| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Make below code works.
```
sql("DESCRIBE test").registerTempTable("describeTest")
sql("SELECT * FROM describeTest").collect()
```
Author: OopsOutOfMemory <victorshengli@126.com>
Author: Sheng, Li <OopsOutOfMemory@users.noreply.github.com>
Closes #4249 from OopsOutOfMemory/desc_query and squashes the following commits:
6fee13d [OopsOutOfMemory] up-to-date
e71430a [Sheng, Li] Update HiveOperatorQueryableSuite.scala
3ba1058 [OopsOutOfMemory] change to default argument
aac7226 [OopsOutOfMemory] Merge branch 'master' into desc_query
68eb6dd [OopsOutOfMemory] Merge branch 'desc_query' of github.com:OopsOutOfMemory/spark into desc_query
354ad71 [OopsOutOfMemory] query describe command
d541a35 [OopsOutOfMemory] refine test suite
e1da481 [OopsOutOfMemory] refine test suite
a780539 [OopsOutOfMemory] Merge branch 'desc_query' of github.com:OopsOutOfMemory/spark into desc_query
0015f82 [OopsOutOfMemory] code style
dd0aaef [OopsOutOfMemory] code style
c7d606d [OopsOutOfMemory] rename test suite
75f2342 [OopsOutOfMemory] refine code and test suite
f942c9b [OopsOutOfMemory] initial
11559ae [OopsOutOfMemory] code style
c5fdecf [OopsOutOfMemory] code style
aeaea5f [OopsOutOfMemory] rename test suite
ac2c3bb [OopsOutOfMemory] refine code and test suite
544573e [OopsOutOfMemory] initial
(cherry picked from commit 0b7eb3f3b700080bf6cb810d092709a8a468e5db)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Author: q00251598 <qiyadong@huawei.com>
Closes #4397 from watermen/SPARK-5619 and squashes the following commits:
f819b6c [q00251598] Support show roles in HiveContext.
(cherry picked from commit a958d60975147fb1afc76fcbd80f65ac8d78759a)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Tobias Schlatter <tobias@meisch.ch>
Closes #4431 from gzm0/sync-scala-refl and squashes the following commits:
c5da21e [Tobias Schlatter] [SPARK-5640] Synchronize ScalaReflection where necessary
(cherry picked from commit 500dc2b4b3136029457e708859fe27da93b1f9e8)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In Hive, 'FROM' clause is optional. This pr supports it.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #4426 from viirya/optional_from and squashes the following commits:
fe81f31 [Liang-Chi Hsieh] Support optional 'FROM' clause.
(cherry picked from commit d433816157bb3ae1f0fbe44efec43a0c906d9f82)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Every proper command line tool should include a `--version` option or something similar.
This PR adds this to `spark-ec2` using the standard functionality provided by `optparse`.
One thing we don't do here is follow the Python convention of setting `__version__`, since it seems awkward given how `spark-ec2` is laid out.
Author: Nicholas Chammas <nicholas.chammas@gmail.com>
Closes #4414 from nchammas/spark-ec2-show-version and squashes the following commits:
914cab5 [Nicholas Chammas] add version info
(cherry picked from commit 70e5b030a78ddfdcc8c9eee568009f277dee0872)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-2945
spark.executor.instances works. As this JIRA recommended, we should add docs for this common config.
Author: WangTaoTheTonic <wangtao111@huawei.com>
Closes #4350 from WangTaoTheTonic/SPARK-2945 and squashes the following commits:
4c3913a [WangTaoTheTonic] not compatible with dynamic allocation
5fa9c46 [WangTaoTheTonic] add doc for spark.executor.instances
(cherry picked from commit d34f79c8db79ae461fadae190446ebc19091bec9)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I'm trying to point out reusing a Configuration in these APIs is dangerous. Any better idea?
Author: zsxwing <zsxwing@gmail.com>
Closes #3225 from zsxwing/SPARK-4361 and squashes the following commits:
fe4e3d5 [zsxwing] Add more docs for Hadoop Configuration
(cherry picked from commit af2a2a263ac5d890e84d012b75fcb50e02c9ede8)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
| |
This was caused because #3486 added a new field to ExecutorInfo and #4369
added new tests that created ExecutorInfos. These patches were merged in
quick succession and were never tested together, hence the compilation error.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
`LogisticRegressionModel`'s `predictPoint` should directly use broadcasted weights. This pr also fixes the compilation errors of two unit test suite: `JavaLogisticRegressionSuite ` and `JavaLinearRegressionSuite`.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #4429 from viirya/use_bcvalue and squashes the following commits:
5a797e5 [Liang-Chi Hsieh] Use broadcasted weights. Fix compilation error.
(cherry picked from commit 80f3bcb58f836cfe1829c85bdd349c10525c8a5e)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch enables UISeleniumSuite, a set of tests for the Spark application web UI. These tests were previously disabled because they were slow, but I think we now have sufficient test time budget that the benefit of enabling them outweighs the time costs.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #4334 from JoshRosen/enable-uiseleniumsuite and squashes the following commits:
4ab9477 [Josh Rosen] Use BeforeAndAfterAll to cleanup WebDriver
71efc72 [Josh Rosen] Update broken UISeleniumSuite tests; use random port #.
a5ab595 [Josh Rosen] Enable UISeleniumSuite tests.
(cherry picked from commit 0d74bd7fd7b2722d08eddc5c269b8b2b6cb47635)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds links to stderr/stdout in the executor tab of the webUI for:
1) Standalone
2) Yarn client
3) Yarn cluster
This tries to add the log url support in a general way so as to make it easy to add support for all the
cluster managers. This is done by using environment variables to pass to the executor the log urls. The
SPARK_LOG_URL_ prefix is used and so additional logs besides stderr/stdout can also be added.
To propagate this information to the UI we use the onExecutorAdded spark listener event.
Although this commit doesn't add log urls when running on a mesos cluster, it should be possible to add using the same mechanism.
Author: Kostas Sakellis <kostas@cloudera.com>
Author: Josh Rosen <joshrosen@databricks.com>
Closes #3486 from ksakellis/kostas-spark-2450 and squashes the following commits:
d190936 [Josh Rosen] Fix a few minor style / formatting nits. Reset listener after each test Don't null listener out at end of main().
8673fe1 [Kostas Sakellis] CR feedback. Hide the log column if there are no logs available
5bf6952 [Kostas Sakellis] [SPARK-2450] [CORE] Adds exeuctor log links to Web UI
(cherry picked from commit 32e964c410e7083b43264c46291e93cd206a8038)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Makoto Fukuhara <fukuo33@gmail.com>
Closes #4396 from fukuo33/fix-unnecessary-regex and squashes the following commits:
cd07fd6 [Makoto Fukuhara] fix unnecessary regex.
(cherry picked from commit 4cdb26c174e479a144950d12e1ad180f361af1fd)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ExecutorAllocationListener
More strictly, in ExecutorAllocationListener, we need to replace onBlockManagerAdded, onBlockManagerRemoved with onExecutorAdded,onExecutorRemoved. because at some time, onExecutorAdded and onExecutorRemoved are more accurate to express these meanings. example at SPARK-5529, BlockManager has been removed,but executor is existed.
andrewor14 sryza
Author: lianhuiwang <lianhuiwang09@gmail.com>
Closes #4369 from lianhuiwang/SPARK-5593 and squashes the following commits:
333367c [lianhuiwang] Replace BlockManagerListener with ExecutorListener in ExecutorAllocationListener
(cherry picked from commit 6072fcc14ee1a4eba793e725fcb2cb2ffebd5b60)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, the classloader isolation was almost too good, such
that if a child class needed to load/reference a class that was
only available in the parent, it could not do so.
This adds tests for that case, the user-first Fake2 class extends
the only-in-parent Fake3 class.
It also sneaks in a fix where only the first stage seemed to work,
and on subsequent stages, a LinkageError happened because classes
from the user-first classpath were getting defined twice.
Author: Stephen Haberman <stephen@exigencecorp.com>
Closes #3725 from stephenh/4877_user_first_parent_inheritance and squashes the following commits:
dabcd35 [Stephen Haberman] [SPARK-4877] Respect userClassPathFirst for the driver code too.
3d0fa7c [Stephen Haberman] [SPARK-4877] Allow user first classes to extend classes in the parent.
(cherry picked from commit 9792bec596113a6f5f4534772b7539255403b082)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Modified syntax error in spark-submit2.cmd. Command prompt doesn't have "defined" operator.
Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
Closes #4428 from tsudukim/feature/SPARK-5396 and squashes the following commits:
ec18465 [Masayoshi TSUZUKI] [SPARK-5396] Syntax error in spark scripts on windows.
(cherry picked from commit c01b9852ea2f7d453249b07d89e62af71bd26e3d)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
|
|
|
|
| |
A recent patch #4051 made the initial number default to 0. With this change, any Spark application using dynamic allocation's default settings will ramp up very slowly. Since we never request more executors than needed to saturate the pending tasks, it is safe to ramp up quickly. The current default of 60 may be too slow.
Author: Andrew Or <andrew@databricks.com>
Closes #4409 from andrewor14/dynamic-allocation-interval and squashes the following commits:
d3cc485 [Andrew Or] Lower request interval
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Sandy Ryza <sandy@cloudera.com>
Closes #4141 from sryza/sandy-spark-4337 and squashes the following commits:
a98bd20 [Sandy Ryza] Andrew's comments
cdaab7f [Sandy Ryza] SPARK-4337. Add ability to cancel pending requests to YARN
(cherry picked from commit 1a88f20de798030a7d5713bd267f612ba5617fca)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Some ExecutorSource metrics can NPE by attempting to reference the
threadpool otherwise.
Author: Ryan Williams <ryan.blake.williams@gmail.com>
Closes #4212 from ryan-williams/threadpool and squashes the following commits:
236f2ad [Ryan Williams] init Executor.threadPool before ExecutorSource
|
|
|
|
|
|
|
|
|
|
|
|
| |
755 means the owner can read, write, and execute, and everyone else can just read and execute. I think that's what we want here since without execute permissions others cannot open directories.
Inspired by [this comment on a separate PR](https://github.com/apache/spark/pull/3297#issuecomment-63286730).
Author: Nicholas Chammas <nicholas.chammas@gmail.com>
Closes #4277 from nchammas/patch-1 and squashes the following commits:
da77fb0 [Nicholas Chammas] [Build] Set all Debian package permissions to 755
|
|
|
|
|
|
|
|
|
|
|
| |
Change spark-version from 1.1.0 to 1.2.0 in the example for spark-ec2/Launch Cluster.
Author: Miguel Peralvo <miguel.peralvo@gmail.com>
Closes #4300 from MiguelPeralvo/patch-1 and squashes the following commits:
38adf0b [Miguel Peralvo] Update ec2-scripts.md
1850869 [Miguel Peralvo] Update ec2-scripts.md
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now KryoSerializer load classes of classesToRegister at the time of its initialization. when we set spark.kryo.classesToRegister=class1, it will throw SparkException("Failed to load class to register with Kryo".
because in KryoSerializer's initialization, classLoader cannot include class of user's jars.
we need to use defaultClassLoader of Serializer in newKryo(), because executor will reset defaultClassLoader of Serializer after Serializer's initialization.
thank zzcclp for reporting it to me.
Author: lianhuiwang <lianhuiwang09@gmail.com>
Closes #4258 from lianhuiwang/SPARK-5470 and squashes the following commits:
73b719f [lianhuiwang] do the splitting and filtering during initialization
64cf306 [lianhuiwang] use defaultClassLoader to load classes of classesToRegister in KryoSerializer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
in ApplicationMaster rename isDriver to isClusterMode,because in Client it uses isClusterMode,ApplicationMaster should keep consistent with it and uses isClusterMode.Also isClusterMode is easier to understand.
andrewor14 sryza
Author: lianhuiwang <lianhuiwang09@gmail.com>
Closes #4430 from lianhuiwang/am-isDriver-rename and squashes the following commits:
f9f3ed0 [lianhuiwang] rename isDriver to isClusterMode
(cherry picked from commit cc6e53119d7a51b95b19244f50b25814088b4d11)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Empty log directories are not useful at the moment, but if one ends
up showing in the log root, it breaks the code that checks for log
directories.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #4352 from vanzin/SPARK-5582 and squashes the following commits:
1a6a3d4 [Marcelo Vanzin] [SPARK-5582] Fix exception when looking at empty directories.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ConcMarkSweepGC for AM.
When we set `SPARK_USE_CONC_INCR_GC`, ConcurrentMarkSweepGC works on the AM.
Actually, if ConcurrentMarkSweepGC is set for the JVM, following JVM options are set automatically and implicitly.
* MaxTenuringThreshold=0
* SurvivorRatio=1024
Those can not be proper value for most cases.
See also http://www.oracle.com/technetwork/java/tuning-139912.html
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #3956 from sarutak/SPARK-5157 and squashes the following commits:
c15da4e [Kousuke Saruta] Set more JVM options for AM when enabling CMS
|
|
|
|
|
|
|
|
|
|
|
| |
.cmd files in bin is not set permission for execution except for spark-shell.cmd.
Let's unify that.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #3983 from sarutak/fix-mode-of-cmd and squashes the following commits:
9d6eedc [Kousuke Saruta] Removed permission for execution from spark-shell.cmd
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
graph with a file format error
When I build a graph with a file format error, there will be an ArrayIndexOutOfBoundsException
Author: Leolh <leosandylh@gmail.com>
Closes #4176 from Leolh/patch-1 and squashes the following commits:
94f6d22 [Leolh] Update GraphLoader.scala
23767f1 [Leolh] [SPARK-3650][GraphX] There will be an ArrayIndexOutOfBoundsException if the format of the source file is wrong
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GaussianMixture
Simple description and code samples (and sample data) for GaussianMixture
Author: Travis Galoppo <tjg2107@columbia.edu>
Closes #4401 from tgaloppo/spark-5013 and squashes the following commits:
c9ff9a5 [Travis Galoppo] Fixed link in mllib-clustering.md Added Gaussian mixture and power iteration as available clustering techniques in mllib-guide
2368690 [Travis Galoppo] Minor fixes
3eb41fa [Travis Galoppo] [SPARK-5013] Added documentation and sample data file for GaussianMixture
(cherry picked from commit 9ad56ad2a2a51df449040c4f4b7c66b104883312)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is part (1a) of the updates from the design doc in [https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]
**UPDATE**: Most of the APIs are being kept private[spark] to allow further discussion. Here is a list of changes which are public:
* new output columns: rawPrediction, probabilities
* The “score” column is now called “rawPrediction”
* Classifiers now provide numClasses
* Params.get and .set are now protected instead of private[ml].
* ParamMap now has a size method.
* new classes: LinearRegression, LinearRegressionModel
* LogisticRegression now has an intercept.
### Sketch of APIs (most of which are private[spark] for now)
Abstract classes for learning algorithms (+ corresponding Model abstractions):
* Classifier (+ ClassificationModel)
* ProbabilisticClassifier (+ ProbabilisticClassificationModel)
* Regressor (+ RegressionModel)
* Predictor (+ PredictionModel)
* *For all of these*:
* There is no strongly typed training-time API.
* There is a strongly typed test-time (prediction) API which helps developers implement new algorithms.
Concrete classes: learning algorithms
* LinearRegression
* LogisticRegression (updated to use new abstract classes)
* Also, removed "score" in favor of "probability" output column. Changed BinaryClassificationEvaluator to match. (SPARK-5031)
Other updates:
* params.scala: Changed Params.set/get to be protected instead of private[ml]
* This was needed for the example of defining a class from outside of the MLlib namespace.
* VectorUDT: Will later change from private[spark] to public.
* This is needed for outside users to write their own validateAndTransformSchema() methods using vectors.
* Also, added equals() method.f
* SPARK-4942 : ML Transformers should allow output cols to be turned on,off
* Update validateAndTransformSchema
* Update transform
* (Updated examples, test suites according to other changes)
New examples:
* DeveloperApiExample.scala (example of defining algorithm from outside of the MLlib namespace)
* Added Java version too
Test Suites:
* LinearRegressionSuite
* LogisticRegressionSuite
* + Java versions of above suites
CC: mengxr etrain shivaram
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #3637 from jkbradley/ml-api-part1 and squashes the following commits:
405bfb8 [Joseph K. Bradley] Last edits based on code review. Small cleanups
fec348a [Joseph K. Bradley] Added JavaDeveloperApiExample.java and fixed other issues: Made developer API private[spark] for now. Added constructors Java can understand to specialized Param types.
8316d5e [Joseph K. Bradley] fixes after rebasing on master
fc62406 [Joseph K. Bradley] fixed test suites after last commit
bcb9549 [Joseph K. Bradley] Fixed issues after rebasing from master (after move from SchemaRDD to DataFrame)
9872424 [Joseph K. Bradley] fixed JavaLinearRegressionSuite.java Java sql api
f542997 [Joseph K. Bradley] Added MIMA excludes for VectorUDT (now public), and added DeveloperApi annotation to it
216d199 [Joseph K. Bradley] fixed after sql datatypes PR got merged
f549e34 [Joseph K. Bradley] Updates based on code review. Major ones are: * Created weakly typed Predictor.train() method which is called by fit() so that developers do not have to call schema validation or copy parameters. * Made Predictor.featuresDataType have a default value of VectorUDT. * NOTE: This could be dangerous since the FeaturesType type parameter cannot have a default value.
343e7bd [Joseph K. Bradley] added blanket mima exclude for ml package
82f340b [Joseph K. Bradley] Fixed bug in LogisticRegression (introduced in this PR). Fixed Java suites
0a16da9 [Joseph K. Bradley] Fixed Linear/Logistic RegressionSuites
c3c8da5 [Joseph K. Bradley] small cleanup
934f97b [Joseph K. Bradley] Fixed bugs from previous commit.
1c61723 [Joseph K. Bradley] * Made ProbabilisticClassificationModel into a subclass of ClassificationModel. Also introduced ProbabilisticClassifier. * This was to support output column “probabilityCol” in transform().
4e2f711 [Joseph K. Bradley] rat fix
bc654e1 [Joseph K. Bradley] Added spark.ml LinearRegressionSuite
8d13233 [Joseph K. Bradley] Added methods: * Classifier: batch predictRaw() * Predictor: train() without paramMap ProbabilisticClassificationModel.predictProbabilities() * Java versions of all above batch methods + others
1680905 [Joseph K. Bradley] Added JavaLabeledPointSuite.java for spark.ml, and added constructor to LabeledPoint which defaults weight to 1.0
adbe50a [Joseph K. Bradley] * fixed LinearRegression train() to use embedded paramMap * added Predictor.predict(RDD[Vector]) method * updated Linear/LogisticRegressionSuites
58802e3 [Joseph K. Bradley] added train() to Predictor subclasses which does not take a ParamMap.
57d54ab [Joseph K. Bradley] * Changed semantics of Predictor.train() to merge the given paramMap with the embedded paramMap. * remove threshold_internal from logreg * Added Predictor.copy() * Extended LogisticRegressionSuite
e433872 [Joseph K. Bradley] Updated docs. Added LabeledPointSuite to spark.ml
54b7b31 [Joseph K. Bradley] Fixed issue with logreg threshold being set correctly
0617d61 [Joseph K. Bradley] Fixed bug from last commit (sorting paramMap by parameter names in toString). Fixed bug in persisting logreg data. Added threshold_internal to logreg for faster test-time prediction (avoiding map lookup).
601e792 [Joseph K. Bradley] Modified ParamMap to sort parameters in toString. Cleaned up classes in class hierarchy, before implementing tests and examples.
d705e87 [Joseph K. Bradley] Added LinearRegression and Regressor back from ml-api branch
52f4fde [Joseph K. Bradley] removing everything except for simple class hierarchy for classification
d35bb5d [Joseph K. Bradley] fixed compilation issues, but have not added tests yet
bfade12 [Joseph K. Bradley] Added lots of classes for new ML API:
(cherry picked from commit dc0c4490a12ecedd8ca5a1bb256c7ccbdf0be04f)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is the second part of SPARK-5604, which removes checkpointDir from tree strategies. Note that this is a break change. I will mention it in the migration guide.
Author: Xiangrui Meng <meng@databricks.com>
Closes #4407 from mengxr/SPARK-5604-1 and squashes the following commits:
13a276d [Xiangrui Meng] remove checkpointDir from trees
(cherry picked from commit 6b88825a25a0a072c13bbcc57bbfdb102a3f133d)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #4410 from rxin/df-renameCol and squashes the following commits:
a6a796e [Reynold Xin] [SPARK-5639][SQL] Support DataFrame.renameColumn.
(cherry picked from commit 7dc4965f34e37b37f4fab69859fcce6476f87811)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
| |
This reverts commit c3b8d272cf0574e72422d8d7f4f0683dcbdce41b.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Because of the way we shade jetty, we lose its dependency orbit
in the assembly jar, which includes the javax servlet API's. This
adds back orbit explicitly, using the version that matches
our jetty version.
Author: Patrick Wendell <patrick@databricks.com>
Closes #4411 from pwendell/servlet-api and squashes the following commits:
445f868 [Patrick Wendell] SPARK-5557: Explicitly include servlet API in dependencies.
(cherry picked from commit 793dbaef401d777c3efc1759a3ea7580e01de528)
Signed-off-by: Patrick Wendell <patrick@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
"SQLQuerySuite.CTAS with serde"
Ideally we should convert Metastore Parquet tables with our own Parquet implementation on both read path and write path. However, the write path is not well covered, and causes this test failure. This PR is a hotfix to bring back Jenkins PR builder. A proper fix will be delivered in a follow-up PR.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4413)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #4413 from liancheng/hotfix-parquet-ctas and squashes the following commits:
5291289 [Cheng Lian] Hot fix for "SQLQuerySuite.CTAS with serde"
(cherry picked from commit 7c0a648fb5537ba7a1fe2545ead49219b14b656c)
Signed-off-by: Cheng Lian <lian@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #4408 from rxin/df-config-eager and squashes the following commits:
c0204cf [Reynold Xin] [SPARK-5638][SQL] Add a config flag to disable eager analysis of DataFrames.
(cherry picked from commit e8a5d50a96f6e7d4fce33ea19fbfc083f4351296)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It seems that `(ScalaUnidoc, unidoc)` is the correct way to overwrite `scalacOptions` in unidoc.
CC: rxin gzm0
Author: Xiangrui Meng <meng@databricks.com>
Closes #4404 from mengxr/SPARK-5620 and squashes the following commits:
f890cf5 [Xiangrui Meng] add -groups to scalacOptions in unidoc
(cherry picked from commit 85ccee81acef578ec4b40fb5f5d97b9e24314f35)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
source improvements
This PR adds three major improvements to Parquet data source:
1. Partition discovery
While reading Parquet files resides in Hive style partition directories, `ParquetRelation2` automatically discovers partitioning information and infers partition column types.
This is also a partial work for [SPARK-5182] [1], which aims to provide first class partitioning support for the data source API. Related code in this PR can be easily extracted to the data source API level in future versions.
1. Schema merging
When enabled, Parquet data source collects schema information from all Parquet part-files and tries to merge them. Exceptions are thrown when incompatible schemas are detected. This feature is controlled by data source option `parquet.mergeSchema`, and is enabled by default.
1. Metastore Parquet table conversion moved to analysis phase
This greatly simplifies the conversion logic. `ParquetConversion` strategy can be removed once the old Parquet implementation is removed in the future.
This version of Parquet data source aims to entirely replace the old Parquet implementation. However, the old version hasn't been removed yet. Users can fall back to the old version by turning off SQL configuration `spark.sql.parquet.useDataSourceApi`.
Other JIRA tickets fixed as side effects in this PR:
- [SPARK-5509] [3]: `EqualTo` now uses a proper `Ordering` to compare binary types.
- [SPARK-3575] [4]: Metastore schema is now preserved and passed to `ParquetRelation2` via data source option `parquet.metastoreSchema`.
TODO:
- [ ] More test cases for partition discovery
- [x] Fix write path after data source write support (#4294) is merged
It turned out to be non-trivial to fall back to old Parquet implementation on the write path when Parquet data source is enabled. Since we're planning to include data source write support in 1.3.0, I simply ignored two test cases involving Parquet insertion for now.
- [ ] Fix outdated comments and documentations
PS: This PR looks big, but more than a half of the changed lines in this PR are trivial changes to test cases. To test Parquet with and without the new data source, almost all Parquet test cases are moved into wrapper driver functions. This introduces hundreds of lines of changes.
[1]: https://issues.apache.org/jira/browse/SPARK-5182
[2]: https://issues.apache.org/jira/browse/SPARK-5528
[3]: https://issues.apache.org/jira/browse/SPARK-5509
[4]: https://issues.apache.org/jira/browse/SPARK-3575
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4308)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #4308 from liancheng/parquet-partition-discovery and squashes the following commits:
b6946e6 [Cheng Lian] Fixes MiMA issues, addresses comments
8232e17 [Cheng Lian] Write support for Parquet data source
a49bd28 [Cheng Lian] Fixes spelling typo in trait name "CreateableRelationProvider"
808380f [Cheng Lian] Fixes issues introduced while rebasing
50dd8d1 [Cheng Lian] Addresses @rxin's comment, fixes UDT schema merging
adf2aae [Cheng Lian] Fixes compilation error introduced while rebasing
4e0175f [Cheng Lian] Fixes Python Parquet API, we need Py4J array to call varargs method
0d8ec1d [Cheng Lian] Adds more test cases
b35c8c6 [Cheng Lian] Fixes some typos and outdated comments
dd704fd [Cheng Lian] Fixes Python Parquet API
596c312 [Cheng Lian] Uses switch to control whether use Parquet data source or not
7d0f7a2 [Cheng Lian] Fixes Metastore Parquet table conversion
a1896c7 [Cheng Lian] Fixes all existing Parquet test suites except for ParquetMetastoreSuite
5654c9d [Cheng Lian] Draft version of Parquet partition discovery and schema merging
(cherry picked from commit a9ed51178c89d83aae1ad420fb3f4a7f4d1812ec)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
`checkpointDir` is a Spark global configuration. Users should set it outside LDA. This PR also hides some methods under `private[clustering] object LDA`, so they don't show up in the generated Java doc (SPARK-5610).
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #4390 from mengxr/SPARK-5604 and squashes the following commits:
a34bb39 [Xiangrui Meng] remove checkpointDir from LDA
(cherry picked from commit c19152cd2a5d407ecf526a90e3bb059f09905b3a)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Because `deleteAllCheckpoints` has IOException potential.
fix issue.
Author: x1- <viva008@gmail.com>
Closes #4347 from x1-/SPARK-5460 and squashes the following commits:
7a3d8de [x1-] change `Try()` to `try catch { case ... }` ar RandomForest.
3a52745 [x1-] modified typo. 'faild' -> 'failed' and remove disused '-'.
1572576 [x1-] Wrapped `Try` around `deleteAllCheckpoints` - RandomForest.
(cherry picked from commit 62371adaa5b9251579db7300504506975689610c)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Hi, rxin marmbrus
I considered your suggestion (in #4127) and now re-write it. This is now up-to-date.
Could u please review it ?
Author: OopsOutOfMemory <victorshengli@126.com>
Closes #4227 from OopsOutOfMemory/describe and squashes the following commits:
053826f [OopsOutOfMemory] describe
(cherry picked from commit 4d8d070c4f9f8211afb95d29036eb5e41796dcf2)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SQLQuerySuite test failure:
[info] - simple select (22 milliseconds)
[info] - sorting (722 milliseconds)
[info] - external sorting (728 milliseconds)
[info] - limit (95 milliseconds)
[info] - date row *** FAILED *** (35 milliseconds)
[info] Results do not match for query:
[info] 'Limit 1
[info] 'Project [CAST(2015-01-28, DateType) AS c0#3630]
[info] 'UnresolvedRelation [testData], None
[info]
[info] == Analyzed Plan ==
[info] Limit 1
[info] Project [CAST(2015-01-28, DateType) AS c0#3630]
[info] LogicalRDD [key#0,value#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35
[info]
[info] == Physical Plan ==
[info] Limit 1
[info] Project [16463 AS c0#3630]
[info] PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35
[info]
[info] == Results ==
[info] !== Correct Answer - 1 == == Spark Answer - 1 ==
[info] ![2015-01-28] [2015-01-27] (QueryTest.scala:77)
[info] org.scalatest.exceptions.TestFailedException:
[info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
[info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info] at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
[info] at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
[info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:77)
[info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:95)
[info] at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply$mcV$sp(SQLQuerySuite.scala:300)
[info] at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply(SQLQuerySuite.scala:300)
[info] at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply(SQLQuerySuite.scala:300)
[info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info] at org.scalatest.Transformer.apply(Transformer.scala:22)
[info] at org.scalatest.Transformer.apply(Transformer.scala:20)
[info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
[info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
[info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
[info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
[info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
[info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
[info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info] at org.scalatest.SuperEngine$$anonfun$traverseSubNode
Author: wangfei <wangfei1@huawei.com>
Closes #4395 from scwf/SQLQuerySuite and squashes the following commits:
1431a2d [wangfei] fix conflicts
c35fe5e [wangfei] minor fix
01dab3a [wangfei] fix test failure of SQLQuerySuite
(cherry picked from commit a83936e109087b5cae8b9734032f2f331fdad2e3)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Trivial fix.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes #4400 from adrian-wang/docdate and squashes the following commits:
31bbe40 [Daoyuan Wang] doc fix for date
(cherry picked from commit 6fa4ac1b007a545201d82603f09b0573f529a4e6)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: GuoQiang Li <witgo@qq.com>
Closes #4263 from witgo/SPARK-5474 and squashes the following commits:
ef397ff [GuoQiang Li] review commits
a398324 [GuoQiang Li] curl should support URL redirection in build/mvn
(cherry picked from commit 34147549a7ad188e5eae8d818d36ca0fe882c16f)
Signed-off-by: Patrick Wendell <patrick@databricks.com>
|
|
|
|
|
| |
(cherry picked from commit 6580929fa029c4010dd4170de9be9f18516f8e5a)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
| |
`Await.result` and `selection.resolveOne` runs the same timeout simultaneously. When `Await.result` timeout is reached first, then `TimeoutException` is thrown. On the other hand, when `selection.resolveOne` timeout is reached first, `ActorNotFoundException` is thrown. This is an obvious race condition and the easiest way to fix it is to increase the timeout of one method to make sure the code fails on the other method first.
Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>
Closes #4343 from jacek-lewandowski/SPARK-5548-1.3 and squashes the following commits:
b9ba47e [Jacek Lewandowski] SPARK-5548: Fixed a race condition in AkkaUtilsSuite
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Add meta description tags on some of the most important doc pages
- Shorten the titles of some pages to have more relevant keywords; for
example there's no reason to have "Spark SQL Programming Guide - Spark
1.2.0 documentation", we can just say "Spark SQL - Spark 1.2.0
documentation".
Author: Matei Zaharia <matei@databricks.com>
Closes #4381 from mateiz/docs-seo and squashes the following commits:
4940563 [Matei Zaharia] [SPARK-5608] Improve SEO of Spark documentation pages
(cherry picked from commit 4d74f0601a2465b0d2273a8bcc716b304584831f)
Signed-off-by: Matei Zaharia <matei@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds a recursive option to the addFile API to satisfy Hive's needs. It only allows specifying HDFS dirs that will be copied down on every executor.
There are a couple outstanding questions.
* Should we allow specifying local dirs as well? The best way to do this would probably be to archive them. The drawback is that it would require a fair bit of code that I don't know of any current use cases for.
* The addFiles implementation has a caching component that I don't entirely understand. What events are we caching between? AFAICT it's users calling addFile on the same file in the same app at different times? Do we want/need to add something similar for addDirectory.
* The addFiles implementation will check to see if an added file already exists and has the same contents. I imagine we want the same behavior, so planning to add this unless people think otherwise.
I plan to add some tests if people are OK with the approach.
Author: Sandy Ryza <sandy@cloudera.com>
Closes #3670 from sryza/sandy-spark-4687 and squashes the following commits:
f9fc77f [Sandy Ryza] Josh's comments
70cd24d [Sandy Ryza] Add another test
13da824 [Sandy Ryza] Revert executor changes
38bf94d [Sandy Ryza] Marcelo's comments
ca83849 [Sandy Ryza] Add addFile test
1941be3 [Sandy Ryza] Fix test and avoid HTTP server in local mode
31f15a9 [Sandy Ryza] Use cache recursively and fix some compile errors
0239c3d [Sandy Ryza] Change addDirectory to addFile with recursive
46fe70a [Sandy Ryza] SPARK-4687. Add a addDirectory API
(cherry picked from commit c4b1108c3f9658adebbdf8508d325528c3206f16)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #4388 from rxin/mllib-style and squashes the following commits:
61d465b [Reynold Xin] oops
3364295 [Reynold Xin] Missed one ..
5e068e3 [Reynold Xin] [MLlib] Minor: UDF style update.
(cherry picked from commit c3ba4d4cd032e376bfdf7ea7eaab65a79a771e7e)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #4386 from rxin/df-implicits and squashes the following commits:
9d96606 [Reynold Xin] style fix
edd296b [Reynold Xin] ReplSuite
1c946ab [Reynold Xin] [SPARK-5612][SQL] Move DataFrame implicit functions into SQLContext.implicits.
(cherry picked from commit 7d789e117d6ddaf66159e708db600f2d8db8d787)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now spark version is only support ```SELECT -key FROM DECIMAL_UDF;``` in HiveContext.
This patch is used to support ```SELECT +key FROM DECIMAL_UDF;``` in HiveContext.
Author: q00251598 <qiyadong@huawei.com>
Closes #4378 from watermen/SPARK-5606 and squashes the following commits:
777f132 [q00251598] sql-case22
74dd368 [q00251598] sql-case22
1a67410 [q00251598] sql-case22
c5cd5bc [q00251598] sql-case22
(cherry picked from commit 9d3a75ef80d0b736d1366a464bf00b64a120f461)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are no break changes (against 1.2) in this PR. I hide the PythonMLLibAPI, which is only called by Py4J, and renamed `SparseMatrix.diag` to `SparseMatrix.spdiag`. All other changes are documentation and annotations. The `Experimental` tag is removed from `ALS.setAlpha` and `Rating`. One issue not addressed in this PR is the `setCheckpointDir` in `LDA` (https://issues.apache.org/jira/browse/SPARK-5604).
CC: srowen jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #4377 from mengxr/SPARK-5599 and squashes the following commits:
17975dc [Xiangrui Meng] fix tests
4487f20 [Xiangrui Meng] remove experimental tag from each stat method because Statistics is experimental already
3cd969a [Xiangrui Meng] remove freeman (sorry~) from StreamLA public doc
55900f5 [Xiangrui Meng] make IR experimental and update its doc
9b8eed3 [Xiangrui Meng] graduate Rating and setAlpha in ALS
b854d28 [Xiangrui Meng] correct iid doc in RandomRDDs
27f5bdd [Xiangrui Meng] update linalg docs and some new method signatures
371721b [Xiangrui Meng] mark fpg as experimental and update its doc
8aca7ee [Xiangrui Meng] change SLR to experimental and update the doc
ebbb2e9 [Xiangrui Meng] mark PIC experimental and update the doc
7830d3b [Xiangrui Meng] mark GMM experimental
a378496 [Xiangrui Meng] use the correct subscript syntax in PIC
c65c424 [Xiangrui Meng] update LDAModel doc
a213b0c [Xiangrui Meng] update GMM constructor
3993054 [Xiangrui Meng] hide algorithm in SLR
ad6b9ce [Xiangrui Meng] Revert "make ClassificatinModel.predict(JavaRDD) return JavaDoubleRDD"
0054684 [Xiangrui Meng] add doc to LRModel's constructor
a89763b [Xiangrui Meng] make ClassificatinModel.predict(JavaRDD) return JavaDoubleRDD
7c0946c [Xiangrui Meng] hide PythonMLLibAPI
(cherry picked from commit db34690466d67f9c8ac6a145fddb5f7ea30a8d8d)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|