aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* [SPARK-9238] [SQL] Remove two extra useless entries for bytesOfCodePointInUTF8zhichao.li2015-07-241-1/+1
| | | | | | | | | | | Only a trial thing, not sure if I understand correctly or not but I guess only 2 entries in `bytesOfCodePointInUTF8` for the case of 6 bytes codepoint(1111110x) is enough. Details can be found from https://en.wikipedia.org/wiki/UTF-8 in "Description" section. Author: zhichao.li <zhichao.li@intel.com> Closes #7582 from zhichao-li/utf8 and squashes the following commits: 8bddd01 [zhichao.li] two extra entries
* [SPARK-9069] [SQL] follow upDavies Liu2015-07-245-26/+55
| | | | | | | | | | | | | | | Address comments for #7605 cc rxin Author: Davies Liu <davies@databricks.com> Closes #7634 from davies/decimal_unlimited2 and squashes the following commits: b2d8b0d [Davies Liu] add doc and test for DecimalType.isWiderThan 65b251c [Davies Liu] fix test 6a91f32 [Davies Liu] fix style ca9c973 [Davies Liu] address comments
* [SPARK-9236] [CORE] Make defaultPartitioner not reuse a parent RDD's ↵François Garillot2015-07-242-1/+24
| | | | | | | | | | | | partitioner if it has 0 partitions See also comments on https://issues.apache.org/jira/browse/SPARK-9236 Author: François Garillot <francois@garillot.net> Closes #7616 from huitseeker/issue/SPARK-9236 and squashes the following commits: 217f902 [François Garillot] [SPARK-9236] Make defaultPartitioner not reuse a parent RDD's partitioner if it has 0 partitions
* [SPARK-8756] [SQL] Keep cached information and avoid re-calculating footers ↵Liang-Chi Hsieh2015-07-241-14/+24
| | | | | | | | | | | | | | | | | | | | | | | | in ParquetRelation2 JIRA: https://issues.apache.org/jira/browse/SPARK-8756 Currently, in ParquetRelation2, footers are re-read every time refresh() is called. But we can check if it is possibly changed before we do the reading because reading all footers will be expensive when there are too many partitions. This pr fixes this by keeping some cached information to check it. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #7154 from viirya/cached_footer_parquet_relation and squashes the following commits: 92e9347 [Liang-Chi Hsieh] Fix indentation. ae0ec64 [Liang-Chi Hsieh] Fix wrong assignment. c8fdfb7 [Liang-Chi Hsieh] Fix it. a52b6d1 [Liang-Chi Hsieh] For comments. c2a2420 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cached_footer_parquet_relation fa5458f [Liang-Chi Hsieh] Use Map to cache FileStatus and do merging previously loaded schema and newly loaded one. 6ae0911 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cached_footer_parquet_relation 21bbdec [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cached_footer_parquet_relation 12a0ed9 [Liang-Chi Hsieh] Add check of FileStatus's modification time. 186429d [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cached_footer_parquet_relation 0ef8caf [Liang-Chi Hsieh] Keep cached information and avoid re-calculating footers.
* [build] Enable memory leak detection for Tungsten.Reynold Xin2015-07-241-1/+1
| | | | | | | | | | This was turned off accidentally in #7591. Author: Reynold Xin <rxin@databricks.com> Closes #7637 from rxin/enable-mem-leak-detect and squashes the following commits: 34bc3ef [Reynold Xin] Enable memory leak detection for Tungsten.
* [SPARK-9200][SQL] Don't implicitly cast non-atomic types to string type.Reynold Xin2015-07-242-1/+10
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #7636 from rxin/complex-string-implicit-cast and squashes the following commits: 3e67327 [Reynold Xin] [SPARK-9200][SQL] Don't implicitly cast non-atomic types to string type.
* [SPARK-9294][SQL] cleanup comments, code style, naming typo for the new ↵Wenchen Fan2015-07-237-89/+46
| | | | | | | | | | | | | | aggregation fix some comments and code style for https://github.com/apache/spark/pull/7458 Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7619 from cloud-fan/agg-clean and squashes the following commits: 3925457 [Wenchen Fan] one more... cc78357 [Wenchen Fan] one more cleanup 26f6a93 [Wenchen Fan] some minor cleanup for the new aggregation
* [SPARK-8092] [ML] Allow OneVsRest Classifier feature and label column names ↵Ram Sriharsha2015-07-232-1/+40
| | | | | | | | | | | | | | | | to be configurable. The base classifier input and output columns are ignored in favor of the ones specified in OneVsRest. Author: Ram Sriharsha <rsriharsha@hw11853.local> Closes #6631 from harsha2010/SPARK-8092 and squashes the following commits: 6591dc6 [Ram Sriharsha] add documentation for params b7024b1 [Ram Sriharsha] cleanup f0e2bfb [Ram Sriharsha] merge with master 108d3d7 [Ram Sriharsha] merge with master 4f74126 [Ram Sriharsha] Allow label/ features columns to be configurable
* [SPARK-9216] [STREAMING] Define KinesisBackedBlockRDDsTathagata Das2015-07-235-5/+545
| | | | | | | | | | | | | | | | | | | | | | For more information see master JIRA: https://issues.apache.org/jira/browse/SPARK-9215 Design Doc: https://docs.google.com/document/d/1k0dl270EnK7uExrsCE7jYw7PYx0YC935uBcxn3p0f58/edit Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #7578 from tdas/kinesis-rdd and squashes the following commits: 543d208 [Tathagata Das] Fixed scala style 5082a30 [Tathagata Das] Fixed scala style 3f40c2d [Tathagata Das] Addressed comments c4f25d2 [Tathagata Das] Addressed comment d3d64d1 [Tathagata Das] Minor update f6e35c8 [Tathagata Das] Added retry logic to make it more robust 8874b70 [Tathagata Das] Updated Kinesis RDD 575bdbc [Tathagata Das] Fix scala style issues 4a36096 [Tathagata Das] Add license 5da3995 [Tathagata Das] Changed KinesisSuiteHelper to KinesisFunSuite 528e206 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into kinesis-rdd 3ae0814 [Tathagata Das] Added KinesisBackedBlockRDD
* [SPARK-9122] [MLLIB] [PySpark] spark.mllib regression support batch predictYanbo Liang2015-07-231-2/+10
| | | | | | | | | | spark.mllib support batch predict for LinearRegressionModel, RidgeRegressionModel and LassoModel. Author: Yanbo Liang <ybliang8@gmail.com> Closes #7614 from yanboliang/spark-9122 and squashes the following commits: 4e610c0 [Yanbo Liang] spark.mllib regression support batch predict
* [SPARK-9069] [SPARK-9264] [SQL] remove unlimited precision support for ↵Davies Liu2015-07-2353-473/+459
| | | | | | | | | | | | | | | | | | | | | | | | | DecimalType Romove Decimal.Unlimited (change to support precision up to 38, to match with Hive and other databases). In order to keep backward source compatibility, Decimal.Unlimited is still there, but change to Decimal(38, 18). If no precision and scale is provide, it's Decimal(10, 0) as before. Author: Davies Liu <davies@databricks.com> Closes #7605 from davies/decimal_unlimited and squashes the following commits: aa3f115 [Davies Liu] fix tests and style fb0d20d [Davies Liu] address comments bfaae35 [Davies Liu] fix style df93657 [Davies Liu] address comments and clean up 06727fd [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_unlimited 4c28969 [Davies Liu] fix tests 8d783cc [Davies Liu] fix tests 788631c [Davies Liu] fix double with decimal in Union/except 1779bde [Davies Liu] fix scala style c9c7c78 [Davies Liu] remove Decimal.Unlimited
* [SPARK-9207] [SQL] Enables Parquet filter push-down by defaultCheng Lian2015-07-232-13/+4
| | | | | | | | | | PARQUET-136 and PARQUET-173 have been fixed in parquet-mr 1.7.0. It's time to enable filter push-down by default now. Author: Cheng Lian <lian@databricks.com> Closes #7612 from liancheng/spark-9207 and squashes the following commits: 77e6b5e [Cheng Lian] Enables Parquet filter push-down by default
* [SPARK-9286] [SQL] Methods in Unevaluable should be final and ↵Josh Rosen2015-07-233-20/+10
| | | | | | | | | | | | | | AlgebraicAggregate should extend Unevaluable. This patch marks the Unevaluable.eval() and UnevaluablegenCode() methods as final and fixes two cases where they were overridden. It also updates AggregateFunction2 to extend Unevaluable. Author: Josh Rosen <joshrosen@databricks.com> Closes #7627 from JoshRosen/unevaluable-fix and squashes the following commits: 8d9ed22 [Josh Rosen] AlgebraicAggregate should extend Unevaluable 65329c2 [Josh Rosen] Do not have AggregateFunction1 inherit from AggregateExpression1 fa68a22 [Josh Rosen] Make eval() and genCode() final
* [SPARK-5447][SQL] Replace reference 'schema rdd' with DataFrame @rxin.David Arroyo Cazorla2015-07-231-1/+1
| | | | | | | | Author: David Arroyo Cazorla <darroyo@stratio.com> Closes #7618 from darroyocazorla/master and squashes the following commits: 5f91379 [David Arroyo Cazorla] [SPARK-5447][SQL] Replace reference 'schema rdd' with DataFrame
* [SPARK-9243] [Documentation] null -> zero in crosstab docXiangrui Meng2015-07-233-3/+3
| | | | | | | | | | We forgot to update doc. brkyvz Author: Xiangrui Meng <meng@databricks.com> Closes #7608 from mengxr/SPARK-9243 and squashes the following commits: 0ea3236 [Xiangrui Meng] null -> zero in crosstab doc
* [SPARK-9183] confusing error message when looking up missing function in ↵Yijie Shen2015-07-233-0/+12
| | | | | | | | | | | | | | | | Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-9183 cc rxin Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7613 from yjshen/npe_udf and squashes the following commits: 44f58f2 [Yijie Shen] add jira ticket number 903c963 [Yijie Shen] add explanation comments f44dd3c [Yijie Shen] Change two hive class LogLevel to avoid annoying messages
* [Build][Minor] Fix building error & performanceCheng Hao2015-07-232-1/+2
| | | | | | | | | | | | | | | | | 1. When build the latest code with sbt, it throws exception like: [error] /home/hcheng/git/catalyst/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala:78: match may not be exhaustive. [error] It would fail on the following input: UNKNOWN [error] val classNameByStatus = status match { [error] 2. Potential performance issue when implicitly convert an Array[Any] to Seq[Any] Author: Cheng Hao <hao.cheng@intel.com> Closes #7611 from chenghao-intel/toseq and squashes the following commits: cab75c5 [Cheng Hao] remove the toArray 24df682 [Cheng Hao] fix building error & performance
* [SPARK-9082] [SQL] [FOLLOW-UP] use `partition` in `PushPredicateThroughProject`Wenchen Fan2015-07-231-14/+8
| | | | | | | | | | a follow up of https://github.com/apache/spark/pull/7446 Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7607 from cloud-fan/tmp and squashes the following commits: 7106989 [Wenchen Fan] use `partition` in `PushPredicateThroughProject`
* [SPARK-9212] [CORE] upgrade Netty version to 4.0.29.FinalZhang, Liye2015-07-231-1/+1
| | | | | | | | | | related JIRA: [SPARK-9212](https://issues.apache.org/jira/browse/SPARK-9212) and [SPARK-8101](https://issues.apache.org/jira/browse/SPARK-8101) Author: Zhang, Liye <liye.zhang@intel.com> Closes #7562 from liyezhang556520/SPARK-9212 and squashes the following commits: 1917729 [Zhang, Liye] SPARK-9212 upgrade Netty version to 4.0.29.Final
* Revert "[SPARK-8579] [SQL] support arbitrary object in UnsafeRow"Reynold Xin2015-07-2324-631/+355
| | | | | | | | | | | | | | | | | | | | | | | | Reverts ObjectPool. As it stands, it has a few problems: 1. ObjectPool doesn't work with spilling and memory accounting. 2. I don't think in the long run the idea of an object pool is what we want to support, since it essentially goes back to unmanaged memory, and creates pressure on GC, and is hard to account for the total in memory size. 3. The ObjectPool patch removed the specialized getters for strings and binary, and as a result, actually introduced branches when reading non primitive data types. If we do want to support arbitrary user defined types in the future, I think we can just add an object array in UnsafeRow, rather than relying on indirect memory addressing through a pool. We also need to pick execution strategies that are optimized for those, rather than keeping a lot of unserialized JVM objects in memory during aggregation. This is probably the hardest thing I had to revert in Spark, due to recent patches that also change the same part of the code. Would be great to get a careful look. Author: Reynold Xin <rxin@databricks.com> Closes #7591 from rxin/revert-object-pool and squashes the following commits: 01db0bc [Reynold Xin] Scala style. eda89fc [Reynold Xin] Fixed describe. 2967118 [Reynold Xin] Fixed accessor for JoinedRow. e3294eb [Reynold Xin] Merge branch 'master' into revert-object-pool 657855f [Reynold Xin] Temp commit. c20f2c8 [Reynold Xin] Style fix. fe37079 [Reynold Xin] Revert "[SPARK-8579] [SQL] support arbitrary object in UnsafeRow"
* [SPARK-9266] Prevent "managed memory leak detected" exception from masking ↵Josh Rosen2015-07-232-2/+30
| | | | | | | | | | | | | | original exception When a task fails with an exception and also fails to properly clean up its managed memory, the `spark.unsafe.exceptionOnMemoryLeak` memory leak detection mechanism's exceptions will mask the original exception that caused the task to fail. We should throw the memory leak exception only if no other exception occurred. Author: Josh Rosen <joshrosen@databricks.com> Closes #7603 from JoshRosen/SPARK-9266 and squashes the following commits: c268cb5 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-9266 c1f0167 [Josh Rosen] Fix the error masking problem 448eae8 [Josh Rosen] Add regression test
* [SPARK-8695] [CORE] [MLLIB] TreeAggregation shouldn't be triggered when it ↵Perinkulam I. Ganesh2015-07-231-1/+3
| | | | | | | | | | | | doesn't save wall-clock time. Author: Perinkulam I. Ganesh <gip@us.ibm.com> Closes #7397 from piganesh/SPARK-8695 and squashes the following commits: 041620c [Perinkulam I. Ganesh] [SPARK-8695][CORE][MLlib] TreeAggregation shouldn't be triggered when it doesn't save wall-clock time. 9ad067c [Perinkulam I. Ganesh] [SPARK-8695] [core] [WIP] TreeAggregation shouldn't be triggered for 5 partitions a6fed07 [Perinkulam I. Ganesh] [SPARK-8695] [core] [WIP] TreeAggregation shouldn't be triggered for 5 partitions
* [SPARK-8935] [SQL] Implement code generation for all castsYijie Shen2015-07-222-51/+508
| | | | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-8935 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7365 from yjshen/cast_codegen and squashes the following commits: ef6e8b5 [Yijie Shen] getColumn and setColumn in struct cast, autounboxing in array and map eaece18 [Yijie Shen] remove null case in cast code gen fd7eba4 [Yijie Shen] resolve comments 80378a5 [Yijie Shen] the missing self cast 611d66e [Yijie Shen] Bug fix: NullType & primitive object unboxing 6d5c0fe [Yijie Shen] rebase and add Interval codegen 9424b65 [Yijie Shen] tiny style fix 4a1c801 [Yijie Shen] remove CodeHolder class, use function instead. 3f5df88 [Yijie Shen] CodeHolder for complex dataTypes c286f13 [Yijie Shen] moved all the cast code into class body 4edfd76 [Yijie Shen] [WIP] finished primitive part
* [SPARK-7254] [MLLIB] Run PowerIterationClustering directly on graphLiang-Chi Hsieh2015-07-222-0/+94
| | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-7254 Author: Liang-Chi Hsieh <viirya@appier.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6054 from viirya/pic_on_graph and squashes the following commits: 8b87b81 [Liang-Chi Hsieh] Fix scala style. a22fb8b [Liang-Chi Hsieh] For comment. ef565a0 [Liang-Chi Hsieh] Fix indentation. d249aa1 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into pic_on_graph 82d7351 [Liang-Chi Hsieh] Run PowerIterationClustering directly on graph.
* [SPARK-9268] [ML] Removed varargs annotation from Params.setDefault taking ↵Joseph K. Bradley2015-07-222-4/+4
| | | | | | | | | | | | | | | | multiple params Removed varargs annotation from Params.setDefault taking multiple params. Though varargs is technically correct, it often requires that developers do clean assembly, rather than (not clean) assembly, which is a nuisance during development. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #7604 from jkbradley/params-setdefault-varargs and squashes the following commits: 6016dc6 [Joseph K. Bradley] removed varargs annotation from Params.setDefault taking multiple params
* [SPARK-8364] [SPARKR] Add crosstab to SparkR DataFramesXiangrui Meng2015-07-224-0/+46
| | | | | | | | | | | | | | | | Add `crosstab` to SparkR DataFrames, which takes two column names and returns a local R data.frame. This is similar to `table` in R. However, `table` in SparkR is used for loading SQL tables as DataFrames. The return type is data.frame instead table for `crosstab` to be compatible with Scala/Python. I couldn't run R tests successfully on my local. Many unit tests failed. So let's try Jenkins. Author: Xiangrui Meng <meng@databricks.com> Closes #7318 from mengxr/SPARK-8364 and squashes the following commits: d75e894 [Xiangrui Meng] fix tests 53f6ddd [Xiangrui Meng] fix tests f1348d6 [Xiangrui Meng] update test 47cb088 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-8364 5621262 [Xiangrui Meng] first version without test
* [SPARK-9144] Remove DAGScheduler.runLocallyWithinThread and ↵Josh Rosen2015-07-2219-229/+108
| | | | | | | | | | | | | | | | | | | | | | | | spark.localExecution.enabled Spark has an option called spark.localExecution.enabled; according to the docs: > Enables Spark to run certain jobs, such as first() or take() on the driver, without sending tasks to the cluster. This can make certain jobs execute very quickly, but may require shipping a whole partition of data to the driver. This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the runLocallyWithinThread method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5. This pull request simply brings #7484 up to date. Author: Josh Rosen <joshrosen@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #7585 from rxin/remove-local-exec and squashes the following commits: 84bd10e [Reynold Xin] Python fix. 1d9739a [Reynold Xin] Merge pull request #7484 from JoshRosen/remove-localexecution eec39fa [Josh Rosen] Remove allowLocal(); deprecate user-facing uses of it. b0835dc [Josh Rosen] Remove local execution code in DAGScheduler 8975d96 [Josh Rosen] Remove local execution tests. ffa8c9b [Josh Rosen] Remove documentation for configuration
* [SPARK-9262][build] Treat Scala compiler warnings as errorsReynold Xin2015-07-2212-18/+55
| | | | | | | | | | | | | | | | | | I've seen a few cases in the past few weeks that the compiler is throwing warnings that are caused by legitimate bugs. This patch upgrades warnings to errors, except deprecation warnings. Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop). Most of the work are done by ericl. Author: Reynold Xin <rxin@databricks.com> Author: Eric Liang <ekl@databricks.com> Closes #7598 from rxin/warnings and squashes the following commits: beb311b [Reynold Xin] Fixed tests. 542c031 [Reynold Xin] Fixed one more warning. 87c354a [Reynold Xin] Fixed all non-deprecation warnings. 78660ac [Eric Liang] first effort to fix warnings
* [SPARK-8484] [ML] Added TrainValidationSplit for hyper-parameter tuning.martinzapletal2015-07-224-32/+368
| | | | | | | | | | | | | | | | | | | - [X] Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. It should be similar to CrossValidator, but simpler and less expensive. - [X] Simplified replacement of https://github.com/apache/spark/pull/6996 Author: martinzapletal <zapletal-martin@email.cz> Closes #7337 from zapletal-martin/SPARK-8484-TrainValidationSplit and squashes the following commits: cafc949 [martinzapletal] Review comments https://github.com/apache/spark/pull/7337. 511b398 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-8484-TrainValidationSplit f4fc9c4 [martinzapletal] SPARK-8484 Resolved feedback to https://github.com/apache/spark/pull/7337 00c4f5a [martinzapletal] SPARK-8484. Styling. d699506 [martinzapletal] SPARK-8484. Styling. 93ed2ee [martinzapletal] Styling. 3bc1853 [martinzapletal] SPARK-8484. Styling. 2aa6f43 [martinzapletal] SPARK-8484. Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. 21662eb [martinzapletal] SPARK-8484. Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model.
* [SPARK-9223] [PYSPARK] [MLLIB] Support model save/load in LDAMechCoder2015-07-221-1/+42
| | | | | | | | | | Since save / load has been merged in LDA, it takes no time to write the wrappers in Python as well. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #7587 from MechCoder/python_lda_save_load and squashes the following commits: c8e4ea7 [MechCoder] [SPARK-9223] [PySpark] Support model save/load in LDA
* [SPARK-9180] fix spark-shell to accept --name optionKenichi Maehashi2015-07-224-5/+5
| | | | | | | | | | | | This patch fixes [[SPARK-9180]](https://issues.apache.org/jira/browse/SPARK-9180). Users can now set the app name of spark-shell using `spark-shell --name "whatever"`. Author: Kenichi Maehashi <webmaster@kenichimaehashi.com> Closes #7512 from kmaehashi/fix-spark-shell-app-name and squashes the following commits: e24991a [Kenichi Maehashi] use setIfMissing instead of setAppName 18aa4ad [Kenichi Maehashi] fix spark-shell to accept --name option
* [SPARK-8975] [STREAMING] Adds a mechanism to send a new rate from the driver ↵Iulian Dragos2015-07-228-8/+153
| | | | | | | | | | | | | | | | | | | | | | | to the block generator First step for [SPARK-7398](https://issues.apache.org/jira/browse/SPARK-7398). tdas huitseeker Author: Iulian Dragos <jaguarul@gmail.com> Author: François Garillot <francois@garillot.net> Closes #7471 from dragos/topic/streaming-bp/dynamic-rate and squashes the following commits: 8941cf9 [Iulian Dragos] Renames and other nitpicks. 162d9e5 [Iulian Dragos] Use Reflection for accessing truly private `executor` method and use the listener bus to know when receivers have registered (`onStart` is called before receivers have registered, leading to flaky behavior). 210f495 [Iulian Dragos] Revert "Added a few tests that measure the receiver’s rate." 0c51959 [Iulian Dragos] Added a few tests that measure the receiver’s rate. 261a051 [Iulian Dragos] - removed field to hold the current rate limit in rate limiter - made rate limit a Long and default to Long.MaxValue (consequence of the above) - removed custom `waitUntil` and replaced it by `eventually` cd1397d [Iulian Dragos] Add a test for the propagation of a new rate limit from driver to receivers. 6369b30 [Iulian Dragos] Merge pull request #15 from huitseeker/SPARK-8975 d15de42 [François Garillot] [SPARK-8975][Streaming] Adds Ratelimiter unit tests w.r.t. spark.streaming.receiver.maxRate 4721c7d [François Garillot] [SPARK-8975][Streaming] Add a mechanism to send a new rate from the driver to the block generator
* [SPARK-9244] Increase some memory defaultsMatei Zaharia2015-07-2227-80/+78
| | | | | | | | | | | | | | | | | | | There are a few memory limits that people hit often and that we could make higher, especially now that memory sizes have grown. - spark.akka.frameSize: This defaults at 10 but is often hit for map output statuses in large shuffles. This memory is not fully allocated up-front, so we can just make this larger and still not affect jobs that never sent a status that large. We increase it to 128. - spark.executor.memory: Defaults at 512m, which is really small. We increase it to 1g. Author: Matei Zaharia <matei@databricks.com> Closes #7586 from mateiz/configs and squashes the following commits: ce0038a [Matei Zaharia] [SPARK-9244] Increase some memory defaults
* [SPARK-8536] [MLLIB] Generalize OnlineLDAOptimizer to asymmetric ↵Feynman Liang2015-07-223-32/+126
| | | | | | | | | | | | | | | | | | | document-topic Dirichlet priors Modify `LDA` to take asymmetric document-topic prior distributions and `OnlineLDAOptimizer` to use the asymmetric prior during variational inference. This PR only generalizes `OnlineLDAOptimizer` and the associated `LocalLDAModel`; `EMLDAOptimizer` and `DistributedLDAModel` still only support symmetric `alpha` (checked during `EMLDAOptimizer.initialize`). Author: Feynman Liang <fliang@databricks.com> Closes #7575 from feynmanliang/SPARK-8536-LDA-asymmetric-priors and squashes the following commits: af8fbb7 [Feynman Liang] Fix merge errors ef5821d [Feynman Liang] Merge remote-tracking branch 'apache/master' into SPARK-8536-LDA-asymmetric-priors 58f1d7b [Feynman Liang] Fix from review feedback a6dcf70 [Feynman Liang] Change docConcentration interface and move LDAOptimizer validation to initialize, add sad path tests 72038ff [Feynman Liang] Add tests referenced against gensim d4284fa [Feynman Liang] Generalize OnlineLDA to asymmetric priors, no tests
* [SPARK-4366] [SQL] [Follow-up] Fix SqlParser compiling warning.Yin Huai2015-07-221-2/+1
| | | | | | | | Author: Yin Huai <yhuai@databricks.com> Closes #7588 from yhuai/SPARK-4366-update1 and squashes the following commits: 25f5f36 [Yin Huai] Fix SqlParser Warning.
* [SPARK-9224] [MLLIB] OnlineLDA Performance ImprovementsFeynman Liang2015-07-221-32/+27
| | | | | | | | | | | | | In-place updates, reduce number of transposes, and vectorize operations in OnlineLDA implementation. Author: Feynman Liang <fliang@databricks.com> Closes #7454 from feynmanliang/OnlineLDA-perf-improvements and squashes the following commits: 78b0f5a [Feynman Liang] Make in-place variables vals, fix BLAS error 7f62a55 [Feynman Liang] --amend c62cb1e [Feynman Liang] Outer product for stats, revert Range slicing aead650 [Feynman Liang] Range slice, in-place update, reduce transposes
* [SPARK-9024] Unsafe HashJoin/HashOuterJoin/HashSemiJoinDavies Liu2015-07-2220-135/+444
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR introduce unsafe version (using UnsafeRow) of HashJoin, HashOuterJoin and HashSemiJoin, including the broadcast one and shuffle one (except FullOuterJoin, which is better to be implemented using SortMergeJoin). It use HashMap to store UnsafeRow right now, will change to use BytesToBytesMap for better performance (in another PR). Author: Davies Liu <davies@databricks.com> Closes #7480 from davies/unsafe_join and squashes the following commits: 6294b1e [Davies Liu] fix projection 10583f1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join dede020 [Davies Liu] fix test 84c9807 [Davies Liu] address comments a05b4f6 [Davies Liu] support UnsafeRow in LeftSemiJoinBNL and BroadcastNestedLoopJoin 611d2ed [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join 9481ae8 [Davies Liu] return UnsafeRow after join() ca2b40f [Davies Liu] revert unrelated change 68f5cd9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join 0f4380d [Davies Liu] ada a comment 69e38f5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join 1a40f02 [Davies Liu] refactor ab1690f [Davies Liu] address comments 60371f2 [Davies Liu] use UnsafeRow in SemiJoin a6c0b7d [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join 184b852 [Davies Liu] fix style 6acbb11 [Davies Liu] fix tests 95d0762 [Davies Liu] remove println bea4a50 [Davies Liu] Unsafe HashJoin
* [SPARK-9165] [SQL] codegen for CreateArray, CreateStruct and CreateNamedStructYijie Shen2015-07-222-5/+76
| | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-9165 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7537 from yjshen/array_struct_codegen and squashes the following commits: 3a6dce6 [Yijie Shen] use infix notion in createArray test 5e90f0a [Yijie Shen] resolve comments: classOf 39cefb8 [Yijie Shen] codegen for createArray createStruct & createNamedStruct
* [SPARK-9082] [SQL] Filter using non-deterministic expressions should not be ↵Wenchen Fan2015-07-222-11/+84
| | | | | | | | | | | | | | | pushed down Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7446 from cloud-fan/filter and squashes the following commits: 330021e [Wenchen Fan] add exists to tree node 2cab68c [Wenchen Fan] more enhance 949be07 [Wenchen Fan] push down part of predicate if possible 3912f84 [Wenchen Fan] address comments 8ce15ca [Wenchen Fan] fix bug 557158e [Wenchen Fan] Filter using non-deterministic expressions should not be pushed down
* [SPARK-9254] [BUILD] [HOTFIX] sbt-launch-lib.bash should support HTTP/HTTPS ↵Cheng Lian2015-07-221-2/+6
| | | | | | | | | | | | | redirection Target file(s) can be hosted on CDN nodes. HTTP/HTTPS redirection must be supported to download these files. Author: Cheng Lian <lian@databricks.com> Closes #7597 from liancheng/spark-9254 and squashes the following commits: fd266ca [Cheng Lian] Uses `--fail' to make curl return non-zero value and remove garbage output when the download fails a7cbfb3 [Cheng Lian] Supports HTTP/HTTPS redirection
* [SPARK-4233] [SPARK-4367] [SPARK-3947] [SPARK-3056] [SQL] Aggregation ↵Yin Huai2015-07-2139-100/+3087
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Improvement This is the first PR for the aggregation improvement, which is tracked by https://issues.apache.org/jira/browse/SPARK-4366 (umbrella JIRA). This PR contains work for its subtasks, SPARK-3056, SPARK-3947, SPARK-4233, and SPARK-4367. This PR introduces a new code path for evaluating aggregate functions. This code path is guarded by `spark.sql.useAggregate2` and by default the value of this flag is true. This new code path contains: * A new aggregate function interface (`AggregateFunction2`) and 7 built-int aggregate functions based on this new interface (`AVG`, `COUNT`, `FIRST`, `LAST`, `MAX`, `MIN`, `SUM`) * A UDAF interface (`UserDefinedAggregateFunction`) based on the new code path and two example UDAFs (`MyDoubleAvg` and `MyDoubleSum`). * A sort-based aggregate operator (`Aggregate2Sort`) for the new aggregate function interface . * A sort-based aggregate operator (`FinalAndCompleteAggregate2Sort`) for distinct aggregations (for distinct aggregations the query plan will use `Aggregate2Sort` and `FinalAndCompleteAggregate2Sort` together). With this change, `spark.sql.useAggregate2` is `true`, the flow of compiling an aggregation query is: 1. Our analyzer looks up functions and returns aggregate functions built based on the old aggregate function interface. 2. When our planner is compiling the physical plan, it tries try to convert all aggregate functions to the ones built based on the new interface. The planner will fallback to the old code path if any of the following two conditions is true: * code-gen is disabled. * there is any function that cannot be converted (right now, Hive UDAFs). * the schema of grouping expressions contain any complex data type. * There are multiple distinct columns. Right now, the new code path handles a single distinct column in the query (you can have multiple aggregate functions using that distinct column). For a query having a aggregate function with DISTINCT and regular aggregate functions, the generated plan will do partial aggregations for those regular aggregate function. Thanks chenghao-intel for his initial work on it. Author: Yin Huai <yhuai@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #7458 from yhuai/UDAF and squashes the following commits: 7865f5e [Yin Huai] Put the catalyst expression in the comment of the generated code for it. b04d6c8 [Yin Huai] Remove unnecessary change. f1d5901 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 35b0520 [Yin Huai] Use semanticEquals to replace grouping expressions in the output of the aggregate operator. 3b43b24 [Yin Huai] bug fix. 00eb298 [Yin Huai] Make it compile. a3ca551 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF e0afca3 [Yin Huai] Gracefully fallback to old aggregation code path. 8a8ac4a [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 88c7d4d [Yin Huai] Enable spark.sql.useAggregate2 by default for testing purpose. dc96fd1 [Yin Huai] Many updates: 85c9c4b [Yin Huai] newline. 43de3de [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF c3614d7 [Yin Huai] Handle single distinct column. 68b8ee9 [Yin Huai] Support single distinct column set. WIP 3013579 [Yin Huai] Format. d678aee [Yin Huai] Remove AggregateExpressionSuite.scala since our built-in aggregate functions will be based on AlgebraicAggregate and we need to have another way to test it. e243ca6 [Yin Huai] Add aggregation iterators. a101960 [Yin Huai] Change MyJavaUDAF to MyDoubleSum. 594cdf5 [Yin Huai] Change existing AggregateExpression to AggregateExpression1 and add an AggregateExpression as the common interface for both AggregateExpression1 and AggregateExpression2. 380880f [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 0a827b3 [Yin Huai] Add comments and doc. Move some classes to the right places. a19fea6 [Yin Huai] Add UDAF interface. 262d4c4 [Yin Huai] Make it compile. b2e358e [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 6edb5ac [Yin Huai] Format update. 70b169c [Yin Huai] Remove groupOrdering. 4721936 [Yin Huai] Add CheckAggregateFunction to extendedCheckRules. d821a34 [Yin Huai] Cleanup. 32aea9c [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 5b46d41 [Yin Huai] Bug fix. aff9534 [Yin Huai] Make Aggregate2Sort work with both algebraic AggregateFunctions and non-algebraic AggregateFunctions. 2857b55 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 4435f20 [Yin Huai] Add ConvertAggregateFunction to HiveContext's analyzer. 1b490ed [Michael Armbrust] make hive test 8cfa6a9 [Michael Armbrust] add test 1b0bb3f [Yin Huai] Do not bind references in AlgebraicAggregate and use code gen for all places. 072209f [Yin Huai] Bug fix: Handle expressions in grouping columns that are not attribute references. f7d9e54 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into UDAF 39ee975 [Yin Huai] Code cleanup: Remove unnecesary AttributeReferences. b7720ba [Yin Huai] Add an analysis rule to convert aggregate function to the new version. 5c00f3f [Michael Armbrust] First draft of codegen 6bbc6ba [Michael Armbrust] now with correct answers\! f7996d0 [Michael Armbrust] Add AlgebraicAggregate dded1c5 [Yin Huai] wip
* [SPARK-9232] [SQL] Duplicate code in JSONRelationAndrew Or2015-07-211-29/+21
| | | | | | | | Author: Andrew Or <andrew@databricks.com> Closes #7576 from andrewor14/clean-up-json-relation and squashes the following commits: ea80803 [Andrew Or] Clean up duplicate code
* [SPARK-9121] [SPARKR] Get rid of the warnings about `no visible global ↵Yu ISHIKAWA2015-07-211-3/+9
| | | | | | | | | | | | | | | | | | function definition` in SparkR [[SPARK-9121] Get rid of the warnings about `no visible global function definition` in SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9121) ## The Result of `dev/lint-r` [The result of lint-r for SPARK-9121 at the revision:1ddd0f2f1688560f88470e312b72af04364e2d49 when I have sent a PR](https://gist.github.com/yu-iskw/6f55953425901725edf6) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #7567 from yu-iskw/SPARK-9121 and squashes the following commits: c8cfd63 [Yu ISHIKAWA] Fix the typo b1f19ed [Yu ISHIKAWA] Add a validate statement for local SparkR 1a03987 [Yu ISHIKAWA] Load the `testthat` package in `dev/lint-r.R`, instead of using the full path of function. 3a5e0ab [Yu ISHIKAWA] [SPARK-9121][SparkR] Get rid of the warnings about `no visible global function definition` in SparkR
* [SPARK-9154][SQL] Rename formatString to format_string.Reynold Xin2015-07-215-42/+18
| | | | | | | | | | | | Also make format_string the canonical form, rather than printf. Author: Reynold Xin <rxin@databricks.com> Closes #7579 from rxin/format_strings and squashes the following commits: 53ee54f [Reynold Xin] Fixed unit tests. 52357e1 [Reynold Xin] Add format_string alias. b40a42a [Reynold Xin] [SPARK-9154][SQL] Rename formatString to format_string.
* [SPARK-9154] [SQL] codegen StringFormatTarek Auel2015-07-214-11/+70
| | | | | | | | | | | | | | | | | | | | Jira: https://issues.apache.org/jira/browse/SPARK-9154 fixes bug of #7546 marmbrus I can't reopen the other PR, because I didn't closed it. Can you trigger Jenkins? Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7571 from tarekauel/SPARK-9154 and squashes the following commits: dcae272 [Tarek Auel] [SPARK-9154][SQL] build fix 1487602 [Tarek Auel] Merge remote-tracking branch 'upstream/master' into SPARK-9154 f512c5f [Tarek Auel] [SPARK-9154][SQL] build fix a943d3e [Tarek Auel] [SPARK-9154] implicit input cast, added tests for null, support for null primitives 10b4de8 [Tarek Auel] [SPARK-9154][SQL] codegen removed fallback trait cd8322b [Tarek Auel] [SPARK-9154][SQL] codegen string format 086caba [Tarek Auel] [SPARK-9154][SQL] codegen string format
* [SPARK-9206] [SQL] Fix HiveContext classloading for GCS connector.Dennis Huo2015-07-211-1/+1
| | | | | | | | | | | | | | | | | IsolatedClientLoader.isSharedClass includes all of com.google.\*, presumably for Guava, protobuf, and/or other shared Google libraries, but needs to count com.google.cloud.\* as "hive classes" when determining which ClassLoader to use. Otherwise, things like HiveContext.parquetFile will throw a ClassCastException when fs.defaultFS is set to a Google Cloud Storage (gs://) path. On StackOverflow: http://stackoverflow.com/questions/31478955 EDIT: Adding yhuai who worked on the relevant classloading isolation pieces. Author: Dennis Huo <dhuo@google.com> Closes #7549 from dennishuo/dhuo-fix-hivecontext-gcs and squashes the following commits: 1f8db07 [Dennis Huo] Fix HiveContext classloading for GCS connector.
* [SPARK-8906][SQL] Move all internal data source classes into ↵Reynold Xin2015-07-2132-62/+124
| | | | | | | | | | | | | | execution.datasources. This way, the sources package contains only public facing interfaces. Author: Reynold Xin <rxin@databricks.com> Closes #7565 from rxin/move-ds and squashes the following commits: 7661aff [Reynold Xin] Mima 9d5196a [Reynold Xin] Rearranged imports. 3dd7174 [Reynold Xin] [SPARK-8906][SQL] Move all internal data source classes into execution.datasources.
* [SPARK-8357] Fix unsafe memory leak on empty inputs in GeneratedAggregatenavis.ryu2015-07-213-1/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes a managed memory leak in GeneratedAggregate. The leak occurs when the unsafe aggregation path is used to perform grouped aggregation on an empty input; in this case, GeneratedAggregate allocates an UnsafeFixedWidthAggregationMap that is never cleaned up because `next()` is never called on the aggregate result iterator. This patch fixes this by short-circuiting on empty inputs. This patch is an updated version of #6810. Closes #6810. Author: navis.ryu <navis@apache.org> Author: Josh Rosen <joshrosen@databricks.com> Closes #7560 from JoshRosen/SPARK-8357 and squashes the following commits: 3486ce4 [Josh Rosen] Some minor cleanup c649310 [Josh Rosen] Revert SparkPlan change: 3c7db0f [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-8357 adc8239 [Josh Rosen] Back out Projection changes. c5419b3 [navis.ryu] addressed comments 143e1ef [navis.ryu] fixed format & added test for CCE case 735972f [navis.ryu] used new conf apis 1a02a55 [navis.ryu] Rolled-back test-conf cleanup & fixed possible CCE & added more tests 51178e8 [navis.ryu] addressed comments 4d326b9 [navis.ryu] fixed test fails 15c5afc [navis.ryu] added a test as suggested by JoshRosen d396589 [navis.ryu] added comments 1b07556 [navis.ryu] [SPARK-8357] [SQL] Memory leakage on unsafe aggregation path with empty input
* Revert "[SPARK-9154] [SQL] codegen StringFormat"Michael Armbrust2015-07-213-59/+11
| | | | | | | | | | | | This reverts commit 7f072c3d5ec50c65d76bd9f28fac124fce96a89e. Revert #7546 Author: Michael Armbrust <michael@databricks.com> Closes #7570 from marmbrus/revert9154 and squashes the following commits: ed2c32a [Michael Armbrust] Revert "[SPARK-9154] [SQL] codegen StringFormat"
* [SPARK-5989] [MLLIB] Model save/load for LDAMechCoder2015-07-213-5/+274
| | | | | | | | | | | | | | Add support for saving and loading LDA both the local and distributed versions. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6948 from MechCoder/lda_save_load and squashes the following commits: 49bcdce [MechCoder] minor style fixes cc14054 [MechCoder] minor 4587d1d [MechCoder] Minor changes c753122 [MechCoder] Load and save the model in private methods 2782326 [MechCoder] [SPARK-5989] Model save/load for LDA