aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* [SPARK-13594][SQL] remove typed operations(e.g. map, flatMap) from python ↵Wenchen Fan2016-03-026-60/+22
| | | | | | | | | | | | | | | | DataFrame ## What changes were proposed in this pull request? Remove `map`, `flatMap`, `mapPartitions` from python DataFrame, to prepare for Dataset API in the future. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #11445 from cloud-fan/python-clean.
* [SPARK-13574] [SQL] Add benchmark to measure string dictionary decode.Nong Li2016-03-021-18/+52
| | | | | | | | | | ## What changes were proposed in this pull request? Also updated the other benchmarks when the default to use vectorized decode was flipped. Author: Nong Li <nong@databricks.com> Closes #11454 from nongli/benchmark.
* [SPARK-13601] call failure callbacks before writer.close()Davies Liu2016-03-027-53/+271
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? In order to tell OutputStream that the task has failed or not, we should call the failure callbacks BEFORE calling writer.close(). ## How was this patch tested? Added new unit tests. Author: Davies Liu <davies@databricks.com> Closes #11450 from davies/callback.
* [SPARK-13535][SQL] Fix Analysis Exceptions when Using Backticks in Transform ↵gatorsmile2016-03-023-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clause #### What changes were proposed in this pull request? ```SQL FROM (FROM test SELECT TRANSFORM(key, value) USING 'cat' AS (`thing1` int, thing2 string)) t SELECT thing1 + 1 ``` This query returns an analysis error, like: ``` Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve '`thing1`' given input columns: [`thing1`, thing2]; line 3 pos 7 'Project [unresolvedalias(('thing1 + 1), None)] +- SubqueryAlias t +- ScriptTransformation [key#2,value#3], cat, [`thing1`#6,thing2#7], HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, )),List((field.delim, )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false) +- SubqueryAlias test +- Project [_1#0 AS key#2,_2#1 AS value#3] +- LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3],[4,4],[5,5]] ``` The backpacks of \`thing1\` should be cleaned before entering Parser/Analyzer. This PR fixes this issue. #### How was this patch tested? Added a test case and modified an existing test case Author: gatorsmile <gatorsmile@gmail.com> Closes #11415 from gatorsmile/scriptTransform.
* [SPARK-12817] Add BlockManager.getOrElseUpdate and remove CacheManagerJosh Rosen2016-03-0216-597/+365
| | | | | | | | | | | | | | CacheManager directly calls MemoryStore.unrollSafely() and has its own logic for handling graceful fallback to disk when cached data does not fit in memory. However, this logic also exists inside of the MemoryStore itself, so this appears to be unnecessary duplication. Thanks to the addition of block-level read/write locks in #10705, we can refactor the code to remove the CacheManager and replace it with an atomic `BlockManager.getOrElseUpdate()` method. This pull request replaces / subsumes #10748. /cc andrewor14 and nongli for review. Note that this changes the locking semantics of a couple of internal BlockManager methods (`doPut()` and `lockNewBlockForWriting`), so please pay attention to the Scaladoc changes and new test cases for those methods. Author: Josh Rosen <joshrosen@databricks.com> Closes #11436 from JoshRosen/remove-cachemanager.
* [SPARK-13609] [SQL] Support Column Pruning for MapPartitionsgatorsmile2016-03-023-2/+28
| | | | | | | | | | | | | | | | #### What changes were proposed in this pull request? This PR is to prune unnecessary columns when the operator is `MapPartitions`. The solution is to add an extra `Project` in the child node. For the other two operators `AppendColumns` and `MapGroups`, it sounds doable. More discussions are required. The major reason is the current implementation of the `inputPlan` of `groupBy` is based on the child of `AppendColumns`. It might be a bug? Thus, will submit a separate PR. #### How was this patch tested? Added a test case in ColumnPruningSuite to verify the rule. Added another test case in DatasetSuite.scala to verify the data. Author: gatorsmile <gatorsmile@gmail.com> Closes #11460 from gatorsmile/datasetPruningNew.
* [SPARK-13515] Make FormatNumber work irrespective of locale.lgieron2016-03-021-7/+13
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Change in class FormatNumber to make it work irrespective of locale. ## How was this patch tested? Unit tests. Author: lgieron <lgieron@gmail.com> Closes #11396 from lgieron/SPARK-13515_Fix_Format_Number.
* Fix run-tests.py typosWojciech Jurczyk2016-03-021-1/+1
| | | | | | | | | | ## What changes were proposed in this pull request? The PR fixes typos in an error message in dev/run-tests.py. Author: Wojciech Jurczyk <wojciech.jurczyk@codilime.com> Closes #11467 from wjur/wjur/typos_run_tests.
* [MINOR][STREAMING] Replace deprecated `apply` with `create` in example.Dongjoon Hyun2016-03-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Twitter Algebird deprecated `apply` in HyperLogLog.scala. ``` deprecated("Use toHLL", since = "0.10.0 / 2015-05") def apply[T <% Array[Byte]](t: T) = create(t) ``` This PR replace the deprecated usage `apply` with new `create` according to the upstream change. ## How was this patch tested? manual. ``` /bin/spark-submit --class org.apache.spark.examples.streaming.TwitterAlgebirdHLL examples/target/scala-2.11/spark-examples-2.0.0-SNAPSHOT-hadoop2.2.0.jar ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11451 from dongjoon-hyun/replace_deprecated_hll_apply.
* [BUILD][MINOR] Fix SBT build error with network-yarn modulejerryshao2016-03-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? ``` error] Expected ID character [error] Not a valid command: common (similar: completions) [error] Expected project ID [error] Expected configuration [error] Expected ':' (if selecting a configuration) [error] Expected key [error] Not a valid key: common (similar: commands) [error] common/network-yarn/test ``` `common/network-yarn` is not a valid sbt project, we should change to `network-yarn`. ## How was this patch tested? Locally run the the unit-test. CC rxin , we should either change here, or change the sbt project name. Author: jerryshao <sshao@hortonworks.com> Closes #11456 from jerryshao/build-fix.
* [SPARK-13008][ML][PYTHON] Put one alg per line in pyspark.ml all listsJoseph K. Bradley2016-03-013-15/+36
| | | | | | | | | | | | This is to fix a long-time annoyance: Whenever we add a new algorithm to pyspark.ml, we have to add it to the ```__all__``` list at the top. Since we keep it alphabetized, it often creates a lot more changes than needed. It is also easy to add the Estimator and forget the Model. I'm going to switch it to have one algorithm per line. This also alphabetizes a few out-of-place classes in pyspark.ml.feature. No changes have been made to the moved classes. CC: thunterdb Author: Joseph K. Bradley <joseph@databricks.com> Closes #10927 from jkbradley/ml-python-all-list.
* [SPARK-13167][SQL] Include rows with null values for partition column when ↵sureshthalamati2016-03-012-1/+45
| | | | | | | | | | | | | | | reading from JDBC datasources. Rows with null values in partition column are not included in the results because none of the partition where clause specify is null predicate on the partition column. This fix adds is null predicate on the partition column to the first JDBC partition where clause. Example: JDBCPartition(THEID < 1 or THEID is null, 0),JDBCPartition(THEID >= 1 AND THEID < 2,1), JDBCPartition(THEID >= 2, 2) Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #11063 from sureshthalamati/nullable_jdbc_part_col_spark-13167.
* [SPARK-13598] [SQL] remove LeftSemiJoinBNLDavies Liu2016-03-014-95/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Broadcast left semi join without joining keys is already supported in BroadcastNestedLoopJoin, it has the same implementation as LeftSemiJoinBNL, we should remove that. ## How was this patch tested? Updated unit tests. Author: Davies Liu <davies@databricks.com> Closes #11448 from davies/remove_bnl.
* [SPARK-13548][BUILD] Move tags and unsafe modules into commonReynold Xin2016-03-0127-4/+4
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch moves tags and unsafe modules into common directory to remove 2 top level non-user-facing directories. ## How was this patch tested? Jenkins should suffice. Author: Reynold Xin <rxin@databricks.com> Closes #11426 from rxin/SPARK-13548.
* [SPARK-13582] [SQL] defer dictionary decoding in parquet readerDavies Liu2016-03-0115-203/+221
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR defer the resolution from a id of dictionary to value until the column is actually accessed (inside getInt/getLong), this is very useful for those columns and rows that are filtered out. It's also useful for binary type, we will not need to copy all the byte arrays. This PR also change the underlying type for small decimal that could be fit within a Int, in order to use getInt() to lookup the value from IntDictionary. ## How was this patch tested? Manually test TPCDS Q7 with scale factor 10, saw about 30% improvements (after PR #11274). Author: Davies Liu <davies@databricks.com> Closes #11437 from davies/decode_dict.
* Closes #11320Xiangrui Meng2016-03-010-0/+0
| | | | | | | Closes #10940 Closes #11302 Closes #11430 Closes #10912
* [SPARK-12811][ML] Estimator for Generalized Linear Models(GLMs)Yanbo Liang2016-03-014-4/+1094
| | | | | | | | | | Estimator for Generalized Linear Models(GLMs) which will be solved by IRLS. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #11136 from yanboliang/spark-12811.
* [SPARK-13511] [SQL] Add wholestage codegen for limitLiang-Chi Hsieh2016-03-012-2/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-13511 ## What changes were proposed in this pull request? Current limit operator doesn't support wholestage codegen. This is open to add support for it. In the `doConsume` of `GlobalLimit` and `LocalLimit`, we use a count term to count the processed rows. Once the row numbers catches the limit number, we set the variable `stopEarly` of `BufferedRowIterator` newly added in this pr to `true` that indicates we want to stop processing remaining rows. Then when the wholestage codegen framework checks `shouldStop()`, it will stop the processing of the row iterator. Before this, the executed plan for a query `sqlContext.range(N).limit(100).groupBy().sum()` is: TungstenAggregate(key=[], functions=[(sum(id#5L),mode=Final,isDistinct=false)], output=[sum(id)#6L]) +- TungstenAggregate(key=[], functions=[(sum(id#5L),mode=Partial,isDistinct=false)], output=[sum#9L]) +- GlobalLimit 100 +- Exchange SinglePartition, None +- LocalLimit 100 +- Range 0, 1, 1, 524288000, [id#5L] After add wholestage codegen support: WholeStageCodegen : +- TungstenAggregate(key=[], functions=[(sum(id#40L),mode=Final,isDistinct=false)], output=[sum(id)#41L]) : +- TungstenAggregate(key=[], functions=[(sum(id#40L),mode=Partial,isDistinct=false)], output=[sum#44L]) : +- GlobalLimit 100 : +- INPUT +- Exchange SinglePartition, None +- WholeStageCodegen : +- LocalLimit 100 : +- Range 0, 1, 1, 524288000, [id#40L] ## How was this patch tested? A test is added into BenchmarkWholeStageCodegen. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11391 from viirya/wholestage-limit.
* [SPARK-13592][WINDOWS] fix path of spark-submit2.cmd in spark-submit.cmdMasayoshi TSUZUKI2016-03-011-1/+1
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch fixes the problem that pyspark fails on Windows because pyspark can't find ```spark-submit2.cmd```. ## How was this patch tested? manual tests: I ran ```bin\pyspark.cmd``` and checked if pyspark is launched correctly after this patch is applyed. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #11442 from tsudukim/feature/SPARK-13592.
* [SPARK-13550][ML] Add java example for ml.clustering.BisectingKMeansZheng RuiFeng2016-02-291-0/+81
| | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-13550 ## What changes were proposed in this pull request? Just add a java example for ml.clustering.BisectingKMeans ## How was this patch tested? manual tests were done. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11428 from zhengruifeng/ml_bkm_je.
* [SPARK-13551][MLLIB] Fix wrong comment and remove meanless lines in ↵Zheng RuiFeng2016-02-291-4/+2
| | | | | | | | | | | | | | | | | | mllib.JavaBisectingKMeansExample JIRA: https://issues.apache.org/jira/browse/SPARK-13551 ## What changes were proposed in this pull request? Fix wrong comment and remove meanless lines in mllib.JavaBisectingKMeansExample ## How was this patch tested? manual test Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11429 from zhengruifeng/mllib_bkm_je.
* [SPARK-13478][YARN] Use real user when fetching delegation tokens.Marcelo Vanzin2016-02-293-12/+41
| | | | | | | | | | | | | | | | | The Hive client library is not smart enough to notice that the current user is a proxy user; so when using a proxy user, it fails to fetch delegation tokens from the metastore because of a missing kerberos TGT for the current user. To fix it, just run the code that fetches the delegation token as the real logged in user. Tested on a kerberos cluster both submitting normally and with a proxy user; Hive and HBase tokens are retrieved correctly in both cases. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11358 from vanzin/SPARK-13478.
* [SPARK-13123][SQL] Implement whole state codegen for sortSameer Agarwal2016-02-295-35/+122
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds support for implementing whole state codegen for sort. Builds heaving on nongli 's PR: https://github.com/apache/spark/pull/11008 (which actually implements the feature), and adds the following changes on top: - [x] Generated code updates peak execution memory metrics - [x] Unit tests in `WholeStageCodegenSuite` and `SQLMetricsSuite` ## How was this patch tested? New unit tests in `WholeStageCodegenSuite` and `SQLMetricsSuite`. Further, all existing sort tests should pass. Author: Sameer Agarwal <sameer@databricks.com> Author: Nong Li <nong@databricks.com> Closes #11359 from sameeragarwal/sort-codegen.
* [SPARK-13522][CORE] Fix the exit log place for heartbeatShixiong Zhu2016-02-291-1/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Just fixed the log place introduced by #11401 ## How was this patch tested? unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11432 from zsxwing/SPARK-13522-follow-up.
* [SPARK-13522][CORE] Executor should kill itself when it's unable to ↵Shixiong Zhu2016-02-292-1/+29
| | | | | | | | | | | | | | | | | | heartbeat to driver more than N times ## What changes were proposed in this pull request? Sometimes, network disconnection event won't be triggered for other potential race conditions that we may not have thought of, then the executor will keep sending heartbeats to driver and won't exit. This PR adds a new configuration `spark.executor.heartbeat.maxFailures` to kill Executor when it's unable to heartbeat to the driver more than `spark.executor.heartbeat.maxFailures` times. ## How was this patch tested? unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11401 from zsxwing/SPARK-13522.
* [SPARK-13544][SQL] Rewrite/Propagate Constraints for Aliases in Aggregategatorsmile2016-02-293-23/+38
| | | | | | | | | | | | | | | | #### What changes were proposed in this pull request? After analysis by Analyzer, two operators could have alias. They are `Project` and `Aggregate`. So far, we only rewrite and propagate constraints if `Alias` is defined in `Project`. This PR is to resolve this issue in `Aggregate`. #### How was this patch tested? Added a test case for `Aggregate` in `ConstraintPropagationSuite`. marmbrus sameeragarwal Author: gatorsmile <gatorsmile@gmail.com> Closes #11422 from gatorsmile/validConstraintsInUnaryNodes.
* [SPARK-13509][SPARK-13507][SQL] Support for writing CSV with a single ↵hyukjinkwon2016-02-296-10/+80
| | | | | | | | | | | | | | | | | | | | function call https://issues.apache.org/jira/browse/SPARK-13507 https://issues.apache.org/jira/browse/SPARK-13509 ## What changes were proposed in this pull request? This PR adds the support to write CSV data directly by a single call to the given path. Several unitests were added for each functionality. ## How was this patch tested? This was tested with unittests and with `dev/run_tests` for coding style Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #11389 from HyukjinKwon/SPARK-13507-13509.
* [SPARK-13540][SQL] Supports using nested classes within Scala objects as ↵Cheng Lian2016-03-012-1/+19
| | | | | | | | | | | | | | | | | | Dataset element type ## What changes were proposed in this pull request? Nested classes defined within Scala objects are translated into Java static nested classes. Unlike inner classes, they don't need outer scopes. But the analyzer still thinks that an outer scope is required. This PR fixes this issue simply by checking whether a nested class is static before looking up its outer scope. ## How was this patch tested? A test case is added to `DatasetSuite`. It checks contents of a Dataset whose element type is a nested class declared in a Scala object. Author: Cheng Lian <lian@databricks.com> Closes #11421 from liancheng/spark-13540-object-as-outer-scope.
* [SPARK-13506][MLLIB] Fix the wrong parameter in R code comment in ↵Zheng RuiFeng2016-02-291-1/+2
| | | | | | | | | | | | | | | | | | AssociationRulesSuite JIRA: https://issues.apache.org/jira/browse/SPARK-13506 ## What changes were proposed in this pull request? just chang R Snippet Comment in AssociationRulesSuite ## How was this patch tested? unit test passsed Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11387 from zhengruifeng/ars.
* [SPARK-13481] Desc order of appID by default for history server page.zhuol2016-02-291-1/+2
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Now by default, it shows as ascending order of appId. We might prefer to display as descending order by default, which will show the latest application at the top. ## How was this patch tested? Manual tested. See screenshot below: ![desc-sort](https://cloud.githubusercontent.com/assets/11683054/13307473/102f4cf8-db31-11e5-8dd5-391edbf32f0d.png) Author: zhuol <zhuol@yahoo-inc.com> Closes #11357 from zhuoliu/13481.
* [SPARK-12633][PYSPARK] [DOC] PySpark regression parameter desc to consistent ↵vijaykiran2016-02-292-164/+166
| | | | | | | | | | | | | format Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the regression module. Also, updated 2 params in classification to read as `Supported values:` to be consistent. closes #10600 Author: vijaykiran <mail@vijaykiran.com> Author: Bryan Cutler <cutlerb@gmail.com> Closes #11404 from BryanCutler/param-desc-consistent-regression-SPARK-12633.
* [SPARK-12994][CORE] It is not necessary to create ExecutorAllocationM…Jeff Zhang2016-02-293-7/+21
| | | | | | | | …anager in local mode Author: Jeff Zhang <zjffdu@apache.org> Closes #10914 from zjffdu/SPARK-12994.
* [SPARK-13545][MLLIB][PYSPARK] Make MLlib LogisticRegressionWithLBFGS's ↵Yanbo Liang2016-02-292-3/+9
| | | | | | | | | | | | | | | | | default parameters consistent in Scala and Python ## What changes were proposed in this pull request? * The default value of ```regParam``` of PySpark MLlib ```LogisticRegressionWithLBFGS``` should be consistent with Scala which is ```0.0```. (This is also consistent with ML ```LogisticRegression```.) * BTW, if we use a known updater(L1 or L2) for binary classification, ```LogisticRegressionWithLBFGS``` will call the ML implementation. We should update the API doc to clarifying ```numCorrections``` will have no effect if we fall into that route. * Make a pass for all parameters of ```LogisticRegressionWithLBFGS```, others are set properly. cc mengxr dbtsai ## How was this patch tested? No new tests, it should pass all current tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11424 from yanboliang/spark-13545.
* [SPARK-13309][SQL] Fix type inference issue with CSV dataRahul Tanwani2016-02-284-10/+32
| | | | | | | | Fix type inference issue for sparse CSV data - https://issues.apache.org/jira/browse/SPARK-13309 Author: Rahul Tanwani <rahul@Rahuls-MacBook-Pro.local> Closes #11194 from tanwanirahul/master.
* [SPARK-13537][SQL] Fix readBytes in VectorizedPlainValuesReaderLiang-Chi Hsieh2016-02-282-1/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-13537 ## What changes were proposed in this pull request? In readBytes of VectorizedPlainValuesReader, we use buffer[offset] to access bytes in buffer. It is incorrect because offset is added with Platform.BYTE_ARRAY_OFFSET when initialization. We should fix it. ## How was this patch tested? `ParquetHadoopFsRelationSuite` sometimes (depending on the randomly generated data) will be [failed](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52136/consoleFull) by this bug. After applying this, the test can be passed. I added a test to `ParquetHadoopFsRelationSuite` with the data which will fail without this patch. The error exception: [info] ParquetHadoopFsRelationSuite: [info] - test all data types - StringType (440 milliseconds) [info] - test all data types - BinaryType (434 milliseconds) [info] - test all data types - BooleanType (406 milliseconds) 20:59:38.618 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 2597.0 (TID 67966) java.lang.ArrayIndexOutOfBoundsException: 46 at org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBytes(VectorizedPlainValuesReader.java:88) Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11418 from viirya/fix-readbytes.
* [SPARK-13529][BUILD] Move network/* modules into common/network-*Reynold Xin2016-02-28115-8/+9
| | | | | | | | | | | | ## What changes were proposed in this pull request? As the title says, this moves the three modules currently in network/ into common/network-*. This removes one top level, non-user-facing folder. ## How was this patch tested? Compilation and existing tests. We should run both SBT and Maven. Author: Reynold Xin <rxin@databricks.com> Closes #11409 from rxin/SPARK-13529.
* [SPARK-13526][SQL] Move SQLContext per-session states to new classAndrew Or2016-02-2710-164/+294
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This creates a `SessionState`, which groups a few fields that existed in `SQLContext`. Because `HiveContext` extends `SQLContext` we also need to make changes there. This is mainly a cleanup task that will soon pave the way for merging the two contexts. ## How was this patch tested? Existing unit tests; this patch introduces no change in behavior. Author: Andrew Or <andrew@databricks.com> Closes #11405 from andrewor14/refactor-session.
* Closes #11413Reynold Xin2016-02-270-0/+0
|
* [SPARK-13533][SQL] Fix readBytes in VectorizedPlainValuesReaderNong Li2016-02-271-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix readBytes in VectorizedPlainValuesReader. This fixes a copy and paste issue. ## How was this patch tested? Ran ParquetHadoopFsRelationSuite which failed before this. Author: Nong Li <nong@databricks.com> Closes #11414 from nongli/spark-13533.
* [SPARK-13530][SQL] Add ShortType support to UnsafeRowParquetRecordReaderLiang-Chi Hsieh2016-02-272-1/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-13530 ## What changes were proposed in this pull request? By enabling vectorized parquet scanner by default, the unit test `ParquetHadoopFsRelationSuite` based on `HadoopFsRelationTest` will be failed due to the lack of short type support in `UnsafeRowParquetRecordReader`. We should fix it. The error exception: [info] ParquetHadoopFsRelationSuite: [info] - test all data types - StringType (499 milliseconds) [info] - test all data types - BinaryType (447 milliseconds) [info] - test all data types - BooleanType (520 milliseconds) [info] - test all data types - ByteType (418 milliseconds) 00:22:58.920 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 124.0 (TID 1949) org.apache.commons.lang.NotImplementedException: Unimplemented type: ShortType at org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader$ColumnReader.readIntBatch(UnsafeRowParquetRecordReader.java:769) at org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader$ColumnReader.readBatch(UnsafeRowParquetRecordReader.java:640) at org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader$ColumnReader.access$000(UnsafeRowParquetRecordReader.java:461) at org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader.nextBatch(UnsafeRowParquetRecordReader.java:224) ## How was this patch tested? The unit test `ParquetHadoopFsRelationSuite` based on `HadoopFsRelationTest` will be [failed](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52110/consoleFull) due to the lack of short type support in UnsafeRowParquetRecordReader. By adding this support, the test can be passed. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11412 from viirya/add-shorttype-support.
* [SPARK-7483][MLLIB] Upgrade Chill to 0.7.2 to support Kryo with FPGrowthmark8002016-02-276-11/+11
| | | | | | | | | | It registers more Scala classes, including ListBuffer to support Kryo with FPGrowth. See https://github.com/twitter/chill/releases for Chill's change log. Author: mark800 <yky800@126.com> Closes #11041 from mark800/master.
* [SPARK-13518][SQL] Enable vectorized parquet scanner by defaultNong Li2016-02-261-4/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Change the default of the flag to enable this feature now that the implementation is complete. ## How was this patch tested? The new parquet reader should be a drop in, so will be exercised by the existing tests. Author: Nong Li <nong@databricks.com> Closes #11397 from nongli/spark-13518.
* [SPARK-13521][BUILD] Remove reference to Tachyon in cluster & release scriptsReynold Xin2016-02-269-160/+6
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? We provide a very limited set of cluster management script in Spark for Tachyon, although Tachyon itself provides a much better version of it. Given now Spark users can simply use Tachyon as a normal file system and does not require extensive configurations, we can remove this management capabilities to simplify Spark bash scripts. Note that this also reduces coupling between a 3rd party external system and Spark's release scripts, and would eliminate possibility for failures such as Tachyon being renamed or the tar balls being relocated. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #11400 from rxin/release-script.
* [SPARK-13474][PROJECT INFRA] Update packaging scripts to push artifacts to ↵Josh Rosen2016-02-261-16/+44
| | | | | | | | | | home.apache.org Due to the people.apache.org -> home.apache.org migration, we need to update our packaging scripts to publish artifacts to the new server. Because the new server only supports sftp instead of ssh, we need to update the scripts to use lftp instead of ssh + rsync. Author: Josh Rosen <joshrosen@databricks.com> Closes #11350 from JoshRosen/update-release-scripts-for-apache-home.
* [SPARK-13519][CORE] Driver should tell Executor to stop itself when cleaning ↵Shixiong Zhu2016-02-261-0/+4
| | | | | | | | | | | | | | | | | | executor's state ## What changes were proposed in this pull request? When the driver removes an executor's state, the connection between the driver and the executor may be still alive so that the executor cannot exit automatically (E.g., Master will send RemoveExecutor when a work is lost but the executor is still alive), so the driver should try to tell the executor to stop itself. Otherwise, we will leak an executor. This PR modified the driver to send `StopExecutor` to the executor when it's removed. ## How was this patch tested? manual test: increase the worker heartbeat interval to force it's always timeout and the leak executors are gone. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11399 from zsxwing/SPARK-13519.
* [SPARK-13505][ML] add python api for MaxAbsScalerzlpmichelle2016-02-261-7/+68
| | | | | | | | | | | | ## What changes were proposed in this pull request? After SPARK-13028, we should add Python API for MaxAbsScaler. ## How was this patch tested? unit test Author: zlpmichelle <zlpmichelle@gmail.com> Closes #11393 from zlpmichelle/master.
* [SPARK-13465] Add a task failure listener to TaskContextReynold Xin2016-02-269-85/+169
| | | | | | | | | | | | | ## What changes were proposed in this pull request? TaskContext supports task completion callback, which gets called regardless of task failures. However, there is no way for the listener to know if there is an error. This patch adds a new listener that gets called when a task fails. ## How was the this patch tested? New unit test case and integration test case covering the code path Author: Reynold Xin <rxin@databricks.com> Closes #11340 from rxin/SPARK-13465.
* [SPARK-13499] [SQL] Performance improvements for parquet reader.Nong Li2016-02-263-54/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch includes these performance fixes: - Remove unnecessary setNotNull() calls. The NULL bits are cleared already. - Speed up RLE group decoding - Speed up dictionary decoding by decoding NULLs directly into the result. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) In addition to the updated benchmarks, on TPCDS, the result of these changes running Q55 (sf40) is: ``` TPCDS: Best/Avg Time(ms) Rate(M/s) Per Row(ns) --------------------------------------------------------------------------------- q55 (Before) 6398 / 6616 18.0 55.5 q55 (After) 4983 / 5189 23.1 43.3 ``` Author: Nong Li <nong@databricks.com> Closes #11375 from nongli/spark-13499.
* [SPARK-12313] [SQL] improve performance of BroadcastNestedLoopJoinDavies Liu2016-02-267-91/+295
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, BroadcastNestedLoopJoin is implemented for worst case, it's too slow, very easy to hang forever. This PR will create fast path for some joinType and buildSide, also improve the worst case (will use much less memory than before). Before this PR, one task requires O(N*K) + O(K) in worst cases, N is number of rows from one partition of streamed table, it could hang the job (because of GC). In order to workaround this for InnerJoin, we have to disable auto-broadcast, switch to CartesianProduct: This could be workaround for InnerJoin, see https://forums.databricks.com/questions/6747/how-do-i-get-a-cartesian-product-of-a-huge-dataset.html In this PR, we will have fast path for these joins : InnerJoin with BuildLeft or BuildRight LeftOuterJoin with BuildRight RightOuterJoin with BuildLeft LeftSemi with BuildRight These fast paths are all stream based (take one pass on streamed table), required O(1) memory. All other join types and build types will take two pass on streamed table, one pass to find the matched rows that includes streamed part, which require O(1) memory, another pass to find the rows from build table that does not have a matched row from streamed table, which required O(K) memory, K is the number rows from build side, one bit per row, should be much smaller than the memory for broadcast. The following join types work in this way: LeftOuterJoin with BuildLeft RightOuterJoin with BuildRight FullOuterJoin with BuildLeft or BuildRight LeftSemi with BuildLeft This PR also added tests for all the join types for BroadcastNestedLoopJoin. After this PR, for InnerJoin with one small table, BroadcastNestedLoopJoin should be faster than CartesianProduct, we don't need that workaround anymore. ## How was the this patch tested? Added unit tests. Author: Davies Liu <davies@databricks.com> Closes #11328 from davies/nested_loop.
* [MINOR][SQL] Fix modifier order.Dongjoon Hyun2016-02-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes the order of modifier from `abstract public` into `public abstract`. Currently, when we run `./dev/lint-java`, it shows the error. ``` Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/util/sketch/CountMinSketch.java:[53,10] (modifier) ModifierOrder: 'public' modifier out of order with the JLS suggestions. ``` ## How was this patch tested? ``` $ ./dev/lint-java Checkstyle checks passed. ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11390 from dongjoon-hyun/fix_modifier_order.