aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-14851][CORE] Support radix sort with nullable longsEric Liang2016-06-1115-79/+178
| | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This adds support for radix sort of nullable long fields. When a sort field is null and radix sort is enabled, we keep nulls in a separate region of the sort buffer so that radix sort does not need to deal with them. This also has performance benefits when sorting smaller integer types, since the current representation of nulls in two's complement (Long.MIN_VALUE) otherwise forces a full-width radix sort. This strategy for nulls does mean the sort is no longer stable. cc davies ## How was this patch tested? Existing randomized sort tests for correctness. I also tested some TPCDS queries and there does not seem to be any significant regression for non-null sorts. Some test queries (best of 5 runs each). Before change: scala> val start = System.nanoTime; spark.range(5000000).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6 start: Long = 3190437233227987 res3: Double = 4716.471091 After change: scala> val start = System.nanoTime; spark.range(5000000).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6 start: Long = 3190367870952791 res4: Double = 2981.143045 Author: Eric Liang <ekl@databricks.com> Closes #13161 from ericl/sc-2998.
* [SPARK-15856][SQL] Revert API breaking changes made in SQLContext.rangeWenchen Fan2016-06-111-18/+18
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? It's easy for users to call `range(...).as[Long]` to get typed Dataset, and don't worth an API breaking change. This PR reverts it. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #13605 from cloud-fan/range.
* [SPARK-15881] Update microbenchmark results for WideSchemaBenchmarkEric Liang2016-06-114-234/+123
| | | | | | | | | | ## What changes were proposed in this pull request? These were not updated after performance improvements. To make updating them easier, I also moved the results from inline comments out into a file, which is auto-generated when the benchmark is re-run. Author: Eric Liang <ekl@databricks.com> Closes #13607 from ericl/sc-3538.
* [SPARK-15585][SQL] Add doc for turning off quotationsTakeshi YAMAMURO2016-06-113-3/+17
| | | | | | | | | | | | ## What changes were proposed in this pull request? This pr is to add doc for turning off quotations because this behavior is different from `com.databricks.spark.csv`. ## How was this patch tested? Check behavior to put an empty string in csv options. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #13616 from maropu/SPARK-15585-2.
* [SPARK-15883][MLLIB][DOCS] Fix broken links in mllib documentsDongjoon Hyun2016-06-118-29/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, this contains some editorial change. **Fix broken links** * mllib-data-types.md * mllib-decision-tree.md * mllib-ensembles.md * mllib-feature-extraction.md * mllib-pmml-model-export.md * mllib-statistics.md **Fix malformed section header and scala coding style** * mllib-linear-methods.md **Replace indirect forward links with direct one** * ml-classification-regression.md ## How was this patch tested? Manual tests (with `cd docs; jekyll build`.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13608 from dongjoon-hyun/SPARK-15883.
* [SPARK-15879][DOCS][UI] Update logo in UI and docs to add "Apache"Sean Owen2016-06-117-0/+0
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Use new Spark logo including "Apache" (now, with crushed PNGs). Remove old unreferenced logo files. ## How was this patch tested? Manual check of generated HTML site and Spark UI. I searched for references to the deleted files to make sure they were not used. Author: Sean Owen <sowen@cloudera.com> Closes #13609 from srowen/SPARK-15879.
* [SPARK-15759] [SQL] Fallback to non-codegen when fail to compile generated codeDavies Liu2016-06-103-3/+24
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In case of any bugs in whole-stage codegen, the generated code can't be compiled, we should fallback to non-codegen to make sure that query could run. The batch mode of new parquet reader depends on codegen, can't be easily switched to non-batch mode, so we still use codegen for batched scan (for parquet). Because it only support primitive types and the number of columns is less than spark.sql.codegen.maxFields (100), it should not fail. This could be configurable by `spark.sql.codegen.fallback` ## How was this patch tested? Manual test it with buggy operator, it worked well. Author: Davies Liu <davies@databricks.com> Closes #13501 from davies/codegen_fallback.
* [SPARK-15678] Add support to REFRESH data source pathsSameer Agarwal2016-06-108-2/+158
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Spark currently incorrectly continues to use cached data even if the underlying data is overwritten. Current behavior: ```scala val dir = "/tmp/test" sqlContext.range(1000).write.mode("overwrite").parquet(dir) val df = sqlContext.read.parquet(dir).cache() df.count() // outputs 1000 sqlContext.range(10).write.mode("overwrite").parquet(dir) sqlContext.read.parquet(dir).count() // outputs 1000 <---- We are still using the cached dataset ``` This patch fixes this bug by adding support for `REFRESH path` that invalidates and refreshes all the cached data (and the associated metadata) for any dataframe that contains the given data source path. Expected behavior: ```scala val dir = "/tmp/test" sqlContext.range(1000).write.mode("overwrite").parquet(dir) val df = sqlContext.read.parquet(dir).cache() df.count() // outputs 1000 sqlContext.range(10).write.mode("overwrite").parquet(dir) spark.catalog.refreshResource(dir) sqlContext.read.parquet(dir).count() // outputs 10 <---- We are not using the cached dataset ``` ## How was this patch tested? Unit tests for overwrites and appends in `ParquetQuerySuite` and `CachedTableSuite`. Author: Sameer Agarwal <sameer@databricks.com> Closes #13566 from sameeragarwal/refresh-path-2.
* Revert "[SPARK-15639][SQL] Try to push down filter at RowGroups level for ↵Cheng Lian2016-06-103-21/+57
| | | | | | parquet reader" This reverts commit bba5d7999f7b3ae9d816ea552ba9378fea1615a6.
* [SPARK-14615][ML][FOLLOWUP] Fix Python examples to use the new ML Vector and ↵hyukjinkwon2016-06-1010-19/+18
| | | | | | | | | | | | | | | | | | | | | | | | | Matrix APIs in the ML pipeline based algorithms ## What changes were proposed in this pull request? This PR fixes Python examples to use the new ML Vector and Matrix APIs in the ML pipeline based algorithms. I firstly executed this shell command, `grep -r "from pyspark.mllib" .` and then executed them all. Some of tests in `ml` produced the error messages as below: ``` pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Input type must be VectorUDT but got org.apache.spark.mllib.linalg.VectorUDTf71b0bce.' ``` So, I fixed them to use new ones just identically with some Python tests fixed in https://github.com/apache/spark/pull/12627 ## How was this patch tested? Manually tested for all the examples listed by `grep -r "from pyspark.mllib" .`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #13393 from HyukjinKwon/SPARK-14615.
* [SPARK-15639][SQL] Try to push down filter at RowGroups level for parquet readerLiang-Chi Hsieh2016-06-103-57/+21
| | | | | | | | | | | | | ## What changes were proposed in this pull request? The base class `SpecificParquetRecordReaderBase` used for vectorized parquet reader will try to get pushed-down filters from the given configuration. This pushed-down filters are used for RowGroups-level filtering. However, we don't set up the filters to push down into the configuration. In other words, the filters are not actually pushed down to do RowGroups-level filtering. This patch is to fix this and tries to set up the filters for pushing down to configuration for the reader. ## How was this patch tested? Existing tests should be passed. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13371 from viirya/vectorized-reader-push-down-filter.
* [SPARK-15884][SPARKR][SQL] Overriding stringArgs in MapPartitionsInRNarine Kokhlikyan2016-06-101-0/+3
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? As discussed in https://github.com/apache/spark/pull/12836 we need to override stringArgs method in MapPartitionsInR in order to avoid too large strings generated by "stringArgs" method based on the input arguments. In this case exclude some of the input arguments: serialized R objects. ## How was this patch tested? Existing test cases Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com> Closes #13610 from NarineK/dapply_MapPartitionsInR_stringArgs.
* [SPARK-15773][CORE][EXAMPLE] Avoid creating local variable `sc` in examples ↵Dongjoon Hyun2016-06-1012-52/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | if possible ## What changes were proposed in this pull request? Instead of using local variable `sc` like the following example, this PR uses `spark.sparkContext`. This makes examples more concise, and also fixes some misleading, i.e., creating SparkContext from SparkSession. ``` - println("Creating SparkContext") - val sc = spark.sparkContext - println("Writing local file to DFS") val dfsFilename = dfsDirPath + "/dfs_read_write_test" - val fileRDD = sc.parallelize(fileContents) + val fileRDD = spark.sparkContext.parallelize(fileContents) ``` This will change 12 files (+30 lines, -52 lines). ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13520 from dongjoon-hyun/SPARK-15773.
* [SPARK-15489][SQL] Dataset kryo encoder won't load custom user settingsSela2016-06-102-9/+89
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Serializer instantiation will consider existing SparkConf ## How was this patch tested? manual test with `ImmutableList` (Guava) and `kryo-serializers`'s `Immutable*Serializer` implementations. Added Test Suite. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Sela <ansela@paypal.com> Closes #13424 from amitsela/SPARK-15489.
* [SPARK-15654] [SQL] fix non-splitable files for text based file formatsDavies Liu2016-06-1011-13/+115
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, we always split the files when it's bigger than maxSplitBytes, but Hadoop LineRecordReader does not respect the splits for compressed files correctly, we should have a API for FileFormat to check whether the file could be splitted or not. This PR is based on #13442, closes #13442 ## How was this patch tested? add regression tests. Author: Davies Liu <davies@databricks.com> Closes #13531 from davies/fix_split.
* [SPARK-15825] [SQL] Fix SMJ invalid resultsHerman van Hovell2016-06-102-0/+16
| | | | | | | | | | | | ## What changes were proposed in this pull request? Code generated `SortMergeJoin` failed with wrong results when using structs as keys. This could (eventually) be traced back to the use of a wrong row reference when comparing structs. ## How was this patch tested? TBD Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13589 from hvanhovell/SPARK-15822.
* [SPARK-15875] Try to use Seq.isEmpty and Seq.nonEmpty instead of Seq.length ↵wangyang2016-06-109-12/+12
| | | | | | | | | | | | | | | | == 0 and Seq.length > 0 ## What changes were proposed in this pull request? In scala, immutable.List.length is an expensive operation so we should avoid using Seq.length == 0 or Seq.lenth > 0, and use Seq.isEmpty and Seq.nonEmpty instead. ## How was this patch tested? existing tests Author: wangyang <wangyang@haizhi.com> Closes #13601 from yangw1234/isEmpty.
* [MINOR][X][X] Replace all occurrences of None: Option with Option.emptySandeep Singh2016-06-106-11/+11
| | | | | | | | | | | | ## What changes were proposed in this pull request? Replace all occurrences of `None: Option[X]` with `Option.empty[X]` ## How was this patch tested? Exisiting Tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13591 from techaddict/minor-7.
* [SPARK-6320][SQL] Move planLater method into GenericStrategy.Takuya UESHIN2016-06-106-12/+151
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR moves `QueryPlanner.planLater()` method into `GenericStrategy` for extra strategies to be able to use `planLater` in its strategy. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #13147 from ueshin/issues/SPARK-6320.
* [SPARK-15871][SQL] Add `assertNotPartitioned` check in `DataFrameWriter`Liwei Lin2016-06-103-10/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? It doesn't make sense to specify partitioning parameters, when we write data out from Datasets/DataFrames into `jdbc` tables or streaming `ForeachWriter`s. This patch adds `assertNotPartitioned` check in `DataFrameWriter`. <table> <tr> <td align="center"><strong>operation</strong></td> <td align="center"><strong>should check not partitioned?</strong></td> </tr> <tr> <td align="center">mode</td> <td align="center"></td> </tr> <tr> <td align="center">outputMode</td> <td align="center"></td> </tr> <tr> <td align="center">trigger</td> <td align="center"></td> </tr> <tr> <td align="center">format</td> <td align="center"></td> </tr> <tr> <td align="center">option/options</td> <td align="center"></td> </tr> <tr> <td align="center">partitionBy</td> <td align="center"></td> </tr> <tr> <td align="center">bucketBy</td> <td align="center"></td> </tr> <tr> <td align="center">sortBy</td> <td align="center"></td> </tr> <tr> <td align="center">save</td> <td align="center"></td> </tr> <tr> <td align="center">queryName</td> <td align="center"></td> </tr> <tr> <td align="center">startStream</td> <td align="center"></td> </tr> <tr> <td align="center">foreach</td> <td align="center">yes</td> </tr> <tr> <td align="center">insertInto</td> <td align="center"></td> </tr> <tr> <td align="center">saveAsTable</td> <td align="center"></td> </tr> <tr> <td align="center">jdbc</td> <td align="center">yes</td> </tr> <tr> <td align="center">json</td> <td align="center"></td> </tr> <tr> <td align="center">parquet</td> <td align="center"></td> </tr> <tr> <td align="center">orc</td> <td align="center"></td> </tr> <tr> <td align="center">text</td> <td align="center"></td> </tr> <tr> <td align="center">csv</td> <td align="center"></td> </tr> </table> ## How was this patch tested? New dedicated tests. Author: Liwei Lin <lwlin7@gmail.com> Closes #13597 from lw-lin/add-assertNotPartitioned.
* Revert [SPARK-14485][CORE] ignore task finished for executor lostKay Ousterhout2016-06-101-13/+1
| | | | | | | | | | | This reverts commit 695dbc816a6d70289abeb145cb62ff4e62b3f49b. This change is being reverted because it hurts performance of some jobs, and only helps in a narrow set of cases. For more discussion, refer to the JIRA. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #13580 from kayousterhout/revert-SPARK-14485.
* [SPARK-15766][SPARKR] R should export is.nanwm624@hotmail.com2016-06-101-0/+2
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? When reviewing SPARK-15545, we found that is.nan is not exported, which should be exported. Add it to the NAMESPACE. ## How was this patch tested? Manual tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #13508 from wangmiao1981/unused.
* [SPARK-15743][SQL] Prevent saving with all-column partitioningDongjoon Hyun2016-06-105-21/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? When saving datasets on storage, `partitionBy` provides an easy way to construct the directory structure. However, if a user choose all columns as partition columns, some exceptions occurs. - **ORC with all column partitioning**: `AnalysisException` on **future read** due to schema inference failure. ```scala scala> spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save("/tmp/data") scala> spark.read.format("orc").load("/tmp/data").collect() org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at /tmp/data. It must be specified manually; ``` - **Parquet with all-column partitioning**: `InvalidSchemaException` on **write execution** due to Parquet limitation. ```scala scala> spark.range(100).write.format("parquet").mode("overwrite").partitionBy("id").save("/tmp/data") [Stage 0:> (0 + 8) / 8]16/06/02 16:51:17 ERROR Utils: Aborting task org.apache.parquet.schema.InvalidSchemaException: A group type can not be empty. Parquet does not support empty group without leaves. Empty group: spark_schema ... (lots of error messages) ``` Although some formats like JSON support all-column partitioning without any problem, it seems not a good idea to make lots of empty directories. This PR prevents saving with all-column partitioning by consistently raising `AnalysisException` before executing save operation. ## How was this patch tested? Newly added `PartitioningUtilsSuite`. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13486 from dongjoon-hyun/SPARK-15743.
* [SPARK-15738][PYSPARK][ML] Adding Pyspark ml RFormula __str__ method similar ↵Bryan Cutler2016-06-103-2/+26
| | | | | | | | | | | | | | to Scala API ## What changes were proposed in this pull request? Adding __str__ to RFormula and model that will show the set formula param and resolved formula. This is currently present in the Scala API, found missing in PySpark during Spark 2.0 coverage review. ## How was this patch tested? run pyspark-ml tests locally Author: Bryan Cutler <cutlerb@gmail.com> Closes #13481 from BryanCutler/pyspark-ml-rformula_str-SPARK-15738.
* [SPARK-15866] Rename listAccumulator collectionAccumulatorReynold Xin2016-06-104-18/+23
| | | | | | | | | | | | ## What changes were proposed in this pull request? SparkContext.listAccumulator, by Spark's convention, makes it sound like "list" is a verb and the method should return a list of accumulators. This patch renames the method and the class collection accumulator. ## How was this patch tested? Updated test case to reflect the names. Author: Reynold Xin <rxin@databricks.com> Closes #13594 from rxin/SPARK-15866.
* [SPARK-15753][SQL] Move Analyzer stuff to Analyzer from DataFrameWriterLiang-Chi Hsieh2016-06-103-16/+17
| | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch moves some codes in `DataFrameWriter.insertInto` that belongs to `Analyzer`. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13496 from viirya/move-analyzer-stuff.
* [SPARK-15812][SQ][STREAMING] Added support for sorting after streaming ↵Tathagata Das2016-06-104-32/+95
| | | | | | | | | | | | | | | aggregation with complete mode ## What changes were proposed in this pull request? When the output mode is complete, then the output of a streaming aggregation essentially will contain the complete aggregates every time. So this is not different from a batch dataset within an incremental execution. Other non-streaming operations should be supported on this dataset. In this PR, I am just adding support for sorting, as it is a common useful functionality. Support for other operations will come later. ## How was this patch tested? Additional unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13549 from tdas/SPARK-15812.
* [SPARK-15837][ML][PYSPARK] Word2vec python add maxsentence parameterWeichenXu2016-06-101-5/+24
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Word2vec python add maxsentence parameter. ## How was this patch tested? Existing test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #13578 from WeichenXu123/word2vec_python_add_maxsentence.
* [SPARK-15823][PYSPARK][ML] Add @property for 'accuracy' in MulticlassMetricsZheng RuiFeng2016-06-101-5/+2
| | | | | | | | | | | | ## What changes were proposed in this pull request? `accuracy` should be decorated with `property` to keep step with other methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, `weightedRecall`, etc ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13560 from zhengruifeng/add_accuracy_property.
* [DOCUMENTATION] fixed groupby aggregation example for pysparkMortada Mehyar2016-06-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? fixing documentation for the groupby/agg example in python ## How was this patch tested? the existing example in the documentation dose not contain valid syntax (missing parenthesis) and is not using `Column` in the expression for `agg()` after the fix here's how I tested it: ``` In [1]: from pyspark.sql import Row In [2]: import pyspark.sql.functions as func In [3]: %cpaste Pasting code; enter '--' alone on the line to stop or use Ctrl-D. :records = [{'age': 19, 'department': 1, 'expense': 100}, : {'age': 20, 'department': 1, 'expense': 200}, : {'age': 21, 'department': 2, 'expense': 300}, : {'age': 22, 'department': 2, 'expense': 300}, : {'age': 23, 'department': 3, 'expense': 300}] :-- In [4]: df = sqlContext.createDataFrame([Row(**d) for d in records]) In [5]: df.groupBy("department").agg(df["department"], func.max("age"), func.sum("expense")).show() +----------+----------+--------+------------+ |department|department|max(age)|sum(expense)| +----------+----------+--------+------------+ | 1| 1| 20| 300| | 2| 2| 22| 600| | 3| 3| 23| 300| +----------+----------+--------+------------+ Author: Mortada Mehyar <mortada.mehyar@gmail.com> Closes #13587 from mortada/groupby_agg_doc_fix.
* [SPARK-15593][SQL] Add DataFrameWriter.foreach to allow the user consuming ↵Shixiong Zhu2016-06-106-42/+413
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | data in ContinuousQuery ## What changes were proposed in this pull request? * Add DataFrameWriter.foreach to allow the user consuming data in ContinuousQuery * ForeachWriter is the interface for the user to consume partitions of data * Add a type parameter T to DataFrameWriter Usage ```Scala val ds = spark.read....stream().as[String] ds.....write .queryName(...) .option("checkpointLocation", ...) .foreach(new ForeachWriter[Int] { def open(partitionId: Long, version: Long): Boolean = { // prepare some resources for a partition // check `version` if possible and return `false` if this is a duplicated data to skip the data processing. } override def process(value: Int): Unit = { // process data } def close(errorOrNull: Throwable): Unit = { // release resources for a partition // check `errorOrNull` and handle the error if necessary. } }) ``` ## How was this patch tested? New unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13342 from zsxwing/foreach.
* [SPARK-15696][SQL] Improve `crosstab` to have a consistent column orderDongjoon Hyun2016-06-092-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, `crosstab` returns a Dataframe having **random-order** columns obtained by just `distinct`. Also, the documentation of `crosstab` shows the result in a sorted order which is different from the current implementation. This PR explicitly constructs the columns in a sorted order in order to improve user experience. Also, this implementation gives the same result with the documentation. **Before** ```scala scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show() +---------+---+---+---+ |key_value| 3| 2| 1| +---------+---+---+---+ | 2| 1| 0| 2| | 1| 0| 1| 1| | 3| 1| 1| 0| +---------+---+---+---+ scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, "c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", "value").show() +---------+---+---+---+ |key_value| c| a| b| +---------+---+---+---+ | 2| 1| 2| 0| | 1| 0| 1| 1| | 3| 1| 0| 1| +---------+---+---+---+ ``` **After** ```scala scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show() +---------+---+---+---+ |key_value| 1| 2| 3| +---------+---+---+---+ | 2| 2| 0| 1| | 1| 1| 1| 0| | 3| 0| 1| 1| +---------+---+---+---+ scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, "c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", "value").show() +---------+---+---+---+ |key_value| a| b| c| +---------+---+---+---+ | 2| 2| 0| 1| | 1| 1| 1| 0| | 3| 0| 1| 1| +---------+---+---+---+ ``` ## How was this patch tested? Pass the Jenkins tests with updated testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13436 from dongjoon-hyun/SPARK-15696.
* [SPARK-15791] Fix NPE in ScalarSubqueryEric Liang2016-06-094-4/+15
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? The fix is pretty simple, just don't make the executedPlan transient in `ScalarSubquery` since it is referenced at execution time. ## How was this patch tested? I verified the fix manually in non-local mode. It's not clear to me why the problem did not manifest in local mode, any suggestions? cc davies Author: Eric Liang <ekl@databricks.com> Closes #13569 from ericl/fix-scalar-npe.
* [SPARK-15850][SQL] Remove function grouping in SparkSessionReynold Xin2016-06-093-31/+61
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? SparkSession does not have that many functions due to better namespacing, and as a result we probably don't need the function grouping. This patch removes the grouping and also adds missing scaladocs for createDataset functions in SQLContext. Closes #13577. ## How was this patch tested? N/A - this is a documentation change. Author: Reynold Xin <rxin@databricks.com> Closes #13582 from rxin/SPARK-15850.
* [SPARK-15853][SQL] HDFSMetadataLog.get should close the input streamShixiong Zhu2016-06-091-2/+6
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR closes the input stream created in `HDFSMetadataLog.get` ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13583 from zsxwing/leak.
* [SPARK-15794] Should truncate toString() of very wide plansEric Liang2016-06-0914-24/+140
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? With very wide tables, e.g. thousands of fields, the plan output is unreadable and often causes OOMs due to inefficient string processing. This truncates all struct and operator field lists to a user configurable threshold to limit performance impact. It would also be nice to optimize string generation to avoid these sort of O(n^2) slowdowns entirely (i.e. use StringBuilder everywhere including expressions), but this is probably too large of a change for 2.0 at this point, and truncation has other benefits for usability. ## How was this patch tested? Added a microbenchmark that covers this case particularly well. I also ran the microbenchmark while varying the truncation threshold. ``` numFields = 5 wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ 2000 wide x 50 rows (write in-mem) 2336 / 2558 0.0 23364.4 0.1X numFields = 25 wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ 2000 wide x 50 rows (write in-mem) 4237 / 4465 0.0 42367.9 0.1X numFields = 100 wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ 2000 wide x 50 rows (write in-mem) 10458 / 11223 0.0 104582.0 0.0X numFields = Infinity wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ [info] java.lang.OutOfMemoryError: Java heap space ``` Author: Eric Liang <ekl@databricks.com> Author: Eric Liang <ekhliang@gmail.com> Closes #13537 from ericl/truncated-string.
* [SPARK-15841][Tests] REPLSuite has incorrect env set for a couple of tests.Prashant Sharma2016-06-092-4/+4
| | | | | | | | | | | Description from JIRA. In ReplSuite, for a test that can be tested well on just local should not really have to start a local-cluster. And similarly a test is in-sufficiently run if it's actually fixing a problem related to a distributed run in environment with local run. Existing tests. Author: Prashant Sharma <prashsh1@in.ibm.com> Closes #13574 from ScrapCodes/SPARK-15841/repl-suite-fix.
* [SPARK-12447][YARN] Only update the states when executor is successfully ↵jerryshao2016-06-092-30/+47
| | | | | | | | | | | | launched The details is described in https://issues.apache.org/jira/browse/SPARK-12447. vanzin Please help to review, thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #10412 from jerryshao/SPARK-12447.
* [SPARK-14321][SQL] Reduce date format cost and string-to-date cost in date ↵Herman van Hovell2016-06-091-24/+24
| | | | | | | | | | | | | | | | | | | functions ## What changes were proposed in this pull request? The current implementations of `UnixTime` and `FromUnixTime` do not cache their parser/formatter as much as they could. This PR resolved this issue. This PR is a take over from https://github.com/apache/spark/pull/13522 and further optimizes the re-use of the parser/formatter. It also fixes the improves handling (catching the actual exception instead of `Throwable`). All credits for this work should go to rajeshbalamohan. This PR closes https://github.com/apache/spark/pull/13522 ## How was this patch tested? Current tests. Author: Herman van Hovell <hvanhovell@databricks.com> Author: Rajesh Balamohan <rbalamohan@apache.org> Closes #13581 from hvanhovell/SPARK-14321.
* [SPARK-15839] Fix Maven doc-jar generation when JAVA_7_HOME is setJosh Rosen2016-06-091-6/+23
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? It looks like the nightly Maven snapshots broke after we set `JAVA_7_HOME` in the build: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/1573/. It seems that passing `-javabootclasspath` to ScalaDoc using scala-maven-plugin ends up preventing the Scala library classes from being added to scalac's internal class path, causing compilation errors while building doc-jars. There might be a principled fix to this inside of the scala-maven-plugin itself, but for now this patch configures the build to omit the `-javabootclasspath` option during Maven doc-jar generation. ## How was this patch tested? Tested manually with `build/mvn clean install -DskipTests=true` when `JAVA_7_HOME` was set. Also manually inspected the effective POM diff to verify that the final POM changes were scoped correctly: https://gist.github.com/JoshRosen/f889d1c236fad14fa25ac4be01654653 /cc vanzin and yhuai for review. Author: Josh Rosen <joshrosen@databricks.com> Closes #13573 from JoshRosen/SPARK-15839.
* [SPARK-15827][BUILD] Publish Spark's forked sbt-pom-reader to Maven CentralJosh Rosen2016-06-092-28/+9
| | | | | | | | | | | | Spark's SBT build currently uses a fork of the sbt-pom-reader plugin but depends on that fork via a SBT subproject which is cloned from https://github.com/scrapcodes/sbt-pom-reader/tree/ignore_artifact_id. This unnecessarily slows down the initial build on fresh machines and is also risky because it risks a build breakage in case that GitHub repository ever changes or is deleted. In order to address these issues, I have published a pre-built binary of our forked sbt-pom-reader plugin to Maven Central under the `org.spark-project` namespace and have updated Spark's build to use that artifact. This published artifact was built from https://github.com/JoshRosen/sbt-pom-reader/tree/v1.0.0-spark, which contains the contents of ScrapCodes's branch plus an additional patch to configure the build for artifact publication. /cc srowen ScrapCodes for review. Author: Josh Rosen <joshrosen@databricks.com> Closes #13564 from JoshRosen/use-published-fork-of-pom-reader.
* [SPARK-15788][PYSPARK][ML] PySpark IDFModel missing "idf" propertyJeff Zhang2016-06-091-0/+10
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? add method idf to IDF in pyspark ## How was this patch tested? add unit test Author: Jeff Zhang <zjffdu@apache.org> Closes #13540 from zjffdu/SPARK-15788.
* [SPARK-15804][SQL] Include metadata in the toStructTypeKevin Yu2016-06-092-1/+16
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? The help function 'toStructType' in the AttributeSeq class doesn't include the metadata when it builds the StructField, so it causes this reported problem https://issues.apache.org/jira/browse/SPARK-15804?jql=project%20%3D%20SPARK when spark writes the the dataframe with the metadata to the parquet datasource. The code path is when spark writes the dataframe to the parquet datasource through the InsertIntoHadoopFsRelationCommand, spark will build the WriteRelation container, and it will call the help function 'toStructType' to create StructType which contains StructField, it should include the metadata there, otherwise, we will lost the user provide metadata. ## How was this patch tested? added test case in ParquetQuerySuite.scala (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Kevin Yu <qyu@us.ibm.com> Closes #13555 from kevinyu98/spark-15804.
* [SPARK-15818][BUILD] Upgrade to Hadoop 2.7.2Adam Roberts2016-06-094-48/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Updating the Hadoop version from 2.7.0 to 2.7.2 if we use the Hadoop-2.7 build profile ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Existing tests (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) I'd like us to use Hadoop 2.7.2 owing to the Hadoop release notes stating Hadoop 2.7.0 is not ready for production use https://hadoop.apache.org/docs/r2.7.0/ states "Apache Hadoop 2.7.0 is a minor release in the 2.x.y release line, building upon the previous stable release 2.6.0. This release is not yet ready for production use. Production users should use 2.7.1 release and beyond." Hadoop 2.7.1 release notes: "Apache Hadoop 2.7.1 is a minor release in the 2.x.y release line, building upon the previous release 2.7.0. This is the next stable release after Apache Hadoop 2.6.x." And then Hadoop 2.7.2 release notes: "Apache Hadoop 2.7.2 is a minor release in the 2.x.y release line, building upon the previous stable release 2.7.1." I've tested this is OK with Intel hardware and IBM Java 8 so let's test it with OpenJDK, ideally this will be pushed to branch-2.0 and master. Author: Adam Roberts <aroberts@uk.ibm.com> Closes #13556 from a-roberts/patch-2.
* [SPARK-12712] Fix failure in ./dev/test-dependencies when run against empty ↵Josh Rosen2016-06-091-1/+1
| | | | | | | | | | | | | | | | | .m2 cache This patch fixes a bug in `./dev/test-dependencies.sh` which caused spurious failures when the script was run on a machine with an empty `.m2` cache. The problem was that extra log output from the dependency download was conflicting with the grep / regex used to identify the classpath in the Maven output. This patch fixes this issue by adjusting the regex pattern. Tested manually with the following reproduction of the bug: ``` rm -rf ~/.m2/repository/org/apache/commons/ ./dev/test-dependencies.sh ``` Author: Josh Rosen <joshrosen@databricks.com> Closes #13568 from JoshRosen/SPARK-12712.
* [MINOR][DOC] In Dataset docs, remove self link to Dataset and add link to ColumnSandeep Singh2016-06-081-100/+100
| | | | | | | | | | | ## What changes were proposed in this pull request? Documentation Fix ## How was this patch tested? Author: Sandeep Singh <sandeep@techaddict.me> Closes #13567 from techaddict/minor-4.
* [SPARK-14670] [SQL] allow updating driver side sql metricsWenchen Fan2016-06-083-8/+85
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? On the SparkUI right now we have this SQLTab that displays accumulator values per operator. However, it only displays metrics updated on the executors, not on the driver. It is useful to also include driver metrics, e.g. broadcast time. This is a different version from https://github.com/apache/spark/pull/12427. This PR sends driver side accumulator updates right after the updating happens, not at the end of execution, by a new event. ## How was this patch tested? new test in `SQLListenerSuite` ![qq20160606-0](https://cloud.githubusercontent.com/assets/3182036/15841418/0eb137da-2c06-11e6-9068-5694eeb78530.png) Author: Wenchen Fan <wenchen@databricks.com> Closes #13189 from cloud-fan/metrics.
* [SPARK-15735] Allow specifying min time to run in microbenchmarksEric Liang2016-06-081-37/+72
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This makes microbenchmarks run for at least 2 seconds by default, to allow some time for jit compilation to kick in. ## How was this patch tested? Tested manually with existing microbenchmarks. This change is backwards compatible in that existing microbenchmarks which specified numIters per-case will still run exactly that number of iterations. Microbenchmarks which previously overrode defaultNumIters now override minNumIters. cc hvanhovell Author: Eric Liang <ekl@databricks.com> Author: Eric Liang <ekhliang@gmail.com> Closes #13472 from ericl/spark-15735.
* [DOCUMENTATION] Fixed target JAR pathprabs2016-06-081-2/+2
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Mentioned Scala version in the sbt configuration file is 2.11, so the path of the target JAR should be `/target/scala-2.11/simple-project_2.11-1.0.jar` ## How was this patch tested? n/a Author: prabs <prabsmails@gmail.com> Author: Prabeesh K <prabsmails@gmail.com> Closes #13554 from prabeesh/master.
* [MINOR] Fix Java Lint errors introduced by #13286 and #13280Sandeep Singh2016-06-083-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? revived #13464 Fix Java Lint errors introduced by #13286 and #13280 Before: ``` Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0 Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type. ``` ## How was this patch tested? ran `dev/lint-java` locally Author: Sandeep Singh <sandeep@techaddict.me> Closes #13559 from techaddict/minor-3.