aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-11913][SQL] support typed aggregate with complex buffer schemaWenchen Fan2015-11-232-10/+56
| | | | | | Author: Wenchen Fan <wenchen@databricks.com> Closes #9898 from cloud-fan/agg.
* [SPARK-11921][SQL] fix `nullable` of encoder schemaWenchen Fan2015-11-232-3/+50
| | | | | | Author: Wenchen Fan <wenchen@databricks.com> Closes #9906 from cloud-fan/nullable.
* [SPARK-11894][SQL] fix isNull for GetInternalRowFieldWenchen Fan2015-11-232-15/+23
| | | | | | | | | | We should use `InternalRow.isNullAt` to check if the field is null before calling `InternalRow.getXXX` Thanks gatorsmile who discovered this bug. Author: Wenchen Fan <wenchen@databricks.com> Closes #9904 from cloud-fan/null.
* [SPARK-11628][SQL] support column datatype of char(x) to recognize HiveCharXiu Guo2015-11-236-7/+43
| | | | | | | | | Can someone review my code to make sure I'm not missing anything? Thanks! Author: Xiu Guo <xguo27@gmail.com> Author: Xiu Guo <guoxi@us.ibm.com> Closes #9612 from xguo27/SPARK-11628.
* [SPARK-11902][ML] Unhandled case in VectorAssembler#transformBenFradet2015-11-222-0/+13
| | | | | | | | | | | | There is an unhandled case in the transform method of VectorAssembler if one of the input columns doesn't have one of the supported type DoubleType, NumericType, BooleanType or VectorUDT. So, if you try to transform a column of StringType you get a cryptic "scala.MatchError: StringType". This PR aims to fix this, throwing a SparkException when dealing with an unknown column type. Author: BenFradet <benjamin.fradet@gmail.com> Closes #9885 from BenFradet/SPARK-11902.
* [SPARK-11912][ML] ml.feature.PCA minor refactorYanbo Liang2015-11-222-30/+24
| | | | | | | | Like [SPARK-11852](https://issues.apache.org/jira/browse/SPARK-11852), ```k``` is params and we should save it under ```metadata/``` rather than both under ```data/``` and ```metadata/```. Refactor the constructor of ```ml.feature.PCAModel``` to take only ```pc``` but construct ```mllib.feature.PCAModel``` inside ```transform```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9897 from yanboliang/spark-11912.
* [SPARK-11835] Adds a sidebar menu to MLlib's documentationTimothy Hunter2015-11-226-8/+163
| | | | | | | | | | This PR adds a sidebar menu when browsing the user guide of MLlib. It uses a YAML file to describe the structure of the documentation. It should be trivial to adapt this to the other projects. ![screen shot 2015-11-18 at 4 46 12 pm](https://cloud.githubusercontent.com/assets/7594753/11259591/a55173f4-8e17-11e5-9340-0aed79d66262.png) Author: Timothy Hunter <timhunter@databricks.com> Closes #9826 from thunterdb/spark-11835.
* [SPARK-6791][ML] Add read/write for CrossValidator and EvaluatorsJoseph K. Bradley2015-11-2212-85/+522
| | | | | | | | | | | | I believe this works for general estimators within CrossValidator, including compound estimators. (See the complex unit test.) Added read/write for all 3 Evaluators as well. CC: mengxr yanboliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #9848 from jkbradley/cv-io.
* [SPARK-11895][ML] rename and refactor DatasetExample under mllib/examplesXiangrui Meng2015-11-221-45/+26
| | | | | | | | | | We used the name `Dataset` to refer to `SchemaRDD` in 1.2 in ML pipelines and created this example file. Since `Dataset` has a new meaning in Spark 1.6, we should rename it to avoid confusion. This PR also removes support for dense format to simplify the example code. cc: yinxusen Author: Xiangrui Meng <meng@databricks.com> Closes #9873 from mengxr/SPARK-11895.
* [SPARK-11908][SQL] Add NullType support to RowEncoderLiang-Chi Hsieh2015-11-223-2/+9
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-11908 We should add NullType support to RowEncoder. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9891 from viirya/rowencoder-nulltype.
* [SPARK-11899][SQL] API audit for GroupedDataset.Reynold Xin2015-11-219-45/+131
| | | | | | | | | | | | 1. Renamed map to mapGroup, flatMap to flatMapGroup. 2. Renamed asKey -> keyAs. 3. Added more documentation. 4. Changed type parameter T to V on GroupedDataset. 5. Added since versions for all functions. Author: Reynold Xin <rxin@databricks.com> Closes #9880 from rxin/SPARK-11899.
* [SPARK-11901][SQL] API audit for Aggregator.Reynold Xin2015-11-212-16/+24
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #9882 from rxin/SPARK-11901.
* [SPARK-11900][SQL] Add since version for all encodersReynold Xin2015-11-211-0/+63
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #9881 from rxin/SPARK-11900.
* [SPARK-11819][SQL][FOLLOW-UP] fix scala 2.11 buildWenchen Fan2015-11-201-2/+2
| | | | | | | | seems scala 2.11 doesn't support: define private methods in `trait xxx` and use it in `object xxx extend xxx`. Author: Wenchen Fan <wenchen@databricks.com> Closes #9879 from cloud-fan/follow.
* Revert "[SPARK-11689][ML] Add user guide and example code for LDA under ↵Xiangrui Meng2015-11-205-204/+1
| | | | | | spark.ml" This reverts commit e359d5dcf5bd300213054ebeae9fe75c4f7eb9e7.
* [HOTFIX] Fix Java Dataset TestsMichael Armbrust2015-11-201-2/+2
|
* [SPARK-11890][SQL] Fix compilation for Scala 2.11Michael Armbrust2015-11-201-2/+2
| | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #9871 from marmbrus/scala211-break.
* [SPARK-11889][SQL] Fix type inference for GroupedDataset.agg in REPLMichael Armbrust2015-11-203-29/+30
| | | | | | | | | | | | | | In this PR I delete a method that breaks type inference for aggregators (only in the REPL) The error when this method is present is: ``` <console>:38: error: missing parameter type for expanded function ((x$2) => x$2._2) ds.groupBy(_._1).agg(sum(_._2), sum(_._3)).collect() ``` Author: Michael Armbrust <michael@databricks.com> Closes #9870 from marmbrus/dataset-repl-agg.
* [SPARK-11787][SPARK-11883][SQL][FOLLOW-UP] Cleanup for this patch.Nong Li2015-11-2010-80/+175
| | | | | | | | | | This mainly moves SqlNewHadoopRDD to the sql package. There is some state that is shared between core and I've left that in core. This allows some other associated minor cleanup. Author: Nong Li <nong@databricks.com> Closes #9845 from nongli/spark-11787.
* [SPARK-11549][DOCS] Replace example code in mllib-evaluation-metrics.md ↵Vikas Nelamangala2015-11-2016-925/+1319
| | | | | | | | using include_example Author: Vikas Nelamangala <vikasnelamangala@Vikass-MacBook-Pro.local> Closes #9689 from vikasnp/master.
* [SPARK-11636][SQL] Support classes defined in the REPL with EncodersMichael Armbrust2015-11-204-7/+43
| | | | | | | | | #theScaryParts (i.e. changes to the repl, executor classloaders and codegen)... Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #9825 from marmbrus/dataset-replClasses2.
* [SPARK-11756][SPARKR] Fix use of aliases - SparkR can not output help ↵felixcheung2015-11-204-84/+37
| | | | | | | | | | | | | | | | | information for SparkR:::summary correctly Fix use of aliases and changes uses of rdname and seealso `aliases` is the hint for `?` - it should not be linked to some other name - those should be seealso https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html Clean up usage on family, as multiple use of family with the same rdname is causing duplicated See Also html blocks (like http://spark.apache.org/docs/latest/api/R/count.html) Also changing some rdname for dplyr-like variant for better R user visibility in R doc, eg. rbind, summary, mutate, summarize shivaram yanboliang Author: felixcheung <felixcheung_m@hotmail.com> Closes #9750 from felixcheung/rdocaliases.
* [SPARK-11716][SQL] UDFRegistration just drops the input type when ↵Jean-Baptiste Onofré2015-11-202-24/+39
| | | | | | | | | | | | | | | | | re-creating the UserDefinedFunction https://issues.apache.org/jira/browse/SPARK-11716 This is one is #9739 and a regression test. When commit it, please make sure the author is jbonofre. You can find the original PR at https://github.com/apache/spark/pull/9739 closes #9739 Author: Jean-Baptiste Onofré <jbonofre@apache.org> Author: Yin Huai <yhuai@databricks.com> Closes #9868 from yhuai/SPARK-11716.
* [SPARK-11887] Close PersistenceEngine at the end of PersistenceEngineSuite testsJosh Rosen2015-11-201-48/+52
| | | | | | | | | | | | | | | | | | | | In PersistenceEngineSuite, we do not call `close()` on the PersistenceEngine at the end of the test. For the ZooKeeperPersistenceEngine, this causes us to leak a ZooKeeper client, causing the logs of unrelated tests to be periodically spammed with connection error messages from that client: ``` 15/11/20 05:13:35.789 pool-1-thread-1-ScalaTest-running-PersistenceEngineSuite-SendThread(localhost:15741) INFO ClientCnxn: Opening socket connection to server localhost/127.0.0.1:15741. Will not attempt to authenticate using SASL (unknown error) 15/11/20 05:13:35.790 pool-1-thread-1-ScalaTest-running-PersistenceEngineSuite-SendThread(localhost:15741) WARN ClientCnxn: Session 0x15124ff48dd0000 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068) ``` This patch fixes this by using a `finally` block. Author: Josh Rosen <joshrosen@databricks.com> Closes #9864 from JoshRosen/close-zookeeper-client-in-tests.
* [SPARK-11870][STREAMING][PYSPARK] Rethrow the exceptions in ↵Shixiong Zhu2015-11-202-0/+19
| | | | | | | | | | TransformFunction and TransformFunctionSerializer TransformFunction and TransformFunctionSerializer don't rethrow the exception, so when any exception happens, it just return None. This will cause some weird NPE and confuse people. Author: Shixiong Zhu <shixiong@databricks.com> Closes #9847 from zsxwing/pyspark-streaming-exception.
* [SPARK-11724][SQL] Change casting between int and timestamp to consistently ↵Nong Li2015-11-208-25/+39
| | | | | | | | | | | | treat int in seconds. Hive has since changed this behavior as well. https://issues.apache.org/jira/browse/HIVE-3454 Author: Nong Li <nong@databricks.com> Author: Nong Li <nongli@gmail.com> Author: Yin Huai <yhuai@databricks.com> Closes #9685 from nongli/spark-11724.
* [SPARK-11650] Reduce RPC timeouts to speed up slow AkkaUtilsSuite testJosh Rosen2015-11-201-1/+2
| | | | | | | | This patch reduces some RPC timeouts in order to speed up the slow "AkkaUtilsSuite.remote fetch ssl on - untrusted server", which used to take two minutes to run. Author: Josh Rosen <joshrosen@databricks.com> Closes #9869 from JoshRosen/SPARK-11650.
* [SPARK-11819][SQL] nice error message for missing encoderWenchen Fan2015-11-202-23/+129
| | | | | | | | | | | | | | | | before this PR, when users try to get an encoder for an un-supported class, they will only get a very simple error message like `Encoder for type xxx is not supported`. After this PR, the error message become more friendly, for example: ``` No Encoder found for abc.xyz.NonEncodable - array element class: "abc.xyz.NonEncodable" - field (class: "scala.Array", name: "arrayField") - root class: "abc.xyz.AnotherClass" ``` Author: Wenchen Fan <wenchen@databricks.com> Closes #9810 from cloud-fan/error-message.
* [SPARK-11817][SQL] Truncating the fractional seconds to prevent inserting a NULLLiang-Chi Hsieh2015-11-202-0/+13
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-11817 Instead of return None, we should truncate the fractional seconds to prevent inserting NULL. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9834 from viirya/truncate-fractional-sec.
* [SPARK-11876][SQL] Support printSchema in DataSet APIgatorsmile2015-11-202-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DataSet APIs look great! However, I am lost when doing multiple level joins. For example, ``` val ds1 = Seq(("a", 1), ("b", 2)).toDS().as("a") val ds2 = Seq(("a", 1), ("b", 2)).toDS().as("b") val ds3 = Seq(("a", 1), ("b", 2)).toDS().as("c") ds1.joinWith(ds2, $"a._2" === $"b._2").as("ab").joinWith(ds3, $"ab._1._2" === $"c._2").printSchema() ``` The printed schema is like ``` root |-- _1: struct (nullable = true) | |-- _1: struct (nullable = true) | | |-- _1: string (nullable = true) | | |-- _2: integer (nullable = true) | |-- _2: struct (nullable = true) | | |-- _1: string (nullable = true) | | |-- _2: integer (nullable = true) |-- _2: struct (nullable = true) | |-- _1: string (nullable = true) | |-- _2: integer (nullable = true) ``` Personally, I think we need the printSchema function. Sometimes, I do not know how to specify the column, especially when their data types are mixed. For example, if I want to write the following select for the above multi-level join, I have to know the schema: ``` newDS.select(expr("_1._2._2 + 1").as[Int]).collect() ``` marmbrus rxin cloud-fan Do you have the same feeling? Author: gatorsmile <gatorsmile@gmail.com> Closes #9855 from gatorsmile/printSchemaDataSet.
* [SPARK-11689][ML] Add user guide and example code for LDA under spark.mlYuhao Yang2015-11-205-1/+204
| | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-11689 Add simple user guide for LDA under spark.ml and example code under examples/. Use include_example to include example code in the user guide markdown. Check SPARK-11606 for instructions. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9722 from hhbyyh/ldaMLExample.
* [SPARK-11852][ML] StandardScaler minor refactorYanbo Liang2015-11-202-39/+32
| | | | | | | | ```withStd``` and ```withMean``` should be params of ```StandardScaler``` and ```StandardScalerModel```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9839 from yanboliang/standardScaler-refactor.
* [SPARK-11877] Prevent agg. fallback conf. from leaking across test suitesJosh Rosen2015-11-201-23/+21
| | | | | | | | | | This patch fixes an issue where the `spark.sql.TungstenAggregate.testFallbackStartsAt` SQLConf setting was not properly reset / cleared at the end of `TungstenAggregationQueryWithControlledFallbackSuite`. This ended up causing test failures in HiveCompatibilitySuite in Maven builds by causing spilling to occur way too frequently. This configuration leak was inadvertently introduced during test cleanup in #9618. Author: Josh Rosen <joshrosen@databricks.com> Closes #9857 from JoshRosen/clear-fallback-prop-in-test-teardown.
* [SPARK-11867] Add save/load for kmeans and naive bayesXusen Yin2015-11-194-28/+195
| | | | | | | | https://issues.apache.org/jira/browse/SPARK-11867 Author: Xusen Yin <yinxusen@gmail.com> Closes #9849 from yinxusen/SPARK-11867.
* [SPARK-11869][ML] Clean up TempDirectory properly in ML testsJoseph K. Bradley2015-11-191-1/+1
| | | | | | | | | | | | Need to remove parent directory (```className```) rather than just tempDir (```className/random_name```) I tested this with IDFSuite, which has 2 read/write tests, and it fixes the problem. CC: mengxr Can you confirm this is fine? I believe it is since the same ```random_name``` is used for all tests in a suite; we basically have an extra unneeded level of nesting. Author: Joseph K. Bradley <joseph@databricks.com> Closes #9851 from jkbradley/tempdir-cleanup.
* [SPARK-11875][ML][PYSPARK] Update doc for PySpark HasCheckpointIntervalYanbo Liang2015-11-192-9/+11
| | | | | | | | | * Update doc for PySpark ```HasCheckpointInterval``` that users can understand how to disable checkpoint. * Update doc for PySpark ```cacheNodeIds``` of ```DecisionTreeParams``` to notify the relationship between ```cacheNodeIds``` and ```checkpointInterval```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9856 from yanboliang/spark-11875.
* [SPARK-11829][ML] Add read/write to estimators under ml.feature (II)Yanbo Liang2015-11-199-33/+338
| | | | | | | | | | | | Add read/write support to the following estimators under spark.ml: * ChiSqSelector * PCA * VectorIndexer * Word2Vec Author: Yanbo Liang <ybliang8@gmail.com> Closes #9838 from yanboliang/spark-11829.
* [SPARK-11846] Add save/load for AFTSurvivalRegression and IsotonicRegressionXusen Yin2015-11-194-22/+210
| | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-11846 mengxr Author: Xusen Yin <yinxusen@gmail.com> Closes #9836 from yinxusen/SPARK-11846.
* [SPARK-11544][SQL][TEST-HADOOP1.0] sqlContext doesn't use PathFilterDilip Biswal2015-11-192-7/+59
| | | | | | | | Apply the user supplied pathfilter while retrieving the files from fs. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #9830 from dilipbiswal/spark-11544.
* [SPARK-11864][SQL] Improve performance of max/minDavies Liu2015-11-195-25/+45
| | | | | | | | | | | | | | | | This PR has the following optimization: 1) The greatest/least already does the null-check, so the `If` and `IsNull` are not necessary. 2) In greatest/least, it should initialize the result using the first child (removing one block). 3) For primitive types, the generated greater expression is too complicated (`a > b ? 1 : (a < b) ? -1 : 0) > 0`), should be as simple as `a > b` Combine these optimization, this could improve the performance of `ss_max` query by 30%. Author: Davies Liu <davies@databricks.com> Closes #9846 from davies/improve_max.
* [SPARK-11845][STREAMING][TEST] Added unit test to verify TrackStateRDD is ↵Tathagata Das2015-11-192-204/+267
| | | | | | | | | | correctly checkpointed To make sure that all lineage is correctly truncated for TrackStateRDD when checkpointed. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #9831 from tdas/SPARK-11845.
* [SPARK-4134][CORE] Lower severity of some executor loss logs.Marcelo Vanzin2015-11-195-24/+45
| | | | | | | | | Don't log ERROR messages when executors are explicitly killed or when the exit reason is not yet known. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9780 from vanzin/SPARK-11789.
* [SPARK-11275][SQL] Incorrect results when using rollup/cubeAndrew Ray2015-11-193-34/+90
| | | | | | | | | | | | | | | | | Fixes bug with grouping sets (including cube/rollup) where aggregates that included grouping expressions would return the wrong (null) result. Also simplifies the analyzer rule a bit and leaves column pruning to the optimizer. Added multiple unit tests to DataFrameAggregateSuite and verified it passes hive compatibility suite: ``` build/sbt -Phive -Dspark.hive.whitelist='groupby.*_grouping.*' 'test-only org.apache.spark.sql.hive.execution.HiveCompatibilitySuite' ``` This is an alternative to pr https://github.com/apache/spark/pull/9419 but I think its better as it simplifies the analyzer rule instead of adding another special case to it. Author: Andrew Ray <ray.andrew@gmail.com> Closes #9815 from aray/groupingset-agg-fix.
* [SPARK-11746][CORE] Use cache-aware method dependencieshushan2015-11-191-1/+1
| | | | | | | | a small change Author: hushan <hushan@xiaomi.com> Closes #9691 from suyanNone/unify-getDependency.
* [SPARK-11828][CORE] Register DAGScheduler metrics source after app id is known.Marcelo Vanzin2015-11-192-3/+2
| | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9820 from vanzin/SPARK-11828.
* [SPARK-11799][CORE] Make it explicit in executor logs that uncaught e…Srinivasa Reddy Vundela2015-11-191-1/+5
| | | | | | | | | | …xceptions are thrown during executor shutdown This commit will make sure that when uncaught exceptions are prepended with [Container in shutdown] when JVM is shutting down. Author: Srinivasa Reddy Vundela <vsr@cloudera.com> Closes #9809 from vundela/master_11799.
* [SPARK-11831][CORE][TESTS] Use port 0 to avoid port conflicts in testsShixiong Zhu2015-11-192-14/+14
| | | | | | | | Use port 0 to fix port-contention-related flakiness Author: Shixiong Zhu <shixiong@databricks.com> Closes #9841 from zsxwing/SPARK-11831.
* [SPARK-11858][SQL] Move sql.columnar into sql.execution.Reynold Xin2015-11-1930-147/+155
| | | | | | | | In addition, tightened visibility of a lot of classes in the columnar package from private[sql] to private[columnar]. Author: Reynold Xin <rxin@databricks.com> Closes #9842 from rxin/SPARK-11858.
* [SPARK-11812][PYSPARK] invFunc=None works properly with python's ↵David Tolpin2015-11-192-3/+14
| | | | | | | | | | | | | reduceByKeyAndWindow invFunc is optional and can be None. Instead of invFunc (the parameter) invReduceFunc (a local function) was checked for trueness (that is, not None, in this context). A local function is never None, thus the case of invFunc=None (a common one when inverse reduction is not defined) was treated incorrectly, resulting in loss of data. In addition, the docstring used wrong parameter names, also fixed. Author: David Tolpin <david.tolpin@gmail.com> Closes #9775 from dtolpin/master.
* [SPARK-11778][SQL] parse table name before it is passed to lookupRelationHuaxin Gao2015-11-192-1/+12
| | | | | | | | | Fix a bug in DataFrameReader.table (table with schema name such as "db_name.table" doesn't work) Use SqlParser.parseTableIdentifier to parse the table name before lookupRelation. Author: Huaxin Gao <huaxing@oc0558782468.ibm.com> Closes #9773 from huaxingao/spark-11778.