aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-13425][SQL] Documentation for CSV datasource optionshyukjinkwon2016-05-013-4/+103
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds the explanation and documentation for CSV options for reading and writing. ## How was this patch tested? Style tests with `./dev/run_tests` for documentation style. Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #12817 from HyukjinKwon/SPARK-13425.
* [SPARK-14931][ML][PYTHON] Mismatched default values between pipelines in ↵Xusen Yin2016-05-0111-40/+96
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Spark and PySpark - update ## What changes were proposed in this pull request? This PR is an update for [https://github.com/apache/spark/pull/12738] which: * Adds a generic unit test for JavaParams wrappers in pyspark.ml for checking default Param values vs. the defaults in the Scala side * Various fixes for bugs found * This includes changing classes taking weightCol to treat unset and empty String Param values the same way. Defaults changed: * Scala * LogisticRegression: weightCol defaults to not set (instead of empty string) * StringIndexer: labels default to not set (instead of empty array) * GeneralizedLinearRegression: * maxIter always defaults to 25 (simpler than defaulting to 25 for a particular solver) * weightCol defaults to not set (instead of empty string) * LinearRegression: weightCol defaults to not set (instead of empty string) * Python * MultilayerPerceptron: layers default to not set (instead of [1,1]) * ChiSqSelector: numTopFeatures defaults to 50 (instead of not set) ## How was this patch tested? Generic unit test. Manually tested that unit test by changing defaults and verifying that broke the test. Author: Joseph K. Bradley <joseph@databricks.com> Author: yinxusen <yinxusen@gmail.com> Closes #12816 from jkbradley/yinxusen-SPARK-14931.
* [SPARK-14505][CORE] Fix bug : creating two SparkContext objects in the same ↵Allen2016-05-012-15/+16
| | | | | | | | | | | | | jvm, the first one will can not run any task! After creating two SparkContext objects in the same jvm(the second one can not be created successfully!), use the first one to run job will throw exception like below: ![image](https://cloud.githubusercontent.com/assets/7162889/14402832/0c8da2a6-fe73-11e5-8aba-68ee3ddaf605.png) Author: Allen <yufan_1990@163.com> Closes #12273 from the-sea/context-create-bug.
* [SPARK-15033][SQL] fix a flaky test in CachedTableSuiteWenchen Fan2016-04-302-7/+14
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This is caused by https://github.com/apache/spark/pull/12776, which removes the `synchronized` from all methods in `AccumulatorContext`. However, a test in `CachedTableSuite` synchronize on `AccumulatorContext` and expecting no one else can change it, which is not true anymore. This PR update that test to not require to lock on `AccumulatorContext`. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #12811 from cloud-fan/flaky.
* [SPARK-14143] Options for parsing NaNs, Infinity and nulls for numeric typesHossein2016-04-306-42/+174
| | | | | | | | | | | | 1. Adds the following options for parsing NaNs: nanValue 2. Adds the following options for parsing infinity: positiveInf, negativeInf. `TypeCast.castTo` is unit tested and an end-to-end test is added to `CSVSuite` Author: Hossein <hossein@databricks.com> Closes #11947 from falaki/SPARK-14143.
* [SPARK-15034][SPARK-15035][SPARK-15036][SQL] Use spark.sql.warehouse.dir as ↵Yin Huai2016-04-3010-22/+236
| | | | | | | | | | | | | | | the warehouse location This PR contains three changes: 1. We will use spark.sql.warehouse.dir set warehouse location. We will not use hive.metastore.warehouse.dir. 2. SessionCatalog needs to set the location to default db. Otherwise, when creating a table in SparkSession without hive support, the default db's path will be an empty string. 3. When we create a database, we need to make the path qualified. Existing tests and new tests Author: Yin Huai <yhuai@databricks.com> Closes #12812 from yhuai/warehouse.
* [SPARK-15030][ML][SPARKR] Support formula in spark.kmeans in SparkRYanbo Liang2016-04-309-50/+87
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? * ```RFormula``` supports empty response variable like ```~ x + y```. * Support formula in ```spark.kmeans``` in SparkR. * Fix some outdated docs for SparkR. ## How was this patch tested? Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #12813 from yanboliang/spark-15030.
* [SPARK-14952][CORE][ML] Remove methods that were deprecated in 1.6.0Herman van Hovell2016-04-306-34/+5
| | | | | | | | | | | | | | | | | | #### What changes were proposed in this pull request? This PR removes three methods the were deprecated in 1.6.0: - `PortableDataStream.close()` - `LinearRegression.weights` - `LogisticRegression.weights` The rationale for doing this is that the impact is small and that Spark 2.0 is a major release. #### How was this patch tested? Compilation succeded. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12732 from hvanhovell/SPARK-14952.
* [SPARK-14653][ML] Remove json4s from mllib-localXiangrui Meng2016-04-305-62/+103
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR moves Vector.toJson/fromJson to ml.linalg.VectorEncoder under mllib/ to keep mllib-local's dependency minimal. The json encoding is used by Params. So we still need this feature in SPARK-14615, where we will switch to ml.linalg in spark.ml APIs. ## How was this patch tested? Copied existing unit tests over. cc; dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #12802 from mengxr/SPARK-14653.
* [SPARK-13289][MLLIB] Fix infinite distances between word vectors in ↵Junyang2016-04-303-17/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Word2VecModel ## What changes were proposed in this pull request? This PR fixes the bug that generates infinite distances between word vectors. For example, Before this PR, we have ``` val synonyms = model.findSynonyms("who", 40) ``` will give the following results: ``` to Infinity and Infinity that Infinity with Infinity ``` With this PR, the distance between words is a value between 0 and 1, as follows: ``` scala> model.findSynonyms("who", 10) res0: Array[(String, Double)] = Array((Harvard-educated,0.5253688097000122), (ex-SAS,0.5213794708251953), (McMutrie,0.5187736749649048), (fellow,0.5166833400726318), (businessman,0.5145374536514282), (American-born,0.5127736330032349), (British-born,0.5062344074249268), (gray-bearded,0.5047978162765503), (American-educated,0.5035858750343323), (mentored,0.49849334359169006)) scala> model.findSynonyms("king", 10) res1: Array[(String, Double)] = Array((queen,0.6787897944450378), (prince,0.6786158084869385), (monarch,0.659771203994751), (emperor,0.6490438580513), (goddess,0.643266499042511), (dynasty,0.635733425617218), (sultan,0.6166239380836487), (pharaoh,0.6150713562965393), (birthplace,0.6143025159835815), (empress,0.6109727025032043)) scala> model.findSynonyms("queen", 10) res2: Array[(String, Double)] = Array((princess,0.7670737504959106), (godmother,0.6982434988021851), (raven-haired,0.6877717971801758), (swan,0.684934139251709), (hunky,0.6816608309745789), (Titania,0.6808111071586609), (heroine,0.6794036030769348), (king,0.6787897944450378), (diva,0.67848801612854), (lip-synching,0.6731793284416199)) ``` ### There are two places changed in this PR: - Normalize the word vector to avoid overflow when calculating inner product between word vectors. This also simplifies the distance calculation, since the word vectors only need to be normalized once. - Scale the learning rate by number of iteration, to be consistent with Google Word2Vec implementation ## How was this patch tested? Use word2vec to train text corpus, and run model.findSynonyms() to get the distances between word vectors. Author: Junyang <fly.shenjy@gmail.com> Author: flyskyfly <fly.shenjy@gmail.com> Closes #11812 from flyjy/TVec.
* [SPARK-13973][PYSPARK] Make pyspark fail noisily if IPYTHON or IPYTHON_OPTS ↵pshearer2016-04-302-25/+18
| | | | | | | | | | | | | | | | | | | are set ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13973 Following discussion with srowen the IPYTHON and IPYTHON_OPTS variables are removed. If they are set in the user's environment, pyspark will not execute and prints an error message. Failing noisily will force users to remove these options and learn the new configuration scheme, which is much more sustainable and less confusing. ## How was this patch tested? Manual testing; set IPYTHON=1 and verified that the error message prints. Author: pshearer <pshearer@massmutual.com> Author: shearerp <shearerp@umich.edu> Closes #12528 from shearerp/master.
* [SPARK-15028][SQL] Remove HiveSessionState.setDefaultOverrideConfsReynold Xin2016-04-307-56/+9
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch removes some code that are no longer relevant -- mainly HiveSessionState.setDefaultOverrideConfs. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #12806 from rxin/SPARK-15028.
* [SPARK-14831][.2][ML][R] rename ml.save/ml.load to write.ml/read.mlXiangrui Meng2016-04-304-44/+44
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Continue the work of #12789 to rename ml.asve/ml.load to write.ml/read.ml, which are more consistent with read.df/write.df and other methods in SparkR. I didn't rename `data` to `df` because we still use `predict` for prediction, which uses `newData` to match the signature in R. ## How was this patch tested? Existing unit tests. cc: yanboliang thunterdb Author: Xiangrui Meng <meng@databricks.com> Closes #12807 from mengxr/SPARK-14831.
* [SPARK-14412][.2][ML] rename *RDDStorageLevel to *StorageLevel in ml.ALSXiangrui Meng2016-04-304-62/+62
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? As discussed in #12660, this PR renames * intermediateRDDStorageLevel -> intermediateStorageLevel * finalRDDStorageLevel -> finalStorageLevel The argument name in `ALS.train` will be addressed in SPARK-15027. ## How was this patch tested? Existing unit tests. Author: Xiangrui Meng <meng@databricks.com> Closes #12803 from mengxr/SPARK-14412.
* [SPARK-14533][MLLIB] RowMatrix.computeCovariance inaccurate when values are ↵Sean Owen2016-04-304-27/+26
| | | | | | | | | | | | | | | | very large (partial fix) ## What changes were proposed in this pull request? Fix for part of SPARK-14533: trivial simplification and more accurate computation of column means. See also https://github.com/apache/spark/pull/12299 which contained a complete fix that was very slow. This PR does _not_ resolve SPARK-14533 entirely. ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #12779 from srowen/SPARK-14533.2.
* [MINOR][EXAMPLE] Use SparkSession instead of SQLContext in RDDRelation.scalaDongjoon Hyun2016-04-301-10/+10
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Now, `SQLContext` is used for backward-compatibility, we had better use `SparkSession` in Spark 2.0 examples. ## How was this patch tested? It's just example change. After building, run `bin/run-example org.apache.spark.examples.sql.RDDRelation`. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12808 from dongjoon-hyun/rddrelation.
* [SPARK-14850][.2][ML] use UnsafeArrayData.fromPrimitiveArray in ↵Xiangrui Meng2016-04-292-11/+9
| | | | | | | | | | | | | | | | | | ml.VectorUDT/MatrixUDT ## What changes were proposed in this pull request? This PR uses `UnsafeArrayData.fromPrimitiveArray` to implement `ml.VectorUDT/MatrixUDT` to avoid boxing/unboxing. ## How was this patch tested? Exiting unit tests. cc: cloud-fan Author: Xiangrui Meng <meng@databricks.com> Closes #12805 from mengxr/SPARK-14850.
* [SPARK-14391][LAUNCHER] Fix launcher communication test, take 2.Marcelo Vanzin2016-04-292-3/+2
| | | | | | | | | | There's actually a race here: the state of the handler was changed before the connection was set, so the test code could be notified of the state change, wake up, and still see the connection as null, triggering the assert. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #12785 from vanzin/SPARK-14391.
* [SPARK-14831][SPARKR] Make the SparkR MLlib API more consistent with SparkTimothy Hunter2016-04-294-72/+247
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR splits the MLlib algorithms into two flavors: - the R flavor, which tries to mimic the existing R API for these algorithms (and works as an S4 specialization for Spark dataframes) - the Spark flavor, which follows the same API and naming conventions as the rest of the MLlib algorithms in the other languages In practice, the former calls the latter. ## How was this patch tested? The tests for the various algorithms were adapted to be run against both interfaces. Author: Timothy Hunter <timhunter@databricks.com> Closes #12789 from thunterdb/14831.
* [SPARK-14850][ML] convert primitive array from/to unsafe array directly in ↵Wenchen Fan2016-04-296-14/+186
| | | | | | | | | | | | | | | | VectorUDT/MatrixUDT ## What changes were proposed in this pull request? This PR adds `fromPrimitiveArray` and `toPrimitiveArray` in `UnsafeArrayData`, so that we can do the conversion much faster in VectorUDT/MatrixUDT. ## How was this patch tested? existing tests and new test suite `UnsafeArraySuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #12640 from cloud-fan/ml.
* [SPARK-13667][SQL] Support for specifying custom date format for date and ↵hyukjinkwon2016-04-298-66/+173
| | | | | | | | | | | | | | | | | | | | | | | | | timestamp types at CSV datasource. ## What changes were proposed in this pull request? This PR adds the support to specify custom date format for `DateType` and `TimestampType`. For `TimestampType`, this uses the given format to infer schema and also to convert the values For `DateType`, this uses the given format to convert the values. If the `dateFormat` is not given, then it works with `DateTimeUtils.stringToTime()` for backwords compatibility. When it's given, then it uses `SimpleDateFormat` for parsing data. In addition, `IntegerType`, `DoubleType` and `LongType` have a higher priority than `TimestampType` in type inference. This means even if the given format is `yyyy` or `yyyy.MM`, it will be inferred as `IntegerType` or `DoubleType`. Since it is type inference, I think it is okay to give such precedences. In addition, I renamed `csv.CSVInferSchema` to `csv.InferSchema` as JSON datasource has `json.InferSchema`. Although they have the same names, I did this because I thought the parent package name can still differentiate each. Accordingly, the suite name was also changed from `CSVInferSchemaSuite` to `InferSchemaSuite`. ## How was this patch tested? unit tests are used and `./dev/run_tests` for coding style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11550 from HyukjinKwon/SPARK-13667.
* [SPARK-14591][SQL] Remove DataTypeParser and add more keywords to the ↵Yin Huai2016-04-299-232/+26
| | | | | | | | | | | | | | nonReserved list. ## What changes were proposed in this pull request? CatalystSqlParser can parse data types. So, we do not need to have an individual DataTypeParser. ## How was this patch tested? Existing tests Author: Yin Huai <yhuai@databricks.com> Closes #12796 from yhuai/removeDataTypeParser.
* [SPARK-14757] [SQL] Fix nullability bug in EqualNullSafe codegenReynold Xin2016-04-292-2/+3
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch fixes a null handling bug in EqualNullSafe's code generation. ## How was this patch tested? Updated unit test so they would fail without the fix. Closes #12628. Author: Reynold Xin <rxin@databricks.com> Author: Arash Nabili <arash@levyx.com> Closes #12799 from rxin/equalnullsafe.
* [SPARK-14412][ML][PYSPARK] Add StorageLevel params to ALSNick Pentreath2016-04-294-11/+209
| | | | | | | | | | | | | | `mllib` `ALS` supports `setIntermediateRDDStorageLevel` and `setFinalRDDStorageLevel`. This PR adds these as Params in `ml` `ALS`. They are put in group **expertParam** since few users will need them. ## How was this patch tested? New test cases in `ALSSuite` and `tests.py`. cc yanboliang jkbradley sethah rishabhbhardwaj Author: Nick Pentreath <nickp@za.ibm.com> Closes #12660 from MLnick/SPARK-14412-als-storage-params.
* [SPARK-14917][SQL] Enable some ORC compressions tests for writinghyukjinkwon2016-04-291-29/+33
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14917 As it is described in the JIRA, it seems Hive 1.2.1 which Spark uses now supports snappy and none. So, this PR enables some tests for writing ORC files with compression codes, `SNAPPY` and `NONE`. ## How was this patch tested? Unittests in `OrcQuerySuite` and `sbt scalastyle`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12699 from HyukjinKwon/SPARK-14917.
* [SPARK-13786][ML][PYTHON] Removed save/load for python tuningJoseph K. Bradley2016-04-292-262/+21
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Per discussion on [https://github.com/apache/spark/pull/12604], this removes ML persistence for Python tuning (TrainValidationSplit, CrossValidator, and their Models) since they do not handle nesting easily. This support should be re-designed and added in the next release. ## How was this patch tested? Removed unit test elements saving and loading the tuning algorithms, but kept tests to save and load their bestModel fields. Author: Joseph K. Bradley <joseph@databricks.com> Closes #12782 from jkbradley/remove-python-tuning-saveload.
* [SPARK-15012][SQL] Simplify configuration API furtherAndrew Or2016-04-2918-187/+108
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? 1. Remove all the `spark.setConf` etc. Just expose `spark.conf` 2. Make `spark.conf` take in things set in the core `SparkConf` as well, otherwise users may get confused This was done for both the Python and Scala APIs. ## How was this patch tested? `SQLConfSuite`, python tests. This one fixes the failed tests in #12787 Closes #12787 Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12798 from yhuai/conf-api.
* [SPARK-15010][CORE] new accumulator shoule be tolerant of local RPC message ↵Wenchen Fan2016-04-291-2/+7
| | | | | | | | | | | | | | | | delivery ## What changes were proposed in this pull request? The RPC framework will not serialize and deserialize messages in local mode, we should not call `acc.value` when receive heartbeat message, because the serialization hook of new accumulator may not be triggered and the `atDriverSide` flag may not be set. ## How was this patch tested? tested it locally via spark shell Author: Wenchen Fan <wenchen@databricks.com> Closes #12795 from cloud-fan/bug.
* [SPARK-15019][SQL] Propagate all Spark Confs to HiveConf created in ↵Yin Huai2016-04-297-40/+52
| | | | | | | | | | | | | | | | HiveClientImpl ## What changes were proposed in this pull request? This PR makes two changes: 1. We will propagate Spark Confs to HiveConf created in HiveClientImpl. So, users can also use spark conf to set warehouse location and metastore url. 2. In sql/hive, HiveClientImpl will be the only place where we create a new HiveConf. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Closes #12791 from yhuai/onlyUseHiveConfInHiveClientImpl.
* [SPARK-15003] Use ConcurrentHashMap in place of HashMap for ↵tedyu2016-04-302-13/+10
| | | | | | | | | | | | | | | | | | | | NewAccumulator.originals ## What changes were proposed in this pull request? This PR proposes to use ConcurrentHashMap in place of HashMap for NewAccumulator.originals This should result in better performance. ## How was this patch tested? Existing unit test suite cloud-fan Author: tedyu <yuzhihong@gmail.com> Closes #12776 from tedyu/master.
* [SPARK-14858] [SQL] Enable subquery pushdownHerman van Hovell2016-04-2913-318/+390
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous subquery PRs did not include support for pushing subqueries used in filters (`WHERE`/`HAVING`) down. This PR adds this support. For example : ```scala range(0, 10).registerTempTable("a") range(5, 15).registerTempTable("b") range(7, 25).registerTempTable("c") range(3, 12).registerTempTable("d") val plan = sql("select * from a join b on a.id = b.id left join c on c.id = b.id where a.id in (select id from d)") plan.explain(true) ``` Leads to the following Analyzed & Optimized plans: ``` == Parsed Logical Plan == ... == Analyzed Logical Plan == id: bigint, id: bigint, id: bigint Project [id#0L,id#4L,id#8L] +- Filter predicate-subquery#16 [(id#0L = id#12L)] : +- SubqueryAlias predicate-subquery#16 [(id#0L = id#12L)] : +- Project [id#12L] : +- SubqueryAlias d : +- Range 3, 12, 1, 8, [id#12L] +- Join LeftOuter, Some((id#8L = id#4L)) :- Join Inner, Some((id#0L = id#4L)) : :- SubqueryAlias a : : +- Range 0, 10, 1, 8, [id#0L] : +- SubqueryAlias b : +- Range 5, 15, 1, 8, [id#4L] +- SubqueryAlias c +- Range 7, 25, 1, 8, [id#8L] == Optimized Logical Plan == Join LeftOuter, Some((id#8L = id#4L)) :- Join Inner, Some((id#0L = id#4L)) : :- Join LeftSemi, Some((id#0L = id#12L)) : : :- Range 0, 10, 1, 8, [id#0L] : : +- Range 3, 12, 1, 8, [id#12L] : +- Range 5, 15, 1, 8, [id#4L] +- Range 7, 25, 1, 8, [id#8L] == Physical Plan == ... ``` I have also taken the opportunity to move quite a bit of code around: - Rewriting subqueris and pulling out correlated predicated from subqueries has been moved into the analyzer. The analyzer transforms `Exists` and `InSubQuery` into `PredicateSubquery` expressions. A PredicateSubquery exposes the 'join' expressions and the proper references. This makes things like type coercion, optimization and planning easier to do. - I have added support for `Aggregate` plans in subqueries. Any correlated expressions will be added to the grouping expressions. I have removed support for `Union` plans, since pulling in an outer reference from beneath a Union has no value (a filtered value could easily be part of another Union child). - Resolution of subqueries is now done using `OuterReference`s. These are used to wrap any outer reference; this makes the identification of these references easier, and also makes dealing with duplicate attributes in the outer and inner plans easier. The resolution of subqueries initially used a resolution loop which would alternate between calling the analyzer and trying to resolve the outer references. We now use a dedicated analyzer which uses a special rule for outer reference resolution. These changes are a stepping stone for enabling correlated scalar subqueries, enabling all Hive tests & allowing us to use predicate subqueries anywhere. Current tests and added test cases in FilterPushdownSuite. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12720 from hvanhovell/SPARK-14858.
* [SPARK-14646][ML] Modified Kmeans to store cluster centers with one per rowJoseph K. Bradley2016-04-291-6/+13
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Modified Kmeans to store cluster centers with one per row ## How was this patch tested? Existing tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #12792 from jkbradley/kmeans-save-fix.
* [SPARK-14988][PYTHON] SparkSession API follow-upsAndrew Or2016-04-2911-213/+256
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Addresses comments in #12765. ## How was this patch tested? Python tests. Author: Andrew Or <andrew@databricks.com> Closes #12784 from andrewor14/python-followup.
* [SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR.Sun Rui2016-04-2915-15/+337
| | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame. The function signature is: dapply(df, function(localDF) {}, schema = NULL) R function input: local data.frame from the partition on local node R function output: local data.frame Schema specifies the Row format of the resulting DataFrame. It must match the R function's output. If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply(). ## How was this patch tested? SparkR unit tests. Author: Sun Rui <rui.sun@intel.com> Author: Sun Rui <sunrui2016@gmail.com> Closes #12493 from sun-rui/SPARK-12919.
* [SPARK-14570][ML] Log instrumentation in Random forestsBenFradet2016-04-298-33/+81
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Added Instrumentation logging to DecisionTree{Classifier,Regressor} and RandomForest{Classifier,Regressor} ## How was this patch tested? No tests involved since it's logging related. Author: BenFradet <benjamin.fradet@gmail.com> Closes #12536 from BenFradet/SPARK-14570.
* [SPARK-15013][SQL] Remove hiveConf from HiveSessionStateYin Huai2016-04-292-30/+1
| | | | | | | | | | | | ## What changes were proposed in this pull request? The hiveConf in HiveSessionState is not actually used anymore. Let's remove it. ## How was this patch tested? Existing tests Author: Yin Huai <yhuai@databricks.com> Closes #12786 from yhuai/removeHiveConf.
* [SPARK-14981][SQL] Throws exception if DESC is specified for sorting columnsCheng Lian2016-04-293-15/+41
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently Spark SQL doesn't support sorting columns in descending order. However, the parser accepts the syntax and silently drops sorting directions. This PR fixes this by throwing an exception if `DESC` is specified as sorting direction of a sorting column. ## How was this patch tested? A test case is added to test the invalid sorting order by checking exception message. Author: Cheng Lian <lian@databricks.com> Closes #12759 from liancheng/spark-14981.
* [SPARK-15004][SQL] Remove zookeeper service discovery code in thrift-serverReynold Xin2016-04-295-506/+6
| | | | | | | | | | | | ## What changes were proposed in this pull request? We recently inlined Hive's thrift server code in SPARK-15004. This patch removes the code related to zookeeper service discovery, Tez, and Hive on Spark, since they are irrelevant. ## How was this patch tested? N/A - removing dead code Author: Reynold Xin <rxin@databricks.com> Closes #12780 from rxin/SPARK-15004.
* [SPARK-15011][SQL][TEST] Ignore ↵Yin Huai2016-04-291-1/+1
| | | | | | | | | | org.apache.spark.sql.hive.StatisticsSuite.analyze MetastoreRelation This test always fail with sbt's hadoop 2.3 and 2.4 tests. Let'e disable it for now and investigate the problem. Author: Yin Huai <yhuai@databricks.com> Closes #12783 from yhuai/SPARK-15011-ignore.
* [SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA PR2Jeff Zhang2016-04-293-6/+546
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? pyspark.ml API for LDA * LDA, LDAModel, LocalLDAModel, DistributedLDAModel * includes persistence This replaces [https://github.com/apache/spark/pull/10242] ## How was this patch tested? * doc test for LDA, including Param setters * unit test for persistence Author: Joseph K. Bradley <joseph@databricks.com> Author: Jeff Zhang <zjffdu@apache.org> Closes #12723 from jkbradley/zjffdu-SPARK-11940.
* [SPARK-14984][ML] Deprecated model field in LinearRegressionSummaryJoseph K. Bradley2016-04-292-4/+5
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Deprecated model field in LinearRegressionSummary Removed unnecessary Since annotations ## How was this patch tested? Existing tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #12763 from jkbradley/lr-summary-api.
* [SPARK-14314][SPARK-14315][ML][SPARKR] Model persistence in SparkR (glm & ↵Yanbo Liang2016-04-297-76/+315
| | | | | | | | | | | | | | | kmeans) SparkR ```glm``` and ```kmeans``` model persistence. Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Author: Gayathri Murali <gayathri.m.softie@gmail.com> Closes #12778 from yanboliang/spark-14311. Closes #12680 Closes #12683
* [SPARK-14988][PYTHON] SparkSession catalog and conf APIAndrew Or2016-04-297-87/+611
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? The `catalog` and `conf` APIs were exposed in `SparkSession` in #12713 and #12669. This patch adds those to the python API. ## How was this patch tested? Python tests. Author: Andrew Or <andrew@databricks.com> Closes #12765 from andrewor14/python-spark-session-more.
* [SPARK-14987][SQL] inline hive-service (cli) into sql/hive-thriftserverDavies Liu2016-04-29181-78/+69973
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR copy the thrift-server from hive-service-1.2 (including TCLIService.thrift and generated Java source code) into sql/hive-thriftserver, so we can do further cleanup and improvements. ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #12764 from davies/thrift_server.
* [SPARK-14571][ML] Log instrumentation in ALSwm624@hotmail.com2016-04-292-0/+12
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add log instrumentation for parameters: rank, numUserBlocks, numItemBlocks, implicitPrefs, alpha, userCol, itemCol, ratingCol, predictionCol, maxIter, regParam, nonnegative, checkpointInterval, seed Add log instrumentation for numUserFeatures and numItemFeatures ## How was this patch tested? Manual test: Set breakpoint in intellij and run def testALS(). Single step debugging and check the log method is called. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #12560 from wangmiao1981/log.
* [SPARK-14969][MLLIB] Remove duplicate implementation of compute in ↵dding32016-04-292-13/+0
| | | | | | | | | | | | | | | | LogisticGradient ## What changes were proposed in this pull request? This PR removes duplicate implementation of compute in LogisticGradient class ## How was this patch tested? unit tests Author: dding3 <dingding@dingding-ubuntu.sh.intel.com> Closes #12747 from dding3/master.
* [SPARK-14511][BUILD] Upgrade genjavadoc to latest upstreamJakob Odersky2016-04-291-3/+7
| | | | | | | | | | | | | ## What changes were proposed in this pull request? In the past, genjavadoc had issues with package private members which led the spark project to use a forked version. This issue has been fixed upstream (typesafehub/genjavadoc#70) and a release is available for scala versions 2.10, 2.11 **and 2.12**, hence a forked version for spark is no longer necessary. This pull request updates the build configuration to use the newest upstream genjavadoc. ## How was this patch tested? The build was run `sbt unidoc`. During the process javadoc emits some errors on the generated java stubs, however these errors were also present before the upgrade. Furthermore, the produced html is fine. Author: Jakob Odersky <jakob@odersky.com> Closes #12707 from jodersky/SPARK-14511-genjavadoc.
* [SPARK-14994][SQL] Remove execution hive from HiveSessionStateReynold Xin2016-04-2920-309/+327
| | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch removes executionHive from HiveSessionState and HiveSharedState. ## How was this patch tested? Updated test cases. Author: Reynold Xin <rxin@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12770 from rxin/SPARK-14994.
* [SPARK-14996][SQL] Add TPCDS Benchmark Queries for SparkSQLSameer Agarwal2016-04-291-0/+1225
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds support for easily running and benchmarking a set of common TPCDS queries locally in SparkSQL. ## How was this patch tested? N/A Author: Sameer Agarwal <sameer@databricks.com> Closes #12771 from sameeragarwal/tpcds-2.
* [SPARK-12660][SPARK-14967][SQL] Implement Except Distinct by Left Anti Joingatorsmile2016-04-2912-111/+132
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | #### What changes were proposed in this pull request? Replaces a logical `Except` operator with a `Left-anti Join` operator. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins). ```SQL SELECT a1, a2 FROM Tab1 EXCEPT SELECT b1, b2 FROM Tab2 ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT ANTI JOIN Tab2 ON a1<=>b1 AND a2<=>b2 ``` Note: 1. This rule is only applicable to EXCEPT DISTINCT. Do not use it for EXCEPT ALL. 2. This rule has to be done after de-duplicating the attributes; otherwise, the enerated join conditions will be incorrect. This PR also corrects the existing behavior in Spark. Before this PR, the behavior is like ```SQL test("except") { val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id") val df_right = Seq(1, 3).toDF("id") checkAnswer( df_left.except(df_right), Row(2) :: Row(2) :: Row(4) :: Nil ) } ``` After this PR, the result is corrected. We strictly follow the SQL compliance of `Except Distinct`. #### How was this patch tested? Modified and added a few test cases to verify the optimization rule and the results of operators. Author: gatorsmile <gatorsmile@gmail.com> Closes #12736 from gatorsmile/exceptByAntiJoin.