aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-15803] [PYSPARK] Support with statement syntax for SparkSessionJeff Zhang2016-06-171-0/+16
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Support with statement syntax for SparkSession in pyspark ## How was this patch tested? Manually verify it. Although I can add unit test for it, it would affect other unit test because the SparkContext is stopped after the with statement. Author: Jeff Zhang <zjffdu@apache.org> Closes #13541 from zjffdu/SPARK-15803.
* [SPARK-16035][PYSPARK] Fix SparseVector parser assertion for end parenthesisandreapasqua2016-06-171-1/+1
| | | | | | | | | | | ## What changes were proposed in this pull request? The check on the end parenthesis of the expression to parse was using the wrong variable. I corrected that. ## How was this patch tested? Manual test Author: andreapasqua <andrea@radius.com> Closes #13750 from andreapasqua/sparse-vector-parser-assertion-fix.
* [SPARK-16020][SQL] Fix complete mode aggregation with console sinkShixiong Zhu2016-06-173-1/+105
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? We cannot use `limit` on DataFrame in ConsoleSink because it will use a wrong planner. This PR just collects `DataFrame` and calls `show` on a batch DataFrame based on the result. This is fine since ConsoleSink is only for debugging. ## How was this patch tested? Manually confirmed ConsoleSink now works with complete mode aggregation. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13740 from zsxwing/complete-console.
* [SPARK-15159][SPARKR] SparkR SparkSession APIFelix Cheung2016-06-1724-186/+420
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR introduces the new SparkSession API for SparkR. `sparkR.session.getOrCreate()` and `sparkR.session.stop()` "getOrCreate" is a bit unusual in R but it's important to name this clearly. SparkR implementation should - SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR) - SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work) - Changes to SparkSession is mostly transparent to users due to SPARK-10903 - Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning - Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily - An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))` - Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession - Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView` - Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames` - `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python) - All tests are updated to use the SparkSession entrypoint - A bug in `read.jdbc` is fixed TODO - [x] Add more tests - [ ] Separate PR - update all roxygen2 doc coding example - [ ] Separate PR - update SparkR programming guide ## How was this patch tested? unit tests, manual tests shivaram sun-rui rxin Author: Felix Cheung <felixcheung_m@hotmail.com> Author: felixcheung <felixcheung_m@hotmail.com> Closes #13635 from felixcheung/rsparksession.
* [SPARK-15946][MLLIB] Conversion between old/new vector columns in a ↵Xiangrui Meng2016-06-172-0/+96
| | | | | | | | | | | | | | | | | | DataFrame (Python) ## What changes were proposed in this pull request? This PR implements python wrappers for #13662 to convert old/new vector columns in a DataFrame. ## How was this patch tested? doctest in Python cc: yanboliang Author: Xiangrui Meng <meng@databricks.com> Closes #13731 from mengxr/SPARK-15946.
* [SPARK-15129][R][DOC] R API changes in MLGayathriMurali2016-06-172-60/+21
| | | | | | | | | | ## What changes were proposed in this pull request? Make user guide changes to SparkR documentation for all changes that happened in 2.0 to Machine Learning APIs Author: GayathriMurali <gayathri.m@intel.com> Closes #13285 from GayathriMurali/SPARK-15129.
* [SPARK-16033][SQL] insertInto() can't be used together with partitionBy()Cheng Lian2016-06-172-3/+46
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? When inserting into an existing partitioned table, partitioning columns should always be determined by catalog metadata of the existing table to be inserted. Extra `partitionBy()` calls don't make sense, and mess up existing data because newly inserted data may have wrong partitioning directory layout. ## How was this patch tested? New test case added in `InsertIntoHiveTableSuite`. Author: Cheng Lian <lian@databricks.com> Closes #13747 from liancheng/spark-16033-insert-into-without-partition-by.
* [SPARK-15916][SQL] JDBC filter push down should respect operator precedencehyukjinkwon2016-06-172-2/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes the problem that the precedence order is messed when pushing where-clause expression to JDBC layer. **Case 1:** For sql `select * from table where (a or b) and c`, the where-clause is wrongly converted to JDBC where-clause `a or (b and c)` after filter push down. The consequence is that JDBC may returns less or more rows than expected. **Case 2:** For sql `select * from table where always_false_condition`, the result table may not be empty if the JDBC RDD is partitioned using where-clause: ``` spark.read.jdbc(url, table, predicates = Array("partition 1 where clause", "partition 2 where clause"...) ``` ## How was this patch tested? Unit test. This PR also close #13640 Author: hyukjinkwon <gurwls223@gmail.com> Author: Sean Zhong <seanzhong@databricks.com> Closes #13743 from clockfly/SPARK-15916.
* [SPARK-16005][R] Add `randomSplit` to SparkRDongjoon Hyun2016-06-174-0/+60
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds `randomSplit` to SparkR for API parity. ## How was this patch tested? Pass the Jenkins tests (with new testcase.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13721 from dongjoon-hyun/SPARK-16005.
* [SPARK-15925][SPARKR] R DataFrame add back registerTempTable, add testsFelix Cheung2016-06-174-18/+57
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add registerTempTable to DataFrame with Deprecate ## How was this patch tested? unit tests shivaram liancheng Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13722 from felixcheung/rregistertemptable.
* [SPARK-16014][SQL] Rename optimizer rules to be more consistentReynold Xin2016-06-176-22/+19
| | | | | | | | | | | | ## What changes were proposed in this pull request? This small patch renames a few optimizer rules to make the naming more consistent, e.g. class name start with a verb. The main important "fix" is probably SamplePushDown -> PushProjectThroughSample. SamplePushDown is actually the wrong name, since the rule is not about pushing Sample down. ## How was this patch tested? Updated test cases. Author: Reynold Xin <rxin@databricks.com> Closes #13732 from rxin/SPARK-16014.
* [SPARK-16017][CORE] Send hostname from CoarseGrainedExecutorBackend to driverShixiong Zhu2016-06-175-11/+12
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? [SPARK-15395](https://issues.apache.org/jira/browse/SPARK-15395) changes the behavior that how the driver gets the executor host and the driver will get the executor IP address instead of the host name. This PR just sends the hostname from executors to driver so that driver can pass it to TaskScheduler. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13741 from zsxwing/SPARK-16017.
* [SPARK-16018][SHUFFLE] Shade netty to load shuffle jar in NodemangerDhruve Ashar2016-06-171-0/+7
| | | | | | | | | | | | ## What changes were proposed in this pull request? Shade the netty.io namespace so that we can use it in shuffle independent of the dependencies being pulled by hadoop jars. ## How was this patch tested? Ran a decent job involving shuffle write/read and tested the new spark-x-yarn-shuffle jar. After shading netty.io namespace, the nodemanager loads and shuffle job completes successfully. Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #13739 from dhruve/bug/SPARK-16018.
* [SPARK-15926] Improve readability of DAGScheduler stage creation methodsKay Ousterhout2016-06-172-86/+76
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This pull request refactors parts of the DAGScheduler to improve readability, focusing on the code around stage creation. One goal of this change it to make it clearer which functions may create new stages (as opposed to looking up stages that already exist). There are no functionality changes in this pull request. In more detail: * shuffleToMapStage was renamed to shuffleIdToMapStage (when reading the existing code I have sometimes struggled to remember what the key is -- is it a stage? A stage id? This change is intended to avoid that confusion) * Cleaned up the code to create shuffle map stages. Previously, creating a shuffle map stage involved 3 different functions (newOrUsedShuffleStage, newShuffleMapStage, and getShuffleMapStage), and it wasn't clear what the purpose of each function was. With the new code, a single function (getOrCreateShuffleMapStage) is responsible for getting a stage (if it already exists) or creating new shuffle map stages and any missing ancestor stages, and it delegates to createShuffleMapStage when new stages need to be created. There's some remaining confusion here because the getOrCreateParentStages call in createShuffleMapStage may recursively create ancestor stages; this is an issue I plan to fix in a future pull request, because it's trickier to fix and involves a slight functionality change. * newResultStage was renamed to createResultStage, for consistency with naming around shuffle map stages. * getParentStages has been renamed to getOrCreateParentStages, to make it clear that this function will sometimes create missing ancestor stages. * The only *slight* functionality change is that on line 478, updateJobIdStageIdMaps now uses a stage's parents instance variable rather than re-calculating them (I couldn't see any reason why they'd need to be re-calculated, and suspect this is just leftover from older code). * getAncestorShuffleDependencies was renamed to getMissingAncestorShuffleDependencies, to make it clear that this only returns dependencies that have not yet been run. cc squito markhamstra JoshRosen (who requested more DAG scheduler commenting long ago -- an issue this pull request tries, in part, to address) FYI rxin Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #13677 from kayousterhout/SPARK-15926.
* [SPARK-16008][ML] Remove unnecessary serialization in logistic regressionsethah2016-06-171-28/+29
| | | | | | | | | | | | | | | | | | | | | JIRA: [SPARK-16008](https://issues.apache.org/jira/browse/SPARK-16008) ## What changes were proposed in this pull request? `LogisticAggregator` stores references to two arrays of dimension `numFeatures` which are serialized before the combine op, unnecessarily. This results in the shuffle write being ~3x (for multiclass logistic regression, this number will go up) larger than it should be (in MLlib, for instance, it is 3x smaller). This patch modifies `LogisticAggregator.add` to accept the two arrays as method parameters which avoids the serialization. ## How was this patch tested? I tested this locally and verified the serialization reduction. ![image](https://cloud.githubusercontent.com/assets/7275795/16140387/d2974bac-3404-11e6-94f9-268860c931a2.png) Additionally, I ran some tests of a 4 node cluster (4x48 cores, 4x128 GB RAM). Data set size of 2M rows and 10k features showed >2x iteration speedup. Author: sethah <seth.hendrickson16@gmail.com> Closes #13729 from sethah/lr_improvement.
* Remove non-obvious conf settings from TPCDS benchmarkSameer Agarwal2016-06-171-2/+0
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? My fault -- these 2 conf entries are mysteriously hidden inside the benchmark code and makes it non-obvious to disable whole stage codegen and/or the vectorized parquet reader. PS: Didn't attach a JIRA as this change should otherwise be a no-op (both these conf are enabled by default in Spark) ## How was this patch tested? N/A Author: Sameer Agarwal <sameer@databricks.com> Closes #13726 from sameeragarwal/tpcds-conf.
* [SPARK-15811][SQL] fix the Python UDF in Scala 2.10Davies Liu2016-06-171-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Iterator can't be serialized in Scala 2.10, we should force it into a array to make sure that . ## How was this patch tested? Build with Scala 2.10 and ran all the Python unit tests manually (will be covered by a jenkins build). Author: Davies Liu <davies@databricks.com> Closes #13717 from davies/fix_udf_210.
* [SPARK-15706][SQL] Fix Wrong Answer when using IF NOT EXISTS in INSERT ↵gatorsmile2016-06-165-5/+85
| | | | | | | | | | | | | | | | OVERWRITE for DYNAMIC PARTITION #### What changes were proposed in this pull request? `IF NOT EXISTS` in `INSERT OVERWRITE` should not support dynamic partitions. If we specify `IF NOT EXISTS`, the inserted statement is not shown in the table. This PR is to issue an exception in this case, just like what Hive does. Also issue an exception if users specify `IF NOT EXISTS` if users do not specify any `PARTITION` specification. #### How was this patch tested? Added test cases into `PlanParserSuite` and `InsertIntoHiveTableSuite` Author: gatorsmile <gatorsmile@gmail.com> Closes #13447 from gatorsmile/insertIfNotExist.
* [SPARK-15822] [SQL] Prevent byte array backed classes from referencing freed ↵Pete Robbins2016-06-162-7/+17
| | | | | | | | | | | | | | | | | | | | | memory ## What changes were proposed in this pull request? `UTF8String` and all `Unsafe*` classes are backed by either on-heap or off-heap byte arrays. The code generated version `SortMergeJoin` buffers the left hand side join keys during iteration. This was actually problematic in off-heap mode when one of the keys is a `UTF8String` (or any other 'Unsafe*` object) and the left hand side iterator was exhausted (and released its memory); the buffered keys would reference freed memory. This causes Seg-faults and all kinds of other undefined behavior when we would use one these buffered keys. This PR fixes this problem by creating copies of the buffered variables. I have added a general method to the `CodeGenerator` for this. I have checked all places in which this could happen, and only `SortMergeJoin` had this problem. This PR is largely based on the work of robbinspg and he should be credited for this. closes https://github.com/apache/spark/pull/13707 ## How was this patch tested? Manually tested on problematic workloads. Author: Pete Robbins <robbinspg@gmail.com> Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13723 from hvanhovell/SPARK-15822-2.
* [SPARK-15908][R] Add varargs-type dropDuplicates() function in SparkRDongjoon Hyun2016-06-163-11/+29
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds varargs-type `dropDuplicates` function to SparkR for API parity. Refer to https://issues.apache.org/jira/browse/SPARK-15807, too. ## How was this patch tested? Pass the Jenkins tests with new testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13684 from dongjoon-hyun/SPARK-15908.
* [SPARK-15490][R][DOC] SparkR 2.0 QA: New R APIs and API docs for non-MLib ↵Kai Jiang2016-06-1610-88/+123
| | | | | | | | | | | | | | changes ## What changes were proposed in this pull request? R Docs changes include typos, format, layout. ## How was this patch tested? Test locally. Author: Kai Jiang <jiangkai@gmail.com> Closes #13394 from vectorijk/spark-15490.
* [SPARK-15782][YARN] Fix spark.jars and spark.yarn.dist.jars handlingNezih Yigitbasi2016-06-165-16/+59
| | | | | | | | | | | | | When `--packages` is specified with spark-shell the classes from those packages cannot be found, which I think is due to some of the changes in SPARK-12343. Tested manually with both scala 2.10 and 2.11 repls. vanzin davies can you guys please review? Author: Marcelo Vanzin <vanzin@cloudera.com> Author: Nezih Yigitbasi <nyigitbasi@netflix.com> Closes #13709 from nezihyigitbasi/SPARK-15782.
* [SPARK-15966][DOC] Add closing tag to fix rendering issue for Spark monitoringDhruve Ashar2016-06-161-1/+1
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Adds the missing closing tag for spark.ui.view.acls.groups ## How was this patch tested? I built the docs locally and verified the changed in browser. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) **Before:** ![image](https://cloud.githubusercontent.com/assets/7732317/16135005/49fc0724-33e6-11e6-9390-98711593fa5b.png) **After:** ![image](https://cloud.githubusercontent.com/assets/7732317/16135021/62b5c4a8-33e6-11e6-8118-b22fda5c66eb.png) Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #13719 from dhruve/doc/SPARK-15966.
* [SPARK-15608][ML][EXAMPLES][DOC] add examples and documents of ml.isotonic ↵WeichenXu2016-06-169-114/+373
| | | | | | | | | | | | | | | | | | | | | | | | | regression ## What changes were proposed in this pull request? add ml doc for ml isotonic regression add scala example for ml isotonic regression add java example for ml isotonic regression add python example for ml isotonic regression modify scala example for mllib isotonic regression modify java example for mllib isotonic regression modify python example for mllib isotonic regression add data/mllib/sample_isotonic_regression_libsvm_data.txt delete data/mllib/sample_isotonic_regression_data.txt ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #13381 from WeichenXu123/add_isotonic_regression_doc.
* [SPARK-15991] SparkContext.hadoopConfiguration should be always the base of ↵Yin Huai2016-06-165-17/+28
| | | | | | | | | | | | | | | | hadoop conf created by SessionState ## What changes were proposed in this pull request? Before this patch, after a SparkSession has been created, hadoop conf set directly to SparkContext.hadoopConfiguration will not affect the hadoop conf created by SessionState. This patch makes the change to always use SparkContext.hadoopConfiguration as the base. This patch also changes the behavior of hive-site.xml support added in https://github.com/apache/spark/pull/12689/. With this patch, we will load hive-site.xml to SparkContext.hadoopConfiguration. ## How was this patch tested? New test in SparkSessionBuilderSuite. Author: Yin Huai <yhuai@databricks.com> Closes #13711 from yhuai/SPARK-15991.
* [SPARK-15749][SQL] make the error message more meaningfulHuaxin Gao2016-06-162-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? For table test1 (C1 varchar (10), C2 varchar (10)), when I insert a row using ``` sqlContext.sql("insert into test1 values ('abc', 'def', 1)") ``` I got error message ``` Exception in thread "main" java.lang.RuntimeException: RelationC1#0,C2#1 JDBCRelation(test1) requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE statement generates the same number of columns as its schema. ``` The error message is a little confusing. In my simple insert statement, it doesn't have a SELECT clause. I will change the error message to a more general one ``` Exception in thread "main" java.lang.RuntimeException: RelationC1#0,C2#1 JDBCRelation(test1) requires that the data to be inserted have the same number of columns as the target table. ``` ## How was this patch tested? I tested the patch using my simple unit test, but it's a very trivial change and I don't think I need to check in any test. Author: Huaxin Gao <huaxing@us.ibm.com> Closes #13492 from huaxingao/spark-15749.
* [SPARK-15868][WEB UI] Executors table in Executors tab should sort Executor ↵Alex Bozarth2016-06-161-2/+7
| | | | | | | | | | | | | | | | | | | IDs in numerical order ## What changes were proposed in this pull request? Currently the Executors table sorts by id using a string sort (since that's what it is stored as). Since the id is a number (other than the driver) we should be sorting numerically. I have changed both the initial sort on page load as well as the table sort to sort on id numerically, treating non-numeric strings (like the driver) as "-1" ## How was this patch tested? Manually tested and dev/run-tests ![pageload](https://cloud.githubusercontent.com/assets/13952758/16027882/d32edd0a-318e-11e6-9faf-fc972b7c36ab.png) ![sorted](https://cloud.githubusercontent.com/assets/13952758/16027883/d34541c6-318e-11e6-9ed7-6bfc0cd4152e.png) Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #13654 from ajbozarth/spark15868.
* [MINOR][DOCS][SQL] Fix some comments about types(TypeCoercion,Partition) and ↵Dongjoon Hyun2016-06-164-5/+5
| | | | | | | | | | | | | | | | | | | exceptions. ## What changes were proposed in this pull request? This PR contains a few changes on code comments. - `HiveTypeCoercion` is renamed into `TypeCoercion`. - `NoSuchDatabaseException` is only used for the absence of database. - For partition type inference, only `DoubleType` is considered. ## How was this patch tested? N/A Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13674 from dongjoon-hyun/minor_doc_types.
* [SPARK-15998][SQL] Verification of SQLConf HIVE_METASTORE_PARTITION_PRUNINGgatorsmile2016-06-161-3/+57
| | | | | | | | | | | | | | #### What changes were proposed in this pull request? `HIVE_METASTORE_PARTITION_PRUNING` is a public `SQLConf`. When `true`, some predicates will be pushed down into the Hive metastore so that unmatching partitions can be eliminated earlier. The current default value is `false`. For performance improvement, users might turn this parameter on. So far, the code base does not have such a test case to verify whether this `SQLConf` properly works. This PR is to improve the test case coverage for avoiding future regression. #### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #13716 from gatorsmile/addTestMetastorePartitionPruning.
* [SQL] Minor HashAggregateExec string output fixesCheng Lian2016-06-161-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes some minor `.toString` format issues for `HashAggregateExec`. Before: ``` *HashAggregate(key=[a#234L,b#235L], functions=[count(1),max(c#236L)], output=[a#234L,b#235L,count(c)#247L,max(c)#248L]) ``` After: ``` *HashAggregate(keys=[a#234L, b#235L], functions=[count(1), max(c#236L)], output=[a#234L, b#235L, count(c)#247L, max(c)#248L]) ``` ## How was this patch tested? Manually tested. Author: Cheng Lian <lian@databricks.com> Closes #13710 from liancheng/minor-agg-string-fix.
* [SPARK-15975] Fix improper Popen retcode code handling in dev/run-testsJosh Rosen2016-06-162-2/+5
| | | | | | | | | | In the `dev/run-tests.py` script we check a `Popen.retcode` for success using `retcode > 0`, but this is subtlety wrong because Popen's return code will be negative if the child process was terminated by a signal: https://docs.python.org/2/library/subprocess.html#subprocess.Popen.returncode In order to properly handle signals, we should change this to check `retcode != 0` instead. Author: Josh Rosen <joshrosen@databricks.com> Closes #13692 from JoshRosen/dev-run-tests-return-code-handling.
* [SPARK-15978][SQL] improve 'show tables' command related codesbomeng2016-06-162-2/+2
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? I've found some minor issues in "show tables" command: 1. In the `SessionCatalog.scala`, `listTables(db: String)` method will call `listTables(formatDatabaseName(db), "*")` to list all the tables for certain db, but in the method `listTables(db: String, pattern: String)`, this db name is formatted once more. So I think we should remove `formatDatabaseName()` in the caller. 2. I suggest to add sort to listTables(db: String) in InMemoryCatalog.scala, just like listDatabases(). ## How was this patch tested? The existing test cases should cover it. Author: bomeng <bmeng@us.ibm.com> Closes #13695 from bomeng/SPARK-15978.
* [SPARK-15796][CORE] Reduce spark.memory.fraction default to avoid ↵Sean Owen2016-06-164-9/+26
| | | | | | | | | | | | | | | | overrunning old gen in JVM default config ## What changes were proposed in this pull request? Reduce `spark.memory.fraction` default to 0.6 in order to make it fit within default JVM old generation size (2/3 heap). See JIRA discussion. This means a full cache doesn't spill into the new gen. CC andrewor14 ## How was this patch tested? Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #13618 from srowen/SPARK-15796.
* [SPARK-15922][MLLIB] `toIndexedRowMatrix` should consider the case `cols < ↵Dongjoon Hyun2016-06-162-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | offset+colsPerBlock` ## What changes were proposed in this pull request? SPARK-15922 reports the following scenario throwing an exception due to the mismatched vector sizes. This PR handles the exceptional case, `cols < (offset + colsPerBlock)`. **Before** ```scala scala> import org.apache.spark.mllib.linalg.distributed._ scala> import org.apache.spark.mllib.linalg._ scala> val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, new DenseVector(Array(1,2,3))):: IndexedRow(2L, new DenseVector(Array(1,2,3))):: Nil scala> val rdd = sc.parallelize(rows) scala> val matrix = new IndexedRowMatrix(rdd, 3, 3) scala> val bmat = matrix.toBlockMatrix scala> val imat = bmat.toIndexedRowMatrix scala> imat.rows.collect ... // java.lang.IllegalArgumentException: requirement failed: Vectors must be the same length! ``` **After** ```scala ... scala> imat.rows.collect res0: Array[org.apache.spark.mllib.linalg.distributed.IndexedRow] = Array(IndexedRow(0,[1.0,2.0,3.0]), IndexedRow(1,[1.0,2.0,3.0]), IndexedRow(2,[1.0,2.0,3.0])) ``` ## How was this patch tested? Pass the Jenkins tests (including the above case) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13643 from dongjoon-hyun/SPARK-15922.
* [SPARK-15977][SQL] Fix TRUNCATE TABLE for Spark specific datasource tablesHerman van Hovell2016-06-162-11/+21
| | | | | | | | | | | | ## What changes were proposed in this pull request? `TRUNCATE TABLE` is currently broken for Spark specific datasource tables (json, csv, ...). This PR correctly sets the location for these datasources which allows them to be truncated. ## How was this patch tested? Extended the datasources `TRUNCATE TABLE` tests in `DDLSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13697 from hvanhovell/SPARK-15977.
* [SPARK-15981][SQL][STREAMING] Fixed bug and added tests in DataStreamReader ↵Tathagata Das2016-06-161-122/+136
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Python API ## What changes were proposed in this pull request? - Fixed bug in Python API of DataStreamReader. Because a single path was being converted to a array before calling Java DataStreamReader method (which takes a string only), it gave the following error. ``` File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/readwriter.py", line 947, in pyspark.sql.readwriter.DataStreamReader.json Failed example: json_sdf = spark.readStream.json(os.path.join(tempfile.mkdtemp(), 'data'), schema = sdf_schema) Exception raised: Traceback (most recent call last): File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", line 1253, in __run compileflags, 1) in test.globs File "<doctest pyspark.sql.readwriter.DataStreamReader.json[0]>", line 1, in <module> json_sdf = spark.readStream.json(os.path.join(tempfile.mkdtemp(), 'data'), schema = sdf_schema) File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/readwriter.py", line 963, in json return self._df(self._jreader.json(path)) File "/Users/tdas/Projects/Spark/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/Users/tdas/Projects/Spark/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 316, in get_return_value format(target_id, ".", name, value)) Py4JError: An error occurred while calling o121.json. Trace: py4j.Py4JException: Method json([class java.util.ArrayList]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:211) at java.lang.Thread.run(Thread.java:744) ``` - Reduced code duplication between DataStreamReader and DataFrameWriter - Added missing Python doctests ## How was this patch tested? New tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13703 from tdas/SPARK-15981.
* [SPARK-15996][R] Fix R examples by removing deprecated functionsDongjoon Hyun2016-06-162-8/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, R examples(`dataframe.R` and `data-manipulation.R`) fail like the following. We had better update them before releasing 2.0 RC. This PR updates them to use up-to-date APIs. ```bash $ bin/spark-submit examples/src/main/r/dataframe.R ... Warning message: 'createDataFrame(sqlContext...)' is deprecated. Use 'createDataFrame(data, schema = NULL, samplingRatio = 1.0)' instead. See help("Deprecated") ... Warning message: 'read.json(sqlContext...)' is deprecated. Use 'read.json(path)' instead. See help("Deprecated") ... Error: could not find function "registerTempTable" Execution halted ``` ## How was this patch tested? Manual. ``` curl -LO http://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv bin/spark-submit examples/src/main/r/dataframe.R bin/spark-submit examples/src/main/r/data-manipulation.R flights.csv ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13714 from dongjoon-hyun/SPARK-15996.
* [SPARK-15983][SQL] Removes FileFormat.prepareReadCheng Lian2016-06-163-31/+18
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Interface method `FileFormat.prepareRead()` was added in #12088 to handle a special case in the LibSVM data source. However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside `inferSchema`, we can remove this interface method to keep the `FileFormat` interface clean. ## How was this patch tested? Existing tests. Author: Cheng Lian <lian@databricks.com> Closes #13698 from liancheng/remove-prepare-read.
* [SPARK-15862][SQL] Better Error Message When Having Database Name in CACHE ↵gatorsmile2016-06-167-64/+121
| | | | | | | | | | | | | | | | | | | | | | | | TABLE AS SELECT #### What changes were proposed in this pull request? ~~If the temp table already exists, we should not silently replace it when doing `CACHE TABLE AS SELECT`. This is inconsistent with the behavior of `CREAT VIEW` or `CREATE TABLE`. This PR is to fix this silent drop.~~ ~~Maybe, we also can introduce new syntax for replacing the existing one. For example, in Hive, to replace a view, the syntax should be like `ALTER VIEW AS SELECT` or `CREATE OR REPLACE VIEW AS SELECT`~~ The table name in `CACHE TABLE AS SELECT` should NOT contain database prefix like "database.table". Thus, this PR captures this in Parser and outputs a better error message, instead of reporting the view already exists. In addition, refactoring the `Parser` to generate table identifiers instead of returning the table name string. #### How was this patch tested? - Added a test case for caching and uncaching qualified table names - Fixed a few test cases that do not drop temp table at the end - Added the related test case for the issue resolved in this PR Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #13572 from gatorsmile/cacheTableAsSelect.
* [SPARK-12922][SPARKR][WIP] Implement gapply() on DataFrame in SparkRNarine Kokhlikyan2016-06-1514-65/+540
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API. Please, let me know what do you think and if you have any ideas to improve it. Thank you! ## How was this patch tested? Unit tests. 1. Primitive test with different column types 2. Add a boolean column 3. Compute average by a group Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com> Author: NarineK <narine.kokhlikyan@us.ibm.com> Closes #12836 from NarineK/gapply2.
* [SPARK-15824][SQL] Execute WITH .... INSERT ... statements immediatelyHerman van Hovell2016-06-153-2/+27
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? We currently immediately execute `INSERT` commands when they are issued. This is not the case as soon as we use a `WITH` to define common table expressions, for example: ```sql WITH tbl AS (SELECT * FROM x WHERE id = 10) INSERT INTO y SELECT * FROM tbl ``` This PR fixes this problem. This PR closes https://github.com/apache/spark/pull/13561 (which fixes the a instance of this problem in the ThriftSever). ## How was this patch tested? Added a test to `InsertSuite` Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13678 from hvanhovell/SPARK-15824.
* [SPARK-15851][BUILD] Fix the call of the bash script to enable proper run in ↵Reynold Xin2016-06-152-2/+3
| | | | | | | | | | | | | | | | | | | Windows ## What changes were proposed in this pull request? The way bash script `build/spark-build-info` is called from core/pom.xml prevents Spark building on Windows. Instead of calling the script directly we call bash and pass the script as an argument. This enables running it on Windows with bash installed which typically comes with Git. This brings https://github.com/apache/spark/pull/13612 up-to-date and also addresses comments from the code review. Closes #13612 ## How was this patch tested? I built manually (on a Mac) to verify it didn't break Mac compilation. Author: Reynold Xin <rxin@databricks.com> Author: avulanov <nashb@yandex.ru> Closes #13691 from rxin/SPARK-15851.
* [SPARK-13498][SQL] Increment the recordsRead input metric for JDBC data sourceWayne Song2016-06-151-0/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch brings https://github.com/apache/spark/pull/11373 up-to-date and increments the record count for JDBC data source. Closes #11373. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #13694 from rxin/SPARK-13498.
* [SPARK-15979][SQL] Rename various Parquet support classes.Reynold Xin2016-06-1514-123/+120
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch renames various Parquet support classes from CatalystAbc to ParquetAbc. This new naming makes more sense for two reasons: 1. These are not optimizer related (i.e. Catalyst) classes. 2. We are in the Spark code base, and as a result it'd be more clear to call out these are Parquet support classes, rather than some Spark classes. ## How was this patch tested? Renamed test cases as well. Author: Reynold Xin <rxin@databricks.com> Closes #13696 from rxin/parquet-rename.
* [SPARK-12492][SQL] Add missing SQLExecution.withNewExecutionId for ↵KaiXinXiaoLei2016-06-151-14/+17
| | | | | | | | | | | | | | | | | | hiveResultString ## What changes were proposed in this pull request? Add missing SQLExecution.withNewExecutionId for hiveResultString so that queries running in `spark-sql` will be shown in Web UI. Closes #13115 ## How was this patch tested? Existing unit tests. Author: KaiXinXiaoLei <huleilei1@huawei.com> Closes #13689 from zsxwing/pr13115.
* [DOCS] Fix Gini and Entropy scaladocs in context of multiclass classificationWojciech Jurczyk2016-06-152-3/+2
| | | | | | | | The PR changes outdated scaladocs for Gini and Entropy classes. Since PR #886 Spark supports multiclass classification, but the docs tell only about binary classification. Author: Wojciech Jurczyk <wojciech.jurczyk@codilime.com> Closes #11252 from wjur/wjur/docs_multiclass.
* Revert "[SPARK-15782][YARN] Set spark.jars system property in client mode"Davies Liu2016-06-155-43/+6
| | | | This reverts commit 4df8df5c2e68f5a5d231c401b04d762d7a648159.
* Closing stale pull requests.Reynold Xin2016-06-150-0/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Closes #13103 Closes #8320 Closes #7871 Closes #7461 Closes #9159 Closes #9150 Closes #9200 Closes #9089 Closes #8022 Closes #6767 Closes #8505 Closes #9457 Closes #9397 Closes #8563 Closes #10062 Closes #9944 Closes #10137 Closes #10148 Closes #9057 Closes #10163 Closes #8023 Closes #10302 Closes #8979 Closes #8981 Closes #10258 Closes #7345 Closes #9183 Closes #10087 Closes #10292 Closes #10254 Closes #10374 Closes #8915 Closes #10128 Closes #10666 Closes #8533 Closes #10625 Closes #8013 Closes #8427 Closes #7753 Closes #10116 Closes #11005 Closes #10797 Closes #11026 Closes #11009 Closes #10117 Closes #11382 Closes #9483 Closes #10566 Closes #10753 Closes #11386 Closes #9097 Closes #11245 Closes #11257 Closes #11045 Closes #10144 Closes #11066 Closes #8610 Closes #10634 Closes #11224 Closes #11212 Closes #11244 Closes #10326 Closes #13524
* [SPARK-7848][STREAMING][UPDATE SPARKSTREAMING DOCS TO INCORPORATE IMPORTANT ↵Nirman Narang2016-06-151-0/+19
| | | | | | | | | | POINTS.] Updated the SparkStreaming Doc with some important points. Author: Nirman Narang <narang@us.ibm.com> Closes #11114 from nirmannarang/SPARK-7848.
* [HOTFIX][CORE] fix flaky BasicSchedulerIntegrationTestImran Rashid2016-06-151-8/+7
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? SPARK-15927 exacerbated a race in BasicSchedulerIntegrationTest, so it went from very unlikely to fairly frequent. The issue is that stage numbering is not completely deterministic, but these tests treated it like it was. So turn off the tests. ## How was this patch tested? on my laptop the test failed abotu 10% of the time before this change, and didn't fail in 500 runs after the change. Author: Imran Rashid <irashid@cloudera.com> Closes #13688 from squito/hotfix_basic_scheduler.