aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-11476][DOCS] Incorrect function referred to in MLib Random data ↵Sean Owen2015-11-081-1/+1
| | | | | | | | | | generation documentation Fix Python example to use normalRDD as advertised Author: Sean Owen <sowen@cloudera.com> Closes #9529 from srowen/SPARK-11476.
* [SPARK-11362] [SQL] Use Spark BitSet in BroadcastNestedLoopJoinLiang-Chi Hsieh2015-11-071-10/+8
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-11362 We use scala.collection.mutable.BitSet in BroadcastNestedLoopJoin now. We should use Spark's BitSet. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9316 from viirya/use-spark-bitset.
* [SPARK-9241][SQL] Supporting multiple DISTINCT columns - follow-upHerman van Hovell2015-11-072-23/+108
| | | | | | | | | | This PR is a follow up for PR https://github.com/apache/spark/pull/9406. It adds more documentation to the rewriting rule, removes a redundant if expression in the non-distinct aggregation path and adds a multiple distinct test to the AggregationQuerySuite. cc yhuai marmbrus Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #9541 from hvanhovell/SPARK-9241-followup.
* [SPARK-8467] [MLLIB] [PYSPARK] Add LDAModel.describeTopics() in PythonYu ISHIKAWA2015-11-063-17/+75
| | | | | | | | | | | | | Could jkbradley and davies review it? - Create a wrapper class: `LDAModelWrapper` for `LDAModel`. Because we can't deal with the return value of`describeTopics` in Scala from pyspark directly. `Array[(Array[Int], Array[Double])]` is too complicated to convert it. - Add `loadLDAModel` in `PythonMLlibAPI`. Since `LDAModel` in Scala is an abstract class and we need to call `load` of `DistributedLDAModel`. [[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8467) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8643 from yu-iskw/SPARK-8467-2.
* [SPARK-11112] DAG visualization: display RDD callsiteAndrew Or2015-11-077-20/+79
| | | | | | | | | <img width="548" alt="screen shot 2015-11-01 at 9 42 33 am" src="https://cloud.githubusercontent.com/assets/2133137/10870343/2a8cd070-807d-11e5-857a-4ebcace77b5b.png"> mateiz sarutak Author: Andrew Or <andrew@databricks.com> Closes #9398 from andrewor14/rdd-callsite.
* [SPARK-11389][CORE] Add support for off-heap memory to MemoryManagerJosh Rosen2015-11-0621-465/+828
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to lay the groundwork for proper off-heap memory support in SQL / Tungsten, we need to extend our MemoryManager to perform bookkeeping for off-heap memory. ## User-facing changes This PR introduces a new configuration, `spark.memory.offHeapSize` (name subject to change), which specifies the absolute amount of off-heap memory that Spark and Spark SQL can use. If Tungsten is configured to use off-heap execution memory for allocating data pages, then all data page allocations must fit within this size limit. ## Internals changes This PR contains a lot of internal refactoring of the MemoryManager. The key change at the heart of this patch is the introduction of a `MemoryPool` class (name subject to change) to manage the bookkeeping for a particular category of memory (storage, on-heap execution, and off-heap execution). These MemoryPools are not fixed-size; they can be dynamically grown and shrunk according to the MemoryManager's policies. In StaticMemoryManager, these pools have fixed sizes, proportional to the legacy `[storage|shuffle].memoryFraction`. In the new UnifiedMemoryManager, the sizes of these pools are dynamically adjusted according to its policies. There are two subclasses of `MemoryPool`: `StorageMemoryPool` manages storage memory and `ExecutionMemoryPool` manages execution memory. The MemoryManager creates two execution pools, one for on-heap memory and one for off-heap. Instances of `ExecutionMemoryPool` manage the logic for fair sharing of their pooled memory across running tasks (in other words, the ShuffleMemoryManager-like logic has been moved out of MemoryManager and pushed into these ExecutionMemoryPool instances). I think that this design is substantially easier to understand and reason about than the previous design, where most of these responsibilities were handled by MemoryManager and its subclasses. To see this, take at look at how simple the logic in `UnifiedMemoryManager` has become: it's now very easy to see when memory is dynamically shifted between storage and execution. ## TODOs - [x] Fix handful of test failures in the MemoryManagerSuites. - [x] Fix remaining TODO comments in code. - [ ] Document new configuration. - [x] Fix commented-out tests / asserts: - [x] UnifiedMemoryManagerSuite. - [x] Write tests that exercise the new off-heap memory management policies. Author: Josh Rosen <joshrosen@databricks.com> Closes #9344 from JoshRosen/offheap-memory-accounting.
* [HOTFIX] Fix python tests after #9527Michael Armbrust2015-11-061-1/+1
| | | | | | | | #9527 missed updating the python tests. Author: Michael Armbrust <michael@databricks.com> Closes #9533 from marmbrus/hotfixTextValue.
* [SPARK-11546] Thrift server makes too many logs about result schemanavis.ryu2015-11-061-11/+13
| | | | | | | | SparkExecuteStatementOperation logs result schema for each getNextRowSet() calls which is by default every 1000 rows, overwhelming whole log file. Author: navis.ryu <navis@apache.org> Closes #9514 from navis/SPARK-11546.
* [SPARK-9241][SQL] Supporting multiple DISTINCT columns (2) - Rewriting RuleHerman van Hovell2015-11-066-44/+238
| | | | | | | | | | | | | | | | The second PR for SPARK-9241, this adds support for multiple distinct columns to the new aggregation code path. This PR solves the multiple DISTINCT column problem by rewriting these Aggregates into an Expand-Aggregate-Aggregate combination. See the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-9241) for some information on this. The advantages over the - competing - [first PR](https://github.com/apache/spark/pull/9280) are: - This can use the faster TungstenAggregate code path. - It is impossible to OOM due to an ```OpenHashSet``` allocating to much memory. However, this will multiply the number of input rows by the number of distinct clauses (plus one), and puts a lot more memory pressure on the aggregation code path itself. The location of this Rule is a bit funny, and should probably change when the old aggregation path is changed. cc yhuai - Could you also tell me where to add tests for this? Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #9406 from hvanhovell/SPARK-9241-rewriter.
* [SPARK-11410] [PYSPARK] Add python bindings for repartition and sortW…Nong Li2015-11-061-16/+101
| | | | | | | | …ithinPartitions. Author: Nong Li <nong@databricks.com> Closes #9504 from nongli/spark-11410.
* [SPARK-11269][SQL] Java API support & test cases for DatasetWenchen Fan2015-11-068-12/+644
| | | | | | | | | This simply brings https://github.com/apache/spark/pull/9358 up-to-date. Author: Wenchen Fan <wenchen@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #9528 from rxin/dataset-java.
* [SPARK-11555] spark on yarn spark-class --num-workers doesn't workThomas Graves2015-11-062-3/+6
| | | | | | | | | | I tested the various options with both spark-submit and spark-class of specifying number of executors in both client and cluster mode where it applied. --num-workers, --num-executors, spark.executor.instances, SPARK_EXECUTOR_INSTANCES, default nothing supplied Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com> Closes #9523 from tgravescs/SPARK-11555.
* [SPARK-11217][ML] save/load for non-meta estimators and transformersXiangrui Meng2015-11-067-4/+469
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR implements the default save/load for non-meta estimators and transformers using the JSON serialization of param values. The saved metadata includes: * class name * uid * timestamp * paramMap The save/load interface is similar to DataFrames. We use the current active context by default, which should be sufficient for most use cases. ~~~scala instance.save("path") instance.write.context(sqlContext).overwrite().save("path") Instance.load("path") ~~~ The param handling is different from the design doc. We didn't save default and user-set params separately, and when we load it back, all parameters are user-set. This does cause issues. But it also cause other issues if we modify the default params. TODOs: * [x] Java test * [ ] a follow-up PR to implement default save/load for all non-meta estimators and transformers cc jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9454 from mengxr/SPARK-11217.
* [SPARK-11561][SQL] Rename text data source's column name to value.Reynold Xin2015-11-062-5/+3
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #9527 from rxin/SPARK-11561.
* [SPARK-11450] [SQL] Add Unsafe Row processing to ExpandHerman van Hovell2015-11-064-14/+73
| | | | | | | | This PR enables the Expand operator to process and produce Unsafe Rows. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #9414 from hvanhovell/SPARK-11450.
* [SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bitsImran Rashid2015-11-0615-61/+128
| | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-10116 This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`. mengxr mkolod Author: Imran Rashid <irashid@cloudera.com> Closes #8314 from squito/SPARK-10116.
* Typo fixes + code readability improvementsJacek Laskowski2015-11-064-17/+21
| | | | | | Author: Jacek Laskowski <jacek.laskowski@deepsense.io> Closes #9501 from jaceklaskowski/typos-with-style.
* [SPARK-9858][SQL] Add an ExchangeCoordinator to estimate the number of ↵Yin Huai2015-11-064-62/+167
| | | | | | | | | | | | post-shuffle partitions for aggregates and joins (follow-up) https://issues.apache.org/jira/browse/SPARK-9858 This PR is the follow-up work of https://github.com/apache/spark/pull/9276. It addresses JoshRosen's comments. Author: Yin Huai <yhuai@databricks.com> Closes #9453 from yhuai/numReducer-followUp.
* [SPARK-10978][SQL][FOLLOW-UP] More comprehensive tests for PR #9399Cheng Lian2015-11-063-46/+321
| | | | | | | | This PR adds test cases that test various column pruning and filter push-down cases. Author: Cheng Lian <lian@databricks.com> Closes #9468 from liancheng/spark-10978.follow-up.
* [SPARK-9162] [SQL] Implement code generation for ScalaUDFLiang-Chi Hsieh2015-11-062-2/+124
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-9162 Currently ScalaUDF extends CodegenFallback and doesn't provide code generation implementation. This path implements code generation for ScalaUDF. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9270 from viirya/scalaudf-codegen.
* [SPARK-11511][STREAMING] Fix NPE when an InputDStream is not usedShixiong Zhu2015-11-062-1/+18
| | | | | | | | Just ignored `InputDStream`s that have null `rememberDuration` in `DStreamGraph.getMaxInputStreamRememberDuration`. Author: Shixiong Zhu <shixiong@databricks.com> Closes #9476 from zsxwing/SPARK-11511.
* [SPARK-11453][SQL][FOLLOW-UP] remove DecimalLitWenchen Fan2015-11-063-29/+35
| | | | | | | | | | | A cleanup for https://github.com/apache/spark/pull/9085. The `DecimalLit` is very similar to `FloatLit`, we can just keep one of them. Also added low level unit test at `SqlParserSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #9482 from cloud-fan/parser.
* [SPARK-11541][SQL] Break JdbcDialects.scala into multiple files and mark ↵Reynold Xin2015-11-0510-187/+332
| | | | | | | | various dialects as private. Author: Reynold Xin <rxin@databricks.com> Closes #9511 from rxin/SPARK-11541.
* [SPARK-11528] [SQL] Typed aggregations for DatasetsMichael Armbrust2015-11-054-3/+132
| | | | | | | | | | | | | | | This PR adds the ability to do typed SQL aggregations. We will likely also want to provide an interface to allow users to do aggregations on objects, but this is deferred to another PR. ```scala val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS() ds.groupBy(_._1).agg(sum("_2").as[Int]).collect() res0: Array(("a", 30), ("b", 3), ("c", 1)) ``` Author: Michael Armbrust <michael@databricks.com> Closes #9499 from marmbrus/dataset-agg.
* [SPARK-7542][SQL] Support off-heap index/sort bufferDavies Liu2015-11-0517-189/+265
| | | | | | | | | | This brings the support of off-heap memory for array inside BytesToBytesMap and InMemorySorter, then we could allocate all the memory from off-heap for execution. Closes #8068 Author: Davies Liu <davies@databricks.com> Closes #9477 from davies/unsafe_timsort.
* [SPARK-11540][SQL] API audit for QueryExecutionListener.Reynold Xin2015-11-052-59/+72
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #9509 from rxin/SPARK-11540.
* [SPARK-11538][BUILD] Force guava 14 in sbt build.Marcelo Vanzin2015-11-051-1/+10
| | | | | | | | | sbt's version resolution code always picks the most recent version, and we don't want that for guava. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9508 from vanzin/SPARK-11538.
* [SPARK-11457][STREAMING][YARN] Fix incorrect AM proxy filter conf recovery ↵jerryshao2015-11-051-1/+12
| | | | | | | | | | | | | | | | | | | from checkpoint Currently Yarn AM proxy filter configuration is recovered from checkpoint file when Spark Streaming application is restarted, which will lead to some unwanted behaviors: 1. Wrong RM address if RM is redeployed from failure. 2. Wrong proxyBase, since app id is updated, old app id for proxyBase is wrong. So instead of recovering from checkpoint file, these configurations should be reloaded each time when app started. This problem only exists in Yarn cluster mode, for Yarn client mode, these configurations will be updated with RPC message `AddWebUIFilter`. Please help to review tdas harishreedharan vanzin , thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #9412 from jerryshao/SPARK-11457.
* [SPARK-11514][ML] Pass random seed to spark.ml DecisionTree*Yu ISHIKAWA2015-11-055-7/+14
| | | | | | | | cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9486 from yu-iskw/SPARK-11514.
* Revert "[SPARK-11469][SQL] Allow users to define nondeterministic udfs."Reynold Xin2015-11-056-262/+78
| | | | This reverts commit 9cf56c96b7d02a14175d40b336da14c2e1c88339.
* [SPARK-11537] [SQL] fix negative hours/minutes/secondsDavies Liu2015-11-052-8/+28
| | | | | | | | Currently, if the Timestamp is before epoch (1970/01/01), the hours, minutes and seconds will be negative (also rounding up). Author: Davies Liu <davies@databricks.com> Closes #9502 from davies/neg_hour.
* [SPARK-11542] [SPARKR] fix glm with long fomularDavies Liu2015-11-052-1/+14
| | | | | | | | Because deparse() will break the long string into multiple lines, the deserialization will fail Author: Davies Liu <davies@databricks.com> Closes #9510 from davies/fix_glm.
* [SPARK-11536][SQL] Remove the internal implicit conversion from Expression ↵Reynold Xin2015-11-051-281/+299
| | | | | | | | to Column in functions.scala Author: Reynold Xin <rxin@databricks.com> Closes #9505 from rxin/SPARK-11536.
* [SPARK-10656][SQL] completely support special chars in DataFrameWenchen Fan2015-11-052-6/+16
| | | | | | | | | | | | the main problem is: we interpret column name with special handling of `.` for DataFrame. This enables us to write something like `df("a.b")` to get the field `b` of `a`. However, we don't need this feature in `DataFrame.apply("*")` or `DataFrame.withColumnRenamed`. In these 2 cases, the column name is the final name already, we don't need extra process to interpret it. The solution is simple, use `queryExecution.analyzed.output` to get resolved column directly, instead of using `DataFrame.resolve`. close https://github.com/apache/spark/pull/8811 Author: Wenchen Fan <wenchen@databricks.com> Closes #9462 from cloud-fan/special-chars.
* [SPARK-11260][SPARKR] with() function supportadrian5552015-11-055-6/+51
| | | | | | | Author: adrian555 <wzhuang@us.ibm.com> Author: Adrian Zhuang <adrian555@users.noreply.github.com> Closes #9443 from adrian555/with.
* [SPARK-11532][SQL] Remove implicit conversion from Expression to ColumnReynold Xin2015-11-051-52/+66
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #9500 from rxin/SPARK-11532.
* [SPARK-10648] Oracle dialect to handle nonspecific numeric typesTravis Hegner2015-11-051-0/+25
| | | | | | | | | | This is the alternative/agreed upon solution to PR #8780. Creating an OracleDialect to handle the nonspecific numeric types that can be defined in oracle. Author: Travis Hegner <thegner@trilliumit.com> Closes #9495 from travishegner/OracleDialect.
* [SPARK-10265][DOCUMENTATION, ML] Fixed @Since annotation to ml.regressionEhsan M.Kermani2015-11-055-18/+119
| | | | | | | | Here is my first commit. Author: Ehsan M.Kermani <ehsanmo1367@gmail.com> Closes #8728 from ehsanmok/SinceAnn.
* [SPARK-11513][SQL] Remove implicit conversion from LogicalPlan to DataFrameReynold Xin2015-11-052-50/+78
| | | | | | | | This internal implicit conversion has been a source of confusion for a lot of new developers. Author: Reynold Xin <rxin@databricks.com> Closes #9479 from rxin/SPARK-11513.
* [SPARK-11484][WEBUI] Using proxyBase set by spark AMSrinivasa Reddy Vundela2015-11-051-8/+4
| | | | | | | | Use the proxyBase set by the AM, if not found then use env. This is to fix the issue if somebody accidentally set APPLICATION_WEB_PROXY_BASE to wrong proxyBase Author: Srinivasa Reddy Vundela <vsr@cloudera.com> Closes #9448 from vundela/master.
* [SPARK-11473][ML] R-like summary statistics with intercept for OLS via ↵Yanbo Liang2015-11-054-41/+48
| | | | | | | | | | normal equation solver Follow up [SPARK-9836](https://issues.apache.org/jira/browse/SPARK-9836), we should also support summary statistics for ```intercept```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9485 from yanboliang/spark-11473.
* [SPARK-11474][SQL] change fetchSize to fetchsizeHuaxin Gao2015-11-051-1/+2
| | | | | | | | | | | | | | | | | | In DefaultDataSource.scala, it has override def createRelation( sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation The parameters is CaseInsensitiveMap. After this line parameters.foreach(kv => properties.setProperty(kv._1, kv._2)) properties is set to all lower case key/value pairs and fetchSize becomes fetchsize. However, in compute method in JDBCRDD, it has val fetchSize = properties.getProperty("fetchSize", "0").toInt so fetchSize value is always 0 and never gets set correctly. Author: Huaxin Gao <huaxing@oc0558782468.ibm.com> Closes #9473 from huaxingao/spark-11474.
* [SPARK-11501][CORE][YARN] Propagate spark.rpc config to executorsNishkam Ravi2015-11-051-0/+1
| | | | | | | | spark.rpc is supposed to be configurable but is not currently (doesn't get propagated to executors because RpcEnv.create is done before driver properties are fetched). Author: Nishkam Ravi <nishkamravi@gmail.com> Closes #9460 from nishkamravi2/master_akka.
* [SPARK-11527][ML][PYSPARK] PySpark AFTSurvivalRegressionModel should expose ↵Yanbo Liang2015-11-051-0/+24
| | | | | | | | | | coefficients/intercept/scale PySpark ```AFTSurvivalRegressionModel``` should expose coefficients/intercept/scale. mengxr vectorijk Author: Yanbo Liang <ybliang8@gmail.com> Closes #9492 from yanboliang/spark-11527.
* [MINOR][ML][DOC] Rename weights to coefficients in user guideYanbo Liang2015-11-051-12/+12
| | | | | | | | We should use ```coefficients``` rather than ```weights``` in user guide that freshman can get the right conventional name at the outset. mengxr vectorijk Author: Yanbo Liang <ybliang8@gmail.com> Closes #9493 from yanboliang/docs-coefficients.
* [MINOR][SQL] A minor log line fixCheng Lian2015-11-051-1/+2
| | | | | | | | `jars` in the log line is an array, so `$jars` doesn't print its content. Author: Cheng Lian <lian@databricks.com> Closes #9494 from liancheng/minor.log-fix.
* [SPARK-11506][MLLIB] Removed redundant operation in Online LDA implementationa1singh2015-11-051-1/+1
| | | | | | | | | | | | | In file LDAOptimizer.scala: line 441: since "idx" was never used, replaced unrequired zipWithIndex.foreach with foreach. - nonEmptyDocs.zipWithIndex.foreach { case ((_, termCounts: Vector), idx: Int) => + nonEmptyDocs.foreach { case (_, termCounts: Vector) => Author: a1singh <a1singh@ucsd.edu> Closes #9456 from a1singh/master.
* [SPARK-11449][CORE] PortableDataStream should be a factoryHerman van Hovell2015-11-051-29/+16
| | | | | | | | | | | | ```PortableDataStream``` maintains some internal state. This makes it tricky to reuse a stream (one needs to call ```close``` on both the ```PortableDataStream``` and the ```InputStream``` it produces). This PR removes all state from ```PortableDataStream``` and effectively turns it into an ```InputStream```/```Array[Byte]``` factory. This makes the user responsible for managing the ```InputStream``` it returns. cc srowen Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #9417 from hvanhovell/SPARK-11449.
* [SPARK-11378][STREAMING] make StreamingContext.awaitTerminationOrTimeout ↵Nick Evans2015-11-052-1/+8
| | | | | | | | | | | | return properly This adds a failing test checking that `awaitTerminationOrTimeout` returns the expected value, and then fixes that failing test with the addition of a `return`. tdas zsxwing Author: Nick Evans <me@nicolasevans.org> Closes #9336 from manygrams/fix_await_termination_or_timeout.
* [SPARK-11440][CORE][STREAMING][BUILD] Declare rest of @Experimental items ↵Sean Owen2015-11-0515-75/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | non-experimental if they've existed since 1.2.0 Remove `Experimental` annotations in core, streaming for items that existed in 1.2.0 or before. The changes are: * SparkContext * binary{Files,Records} : 1.2.0 * submitJob : 1.0.0 * JavaSparkContext * binary{Files,Records} : 1.2.0 * DoubleRDDFunctions, JavaDoubleRDD * {mean,sum}Approx : 1.0.0 * PairRDDFunctions, JavaPairRDD * sampleByKeyExact : 1.2.0 * countByKeyApprox : 1.0.0 * PairRDDFunctions * countApproxDistinctByKey : 1.1.0 * RDD * countApprox, countByValueApprox, countApproxDistinct : 1.0.0 * JavaRDDLike * countApprox : 1.0.0 * PythonHadoopUtil.Converter : 1.1.0 * PortableDataStream : 1.2.0 (related to binaryFiles) * BoundedDouble : 1.0.0 * PartialResult : 1.0.0 * StreamingContext, JavaStreamingContext * binaryRecordsStream : 1.2.0 * HiveContext * analyze : 1.2.0 Author: Sean Owen <sowen@cloudera.com> Closes #9396 from srowen/SPARK-11440.