aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* Preparing Spark release v1.3.1-rc3v1.3.1Patrick Wendell2015-04-1128-28/+28
|
* Revert "Preparing Spark release v1.3.1-rc2"Patrick Wendell2015-04-1028-28/+28
| | | | This reverts commit 7c4473aa5a7f5de0323394aaedeefbf9738e8eb5.
* Revert "Preparing development version 1.3.2-SNAPSHOT"Patrick Wendell2015-04-1028-28/+28
| | | | This reverts commit cdef7d080aa3f473f5ea06ba816c01b41a0239eb.
* [SPARK-6851][SQL] Create new instance for each converted parquet relationMichael Armbrust2015-04-102-1/+80
| | | | | | | | | | | | | | | | Otherwise we end up rewriting predicates to be trivially equal (i.e. `a#1 = a#2` -> `a#3 = a#3`), at which point the query is no longer valid. Author: Michael Armbrust <michael@databricks.com> Closes #5458 from marmbrus/selfJoinParquet and squashes the following commits: 22df77c [Michael Armbrust] [SPARK-6851][SQL] Create new instance for each converted parquet relation (cherry picked from commit 23d5f8864f7d665a74b1d38118700139854dbb1c) Signed-off-by: Michael Armbrust <michael@databricks.com> Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
* [SPARK-5969][PySpark] Fix descending pyspark.rdd.sortByKey.Milan Straka2015-04-102-1/+12
| | | | | | | | | | | | | | | | The samples should always be sorted in ascending order, because bisect.bisect_left is used on it. The reverse order of the result is already achieved in rangePartitioner by reversing the found index. The current implementation also work, but always uses only two partitions -- the first one and the last one (because the bisect_left return returns either "beginning" or "end" for a descending sequence). Author: Milan Straka <fox@ucw.cz> This patch had conflicts when merged, resolved by Committer: Josh Rosen <joshrosen@databricks.com> Closes #4761 from foxik/fix-descending-sort and squashes the following commits: 95896b5 [Milan Straka] Add regression test for SPARK-5969. 5757490 [Milan Straka] Fix descending pyspark.rdd.sortByKey.
* [SPARK-6343] Doc driver-worker network reqsPeter Parente2015-04-093-1/+5
| | | | | | | | | | | | | | | Attempt at making the driver-worker networking requirement more explicit and up-front in the documentation (see https://issues.apache.org/jira/browse/SPARK-6343). Update cluster overview diagram to show connections from workers to driver. Add a bullet below about how driver listens / accepts connections from workers. Author: Peter Parente <pparent@us.ibm.com> Closes #5382 from parente/SPARK-6343 and squashes the following commits: 0b2fb9d [Peter Parente] [SPARK-6343] Doc driver-worker network reqs (cherry picked from commit b9c51c04932efeeda790752276078314db440634) Signed-off-by: Sean Owen <sowen@cloudera.com>
* [SPARK-6767][SQL] Fixed Query DSL error in spark sql ReadmeTijo Thomas2015-04-081-1/+1
| | | | | | | | | | | | | | | | Fixed the following error query.where('key > 30).select(avg('key)).collect() <console>:43: error: value > is not a member of Symbol query.where('key > 30).select(avg('key)).collect() Author: Tijo Thomas <tijoparacka@gmail.com> Closes #5415 from tijoparacka/ERROR_SQL_DATAFRAME_EXAMPLE and squashes the following commits: 234751e [Tijo Thomas] Fixed Query DSL error in spark sql Readme (cherry picked from commit 2f482d706b9d38820472c3152dbd1612c98729bd) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6781] [SQL] use sqlContext in python shellDavies Liu2015-04-0812-70/+69
| | | | | | | | | | | | | | Use `sqlContext` in PySpark shell, make it consistent with SQL programming guide. `sqlCtx` is also kept for compatibility. Author: Davies Liu <davies@databricks.com> Closes #5425 from davies/sqlCtx and squashes the following commits: af67340 [Davies Liu] sqlCtx -> sqlContext 15a278f [Davies Liu] use sqlContext in python shell (cherry picked from commit 6ada4f6f52cf1d992c7ab0c32318790cf08b0a0d) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6753] Clone SparkConf in ShuffleSuite testsKay Ousterhout2015-04-081-2/+2
| | | | | | | | | | | | | | | | | | | Prior to this change, the unit test for SPARK-3426 did not clone the original SparkConf, which meant that that test did not use the options set by suites that subclass ShuffleSuite.scala. This commit fixes that problem. JoshRosen would be great if you could take a look at this, since you wrote this test originally. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #5401 from kayousterhout/SPARK-6753 and squashes the following commits: 368c540 [Kay Ousterhout] [SPARK-6753] Clone SparkConf in ShuffleSuite tests (cherry picked from commit 9d44ddce1d1e19011026605549c37d0db6d6afa1) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* [SPARK-6506] [pyspark] Do not try to retrieve SPARK_HOME when not needed...Marcelo Vanzin2015-04-081-2/+1
| | | | | | | | | | | | | | | | .... In particular, this makes pyspark in yarn-cluster mode fail unless SPARK_HOME is set, when it's not really needed. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #5405 from vanzin/SPARK-6506 and squashes the following commits: e184507 [Marcelo Vanzin] [SPARK-6506] [pyspark] Do not try to retrieve SPARK_HOME when not needed. (cherry picked from commit f7e21dd1ec4541be54eb01d8b15cfcc6714feed0) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* Preparing development version 1.3.2-SNAPSHOTPatrick Wendell2015-04-0828-28/+28
|
* Preparing Spark release v1.3.1-rc2Patrick Wendell2015-04-0828-28/+28
|
* Revert "Preparing Spark release v1.3.1-rc1"Patrick Wendell2015-04-0728-28/+28
| | | | This reverts commit 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851.
* Revert "Preparing development version 1.3.2-SNAPSHOT"Patrick Wendell2015-04-0728-28/+28
| | | | This reverts commit 728c1f927822eb6b12f04dc47109feb6fbe02ec2.
* [SPARK-6737] Fix memory leak in OutputCommitCoordinatorJosh Rosen2015-04-073-29/+42
| | | | | | | | | | | | | | | | | | | | | | This patch fixes a memory leak in the DAGScheduler, which caused us to leak a map entry per submitted stage. The problem is that the OutputCommitCoordinator needs to be informed when stages end in order to remove entries from its `authorizedCommitters` map, but the DAGScheduler only called it in one of the four code paths that are used to mark stages as completed. This patch fixes this issue by consolidating the processing of stage completion into a new `markStageAsFinished` method and updates DAGSchedulerSuite's `assertDataStructuresEmpty` assertion to also check the OutputCommitCoordinator data structures. I've also added a comment at the top of DAGScheduler so that we remember to update this test when adding new data structures. Author: Josh Rosen <joshrosen@databricks.com> Closes #5397 from JoshRosen/SPARK-6737 and squashes the following commits: af3b02f [Josh Rosen] Consolidate stage completion handling code in a single method. e96ce3a [Josh Rosen] Consolidate stage completion handling code in a single method. 3052aea [Josh Rosen] Comment update 7896899 [Josh Rosen] Fix SPARK-6737 by informing OutputCommitCoordinator of all stage end events. 4ead1dc [Josh Rosen] Add regression tests for SPARK-6737 (cherry picked from commit c83e03948b184ffb3a9418fecc4d2c26ae33b057) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
* [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.pyMatt Aasted2015-04-061-1/+1
| | | | | | | | | | | | | The spark_ec2.py script uses public_dns_name everywhere in the script except for testing ssh availability, which is done using the public ip address of the instances. This breaks the script for users who are deploying the cluster with a private-network-only security group. The fix is to use public_dns_name in the remaining place. Author: Matt Aasted <aasted@twitch.tv> Closes #5302 from aasted/master and squashes the following commits: 60cf6ee [Matt Aasted] [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py (cherry picked from commit 6f0d55d76f758d217fd18ffa0ccf273d7ab0377b) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* SPARK-6205 [CORE] UISeleniumSuite fails for Hadoop 2.x test with ↵Sean Owen2015-04-062-0/+13
| | | | | | | | | | | | NoClassDefFoundError Add xml-apis to core test deps to work aroudn UISeleniumSuite classpath issue Author: Sean Owen <sowen@cloudera.com> Closes #4933 from srowen/SPARK-6205 and squashes the following commits: ddd4d32 [Sean Owen] Add xml-apis to core test deps to work aroudn UISeleniumSuite classpath issue
* Preparing development version 1.3.2-SNAPSHOTPatrick Wendell2015-04-0428-28/+28
|
* Preparing Spark release v1.3.1-rc1Patrick Wendell2015-04-0428-28/+28
|
* Version info and CHANGES.txt for 1.3.1Patrick Wendell2015-04-045-6/+735
|
* [SQL] Use path.makeQualified in newParquet.Yin Huai2015-04-041-1/+2
| | | | | | | | | | | | Author: Yin Huai <yhuai@databricks.com> Closes #5353 from yhuai/wrongFS and squashes the following commits: 849603b [Yin Huai] Not use deprecated method. 6d6ae34 [Yin Huai] Use path.makeQualified. (cherry picked from commit da25c86d64ff9ce80f88186ba083f6c21dd9a568) Signed-off-by: Cheng Lian <lian@databricks.com>
* [SPARK-6700] disable flaky testDavies Liu2015-04-031-1/+2
| | | | | | | | | | | Author: Davies Liu <davies@databricks.com> Closes #5356 from davies/flaky and squashes the following commits: 08955f4 [Davies Liu] disable flaky test (cherry picked from commit 9b40c17ab161b64933539abeefde443cb4f98673) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-6688] [core] Always use resolved URIs in EventLoggingListener.Marcelo Vanzin2015-04-036-19/+30
| | | | | | | | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #5340 from vanzin/SPARK-6688 and squashes the following commits: ccfddd9 [Marcelo Vanzin] Resolve at the source. 20d2a34 [Marcelo Vanzin] [SPARK-6688] [core] Always use resolved URIs in EventLoggingListener. (cherry picked from commit 14632b7942c02a332c4d3814fb6b2611e3f76fc7) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-6575][SQL] Converted Parquet Metastore tables no longer cache metadataYin Huai2015-04-033-12/+23
| | | | | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-6575 Author: Yin Huai <yhuai@databricks.com> This patch had conflicts when merged, resolved by Committer: Cheng Lian <lian@databricks.com> Closes #5339 from yhuai/parquetRelationCache and squashes the following commits: b0e1a42 [Yin Huai] Address comments. 83d9846 [Yin Huai] Remove unnecessary change. c0dc7a4 [Yin Huai] Cache converted parquet relations. (cherry picked from commit c42c3fc7f7b79a1f6ce990d39b5d9d14ab19fcf0) Signed-off-by: Cheng Lian <lian@databricks.com>
* [SPARK-6621][Core] Fix the bug that calling EventLoop.stop in ↵zsxwing2015-04-022-3/+87
| | | | | | | | | | | | | EventLoop.onReceive/onError/onStart doesn't call onStop Author: zsxwing <zsxwing@gmail.com> Closes #5280 from zsxwing/SPARK-6621 and squashes the following commits: 521125e [zsxwing] Fix the bug that calling EventLoop.stop in EventLoop.onReceive and EventLoop.onError doesn't call onStop (cherry picked from commit 440ea31b76aa7e813436271fd63880c7bcd69157) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* [SPARK-6345][STREAMING][MLLIB] Fix for training with predictionfreeman2015-04-025-3/+62
| | | | | | | | | | | | | | | | This patch fixes a reported bug causing model updates to not properly propagate to model predictions during streaming regression. These minor changes in model declaration fix the problem, and I expanded the tests to include the scenario in which the bug was arising. The two new tests failed prior to the patch and now pass. cc mengxr Author: freeman <the.freeman.lab@gmail.com> Closes #5037 from freeman-lab/train-predict-fix and squashes the following commits: 3af953e [freeman] Expand test coverage to include combined training and prediction 8f84fc8 [freeman] Move model declaration (cherry picked from commit 6e1c1ec67bc4d7e5700f523ec08db6bb25bd2302) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [CORE] The descriptionof jobHistory config should be ↵KaiXinXiaoLei2015-04-021-1/+1
| | | | | | | | | | | | | | | spark.history.fs.logDirectory The config option is spark.history.fs.logDirectory, not spark.fs.history.logDirectory. So the descriptionof should be changed. Thanks. Author: KaiXinXiaoLei <huleilei1@huawei.com> Closes #5332 from KaiXinXiaoLei/historyConfig and squashes the following commits: 5ffbfb5 [KaiXinXiaoLei] the describe of jobHistory config is error (cherry picked from commit 8a0aa81ca37d337423db60edb09cf264cc2c6498) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-6575][SQL] Converted Parquet Metastore tables no longer cache metadataYin Huai2015-04-022-6/+167
| | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-6575 Author: Yin Huai <yhuai@databricks.com> Closes #5339 from yhuai/parquetRelationCache and squashes the following commits: 83d9846 [Yin Huai] Remove unnecessary change. c0dc7a4 [Yin Huai] Cache converted parquet relations. (cherry picked from commit 4b82bd730a24f96d94dfea87420cfaa4253a5ccb) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-6650] [core] Stop ExecutorAllocationManager when context stops.Marcelo Vanzin2015-04-023-35/+48
| | | | | | | | | | | | | | | | | | | | | | | | | This fixes the thread leak. I also changed the unit test to keep track of allocated contexts and make sure they're closed after tests are run; this is needed since some tests use this pattern: val sc = createContext() doSomethingThatMayThrow() sc.stop() Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #5311 from vanzin/SPARK-6650 and squashes the following commits: 652c73b [Marcelo Vanzin] Nits. 5711512 [Marcelo Vanzin] More exception safety. cc5a744 [Marcelo Vanzin] Stop alloc manager before scheduler. 9886f69 [Marcelo Vanzin] [SPARK-6650] [core] Stop ExecutorAllocationManager when context stops. (cherry picked from commit 45134ec920c3766c22aefd4366b4b60ec99bd810) Signed-off-by: Andrew Or <andrew@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala
* [SPARK-6686][SQL] Use resolved output instead of names for toDF renameMichael Armbrust2015-04-022-2/+10
| | | | | | | | | | | | | This is a workaround for a problem reported on the user list. This doesn't fix the core problem, but in general is a more robust way to do renames. Author: Michael Armbrust <michael@databricks.com> Closes #5337 from marmbrus/toDFrename and squashes the following commits: 6a3159d [Michael Armbrust] [SPARK-6686][SQL] Use resolved output instead of names for toDF rename (cherry picked from commit 052dee0707830cfd3cd8821ecc3471a37ede294a) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-6672][SQL] convert row to catalyst in createDataFrame(RDD[Row], ...)Xiangrui Meng2015-04-027-8/+37
| | | | | | | | | | | | | | | We assume that `RDD[Row]` contains Scala types. So we need to convert them into catalyst types in createDataFrame. liancheng Author: Xiangrui Meng <meng@databricks.com> Closes #5329 from mengxr/SPARK-6672 and squashes the following commits: 2d52644 [Xiangrui Meng] set needsConversion = false in jsonRDD 06896e4 [Xiangrui Meng] add createDataFrame without conversion 4a3767b [Xiangrui Meng] convert Row to catalyst (cherry picked from commit 424e987dfebbbaa37f4496d44090d469a931ce76) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-6618][SPARK-6669][SQL] Lock Hive metastore client correctly.Yin Huai2015-04-024-27/+53
| | | | | | | | | | | | | | | | | | Author: Yin Huai <yhuai@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #5333 from yhuai/lookupRelationLock and squashes the following commits: 59c884f [Michael Armbrust] [SQL] Lock metastore client in analyzeTable 7667030 [Yin Huai] Merge pull request #2 from marmbrus/pr/5333 e4a9b0b [Michael Armbrust] Correctly lock on MetastoreCatalog d6fc32f [Yin Huai] Missing `)`. 1e241af [Yin Huai] Protect InsertIntoHive. fee7e9c [Yin Huai] A test? 5416b0f [Yin Huai] Just protect client. (cherry picked from commit 5db89127e72630aec7c5552f2c84018ae18d03fe) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [Minor] [SQL] Follow-up of PR #5210Cheng Lian2015-04-021-4/+5
| | | | | | | | | | | | | | | | | This PR addresses rxin's comments in PR #5210. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5219) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #5219 from liancheng/spark-6554-followup and squashes the following commits: 41f3a09 [Cheng Lian] Addresses comments in #5210 (cherry picked from commit d3944b6f2aeb36629bf89207629cc5e55d327241) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6655][SQL] We need to read the schema of a data source table stored ↵Yin Huai2015-04-022-4/+37
| | | | | | | | | | | | | | | | | in spark.sql.sources.schema property https://issues.apache.org/jira/browse/SPARK-6655 Author: Yin Huai <yhuai@databricks.com> Closes #5313 from yhuai/SPARK-6655 and squashes the following commits: 1e00c03 [Yin Huai] Unnecessary change. f131bd9 [Yin Huai] Fix. f1218c1 [Yin Huai] Failed test. (cherry picked from commit 251698fb7335a3bb465f1cd0c29e7e74e0361f4a) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SQL] Throw UnsupportedOperationException instead of NotImplementedErrorMichael Armbrust2015-04-022-4/+3
| | | | | | | | | | | | | | NotImplementedError in scala 2.10 is a fatal exception, which is not very nice to throw when not actually fatal. Author: Michael Armbrust <michael@databricks.com> Closes #5315 from marmbrus/throwUnsupported and squashes the following commits: c29e03b [Michael Armbrust] [SQL] Throw UnsupportedOperationException instead of NotImplementedError 052e05b [Michael Armbrust] [SQL] Throw UnsupportedOperationException instead of NotImplementedError (cherry picked from commit 4214e50fc32de1478584d8edfa3a35576c12c025) Signed-off-by: Michael Armbrust <michael@databricks.com>
* SPARK-6414: Spark driver failed with NPE on job cancelationHung Lin2015-04-023-14/+25
| | | | | | | | | | | | | | | | Use Option for ActiveJob.properties to avoid NPE bug Author: Hung Lin <hung.lin@gmail.com> Closes #5124 from hunglin/SPARK-6414 and squashes the following commits: 2290b6b [Hung Lin] [SPARK-6414][core] Fix NPE in SparkContext.cancelJobGroup() (cherry picked from commit e3202aa2e9bd140effbcf2a7a02b90cb077e760b) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
* [SPARK-6079] Use index to speed up StatusTracker.getJobIdsForGroup()Josh Rosen2015-04-023-6/+51
| | | | | | | | | | | | | | | | | | `StatusTracker.getJobIdsForGroup()` is implemented via a linear scan over a HashMap rather than using an index, which might be an expensive operation if there are many (e.g. thousands) of retained jobs. This patch adds a new map to `JobProgressListener` in order to speed up these lookups. Author: Josh Rosen <joshrosen@databricks.com> Closes #4830 from JoshRosen/statustracker-job-group-indexing and squashes the following commits: e39c5c7 [Josh Rosen] Address review feedback 6709fb2 [Josh Rosen] Merge remote-tracking branch 'origin/master' into statustracker-job-group-indexing 2c49614 [Josh Rosen] getOrElse 97275a7 [Josh Rosen] Add jobGroup to jobId index to JobProgressListener (cherry picked from commit d44a3362ed8cf3068f8ff233e13851a39da42219) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* [SPARK-6667] [PySpark] remove setReuseAddressDavies Liu2015-04-022-1/+1
| | | | | | | | | | | | | | | | | The reused address on server side had caused the server can not acknowledge the connected connections, remove it. This PR will retry once after timeout, it also add a timeout at client side. Author: Davies Liu <davies@databricks.com> Closes #5324 from davies/collect_hang and squashes the following commits: e5a51a2 [Davies Liu] remove setReuseAddress 7977c2f [Davies Liu] do retry on client side b838f35 [Davies Liu] retry after timeout (cherry picked from commit 0cce5451adfc6bf4661bcf67aca3db26376455fe) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* Revert "[SPARK-6618][SQL] HiveMetastoreCatalog.lookupRelation should use ↵Cheng Lian2015-04-022-20/+3
| | | | | | fine-grained lock" This reverts commit fd600cec0c8cf9e14c3d5d5f63b1de94413ffba8.
* [SQL] SPARK-6658: Update DataFrame documentation to refer to correct typesChet Mancini2015-04-011-6/+6
|
* [SPARK-6578] Small rewrite to make the logic more clear in ↵Reynold Xin2015-04-011-20/+23
| | | | | | | | | | | | | MessageWithHeader.transferTo. Author: Reynold Xin <rxin@databricks.com> Closes #5319 from rxin/SPARK-6578 and squashes the following commits: 7c62a64 [Reynold Xin] Small rewrite to make the logic more clear in transferTo. (cherry picked from commit 899ebcb1448126f40be784ce42e69218e9a1ead7) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6660][MLLIB] pythonToJava doesn't recognize object arraysXiangrui Meng2015-04-012-1/+12
| | | | | | | | | | | | | | | | | davies Author: Xiangrui Meng <meng@databricks.com> Closes #5318 from mengxr/SPARK-6660 and squashes the following commits: 0f66ec2 [Xiangrui Meng] recognize object arrays ad8c42f [Xiangrui Meng] add a test for SPARK-6660 (cherry picked from commit 4815bc2128c7f6d4d21da730b8c72da087233b34) Signed-off-by: Xiangrui Meng <meng@databricks.com> Conflicts: python/pyspark/mllib/tests.py
* [SPARK-6553] [pyspark] Support functools.partial as UDFksonj2015-04-012-1/+33
| | | | | | | | | | | | | Use `f.__repr__()` instead of `f.__name__` when instantiating `UserDefinedFunction`s, so `functools.partial`s may be used. Author: ksonj <kson@siberie.de> Closes #5206 from ksonj/partials and squashes the following commits: ea66f3d [ksonj] Inserted blank lines for PEP8 compliance d81b02b [ksonj] added tests for udf with partial function and callable object 2c76100 [ksonj] Makes UDFs work with all types of callables b814a12 [ksonj] support functools.partial as udf
* [SPARK-6642][MLLIB] use 1.2 lambda scaling and remove addImplicit from ↵Xiangrui Meng2015-04-013-84/+60
| | | | | | | | | | | | | | | | | | | NormalEquation This PR changes lambda scaling from number of users/items to number of explicit ratings. The latter is the behavior in 1.2. Slight refactor of NormalEquation to make it independent of ALS models. srowen codexiang Author: Xiangrui Meng <meng@databricks.com> Closes #5314 from mengxr/SPARK-6642 and squashes the following commits: dc655a1 [Xiangrui Meng] relax python tests f410df2 [Xiangrui Meng] use 1.2 scaling and remove addImplicit from NormalEquation (cherry picked from commit ccafd757eda478913f783f3127be715bf6413740) Signed-off-by: Xiangrui Meng <meng@databricks.com> Conflicts: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
* [SPARK-6578] [core] Fix thread-safety issue in outbound path of network library.Marcelo Vanzin2015-04-017-10/+364
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While the inbound path of a netty pipeline is thread-safe, the outbound path is not. That means that multiple threads can compete to write messages to the next stage of the pipeline. The network library sometimes breaks a single RPC message into multiple buffers internally to avoid copying data (see MessageEncoder). This can result in the following scenario (where "FxBy" means "frame x, buffer y"): T1 F1B1 F1B2 \ \ \ \ socket F1B1 F2B1 F1B2 F2B2 / / / / T2 F2B1 F2B2 And the frames now cannot be rebuilt on the receiving side because the different messages have been mixed up on the wire. The fix wraps these multi-buffer messages into a `FileRegion` object so that these messages are written "atomically" to the next pipeline handler. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #5234 from vanzin/SPARK-6578 and squashes the following commits: 16b2d70 [Marcelo Vanzin] Forgot to update a type. c9c2e4e [Marcelo Vanzin] Review comments: simplify some code. 9c888ac [Marcelo Vanzin] Small style nits. 8474bab [Marcelo Vanzin] Fix multiple calls to MessageWithHeader.transferTo(). e26509f [Marcelo Vanzin] Merge branch 'master' into SPARK-6578 c503f6c [Marcelo Vanzin] Implement a custom FileRegion instead of using locks. 84aa7ce [Marcelo Vanzin] Rename handler to the correct name. 432f3bd [Marcelo Vanzin] Remove unneeded method. 8d70e60 [Marcelo Vanzin] Fix thread-safety issue in outbound path of network library. (cherry picked from commit f084c5de14eb10a6aba82a39e03e7877926ebb9e) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6657] [Python] [Docs] fixed python doc build warningsJoseph K. Bradley2015-04-012-17/+11
| | | | | | | | | | | | | | | fixed python doc build warnings CC whomever wants to review: rxin mengxr davies Author: Joseph K. Bradley <joseph@databricks.com> Closes #5317 from jkbradley/python-doc-warnings and squashes the following commits: 4cd43c2 [Joseph K. Bradley] fixed python doc build warnings (cherry picked from commit fb25e8c7f45b4f96561e3f7434a0f4dfce8ddefe) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6651][MLLIB] delegate dense vector arithmetics to the underlying ↵Xiangrui Meng2015-04-011-1/+37
| | | | | | | | | | | | | | | | numpy array Users should be able to use numpy operators directly on dense vectors. davies atalwalkar Author: Xiangrui Meng <meng@databricks.com> Closes #5312 from mengxr/SPARK-6651 and squashes the following commits: e665c5c [Xiangrui Meng] wrap the result in a dense vector 23dfca3 [Xiangrui Meng] delegate dense vector arithmetics to the underlying numpy array (cherry picked from commit 2275acce7ba5fac83c58554d7ee9f4c7f3e866cf) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* SPARK-6626 [DOCS]: Corrected Scala:TwitterUtils parametersjayson2015-04-011-1/+1
| | | | | | | | | | | | | Per Sean Owen's request, here is the update call for TwitterUtils using Scala :) Author: jayson <jayson@ziprecruiter.com> Closes #5295 from JaysonSunshine/master and squashes the following commits: df1d056 [jayson] Corrected Scala:TwitterUtils parameters (cherry picked from commit 0358b08db85b3ee4ae70834626e7a42311bcc635) Signed-off-by: Sean Owen <sowen@cloudera.com>
* [Doc] Improve Python DataFrame documentationReynold Xin2015-03-316-390/+253
| | | | | | | | | | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #5287 from rxin/pyspark-df-doc-cleanup-context and squashes the following commits: 1841b60 [Reynold Xin] Lint. f2007f1 [Reynold Xin] functions and types. bc3b72b [Reynold Xin] More improvements to DataFrame Python doc. ac1d4c0 [Reynold Xin] Bug fix. b163365 [Reynold Xin] Python fix. Added Experimental flag to DataFrameNaFunctions. 608422d [Reynold Xin] [Doc] Cleanup context.py Python docs. (cherry picked from commit 305abe1e57450f49e3ec4dffb073c5adf17cadef) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6614] OutputCommitCoordinator should clear authorized committer only ↵Josh Rosen2015-03-312-3/+30
| | | | | | | | | | | | | | | | | | | after authorized committer fails, not after any failure In OutputCommitCoordinator, there is some logic to clear the authorized committer's lock on committing in case that task fails. However, it looks like the current code also clears this lock if other non-authorized tasks fail, which is an obvious bug. In theory, it's possible that this could allow a new committer to start, run to completion, and commit output before the authorized committer finished, but it's unlikely that this race occurs often in practice due to the complex combination of failure and timing conditions that would be required to expose it. This patch addresses this issue and adds a regression test. Thanks to aarondav for spotting this issue. Author: Josh Rosen <joshrosen@databricks.com> Closes #5276 from JoshRosen/SPARK-6614 and squashes the following commits: d532ba7 [Josh Rosen] Check whether failed task was authorized committer cbb3784 [Josh Rosen] Add regression test for SPARK-6614