aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-12708][UI] Sorting task error in Stages Page when yarn mode.Koyo Yoshida2016-01-156-18/+46
| | | | | | | | | | | | | If sort column contains slash(e.g. "Executor ID / Host") when yarn mode,sort fail with following message. ![spark-12708](https://cloud.githubusercontent.com/assets/6679275/12193320/80814f8c-b62a-11e5-9914-7bf3907029df.png) It's similar to SPARK-4313 . Author: root <root@R520T1.(none)> Author: Koyo Yoshida <koyo0615@gmail.com> Closes #10663 from yoshidakuy/SPARK-12708.
* [SPARK-12813][SQL] Eliminate serialization for back to back operationsMichael Armbrust2016-01-1417-274/+518
| | | | | | | | | | | | | | | The goal of this PR is to eliminate unnecessary translations when there are back-to-back `MapPartitions` operations. In order to achieve this I also made the following simplifications: - Operators no longer have hold encoders, instead they have only the expressions that they need. The benefits here are twofold: the expressions are visible to transformations so go through the normal resolution/binding process. now that they are visible we can change them on a case by case basis. - Operators no longer have type parameters. Since the engine is responsible for its own type checking, having the types visible to the complier was an unnecessary complication. We still leverage the scala compiler in the companion factory when constructing a new operator, but after this the types are discarded. Deferred to a follow up PR: - Remove as much of the resolution/binding from Dataset/GroupedDataset as possible. We should still eagerly check resolution and throw an error though in the case of mismatches for an `as` operation. - Eliminate serializations in more cases by adding more cases to `EliminateSerialization` Author: Michael Armbrust <michael@databricks.com> Closes #10747 from marmbrus/encoderExpressions.
* [SPARK-12174] Speed up BlockManagerSuite getRemoteBytes() testJosh Rosen2016-01-141-41/+30
| | | | | | | | | | This patch significantly speeds up the BlockManagerSuite's "SPARK-9591: getRemoteBytes from another location when Exception throw" test, reducing the test time from 45s to ~250ms. The key change was to set `spark.shuffle.io.maxRetries` to 0 (the code previously set `spark.network.timeout` to `2s`, but this didn't make a difference because the slowdown was not due to this timeout). Along the way, I also cleaned up the way that we handle SparkConf in BlockManagerSuite: previously, each test would mutate a shared SparkConf instance, while now each test gets a fresh SparkConf. Author: Josh Rosen <joshrosen@databricks.com> Closes #10759 from JoshRosen/SPARK-12174.
* [SPARK-12821][BUILD] Style checker should run when some configuration files ↵Kousuke Saruta2016-01-141-2/+7
| | | | | | | | | | for style are modified but any source files are not. When running the `run-tests` script, style checkers run only when any source files are modified but they should run when configuration files related to style are modified. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10754 from sarutak/SPARK-12821.
* [SPARK-12771][SQL] Simplify CaseWhen code generationReynold Xin2016-01-141-25/+35
| | | | | | | | | | The generated code for CaseWhen uses a control variable "got" to make sure we do not evaluate more branches once a branch is true. Changing that to generate just simple "if / else" would be slightly more efficient. This closes #10737. Author: Reynold Xin <rxin@databricks.com> Closes #10755 from rxin/SPARK-12771.
* [SPARK-12784][UI] Fix Spark UI IndexOutOfBoundsException with dynamic allocationShixiong Zhu2016-01-142-6/+17
| | | | | | | | Add `listener.synchronized` to get `storageStatusList` and `execInfo` atomically. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10728 from zsxwing/SPARK-12784.
* [SPARK-9844][CORE] File appender race condition during shutdownBryan Cutler2016-01-142-10/+95
| | | | | | | | When an Executor process is destroyed, the FileAppender that is asynchronously reading the stderr stream of the process can throw an IOException during read because the stream is closed. Before the ExecutorRunner destroys the process, the FileAppender thread is flagged to stop. This PR wraps the inputStream.read call of the FileAppender in a try/catch block so that if an IOException is thrown and the thread has been flagged to stop, it will safely ignore the exception. Additionally, the FileAppender thread was changed to use Utils.tryWithSafeFinally to better log any exception that do occur. Added unit tests to verify a IOException is thrown and logged if FileAppender is not flagged to stop, and that no IOException when the flag is set. Author: Bryan Cutler <cutlerb@gmail.com> Closes #10714 from BryanCutler/file-appender-read-ioexception-SPARK-9844.
* [SPARK-12707][SPARK SUBMIT] Remove submit python/R scripts through py…Jeff Zhang2016-01-131-7/+6
| | | | | | | | …spark/sparkR Author: Jeff Zhang <zjffdu@apache.org> Closes #10658 from zjffdu/SPARK-12707.
* [SPARK-12756][SQL] use hash expression in ExchangeWenchen Fan2016-01-1312-64/+84
| | | | | | | | | | This PR makes bucketing and exchange share one common hash algorithm, so that we can guarantee the data distribution is same between shuffle and bucketed data source, which enables us to only shuffle one side when join a bucketed table and a normal one. This PR also fixes the tests that are broken by the new hash behaviour in shuffle. Author: Wenchen Fan <wenchen@databricks.com> Closes #10703 from cloud-fan/use-hash-expr-in-shuffle.
* [SPARK-12819] Deprecate TaskContext.isRunningLocally()Josh Rosen2016-01-135-19/+4
| | | | | | | | We've already removed local execution but didn't deprecate `TaskContext.isRunningLocally()`; we should deprecate it for 2.0. Author: Josh Rosen <joshrosen@databricks.com> Closes #10751 from JoshRosen/remove-local-exec-from-taskcontext.
* [SPARK-12703][MLLIB][DOC][PYTHON] Fixed pyspark.mllib.clustering.KMeans user ↵Joseph K. Bradley2016-01-131-5/+1
| | | | | | | | | | guide example Fixed WSSSE computeCost in Python mllib KMeans user guide example by using new computeCost method API in Python. Author: Joseph K. Bradley <joseph@databricks.com> Closes #10707 from jkbradley/kmeans-doc-fix.
* [SPARK-12026][MLLIB] ChiSqTest gets slower and slower over time when number ↵Yuhao Yang2016-01-131-2/+4
| | | | | | | | | | | | | | of features is large jira: https://issues.apache.org/jira/browse/SPARK-12026 The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger. I tested on local and the change can improve the performance and the running time was stable. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10146 from hhbyyh/chiSq.
* [SPARK-12400][SHUFFLE] Avoid generating temp shuffle files for empty partitionsjerryshao2016-01-132-12/+51
| | | | | | | | | | | | This problem lies in `BypassMergeSortShuffleWriter`, empty partition will also generate a temp shuffle file with several bytes. So here change to only create file when partition is not empty. This problem only lies in here, no such issue in `HashShuffleWriter`. Please help to review, thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #10376 from jerryshao/SPARK-12400.
* [SPARK-12690][CORE] Fix NPE in UnsafeInMemorySorter.free()Carson Wang2016-01-131-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | I hit the exception below. The `UnsafeKVExternalSorter` does pass `null` as the consumer when creating an `UnsafeInMemorySorter`. Normally the NPE doesn't occur because the `inMemSorter` is set to null later and the `free()` method is not called. It happens when there is another exception like OOM thrown before setting `inMemSorter` to null. Anyway, we can add the null check to avoid it. ``` ERROR spark.TaskContextImpl: Error in TaskCompletionListener java.lang.NullPointerException at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.free(UnsafeInMemorySorter.java:110) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.cleanupResources(UnsafeExternalSorter.java:288) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$1.onTaskCompletion(UnsafeExternalSorter.java:141) at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:79) at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:77) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:77) at org.apache.spark.scheduler.Task.run(Task.scala:91) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) ``` Author: Carson Wang <carson.wang@intel.com> Closes #10637 from carsonwang/FixNPE.
* [SPARK-12791][SQL] Simplify CaseWhen by breaking "branches" into ↵Reynold Xin2016-01-1312-138/+156
| | | | | | | | | | | | "conditions" and "values" This pull request rewrites CaseWhen expression to break the single, monolithic "branches" field into a sequence of tuples (Seq[(condition, value)]) and an explicit optional elseValue field. Prior to this pull request, each even position in "branches" represents the condition for each branch, and each odd position represents the value for each branch. The use of them have been pretty confusing with a lot sliding windows or grouped(2) calls. Author: Reynold Xin <rxin@databricks.com> Closes #10734 from rxin/simplify-case.
* [SPARK-12642][SQL] improve the hash expression to be decoupled from unsafe rowWenchen Fan2016-01-136-29/+288
| | | | | | | | https://issues.apache.org/jira/browse/SPARK-12642 Author: Wenchen Fan <wenchen@databricks.com> Closes #10694 from cloud-fan/hash-expr.
* [SPARK-12268][PYSPARK] Make pyspark shell pythonstartup work under python3Erik Selin2016-01-131-1/+3
| | | | | | | | | | This replaces the `execfile` used for running custom python shell scripts with explicit open, compile and exec (as recommended by 2to3). The reason for this change is to make the pythonstartup option compatible with python3. Author: Erik Selin <erik.selin@gmail.com> Closes #10255 from tyro89/pythonstartup-python3.
* [SPARK-9383][PROJECT-INFRA] PR merge script should reset back to previous ↵Josh Rosen2016-01-131-3/+16
| | | | | | | | | | | | branch when possible This patch modifies our PR merge script to reset back to a named branch when restoring the original checkout upon exit. When the committer is originally checked out to a detached head, then they will be restored back to that same ref (the same as today's behavior). This is a slightly updated version of #7569, with an extra fix to handle the detached head corner-case. Author: Josh Rosen <joshrosen@databricks.com> Closes #10709 from JoshRosen/SPARK-9383.
* [SPARK-12761][CORE] Remove duplicated codeJakob Odersky2016-01-131-5/+1
| | | | | | | | Removes some duplicated code that was reintroduced during a merge. Author: Jakob Odersky <jodersky@gmail.com> Closes #10711 from jodersky/repl-2.11-duplicate.
* [SPARK-12805][MESOS] Fixes documentation on Mesos run modesLuc Bourlier2016-01-131-7/+5
| | | | | | | | The default run has changed, but the documentation didn't fully reflect the change. Author: Luc Bourlier <luc.bourlier@typesafe.com> Closes #10740 from skyluc/issue/mesos-modes-doc.
* [SPARK-9297] [SQL] Add covar_pop and covar_sampLiang-Chi Hsieh2016-01-134-0/+272
| | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-9297 Add two aggregation functions: covar_pop and covar_samp. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10029 from viirya/covar-funcs.
* [SPARK-12692][BUILD][HOT-FIX] Fix the scala style of ↵Yin Huai2016-01-131-2/+2
| | | | | | | | | | KinesisBackedBlockRDDSuite.scala. https://github.com/apache/spark/pull/10736 was merged yesterday and caused the master start to fail because of the style issue. Author: Yin Huai <yhuai@databricks.com> Closes #10742 from yhuai/fixStyle.
* [SPARK-12692][BUILD] Enforce style checking about white space before commaKousuke Saruta2016-01-131-7/+6
| | | | | | | | | This is the final PR about SPARK-12692. We have removed all of white spaces before comma from code so let's enforce style checking. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10736 from sarutak/SPARK-12692-followup-enforce-checking.
* [SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before ↵Kousuke Saruta2016-01-1210-22/+22
| | | | | | | | | | | ",") Fix the style violation (space before , and :). This PR is a followup for #10643 and rework of #10685 . Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10732 from sarutak/SPARK-12692-followup-sql.
* [SPARK-12558][SQL] AnalysisException when multiple functions applied in ↵Dilip Biswal2016-01-122-0/+30
| | | | | | | | | | | | GROUP BY clause cloud-fan Can you please take a look ? In this case, we are failing during check analysis while validating the aggregation expression. I have added a semanticEquals for HiveGenericUDF to fix this. Please let me know if this is the right way to address this issue. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #10520 from dilipbiswal/spark-12558.
* [SPARK-12692][BUILD][CORE] Scala style: Fix the style violation (Space ↵Kousuke Saruta2016-01-126-6/+6
| | | | | | | | | | | before ",") Fix the style violation (space before , and :). This PR is a followup for #10643 Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10719 from sarutak/SPARK-12692-followup-core.
* [SPARK-12788][SQL] Simplify BooleanEquality by using casts.Reynold Xin2016-01-122-26/+32
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #10730 from rxin/SPARK-12788.
* [SPARK-12785][SQL] Add ColumnarBatch, an in memory columnar format for ↵Nong Li2016-01-126-0/+1463
| | | | | | | | | | | | | | | | | | | | | | | | | execution. There are many potential benefits of having an efficient in memory columnar format as an alternate to UnsafeRow. This patch introduces ColumnarBatch/ColumnarVector which starts this effort. The remaining implementation can be done as follow up patches. As stated in the in the JIRA, there are useful external components that operate on memory in a simple columnar format. ColumnarBatch would serve that purpose and could server as a zero-serialization/zero-copy exchange for this use case. This patch supports running the underlying data either on heap or off heap. On heap runs a bit faster but we would need offheap for zero-copy exchanges. Currently, this mode is hidden behind one interface (ColumnVector). This differs from Parquet or the existing columnar cache because this is *not* intended to be used as a storage format. The focus is entirely on CPU efficiency as we expect to only have 1 of these batches in memory per task. The layout of the values is just dense arrays of the value type. Author: Nong Li <nong@databricks.com> Author: Nong <nongli@gmail.com> Closes #10628 from nongli/spark-12635.
* [SPARK-12652][PYSPARK] Upgrade Py4J to 0.9.1Shixiong Zhu2016-01-1218-112/+20
| | | | | | | | | | | | - [x] Upgrade Py4J to 0.9.1 - [x] SPARK-12657: Revert SPARK-12617 - [x] SPARK-12658: Revert SPARK-12511 - Still keep the change that only reading checkpoint once. This is a manual change and worth to take a look carefully. https://github.com/zsxwing/spark/commit/bfd4b5c040eb29394c3132af3c670b1a7272457c - [x] Verify no leak any more after reverting our workarounds Author: Shixiong Zhu <shixiong@databricks.com> Closes #10692 from zsxwing/py4j-0.9.1.
* [SPARK-12724] SQL generation support for persisted data source tablesCheng Lian2016-01-1217-51/+55
| | | | | | | | This PR implements SQL generation support for persisted data source tables. A new field `metastoreTableIdentifier: Option[TableIdentifier]` is added to `LogicalRelation`. When a `LogicalRelation` representing a persisted data source relation is created, this field holds the database name and table name of the relation. Author: Cheng Lian <lian@databricks.com> Closes #10712 from liancheng/spark-12724-datasources-sql-gen.
* Revert "[SPARK-12692][BUILD][SQL] Scala style: Fix the style violation ↵Reynold Xin2016-01-1254-150/+141
| | | | | | (Space before "," or ":")" This reverts commit 8cfa218f4f1b05f4d076ec15dd0a033ad3e4500d.
* [SPARK-12768][SQL] Remove CaseKeyWhen expressionReynold Xin2016-01-123-171/+38
| | | | | | | | | | This patch removes CaseKeyWhen expression and replaces it with a factory method that generates the equivalent CaseWhen. This reduces the amount of code we'd need to maintain in the future for both code generation and optimizer. Note that we introduced CaseKeyWhen to avoid duplicate evaluations of the key. This is no longer a problem because we now have common subexpression elimination. Author: Reynold Xin <rxin@databricks.com> Closes #10722 from rxin/SPARK-12768.
* [SPARK-9843][SQL] Make catalyst optimizer pass pluggable at runtimeRobert Kruszewski2016-01-124-2/+46
| | | | | | | | Let me know whether you'd like to see it in other place Author: Robert Kruszewski <robertk@palantir.com> Closes #10210 from robert3005/feature/pluggable-optimizer.
* [SPARK-12762][SQL] Add unit test for SimplifyConditionals optimization ruleReynold Xin2016-01-125-7/+69
| | | | | | | | | | | | | | This pull request does a few small things: 1. Separated if simplification from BooleanSimplification and created a new rule SimplifyConditionals. In the future we can also simplify other conditional expressions here. 2. Added unit test for SimplifyConditionals. 3. Renamed SimplifyCaseConversionExpressionsSuite to SimplifyStringCaseConversionSuite Author: Reynold Xin <rxin@databricks.com> Closes #10716 from rxin/SPARK-12762.
* [SPARK-12582][TEST] IndexShuffleBlockResolverSuite fails in windowsYucai Yu2016-01-121-17/+34
| | | | | | | | | | | | | [SPARK-12582][Test] IndexShuffleBlockResolverSuite fails in windows * IndexShuffleBlockResolverSuite fails in windows due to file is not closed. * mv IndexShuffleBlockResolverSuite.scala from "test/java" to "test/scala". https://issues.apache.org/jira/browse/SPARK-12582 Author: Yucai Yu <yucai.yu@intel.com> Closes #10526 from yucai/master.
* [SPARK-12638][API DOC] Parameter explanation not very accurate for rdd ↵Tommy YU2016-01-121-0/+14
| | | | | | | | | | | function "aggregate" Currently, RDD function aggregate's parameter doesn't explain well, especially parameter "zeroValue". It's helpful to let junior scala user know that "zeroValue" attend both "seqOp" and "combOp" phase. Author: Tommy YU <tummyyu@163.com> Closes #10587 from Wenpei/rdd_aggregate_doc.
* [SPARK-5273][MLLIB][DOCS] Improve documentation examples for LinearRegressionSean Owen2016-01-121-3/+5
| | | | | | | | | | Use a much smaller step size in LinearRegressionWithSGD MLlib examples to achieve a reasonable RMSE. Our training folks hit this exact same issue when concocting an example and had the same solution. Author: Sean Owen <sowen@cloudera.com> Closes #10675 from srowen/SPARK-5273.
* [SPARK-7615][MLLIB] MLLIB Word2Vec wordVectors divided by Euclidean Norm ↵Sean Owen2016-01-121-1/+6
| | | | | | | | | | | | equals to zero Cosine similarity with 0 vector should be 0 Related to https://github.com/apache/spark/pull/10152 Author: Sean Owen <sowen@cloudera.com> Closes #10696 from srowen/SPARK-7615.
* [SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before ↵Kousuke Saruta2016-01-1254-141/+150
| | | | | | | | | | | "," or ":") Fix the style violation (space before , and :). This PR is a followup for #10643. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10718 from sarutak/SPARK-12692-followup-sql.
* [SPARK-12692][BUILD][YARN] Scala style: Fix the style violation (Space ↵Kousuke Saruta2016-01-111-1/+1
| | | | | | | | | | | before "," or ":") Fix the style violation (space before , and :). This PR is a followup for #10643. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10686 from sarutak/SPARK-12692-followup-yarn.
* [SPARK-12692][BUILD][STREAMING] Scala style: Fix the style violation (Space ↵Kousuke Saruta2016-01-1130-96/+108
| | | | | | | | | | | before "," or ":") Fix the style violation (space before , and :). This PR is a followup for #10643. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10685 from sarutak/SPARK-12692-followup-streaming.
* [SPARK-11823] Ignores HiveThriftBinaryServerSuite's test jdbc cancelYin Huai2016-01-111-1/+3
| | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-11823 This test often hangs and times out, leaving hanging processes. Let's ignore it for now and improve the test. Author: Yin Huai <yhuai@databricks.com> Closes #10715 from yhuai/SPARK-11823-ignore.
* [SPARK-12498][SQL][MINOR] BooleanSimplication simplificationCheng Lian2016-01-112-102/+92
| | | | | | | | Scala syntax allows binary case classes to be used as infix operator in pattern matching. This PR makes use of this syntax sugar to make `BooleanSimplification` more readable. Author: Cheng Lian <lian@databricks.com> Closes #10445 from liancheng/boolean-simplification-simplification.
* [SPARK-12742][SQL] org.apache.spark.sql.hive.LogicalPlanToSQLSuite failure ↵wangfei2016-01-111-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | due to Table already exists exception ``` [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.LogicalPlanToSQLSuite *** ABORTED *** (325 milliseconds) [info] org.apache.spark.sql.AnalysisException: Table `t1` already exists.; [info] at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:296) [info] at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:285) [info] at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:33) [info] at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187) [info] at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:23) [info] at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253) [info] at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.run(LogicalPlanToSQLSuite.scala:23) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:296) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:286) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [info] at java.lang.Thread.run(Thread.java:745) ``` /cc liancheng Author: wangfei <wangfei_hello@126.com> Closes #10682 from scwf/fix-test.
* [SPARK-12576][SQL] Enable expression parsing in CatalystQlHerman van Hovell2016-01-119-56/+217
| | | | | | | | | | | | The PR allows us to use the new SQL parser to parse SQL expressions such as: ```1 + sin(x*x)``` We enable this functionality in this PR, but we will not start using this actively yet. This will be done as soon as we have reached grammar parity with the existing parser stack. cc rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10649 from hvanhovell/SPARK-12576.
* [SPARK-10809][MLLIB] Single-document topicDistributions method for LocalLDAModelYuhao Yang2016-01-112-3/+38
| | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-10809 We could provide a single-document topicDistributions method for LocalLDAModel to allow for quick queries which avoid RDD operations. Currently, the user must use an RDD of documents. add some missing assert too. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9484 from hhbyyh/ldaTopicPre.
* [SPARK-12685][MLLIB] word2vec trainWordsCount gets overflowYuhao Yang2016-01-111-4/+4
| | | | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-12685 the log of `word2vec` reports trainWordsCount = -785727483 during computation over a large dataset. Update the priority as it will affect the computation process. `alpha = learningRate * (1 - numPartitions * wordCount.toDouble / (trainWordsCount + 1))` Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10627 from hhbyyh/w2voverflow.
* [SPARK-12603][MLLIB] PySpark MLlib GaussianMixtureModel should support ↵Yanbo Liang2016-01-115-14/+37
| | | | | | | | | | single instance predict/predictSoft PySpark MLlib ```GaussianMixtureModel``` should support single instance ```predict/predictSoft``` just like Scala do. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10552 from yanboliang/spark-12603.
* [SPARK-12758][SQL] add note to Spark SQL Migration guide about TimestampType ↵Brandon Bradley2016-01-111-0/+5
| | | | | | | | | | casting Warning users about casting changes. Author: Brandon Bradley <bradleytastic@gmail.com> Closes #10708 from blbradley/spark-12758.
* [SPARK-12734][HOTFIX] Build changes must trigger all tests; clean after ↵Josh Rosen2016-01-112-2/+2
| | | | | | | | | | | | | | | | install in dep tests This patch fixes a build/test issue caused by the combination of #10672 and a latent issue in the original `dev/test-dependencies` script. First, changes which _only_ touched build files were not triggering full Jenkins runs, making it possible for a build change to be merged even though it could cause failures in other tests. The `root` build module now depends on `build`, so all tests will now be run whenever a build-related file is changed. I also added a `clean` step to the Maven install step in `dev/test-dependencies` in order to address an issue where the dummy JARs stuck around and caused "multiple assembly JARs found" errors in tests. /cc zsxwing Author: Josh Rosen <joshrosen@databricks.com> Closes #10704 from JoshRosen/fix-build-test-problems.