spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, ↵	Timothy Hunter	2015-12-10	31	-1793/+149
\| \| \| \| \| \| \| \| \| \| \| \|	spark.mllib and mllib in the documentation. Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark). It also removes some files that I forgot to delete with #10207 Author: Timothy Hunter <timhunter@databricks.com> Closes #10234 from thunterdb/12212.
*	[SPARK-12228][SQL] Try to run execution hive's derby in memory.	Yin Huai	2015-12-10	4	-5/+9
\| \| \| \| \| \| \| \| \| \|	This PR tries to make execution hive's derby run in memory since it is a fake metastore and every time we create a HiveContext, we will switch to a new one. It is possible that it can reduce the flakyness of our tests that need to create HiveContext (e.g. HiveSparkSubmitSuite). I will test it more. https://issues.apache.org/jira/browse/SPARK-12228 Author: Yin Huai <yhuai@databricks.com> Closes #10204 from yhuai/derbyInMemory.
*	[SPARK-12250][SQL] Allow users to define a UDAF without providing details of ↵	Yin Huai	2015-12-10	2	-5/+64
\| \| \| \| \| \| \| \| \| \|	its inputSchema https://issues.apache.org/jira/browse/SPARK-12250 Author: Yin Huai <yhuai@databricks.com> Closes #10236 from yhuai/SPARK-12250.
*	[SPARK-12234][SPARKR] Fix ```subset``` function error when only set ↵	Yanbo Liang	2015-12-10	2	-2/+11
\| \| \| \| \| \| \| \| \| \| \| \|	```select``` argument Fix ```subset``` function error when only set ```select``` argument. Please refer to the [JIRA](https://issues.apache.org/jira/browse/SPARK-12234) about the error and how to reproduce it. cc sun-rui felixcheung shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10217 from yanboliang/spark-12234.
*	[SPARK-11602][MLLIB] Refine visibility for 1.6 scala API audit	Yuhao Yang	2015-12-10	4	-5/+5
\| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-11602 Made a pass on the API change of 1.6. Open the PR for efficient discussion. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9939 from hhbyyh/auditScala.
*	[SPARK-12198][SPARKR] SparkR support read.parquet and deprecate parquetFile	Yanbo Liang	2015-12-10	3	-6/+22
\| \| \| \| \| \| \| \|	SparkR support ```read.parquet``` and deprecate ```parquetFile```. This change is similar with #10145 for ```jsonFile```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10191 from yanboliang/spark-12198.
*	[SPARK-11832][CORE] Process arguments in spark-shell for Scala 2.11	Jakob Odersky	2015-12-10	2	-13/+27
\| \| \| \| \| \| \| \|	Process arguments passed to the spark-shell. Fixes running the spark-shell from within a build environment. Author: Jakob Odersky <jodersky@gmail.com> Closes #9824 from jodersky/shell-2.11.
*	[SPARK-12242][SQL] Add DataFrame.transform method	Reynold Xin	2015-12-10	2	-1/+14
\| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #10226 from rxin/df-transform.
*	[SPARK-11530][MLLIB] Return eigenvalues with PCA model	Sean Owen	2015-12-10	7	-25/+67
\| \| \| \| \| \| \| \| \| \|	Add `computePrincipalComponentsAndVariance` to also compute PCA's explained variance. CC mengxr Author: Sean Owen <sowen@cloudera.com> Closes #9736 from srowen/SPARK-11530.
*	[SPARK-12136][STREAMING] rddToFileName does not properly handle prefix and ↵	bomeng	2015-12-10	1	-6/+7
\| \| \| \| \| \| \| \| \| \| \| \| \|	suffix parameters The original code does not properly handle the cases where the prefix is null, but suffix is not null - the suffix should be used but is not. The fix is using StringBuilder to construct the proper file name. Author: bomeng <bmeng@us.ibm.com> Author: Bo Meng <mengbo@bos-macbook-pro.usca.ibm.com> Closes #10185 from bomeng/SPARK-12136.
*	[SPARK-12252][SPARK-12131][SQL] refactor MapObjects to make it less hacky	Wenchen Fan	2015-12-10	4	-47/+35
\| \| \| \| \| \| \| \| \| \| \| \|	in https://github.com/apache/spark/pull/10133 we found that, we shoud ensure the children of `TreeNode` are all accessible in the `productIterator`, or the behavior will be very confusing. In this PR, I try to fix this problem by expsing the `loopVar`. This also fixes SPARK-12131 which is caused by the hacky `MapObjects`. Author: Wenchen Fan <wenchen@databricks.com> Closes #10239 from cloud-fan/map-objects.
*	[SPARK-12244][SPARK-12245][STREAMING] Rename trackStateByKey to mapWithState ↵	Tathagata Das	2015-12-09	13	-382/+389
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and change tracking function signature SPARK-12244: Based on feedback from early users and personal experience attempting to explain it, the name trackStateByKey had two problem. "trackState" is a completely new term which really does not give any intuition on what the operation is the resultant data stream of objects returned by the function is called in docs as the "emitted" data for the lack of a better. "mapWithState" makes sense because the API is like a mapping function like (Key, Value) => T with State as an additional parameter. The resultant data stream is "mapped data". So both problems are solved. SPARK-12245: From initial experiences, not having the key in the function makes it hard to return mapped stuff, as the whole information of the records is not there. Basically the user is restricted to doing something like mapValue() instead of map(). So adding the key as a parameter. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #10224 from tdas/rename.
*	[SPARK-11796] Fix httpclient and httpcore depedency issues related to ↵	Mark Grover	2015-12-09	3	-2/+50
\| \| \| \| \| \| \| \| \| \|	docker-client This commit fixes dependency issues which prevented the Docker-based JDBC integration tests from running in the Maven build. Author: Mark Grover <mgrover@cloudera.com> Closes #9876 from markgrover/master_docker.
*	[SPARK-11678][SQL][DOCS] Document basePath in the programming guide.	Yin Huai	2015-12-09	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds document for `basePath`, which is a new parameter used by `HadoopFsRelation`. The compiled doc is shown below. ![image](https://cloud.githubusercontent.com/assets/2072857/11673132/1ba01192-9dcb-11e5-98d9-ac0b4e92e98c.png) JIRA: https://issues.apache.org/jira/browse/SPARK-11678 Author: Yin Huai <yhuai@databricks.com> Closes #10211 from yhuai/basePathDoc.
*	[SPARK-12165][ADDENDUM] Fix outdated comments on unroll test	Andrew Or	2015-12-09	1	-4/+9
\| \| \| \| \| \| \| \|	JoshRosen Author: Andrew Or <andrew@databricks.com> Closes #10229 from andrewor14/unroll-test-comments.
*	[SPARK-12211][DOC][GRAPHX] Fix version number in graphx doc for migration ↵	Andrew Ray	2015-12-09	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	from 1.1 Migration from 1.1 section added to the GraphX doc in 1.2.0 (see https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#migrating-from-spark-11) uses \{{site.SPARK_VERSION}} as the version where changes were introduced, it should be just 1.2. Author: Andrew Ray <ray.andrew@gmail.com> Closes #10206 from aray/graphx-doc-1.1-migration.
*	[SPARK-11551][DOC] Replace example code in ml-features.md using include_example	Xusen Yin	2015-12-09	52	-1061/+2820
\| \| \| \| \| \| \| \| \|	PR on behalf of somideshmukh, thanks! Author: Xusen Yin <yinxusen@gmail.com> Author: somideshmukh <somilde@us.ibm.com> Closes #10219 from yinxusen/SPARK-11551.
*	[SPARK-11824][WEBUI] WebUI does not render descriptions with 'bad' HTML, ↵	Sean Owen	2015-12-09	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	throws console error Don't warn when description isn't valid HTML since it may properly be like "SELECT ... where foo <= 1" The tests for this code indicate that it's normal to handle strings like this that don't contain HTML as a string rather than markup. Hence logging every such instance as a warning is too noisy since it's not a problem. this is an issue for stages whose name contain SQL like the above CC tdas as author of this bit of code Author: Sean Owen <sowen@cloudera.com> Closes #10159 from srowen/SPARK-11824.
*	[SPARK-12165][SPARK-12189] Fix bugs in eviction of storage memory by execution	Josh Rosen	2015-12-09	8	-204/+230
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes a bug in the eviction of storage memory by execution. ## The bug: In general, execution should be able to evict storage memory when the total storage memory usage is greater than `maxMemory * spark.memory.storageFraction`. Due to a bug, however, Spark might wind up evicting no storage memory in certain cases where the storage memory usage was between `maxMemory * spark.memory.storageFraction` and `maxMemory`. For example, here is a regression test which illustrates the bug: ```scala val maxMemory = 1000L val taskAttemptId = 0L val (mm, ms) = makeThings(maxMemory) // Since we used the default storage fraction (0.5), we should be able to allocate 500 bytes // of storage memory which are immune to eviction by execution memory pressure. // Acquire enough storage memory to exceed the storage region size assert(mm.acquireStorageMemory(dummyBlock, 750L, evictedBlocks)) assertEvictBlocksToFreeSpaceNotCalled(ms) assert(mm.executionMemoryUsed === 0L) assert(mm.storageMemoryUsed === 750L) // At this point, storage is using 250 more bytes of memory than it is guaranteed, so execution // should be able to reclaim up to 250 bytes of storage memory. // Therefore, execution should now be able to require up to 500 bytes of memory: assert(mm.acquireExecutionMemory(500L, taskAttemptId, MemoryMode.ON_HEAP) === 500L) // <--- fails by only returning 250L assert(mm.storageMemoryUsed === 500L) assert(mm.executionMemoryUsed === 500L) assertEvictBlocksToFreeSpaceCalled(ms, 250L) ``` The problem relates to the control flow / interaction between `StorageMemoryPool.shrinkPoolToReclaimSpace()` and `MemoryStore.ensureFreeSpace()`. While trying to allocate the 500 bytes of execution memory, the `UnifiedMemoryManager` discovers that it will need to reclaim 250 bytes of memory from storage, so it calls `StorageMemoryPool.shrinkPoolToReclaimSpace(250L)`. This method, in turn, calls `MemoryStore.ensureFreeSpace(250L)`. However, `ensureFreeSpace()` first checks whether the requested space is less than `maxStorageMemory - storageMemoryUsed`, which will be true if there is any free execution memory because it turns out that `MemoryStore.maxStorageMemory = (maxMemory - onHeapExecutionMemoryPool.memoryUsed)` when the `UnifiedMemoryManager` is used. The control flow here is somewhat confusing (it grew to be messy / confusing over time / as a result of the merging / refactoring of several components). In the pre-Spark 1.6 code, `ensureFreeSpace` was called directly by the `MemoryStore` itself, whereas in 1.6 it's involved in a confusing control flow where `MemoryStore` calls `MemoryManager.acquireStorageMemory`, which then calls back into `MemoryStore.ensureFreeSpace`, which, in turn, calls `MemoryManager.freeStorageMemory`. ## The solution: The solution implemented in this patch is to remove the confusing circular control flow between `MemoryManager` and `MemoryStore`, making the storage memory acquisition process much more linear / straightforward. The key changes: - Remove a layer of inheritance which made the memory manager code harder to understand (53841174760a24a0df3eb1562af1f33dbe340eb9). - Move some bounds checks earlier in the call chain (13ba7ada77f87ef1ec362aec35c89a924e6987cb). - Refactor `ensureFreeSpace()` so that the part which evicts blocks can be called independently from the part which checks whether there is enough free space to avoid eviction (7c68ca09cb1b12f157400866983f753ac863380e). - Realize that this lets us remove a layer of overloads from `ensureFreeSpace` (eec4f6c87423d5e482b710e098486b3bbc4daf06). - Realize that `ensureFreeSpace()` can simply be replaced with an `evictBlocksToFreeSpace()` method which is called [after we've already figured out](https://github.com/apache/spark/blob/2dc842aea82c8895125d46a00aa43dfb0d121de9/core/src/main/scala/org/apache/spark/memory/StorageMemoryPool.scala#L88) how much memory needs to be reclaimed via eviction; (2dc842aea82c8895125d46a00aa43dfb0d121de9). Along the way, I fixed some problems with the mocks in `MemoryManagerSuite`: the old mocks would [unconditionally](https://github.com/apache/spark/blob/80a824d36eec9d9a9f092ee1741453851218ec73/core/src/test/scala/org/apache/spark/memory/MemoryManagerSuite.scala#L84) report that a block had been evicted even if there was enough space in the storage pool such that eviction would be avoided. I also fixed a problem where `StorageMemoryPool._memoryUsed` might become negative due to freed memory being double-counted when excution evicts storage. The problem was that `StorageMemoryPoolshrinkPoolToFreeSpace` would [decrement `_memoryUsed`](https://github.com/apache/spark/commit/7c68ca09cb1b12f157400866983f753ac863380e#diff-935c68a9803be144ed7bafdd2f756a0fL133) even though `StorageMemoryPool.freeMemory` had already decremented it as each evicted block was freed. See SPARK-12189 for details. Author: Josh Rosen <joshrosen@databricks.com> Author: Andrew Or <andrew@databricks.com> Closes #10170 from JoshRosen/SPARK-12165.
*	[SPARK-12241][YARN] Improve failure reporting in Yarn client ↵	Steve Loughran	2015-12-09	3	-31/+64
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	obtainTokenForHBase() This lines up the HBase token logic with that done for Hive in SPARK-11265: reflection with only CFNE being swallowed. There is a test, one which doesn't try to put HBase on the yarn/test class and really do the reflection (the way the hive introspection does). If people do want that then it could be added with careful POM work +also: cut an incorrect comment from the Hive test case before copying it, and a couple of imports that may have been related to the hive test in the past. Author: Steve Loughran <stevel@hortonworks.com> Closes #10227 from steveloughran/stevel/patches/SPARK-12241-obtainTokenForHBase.
*	[SPARK-10582][YARN][CORE] Fix AM failure situation for dynamic allocation	jerryshao	2015-12-09	4	-2/+142
\| \| \| \| \| \| \| \| \| \| \| \|	Because of AM failure, the target executor number between driver and AM will be different, which will lead to unexpected behavior in dynamic allocation. So when AM is re-registered with driver, state in `ExecutorAllocationManager` and `CoarseGrainedSchedulerBacked` should be reset. This issue is originally addressed in #8737 , here re-opened again. Thanks a lot KaiXinXiaoLei for finding this issue. andrewor14 and vanzin would you please help to review this, thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #9963 from jerryshao/SPARK-10582.
*	[SPARK-10299][ML] word2vec should allow users to specify the window size	Holden Karau	2015-12-09	3	-4/+65
\| \| \| \| \| \| \| \| \|	Currently word2vec has the window hard coded at 5, some users may want different sizes (for example if using on n-gram input or similar). User request comes from http://stackoverflow.com/questions/32231975/spark-word2vec-window-size . Author: Holden Karau <holden@us.ibm.com> Author: Holden Karau <holden@pigscanfly.ca> Closes #8513 from holdenk/SPARK-10299-word2vec-should-allow-users-to-specify-the-window-size.
*	[SPARK-12012][SQL] Show more comprehensive PhysicalRDD metadata when ↵	Cheng Lian	2015-12-09	11	-32/+87
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	visualizing SQL query plan This PR adds a `private[sql]` method `metadata` to `SparkPlan`, which can be used to describe detail information about a physical plan during visualization. Specifically, this PR uses this method to provide details of `PhysicalRDD`s translated from a data source relation. For example, a `ParquetRelation` converted from Hive metastore table `default.psrc` is now shown as the following screenshot: ![image](https://cloud.githubusercontent.com/assets/230655/11526657/e10cb7e6-9916-11e5-9afa-f108932ec890.png) And here is the screenshot for a regular `ParquetRelation` (not converted from Hive metastore table) loaded from a really long path: ![output](https://cloud.githubusercontent.com/assets/230655/11680582/37c66460-9e94-11e5-8f50-842db5309d5a.png) Author: Cheng Lian <lian@databricks.com> Closes #10004 from liancheng/spark-12012.physical-rdd-metadata.
*	[SPARK-12031][CORE][BUG] Integer overflow when do sampling	uncleGen	2015-12-09	2	-7/+8
\| \| \| \| \| \|	Author: uncleGen <hustyugm@gmail.com> Closes #10023 from uncleGen/1.6-bugfix.
*	[SPARK-11676][SQL] Parquet filter tests all pass if filters are not really ↵	hyukjinkwon	2015-12-09	1	-28/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pushed down Currently Parquet predicate tests all pass even if filters are not pushed down or this is disabled. In this PR, For checking evaluating filters, Simply it makes the expression from `expression.Filter` and then try to create filters just like Spark does. For checking the results, this manually accesses to the child rdd (of `expression.Filter`) and produces the results which should be filtered properly, and then compares it to expected values. Now, if filters are not pushed down or this is disabled, this throws exceptions. Author: hyukjinkwon <gurwls223@gmail.com> Closes #9659 from HyukjinKwon/SPARK-11676.
*	[SPARK-12222] [CORE] Deserialize RoaringBitmap using Kryo serializer throw ↵	Fei Wang	2015-12-08	2	-2/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Buffer underflow exception Jira: https://issues.apache.org/jira/browse/SPARK-12222 Deserialize RoaringBitmap using Kryo serializer throw Buffer underflow exception: ``` com.esotericsoftware.kryo.KryoException: Buffer underflow. at com.esotericsoftware.kryo.io.Input.require(Input.java:156) at com.esotericsoftware.kryo.io.Input.skip(Input.java:131) at com.esotericsoftware.kryo.io.Input.skip(Input.java:264) ``` This is caused by a bug of kryo's `Input.skip(long count)`(https://github.com/EsotericSoftware/kryo/issues/119) and we call this method in `KryoInputDataInputBridge`. Instead of upgrade kryo's version, this pr bypass the kryo's `Input.skip(long count)` by directly call another `skip` method in kryo's Input.java(https://github.com/EsotericSoftware/kryo/blob/kryo-2.21/src/com/esotericsoftware/kryo/io/Input.java#L124), i.e. write the bug-fixed version of `Input.skip(long count)` in KryoInputDataInputBridge's `skipBytes` method. more detail link to https://github.com/apache/spark/pull/9748#issuecomment-162860246 Author: Fei Wang <wangfei1@huawei.com> Closes #10213 from scwf/patch-1.
*	[SPARK-11343][ML] Documentation of float and double prediction/label columns ↵	Dominik Dahlem	2015-12-08	1	-2/+7
\| \| \| \| \| \| \| \| \| \| \| \|	in RegressionEvaluator felixcheung , mengxr Just added a message to require() Author: Dominik Dahlem <dominik.dahlem@gmail.combination> Closes #9598 from dahlem/ddahlem_regression_evaluator_double_predictions_message_04112015.
*	[SPARK-8517][ML][DOC] Reorganizes the spark.ml user guide	Timothy Hunter	2015-12-08	8	-81/+1752
\| \| \| \| \| \| \| \| \| \|	This PR moves pieces of the spark.ml user guide to reflect suggestions in SPARK-8517. It does not introduce new content, as requested. <img width="192" alt="screen shot 2015-12-08 at 11 36 00 am" src="https://cloud.githubusercontent.com/assets/7594753/11666166/e82b84f2-9d9f-11e5-8904-e215424d8444.png"> Author: Timothy Hunter <timhunter@databricks.com> Closes #10207 from thunterdb/spark-8517.
*	[SPARK-12069][SQL] Update documentation with Datasets	Michael Armbrust	2015-12-08	5	-104/+237
\| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #10060 from marmbrus/docs.
*	[SPARK-12187] *MemoryPool classes should not be fully public	Andrew Or	2015-12-08	4	-4/+4
\| \| \| \| \| \| \| \|	This patch tightens them to `private[memory]`. Author: Andrew Or <andrew@databricks.com> Closes #10182 from andrewor14/memory-visibility.
*	[SPARK-3873][BUILD] Add style checker to enforce import ordering.	Marcelo Vanzin	2015-12-08	1	-1/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The checker tries to follow as closely as possible the guidelines of the code style document, and makes some decisions where the guide is not clear. In particular: - wildcard imports come first when there are other imports in the same package - multi-import blocks come before single imports - lower-case names inside multi-import blocks come before others In some projects, such as graphx, there seems to be a convention to separate o.a.s imports from the project's own; to simplify the checker, I chose not to allow that, which is a strict interpretation of the code style guide, even though I think it makes sense. Since the checks are based on syntax only, some edge cases may generate spurious warnings; for example, when class names start with a lower case letter (and are thus treated as a package name by the checker). The checker is currently only generating warnings, and since there are many of those, the build output does get a little noisy. The idea is to fix the code (and the checker, as needed) little by little instead of having a huge change that touches everywhere. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #6502 from vanzin/SPARK-3873.
*	[SPARK-12159][ML] Add user guide section for IndexToString transformer	BenFradet	2015-12-08	4	-16/+268
\| \| \| \| \| \| \| \|	Documentation regarding the `IndexToString` label transformer with code snippets in Scala/Java/Python. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10166 from BenFradet/SPARK-12159.
*	[SPARK-11605][MLLIB] ML 1.6 QA: API: Java compatibility, docs	Yuhao Yang	2015-12-08	5	-27/+96
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-11605 Check Java compatibility for MLlib for this release. fix: 1. `StreamingTest.registerStream` needs java friendly interface. 2. `GradientBoostedTreesModel.computeInitialPredictionAndError` and `GradientBoostedTreesModel.updatePredictionError` has java compatibility issue. Mark them as `developerAPI`. TBD: [updated] no fix for now per discussion. `org.apache.spark.mllib.classification.LogisticRegressionModel` `public scala.Option<java.lang.Object> getThreshold();` has wrong return type for Java invocation. `SVMModel` has the similar issue. Yet adding a `scala.Option<java.util.Double> getThreshold()` would result in an overloading error due to the same function signature. And adding a new function with different name seems to be not necessary. cc jkbradley feynmanliang Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10102 from hhbyyh/javaAPI.
*	[SPARK-12205][SQL] Pivot fails Analysis when aggregate is UnresolvedFunction	Andrew Ray	2015-12-08	2	-1/+9
\| \| \| \| \| \| \| \|	Delays application of ResolvePivot until all aggregates are resolved to prevent problems with UnresolvedFunction and adds unit test Author: Andrew Ray <ray.andrew@gmail.com> Closes #10202 from aray/sql-pivot-unresolved-function.
*	[SPARK-10393] use ML pipeline in LDA example	Yuhao Yang	2015-12-08	1	-113/+40
\| \| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-10393 Since the logic of the text processing part has been moved to ML estimators/transformers, replace the related code in LDA Example with the ML pipeline. Author: Yuhao Yang <hhbyyh@gmail.com> Author: yuhaoyang <yuhao@zhanglipings-iMac.local> Closes #8551 from hhbyyh/ldaExUpdate.
*	[SPARK-12188][SQL] Code refactoring and comment correction in Dataset APIs	gatorsmile	2015-12-08	1	-40/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR contains the following updates: - Created a new private variable `boundTEncoder` that can be shared by multiple functions, `RDD`, `select` and `collect`. - Replaced all the `queryExecution.analyzed` by the function call `logicalPlan` - A few API comments are using wrong class names (e.g., `DataFrame`) or parameter names (e.g., `n`) - A few API descriptions are wrong. (e.g., `mapPartitions`) marmbrus rxin cloud-fan Could you take a look and check if they are appropriate? Thank you! Author: gatorsmile <gatorsmile@gmail.com> Closes #10184 from gatorsmile/datasetClean.
*	[SPARK-12195][SQL] Adding BigDecimal, Date and Timestamp into Encoder	gatorsmile	2015-12-08	2	-0/+35
\| \| \| \| \| \| \| \| \| \|	This PR is to add three more data types into Encoder, including `BigDecimal`, `Date` and `Timestamp`. marmbrus cloud-fan rxin Could you take a quick look at these three types? Not sure if it can be merged to 1.6. Thank you very much! Author: gatorsmile <gatorsmile@gmail.com> Closes #10188 from gatorsmile/dataTypesinEncoder.
*	[SPARK-12201][SQL] add type coercion rule for greatest/least	Wenchen Fan	2015-12-08	3	-0/+47
\| \| \| \| \| \| \| \| \|	checked with hive, greatest/least should cast their children to a tightest common type, i.e. `(int, long) => long`, `(int, string) => error`, `(decimal(10,5), decimal(5, 10)) => error` Author: Wenchen Fan <wenchen@databricks.com> Closes #10196 from cloud-fan/type-coercion.
*	[SPARK-12074] Avoid memory copy involving ↵	tedyu	2015-12-08	3	-7/+8
\| \| \| \| \| \| \| \| \| \| \| \| \|	ByteBuffer.wrap(ByteArrayOutputStream.toByteArray) SPARK-12060 fixed JavaSerializerInstance.serialize This PR applies the same technique on two other classes. zsxwing Author: tedyu <yuzhihong@gmail.com> Closes #10177 from tedyu/master.
*	[SPARK-11155][WEB UI] Stage summary json should include stage duration	Xin Ren	2015-12-08	11	-9/+124
\| \| \| \| \| \| \| \| \| \|	The json endpoint for stages doesn't include information on the stage duration that is present in the UI. This looks like a simple oversight, they should be included. eg., the metrics should be included at api/v1/applications/<appId>/stages. Metrics I've added are: submissionTime, firstTaskLaunchedTime and completionTime Author: Xin Ren <iamshrek@126.com> Closes #10107 from keypointt/SPARK-11155.
*	[SPARK-11652][CORE] Remote code execution with InvokerTransformer	Sean Owen	2015-12-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Fix commons-collection group ID to commons-collections for version 3.x Patches earlier PR at https://github.com/apache/spark/pull/9731 Author: Sean Owen <sowen@cloudera.com> Closes #10198 from srowen/SPARK-11652.2.
*	[SPARK-11551][DOC][EXAMPLE] Revert PR #10002	Cheng Lian	2015-12-08	52	-2806/+1058
\| \| \| \| \| \| \| \| \| \|	This reverts PR #10002, commit 78209b0ccaf3f22b5e2345dfb2b98edfdb746819. The original PR wasn't tested on Jenkins before being merged. Author: Cheng Lian <lian@databricks.com> Closes #10200 from liancheng/revert-pr-10002.
*	[SPARK-11439][ML] Optimization of creating sparse feature without dense one	Nakul Jindal	2015-12-08	3	-122/+142
\| \| \| \| \| \| \| \|	Sparse feature generated in LinearDataGenerator does not create dense vectors as an intermediate any more. Author: Nakul Jindal <njindal@us.ibm.com> Closes #9756 from nakul02/SPARK-11439_sparse_without_creating_dense_feature.
*	[SPARK-12166][TEST] Unset hadoop related environment in testing	Jeff Zhang	2015-12-08	1	-0/+6
\| \| \| \| \| \|	Author: Jeff Zhang <zjffdu@apache.org> Closes #10172 from zjffdu/SPARK-12166.
*	[SPARK-12103][STREAMING][KAFKA][DOC] document that K means Key and V …	cody koeninger	2015-12-08	1	-0/+61
\| \| \| \| \| \| \| \|	…means Value Author: cody koeninger <cody@koeninger.org> Closes #10132 from koeninger/SPARK-12103.
*	[SPARK-11958][SPARK-11957][ML][DOC] SQLTransformer user guide and example code	Yanbo Liang	2015-12-07	5	-2/+212
\| \| \| \| \| \| \| \|	Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10006 from yanboliang/spark-11958.
*	[SPARK-10259][ML] Add @since annotation to ml.classification	Takahashi Hiroshi	2015-12-07	7	-44/+185
\| \| \| \| \| \| \| \|	Add since annotation to ml.classification Author: Takahashi Hiroshi <takahashi.hiroshi@lab.ntt.co.jp> Closes #8534 from taishi-oss/issue10259.
*	Closes #10098	Xiangrui Meng	2015-12-07	0	-0/+0
\|
*	[SPARK-11551][DOC][EXAMPLE] Replace example code in ml-features.md using ↵	somideshmukh	2015-12-07	52	-1058/+2806
\| \| \| \| \| \| \| \| \| \| \| \| \|	include_example Made new patch contaning only markdown examples moved to exmaple/folder. Ony three java code were not shfted since they were contaning compliation error ,these classes are 1)StandardScale 2)NormalizerExample 3)VectorIndexer Author: Xusen Yin <yinxusen@gmail.com> Author: somideshmukh <somilde@us.ibm.com> Closes #10002 from somideshmukh/SomilBranch1.33.
*	[SPARK-12160][MLLIB] Use SQLContext.getOrCreate in MLlib	Joseph K. Bradley	2015-12-07	13	-29/+29
\| \| \| \| \| \| \| \| \| \| \| \|	Switched from using SQLContext constructor to using getOrCreate, mainly in model save/load methods. This covers all instances in spark.mllib. There were no uses of the constructor in spark.ml. CC: mengxr yhuai Author: Joseph K. Bradley <joseph@databricks.com> Closes #10161 from jkbradley/mllib-sqlcontext-fix.