spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-4292][SQL] Result set iterator bug in JDBC/ODBC	wangfei	2014-11-07	3	-6/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	select * from src, get the wrong result set as follows: ``` ... \| 309 \| val_309 \| \| 309 \| val_309 \| \| 309 \| val_309 \| \| 309 \| val_309 \| \| 309 \| val_309 \| \| 309 \| val_309 \| \| 309 \| val_309 \| \| 309 \| val_309 \| \| 309 \| val_309 \| \| 309 \| val_309 \| \| 97 \| val_97 \| \| 97 \| val_97 \| \| 97 \| val_97 \| \| 97 \| val_97 \| \| 97 \| val_97 \| \| 97 \| val_97 \| \| 97 \| val_97 \| \| 97 \| val_97 \| \| 97 \| val_97 \| \| 97 \| val_97 \| \| 97 \| val_97 \| ... ``` Author: wangfei <wangfei1@huawei.com> Closes #3149 from scwf/SPARK-4292 and squashes the following commits: 1574a43 [wangfei] using result.collect 8b2d845 [wangfei] adding test f64eddf [wangfei] result set iter bug (cherry picked from commit d6e55524437026c0c76addeba8f99249a8316716) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-4203][SQL] Partition directories in random order when inserting into ↵	Matthew Taylor	2014-11-07	2	-4/+43
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	hive table When doing an insert into hive table with partitions the folders written to the file system are in a random order instead of the order defined in table creation. Seems that the loadPartition method in Hive.java has a Map<String,String> parameter but expects to be called with a map that has a defined ordering such as LinkedHashMap. Working on a test but having intillij problems Author: Matthew Taylor <matthew.t@tbfe.net> Closes #3076 from tbfenet/partition_dir_order_problem and squashes the following commits: f1b9a52 [Matthew Taylor] Comment format fix bca709f [Matthew Taylor] review changes 0e50f6b [Matthew Taylor] test fix 99f1a31 [Matthew Taylor] partition ordering fix 369e618 [Matthew Taylor] partition ordering fix (cherry picked from commit ac70c972a51952f801fd02dd5962c0a0c1aba8f8) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-4270][SQL] Fix Cast from DateType to DecimalType.	Takuya UESHIN	2014-11-07	2	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	`Cast` from `DateType` to `DecimalType` throws `NullPointerException`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #3134 from ueshin/issues/SPARK-4270 and squashes the following commits: 7394e4b [Takuya UESHIN] Fix Cast from DateType to DecimalType. (cherry picked from commit a6405c5ddcda112f8efd7d50d8e5f44f78a0fa41) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-4272] [SQL] Add more unwrapper functions for primitive type in ↵	Cheng Hao	2014-11-07	2	-4/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TableReader Currently, the data "unwrap" only support couple of primitive types, not all, it will not cause exception, but may get some performance in table scanning for the type like binary, date, timestamp, decimal etc. Author: Cheng Hao <hao.cheng@intel.com> Closes #3136 from chenghao-intel/table_reader and squashes the following commits: fffb729 [Cheng Hao] fix bug for retrieving the timestamp object e9c97a4 [Cheng Hao] Add more unwrapper functions for primitive type in TableReader (cherry picked from commit 60ab80f501b8384ddf48a9ac0ba0c2b9eb548b28) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-4213][SQL] ParquetFilters - No support for LT, LTE, GT, GTE operators	Kousuke Saruta	2014-11-07	2	-11/+364
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Following description is quoted from JIRA: When I issue a hql query against a HiveContext where my predicate uses a column of string type with one of LT, LTE, GT, or GTE operator, I get the following error: scala.MatchError: StringType (of class org.apache.spark.sql.catalyst.types.StringType$) Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, StringType is absent from the corresponding functions for creating these filters. To reproduce, in a Hive 0.13.1 shell, I created the following table (at a specified DB): create table sparkbug ( id int, event string ) stored as parquet; Insert some sample data: insert into table sparkbug select 1, '2011-06-18' from <some table> limit 1; insert into table sparkbug select 2, '2012-01-01' from <some table> limit 1; Launch a spark shell and create a HiveContext to the metastore where the table above is located. import org.apache.spark.sql._ import org.apache.spark.sql.SQLContext import org.apache.spark.sql.hive.HiveContext val hc = new HiveContext(sc) hc.setConf("spark.sql.shuffle.partitions", "10") hc.setConf("spark.sql.hive.convertMetastoreParquet", "true") hc.setConf("spark.sql.parquet.compression.codec", "snappy") import hc._ hc.hql("select * from <db>.sparkbug where event >= '2011-12-01'") A scala.MatchError will appear in the output. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3083 from sarutak/SPARK-4213 and squashes the following commits: 4ab6e56 [Kousuke Saruta] WIP b6890c6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4213 9a1fae7 [Kousuke Saruta] Fixed ParquetFilters so that compare Strings (cherry picked from commit 14c54f1876fcf91b5c10e80be2df5421c7328557) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SQL] Modify keyword val location according to ordering	Jacky Li	2014-11-07	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	'DOUBLE' should be moved before 'ELSE' according to the ordering convension Author: Jacky Li <jacky.likun@gmail.com> Closes #3080 from jackylk/patch-5 and squashes the following commits: 3c11df7 [Jacky Li] [SQL] Modify keyword val location according to ordering (cherry picked from commit 68609c51ad1ab2def302df3c4a1c0bc1ec6e1075) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SQL] Support ScalaReflection of schema in different universes	Michael Armbrust	2014-11-07	1	-3/+15
\| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #3096 from marmbrus/reflectionContext and squashes the following commits: adc221f [Michael Armbrust] Support ScalaReflection of schema in different universes (cherry picked from commit 8154ed7df6c5407e638f465d3bd86b43f36216ef) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-4225][SQL] Resorts to SparkContext.version to inspect Spark version	Cheng Lian	2014-11-07	2	-24/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR resorts to `SparkContext.version` rather than META-INF/MANIFEST.MF in the assembly jar to inspect Spark version. Currently, when built with Maven, the MANIFEST.MF file in the assembly jar is incorrectly replaced by Guava 15.0 MANIFEST.MF, probably because of the assembly/shading tricks. Another related PR is #3103, which tries to fix the MANIFEST issue. Author: Cheng Lian <lian@databricks.com> Closes #3105 from liancheng/spark-4225 and squashes the following commits: d9585e1 [Cheng Lian] Resorts to SparkContext.version to inspect Spark version (cherry picked from commit 86e9eaa3f0ec23cb38bce67585adb2d5f484f4ee) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SQL][DOC][Minor] Spark SQL Hive now support dynamic partitioning	wangfei	2014-11-07	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \|	Author: wangfei <wangfei1@huawei.com> Closes #3127 from scwf/patch-9 and squashes the following commits: e39a560 [wangfei] now support dynamic partitioning (cherry picked from commit 636d7bcc96b912f5b5caa91110cd55b55fa38ad8) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-4187] [Core] Switch to binary protocol for external shuffle service ↵	Aaron Davidson	2014-11-07	30	-284/+640
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	messages This PR elimiantes the network package's usage of the Java serializer and replaces it with Encodable, which is a lightweight binary protocol. Each message is preceded by a type id, which will allow us to change messages (by only adding new ones), or to change the format entirely by switching to a special id (such as -1). This protocol has the advantage over Java that we can guarantee that messages will remain compatible across compiled versions and JVMs, though it does not provide a clean way to do schema migration. In the future, it may be good to use a more heavy-weight serialization format like protobuf, thrift, or avro, but these all add several dependencies which are unnecessary at the present time. Additionally this unifies the RPC messages of NettyBlockTransferService and ExternalShuffleClient. Author: Aaron Davidson <aaron@databricks.com> Closes #3146 from aarondav/free and squashes the following commits: ed1102a [Aaron Davidson] Remove some unused imports b8e2a49 [Aaron Davidson] Add appId to test 538f2a3 [Aaron Davidson] [SPARK-4187] [Core] Switch to binary protocol for external shuffle service messages (cherry picked from commit d4fa04e50d299e9cad349b3781772956453a696b) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	[SPARK-4204][Core][WebUI] Change Utils.exceptionString to contain the inner ↵	zsxwing	2014-11-06	12	-30/+148
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	exceptions and make the error information in Web UI more friendly This PR fixed `Utils.exceptionString` to output the full exception information. However, the stack trace may become very huge, so I also updated the Web UI to collapse the error information by default (display the first line and clicking `+detail` will display the full info). Here are the screenshots: Stages: ![stages](https://cloud.githubusercontent.com/assets/1000778/4882441/66d8cc68-6356-11e4-8346-6318677d9470.png) Details for one stage: ![stage](https://cloud.githubusercontent.com/assets/1000778/4882513/1311043c-6357-11e4-8804-ca14240a9145.png) The full information in the gray text field is: ```Java org.apache.spark.shuffle.FetchFailedException: Connection reset by peer at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:189) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198) at sun.nio.ch.IOUtil.read(IOUtil.java:166) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ... 1 more ``` /cc aarondav Author: zsxwing <zsxwing@gmail.com> Closes #3073 from zsxwing/SPARK-4204 and squashes the following commits: 176d1e3 [zsxwing] Add comments to explain the stack trace difference ca509d3 [zsxwing] Add fullStackTrace to the constructor of ExceptionFailure a07057b [zsxwing] Core style fix dfb0032 [zsxwing] Backward compatibility for old history server 1e50f71 [zsxwing] Update as per review and increase the max height of the stack trace details 94f2566 [zsxwing] Change Utils.exceptionString to contain the inner exceptions and make the error information in Web UI more friendly (cherry picked from commit 3abdb1b24aa48f21e7eed1232c01d3933873688c) Signed-off-by: Andrew Or <andrew@databricks.com>
*	[SPARK-4236] Cleanup removed applications' files in shuffle service	Aaron Davidson	2014-11-06	8	-22/+319
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This relies on a hook from whoever is hosting the shuffle service to invoke removeApplication() when the application is completed. Once invoked, we will clean up all the executors' shuffle directories we know about. Author: Aaron Davidson <aaron@databricks.com> Closes #3126 from aarondav/cleanup and squashes the following commits: 33a64a9 [Aaron Davidson] Missing brace e6e428f [Aaron Davidson] Address comments 16a0d27 [Aaron Davidson] Cleanup e4df3e7 [Aaron Davidson] [SPARK-4236] Cleanup removed applications' files in shuffle service (cherry picked from commit 48a19a6dba896f7d0b637f84e114b7efbb814e51) Signed-off-by: Andrew Or <andrew@databricks.com>
*	[SPARK-4188] [Core] Perform network-level retry of shuffle file fetches	Aaron Davidson	2014-11-06	16	-45/+668
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds a RetryingBlockFetcher to the NettyBlockTransferService which is wrapped around our typical OneForOneBlockFetcher, adding retry logic in the event of an IOException. This sort of retry allows us to avoid marking an entire executor as failed due to garbage collection or high network load. TODO: - [x] unit tests - [x] put in ExternalShuffleClient too Author: Aaron Davidson <aaron@databricks.com> Closes #3101 from aarondav/retry and squashes the following commits: 72a2a32 [Aaron Davidson] Add that we should remove the condition around the retry thingy c7fd107 [Aaron Davidson] Fix unit tests e80e4c2 [Aaron Davidson] Address initial comments 6f594cd [Aaron Davidson] Fix unit test 05ff43c [Aaron Davidson] Add to external shuffle client and add unit test 66e5a24 [Aaron Davidson] [SPARK-4238] [Core] Perform network-level retry of shuffle file fetches (cherry picked from commit f165b2bbf5d4acf34d826fa55b900f5bbc295654) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	[SPARK-4277] Support external shuffle service on Standalone Worker	Aaron Davidson	2014-11-06	6	-26/+79
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Author: Aaron Davidson <aaron@databricks.com> Closes #3142 from aarondav/worker and squashes the following commits: 3780bd7 [Aaron Davidson] Address comments 2dcdfc1 [Aaron Davidson] Add private[worker] 47f49d3 [Aaron Davidson] NettyBlockTransferService shouldn't care about app ids (it's only b/t executors) 258417c [Aaron Davidson] [SPARK-4277] Support external shuffle service on executor (cherry picked from commit 6e9ef10fd7446a11f37446c961916ba2a8e02cb8) Signed-off-by: Andrew Or <andrew@databricks.com>
*	[SPARK-3797] Minor addendum to Yarn shuffle service	Andrew Or	2014-11-06	4	-26/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I did not realize there was a `network.util.JavaUtils` when I wrote this code. This PR moves the `ByteBuffer` string conversion to the appropriate place. I tested the changes on a stable yarn cluster. Author: Andrew Or <andrew@databricks.com> Closes #3144 from andrewor14/yarn-shuffle-util and squashes the following commits: b6c08bf [Andrew Or] Remove unused import 94e205c [Andrew Or] Use netty Unpooled 85202a5 [Andrew Or] Use guava Charsets 057135b [Andrew Or] Reword comment adf186d [Andrew Or] Move byte buffer String conversion logic to JavaUtils (cherry picked from commit 96136f222abd4f3abd10cb78a4ebecdb21f3bde7) Signed-off-by: Andrew Or <andrew@databricks.com>
*	[HOT FIX] Make distribution fails	Andrew Or	2014-11-06	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \|	This was added by me in https://github.com/apache/spark/commit/61a5cced049a8056292ba94f23fa7bd040f50685. The real fix will be added in [SPARK-4281](https://issues.apache.org/jira/browse/SPARK-4281). Author: Andrew Or <andrew@databricks.com> Closes #3145 from andrewor14/fix-make-distribution and squashes the following commits: c78be61 [Andrew Or] Hot fix make distribution (cherry picked from commit 470881b24a503c9edcaed159c29bafa446ab0e9a) Signed-off-by: Andrew Or <andrew@databricks.com>
*	[SPARK-4249][GraphX]fix a problem of EdgePartitionBuilder in Graphx	lianhuiwang	2014-11-06	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	at first srcIds is not initialized and are all 0. so we use edgeArray(0).srcId to currSrcId Author: lianhuiwang <lianhuiwang09@gmail.com> Closes #3138 from lianhuiwang/SPARK-4249 and squashes the following commits: 3f4e503 [lianhuiwang] fix a problem of EdgePartitionBuilder in Graphx (cherry picked from commit d15c6e9dc2860bbe56e31ddf71218ccc6d5c841d) Signed-off-by: Ankur Dave <ankurdave@gmail.com>
*	[SPARK-4264] Completion iterator should only invoke callback once	Aaron Davidson	2014-11-06	2	-1/+51
\| \| \| \| \| \| \| \| \| \| \|	Author: Aaron Davidson <aaron@databricks.com> Closes #3128 from aarondav/compiter and squashes the following commits: 698e4be [Aaron Davidson] [SPARK-4264] Completion iterator should only invoke callback once (cherry picked from commit 23eaf0e12ff221dcca40a79e61b6cc5e7c846cb5) Signed-off-by: Aaron Davidson <aaron@databricks.com>
*	[SPARK-4186] add binaryFiles and binaryRecords in Python	Davies Liu	2014-11-06	5	-22/+90
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	add binaryFiles() and binaryRecords() in Python ``` binaryFiles(self, path, minPartitions=None): :: Developer API :: Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file. Note: Small files are preferred, large file is also allowable, but may cause bad performance. binaryRecords(self, path, recordLength): Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant. :param path: Directory to the input data files :param recordLength: The length at which to split the records ``` Author: Davies Liu <davies@databricks.com> Closes #3078 from davies/binary and squashes the following commits: cd0bdbd [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary 3aa349b [Davies Liu] add experimental notes 24e84b6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary 5ceaa8a [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary 1900085 [Davies Liu] bugfix bb22442 [Davies Liu] add binaryFiles and binaryRecords in Python (cherry picked from commit b41a39e24038876359aeb7ce2bbbb4de2234e5f3) Signed-off-by: Matei Zaharia <matei@databricks.com>
*	[SPARK-4255] Fix incorrect table striping	Kay Ousterhout	2014-11-06	2	-5/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit stripes table rows after hiding some rows, to ensure that rows are correct striped to alternate white and grey even when rows are hidden by default. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #3117 from kayousterhout/striping and squashes the following commits: be6e10a [Kay Ousterhout] [SPARK-4255] Fix incorrect table striping (cherry picked from commit 5f27ae16d5b016fae4afeb0f2ad779fd3130b390) Signed-off-by: Kay Ousterhout <kayousterhout@gmail.com>
*	[SPARK-4137] [EC2] Don't change working dir on user	Nicholas Chammas	2014-11-05	2	-3/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This issue was uncovered after [this discussion](https://issues.apache.org/jira/browse/SPARK-3398?focusedCommentId=14187471&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14187471). Don't change the working directory on the user. This breaks relative paths the user may pass in, e.g., for the SSH identity file. ``` ./ec2/spark-ec2 -i ../my.pem ``` This patch will preserve the user's current working directory and allow calls like the one above to work. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #2988 from nchammas/spark-ec2-cwd and squashes the following commits: f3850b5 [Nicholas Chammas] pep8 fix fbc20c7 [Nicholas Chammas] revert to old commenting style 752f958 [Nicholas Chammas] specify deploy.generic path absolutely bcdf6a5 [Nicholas Chammas] fix typo 77871a2 [Nicholas Chammas] add clarifying comment ce071fc [Nicholas Chammas] don't change working dir (cherry picked from commit db45f5ad0368760dbeaa618a04f66ae9b2bed656) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
*	[SPARK-4262][SQL] add .schemaRDD to JavaSchemaRDD	Xiangrui Meng	2014-11-05	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	marmbrus Author: Xiangrui Meng <meng@databricks.com> Closes #3125 from mengxr/SPARK-4262 and squashes the following commits: 307695e [Xiangrui Meng] add .schemaRDD to JavaSchemaRDD (cherry picked from commit 3d2b5bc5bb979d8b0b71e06bc0f4548376fdbb98) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SPARK-4254] [mllib] MovieLensALS bug fix	Joseph K. Bradley	2014-11-05	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Changed code so it does not try to serialize Params. CC: mengxr debasish83 srowen Author: Joseph K. Bradley <joseph@databricks.com> Closes #3116 from jkbradley/als-bugfix and squashes the following commits: e575bd8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into als-bugfix 9401b16 [Joseph K. Bradley] changed implicitPrefs so it is not serialized to fix MovieLensALS example bug (cherry picked from commit c315d1316cb2372e90ae3a12f72d5b3304435a6b) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SPARK-4158] Fix for missing resources.	Brenden Matthews	2014-11-05	2	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Mesos offers may not contain all resources, and Spark needs to check to ensure they are present and sufficient. Spark may throw an erroneous exception when resources aren't present. Author: Brenden Matthews <brenden@diddyinc.com> Closes #3024 from brndnmtthws/fix-mesos-resource-misuse and squashes the following commits: e5f9580 [Brenden Matthews] [SPARK-4158] Fix for missing resources. (cherry picked from commit cb0eae3b78d7f6f56c0b9521ee48564a4967d3de) Signed-off-by: Andrew Or <andrew@databricks.com>
*	SPARK-3223 runAsSparkUser cannot change HDFS write permission properly i...	Jongyoul Lee	2014-11-05	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...n mesos cluster mode - change master newer Author: Jongyoul Lee <jongyoul@gmail.com> Closes #3034 from jongyoul/SPARK-3223 and squashes the following commits: 42b2ed3 [Jongyoul Lee] SPARK-3223 runAsSparkUser cannot change HDFS write permission properly in mesos cluster mode - change master newer (cherry picked from commit f7ac8c2b1de96151231617846b7468d23379c74a) Signed-off-by: Andrew Or <andrew@databricks.com>
*	SPARK-4040. Update documentation to exemplify use of local (n) value, fo...	jay@apache.org	2014-11-05	2	-7/+17
\| \| \| \| \| \| \| \| \| \| \| \| \|	This is a minor docs update which helps to clarify the way local[n] is used for streaming apps. Author: jay@apache.org <jayunit100> Closes #2964 from jayunit100/SPARK-4040 and squashes the following commits: 35b5a5e [jay@apache.org] SPARK-4040: Update documentation to exemplify use of local (n) value. (cherry picked from commit 868cd4c3ca11e6ecc4425b972d9a20c360b52425) Signed-off-by: Matei Zaharia <matei@databricks.com>
*	[SPARK-3797] Run external shuffle service in Yarn NM	Andrew Or	2014-11-05	12	-16/+483
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This creates a new module `network/yarn` that depends on `network/shuffle` recently created in #3001. This PR introduces a custom Yarn auxiliary service that runs the external shuffle service. As of the changes here this shuffle service is required for using dynamic allocation with Spark. This is still WIP mainly because it doesn't handle security yet. I have tested this on a stable Yarn cluster. Author: Andrew Or <andrew@databricks.com> Closes #3082 from andrewor14/yarn-shuffle-service and squashes the following commits: ef3ddae [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service 0ee67a2 [Andrew Or] Minor wording suggestions 1c66046 [Andrew Or] Remove unused provided dependencies 0eb6233 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service 6489db5 [Andrew Or] Try catch at the right places 7b71d8f [Andrew Or] Add detailed java docs + reword a few comments d1124e4 [Andrew Or] Add security to shuffle service (INCOMPLETE) 5f8a96f [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service 9b6e058 [Andrew Or] Address various feedback f48b20c [Andrew Or] Fix tests again f39daa6 [Andrew Or] Do not make network-yarn an assembly module 761f58a [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service 15a5b37 [Andrew Or] Fix build for Hadoop 1.x baff916 [Andrew Or] Fix tests 5bf9b7e [Andrew Or] Address a few minor comments 5b419b8 [Andrew Or] Add missing license header 804e7ff [Andrew Or] Include the Yarn shuffle service jar in the distribution cd076a4 [Andrew Or] Require external shuffle service for dynamic allocation ea764e0 [Andrew Or] Connect to Yarn shuffle service only if it's enabled 1bf5109 [Andrew Or] Use the shuffle service port specified through hadoop config b4b1f0c [Andrew Or] 4 tabs -> 2 tabs 43dcb96 [Andrew Or] First cut integration of shuffle service with Yarn aux service b54a0c4 [Andrew Or] Initial skeleton for Yarn shuffle service (cherry picked from commit 61a5cced049a8056292ba94f23fa7bd040f50685) Signed-off-by: Andrew Or <andrew@databricks.com>
*	SPARK-4222 [CORE] use readFully in FixedLengthBinaryRecordReader	industrial-sloth	2014-11-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	replaces the existing read() call with readFully(). Author: industrial-sloth <industrial-sloth@users.noreply.github.com> Closes #3093 from industrial-sloth/branch-1.2-fixedLenRecRdr and squashes the following commits: a245c8a [industrial-sloth] use readFully in FixedLengthBinaryRecordReader
*	[SPARK-3984] [SPARK-3983] Fix incorrect scheduler delay and display task ↵	Kay Ousterhout	2014-11-05	4	-3/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	deserialization time in UI This commit fixes the scheduler delay in the UI (which previously included things that are not scheduler delay, like time to deserialize the task and serialize the result), and also adds information about time to deserialize tasks to the optional additional metrics. Time to deserialize the task can be large relative to task time for short jobs, and understanding when it is high can help developers realize that they should try to reduce closure size (e.g, by including less data in the task description). cc shivaram etrain Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #2832 from kayousterhout/SPARK-3983 and squashes the following commits: 0c1398e [Kay Ousterhout] Fixed ordering 531575d [Kay Ousterhout] Removed executor launch time 1f13afe [Kay Ousterhout] Minor spacing fixes 335be4b [Kay Ousterhout] Made metrics hideable 5bc3cba [Kay Ousterhout] [SPARK-3984] [SPARK-3983] Improve UI task metrics. (cherry picked from commit a46497eecc50f854c5c5701dc2b8a2468b76c085) Signed-off-by: Kay Ousterhout <kayousterhout@gmail.com>
*	[EC2] Factor out Mesos spark-ec2 branch	Nicholas Chammas	2014-11-05	1	-2/+9
\| \| \| \| \| \| \| \| \| \|	We reference a specific branch in two places. This patch makes it one place. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #3008 from nchammas/mesos-spark-ec2-branch and squashes the following commits: 10a6089 [Nicholas Chammas] factor out mess spark-ec2 branch
*	[SPARK-4166][Core] Add a backward compatibility test for ExecutorLostFailure	zsxwing	2014-11-05	1	-0/+9
\| \| \| \| \| \| \| \|	Author: zsxwing <zsxwing@gmail.com> Closes #3085 from zsxwing/SPARK-4166-back-comp and squashes the following commits: 89329f4 [zsxwing] Add a backward compatibility test for ExecutorLostFailure
*	[SPARK-4163][Core] Add a backward compatibility test for FetchFailed	zsxwing	2014-11-05	1	-0/+11
\| \| \| \| \| \| \| \| \| \|	/cc aarondav Author: zsxwing <zsxwing@gmail.com> Closes #3086 from zsxwing/SPARK-4163-back-comp and squashes the following commits: 21cb2a8 [zsxwing] Add a backward compatibility test for FetchFailed
*	[SPARK-4168][WebUI] web statges number should show correctly when stages are ↵	Zhang, Liye	2014-11-05	2	-4/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	more than 1000 The number of completed stages and failed stages showed on webUI will always be less than 1000. This is really misleading when there are already thousands of stages completed or failed. The number should be correct even when only partial stages listed on the webUI (stage info will be removed if the number is too large). Author: Zhang, Liye <liye.zhang@intel.com> Closes #3035 from liyezhang556520/webStageNum and squashes the following commits: d9e29fb [Zhang, Liye] add detailed comments for variables 4ea8fd1 [Zhang, Liye] change variable name accroding to comments f4c404d [Zhang, Liye] [SPARK-4168][WebUI] web statges number should show correctly when stages are more than 1000
*	[SPARK-611] Display executor thread dumps in web UI	Josh Rosen	2014-11-05	13	-7/+247
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch allows executor thread dumps to be collected on-demand and viewed in the Spark web UI. The thread dumps are collected using Thread.getAllStackTraces(). To allow remote thread dumps to be triggered from the web UI, I added a new `ExecutorActor` that runs inside of the Executor actor system and responds to RPCs from the driver. The driver's mechanism for obtaining a reference to this actor is a little bit hacky: it uses the block manager master actor to determine the host/port of the executor actor systems in order to construct ActorRefs to ExecutorActor. Unfortunately, I couldn't find a much cleaner way to do this without a big refactoring of the executor -> driver communication. Screenshots: ![image](https://cloud.githubusercontent.com/assets/50748/4781793/7e7a0776-5cbf-11e4-874d-a91cd04620bd.png) ![image](https://cloud.githubusercontent.com/assets/50748/4781794/8bce76aa-5cbf-11e4-8d13-8477748c9f7e.png) ![image](https://cloud.githubusercontent.com/assets/50748/4781797/bd11a8b8-5cbf-11e4-9ad7-a7459467ec8e.png) Author: Josh Rosen <joshrosen@databricks.com> Closes #2944 from JoshRosen/jstack-in-web-ui and squashes the following commits: 3c21a5d [Josh Rosen] Address review comments: 880f7f7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui f719266 [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui 19707b0 [Josh Rosen] Add one comment. 127a130 [Josh Rosen] Update to use SparkContext.DRIVER_IDENTIFIER b8e69aa [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui 3dfc2d4 [Josh Rosen] Add missing file. bc1e675 [Josh Rosen] Undo some leftover changes from the earlier approach. f4ac1c1 [Josh Rosen] Switch to on-demand collection of thread dumps dfec08b [Josh Rosen] Add option to disable thread dumps in UI. 4c87d7f [Josh Rosen] Use separate RPC for sending thread dumps. 2b8bdf3 [Josh Rosen] Enable thread dumps from the driver when running in non-local mode. cc3e6b3 [Josh Rosen] Fix test code in DAGSchedulerSuite. 87b8b65 [Josh Rosen] Add new listener event for thread dumps. 8c10216 [Josh Rosen] Add missing file. 0f198ac [Josh Rosen] [SPARK-611] Display executor thread dumps in web UI
*	[SPARK-4242] [Core] Add SASL to external shuffle service	Aaron Davidson	2014-11-05	15	-23/+272
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Does three things: (1) Adds SASL to ExternalShuffleClient, (2) puts SecurityManager in BlockManager's constructor, and (3) adds unit test. Author: Aaron Davidson <aaron@databricks.com> Closes #3108 from aarondav/sasl-client and squashes the following commits: 48b622d [Aaron Davidson] Screw it, let's just get LimitedInputStream 3543b70 [Aaron Davidson] Back out of pom change due to unknown test issue? b58518a [Aaron Davidson] ByteStreams.limit() not available :( cbe451a [Aaron Davidson] Address comments 2bf2908 [Aaron Davidson] [SPARK-4242] [Core] Add SASL to external shuffle service
*	[SPARK-2938] Support SASL authentication in NettyBlockTransferService	Aaron Davidson	2014-11-05	37	-392/+1257
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Also lays the groundwork for supporting it inside the external shuffle service. Author: Aaron Davidson <aaron@databricks.com> Closes #3087 from aarondav/sasl and squashes the following commits: 3481718 [Aaron Davidson] Delete rogue println 44f8410 [Aaron Davidson] Delete documentation - muahaha! eb9f065 [Aaron Davidson] Improve documentation and add end-to-end test at Spark-level a6b95f1 [Aaron Davidson] Address comments 785bbde [Aaron Davidson] Cleanup 79973cb [Aaron Davidson] Remove unused file 151b3c5 [Aaron Davidson] Add docs, timeout config, better failure handling f6177d7 [Aaron Davidson] Cleanup SASL state upon connection termination 7b42adb [Aaron Davidson] Add unit tests 8191bcb [Aaron Davidson] [SPARK-2938] Support SASL authentication in NettyBlockTransferService
*	[SPARK-4197] [mllib] GradientBoosting API cleanup and examples in Scala, Java	Joseph K. Bradley	2014-11-05	7	-206/+462
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	### Summary * Made it easier to construct default Strategy and BoostingStrategy and to set parameters using simple types. * Added Scala and Java examples for GradientBoostedTrees * small cleanups and fixes ### Details GradientBoosting bug fixes (“bug” = bad default options) * Force boostingStrategy.weakLearnerParams.algo = Regression * Force boostingStrategy.weakLearnerParams.impurity = impurity.Variance * Only persist data if not yet persisted (since it causes an error if persisted twice) BoostingStrategy * numEstimators: renamed to numIterations * removed subsamplingRate (duplicated by Strategy) * removed categoricalFeaturesInfo since it belongs with the weak learner params (since boosting can be oblivious to feature type) * Changed algo to var (not val) and added BeanProperty, with overload taking String argument * Added assertValid() method * Updated defaultParams() method and eliminated defaultWeakLearnerParams() since that belongs in Strategy Strategy (for DecisionTree) * Changed algo to var (not val) and added BeanProperty, with overload taking String argument * Added setCategoricalFeaturesInfo method taking Java Map. * Cleaned up assertValid * Changed val’s to def’s since parameters can now be changed. CC: manishamde mengxr codedeft Author: Joseph K. Bradley <joseph@databricks.com> Closes #3094 from jkbradley/gbt-api and squashes the following commits: 7a27e22 [Joseph K. Bradley] scalastyle fix 52013d5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into gbt-api e9b8410 [Joseph K. Bradley] Summary of changes (cherry picked from commit 5b3b6f6f5f029164d7749366506e142b104c1d43) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SPARK-4029][Streaming] Update streaming driver to reliably save and recover ↵	Tathagata Das	2014-11-05	8	-89/+597
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	received block metadata on driver failures As part of the initiative of preventing data loss on driver failure, this JIRA tracks the sub task of modifying the streaming driver to reliably save received block metadata, and recover them on driver restart. This was solved by introducing a `ReceivedBlockTracker` that takes all the responsibility of managing the metadata of received blocks (i.e. `ReceivedBlockInfo`, and any actions on them (e.g, allocating blocks to batches, etc.). All actions to block info get written out to a write ahead log (using `WriteAheadLogManager`). On recovery, all the actions are replaying to recreate the pre-failure state of the `ReceivedBlockTracker`, which include the batch-to-block allocations and the unallocated blocks. Furthermore, the `ReceiverInputDStream` was modified to create `WriteAheadLogBackedBlockRDD`s when file segment info is present in the `ReceivedBlockInfo`. After recovery of all the block info (through recovery `ReceivedBlockTracker`), the `WriteAheadLogBackedBlockRDD`s gets recreated with the recovered info, and jobs submitted. The data of the blocks gets pulled from the write ahead logs, thanks to the segment info present in the `ReceivedBlockInfo`. This is still a WIP. Things that are missing here are. - End-to-end integration tests: Unit tests that tests the driver recovery, by killing and restarting the streaming context, and verifying all the input data gets processed. This has been implemented but not included in this PR yet. A sneak peek of that DriverFailureSuite can be found in this PR (on my personal repo): https://github.com/tdas/spark/pull/25 I can either include it in this PR, or submit that as a separate PR after this gets in. - WAL cleanup: Cleaning up the received data write ahead log, by calling `ReceivedBlockHandler.cleanupOldBlocks`. This is being worked on. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #3026 from tdas/driver-ha-rbt and squashes the following commits: a8009ed [Tathagata Das] Added comment 1d704bb [Tathagata Das] Enabled storing recovered WAL-backed blocks to BM 2ee2484 [Tathagata Das] More minor changes based on PR 47fc1e3 [Tathagata Das] Addressed PR comments. 9a7e3e4 [Tathagata Das] Refactored ReceivedBlockTracker API a bit to make things a little cleaner for users of the tracker. af63655 [Tathagata Das] Minor changes. fce2b21 [Tathagata Das] Removed commented lines 59496d3 [Tathagata Das] Changed class names, made allocation more explicit and added cleanup 19aec7d [Tathagata Das] Fixed casting bug. f66d277 [Tathagata Das] Fix line lengths. cda62ee [Tathagata Das] Added license 25611d6 [Tathagata Das] Minor changes before submitting PR 7ae0a7fb [Tathagata Das] Transferred changes from driver-ha-working branch (cherry picked from commit 5f13759d3642ea5b58c12a756e7125ac19aff10e) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
*	[SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API	Davies Liu	2014-11-04	5	-4/+219
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	``` pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None) :: Experimental :: If `observed` is Vector, conduct Pearson's chi-squared goodness of fit test of the observed data against the expected distribution, or againt the uniform distribution (by default), with each category having an expected frequency of `1 / len(observed)`. (Note: `observed` cannot contain negative values) If `observed` is matrix, conduct Pearson's independence test on the input contingency matrix, which cannot contain negative entries or columns or rows that sum up to 0. If `observed` is an RDD of LabeledPoint, conduct Pearson's independence test for every feature against the label across the input RDD. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the chi-squared statistic is computed. All label and feature values must be categorical. :param observed: it could be a vector containing the observed categorical counts/relative frequencies, or the contingency matrix (containing either counts or relative frequencies), or an RDD of LabeledPoint containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value. :param expected: Vector containing the expected categorical counts/relative frequencies. `expected` is rescaled if the `expected` sum differs from the `observed` sum. :return: ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis. ``` Author: Davies Liu <davies@databricks.com> Closes #3091 from davies/his and squashes the following commits: 145d16c [Davies Liu] address comments 0ab0764 [Davies Liu] fix float 5097d54 [Davies Liu] add Hypothesis test Python API (cherry picked from commit c8abddc5164d8cf11cdede6ab3d5d1ea08028708) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SQL] Add String option for DSL AS	Michael Armbrust	2014-11-04	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #3097 from marmbrus/asString and squashes the following commits: 6430520 [Michael Armbrust] Add String option for DSL AS (cherry picked from commit 515abb9afa2d6b58947af6bb079a493b49d315ca) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[Spark-4060] [MLlib] exposing special rdd functions to the public	Niklas Wilcke	2014-11-04	4	-11/+13
\| \| \| \| \| \| \| \| \| \| \|	Author: Niklas Wilcke <1wilcke@informatik.uni-hamburg.de> Closes #2907 from numbnut/master and squashes the following commits: 7f7c767 [Niklas Wilcke] [Spark-4060] [MLlib] exposing special rdd functions to the public, #2907 (cherry picked from commit f90ad5d426cb726079c490a9bb4b1100e2b4e602) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	fixed MLlib Naive-Bayes java example bug	Dariusz Kobylarz	2014-11-04	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	the filter tests Double objects by references whereas it should test their values Author: Dariusz Kobylarz <darek.kobylarz@gmail.com> Closes #3081 from dkobylarz/master and squashes the following commits: 5d43a39 [Dariusz Kobylarz] naive bayes example update a304b93 [Dariusz Kobylarz] fixed MLlib Naive-Bayes java example bug (cherry picked from commit bcecd73fdd4d2ec209259cfd57d3ad1d63f028f2) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by ↵	Davies Liu	2014-11-03	14	-338/+201
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	default. This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1. Author: Davies Liu <davies@databricks.com> This patch had conflicts when merged, resolved by Committer: Josh Rosen <joshrosen@databricks.com> Closes #2920 from davies/fix_autobatch and squashes the following commits: e544ef9 [Davies Liu] revert unrelated change 6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch 1d557fc [Davies Liu] fix tests 8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch 76abdce [Davies Liu] clean up 53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch 2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch b4292ce [Davies Liu] fix bug in master d79744c [Davies Liu] recover hive tests be37ece [Davies Liu] refactor eb3938d [Davies Liu] refactor serializer in scala 8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default. (cherry picked from commit e4f42631a68b473ce706429915f3f08042af2119) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-3573][MLLIB] Make MLlib's Vector compatible with SQL's SchemaRDD	Xiangrui Meng	2014-11-03	8	-6/+353
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley. ~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~ marmbrus jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #3070 from mengxr/SPARK-3573 and squashes the following commits: 3a0b6e5 [Xiangrui Meng] organize imports 236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples (cherry picked from commit 1a9c6cddadebdc53d083ac3e0da276ce979b5d1f) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SPARK-4192][SQL] Internal API for Python UDT	Xiangrui Meng	2014-11-03	7	-5/+375
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before `SQLContext.applySchema`, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python. marmbrus jkbradley davies Author: Xiangrui Meng <meng@databricks.com> Closes #3068 from mengxr/SPARK-4192-sql and squashes the following commits: acff637 [Xiangrui Meng] merge master dba5ea7 [Xiangrui Meng] only use pyClass for Python UDT output sqlType as well 2c9d7e4 [Xiangrui Meng] move import to global setup; update needsConversion 7c4a6a9 [Xiangrui Meng] address comments 75223db [Xiangrui Meng] minor update f740379 [Xiangrui Meng] remove UDT from default imports e98d9d0 [Xiangrui Meng] fix py style 4e84fce [Xiangrui Meng] remove local hive tests and add more tests 39f19e0 [Xiangrui Meng] add tests b7f666d [Xiangrui Meng] add Python UDT (cherry picked from commit 04450d11548cfb25d4fb77d4a33e3a7cd4254183) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[FIX][MLLIB] fix seed in BaggedPointSuite	Xiangrui Meng	2014-11-03	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Saw Jenkins test failures due to random seeds. jkbradley manishamde Author: Xiangrui Meng <meng@databricks.com> Closes #3084 from mengxr/fix-baggedpoint-suite and squashes the following commits: f735a43 [Xiangrui Meng] fix seed in BaggedPointSuite (cherry picked from commit c5912ecc7b392a13089ae735c07c2d7256de36c6) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SQL] Convert arguments to Scala UDFs	Michael Armbrust	2014-11-03	2	-262/+316
\| \| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #3077 from marmbrus/udfsWithUdts and squashes the following commits: 34b5f27 [Michael Armbrust] style 504adef [Michael Armbrust] Convert arguments to Scala UDFs (cherry picked from commit 15b58a2234ab7ba30c9c0cbb536177a3c725e350) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	SPARK-4178. Hadoop input metrics ignore bytes read in RecordReader insta...	Sandy Ryza	2014-11-03	3	-25/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	...ntiation Author: Sandy Ryza <sandy@cloudera.com> Closes #3045 from sryza/sandy-spark-4178 and squashes the following commits: 8d2e70e [Sandy Ryza] Kostas's review feedback e5b27c0 [Sandy Ryza] SPARK-4178. Hadoop input metrics ignore bytes read in RecordReader instantiation (cherry picked from commit 28128150e7e0c2b7d1c483e67214bdaef59f7d75) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	[SQL] More aggressive defaults	Michael Armbrust	2014-11-03	4	-14/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Turns on compression for in-memory cached data by default - Changes the default parquet compression format back to gzip (we have seen more OOMs with production workloads due to the way Snappy allocates memory) - Ups the batch size to 10,000 rows - Increases the broadcast threshold to 10mb. - Uses our parquet implementation instead of the hive one by default. - Cache parquet metadata by default. Author: Michael Armbrust <michael@databricks.com> Closes #3064 from marmbrus/fasterDefaults and squashes the following commits: 97ee9f8 [Michael Armbrust] parquet codec docs e641694 [Michael Armbrust] Remote also a12866a [Michael Armbrust] Cache metadata. 2d73acc [Michael Armbrust] Update docs defaults. d63d2d5 [Michael Armbrust] document parquet option da373f9 [Michael Armbrust] More aggressive defaults (cherry picked from commit 25bef7e6951301e93004567fc0cef96bf8d1a224) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-4152] [SQL] Avoid data change in CTAS while table already existed	Cheng Hao	2014-11-03	4	-3/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	CREATE TABLE t1 (a String); CREATE TABLE t1 AS SELECT key FROM src; – throw exception CREATE TABLE if not exists t1 AS SELECT key FROM src; – expect do nothing, currently it will overwrite the t1, which is incorrect. Author: Cheng Hao <hao.cheng@intel.com> Closes #3013 from chenghao-intel/ctas_unittest and squashes the following commits: 194113e [Cheng Hao] fix bug in CTAS when table already existed (cherry picked from commit e83f13e8d37ca33f4e183e977d077221b90c6025) Signed-off-by: Michael Armbrust <michael@databricks.com>