spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[Minor] [SQL] Cleans up DataFrame variable names and toDF() calls	Cheng Lian	2015-02-17	37	-259/+250
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Although we've migrated to the DataFrame API, lots of code still uses `rdd` or `srdd` as local variable names. This PR tries to address these naming inconsistencies and some other minor DataFrame related style issues. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4670) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4670 from liancheng/df-cleanup and squashes the following commits: 3e14448 [Cheng Lian] Cleans up DataFrame variable names and toDF() calls (cherry picked from commit 61ab08549cb6fceb6de1b5c490c55a89d4bd28fa) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	[SPARK-5731][Streaming][Test] Fix incorrect test in DirectKafkaStreamSuite	Tathagata Das	2015-02-17	1	-12/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The test was incorrect. Instead of counting the number of records, it counted the number of partitions of RDD generated by DStream. Which is not its intention. I will be testing this patch multiple times to understand its flakiness. PS: This was caused by my refactoring in https://github.com/apache/spark/pull/4384/ koeninger check it out. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #4597 from tdas/kafka-flaky-test and squashes the following commits: d236235 [Tathagata Das] Unignored last test. e9a1820 [Tathagata Das] fix test (cherry picked from commit 3912d332464dcd124c60b734724c34d9742466a4) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
*	[SPARK-5723][SQL]Change the default file format to Parquet for CTAS statements.	Yin Huai	2015-02-17	5	-25/+158
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-5723 Author: Yin Huai <yhuai@databricks.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #4639 from yhuai/defaultCTASFileFormat and squashes the following commits: a568137 [Yin Huai] Merge remote-tracking branch 'upstream/master' into defaultCTASFileFormat ad2b07d [Yin Huai] Update tests and error messages. 8af5b2a [Yin Huai] Update conf key and unit test. 5a67903 [Yin Huai] Use data source write path for Hive's CTAS statements when no storage format/handler is specified. (cherry picked from commit e50934f11e1e3ded21a631e5ab69db3c79467137) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	Preparing development version 1.3.1-SNAPSHOT	Patrick Wendell	2015-02-18	28	-28/+28
\|
*	Preparing Spark release v1.3.0-rc1	Patrick Wendell	2015-02-18	28	-28/+28
\|
*	[SPARK-5875][SQL]logical.Project should not be resolved if it contains ↵	Yin Huai	2015-02-17	3	-2/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	aggregates or generators https://issues.apache.org/jira/browse/SPARK-5875 has a case to reproduce the bug and explain the root cause. Author: Yin Huai <yhuai@databricks.com> Closes #4663 from yhuai/projectResolved and squashes the following commits: 472f7b6 [Yin Huai] If a logical.Project has any AggregateExpression or Generator, it's resolved field should be false. (cherry picked from commit d5f12bfe8f0a98d6fee114bb24376668ebe2898e) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	Revert "Preparing Spark release v1.3.0-snapshot1"	Patrick Wendell	2015-02-17	28	-28/+28
\| \| \| \|	This reverts commit d97bfc6f28ec4b7acfb36410c7c167d8d3c145ec.
*	Revert "Preparing development version 1.3.1-SNAPSHOT"	Patrick Wendell	2015-02-17	28	-28/+28
\| \| \| \|	This reverts commit e57c81b8c1a6581c2588973eaf30d3c7ae90ed0c.
*	[SPARK-4454] Revert getOrElse() cleanup in DAGScheduler.getCacheLocs()	Josh Rosen	2015-02-17	1	-3/+5
\| \| \| \|	This method is performance-sensitive and this change wasn't necessary.
*	[SPARK-4454] Properly synchronize accesses to DAGScheduler cacheLocs map	Josh Rosen	2015-02-17	1	-10/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch addresses a race condition in DAGScheduler by properly synchronizing accesses to its `cacheLocs` map. This map is accessed by the `getCacheLocs` and `clearCacheLocs()` methods, which can be called by separate threads, since DAGScheduler's `getPreferredLocs()` method is called by SparkContext and indirectly calls `getCacheLocs()`. If this map is cleared by the DAGScheduler event processing thread while a user thread is submitting a job and computing preferred locations, then this can cause the user thread to throw "NoSuchElementException: key not found" errors. Most accesses to DAGScheduler's internal state do not need synchronization because that state is only accessed from the event processing loop's thread. An alternative approach to fixing this bug would be to refactor this code so that SparkContext sends the DAGScheduler a message in order to get the list of preferred locations. However, this would involve more extensive changes to this code and would be significantly harder to backport to maintenance branches since some of the related code has undergone significant refactoring (e.g. the introduction of EventLoop). Since `cacheLocs` is the only state that's accessed in this way, adding simple synchronization seems like a better short-term fix. See #3345 for additional context. Author: Josh Rosen <joshrosen@databricks.com> Closes #4660 from JoshRosen/SPARK-4454 and squashes the following commits: 12d64ba [Josh Rosen] Properly synchronize accesses to DAGScheduler cacheLocs map. (cherry picked from commit d46d6246d225ff3af09ebae1a09d4de2430c502d) Signed-off-by: Patrick Wendell <patrick@databricks.com>
*	[SPARK-5811] Added documentation for maven coordinates and added Spark ↵	Burak Yavuz	2015-02-17	5	-27/+131
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Packages support Documentation for maven coordinates + Spark Package support. Added pyspark tests for `--packages` Author: Burak Yavuz <brkyvz@gmail.com> Author: Davies Liu <davies@databricks.com> Closes #4662 from brkyvz/SPARK-5811 and squashes the following commits: 56ccccd [Burak Yavuz] fixed broken test 64cb8ee [Burak Yavuz] passed pep8 on local c07b81e [Burak Yavuz] fixed pep8 a8bd6b7 [Burak Yavuz] submit PR 4ef4046 [Burak Yavuz] ready for PR 8fb02e5 [Burak Yavuz] merged master 25c9b9f [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into python-jar 560d13b [Burak Yavuz] before PR 17d3f76 [Davies Liu] support .jar as python package a3eb717 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into SPARK-5811 c60156d [Burak Yavuz] [SPARK-5811] Added documentation for maven coordinates (cherry picked from commit ae6cfb3acdbc2721d25793698a4a440f0519dbec) Signed-off-by: Patrick Wendell <patrick@databricks.com>
*	[SPARK-5785] [PySpark] narrow dependency for cogroup/join in PySpark	Davies Liu	2015-02-17	7	-25/+101
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, PySpark does not support narrow dependency during cogroup/join when the two RDDs have the partitioner, another unnecessary shuffle stage will come in. The Python implementation of cogroup/join is different than Scala one, it depends on union() and partitionBy(). This patch will try to use PartitionerAwareUnionRDD() in union(), when all the RDDs have the same partitioner. It also fix `reservePartitioner` in all the map() or mapPartitions(), then partitionBy() can skip the unnecessary shuffle stage. Author: Davies Liu <davies@databricks.com> Closes #4629 from davies/narrow and squashes the following commits: dffe34e [Davies Liu] improve test, check number of stages for join/cogroup 1ed3ba2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into narrow 4d29932 [Davies Liu] address comment cc28d97 [Davies Liu] add unit tests 940245e [Davies Liu] address comments ff5a0a6 [Davies Liu] skip the partitionBy() on Python side eb26c62 [Davies Liu] narrow dependency in PySpark (cherry picked from commit c3d2b90bde2e11823909605d518167548df66bd8) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-5852][SQL]Fail to convert a newly created empty metastore parquet ↵	Yin Huai	2015-02-17	3	-6/+164
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	table to a data source parquet table. The problem is that after we create an empty hive metastore parquet table (e.g. `CREATE TABLE test (a int) STORED AS PARQUET`), Hive will create an empty dir for us, which cause our data source `ParquetRelation2` fail to get the schema of the table. See JIRA for the case to reproduce the bug and the exception. This PR is based on #4562 from chenghao-intel. JIRA: https://issues.apache.org/jira/browse/SPARK-5852 Author: Yin Huai <yhuai@databricks.com> Author: Cheng Hao <hao.cheng@intel.com> Closes #4655 from yhuai/CTASParquet and squashes the following commits: b8b3450 [Yin Huai] Update tests. 2ac94f7 [Yin Huai] Update tests. 3db3d20 [Yin Huai] Minor update. d7e2308 [Yin Huai] Revert changes in HiveMetastoreCatalog.scala. 36978d1 [Cheng Hao] Update the code as feedback a04930b [Cheng Hao] fix bug of scan an empty parquet based table 442ffe0 [Cheng Hao] passdown the schema for Parquet File in HiveContext (cherry picked from commit 117121a4ecaadda156a82255333670775e7727db) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-5872] [SQL] create a sqlCtx in pyspark shell	Davies Liu	2015-02-17	2	-3/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The sqlCtx will be HiveContext if hive is built in assembly jar, or SQLContext if not. It also skip the Hive tests in pyspark.sql.tests if no hive is available. Author: Davies Liu <davies@databricks.com> Closes #4659 from davies/sqlctx and squashes the following commits: 0e6629a [Davies Liu] sqlCtx in pyspark (cherry picked from commit 4d4cc760fa9687ce563320094557ef9144488676) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-5871] output explain in Python	Davies Liu	2015-02-17	1	-3/+20
\| \| \| \| \| \| \| \| \| \| \|	Author: Davies Liu <davies@databricks.com> Closes #4658 from davies/explain and squashes the following commits: db87ea2 [Davies Liu] output explain in Python (cherry picked from commit 3df85dccbc8fd1ba19bbcdb8d359c073b1494d98) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-4172] [PySpark] Progress API in Python	Davies Liu	2015-02-17	6	-24/+232
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch bring the pull based progress API into Python, also a example in Python. Author: Davies Liu <davies@databricks.com> Closes #3027 from davies/progress_api and squashes the following commits: b1ba984 [Davies Liu] fix style d3b9253 [Davies Liu] add tests, mute the exception after stop 4297327 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress_api 969fa9d [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress_api 25590c9 [Davies Liu] update with Java API 360de2d [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress_api c0f1021 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress_api 023afb3 [Davies Liu] add Python API and example for progress API (cherry picked from commit 445a755b884885b88c1778fd56a3151045b0b0ed) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-5868][SQL] Fix python UDFs in HiveContext and checks in SQLContext	Michael Armbrust	2015-02-17	2	-1/+5
\| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #4657 from marmbrus/pythonUdfs and squashes the following commits: a7823a8 [Michael Armbrust] [SPARK-5868][SQL] Fix python UDFs in HiveContext and checks in SQLContext (cherry picked from commit de4836f8f12c36c1b350cef288a75b5e59155735) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SQL] [Minor] Update the HiveContext Unittest	Cheng Hao	2015-02-17	7	-0/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In unit test, the table src(key INT, value STRING) is not the same as HIVE src(key STRING, value STRING) https://github.com/apache/hive/blob/branch-0.13/data/scripts/q_test_init.sql And in the reflect.q, test failed for expression `reflect("java.lang.Integer", "valueOf", key, 16)`, which expect the argument `key` as STRING not INT. This PR doesn't aim to change the `src` schema, we can do that after 1.3 released, however, we probably need to re-generate all the golden files. Author: Cheng Hao <hao.cheng@intel.com> Closes #4584 from chenghao-intel/reflect and squashes the following commits: e5bdc3a [Cheng Hao] Move the test case reflect into blacklist 184abfd [Cheng Hao] revert the change to table src1 d9bcf92 [Cheng Hao] Update the HiveContext Unittest (cherry picked from commit 9d281fa56022800dc008a3de233fec44379a2bd7) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[Minor][SQL] Use same function to check path parameter in JSONRelation	Liang-Chi Hsieh	2015-02-17	2	-3/+3
\| \| \| \| \| \| \| \| \| \| \|	Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4649 from viirya/use_checkpath and squashes the following commits: 0f9a1a1 [Liang-Chi Hsieh] Use same function to check path parameter. (cherry picked from commit ac506b7c2846f656e03839bbd0e93827c7cc613e) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-5862][SQL] Only transformUp the given plan once in HiveMetastoreCatalog	Liang-Chi Hsieh	2015-02-17	1	-17/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Current `ParquetConversions` in `HiveMetastoreCatalog` will transformUp the given plan multiple times if there are many Metastore Parquet tables. Since the transformUp operation is recursive, it should be better to only perform it once. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4651 from viirya/parquet_atonce and squashes the following commits: c1ed29d [Liang-Chi Hsieh] Fix bug. e0f919b [Liang-Chi Hsieh] Only transformUp the given plan once. (cherry picked from commit 4611de1cef7363bc71ec608560dfd866ae477747) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[Minor] fix typo in SQL document	CodingCat	2015-02-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Author: CodingCat <zhunansjtu@gmail.com> Closes #4656 from CodingCat/fix_typo and squashes the following commits: b41d15c [CodingCat] recover 689fe46 [CodingCat] fix typo (cherry picked from commit 31efb39c1deb253032b38e8fbafde4b2b1dde1f6) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-5864] [PySpark] support .jar as python package	Davies Liu	2015-02-17	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A jar file containing Python sources in it could be used as a Python package, just like zip file. spark-submit already put the jar file into PYTHONPATH, this patch also put it in the sys.path, then it could be used in Python worker. Author: Davies Liu <davies@databricks.com> Closes #4652 from davies/jar and squashes the following commits: 17d3f76 [Davies Liu] support .jar as python package (cherry picked from commit fc4eb9505adda192eb38cb4454d532027690bfa3) Signed-off-by: Patrick Wendell <patrick@databricks.com>
*	SPARK-5841 [CORE] [HOTFIX] Memory leak in DiskBlockManager	Sean Owen	2015-02-17	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Avoid call to remove shutdown hook being called from shutdown hook CC pwendell JoshRosen MattWhelan Author: Sean Owen <sowen@cloudera.com> Closes #4648 from srowen/SPARK-5841.2 and squashes the following commits: 51548db [Sean Owen] Avoid call to remove shutdown hook being called from shutdown hook (cherry picked from commit 49c19fdbad57f0609bbcc9278f9eaa8115a73604) Signed-off-by: Sean Owen <sowen@cloudera.com>
*	[SPARK-5661]function hasShutdownDeleteTachyonDir should use ↵	xukun 00228947	2015-02-17	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	shutdownDeleteTachyonPaths to determine whether contains file hasShutdownDeleteTachyonDir(file: TachyonFile) should use shutdownDeleteTachyonPaths(not shutdownDeletePaths) to determine Whether contain file. To solve it ,delete two unused function. Author: xukun 00228947 <xukun.xu@huawei.com> Author: viper-kun <xukun.xu@huawei.com> Closes #4418 from viper-kun/deleteunusedfun and squashes the following commits: 87340eb [viper-kun] fix style 3d6c69e [xukun 00228947] fix bug 2bc397e [xukun 00228947] deleteunusedfun (cherry picked from commit b271c265b742fa6947522eda4592e9e6a7fd1f3a) Signed-off-by: Sean Owen <sowen@cloudera.com>
*	[SPARK-5778] throw if nonexistent metrics config file provided	Ryan Williams	2015-02-17	5	-19/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	previous behavior was to log an error; this is fine in the general case where no `spark.metrics.conf` parameter was specified, in which case a default `metrics.properties` is looked for, and the execption logged and suppressed if it doesn't exist. if the user has purposefully specified a metrics.conf file, however, it makes more sense to show them an error when said file doesn't exist. Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #4571 from ryan-williams/metrics and squashes the following commits: 5bccb14 [Ryan Williams] private-ize some MetricsConfig members 08ff998 [Ryan Williams] rename METRICS_CONF: DEFAULT_METRICS_CONF_FILENAME f4d7fab [Ryan Williams] fix tests ad24b0e [Ryan Williams] add "metrics.properties" to .rat-excludes 94e810b [Ryan Williams] throw if nonexistent Sink class is specified 31d2c30 [Ryan Williams] metrics code review feedback 56287db [Ryan Williams] throw if nonexistent metrics config file provided (cherry picked from commit d8f69cf78862d13a48392a0b94388b8d403523da) Signed-off-by: Patrick Wendell <patrick@databricks.com>
*	[SPARK-5859] [PySpark] [SQL] fix DataFrame Python API	Davies Liu	2015-02-17	2	-18/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. added explain() 2. add isLocal() 3. do not call show() in __repl__ 4. add foreach() and foreachPartition() 5. add distinct() 6. fix functions.col()/column()/lit() 7. fix unit tests in sql/functions.py 8. fix unicode in showString() Author: Davies Liu <davies@databricks.com> Closes #4645 from davies/df6 and squashes the following commits: 6b46a2c [Davies Liu] fix DataFrame Python API (cherry picked from commit d8adefefcc2a4af32295440ed1d4917a6968f017) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-5166][SPARK-5247][SPARK-5258][SQL] API Cleanup / Documentation	Michael Armbrust	2015-02-17	30	-405/+483
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #4642 from marmbrus/docs and squashes the following commits: d291c34 [Michael Armbrust] python tests 9be66e3 [Michael Armbrust] comments d56afc2 [Michael Armbrust] fix style f004747 [Michael Armbrust] fix build c4a907b [Michael Armbrust] fix tests 42e2b73 [Michael Armbrust] [SQL] Documentation / API Clean-up. (cherry picked from commit c74b07fa94a8da50437d952ae05cf6ac70fbb93e) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-5858][MLLIB] Remove unnecessary first() call in GLM	Xiangrui Meng	2015-02-17	2	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	`numFeatures` is only used by multinomial logistic regression. Calling `.first()` for every GLM causes performance regression, especially in Python. Author: Xiangrui Meng <meng@databricks.com> Closes #4647 from mengxr/SPARK-5858 and squashes the following commits: 036dc7f [Xiangrui Meng] remove unnecessary first() call 12c5548 [Xiangrui Meng] check numFeatures only once (cherry picked from commit c76da36c2163276b5c34e59fbb139eeb34ed0faa) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	SPARK-5856: In Maven build script, launch Zinc with more memory	Patrick Wendell	2015-02-17	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I've seen out of memory exceptions when trying to run many parallel builds against the same Zinc server during packaging. We should use the same increased memory settings we use for Maven itself. I tested this and confirmed that the Nailgun JVM launched with higher memory. Author: Patrick Wendell <patrick@databricks.com> Closes #4643 from pwendell/zinc-memory and squashes the following commits: 717cfb0 [Patrick Wendell] SPARK-5856: Launch Zinc with larger memory options. (cherry picked from commit 3ce46e94fe77d15f18e916b76b37fa96356ace93) Signed-off-by: Patrick Wendell <patrick@databricks.com>
*	Revert "[SPARK-5363] [PySpark] check ending mark in non-block way"	Josh Rosen	2015-02-17	2	-18/+4
\| \| \| \|	This reverts commits ac6fe67e1d8bf01ee565f9cc09ad48d88a275829 and c06e42f2c1e5fcf123b466efd27ee4cb53bbed3f.
*	[SPARK-5826][Streaming] Fix Configuration not serializable problem	jerryshao	2015-02-17	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \|	Author: jerryshao <saisai.shao@intel.com> Closes #4612 from jerryshao/SPARK-5826 and squashes the following commits: 7ec71db [jerryshao] Remove transient for conf statement 88d84e6 [jerryshao] Fix Configuration not serializable problem (cherry picked from commit a65766bf0244a41b793b9dc5fbdd2882664ad00e) Signed-off-by: Sean Owen <sowen@cloudera.com>
*	HOTFIX: Style issue causing build break	Patrick Wendell	2015-02-16	1	-2/+2
\| \| \| \|	Caused by #4601
*	[SPARK-5802][MLLIB] cache transformed data in glm	Xiangrui Meng	2015-02-16	1	-14/+15
\| \| \| \| \| \| \| \| \| \| \| \| \|	If we need to transform the input data, we should cache the output to avoid re-computing feature vectors every iteration. dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #4593 from mengxr/SPARK-5802 and squashes the following commits: ae3be84 [Xiangrui Meng] cache transformed data in glm (cherry picked from commit fd84229e2aeb6a03760703c9dccd2db853779400) Signed-off-by: Xiangrui Meng <meng@databricks.com>
*	[SPARK-5853][SQL] Schema support in Row.	Reynold Xin	2015-02-16	4	-5/+20
\| \| \| \| \| \| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #4640 from rxin/SPARK-5853 and squashes the following commits: 9c6f569 [Reynold Xin] [SPARK-5853][SQL] Schema support in Row. (cherry picked from commit d380f324c6d38ffacfda83a525a1a7e23347e5b8) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	SPARK-5850: Remove experimental label for Scala 2.11 and FlumePollingStream	Patrick Wendell	2015-02-16	3	-12/+4
\| \| \| \| \| \| \| \|	Author: Patrick Wendell <patrick@databricks.com> Closes #4638 from pwendell/SPARK-5850 and squashes the following commits: 386126f [Patrick Wendell] SPARK-5850: Remove experimental label for Scala 2.11 and FlumePollingStream.
*	[SPARK-5363] [PySpark] check ending mark in non-block way	Davies Liu	2015-02-16	2	-4/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is chance of dead lock that the Python process is waiting for ending mark from JVM, but which is eaten by corrupted stream. This PR checks the ending mark from Python in non-block way, so it will not blocked by Python process. There is a small chance that the ending mark is sent by Python process but not available right now, then Python process will not be used. cc JoshRosen pwendell Author: Davies Liu <davies@databricks.com> Closes #4601 from davies/freeze and squashes the following commits: e15a8c3 [Davies Liu] update logging 890329c [Davies Liu] Merge branch 'freeze' of github.com:davies/spark into freeze 2bd2228 [Davies Liu] add more logging 656d544 [Davies Liu] Update PythonRDD.scala 05e1085 [Davies Liu] check ending mark in non-block way (cherry picked from commit ac6fe67e1d8bf01ee565f9cc09ad48d88a275829) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SQL] Various DataFrame doc changes.	Reynold Xin	2015-02-16	9	-87/+436
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Added a bunch of tags. Also changed parquetFile to take varargs rather than a string followed by varargs. Author: Reynold Xin <rxin@databricks.com> Closes #4636 from rxin/df-doc and squashes the following commits: 651f80c [Reynold Xin] Fixed parquetFile in PySpark. 8dc3024 [Reynold Xin] [SQL] Various DataFrame doc changes. (cherry picked from commit 0e180bfc3c7f18780d4fc4f42681609832418e43) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	[SPARK-5849] Handle more types of invalid JSON requests in ↵	Josh Rosen	2015-02-16	2	-8/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SubmitRestProtocolMessage.parseAction This patch improves SubmitRestProtocol's handling of invalid JSON requests in cases where those requests were parsable as JSON but not as JSON objects (e.g. they could be parsed as ararys or strings). I replaced an unchecked cast with pattern-matching and added a new test case. Author: Josh Rosen <joshrosen@databricks.com> Closes #4637 from JoshRosen/rest-protocol-cast and squashes the following commits: b3f282b [Josh Rosen] [SPARK-5849] Handle more types of invalid JSON in SubmitRestProtocolMessage.parseAction (cherry picked from commit 58a82a7882d7a8a7e4064278c4bf28607d9a42ba) Signed-off-by: Andrew Or <andrew@databricks.com>
*	[SPARK-3340] Deprecate ADD_JARS and ADD_FILES	azagrebin	2015-02-16	3	-6/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I created a patch that disables the environment variables. Thereby scala or python shell log a warning message to notify user about the deprecation with the following message: scala: "ADD_JARS environment variable is deprecated, use --jar spark submit argument instead" python: "Warning: ADD_FILES environment variable is deprecated, use --py-files argument instead" Is it what is expected or the code associated with the variables should be just completely removed? Should it be somewhere documented? Author: azagrebin <azagrebin@gmail.com> Closes #4616 from azagrebin/master and squashes the following commits: bab1aa9 [azagrebin] [SPARK-3340] Deprecate ADD_JARS and ADD_FILES: minor readability issue 0643895 [azagrebin] [SPARK-3340] Deprecate ADD_JARS and ADD_FILES: add warning messages 42f0107 [azagrebin] [SPARK-3340] Deprecate ADD_JARS and ADD_FILES (cherry picked from commit 16687651f05bde8ff2e2fcef100383168958bf7f) Signed-off-by: Andrew Or <andrew@databricks.com>
*	[SPARK-5788] [PySpark] capture the exception in python write thread	Davies Liu	2015-02-16	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The exception in Python writer thread will shutdown executor. Author: Davies Liu <davies@databricks.com> Closes #4577 from davies/exception and squashes the following commits: eb0ceff [Davies Liu] Update PythonRDD.scala 139b0db [Davies Liu] capture the exception in python write thread (cherry picked from commit b1bd1dd3228ef50fa7310d466afd834b8cb1f22e) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	SPARK-5848: tear down the ConsoleProgressBar timer	Matt Whelan	2015-02-17	2	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The timer is a GC root, and failing to terminate it leaks SparkContext instances. Author: Matt Whelan <mwhelan@perka.com> Closes #4635 from MattWhelan/SPARK-5848 and squashes the following commits: 2a1e8a5 [Matt Whelan] SPARK-5848: teardown the ConsoleProgressBar timer (cherry picked from commit 1294a6e01af0d4f6678ea8cb5d47dc97112608b5) Signed-off-by: Sean Owen <sowen@cloudera.com>
*	[SPARK-4865][SQL]Include temporary tables in SHOW TABLES	Yin Huai	2015-02-16	9	-50/+111
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds a `ShowTablesCommand` to support `SHOW TABLES [IN databaseName]` SQL command. The result of `SHOW TABLE` has two columns, `tableName` and `isTemporary`. For temporary tables, the value of `isTemporary` column will be `false`. JIRA: https://issues.apache.org/jira/browse/SPARK-4865 Author: Yin Huai <yhuai@databricks.com> Closes #4618 from yhuai/showTablesCommand and squashes the following commits: 0c09791 [Yin Huai] Use ShowTablesCommand. 85ee76d [Yin Huai] Since SHOW TABLES is not a Hive native command any more and we will not see "OK" (originally generated by Hive's driver), use SHOW DATABASES in the test. 94bacac [Yin Huai] Add SHOW TABLES to the list of noExplainCommands. d71ed09 [Yin Huai] Fix test. a4a6ec3 [Yin Huai] Add SHOW TABLE command. (cherry picked from commit e189cbb052d59eb499dd4312403925fdd72f5718) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SQL] Optimize arithmetic and predicate operators	kai	2015-02-16	10	-260/+290
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Existing implementation of arithmetic operators and BinaryComparison operators have redundant type checking codes, e.g.: Expression.n2 is used by Add/Subtract/Multiply. (1) n2 always checks left.dataType == right.dataType. However, this checking should be done once when we resolve expression types; (2) n2 requires dataType is a NumericType. This can be done once. This PR optimizes arithmetic and predicate operators by removing such redundant type-checking codes. Some preliminary benchmarking on 10G TPC-H data over 5 r3.2xlarge EC2 machines shows that this PR can reduce the query time by 5.5% to 11%. The benchmark queries follow the template below, where OP is plus/minus/times/divide/remainder/bitwise and/bitwise or/bitwise xor. SELECT l_returnflag, l_linestatus, SUM(l_quantity OP cnt1), SUM(l_quantity OP cnt2), ...., SUM(l_quantity OP cnt700) FROM ( SELECT l_returnflag, l_linestatus, l_quantity, 1 AS cnt1, 2 AS cnt2, ..., 700 AS cnt700 FROM lineitem WHERE l_shipdate <= '1998-09-01' ) GROUP BY l_returnflag, l_linestatus; Author: kai <kaizeng@eecs.berkeley.edu> Closes #4472 from kai-zeng/arithmetic-optimize and squashes the following commits: fef0cf1 [kai] Merge branch 'master' of github.com:apache/spark into arithmetic-optimize 4b3a1bb [kai] chmod a-x 5a41e49 [kai] chmod a-x Expression.scala cb37c94 [kai] rebase onto spark master 7f6e968 [kai] chmod 100755 -> 100644 6cddb46 [kai] format 7490dbc [kai] fix unresolved-expression exception for EqualTo 9c40bc0 [kai] fix bitwisenot 3cbd363 [kai] clean up test code ca47801 [kai] override evalInternal for bitwise ops 8fa84a1 [kai] add bitwise or and xor 6892fc4 [kai] revert override evalInternal f8eba24 [kai] override evalInternal 31ccdd4 [kai] rewrite all bitwise op and remove evalInternal 86297e2 [kai] generalized cb92ae1 [kai] bitwise-and: override eval 97a7d6c [kai] bitwise-and: override evalInternal using and func 0906c39 [kai] add bitwise test 62abbbc [kai] clean up predicate and arithmetic b34d58d [kai] add caching and benmark option 12c5b32 [kai] override eval 1cd7571 [kai] fix sqrt and maxof 03fd0c3 [kai] fix predicate 16fd84c [kai] optimize + - * / % -(unary) abs < > <= >= fd95823 [kai] remove unnecessary type checking 24d062f [kai] test suite (cherry picked from commit cb6c48c874af2bd78ee73c1dc8a44fd28ecc0991) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-5839][SQL]HiveMetastoreCatalog does not recognize table names and ↵	Yin Huai	2015-02-16	3	-4/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	aliases of data source tables. JIRA: https://issues.apache.org/jira/browse/SPARK-5839 Author: Yin Huai <yhuai@databricks.com> Closes #4626 from yhuai/SPARK-5839 and squashes the following commits: f779d85 [Yin Huai] Use subqeury to wrap replaced ParquetRelation. 2695f13 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-5839 f1ba6ca [Yin Huai] Address comment. 2c7fa08 [Yin Huai] Use Subqueries to wrap a data source table. (cherry picked from commit f3ff1eb2985ff3e1567645b898f6b42e4b01f237) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[SPARK-5746][SQL] Check invalid cases for the write path of data source API	Yin Huai	2015-02-16	14	-57/+197
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-5746 liancheng marmbrus Author: Yin Huai <yhuai@databricks.com> Closes #4617 from yhuai/insertOverwrite and squashes the following commits: 8e3019d [Yin Huai] Fix compilation error. 499e8e7 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertOverwrite e76e85a [Yin Huai] Address comments. ac31b3c [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertOverwrite f30bdad [Yin Huai] Use toDF. 99da57e [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertOverwrite 6b7545c [Yin Huai] Add a pre write check to the data source API. a88c516 [Yin Huai] DDLParser will take a parsering function to take care CTAS statements. (cherry picked from commit 5b6cd65cd611b1a46a7d5eb33139c6224b96264e) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	HOTFIX: Break in Jekyll build from #4589	Patrick Wendell	2015-02-16	1	-2/+1
\| \| \| \|	That patch had a line break in the middle of a {{ }} expression, which is not allowed.
*	[SPARK-2313] Use socket to communicate GatewayServer port back to Python driver	Josh Rosen	2015-02-16	3	-43/+97
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch changes PySpark so that the GatewayServer's port is communicated back to the Python process that launches it over a local socket instead of a pipe. The old pipe-based approach was brittle and could fail if `spark-submit` printed unexpected to stdout. To accomplish this, I wrote a custom `PythonGatewayServer.main()` function to use in place of Py4J's `GatewayServer.main()`. Closes #3424. Author: Josh Rosen <joshrosen@databricks.com> Closes #4603 from JoshRosen/SPARK-2313 and squashes the following commits: 6a7740b [Josh Rosen] Remove EchoOutputThread since it's no longer needed 0db501f [Josh Rosen] Use select() so that we don't block if GatewayServer dies. 9bdb4b6 [Josh Rosen] Handle case where getListeningPort returns -1 3fb7ed1 [Josh Rosen] Remove stdout=PIPE 2458934 [Josh Rosen] Use underscore to mark env var. as private d12c95d [Josh Rosen] Use Logging and Utils.tryOrExit() e5f9730 [Josh Rosen] Wrap everything in a giant try-block 2f70689 [Josh Rosen] Use stdin PIPE to share fate with driver 8bf956e [Josh Rosen] Initial cut at passing Py4J gateway port back to driver via socket (cherry picked from commit 0cfda8461f173428f955aa9a7140b1356beea400) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	SPARK-5357: Update commons-codec version to 1.10 (current)	Matt Whelan	2015-02-16	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Resolves https://issues.apache.org/jira/browse/SPARK-5357 In commons-codec 1.5, Base64 instances are not thread safe. That was only true from 1.4-1.6. Author: Matt Whelan <mwhelan@perka.com> Closes #4153 from MattWhelan/depsUpdate and squashes the following commits: b4a91f4 [Matt Whelan] SPARK-5357: Update commons-codec version to 1.10 (current) (cherry picked from commit c01c4ebcfe5c1a4a56a8987af596eca090c2cc2f) Signed-off-by: Sean Owen <sowen@cloudera.com>
*	SPARK-5841: remove DiskBlockManager shutdown hook on stop	Matt Whelan	2015-02-16	1	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	After a call to stop, the shutdown hook is redundant, and causes a memory leak. Author: Matt Whelan <mwhelan@perka.com> Closes #4627 from MattWhelan/SPARK-5841 and squashes the following commits: d5f5c7f [Matt Whelan] SPARK-5841: remove DiskBlockManager shutdown hook on stop (cherry picked from commit bb05982dd25e008fb01684dff1f95d03e7271721) Signed-off-by: Sean Owen <sowen@cloudera.com>
*	[SPARK-5833] [SQL] Adds REFRESH TABLE command	Cheng Lian	2015-02-16	4	-24/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Lifts `HiveMetastoreCatalog.refreshTable` to `Catalog`. Adds `RefreshTable` command to refresh (possibly cached) metadata in external data sources tables. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4624) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4624 from liancheng/refresh-table and squashes the following commits: 8d1aa4c [Cheng Lian] Adds REFRESH TABLE command (cherry picked from commit c51ab37faddf4ede23243058dfb388e74a192552) Signed-off-by: Michael Armbrust <michael@databricks.com>