aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-4172] [PySpark] Progress API in PythonDavies Liu2015-02-176-24/+232
| | | | | | | | | | | | | | | | | This patch bring the pull based progress API into Python, also a example in Python. Author: Davies Liu <davies@databricks.com> Closes #3027 from davies/progress_api and squashes the following commits: b1ba984 [Davies Liu] fix style d3b9253 [Davies Liu] add tests, mute the exception after stop 4297327 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress_api 969fa9d [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress_api 25590c9 [Davies Liu] update with Java API 360de2d [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress_api c0f1021 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress_api 023afb3 [Davies Liu] add Python API and example for progress API
* [SPARK-5868][SQL] Fix python UDFs in HiveContext and checks in SQLContextMichael Armbrust2015-02-172-1/+5
| | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #4657 from marmbrus/pythonUdfs and squashes the following commits: a7823a8 [Michael Armbrust] [SPARK-5868][SQL] Fix python UDFs in HiveContext and checks in SQLContext
* [SQL] [Minor] Update the HiveContext UnittestCheng Hao2015-02-177-0/+17
| | | | | | | | | | | | | | | | | In unit test, the table src(key INT, value STRING) is not the same as HIVE src(key STRING, value STRING) https://github.com/apache/hive/blob/branch-0.13/data/scripts/q_test_init.sql And in the reflect.q, test failed for expression `reflect("java.lang.Integer", "valueOf", key, 16)`, which expect the argument `key` as STRING not INT. This PR doesn't aim to change the `src` schema, we can do that after 1.3 released, however, we probably need to re-generate all the golden files. Author: Cheng Hao <hao.cheng@intel.com> Closes #4584 from chenghao-intel/reflect and squashes the following commits: e5bdc3a [Cheng Hao] Move the test case reflect into blacklist 184abfd [Cheng Hao] revert the change to table src1 d9bcf92 [Cheng Hao] Update the HiveContext Unittest
* [Minor][SQL] Use same function to check path parameter in JSONRelationLiang-Chi Hsieh2015-02-172-3/+3
| | | | | | | | Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4649 from viirya/use_checkpath and squashes the following commits: 0f9a1a1 [Liang-Chi Hsieh] Use same function to check path parameter.
* [SPARK-5862][SQL] Only transformUp the given plan once in HiveMetastoreCatalogLiang-Chi Hsieh2015-02-171-17/+20
| | | | | | | | | | | Current `ParquetConversions` in `HiveMetastoreCatalog` will transformUp the given plan multiple times if there are many Metastore Parquet tables. Since the transformUp operation is recursive, it should be better to only perform it once. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4651 from viirya/parquet_atonce and squashes the following commits: c1ed29d [Liang-Chi Hsieh] Fix bug. e0f919b [Liang-Chi Hsieh] Only transformUp the given plan once.
* [Minor] fix typo in SQL documentCodingCat2015-02-171-1/+1
| | | | | | | | | Author: CodingCat <zhunansjtu@gmail.com> Closes #4656 from CodingCat/fix_typo and squashes the following commits: b41d15c [CodingCat] recover 689fe46 [CodingCat] fix typo
* [SPARK-5864] [PySpark] support .jar as python packageDavies Liu2015-02-171-2/+4
| | | | | | | | | | | | A jar file containing Python sources in it could be used as a Python package, just like zip file. spark-submit already put the jar file into PYTHONPATH, this patch also put it in the sys.path, then it could be used in Python worker. Author: Davies Liu <davies@databricks.com> Closes #4652 from davies/jar and squashes the following commits: 17d3f76 [Davies Liu] support .jar as python package
* SPARK-5841 [CORE] [HOTFIX] Memory leak in DiskBlockManagerSean Owen2015-02-171-1/+4
| | | | | | | | | | | | Avoid call to remove shutdown hook being called from shutdown hook CC pwendell JoshRosen MattWhelan Author: Sean Owen <sowen@cloudera.com> Closes #4648 from srowen/SPARK-5841.2 and squashes the following commits: 51548db [Sean Owen] Avoid call to remove shutdown hook being called from shutdown hook
* MAINTENANCE: Automated closing of pull requests.Patrick Wendell2015-02-170-0/+0
| | | | | | | | | | | | This commit exists to close the following pull requests on Github: Closes #3297 (close requested by 'andrewor14') Closes #3345 (close requested by 'pwendell') Closes #2729 (close requested by 'srowen') Closes #2320 (close requested by 'pwendell') Closes #4529 (close requested by 'andrewor14') Closes #2098 (close requested by 'srowen') Closes #4120 (close requested by 'andrewor14')
* [SPARK-3381] [MLlib] Eliminate bins for unordered features in DecisionTreesMechCoder2015-02-173-73/+15
| | | | | | | | | | | | | For unordered features, it is sufficient to use splits since the threshold of the split corresponds the threshold of the HighSplit of the bin and there is no use of the LowSplit. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #4231 from MechCoder/spark-3381 and squashes the following commits: 58c19a5 [MechCoder] COSMIT c274b74 [MechCoder] Remove unordered feature calculation in labeledPointToTreePoint b2b9b89 [MechCoder] COSMIT d3ee042 [MechCoder] [SPARK-3381] [MLlib] Eliminate bins for unordered features
* [SPARK-5661]function hasShutdownDeleteTachyonDir should use ↵xukun 002289472015-02-171-2/+2
| | | | | | | | | | | | | | | shutdownDeleteTachyonPaths to determine whether contains file hasShutdownDeleteTachyonDir(file: TachyonFile) should use shutdownDeleteTachyonPaths(not shutdownDeletePaths) to determine Whether contain file. To solve it ,delete two unused function. Author: xukun 00228947 <xukun.xu@huawei.com> Author: viper-kun <xukun.xu@huawei.com> Closes #4418 from viper-kun/deleteunusedfun and squashes the following commits: 87340eb [viper-kun] fix style 3d6c69e [xukun 00228947] fix bug 2bc397e [xukun 00228947] deleteunusedfun
* [SPARK-5778] throw if nonexistent metrics config file providedRyan Williams2015-02-175-19/+23
| | | | | | | | | | | | | | | | | | | | | | | previous behavior was to log an error; this is fine in the general case where no `spark.metrics.conf` parameter was specified, in which case a default `metrics.properties` is looked for, and the execption logged and suppressed if it doesn't exist. if the user has purposefully specified a metrics.conf file, however, it makes more sense to show them an error when said file doesn't exist. Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #4571 from ryan-williams/metrics and squashes the following commits: 5bccb14 [Ryan Williams] private-ize some MetricsConfig members 08ff998 [Ryan Williams] rename METRICS_CONF: DEFAULT_METRICS_CONF_FILENAME f4d7fab [Ryan Williams] fix tests ad24b0e [Ryan Williams] add "metrics.properties" to .rat-excludes 94e810b [Ryan Williams] throw if nonexistent Sink class is specified 31d2c30 [Ryan Williams] metrics code review feedback 56287db [Ryan Williams] throw if nonexistent metrics config file provided
* [SPARK-5859] [PySpark] [SQL] fix DataFrame Python APIDavies Liu2015-02-172-18/+59
| | | | | | | | | | | | | | | | | 1. added explain() 2. add isLocal() 3. do not call show() in __repl__ 4. add foreach() and foreachPartition() 5. add distinct() 6. fix functions.col()/column()/lit() 7. fix unit tests in sql/functions.py 8. fix unicode in showString() Author: Davies Liu <davies@databricks.com> Closes #4645 from davies/df6 and squashes the following commits: 6b46a2c [Davies Liu] fix DataFrame Python API
* [SPARK-5166][SPARK-5247][SPARK-5258][SQL] API Cleanup / DocumentationMichael Armbrust2015-02-1730-405/+483
| | | | | | | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #4642 from marmbrus/docs and squashes the following commits: d291c34 [Michael Armbrust] python tests 9be66e3 [Michael Armbrust] comments d56afc2 [Michael Armbrust] fix style f004747 [Michael Armbrust] fix build c4a907b [Michael Armbrust] fix tests 42e2b73 [Michael Armbrust] [SQL] Documentation / API Clean-up.
* [SPARK-5858][MLLIB] Remove unnecessary first() call in GLMXiangrui Meng2015-02-172-4/+9
| | | | | | | | | | | `numFeatures` is only used by multinomial logistic regression. Calling `.first()` for every GLM causes performance regression, especially in Python. Author: Xiangrui Meng <meng@databricks.com> Closes #4647 from mengxr/SPARK-5858 and squashes the following commits: 036dc7f [Xiangrui Meng] remove unnecessary first() call 12c5548 [Xiangrui Meng] check numFeatures only once
* SPARK-5856: In Maven build script, launch Zinc with more memoryPatrick Wendell2015-02-171-1/+4
| | | | | | | | | | | | | | | | I've seen out of memory exceptions when trying to run many parallel builds against the same Zinc server during packaging. We should use the same increased memory settings we use for Maven itself. I tested this and confirmed that the Nailgun JVM launched with higher memory. Author: Patrick Wendell <patrick@databricks.com> Closes #4643 from pwendell/zinc-memory and squashes the following commits: 717cfb0 [Patrick Wendell] SPARK-5856: Launch Zinc with larger memory options.
* Revert "[SPARK-5363] [PySpark] check ending mark in non-block way"Josh Rosen2015-02-172-18/+4
| | | | This reverts commits ac6fe67e1d8bf01ee565f9cc09ad48d88a275829 and c06e42f2c1e5fcf123b466efd27ee4cb53bbed3f.
* [SPARK-5826][Streaming] Fix Configuration not serializable problemjerryshao2015-02-171-2/+4
| | | | | | | | | Author: jerryshao <saisai.shao@intel.com> Closes #4612 from jerryshao/SPARK-5826 and squashes the following commits: 7ec71db [jerryshao] Remove transient for conf statement 88d84e6 [jerryshao] Fix Configuration not serializable problem
* HOTFIX: Style issue causing build breakPatrick Wendell2015-02-161-2/+2
| | | | Caused by #4601
* [SPARK-5802][MLLIB] cache transformed data in glmXiangrui Meng2015-02-161-14/+15
| | | | | | | | | | If we need to transform the input data, we should cache the output to avoid re-computing feature vectors every iteration. dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #4593 from mengxr/SPARK-5802 and squashes the following commits: ae3be84 [Xiangrui Meng] cache transformed data in glm
* [SPARK-5853][SQL] Schema support in Row.Reynold Xin2015-02-164-5/+20
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4640 from rxin/SPARK-5853 and squashes the following commits: 9c6f569 [Reynold Xin] [SPARK-5853][SQL] Schema support in Row.
* SPARK-5850: Remove experimental label for Scala 2.11 and FlumePollingStreamPatrick Wendell2015-02-163-12/+4
| | | | | | | | Author: Patrick Wendell <patrick@databricks.com> Closes #4638 from pwendell/SPARK-5850 and squashes the following commits: 386126f [Patrick Wendell] SPARK-5850: Remove experimental label for Scala 2.11 and FlumePollingStream.
* [SPARK-5363] [PySpark] check ending mark in non-block wayDavies Liu2015-02-162-4/+18
| | | | | | | | | | | | | | | | | | | | There is chance of dead lock that the Python process is waiting for ending mark from JVM, but which is eaten by corrupted stream. This PR checks the ending mark from Python in non-block way, so it will not blocked by Python process. There is a small chance that the ending mark is sent by Python process but not available right now, then Python process will not be used. cc JoshRosen pwendell Author: Davies Liu <davies@databricks.com> Closes #4601 from davies/freeze and squashes the following commits: e15a8c3 [Davies Liu] update logging 890329c [Davies Liu] Merge branch 'freeze' of github.com:davies/spark into freeze 2bd2228 [Davies Liu] add more logging 656d544 [Davies Liu] Update PythonRDD.scala 05e1085 [Davies Liu] check ending mark in non-block way
* [SQL] Various DataFrame doc changes.Reynold Xin2015-02-169-87/+436
| | | | | | | | | | | | | Added a bunch of tags. Also changed parquetFile to take varargs rather than a string followed by varargs. Author: Reynold Xin <rxin@databricks.com> Closes #4636 from rxin/df-doc and squashes the following commits: 651f80c [Reynold Xin] Fixed parquetFile in PySpark. 8dc3024 [Reynold Xin] [SQL] Various DataFrame doc changes.
* [SPARK-5849] Handle more types of invalid JSON requests in ↵Josh Rosen2015-02-162-8/+12
| | | | | | | | | | | | SubmitRestProtocolMessage.parseAction This patch improves SubmitRestProtocol's handling of invalid JSON requests in cases where those requests were parsable as JSON but not as JSON objects (e.g. they could be parsed as ararys or strings). I replaced an unchecked cast with pattern-matching and added a new test case. Author: Josh Rosen <joshrosen@databricks.com> Closes #4637 from JoshRosen/rest-protocol-cast and squashes the following commits: b3f282b [Josh Rosen] [SPARK-5849] Handle more types of invalid JSON in SubmitRestProtocolMessage.parseAction
* [SPARK-3340] Deprecate ADD_JARS and ADD_FILESazagrebin2015-02-163-6/+12
| | | | | | | | | | | | | | | | | | | I created a patch that disables the environment variables. Thereby scala or python shell log a warning message to notify user about the deprecation with the following message: scala: "ADD_JARS environment variable is deprecated, use --jar spark submit argument instead" python: "Warning: ADD_FILES environment variable is deprecated, use --py-files argument instead" Is it what is expected or the code associated with the variables should be just completely removed? Should it be somewhere documented? Author: azagrebin <azagrebin@gmail.com> Closes #4616 from azagrebin/master and squashes the following commits: bab1aa9 [azagrebin] [SPARK-3340] Deprecate ADD_JARS and ADD_FILES: minor readability issue 0643895 [azagrebin] [SPARK-3340] Deprecate ADD_JARS and ADD_FILES: add warning messages 42f0107 [azagrebin] [SPARK-3340] Deprecate ADD_JARS and ADD_FILES
* [SPARK-5788] [PySpark] capture the exception in python write threadDavies Liu2015-02-161-2/+2
| | | | | | | | | | | The exception in Python writer thread will shutdown executor. Author: Davies Liu <davies@databricks.com> Closes #4577 from davies/exception and squashes the following commits: eb0ceff [Davies Liu] Update PythonRDD.scala 139b0db [Davies Liu] capture the exception in python write thread
* SPARK-5848: tear down the ConsoleProgressBar timerMatt Whelan2015-02-172-1/+7
| | | | | | | | | | | The timer is a GC root, and failing to terminate it leaks SparkContext instances. Author: Matt Whelan <mwhelan@perka.com> Closes #4635 from MattWhelan/SPARK-5848 and squashes the following commits: 2a1e8a5 [Matt Whelan] SPARK-5848: teardown the ConsoleProgressBar timer
* [SPARK-4865][SQL]Include temporary tables in SHOW TABLESYin Huai2015-02-169-50/+111
| | | | | | | | | | | | | | | | This PR adds a `ShowTablesCommand` to support `SHOW TABLES [IN databaseName]` SQL command. The result of `SHOW TABLE` has two columns, `tableName` and `isTemporary`. For temporary tables, the value of `isTemporary` column will be `false`. JIRA: https://issues.apache.org/jira/browse/SPARK-4865 Author: Yin Huai <yhuai@databricks.com> Closes #4618 from yhuai/showTablesCommand and squashes the following commits: 0c09791 [Yin Huai] Use ShowTablesCommand. 85ee76d [Yin Huai] Since SHOW TABLES is not a Hive native command any more and we will not see "OK" (originally generated by Hive's driver), use SHOW DATABASES in the test. 94bacac [Yin Huai] Add SHOW TABLES to the list of noExplainCommands. d71ed09 [Yin Huai] Fix test. a4a6ec3 [Yin Huai] Add SHOW TABLE command.
* [SQL] Optimize arithmetic and predicate operatorskai2015-02-1610-260/+290
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Existing implementation of arithmetic operators and BinaryComparison operators have redundant type checking codes, e.g.: Expression.n2 is used by Add/Subtract/Multiply. (1) n2 always checks left.dataType == right.dataType. However, this checking should be done once when we resolve expression types; (2) n2 requires dataType is a NumericType. This can be done once. This PR optimizes arithmetic and predicate operators by removing such redundant type-checking codes. Some preliminary benchmarking on 10G TPC-H data over 5 r3.2xlarge EC2 machines shows that this PR can reduce the query time by 5.5% to 11%. The benchmark queries follow the template below, where OP is plus/minus/times/divide/remainder/bitwise and/bitwise or/bitwise xor. SELECT l_returnflag, l_linestatus, SUM(l_quantity OP cnt1), SUM(l_quantity OP cnt2), ...., SUM(l_quantity OP cnt700) FROM ( SELECT l_returnflag, l_linestatus, l_quantity, 1 AS cnt1, 2 AS cnt2, ..., 700 AS cnt700 FROM lineitem WHERE l_shipdate <= '1998-09-01' ) GROUP BY l_returnflag, l_linestatus; Author: kai <kaizeng@eecs.berkeley.edu> Closes #4472 from kai-zeng/arithmetic-optimize and squashes the following commits: fef0cf1 [kai] Merge branch 'master' of github.com:apache/spark into arithmetic-optimize 4b3a1bb [kai] chmod a-x 5a41e49 [kai] chmod a-x Expression.scala cb37c94 [kai] rebase onto spark master 7f6e968 [kai] chmod 100755 -> 100644 6cddb46 [kai] format 7490dbc [kai] fix unresolved-expression exception for EqualTo 9c40bc0 [kai] fix bitwisenot 3cbd363 [kai] clean up test code ca47801 [kai] override evalInternal for bitwise ops 8fa84a1 [kai] add bitwise or and xor 6892fc4 [kai] revert override evalInternal f8eba24 [kai] override evalInternal 31ccdd4 [kai] rewrite all bitwise op and remove evalInternal 86297e2 [kai] generalized cb92ae1 [kai] bitwise-and: override eval 97a7d6c [kai] bitwise-and: override evalInternal using and func 0906c39 [kai] add bitwise test 62abbbc [kai] clean up predicate and arithmetic b34d58d [kai] add caching and benmark option 12c5b32 [kai] override eval 1cd7571 [kai] fix sqrt and maxof 03fd0c3 [kai] fix predicate 16fd84c [kai] optimize + - * / % -(unary) abs < > <= >= fd95823 [kai] remove unnecessary type checking 24d062f [kai] test suite
* [SPARK-5839][SQL]HiveMetastoreCatalog does not recognize table names and ↵Yin Huai2015-02-163-4/+53
| | | | | | | | | | | | | | | aliases of data source tables. JIRA: https://issues.apache.org/jira/browse/SPARK-5839 Author: Yin Huai <yhuai@databricks.com> Closes #4626 from yhuai/SPARK-5839 and squashes the following commits: f779d85 [Yin Huai] Use subqeury to wrap replaced ParquetRelation. 2695f13 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-5839 f1ba6ca [Yin Huai] Address comment. 2c7fa08 [Yin Huai] Use Subqueries to wrap a data source table.
* [SPARK-5746][SQL] Check invalid cases for the write path of data source APIYin Huai2015-02-1614-57/+197
| | | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-5746 liancheng marmbrus Author: Yin Huai <yhuai@databricks.com> Closes #4617 from yhuai/insertOverwrite and squashes the following commits: 8e3019d [Yin Huai] Fix compilation error. 499e8e7 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertOverwrite e76e85a [Yin Huai] Address comments. ac31b3c [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertOverwrite f30bdad [Yin Huai] Use toDF. 99da57e [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertOverwrite 6b7545c [Yin Huai] Add a pre write check to the data source API. a88c516 [Yin Huai] DDLParser will take a parsering function to take care CTAS statements.
* HOTFIX: Break in Jekyll build from #4589Patrick Wendell2015-02-161-2/+1
| | | | That patch had a line break in the middle of a {{ }} expression, which is not allowed.
* [SPARK-2313] Use socket to communicate GatewayServer port back to Python driverJosh Rosen2015-02-163-43/+97
| | | | | | | | | | | | | | | | | | | | | | This patch changes PySpark so that the GatewayServer's port is communicated back to the Python process that launches it over a local socket instead of a pipe. The old pipe-based approach was brittle and could fail if `spark-submit` printed unexpected to stdout. To accomplish this, I wrote a custom `PythonGatewayServer.main()` function to use in place of Py4J's `GatewayServer.main()`. Closes #3424. Author: Josh Rosen <joshrosen@databricks.com> Closes #4603 from JoshRosen/SPARK-2313 and squashes the following commits: 6a7740b [Josh Rosen] Remove EchoOutputThread since it's no longer needed 0db501f [Josh Rosen] Use select() so that we don't block if GatewayServer dies. 9bdb4b6 [Josh Rosen] Handle case where getListeningPort returns -1 3fb7ed1 [Josh Rosen] Remove stdout=PIPE 2458934 [Josh Rosen] Use underscore to mark env var. as private d12c95d [Josh Rosen] Use Logging and Utils.tryOrExit() e5f9730 [Josh Rosen] Wrap everything in a giant try-block 2f70689 [Josh Rosen] Use stdin PIPE to share fate with driver 8bf956e [Josh Rosen] Initial cut at passing Py4J gateway port back to driver via socket
* SPARK-5357: Update commons-codec version to 1.10 (current)Matt Whelan2015-02-161-1/+1
| | | | | | | | | | | | Resolves https://issues.apache.org/jira/browse/SPARK-5357 In commons-codec 1.5, Base64 instances are not thread safe. That was only true from 1.4-1.6. Author: Matt Whelan <mwhelan@perka.com> Closes #4153 from MattWhelan/depsUpdate and squashes the following commits: b4a91f4 [Matt Whelan] SPARK-5357: Update commons-codec version to 1.10 (current)
* SPARK-5841: remove DiskBlockManager shutdown hook on stopMatt Whelan2015-02-161-4/+9
| | | | | | | | | | | After a call to stop, the shutdown hook is redundant, and causes a memory leak. Author: Matt Whelan <mwhelan@perka.com> Closes #4627 from MattWhelan/SPARK-5841 and squashes the following commits: d5f5c7f [Matt Whelan] SPARK-5841: remove DiskBlockManager shutdown hook on stop
* [SPARK-5833] [SQL] Adds REFRESH TABLE commandCheng Lian2015-02-164-24/+42
| | | | | | | | | | | | | | Lifts `HiveMetastoreCatalog.refreshTable` to `Catalog`. Adds `RefreshTable` command to refresh (possibly cached) metadata in external data sources tables. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4624) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4624 from liancheng/refresh-table and squashes the following commits: 8d1aa4c [Cheng Lian] Adds REFRESH TABLE command
* [SPARK-5296] [SQL] Add more filter types for data sources APICheng Lian2015-02-165-31/+103
| | | | | | | | | | | | | | | | | | | | | | | | | This PR adds the following filter types for data sources API: - `IsNull` - `IsNotNull` - `Not` - `And` - `Or` The code which converts Catalyst predicate expressions to data sources filters is very similar to filter conversion logics in `ParquetFilters` which converts Catalyst predicates to Parquet filter predicates. In this way we can support nested AND/OR/NOT predicates without changing current `BaseScan` type hierarchy. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4623) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #4623 from liancheng/more-fiters and squashes the following commits: 1b296f4 [Cheng Lian] Add more filter types for data sources API
* [SQL] Add fetched row count in SparkSQLCLIDriverOopsOutOfMemory2015-02-161-2/+10
| | | | | | | | | | | | | | | | | | before this change: ```scala Time taken: 0.619 seconds ``` after this change : ```scala Time taken: 0.619 seconds, Fetched: 4 row(s) ``` Author: OopsOutOfMemory <victorshengli@126.com> Closes #4604 from OopsOutOfMemory/rowcount and squashes the following commits: 7252dea [OopsOutOfMemory] add fetched row count
* [SQL] Initial support for reporting location of error in sql stringMichael Armbrust2015-02-1611-39/+314
| | | | | | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #4587 from marmbrus/position and squashes the following commits: 0810052 [Michael Armbrust] fix tests 395c019 [Michael Armbrust] Merge remote-tracking branch 'marmbrus/position' into position e155dce [Michael Armbrust] more errors f3efa51 [Michael Armbrust] Update AnalysisException.scala d45ff60 [Michael Armbrust] [SQL] Initial support for reporting location of error in sql string
* [SPARK-5824] [SQL] add null format in ctas and set default col comment to nullDaoyuan Wang2015-02-1619-1/+61
| | | | | | | | | Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4609 from adrian-wang/ctas and squashes the following commits: 0a75d5a [Daoyuan Wang] reorder import 93d1863 [Daoyuan Wang] add null format in ctas and set default col comment to null
* [SQL] [Minor] Update the SpecificMutableRow.copyCheng Hao2015-02-161-3/+4
| | | | | | | | | | When profiling the Join / Aggregate queries via VisualVM, I noticed lots of `SpecificMutableRow` objects created, as well as the `MutableValue`, since the `SpecificMutableRow` are mostly used in data source implementation, but the `copy` method could be called multiple times in upper modules (e.g. in Join / aggregation etc.), duplicated instances created should be avoid. Author: Cheng Hao <hao.cheng@intel.com> Closes #4619 from chenghao-intel/specific_mutable_row and squashes the following commits: 9300d23 [Cheng Hao] update the SpecificMutableRow.copy
* SPARK-5795 [STREAMING] api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may ↵Sean Owen2015-02-162-10/+28
| | | | | | | | | | | | | | | not friendly to java Revise JavaPairDStream API declaration on saveAs Hadoop methods, to allow it to be called directly as intended. CC tdas for review Author: Sean Owen <sowen@cloudera.com> Closes #4608 from srowen/SPARK-5795 and squashes the following commits: 36f1ead [Sean Owen] Add code that shows compile problem and fix 036bd27 [Sean Owen] Revise JavaPairDStream API declaration on saveAs Hadoop methods, to allow it to be called directly as intended.
* Minor fixes for commit https://github.com/apache/spark/pull/4592.Reynold Xin2015-02-162-8/+7
|
* [SPARK-5799][SQL] Compute aggregation function on specified numeric columnsLiang-Chi Hsieh2015-02-165-26/+123
| | | | | | | | | | | | | | | | | | | | | | | | Compute aggregation function on specified numeric columns. For example: val df = Seq(("a", 1, 0, "b"), ("b", 2, 4, "c"), ("a", 2, 3, "d")).toDataFrame("key", "value1", "value2", "rest") df.groupBy("key").min("value2") Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4592 from viirya/specific_cols_agg and squashes the following commits: 9446896 [Liang-Chi Hsieh] For comments. 314c4cd [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into specific_cols_agg 353fad7 [Liang-Chi Hsieh] For python unit tests. 54ed0c4 [Liang-Chi Hsieh] Address comments. b079e6b [Liang-Chi Hsieh] Remove duplicate codes. 55100fb [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into specific_cols_agg 880c2ac [Liang-Chi Hsieh] Fix Python style checks. 4c63a01 [Liang-Chi Hsieh] Fix pyspark. b1a24fc [Liang-Chi Hsieh] Address comments. 2592f29 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into specific_cols_agg 27069c3 [Liang-Chi Hsieh] Combine functions and add varargs annotation. 371a3f7 [Liang-Chi Hsieh] Compute aggregation function on specified numeric columns.
* SPARK-5815 [MLLIB] Part 2. Deprecate SVDPlusPlus APIs that expose ↵Sean Owen2015-02-161-27/+15
| | | | | | | | | | | | DoubleMatrix from JBLAS Now, deprecated runSVDPlusPlus and update run, for 1.4.0 / master only Author: Sean Owen <sowen@cloudera.com> Closes #4625 from srowen/SPARK-5815.2 and squashes the following commits: 6fd2ca5 [Sean Owen] Now, deprecated runSVDPlusPlus and update run, for 1.4.0 / master only
* [SPARK-5831][Streaming]When checkpoint file size is bigger than 10, then ↵Xutingjun2015-02-161-1/+1
| | | | | | | | | | delete the old ones Author: Xutingjun <1039320815@qq.com> Closes #4621 from XuTingjun/checkpointFile and squashes the following commits: b5f2816 [Xutingjun] fix bug
* [SPARK-4553] [SPARK-5767] [SQL] Wires Parquet data source with the newly ↵Cheng Lian2015-02-1612-675/+1148
| | | | | | | | | | | | | | | | | | | | | | | | introduced write support for data source API This PR migrates the Parquet data source to the new data source write support API. Now users can also overwriting and appending to existing tables. Notice that inserting into partitioned tables is not supported yet. When Parquet data source is enabled, insertion to Hive Metastore Parquet tables is also fullfilled by the Parquet data source. This is done by the newly introduced `HiveMetastoreCatalog.ParquetConversions` rule, which is a "proper" implementation of the original hacky `HiveStrategies.ParquetConversion`. The latter is still preserved, and can be removed together with the old Parquet support in the future. TODO: - [x] Update outdated comments in `newParquet.scala`. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4563) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4563 from liancheng/parquet-refining and squashes the following commits: fa98d27 [Cheng Lian] Fixes test cases which should disable off Parquet data source 2476e82 [Cheng Lian] Fixes compilation error introduced during rebasing a83d290 [Cheng Lian] Passes Hive Metastore partitioning information to ParquetRelation2
* [Minor] [SQL] Renames stringRddToDataFrame to stringRddToDataFrameHolder for ↵Cheng Lian2015-02-161-1/+1
| | | | | | | | | | | | | | consistency <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4613) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4613 from liancheng/df-implicit-rename and squashes the following commits: db8bdd3 [Cheng Lian] Renames stringRddToDataFrame to stringRddToDataFrameHolder for consistency
* [Ml] SPARK-5804 Explicitly manage cache in Crossvalidator k-fold loopPeter Rudenko2015-02-161-0/+2
| | | | | | | | | | | On a big dataset explicitly unpersist train and validation folds allows to load more data into memory in the next loop iteration. On my environment (single node 8Gb worker RAM, 2 GB dataset file, 3 folds for cross validation), saved more than 5 minutes. Author: Peter Rudenko <petro.rudenko@gmail.com> Closes #4595 from petro-rudenko/patch-2 and squashes the following commits: 66a7cfb [Peter Rudenko] Move validationDataset cache to declaration c5f3265 [Peter Rudenko] [Ml] SPARK-5804 Explicitly manage cache in Crossvalidator k-fold loop