aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-8407] [SQL] complex type constructors: struct and named_structYijie Shen2015-07-029-13/+126
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a follow up of [SPARK-8283](https://issues.apache.org/jira/browse/SPARK-8283) ([PR-6828](https://github.com/apache/spark/pull/6828)), to support both `struct` and `named_struct` in Spark SQL. After [#6725](https://github.com/apache/spark/pull/6828), the semantic of [`CreateStruct`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypes.scala#L56) methods have changed a little and do not limited to cols of `NamedExpressions`, it will name non-NamedExpression fields following the hive convention, col1, col2 ... This PR would both loosen [`struct`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L723) to take children of `Expression` type and add `named_struct` support. Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #6874 from yijieshen/SPARK-8283 and squashes the following commits: 4cd3375ac [Yijie Shen] change struct documentation d599d0b [Yijie Shen] rebase code 9a7039e [Yijie Shen] fix reviews and regenerate golden answers b487354 [Yijie Shen] replace assert using checkAnswer f07e114 [Yijie Shen] tiny fix 9613be9 [Yijie Shen] review fix 7fef712 [Yijie Shen] Fix checkInputTypes' implementation using foldable and nullable 60812a7 [Yijie Shen] Fix type check 828d694 [Yijie Shen] remove unnecessary resolved assertion inside dataType method fd3cd8e [Yijie Shen] remove type check from eval 7a71255 [Yijie Shen] tiny fix ccbbd86 [Yijie Shen] Fix reviews 47da332 [Yijie Shen] remove nameStruct API from DataFrame 917e680 [Yijie Shen] Fix reviews 4bd75ad [Yijie Shen] loosen struct method in functions.scala to take Expression children 0acb7be [Yijie Shen] Add CreateNamedStruct in both DataFrame function API and FunctionRegistery
* [SPARK-8747] [SQL] fix EqualNullSafe for binary typeWenchen Fan2015-07-022-47/+78
| | | | | | | | | | | also improve tests for binary comparison. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7143 from cloud-fan/binary and squashes the following commits: 28a5b76 [Wenchen Fan] improve test 04ef4b0 [Wenchen Fan] fix equalNullSafe
* [SPARK-8223] [SPARK-8224] [SQL] shift left and shift rightTarek Auel2015-07-026-1/+223
| | | | | | | | | | | | | | | | | | | | | | | | | | Jira: https://issues.apache.org/jira/browse/SPARK-8223 https://issues.apache.org/jira/browse/SPARK-8224 ~~I am aware of #7174 and will update this pr, if it's merged.~~ Done I don't know if #7034 can simplify this, but we can have a look on it, if it gets merged rxin In the Jira ticket the function as no second argument. I added a `numBits` argument that allows to specify the number of bits. I guess this improves the usability. I wanted to add `shiftleft(value)` as well, but the `selectExpr` dataframe tests crashes, if I have both. I order to do this, I added the following to the functions.scala `def shiftRight(e: Column): Column = ShiftRight(e.expr, lit(1).expr)`, but as I mentioned this doesn't pass tests like `df.selectExpr("shiftRight(a)", ...` (not enough arguments exception). If we need the bitwise shift in order to be hive compatible, I suggest to add `shiftLeft` and something like `shiftLeftX` Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7178 from tarekauel/8223 and squashes the following commits: 8023bb5 [Tarek Auel] [SPARK-8223][SPARK-8224] fixed test f3f64e6 [Tarek Auel] [SPARK-8223][SPARK-8224] Integer -> Int f628706 [Tarek Auel] [SPARK-8223][SPARK-8224] removed toString; updated function description 3b56f2a [Tarek Auel] Merge remote-tracking branch 'origin/master' into 8223 5189690 [Tarek Auel] [SPARK-8223][SPARK-8224] minor fix and style fix 9434a28 [Tarek Auel] Merge remote-tracking branch 'origin/master' into 8223 44ee324 [Tarek Auel] [SPARK-8223][SPARK-8224] docu fix ac7fe9d [Tarek Auel] [SPARK-8223][SPARK-8224] right and left bit shift
* [SPARK-8758] [MLLIB] Add Python user guide for PowerIterationClusteringYanbo Liang2015-07-022-4/+69
| | | | | | | | | | | Add Python user guide for PowerIterationClustering Author: Yanbo Liang <ybliang8@gmail.com> Closes #7155 from yanboliang/spark-8758 and squashes the following commits: 18d803b [Yanbo Liang] address comments dd29577 [Yanbo Liang] Add Python user guide for PowerIterationClustering
* [SPARK-8647] [MLLIB] Potential issue with constant hashCodeAlok Singh2015-07-022-2/+4
| | | | | | | | | | | | | | | I added the code, // see [SPARK-8647], this achieves the needed constant hash code without constant no. override def hashCode(): Int = this.getClass.getName.hashCode() does getting the constant hash code as per jira Author: Alok Singh <singhal@Aloks-MacBook-Pro.local> Closes #7146 from aloknsingh/aloknsingh_SPARK-8647 and squashes the following commits: e58bccf [Alok Singh] [SPARK-8647][MLlib] to avoid the class derivation issues, change the constant hashCode to override def hashCode(): Int = classOf[MatrixUDT].getName.hashCode() 43cdb89 [Alok Singh] [SPARK-8647][MLlib] Potential issue with constant hashCode
* [SPARK-8690] [SQL] Add a setting to disable SparkSQL parquet schema merge by ↵Wisely Chen2015-07-023-1/+30
| | | | | | | | | | | | | | | | | | | | using datasource API The detail problem story is in https://issues.apache.org/jira/browse/SPARK-8690 General speaking, I add a config spark.sql.parquet.mergeSchema to achieve the sqlContext.load("parquet" , Map( "path" -> "..." , "mergeSchema" -> "false" )) It will become a simple flag and without any side affect. Author: Wisely Chen <wiselychen@appier.com> Closes #7070 from thegiive/SPARK8690 and squashes the following commits: c6f3e86 [Wisely Chen] Refactor some code style and merge the test case to ParquetSchemaMergeConfigSuite 94c9307 [Wisely Chen] Remove some style problem db8ef1b [Wisely Chen] Change config to SQLConf and add test case b6806fb [Wisely Chen] remove text c0edb8c [Wisely Chen] [SPARK-8690] add a config spark.sql.parquet.mergeSchema to disable datasource API schema merge feature.
* [SPARK-8746] [SQL] update download link for Hive 0.13.1Christian Kadner2015-07-021-1/+1
| | | | | | | | | | updated the [Hive 0.13.1](https://archive.apache.org/dist/hive/hive-0.13.1) download link in `sql/README.md` Author: Christian Kadner <ckadner@us.ibm.com> Closes #7144 from ckadner/SPARK-8746 and squashes the following commits: 65d80f7 [Christian Kadner] [SPARK-8746][SQL] update download link for Hive 0.13.1
* [SPARK-8787] [SQL] Changed parameter order of @deprecated in package object sqlVinod K C2015-07-021-1/+1
| | | | | | | | | | | | | | Parameter order of deprecated annotation in package object sql is wrong >>deprecated("1.3.0", "use DataFrame") . This has to be changed to deprecated("use DataFrame", "1.3.0") Author: Vinod K C <vinod.kc@huawei.com> Closes #7183 from vinodkc/fix_deprecated_param_order and squashes the following commits: 1cbdbe8 [Vinod K C] Modified the message 700911c [Vinod K C] Changed order of parameters
* [DOCS] Fix minor wrong lambda expression example.Kousuke Saruta2015-07-021-1/+1
| | | | | | | | | | | | | | | | It's a really minor issue but there is an example with wrong lambda-expression usage in `SQLContext.scala` like as follows. ``` sqlContext.udf().register("myUDF", (Integer arg1, String arg2) -> arg2 + arg1), <- We have an extra `)` here. DataTypes.StringType); ``` Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #7187 from sarutak/fix-minor-wrong-lambda-expression and squashes the following commits: a13196d [Kousuke Saruta] Fixed minor wrong lambda expression example.
* [SPARK-8687] [YARN] Fix bug: Executor can't fetch the new set configuration ↵huangzhaowei2015-07-011-1/+6
| | | | | | | | | | | | | | | | | | | | | in yarn-client Spark initi the properties CoarseGrainedSchedulerBackend.start ```scala // TODO (prashant) send conf instead of properties driverEndpoint = rpcEnv.setupEndpoint( CoarseGrainedSchedulerBackend.ENDPOINT_NAME, new DriverEndpoint(rpcEnv, properties)) ``` Then the yarn logic will set some configuration but not update in this `properties`. So `Executor` won't gain the `properties`. [Jira](https://issues.apache.org/jira/browse/SPARK-8687) Author: huangzhaowei <carlmartinmax@gmail.com> Closes #7066 from SaintBacchus/SPARK-8687 and squashes the following commits: 1de4f48 [huangzhaowei] Ensure all necessary properties have already been set before startup ExecutorLaucher
* [SPARK-3071] Increase default driver memoryIlya Ganelin2015-07-0115-21/+35
| | | | | | | | | | | | | | | | | | | | | | | | I've updated default values in comments, documentation, and in the command line builder to be 1g based on comments in the JIRA. I've also updated most usages to point at a single variable defined in the Utils.scala and JavaUtils.java files. This wasn't possible in all cases (R, shell scripts etc.) but usage in most code is now pointing at the same place. Please let me know if I've missed anything. Will the spark-shell use the value within the command line builder during instantiation? Author: Ilya Ganelin <ilya.ganelin@capitalone.com> Closes #7132 from ilganeli/SPARK-3071 and squashes the following commits: 4074164 [Ilya Ganelin] String fix 271610b [Ilya Ganelin] Merge branch 'SPARK-3071' of github.com:ilganeli/spark into SPARK-3071 273b6e9 [Ilya Ganelin] Test fix fd67721 [Ilya Ganelin] Update JavaUtils.java 26cc177 [Ilya Ganelin] test fix e5db35d [Ilya Ganelin] Fixed test failure 39732a1 [Ilya Ganelin] merge fix a6f7deb [Ilya Ganelin] Created default value for DRIVER MEM in Utils that's now used in almost all locations instead of setting manually in each 09ad698 [Ilya Ganelin] Update SubmitRestProtocolSuite.scala 19b6f25 [Ilya Ganelin] Missed one doc update 2698a3d [Ilya Ganelin] Updated default value for driver memory
* [SPARK-8740] [PROJECT INFRA] Support GitHub OAuth tokens in ↵Josh Rosen2015-07-011-2/+16
| | | | | | | | | | | | | | | | | dev/merge_spark_pr.py This commit allows `dev/merge_spark_pr.py` to use personal GitHub OAuth tokens in order to make authenticated requests. This is necessary to work around per-IP rate limiting issues. To use a token, just set the `GITHUB_OAUTH_KEY` environment variable. You can create a personal token at https://github.com/settings/tokens; we only require `public_repo` scope. If the script fails due to a rate-limit issue, it now logs a useful message directing the user to the OAuth token instructions. Author: Josh Rosen <joshrosen@databricks.com> Closes #7136 from JoshRosen/pr-merge-script-oauth-authentication and squashes the following commits: 4d011bd [Josh Rosen] Fix error message 23d92ff [Josh Rosen] Support GitHub OAuth tokens in dev/merge_spark_pr.py
* [SPARK-8769] [TRIVIAL] [DOCS] toLocalIterator should mention it results in ↵Holden Karau2015-07-011-0/+4
| | | | | | | | | | many jobs Author: Holden Karau <holden@pigscanfly.ca> Closes #7171 from holdenk/SPARK-8769-toLocalIterator-documentation-improvement and squashes the following commits: 97ddd99 [Holden Karau] Add note
* [SPARK-8771] [TRIVIAL] Add a version to the deprecated annotation for the ↵Holden Karau2015-07-011-2/+2
| | | | | | | | | | | actorSystem Author: Holden Karau <holden@pigscanfly.ca> Closes #7172 from holdenk/SPARK-8771-actor-system-deprecation-tag-uses-deprecated-deprecation-tag and squashes the following commits: 7f1455b [Holden Karau] Add .0s to the versions for the derpecated anotations in SparkEnv.scala ca13c9d [Holden Karau] Add a version to the deprecated annotation for the actorSystem in SparkEnv
* [SPARK-8688] [YARN] Bug fix: disable the cache fs to gain the HDFS connection.huangzhaowei2015-07-013-5/+23
| | | | | | | | | | | | | | | | | | | | | | If `fs.hdfs.impl.disable.cache` was `false`(default), `FileSystem` will use the cached `DFSClient` which use old token. [AMDelegationTokenRenewer](https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala#L196) ```scala val credentials = UserGroupInformation.getCurrentUser.getCredentials credentials.writeTokenStorageFile(tempTokenPath, discachedConfiguration) ``` Although the `credentials` had the new Token, but it still use the cached client and old token. So It's better to set the `fs.hdfs.impl.disable.cache` as `true` to avoid token expired. [Jira](https://issues.apache.org/jira/browse/SPARK-8688) Author: huangzhaowei <carlmartinmax@gmail.com> Closes #7069 from SaintBacchus/SPARK-8688 and squashes the following commits: f94cd0b [huangzhaowei] modify function parameter 8fb9eb9 [huangzhaowei] explicit the comment 0cd55c9 [huangzhaowei] Rename function name to be an accurate one cf776a1 [huangzhaowei] [SPARK-8688][YARN]Bug fix: disable the cache fs to gain the HDFS connection.
* [SPARK-8754] [YARN] YarnClientSchedulerBackend doesn't stop gracefully in ↵Devaraj K2015-07-011-1/+3
| | | | | | | | | | | | failure conditions In YarnClientSchedulerBackend.stop(), added a check for monitorThread. Author: Devaraj K <devaraj@apache.org> Closes #7153 from devaraj-kavali/master and squashes the following commits: 66be9ad [Devaraj K] https://issues.apache.org/jira/browse/SPARK-8754 YarnClientSchedulerBackend doesn't stop gracefully in failure conditions
* [SPARK-8227] [SQL] Add function unhexzhichao.li2015-07-016-0/+88
| | | | | | | | | | | | | | | | | cc chenghao-intel adrian-wang Author: zhichao.li <zhichao.li@intel.com> Closes #7113 from zhichao-li/unhex and squashes the following commits: 379356e [zhichao.li] remove exception checking a4ae6dc [zhichao.li] add udf_unhex to whitelist fe5c14a [zhichao.li] add todigit 607d7a3 [zhichao.li] use checkInputTypes bffd37f [zhichao.li] change to use Hex in apache common package cde73f5 [zhichao.li] update to use AutoCastInputTypes 11945c7 [zhichao.li] style c852d46 [zhichao.li] Add function unhex
* [SPARK-8660] [MLLIB] removed > symbols from comments in ↵Rosstin2015-07-011-54/+63
| | | | | | | | | | | | | | | | | | | | | | LogisticRegressionSuite.scala for ease of copypaste '>' symbols removed from comments in LogisticRegressionSuite.scala, for ease of copypaste also single-lined the multiline commands (is this desirable, or does it violate style?) Author: Rosstin <asterazul@gmail.com> Closes #7167 from Rosstin/SPARK-8660-2 and squashes the following commits: f4b9bc8 [Rosstin] SPARK-8660 restored character limit on multiline comments in LogisticRegressionSuite.scala fe6b112 [Rosstin] SPARK-8660 > symbols removed from LogisticRegressionSuite.scala for easy of copypaste 39ddd50 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8661 5a05dee [Rosstin] SPARK-8661 for LinearRegressionSuite.scala, changed javadoc-style comments to regular multiline comments to make it easier to copy-paste the R code. bb9a4b1 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8660 242aedd [Rosstin] SPARK-8660, changed comment style from JavaDoc style to normal multiline comment in order to make copypaste into R easier, in file classification/LogisticRegressionSuite.scala 2cd2985 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639 21ac1e5 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639 6c18058 [Rosstin] fixed minor typos in docs/README.md and docs/api.md
* [SPARK-8770][SQL] Create BinaryOperator abstract class.Reynold Xin2015-07-0115-155/+191
| | | | | | | | | | | | | | Our current BinaryExpression abstract class is not for generic binary expressions, i.e. it requires left/right children to have the same type. However, due to its name, contributors build new binary expressions that don't have that assumption (e.g. Sha) and still extend BinaryExpression. This patch creates a new BinaryOperator abstract class, and update the analyzer o only apply type casting rule there. This patch also adds the notion of "prettyName" to expressions, which defines the user-facing name for the expression. Author: Reynold Xin <rxin@databricks.com> Closes #7174 from rxin/binary-opterator and squashes the following commits: f31900d [Reynold Xin] [SPARK-8770][SQL] Create BinaryOperator abstract class. fceb216 [Reynold Xin] Merge branch 'master' of github.com:apache/spark into binary-opterator d8518cf [Reynold Xin] Updated Python tests.
* Revert "[SPARK-8770][SQL] Create BinaryOperator abstract class."Reynold Xin2015-07-0112-170/+135
| | | | This reverts commit 272778999823ed79af92280350c5869a87a21f29.
* [SPARK-8770][SQL] Create BinaryOperator abstract class.Reynold Xin2015-07-0112-135/+170
| | | | | | | | | | | | Our current BinaryExpression abstract class is not for generic binary expressions, i.e. it requires left/right children to have the same type. However, due to its name, contributors build new binary expressions that don't have that assumption (e.g. Sha) and still extend BinaryExpression. This patch creates a new BinaryOperator abstract class, and update the analyzer o only apply type casting rule there. This patch also adds the notion of "prettyName" to expressions, which defines the user-facing name for the expression. Author: Reynold Xin <rxin@databricks.com> Closes #7170 from rxin/binaryoperator and squashes the following commits: 51264a5 [Reynold Xin] [SPARK-8770][SQL] Create BinaryOperator abstract class.
* [SPARK-8766] support non-ascii character in column namesDavies Liu2015-07-014-5/+15
| | | | | | | | | | | | | | Use UTF-8 to encode the name of column in Python 2, or it may failed to encode with default encoding ('ascii'). This PR also fix a bug when there is Java exception without error message. Author: Davies Liu <davies@databricks.com> Closes #7165 from davies/non_ascii and squashes the following commits: 02cb61a [Davies Liu] fix tests 3b09d31 [Davies Liu] add encoding in header 867754a [Davies Liu] support non-ascii character in column names
* [SPARK-3444] [CORE] Restore INFO level after log4j test.Marcelo Vanzin2015-07-011-5/+11
| | | | | | | | | | | Otherwise other tests don't log anything useful... Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #7140 from vanzin/SPARK-3444 and squashes the following commits: de14836 [Marcelo Vanzin] Better fix. 6cff13a [Marcelo Vanzin] [SPARK-3444] [core] Restore INFO level after log4j test.
* [QUICKFIX] [SQL] fix copy of generated rowDavies Liu2015-07-012-1/+4
| | | | | | | | | | copy() of generated Row doesn't check nullability of columns Author: Davies Liu <davies@databricks.com> Closes #7163 from davies/fix_copy and squashes the following commits: 661a206 [Davies Liu] fix copy of generated row
* [SPARK-7820] [BUILD] Fix Java8-tests suite compile and test error under sbtjerryshao2015-07-012-2/+10
| | | | | | | | Author: jerryshao <saisai.shao@intel.com> Closes #7120 from jerryshao/SPARK-7820 and squashes the following commits: 6902439 [jerryshao] fix Java8-tests suite compile error under sbt
* [SPARK-8378] [STREAMING] Add the Python API for Flumezsxwing2015-07-0115-236/+1009
| | | | | | | | | | | | | | | | | | | | | | | Author: zsxwing <zsxwing@gmail.com> Closes #6830 from zsxwing/flume-python and squashes the following commits: 78dfdac [zsxwing] Fix the compile error in the test code f1bf3c0 [zsxwing] Address TD's comments 0449723 [zsxwing] Add sbt goal streaming-flume-assembly/assembly e93736b [zsxwing] Fix the test case for determine_modules_to_test 9d5821e [zsxwing] Fix pyspark_core dependencies f9ee681 [zsxwing] Merge branch 'master' into flume-python 7a55837 [zsxwing] Add streaming_flume_assembly to run-tests.py b96b0de [zsxwing] Merge branch 'master' into flume-python ce85e83 [zsxwing] Fix incompatible issues for Python 3 01cbb3d [zsxwing] Add import sys 152364c [zsxwing] Fix the issue that StringIO doesn't work in Python 3 14ba0ff [zsxwing] Add flume-assembly for sbt building b8d5551 [zsxwing] Merge branch 'master' into flume-python 4762c34 [zsxwing] Fix the doc 0336579 [zsxwing] Refactor Flume unit tests and also add tests for Python API 9f33873 [zsxwing] Add the Python API for Flume
* [SPARK-8765] [MLLIB] [PYTHON] removed flaky python PIC testJoseph K. Bradley2015-07-011-4/+0
| | | | | | | | | | | | See failure: [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36133/console] CC yanboliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #7164 from jkbradley/pic-python-test and squashes the following commits: 156d55b [Joseph K. Bradley] removed flaky python PIC test
* [SPARK-8308] [MLLIB] add missing save load for python exampleYuhao Yang2015-07-013-6/+18
| | | | | | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-8308 1. add some missing save/load in python examples. , LogisticRegression, LinearRegression and NaiveBayes 2. tune down iterations for MatrixFactorization, since current number will trigger StackOverflow for default java configuration (>1M) Author: Yuhao Yang <hhbyyh@gmail.com> Closes #6760 from hhbyyh/docUpdate and squashes the following commits: 9bd3383 [Yuhao Yang] update scala example 8a44692 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into docUpdate 077cbb8 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into docUpdate 3e948dc [Yuhao Yang] add missing save load for python example
* [SPARK-6263] [MLLIB] Python MLlib API missing items: Utilslewuathe2015-07-013-0/+74
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement missing API in pyspark. MLUtils * appendBias * loadVectors `kFold` is also missing however I am not sure `ClassTag` can be passed or restored through python. Author: lewuathe <lewuathe@me.com> Closes #5707 from Lewuathe/SPARK-6263 and squashes the following commits: 16863ea [lewuathe] Merge master 3fc27e7 [lewuathe] Merge branch 'master' into SPARK-6263 6084e9c [lewuathe] Resolv conflict d2aa2a0 [lewuathe] Resolv conflict 9c329d8 [lewuathe] Fix efficiency 3a12a2d [lewuathe] Merge branch 'master' into SPARK-6263 1d4714b [lewuathe] Fix style b29e2bc [lewuathe] Remove scipy dependencies e32eb40 [lewuathe] Merge branch 'master' into SPARK-6263 25d3c9d [lewuathe] Remove unnecessary imports 7ec04db [lewuathe] Resolv conflict 1502d13 [lewuathe] Resolv conflict d6bd416 [lewuathe] Check existence of scipy.sparse 5d555b1 [lewuathe] Construct scipy.sparse matrix c345a44 [lewuathe] Merge branch 'master' into SPARK-6263 b8b5ef7 [lewuathe] Fix unnecessary sort method d254be7 [lewuathe] Merge branch 'master' into SPARK-6263 62a9c7e [lewuathe] Fix appendBias return type 454c73d [lewuathe] Merge branch 'master' into SPARK-6263 a353354 [lewuathe] Remove unnecessary appendBias implementation 44295c2 [lewuathe] Merge branch 'master' into SPARK-6263 64f72ad [lewuathe] Merge branch 'master' into SPARK-6263 c728046 [lewuathe] Fix style 2980569 [lewuathe] [SPARK-6263] Python MLlib API missing items: Utils
* [SPARK-8621] [SQL] support empty string as column nameWenchen Fan2015-07-012-2/+9
| | | | | | | | | | | improve the empty check in `parseAttributeName` so that we can allow empty string as column name. Close https://github.com/apache/spark/pull/7117 Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7149 from cloud-fan/8621 and squashes the following commits: efa9e3e [Wenchen Fan] support empty string
* [SPARK-8752][SQL] Add ExpectsInputTypes trait for defining expected input types.Reynold Xin2015-07-017-24/+44
| | | | | | | | | | | | This patch doesn't actually introduce any code that uses the new ExpectsInputTypes. It just adds the trait so others can use it. Also renamed the old expectsInputTypes function to just inputTypes. We should add implicit type casting also in the future. Author: Reynold Xin <rxin@databricks.com> Closes #7151 from rxin/expects-input-types and squashes the following commits: 16cf07b [Reynold Xin] [SPARK-8752][SQL] Add ExpectsInputTypes trait for defining expected input types.
* [SPARK-7714] [SPARKR] SparkR tests should use more specific expectations ↵Sun Rui2015-07-019-194/+194
| | | | | | | | | | | | | | | | | than expect_true 1. Update the pattern 'expect_true(a == b)' to 'expect_equal(a, b)'. 2. Update the pattern 'expect_true(inherits(a, b))' to 'expect_is(a, b)'. 3. Update the pattern 'expect_true(identical(a, b))' to 'expect_identical(a, b)'. Author: Sun Rui <rui.sun@intel.com> Closes #7152 from sun-rui/SPARK-7714 and squashes the following commits: 8ad2440 [Sun Rui] Fix test case errors. 8fe9f0c [Sun Rui] Update the pattern 'expect_true(identical(a, b))' to 'expect_identical(a, b)'. f1b8005 [Sun Rui] Update the pattern 'expect_true(inherits(a, b))' to 'expect_is(a, b)'. f631e94 [Sun Rui] Update the pattern 'expect_true(a == b)' to 'expect_equal(a, b)'.
* [SPARK-8763] [PYSPARK] executing run-tests.py with Python 2.6 fails with ↵cocoatomo2015-07-011-2/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | absence of subprocess.check_output function Running run-tests.py with Python 2.6 cause following error: ``` Running PySpark tests. Output is in python//Users/tomohiko/.jenkins/jobs/pyspark_test/workspace/python/unit-tests.log Will test against the following Python executables: ['python2.6', 'python3.4', 'pypy'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Traceback (most recent call last): File "./python/run-tests.py", line 196, in <module> main() File "./python/run-tests.py", line 159, in main python_implementation = subprocess.check_output( AttributeError: 'module' object has no attribute 'check_output' ... ``` The cause of this error is using subprocess.check_output function, which exists since Python 2.7. (ref. https://docs.python.org/2.7/library/subprocess.html#subprocess.check_output) Author: cocoatomo <cocoatomo77@gmail.com> Closes #7161 from cocoatomo/issues/8763-test-fails-py26 and squashes the following commits: cf4f901 [cocoatomo] [SPARK-8763] backport process.check_output function from Python 2.7
* [SPARK-8750][SQL] Remove the closure in functions.callUdf.Reynold Xin2015-07-011-1/+9
| | | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #7148 from rxin/calludf-closure and squashes the following commits: 00df372 [Reynold Xin] Fixed index out of bound exception. 4beba76 [Reynold Xin] [SPARK-8750][SQL] Remove the closure in functions.callUdf.
* [SQL] [MINOR] remove internalRowRDD in DataFrameWenchen Fan2015-07-014-7/+5
| | | | | | | | | | Developers have already familiar with `queryExecution.toRDD` as internal row RDD, and we should not add new concept. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7116 from cloud-fan/internal-rdd and squashes the following commits: 24756ca [Wenchen Fan] remove internalRowRDD
* [SPARK-8749][SQL] Remove HiveTypeCoercion trait.Reynold Xin2015-07-013-44/+33
| | | | | | | | | | Moved all the rules into the companion object. Author: Reynold Xin <rxin@databricks.com> Closes #7147 from rxin/SPARK-8749 and squashes the following commits: c1c6dc0 [Reynold Xin] [SPARK-8749][SQL] Remove HiveTypeCoercion trait.
* [SPARK-8748][SQL] Move castability test out from Cast case class into Cast ↵Reynold Xin2015-06-301-66/+78
| | | | | | | | | | | | | object. This patch moved resolve function in Cast case class into the companion object, and renamed it canCast. We can then use this in the analyzer without a Cast expr. Author: Reynold Xin <rxin@databricks.com> Closes #7145 from rxin/cast and squashes the following commits: cd086a9 [Reynold Xin] Whitespace changes. 4d2d989 [Reynold Xin] [SPARK-8748][SQL] Move castability test out from Cast case class into Cast object.
* [SPARK-6602][Core]Remove unnecessary synchronizedzsxwing2015-06-301-5/+2
| | | | | | | | | | A follow-up pr to address https://github.com/apache/spark/pull/5392#discussion_r33627528 Author: zsxwing <zsxwing@gmail.com> Closes #7141 from zsxwing/pr5392-follow-up and squashes the following commits: fcf7b50 [zsxwing] Remove unnecessary synchronized
* [SPARK-8535] [PYSPARK] PySpark : Can't create DataFrame from Pandas ↵x1-2015-06-301-1/+3
| | | | | | | | | | | | | | | | | | dataframe with no explicit column name Because implicit name of `pandas.columns` are Int, but `StructField` json expect `String`. So I think `pandas.columns` are should be convert to `String`. ### issue * [SPARK-8535 PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name](https://issues.apache.org/jira/browse/SPARK-8535) Author: x1- <viva008@gmail.com> Closes #7124 from x1-/SPARK-8535 and squashes the following commits: d68fd38 [x1-] modify unit-test using pandas. ea1897d [x1-] For implicit name of pandas.columns are Int, so should be convert to String.
* [SPARK-8471] [ML] Rename DiscreteCosineTransformer to DCTFeynman Liang2015-06-303-8/+8
| | | | | | | | | | | | | | | | | | | | Rename DiscreteCosineTransformer and related classes to DCT. Author: Feynman Liang <fliang@databricks.com> Closes #7138 from feynmanliang/dct-features and squashes the following commits: e547b3e [Feynman Liang] Fix renaming bug 9d5c9e4 [Feynman Liang] Lowercase JavaDCTSuite variable f9a8958 [Feynman Liang] Remove old files f8fe794 [Feynman Liang] Merge branch 'master' into dct-features 894d0b2 [Feynman Liang] Rename DiscreteCosineTransformer to DCT 433dbc7 [Feynman Liang] Test refactoring 91e9636 [Feynman Liang] Style guide and test helper refactor b5ac19c [Feynman Liang] Use Vector types, add Java test 530983a [Feynman Liang] Tests for other numeric datatypes 195d7aa [Feynman Liang] Implement support for arbitrary numeric types 95d4939 [Feynman Liang] Working DCT for 1D Doubles
* [SPARK-6602][Core] Update Master, Worker, Client, AppClient and related ↵zsxwing2015-06-3027-633/+806
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | classes to use RpcEndpoint This PR updates the rest Actors in core to RpcEndpoint. Because there is no `ActorSelection` in RpcEnv, I changes the logic of `registerWithMaster` in Worker and AppClient to avoid blocking the message loop. These changes need to be reviewed carefully. Author: zsxwing <zsxwing@gmail.com> Closes #5392 from zsxwing/rpc-rewrite-part3 and squashes the following commits: 2de7bed [zsxwing] Merge branch 'master' into rpc-rewrite-part3 f12d943 [zsxwing] Address comments 9137b82 [zsxwing] Fix the code style e734c71 [zsxwing] Merge branch 'master' into rpc-rewrite-part3 2d24fb5 [zsxwing] Fix the code style 5a82374 [zsxwing] Merge branch 'master' into rpc-rewrite-part3 fa47110 [zsxwing] Merge branch 'master' into rpc-rewrite-part3 72304f0 [zsxwing] Update the error strategy for AkkaRpcEnv e56cb16 [zsxwing] Always send failure back to the sender a7b86e6 [zsxwing] Use JFuture for java.util.concurrent.Future aa34b9b [zsxwing] Fix the code style bd541e7 [zsxwing] Merge branch 'master' into rpc-rewrite-part3 25a84d8 [zsxwing] Use ThreadUtils 060ff31 [zsxwing] Merge branch 'master' into rpc-rewrite-part3 dbfc916 [zsxwing] Improve the docs and comments 837927e [zsxwing] Merge branch 'master' into rpc-rewrite-part3 5c27f97 [zsxwing] Merge branch 'master' into rpc-rewrite-part3 fadbb9e [zsxwing] Fix the code style 6637e3c [zsxwing] Merge remote-tracking branch 'origin/master' into rpc-rewrite-part3 7fdee0e [zsxwing] Fix the return type to ExecutorService and ScheduledExecutorService e8ad0a5 [zsxwing] Fix the code style 6b2a104 [zsxwing] Log error and use SparkExitCode.UNCAUGHT_EXCEPTION exit code fbf3194 [zsxwing] Add Utils.newDaemonSingleThreadExecutor and newDaemonSingleThreadScheduledExecutor b776817 [zsxwing] Update Master, Worker, Client, AppClient and related classes to use RpcEndpoint
* [SPARK-8727] [SQL] Missing python api; md5, log2Tarek Auel2015-06-301-13/+52
| | | | | | | | | | | | | | | | | Jira: https://issues.apache.org/jira/browse/SPARK-8727 Author: Tarek Auel <tarek.auel@gmail.com> Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7114 from tarekauel/missing-python and squashes the following commits: ef4c61b [Tarek Auel] [SPARK-8727] revert dataframe change 4029d4d [Tarek Auel] removed dataframe pi and e unit test 66f0d2b [Tarek Auel] removed pi and e from python api and dataframe api; added _to_java_column(col) for strlen 4d07318 [Tarek Auel] fixed python unit test 45f2bee [Tarek Auel] fixed result of pi and e c39f47b [Tarek Auel] add python api bd50a3a [Tarek Auel] add missing python functions
* [SPARK-8741] [SQL] Remove e and pi from DataFrame functions.Reynold Xin2015-06-302-26/+0
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #7137 from rxin/SPARK-8741 and squashes the following commits: 32c7e75 [Reynold Xin] [SPARK-8741][SQL] Remove e and pi from DataFrame functions.
* [SPARK-7739] [MLLIB] Improve ChiSqSelector example code in user guidesethah2015-06-301-6/+8
| | | | | | | | | Author: sethah <seth.hendrickson16@gmail.com> Closes #7029 from sethah/working_on_SPARK-7739 and squashes the following commits: ef96916 [sethah] Fixing some style issues efea1f8 [sethah] adding clarification to ChiSqSelector example
* [SPARK-8738] [SQL] [PYSPARK] capture SQL AnalysisException in Python APIDavies Liu2015-06-304-1/+65
| | | | | | | | | | | | | | | | Capture the AnalysisException in SQL, hide the long java stack trace, only show the error message. cc rxin Author: Davies Liu <davies@databricks.com> Closes #7135 from davies/ananylis and squashes the following commits: dad7ae7 [Davies Liu] add comment ec0c0e8 [Davies Liu] Update utils.py cdd7edd [Davies Liu] add doc 7b044c2 [Davies Liu] fix python 3 f84d3bd [Davies Liu] capture SQL AnalysisException in Python API
* [SPARK-8739] [WEB UI] [WINDOWS] A illegal character `\r` can be contained in ↵Kousuke Saruta2015-06-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | StagePage. This issue was reported by saurfang. Thanks! There is a following code in StagePage.scala. ``` |width="$serializationTimeProportion%"></rect> |<rect class="getting-result-time-proportion" |x="$gettingResultTimeProportionPos%" y="0px" height="26px" |width="$gettingResultTimeProportion%"></rect></svg>', |'start': new Date($launchTime), |'end': new Date($finishTime) |} |""".stripMargin.replaceAll("\n", " ") ``` The last `replaceAll("\n", "")` doesn't work when we checkout and build source code on Windows and deploy on Linux. It's because when we checkout the source code on Windows, new-line-code is replaced with `"\r\n"` and `replaceAll("\n", "")` replaces only `"\n"`. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #7133 from sarutak/SPARK-8739 and squashes the following commits: 17fb044 [Kousuke Saruta] Fixed a new-line-code issue
* [SPARK-8563] [MLLIB] Fixed a bug so that ↵lee192015-06-302-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IndexedRowMatrix.computeSVD().U.numCols = k I'm sorry that I made https://github.com/apache/spark/pull/6949 closed by mistake. I pushed codes again. And, I added a test code. > There is a bug that `U.numCols() = self.nCols` in `IndexedRowMatrix.computeSVD()` It should have been `U.numCols() = k = svd.U.numCols()` > ``` self = U * sigma * V.transpose (m x n) = (m x n) * (k x k) * (k x n) //ASIS --> (m x n) = (m x k) * (k x k) * (k x n) //TOBE ``` Author: lee19 <lee19@live.co.kr> Closes #6953 from lee19/MLlibBugfix and squashes the following commits: c1812a0 [lee19] [SPARK-8563] [MLlib] Used nRows instead of numRows() to reduce a burden. 4b9803b [lee19] [SPARK-8563] [MLlib] Fixed a build error. c2ccd89 [lee19] Added a unit test that validates matrix sizes of svd for [SPARK-8563][MLlib] 8373424 [lee19] [SPARK-8563][MLlib] Fixed a bug so that IndexedRowMatrix.computeSVD().U.numCols = k
* [SPARK-8705] [WEBUI] Don't display rects when totalExecutionTime is 0zsxwing2015-06-301-22/+30
| | | | | | | | | | | | | | | | | Because `System.currentTimeMillis()` is not accurate for tasks that only need several milliseconds, sometimes `totalExecutionTime` in `makeTimeline` will be 0. If `totalExecutionTime` is 0, there will the following error in the console. ![screen shot 2015-06-29 at 7 08 55 pm](https://cloud.githubusercontent.com/assets/1000778/8406776/5cd38e04-1e92-11e5-89f2-0c5134fe4b6b.png) This PR fixes it by using an empty svg tag when `totalExecutionTime` is 0. This is a screenshot for a task that its totalExecutionTime is 0 after fixing it. ![screen shot 2015-06-30 at 12 26 52 am](https://cloud.githubusercontent.com/assets/1000778/8412896/7b33b4be-1ebf-11e5-9100-d6d656af3747.png) Author: zsxwing <zsxwing@gmail.com> Closes #7088 from zsxwing/SPARK-8705 and squashes the following commits: 9ee4ef5 [zsxwing] Address comments ef2ecfa [zsxwing] Don't display rects when totalExecutionTime is 0
* [SPARK-8736] [ML] GBTRegressor should not threshold predictionJoseph K. Bradley2015-06-302-3/+23
| | | | | | | | | | | | Changed GBTRegressor so it does NOT threshold the prediction. Added test which fails with bug but works after fix. CC: feynmanliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #7134 from jkbradley/gbrt-fix and squashes the following commits: 613b90e [Joseph K. Bradley] Changed GBTRegressor so it does NOT threshold the prediction
* [SPARK-8372] Do not show applications that haven't recorded their app ID yet.Marcelo Vanzin2015-06-302-60/+147
| | | | | | | | | | | | | | | | | | Showing these applications may lead to weird behavior in the History Server. For old logs, if the app ID is recorded later, you may end up with a duplicate entry. For new logs, the app might be listed with a ".inprogress" suffix. So ignore those, but still allow old applications that don't record app IDs at all (1.0 and 1.1) to be shown. Author: Marcelo Vanzin <vanzin@cloudera.com> Author: Carson Wang <carson.wang@intel.com> Closes #7097 from vanzin/SPARK-8372 and squashes the following commits: a24eab2 [Marcelo Vanzin] Feedback. 112ae8f [Marcelo Vanzin] Merge branch 'master' into SPARK-8372 7b91b74 [Marcelo Vanzin] Handle logs generated by 1.0 and 1.1. 1eca3fe [Carson Wang] [SPARK-8372] History server shows incorrect information for application not started