aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* [SPARK-5716] [SQL] Support TOK_CHARSETLITERAL in HiveQlDaoyuan Wang2015-02-107-0/+8
| | | | | | | | | Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4502 from adrian-wang/utf8 and squashes the following commits: 4d7b0ee [Daoyuan Wang] remove useless import 606f981 [Daoyuan Wang] support TOK_CHARSETLITERAL in HiveQl
* [Spark-5717] [MLlib] add stop and reorganize importJqueryFan2015-02-102-1/+2
| | | | | | | | | | | | | Trivial. add sc stop and reorganize import https://issues.apache.org/jira/browse/SPARK-5717 Author: JqueryFan <firing@126.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes #4503 from hhbyyh/scstop and squashes the following commits: 7837a2c [JqueryFan] revert import change 2e85cc1 [Yuhao Yang] add stop and reorganize import
* [SPARK-1805] [EC2] Validate instance typesNicholas Chammas2015-02-101-51/+81
| | | | | | | | | | | | | | | | | Addresses [SPARK-1805](https://issues.apache.org/jira/browse/SPARK-1805), though doesn't resolve it completely. Error out quickly if the user asks for the master and slaves to have different AMI virtualization types, since we don't currently support that. In addition to that, we print warnings if the inputted instance types are not recognized, though I would prefer if we errored out. Elsewhere in the script it seems [we allow unrecognized instance types](https://github.com/apache/spark/blob/5de14cc2763a8211f77eeb55940dec025822eb78/ec2/spark_ec2.py#L331), though I think we should remove that. It's messy, but it should serve us until we enhance spark-ec2 to support clusters with mixed virtualization types. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #4455 from nchammas/ec2-master-slave-different-virtualization and squashes the following commits: ce28609 [Nicholas Chammas] fix style b0adba0 [Nicholas Chammas] validate input instance types
* [SPARK-5700] [SQL] [Build] Bumps jets3t to 0.9.3 for hadoop-2.3 and ↵Cheng Lian2015-02-102-5/+2
| | | | | | | | | | | | | | | | | | hadoop-2.4 profiles This is a follow-up PR for #4454 and #4484. JetS3t 0.9.2 contains a log4j.properties file inside the artifact and breaks our tests (see SPARK-5696). This is fixed in 0.9.3. This PR also reverts hotfix changes introduced in #4484. The reason is that asking users to configure HiveThriftServer2 logging configurations in hive-log4j.properties can be unintuitive. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4499) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4499 from liancheng/spark-5700 and squashes the following commits: 4f020c7 [Cheng Lian] Bumps jets3t to 0.9.3 for hadoop-2.3 and hadoop-2.4 profiles
* SPARK-5239 [CORE] JdbcRDD throws "java.lang.AbstractMethodError: ↵Sean Owen2015-02-102-6/+6
| | | | | | | | | | | | oracle.jdbc.driver.xxxxxx.isClosed()Z" This is a completion of https://github.com/apache/spark/pull/4033 which was withdrawn for some reason. Author: Sean Owen <sowen@cloudera.com> Closes #4470 from srowen/SPARK-5239.2 and squashes the following commits: 2398bde [Sean Owen] Avoid use of JDBC4-only isClosed()
* [SPARK-4964][Streaming][Kafka] More updates to Exactly-once Kafka streamTathagata Das2015-02-0917-262/+1048
| | | | | | | | | | | | | | | | | | | | | | | Changes - Added example - Added a critical unit test that verifies that offset ranges can be recovered through checkpoints Might add more changes. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #4384 from tdas/new-kafka-fixes and squashes the following commits: 7c931c3 [Tathagata Das] Small update 3ed9284 [Tathagata Das] updated scala doc 83d0402 [Tathagata Das] Added JavaDirectKafkaWordCount example. 26df23c [Tathagata Das] Updates based on PR comments from Cody e4abf69 [Tathagata Das] Scala doc improvements and stuff. bb65232 [Tathagata Das] Fixed test bug and refactored KafkaStreamSuite 50f2b56 [Tathagata Das] Added Java API and added more Scala and Java unit tests. Also updated docs. e73589c [Tathagata Das] Minor changes. 4986784 [Tathagata Das] Added unit test to kafka offset recovery 6a91cab [Tathagata Das] Added example
* [SPARK-5597][MLLIB] save/load for decision trees and emsemblesJoseph K. Bradley2015-02-098-38/+561
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is based on #4444 from jkbradley with the following changes: 1. Node schema updated to ~~~ treeId: int nodeId: Int predict/ |- predict: Double |- prob: Double impurity: Double isLeaf: Boolean split/ |- feature: Int |- threshold: Double |- featureType: Int |- categories: Array[Double] leftNodeId: Integer rightNodeId: Integer infoGain: Double ~~~ 2. Some refactor of the implementation. Closes #4444. Author: Joseph K. Bradley <joseph@databricks.com> Author: Xiangrui Meng <meng@databricks.com> Closes #4493 from mengxr/SPARK-5597 and squashes the following commits: 75e3bb6 [Xiangrui Meng] fix style 2b0033d [Xiangrui Meng] update tree export schema and refactor the implementation 45873a2 [Joseph K. Bradley] org imports 1d4c264 [Joseph K. Bradley] Added save/load for tree ensembles dcdbf85 [Joseph K. Bradley] added save/load for decision tree but need to generalize it to ensembles
* [SQL] Remove the duplicated codeCheng Hao2015-02-091-5/+0
| | | | | | | | Author: Cheng Hao <hao.cheng@intel.com> Closes #4494 from chenghao-intel/tiny_code_change and squashes the following commits: 450dfe7 [Cheng Hao] remove the duplicated code
* [SPARK-5701] Only set ShuffleReadMetrics when task has shuffle depsKay Ousterhout2015-02-092-10/+40
| | | | | | | | | | The updateShuffleReadMetrics method in TaskMetrics (called by the executor heartbeater) will currently always add a ShuffleReadMetrics to TaskMetrics (with values set to 0), even when the task didn't read any shuffle data. ShuffleReadMetrics should only be added if the task reads shuffle data. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #4488 from kayousterhout/SPARK-5701 and squashes the following commits: 673ed58 [Kay Ousterhout] SPARK-5701: Only set ShuffleReadMetrics when task has shuffle deps
* [SPARK-5703] AllJobsPage throws empty.max exceptionAndrew Or2015-02-091-1/+3
| | | | | | | | | | | | If you have a `SparkListenerJobEnd` event without the corresponding `SparkListenerJobStart` event, then `JobProgressListener` will create an empty `JobUIData` with an empty `stageIds` list. However, later in `AllJobsPage` we call `stageIds.max`. If this is empty, it will throw an exception. This crashed my history server. Author: Andrew Or <andrew@databricks.com> Closes #4490 from andrewor14/jobs-page-max and squashes the following commits: 21797d3 [Andrew Or] Check nonEmpty before calling max
* [SPARK-2996] Implement userClassPathFirst for driver, yarn.Marcelo Vanzin2015-02-0927-348/+736
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Yarn's config option `spark.yarn.user.classpath.first` does not work the same way as `spark.files.userClassPathFirst`; Yarn's version is a lot more dangerous, in that it modifies the system classpath, instead of restricting the changes to the user's class loader. So this change implements the behavior of the latter for Yarn, and deprecates the more dangerous choice. To be able to achieve feature-parity, I also implemented the option for drivers (the existing option only applies to executors). So now there are two options, each controlling whether to apply userClassPathFirst to the driver or executors. The old option was deprecated, and aliased to the new one (`spark.executor.userClassPathFirst`). The existing "child-first" class loader also had to be fixed. It didn't handle resources, and it was also doing some things that ended up causing JVM errors depending on how things were being called. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #3233 from vanzin/SPARK-2996 and squashes the following commits: 9cf9cf1 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996 a1499e2 [Marcelo Vanzin] Remove SPARK_HOME propagation. fa7df88 [Marcelo Vanzin] Remove 'test.resource' file, create it dynamically. a8c69f1 [Marcelo Vanzin] Review feedback. cabf962 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996 a1b8d7e [Marcelo Vanzin] Merge branch 'master' into SPARK-2996 3f768e3 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996 2ce3c7a [Marcelo Vanzin] Merge branch 'master' into SPARK-2996 0e6d6be [Marcelo Vanzin] Merge branch 'master' into SPARK-2996 70d4044 [Marcelo Vanzin] Fix pyspark/yarn-cluster test. 0fe7777 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996 0e6ef19 [Marcelo Vanzin] Move class loaders around and make names more meaninful. fe970a7 [Marcelo Vanzin] Review feedback. 25d4fed [Marcelo Vanzin] Merge branch 'master' into SPARK-2996 3cb6498 [Marcelo Vanzin] Call the right loadClass() method on the parent. fbb8ab5 [Marcelo Vanzin] Add locking in loadClass() to avoid deadlocks. 2e6c4b7 [Marcelo Vanzin] Mention new setting in documentation. b6497f9 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996 a10f379 [Marcelo Vanzin] Some feedback. 3730151 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996 f513871 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996 44010b6 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996 7b57cba [Marcelo Vanzin] Remove now outdated message. 5304d64 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996 35949c8 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996 54e1a98 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996 d1273b2 [Marcelo Vanzin] Add test file to rat exclude. fa1aafa [Marcelo Vanzin] Remove write check on user jars. 89d8072 [Marcelo Vanzin] Cleanups. a963ea3 [Marcelo Vanzin] Implement spark.driver.userClassPathFirst for standalone cluster mode. 50afa5f [Marcelo Vanzin] Fix Yarn executor command line. 7d14397 [Marcelo Vanzin] Register user jars in executor up front. 7f8603c [Marcelo Vanzin] Fix yarn-cluster mode without userClassPathFirst. 20373f5 [Marcelo Vanzin] Fix ClientBaseSuite. 55c88fa [Marcelo Vanzin] Run all Yarn integration tests via spark-submit. 0b64d92 [Marcelo Vanzin] Add deprecation warning to yarn option. 4a84d87 [Marcelo Vanzin] Fix the child-first class loader. d0394b8 [Marcelo Vanzin] Add "deprecated configs" to SparkConf. 46d8cf2 [Marcelo Vanzin] Update doc with new option, change name to "userClassPathFirst". a314f2d [Marcelo Vanzin] Enable driver class path isolation in SparkSubmit. 91f7e54 [Marcelo Vanzin] [yarn] Enable executor class path isolation. a853e74 [Marcelo Vanzin] Re-work CoarseGrainedExecutorBackend command line arguments. 89522ef [Marcelo Vanzin] Add class path isolation support for Yarn cluster mode.
* SPARK-4900 [MLLIB] MLlib SingularValueDecomposition ARPACK IllegalStateExceptionSean Owen2015-02-091-1/+1
| | | | | | | | | | Fix ARPACK error code mapping, at least. It's not yet clear whether the error is what we expect from ARPACK. If it isn't, not sure if that's to be treated as an MLlib or Breeze issue. Author: Sean Owen <sowen@cloudera.com> Closes #4485 from srowen/SPARK-4900 and squashes the following commits: 7355aa1 [Sean Owen] Fix ARPACK error code mapping
* Add a config option to print DAG.KaiXinXiaoLei2015-02-091-0/+3
| | | | | | | | | | | | | | Add a config option "spark.rddDebug.enable" to check whether to print DAG info. When "spark.rddDebug.enable" is true, it will print information about DAG in the log. Author: KaiXinXiaoLei <huleilei1@huawei.com> Closes #4257 from KaiXinXiaoLei/DAGprint and squashes the following commits: d9fe42e [KaiXinXiaoLei] change log info c27ee76 [KaiXinXiaoLei] change log info 83c2b32 [KaiXinXiaoLei] change config option adcb14f [KaiXinXiaoLei] change the file. f4e7b9e [KaiXinXiaoLei] add a option to print DAG
* [SPARK-5469] restructure pyspark.sql into multiple filesDavies Liu2015-02-0912-2756/+2962
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | All the DataTypes moved into pyspark.sql.types The changes can be tracked by `--find-copies-harder -M25` ``` davieslocalhost:~/work/spark/python$ git diff --find-copies-harder -M25 --numstat master.. 2 5 python/docs/pyspark.ml.rst 0 3 python/docs/pyspark.mllib.rst 10 2 python/docs/pyspark.sql.rst 1 1 python/pyspark/mllib/linalg.py 21 14 python/pyspark/{mllib => sql}/__init__.py 14 2108 python/pyspark/{sql.py => sql/context.py} 10 1772 python/pyspark/{sql.py => sql/dataframe.py} 7 6 python/pyspark/{sql_tests.py => sql/tests.py} 8 1465 python/pyspark/{sql.py => sql/types.py} 4 2 python/run-tests 1 1 sql/core/src/main/scala/org/apache/spark/sql/test/ExamplePointUDT.scala ``` Also `git blame -C -C python/pyspark/sql/context.py` to track the history. Author: Davies Liu <davies@databricks.com> Closes #4479 from davies/sql and squashes the following commits: 1b5f0a5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sql 2b2b983 [Davies Liu] restructure pyspark.sql
* [SPARK-5698] Do not let user request negative # of executorsAndrew Or2015-02-091-0/+5
| | | | | | | | | | | | | Otherwise we might crash the ApplicationMaster. Why? Please see https://issues.apache.org/jira/browse/SPARK-5698. sryza I believe this is also relevant in your patch #4168. Author: Andrew Or <andrew@databricks.com> Closes #4483 from andrewor14/da-negative and squashes the following commits: 53ed955 [Andrew Or] Throw IllegalArgumentException instead 0e89fd5 [Andrew Or] Check against negative requests
* [SPARK-5699] [SQL] [Tests] Runs hive-thriftserver tests whenever SQL code is ↵Cheng Lian2015-02-091-3/+3
| | | | | | | | | | | | | | modified <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4486) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4486 from liancheng/spark-5699 and squashes the following commits: 538001d [Cheng Lian] Runs hive-thriftserver tests whenever SQL code is modified
* [SPARK-5648][SQL] support "alter ... unset tblproperties("key")"DoingDone92015-02-091-0/+2
| | | | | | | | | | | | | make hivecontext support "alter ... unset tblproperties("key")" like : alter view viewName unset tblproperties("k") alter table tableName unset tblproperties("k") Author: DoingDone9 <799203320@qq.com> Closes #4424 from DoingDone9/unset and squashes the following commits: 6dd8bee [DoingDone9] support "alter ... unset tblproperties("key")"
* [SPARK-2096][SQL] support dot notation on array of structWenchen Fan2015-02-095-22/+53
| | | | | | | | | | | | | | ~~The rule is simple: If you want `a.b` work, then `a` must be some level of nested array of struct(level 0 means just a StructType). And the result of `a.b` is same level of nested array of b-type. An optimization is: the resolve chain looks like `Attribute -> GetItem -> GetField -> GetField ...`, so we could transmit the nested array information between `GetItem` and `GetField` to avoid repeated computation of `innerDataType` and `containsNullList` of that nested array.~~ marmbrus Could you take a look? to evaluate `a.b`, if `a` is array of struct, then `a.b` means get field `b` on each element of `a`, and return a result of array. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #2405 from cloud-fan/nested-array-dot and squashes the following commits: 08a228a [Wenchen Fan] support dot notation on array of struct
* [SPARK-5614][SQL] Predicate pushdown through Generate.Lu Yan2015-02-092-1/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now in Catalyst's rules, predicates can not be pushed through "Generate" nodes. Further more, partition pruning in HiveTableScan can not be applied on those queries involves "Generate". This makes such queries very inefficient. In practice, it finds patterns like ```scala Filter(predicate, Generate(generator, _, _, _, grandChild)) ``` and splits the predicate into 2 parts by referencing the generated column from Generate node or not. And a new Filter will be created for those conjuncts can be pushed beneath Generate node. If nothing left for the original Filter, it will be removed. For example, physical plan for query ```sql select len, bk from s_server lateral view explode(len_arr) len_table as len where len > 5 and day = '20150102'; ``` where 'day' is a partition column in metastore is like this in current version of Spark SQL: > Project [len, bk] > > Filter ((len > "5") && "(day = "20150102")") > > Generate explode(len_arr), true, false > > HiveTableScan [bk, len_arr, day], (MetastoreRelation default, s_server, None), None But theoretically the plan should be like this > Project [len, bk] > > Filter (len > "5") > > Generate explode(len_arr), true, false > > HiveTableScan [bk, len_arr, day], (MetastoreRelation default, s_server, None), Some(day = "20150102") Where partition pruning predicates can be pushed to HiveTableScan nodes. Author: Lu Yan <luyan02@baidu.com> Closes #4394 from ianluyan/ppd and squashes the following commits: a67dce9 [Lu Yan] Fix English grammar. 7cea911 [Lu Yan] Revised based on @marmbrus's opinions ffc59fc [Lu Yan] [SPARK-5614][SQL] Predicate pushdown through Generate.
* [SPARK-5696] [SQL] [HOTFIX] Asks HiveThriftServer2 to re-initialize log4j ↵Cheng Lian2015-02-091-0/+3
| | | | | | | | | | | | | | | | | | using Hive configurations In this way, log4j configurations overriden by jets3t-0.9.2.jar can be again overriden by Hive default log4j configurations. This might not be the best solution for this issue since it requires users to use `hive-log4j.properties` rather than `log4j.properties` to initialize `HiveThriftServer2` logging configurations, which can be confusing. The main purpose of this PR is to fix Jenkins PR build. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4484) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4484 from liancheng/spark-5696 and squashes the following commits: df83956 [Cheng Lian] Hot fix: asks HiveThriftServer2 to re-initialize log4j using Hive configurations
* [SQL] Code cleanup.Yin Huai2015-02-091-3/+0
| | | | | | | | | | | | I added an unnecessary line of code in https://github.com/apache/spark/commit/13531dd97c08563e53dacdaeaf1102bdd13ef825. My bad. Let's delete it. Author: Yin Huai <yhuai@databricks.com> Closes #4482 from yhuai/unnecessaryCode and squashes the following commits: 3645af0 [Yin Huai] Code cleanup.
* [SQL] Add some missing DataFrame functions.Michael Armbrust2015-02-096-21/+51
| | | | | | | | | | | | | - as with a `Symbol` - distinct - sqlContext.emptyDataFrame - move add/remove col out of RDDApi section Author: Michael Armbrust <michael@databricks.com> Closes #4437 from marmbrus/dfMissingFuncs and squashes the following commits: 2004023 [Michael Armbrust] Add missing functions
* [SPARK-5611] [EC2] Allow spark-ec2 repo and branch to be set on CLI of ↵Florian Verhein2015-02-091-5/+32
| | | | | | | | | | | | | | | | | | | spark_ec2.py and by extension, the ami-list Useful for using alternate spark-ec2 repos or branches. Author: Florian Verhein <florian.verhein@gmail.com> Closes #4385 from florianverhein/master and squashes the following commits: 7e2b4be [Florian Verhein] [SPARK-5611] [EC2] typo 8b653dc [Florian Verhein] [SPARK-5611] [EC2] Enforce only supporting spark-ec2 forks from github, log improvement bc4b0ed [Florian Verhein] [SPARK-5611] allow spark-ec2 repos with different names 8b5c551 [Florian Verhein] improve option naming, fix logging, fix lint failing, add guard to enforce spark-ec2 7724308 [Florian Verhein] [SPARK-5611] [EC2] fixes b42b68c [Florian Verhein] [SPARK-5611] [EC2] Allow spark-ec2 repo and branch to be set on CLI of spark_ec2.py
* [SPARK-5675][SQL] XyzType companion object should subclass XyzTypeReynold Xin2015-02-091-12/+73
| | | | | | | | | | | | | | | Otherwise, the following will always return false in Java. ```scala dataType instanceof StringType ``` Author: Reynold Xin <rxin@databricks.com> Closes #4463 from rxin/type-companion-object and squashes the following commits: 04d5d8d [Reynold Xin] Comment. 976e11e [Reynold Xin] [SPARK-5675][SQL]StringType case object should be subclass of StringType class
* [SPARK-4905][STREAMING] FlumeStreamSuite fix.Hari Shreedharan2015-02-091-4/+3
| | | | | | | | | | | | Using String constructor instead of CharsetDecoder to see if it fixes the issue of empty strings in Flume test output. Author: Hari Shreedharan <hshreedharan@apache.org> Closes #4371 from harishreedharan/Flume-stream-attempted-fix and squashes the following commits: 550d363 [Hari Shreedharan] Fix imports. 8695950 [Hari Shreedharan] Use Charsets.UTF_8 instead of "UTF-8" in String constructors. af3ba14 [Hari Shreedharan] [SPARK-4905][STREAMING] FlumeStreamSuite fix.
* [SPARK-5691] Fixing wrong data structure lookup for dupe app registratio...mcheah2015-02-091-1/+1
| | | | | | | | | | | | | In Master's registerApplication method, it checks if the application had already registered by examining the addressToWorker hash map. In reality, it should refer to the addressToApp data structure, as this is what really tracks which apps have been registered. Author: mcheah <mcheah@palantir.com> Closes #4477 from mccheah/spark-5691 and squashes the following commits: efdc573 [mcheah] [SPARK-5691] Fixing wrong data structure lookup for dupe app registration
* [SPARK-5664][BUILD] Restore stty settings when exiting from SBT's spark-shellLiang-Chi Hsieh2015-02-092-1/+29
| | | | | | | | | | For launching spark-shell from SBT. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4451 from viirya/restore_stty and squashes the following commits: fdfc480 [Liang-Chi Hsieh] Restore stty settings when exit (for launching spark-shell from SBT).
* [SPARK-5678] Convert DataFrame to pandas.DataFrame and SeriesDavies Liu2015-02-091-0/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ``` pyspark.sql.DataFrame.to_pandas = to_pandas(self) unbound pyspark.sql.DataFrame method Collect all the rows and return a `pandas.DataFrame`. >>> df.to_pandas() # doctest: +SKIP age name 0 2 Alice 1 5 Bob pyspark.sql.Column.to_pandas = to_pandas(self) unbound pyspark.sql.Column method Return a pandas.Series from the column >>> df.age.to_pandas() # doctest: +SKIP 0 2 1 5 dtype: int64 ``` Not tests by jenkins (they depends on pandas) Author: Davies Liu <davies@databricks.com> Closes #4476 from davies/to_pandas and squashes the following commits: 6276fb6 [Davies Liu] Convert DataFrame to pandas.DataFrame and Series
* SPARK-4267 [YARN] Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 ↵Sean Owen2015-02-093-14/+18
| | | | | | | | | | | | | | or later Before passing to YARN, escape arguments in "extraJavaOptions" args, in order to correctly handle cases like -Dfoo="one two three". Also standardize how these args are handled and ensure that individual args are treated as stand-alone args, not one string. vanzin andrewor14 Author: Sean Owen <sowen@cloudera.com> Closes #4452 from srowen/SPARK-4267.2 and squashes the following commits: c8297d2 [Sean Owen] Before passing to YARN, escape arguments in "extraJavaOptions" args, in order to correctly handle cases like -Dfoo="one two three". Also standardize how these args are handled and ensure that individual args are treated as stand-alone args, not one string.
* SPARK-2149. [MLLIB] Univariate kernel density estimationSandy Ryza2015-02-093-0/+132
| | | | | | | | | | Author: Sandy Ryza <sandy@cloudera.com> Closes #1093 from sryza/sandy-spark-2149 and squashes the following commits: 5f06b33 [Sandy Ryza] More review comments 0f73060 [Sandy Ryza] Respond to Sean's review comments 0dfa005 [Sandy Ryza] SPARK-2149. Univariate kernel density estimation
* [SPARK-5473] [EC2] Expose SSH failures after status checks passNicholas Chammas2015-02-091-12/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If there is some fatal problem with launching a cluster, `spark-ec2` just hangs without giving the user useful feedback on what the problem is. This PR exposes the output of the SSH calls to the user if the SSH test fails during cluster launch for any reason but the instance status checks are all green. It also removes the growing trail of dots while waiting in favor of a fixed 3 dots. For example: ``` $ ./ec2/spark-ec2 -k key -i /incorrect/path/identity.pem --instance-type m3.medium --slaves 1 --zone us-east-1c launch "spark-test" Setting up security groups... Searching for existing cluster spark-test... Spark AMI: ami-35b1885c Launching instances... Launched 1 slaves in us-east-1c, regid = r-7dadd096 Launched master in us-east-1c, regid = r-fcadd017 Waiting for cluster to enter 'ssh-ready' state... Warning: SSH connection error. (This could be temporary.) Host: 127.0.0.1 SSH return code: 255 SSH output: Warning: Identity file /incorrect/path/identity.pem not accessible: No such file or directory. Warning: Permanently added '127.0.0.1' (RSA) to the list of known hosts. Permission denied (publickey). ``` This should give users enough information when some unrecoverable error occurs during launch so they can know to abort the launch. This will help avoid situations like the ones reported [here on Stack Overflow](http://stackoverflow.com/q/28002443/) and [here on the user list](http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3C1422323829398-21381.postn3.nabble.com%3E), where the users couldn't tell what the problem was because it was being hidden by `spark-ec2`. This is a usability improvement that should be backported to 1.2. Resolves [SPARK-5473](https://issues.apache.org/jira/browse/SPARK-5473). Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #4262 from nchammas/expose-ssh-failure and squashes the following commits: 8bda6ed [Nicholas Chammas] default to print SSH output 2b92534 [Nicholas Chammas] show SSH output after status check pass
* [SPARK-5539][MLLIB] LDA guideXiangrui Meng2015-02-083-1/+215
| | | | | | | | | | | | | | This is the LDA user guide from jkbradley with Java and Scala code example. Author: Xiangrui Meng <meng@databricks.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #4465 from mengxr/lda-guide and squashes the following commits: 6dcb7d1 [Xiangrui Meng] update java example in the user guide 76169ff [Xiangrui Meng] update java example 36c3ae2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into lda-guide c2a1efe [Joseph K. Bradley] Added LDA programming guide, plus Java example (which is in the guide and probably should be removed).
* [SPARK-5472][SQL] Fix Scala code styleHung Lin2015-02-082-36/+41
| | | | | | | | | | Fix Scala code style. Author: Hung Lin <hung@zoomdata.com> Closes #4464 from hunglin/SPARK-5472 and squashes the following commits: ef7a3b3 [Hung Lin] SPARK-5472: fix scala style
* SPARK-4405 [MLLIB] Matrices.* construction methods should check for rows x ↵Sean Owen2015-02-081-2/+12
| | | | | | | | | | | | cols overflow Check that size of dense matrix array is not beyond Int.MaxValue in Matrices.* methods. jkbradley this should be an easy one. Review and/or merge as you see fit. Author: Sean Owen <sowen@cloudera.com> Closes #4461 from srowen/SPARK-4405 and squashes the following commits: c67574e [Sean Owen] Check that size of dense matrix array is not beyond Int.MaxValue in Matrices.* methods
* [SPARK-5660][MLLIB] Make Matrix apply publicJoseph K. Bradley2015-02-081-3/+3
| | | | | | | | | | | | | | This is #4447 with `override`. Closes #4447 Author: Joseph K. Bradley <joseph@databricks.com> Author: Xiangrui Meng <meng@databricks.com> Closes #4462 from mengxr/SPARK-5660 and squashes the following commits: f82c8d6 [Xiangrui Meng] add override to matrix.apply 91cedde [Joseph K. Bradley] made matrix apply public
* [SPARK-5643][SQL] Add a show method to print the content of a DataFrame in ↵Reynold Xin2015-02-088-21/+151
| | | | | | | | | | | | | | | | | | | | | | | tabular format. An example: ``` year month AVG('Adj Close) MAX('Adj Close) 1980 12 0.503218 0.595103 1981 01 0.523289 0.570307 1982 02 0.436504 0.475256 1983 03 0.410516 0.442194 1984 04 0.450090 0.483521 ``` Author: Reynold Xin <rxin@databricks.com> Closes #4416 from rxin/SPARK-5643 and squashes the following commits: d0e0d6e [Reynold Xin] [SQL] Minor update to data source and statistics documentation. 269da83 [Reynold Xin] Updated isLocal comment. 2cf3c27 [Reynold Xin] Moved logic into optimizer. 1a04d8b [Reynold Xin] [SPARK-5643][SQL] Add a show method to print the content of a DataFrame in columnar format.
* SPARK-5665 [DOCS] Update netlib-java documentationSam Halliday2015-02-081-17/+24
| | | | | | | | | | | | | | | | | | | I am the author of netlib-java and I found this documentation to be out of date. Some main points: 1. Breeze has not depended on jBLAS for some time 2. netlib-java provides a pure JVM implementation as the fallback (the original docs did not appear to be aware of this, claiming that gfortran was necessary) 3. The licensing issue is not just about LGPL: optimised natives have proprietary licenses. Building with the LGPL flag turned on really doesn't help you get past this. 4. I really think it's best to direct people to my detailed setup guide instead of trying to compress it into one sentence. It is different for each architecture, each OS, and for each backend. I hope this helps to clear things up :smile: Author: Sam Halliday <sam.halliday@Gmail.com> Author: Sam Halliday <sam.halliday@gmail.com> Closes #4448 from fommil/patch-1 and squashes the following commits: 18cda11 [Sam Halliday] remove link to skillsmatters at request of @mengxr a35e4a9 [Sam Halliday] reword netlib-java/breeze docs
* [SPARK-5598][MLLIB] model save/load for ALSXiangrui Meng2015-02-083-3/+100
| | | | | | | | | | | | | following #4233. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #4422 from mengxr/SPARK-5598 and squashes the following commits: a059394 [Xiangrui Meng] SaveLoad not extending Loader 14b7ea6 [Xiangrui Meng] address comments f487cb2 [Xiangrui Meng] add unit tests 62fc43c [Xiangrui Meng] implement save/load for MFM
* [SQL] Set sessionState in QueryExecution.Yin Huai2015-02-081-0/+5
| | | | | | | | | | | | | This PR sets the SessionState in HiveContext's QueryExecution. So, we can make sure that SessionState.get can return the SessionState every time. Author: Yin Huai <yhuai@databricks.com> Closes #4445 from yhuai/setSessionState and squashes the following commits: 769c9f1 [Yin Huai] Remove unused import. 439f329 [Yin Huai] Try again. 427a0c9 [Yin Huai] Set SessionState everytime when we create a QueryExecution in HiveContext. a3b7793 [Yin Huai] Set sessionState when dealing with CreateTableAsSelect.
* [SPARK-3039] [BUILD] Spark assembly for new hadoop API (hadoop 2) contai...medale2015-02-081-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ...ns avro-mapred for hadoop 1 API had been marked as resolved but did not work for at least some builds due to version conflicts using avro-mapred-1.7.5.jar and avro-mapred-1.7.6-hadoop2.jar (the correct version) when building for hadoop2. sql/hive/pom.xml org.spark-project.hive:hive-exec's depends on 1.7.5: Building Spark Project Hive 1.2.0 [INFO] ------------------------------------------------------------------------ [INFO] [INFO] --- maven-dependency-plugin:2.4:tree (default-cli) spark-hive_2.10 --- [INFO] org.apache.spark:spark-hive_2.10:jar:1.2.0 [INFO] +- org.spark-project.hive:hive-exec:jar:0.13.1a:compile [INFO] | \- org.apache.avro:avro-mapred:jar:1.7.5:compile [INFO] \- org.apache.avro:avro-mapred:jar:hadoop2:1.7.6:compile [INFO] Excluding this dependency allows the explicitly listed avro-mapred dependency to be picked up. Author: medale <medale94@yahoo.com> Closes #4315 from medale/avro-hadoop2 and squashes the following commits: 1ab4fa3 [medale] Merge branch 'master' into avro-hadoop2 9d85e2a [medale] Merge remote-tracking branch 'upstream/master' into avro-hadoop2 51b9c2a [medale] [SPARK-3039] [BUILD] Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API had been marked as resolved but did not work for at least some builds due to version conflicts using avro-mapred-1.7.5.jar and avro-mapred-1.7.6-hadoop2.jar (the correct version) when building for hadoop2.
* [SPARK-5672][Web UI] Don't return `ERROR 500` when have missing argsKirill A. Korinskiy2015-02-086-16/+35
| | | | | | | | | | Spark web UI return `HTTP ERROR 500` when GET arguments is missing. Author: Kirill A. Korinskiy <catap@catap.ru> Closes #4239 from catap/ui_500 and squashes the following commits: 520e180 [Kirill A. Korinskiy] [SPARK-5672][Web UI] Return `HTTP ERROR 400` when have missing args
* [SPARK-5656] Fail gracefully for large values of k and/or n that will ex...mbittmann2015-02-081-0/+3
| | | | | | | | | | | | | | | | ...ceed max int. Large values of k and/or n in EigenValueDecomposition.symmetricEigs will result in array initialization to a value larger than Integer.MAX_VALUE in the following: var v = new Array[Double](n * ncv) Author: mbittmann <mbittmann@gmail.com> Author: bittmannm <mark.bittmann@agilex.com> Closes #4433 from mbittmann/master and squashes the following commits: ee56e05 [mbittmann] [SPARK-5656] Combine checks into simple message e49cbbb [mbittmann] [SPARK-5656] Simply error message 860836b [mbittmann] Array size check updates based on code review a604816 [bittmannm] [SPARK-5656] Fail gracefully for large values of k and/or n that will exceed max int.
* [SPARK-5366][EC2] Check the mode of private keyliuchang08122015-02-081-0/+15
| | | | | | | | | | | | | | | | | | | Check the mode of private key file. Author: liuchang0812 <liuchang0812@gmail.com> Closes #4162 from Liuchang0812/ec2-script and squashes the following commits: fc37355 [liuchang0812] quota file name 01ed464 [liuchang0812] more output ce2a207 [liuchang0812] pep8 f44efd2 [liuchang0812] move code to real_main 8475a54 [liuchang0812] fix bug cd61a1a [liuchang0812] import stat c106cb2 [liuchang0812] fix trivis bug 89c9953 [liuchang0812] more output about checking private key 1177a90 [liuchang0812] remove commet 41188ab [liuchang0812] check the mode of private key
* [SPARK-5671] Upgrade jets3t to 0.9.2 in hadoop-2.3 and 2.4 profilesJosh Rosen2015-02-071-2/+2
| | | | | | | | | | | | | | Upgrading from jets3t 0.9.0 to 0.9.2 fixes a dependency issue that was causing UISeleniumSuite to fail with ClassNotFoundExceptions when run the hadoop-2.3 or hadoop-2.4 profiles. The jets3t release notes can be found at http://www.jets3t.org/RELEASE_NOTES.html Author: Josh Rosen <joshrosen@databricks.com> Closes #4454 from JoshRosen/SPARK-5671 and squashes the following commits: fa6cb3e [Josh Rosen] [SPARK-5671] Upgrade jets3t to 0.9.2 in hadoop-2.3 and 2.4 profiles
* [SPARK-5108][BUILD] Jackson dependency management for Hadoop-2.6.0 supportZhan Zhang2015-02-071-0/+13
| | | | | | | | | | | | | | | There is dependency compatibility issue. Currently hadoop-2.6.0 use 1.9.13 for jackson. Upgrade to the same version to make it consistent. Author: Zhan Zhang <zhazhan@gmail.com> Closes #3938 from zhzhan/spark5108 and squashes the following commits: 0080a84 [Zhan Zhang] change to upgrade jackson version only in hadoop-2.x 0b9bad6 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark into spark5108 917600a [Zhan Zhang] solve conflicts f7064d0 [Zhan Zhang] hadoop2.6 dependency management fix fc56b25 [Zhan Zhang] squash all commits 3bf966c [Zhan Zhang] test
* SPARK-5408: Use -XX:MaxPermSize specified by user instead of default in ...Jacek Lewandowski2015-02-071-2/+12
| | | | | | | | | | ...ExecutorRunner and DriverRunner Author: Jacek Lewandowski <lewandowski.jacek@gmail.com> Closes #4203 from jacek-lewandowski/SPARK-5408-1.3 and squashes the following commits: d913686 [Jacek Lewandowski] SPARK-5408: Use -XX:MaxPermSize specified by used instead of default in ExecutorRunner and DriverRunner
* [BUILD] Add the ability to launch spark-shell from SBT.Michael Armbrust2015-02-071-0/+23
| | | | | | | | | | Now you can quickly launch the spark-shell without building an assembly. For quick development iteration run `build/sbt ~sparkShell` and calling exit will relaunch with any changes. Author: Michael Armbrust <michael@databricks.com> Closes #4438 from marmbrus/sparkShellSbt and squashes the following commits: b4e44fe [Michael Armbrust] [BUILD] Add the ability to launch spark-shell from SBT.
* [SPARK-5388] Provide a stable application submission gateway for standalone ↵Andrew Or2015-02-0623-94/+2027
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cluster mode The goal is to provide a stable, REST-based application submission gateway that is not inherently based on Akka, which is unstable across versions. This PR targets standalone cluster mode, but is implemented in a general enough manner that can be potentially extended to other modes in the future. Client mode is currently not included in the changes here because there are many more Akka messages exchanged there. As of the changes here, the Master will advertise two ports, 7077 and 6066. We need to keep around the old one (7077) for client mode and older versions of Spark submit. However, all new versions of Spark submit will use the REST gateway (6066). By the way this includes ~700 lines of tests and ~200 lines of license. Author: Andrew Or <andrew@databricks.com> Closes #4216 from andrewor14/rest and squashes the following commits: 8d7ce07 [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest 6f0c597 [Andrew Or] Use nullable fields for integer and boolean values dfe4bd7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest b9e2a08 [Andrew Or] Minor comments 02b5cea [Andrew Or] Fix tests d2b1ef8 [Andrew Or] Comment changes + minor code refactoring across the board 9c82a36 [Andrew Or] Minor comment and wording updates b4695e7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest c9a8ad7 [Andrew Or] Do not include appResource and mainClass as properties 6fc7670 [Andrew Or] Report REST server response back to the user 40e6095 [Andrew Or] Pass submit parameters through system properties cbd670b [Andrew Or] Include unknown fields, if any, in server response 9fee16f [Andrew Or] Include server protocol version on mismatch 09f873a [Andrew Or] Fix style 8188e61 [Andrew Or] Upgrade Jackson from 2.3.0 to 2.4.4 37538e0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest 9165ae8 [Andrew Or] Fall back to Akka if endpoint was not REST 252d53c [Andrew Or] Clean up server error handling behavior further c643f64 [Andrew Or] Fix style bbbd329 [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest 792e112 [Andrew Or] Use specific HTTP response codes on error f98660b [Andrew Or] Version the protocol and include it in REST URL 721819f [Andrew Or] Provide more REST-like interface for submit/kill/status 581f7bf [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest 9e0d1af [Andrew Or] Move some classes around to reduce number of files (minor) 42e5de4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest 1f1c03f [Andrew Or] Use Jackson's DefaultScalaModule to simplify messages 9229433 [Andrew Or] Reduce duplicate naming in REST field ade28fd [Andrew Or] Clean up REST response output in Spark submit b2fef8b [Andrew Or] Abstract the success field to the general response 6c57b4b [Andrew Or] Increase timeout in end-to-end tests bf696ff [Andrew Or] Add checks for enabling REST when using kill/status 7ee6737 [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest e2f7f5f [Andrew Or] Provide more safeguard against missing fields 9581df7 [Andrew Or] Clean up uses of exceptions 914fdff [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest e2104e6 [Andrew Or] stable -> rest 3db7379 [Andrew Or] Fix comments and name fields for better error messages 8d43486 [Andrew Or] Replace SubmitRestProtocolAction with class name df90e8b [Andrew Or] Use Jackson for JSON de/serialization d7a1f9f [Andrew Or] Fix local cluster tests efa5e18 [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest e42c131 [Andrew Or] Add end-to-end tests for standalone REST protocol 837475b [Andrew Or] Show the REST port on the Master UI d8d3717 [Andrew Or] Use a daemon thread pool for REST server 6568ca5 [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest 77774ba [Andrew Or] Minor fixes 206cae4 [Andrew Or] Refactor and add tests for the REST protocol 63c05b3 [Andrew Or] Remove MASTER as a field (minor) 9e21b72 [Andrew Or] Action -> SparkSubmitAction (minor) 51c5ca6 [Andrew Or] Distinguish client and server side Spark versions b44e103 [Andrew Or] Implement status requests + fix validation behavior 120ab9d [Andrew Or] Support kill and request driver status through SparkSubmit 544de1d [Andrew Or] Major clean ups in code and comments e958cae [Andrew Or] Supported nested values in messages 484bd21 [Andrew Or] Specify an ordering for fields in SubmitDriverRequestMessage 6ff088d [Andrew Or] Rename classes to generalize REST protocol af9d9cb [Andrew Or] Integrate REST protocol in standalone mode 53e7c0e [Andrew Or] Initial client, server, and all the messages
* SPARK-5403: Ignore UserKnownHostsFile in SSH callsGrzegorz Dubicki2015-02-061-0/+1
| | | | | | | | | | See https://issues.apache.org/jira/browse/SPARK-5403 Author: Grzegorz Dubicki <grzegorz.dubicki@gmail.com> Closes #4196 from grzegorz-dubicki/SPARK-5403 and squashes the following commits: a7d863f [Grzegorz Dubicki] Resolve start command hanging issue
* [SPARK-5601][MLLIB] make streaming linear algorithms Java-friendlyXiangrui Meng2015-02-063-1/+181
| | | | | | | | | | | | | | Overload `trainOn`, `predictOn`, and `predictOnValues`. CC freeman-lab Author: Xiangrui Meng <meng@databricks.com> Closes #4432 from mengxr/streaming-java and squashes the following commits: 6a79b85 [Xiangrui Meng] add java test for streaming logistic regression 2d7b357 [Xiangrui Meng] organize imports 1f662b3 [Xiangrui Meng] make streaming linear algorithms Java-friendly