aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-3813][SQL] Support "case when" conditional functions in Spark SQL.ravipesala2014-10-092-2/+27
| | | | | | | | | | | | | | "case when" conditional function is already supported in Spark SQL but there is no support in SqlParser. So added parser support to it. Author : ravipesala ravindra.pesalahuawei.com Author: ravipesala <ravindra.pesala@huawei.com> Closes #2678 from ravipesala/SPARK-3813 and squashes the following commits: 70c75a7 [ravipesala] Fixed styles 713ea84 [ravipesala] Updated as per admin comments 709684f [ravipesala] Changed parser to support case when function.
* [SPARK-3858][SQL] Pass the generator alias into logical plan nodeNathan Howell2014-10-092-1/+9
| | | | | | | | | | The alias parameter is being ignored, which makes it more difficult to specify a qualifier for Generator expressions. Author: Nathan Howell <nhowell@godaddy.com> Closes #2721 from NathanHowell/SPARK-3858 and squashes the following commits: 8aa0f43 [Nathan Howell] [SPARK-3858][SQL] Pass the generator alias into logical plan node
* [SPARK-3412][SQL]add missing row apiDaoyuan Wang2014-10-093-11/+32
| | | | | | | | | | | | | | chenghao-intel assigned this to me, check PR #2284 for previous discussion Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #2529 from adrian-wang/rowapi and squashes the following commits: c6594b2 [Daoyuan Wang] using boxed 7b7e6e3 [Daoyuan Wang] update pattern match 7a39456 [Daoyuan Wang] rename file and refresh getAs[T] 4c18c29 [Daoyuan Wang] remove setAs[T] and null judge 1614493 [Daoyuan Wang] add missing row api
* [SPARK-3339][SQL] Support for skipping json lines that fail to parseYin Huai2014-10-096-19/+116
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR aims to provide a way to skip/query corrupt JSON records. To do so, we introduce an internal column to hold corrupt records (the default name is `_corrupt_record`. This name can be changed by setting the value of `spark.sql.columnNameOfCorruptRecord`). When there is a parsing error, we will put the corrupt record in its unparsed format to the internal column. Users can skip/query this column through SQL. * To query those corrupt records ``` -- For Hive parser SELECT `_corrupt_record` FROM jsonTable WHERE `_corrupt_record` IS NOT NULL -- For our SQL parser SELECT _corrupt_record FROM jsonTable WHERE _corrupt_record IS NOT NULL ``` * To skip corrupt records and query regular records ``` -- For Hive parser SELECT field1, field2 FROM jsonTable WHERE `_corrupt_record` IS NULL -- For our SQL parser SELECT field1, field2 FROM jsonTable WHERE _corrupt_record IS NULL ``` Generally, it is not recommended to change the name of the internal column. If the name has to be changed to avoid possible name conflicts, you can use `sqlContext.setConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD, <new column name>)` or `sqlContext.sql(SET spark.sql.columnNameOfCorruptRecord=<new column name>)`. Author: Yin Huai <huai@cse.ohio-state.edu> Closes #2680 from yhuai/corruptJsonRecord and squashes the following commits: 4c9828e [Yin Huai] Merge remote-tracking branch 'upstream/master' into corruptJsonRecord 309616a [Yin Huai] Change the default name of corrupt record to "_corrupt_record". b4a3632 [Yin Huai] Merge remote-tracking branch 'upstream/master' into corruptJsonRecord 9375ae9 [Yin Huai] Set the column name of corrupt json record back to the default one after the unit test. ee584c0 [Yin Huai] Provide a way to query corrupt json records as unparsed strings.
* Revert "[SPARK-2805] Upgrade to akka 2.3.4"Patrick Wendell2014-10-096-6/+77
| | | | This reverts commit b9df8af62e8d7b263a668dfb6e9668ab4294ea37.
* [SPARK-3853][SQL] JSON Schema support for Timestamp fieldsMike Timper2014-10-092-0/+18
| | | | | | | | | | In JSONRDD.scala, add 'case TimestampType' in the enforceCorrectType function and a toTimestamp function. Author: Mike Timper <mike@aurorafeint.com> Closes #2720 from mtimper/master and squashes the following commits: 9386ab8 [Mike Timper] Fix and tests for SPARK-3853
* [SPARK-3868][PySpark] Hard to recognize which module is tested from ↵cocoatomo2014-10-091-3/+4
| | | | | | | | | | | | | unit-tests.log ./python/run-tests script display messages about which test it is running currently on stdout but not write them on unit-tests.log. It is harder for us to recognize what test programs were executed and which test was failed. Author: cocoatomo <cocoatomo77@gmail.com> Closes #2724 from cocoatomo/issues/3868-display-testing-module-name and squashes the following commits: c63d9fa [cocoatomo] [SPARK-3868][PySpark] Hard to recognize which module is tested from unit-tests.log
* [SPARK-3806][SQL] Minor fix for CliSuitescwf2014-10-091-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To fix two issues in CliSuite 1 CliSuite throw IndexOutOfBoundsException: Exception in thread "Thread-6" java.lang.IndexOutOfBoundsException: 6 at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47) at org.apache.spark.sql.hive.thriftserver.CliSuite.org$apache$spark$sql$hive$thriftserver$CliSuite$$captureOutput$1(CliSuite.scala:67) at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78) at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78) at scala.sys.process.ProcessLogger$$anon$1.out(ProcessLogger.scala:96) at scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135) at scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135) at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:175) at scala.sys.process.BasicIO$.processLinesFully(BasicIO.scala:179) at scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:164) at scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:162) at scala.sys.process.ProcessBuilderImpl$Simple$$anonfun$3.apply$mcV$sp(ProcessBuilderImpl.scala:73) at scala.sys.process.ProcessImpl$Spawn$$anon$1.run(ProcessImpl.scala:22) Actually, it is the Mutil-Threads lead to this problem. 2 Using ```line.startsWith``` instead ```line.contains``` to assert expected answer. This is a tiny bug in CliSuite, for test case "Simple commands", there is a expected answers "5", if we use ```contains``` that means output like "14/10/06 11:```5```4:36 INFO CliDriver: Time taken: 1.078 seconds" or "14/10/06 11:54:36 INFO StatsReportListener: 0% ```5```% 10% 25% 50% 75% 90% 95% 100%" will make the assert true. Author: scwf <wangfei1@huawei.com> Closes #2666 from scwf/clisuite and squashes the following commits: 11430db [scwf] fix-clisuite
* [SPARK-3711][SQL] Optimize where in clause filter queriesYash Datta2014-10-094-2/+132
| | | | | | | | | | | | | | | The In case class is replaced by a InSet class in case all the filters are literals, which uses a hashset instead of Sequence, thereby giving significant performance improvement (earlier the seq was using a worst case linear match (exists method) since expressions were assumed in the filter list) . Maximum improvement should be visible in case small percentage of large data matches the filter list. Author: Yash Datta <Yash.Datta@guavus.com> Closes #2561 from saucam/branch-1.1 and squashes the following commits: 4bf2d19 [Yash Datta] SPARK-3711: 1. Fix code style and import order 2. Fix optimization condition 3. Add tests for null in filter list 4. Add test case that optimization is not triggered in case of attributes in filter list afedbcd [Yash Datta] SPARK-3711: 1. Add test cases for InSet class in ExpressionEvaluationSuite 2. Add class OptimizedInSuite on the lines of ConstantFoldingSuite, for the optimized In clause 0fc902f [Yash Datta] SPARK-3711: UnaryMinus will be handled by constantFolding bd84c67 [Yash Datta] SPARK-3711: Incorporate review comments. Move optimization of In clause to Optimizer.scala by adding a rule. Add appropriate comments 430f5d1 [Yash Datta] SPARK-3711: Optimize the filter list in case of negative values as well bee98aa [Yash Datta] SPARK-3711: Optimize where in clause filter queries
* [SPARK-3752][SQL]: Add tests for different UDF'sVida Ha2014-10-096-15/+265
| | | | | | | | Author: Vida Ha <vida@databricks.com> Closes #2621 from vidaha/vida/SPARK-3752 and squashes the following commits: d7fdbbc [Vida Ha] Add tests for different UDF's
* [SPARK-3741] Make ConnectionManager propagate errors properly and add mo...zsxwing2014-10-092-69/+172
| | | | | | | | | | | | | | | | | ...re logs to avoid Executors swallowing errors This PR made the following changes: * Register a callback to `Connection` so that the error will be propagated properly. * Add more logs so that the errors won't be swallowed by Executors. * Use trySuccess/tryFailure because `Promise` doesn't allow to call success/failure more than once. Author: zsxwing <zsxwing@gmail.com> Closes #2593 from zsxwing/SPARK-3741 and squashes the following commits: 1d5aed5 [zsxwing] Fix naming 0b8a61c [zsxwing] Merge branch 'master' into SPARK-3741 764aec5 [zsxwing] [SPARK-3741] Make ConnectionManager propagate errors properly and add more logs to avoid Executors swallowing errors
* [Minor] use norm operator after breeze 0.10 upgradeGuoQiang Li2014-10-091-8/+10
| | | | | | | | | | cc mengxr Author: GuoQiang Li <witgo@qq.com> Closes #2730 from witgo/SPARK-3856 and squashes the following commits: 2cffce1 [GuoQiang Li] use norm operator after breeze 0.10 upgrade
* [SPARK-3158][MLLIB]Avoid 1 extra aggregation for DecisionTree trainingQiping Li2014-10-094-48/+197
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the implementation does one unnecessary aggregation step. The aggregation step for level L (to choose splits) gives enough information to set the predictions of any leaf nodes at level L+1. We can use that info and skip the aggregation step for the last level of the tree (which only has leaf nodes). ### Implementation Details Each node now has a `impurity` field and the `predict` is changed from type `Double` to type `Predict`(this can be used to compute predict probability in the future) When compute best splits for each node, we also compute impurity and predict for the child nodes, which is used to constructed newly allocated child nodes. So at level L, we have set impurity and predict for nodes at level L +1. If level L+1 is the last level, then we can avoid aggregation. What's more, calculation of parent impurity in Top nodes for each tree needs to be treated differently because we have to compute impurity and predict for them first. In `binsToBestSplit`, if current node is top node(level == 0), we calculate impurity and predict first. after finding best split, top node's predict and impurity is set to the calculated value. Non-top nodes's impurity and predict are already calculated and don't need to be recalculated again. I have considered to add a initialization step to set top nodes' impurity and predict and then we can treat all nodes in the same way, but this will need a lot of duplication of code(all the code to do seq operation(BinSeqOp) needs to be duplicated), so I choose the current way. CC mengxr manishamde jkbradley, please help me review this, thanks. Author: Qiping Li <liqiping1991@gmail.com> Closes #2708 from chouqin/avoid-agg and squashes the following commits: 8e269ea [Qiping Li] adjust code and comments eefeef1 [Qiping Li] adjust comments and check child nodes' impurity c41b1b6 [Qiping Li] fix pyspark unit test 7ad7a71 [Qiping Li] fix unit test 822c912 [Qiping Li] add comments and unit test e41d715 [Qiping Li] fix bug in test suite 6cc0333 [Qiping Li] SPARK-3158: Avoid 1 extra aggregation for DecisionTree training
* add spark.driver.memory to config docsnartz2014-10-091-0/+8
| | | | | | | | | | | | | | It took me a minute to track this down, so I thought it could be useful to have it in the docs. I'm unsure if 512mb is the default for spark.driver.memory? Also - there could be a better value for the 'description' to differentiate it from spark.executor.memory. Author: nartz <nartzpod@gmail.com> Author: Nathan Artz <nathanartz@Nathans-MacBook-Pro.local> Closes #2410 from nartz/docs/add-spark-driver-memory-to-config-docs and squashes the following commits: a2f6c62 [nartz] Update configuration.md 74521b8 [Nathan Artz] add spark.driver.memory to config docs
* [SPARK-3844][UI] Truncate appName in WebUI if it is too longXiangrui Meng2014-10-091-1/+5
| | | | | | | | | | | Truncate appName in WebUI if it is too long. Author: Xiangrui Meng <meng@databricks.com> Closes #2707 from mengxr/truncate-app-name and squashes the following commits: 87834ce [Xiangrui Meng] move scala import below java c7111dc [Xiangrui Meng] truncate appName in WebUI if it is too long
* [SPARK-2805] Upgrade to akka 2.3.4Anand Avati2014-10-086-77/+6
| | | | | | | | | | | Upgrade to akka 2.3.4 Author: Anand Avati <avati@redhat.com> Closes #1685 from avati/SPARK-1812-akka-2.3 and squashes the following commits: 57a2315 [Anand Avati] SPARK-1812: streaming - remove tests which depend on akka.actor.IO 2a551d3 [Anand Avati] SPARK-1812: core - upgrade to akka 2.3.4
* [SPARK-3856][MLLIB] use norm operator after breeze 0.10 upgradeXiangrui Meng2014-10-081-2/+2
| | | | | | | | | | | | | | | | | Got warning msg: ~~~ [warn] /Users/meng/src/spark/mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala:50: method norm in trait NumericOps is deprecated: Use norm(XXX) instead of XXX.norm [warn] var norm = vector.toBreeze.norm(p) ~~~ dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #2718 from mengxr/SPARK-3856 and squashes the following commits: 4f38169 [Xiangrui Meng] use norm operator
* Fetch from branch v4 in Spark EC2 script.Josh Rosen2014-10-081-1/+1
|
* [SPARK-3857] Create joins package for various join operators.Reynold Xin2014-10-0815-646/+844
| | | | | | | | | | | Author: Reynold Xin <rxin@apache.org> Closes #2719 from rxin/sql-join-break and squashes the following commits: 0c0082b [Reynold Xin] Fix line length. cbc664c [Reynold Xin] Rename join -> joins package. a070d44 [Reynold Xin] Fix line length in HashJoin a39be8c [Reynold Xin] [SPARK-3857] Create a join package for various join operators.
* [SQL] Prevents per row dynamic dispatching and pattern matching when ↵Cheng Lian2014-10-081-30/+34
| | | | | | | | | | | | | inserting Hive values Builds all wrappers at first according to object inspector types to avoid per row costs. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2592 from liancheng/hive-value-wrapper and squashes the following commits: 9696559 [Cheng Lian] Passes all tests 4998666 [Cheng Lian] Prevents per row dynamic dispatching and pattern matching when inserting Hive values
* [SPARK-3810][SQL] Makes PreInsertionCasts handle partitions properlyCheng Lian2014-10-082-10/+41
| | | | | | | | | | Includes partition keys into account when applying `PreInsertionCasts` rule. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2672 from liancheng/fix-pre-insert-casts and squashes the following commits: def1a1a [Cheng Lian] Makes PreInsertionCasts handle partitions properly
* [SPARK-3707] [SQL] Fix bug of type coercion in DIVCheng Hao2014-10-082-5/+42
| | | | | | | | | | | Calling `BinaryArithmetic.dataType` will throws exception until it's resolved, but in type coercion rule `Division`, seems doesn't follow this. Author: Cheng Hao <hao.cheng@intel.com> Closes #2559 from chenghao-intel/type_coercion and squashes the following commits: 199a85d [Cheng Hao] Simplify the divide rule dc55218 [Cheng Hao] fix bug of type coercion in div
* [SQL][Doc] Keep Spark SQL README.md up to dateLiquan Pei2014-10-081-16/+15
| | | | | | | | | | | marmbrus Update README.md to be consistent with Spark 1.1 Author: Liquan Pei <liquanpei@gmail.com> Closes #2706 from Ishiihara/SparkSQL-readme and squashes the following commits: 33b9d4b [Liquan Pei] keep README.md up to date
* [SPARK-3713][SQL] Uses JSON to serialize DataType objectsCheng Lian2014-10-087-168/+277
| | | | | | | | | | | | | | | | | | | | | This PR uses JSON instead of `toString` to serialize `DataType`s. The latter is not only hard to parse but also flaky in many cases. Since we already write schema information to Parquet metadata in the old style, we have to reserve the old `DataType` parser and ensure downward compatibility. The old parser is now renamed to `CaseClassStringParser` and moved into `object DataType`. JoshRosen davies Please help review PySpark related changes, thanks! Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2563 from liancheng/datatype-to-json and squashes the following commits: fc92eb3 [Cheng Lian] Reverts debugging code, simplifies primitive type JSON representation 438c75f [Cheng Lian] Refactors PySpark DataType JSON SerDe per comments 6b6387b [Cheng Lian] Removes debugging code 6a3ee3a [Cheng Lian] Addresses per review comments dc158b5 [Cheng Lian] Addresses PEP8 issues 99ab4ee [Cheng Lian] Adds compatibility est case for Parquet type conversion a983a6c [Cheng Lian] Adds PySpark support f608c6e [Cheng Lian] De/serializes DataType objects from/to JSON
* [SPARK-3831] [SQL] Filter rule Improvement and bool expression optimization.Kousuke Saruta2014-10-083-2/+16
| | | | | | | | | | | | | | | | | | | | | | | If we write the filter which is always FALSE like SELECT * from person WHERE FALSE; 200 tasks will run. I think, 1 task is enough. And current optimizer cannot optimize the case NOT is duplicated like SELECT * from person WHERE NOT ( NOT (age > 30)); The filter rule above should be simplified Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2692 from sarutak/SPARK-3831 and squashes the following commits: 25f3e20 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3831 23c750c [Kousuke Saruta] Improved unsupported predicate test case a11b9f3 [Kousuke Saruta] Modified NOT predicate test case in PartitionBatchPruningSuite 8ea872b [Kousuke Saruta] Fixed the number of tasks when the data of LocalRelation is empty.
* [SPARK-3843][Minor] Cleanup scalastyle.txt at the end of running dev/scalastyleKousuke Saruta2014-10-081-0/+2
| | | | | | | | | | dev/scalastyle create a log file 'scalastyle.txt'. it is overwrote per running but never deleted even though dev/mima and dev/lint-python delete their log files. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2702 from sarutak/scalastyle-txt-cleanup and squashes the following commits: d6e238e [Kousuke Saruta] Fixed dev/scalastyle to cleanup scalastyle.txt
* [SPARK-3841] [mllib] Pretty-print params for ML examplesJoseph K. Bradley2014-10-0811-7/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | Provide a parent class for the Params case classes used in many MLlib examples, where the parent class pretty-prints the case class fields: Param1Name Param1Value Param2Name Param2Value ... Using this class will make it easier to print test settings to logs. Also, updated DecisionTreeRunner to print a little more info. CC: mengxr Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #2700 from jkbradley/dtrunner-update and squashes the following commits: cff873f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update 7a08ae4 [Joseph K. Bradley] code review comment updates b4d2043 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update d8228a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update 0fc9c64 [Joseph K. Bradley] Added abstract TestParams class for mllib example parameters 12b7798 [Joseph K. Bradley] Added abstract class TestParams for pretty-printing Params values 5f84f03 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update f7441b6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update 19eb6fc [Joseph K. Bradley] Updated DecisionTreeRunner to print training time.
* HOTFIX: Use correct Hadoop profile in buildPatrick Wendell2014-10-081-1/+1
|
* [SPARK-3848] yarn alpha doesn't build on masterKousuke Saruta2014-10-081-1/+1
| | | | | | | | | | | | yarn alpha build was broken by #2432 as it added an argument to YarnAllocator but not to yarn/alpha YarnAllocationHandler commit https://github.com/apache/spark/commit/79e45c9323455a51f25ed9acd0edd8682b4bbb88 Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2715 from sarutak/SPARK-3848 and squashes the following commits: bafb8d1 [Kousuke Saruta] Fixed parameters for the default constructor of alpha/YarnAllocatorHandler.
* [SPARK-3788] [yarn] Fix compareFs to do the right thing for HDFS namespaces.Marcelo Vanzin2014-10-081-19/+12
| | | | | | | | | | | | | | HA and viewfs use namespaces instead of host names, so you can't resolve them since that will fail. So be smarter to avoid doing unnecessary work. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #2649 from vanzin/SPARK-3788 and squashes the following commits: fedbc73 [Marcelo Vanzin] Update comment. c938845 [Marcelo Vanzin] Use Objects.equal() to avoid issues with ==. 9f7b571 [Marcelo Vanzin] [SPARK-3788] [yarn] Fix compareFs to do the right thing for HA, federation.
* [SPARK-3710] Fix Yarn integration tests on Hadoop 2.2.Marcelo Vanzin2014-10-071-0/+51
| | | | | | | | | | | | | | | | It seems some dependencies are not declared when pulling the 2.2 test dependencies, so we need to add them manually for the Yarn cluster to come up. These don't seem to be necessary for 2.3 and beyond, so restrict them to the hadoop-2.2 profile. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #2682 from vanzin/SPARK-3710 and squashes the following commits: 701d4fb [Marcelo Vanzin] Add comment. 0540bdf [Marcelo Vanzin] [SPARK-3710] Fix Yarn integration tests on Hadoop 2.2.
* [SPARK-3836] [REPL] Spark REPL optionally propagate internal exceptionsAhir Reddy2014-10-071-3/+9
| | | | | | | | | | | | Optionally have the repl throw exceptions generated by interpreted code, instead of swallowing the exception and returning it as text output. This is useful when embedding the repl, otherwise it's not possible to know when user code threw an exception. Author: Ahir Reddy <ahirreddy@gmail.com> Closes #2695 from ahirreddy/repl-throw-exceptions and squashes the following commits: bad25ee [Ahir Reddy] Style Fixes f0e5b44 [Ahir Reddy] Fixed style 0d4413d [Ahir Reddy] propogate excetions from repl
* [SPARK-3412] [PySpark] Replace Epydoc with Sphinx to generate Python API docsDavies Liu2014-10-0715-206/+167
| | | | | | | | | | | | | | | | | | | | | | Retire Epydoc, use Sphinx to generate API docs. Refine Sphinx docs, also convert some docstrings into Sphinx style. It looks like: ![api doc](https://cloud.githubusercontent.com/assets/40902/4538272/9e2d4f10-4dec-11e4-8d96-6e45a8fe51f9.png) Author: Davies Liu <davies.liu@gmail.com> Closes #2689 from davies/docs and squashes the following commits: bf4a0a5 [Davies Liu] fix links 3fb1572 [Davies Liu] fix _static in jekyll 65a287e [Davies Liu] fix scripts and logo 8524042 [Davies Liu] Merge branch 'master' of github.com:apache/spark into docs d5b874a [Davies Liu] Merge branch 'master' of github.com:apache/spark into docs 4bc1c3c [Davies Liu] refactor 746d0b6 [Davies Liu] @param -> :param 240b393 [Davies Liu] replace epydoc with sphinx doc
* [SPARK-3829] Make Spark logo image on the header of HistoryPage as a link to ↵Kousuke Saruta2014-10-071-2/+4
| | | | | | | | | | | | | | | HistoryPage's page #1 There is a Spark logo on the header of HistoryPage. We can have too many HistoryPages if we run 20+ applications. So I think, it's useful if the logo is as a link to the HistoryPage's page number 1. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2690 from sarutak/SPARK-3829 and squashes the following commits: 908c109 [Kousuke Saruta] Removed extra space. 00bfbd7 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3829 dd87480 [Kousuke Saruta] Made header Spark log image as a link to History Server's top page.
* [SPARK-3398] [EC2] Have spark-ec2 intelligently wait for specific cluster statesNicholas Chammas2014-10-071-25/+86
| | | | | | | | | | | | | | | | | | | | | Instead of waiting arbitrary amounts of time for the cluster to reach a specific state, this patch lets `spark-ec2` explicitly wait for a cluster to reach a desired state. This is useful in a couple of situations: * The cluster is launching and you want to wait until SSH is available before installing stuff. * The cluster is being terminated and you want to wait until all the instances are terminated before trying to delete security groups. This patch removes the need for the `--wait` option and removes some of the time-based retry logic that was being used. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #2339 from nchammas/spark-ec2-wait-properly and squashes the following commits: 43a69f0 [Nicholas Chammas] short-circuit SSH check; linear backoff 9a9e035 [Nicholas Chammas] remove extraneous comment 26c5ed0 [Nicholas Chammas] replace print with write() bb67c06 [Nicholas Chammas] deprecate wait option; remove dead code 7969265 [Nicholas Chammas] fix long line (PEP 8) 126e4cf [Nicholas Chammas] wait for specific cluster states
* [SPARK-3832][MLlib] Upgrade Breeze dependency to 0.10DB Tsai2014-10-071-1/+1
| | | | | | | | | | | In Breeze 0.10, the L1regParam can be configured through anonymous function in OWLQN, and each component can be penalized differently. This is required for GLMNET in MLlib with L1/L2 regularization. https://github.com/scalanlp/breeze/commit/2570911026aa05aa1908ccf7370bc19cd8808a4c Author: DB Tsai <dbtsai@dbtsai.com> Closes #2693 from dbtsai/breeze0.10 and squashes the following commits: 7a0c45c [DB Tsai] In Breeze 0.10, the L1regParam can be configured through anonymous function in OWLQN, and each component can be penalized differently. This is required for GLMNET in MLlib with L1/L2 regularization. https://github.com/scalanlp/breeze/commit/2570911026aa05aa1908ccf7370bc19cd8808a4c
* [SPARK-3486][MLlib][PySpark] PySpark support for Word2VecLiquan Pei2014-10-075-7/+264
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mengxr Added PySpark support for Word2Vec Change list (1) PySpark support for Word2Vec (2) SerDe support of string sequence both on python side and JVM side (3) Test for SerDe of string sequence on JVM side Author: Liquan Pei <liquanpei@gmail.com> Closes #2356 from Ishiihara/Word2Vec-python and squashes the following commits: 476ea34 [Liquan Pei] style fixes b13a0b9 [Liquan Pei] resolve merge conflicts and minor fixes 8671eba [Liquan Pei] Merge remote-tracking branch 'upstream/master' into Word2Vec-python daf88a6 [Liquan Pei] modification according to feedback a73fa19 [Liquan Pei] clean up 3d8007b [Liquan Pei] fix findSynonyms for vector 1bdcd2e [Liquan Pei] minor fixes cdef9f4 [Liquan Pei] add missing comments b7447eb [Liquan Pei] modify according to feedback b9a7383 [Liquan Pei] cache words RDD in fit 89490bf [Liquan Pei] add tests and Word2VecModelWrapper 78bbb53 [Liquan Pei] use pickle for seq string SerDe a264b08 [Liquan Pei] Merge remote-tracking branch 'upstream/master' into Word2Vec-python ca1e5ff [Liquan Pei] fix test 68e7276 [Liquan Pei] minor style fixes 48d5e72 [Liquan Pei] Functionality improvement 0ad3ac1 [Liquan Pei] minor fix c867fdf [Liquan Pei] add Word2Vec to pyspark
* [SPARK-3790][MLlib] CosineSimilarity ExampleReza Zadeh2014-10-071-0/+107
| | | | | | | | | | | | | | | | | Provide example for `RowMatrix.columnSimilarity()` Author: Reza Zadeh <rizlar@gmail.com> Closes #2622 from rezazadeh/dimsumexample and squashes the following commits: 8f20b82 [Reza Zadeh] update comment 379066d [Reza Zadeh] cache rows 792b81c [Reza Zadeh] Address review comments e573c7a [Reza Zadeh] Average absolute error b15685f [Reza Zadeh] Use scopt. Distribute evaluation. eca3dfd [Reza Zadeh] Documentation ac96fb2 [Reza Zadeh] Compute approximation error, add command line. 4533579 [Reza Zadeh] CosineSimilarity Example
* [SPARK-3777] Display "Executor ID" for Tasks in Stage pagezsxwing2014-10-071-2/+2
| | | | | | | | | | | | | | | Now the Stage page only displays "Executor"(host) for tasks. However, there may be more than one Executors running in the same host. Currently, when some task is hung, I only know the host of the faulty executor. Therefore I have to check all executors in the host. Adding "Executor ID" in the Tasks table. would be helpful to locate the faulty executor. Here is the new page: ![add_executor_id_for_tasks](https://cloud.githubusercontent.com/assets/1000778/4505774/acb9648c-4afa-11e4-8826-8768a0a60cc9.png) Author: zsxwing <zsxwing@gmail.com> Closes #2642 from zsxwing/SPARK-3777 and squashes the following commits: 37945af [zsxwing] Put Executor ID and Host into one cell 4bbe2c7 [zsxwing] [SPARK-3777] Display "Executor ID" for Tasks in Stage page
* [SPARK-3825] Log more detail when unrolling a block failsAndrew Or2014-10-072-8/+39
| | | | | | | | | | | | | | | | | | | | | | | Before: ``` 14/10/06 16:45:42 WARN CacheManager: Not enough space to cache partition rdd_0_2 in memory! Free memory is 481861527 bytes. ``` After: ``` 14/10/07 11:08:24 WARN MemoryStore: Not enough space to cache rdd_2_0 in memory! (computed 68.8 MB so far) 14/10/07 11:08:24 INFO MemoryStore: Memory use = 1088.0 B (blocks) + 445.1 MB (scratch space shared across 8 thread(s)) = 445.1 MB. Storage limit = 459.5 MB. ``` Author: Andrew Or <andrewor14@gmail.com> Closes #2688 from andrewor14/cache-log-message and squashes the following commits: 28e33d6 [Andrew Or] Shy away from "unrolling" 5638c49 [Andrew Or] Grammar 39a0c28 [Andrew Or] Log more detail when unrolling a block fails
* [SPARK-3731] [PySpark] fix memory leak in PythonRDDDavies Liu2014-10-071-0/+5
| | | | | | | | | | The parent.getOrCompute() of PythonRDD is executed in a separated thread, it should release the memory reserved for shuffle and unrolling finally. Author: Davies Liu <davies.liu@gmail.com> Closes #2668 from davies/leak and squashes the following commits: ae98be2 [Davies Liu] fix memory leak in PythonRDD
* [SPARK-3762] clear reference of SparkEnv after stopDavies Liu2014-10-079-21/+8
| | | | | | | | | | | | | | | | | SparkEnv is cached in ThreadLocal object, so after stop and create a new SparkContext, old SparkEnv is still used by some threads, it will trigger many problems, for example, pyspark will have problem after restart SparkContext, because py4j use thread pool for RPC. This patch will clear all the references after stop a SparkEnv. cc mateiz tdas pwendell Author: Davies Liu <davies.liu@gmail.com> Closes #2624 from davies/env and squashes the following commits: a69f30c [Davies Liu] deprecate getThreadLocal ba77ca4 [Davies Liu] remove getThreadLocal(), update docs ee62bb7 [Davies Liu] cleanup ThreadLocal of SparnENV 4d0ea8b [Davies Liu] clear reference of SparkEnv after stop
* [SPARK-3808] PySpark fails to start in WindowsMasayoshi TSUZUKI2014-10-071-1/+1
| | | | | | | | | | Modified syntax error of *.cmd script. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #2669 from tsudukim/feature/SPARK-3808 and squashes the following commits: 7f804e6 [Masayoshi TSUZUKI] [SPARK-3808] PySpark fails to start in Windows
* [SPARK-3827] Very long RDD names are not rendered properly in web UIHossein2014-10-071-0/+5
| | | | | | | | | | | | | | With Spark SQL we generate very long RDD names. These names are not properly rendered in the web UI. This PR fixes the rendering issue. [SPARK-3827] #comment Linking PR with JIRA Author: Hossein <hossein@databricks.com> Closes #2687 from falaki/sparkTableUI and squashes the following commits: fd06409 [Hossein] Limit width of cell when RDD name is too long
* [SPARK-3627] - [yarn] - fix exit code and final status reporting to RMThomas Graves2014-10-074-126/+212
| | | | | | | | | | | | | | | | | | | See the description and whats handled in the jira comment: https://issues.apache.org/jira/browse/SPARK-3627?focusedCommentId=14150013&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14150013 This does not handle yarn client mode reporting of the driver to the AM. I think that should be handled when we make it an unmanaged AM. Author: Thomas Graves <tgraves@apache.org> Closes #2577 from tgravescs/SPARK-3627 and squashes the following commits: 9c2efbf [Thomas Graves] review comments e8cc261 [Thomas Graves] fix accidental typo during fixing comment 24c98e3 [Thomas Graves] rework 85f1901 [Thomas Graves] Merge remote-tracking branch 'upstream/master' into SPARK-3627 fab166d [Thomas Graves] update based on review comments 32f4dfa [Thomas Graves] switch back f0b6519 [Thomas Graves] change order of cleanup staging dir d3cc800 [Thomas Graves] SPARK-3627 - yarn - fix exit code and final status reporting to RM
* [SPARK-3479] [Build] Report failed test categoryNicholas Chammas2014-10-063-35/+126
| | | | | | | | | | | | | | | | | | | | | | This PR allows SparkQA (i.e. Jenkins) to report in its posts to GitHub what category of test failed, if one can be determined. The failure categories are: * general failure * RAT checks failed * Scala style checks failed * Python style checks failed * Build failed * Spark unit tests failed * PySpark unit tests failed * MiMa checks failed This PR also fixes the diffing logic used to determine if a patch introduces new classes. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #2606 from nchammas/report-failed-test-category and squashes the following commits: d67df03 [Nicholas Chammas] report what test category failed
* [SPARK-3773][PySpark][Doc] Sphinx build warningcocoatomo2014-10-066-23/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When building Sphinx documents for PySpark, we have 12 warnings. Their causes are almost docstrings in broken ReST format. To reproduce this issue, we should run following commands on the commit: 6e27cb630de69fa5acb510b4e2f6b980742b1957. ```bash $ cd ./python/docs $ make clean html ... /Users/<user>/MyRepos/Scala/spark/python/pyspark/__init__.py:docstring of pyspark.SparkContext.sequenceFile:4: ERROR: Unexpected indentation. /Users/<user>/MyRepos/Scala/spark/python/pyspark/__init__.py:docstring of pyspark.RDD.saveAsSequenceFile:4: ERROR: Unexpected indentation. /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:14: ERROR: Unexpected indentation. /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:16: WARNING: Definition list ends without a blank line; unexpected unindent. /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:17: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:14: ERROR: Unexpected indentation. /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:16: WARNING: Definition list ends without a blank line; unexpected unindent. /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:17: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/<user>/MyRepos/Scala/spark/python/docs/pyspark.mllib.rst:50: WARNING: missing attribute mentioned in :members: or __all__: module pyspark.mllib.regression, attribute RidgeRegressionModelLinearRegressionWithSGD /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.DecisionTreeModel.predict:3: ERROR: Unexpected indentation. ... checking consistency... /Users/<user>/MyRepos/Scala/spark/python/docs/modules.rst:: WARNING: document isn't included in any toctree ... copying static files... WARNING: html_static_path entry u'/Users/<user>/MyRepos/Scala/spark/python/docs/_static' does not exist ... build succeeded, 12 warnings. ``` Author: cocoatomo <cocoatomo77@gmail.com> Closes #2653 from cocoatomo/issues/3773-sphinx-build-warnings and squashes the following commits: 6f65661 [cocoatomo] [SPARK-3773][PySpark][Doc] Sphinx build warning
* [SPARK-3786] [PySpark] speedup testsDavies Liu2014-10-064-91/+82
| | | | | | | | | | | | | | | | | | | | | | | | | This patch try to speed up tests of PySpark, re-use the SparkContext in tests.py and mllib/tests.py to reduce the overhead of create SparkContext, remove some test cases, which did not make sense. It also improve the performance of some cases, such as MergerTests and SortTests. before this patch: real 21m27.320s user 4m42.967s sys 0m17.343s after this patch: real 9m47.541s user 2m12.947s sys 0m14.543s It almost cut the time by half. Author: Davies Liu <davies.liu@gmail.com> Closes #2646 from davies/tests and squashes the following commits: c54de60 [Davies Liu] revert change about memory limit 6a2a4b0 [Davies Liu] refactor of tests, speedup 100%
* [SPARK-2461] [PySpark] Add a toString method to GeneralizedLinearModelSandy Ryza2014-10-064-4/+9
| | | | | | | | | | | | | | | | | | | | | | Add a toString method to GeneralizedLinearModel, also change `__str__` to `__repr__` for some classes, to provide better message in repr. This PR is based on #1388, thanks to sryza! closes #1388 Author: Sandy Ryza <sandy@cloudera.com> Author: Davies Liu <davies.liu@gmail.com> Closes #2625 from davies/string and squashes the following commits: 3544aad [Davies Liu] fix LinearModel 0bcd642 [Davies Liu] Merge branch 'sandy-spark-2461' of github.com:sryza/spark 1ce5c2d [Sandy Ryza] __repr__ back to __str__ in a couple places aa9e962 [Sandy Ryza] Switch __str__ to __repr__ a0c5041 [Sandy Ryza] Add labels back in 1aa17f5 [Sandy Ryza] Match existing conventions fac1bc4 [Sandy Ryza] Fix PEP8 error f7b58ed [Sandy Ryza] SPARK-2461. Add a toString method to GeneralizedLinearModel
* [SPARK-3765][Doc] Add test information to sbt build docsscwf2014-10-051-0/+15
| | | | | | | | | | Add testing with sbt to doc ```building-spark.md``` Author: scwf <wangfei1@huawei.com> Closes #2629 from scwf/sbt-doc and squashes the following commits: fd9cf29 [scwf] add testing with sbt to docs