aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* [STREAMING] [MINOR] Keep streaming.UIUtils privateAndrew Or2015-05-131-1/+1
| | | | | | | | | | | | | zsxwing Author: Andrew Or <andrew@databricks.com> Closes #6134 from andrewor14/private-streaming-uiutils and squashes the following commits: 225df94 [Andrew Or] Privatize class (cherry picked from commit bb6dec3b160b54488892a509965fee70a530deff) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-7502] DAG visualization: gracefully handle removed stagesAndrew Or2015-05-133-8/+25
| | | | | | | | | | | | | | | | | | | | | Old stages are removed without much feedback to the user. This happens very often in streaming. See screenshots below for more detail. zsxwing **Before** <img src="https://cloud.githubusercontent.com/assets/2133137/7621031/643cc1e0-f978-11e4-8f42-09decaac44a7.png" width="500px"/> ------------------------- **After** <img src="https://cloud.githubusercontent.com/assets/2133137/7621037/6e37348c-f978-11e4-84a5-e44e154f9b13.png" width="400px"/> Author: Andrew Or <andrew@databricks.com> Closes #6132 from andrewor14/dag-viz-remove-gracefully and squashes the following commits: 43175cd [Andrew Or] Handle removed jobs and stages gracefully (cherry picked from commit aa1837875a3febad2f22b91a294f91749852b42f) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-7464] DAG visualization: highlight the same RDDs on hoverAndrew Or2015-05-133-16/+37
| | | | | | | | | | | | | | | | This is pretty useful for MLlib. <img src="https://cloud.githubusercontent.com/assets/2133137/7599650/c7d03dd8-f8b8-11e4-8c0a-0a89e786c90f.png" width="400px"/> Author: Andrew Or <andrew@databricks.com> Closes #6100 from andrewor14/dag-viz-hover and squashes the following commits: fefe2af [Andrew Or] Link tooltips for nodes that belong to the same RDD 90c6a7e [Andrew Or] Assign classes to clusters and nodes, not IDs (cherry picked from commit 44403414d3e754f7b991c0bbeb4868edb4135aa2) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-7399] Spark compilation error for scala 2.11Andrew Or2015-05-133-12/+14
| | | | | | | | | | | | | Subsequent fix following #5966. I tried this out locally. Author: Andrew Or <andrew@databricks.com> Closes #6129 from andrewor14/211-compilation and squashes the following commits: 713868f [Andrew Or] Fix compilation issue for scala 2.11 (cherry picked from commit f88ac701552a1a854247509db49d78f13515eae4) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-7608] Clean up old state in RDDOperationGraphListenerAndrew Or2015-05-132-9/+108
| | | | | | | | | | | | | | This is necessary for streaming and long-running Spark applications. zsxwing tdas Author: Andrew Or <andrew@databricks.com> Closes #6125 from andrewor14/viz-listener-leak and squashes the following commits: 8660949 [Andrew Or] Fix thing + add tests 33c0843 [Andrew Or] Clean up old job state (cherry picked from commit f6e18388d993d99f768c6d547327e0720ec64224) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SQL] Move some classes into packages that are more appropriate.Reynold Xin2015-05-1324-57/+80
| | | | | | | | | | | | | | | | | | | JavaTypeInference into catalyst types.DateUtils into catalyst CacheManager into execution DefaultParserDialect into catalyst Author: Reynold Xin <rxin@databricks.com> Closes #6108 from rxin/sql-rename and squashes the following commits: 3fc9613 [Reynold Xin] Fixed import ordering. 83d9ff4 [Reynold Xin] Fixed codegen tests. e271e86 [Reynold Xin] mima f4e24a6 [Reynold Xin] [SQL] Move some classes into packages that are more appropriate. (cherry picked from commit e683182c3e6347afdac0e5658487f80e5e054ef4) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-7303] [SQL] push down project if possible when the child is sortscwf2015-05-132-1/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Optimize the case of `project(_, sort)` , a example is: `select key from (select * from testData order by key) t` before this PR: ``` == Parsed Logical Plan == 'Project ['key] 'Subquery t 'Sort ['key ASC], true 'Project [*] 'UnresolvedRelation [testData], None == Analyzed Logical Plan == Project [key#0] Subquery t Sort [key#0 ASC], true Project [key#0,value#1] Subquery testData LogicalRDD [key#0,value#1], MapPartitionsRDD[1] == Optimized Logical Plan == Project [key#0] Sort [key#0 ASC], true LogicalRDD [key#0,value#1], MapPartitionsRDD[1] == Physical Plan == Project [key#0] Sort [key#0 ASC], true Exchange (RangePartitioning [key#0 ASC], 5), [] PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] ``` after this PR ``` == Parsed Logical Plan == 'Project ['key] 'Subquery t 'Sort ['key ASC], true 'Project [*] 'UnresolvedRelation [testData], None == Analyzed Logical Plan == Project [key#0] Subquery t Sort [key#0 ASC], true Project [key#0,value#1] Subquery testData LogicalRDD [key#0,value#1], MapPartitionsRDD[1] == Optimized Logical Plan == Sort [key#0 ASC], true Project [key#0] LogicalRDD [key#0,value#1], MapPartitionsRDD[1] == Physical Plan == Sort [key#0 ASC], true Exchange (RangePartitioning [key#0 ASC], 5), [] Project [key#0] PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] ``` with this rule we will first do column pruning on the table and then do sorting. Author: scwf <wangfei1@huawei.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #5838 from scwf/pruning and squashes the following commits: b00d833 [scwf] address michael's comment e230155 [scwf] fix tests failure b09b895 [scwf] improve column pruning (cherry picked from commit 59250fe51486908f9e3f3d9ef10aadbcb9b4d62d) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-7382] [MLLIB] Feature Parity in PySpark for ml.classificationBurak Yavuz2015-05-133-10/+501
| | | | | | | | | | | | | | | | The missing pieces in ml.classification for Python! cc mengxr Author: Burak Yavuz <brkyvz@gmail.com> Closes #6106 from brkyvz/ml-class and squashes the following commits: dd78237 [Burak Yavuz] fix style 1048e29 [Burak Yavuz] ready for PR (cherry picked from commit df2fb1305aba6781017b0973b0965b664f835e31) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7545] [MLLIB] Added check in Bernoulli Naive Bayes to make sure that ↵leahmcguire2015-05-132-3/+58
| | | | | | | | | | | | | | | | | | | | | | | both training and predict features have values of 0 or 1 Author: leahmcguire <lmcguire@salesforce.com> Closes #6073 from leahmcguire/binaryCheckNB and squashes the following commits: b8442c2 [leahmcguire] changed to if else for value checks 911bf83 [leahmcguire] undid reformat 4eedf1e [leahmcguire] moved bernoulli check 9ee9e84 [leahmcguire] fixed style error 3f3b32c [leahmcguire] fixed zero one check so only called in combiner 831fd27 [leahmcguire] got test working f44bb3c [leahmcguire] removed changes from CV branch 67253f0 [leahmcguire] added check to bernoulli to ensure feature values are zero or one f191c71 [leahmcguire] fixed name 58d060b [leahmcguire] changed param name and test according to comments 04f0d3c [leahmcguire] Added stats from cross validation as a val in the cross validation model to save them for user access (cherry picked from commit 61e05fc58e1245de871c409b60951745b5db3420) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-7593] [ML] Python Api for ml.feature.BucketizerBurak Yavuz2015-05-133-2/+92
| | | | | | | | | | | | | | | | Added `ml.feature.Bucketizer` to PySpark. cc mengxr Author: Burak Yavuz <brkyvz@gmail.com> Closes #6124 from brkyvz/ml-bucket and squashes the following commits: 05285be [Burak Yavuz] added sphinx doc 6abb6ed [Burak Yavuz] added support for Bucketizer (cherry picked from commit 5db18ba6e1bd8c6307c41549176c53590cf344a0) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7551][DataFrame] support backticks for DataFrame attribute resolutionWenchen Fan2015-05-133-4/+82
| | | | | | | | | | | | | | Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6074 from cloud-fan/7551 and squashes the following commits: e6f579e [Wenchen Fan] allow space 2b86699 [Wenchen Fan] handle blank e218d99 [Wenchen Fan] address comments 54c4209 [Wenchen Fan] fix 7551 (cherry picked from commit 213a6f30fee4a1c416ea76b678c71877fd36ef18) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-7567] [SQL] Migrating Parquet data source to FSBasedRelationCheng Lian2015-05-1316-1090/+926
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR migrates Parquet data source to the newly introduced `FSBasedRelation`. `FSBasedParquetRelation` is created to replace `ParquetRelation2`. Major differences are: 1. Partition discovery code has been factored out to `FSBasedRelation` 1. `AppendingParquetOutputFormat` is not used now. Instead, an anonymous subclass of `ParquetOutputFormat` is used to handle appending and writing dynamic partitions 1. When scanning partitioned tables, `FSBasedParquetRelation.buildScan` only builds an `RDD[Row]` for a single selected partition 1. `FSBasedParquetRelation` doesn't rely on Catalyst expressions for filter push down, thus it doesn't extend `CatalystScan` anymore After migrating `JSONRelation` (which extends `CatalystScan`), we can remove `CatalystScan`. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6090) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #6090 from liancheng/parquet-migration and squashes the following commits: 6063f87 [Cheng Lian] Casts to OutputCommitter rather than FileOutputCommtter bfd1cf0 [Cheng Lian] Fixes compilation error introduced while rebasing f9ea56e [Cheng Lian] Adds ParquetRelation2 related classes to MiMa check whitelist 261d8c1 [Cheng Lian] Minor bug fix and more tests db65660 [Cheng Lian] Migrates Parquet data source to FSBasedRelation (cherry picked from commit 7ff16e8abef9fbf4a4855e23c256b22e62e560a6) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-7589] [STREAMING] [WEBUI] Make "Input Rate" in the Streaming page ↵zsxwing2015-05-133-19/+30
| | | | | | | | | | | | | | | | | | consistent with other pages This PR makes "Input Rate" in the Streaming page consistent with Job and Stage pages. ![screen shot 2015-05-12 at 5 03 35 pm](https://cloud.githubusercontent.com/assets/1000778/7601444/f943f8ac-f8ca-11e4-8280-a715d814f434.png) ![screen shot 2015-05-12 at 5 07 25 pm](https://cloud.githubusercontent.com/assets/1000778/7601445/f9571c0c-f8ca-11e4-9b12-9317cb55c002.png) Author: zsxwing <zsxwing@gmail.com> Closes #6102 from zsxwing/SPARK-7589 and squashes the following commits: 2745225 [zsxwing] Make "Input Rate" in the Streaming page consistent with other pages (cherry picked from commit bec938f777a2e18757c7d04504d86a5342e2b49e) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-6734] [SQL] Add UDTF.close support in GenerateCheng Hao2015-05-147-13/+74
| | | | | | | | | | | | | | | Some third-party UDTF extensions generate additional rows in the "GenericUDTF.close()" method, which is supported / documented by Hive. https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF However, Spark SQL ignores the "GenericUDTF.close()", and it causes bug while porting job from Hive to Spark SQL. Author: Cheng Hao <hao.cheng@intel.com> Closes #5383 from chenghao-intel/udtf_close and squashes the following commits: 98b4e4b [Cheng Hao] Support UDTF.close (cherry picked from commit 0da254fb2903c01e059fa7d0dc81df5740312b35) Signed-off-by: Cheng Lian <lian@databricks.com>
* [MINOR] [SQL] Removes debugging printlnCheng Lian2015-05-131-2/+0
| | | | | | | | | | | Author: Cheng Lian <lian@databricks.com> Closes #6123 from liancheng/remove-println and squashes the following commits: 03356b6 [Cheng Lian] Removes debugging println (cherry picked from commit aa6ba3f2166edcc8bcda3abc70482fa8605e83b7) Signed-off-by: Cheng Lian <lian@databricks.com>
* [SQL] In InsertIntoFSBasedRelation.insert, log cause before abort job/task.Yin Huai2015-05-131-0/+2
| | | | | | | | | | | | | | | We need to add a log entry before calling `abortTask`/`abortJob`. Otherwise, an exception from `abortTask`/`abortJob` will shadow the real cause. cc liancheng Author: Yin Huai <yhuai@databricks.com> Closes #6105 from yhuai/logCause and squashes the following commits: 8dfe0d8 [Yin Huai] Log cause. (cherry picked from commit b061bd517a3dc26e7f37a334f49c3465d98334c6) Signed-off-by: Cheng Lian <lian@databricks.com>
* [SPARK-7599] [SQL] Don't restrict customized output committers to be ↵Cheng Lian2015-05-131-8/+12
| | | | | | | | | | | | | subclasses of FileOutputCommitter Author: Cheng Lian <lian@databricks.com> Closes #6118 from liancheng/spark-7599 and squashes the following commits: 31e1bd6 [Cheng Lian] Don't restrict customized output committers to be subclasses of FileOutputCommitter (cherry picked from commit 10c546e9d42a0f3fbf45c919e74f62c548ca8347) Signed-off-by: Yin Huai <yhuai@databricks.com>
* [SPARK-6568] spark-shell.cmd --jars option does not accept the jar that has ↵Masayoshi TSUZUKI2015-05-135-80/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | | space in its path escape spaces in the arguments. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #5447 from tsudukim/feature/SPARK-6568-2 and squashes the following commits: 3f9a188 [Masayoshi TSUZUKI] modified some errors. ed46047 [Masayoshi TSUZUKI] avoid scalastyle errors. 1784239 [Masayoshi TSUZUKI] removed Utils.formatPath. e03f289 [Masayoshi TSUZUKI] removed testWindows from Utils.resolveURI and Utils.resolveURIs. replaced SystemUtils.IS_OS_WINDOWS to Utils.isWindows. removed Utils.formatPath from PythonRunner.scala. 84c33d0 [Masayoshi TSUZUKI] - use resolveURI in nonLocalPaths - run tests for Windows path only on Windows 016128d [Masayoshi TSUZUKI] fixed to use File.toURI() 2c62e3b [Masayoshi TSUZUKI] Merge pull request #1 from sarutak/SPARK-6568-2 7019a8a [Masayoshi TSUZUKI] Merge branch 'master' of https://github.com/apache/spark into feature/SPARK-6568-2 45946ee [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-6568-2 10f1c73 [Kousuke Saruta] Added a comment 93c3c40 [Kousuke Saruta] Merge branch 'classpath-handling-fix' of github.com:sarutak/spark into SPARK-6568-2 649da82 [Kousuke Saruta] Fix classpath handling c7ba6a7 [Masayoshi TSUZUKI] [SPARK-6568] spark-shell.cmd --jars option does not accept the jar that has space in its path (cherry picked from commit 50c72708015fba15d0e78946f1f4ec262776bc38) Signed-off-by: Sean Owen <sowen@cloudera.com>
* [SPARK-7526] [SPARKR] Specify ip of RBackend, MonitorServer and RRDD Socket ↵linweizhong2015-05-122-6/+6
| | | | | | | | | | | | | | | server These R process only used to communicate with JVM process on local, so binding to localhost is more reasonable then wildcard ip. Author: linweizhong <linweizhong@huawei.com> Closes #6053 from Sephiroth-Lin/spark-7526 and squashes the following commits: 5303af7 [linweizhong] bind to localhost rather than wildcard ip (cherry picked from commit 98195c3031fe60683bb25840f135458d5d0e52c5) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
* [SPARK-7482] [SPARKR] Rename some DataFrame API methods in SparkR to match ↵Sun Rui2015-05-126-49/+71
| | | | | | | | | | | | | | | their counterparts in Scala. Author: Sun Rui <rui.sun@intel.com> Closes #6007 from sun-rui/SPARK-7482 and squashes the following commits: 5c5cf5e [Sun Rui] Implement alias loadDF() as a new function. 3a30c10 [Sun Rui] Rename load()/save() to read.df()/write.df(). Also add loadDF()/saveDF() as aliases. 9f569d6 [Sun Rui] [SPARK-7482][SparkR] Rename some DataFrame API methods in SparkR to match their counterparts in Scala. (cherry picked from commit df9b94a57cbd0e028228059d215b446d59d25ba8) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
* [SPARK-7566][SQL] Add type to HiveContext.analyzerSantiago M. Mola2015-05-121-1/+1
| | | | | | | | | | | | | This makes HiveContext.analyzer overrideable. Author: Santiago M. Mola <santi@mola.io> Closes #6086 from smola/patch-3 and squashes the following commits: 8ece136 [Santiago M. Mola] [SPARK-7566][SQL] Add type to HiveContext.analyzer (cherry picked from commit 208b902257bbfb85bf8cadfc942b7134ad690f8b) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-7321][SQL] Add Column expression for conditional statements ↵Reynold Xin2015-05-126-2/+163
| | | | | | | | | | | | | | | | | | | | | | | | | | | | (when/otherwise) This builds on https://github.com/apache/spark/pull/5932 and should close https://github.com/apache/spark/pull/5932 as well. As an example: ```python df.select(when(df['age'] == 2, 3).otherwise(4).alias("age")).collect() ``` Author: Reynold Xin <rxin@databricks.com> Author: kaka1992 <kaka_1992@163.com> Closes #6072 from rxin/when-expr and squashes the following commits: 8f49201 [Reynold Xin] Throw exception if otherwise is applied twice. 0455eda [Reynold Xin] Reset run-tests. bfb9d9f [Reynold Xin] Updated documentation and test cases. 762f6a5 [Reynold Xin] Merge pull request #5932 from kaka1992/IFCASE 95724c6 [kaka1992] Update 8218d0a [kaka1992] Update 801009e [kaka1992] Update 76d6346 [kaka1992] [SPARK-7321][SQL] Add Column expression for conditional statements (if, case) (cherry picked from commit 97dee313f23b00f15638cb72a4a80c1f197f8a9d) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-7588] Document all SQL/DataFrame public methods with @since tagReynold Xin2015-05-1217-26/+706
| | | | | | | | | | | | | This pull request adds since tag to all public methods/classes in SQL/DataFrame to indicate which version the methods/classes were first added. Author: Reynold Xin <rxin@databricks.com> Closes #6101 from rxin/tbc and squashes the following commits: ed55e11 [Reynold Xin] Add since version to all DataFrame methods. (cherry picked from commit 8fd55358b7fc1c7545d823bef7b39769f731c1ee) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [HOTFIX] Use the old Job API to support old Hadoop versionszsxwing2015-05-131-1/+1
| | | | | | | | | | | | | | | #5526 uses `Job.getInstance`, which does not exist in the old Hadoop versions. Just use `new Job` to replace it. cc liancheng Author: zsxwing <zsxwing@gmail.com> Closes #6095 from zsxwing/hotfix and squashes the following commits: b0c2049 [zsxwing] Use the old Job API to support old Hadoop versions (cherry picked from commit 247b70349c1e4413657359d626d92e0ffbc2b7f1) Signed-off-by: Cheng Lian <lian@databricks.com>
* [SPARK-7572] [MLLIB] do not import Param/Params under pyspark.mlXiangrui Meng2015-05-123-7/+11
| | | | | | | | | | | | | Remove `Param` and `Params` from `pyspark.ml` and add a section in the doc. brkyvz Author: Xiangrui Meng <meng@databricks.com> Closes #6094 from mengxr/SPARK-7572 and squashes the following commits: 022abd6 [Xiangrui Meng] do not import Param/Params under spark.ml (cherry picked from commit 77f64c736d07a44f64393910d092091e8ba6047a) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7554] [STREAMING] Throw exception when an active/stopped ↵Tathagata Das2015-05-123-3/+59
| | | | | | | | | | | | | StreamingContext is used to create DStreams and output operations Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #6099 from tdas/SPARK-7554 and squashes the following commits: 2cd4158 [Tathagata Das] Throw exceptions on attempts to add stuff to active and stopped contexts. (cherry picked from commit 23f7d66d51c8809ebc27bfbce3d95515e9b34c2e) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-7528] [MLLIB] make RankingMetrics Java-friendlyXiangrui Meng2015-05-122-4/+87
| | | | | | | | | | | | | `RankingMetrics` contains a ClassTag, which is hard to create in Java. This PR adds a factory method `of` for Java users. coderxiang Author: Xiangrui Meng <meng@databricks.com> Closes #6098 from mengxr/SPARK-7528 and squashes the following commits: e5d57ae [Xiangrui Meng] make RankingMetrics Java-friendly (cherry picked from commit 2713bc65af1e0e81edd5fad0338e34fd127391f9) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7553] [STREAMING] Added methods to maintain a singleton StreamingContextTathagata Das2015-05-122-11/+202
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In a REPL/notebook environment, its very easy to lose a reference to a StreamingContext by overriding the variable name. So if you happen to execute the following commands ``` val ssc = new StreamingContext(...) // cmd 1 ssc.start() // cmd 2 ... val ssc = new StreamingContext(...) // accidentally run cmd 1 again ``` The value of ssc will be overwritten. Now you can neither start the new context (as only one context can be started), nor stop the previous context (as the reference is lost). Hence its best to maintain a singleton reference to the active context, so that we never loose reference for the active context. Since this problem occurs useful in REPL environments, its best to add this as an Experimental support in the Scala API only so that it can be used in Scala REPLs and notebooks. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #6070 from tdas/SPARK-7553 and squashes the following commits: 731c9a1 [Tathagata Das] Fixed style a797171 [Tathagata Das] Added more unit tests 19fc70b [Tathagata Das] Added :: Experimental :: in docs 64706c9 [Tathagata Das] Fixed test 634db5d [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7553 3884a25 [Tathagata Das] Fixing test bug d37a846 [Tathagata Das] Added getActive and getActiveOrCreate (cherry picked from commit 00e7b09a0bee2fcfd0ce34992bd26435758daf26) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-7573] [ML] OneVsRest cleanupsJoseph K. Bradley2015-05-123-31/+23
| | | | | | | | | | | | | | | | | | | | | Minor cleanups discussed with [~mengxr]: * move OneVsRest from reduction to classification sub-package * make model constructor private Some doc cleanups too CC: harsha2010 Could you please verify this looks OK? Thanks! Author: Joseph K. Bradley <joseph@databricks.com> Closes #6097 from jkbradley/onevsrest-cleanup and squashes the following commits: 4ecd48d [Joseph K. Bradley] org imports 430b065 [Joseph K. Bradley] moved OneVsRest from reduction subpackage to classification. small java doc style fixes 9f8b9b9 [Joseph K. Bradley] Small cleanups to OneVsRest. Made model constructor private to ml package. (cherry picked from commit 96c4846db89802f5a81dca5dcfa3f2a0f72b5cb8) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7557] [ML] [DOC] User guide for spark.ml HashingTF, TokenizerJoseph K. Bradley2015-05-123-0/+278
| | | | | | | | | | | | | | | | | | | | | Added feature transformer subsection to spark.ml guide, with HashingTF and Tokenizer. Added JavaHashingTFSuite to test Java examples in new guide. I've run Scala, Python examples in the Spark/PySpark shells. I ran the Java examples via the test suite (with small modifications for printing). CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #6093 from jkbradley/hashingtf-guide and squashes the following commits: d5d213f [Joseph K. Bradley] small fix dd6e91a [Joseph K. Bradley] fixes from code review of user guide 33c3ff9 [Joseph K. Bradley] small fix bc6058c [Joseph K. Bradley] fix link 361a174 [Joseph K. Bradley] Added subsection for feature transformers to spark.ml guide, with HashingTF and Tokenizer. Added JavaHashingTFSuite to test Java examples in new guide (cherry picked from commit f0c1bc3472a7422ae5649634f29c88e161f5ecaf) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7496] [MLLIB] Update Programming guide with Online LDAYuhao Yang2015-05-121-3/+3
| | | | | | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-7496 Update LDA subsection of clustering section of MLlib programming guide to include OnlineLDA. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #6046 from hhbyyh/ldaDocument and squashes the following commits: 4b6fbfa [Yuhao Yang] add online paper and some comparison fd4c983 [Yuhao Yang] update lda document for optimizers (cherry picked from commit 1d703660d4d14caea697affdf31170aea44c8903) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-7406] [STREAMING] [WEBUI] Add tooltips for "Scheduling Delay", ↵zsxwing2015-05-124-14/+21
| | | | | | | | | | | | | | | | | | | | | | "Processing Time" and "Total Delay" Screenshots: ![screen shot 2015-05-06 at 2 29 03 pm](https://cloud.githubusercontent.com/assets/1000778/7504129/9c57f710-f3fc-11e4-9c6e-1b79c17c546d.png) ![screen shot 2015-05-06 at 2 24 35 pm](https://cloud.githubusercontent.com/assets/1000778/7504140/b63bb216-f3fc-11e4-83a5-6dfc6481d192.png) tdas as we discussed offline Author: zsxwing <zsxwing@gmail.com> Closes #5952 from zsxwing/SPARK-7406 and squashes the following commits: 2b004ea [zsxwing] Merge branch 'master' into SPARK-7406 e9eb506 [zsxwing] Update tooltip contents 2215b2a [zsxwing] Add tooltips for "Scheduling Delay", "Processing Time" and "Total Delay" (cherry picked from commit 1422e79e517ca14a6b0e178f015362d2e0d413c6) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-7571] [MLLIB] rename Math to mathXiangrui Meng2015-05-129-15/+15
| | | | | | | | | | | | | `scala.Math` is deprecated since 2.8. This PR only touchs `Math` usages in MLlib. dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #6092 from mengxr/SPARK-7571 and squashes the following commits: fe8f8d3 [Xiangrui Meng] Math -> math (cherry picked from commit a4874b0d1820efd24071108434a4d89429473fe3) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7484][SQL]Support jdbc connection propertiesVenkata Ramana Gollamudi2015-05-124-31/+148
| | | | | | | | | | | | | | | | | Few jdbc drivers like SybaseIQ support passing username and password only through connection properties. So the same needs to be supported for SQLContext.jdbc, dataframe.createJDBCTable and dataframe.insertIntoJDBC. Added as default arguments or overrided function to support backward compatability. Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com> Closes #6009 from gvramana/add_jdbc_conn_properties and squashes the following commits: 396a0d0 [Venkata Ramana Gollamudi] fixed comments d66dd8c [Venkata Ramana Gollamudi] fixed comments 1b8cd8c [Venkata Ramana Gollamudi] Support jdbc connection properties (cherry picked from commit 455551d1c6cc206ffe1ff5ac52ca0ed89c61653d) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-7559] [MLLIB] Bucketizer should include the right most boundary in ↵Xiangrui Meng2015-05-122-39/+41
| | | | | | | | | | | | | | | | | | | the last bucket. We make special treatment for +inf in `Bucketizer`. This could be simplified by always including the largest split value in the last bucket. E.g., (x1, x2, x3) defines buckets [x1, x2) and [x2, x3]. This shouldn't affect user code much, and there are applications that need to include the right-most value. For example, we can bucketize ratings from 0 to 10 to bad, neutral, and good with splits 0, 4, 6, 10. It may reads weird if the users need to put 0, 4, 6, 10.1 (or 11). This also update the impl to use `Arrays.binarySearch` and `withClue` in test. yinxusen jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #6075 from mengxr/SPARK-7559 and squashes the following commits: e28f910 [Xiangrui Meng] update bucketizer impl (cherry picked from commit 23b9863e2aa7ecd0c4fa3aa8a59fdae09b4fe1d7) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-7569][SQL] Better error for invalid binary expressionsMichael Armbrust2015-05-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | `scala> Seq((1,1)).toDF("a", "b").select(lit(1) + new java.sql.Date(1)) ` Before: ``` org.apache.spark.sql.AnalysisException: invalid expression (1 + 0) between Literal 1, IntegerType and Literal 0, DateType; ``` After: ``` org.apache.spark.sql.AnalysisException: invalid expression (1 + 0) between int and date; ``` Author: Michael Armbrust <michael@databricks.com> Closes #6089 from marmbrus/betterBinaryError and squashes the following commits: 23b68ad [Michael Armbrust] [SPARK-7569][SQL] Better error for invalid binary expressions (cherry picked from commit 2a41c0d71a13558f12c6811bf98791e01186f3ad) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-7015] [MLLIB] [WIP] Multiclass to Binary Reduction: One Against AllRam Sriharsha2015-05-1210-8/+471
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | initial cut of one against all. test code is a scaffolding , not fully implemented. This WIP is to gather early feedback. Author: Ram Sriharsha <rsriharsha@hw11853.local> Closes #5830 from harsha2010/reduction and squashes the following commits: 5f4b495 [Ram Sriharsha] Fix Test 386e98b [Ram Sriharsha] Style fix 49b4a17 [Ram Sriharsha] Simplify the test 02279cc [Ram Sriharsha] Output Label Metadata in Prediction Col bc78032 [Ram Sriharsha] Code Review Updates 8ce4845 [Ram Sriharsha] Merge with Master 2a807be [Ram Sriharsha] Merge branch 'master' into reduction e21bfcc [Ram Sriharsha] Style Fix 5614f23 [Ram Sriharsha] Style Fix c75583a [Ram Sriharsha] Cleanup 7a5f136 [Ram Sriharsha] Fix TODOs 804826b [Ram Sriharsha] Merge with Master 1448a5f [Ram Sriharsha] Style Fix 6e47807 [Ram Sriharsha] Style Fix d63e46b [Ram Sriharsha] Incorporate Code Review Feedback ced68b5 [Ram Sriharsha] Refactor OneVsAll to implement Predictor 78fa82a [Ram Sriharsha] extra line 0dfa1fb [Ram Sriharsha] Fix inexhaustive match cases that may arise from UnresolvedAttribute a59a4f4 [Ram Sriharsha] @Experimental 4167234 [Ram Sriharsha] Merge branch 'master' into reduction 868a4fd [Ram Sriharsha] @Experimental 041d905 [Ram Sriharsha] Code Review Fixes df188d8 [Ram Sriharsha] Style fix 612ec48 [Ram Sriharsha] Style Fix 6ef43d3 [Ram Sriharsha] Prefer Unresolved Attribute to Option: Java APIs are cleaner 6bf6bff [Ram Sriharsha] Update OneHotEncoder to new API e29cb89 [Ram Sriharsha] Merge branch 'master' into reduction 1c7fa44 [Ram Sriharsha] Fix Tests ca83672 [Ram Sriharsha] Incorporate Code Review Feedback + Rename to OneVsRestClassifier 221beeed [Ram Sriharsha] Upgrade to use Copy method for cloning Base Classifiers 26f1ddb [Ram Sriharsha] Merge with SPARK-5956 API changes 9738744 [Ram Sriharsha] Merge branch 'master' into reduction 1a3e375 [Ram Sriharsha] More efficient Implementation: Use withColumn to generate label column dynamically 32e0189 [Ram Sriharsha] Restrict reduction to Margin Based Classifiers ff272da [Ram Sriharsha] Style fix 28771f5 [Ram Sriharsha] Add Tests for Multiclass to Binary Reduction b60f874 [Ram Sriharsha] Fix Style issues in Test 3191cdf [Ram Sriharsha] Remove this test, accidental commit 23f056c [Ram Sriharsha] Fix Headers for test 1b5e929 [Ram Sriharsha] Fix Style issues and add Header 8752863 [Ram Sriharsha] [SPARK-7015][MLLib][WIP] Multiclass to Binary Reduction: One Against All (cherry picked from commit 595a67589a42f8025d3e5fd4da413b1faa2e14bf) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-2018] [CORE] Upgrade LZF library to fix endian serialization p…Tim Ellison2015-05-121-1/+1
| | | | | | | | | | | | | | | | | | | …roblem Pick up newer version of dependency with fix for SPARK-2018. The update involved patching the ning/compress LZF library to handle big endian systems correctly. Credit goes to gireeshpunathil for diagnosing the problem, and cowtowncoder for fixing it. Spark tests run clean for me. Author: Tim Ellison <t.p.ellison@gmail.com> Closes #6077 from tellison/UpgradeLZF and squashes the following commits: ad8d4ef [Tim Ellison] [SPARK-2018] [CORE] Upgrade LZF library to fix endian serialization problem (cherry picked from commit 5438f49ccf374fed16bc2b7fc1556e4c0095b14c) Signed-off-by: Sean Owen <sowen@cloudera.com>
* [SPARK-7487] [ML] Feature Parity in PySpark for ml.regressionBurak Yavuz2015-05-126-8/+709
| | | | | | | | | | | | | | | | | | | Added LinearRegression Python API Author: Burak Yavuz <brkyvz@gmail.com> Closes #6016 from brkyvz/ml-reg and squashes the following commits: 11c9ef9 [Burak Yavuz] address comments 1027a40 [Burak Yavuz] fix typo 4c699ad [Burak Yavuz] added tree regressor api 8afead2 [Burak Yavuz] made mixin for DT fa51c74 [Burak Yavuz] save additions 0640d48 [Burak Yavuz] added ml.regression 82aac48 [Burak Yavuz] added linear regression (cherry picked from commit 8e935b0a214f8b477fe9579fbf6a2d0a27b59118) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [HOT FIX #6076] DAG visualization: curve the edgesAndrew Or2015-05-121-1/+2
|
* [SPARK-7276] [DATAFRAME] speed up DataFrame.select by collapsing ProjectWenchen Fan2015-05-124-18/+41
| | | | | | | | | | | | | | Author: Wenchen Fan <cloud0fan@outlook.com> Closes #5831 from cloud-fan/7276 and squashes the following commits: ee4a1e1 [Wenchen Fan] fix rebase mistake a3b565d [Wenchen Fan] refactor 99deb5d [Wenchen Fan] add test f1f67ad [Wenchen Fan] fix 7276 (cherry picked from commit 4e290522c2a6310636317c54589dc35c91d95486) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-7500] DAG visualization: move cluster labeling to dagre-d3Andrew Or2015-05-124-87/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes the label bleeding issue described in the JIRA and pictured in the screenshots below. I also took the opportunity to move some code to the places that they belong more to. In particular: (1) Drawing cluster labels is now implemented in my branch of dagre-d3 instead of in Spark (2) All graph styling is now moved from Scala to JS Note that these changes are related because our existing mechanism of "tacking on cluster labels" afterwards isn't flexible enough for us to fix issues like this one easily. For the other half of the changes, visit http://github.com/andrewor14/dagre-d3. ------------------- **Before.** <img src="https://cloud.githubusercontent.com/assets/2133137/7582769/b1423440-f845-11e4-8248-b3446a01bf79.png" width="300px"/> ------------------- **After.** <img src="https://cloud.githubusercontent.com/assets/2133137/7582742/74891ae6-f845-11e4-96c4-41c7b8aedbdf.png" width="400px"/> Author: Andrew Or <andrew@databricks.com> Closes #6076 from andrewor14/dag-viz-bleed and squashes the following commits: 5858d7a [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-bleed c686dc4 [Andrew Or] Fix tooltip placement d908c36 [Andrew Or] Add link to dagre-d3 changes (minor) 4a4fb58 [Andrew Or] Fix bleeding + move all styling to JS (cherry picked from commit 65697bbeafe507dda066e2dc14ca5183f278dfe9) Signed-off-by: Andrew Or <andrew@databricks.com>
* [DataFrame][minor] support column in field accessorWenchen Fan2015-05-122-1/+2
| | | | | | | | | | | | | Minor improvement, now we can use `Column` as extraction expression. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6080 from cloud-fan/tmp and squashes the following commits: 0fdefb7 [Wenchen Fan] support column in field accessor (cherry picked from commit bfcaf8adcdc20dec203e2e9d5a72b52dd6f226a9) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-3928] [SPARK-5182] [SQL] Partitioning support for the data sources APICheng Lian2015-05-1321-255/+2042
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR adds partitioning support for the external data sources API. It aims to simplify development of file system based data sources, and provide first class partitioning support for both read path and write path. Existing data sources like JSON and Parquet can be simplified with this work. ## New features provided 1. Hive compatible partition discovery This actually generalizes the partition discovery strategy used in Parquet data source in Spark 1.3.0. 1. Generalized partition pruning optimization Now partition pruning is handled during physical planning phase. Specific data sources don't need to worry about this harness anymore. (This also implies that we can remove `CatalystScan` after migrating the Parquet data source, since now we don't need to pass Catalyst expressions to data source implementations.) 1. Insertion with dynamic partitions When inserting data to a `FSBasedRelation`, data can be partitioned dynamically by specified partition columns. ## New structures provided ### Developer API 1. `FSBasedRelation` Base abstract class for file system based data sources. 1. `OutputWriter` Base abstract class for output row writers, responsible for writing a single row object. 1. `FSBasedRelationProvider` A new relation provider for `FSBasedRelation` subclasses. Note that data sources extending `FSBasedRelation` don't need to extend `RelationProvider` and `SchemaRelationProvider`. ### User API New overloaded versions of 1. `DataFrame.save()` 1. `DataFrame.saveAsTable()` 1. `SQLContext.load()` are provided to allow users to save/load DataFrames with user defined dynamic partition columns. ### Spark SQL query planning 1. `InsertIntoFSBasedRelation` Used to implement write path for `FSBasedRelation`s. 1. New rules for `FSBasedRelation` in `DataSourceStrategy` These are added to hook `FSBasedRelation` into physical query plan in read path, and perform partition pruning. ## TODO - [ ] Use scratch directories when overwriting a table with data selected from itself. Currently, this is not supported, because the table been overwritten is always deleted before writing any data to it. - [ ] When inserting with dynamic partition columns, use external sorter to group the data first. This ensures that we only need to open a single `OutputWriter` at a time. For data sources like Parquet, `OutputWriter`s can be quite memory consuming. One issue is that, this approach breaks the row distribution in the original DataFrame. However, we did't promise to preserve data distribution when writing a DataFrame. - [x] More tests. Specifically, test cases for - [x] Self-join - [x] Loading partitioned relations with a subset of partition columns stored in data files. - [x] `SQLContext.load()` with user defined dynamic partition columns. ## Parquet data source migration Parquet data source migration is covered in PR https://github.com/liancheng/spark/pull/6, which is against this PR branch and for preview only. A formal PR need to be made after this one is merged. Author: Cheng Lian <lian@databricks.com> Closes #5526 from liancheng/partitioning-support and squashes the following commits: 5351a1b [Cheng Lian] Fixes compilation error introduced while rebasing 1f9b1a5 [Cheng Lian] Tweaks data schema passed to FSBasedRelations 43ba50e [Cheng Lian] Avoids serializing generated projection code edf49e7 [Cheng Lian] Removed commented stale code block 348a922 [Cheng Lian] Adds projection in FSBasedRelation.buildScan(requiredColumns, inputPaths) ad4d4de [Cheng Lian] Enables HDFS style globbing 8d12e69 [Cheng Lian] Fixes compilation error c71ac6c [Cheng Lian] Addresses comments from @marmbrus 7552168 [Cheng Lian] Fixes typo in MimaExclude.scala 0349e09 [Cheng Lian] Fixes compilation error introduced while rebasing 52b0c9b [Cheng Lian] Adjusts project/MimaExclude.scala c466de6 [Cheng Lian] Addresses comments bc3f9b4 [Cheng Lian] Uses projection to separate partition columns and data columns while inserting rows 795920a [Cheng Lian] Fixes compilation error after rebasing 0b8cd70 [Cheng Lian] Adds Scala/Catalyst row conversion when writing non-partitioned tables fa543f3 [Cheng Lian] Addresses comments 5849dd0 [Cheng Lian] Fixes doc typos. Fixes partition discovery refresh. 51be443 [Cheng Lian] Replaces FSBasedRelation.outputCommitterClass with FSBasedRelation.prepareForWrite c4ed4fe [Cheng Lian] Bug fixes and a new test suite a29e663 [Cheng Lian] Bug fix: should only pass actuall data files to FSBaseRelation.buildScan 5f423d3 [Cheng Lian] Bug fixes. Lets data source to customize OutputCommitter rather than OutputFormat 54c3d7b [Cheng Lian] Enforces that FileOutputFormat must be used be0c268 [Cheng Lian] Uses TaskAttempContext rather than Configuration in OutputWriter.init 0bc6ad1 [Cheng Lian] Resorts to new Hadoop API, and now FSBasedRelation can customize output format class f320766 [Cheng Lian] Adds prepareForWrite() hook, refactored writer containers 422ff4a [Cheng Lian] Fixes style issue ce52353 [Cheng Lian] Adds new SQLContext.load() overload with user defined dynamic partition columns 8d2ff71 [Cheng Lian] Merges partition columns when reading partitioned relations ca1805b [Cheng Lian] Removes duplicated partition discovery code in new Parquet f18dec2 [Cheng Lian] More strict schema checking b746ab5 [Cheng Lian] More tests 9b487bf [Cheng Lian] Fixes compilation errors introduced while rebasing ea6c8dd [Cheng Lian] Removes remote debugging stuff 327bb1d [Cheng Lian] Implements partitioning support for data sources API 3c5073a [Cheng Lian] Fixes SaveModes used in test cases fb5a607 [Cheng Lian] Fixes compilation error 9d17607 [Cheng Lian] Adds the contract that OutputWriter should have zero-arg constructor 5de194a [Cheng Lian] Forgot Apache licence header 95d0b4d [Cheng Lian] Renames PartitionedSchemaRelationProvider to FSBasedRelationProvider 770b5ba [Cheng Lian] Adds tests for FSBasedRelation 3ba9bbf [Cheng Lian] Adds DataFrame.saveAsTable() overrides which support partitioning 1b8231f [Cheng Lian] Renames FSBasedPrunedFilteredScan to FSBasedRelation aa8ba9a [Cheng Lian] Javadoc fix 012ed2d [Cheng Lian] Adds PartitioningOptions 7dd8dd5 [Cheng Lian] Adds new interfaces and stub methods for data sources API partitioning support (cherry picked from commit 0595b6de8f1da04baceda082553c2aa1aa2cb006) Signed-off-by: Cheng Lian <lian@databricks.com>
* [DataFrame][minor] cleanup unapply methods in DataTypesWenchen Fan2015-05-121-12/+3
| | | | | | | | | | | | Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6079 from cloud-fan/unapply and squashes the following commits: 40da442 [Wenchen Fan] one more 7d90a05 [Wenchen Fan] cleanup unapply in DataTypes (cherry picked from commit 831504cf6bde6b1131005d5552e56a842725c84c) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6876] [PySpark] [SQL] add DataFrame na.replace in pysparkDaoyuan Wang2015-05-123-0/+140
| | | | | | | | | | | | | | | Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #6003 from adrian-wang/pynareplace and squashes the following commits: 672efba [Daoyuan Wang] remove py2.7 feature 4a148f7 [Daoyuan Wang] to_replace support dict, value support single value, and add full tests 9e232e7 [Daoyuan Wang] rename scala map af0268a [Daoyuan Wang] remove na 63ac579 [Daoyuan Wang] add na.replace in pyspark (cherry picked from commit d86ce845840a92b4dde7975082738ed94ab8c570) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-7532] [STREAMING] StreamingContext.start() made to logWarning and not ↵Tathagata Das2015-05-122-17/+14
| | | | | | | | | | | | | | | | | | | throw exception Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #6060 from tdas/SPARK-7532 and squashes the following commits: 6fe2e83 [Tathagata Das] Update docs 7dadfc3 [Tathagata Das] Fixed bug again 99c7678 [Tathagata Das] Added logInfo 65aec20 [Tathagata Das] Fix bug 5bf031b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7532 1a9a818 [Tathagata Das] Fix scaladoc c584313 [Tathagata Das] StreamingContext.start() made to logWarning and not throw exception (cherry picked from commit ec6f2a9774167014566fb9608ee4394d2ce5fd6a) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-7467] Dag visualization: treat checkpoint as an RDD operationAndrew Or2015-05-121-7/+9
| | | | | | | | | | | | | | | Such that a checkpoint RDD does not go into random scopes on the UI, e.g. `take`. We've seen this in streaming. Author: Andrew Or <andrew@databricks.com> Closes #6004 from andrewor14/dag-viz-checkpoint and squashes the following commits: 9217439 [Andrew Or] Fix checkpoints 4ae8806 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-checkpoint 19bc07b [Andrew Or] Treat checkpoint as an RDD operation (cherry picked from commit f3e8e60063ccf0d713d03e671a3231560475f90d) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-7485] [BUILD] Remove pyspark files from assembly.Marcelo Vanzin2015-05-125-115/+3
| | | | | | | | | | | | | | | | | The sbt part of the build is hacky; it basically tricks sbt into generating the zip by using a generator, but returns an empty list for the generated files so that nothing is actually added to the assembly. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #6022 from vanzin/SPARK-7485 and squashes the following commits: 22c1e04 [Marcelo Vanzin] Remove unneeded code. 4893622 [Marcelo Vanzin] [SPARK-7485] [build] Remove pyspark files from assembly. (cherry picked from commit 82e890fb19d6fbaffa69856eecb4699f2f8a81eb) Signed-off-by: Andrew Or <andrew@databricks.com>
* [MINOR] [PYSPARK] Set PYTHONPATH to python/lib/pyspark.zip rather than ↵linweizhong2015-05-121-1/+1
| | | | | | | | | | | | | | | python/pyspark As PR #5580 we have created pyspark.zip on building and set PYTHONPATH to python/lib/pyspark.zip, so to keep consistence update this. Author: linweizhong <linweizhong@huawei.com> Closes #6047 from Sephiroth-Lin/pyspark_pythonpath and squashes the following commits: 8cc3d96 [linweizhong] Set PYTHONPATH to python/lib/pyspark.zip rather than python/pyspark as PR#5580 we have create pyspark.zip on build (cherry picked from commit 984787526625b4ef8a1635faf7a5ac3cb0b758b7) Signed-off-by: Andrew Or <andrew@databricks.com>