aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* Typo in mllib-evaluation-metrics.mdMageswaran.D2015-10-281-2/+2
| | | | | | | | Recall by threshold snippet was using "precisionByThreshold" Author: Mageswaran.D <mageswaran1989@gmail.com> Closes #9333 from Mageswaran1989/Typo_in_mllib-evaluation-metrics.md.
* [SPARK-11313][SQL] implement cogroup on DataSets (support 2 datasets)Wenchen Fan2015-10-288-0/+257
| | | | | | | | A simpler version of https://github.com/apache/spark/pull/9279, only support 2 datasets. Author: Wenchen Fan <wenchen@databricks.com> Closes #9324 from cloud-fan/cogroup2.
* [SPARK-11332] [ML] Refactored to use ml.feature.Instance instead of ↵Nakul Jindal2015-10-283-24/+15
| | | | | | | | | | WeightedLeastSquare.Instance WeightedLeastSquares now uses the common Instance class in ml.feature instead of a private one. Author: Nakul Jindal <njindal@us.ibm.com> Closes #9325 from nakul02/SPARK-11332_refactor_WeightedLeastSquares_dot_Instance.
* [MINOR][ML] fix compile warnsXiangrui Meng2015-10-272-2/+3
| | | | | | | | This fixes some compile time warnings. Author: Xiangrui Meng <meng@databricks.com> Closes #9319 from mengxr/mllib-compile-warn-20151027.
* [SPARK-11302][MLLIB] 2) Multivariate Gaussian Model with Covariance matrix ↵Sean Owen2015-10-273-6/+21
| | | | | | | | | | | | returns incorrect answer in some cases Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test. Supersedes https://github.com/apache/spark/pull/9293 Author: Sean Owen <sowen@cloudera.com> Closes #9309 from srowen/SPARK-11302.2.
* [SPARK-10484] [SQL] Optimize the cartesian join with broadcast join for some ↵Cheng Hao2015-10-2710-16/+261
| | | | | | | | | | cases In some cases, we can broadcast the smaller relation in cartesian join, which improve the performance significantly. Author: Cheng Hao <hao.cheng@intel.com> Closes #8652 from chenghao-intel/cartesian.
* [SPARK-11178] Improving naming around task failures.Kay Ousterhout2015-10-2711-47/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0 introduced new functionality so that if an executor dies for a reason that's not caused by one of the tasks running on the executor (e.g., due to pre-emption), Spark doesn't count the failure towards the maximum number of failures for the task. That commit introduced some vague naming that this commit attempts to fix; in particular: (1) The variable "isNormalExit", which was used to refer to cases where the executor died for a reason unrelated to the tasks running on the machine, has been renamed (and reversed) to "exitCausedByApp". The problem with the existing name is that it's not clear (at least to me!) what it means for an exit to be "normal"; the new name is intended to make the purpose of this variable more clear. (2) The variable "shouldEventuallyFailJob" has been renamed to "countTowardsTaskFailures". This variable is used to determine whether a task's failure should be counted towards the maximum number of failures allowed for a task before the associated Stage is aborted. The problem with the existing name is that it can be confused with implying that the task's failure should immediately cause the stage to fail because it is somehow fatal (this is the case for a fetch failure, for example: if a task fails because of a fetch failure, there's no point in retrying, and the whole stage should be failed). Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #9164 from kayousterhout/SPARK-11178.
* [SPARK-11212][CORE][STREAMING] Make preferred locations support ↵zsxwing2015-10-278-132/+217
| | | | | | | | | | | | | | | | | | | ExecutorCacheTaskLocation and update… … ReceiverTracker and ReceiverSchedulingPolicy to use it This PR includes the following changes: 1. Add a new preferred location format, `executor_<host>_<executorID>` (e.g., "executor_localhost_2"), to support specifying the executor locations for RDD. 2. Use the new preferred location format in `ReceiverTracker` to optimize the starting time of Receivers when there are multiple executors in a host. The goal of this PR is to enable the streaming scheduler to place receivers (which run as tasks) in specific executors. Basically, I want to have more control on the placement of the receivers such that they are evenly distributed among the executors. We tried to do this without changing the core scheduling logic. But it does not allow specifying particular executor as preferred location, only at the host level. So if there are two executors in the same host, and I want two receivers to run on them (one on each executor), I cannot specify that. Current code only specifies the host as preference, which may end up launching both receivers on the same executor. We try to work around it but restarting a receiver when it does not launch in the desired executor and hope that next time it will be started in the right one. But that cause lots of restarts, and delays in correctly launching the receiver. So this change, would allow the streaming scheduler to specify the exact executor as the preferred location. Also this is not exposed to the user, only the streaming scheduler uses this. Author: zsxwing <zsxwing@gmail.com> Closes #9181 from zsxwing/executor-location.
* [SPARK-11324][STREAMING] Flag for closing Write Ahead Logs after a writeBurak Yavuz2015-10-273-9/+44
| | | | | | | | | | | Currently the Write Ahead Log in Spark Streaming flushes data as writes need to be made. S3 does not support flushing of data, data is written once the stream is actually closed. In case of failure, the data for the last minute (default rolling interval) will not be properly written. Therefore we need a flag to close the stream after the write, so that we achieve read after write consistency. cc tdas zsxwing Author: Burak Yavuz <brkyvz@gmail.com> Closes #9285 from brkyvz/caw-wal.
* [SPARK-10024][PYSPARK] Python API RF and GBT related params clear upvectorijk2015-10-272-338/+168
| | | | | | | | | implement {RandomForest, GBT, TreeEnsemble, TreeClassifier, TreeRegressor}Params for Python API in pyspark/ml/{classification, regression}.py Author: vectorijk <jiangkai@gmail.com> Closes #9233 from vectorijk/spark-10024.
* [SPARK-11347] [SQL] Support for joinWith in DatasetsMichael Armbrust2015-10-2718-615/+563
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR adds a new operation `joinWith` to a `Dataset`, which returns a `Tuple` for each pair where a given `condition` evaluates to true. ```scala case class ClassData(a: String, b: Int) val ds1 = Seq(ClassData("a", 1), ClassData("b", 2)).toDS() val ds2 = Seq(("a", 1), ("b", 2)).toDS() > ds1.joinWith(ds2, $"_1" === $"a").collect() res0: Array((ClassData("a", 1), ("a", 1)), (ClassData("b", 2), ("b", 2))) ``` This operation is similar to the relation `join` function with one important difference in the result schema. Since `joinWith` preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names `_1` and `_2`. This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common. ## Required Changes to Encoders In the process of working on this patch, several deficiencies to the way that we were handling encoders were discovered. Specifically, it turned out to be very difficult to `rebind` the non-expression based encoders to extract the nested objects from the results of joins (and also typed selects that return tuples). As a result the following changes were made. - `ClassEncoder` has been renamed to `ExpressionEncoder` and has been improved to also handle primitive types. Additionally, it is now possible to take arbitrary expression encoders and rewrite them into a single encoder that returns a tuple. - All internal operations on `Dataset`s now require an `ExpressionEncoder`. If the users tries to pass a non-`ExpressionEncoder` in, an error will be thrown. We can relax this requirement in the future by constructing a wrapper class that uses expressions to project the row to the expected schema, shielding the users code from the required remapping. This will give us a nice balance where we don't force user encoders to understand attribute references and binding, but still allow our native encoder to leverage runtime code generation to construct specific encoders for a given schema that avoid an extra remapping step. - Additionally, the semantics for different types of objects are now better defined. As stated in the `ExpressionEncoder` scaladoc: - Classes will have their sub fields extracted by name using `UnresolvedAttribute` expressions and `UnresolvedExtractValue` expressions. - Tuples will have their subfields extracted by position using `BoundReference` expressions. - Primitives will have their values extracted from the first ordinal with a schema that defaults to the name `value`. - Finally, the binding lifecycle for `Encoders` has now been unified across the codebase. Encoders are now `resolved` to the appropriate schema in the constructor of `Dataset`. This process replaces an unresolved expressions with concrete `AttributeReference` expressions. Binding then happens on demand, when an encoder is going to be used to construct an object. This closely mirrors the lifecycle for standard expressions when executing normal SQL or `DataFrame` queries. Author: Michael Armbrust <michael@databricks.com> Closes #9300 from marmbrus/datasets-tuples.
* [SPARK-6488][MLLIB][PYTHON] Support addition/multiplication in PySpark's ↵Mike Dusenberry2015-10-271-0/+68
| | | | | | | | | | BlockMatrix This PR adds addition and multiplication to PySpark's `BlockMatrix` class via `add` and `multiply` functions. Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes #9139 from dusenberrymw/SPARK-6488_Add_Addition_and_Multiplication_to_PySpark_BlockMatrix.
* [SPARK-11306] Fix hang when JVM exits.Kay Ousterhout2015-10-271-1/+1
| | | | | | | | | | | | | | | | | This commit fixes a bug where, in Standalone mode, if a task fails and crashes the JVM, the failure is considered a "normal failure" (meaning it's considered unrelated to the task), so the failure isn't counted against the task's maximum number of failures: https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0#diff-a755f3d892ff2506a7aa7db52022d77cL138. As a result, if a task fails in a way that results in it crashing the JVM, it will continuously be re-launched, resulting in a hang. This commit fixes that problem. This bug was introduced by #8007; andrewor14 mccheah vanzin can you take a look at this? This error is hard to trigger because we handle executor losses through 2 code paths (the second is via Akka, where Akka notices that the executor endpoint is disconnected). In my setup, the Akka code path completes first, and doesn't have this bug, so things work fine (see my recent email to the dev list about this). If I manually disable the Akka code path, I can see the hang (and this commit fixes the issue). Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #9273 from kayousterhout/SPARK-11306.
* [SPARK-11303][SQL] filter should not be pushed down into sampleYanbo Liang2015-10-272-4/+10
| | | | | | | | When sampling and then filtering DataFrame, the SQL Optimizer will push down filter into sample and produce wrong result. This is due to the sampler is calculated based on the original scope rather than the scope after filtering. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9294 from yanboliang/spark-11303.
* [SPARK-11277][SQL] sort_array throws exception scala.MatchErrorJia Li2015-10-272-1/+12
| | | | | | | | | | | | I'm new to spark. I was trying out the sort_array function then hit this exception. I looked into the spark source code. I found the root cause is that sort_array does not check for an array of NULLs. It's not meaningful to sort an array of entirely NULLs anyway. I'm adding a check on the input array type to SortArray. If the array consists of NULLs entirely, there is no need to sort such array. I have also added a test case for this. Please help to review my fix. Thanks! Author: Jia Li <jiali@us.ibm.com> Closes #9247 from jliwork/SPARK-11277.
* [SPARK-5569][STREAMING] fix ObjectInputStreamWithLoader for supporting load ↵maxwell2015-10-272-3/+36
| | | | | | | | | | | | | | | | | | | | array classes. When use Kafka DirectStream API to create checkpoint and restore saved checkpoint when restart, ClassNotFound exception would occur. The reason for this error is that ObjectInputStreamWithLoader extends the ObjectInputStream class and override its resolveClass method. But Instead of Using Class.forName(desc,false,loader), Spark uses loader.loadClass(desc) to instance the class, which do not works with array class. For example: Class.forName("[Lorg.apache.spark.streaming.kafka.OffsetRange.",false,loader) works well while loader.loadClass("[Lorg.apache.spark.streaming.kafka.OffsetRange") would throw an class not found exception. details of the difference between Class.forName and loader.loadClass can be found here. http://bugs.java.com/view_bug.do?bug_id=6446627 Author: maxwell <maxwellzdm@gmail.com> Author: DEMING ZHU <deming.zhu@linecorp.com> Closes #8955 from maxwellzdm/master.
* [SPARK-11270][STREAMING] Add improved equality testing for TopicAndPartition ↵Nick Evans2015-10-272-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | from the Kafka Streaming API jerryshao tdas I know this is kind of minor, and I know you all are busy, but this brings this class in line with the `OffsetRange` class, and makes tests a little more concise. Instead of doing something like: ``` assert topic_and_partition_instance._topic == "foo" assert topic_and_partition_instance._partition == 0 ``` You can do something like: ``` assert topic_and_partition_instance == TopicAndPartition("foo", 0) ``` Before: ``` >>> from pyspark.streaming.kafka import TopicAndPartition >>> TopicAndPartition("foo", 0) == TopicAndPartition("foo", 0) False ``` After: ``` >>> from pyspark.streaming.kafka import TopicAndPartition >>> TopicAndPartition("foo", 0) == TopicAndPartition("foo", 0) True ``` I couldn't find any tests - am I missing something? Author: Nick Evans <me@nicolasevans.org> Closes #9236 from manygrams/topic_and_partition_equality.
* [SPARK-11276][CORE] SizeEstimator prevents class unloadingSem Mulder2015-10-271-2/+4
| | | | | | | | | | | | The SizeEstimator keeps a cache of ClassInfos but this cache uses Class objects as keys. Which results in strong references to the Class objects. If these classes are dynamically created this prevents the corresponding ClassLoader from being GCed. Leading to PermGen exhaustion. We use a Map with WeakKeys to prevent this issue. Author: Sem Mulder <sem.mulder@site2mobile.com> Closes #9244 from SemMulder/fix-sizeestimator-classunloading.
* [SPARK-11297] Add new code tagsXusen Yin2015-10-261-0/+4
| | | | | | | | | | mengxr https://issues.apache.org/jira/browse/SPARK-11297 Add new code tags to hold the same look and feel with previous documents. Author: Xusen Yin <yinxusen@gmail.com> Closes #9265 from yinxusen/SPARK-11297.
* [SPARK-10654][MLLIB] Add columnSimilarities to IndexedRowMatrixReza Zadeh2015-10-262-0/+25
| | | | | | | | | | Add columnSimilarities to IndexedRowMatrix by delegating to functionality already in RowMatrix. With a test. Author: Reza Zadeh <reza@databricks.com> Closes #8792 from rezazadeh/colsims.
* [SPARK-11184][MLLIB] Declare most of .mllib code not-ExperimentalSean Owen2015-10-2645-208/+43
| | | | | | | | Remove "Experimental" from .mllib code that has been around since 1.4.0 or earlier Author: Sean Owen <sowen@cloudera.com> Closes #9169 from srowen/SPARK-11184.
* [SPARK-10271][PYSPARK][MLLIB] Added @since tags to pyspark.mllib.clusteringnoelsmith2015-10-261-1/+68
| | | | | | | | | | Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings). Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark). Author: noelsmith <mail@noelsmith.com> Closes #8627 from noel-smith/SPARK-10271-since-mllib-clustering.
* [SPARK-11289][DOC] Substitute code examples in ML features extractors with ↵Xusen Yin2015-10-269-209/+480
| | | | | | | | | | | | include_example mengxr https://issues.apache.org/jira/browse/SPARK-11289 I make some changes in ML feature extractors. I.e. TF-IDF, Word2Vec, and CountVectorizer. I add new example code in spark/examples, hope it is the right place to add those examples. Author: Xusen Yin <yinxusen@gmail.com> Closes #9266 from yinxusen/SPARK-11289.
* [SPARK-10562] [SQL] support mixed case partitionBy column names for tables ↵Wenchen Fan2015-10-263-27/+54
| | | | | | | | | | stored in metastore https://issues.apache.org/jira/browse/SPARK-10562 Author: Wenchen Fan <wenchen@databricks.com> Closes #9226 from cloud-fan/par.
* [SPARK-11209][SPARKR] Add window functions into SparkR [step 1].Sun Rui2015-10-265-1/+122
| | | | | | Author: Sun Rui <rui.sun@intel.com> Closes #9193 from sun-rui/SPARK-11209.
* [SPARK-10947] [SQL] With schema inference from JSON into a Dataframe, add ↵Stephen De Gennaro2015-10-264-11/+171
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | option to infer all primitive object types as strings Currently, when a schema is inferred from a JSON file using sqlContext.read.json, the primitive object types are inferred as string, long, boolean, etc. However, if the inferred type is too specific (JSON obviously does not enforce types itself), this can cause issues with merging dataframe schemas. This pull request adds the option "primitivesAsString" to the JSON DataFrameReader which when true (defaults to false if not set) will infer all primitives as strings. Below is an example usage of this new functionality. ``` val jsonDf = sqlContext.read.option("primitivesAsString", "true").json(sampleJsonFile) scala> jsonDf.printSchema() root |-- bigInteger: string (nullable = true) |-- boolean: string (nullable = true) |-- double: string (nullable = true) |-- integer: string (nullable = true) |-- long: string (nullable = true) |-- null: string (nullable = true) |-- string: string (nullable = true) ``` Author: Stephen De Gennaro <stepheng@realitymine.com> Closes #9249 from stephend-realitymine/stephend-primitives.
* [SPARK-11325] [SQL] Alias 'alias' in Scala's DataFrame APINong Li2015-10-262-0/+21
| | | | | | Author: Nong Li <nongli@gmail.com> Closes #9286 from nongli/spark-11325.
* [SQL][DOC] Minor document fixes in interfaces.scalaAlexander Slesarenko2015-10-261-7/+7
| | | | | | | | rxin just noticed this while reading the code. Author: Alexander Slesarenko <avslesarenko@gmail.com> Closes #9284 from aslesarenko/doc-typos.
* [SPARK-11258] Converting a Spark DataFrame into an R data.frame is slow / ↵Frank Rosner2015-10-262-7/+47
| | | | | | | | | | | | requires a lot of memory https://issues.apache.org/jira/browse/SPARK-11258 I was not able to locate an existing unit test for this function so I wrote one. Author: Frank Rosner <frank@fam-rosner.de> Closes #9222 from FRosner/master.
* [SPARK-10979][SPARKR] Sparkrmerge: Add merge to DataFrame with R signatureNarine Kokhlikyan2015-10-262-8/+169
| | | | | | | | | Add merge function to DataFrame, which supports R signature. https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com> Closes #9012 from NarineK/sparkrmerge.
* [SPARK-5966][WIP] Spark-submit deploy-mode cluster is not compatible with ↵Kevin Yu2015-10-261-0/+2
| | | | | | | | | | master local> … master local> Author: Kevin Yu <qyu@us.ibm.com> Closes #9220 from kevinyu98/working_on_spark-5966.
* [SPARK-11279][PYSPARK] Add DataFrame#toDF in PySparkJeff Zhang2015-10-261-0/+12
| | | | | | Author: Jeff Zhang <zjffdu@apache.org> Closes #9248 from zjffdu/SPARK-11279.
* [SPARK-11253] [SQL] reset all accumulators in physical operators before ↵Wenchen Fan2015-10-253-4/+87
| | | | | | | | | | | | | | execute an action With this change, our query execution listener can get the metrics correctly. The UI still looks good after this change. <img width="257" alt="screen shot 2015-10-23 at 11 25 14 am" src="https://cloud.githubusercontent.com/assets/3182036/10683834/d516f37e-7978-11e5-8118-343ed40eb824.png"> <img width="494" alt="screen shot 2015-10-23 at 11 25 01 am" src="https://cloud.githubusercontent.com/assets/3182036/10683837/e1fa60da-7978-11e5-8ec8-178b88f27764.png"> Author: Wenchen Fan <wenchen@databricks.com> Closes #9215 from cloud-fan/metric.
* [SPARK-11127][STREAMING] upgrade AWS SDK and Kinesis Client Library (KCL)Xiangrui Meng2015-10-251-2/+2
| | | | | | | | | | AWS SDK 1.9.40 is the latest 1.9.x release. KCL 1.5.1 is the latest release that using AWS SDK 1.9.x. The main goal is to have Kinesis consumer be able to read messages generated from Kinesis Producer Library (KPL). The API should be compatible with old versions. tdas brkyvz Author: Xiangrui Meng <meng@databricks.com> Closes #9153 from mengxr/SPARK-11127.
* [SPARK-10984] Simplify *MemoryManager class structureJosh Rosen2015-10-2558-1255/+888
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch refactors the MemoryManager class structure. After #9000, Spark had the following classes: - MemoryManager - StaticMemoryManager - ExecutorMemoryManager - TaskMemoryManager - ShuffleMemoryManager This is fairly confusing. To simplify things, this patch consolidates several of these classes: - ShuffleMemoryManager and ExecutorMemoryManager were merged into MemoryManager. - TaskMemoryManager is moved into Spark Core. **Key changes and tasks**: - [x] Merge ExecutorMemoryManager into MemoryManager. - [x] Move pooling logic into Allocator. - [x] Move TaskMemoryManager from `spark-unsafe` to `spark-core`. - [x] Refactor the existing Tungsten TaskMemoryManager interactions so Tungsten code use only this and not both this and ShuffleMemoryManager. - [x] Refactor non-Tungsten code to use the TaskMemoryManager instead of ShuffleMemoryManager. - [x] Merge ShuffleMemoryManager into MemoryManager. - [x] Move code - [x] ~~Simplify 1/n calculation.~~ **Will defer to followup, since this needs more work.** - [x] Port ShuffleMemoryManagerSuite tests. - [x] Move classes from `unsafe` package to `memory` package. - [ ] Figure out how to handle the hacky use of the memory managers in HashedRelation's broadcast variable construction. - [x] Test porting and cleanup: several tests relied on mock functionality (such as `TestShuffleMemoryManager.markAsOutOfMemory`) which has been changed or broken during the memory manager consolidation - [x] AbstractBytesToBytesMapSuite - [x] UnsafeExternalSorterSuite - [x] UnsafeFixedWidthAggregationMapSuite - [x] UnsafeKVExternalSorterSuite **Compatiblity notes**: - This patch introduces breaking changes in `ExternalAppendOnlyMap`, which is marked as `DevloperAPI` (likely for legacy reasons): this class now cannot be used outside of a task. Author: Josh Rosen <joshrosen@databricks.com> Closes #9127 from JoshRosen/SPARK-10984.
* [SPARK-10891][STREAMING][KINESIS] Add MessageHandler to ↵Burak Yavuz2015-10-259-75/+337
| | | | | | | | | | | | | KinesisUtils.createStream similar to Direct Kafka This PR allows users to map a Kinesis `Record` to a generic `T` when creating a Kinesis stream. This is particularly useful, if you would like to do extra work with Kinesis metadata such as sequence number, and partition key. TODO: - [x] add tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #8954 from brkyvz/kinesis-handler.
* [SPARK-11287] Fixed class name to properly start TestExecutor from ↵Bryan Cutler2015-10-251-1/+2
| | | | | | | | | | deploy.client.TestClient Executing deploy.client.TestClient fails due to bad class name for TestExecutor in ApplicationDescription. Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #9255 from BryanCutler/fix-TestClient-classname-SPARK-11287.
* [SPARK-6428][SQL] Removed unnecessary typecasts in MutableInt, MutableDouble ↵Alexander Slesarenko2015-10-251-9/+9
| | | | | | | | | | etc. marmbrus rxin I believe these typecasts are not required in the presence of explicit return types. Author: Alexander Slesarenko <avslesarenko@gmail.com> Closes #9262 from aslesarenko/remove-typecasts.
* [SPARK-11299][DOC] Fix link to Scala DataFrame Functions referenceJosh Rosen2015-10-251-1/+1
| | | | | | | | The SQL programming guide's link to the DataFrame functions reference points to the wrong location; this patch fixes that. Author: Josh Rosen <joshrosen@databricks.com> Closes #9269 from JoshRosen/SPARK-11299.
* Fix typosJacek Laskowski2015-10-254-4/+5
| | | | | | | | | | Two typos squashed. BTW Let me know how to proceed with other typos if I ran across any. I don't feel well to leave them aside as much as sending pull requests with such tiny changes. Guide me. Author: Jacek Laskowski <jacek.laskowski@deepsense.io> Closes #9250 from jaceklaskowski/typos-hunting.
* [SPARK-11264] bin/spark-class can't find assembly jars with certain ↵Jeffrey Naisbitt2015-10-241-0/+1
| | | | | | | | | | | | | | | | GREP_OPTIONS set Temporarily remove GREP_OPTIONS if set in bin/spark-class. Some GREP_OPTIONS will modify the output of the grep commands that are looking for the assembly jars. For example, if the -n option is specified, the grep output will look like: 5:spark-assembly-1.5.1-hadoop2.4.0.jar This will not match the regular expressions, and so the jar files will not be found. We could improve the regular expression to handle this case and trim off extra characters, but it is difficult to know which options may or may not be set. Unsetting GREP_OPTIONS within the script handles all the cases and gives the desired output. Author: Jeffrey Naisbitt <jnaisbitt@familysearch.org> Closes #9231 from naisbitt/unset-GREP_OPTIONS.
* [SPARK-11245] update twitter4j to 4.0.4 versiondima2015-10-242-2/+2
| | | | | | | | | update twitter4j to 4.0.4 version https://issues.apache.org/jira/browse/SPARK-11245 Author: dima <pronix.service@gmail.com> Closes #9221 from pronix/twitter4j_update.
* [SPARK-11125] [SQL] Uninformative exception when running spark-sql witho…Jeff Zhang2015-10-231-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | …ut building with -Phive-thriftserver and SPARK_PREPEND_CLASSES is set This is the exception after this patch. Please help review. ``` java.lang.NoClassDefFoundError: org/apache/hadoop/hive/cli/CliDriver at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:412) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.util.Utils$.classForName(Utils.scala:173) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:647) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.cli.CliDriver at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 21 more Failed to load hive class. You need to build Spark with -Phive and -Phive-thriftserver. ``` Author: Jeff Zhang <zjffdu@apache.org> Closes #9134 from zjffdu/SPARK-11125.
* [SPARK-11294][SPARKR] Improve R doc for read.df, write.df, saveAsTablefelixcheung2015-10-232-19/+24
| | | | | | | | | | | | | | | Add examples for read.df, write.df; fix grouping for read.df, loadDF; fix formatting and text truncation for write.df, saveAsTable. Several text issues: ![image](https://cloud.githubusercontent.com/assets/8969467/10708590/1303a44e-79c3-11e5-854f-3a2e16854cd7.png) - text collapsed into a single paragraph - text truncated at 2 places, eg. "overwrite: Existing data is expected to be overwritten by the contents of error:" shivaram Author: felixcheung <felixcheung_m@hotmail.com> Closes #9261 from felixcheung/rdocreadwritedf.
* [SPARK-10971][SPARKR] RRunner should allow setting path to Rscript.Sun Rui2015-10-232-1/+28
| | | | | | | | | | | | | | | | | Add a new spark conf option "spark.sparkr.r.driver.command" to specify the executable for an R script in client modes. The existing spark conf option "spark.sparkr.r.command" is used to specify the executable for an R script in cluster modes for both driver and workers. See also [launch R worker script](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RRDD.scala#L395). BTW, [envrionment variable "SPARKR_DRIVER_R"](https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L275) is used to locate R shell on the local host. For your information, PYSPARK has two environment variables serving simliar purpose: PYSPARK_PYTHON Python binary executable to use for PySpark in both driver and workers (default is `python`). PYSPARK_DRIVER_PYTHON Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON). pySpark use the code [here](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L41) to determine the python executable for a python script. Author: Sun Rui <rui.sun@intel.com> Closes #9179 from sun-rui/SPARK-10971.
* [SPARK-11194] [SQL] Use MutableURLClassLoader for the classLoader in ↵Yin Huai2015-10-231-28/+51
| | | | | | | | | | IsolatedClientLoader. https://issues.apache.org/jira/browse/SPARK-11194 Author: Yin Huai <yhuai@databricks.com> Closes #9170 from yhuai/SPARK-11194.
* [SPARK-11274] [SQL] Text data source support for Spark SQL.Reynold Xin2015-10-237-4/+283
| | | | | | | | | | | | | | | | | This adds API for reading and writing text files, similar to SparkContext.textFile and RDD.saveAsTextFile. ``` SQLContext.read.text("/path/to/something.txt") DataFrame.write.text("/path/to/write.txt") ``` Using the new Dataset API, this also supports ``` val ds: Dataset[String] = SQLContext.read.text("/path/to/something.txt").as[String] ``` Author: Reynold Xin <rxin@databricks.com> Closes #9240 from rxin/SPARK-11274.
* [SPARK-6723] [MLLIB] Model import/export for ChiSqSelectorJayant Shekar2015-10-232-1/+95
| | | | | | | | | | | This is a PR for Parquet-based model import/export. * Added save/load for ChiSqSelectorModel * Updated the test suite ChiSqSelectorSuite Author: Jayant Shekar <jayant@user-MBPMBA-3.local> Closes #6785 from jayantshekhar/SPARK-6723.
* [SPARK-10277] [MLLIB] [PYSPARK] Add @since annotation to ↵Yu ISHIKAWA2015-10-231-1/+101
| | | | | | | | pyspark.mllib.regression Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8684 from yu-iskw/SPARK-10277.
* [SPARK-10382] Make example code in user guide testableXusen Yin2015-10-231-0/+96
| | | | | | | | | | A POC code for making example code in user guide testable. mengxr We still need to talk about the labels in code. Author: Xusen Yin <yinxusen@gmail.com> Closes #9109 from yinxusen/SPARK-10382.