aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-11550][DOCS] Replace example code in mllib-optimization.md using ↵Pravin Gadakh2015-11-103-143/+200
| | | | | | | | include_example Author: Pravin Gadakh <pravingadakh177@gmail.com> Closes #9516 from pravingadakh/SPARK-11550.
* [SPARK-11616][SQL] Improve toString for DatasetMichael Armbrust2015-11-104-13/+47
| | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #9586 from marmbrus/dataset-toString.
* [SPARK-7316][MLLIB] RDD sliding window with stepunknown2015-11-103-39/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implementation of step capability for sliding window function in MLlib's RDD. Though one can use current sliding window with step 1 and then filter every Nth window, it will take more time and space (N*data.count times more than needed). For example, below are the results for various windows and steps on 10M data points: Window | Step | Time | Windows produced ------------ | ------------- | ---------- | ---------- 128 | 1 | 6.38 | 9999873 128 | 10 | 0.9 | 999988 128 | 100 | 0.41 | 99999 1024 | 1 | 44.67 | 9998977 1024 | 10 | 4.74 | 999898 1024 | 100 | 0.78 | 99990 ``` import org.apache.spark.mllib.rdd.RDDFunctions._ val rdd = sc.parallelize(1 to 10000000, 10) rdd.count val window = 1024 val step = 1 val t = System.nanoTime(); val windows = rdd.sliding(window, step); println(windows.count); println((System.nanoTime() - t) / 1e9) ``` Author: unknown <ulanov@ULANOV3.americas.hpqcorp.net> Author: Alexander Ulanov <nashb@yandex.ru> Author: Xiangrui Meng <meng@databricks.com> Closes #5855 from avulanov/SPARK-7316-sliding.
* [SPARK-11618][ML] Minor refactoring of basic ML import/exportJoseph K. Bradley2015-11-101-27/+30
| | | | | | | | | | | | Refactoring * separated overwrite and param save logic in DefaultParamsWriter * added sparkVersion to DefaultParamsWriter CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9587 from jkbradley/logreg-io.
* [ML][R] SparkR::glm summary result to compare with native RYanbo Liang2015-11-102-22/+11
| | | | | | | | Follow up #9561. Due to [SPARK-11587](https://issues.apache.org/jira/browse/SPARK-11587) has been fixed, we should compare SparkR::glm summary result with native R output rather than hard-code one. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9590 from yanboliang/glm-r-test.
* [SPARK-10371][SQL] Implement subexpr elimination for UnsafeProjectionsNong Li2015-11-1011-16/+523
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the building blocks for codegening subexpr elimination and implements it end to end for UnsafeProjection. The building blocks can be used to do the same thing for other operators. It introduces some utilities to compute common sub expressions. Expressions can be added to this data structure. The expr and its children will be recursively matched against existing expressions (ones previously added) and grouped into common groups. This is built using the existing `semanticEquals`. It does not understand things like commutative or associative expressions. This can be done as future work. After building this data structure, the codegen process takes advantage of it by: 1. Generating a helper function in the generated class that computes the common subexpression. This is done for all common subexpressions that have at least two occurrences and the expression tree is sufficiently complex. 2. When generating the apply() function, if the helper function exists, call that instead of regenerating the expression tree. Repeated calls to the helper function shortcircuit the evaluation logic. Author: Nong Li <nong@databricks.com> Author: Nong Li <nongli@gmail.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #9480 from nongli/spark-10371.
* [SPARK-11590][SQL] use native json_tuple in lateral viewWenchen Fan2015-11-108-40/+104
| | | | | | Author: Wenchen Fan <wenchen@databricks.com> Closes #9562 from cloud-fan/json-tuple.
* [SPARK-11578][SQL][FOLLOW-UP] complete the user facing api for typed aggregationWenchen Fan2015-11-104-14/+99
| | | | | | | | | | | | | | Currently the user facing api for typed aggregation has some limitations: * the customized typed aggregation must be the first of aggregation list * the customized typed aggregation can only use long as buffer type * the customized typed aggregation can only use flat type as result type This PR tries to remove these limitations. Author: Wenchen Fan <wenchen@databricks.com> Closes #9599 from cloud-fan/agg.
* [SPARK-10863][SPARKR] Method coltypes() (New version)Oscar D. Lara Yejas2015-11-107-18/+124
| | | | | | | | This is a follow up on PR #8984, as the corresponding branch for such PR was damaged. Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu> Closes #9579 from olarayej/SPARK-10863_NEW14.
* [SPARK-9830][SQL] Remove AggregateExpression1 and Aggregate Operator used to ↵Yin Huai2015-11-1064-2260/+743
| | | | | | | | | | | | | | | | | | | evaluate AggregateExpression1s https://issues.apache.org/jira/browse/SPARK-9830 This PR contains the following main changes. * Removing `AggregateExpression1`. * Removing `Aggregate` operator, which is used to evaluate `AggregateExpression1`. * Removing planner rule used to plan `Aggregate`. * Linking `MultipleDistinctRewriter` to analyzer. * Renaming `AggregateExpression2` to `AggregateExpression` and `AggregateFunction2` to `AggregateFunction`. * Updating places where we create aggregate expression. The way to create aggregate expressions is `AggregateExpression(aggregateFunction, mode, isDistinct)`. * Changing `val`s in `DeclarativeAggregate`s that touch children of this function to `lazy val`s (when we create aggregate expression in DataFrame API, children of an aggregate function can be unresolved). Author: Yin Huai <yhuai@databricks.com> Closes #9556 from yhuai/removeAgg1.
* [SPARK-11252][NETWORK] ShuffleClient should release connection after ↵Lianhui Wang2015-11-106-15/+81
| | | | | | | | | | | fetching blocks had been completed for external shuffle with yarn's external shuffle, ExternalShuffleClient of executors reserve its connections for yarn's NodeManager until application has been completed. so it will make NodeManager and executors have many socket connections. in order to reduce network pressure of NodeManager's shuffleService, after registerWithShuffleServer or fetchBlocks have been completed in ExternalShuffleClient, connection for NM's shuffleService needs to be closed.andrewor14 rxin vanzin Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #9227 from lianhuiwang/spark-11252.
* [SPARK-7841][BUILD] Stop using retrieveManaged to retrieve dependencies in SBTJosh Rosen2015-11-102-6/+18
| | | | | | | | | | | | | | This patch modifies Spark's SBT build so that it no longer uses `retrieveManaged` / `lib_managed` to store its dependencies. The motivations for this change are nicely described on the JIRA ticket ([SPARK-7841](https://issues.apache.org/jira/browse/SPARK-7841)); my personal interest in doing this stems from the fact that `lib_managed` has caused me some pain while debugging dependency issues in another PR of mine. Removing our use of `lib_managed` would be trivial except for one snag: the Datanucleus JARs, required by Spark SQL's Hive integration, cannot be included in assembly JARs due to problems with merging OSGI `plugin.xml` files. As a result, several places in the packaging and deployment pipeline assume that these Datanucleus JARs are copied to `lib_managed/jars`. In the interest of maintaining compatibility, I have chosen to retain the `lib_managed/jars` directory _only_ for these Datanucleus JARs and have added custom code to `SparkBuild.scala` to automatically copy those JARs to that folder as part of the `assembly` task. `dev/mima` also depended on `lib_managed` in a hacky way in order to set classpaths when generating MiMa excludes; I've updated this to obtain the classpaths directly from SBT instead. /cc dragos marmbrus pwendell srowen Author: Josh Rosen <joshrosen@databricks.com> Closes #9575 from JoshRosen/SPARK-7841.
* [SPARK-11382] Replace example code in mllib-decision-tree.md using ↵Xusen Yin2015-11-108-247/+438
| | | | | | | | | | | | include_example https://issues.apache.org/jira/browse/SPARK-11382 B.T.W. I fix an error in naive_bayes_example.py. Author: Xusen Yin <yinxusen@gmail.com> Closes #9596 from yinxusen/SPARK-11382.
* Fix typo in driver pagePaul Chandler2015-11-101-1/+1
| | | | | | | | "Comamnd property" => "Command property" Author: Paul Chandler <pestilence669@users.noreply.github.com> Closes #9578 from pestilence669/fix_spelling.
* [SPARK-11598] [SQL] enable tests for ShuffledHashOuterJoinDavies Liu2015-11-091-204/+231
| | | | | | Author: Davies Liu <davies@databricks.com> Closes #9573 from davies/join_condition.
* [SPARK-11599] [SQL] fix NPE when resolve Hive UDF in SQLParserDavies Liu2015-11-092-9/+34
| | | | | | | | | | The DataFrame APIs that takes a SQL expression always use SQLParser, then the HiveFunctionRegistry will called outside of Hive state, cause NPE if there is not a active Session State for current thread (in PySpark). cc rxin yhuai Author: Davies Liu <davies@databricks.com> Closes #9576 from davies/hive_udf.
* [SPARK-11587][SPARKR] Fix the summary generic to match base RShivaram Venkataraman2015-11-094-10/+16
| | | | | | | | | The signature is summary(object, ...) as defined in https://stat.ethz.ch/R-manual/R-devel/library/base/html/summary.html Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #9582 from shivaram/summary-fix.
* Add mockito as an explicit test dependency to spark-streamingBurak Yavuz2015-11-091-0/+5
| | | | | | | | | While sbt successfully compiles as it properly pulls the mockito dependency, maven builds have broken. We need this in ASAP. tdas Author: Burak Yavuz <brkyvz@gmail.com> Closes #9584 from brkyvz/fix-master.
* [SPARK-11333][STREAMING] Add executorId to ReceiverInfo and display it in UIShixiong Zhu2015-11-098-7/+22
| | | | | | | | | | | Expose executorId to `ReceiverInfo` and UI since it's helpful when there are multiple executors running in the same host. Screenshot: <img width="1058" alt="screen shot 2015-11-02 at 10 52 19 am" src="https://cloud.githubusercontent.com/assets/1000778/10890968/2e2f5512-8150-11e5-8d9d-746e826b69e8.png"> Author: Shixiong Zhu <shixiong@databricks.com> Author: zsxwing <zsxwing@gmail.com> Closes #9418 from zsxwing/SPARK-11333.
* [SPARK-11462][STREAMING] Add JavaStreamingListenerzsxwing2015-11-094-0/+665
| | | | | | | | | | | Currently, StreamingListener is not Java friendly because it exposes some Scala collections to Java users directly, such as Option, Map. This PR added a Java version of StreamingListener and a bunch of Java friendly classes for Java users. Author: zsxwing <zsxwing@gmail.com> Author: Shixiong Zhu <shixiong@databricks.com> Closes #9420 from zsxwing/java-streaming-listener.
* [SPARK-11141][STREAMING] Batch ReceivedBlockTrackerLogEvents for WAL writesBurak Yavuz2015-11-096-192/+767
| | | | | | | | | | When using S3 as a directory for WALs, the writes take too long. The driver gets very easily bottlenecked when multiple receivers send AddBlock events to the ReceiverTracker. This PR adds batching of events in the ReceivedBlockTracker so that receivers don't get blocked by the driver for too long. cc zsxwing tdas Author: Burak Yavuz <brkyvz@gmail.com> Closes #9143 from brkyvz/batch-wal-writes.
* [SPARK-11198][STREAMING][KINESIS] Support de-aggregation of records during ↵Burak Yavuz2015-11-098-25/+76
| | | | | | | | | | | | recovery While the KCL handles de-aggregation during the regular operation, during recovery we use the lower level api, and therefore need to de-aggregate the records. tdas Testing is an issue, we need protobuf magic to do the aggregated records. Maybe we could depend on KPL for tests? Author: Burak Yavuz <brkyvz@gmail.com> Closes #9403 from brkyvz/kinesis-deaggregation.
* [SPARK-11069][ML] Add RegexTokenizer option to convert to lowercaseYuhao Yang2015-11-093-7/+35
| | | | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-11069 quotes from jira: Tokenizer converts strings to lowercase automatically, but RegexTokenizer does not. It would be nice to add an option to RegexTokenizer to convert to lowercase. Proposal: call the Boolean Param "toLowercase" set default to false (so behavior does not change) Actually sklearn converts to lowercase before tokenizing too Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9092 from hhbyyh/tokenLower.
* [SPARK-11610][MLLIB][PYTHON][DOCS] Make the docs of LDAModel.describeTopics ↵Yu ISHIKAWA2015-11-091-0/+6
| | | | | | | | | | in Python more specific cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9577 from yu-iskw/SPARK-11610.
* [SPARK-11564][SQL] Fix documentation for DataFrame.take/collectReynold Xin2015-11-091-4/+4
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #9557 from rxin/SPARK-11564-1.
* [SPARK-11578][SQL] User API for Typed AggregationMichael Armbrust2015-11-099-42/+360
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR adds a new interface for user-defined aggregations, that can be used in `DataFrame` and `Dataset` operations to take all of the elements of a group and reduce them to a single value. For example, the following aggregator extracts an `int` from a specific class and adds them up: ```scala case class Data(i: Int) val customSummer = new Aggregator[Data, Int, Int] { def prepare(d: Data) = d.i def reduce(l: Int, r: Int) = l + r def present(r: Int) = r }.toColumn() val ds: Dataset[Data] = ... val aggregated = ds.select(customSummer) ``` By using helper functions, users can make a generic `Aggregator` that works on any input type: ```scala /** An `Aggregator` that adds up any numeric type returned by the given function. */ class SumOf[I, N : Numeric](f: I => N) extends Aggregator[I, N, N] with Serializable { val numeric = implicitly[Numeric[N]] override def zero: N = numeric.zero override def reduce(b: N, a: I): N = numeric.plus(b, f(a)) override def present(reduction: N): N = reduction } def sum[I, N : Numeric : Encoder](f: I => N): TypedColumn[I, N] = new SumOf(f).toColumn ``` These aggregators can then be used alongside other built-in SQL aggregations. ```scala val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS() ds .groupBy(_._1) .agg( sum(_._2), // The aggregator defined above. expr("sum(_2)").as[Int], // A built-in dynatically typed aggregation. count("*")) // A built-in statically typed aggregation. .collect() res0: ("a", 30, 30, 2L), ("b", 3, 3, 2L), ("c", 1, 1, 1L) ``` The current implementation focuses on integrating this into the typed API, but currently only supports running aggregations that return a single long value as explained in `TypedAggregateExpression`. This will be improved in a followup PR. Author: Michael Armbrust <michael@databricks.com> Closes #9555 from marmbrus/dataset-useragg.
* [SPARK-11360][DOC] Loss of nullability when writing parquet filesgatorsmile2015-11-091-1/+2
| | | | | | | | This fix is to add one line to explain the current behavior of Spark SQL when writing Parquet files. All columns are forced to be nullable for compatibility reasons. Author: gatorsmile <gatorsmile@gmail.com> Closes #9314 from gatorsmile/lossNull.
* [SPARK-9557][SQL] Refactor ParquetFilterSuite and remove old ParquetFilters codehyukjinkwon2015-11-091-4/+4
| | | | | | | | | | | | | | | | | | | Actually this was resolved by https://github.com/apache/spark/pull/8275. But I found the JIRA issue for this is not marked as resolved since the PR above was made for another issue but the PR above resolved both. I commented that this is resolved by the PR above; however, I opened this PR as I would like to just add a little bit of corrections. In the previous PR, I refactored the test by not reducing just collecting filters; however, this would not test properly `And` filter (which is not given to the tests). I unintentionally changed this from the original way (before being refactored). In this PR, I just followed the original way to collect filters by reducing. I would like to close this if this PR is inappropriate and somebody would like this deal with it in the separate PR related with this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #9554 from HyukjinKwon/SPARK-9557.
* [SPARK-11564][SQL][FOLLOW-UP] improve java api for GroupedDatasetWenchen Fan2015-11-099-25/+118
| | | | | | | | created `MapGroupFunction`, `FlatMapGroupFunction`, `CoGroupFunction` Author: Wenchen Fan <wenchen@databricks.com> Closes #9564 from cloud-fan/map.
* [SPARK-6517][MLLIB] Implement the Algorithm of Hierarchical ClusteringYu ISHIKAWA2015-11-094-0/+841
| | | | | | | | | | | | | | | | | | I implemented a hierarchical clustering algorithm again. This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later. https://issues.apache.org/jira/browse/SPARK-6517 - This implementation based on a bi-sectiong K-means clustering. - It derives from the freeman-lab 's implementation - The basic idea is not changed from the previous version. (#2906) - However, It is 1000x faster than the previous version through parallel processing. Thank you for your great cooperation, RJ Nowling(rnowling), Jeremy Freeman(freeman-lab), Xiangrui Meng(mengxr) and Sean Owen(srowen). Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Author: Yu ISHIKAWA <yu-iskw@users.noreply.github.com> Closes #5267 from yu-iskw/new-hierarchical-clustering.
* [SPARK-11359][STREAMING][KINESIS] Checkpoint to DynamoDB even when new data ↵Burak Yavuz2015-11-096-183/+349
| | | | | | | | | | doesn't come in Currently, the checkpoints to DynamoDB occur only when new data comes in, as we update the clock for the checkpointState. This PR makes the checkpoint a scheduled execution based on the `checkpointInterval`. Author: Burak Yavuz <brkyvz@gmail.com> Closes #9421 from brkyvz/kinesis-checkpoint.
* [SPARK-11595] [SQL] Fixes ADD JAR when the input path contains URL schemeCheng Lian2015-11-094-11/+18
| | | | | | Author: Cheng Lian <lian@databricks.com> Closes #9569 from liancheng/spark-11595.fix-add-jar.
* [SPARK-9301][SQL] Add collect_set and collect_list aggregate functionsNick Buroojy2015-11-094-13/+64
| | | | | | | | | | | | | | | | | | | | | For now they are thin wrappers around the corresponding Hive UDAFs. One limitation with these in Hive 0.13.0 is they only support aggregating primitive types. I chose snake_case here instead of camelCase because it seems to be used in the majority of the multi-word fns. Do we also want to add these to `functions.py`? This approach was recommended here: https://github.com/apache/spark/pull/8592#issuecomment-154247089 marmbrus rxin Author: Nick Buroojy <nick.buroojy@civitaslearning.com> Closes #9526 from nburoojy/nick/udaf-alias. (cherry picked from commit a6ee4f989d020420dd08b97abb24802200ff23b2) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-11548][DOCS] Replaced example code in ↵Rishabh Bhardwaj2015-11-094-135/+221
| | | | | | | | | | mllib-collaborative-filtering.md using include_example Kindly review the changes. Author: Rishabh Bhardwaj <rbnext29@gmail.com> Closes #9519 from rishabhbhardwaj/SPARK-11337.
* [SPARK-11552][DOCS][Replaced example code in ml-decision-tree.md using ↵sachin aggarwal2015-11-097-330/+527
| | | | | | | | | | include_example] I have tested it on my local, it is working fine, please review Author: sachin aggarwal <different.sachin@gmail.com> Closes #9539 from agsachin/SPARK-11552-real.
* [SPARK-10471][CORE][MESOS] prevent getting offers for unmet constraintsFelix Bechstein2015-11-093-53/+91
| | | | | | | | | | | this change rejects offers for slaves with unmet constraints for 120s to mitigate offer starvation. this prevents mesos to send us these offers again and again. in return, we get more offers for slaves which might meet our constraints. and it enables mesos to send the rejected offers to other frameworks. Author: Felix Bechstein <felix.bechstein@otto.de> Closes #8639 from felixb/decline_offers_constraint_mismatch.
* [SPARK-10280][MLLIB][PYSPARK][DOCS] Add @since annotation to ↵Yu ISHIKAWA2015-11-091-0/+56
| | | | | | | | pyspark.ml.classification Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8690 from yu-iskw/SPARK-10280.
* [SPARK-11581][DOCS] Example mllib code in documentation incorrectly computes MSEBharat Lal2015-11-091-1/+1
| | | | | | Author: Bharat Lal <bharat.iisc@gmail.com> Closes #9560 from bharatl/SPARK-11581.
* [DOCS] Fix typo for Python section on unifying Kafka streamschriskang902015-11-091-2/+2
| | | | | | | | | | | 1) kafkaStreams is a list. The list should be unpacked when passing it into the streaming context union method, which accepts a variable number of streams. 2) print() should be pprint() for pyspark. This contribution is my original work, and I license the work to the project under the project's open source license. Author: chriskang90 <jckang@uchicago.edu> Closes #9545 from c-kang/streaming_python_typo.
* [SPARK-9865][SPARKR] Flaky SparkR test: test_sparkSQL.R: sample on a DataFramefelixcheung2015-11-091-2/+2
| | | | | | | | | | | | | Make sample test less flaky by setting the seed Tested with ``` repeat { if (count(sample(df, FALSE, 0.1)) == 3) { break } } ``` Author: felixcheung <felixcheung_m@hotmail.com> Closes #9549 from felixcheung/rsample.
* [SPARK-11112] Fix Scala 2.11 compilation error in RDDInfo.scalatedyu2015-11-091-11/+1
| | | | | | | | | | | | | | As shown in https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1946/console , compilation fails with: ``` [error] /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/storage/RDDInfo.scala:25: in class RDDInfo, multiple overloaded alternatives of constructor RDDInfo define default arguments. [error] class RDDInfo( [error] ``` This PR tries to fix the compilation error Author: tedyu <yuzhihong@gmail.com> Closes #9538 from tedyu/master.
* [SPARK-10565][CORE] add missing web UI stats to /api/v1/applications JSONCharles Yeh2015-11-096-2/+114
| | | | | | | | | | I looked at the other endpoints, and they don't seem to be missing any fields. Added fields: ![image](https://cloud.githubusercontent.com/assets/613879/10948801/58159982-82e4-11e5-86dc-62da201af910.png) Author: Charles Yeh <charlesyeh@dropbox.com> Closes #9472 from CharlesYeh/api_vars.
* [SPARK-11582][MLLIB] specifying pmml version attribute =4.2 in the root node ↵fazlan-nazeem2015-11-091-0/+1
| | | | | | | | | | of pmml model The current pmml models generated do not specify the pmml version in its root node. This is a problem when using this pmml model in other tools because they expect the version attribute to be set explicitly. This fix adds the pmml version attribute to the generated pmml models and specifies its value as 4.2. Author: fazlan-nazeem <fazlann@wso2.com> Closes #9558 from fazlan-nazeem/master.
* [SPARK-10689][ML][DOC] User guide and example code for AFTSurvivalRegressionYanbo Liang2015-11-095-0/+281
| | | | | | | | Add user guide and example code for ```AFTSurvivalRegression```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9491 from yanboliang/spark-10689.
* [SPARK-11494][ML][R] Expose R-like summary statistics in SparkR::glm for ↵Yanbo Liang2015-11-093-15/+88
| | | | | | | | | | | | | | | | | | | | | | linear regression Expose R-like summary statistics in SparkR::glm for linear regression, the output of ```summary``` like ```Java $DevianceResiduals Min Max -0.9509607 0.7291832 $Coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) 1.6765 0.2353597 7.123139 4.456124e-11 Sepal_Length 0.3498801 0.04630128 7.556598 4.187317e-12 Species_versicolor -0.9833885 0.07207471 -13.64402 0 Species_virginica -1.00751 0.09330565 -10.79796 0 ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #9561 from yanboliang/spark-11494.
* [DOC][MINOR][SQL] Fix internal linkRohit Agarwal2015-11-091-1/+1
| | | | | | | | It doesn't show up as a hyperlink currently. It will show up as a hyperlink after this change. Author: Rohit Agarwal <mindprince@gmail.com> Closes #9544 from mindprince/patch-2.
* [SPARK-11218][CORE] show help messages for start-slave and start-masterCharles Yeh2015-11-092-14/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Addressing https://issues.apache.org/jira/browse/SPARK-11218, mostly copied start-thriftserver.sh. ``` charlesyeh-mbp:spark charlesyeh$ ./sbin/start-master.sh --help Usage: Master [options] Options: -i HOST, --ip HOST Hostname to listen on (deprecated, please use --host or -h) -h HOST, --host HOST Hostname to listen on -p PORT, --port PORT Port to listen on (default: 7077) --webui-port PORT Port for web UI (default: 8080) --properties-file FILE Path to a custom Spark properties file. Default is conf/spark-defaults.conf. ``` ``` charlesyeh-mbp:spark charlesyeh$ ./sbin/start-slave.sh Usage: Worker [options] <master> Master must be a URL of the form spark://hostname:port Options: -c CORES, --cores CORES Number of cores to use -m MEM, --memory MEM Amount of memory to use (e.g. 1000M, 2G) -d DIR, --work-dir DIR Directory to run apps in (default: SPARK_HOME/work) -i HOST, --ip IP Hostname to listen on (deprecated, please use --host or -h) -h HOST, --host HOST Hostname to listen on -p PORT, --port PORT Port to listen on (default: random) --webui-port PORT Port for web UI (default: 8081) --properties-file FILE Path to a custom Spark properties file. Default is conf/spark-defaults.conf. ``` Author: Charles Yeh <charlesyeh@dropbox.com> Closes #9432 from CharlesYeh/helpmsg.
* [SPARK-11453][SQL] append data to partitioned table will messes up the resultWenchen Fan2015-11-083-4/+53
| | | | | | | | | | | | | The reason is that: 1. For partitioned hive table, we will move the partitioned columns after data columns. (e.g. `<a: Int, b: Int>` partition by `a` will become `<b: Int, a: Int>`) 2. When append data to table, we use position to figure out how to match input columns to table's columns. So when we append data to partitioned table, we will match wrong columns between input and table. A solution is reordering the input columns before match by position, like what we did for [`InsertIntoHadoopFsRelation`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L101-L105) Author: Wenchen Fan <wenchen@databricks.com> Closes #9408 from cloud-fan/append.
* [SPARK-11564][SQL] Dataset Java API auditReynold Xin2015-11-0814-97/+316
| | | | | | | | | | | | | | A few changes: 1. Removed fold, since it can be confusing for distributed collections. 2. Created specific interfaces for each Dataset function (e.g. MapFunction, ReduceFunction, MapPartitionsFunction) 3. Added more documentation and test cases. The other thing I'm considering doing is to have a "collector" interface for FlatMapFunction and MapPartitionsFunction, similar to MapReduce's map function. Author: Reynold Xin <rxin@databricks.com> Closes #9531 from rxin/SPARK-11564.
* [SPARK-11554][SQL] add map/flatMap to GroupedDatasetWenchen Fan2015-11-086-37/+70
| | | | | | Author: Wenchen Fan <wenchen@databricks.com> Closes #9521 from cloud-fan/map.