aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* [SPARK-13130][SQL] Make codegen variable names easier to readReynold Xin2016-02-011-2/+9
| | | | | | | | | 1. Use lower case 2. Change long prefixes to something shorter (in this case I am changing only one: TungstenAggregate -> agg). Author: Reynold Xin <rxin@databricks.com> Closes #11017 from rxin/SPARK-13130.
* [SPARK-12790][CORE] Remove HistoryServer old multiple files formatfelixcheung2016-02-0120-235/+23
| | | | | | | | | Removed isLegacyLogDirectory code path and updated tests andrewor14 Author: felixcheung <felixcheung_m@hotmail.com> Closes #10860 from felixcheung/historyserverformat.
* [SPARK-12637][CORE] Print stage info of finished stages properlySean Owen2016-02-011-1/+12
| | | | | | | | | | Improve printing of StageInfo in onStageCompleted See also https://github.com/apache/spark/pull/10585 Author: Sean Owen <sowen@cloudera.com> Closes #10922 from srowen/SPARK-12637.
* [SPARK-13078][SQL] API and test cases for internal catalogReynold Xin2016-02-014-0/+710
| | | | | | | | | | This pull request creates an internal catalog API. The creation of this API is the first step towards consolidating SQLContext and HiveContext. I envision we will have two different implementations in Spark 2.0: (1) a simple in-memory implementation, and (2) an implementation based on the current HiveClient (ClientWrapper). I took a look at what Hive's internal metastore interface/implementation, and then created this API based on it. I believe this is the minimal set needed in order to achieve all the needed functionality. Author: Reynold Xin <rxin@databricks.com> Closes #10982 from rxin/SPARK-13078.
* Fix for [SPARK-12854][SQL] Implement complex types support in Columna…Jacek Laskowski2016-02-012-2/+2
| | | | | | | | | | …rBatch Fixes build for Scala 2.11. Author: Jacek Laskowski <jacek@japila.pl> Closes #10946 from jaceklaskowski/SPARK-12854-fix.
* [SPARK-13043][SQL] Implement remaining catalyst types in ColumnarBatch.Nong Li2016-02-018-42/+484
| | | | | | | | | | | | | | | This includes: float, boolean, short, decimal and calendar interval. Decimal is mapped to long or byte array depending on the size and calendar interval is mapped to a struct of int and long. The only remaining type is map. The schema mapping is straightforward but we might want to revisit how we deal with this in the rest of the execution engine. Author: Nong Li <nong@databricks.com> Closes #10961 from nongli/spark-13043.
* [SPARK-12979][MESOS] Don’t resolve paths on the local file system in Mesos ↵Iulian Dragos2016-02-013-3/+3
| | | | | | | | | | scheduler The driver filesystem is likely different from where the executors will run, so resolving paths (and symlinks, etc.) will lead to invalid paths on executors. Author: Iulian Dragos <jaguarul@gmail.com> Closes #10923 from dragos/issue/canonical-paths.
* [SPARK-12265][MESOS] Spark calls System.exit inside driver instead of ↵Nilanjan Raychaudhuri2016-02-013-4/+19
| | | | | | | | | | | throwing exception This takes over #10729 and makes sure that `spark-shell` fails with a proper error message. There is a slight behavioral change: before this change `spark-shell` would exit, while now the REPL is still there, but `sc` and `sqlContext` are not defined and the error is visible to the user. Author: Nilanjan Raychaudhuri <nraychaudhuri@gmail.com> Author: Iulian Dragos <jaguarul@gmail.com> Closes #10921 from dragos/pr/10729.
* [SPARK-12463][SPARK-12464][SPARK-12465][SPARK-10647][MESOS] Fix zookeeper ↵Timothy Chen2016-02-015-25/+36
| | | | | | | | | | dir with mesos conf and add docs. Fix zookeeper dir configuration used in cluster mode, and also add documentation around these settings. Author: Timothy Chen <tnachen@gmail.com> Closes #10057 from tnachen/fix_mesos_dir.
* [ML][MINOR] Invalid MulticlassClassification reference in ml-guideLewuathe2016-02-011-1/+1
| | | | | | | | | | In [ml-guide](https://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation), there is invalid reference to `MulticlassClassificationEvaluator` apidoc. https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.MultiClassClassificationEvaluator Author: Lewuathe <lewuathe@me.com> Closes #10996 from Lewuathe/fix-typo-in-ml-guide.
* [DOCS] Fix the jar location of datanucleus in sql-programming-guid.mdTakeshi YAMAMURO2016-02-011-1/+1
| | | | | | | | ISTM `lib` is better because `datanucleus` jars are located in `lib` for release builds. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #10901 from maropu/DocFix.
* [SPARK-12705][SPARK-10777][SQL] Analyzer Rule ResolveSortReferencesgatorsmile2016-02-016-22/+274
| | | | | | | | | | | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-12705 **Scope:** This PR is a general fix for sorting reference resolution when the child's `outputSet` does not have the order-by attributes (called, *missing attributes*): - UnaryNode support is limited to `Project`, `Window`, `Aggregate`, `Distinct`, `Filter`, `RepartitionByExpression`. - We will not try to resolve the missing references inside a subquery, unless the outputSet of this subquery contains it. **General Reference Resolution Rules:** - Jump over the nodes with the following types: `Distinct`, `Filter`, `RepartitionByExpression`. Do not need to add missing attributes. The reason is their `outputSet` is decided by their `inputSet`, which is the `outputSet` of their children. - Group-by expressions in `Aggregate`: missing order-by attributes are not allowed to be added into group-by expressions since it will change the query result. Thus, in RDBMS, it is not allowed. - Aggregate expressions in `Aggregate`: if the group-by expressions in `Aggregate` contains the missing attributes but aggregate expressions do not have it, just add them into the aggregate expressions. This can resolve the analysisExceptions thrown by the three TCPDS queries. - `Project` and `Window` are special. We just need to add the missing attributes to their `projectList`. **Implementation:** 1. Traverse the whole tree in a pre-order manner to find all the resolvable missing order-by attributes. 2. Traverse the whole tree in a post-order manner to add the found missing order-by attributes to the node if their `inputSet` contains the attributes. 3. If the origins of the missing order-by attributes are different nodes, each pass only resolves the missing attributes that are from the same node. **Risk:** Low. This rule will be trigger iff ```!s.resolved && child.resolved``` is true. Thus, very few cases are affected. Author: gatorsmile <gatorsmile@gmail.com> Closes #10678 from gatorsmile/sortWindows.
* [SPARK-12989][SQL] Delaying Alias Cleanup after ExtractWindowExpressionsgatorsmile2016-02-012-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-12989 In the rule `ExtractWindowExpressions`, we simply replace alias by the corresponding attribute. However, this will cause an issue exposed by the following case: ```scala val data = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", "num") .withColumn("Data", struct("A", "B", "C")) .drop("A") .drop("B") .drop("C") val winSpec = Window.partitionBy("Data.A", "Data.B").orderBy($"num".desc) data.select($"*", max("num").over(winSpec) as "max").explain(true) ``` In this case, both `Data.A` and `Data.B` are `alias` in `WindowSpecDefinition`. If we replace these alias expression by their alias names, we are unable to know what they are since they will not be put in `missingExpr` too. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #10963 from gatorsmile/seletStarAfterColDrop.
* [SPARK-6847][CORE][STREAMING] Fix stack overflow issue when updateStateByKey ↵Shixiong Zhu2016-02-015-2/+119
| | | | | | | | | | is followed by a checkpointed dstream Add a local property to indicate if checkpointing all RDDs that are marked with the checkpoint flag, and enable it in Streaming Author: Shixiong Zhu <shixiong@databricks.com> Closes #10934 from zsxwing/recursive-checkpoint.
* [SPARK-13093] [SQL] improve null check in nullSafeCodeGen for unary, binary ↵Wenchen Fan2016-01-313-67/+85
| | | | | | | | | | | | | | | and ternary expression The current implementation is sub-optimal: * If an expression is always nullable, e.g. `Unhex`, we can still remove null check for children if they are not nullable. * If an expression has some non-nullable children, we can still remove null check for these children and keep null check for others. This PR improves this by making the null check elimination more fine-grained. Author: Wenchen Fan <wenchen@databricks.com> Closes #10987 from cloud-fan/null-check.
* [SPARK-13049] Add First/last with ignore nulls to functions.scalaHerman van Hovell2016-01-314-29/+157
| | | | | | | | | | | | | This PR adds the ability to specify the ```ignoreNulls``` option to the functions dsl, e.g: ```df.select($"id", last($"value", ignoreNulls = true).over(Window.partitionBy($"id").orderBy($"other"))``` This PR is some where between a bug fix (see the JIRA) and a new feature. I am not sure if we should backport to 1.6. cc yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10957 from hvanhovell/SPARK-13049.
* [SPARK-12689][SQL] Migrate DDL parsing to the newly absorbed parserLiang-Chi Hsieh2016-01-3010-229/+208
| | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-12689 DDLParser processes three commands: createTable, describeTable and refreshTable. This patch migrates the three commands to newly absorbed parser. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10723 from viirya/migrate-ddl-describe.
* [SPARK-13070][SQL] Better error message when Parquet schema merging failsCheng Lian2016-01-304-7/+77
| | | | | | | | | Make sure we throw better error messages when Parquet schema merging fails. Author: Cheng Lian <lian@databricks.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10979 from viirya/schema-merging-failure-message.
* [SPARK-13100][SQL] improving the performance of stringToDate method in ↵wangyang2016-01-301-1/+2
| | | | | | | | | | DateTimeUtils.scala In jdk1.7 TimeZone.getTimeZone() is synchronized, so use an instance variable to hold an GMT TimeZone object instead of instantiate it every time. Author: wangyang <wangyang@haizhi.com> Closes #10994 from wangyang1992/datetimeUtil.
* [SPARK-6363][BUILD] Make Scala 2.11 the default Scala versionJosh Rosen2016-01-3049-194/+186
| | | | | | | | | | | | This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds). The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance). After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break. Author: Josh Rosen <joshrosen@databricks.com> Closes #10608 from JoshRosen/SPARK-6363.
* [SPARK-13098] [SQL] remove GenericInternalRowWithSchemaWenchen Fan2016-01-292-20/+5
| | | | | | | | This class is only used for serialization of Python DataFrame. However, we don't require internal row there, so `GenericRowWithSchema` can also do the job. Author: Wenchen Fan <wenchen@databricks.com> Closes #10992 from cloud-fan/python.
* [SPARK-12914] [SQL] generate aggregation with grouping keysDavies Liu2016-01-296-53/+393
| | | | | | | | | | This PR add support for grouping keys for generated TungstenAggregate. Spilling and performance improvements for BytesToBytesMap will be done by followup PR. Author: Davies Liu <davies@databricks.com> Closes #10855 from davies/gen_keys.
* [SPARK-13071] Coalescing HadoopRDD overwrites existing input metricsAndrew Or2016-01-293-3/+18
| | | | | | | | | | | | | | | | This issue is causing tests to fail consistently in master with Hadoop 2.6 / 2.7. This is because for Hadoop 2.5+ we overwrite existing values of `InputMetrics#bytesRead` in each call to `HadoopRDD#compute`. In the case of coalesce, e.g. ``` sc.textFile(..., 4).coalesce(2).count() ``` we will call `compute` multiple times in the same task, overwriting `bytesRead` values from previous calls to `compute`. For a regression test, see `InputOutputMetricsSuite.input metrics for old hadoop with coalesce`. I did not add a new regression test because it's impossible without significant refactoring; there's a lot of existing duplicate code in this corner of Spark. This was caused by #10835. Author: Andrew Or <andrew@databricks.com> Closes #10973 from andrewor14/fix-input-metrics-coalesce.
* [SPARK-13088] Fix DAG viz in latest version of chromeAndrew Or2016-01-291-6/+7
| | | | | | | | | | Apparently chrome removed `SVGElement.prototype.getTransformToElement`, which is used by our JS library dagre-d3 when creating edges. The real diff can be found here: https://github.com/andrewor14/dagre-d3/commit/7d6c0002e4c74b82a02c5917876576f71e215590, which is taken from the fix in the main repo: https://github.com/cpettitt/dagre-d3/commit/1ef067f1c6ad2e0980f6f0ca471bce998784b7b2 Upstream issue: https://github.com/cpettitt/dagre-d3/issues/202 Author: Andrew Or <andrew@databricks.com> Closes #10986 from andrewor14/fix-dag-viz.
* [SPARK-13096][TEST] Fix flaky verifyPeakExecutionMemorySetAndrew Or2016-01-291-0/+2
| | | | | | | | | | Previously we would assert things before all events are guaranteed to have been processed. To fix this, just block until all events are actually processed, i.e. until the listener queue is empty. https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/79/testReport/junit/org.apache.spark.util.collection/ExternalAppendOnlyMapSuite/spilling/ Author: Andrew Or <andrew@databricks.com> Closes #10990 from andrewor14/accum-suite-less-flaky.
* [SPARK-13076][SQL] Rename ClientInterface -> HiveClientReynold Xin2016-01-2912-42/+41
| | | | | | | | | | And ClientWrapper -> HiveClientImpl. I have some followup pull requests to introduce a new internal catalog, and I think this new naming reflects better the functionality of the two classes. Author: Reynold Xin <rxin@databricks.com> Closes #10981 from rxin/SPARK-13076.
* [SPARK-13055] SQLHistoryListener throws ClassCastExceptionAndrew Or2016-01-2913-45/+133
| | | | | | | | | | This is an existing issue uncovered recently by #10835. The reason for the exception was because the `SQLHistoryListener` gets all sorts of accumulators, not just the ones that represent SQL metrics. For example, the listener gets the `internal.metrics.shuffleRead.remoteBlocksFetched`, which is an Int, then it proceeds to cast the Int to a Long, which fails. The fix is to mark accumulators representing SQL metrics using some internal metadata. Then we can identify which ones are SQL metrics and only process those in the `SQLHistoryListener`. Author: Andrew Or <andrew@databricks.com> Closes #10971 from andrewor14/fix-sql-history.
* [SPARK-12818] Polishes spark-sketch moduleCheng Lian2016-01-296-83/+110
| | | | | | | | Fixes various minor code and Javadoc styling issues. Author: Cheng Lian <lian@databricks.com> Closes #10985 from liancheng/sketch-polishing.
* [SPARK-12656] [SQL] Implement Intersect with Left-semi Joingatorsmile2016-01-2911-122/+211
| | | | | | | | | | | | Our current Intersect physical operator simply delegates to RDD.intersect. We should remove the Intersect physical operator and simply transform a logical intersect into a semi-join with distinct. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins). After a search, I found one of the mainstream RDBMS did the same. In their query explain, Intersect is replaced by Left-semi Join. Left-semi Join could help outer-join elimination in Optimizer, as shown in the PR: https://github.com/apache/spark/pull/10566 Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #10630 from gatorsmile/IntersectBySemiJoin.
* [SPARK-13072] [SQL] simplify and improve murmur3 hash expression codegenWenchen Fan2016-01-291-86/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | simplify(remove several unnecessary local variables) the generated code of hash expression, and avoid null check if possible. generated code comparison for `hash(int, double, string, array<string>)`: **before:** ``` public UnsafeRow apply(InternalRow i) { /* hash(input[0, int],input[1, double],input[2, string],input[3, array<int>],42) */ int value1 = 42; /* input[0, int] */ int value3 = i.getInt(0); if (!false) { value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1); } /* input[1, double] */ double value5 = i.getDouble(1); if (!false) { value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashLong(Double.doubleToLongBits(value5), value1); } /* input[2, string] */ boolean isNull6 = i.isNullAt(2); UTF8String value7 = isNull6 ? null : (i.getUTF8String(2)); if (!isNull6) { value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value7.getBaseObject(), value7.getBaseOffset(), value7.numBytes(), value1); } /* input[3, array<int>] */ boolean isNull8 = i.isNullAt(3); ArrayData value9 = isNull8 ? null : (i.getArray(3)); if (!isNull8) { int result10 = value1; for (int index11 = 0; index11 < value9.numElements(); index11++) { if (!value9.isNullAt(index11)) { final int element12 = value9.getInt(index11); result10 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element12, result10); } } value1 = result10; } } ``` **after:** ``` public UnsafeRow apply(InternalRow i) { /* hash(input[0, int],input[1, double],input[2, string],input[3, array<int>],42) */ int value1 = 42; /* input[0, int] */ int value3 = i.getInt(0); value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1); /* input[1, double] */ double value5 = i.getDouble(1); value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashLong(Double.doubleToLongBits(value5), value1); /* input[2, string] */ boolean isNull6 = i.isNullAt(2); UTF8String value7 = isNull6 ? null : (i.getUTF8String(2)); if (!isNull6) { value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value7.getBaseObject(), value7.getBaseOffset(), value7.numBytes(), value1); } /* input[3, array<int>] */ boolean isNull8 = i.isNullAt(3); ArrayData value9 = isNull8 ? null : (i.getArray(3)); if (!isNull8) { for (int index10 = 0; index10 < value9.numElements(); index10++) { final int element11 = value9.getInt(index10); value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element11, value1); } } rowWriter14.write(0, value1); return result12; } ``` Author: Wenchen Fan <wenchen@databricks.com> Closes #10974 from cloud-fan/codegen.
* [SPARK-10873] Support column sort and search for History Server.zhuol2016-01-2928-202/+1721
| | | | | | | | | | | | | | | | | | | | | | | | [SPARK-10873] Support column sort and search for History Server using jQuery DataTable and REST API. Before this commit, the history server was generated hard-coded html and can not support search, also, the sorting was disabled if there is any application that has more than one attempt. Supporting search and sort (over all applications rather than the 20 entries in the current page) in any case will greatly improve user experience. 1. Create the historypage-template.html for displaying application information in datables. 2. historypage.js uses jQuery to access the data from /api/v1/applications REST API, and use DataTable to display each application's information. For application that has more than one attempt, the RowsGroup is used to merge such entries while at the same time supporting sort and search. 3. "duration" and "lastUpdated" rest API are added to application's "attempts". 4. External javascirpt and css files for datatables, RowsGroup and jquery plugins are added with licenses clarified. Snapshots for how it looks like now: History page view: ![historypage](https://cloud.githubusercontent.com/assets/11683054/12184383/89bad774-b55a-11e5-84e4-b0276172976f.png) Search: ![search](https://cloud.githubusercontent.com/assets/11683054/12184385/8d3b94b0-b55a-11e5-869a-cc0ef0a4242a.png) Sort by started time: ![sort-by-started-time](https://cloud.githubusercontent.com/assets/11683054/12184387/8f757c3c-b55a-11e5-98c8-577936366566.png) Author: zhuol <zhuol@yahoo-inc.com> Closes #10648 from zhuoliu/10873.
* [SPARK-13032][ML][PYSPARK] PySpark support model export/import and take ↵Yanbo Liang2016-01-295-29/+236
| | | | | | | | | | | | | | LinearRegression as example * Implement ```MLWriter/MLWritable/MLReader/MLReadable``` for PySpark. * Making ```LinearRegression``` to support ```save/load``` as example. After this merged, the work for other transformers/estimators will be easy, then we can list and distribute the tasks to the community. cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #10469 from yanboliang/spark-11939.
* [SPARK-13031][SQL] cleanup codegen and improve test coverageDavies Liu2016-01-2911-205/+350
| | | | | | | | | | | | | | 1. enable whole stage codegen during tests even there is only one operator supports that. 2. split doProduce() into two APIs: upstream() and doProduce() 3. generate prefix for fresh names of each operator 4. pass UnsafeRow to parent directly (avoid getters and create UnsafeRow again) 5. fix bugs and tests. This PR re-open #10944 and fix the bug. Author: Davies Liu <davies@databricks.com> Closes #10977 from davies/gen_refactor.
* [SPARK-13050][BUILD] Scalatest tags fail build with the addition of the ↵Alex Bozarth2016-01-281-0/+7
| | | | | | | | | | sketch module A dependency on the spark test tags was left out of the sketch module pom file causing builds to fail when test tags were used. This dependency is found in the pom file for every other module in spark. Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #10954 from ajbozarth/spark13050.
* [SPARK-13067] [SQL] workaround for a weird scala reflection problemWenchen Fan2016-01-282-6/+23
| | | | | | | | | A simple workaround to avoid getting parameter types when convert a logical plan to json. Author: Wenchen Fan <wenchen@databricks.com> Closes #10970 from cloud-fan/reflection.
* [SPARK-12968][SQL] Implement command to set current databaseLiang-Chi Hsieh2016-01-289-3/+50
| | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-12968 Implement command to set current database. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10916 from viirya/ddl-use-database.
* Revert "[SPARK-13031] [SQL] cleanup codegen and improve test coverage"Davies Liu2016-01-289-334/+202
| | | | This reverts commit cc18a7199240bf3b03410c1ba6704fe7ce6ae38e.
* [SPARK-11955][SQL] Mark optional fields in merging schema for safely ↵Liang-Chi Hsieh2016-01-286-29/+117
| | | | | | | | | | | | | | | pushdowning filters in Parquet JIRA: https://issues.apache.org/jira/browse/SPARK-11955 Currently we simply skip pushdowning filters in parquet if we enable schema merging. However, we can actually mark particular fields in merging schema for safely pushdowning filters in parquet. Author: Liang-Chi Hsieh <viirya@appier.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #9940 from viirya/safe-pushdown-parquet-filters.
* [SPARK-12749][SQL] add json option to parse floating-point types as DecimalTypeBrandon Bradley2016-01-285-2/+40
| | | | | | | | | | I tried to add this via `USE_BIG_DECIMAL_FOR_FLOATS` option from Jackson with no success. Added test for non-complex types. Should I add a test for complex types? Author: Brandon Bradley <bradleytastic@gmail.com> Closes #10936 from blbradley/spark-12749.
* [SPARK-12401][SQL] Add integration tests for postgres enum typesTakeshi YAMAMURO2016-01-281-6/+9
| | | | | | | | | We can handle posgresql-specific enum types as strings in jdbc. So, we should just add tests and close the corresponding JIRA ticket. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #10596 from maropu/AddTestsInIntegration.
* [SPARK-9835][ML] Implement IterativelyReweightedLeastSquares solverYanbo Liang2016-01-283-1/+314
| | | | | | | | | | | | | | | | | | | | Implement ```IterativelyReweightedLeastSquares``` solver for GLM. I consider it as a solver rather than estimator, it only used internal so I keep it ```private[ml]```. There are two limitations in the current implementation compared with R: * It can not support ```Tuple``` as response for ```Binomial``` family, such as the following code: ``` glm( cbind(using, notUsing) ~ age + education + wantsMore , family = binomial) ``` * It does not support ```offset```. Because I considered that ```RFormula``` did not support ```Tuple``` as label and ```offset``` keyword, so I simplified the implementation. But to add support for these two functions is not very hard, I can do it in follow-up PR if it is necessary. Meanwhile, we can also add R-like statistic summary for IRLS. The implementation refers R, [statsmodels](https://github.com/statsmodels/statsmodels) and [sparkGLM](https://github.com/AlteryxLabs/sparkGLM). Please focus on the main structure and overpass minor issues/docs that I will update later. Any comments and opinions will be appreciated. cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #10639 from yanboliang/spark-9835.
* [SPARK-13031] [SQL] cleanup codegen and improve test coverageDavies Liu2016-01-289-202/+334
| | | | | | | | | | | | 1. enable whole stage codegen during tests even there is only one operator supports that. 2. split doProduce() into two APIs: upstream() and doProduce() 3. generate prefix for fresh names of each operator 4. pass UnsafeRow to parent directly (avoid getters and create UnsafeRow again) 5. fix bugs and tests. Author: Davies Liu <davies@databricks.com> Closes #10944 from davies/gen_refactor.
* [SPARK-12926][SQL] SQLContext to display warning message when non-sql ↵Tejas Patil2016-01-281-3/+11
| | | | | | | | | | configs are being set Users unknowingly try to set core Spark configs in SQLContext but later realise that it didn't work. eg. sqlContext.sql("SET spark.shuffle.memoryFraction=0.4"). This PR adds a warning message when such operations are done. Author: Tejas Patil <tejasp@fb.com> Closes #10849 from tejasapatil/SPARK-12926.
* [SPARK-12818][SQL] Specialized integral and string types for Count-min SketchCheng Lian2016-01-283-35/+99
| | | | | | | | This PR is a follow-up of #10911. It adds specialized update methods for `CountMinSketch` so that we can avoid doing internal/external row format conversion in `DataFrame.countMinSketch()`. Author: Cheng Lian <lian@databricks.com> Closes #10968 from liancheng/cms-specialized.
* Provide same info as in spark-submit --helpJames Lohse2016-01-281-2/+3
| | | | | | | | this is stated for --packages and --repositories. Without stating it for --jars, people expect a standard java classpath to work, with expansion and using a different delimiter than a comma. Currently this is only state in the --help for spark-submit "Comma-separated list of local jars to include on the driver and executor classpaths." Author: James Lohse <jimlohse@users.noreply.github.com> Closes #10890 from jimlohse/patch-1.
* [SPARK-13045] [SQL] Remove ColumnVector.Struct in favor of ColumnarBatch.RowNong Li2016-01-273-120/+32
| | | | | | | | These two classes became identical as the implementation progressed. Author: Nong Li <nong@databricks.com> Closes #10952 from nongli/spark-13045.
* [HOTFIX] Fix Scala 2.11 compilationAndrew Or2016-01-271-1/+2
| | | | | | | | | | by explicitly marking annotated parameters as vals (SI-8813). Caused by #10835. Author: Andrew Or <andrew@databricks.com> Closes #10955 from andrewor14/fix-scala211.
* [SPARK-12865][SPARK-12866][SQL] Migrate SparkSQLParser/ExtendedHiveQlParser ↵Herman van Hovell2016-01-2716-226/+161
| | | | | | | | | | | | | | | | commands to new Parser This PR moves all the functionality provided by the SparkSQLParser/ExtendedHiveQlParser to the new Parser hierarchy (SparkQl/HiveQl). This also improves the current SET command parsing: the current implementation swallows ```set role ...``` and ```set autocommit ...``` commands, this PR respects these commands (and passes them on to Hive). This PR and https://github.com/apache/spark/pull/10723 end the use of Parser-Combinator parsers for SQL parsing. As a result we can also remove the ```AbstractSQLParser``` in Catalyst. The PR is marked WIP as long as it doesn't pass all tests. cc rxin viirya winningsix (this touches https://github.com/apache/spark/pull/10144) Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10905 from hvanhovell/SPARK-12866.
* [SPARK-12938][SQL] DataFrame API for Bloom filterWenchen Fan2016-01-277-93/+306
| | | | | | | | | | This PR integrates Bloom filter from spark-sketch into DataFrame. This version resorts to RDD.aggregate for building the filter. A more performant UDAF version can be built in future follow-up PRs. This PR also add 2 specify `put` version(`putBinary` and `putLong`) into `BloomFilter`, which makes it easier to build a Bloom filter over a `DataFrame`. Author: Wenchen Fan <wenchen@databricks.com> Closes #10937 from cloud-fan/bloom-filter.
* [SPARK-13021][CORE] Fail fast when custom RDDs violate RDD.partition's API ↵Josh Rosen2016-01-272-0/+25
| | | | | | | | | | | | | | contract Spark's `Partition` and `RDD.partitions` APIs have a contract which requires custom implementations of `RDD.partitions` to ensure that for all `x`, `rdd.partitions(x).index == x`; in other words, the `index` reported by a repartition needs to match its position in the partitions array. If a custom RDD implementation violates this contract, then Spark has the potential to become stuck in an infinite recomputation loop when recomputing a subset of an RDD's partitions, since the tasks that are actually run will not correspond to the missing output partitions that triggered the recomputation. Here's a link to a notebook which demonstrates this problem: https://rawgit.com/JoshRosen/e520fb9a64c1c97ec985/raw/5e8a5aa8d2a18910a1607f0aa4190104adda3424/Violating%2520RDD.partitions%2520contract.html In order to guard against this infinite loop behavior, this patch modifies Spark so that it fails fast and refuses to compute RDDs' whose `partitions` violate the API contract. Author: Josh Rosen <joshrosen@databricks.com> Closes #10932 from JoshRosen/SPARK-13021.