aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* Preparing Spark release v1.5.0-snapshot-20150811Patrick Wendell2015-08-1133-33/+33
|
* [SPARK-9074] [LAUNCHER] Allow arbitrary Spark args to be set.Marcelo Vanzin2015-08-113-3/+150
| | | | | | | | | | | | | | | | | This change allows any Spark argument to be added to the app to be started using SparkLauncher. Known arguments are properly validated, while unknown arguments are allowed so that the library can launch newer Spark versions (in case SPARK_HOME points at one). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #7975 from vanzin/SPARK-9074 and squashes the following commits: b5e451a [Marcelo Vanzin] [SPARK-9074] [launcher] Allow arbitrary Spark args to be set. (cherry picked from commit 5a5bbc29961630d649d4bd4acd5d19eb537b5fd0) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
* [HOTFIX] Fix style error caused by ef961ed48a4f45447f0e0ad256b040c7ab2d78d9Andrew Or2015-08-111-1/+1
|
* Preparing development version 1.5.0-SNAPSHOTPatrick Wendell2015-08-1133-33/+33
|
* Preparing Spark release v1.5.0-snapshot-20150811Patrick Wendell2015-08-1133-33/+33
|
* [SPARK-8925] [MLLIB] Add @since tags to mllib.utilSudhakar Thota2015-08-111-1/+21
| | | | | | | | | | | | Went thru the history of changes the file MLUtils.scala and picked up the version that the change went in. Author: Sudhakar Thota <sudhakarthota@yahoo.com> Author: Sudhakar Thota <sudhakarthota@sudhakars-mbp-2.usca.ibm.com> Closes #7436 from sthota2014/SPARK-8925_thotas. (cherry picked from commit 017b5de07ef6cff249e984a2ab781c520249ac76) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-9788] [MLLIB] Fix LDA Binary CompatibilityFeynman Liang2015-08-114-24/+46
| | | | | | | | | | | | | | | | | | | 1. Add “asymmetricDocConcentration” and revert docConcentration changes. If the (internal) doc concentration vector is a single value, “getDocConcentration" returns it. If it is a constant vector, getDocConcentration returns the first item, and fails otherwise. 2. Give `LDAModel.gammaShape` a default value in `LDAModel` concrete class constructors. jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8077 from feynmanliang/SPARK-9788 and squashes the following commits: 6b07bc8 [Feynman Liang] Code review changes 9d6a71e [Feynman Liang] Add asymmetricAlpha alias bf4e685 [Feynman Liang] Asymmetric docConcentration 4cab972 [Feynman Liang] Default gammaShape (cherry picked from commit be3e27164133025db860781bd5cdd3ca233edd21) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-9824] [CORE] Fix the issue that InternalAccumulator leaks WeakReferencezsxwing2015-08-113-11/+16
| | | | | | | | | | | | | `InternalAccumulator.create` doesn't call `registerAccumulatorForCleanup` to register itself with ContextCleaner, so `WeakReference`s for these accumulators in `Accumulators.originals` won't be removed. This PR added `registerAccumulatorForCleanup` for internal accumulators to avoid the memory leak. Author: zsxwing <zsxwing@gmail.com> Closes #8108 from zsxwing/internal-accumulators-leak. (cherry picked from commit f16bc68dfb25c7b746ae031a57840ace9bafa87f) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-9814] [SQL] EqualNotNull not passing to data sourceshyukjinkwon2015-08-113-0/+15
| | | | | | | | | | Author: hyukjinkwon <gurwls223@gmail.com> Author: 권혁진 <gurwls223@gmail.com> Closes #8096 from HyukjinKwon/master. (cherry picked from commit 00c02728a6c6c4282c389ca90641dd78dd5e3d32) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-7726] Add import so Scaladoc doesn't fail.Patrick Wendell2015-08-111-0/+3
| | | | | | | | | | | | | This is another import needed so Scala 2.11 doc generation doesn't fail. See SPARK-7726 for more detail. I tested this locally and the 2.11 install goes from failing to succeeding with this patch. Author: Patrick Wendell <patrick@databricks.com> Closes #8095 from pwendell/scaladoc. (cherry picked from commit 2a3be4ddf9d9527353f07ea0ab204ce17dbcba9a) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-9750] [MLLIB] Improve equals on SparseMatrix and DenseMatrixFeynman Liang2015-08-112-2/+24
| | | | | | | | | | | | | | | | | | | | Adds unit test for `equals` on `mllib.linalg.Matrix` class and `equals` to both `SparseMatrix` and `DenseMatrix`. Supports equality testing between `SparseMatrix` and `DenseMatrix`. mengxr Author: Feynman Liang <fliang@databricks.com> Closes #8042 from feynmanliang/SPARK-9750 and squashes the following commits: bb70d5e [Feynman Liang] Breeze compare for dense matrices as well, in case other is sparse ab6f3c8 [Feynman Liang] Sparse matrix compare for equals 22782df [Feynman Liang] Add equality based on matrix semantics, not representation 78f9426 [Feynman Liang] Add casts 43d28fa [Feynman Liang] Fix failing test 6416fa0 [Feynman Liang] Add failing sparse matrix equals tests (cherry picked from commit 520ad44b17f72e6465bf990f64b4e289f8a83447) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-9646] [SQL] Add metrics for all join and aggregate operatorszsxwing2015-08-1127-107/+847
| | | | | | | | | | | | | | | | | | | | | | This PR added metrics for all join and aggregate operators. However, I found the metrics may be confusing in the following two case: 1. The iterator is not totally consumed and the metric values will be less. 2. Recreating the iterators will make metric values look bigger than the size of the input source, such as `CartesianProduct`. Author: zsxwing <zsxwing@gmail.com> Closes #8060 from zsxwing/sql-metrics and squashes the following commits: 40f3fc1 [zsxwing] Mark LongSQLMetric private[metric] to avoid using incorrectly and leak memory b1b9071 [zsxwing] Merge branch 'master' into sql-metrics 4bef25a [zsxwing] Add metrics for SortMergeOuterJoin 95ccfc6 [zsxwing] Merge branch 'master' into sql-metrics 67cb4dd [zsxwing] Add metrics for Project and TungstenProject; remove metrics from PhysicalRDD and LocalTableScan 0eb47d4 [zsxwing] Merge branch 'master' into sql-metrics dd9d932 [zsxwing] Avoid creating new Iterators 589ea26 [zsxwing] Add metrics for all join and aggregate operators (cherry picked from commit 5831294a7a8fa2524133c5d718cbc8187d2b0620) Signed-off-by: Yin Huai <yhuai@databricks.com>
* [SPARK-9572] [STREAMING] [PYSPARK] Added ↵Tathagata Das2015-08-113-15/+177
| | | | | | | | | | | | | | | | | | | | | | | | | | StreamingContext.getActiveOrCreate() in Python Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8080 from tdas/SPARK-9572 and squashes the following commits: 64a231d [Tathagata Das] Fix based on comments 741a0d0 [Tathagata Das] Fixed style f4f094c [Tathagata Das] Tweaked test 9afcdbe [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-9572 e21488d [Tathagata Das] Minor update 1a371d9 [Tathagata Das] Addressed comments. 60479da [Tathagata Das] Fixed indent 9c2da9c [Tathagata Das] Fixed bugs b5bd32c [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-9572 b55b348 [Tathagata Das] Removed prints 5781728 [Tathagata Das] Fix style issues b711214 [Tathagata Das] Reverted run-tests.py 643b59d [Tathagata Das] Revert unnecessary change 150e58c [Tathagata Das] Added StreamingContext.getActiveOrCreate() in Python (cherry picked from commit 5b8bb1b213b8738f563fcd00747604410fbb3087) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* Fix comment errorJeff Zhang2015-08-111-1/+1
| | | | | | | | | | | API is updated but its doc comment is not updated. Author: Jeff Zhang <zjffdu@apache.org> Closes #8097 from zjffdu/dev. (cherry picked from commit bce72797f3499f14455722600b0d0898d4fd87c9) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-9785] [SQL] HashPartitioning compatibility should consider expression ↵Josh Rosen2015-08-112-10/+60
| | | | | | | | | | | | | | | | | | | ordering HashPartitioning compatibility is currently defined w.r.t the _set_ of expressions, but the ordering of those expressions matters when computing hash codes; this could lead to incorrect answers if we mistakenly avoided a shuffle based on the assumption that HashPartitionings with the same expressions in different orders will produce equivalent row hashcodes. The first commit adds a regression test which illustrates this problem. The fix for this is simple: make `HashPartitioning.compatibleWith` and `HashPartitioning.guarantees` sensitive to the expression ordering (i.e. do not perform set comparison). Author: Josh Rosen <joshrosen@databricks.com> Closes #8074 from JoshRosen/hashpartitioning-compatiblewith-fixes and squashes the following commits: b61412f [Josh Rosen] Demonstrate that I haven't cheated in my fix 0b4d7d9 [Josh Rosen] Update so that clusteringSet is only used in satisfies(). dc9c9d7 [Josh Rosen] Add failing regression test for SPARK-9785 (cherry picked from commit dfe347d2cae3eb05d7539aaf72db3d309e711213) Signed-off-by: Yin Huai <yhuai@databricks.com>
* [SPARK-9815] Rename PlatformDependent.UNSAFE -> Platform.Reynold Xin2015-08-1139-499/+371
| | | | | | | | | | | | | PlatformDependent.UNSAFE is way too verbose. Author: Reynold Xin <rxin@databricks.com> Closes #8094 from rxin/SPARK-9815 and squashes the following commits: 229b603 [Reynold Xin] [SPARK-9815] Rename PlatformDependent.UNSAFE -> Platform. (cherry picked from commit d378396f86f625f006738d87fe5dbc2ff8fd913d) Signed-off-by: Davies Liu <davies.liu@gmail.com>
* [SPARK-9727] [STREAMING] [BUILD] Updated streaming kinesis SBT project name ↵Tathagata Das2015-08-113-5/+5
| | | | | | | | | | | | | to be more consistent Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8092 from tdas/SPARK-9727 and squashes the following commits: b1b01fd [Tathagata Das] Updated streaming kinesis project name (cherry picked from commit 600031ebe27473d8fffe6ea436c2149223b82896) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-9640] [STREAMING] [TEST] Do not run Python Kinesis tests when the ↵Tathagata Das2015-08-101-12/+44
| | | | | | | | | | | | | | | | | | | | | | | Kinesis assembly JAR has not been generated Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #7961 from tdas/SPARK-9640 and squashes the following commits: 974ce19 [Tathagata Das] Undo changes related to SPARK-9727 004ae26 [Tathagata Das] style fixes 9bbb97d [Tathagata Das] Minor style fies e6a677e [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-9640 ca90719 [Tathagata Das] Removed extra line ba9cfc7 [Tathagata Das] Improved kinesis test selection logic 88d59bd [Tathagata Das] updated test modules 871fcc8 [Tathagata Das] Fixed SparkBuild 94be631 [Tathagata Das] Fixed style b858196 [Tathagata Das] Fixed conditions and few other things based on PR comments. e292e64 [Tathagata Das] Added filters for Kinesis python tests (cherry picked from commit 0f90d6055e5bea9ceb1d454db84f4aa1d59b284d) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-9729] [SPARK-9363] [SQL] Use sort merge join for left and right outer ↵Josh Rosen2015-08-1013-319/+1165
| | | | | | | | | | | | | | | | | | | | | | | | | | | | join This patch adds a new `SortMergeOuterJoin` operator that performs left and right outer joins using sort merge join. It also refactors `SortMergeJoin` in order to improve performance and code clarity. Along the way, I also performed a couple pieces of minor cleanup and optimization: - Rename the `HashJoin` physical planner rule to `EquiJoinSelection`, since it's also used for non-hash joins. - Rewrite the comment at the top of `HashJoin` to better explain the precedence for choosing join operators. - Update `JoinSuite` to use `SqlTestUtils.withConf` for changing SQLConf settings. This patch incorporates several ideas from adrian-wang's patch, #5717. Closes #5717. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7904) <!-- Reviewable:end --> Author: Josh Rosen <joshrosen@databricks.com> Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #7904 from JoshRosen/outer-join-smj and squashes 1 commits. (cherry picked from commit 91e9389f39509e63654bd4bcb7bd919eaedda910) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-9340] [SQL] Fixes converting unannotated Parquet listsDamian Guy2015-08-1114-33/+247
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR is inspired by #8063 authored by dguy. Especially, testing Parquet files added here are all taken from that PR. **Committer who merges this PR should attribute it to "Damian Guy <damian.guygmail.com>".** ---- SPARK-6776 and SPARK-6777 followed `parquet-avro` to implement backwards-compatibility rules defined in `parquet-format` spec. However, both Spark SQL and `parquet-avro` neglected the following statement in `parquet-format`: > This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated by `LIST` or `MAP` should be interpreted as a required list of required elements where the element type is the type of the field. One of the consequences is that, Parquet files generated by `parquet-protobuf` containing unannotated repeated fields are not correctly converted to Catalyst arrays. This PR fixes this issue by 1. Handling unannotated repeated fields in `CatalystSchemaConverter`. 2. Converting this kind of special repeated fields to Catalyst arrays in `CatalystRowConverter`. Two special converters, `RepeatedPrimitiveConverter` and `RepeatedGroupConverter`, are added. They delegate actual conversion work to a child `elementConverter` and accumulates elements in an `ArrayBuffer`. Two extra methods, `start()` and `end()`, are added to `ParentContainerUpdater`. So that they can be used to initialize new `ArrayBuffer`s for unannotated repeated fields, and propagate converted array values to upstream. Author: Cheng Lian <lian@databricks.com> Closes #8070 from liancheng/spark-9340/unannotated-parquet-list and squashes the following commits: ace6df7 [Cheng Lian] Moves ParquetProtobufCompatibilitySuite f1c7bfd [Cheng Lian] Updates .rat-excludes 420ad2b [Cheng Lian] Fixes converting unannotated Parquet lists (cherry picked from commit 071bbad5db1096a548c886762b611a8484a52753) Signed-off-by: Cheng Lian <lian@databricks.com>
* [SPARK-9801] [STREAMING] Check if file exists before deleting temporary files.Hao Zhu2015-08-101-2/+6
| | | | | | | | | | | | | | | Spark streaming deletes the temp file and backup files without checking if they exist or not Author: Hao Zhu <viadeazhu@gmail.com> Closes #8082 from viadea/master and squashes the following commits: 242d05f [Hao Zhu] [SPARK-9801][Streaming]No need to check the existence of those files fd143f2 [Hao Zhu] [SPARK-9801][Streaming]Check if backupFile exists before deleting backupFile files. 087daf0 [Hao Zhu] SPARK-9801 (cherry picked from commit 3c9802d9400bea802984456683b2736a450ee17e) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-5155] [PYSPARK] [STREAMING] Mqtt streaming support in PythonPrabeesh K2015-08-1014-109/+565
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR is based on #4229, thanks prabeesh. Closes #4229 Author: Prabeesh K <prabsmails@gmail.com> Author: zsxwing <zsxwing@gmail.com> Author: prabs <prabsmails@gmail.com> Author: Prabeesh K <prabeesh.k@namshi.com> Closes #7833 from zsxwing/pr4229 and squashes the following commits: 9570bec [zsxwing] Fix the variable name and check null in finally 4a9c79e [zsxwing] Fix pom.xml indentation abf5f18 [zsxwing] Merge branch 'master' into pr4229 935615c [zsxwing] Fix the flaky MQTT tests 47278c5 [zsxwing] Include the project class files 478f844 [zsxwing] Add unpack 5f8a1d4 [zsxwing] Make the maven build generate the test jar for Python MQTT tests 734db99 [zsxwing] Merge branch 'master' into pr4229 126608a [Prabeesh K] address the comments b90b709 [Prabeesh K] Merge pull request #1 from zsxwing/pr4229 d07f454 [zsxwing] Register StreamingListerner before starting StreamingContext; Revert unncessary changes; fix the python unit test a6747cb [Prabeesh K] wait for starting the receiver before publishing data 87fc677 [Prabeesh K] address the comments: 97244ec [zsxwing] Make sbt build the assembly test jar for streaming mqtt 80474d1 [Prabeesh K] fix 1f0cfe9 [Prabeesh K] python style fix e1ee016 [Prabeesh K] scala style fix a5a8f9f [Prabeesh K] added Python test 9767d82 [Prabeesh K] implemented Python-friendly class a11968b [Prabeesh K] fixed python style 795ec27 [Prabeesh K] address comments ee387ae [Prabeesh K] Fix assembly jar location of mqtt-assembly 3f4df12 [Prabeesh K] updated version b34c3c1 [prabs] adress comments 3aa7fff [prabs] Added Python streaming mqtt word count example b7d42ff [prabs] Mqtt streaming support in Python (cherry picked from commit 853809e948e7c5092643587a30738115b6591a59) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-9737] [YARN] Add the suggested configuration when required executor ↵Yadong Qi2015-08-101-2/+4
| | | | | | | | | | | | | memory is above the max threshold of this cluster on YARN mode Author: Yadong Qi <qiyadong2010@gmail.com> Closes #8028 from watermen/SPARK-9737 and squashes the following commits: 48bdf3d [Yadong Qi] Add suggested configuration. (cherry picked from commit 86fa4ba6d13f909cb508b7cb3b153d586fe59bc3) Signed-off-by: Reynold Xin <rxin@databricks.com>
* Preparing development version 1.5.0-SNAPSHOTPatrick Wendell2015-08-1032-32/+32
|
* Preparing Spark release v1.5.0-snapshot-20150810Patrick Wendell2015-08-1032-32/+32
|
* Preparing development version 1.5.0-SNAPSHOTPatrick Wendell2015-08-1032-32/+32
|
* Preparing Spark release v1.5.0-snapshot-20150810Patrick Wendell2015-08-1032-32/+32
|
* [SPARK-9759] [SQL] improve decimal.times() and cast(int, decimalType)Davies Liu2015-08-102-32/+22
| | | | | | | | | | | | | | | | | | | This patch optimize two things: 1. passing MathContext to JavaBigDecimal.multiply/divide/reminder to do right rounding, because java.math.BigDecimal.apply(MathContext) is expensive 2. Cast integer/short/byte to decimal directly (without double) This two optimizations could speed up the end-to-end time of a aggregation (SUM(short * decimal(5, 2)) 75% (from 19s -> 10.8s) Author: Davies Liu <davies@databricks.com> Closes #8052 from davies/optimize_decimal and squashes the following commits: 225efad [Davies Liu] improve decimal.times() and cast(int, decimalType) (cherry picked from commit c4fd2a242228ee101904770446e3f37d49e39b76) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-9620] [SQL] generated UnsafeProjection should support many columns or ↵Davies Liu2015-08-106-142/+207
| | | | | | | | | | | | | | | | | | | | | | | | | | | large exressions Currently, generated UnsafeProjection can reach 64k byte code limit of Java. This patch will split the generated expressions into multiple functions, to avoid the limitation. After this patch, we can work well with table that have up to 64k columns (hit max number of constants limit in Java), it should be enough in practice. cc rxin Author: Davies Liu <davies@databricks.com> Closes #8044 from davies/wider_table and squashes the following commits: 9192e6c [Davies Liu] fix generated safe projection d1ef81a [Davies Liu] fix failed tests 737b3d3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into wider_table ffcd132 [Davies Liu] address comments 1b95be4 [Davies Liu] put the generated class into sql package 77ed72d [Davies Liu] address comments 4518e17 [Davies Liu] Merge branch 'master' of github.com:apache/spark into wider_table 75ccd01 [Davies Liu] Merge branch 'master' of github.com:apache/spark into wider_table 495e932 [Davies Liu] support wider table with more than 1k columns for generated projections (cherry picked from commit fe2fb7fb7189d183a4273ad27514af4b6b461f26) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-9763][SQL] Minimize exposure of internal SQL classes.Reynold Xin2015-08-1076-966/+1114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are a few changes in this pull request: 1. Moved all data sources to execution.datasources, except the public JDBC APIs. 2. In order to maintain backward compatibility from 1, added a backward compatibility translation map in data source resolution. 3. Moved ui and metric package into execution. 4. Added more documentation on some internal classes. 5. Renamed DataSourceRegister.format -> shortName. 6. Added "override" modifier on shortName. 7. Removed IntSQLMetric. Author: Reynold Xin <rxin@databricks.com> Closes #8056 from rxin/SPARK-9763 and squashes the following commits: 9df4801 [Reynold Xin] Removed hardcoded name in test cases. d9babc6 [Reynold Xin] Shorten. e484419 [Reynold Xin] Removed VisibleForTesting. 171b812 [Reynold Xin] MimaExcludes. 2041389 [Reynold Xin] Compile ... 79dda42 [Reynold Xin] Compile. 0818ba3 [Reynold Xin] Removed IntSQLMetric. c46884f [Reynold Xin] Two more fixes. f9aa88d [Reynold Xin] [SPARK-9763][SQL] Minimize exposure of internal SQL classes. (cherry picked from commit 40ed2af587cedadc6e5249031857a922b3b234ca) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-9784] [SQL] Exchange.isUnsafe should check whether codegen and unsafe ↵Josh Rosen2015-08-101-1/+1
| | | | | | | | | | | | | | | are enabled Exchange.isUnsafe should check whether codegen and unsafe are enabled. Author: Josh Rosen <joshrosen@databricks.com> Closes #8073 from JoshRosen/SPARK-9784 and squashes the following commits: 7a1019f [Josh Rosen] [SPARK-9784] Exchange.isUnsafe should check whether codegen and unsafe are enabled (cherry picked from commit 0fe66744f16854fc8cd8a72174de93a788e3cf6c) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* Fixed AtmoicReference<> ExampleMahmoud Lababidi2015-08-101-1/+1
| | | | | | | | | | | Author: Mahmoud Lababidi <lababidi@gmail.com> Closes #8076 from lababidi/master and squashes the following commits: af4553b [Mahmoud Lababidi] Fixed AtmoicReference<> Example (cherry picked from commit d285212756168200383bf4df2c951bd80a492a7c) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-9755] [MLLIB] Add docs to MultivariateOnlineSummarizer methodsFeynman Liang2015-08-101-0/+16
| | | | | | | | | | | | | | | Adds method documentations back to `MultivariateOnlineSummarizer`, which were present in 1.4 but disappeared somewhere along the way to 1.5. jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8045 from feynmanliang/SPARK-9755 and squashes the following commits: af67fde [Feynman Liang] Add MultivariateOnlineSummarizer docs (cherry picked from commit 00b655cced637e1c3b750c19266086b9dcd7c158) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-9743] [SQL] Fixes JSONRelation refreshingCheng Lian2015-08-104-12/+21
| | | | | | | | | | | | | | | | | | | | PR #7696 added two `HadoopFsRelation.refresh()` calls ([this] [1], and [this] [2]) in `DataSourceStrategy` to make test case `InsertSuite.save directly to the path of a JSON table` pass. However, this forces every `HadoopFsRelation` table scan to do a refresh, which can be super expensive for tables with large number of partitions. The reason why the original test case fails without the `refresh()` calls is that, the old JSON relation builds the base RDD with the input paths, while `HadoopFsRelation` provides `FileStatus`es of leaf files. With the old JSON relation, we can create a temporary table based on a path, writing data to that, and then read newly written data without refreshing the table. This is no long true for `HadoopFsRelation`. This PR removes those two expensive refresh calls, and moves the refresh into `JSONRelation` to fix this issue. We might want to update `HadoopFsRelation` interface to provide better support for this use case. [1]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L63 [2]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L91 Author: Cheng Lian <lian@databricks.com> Closes #8035 from liancheng/spark-9743/fix-json-relation-refreshing and squashes the following commits: ec1957d [Cheng Lian] Fixes JSONRelation refreshing (cherry picked from commit e3fef0f9e17b1766a3869cb80ce7e4cd521cb7b6) Signed-off-by: Yin Huai <yhuai@databricks.com>
* [SPARK-9777] [SQL] Window operator can accept UnsafeRowsYin Huai2015-08-091-0/+2
| | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-9777 Author: Yin Huai <yhuai@databricks.com> Closes #8064 from yhuai/windowUnsafe and squashes the following commits: 8fb3537 [Yin Huai] Set canProcessUnsafeRows to true. (cherry picked from commit be80def0d07ed0f45d60453f4f82500d8c4c9106) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [CORE] [SPARK-9760] Use Option instead of Some for Ivy reposShivaram Venkataraman2015-08-091-1/+1
| | | | | | | | | | | | | | | | | This was introduced in #7599 cc rxin brkyvz Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #8055 from shivaram/spark-packages-repo-fix and squashes the following commits: 890f306 [Shivaram Venkataraman] Remove test case 51d69ee [Shivaram Venkataraman] Add test case for --packages without --repository c02e0b4 [Shivaram Venkataraman] Use Option instead of Some for Ivy repos (cherry picked from commit 46025616b414eaf1da01fcc1255d8041ea1554bc) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-9703] [SQL] Refactor EnsureRequirements to avoid certain unnecessary ↵Josh Rosen2015-08-095-65/+328
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | shuffles This pull request refactors the `EnsureRequirements` planning rule in order to avoid the addition of certain unnecessary shuffles. As an example of how unnecessary shuffles can occur, consider SortMergeJoin, which requires clustered distribution and sorted ordering of its children's input rows. Say that both of SMJ's children produce unsorted output but are both SinglePartition. In this case, we will need to inject sort operators but should not need to inject Exchanges. Unfortunately, it looks like the EnsureRequirements unnecessarily repartitions using a hash partitioning. This patch solves this problem by refactoring `EnsureRequirements` to properly implement the `compatibleWith` checks that were broken in earlier implementations. See the significant inline comments for a better description of how this works. The majority of this PR is new comments and test cases, with few actual changes to the code. Author: Josh Rosen <joshrosen@databricks.com> Closes #7988 from JoshRosen/exchange-fixes and squashes the following commits: 38006e7 [Josh Rosen] Rewrite EnsureRequirements _yet again_ to make things even simpler 0983f75 [Josh Rosen] More guarantees vs. compatibleWith cleanup; delete BroadcastPartitioning. 8784bd9 [Josh Rosen] Giant comment explaining compatibleWith vs. guarantees 1307c50 [Josh Rosen] Update conditions for requiring child compatibility. 18cddeb [Josh Rosen] Rename DummyPlan to DummySparkPlan. 2c7e126 [Josh Rosen] Merge remote-tracking branch 'origin/master' into exchange-fixes fee65c4 [Josh Rosen] Further refinement to comments / reasoning 642b0bb [Josh Rosen] Further expand comment / reasoning 06aba0c [Josh Rosen] Add more comments 8dbc845 [Josh Rosen] Add even more tests. 4f08278 [Josh Rosen] Fix the test by adding the compatibility check to EnsureRequirements a1c12b9 [Josh Rosen] Add failing test to demonstrate allCompatible bug 0725a34 [Josh Rosen] Small assertion cleanup. 5172ac5 [Josh Rosen] Add test for requiresChildrenToProduceSameNumberOfPartitions. 2e0f33a [Josh Rosen] Write a more generic test for EnsureRequirements. 752b8de [Josh Rosen] style fix c628daf [Josh Rosen] Revert accidental ExchangeSuite change. c9fb231 [Josh Rosen] Rewrite exchange to fix better handle this case. adcc742 [Josh Rosen] Move test to PlannerSuite. 0675956 [Josh Rosen] Preserving ordering and partitioning in row format converters also does not help. cc5669c [Josh Rosen] Adding outputPartitioning to Repartition does not fix the test. 2dfc648 [Josh Rosen] Add failing test illustrating bad exchange planning. (cherry picked from commit 23cf5af08d98da771c41571c00a2f5cafedfebdd) Signed-off-by: Yin Huai <yhuai@databricks.com>
* [SPARK-8930] [SQL] Throw a AnalysisException with meaningful messages if ↵Yijie Shen2015-08-093-2/+21
| | | | | | | | | | | | | | DataFrame#explode takes a star in expressions Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #8057 from yjshen/explode_star and squashes the following commits: eae181d [Yijie Shen] change explaination message 54c9d11 [Yijie Shen] meaning message for * in explode (cherry picked from commit 68ccc6e184598822b19a880fdd4597b66a1c2d92) Signed-off-by: Yin Huai <yhuai@databricks.com>
* [SPARK-9752][SQL] Support UnsafeRow in Sample operator.Reynold Xin2015-08-094-27/+61
| | | | | | | | | | | | | | | | | | In order for this to work, I had to disable gap sampling. Author: Reynold Xin <rxin@databricks.com> Closes #8040 from rxin/SPARK-9752 and squashes the following commits: f9e248c [Reynold Xin] Fix the test case for real this time. adbccb3 [Reynold Xin] Fixed test case. 589fb23 [Reynold Xin] Merge branch 'SPARK-9752' of github.com:rxin/spark into SPARK-9752 55ccddc [Reynold Xin] Fixed core test. 78fa895 [Reynold Xin] [SPARK-9752][SQL] Support UnsafeRow in Sample operator. c9e7112 [Reynold Xin] [SPARK-9752][SQL] Support UnsafeRow in Sample operator. (cherry picked from commit e9c36938ba972b6fe3c9f6228508e3c9f1c876b2) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6212] [SQL] The EXPLAIN output of CTAS only shows the analyzed planYijie Shen2015-08-083-3/+38
| | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-6212 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7986 from yjshen/ctas_explain and squashes the following commits: bb6fee5 [Yijie Shen] refine test f731041 [Yijie Shen] address comment b2cf8ab [Yijie Shen] bug fix bd7eb20 [Yijie Shen] ctas explain (cherry picked from commit 3ca995b78f373251081f6877623649bfba3040b2) Signed-off-by: Yin Huai <yhuai@databricks.com>
* [MINOR] inaccurate comments for showString()CodingCat2015-08-081-1/+1
| | | | | | | | | | | Author: CodingCat <zhunansjtu@gmail.com> Closes #8050 from CodingCat/minor and squashes the following commits: 5bc4b89 [CodingCat] inaccurate comments (cherry picked from commit 25c363e93bc79119c5ba5c228fcad620061cff62) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-9486][SQL] Add data source aliasing for external packagesJoseph Batchik2015-08-0811-30/+156
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Users currently have to provide the full class name for external data sources, like: `sqlContext.read.format("com.databricks.spark.avro").load(path)` This allows external data source packages to register themselves using a Service Loader so that they can add custom alias like: `sqlContext.read.format("avro").load(path)` This makes it so that using external data source packages uses the same format as the internal data sources like parquet, json, etc. Author: Joseph Batchik <joseph.batchik@cloudera.com> Author: Joseph Batchik <josephbatchik@gmail.com> Closes #7802 from JDrit/service_loader and squashes the following commits: 49a01ec [Joseph Batchik] fixed a couple of format / error bugs e5e93b2 [Joseph Batchik] modified rat file to only excluded added services 72b349a [Joseph Batchik] fixed error with orc data source actually 9f93ea7 [Joseph Batchik] fixed error with orc data source 87b7f1c [Joseph Batchik] fixed typo 101cd22 [Joseph Batchik] removing unneeded changes 8f3cf43 [Joseph Batchik] merged in changes b63d337 [Joseph Batchik] merged in master 95ae030 [Joseph Batchik] changed the new trait to be used as a mixin for data source to register themselves 74db85e [Joseph Batchik] reformatted class loader ac2270d [Joseph Batchik] removing some added test a6926db [Joseph Batchik] added test cases for data source loader 208a2a8 [Joseph Batchik] changes to do error catching if there are multiple data sources 946186e [Joseph Batchik] started working on service loader (cherry picked from commit a3aec918bed22f8e33cf91dc0d6e712e6653c7d2) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-9728][SQL]Support CalendarIntervalType in HiveQLYijie Shen2015-08-085-0/+309
| | | | | | | | | | | | | | | | This PR enables converting interval term in HiveQL to CalendarInterval Literal. JIRA: https://issues.apache.org/jira/browse/SPARK-9728 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #8034 from yjshen/interval_hiveql and squashes the following commits: 7fe9a5e [Yijie Shen] declare throw exception and add unit test fce7795 [Yijie Shen] convert hiveql interval term into CalendarInterval literal (cherry picked from commit 23695f1d2d7ef9f3ea92cebcd96b1cf0e8904eb4) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-6902] [SQL] [PYSPARK] Row should be read-onlyDavies Liu2015-08-082-0/+20
| | | | | | | | | | | | | | Raise an read-only exception when user try to mutable a Row. Author: Davies Liu <davies@databricks.com> Closes #8009 from davies/readonly_row and squashes the following commits: 8722f3f [Davies Liu] add tests 05a3d36 [Davies Liu] Row should be read-only (cherry picked from commit ac507a03c3371cd5404ca195ee0ba0306badfc23) Signed-off-by: Davies Liu <davies.liu@gmail.com>
* [SPARK-4561] [PYSPARK] [SQL] turn Row into dict recursivelyDavies Liu2015-08-081-2/+25
| | | | | | | | | | | | | Add an option `recursive` to `Row.asDict()`, when True (default is False), it will convert the nested Row into dict. Author: Davies Liu <davies@databricks.com> Closes #8006 from davies/as_dict and squashes the following commits: 922cc5a [Davies Liu] turn Row into dict recursively (cherry picked from commit 74a6541aa82bcd7a052b2e57b5ca55b7c316495b) Signed-off-by: Davies Liu <davies.liu@gmail.com>
* [SPARK-9738] [SQL] remove FromUnsafe and add its codegen version to GenerateSafeWenchen Fan2015-08-084-107/+95
| | | | | | | | | | | | | | In https://github.com/apache/spark/pull/7752 we added `FromUnsafe` to convert nexted unsafe data like array/map/struct to safe versions. It's a quick solution and we already have `GenerateSafe` to do the conversion which is codegened. So we should remove `FromUnsafe` and implement its codegen version in `GenerateSafe`. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8029 from cloud-fan/from-unsafe and squashes the following commits: ed40d8f [Wenchen Fan] add the copy back a93fd4b [Wenchen Fan] cogengen FromUnsafe (cherry picked from commit 106c0789d8c83c7081bc9a335df78ba728e95872) Signed-off-by: Davies Liu <davies.liu@gmail.com>
* [SPARK-4176] [SQL] [MINOR] Should use unscaled Long to write decimals for ↵Cheng Lian2015-08-082-13/+18
| | | | | | | | | | | | | | | | | precision <= 18 rather than 8 This PR fixes a minor bug introduced in #7455: when writing decimals, we should use the unscaled Long for better performance when the precision <= 18 rather than 8 (should be a typo). This bug doesn't affect correctness, but hurts Parquet decimal writing performance. This PR also replaced similar magic numbers with newly defined constants. Author: Cheng Lian <lian@databricks.com> Closes #8031 from liancheng/spark-4176/minor-fix-for-writing-decimals and squashes the following commits: 10d4ea3 [Cheng Lian] Should use unscaled Long to write decimals for precision <= 18 rather than 8 (cherry picked from commit 11caf1ce290b6931647c2f71268f847d1d48930e) Signed-off-by: Cheng Lian <lian@databricks.com>
* [SPARK-9731] Standalone scheduling incorrect cores if spark.executor.cores ↵Carson Wang2015-08-072-12/+29
| | | | | | | | | | | | | | | | | | | is not set The issue only happens if `spark.executor.cores` is not set and executor memory is set to a high value. For example, if we have a worker with 4G and 10 cores and we set `spark.executor.memory` to 3G, then only 1 core is assigned to the executor. The correct number should be 10 cores. I've added a unit test to illustrate the issue. Author: Carson Wang <carson.wang@intel.com> Closes #8017 from carsonwang/SPARK-9731 and squashes the following commits: d09ec48 [Carson Wang] Fix code style 86b651f [Carson Wang] Simplify the code 943cc4c [Carson Wang] fix scheduling correct cores to executors (cherry picked from commit ef062c15992b0d08554495b8ea837bef3fabf6e9) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-9753] [SQL] TungstenAggregate should also accept InternalRow instead ↵Yin Huai2015-08-074-50/+39
| | | | | | | | | | | | | | | | | | | of just UnsafeRow https://issues.apache.org/jira/browse/SPARK-9753 This PR makes TungstenAggregate to accept `InternalRow` instead of just `UnsafeRow`. Also, it adds an `getAggregationBufferFromUnsafeRow` method to `UnsafeFixedWidthAggregationMap`. It is useful when we already have grouping keys stored in `UnsafeRow`s. Finally, it wraps `InputStream` and `OutputStream` in `UnsafeRowSerializer` with `BufferedInputStream` and `BufferedOutputStream`, respectively. Author: Yin Huai <yhuai@databricks.com> Closes #8041 from yhuai/joinedRowForProjection and squashes the following commits: 7753e34 [Yin Huai] Use BufferedInputStream and BufferedOutputStream. d68b74e [Yin Huai] Use joinedRow instead of UnsafeRowJoiner. e93c009 [Yin Huai] Add getAggregationBufferFromUnsafeRow for cases that the given groupingKeyRow is already an UnsafeRow. (cherry picked from commit c564b27447ed99e55b359b3df1d586d5766b85ea) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-9754][SQL] Remove TypeCheck in debug package.Reynold Xin2015-08-072-104/+4
| | | | | | | | | | | | | TypeCheck no longer applies in the new "Tungsten" world. Author: Reynold Xin <rxin@databricks.com> Closes #8043 from rxin/SPARK-9754 and squashes the following commits: 4ec471e [Reynold Xin] [SPARK-9754][SQL] Remove TypeCheck in debug package. (cherry picked from commit 998f4ff94df1d9db1c9e32c04091017c25cd4e81) Signed-off-by: Reynold Xin <rxin@databricks.com>