aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-11622][MLLIB] Make LibSVMRelation extends HadoopFsRelation and…Jeff Zhang2016-01-263-16/+113
| | | | | | | | | | | | … Add LibSVMOutputWriter The behavior of LibSVMRelation is not changed except adding LibSVMOutputWriter * Partition is still not supported * Multiple input paths is not supported Author: Jeff Zhang <zjffdu@apache.org> Closes #9595 from zjffdu/SPARK-11622.
* [SPARK-12614][CORE] Don't throw non fatal exception from askShixiong Zhu2016-01-261-25/+29
| | | | | | | | Right now RpcEndpointRef.ask may throw exception in some corner cases, such as calling ask after stopping RpcEnv. It's better to avoid throwing exception from RpcEndpointRef.ask. We can send the exception to the future for `ask`. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10568 from zsxwing/send-ask-fail.
* [SPARK-10509][PYSPARK] Reduce excessive param boiler plate codeHolden Karau2016-01-2612-317/+43
| | | | | | | | The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh). Author: Holden Karau <holden@us.ibm.com> Closes #10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.
* [SPARK-12993][PYSPARK] Remove usage of ADD_FILES in pysparkJeff Zhang2016-01-261-10/+1
| | | | | | | | environment variable ADD_FILES is created for adding python files on spark context to be distributed to executors (SPARK-865), this is deprecated now. User are encouraged to use --py-files for adding python files. Author: Jeff Zhang <zjffdu@apache.org> Closes #10913 from zjffdu/SPARK-12993.
* [SQL] Minor Scaladoc format fixCheng Lian2016-01-261-4/+4
| | | | | | | | Otherwise the `^` character is always marked as error in IntelliJ since it represents an unclosed superscript markup tag. Author: Cheng Lian <lian@databricks.com> Closes #10926 from liancheng/agg-doc-fix.
* [SPARK-8725][PROJECT-INFRA] Test modules in topologically-sorted order in ↵Josh Rosen2016-01-264-18/+162
| | | | | | | | | | | | | | dev/run-tests This patch improves our `dev/run-tests` script to test modules in a topologically-sorted order based on modules' dependencies. This will help to ensure that bugs in upstream projects are not misattributed to downstream projects because those projects' tests were the first ones to exhibit the failure Topological sorting is also useful for shortening the feedback loop when testing pull requests: if I make a change in SQL then the SQL tests should run before MLlib, not after. In addition, this patch also updates our test module definitions to split `sql` into `catalyst`, `sql`, and `hive` in order to allow more tests to be skipped when changing only `hive/` files. Author: Josh Rosen <joshrosen@databricks.com> Closes #10885 from JoshRosen/SPARK-8725.
* [SPARK-12952] EMLDAOptimizer initialize() should return EMLDAOptimizer other ↵Xusen Yin2016-01-261-1/+3
| | | | | | | | | | than its parent class https://issues.apache.org/jira/browse/SPARK-12952 Author: Xusen Yin <yinxusen@gmail.com> Closes #10863 from yinxusen/SPARK-12952.
* [SPARK-11923][ML] Python API for ml.feature.ChiSqSelectorXusen Yin2016-01-261-1/+97
| | | | | | | | https://issues.apache.org/jira/browse/SPARK-11923 Author: Xusen Yin <yinxusen@gmail.com> Closes #10186 from yinxusen/SPARK-11923.
* [SPARK-7799][STREAMING][DOCUMENT] Add the linking and deploying instructions ↵Shixiong Zhu2016-01-261-37/+44
| | | | | | | | | | | | for streaming-akka project Since `actorStream` is an external project, we should add the linking and deploying instructions for it. A follow up PR of #10744 Author: Shixiong Zhu <shixiong@databricks.com> Closes #10856 from zsxwing/akka-link-instruction.
* [SPARK-12682][SQL] Add support for (optionally) not storing tables in hive ↵Sameer Agarwal2016-01-262-0/+39
| | | | | | | | | | metadata format This PR adds a new table option (`skip_hive_metadata`) that'd allow the user to skip storing the table metadata in hive metadata format. While this could be useful in general, the specific use-case for this change is that Hive doesn't handle wide schemas well (see https://issues.apache.org/jira/browse/SPARK-12682 and https://issues.apache.org/jira/browse/SPARK-6024) which in turn prevents such tables from being queried in SparkSQL. Author: Sameer Agarwal <sameer@databricks.com> Closes #10826 from sameeragarwal/skip-hive-metadata.
* [SPARK-10911] Executors should System.exit on clean shutdown.zhuol2016-01-261-0/+1
| | | | | | | | Call system.exit explicitly to make sure non-daemon user threads terminate. Without this, user applications might live forever if the cluster manager does not appropriately kill them. E.g., YARN had this bug: HADOOP-12441. Author: zhuol <zhuol@yahoo-inc.com> Closes #9946 from zhuoliu/10911.
* [SPARK-3369][CORE][STREAMING] Java mapPartitions Iterator->Iterable is ↵Sean Owen2016-01-2630-110/+146
| | | | | | | | | | | | inconsistent with Scala's Iterator->Iterator Fix Java function API methods for flatMap and mapPartitions to require producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable. CC rxin pwendell for API change; tdas since it also touches streaming. Author: Sean Owen <sowen@cloudera.com> Closes #10413 from srowen/SPARK-3369.
* [SPARK-12961][CORE] Prevent snappy-java memory leakLiang-Chi Hsieh2016-01-261-6/+14
| | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-12961 To prevent memory leak in snappy-java, just call the method once and cache the result. After the library releases new version, we can remove this object. JoshRosen Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10875 from viirya/prevent-snappy-memory-leak.
* [SPARK-12937][SQL] bloom filter serializationWenchen Fan2016-01-266-44/+159
| | | | | | | | | | This PR adds serialization support for BloomFilter. A version number is added to version the serialized binary format. Author: Wenchen Fan <wenchen@databricks.com> Closes #10920 from cloud-fan/bloom-filter.
* [SQL][MINOR] A few minor tweaks to CSV reader.Reynold Xin2016-01-262-14/+9
| | | | | | | | This pull request simply fixes a few minor coding style issues in csv, as I was reviewing the change post-hoc. Author: Reynold Xin <rxin@databricks.com> Closes #10919 from rxin/csv-minor.
* [SPARK-10086][MLLIB][STREAMING][PYSPARK] ignore StreamingKMeans test in ↵Xiangrui Meng2016-01-251-0/+1
| | | | | | | | | | | | | | PySpark for now I saw several failures from recent PR builds, e.g., https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50015/consoleFull. This PR marks the test as ignored and we will fix the flakyness in SPARK-10086. gliptak Do you know why the test failure didn't show up in the Jenkins "Test Result"? cc: jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #10909 from mengxr/SPARK-10086.
* [SPARK-12834] Change ser/de of JavaArray and JavaListXusen Yin2016-01-251-1/+5
| | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-12834 We use `SerDe.dumps()` to serialize `JavaArray` and `JavaList` in `PythonMLLibAPI`, then deserialize them with `PickleSerializer` in Python side. However, there is no need to transform them in such an inefficient way. Instead of it, we can use type conversion to convert them, e.g. `list(JavaArray)` or `list(JavaList)`. What's more, there is an issue to Ser/De Scala Array as I said in https://issues.apache.org/jira/browse/SPARK-12780 Author: Xusen Yin <yinxusen@gmail.com> Closes #10772 from yinxusen/SPARK-12834.
* [SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizerHolden Karau2016-01-251-4/+85
| | | | | | | | | | | Add Python API for ml.feature.QuantileDiscretizer. One open question: Do we want to do this stuff to re-use the java model, create a new model, or use a different wrapper around the java model. cc brkyvz & mengxr Author: Holden Karau <holden@us.ibm.com> Closes #10085 from holdenk/SPARK-11937-SPARK-11922-Python-API-for-ml.feature.QuantileDiscretizer.
* [SPARK-12934] use try-with-resources for streamstedyu2016-01-251-0/+2
| | | | | | | | liancheng please take a look Author: tedyu <yuzhihong@gmail.com> Closes #10906 from tedyu/master.
* [SPARK-12936][SQL] Initial bloom filter implementationWenchen Fan2016-01-255-0/+602
| | | | | | | | | | | | | This PR adds an initial implementation of bloom filter in the newly added sketch module. The implementation is based on the [`BloomFilter` class in guava](https://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/hash/BloomFilter.java). Some difference from the design doc: * expose `bitSize` instead of `sizeInBytes` to user. * always need the `expectedInsertions` parameter when create bloom filter. Author: Wenchen Fan <wenchen@databricks.com> Closes #10883 from cloud-fan/bloom-filter.
* [SPARK-12879] [SQL] improve the unsafe row writing frameworkWenchen Fan2016-01-257-78/+258
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As we begin to use unsafe row writing framework(`BufferHolder` and `UnsafeRowWriter`) in more and more places(`UnsafeProjection`, `UnsafeRowParquetRecordReader`, `GenerateColumnAccessor`, etc.), we should add more doc to it and make it easier to use. This PR abstract the technique used in `UnsafeRowParquetRecordReader`: avoid unnecessary operatition as more as possible. For example, do not always point the row to the buffer at the end, we only need to update the size of row. If all fields are of primitive type, we can even save the row size updating. Then we can apply this technique to more places easily. a local benchmark shows `UnsafeProjection` is up to 1.7x faster after this PR: **old version** ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- single long 2616.04 102.61 1.00 X single nullable long 3032.54 88.52 0.86 X primitive types 9121.05 29.43 0.29 X nullable primitive types 12410.60 21.63 0.21 X ``` **new version** ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- single long 1533.34 175.07 1.00 X single nullable long 2306.73 116.37 0.66 X primitive types 8403.93 31.94 0.18 X nullable primitive types 12448.39 21.56 0.12 X ``` For single non-nullable long(the best case), we can have about 1.7x speed up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's not such a boost as the saved operations only take a little proportion of the whole process. The benchmark code is included in this PR. Author: Wenchen Fan <wenchen@databricks.com> Closes #10809 from cloud-fan/unsafe-projection.
* [SPARK-12934][SQL] Count-min sketch serializationCheng Lian2016-01-254-19/+213
| | | | | | | | | | This PR adds serialization support for `CountMinSketch`. A version number is added to version the serialized binary format. Author: Cheng Lian <lian@databricks.com> Closes #10893 from liancheng/cms-serialization.
* [SPARK-12905][ML][PYSPARK] PCAModel return eigenvalues for PySparkYanbo Liang2016-01-252-0/+13
| | | | | | | | | | ```PCAModel``` can output ```explainedVariance``` at Python side. cc mengxr srowen Author: Yanbo Liang <ybliang8@gmail.com> Closes #10830 from yanboliang/spark-12905.
* [SPARK-12975][SQL] Throwing Exception when Bucketing Columns are part of ↵gatorsmile2016-01-253-3/+83
| | | | | | | | | | | | | | | | | | | | | | Partitioning Columns When users are using `partitionBy` and `bucketBy` at the same time, some bucketing columns might be part of partitioning columns. For example, ``` df.write .format(source) .partitionBy("i") .bucketBy(8, "i", "k") .saveAsTable("bucketed_table") ``` However, in the above case, adding column `i` into `bucketBy` is useless. It is just wasting extra CPU when reading or writing bucket tables. Thus, like Hive, we can issue an exception and let users do the change. Also added a test case for checking if the information of `sortBy` and `bucketBy` columns are correctly saved in the metastore table. Could you check if my understanding is correct? cloud-fan rxin marmbrus Thanks! Author: gatorsmile <gatorsmile@gmail.com> Closes #10891 from gatorsmile/commonKeysInPartitionByBucketBy.
* [SPARK-12901][SQL][HOT-FIX] Fix scala 2.11 compilation.Yin Huai2016-01-252-2/+2
|
* [SPARK-12902] [SQL] visualization for generated operatorsDavies Liu2016-01-259-32/+104
| | | | | | | | | | | | | | This PR brings back visualization for generated operators, they looks like: ![sql](https://cloud.githubusercontent.com/assets/40902/12460920/0dc7956a-bf6b-11e5-9c3f-8389f452526e.png) ![stage](https://cloud.githubusercontent.com/assets/40902/12460923/11806ac4-bf6b-11e5-9c72-e84a62c5ea93.png) Note: SQL metrics are not supported right now, because they are very slow, will be supported once we have batch mode. Author: Davies Liu <davies@databricks.com> Closes #10828 from davies/viz_codegen.
* [SPARK-12149][WEB UI] Executor UI improvement suggestions - Color UIAlex Bozarth2016-01-257-20/+103
| | | | | | | | | | | | | Added color coding to the Executors page for Active Tasks, Failed Tasks, Completed Tasks and Task Time. Active Tasks is shaded blue with it's range based on percentage of total cores used. Failed Tasks is shaded red ranging over the first 10% of total tasks failed Completed Tasks is shaded green ranging over 10% of total tasks including failed and active tasks, but only when there are active or failed tasks on that executor. Task Time is shaded red when GC Time goes over 10% of total time with it's range directly corresponding to the percent of total time. Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #10154 from ajbozarth/spark12149.
* Closes #10879Xiangrui Meng2016-01-250-0/+0
| | | | | | | | | Closes #9046 Closes #8532 Closes #10756 Closes #8960 Closes #10485 Closes #10467
* [SPARK-11965][ML][DOC] Update user guide for RFormula feature interactionsYanbo Liang2016-01-252-1/+40
| | | | | | | | Update user guide for RFormula feature interactions. Meanwhile we also update other new features such as supporting string label in Spark 1.6. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10222 from yanboliang/spark-11965.
* [SPARK-12755][CORE] Stop the event logger before the DAG schedulerMichael Allman2016-01-251-6/+6
| | | | | | | | | | [SPARK-12755][CORE] Stop the event logger before the DAG scheduler to avoid a race condition where the standalone master attempts to build the app's history UI before the event log is stopped. This contribution is my original work, and I license this work to the Spark project under the project's open source license. Author: Michael Allman <michael@videoamp.com> Closes #10700 from mallman/stop_event_logger_first.
* [SPARK-12932][JAVA API] improved error message for java type inference failureAndy Grove2016-01-251-1/+2
| | | | | | Author: Andy Grove <andygrove73@gmail.com> Closes #10865 from andygrove/SPARK-12932.
* [SPARK-12901][SQL] Refactor options for JSON and CSV datasource (not case ↵hyukjinkwon2016-01-256-52/+40
| | | | | | | | | | | | | | | | | class and same format). https://issues.apache.org/jira/browse/SPARK-12901 This PR refactors the options in JSON and CSV datasources. In more details, 1. `JSONOptions` uses the same format as `CSVOptions`. 2. Not case classes. 3. `CSVRelation` that does not have to be serializable (it was `with Serializable` but I removed) Author: hyukjinkwon <gurwls223@gmail.com> Closes #10895 from HyukjinKwon/SPARK-12901.
* [SPARK-12624][PYSPARK] Checks row length when converting Java arrays to ↵Cheng Lian2016-01-242-1/+17
| | | | | | | | | | Python rows When actual row length doesn't conform to specified schema field length, we should give a better error message instead of throwing an unintuitive `ArrayOutOfBoundsException`. Author: Cheng Lian <lian@databricks.com> Closes #10886 from liancheng/spark-12624.
* [SPARK-12120][PYSPARK] Improve exception message when failing to init…Jeff Zhang2016-01-241-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | …ialize HiveContext in PySpark davies Mind to review ? This is the error message after this PR ``` 15/12/03 16:59:53 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException /Users/jzhang/github/spark/python/pyspark/sql/context.py:689: UserWarning: You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly warnings.warn("You must build Spark with Hive. " Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 663, in read return DataFrameReader(self) File "/Users/jzhang/github/spark/python/pyspark/sql/readwriter.py", line 56, in __init__ self._jreader = sqlContext._ssql_ctx.read() File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 692, in _ssql_ctx raise e py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. : java.lang.RuntimeException: java.net.ConnectException: Call From jzhangMBPr.local/127.0.0.1 to 0.0.0.0:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:194) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238) at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218) at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208) at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462) at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461) at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40) at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:330) at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90) at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) ``` Author: Jeff Zhang <zjffdu@apache.org> Closes #10126 from zjffdu/SPARK-12120.
* [SPARK-10498][TOOLS][BUILD] Add requirements.txt file for dev python toolsHolden Karau2016-01-241-0/+3
| | | | | | | | | | Minor since so few people use them, but it would probably be good to have a requirements file for our python release tools for easier setup (also version pinning). cc JoshRosen who looked at the original JIRA. Author: Holden Karau <holden@us.ibm.com> Closes #10871 from holdenk/SPARK-10498-add-requirements-file-for-dev-python-tools.
* [SPARK-12971] Fix Hive tests which fail in Hadoop-2.3 SBT buildJosh Rosen2016-01-242-4/+22
| | | | | | | | | | ErrorPositionSuite and one of the HiveComparisonTest tests have been consistently failing on the Hadoop 2.3 SBT build (but on no other builds). I believe that this is due to test isolation issues (e.g. tests sharing state via the sets of temporary tables that are registered to TestHive). This patch attempts to improve the isolation of these tests in order to address this issue. Author: Josh Rosen <joshrosen@databricks.com> Closes #10884 from JoshRosen/fix-failing-hadoop-2.3-hive-tests.
* [STREAMING][MINOR] Scaladoc + logsJacek Laskowski2016-01-234-6/+5
| | | | | | | | Found while doing code review Author: Jacek Laskowski <jacek@japila.pl> Closes #10878 from jaceklaskowski/streaming-scaladoc-logs-tiny-fixes.
* [SPARK-12904][SQL] Strength reduction for integral and decimal literal ↵Reynold Xin2016-01-236-139/+376
| | | | | | | | | | comparisons This pull request implements strength reduction for comparing integral expressions and decimal literals, which is more common now because we switch to parsing fractional literals as decimal types (rather than doubles). I added the rules to the existing DecimalPrecision rule with some refactoring to simplify the control flow. I also moved DecimalPrecision rule into its own file due to the growing size. Author: Reynold Xin <rxin@databricks.com> Closes #10882 from rxin/SPARK-12904-1.
* [SPARK-11137][STREAMING] Make StreamingContext.stop() exception-safejayadevanmurali2016-01-231-4/+12
| | | | | | | | Make StreamingContext.stop() exception-safe Author: jayadevanmurali <jayadevan.m@tcs.com> Closes #10807 from jayadevanmurali/branch-0.1-SPARK-11137.
* [SPARK-12760][DOCS] inaccurate description for difference between local vs ↵Sean Owen2016-01-231-4/+4
| | | | | | | | | | cluster mode in closure handling Clarify that modifying a driver local variable won't have the desired effect in cluster modes, and may or may not work as intended in local mode Author: Sean Owen <sowen@cloudera.com> Closes #10866 from srowen/SPARK-12760.
* [SPARK-12760][DOCS] invalid lambda expression in python example for …Mortada Mehyar2016-01-231-2/+5
| | | | | | | | | | | | | | | | | | | | | | | …local vs cluster srowen thanks for the PR at https://github.com/apache/spark/pull/10866! sorry it took me a while. This is related to https://github.com/apache/spark/pull/10866, basically the assignment in the lambda expression in the python example is actually invalid ``` In [1]: data = [1, 2, 3, 4, 5] In [2]: counter = 0 In [3]: rdd = sc.parallelize(data) In [4]: rdd.foreach(lambda x: counter += x) File "<ipython-input-4-fcb86c182bad>", line 1 rdd.foreach(lambda x: counter += x) ^ SyntaxError: invalid syntax ``` Author: Mortada Mehyar <mortada.mehyar@gmail.com> Closes #10867 from mortada/doc_python_fix.
* [SPARK-12859][STREAMING][WEB UI] Names of input streams with receivers don't ↵Alex Bozarth2016-01-231-1/+1
| | | | | | | | | | fit in Streaming page Added CSS style to force names of input streams with receivers to wrap Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #10873 from ajbozarth/spark12859.
* [SPARK-12933][SQL] Initial implementation of Count-Min sketchCheng Lian2016-01-239-12/+892
| | | | | | | | | | | | | | | | | | | This PR adds an initial implementation of count min sketch, contained in a new module spark-sketch under `common/sketch`. The implementation is based on the [`CountMinSketch` class in stream-lib][1]. As required by the [design doc][2], spark-sketch should have no external dependency. Two classes, `Murmur3_x86_32` and `Platform` are copied to spark-sketch from spark-unsafe for hashing facilities. They'll also be used in the upcoming bloom filter implementation. The following features will be added in future follow-up PRs: - Serialization support - DataFrame API integration [1]: https://github.com/addthis/stream-lib/blob/aac6b4d23a8686b000f80baa447e0922ecac3bcb/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java [2]: https://issues.apache.org/jira/secure/attachment/12782378/BloomFilterandCount-MinSketchinSpark2.0.pdf Author: Cheng Lian <lian@databricks.com> Closes #10851 from liancheng/count-min-sketch.
* [SPARK-12872][SQL] Support to specify the option for compression codec for ↵hyukjinkwon2016-01-225-29/+96
| | | | | | | | | | | | | | | JSON datasource https://issues.apache.org/jira/browse/SPARK-12872 This PR makes the JSON datasource can compress output by option instead of manually setting Hadoop configurations. For reflecting codec by names, it is similar with https://github.com/apache/spark/pull/10805. As `CSVCompressionCodecs` can be shared with other datasources, it became a separate class to share as `CompressionCodecs`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #10858 from HyukjinKwon/SPARK-12872.
* [HOTFIX]Remove rpcEnv.awaitTermination to avoid dead-lock in some testShixiong Zhu2016-01-221-1/+0
| | | | Looks rpcEnv.awaitTermination may block some tests forever. Just remove it and investigate the tests.
* [SPARK-7997][CORE] Remove Akka from Spark Core and StreamingShixiong Zhu2016-01-2243-831/+123
| | | | | | | | | | | | - Remove Akka dependency from core. Note: the streaming-akka project still uses Akka. - Remove HttpFileServer - Remove Akka configs from SparkConf and SSLOptions - Rename `spark.akka.frameSize` to `spark.rpc.message.maxSize`. I think it's still worth to keep this config because using `DirectTaskResult` or `IndirectTaskResult` depends on it. - Update comments and docs Author: Shixiong Zhu <shixiong@databricks.com> Closes #10854 from zsxwing/remove-akka.
* [HOTFIX][BUILD][TEST-MAVEN] Remove duplicate dependencyShixiong Zhu2016-01-221-7/+4
| | | | | | Author: Shixiong Zhu <shixiong@databricks.com> Closes #10868 from zsxwing/hotfix-akka-pom.
* [SPARK-12629][SPARKR] Fixes for DataFrame saveAsTable methodNarine Kokhlikyan2016-01-223-9/+41
| | | | | | | | | | I've tried to solve some of the issues mentioned in: https://issues.apache.org/jira/browse/SPARK-12629 Please, let me know what do you think. Thanks! Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com> Closes #10580 from NarineK/sparkrSavaAsRable.
* [SPARK-12959][SQL] Writing Bucketed Data with Disabled Bucketing in SQLConfgatorsmile2016-01-223-6/+26
| | | | | | | | | | | | When users turn off bucketing in SQLConf, we should issue some messages to tell users these operations will be converted to normal way. Also added a test case for this scenario and fixed the helper function. Do you think this PR is helpful when using bucket tables? cloud-fan Thank you! Author: gatorsmile <gatorsmile@gmail.com> Closes #10870 from gatorsmile/bucketTableWritingTestcases.
* [SPARK-12960] [PYTHON] Some examples are missing support for python2Mark Grover2016-01-212-0/+3
| | | | | | | | Without importing the print_function, the lines later on like ```print("Usage: direct_kafka_wordcount.py <broker_list> <topic>", file=sys.stderr)``` fail when using python2.*. Import fixes that problem and doesn't break anything on python3 either. Author: Mark Grover <mark@apache.org> Closes #10872 from markgrover/python2_compat.