aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-12195][SQL] Adding BigDecimal, Date and Timestamp into Encodergatorsmile2015-12-082-0/+35
| | | | | | | | | | This PR is to add three more data types into Encoder, including `BigDecimal`, `Date` and `Timestamp`. marmbrus cloud-fan rxin Could you take a quick look at these three types? Not sure if it can be merged to 1.6. Thank you very much! Author: gatorsmile <gatorsmile@gmail.com> Closes #10188 from gatorsmile/dataTypesinEncoder.
* [SPARK-12201][SQL] add type coercion rule for greatest/leastWenchen Fan2015-12-083-0/+47
| | | | | | | | | checked with hive, greatest/least should cast their children to a tightest common type, i.e. `(int, long) => long`, `(int, string) => error`, `(decimal(10,5), decimal(5, 10)) => error` Author: Wenchen Fan <wenchen@databricks.com> Closes #10196 from cloud-fan/type-coercion.
* [SPARK-12074] Avoid memory copy involving ↵tedyu2015-12-083-7/+8
| | | | | | | | | | | | | ByteBuffer.wrap(ByteArrayOutputStream.toByteArray) SPARK-12060 fixed JavaSerializerInstance.serialize This PR applies the same technique on two other classes. zsxwing Author: tedyu <yuzhihong@gmail.com> Closes #10177 from tedyu/master.
* [SPARK-11155][WEB UI] Stage summary json should include stage durationXin Ren2015-12-0811-9/+124
| | | | | | | | | | The json endpoint for stages doesn't include information on the stage duration that is present in the UI. This looks like a simple oversight, they should be included. eg., the metrics should be included at api/v1/applications/<appId>/stages. Metrics I've added are: submissionTime, firstTaskLaunchedTime and completionTime Author: Xin Ren <iamshrek@126.com> Closes #10107 from keypointt/SPARK-11155.
* [SPARK-11652][CORE] Remote code execution with InvokerTransformerSean Owen2015-12-081-1/+1
| | | | | | | | | | Fix commons-collection group ID to commons-collections for version 3.x Patches earlier PR at https://github.com/apache/spark/pull/9731 Author: Sean Owen <sowen@cloudera.com> Closes #10198 from srowen/SPARK-11652.2.
* [SPARK-11551][DOC][EXAMPLE] Revert PR #10002Cheng Lian2015-12-0852-2806/+1058
| | | | | | | | | | This reverts PR #10002, commit 78209b0ccaf3f22b5e2345dfb2b98edfdb746819. The original PR wasn't tested on Jenkins before being merged. Author: Cheng Lian <lian@databricks.com> Closes #10200 from liancheng/revert-pr-10002.
* [SPARK-11439][ML] Optimization of creating sparse feature without dense oneNakul Jindal2015-12-083-122/+142
| | | | | | | | Sparse feature generated in LinearDataGenerator does not create dense vectors as an intermediate any more. Author: Nakul Jindal <njindal@us.ibm.com> Closes #9756 from nakul02/SPARK-11439_sparse_without_creating_dense_feature.
* [SPARK-12166][TEST] Unset hadoop related environment in testingJeff Zhang2015-12-081-0/+6
| | | | | | Author: Jeff Zhang <zjffdu@apache.org> Closes #10172 from zjffdu/SPARK-12166.
* [SPARK-12103][STREAMING][KAFKA][DOC] document that K means Key and V …cody koeninger2015-12-081-0/+61
| | | | | | | | …means Value Author: cody koeninger <cody@koeninger.org> Closes #10132 from koeninger/SPARK-12103.
* [SPARK-11958][SPARK-11957][ML][DOC] SQLTransformer user guide and example codeYanbo Liang2015-12-075-2/+212
| | | | | | | | Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10006 from yanboliang/spark-11958.
* [SPARK-10259][ML] Add @since annotation to ml.classificationTakahashi Hiroshi2015-12-077-44/+185
| | | | | | | | Add since annotation to ml.classification Author: Takahashi Hiroshi <takahashi.hiroshi@lab.ntt.co.jp> Closes #8534 from taishi-oss/issue10259.
* Closes #10098Xiangrui Meng2015-12-070-0/+0
|
* [SPARK-11551][DOC][EXAMPLE] Replace example code in ml-features.md using ↵somideshmukh2015-12-0752-1058/+2806
| | | | | | | | | | | | | include_example Made new patch contaning only markdown examples moved to exmaple/folder. Ony three java code were not shfted since they were contaning compliation error ,these classes are 1)StandardScale 2)NormalizerExample 3)VectorIndexer Author: Xusen Yin <yinxusen@gmail.com> Author: somideshmukh <somilde@us.ibm.com> Closes #10002 from somideshmukh/SomilBranch1.33.
* [SPARK-12160][MLLIB] Use SQLContext.getOrCreate in MLlibJoseph K. Bradley2015-12-0713-29/+29
| | | | | | | | | | | | Switched from using SQLContext constructor to using getOrCreate, mainly in model save/load methods. This covers all instances in spark.mllib. There were no uses of the constructor in spark.ml. CC: mengxr yhuai Author: Joseph K. Bradley <joseph@databricks.com> Closes #10161 from jkbradley/mllib-sqlcontext-fix.
* [SPARK-12184][PYTHON] Make python api doc for pivot consistant with scala docAndrew Ray2015-12-071-5/+9
| | | | | | | | In SPARK-11946 the API for pivot was changed a bit and got updated doc, the doc changes were not made for the python api though. This PR updates the python doc to be consistent. Author: Andrew Ray <ray.andrew@gmail.com> Closes #10176 from aray/sql-pivot-python-doc.
* [SPARK-11884] Drop multiple columns in the DataFrame APItedyu2015-12-072-8/+23
| | | | | | | | | | | See the thread Ben started: http://search-hadoop.com/m/q3RTtveEuhjsr7g/ This PR adds drop() method to DataFrame which accepts multiple column names Author: tedyu <yuzhihong@gmail.com> Closes #9862 from ted-yu/master.
* [SPARK-11963][DOC] Add docs for QuantileDiscretizerXusen Yin2015-12-073-0/+185
| | | | | | | | https://issues.apache.org/jira/browse/SPARK-11963 Author: Xusen Yin <yinxusen@gmail.com> Closes #9962 from yinxusen/SPARK-11963.
* [SPARK-12060][CORE] Avoid memory copy in JavaSerializerInstance.serializeShixiong Zhu2015-12-072-4/+34
| | | | | | | | | | Merged #10051 again since #10083 is resolved. This reverts commit 328b757d5d4486ea3c2e246780792d7a57ee85e5. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10167 from zsxwing/merge-SPARK-12060.
* [SPARK-11932][STREAMING] Partition previous TrackStateRDD if partitioner not ↵Tathagata Das2015-12-076-84/+258
| | | | | | | | | | | | present The reason is that TrackStateRDDs generated by trackStateByKey expect the previous batch's TrackStateRDDs to have a partitioner. However, when recovery from DStream checkpoints, the RDDs recovered from RDD checkpoints do not have a partitioner attached to it. This is because RDD checkpoints do not preserve the partitioner (SPARK-12004). While #9983 solves SPARK-12004 by preserving the partitioner through RDD checkpoints, there may be a non-zero chance that the saving and recovery fails. To be resilient, this PR repartitions the previous state RDD if the partitioner is not detected. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #9988 from tdas/SPARK-11932.
* [SPARK-12132] [PYSPARK] raise KeyboardInterrupt inside SIGINT handlerDavies Liu2015-12-071-0/+1
| | | | | | | | | | | | | | | | | | | Currently, the current line is not cleared by Cltr-C After this patch ``` >>> asdfasdf^C Traceback (most recent call last): File "~/spark/python/pyspark/context.py", line 225, in signal_handler raise KeyboardInterrupt() KeyboardInterrupt ``` It's still worse than 1.5 (and before). Author: Davies Liu <davies@databricks.com> Closes #10134 from davies/fix_cltrc.
* [SPARK-12034][SPARKR] Eliminate warnings in SparkR test cases.Sun Rui2015-12-0720-39/+50
| | | | | | | | | | | | | | This PR: 1. Suppress all known warnings. 2. Cleanup test cases and fix some errors in test cases. 3. Fix errors in HiveContext related test cases. These test cases are actually not run previously due to a bug of creating TestHiveContext. 4. Support 'testthat' package version 0.11.0 which prefers that test cases be under 'tests/testthat' 5. Make sure the default Hadoop file system is local when running test cases. 6. Turn on warnings into errors. Author: Sun Rui <rui.sun@intel.com> Closes #10030 from sun-rui/SPARK-12034.
* [SPARK-12032] [SQL] Re-order inner joins to do join with conditions firstDavies Liu2015-12-073-6/+185
| | | | | | | | | | | | | | Currently, the order of joins is exactly the same as SQL query, some conditions may not pushed down to the correct join, then those join will become cross product and is extremely slow. This patch try to re-order the inner joins (which are common in SQL query), pick the joins that have self-contain conditions first, delay those that does not have conditions. After this patch, the TPCDS query Q64/65 can run hundreds times faster. cc marmbrus nongli Author: Davies Liu <davies@databricks.com> Closes #10073 from davies/reorder_joins.
* [SPARK-12106][STREAMING][FLAKY-TEST] BatchedWAL test transiently flaky when ↵Burak Yavuz2015-12-072-6/+14
| | | | | | | | | | Jenkins load is high We need to make sure that the last entry is indeed the last entry in the queue. Author: Burak Yavuz <brkyvz@gmail.com> Closes #10110 from brkyvz/batch-wal-test-fix.
* [SPARK-12152][PROJECT-INFRA] Speed up Scalastyle checks by only invoking SBT ↵Josh Rosen2015-12-061-8/+11
| | | | | | | | | | once Currently, `dev/scalastyle` invokes SBT four times, but these invocations can be replaced with a single invocation, saving about one minute of build time. Author: Josh Rosen <joshrosen@databricks.com> Closes #10151 from JoshRosen/speed-up-scalastyle.
* [SPARK-12138][SQL] Escape \u in the generated comments of codegengatorsmile2015-12-062-1/+12
| | | | | | | | | | When \u appears in a comment block (i.e. in /**/), code gen will break. So, in Expression and CodegenFallback, we escape \u to \\u. yhuai Please review it. I did reproduce it and it works after the fix. Thanks! Author: gatorsmile <gatorsmile@gmail.com> Closes #10155 from gatorsmile/escapeU.
* [SPARK-12048][SQL] Prevent to close JDBC resources twicegcc2015-12-061-0/+1
| | | | | | Author: gcc <spark-src@condor.rhaag.ip> Closes #10101 from rh99/master.
* [SPARK-12044][SPARKR] Fix usage of isnan, isNaNYanbo Liang2015-12-054-11/+31
| | | | | | | | | | | | 1, Add ```isNaN``` to ```Column``` for SparkR. ```Column``` should has three related variable functions: ```isNaN, isNull, isNotNull```. 2, Replace ```DataFrame.isNaN``` with ```DataFrame.isnan``` at SparkR side. Because ```DataFrame.isNaN``` has been deprecated and will be removed at Spark 2.0. <del>3, Add ```isnull``` to ```DataFrame``` for SparkR. ```DataFrame``` should has two related functions: ```isnan, isnull```.<del> cc shivaram sun-rui felixcheung Author: Yanbo Liang <ybliang8@gmail.com> Closes #10037 from yanboliang/spark-12044.
* [SPARK-12115][SPARKR] Change numPartitions() to getNumPartitions() to be ↵Yanbo Liang2015-12-054-30/+45
| | | | | | | | | | | | | consistent with Scala/Python Change ```numPartitions()``` to ```getNumPartitions()``` to be consistent with Scala/Python. <del>Note: If we can not catch up with 1.6 release, it will be breaking change for 1.7 that we also need to explain in release note.<del> cc sun-rui felixcheung shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10123 from yanboliang/spark-12115.
* [SPARK-11715][SPARKR] Add R support corr for Column Aggregrationfelixcheung2015-12-054-6/+22
| | | | | | | | Need to match existing method signature Author: felixcheung <felixcheung_m@hotmail.com> Closes #9680 from felixcheung/rcorr.
* [SPARK-11774][SPARKR] Implement struct(), encode(), decode() functions in ↵Sun Rui2015-12-054-6/+105
| | | | | | | | SparkR. Author: Sun Rui <rui.sun@intel.com> Closes #9804 from sun-rui/SPARK-11774.
* [SPARK-11988][ML][MLLIB] Update JPMML to 1.2.7Sean Owen2015-12-056-65/+59
| | | | | | | | Update JPMML pmml-model to 1.2.7 Author: Sean Owen <sowen@cloudera.com> Closes #9972 from srowen/SPARK-11988.
* [SPARK-11994][MLLIB] Word2VecModel load and save cause SparkException when ↵Antonio Murgia2015-12-052-4/+31
| | | | | | | | model is bigger than spark.kryoserializer.buffer.max Author: Antonio Murgia <antonio.murgia2@studio.unibo.it> Closes #9989 from tmnd1991/SPARK-11932.
* [SPARK-12096][MLLIB] remove the old constraint in word2vecYuhao Yang2015-12-051-2/+2
| | | | | | | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-12096 word2vec now can handle much bigger vocabulary. The old constraint vocabSize.toLong * vectorSize < Ine.max / 8 should be removed. new constraint is vocabSize.toLong * vectorSize < max array length (usually a little less than Int.MaxValue) I tested with vocabsize over 18M and vectorsize = 100. srowen jkbradley Sorry to miss this in last PR. I was reminded today. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10103 from hhbyyh/w2vCapacity.
* [SPARK-12084][CORE] Fix codes that uses ByteBuffer.array incorrectlyShixiong Zhu2015-12-0422-69/+81
| | | | | | | | | | `ByteBuffer` doesn't guarantee all contents in `ByteBuffer.array` are valid. E.g, a ByteBuffer returned by `ByteBuffer.slice`. We should not use the whole content of `ByteBuffer` unless we know that's correct. This patch fixed all places that use `ByteBuffer.array` incorrectly. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10083 from zsxwing/bytebuffer-array.
* [SPARK-12080][CORE] Kryo - Support multiple user registratorsrotems2015-12-042-4/+6
| | | | | | Author: rotems <roter> Closes #10078 from Botnaim/KryoMultipleCustomRegistrators.
* [SPARK-12142][CORE]Reply false when container allocator is not ready and ↵meiyoula2015-12-042-1/+3
| | | | | | | | | | reset target Using Dynamic Allocation function, when a new AM is starting, and ExecutorAllocationManager send RequestExecutor message to AM. If the container allocator is not ready, the whole app will hang on Author: meiyoula <1039320815@qq.com> Closes #10138 from XuTingjun/patch-1.
* [SPARK-12112][BUILD] Upgrade to SBT 0.13.9Josh Rosen2015-12-0520-48/+47
| | | | | | | | | | We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin). I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations. Author: Josh Rosen <joshrosen@databricks.com> Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.
* [SPARK-11314][BUILD][HOTFIX] Add exclusion for moved YARN classes.Marcelo Vanzin2015-12-041-1/+4
| | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10147 from vanzin/SPARK-11314.
* [SPARK-12058][STREAMING][KINESIS][TESTS] fix Kinesis python testsBurak Yavuz2015-12-045-50/+115
| | | | | | | | | | | | Python tests require access to the `KinesisTestUtils` file. When this file exists under src/test, python can't access it, since it is not available in the assembly jar. However, if we move KinesisTestUtils to src/main, we need to add the KinesisProducerLibrary as a dependency. In order to avoid this, I moved KinesisTestUtils to src/main, and extended it with ExtendedKinesisTestUtils which is under src/test that adds support for the KPL. cc zsxwing tdas Author: Burak Yavuz <brkyvz@gmail.com> Closes #10050 from brkyvz/kinesis-py.
* [SPARK-6990][BUILD] Add Java linting script; fix minor warningsDmitry Erastov2015-12-0431-70/+368
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This replaces https://github.com/apache/spark/pull/9696 Invoke Checkstyle and print any errors to the console, failing the step. Use Google's style rules modified according to https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide Some important checks are disabled (see TODOs in `checkstyle.xml`) due to multiple violations being present in the codebase. Suggest fixing those TODOs in a separate PR(s). More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/). Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles): > Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions. > [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1 Also fix some of the minor violations that didn't require sweeping changes. Apologies for the previous botched PRs - I finally figured out the issue. cr: JoshRosen, pwendell > I state that the contribution is my original work, and I license the work to the project under the project's open source license. Author: Dmitry Erastov <derastov@gmail.com> Closes #9867 from dskrvk/master.
* [SPARK-12089] [SQL] Fix memory corrupt due to freeing a page being referencedNong2015-12-041-2/+5
| | | | | | | | | | When the spillable sort iterator was spilled, it was mistakenly keeping the last page in memory rather than the current page. This causes the current record to get corrupted. Author: Nong <nong@cloudera.com> Closes #10142 from nongli/spark-12089.
* Add links howto to setup IDEs for developing sparkkaklakariada2015-12-041-0/+2
| | | | | | | | These links make it easier for new developers to work with Spark in their IDE. Author: kaklakariada <kaklakariada@users.noreply.github.com> Closes #10104 from kaklakariada/readme-developing-ide-gettting-started.
* [SPARK-12122][STREAMING] Prevent batches from being submitted twice after ↵Tathagata Das2015-12-041-1/+2
| | | | | | | | recovering StreamingContext from checkpoint Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #10127 from tdas/SPARK-12122.
* [SPARK-12104][SPARKR] collect() does not handle multiple columns with same name.Sun Rui2015-12-032-4/+10
| | | | | | Author: Sun Rui <rui.sun@intel.com> Closes #10118 from sun-rui/SPARK-12104.
* [SPARK-11206] Support SQL UI on the history server (resubmit)Carson Wang2015-12-0321-135/+329
| | | | | | | | | | | | | | | Resubmit #9297 and #9991 On the live web UI, there is a SQL tab which provides valuable information for the SQL query. But once the workload is finished, we won't see the SQL tab on the history server. It will be helpful if we support SQL UI on the history server so we can analyze it even after its execution. To support SQL UI on the history server: 1. I added an onOtherEvent method to the SparkListener trait and post all SQL related events to the same event bus. 2. Two SQL events SparkListenerSQLExecutionStart and SparkListenerSQLExecutionEnd are defined in the sql module. 3. The new SQL events are written to event log using Jackson. 4. A new trait SparkHistoryListenerFactory is added to allow the history server to feed events to the SQL history listener. The SQL implementation is loaded at runtime using java.util.ServiceLoader. Author: Carson Wang <carson.wang@intel.com> Closes #10061 from carsonwang/SqlHistoryUI.
* [SPARK-12056][CORE] Create a TaskAttemptContext only after calling setConf.Anderson de Andrade2015-12-031-2/+2
| | | | | | | | | | | | | | TaskAttemptContext's constructor will clone the configuration instead of referencing it. Calling setConf after creating TaskAttemptContext makes any changes to the configuration made inside setConf unperceived by RecordReader instances. As an example, Titan's InputFormat will change conf when calling setConf. They wrap their InputFormat around Cassandra's ColumnFamilyInputFormat, and append Cassandra's configuration. This change fixes the following error when using Titan's CassandraInputFormat with Spark: *java.lang.RuntimeException: org.apache.thrift.protocol.TProtocolException: Required field 'keyspace' was not present! Struct: set_key space_args(keyspace:null)* There's a discussion of this error here: https://groups.google.com/forum/#!topic/aureliusgraphs/4zpwyrYbGAE Author: Anderson de Andrade <adeandrade@verticalscope.com> Closes #10046 from adeandrade/newhadooprdd-fix.
* [SPARK-12019][SPARKR] Support character vector for sparkR.init(), check ↵felixcheung2015-12-035-21/+79
| | | | | | | | | | | param and fix doc and add tests. Spark submit expects comma-separated list Author: felixcheung <felixcheung_m@hotmail.com> Closes #10034 from felixcheung/sparkrinitdoc.
* [FLAKY-TEST-FIX][STREAMING][TEST] Make sure StreamingContexts are shutdown ↵Tathagata Das2015-12-031-61/+61
| | | | | | | | after test Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #10124 from tdas/InputStreamSuite-flaky-test.
* [SPARK-12107][EC2] Update spark-ec2 versionsNicholas Chammas2015-12-031-3/+9
| | | | | | | | | | I haven't created a JIRA. If we absolutely need one I'll do it, but I'm fine with not getting mentioned in the release notes if that's the only purpose it'll serve. cc marmbrus - We should include this in 1.6-RC2 if there is one. I can open a second PR against branch-1.6 if necessary. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #10109 from nchammas/spark-ec2-versions.
* [MINOR][ML] Use coefficients replace weightsYanbo Liang2015-12-032-2/+2
| | | | | | | | | Use ```coefficients``` replace ```weights```, I wish they are the last two. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #10065 from yanboliang/coefficients.