aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* [SPARK-11243][SQL] zero out padding bytes in UnsafeRowDavies Liu2015-10-232-5/+35
| | | | | | | | | | For nested StructType, the underline buffer could be used for others before, we should zero out the padding bytes for those primitive types that have less than 8 bytes. cc cloud-fan Author: Davies Liu <davies@databricks.com> Closes #9217 from davies/zero_out.
* Fix typo "Received" to "Receiver" in streaming-kafka-integration.mdRohan Bhanderi2015-10-231-1/+1
| | | | | | | | Removed typo on line 8 in markdown : "Received" -> "Receiver" Author: Rohan Bhanderi <rohan.bhanderi@sjsu.edu> Closes #9242 from RohanBhanderi/patch-1.
* [SPARK-11273][SQL] Move ArrayData/MapData/DataTypeParser to catalyst.util ↵Reynold Xin2015-10-2350-48/+76
| | | | | | | | package Author: Reynold Xin <rxin@databricks.com> Closes #9239 from rxin/types-private.
* Fix a (very tiny) typoJacek Laskowski2015-10-221-1/+1
| | | | | | Author: Jacek Laskowski <jacek.laskowski@deepsense.io> Closes #9230 from jaceklaskowski/utils-seconds-typo.
* [SPARK-11134][CORE] Increase LauncherBackendSuite timeout.Marcelo Vanzin2015-10-221-2/+2
| | | | | | | | This test can take a little while to finish on slow / loaded machines. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9235 from vanzin/SPARK-11134.
* [SPARK-11098][CORE] Add Outbox to cache the sending messages to resolve the ↵zsxwing2015-10-222-57/+310
| | | | | | | | | | | | | | | | | | message disorder issue The current NettyRpc has a message order issue because it uses a thread pool to send messages. E.g., running the following two lines in the same thread, ``` ref.send("A") ref.send("B") ``` The remote endpoint may see "B" before "A" because sending "A" and "B" are in parallel. To resolve this issue, this PR added an outbox for each connection, and if we are connecting to the remote node when sending messages, just cache the sending messages in the outbox and send them one by one when the connection is established. Author: zsxwing <zsxwing@gmail.com> Closes #9197 from zsxwing/rpc-outbox.
* [SPARK-11251] Fix page size calculation in local modeAndrew Or2015-10-223-15/+40
| | | | | | | | | | | | | | | | ``` // My machine only has 8 cores $ bin/spark-shell --master local[32] scala> val df = sc.parallelize(Seq((1, 1), (2, 2))).toDF("a", "b") scala> df.as("x").join(df.as("y"), $"x.a" === $"y.a").count() Caused by: java.io.IOException: Unable to acquire 2097152 bytes of memory at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351) ``` Author: Andrew Or <andrew@databricks.com> Closes #9209 from andrewor14/fix-local-page-size.
* [SPARK-7021] Add JUnit output for Python unit testsGábor Lipták2015-10-225-9/+48
| | | | | | | | WIP Author: Gábor Lipták <gliptak@gmail.com> Closes #8323 from gliptak/SPARK-7021.
* [SPARK-11116][SQL] First Draft of Dataset APIMichael Armbrust2015-10-2226-23/+1501
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | *This PR adds a new experimental API to Spark, tentitively named Datasets.* A `Dataset` is a strongly-typed collection of objects that can be transformed in parallel using functional or relational operations. Example usage is as follows: ### Functional ```scala > val ds: Dataset[Int] = Seq(1, 2, 3).toDS() > ds.filter(_ % 1 == 0).collect() res1: Array[Int] = Array(1, 2, 3) ``` ### Relational ```scala scala> ds.toDF().show() +-----+ |value| +-----+ | 1| | 2| | 3| +-----+ > ds.select(expr("value + 1").as[Int]).collect() res11: Array[Int] = Array(2, 3, 4) ``` ## Comparison to RDDs A `Dataset` differs from an `RDD` in the following ways: - The creation of a `Dataset` requires the presence of an explicit `Encoder` that can be used to serialize the object into a binary format. Encoders are also capable of mapping the schema of a given object to the Spark SQL type system. In contrast, RDDs rely on runtime reflection based serialization. - Internally, a `Dataset` is represented by a Catalyst logical plan and the data is stored in the encoded form. This representation allows for additional logical operations and enables many operations (sorting, shuffling, etc.) to be performed without deserializing to an object. A `Dataset` can be converted to an `RDD` by calling the `.rdd` method. ## Comparison to DataFrames A `Dataset` can be thought of as a specialized DataFrame, where the elements map to a specific JVM object type, instead of to a generic `Row` container. A DataFrame can be transformed into specific Dataset by calling `df.as[ElementType]`. Similarly you can transform a strongly-typed `Dataset` to a generic DataFrame by calling `ds.toDF()`. ## Implementation Status and TODOs This is a rough cut at the least controversial parts of the API. The primary purpose here is to get something committed so that we can better parallelize further work and get early feedback on the API. The following is being deferred to future PRs: - Joins and Aggregations (prototype here https://github.com/apache/spark/commit/f11f91e6f08c8cf389b8388b626cd29eec32d937) - Support for Java Additionally, the responsibility for binding an encoder to a given schema is currently done in a fairly ad-hoc fashion. This is an internal detail, and what we are doing today works for the cases we care about. However, as we add more APIs we'll probably need to do this in a more principled way (i.e. separate resolution from binding as we do in DataFrames). ## COMPATIBILITY NOTE Long term we plan to make `DataFrame` extend `Dataset[Row]`. However, making this change to che class hierarchy would break the function signatures for the existing function operations (map, flatMap, etc). As such, this class should be considered a preview of the final API. Changes will be made to the interface after Spark 1.6. Author: Michael Armbrust <michael@databricks.com> Closes #9190 from marmbrus/dataset-infra.
* [SPARK-11242][SQL] In conf/spark-env.sh.template SPARK_DRIVER_MEMORY is ↵guoxi2015-10-221-4/+4
| | | | | | | | | | documented incorrectly Minor fix on the comment Author: guoxi <guoxi@us.ibm.com> Closes #9201 from xguo27/SPARK-11242.
* [SPARK-9735][SQL] Respect the user specified schema than the infer partition ↵Cheng Hao2015-10-222-16/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | schema for HadoopFsRelation To enable the unit test of `hadoopFsRelationSuite.Partition column type casting`. It previously threw exception like below, as we treat the auto infer partition schema with higher priority than the user specified one. ``` java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:220) at org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:62) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 07:44:01.344 ERROR org.apache.spark.executor.Executor: Exception in task 14.0 in stage 3.0 (TID 206) java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:220) at org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:62) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #8026 from chenghao-intel/partition_discovery.
* [SPARK-11163] Remove unnecessary addPendingTask calls.Kay Ousterhout2015-10-221-22/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | This commit removes unnecessary calls to addPendingTask in TaskSetManager.executorLost. These calls are unnecessary: for tasks that are still pending and haven't been launched, they're still in all of the correct pending lists, so calling addPendingTask has no effect. For tasks that are currently running (which may still be in the pending lists, depending on how they were scheduled), we call addPendingTask in handleFailedTask, so the calls at the beginning of executorLost are redundant. I think these calls are left over from when we re-computed the locality levels in addPendingTask; now that we call recomputeLocality separately, I don't think these are necessary. Now that those calls are removed, the readding parameter in addPendingTask is no longer necessary, so this commit also removes that parameter. markhamstra can you take a look at this? cc vanzin Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #9154 from kayousterhout/SPARK-11163.
* [SPARK-11232][CORE] Use 'offer' instead of 'put' to make sure calling send ↵zsxwing2015-10-221-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | won't be interrupted The current `NettyRpcEndpointRef.send` can be interrupted because it uses `LinkedBlockingQueue.put`, which may hang the application. Image the following execution order: | thread 1: TaskRunner.kill | thread 2: TaskRunner.run ------------- | ------------- | ------------- 1 | killed = true | 2 | | if (killed) { 3 | | throw new TaskKilledException 4 | | case _: TaskKilledException _: InterruptedException if task.killed => 5 | task.kill(interruptThread): interruptThread is true | 6 | | execBackend.statusUpdate(taskId, TaskState.KILLED, ser.serialize(TaskKilled)) 7 | | localEndpoint.send(StatusUpdate(taskId, state, serializedData)): in LocalBackend Then `localEndpoint.send(StatusUpdate(taskId, state, serializedData))` will throw `InterruptedException`. This will prevent the executor from updating the task status and hang the application. An failure caused by the above issue here: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44062/consoleFull Since `receivers` is an unbounded `LinkedBlockingQueue`, we can just use `LinkedBlockingQueue.offer` to resolve this issue. Author: zsxwing <zsxwing@gmail.com> Closes #9198 from zsxwing/dont-interrupt-send.
* [SPARK-11216][SQL][FOLLOW-UP] add encoder/decoder for external rowWenchen Fan2015-10-224-16/+17
| | | | | | | | address comments in https://github.com/apache/spark/pull/9184 Author: Wenchen Fan <wenchen@databricks.com> Closes #9212 from cloud-fan/encoder.
* [SPARK-10708] Consolidate sort shuffle implementationsJosh Rosen2015-10-2230-1317/+456
| | | | | | | | There's a lot of duplication between SortShuffleManager and UnsafeShuffleManager. Given that these now provide the same set of functionality, now that UnsafeShuffleManager supports large records, I think that we should replace SortShuffleManager's serialized shuffle implementation with UnsafeShuffleManager's and should merge the two managers together. Author: Josh Rosen <joshrosen@databricks.com> Closes #8829 from JoshRosen/consolidate-sort-shuffle-implementations.
* [SPARK-11244][SPARKR] sparkR.stop() should remove SQLContextForest Fang2015-10-222-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | SparkR should remove `.sparkRSQLsc` and `.sparkRHivesc` when `sparkR.stop()` is called. Otherwise even when SparkContext is reinitialized, `sparkRSQL.init` returns the stale copy of the object and complains: ```r sc <- sparkR.init("local") sqlContext <- sparkRSQL.init(sc) sparkR.stop() sc <- sparkR.init("local") sqlContext <- sparkRSQL.init(sc) sqlContext ``` producing ```r Error in callJMethod(x, "getClass") : Invalid jobj 1. If SparkR was restarted, Spark operations need to be re-executed. ``` I have added the check and removal only when SparkContext itself is initialized. I have also added corresponding test for this fix. Let me know if you want me to move the test to SQL test suite instead. p.s. I tried lint-r but ended up a lots of errors on existing code. Author: Forest Fang <forest.fang@outlook.com> Closes #9205 from saurfang/sparkR.stop.
* [SPARK-11121][CORE] Correct the TaskLocation typezhichao.li2015-10-222-4/+9
| | | | | | | | Correct the logic to return `HDFSCacheTaskLocation` instance when the input `str` is a in memory location. Author: zhichao.li <zhichao.li@intel.com> Closes #9096 from zhichao-li/uselessBranch.
* [SPARK-11243][SQL] output UnsafeRow from columnar cacheDavies Liu2015-10-217-132/+291
| | | | | | | | This PR change InMemoryTableScan to output UnsafeRow, and optimize the unrolling and scanning by coping the bytes for var-length types between UnsafeRow and ByteBuffer directly without creating the wrapper objects. When scanning the decimals in TPC-DS store_sales table, it's 80% faster (copy it as long without create Decimal objects). Author: Davies Liu <davies@databricks.com> Closes #9203 from davies/unsafe_cache.
* [SPARK-9392][SQL] Dataframe drop should work on unresolved columnsYanbo Liang2015-10-212-4/+9
| | | | | | | | Dataframe drop should work on unresolved columns Author: Yanbo Liang <ybliang8@gmail.com> Closes #8821 from yanboliang/spark-9392.
* Minor cleanup of ShuffleMapStage.outputLocs code.Reynold Xin2015-10-214-20/+39
| | | | | | | | | | I was looking at this code and found the documentation to be insufficient. I added more documentation, and refactored some relevant code path slightly to improve encapsulation. There are more that I want to do, but I want to get these changes in before doing more work. My goal is to reduce exposing internal fields directly in ShuffleMapStage to improve encapsulation. After this change, DAGScheduler no longer directly writes outputLocs. There are still 3 places that reads outputLocs directly, but we can change those later. Author: Reynold Xin <rxin@databricks.com> Closes #9175 from rxin/stage-cleanup.
* [SPARK-10151][SQL] Support invocation of hive macronavis.ryu2015-10-2128-8/+28
| | | | | | | | Macro in hive (which is GenericUDFMacro) contains real function inside of it but it's not conveyed to tasks, resulting null-pointer exception. Author: navis.ryu <navis@apache.org> Closes #8354 from navis/SPARK-10151.
* [SPARK-8654][SQL] Analysis exception when using NULL IN (...) : invalid castDilip Biswal2015-10-213-3/+32
| | | | | | | | | | | | | | | | | | In the analysis phase , while processing the rules for IN predicate, we compare the in-list types to the lhs expression type and generate cast operation if necessary. In the case of NULL [NOT] IN expr1 , we end up generating cast between in list types to NULL like cast (1 as NULL) which is not a valid cast. The fix is to find a common type between LHS and RHS expressions and cast all the expression to the common type. Author: Dilip Biswal <dbiswal@us.ibm.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #9036 from dilipbiswal/spark_8654_new.
* [SPARK-11233][SQL] register cosh in function registryShagun Sodhani2015-10-211-0/+1
| | | | | | Author: Shagun Sodhani <sshagunsodhani@gmail.com> Closes #9199 from shagunsodhani/proposed-fix-#11233.
* [SPARK-11208][SQL] Filter out 'hive.metastore.rawstore.impl' from ↵Artem Aliev2015-10-211-1/+2
| | | | | | | | | | executionHive temporary config The executionHive assumed to be a standard meta store located in temporary directory as a derby db. But hive.metastore.rawstore.impl was not filtered out so any custom implementation of the metastore with other storage properties (not JDO) will persist that temporary functions. CassandraHiveMetaStore from DataStax Enterprise is one of examples. Author: Artem Aliev <artem.aliev@datastax.com> Closes #9178 from artem-aliev/SPARK-11208.
* [SPARK-9740][SPARK-9592][SPARK-9210][SQL] Change the default behavior of ↵Yin Huai2015-10-217-45/+219
| | | | | | | | | | | | First/Last to RESPECT NULLS. I am changing the default behavior of `First`/`Last` to respect null values (the SQL standard default behavior). https://issues.apache.org/jira/browse/SPARK-9740 Author: Yin Huai <yhuai@databricks.com> Closes #8113 from yhuai/firstLast.
* [SPARK-11197][SQL] run SQL on files directlyDavies Liu2015-10-2110-9/+91
| | | | | | | | | | | | This PR introduce a new feature to run SQL directly on files without create a table, for example: ``` select id from json.`path/to/json/files` as j ``` Author: Davies Liu <davies@databricks.com> Closes #9173 from davies/source.
* [SPARK-10743][SQL] keep the name of expression if possible when do castWenchen Fan2015-10-214-25/+23
| | | | | | Author: Wenchen Fan <cloud0fan@163.com> Closes #8859 from cloud-fan/cast.
* [SPARK-10534] [SQL] ORDER BY clause allows only columns that are present in ↵Dilip Biswal2015-10-212-1/+11
| | | | | | | | | | | | | | | | | | the select projection list Find out the missing attributes by recursively looking at the sort order expression and rest of the code takes care of projecting them out. Added description from cloud-fan I wanna explain a bit more about this bug. When we resolve sort ordering, we will use a special method, which only resolves UnresolvedAttributes and UnresolvedExtractValue. However, for something like Floor('a), even the 'a is resolved, the floor expression may still being unresolved as data type mismatch(for example, 'a is string type and Floor need double type), thus can't pass this filter, and we can't push down this missing attribute 'a Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #9123 from dilipbiswal/SPARK-10534.
* [SPARK-11216] [SQL] add encoder/decoder for external rowWenchen Fan2015-10-219-54/+459
| | | | | | | | | | | | | | Implement encode/decode for external row based on `ClassEncoder`. TODO: * code cleanup * ~~fix corner cases~~ * refactor the encoder interface * improve test for product codegen, to cover more corner cases. Author: Wenchen Fan <wenchen@databricks.com> Closes #9184 from cloud-fan/encoder.
* [SPARK-11179] [SQL] Push filters through aggregatenitin goyal2015-10-212-0/+69
| | | | | | | | | | | | | | | | | | | | | | Push conjunctive predicates though Aggregate operators when their references are a subset of the groupingExpressions. Query plan before optimisation :- Filter ((c#138L = 2) && (a#0 = 3)) Aggregate [a#0], [a#0,count(b#1) AS c#138L] Project [a#0,b#1] LocalRelation [a#0,b#1,c#2] Query plan after optimisation :- Filter (c#138L = 2) Aggregate [a#0], [a#0,count(b#1) AS c#138L] Filter (a#0 = 3) Project [a#0,b#1] LocalRelation [a#0,b#1,c#2] Author: nitin goyal <nitin.goyal@guavus.com> Author: nitin.goyal <nitin.goyal@guavus.com> Closes #9167 from nitin2goyal/master.
* [SPARK-11037][SQL] using Option instead of Some in JdbcDialectsPravin Gadakh2015-10-211-8/+8
| | | | | | | | Using Option instead of Some in getCatalystType method. Author: Pravin Gadakh <prgadakh@in.ibm.com> Closes #9195 from pravingadakh/master.
* [SPARK-11205][PYSPARK] Delegate to scala DataFrame API rather than p…Jeff Zhang2015-10-201-1/+2
| | | | | | | | | | …rint in python No test needed. Verify it manually in pyspark shell Author: Jeff Zhang <zjffdu@apache.org> Closes #9177 from zjffdu/SPARK-11205.
* [SPARK-11221][SPARKR] fix R doc for lit and add examplesfelixcheung2015-10-201-4/+9
| | | | | | | | | Currently the documentation for `lit` is inconsistent with doc format, references "Scala symbol" and has no example. Fixing that. shivaram Author: felixcheung <felixcheung_m@hotmail.com> Closes #9187 from felixcheung/rlit.
* [MINOR][ML] fix doc warningsXiangrui Meng2015-10-201-0/+1
| | | | | | | | | | | | | Without an empty line, sphinx will treat doctest as docstring. holdenk ~~~ /Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "label|raw |vectors | +-----+---------------+-------------------------+ |0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])". /Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])". ~~~ Author: Xiangrui Meng <meng@databricks.com> Closes #9188 from mengxr/py-count-vec-doc-fix.
* [SPARK-10082][MLLIB] minor style updates for matrix indexing after #8271Xiangrui Meng2015-10-202-8/+8
| | | | | | | | | | | * `>=0` => `>= 0` * print `i`, `j` in the log message MechCoder Author: Xiangrui Meng <meng@databricks.com> Closes #9189 from mengxr/SPARK-10082.
* [SPARK-11153][SQL] Disables Parquet filter push-down for string and binary ↵Cheng Lian2015-10-212-2/+31
| | | | | | | | | | | | | | | | | | | | | | | | | columns Due to PARQUET-251, `BINARY` columns in existing Parquet files may be written with corrupted statistics information. This information is used by filter push-down optimization. Since Spark 1.5 turns on Parquet filter push-down by default, we may end up with wrong query results. PARQUET-251 has been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0. This affects all Spark SQL data types that can be mapped to Parquet {{BINARY}}, namely: - `StringType` - `BinaryType` - `DecimalType` (But Spark SQL doesn't support pushing down filters involving `DecimalType` columns for now.) To avoid wrong query results, we should disable filter push-down for columns of `StringType` and `BinaryType` until we upgrade to parquet-mr 1.8. Author: Cheng Lian <lian@databricks.com> Closes #9152 from liancheng/spark-11153.workaround-parquet-251. (cherry picked from commit 0887e5e87891e8e22f534ca6d0406daf86ec2dad) Signed-off-by: Cheng Lian <lian@databricks.com>
* [SPARK-10767][PYSPARK] Make pyspark shared params codegen more consistentHolden Karau2015-10-203-65/+65
| | | | | | | | Namely "." shows up in some places in the template when using the param docstring and not in others Author: Holden Karau <holden@pigscanfly.ca> Closes #9017 from holdenk/SPARK-10767-Make-pyspark-shared-params-codegen-more-consistent.
* [SPARK-10082][MLLIB] Validate i, j in apply DenseMatrices and SparseMatricesMechCoder2015-10-202-0/+15
| | | | | | | | | | | Given row_ind should be less than the number of rows Given col_ind should be less than the number of cols. The current code in master gives unpredictable behavior for such cases. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8271 from MechCoder/hash_code_matrices.
* [SPARK-10269][PYSPARK][MLLIB] Add @since annotation to ↵noelsmith2015-10-201-4/+66
| | | | | | | | | | | | | | pyspark.mllib.classification Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings). Added since to methods + "versionadded::" to classes derived from the file history. Note - some methods are inherited from the regression module (i.e. LinearModel.intercept) so these won't have version numbers in the API docs until that model is updated. Author: noelsmith <mail@noelsmith.com> Closes #8626 from noel-smith/SPARK-10269-since-mlib-classification.
* [SPARK-10261][DOCUMENTATION, ML] Fixed @Since annotation to ml.evaluationTijo Thomas2015-10-204-7/+43
| | | | | | | Author: Tijo Thomas <tijoparacka@gmail.com> Author: tijo <tijo@ezzoft.com> Closes #8554 from tijoparacka/SPARK-10261-2.
* [SPARK-10272][PYSPARK][MLLIB] Added @since tags to pyspark.mllib.evaluationnoelsmith2015-10-201-0/+41
| | | | | | | | | | | | Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings). Added since to public methods + "versionadded::" to classes (derived from the git file history in pyspark). Note - I added also the tags to MultilabelMetrics even though it isn't declared as public in the __all__ statement... if that's incorrect - I'll remove. Author: noelsmith <mail@noelsmith.com> Closes #8628 from noel-smith/SPARK-10272-since-mllib-evalutation.
* [SPARK-11149] [SQL] Improve cache performance for primitive typesDavies Liu2015-10-209-122/+265
| | | | | | | | | | | | | | | | | | | | | | | | | | | This PR improve the performance by: 1) Generate an Iterator that take Iterator[CachedBatch] as input, and call accessors (unroll the loop for columns), avoid the expensive Iterator.flatMap. 2) Use Unsafe.getInt/getLong/getFloat/getDouble instead of ByteBuffer.getInt/getLong/getFloat/getDouble, the later one actually read byte by byte. 3) Remove the unnecessary copy() in Coalesce(), which is not related to memory cache, found during benchmark. The following benchmark showed that we can speedup the columnar cache of int by 2x. ``` path = '/opt/tpcds/store_sales/' int_cols = ['ss_sold_date_sk', 'ss_sold_time_sk', 'ss_item_sk','ss_customer_sk'] df = sqlContext.read.parquet(path).select(int_cols).cache() df.count() t = time.time() print df.select("*")._jdf.queryExecution().toRdd().count() print time.time() - t ``` Author: Davies Liu <davies@databricks.com> Closes #9145 from davies/byte_buffer.
* [SPARK-11111] [SQL] fast null-safe joinDavies Liu2015-10-205-15/+105
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, we use CartesianProduct for join with null-safe-equal condition. ``` scala> sqlContext.sql("select * from t a join t b on (a.i <=> b.i)").explain == Physical Plan == TungstenProject [i#2,j#3,i#7,j#8] Filter (i#2 <=> i#7) CartesianProduct LocalTableScan [i#2,j#3], [[1,1]] LocalTableScan [i#7,j#8], [[1,1]] ``` Actually, we can have an equal-join condition as `coalesce(i, default) = coalesce(b.i, default)`, then an partitioned join algorithm could be used. After this PR, the plan will become: ``` >>> sqlContext.sql("select * from a join b ON a.id <=> b.id").explain() TungstenProject [id#0L,id#1L] Filter (id#0L <=> id#1L) SortMergeJoin [coalesce(id#0L,0)], [coalesce(id#1L,0)] TungstenSort [coalesce(id#0L,0) ASC], false, 0 TungstenExchange hashpartitioning(coalesce(id#0L,0),200) ConvertToUnsafe Scan PhysicalRDD[id#0L] TungstenSort [coalesce(id#1L,0) ASC], false, 0 TungstenExchange hashpartitioning(coalesce(id#1L,0),200) ConvertToUnsafe Scan PhysicalRDD[id#1L] ``` Author: Davies Liu <davies@databricks.com> Closes #9120 from davies/null_safe.
* [SPARK-6740] [SQL] correctly parse NOT operator with comparison operationsWenchen Fan2015-10-203-8/+21
| | | | | | | | | | We can't parse `NOT` operator with comparison operations like `SELECT NOT TRUE > TRUE`, this PR fixed it. Takes over https://github.com/apache/spark/pull/6326. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8617 from cloud-fan/not.
* [SPARK-11105] [YARN] Distribute log4j.properties to executorsvundela2015-10-202-1/+17
| | | | | | | | | | | Currently log4j.properties file is not uploaded to executor's which is leading them to use the default values. This fix will make sure that file is always uploaded to distributed cache so that executor will use the latest settings. If user specifies log configurations through --files then executors will be picking configs from --files instead of $SPARK_CONF_DIR/log4j.properties Author: vundela <vsr@cloudera.com> Author: Srinivasa Reddy Vundela <vsr@cloudera.com> Closes #9118 from vundela/master.
* [SPARK-10447][SPARK-3842][PYSPARK] upgrade pyspark to py4j0.9Holden Karau2015-10-2017-66/+34
| | | | | | | | | Upgrade to Py4j0.9 Author: Holden Karau <holden@pigscanfly.ca> Author: Holden Karau <holden@us.ibm.com> Closes #8615 from holdenk/SPARK-10447-upgrade-pyspark-to-py4j0.9.
* [SPARK-10463] [SQL] remove PromotePrecision during optimizationDaoyuan Wang2015-10-201-3/+4
| | | | | | | | | | | PromotePrecision is not necessary after HiveTypeCoercion done. Jira: https://issues.apache.org/jira/browse/SPARK-10463 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #8621 from adrian-wang/promoterm.
* [SPARK-11110][BUILD] Remove transient annotation for parameters.Jakob Odersky2015-10-202-4/+4
| | | | | | | | | | | | | `transient` annotations on class parameters (not case class parameters or vals) causes compilation errors during compilation with Scala 2.11. I understand that transient *parameters* make no sense, however I don't quite understand why the 2.10 compiler accepted them. Note: in case it is preferred to keep the annotations in case someone would in the future want to redefine them as vals, it would also be possible to just add `val` after the annotation, e.g. `class Foo(transient x: Int)` becomes `class Foo(transient private val x: Int)`. I chose to remove the annotation as it also reduces needles clutter, however please feel free to tell me if you prefer the second option and I'll update the PR Author: Jakob Odersky <jodersky@gmail.com> Closes #9126 from jodersky/sbt-scala-2.11.
* [SPARK-10876] Display total uptime for completed applicationsJean-Baptiste Onofré2015-10-202-8/+17
| | | | | | Author: Jean-Baptiste Onofré <jbonofre@apache.org> Closes #9059 from jbonofre/SPARK-10876.
* [SPARK-11088][SQL] Merges partition values using UnsafeProjectionCheng Lian2015-10-191-49/+24
| | | | | | | | `DataSourceStrategy.mergeWithPartitionValues` is essentially a projection implemented in a quite inefficient way. This PR optimizes this method with `UnsafeProjection` to avoid unnecessary boxing costs. Author: Cheng Lian <lian@databricks.com> Closes #9104 from liancheng/spark-11088.faster-partition-values-merging.