aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-1308] Add getNumPartitions to pyspark RDDSyed Hashmi2014-06-091-18/+27
| | | | | | | | | | Add getNumPartitions to pyspark RDD to provide an intuitive way to get number of partitions in RDD like we can do in scala today. Author: Syed Hashmi <shashmi@cloudera.com> Closes #995 from syedhashmi/master and squashes the following commits: de0ed5e [Syed Hashmi] [SPARK-1308] Add getNumPartitions to pyspark RDD
* Grammar: read -> readsAndrew Ash2014-06-081-1/+1
| | | | | | | | Author: Andrew Ash <andrew@andrewash.com> Closes #1016 from ash211/patch-6 and squashes the following commits: e3865c8 [Andrew Ash] Grammar: read -> reads
* [SPARK-2067] use relative path for Spark logo in UINeville Li2014-06-081-1/+1
| | | | | | | | Author: Neville Li <neville@spotify.com> Closes #1006 from nevillelyh/gh/SPARK-2067 and squashes the following commits: 9ee64cf [Neville Li] [SPARK-2067] use relative path for Spark logo in UI
* SPARK-1628 follow up: Improve RangePartitioner's documentation.Reynold Xin2014-06-081-1/+4
| | | | | | | | | | | | Adding a paragraph clarifying a weird behavior in RangePartitioner. See also #549. Author: Reynold Xin <rxin@apache.org> Closes #1012 from rxin/partitioner-doc and squashes the following commits: 6f0109e [Reynold Xin] SPARK-1628 follow up: Improve RangePartitioner's documentation.
* Update run-examplemaji20142014-06-081-1/+1
| | | | | | | | | | | | | Old code can only be ran under spark_home and use "bin/run-example". Error "./run-example: line 55: ./bin/spark-submit: No such file or directory" appears when running in other place. So change this Author: maji2014 <maji3@asiainfo-linkage.com> Closes #1011 from maji2014/master and squashes the following commits: 2cc1af6 [maji2014] Update run-example Closes #988.
* SPARK-1628: Add missing hashCode methods in Partitioner subclasseszsxwing2014-06-083-1/+20
| | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-1628 Added `hashCode` in HashPartitioner, RangePartitioner, PythonPartitioner and PageRankUtils.CustomPartitioner. Author: zsxwing <zsxwing@gmail.com> Closes #549 from zsxwing/SPARK-1628 and squashes the following commits: 2620936 [zsxwing] SPARK-1628: Add missing hashCode methods in Partitioner subclasses
* SPARK-1898: In deploy.yarn.Client, use YarnClient not YarnClientImplColin Patrick McCabe2014-06-082-10/+17
| | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-1898 Author: Colin Patrick McCabe <cmccabe@cloudera.com> Closes #850 from cmccabe/master and squashes the following commits: d66eddc [Colin Patrick McCabe] SPARK-1898: In deploy.yarn.Client, use YarnClient rather than YarnClientImpl
* SPARK-2026: Maven Hadoop Profiles Should Set The Hadoop VersionBernardo Gomez Palacio2014-06-081-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Maven Profiles that refer to hadoopX, e.g. `hadoop2.4`, should set the expected `hadoop.version` and `yarn.version`. e.g. ``` <profile> <id>hadoop-2.4</id> <properties> <hadoop.version>2.4.0</hadoop.version> <yarn.version>${hadoop.version}</yarn.version> <protobuf.version>2.5.0</protobuf.version> <jets3t.version>0.9.0</jets3t.version> </properties> </profile> ``` Builds can still define the `-Dhadoop.version` option but this will correctly default the Hadoop Version to the one that is expected according the profile that is selected. e.g. ```$ mvn -P hadoop-2.4,yarn clean install``` or ```$ mvn -P hadoop-0.23,yarn clean install``` [ticket] : https://issues.apache.org/jira/browse/SPARK-2026 Author : berngp Reviewer : ? Author: Bernardo Gomez Palacio <bernardo.gomezpalacio@gmail.com> Closes #998 from berngp/feature/SPARK-2026 and squashes the following commits: 07ba4f7 [Bernardo Gomez Palacio] SPARK-2026: Maven Hadoop Profiles Should Set The Hadoop Version
* SPARK-2056 Set RDD name to input pathNeville Li2014-06-071-4/+4
| | | | | | | | Author: Neville Li <neville@spotify.com> Closes #992 from nevillelyh/master and squashes the following commits: 3011739 [Neville Li] [SPARK-2056] Set RDD name to input path
* HOTFIX: Support empty body in merge scriptPatrick Wendell2014-06-071-2/+3
| | | | | | | | | | Discovered in #992 Author: Patrick Wendell <pwendell@gmail.com> Closes #1007 from pwendell/hotfix and squashes the following commits: af90aa0 [Patrick Wendell] HOTFIX: Support empty body in merge script
* [SPARK-1994][SQL] Weird data corruption bug when running Spark SQL on data ↵Michael Armbrust2014-06-071-10/+5
| | | | | | | | | | | | | | in HDFS Basically there is a race condition (possibly a scala bug?) when these values are recomputed on all of the slaves that results in an incorrect projection being generated (possibly because the GUID uniqueness contract is broken?). In general we should probably enforce that all expression planing occurs on the driver, as is now occurring here. Author: Michael Armbrust <michael@databricks.com> Closes #1004 from marmbrus/fixAggBug and squashes the following commits: e0c116c [Michael Armbrust] Compute aggregate expression during planning instead of lazily on workers.
* [SPARK-1841]: update scalatest to version 2.1.5witgo2014-06-0611-36/+47
| | | | | | | | | | | | | | | | | | | | Author: witgo <witgo@qq.com> Closes #713 from witgo/scalatest and squashes the following commits: b627a6a [witgo] merge master 51fb3d6 [witgo] merge master 3771474 [witgo] fix RDDSuite 996d6f9 [witgo] fix TimeStampedWeakValueHashMap test 9dfa4e7 [witgo] merge bug 1479b22 [witgo] merge master 29b9194 [witgo] fix code style 022a7a2 [witgo] fix test dependency a52c0fa [witgo] fix test dependency cd8f59d [witgo] Merge branch 'master' of https://github.com/apache/spark into scalatest 046540d [witgo] fix RDDSuite.scala 2c543b9 [witgo] fix ReplSuite.scala c458928 [witgo] update scalatest to version 2.1.5
* [SPARK-2050 - 2][SQL] DIV and BETWEEN should not be case sensitive.Michael Armbrust2014-06-064-4/+10
| | | | | | | | | | Followup: #989 Author: Michael Armbrust <michael@databricks.com> Closes #994 from marmbrus/caseSensitiveFunctions2 and squashes the following commits: 9d9c8ed [Michael Armbrust] Fix DIV and BETWEEN.
* [SPARK-1552] Fix type comparison bug in {map,outerJoin}VerticesAnkur Dave2014-06-055-8/+40
| | | | | | | | | | | | | | | | | | | | | In GraphImpl, mapVertices and outerJoinVertices use a more efficient implementation when the map function conserves vertex attribute types. This is implemented by comparing the ClassTags of the old and new vertex attribute types. However, ClassTags store erased types, so the comparison will return a false positive for types with different type parameters, such as Option[Int] and Option[Double]. This PR resolves the problem by requesting that the compiler generate evidence of equality between the old and new vertex attribute types, and providing a default value for the evidence parameter if the two types are not equal. The methods can then check the value of the evidence parameter to see whether the types are equal. It also adds a test called "mapVertices changing type with same erased type" that failed before the PR and succeeds now. Callers of mapVertices and outerJoinVertices can no longer use a wildcard for a graph's VD type. To avoid "Error occurred in an application involving default arguments," they must bind VD to a type parameter, as this PR does for ShortestPaths and LabelPropagation. Author: Ankur Dave <ankurdave@gmail.com> Closes #967 from ankurdave/SPARK-1552 and squashes the following commits: 68a4fff [Ankur Dave] Undo conserve naming 7388705 [Ankur Dave] Remove unnecessary ClassTag for VD parameters a704e5f [Ankur Dave] Use type equality constraint with default argument 29a5ab7 [Ankur Dave] Add failing test f458c83 [Ankur Dave] Revert "[SPARK-1552] Fix type comparison bug in mapVertices and outerJoinVertices" 16d6af8 [Ankur Dave] [SPARK-1552] Fix type comparison bug in mapVertices and outerJoinVertices
* [SPARK-2050][SQL] LIKE, RLIKE and IN in HQL should not be case sensitive.Michael Armbrust2014-06-051-4/+8
| | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #989 from marmbrus/caseSensitiveFuncitons and squashes the following commits: 681de54 [Michael Armbrust] LIKE, RLIKE and IN in HQL should not be case sensitive.
* SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keysMatei Zaharia2014-06-052-5/+44
| | | | | | | | | | | The current implementation reads one key with the next hash code as it finishes reading the keys with the current hash code, which may cause it to miss some matches of the next key. This can cause operations like join to give the wrong result when reduce tasks spill to disk and there are hash collisions, as values won't be matched together. This PR fixes it by not reading in that next key, using a peeking iterator instead. Author: Matei Zaharia <matei@databricks.com> Closes #986 from mateiz/spark-2043 and squashes the following commits: 0959514 [Matei Zaharia] Added unit test for having many hash collisions 892debb [Matei Zaharia] SPARK-2043: don't read a key with the next hash code in ExternalAppendOnlyMap, instead use a buffered iterator to only read values with the current hash code.
* [SPARK-2025] Unpersist edges of previous graph in PregelAnkur Dave2014-06-051-0/+1
| | | | | | | | | | | | | | Due to a bug introduced by apache/spark#497, Pregel does not unpersist replicated vertices from previous iterations. As a result, they stay cached until memory is full, wasting GC time. This PR corrects the problem by unpersisting both the edges and the replicated vertices of previous iterations. This is safe because the edges and replicated vertices of the current iteration are cached by the call to `g.cache()` and then materialized by the call to `messages.count()`. Therefore no unmaterialized RDDs depend on `prevG.edges`. I verified that no recomputation occurs by running PageRank with a custom patch to Spark that warns when a partition is recomputed. Thanks to Tim Weninger for reporting this bug. Author: Ankur Dave <ankurdave@gmail.com> Closes #972 from ankurdave/SPARK-2025 and squashes the following commits: 13d5b07 [Ankur Dave] Unpersist edges of previous graph in Pregel
* Use pluggable clock in DAGSheduler #SPARK-2031CrazyJvm2014-06-051-6/+7
| | | | | | | | | | DAGScheduler supports pluggable clock like what TaskSetManager does. Author: CrazyJvm <crazyjvm@gmail.com> Closes #976 from CrazyJvm/clock and squashes the following commits: 6779a4c [CrazyJvm] Use pluggable clock in DAGSheduler
* [SPARK-2041][SQL] Correctly analyze queries where columnName == tableName.Michael Armbrust2014-06-053-1/+11
| | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #985 from marmbrus/tableName and squashes the following commits: 3caaa27 [Michael Armbrust] Correctly analyze queries where columnName == tableName.
* Remove compile-scoped junit dependency.Marcelo Vanzin2014-06-052-1/+10
| | | | | | | | | | | | | This avoids having junit classes showing up in the assembly jar. I verified that only test classes in the jtransforms package use junit. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #794 from vanzin/junit-dep-exclusion and squashes the following commits: 274e1c2 [Marcelo Vanzin] Remove junit from assembly in sbt build also. ad950be [Marcelo Vanzin] Remove compile-scoped junit dependency.
* sbt 0.13.X should be using sbt-assembly 0.11.XKalpit Shah2014-06-051-1/+1
| | | | | | | | | | https://github.com/sbt/sbt-assembly/blob/master/README.md Author: Kalpit Shah <shahkalpit84@gmail.com> Closes #555 from kalpit/upgrade/sbtassembly and squashes the following commits: 1fa7324 [Kalpit Shah] sbt 0.13.X should be using sbt-assembly 0.11.X
* HOTFIX: Remove generated-mima-excludes file after runing MIMA.Patrick Wendell2014-06-051-0/+1
| | | | | | | | | | | This has been causing some false failures on PR's that don't merge correctly. Author: Patrick Wendell <pwendell@gmail.com> Closes #971 from pwendell/mima and squashes the following commits: 1dc80aa [Patrick Wendell] HOTFIX: Remove generated-mima-excludes file after runing MIMA.
* [SPARK-2036] [SQL] CaseConversionExpression should check if the evaluated ↵Takuya UESHIN2014-06-053-2/+28
| | | | | | | | | | | | value is null. `CaseConversionExpression` should check if the evaluated value is `null`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #982 from ueshin/issues/SPARK-2036 and squashes the following commits: 61e1c54 [Takuya UESHIN] Add check if the evaluated value is null.
* SPARK-1677: allow user to disable output dir existence checkingCodingCat2014-06-053-2/+34
| | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-1677 For compatibility with older versions of Spark it would be nice to have an option `spark.hadoop.validateOutputSpecs` (default true) for the user to disable the output directory existence checking Author: CodingCat <zhunansjtu@gmail.com> Closes #947 from CodingCat/SPARK-1677 and squashes the following commits: 7930f83 [CodingCat] miao c0c0e03 [CodingCat] bug fix and doc update 5318562 [CodingCat] bug fix 13219b5 [CodingCat] allow user to disable output dir existence checking
* [SPARK-2029] Bump pom.xml version number of master branch to 1.1.0-SNAPSHOT.Takuya UESHIN2014-06-0523-23/+23
| | | | | | | | Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #974 from ueshin/issues/SPARK-2029 and squashes the following commits: e19e8f4 [Takuya UESHIN] Bump version number to 1.1.0-SNAPSHOT.
* Fix issue in ReplSuite with hadoop-provided profile.Marcelo Vanzin2014-06-041-1/+13
| | | | | | | | | | | | | | | When building the assembly with the maven "hadoop-provided" profile, the executors were failing to come up because Hadoop classes were not found in the classpath anymore; so add them explicitly to the classpath using spark.executor.extraClassPath. This is only needed for the local-cluster mode, but doesn't affect other tests, so it's added for all of them to keep the code simpler. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #781 from vanzin/repl-test-fix and squashes the following commits: 4f0a3b0 [Marcelo Vanzin] Fix issue in ReplSuite with hadoop-provided profile.
* Minor: Fix documentation error from apache/spark#946Ankur Dave2014-06-041-2/+2
| | | | | | | | Author: Ankur Dave <ankurdave@gmail.com> Closes #970 from ankurdave/SPARK-1991_docfix and squashes the following commits: 6d07343 [Ankur Dave] Minor: Fix documentation error from apache/spark#946
* SPARK-1790: Update EC2 scripts to support r3 instance typesVarakhedi Sujeet2014-06-041-2/+12
| | | | | | | | Author: Varakhedi Sujeet <svarakhedi@gopivotal.com> Closes #960 from sujeetv/ec2-r3 and squashes the following commits: 3cb9fd5 [Varakhedi Sujeet] SPARK-1790: Update EC2 scripts to support r3 instance
* SPARK-1518: FileLogger: Fix compile against Hadoop trunkColin McCabe2014-06-041-4/+12
| | | | | | | | | | | | | In Hadoop trunk (currently Hadoop 3.0.0), the deprecated FSDataOutputStream#sync() method has been removed. Instead, we should call FSDataOutputStream#hflush, which does the same thing as the deprecated method used to do. Author: Colin McCabe <cmccabe@cloudera.com> Closes #898 from cmccabe/SPARK-1518 and squashes the following commits: 752b9d7 [Colin McCabe] FileLogger: Fix compile against Hadoop trunk
* [SPARK-1752][MLLIB] Standardize text format for vectors and labeled pointsXiangrui Meng2014-06-0418-72/+579
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We should standardize the text format used to represent vectors and labeled points. The proposed formats are the following: 1. dense vector: `[v0,v1,..]` 2. sparse vector: `(size,[i0,i1],[v0,v1])` 3. labeled point: `(label,vector)` where "(..)" indicates a tuple and "[...]" indicate an array. `loadLabeledPoints` is added to pyspark's `MLUtils`. I didn't add `loadVectors` to pyspark because `RDD.saveAsTextFile` cannot stringify dense vectors in the proposed format automatically. `MLUtils#saveLabeledData` and `MLUtils#loadLabeledData` are deprecated. Users should use `RDD#saveAsTextFile` and `MLUtils#loadLabeledPoints` instead. In Scala, `MLUtils#loadLabeledPoints` is compatible with the format used by `MLUtils#loadLabeledData`. CC: @mateiz, @srowen Author: Xiangrui Meng <meng@databricks.com> Closes #685 from mengxr/labeled-io and squashes the following commits: 2d1116a [Xiangrui Meng] make loadLabeledData/saveLabeledData deprecated since 1.0.1 297be75 [Xiangrui Meng] change LabeledPoint.parse to LabeledPointParser.parse to maintain binary compatibility d6b1473 [Xiangrui Meng] Merge branch 'master' into labeled-io 56746ea [Xiangrui Meng] replace # by . 623a5f0 [Xiangrui Meng] merge master f06d5ba [Xiangrui Meng] add docs and minor updates 640fe0c [Xiangrui Meng] throw SparkException 5bcfbc4 [Xiangrui Meng] update test to add scientific notations e86bf38 [Xiangrui Meng] remove NumericTokenizer 050fca4 [Xiangrui Meng] use StringTokenizer 6155b75 [Xiangrui Meng] merge master f644438 [Xiangrui Meng] remove parse methods based on eval from pyspark a41675a [Xiangrui Meng] python loadLabeledPoint uses Scala's implementation ce9a475 [Xiangrui Meng] add deserialize_labeled_point to pyspark with tests e9fcd49 [Xiangrui Meng] add serializeLabeledPoint and tests aea4ae3 [Xiangrui Meng] minor updates 810d6df [Xiangrui Meng] update tokenizer/parser implementation 7aac03a [Xiangrui Meng] remove Scala parsers c1885c1 [Xiangrui Meng] add headers and minor changes b0c50cb [Xiangrui Meng] add customized parser d731817 [Xiangrui Meng] style update 63dc396 [Xiangrui Meng] add loadLabeledPoints to pyspark ea122b5 [Xiangrui Meng] Merge branch 'master' into labeled-io cd6c78f [Xiangrui Meng] add __str__ and parse to LabeledPoint a7a178e [Xiangrui Meng] add stringify to pyspark's Vectors 5c2dbfa [Xiangrui Meng] add parse to pyspark's Vectors 7853f88 [Xiangrui Meng] update pyspark's SparseVector.__str__ e761d32 [Xiangrui Meng] make LabelPoint.parse compatible with the dense format used before v1.0 and deprecate loadLabeledData and saveLabeledData 9e63a02 [Xiangrui Meng] add loadVectors and loadLabeledPoints 19aa523 [Xiangrui Meng] update toString and add parsers for Vectors and LabeledPoint
* SPARK-1973. Add randomSplit to JavaRDD (with tests, and tidy Java tests)Sean Owen2014-06-044-334/+358
| | | | | | | | | | | | | | | | | | | | | I'd like to use randomSplit through the Java API, and would like to add a convenience wrapper for this method to JavaRDD. This is fairly trivial. (In fact, is the intent that JavaRDD not wrap every RDD method? and that sometimes users should just use JavaRDD.wrapRDD()?) Along the way, I added tests for it, and also touched up the Java API test style and behavior. This is maybe the more useful part of this small change. Author: Sean Owen <sowen@cloudera.com> Author: Xiangrui Meng <meng@databricks.com> This patch had conflicts when merged, resolved by Committer: Xiangrui Meng <meng@databricks.com> Closes #919 from srowen/SPARK-1973 and squashes the following commits: 148cb7b [Sean Owen] Some final Java test polish, while we are at it 1fc3f3e [Xiangrui Meng] more cleaning on Java 8 tests 9ebc57f [Sean Owen] Use accumulator instead of temp files to test foreach 5efb0be [Sean Owen] Add Java randomSplit, and unit tests (including for sample) 5dcc158 [Sean Owen] Simplified Java 8 test with new language features, and fixed the name of MLB's greatest team 91a1769 [Sean Owen] Touch up minor style issues in existing Java API suite test
* [MLLIB] set RDD names in ALSNeville Li2014-06-041-5/+11
| | | | | | | | | | | This is very useful when debugging & fine tuning jobs with large data sets. Author: Neville Li <neville@spotify.com> Closes #966 from nevillelyh/master and squashes the following commits: 6747764 [Neville Li] [MLLIB] use string interpolation for RDD names 3b15d34 [Neville Li] [MLLIB] set RDD names in ALS
* [SPARK-1817] RDD.zip() should verify partition sizes for each partitionKan Zhang2014-06-035-100/+33
| | | | | | | | | | | RDD.zip() will throw an exception if it finds partition sizes are not the same. Author: Kan Zhang <kzhang@apache.org> Closes #944 from kanzhang/SPARK-1817 and squashes the following commits: c073848 [Kan Zhang] [SPARK-1817] Cosmetic updates 524c670 [Kan Zhang] [SPARK-1817] RDD.zip() should verify partition sizes for each partition
* SPARK-1806 (addendum) Use non-deprecated methods in Mesos 0.18Sean Owen2014-06-032-3/+5
| | | | | | | | | | The update to Mesos 0.18 caused some deprecation warnings in the build. The change to the non-deprecated version is straightforward as it emulates what the Mesos driver does with the deprecated method anyway (https://github.com/apache/mesos/blob/c5aa1dd22155d79c5a7c33076319299a40fd63b3/src/sched/sched.cpp#L1354) Author: Sean Owen <sowen@cloudera.com> Closes #920 from srowen/SPARK-1806 and squashes the following commits: 8d76b6a [Sean Owen] Use non-deprecated methods in Mesos 0.18
* Update spark-ec2 scripts for 1.0.0 on masterAaron Davidson2014-06-031-3/+3
| | | | | | | | | | | | | The change was previously committed only to branch-1.0 as part of https://github.com/apache/spark/commit/a34e6fda1d6fb8e769c21db70845f1a6dde968d8 Author: Aaron Davidson <aaron@databricks.com> This patch had conflicts when merged, resolved by Committer: Patrick Wendell <pwendell@gmail.com> Closes #938 from aarondav/sparkec2 and squashes the following commits: 067cc31 [Aaron Davidson] Update spark-ec2 scripts for 1.0.0 on master
* Enable repartitioning of graph over different number of partitionsJoseph E. Gonzalez2014-06-033-4/+20
| | | | | | | | | | It is currently very difficult to repartition a graph over a different number of partitions. This PR adds an additional `partitionBy` function that takes the number of partitions. Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> Closes #719 from jegonzal/graph_partitioning_options and squashes the following commits: 730b405 [Joseph E. Gonzalez] adding an additional number of partitions option to partitionBy
* use env default python in merge_spark_pr.pyXiangrui Meng2014-06-031-1/+1
| | | | | | | | | | A minor change to use env default python instead of fixed `/usr/bin/python`. Author: Xiangrui Meng <meng@databricks.com> Closes #965 from mengxr/merge-pr-python and squashes the following commits: 1ae0013 [Xiangrui Meng] use env default python in merge_spark_pr.py
* SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of ↵Reynold Xin2014-06-0311-135/+189
| | | | | | | | | | | | | | | | | | | | | | | | | | HyperLogLog. I also corrected some errors made in the previous HLL count approximate API, including relativeSD wasn't really a measure for error (and we used it to test error bounds in test results). Author: Reynold Xin <rxin@apache.org> Closes #897 from rxin/hll and squashes the following commits: 4d83f41 [Reynold Xin] New error bound and non-randomness. f154ea0 [Reynold Xin] Added a comment on the value bound for testing. e367527 [Reynold Xin] One more round of code review. 41e649a [Reynold Xin] Update final mima list. 9e320c8 [Reynold Xin] Incorporate code review feedback. e110d70 [Reynold Xin] Merge branch 'master' into hll 354deb8 [Reynold Xin] Added comment on the Mima exclude rules. acaa524 [Reynold Xin] Added the right exclude rules in MimaExcludes. 6555bfe [Reynold Xin] Added a default method and re-arranged MimaExcludes. 1db1522 [Reynold Xin] Excluded util.SerializableHyperLogLog from MIMA check. 9221b27 [Reynold Xin] Merge branch 'master' into hll 88cfe77 [Reynold Xin] Updated documentation and restored the old incorrect API to maintain API compatibility. 1294be6 [Reynold Xin] Updated HLL+. e7786cb [Reynold Xin] Merge branch 'master' into hll c0ef0c2 [Reynold Xin] SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog.
* [SPARK-1161] Add saveAsPickleFile and SparkContext.pickleFile in PythonKan Zhang2014-06-032-8/+39
| | | | | | | | | | Author: Kan Zhang <kzhang@apache.org> Closes #755 from kanzhang/SPARK-1161 and squashes the following commits: 24ed8a2 [Kan Zhang] [SPARK-1161] Fixing doc tests 44e0615 [Kan Zhang] [SPARK-1161] Adding an optional batchSize with default value 10 d929429 [Kan Zhang] [SPARK-1161] Add saveAsObjectFile and SparkContext.objectFile in Python
* Fixed a typoDB Tsai2014-06-031-1/+1
| | | | | | | | | | in RowMatrix.scala Author: DB Tsai <dbtsai@dbtsai.com> Closes #959 from dbtsai/dbtsai-typo and squashes the following commits: fab0e0e [DB Tsai] Fixed typo
* [SPARK-1991] Support custom storage levels for vertices and edgesAnkur Dave2014-06-038-97/+229
| | | | | | | | | | | | | | | | | | | | | This PR adds support for specifying custom storage levels for the vertices and edges of a graph. This enables GraphX to handle graphs larger than memory size by specifying MEMORY_AND_DISK and then repartitioning the graph to use many small partitions, each of which does fit in memory. Spark will then automatically load partitions from disk as needed. The user specifies the desired vertex and edge storage levels when building the graph by passing them to the graph constructor. These are then stored in the `targetStorageLevel` attribute of the VertexRDD and EdgeRDD respectively. Whenever GraphX needs to cache a VertexRDD or EdgeRDD (because it plans to use it more than once, for example), it uses the specified target storage level. Also, when the user calls `Graph#cache()`, the vertices and edges are persisted using their target storage levels. In order to facilitate propagating the target storage levels across VertexRDD and EdgeRDD operations, we remove raw calls to the constructors and instead introduce the `withPartitionsRDD` and `withTargetStorageLevel` methods. I tested this change by running PageRank and triangle count on a severely memory-constrained cluster (1 executor with 300 MB of memory, and a 1 GB graph). Before this PR, these algorithms used to fail with OutOfMemoryErrors. With this PR, and using the DISK_ONLY storage level, they succeed. Author: Ankur Dave <ankurdave@gmail.com> Closes #946 from ankurdave/SPARK-1991 and squashes the following commits: ce17d95 [Ankur Dave] Move pickStorageLevel to StorageLevel.fromString ccaf06f [Ankur Dave] Shadow members in withXYZ() methods rather than using underscores c34abc0 [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0 c5ca068 [Ankur Dave] Revert "Exclude all of GraphX from binary compatibility checks" 34bcefb [Ankur Dave] Exclude all of GraphX from binary compatibility checks 6fdd137 [Ankur Dave] [SPARK-1991] Support custom storage levels for vertices and edges
* Synthetic GraphX BenchmarkJoseph E. Gonzalez2014-06-034-11/+171
| | | | | | | | | | | | | | | | | | | | This PR accomplishes two things: 1. It introduces a Synthetic Benchmark application that generates an arbitrarily large log-normal graph and executes either PageRank or connected components on the graph. This can be used to profile GraphX system on arbitrary clusters without access to large graph datasets 2. This PR improves the implementation of the log-normal graph generator. Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> Author: Ankur Dave <ankurdave@gmail.com> Closes #720 from jegonzal/graphx_synth_benchmark and squashes the following commits: e40812a [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0 bccccad [Ankur Dave] Fix long lines 374678a [Ankur Dave] Bugfix and style changes 1bdf39a [Joseph E. Gonzalez] updating options d943972 [Joseph E. Gonzalez] moving the benchmark application into the examples folder. f4f839a [Joseph E. Gonzalez] Creating a synthetic benchmark script.
* fix java.lang.ClassCastExceptionbaishuo(白硕)2014-06-031-1/+1
| | | | | | | | | | | | | | | | | get Exception when run:bin/run-example org.apache.spark.examples.sql.RDDRelation Exception's detail is: Exception in thread "main" java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) at org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:145) at org.apache.spark.examples.sql.RDDRelation$.main(RDDRelation.scala:49) at org.apache.spark.examples.sql.RDDRelation.main(RDDRelation.scala) change sql("SELECT COUNT(*) FROM records").collect().head.getInt(0) to sql("SELECT COUNT(*) FROM records").collect().head.getLong(0), then the Exception do not occur any more Author: baishuo(白硕) <vc_java@hotmail.com> Closes #949 from baishuo/master and squashes the following commits: f4b319f [baishuo(白硕)] fix java.lang.ClassCastException
* [SPARK-1468] Modify the partition function used by partitionBy.Erik Selin2014-06-031-1/+4
| | | | | | | | | | | | | | Make partitionBy use a tweaked version of hash as its default partition function since the python hash function does not consistently assign the same value to None across python processes. Associated JIRA at https://issues.apache.org/jira/browse/SPARK-1468 Author: Erik Selin <erik.selin@jadedpixel.com> Closes #371 from tyro89/consistent_hashing and squashes the following commits: 201c301 [Erik Selin] Make partitionBy use a tweaked version of hash as its default partition function since the python hash function does not consistently assign the same value to None across python processes.
* Add support for Pivotal HD in the Maven build: SPARK-1992tzolov2014-06-032-0/+12
| | | | | | | | | | | | | Allow Spark to build against particular Pivotal HD distributions. For example to build Spark against Pivotal HD 2.0.1 one can run: ``` mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0-gphd-3.0.1.0 -DskipTests clean package ``` Author: tzolov <christian.tzolov@gmail.com> Closes #942 from tzolov/master and squashes the following commits: bc3e05a [tzolov] Add support for Pivotal HD in the Maven build and SBT build: [SPARK-1992]
* [SPARK-1912] fix compress memory issue during reduceWenchen Fan(Cloud)2014-06-031-2/+20
| | | | | | | | | | | | | | | | When we need to read a compressed block, we will first create a compress stream instance(LZF or Snappy) and use it to wrap that block. Let's say a reducer task need to read 1000 local shuffle blocks, it will first prepare to read that 1000 blocks, which means create 1000 compression stream instance to wrap them. But the initialization of compression instance will allocate some memory and when we have many compression instance at the same time, it is a problem. Actually reducer reads the shuffle blocks one by one, so we can do the compression instance initialization lazily. Author: Wenchen Fan(Cloud) <cloud0fan@gmail.com> Closes #860 from cloud-fan/fix-compress and squashes the following commits: 0924a6b [Wenchen Fan(Cloud)] rename 'doWork' into 'getIterator' 07f32c2 [Wenchen Fan(Cloud)] move the LazyProxyIterator to dataDeserialize d80c426 [Wenchen Fan(Cloud)] remove empty lines in short class 2c8adb2 [Wenchen Fan(Cloud)] add inline comment 8ebff77 [Wenchen Fan(Cloud)] fix compress memory issue during reduce
* SPARK-2001 : Remove docs/spark-debugger.md from masterHenry Saputra2014-06-031-121/+0
| | | | | | | | | | | | | | | | | | | | Per discussion in dev list: " Seemed like the spark-debugger.md is no longer accurate (see http://spark.apache.org/docs/latest/spark-debugger.html) and since it was originally written Spark has evolved that makes the doc obsolete. There are already work pending for new replay debugging (I could not find the PR links for it) so I With version control we could always reinstate the old doc if needed, but as of today the doc is no longer reflect the current state of Spark's RDD. " Author: Henry Saputra <henry.saputra@gmail.com> Closes #953 from hsaputra/SPARK-2001-hsaputra and squashes the following commits: dc324aa [Henry Saputra] SPARK-2001 : Remove docs/spark-debugger.md from master since it is obsolete
* [SPARK-1942] Stop clearing spark.driver.port in unit testsSyed Hashmi2014-06-0323-42/+0
| | | | | | | | | | | | | | | | | | | stop resetting spark.driver.port in unit tests (scala, java and python). Author: Syed Hashmi <shashmi@cloudera.com> Author: CodingCat <zhunansjtu@gmail.com> Closes #943 from syedhashmi/master and squashes the following commits: 885f210 [Syed Hashmi] Removing unnecessary file (created by mergetool) b8bd4b5 [Syed Hashmi] Merge remote-tracking branch 'upstream/master' b895e59 [Syed Hashmi] Revert "[SPARK-1784] Add a new partitioner" 57b6587 [Syed Hashmi] Revert "[SPARK-1784] Add a balanced partitioner" 1574769 [Syed Hashmi] [SPARK-1942] Stop clearing spark.driver.port in unit tests 4354836 [Syed Hashmi] Revert "SPARK-1686: keep schedule() calling in the main thread" fd36542 [Syed Hashmi] [SPARK-1784] Add a balanced partitioner 6668015 [CodingCat] SPARK-1686: keep schedule() calling in the main thread 4ca94cc [Syed Hashmi] [SPARK-1784] Add a new partitioner
* Avoid dynamic dispatching when unwrapping Hive data.Cheng Lian2014-06-022-15/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a follow up of PR #758. The `unwrapHiveData` function is now composed statically before actual rows are scanned according to the field object inspector to avoid dynamic dispatching cost. According to the same micro benchmark used in PR #758, this simple change brings slight performance boost: 2.5% for CSV table and 1% for RCFile table. ``` Optimized version: CSV: 6870 ms, RCFile: 5687 ms CSV: 6832 ms, RCFile: 5800 ms CSV: 6822 ms, RCFile: 5679 ms CSV: 6704 ms, RCFile: 5758 ms CSV: 6819 ms, RCFile: 5725 ms Original version: CSV: 7042 ms, RCFile: 5667 ms CSV: 6883 ms, RCFile: 5703 ms CSV: 7115 ms, RCFile: 5665 ms CSV: 7020 ms, RCFile: 5981 ms CSV: 6871 ms, RCFile: 5906 ms ``` Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #935 from liancheng/staticUnwrapping and squashes the following commits: c49c70c [Cheng Lian] Avoid dynamic dispatching when unwrapping Hive data.
* [SPARK-1995][SQL] system function upper and lower can be supportedegraldlo2014-06-024-1/+70
| | | | | | | | | | | | | | I don't know whether it's time to implement system function about string operation in spark sql now. Author: egraldlo <egraldlo@gmail.com> Closes #936 from egraldlo/stringoperator and squashes the following commits: 3c6c60a [egraldlo] Add UPPER, LOWER, MAX and MIN into hive parser ea76d0a [egraldlo] modify the formatting issues b49f25e [egraldlo] modify the formatting issues 1f0bbb5 [egraldlo] system function upper and lower supported 13d3267 [egraldlo] system function upper and lower supported