spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-1100] prevent Spark from overwriting directory silently	CodingCat	2014-03-01	2	-10/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Thanks for Diana Carroll to report this issue (https://spark-project.atlassian.net/browse/SPARK-1100) the current saveAsTextFile/SequenceFile will overwrite the output directory silently if the directory already exists, this behaviour is not desirable because overwriting the data silently is not user-friendly if the partition number of two writing operation changed, then the output directory will contain the results generated by two runnings My fix includes: add some new APIs with a flag for users to define whether he/she wants to overwrite the directory: if the flag is set to true, then the output directory is deleted first and then written into the new data to prevent the output directory contains results from multiple rounds of running; if the flag is set to false, Spark will throw an exception if the output directory already exists changed JavaAPI part default behaviour is overwriting Two questions should we deprecate the old APIs without such a flag? I noticed that Spark Streaming also called these APIs, I thought we don't need to change the related part in streaming? @tdas Author: CodingCat <zhunansjtu@gmail.com> Closes #11 from CodingCat/SPARK-1100 and squashes the following commits: 6a4e3a3 [CodingCat] code clean ef2d43f [CodingCat] add new test cases and code clean ac63136 [CodingCat] checkOutputSpecs not applicable to FSOutputFormat ec490e8 [CodingCat] prevent Spark from overwriting directory silently and leaving dirty directory
*	Revert "[SPARK-1150] fix repo location in create script"	Patrick Wendell	2014-03-01	1	-8/+2
\| \| \| \|	This reverts commit 9aa095711858ce8670e51488f66a3d7c1a821c30.
*	[SPARK-1150] fix repo location in create script	Mark Grover	2014-03-01	1	-2/+8
\| \| \| \| \| \| \| \| \| \| \| \| \|	https://spark-project.atlassian.net/browse/SPARK-1150 fix the repo location in create_release script Author: Mark Grover <mark@apache.org> Closes #48 from CodingCat/script_fixes and squashes the following commits: 01f4bf7 [Mark Grover] Fixing some nitpicks d2244d4 [Mark Grover] SPARK-676: Abbreviation in SPARK_MEM but not in SPARK_WORKER_MEMORY
*	[SPARK-979] Randomize order of offers.	Kay Ousterhout	2014-03-01	4	-41/+75
\| \| \| \| \| \| \| \| \| \| \| \| \|	This commit randomizes the order of resource offers to avoid scheduling all tasks on the same small set of machines. This is a much simpler solution to SPARK-979 than #7. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #27 from kayousterhout/randomize and squashes the following commits: 435d817 [Kay Ousterhout] [SPARK-979] Randomize order of offers.
*	SPARK-1051. On YARN, executors don't doAs submitting user	Sandy Ryza	2014-02-28	1	-8/+10
\| \| \| \| \| \| \| \| \| \|	This reopens https://github.com/apache/incubator-spark/pull/538 against the new repo Author: Sandy Ryza <sandy@cloudera.com> Closes #29 from sryza/sandy-spark-1051 and squashes the following commits: 708ce49 [Sandy Ryza] SPARK-1051. doAs submitting user in YARN
*	Remote BlockFetchTracker trait	Kay Ousterhout	2014-02-27	2	-38/+17
\| \| \| \| \| \| \| \| \| \| \| \|	This trait seems to have been created a while ago when there were multiple implementations; now that there's just one, I think it makes sense to merge it into the BlockFetcherIterator trait. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #39 from kayousterhout/remove_tracker and squashes the following commits: 8173939 [Kay Ousterhout] Remote BlockFetchTracker.
*	[HOTFIX] Patching maven build after #6 (SPARK-1121).	Patrick Wendell	2014-02-27	1	-8/+0
\| \| \| \| \| \| \| \| \| \| \|	That patch removed the Maven avro declaration but didn't remove the actual dependency in core. /cc @scrapcodes Author: Patrick Wendell <pwendell@gmail.com> Closes #37 from pwendell/master and squashes the following commits: 0ef3008 [Patrick Wendell] [HOTFIX] Patching maven build after #6 (SPARK-1121).
*	SPARK 1084.1 (resubmitted)	Sean Owen	2014-02-27	7	-20/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	(Ported from https://github.com/apache/incubator-spark/pull/637 ) Author: Sean Owen <sowen@cloudera.com> Closes #31 from srowen/SPARK-1084.1 and squashes the following commits: 6c4a32c [Sean Owen] Suppress warnings about legitimate unchecked array creations, or change code to avoid it f35b833 [Sean Owen] Fix two misc javadoc problems 254e8ef [Sean Owen] Fix one new style error introduced in scaladoc warning commit 5b2fce2 [Sean Owen] Fix scaladoc invocation warning, and enable javac warnings properly, with plugin config updates 007762b [Sean Owen] Remove dead scaladoc links b8ff8cb [Sean Owen] Replace deprecated Ant <tasks> with <target>
*	Show Master status on UI page	Raymond Liu	2014-02-26	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	For standalone HA mode, A status is useful to identify the current master, already in json format too. Author: Raymond Liu <raymond.liu@intel.com> Closes #24 from colorant/status and squashes the following commits: df630b3 [Raymond Liu] Show Master status on UI page
*	SPARK-1129: use a predefined seed when seed is zero in XORShiftRandom	Xiangrui Meng	2014-02-26	2	-3/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the seed is zero, XORShift generates all zeros, which would create unexpected result. JIRA: https://spark-project.atlassian.net/browse/SPARK-1129 Author: Xiangrui Meng <meng@databricks.com> Closes #645 from mengxr/xor and squashes the following commits: 1b086ab [Xiangrui Meng] use MurmurHash3 to set seed in XORShiftRandom 45c6f16 [Xiangrui Meng] minor style change 51f4050 [Xiangrui Meng] use a predefined seed when seed is zero in XORShiftRandom
*	Remove references to ClusterScheduler (SPARK-1140)	Kay Ousterhout	2014-02-26	8	-46/+47
\| \| \| \| \| \| \| \| \| \| \|	ClusterScheduler was renamed to TaskSchedulerImpl; this commit updates comments and tests accordingly. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #9 from kayousterhout/cluster_scheduler_death and squashes the following commits: d6fd119 [Kay Ousterhout] Remove references to ClusterScheduler.
*	Deprecated and added a few java api methods for corresponding scala api.	Prashant Sharma	2014-02-26	5	-6/+32
\| \| \| \| \| \| \| \| \| \| \| \|	PR [402](https://github.com/apache/incubator-spark/pull/402) from incubator repo. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #19 from ScrapCodes/java-api-completeness and squashes the following commits: 11d0c2b [Prashant Sharma] Integer -> java.lang.Integer 737819a [Prashant Sharma] SPARK-1095 add explicit return types to APIs. 3ddc8bb [Prashant Sharma] Deprected *With functions in scala and added a few missing Java APIs
*	SPARK-1078: Replace lift-json with json4s-jackson.	William Benton	2014-02-26	8	-23/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The aim of the Json4s project is to provide a common API for Scala JSON libraries. It is Apache-licensed, easier for downstream distributions to package, and mostly API-compatible with lift-json. Furthermore, the Jackson-backed implementation parses faster than lift-json on all but the smallest inputs. Author: William Benton <willb@redhat.com> Closes #582 from willb/json4s and squashes the following commits: 7ca62c4 [William Benton] Replace lift-json with json4s-jackson.
*	For SPARK-1082, Use Curator for ZK interaction in standalone cluster	Raymond Liu	2014-02-24	7	-296/+95
\| \| \| \| \| \| \| \| \| \| \|	Author: Raymond Liu <raymond.liu@intel.com> Closes #611 from colorant/curator and squashes the following commits: 7556aa1 [Raymond Liu] Address review comments af92e1f [Raymond Liu] Fix coding style 964f3c2 [Raymond Liu] Ignore NodeExists exception 6df2966 [Raymond Liu] Rewrite zookeeper client code with curator
*	For outputformats that are Configurable, call setConf before sending data to ↵	Bryn Keller	2014-02-24	2	-1/+80
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	them. [SPARK-1108] This allows us to use, e.g. HBase's TableOutputFormat with PairRDDFunctions.saveAsNewAPIHadoopFile, which otherwise would throw NullPointerException because the output table name hasn't been configured. Note this bug also affects branch-0.9 Author: Bryn Keller <bryn.keller@intel.com> Closes #638 from xoltar/SPARK-1108 and squashes the following commits: 7e94e7d [Bryn Keller] Import, comment, and format cleanup per code review 7cbcaa1 [Bryn Keller] For outputformats that are Configurable, call setConf before sending data to them. This allows us to use, e.g. HBase TableOutputFormat, which otherwise would throw NullPointerException because the output table name hasn't been configured
*	Fix removal from shuffleToMapStage to search for a key-value pair with	Matei Zaharia	2014-02-24	1	-2/+2
\| \| \| \|	our stage instead of using our shuffleID.
*	SPARK-1124: Fix infinite retries of reduce stage when a map stage failed	Matei Zaharia	2014-02-23	2	-14/+30
\| \| \| \| \| \| \| \|	In the previous code, if you had a failing map stage and then tried to run reduce stages on it repeatedly, the first reduce stage would fail correctly, but the later ones would mistakenly believe that all map outputs are available and start failing infinitely with fetch failures from "null".
*	SPARK-1071: Tidy logging strategy and use of log4j	Sean Owen	2014-02-23	1	-9/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Prompted by a recent thread on the mailing list, I tried and failed to see if Spark can be made independent of log4j. There are a few cases where control of the underlying logging is pretty useful, and to do that, you have to bind to a specific logger. Instead I propose some tidying that leaves Spark's use of log4j, but gets rid of warnings and should still enable downstream users to switch. The idea is to pipe everything (except log4j) through SLF4J, and have Spark use SLF4J directly when logging, and where Spark needs to output info (REPL and tests), bind from SLF4J to log4j. This leaves the same behavior in Spark. It means that downstream users who want to use something except log4j should: - Exclude dependencies on log4j, slf4j-log4j12 from Spark - Include dependency on log4j-over-slf4j - Include dependency on another logger X, and another slf4j-X - Recreate any log config that Spark does, that is needed, in the other logger's config That sounds about right. Here are the key changes: - Include the jcl-over-slf4j shim everywhere by depending on it in core. - Exclude dependencies on commons-logging from third-party libraries. - Include the jul-to-slf4j shim everywhere by depending on it in core. - Exclude slf4j-* dependencies from third-party libraries to prevent collision or warnings - Added missing slf4j-log4j12 binding to GraphX, Bagel module tests And minor/incidental changes: - Update to SLF4J 1.7.5, which happily matches Hadoop 2’s version and is a recommended update over 1.7.2 - (Remove a duplicate HBase dependency declaration in SparkBuild.scala) - (Remove a duplicate mockito dependency declaration that was causing warnings and bugging me) Author: Sean Owen <sowen@cloudera.com> Closes #570 from srowen/SPARK-1071 and squashes the following commits: 52eac9f [Sean Owen] Add slf4j-over-log4j12 dependency to core (non-test) and remove it from things that depend on core. 77a7fa9 [Sean Owen] SPARK-1071: Tidy logging strategy and use of log4j
*	Migrate Java code to Scala or move it to src/main/java	Punya Biswal	2014-02-22	11	-92/+56
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	These classes can't be migrated: StorageLevels: impossible to create static fields in Scala JavaSparkContextVarargsWorkaround: incompatible varargs JavaAPISuite: should test Java APIs in pure Java (for sanity) Author: Punya Biswal <pbiswal@palantir.com> Closes #605 from punya/move-java-sources and squashes the following commits: 25b00b2 [Punya Biswal] Remove redundant type param; reformat 853da46 [Punya Biswal] Use factory method rather than constructor e5d53d9 [Punya Biswal] Migrate Java code to Scala or move it to src/main/java
*	SPARK-1117: update accumulator docs	Xiangrui Meng	2014-02-21	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	The current doc hints spark doesn't support accumulators of type `Long`, which is wrong. JIRA: https://spark-project.atlassian.net/browse/SPARK-1117 Author: Xiangrui Meng <meng@databricks.com> Closes #631 from mengxr/acc and squashes the following commits: 45ecd25 [Xiangrui Meng] update accumulator docs
*	[SPARK-1113] External spilling - fix Int.MaxValue hash code collision bug	Andrew Or	2014-02-21	2	-38/+102
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The original poster of this bug is @guojc, who opened a PR that preceded this one at https://github.com/apache/incubator-spark/pull/612. ExternalAppendOnlyMap uses key hash code to order the buffer streams from which spilled files are read back into memory. When a buffer stream is empty, the default hash code for that stream is equal to Int.MaxValue. This is, however, a perfectly legitimate candidate for a key hash code. When reading from a spilled map containing such a key, a hash collision may occur, in which case we attempt to read from an empty stream and throw NoSuchElementException. The fix is to maintain the invariant that empty buffer streams are never added back to the merge queue to be considered. This guarantees that we never read from an empty buffer stream, ever again. This PR also includes two new tests for hash collisions. Author: Andrew Or <andrewor14@gmail.com> Closes #624 from andrewor14/spilling-bug and squashes the following commits: 9e7263d [Andrew Or] Slightly optimize next() 2037ae2 [Andrew Or] Move a few comments around... cf95942 [Andrew Or] Remove default value of Int.MaxValue for minKeyHash c11f03b [Andrew Or] Fix Int.MaxValue hash collision bug in ExternalAppendOnlyMap 21c1a39 [Andrew Or] Add hash collision tests to ExternalAppendOnlyMapSuite
*	SPARK-1111: URL Validation Throws Error for HDFS URL's	Patrick Wendell	2014-02-21	2	-9/+42
\| \| \| \| \| \| \| \| \| \|	Fixes an error where HDFS URL's cause an exception. Should be merged into master and 0.9. Author: Patrick Wendell <pwendell@gmail.com> Closes #625 from pwendell/url-validation and squashes the following commits: d14bfe3 [Patrick Wendell] SPARK-1111: URL Validation Throws Error for HDFS URL's
*	Super minor: Add require for mergeCombiners in combineByKey	Aaron Davidson	2014-02-20	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	We changed the behavior in 0.9.0 from requiring that mergeCombiners be null when mapSideCombine was false to requiring that mergeCombiners never be null, for external sorting. This patch adds a require() to make this behavior change explicitly messaged rather than resulting in a NPE. Author: Aaron Davidson <aaron@databricks.com> Closes #623 from aarondav/master and squashes the following commits: 520b80c [Aaron Davidson] Super minor: Add require for mergeCombiners in combineByKey
*	Optimized imports	NirmalReddy	2014-02-18	246	-552/+446
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Optimized imports and arranged according to scala style guide @ https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports Author: NirmalReddy <nirmal.reddy@imaginea.com> Author: NirmalReddy <nirmal_reddy2000@yahoo.com> Closes #613 from NirmalReddy/opt-imports and squashes the following commits: 578b4f5 [NirmalReddy] imported java.lang.Double as JDouble a2cbcc5 [NirmalReddy] addressed the comments 776d664 [NirmalReddy] Optimized imports in core
*	SPARK-1098: Minor cleanup of ClassTag usage in Java API	Aaron Davidson	2014-02-17	4	-100/+108
\| \| \| \| \| \| \| \| \| \|	Our usage of fake ClassTags in this manner is probably not healthy, but I'm not sure if there's a better solution available, so I just cleaned up and documented the current one. Author: Aaron Davidson <aaron@databricks.com> Closes #604 from aarondav/master and squashes the following commits: b398e89 [Aaron Davidson] SPARK-1098: Minor cleanup of ClassTag usage in Java API
*	Worker registration logging fix	Andrew Ash	2014-02-17	1	-1/+1
\| \| \| \| \| \| \| \|	Author: Andrew Ash <andrew@andrewash.com> Closes #608 from ash211/patch-7 and squashes the following commits: bd85f2a [Andrew Ash] Worker registration logging fix
*	Add subtractByKey to the JavaPairRDD wrapper	Punya Biswal	2014-02-16	1	-0/+23
\| \| \| \| \| \| \| \| \|	Author: Punya Biswal <pbiswal@palantir.com> Closes #600 from punya/subtractByKey-java and squashes the following commits: e961913 [Punya Biswal] Hide implicit ClassTags from Java API c5d317b [Punya Biswal] Add subtractByKey to the JavaPairRDD wrapper
*	fix for https://spark-project.atlassian.net/browse/SPARK-1052	Bijay Bisht	2014-02-16	1	-7/+2
\| \| \| \| \| \| \| \| \| \| \| \|	Author: Bijay Bisht <bijay.bisht@gmail.com> Closes #568 from bijaybisht/SPARK-1052 and squashes the following commits: da70395 [Bijay Bisht] fix for https://spark-project.atlassian.net/browse/SPARK-1052 - comments incorporated fdb1d94 [Bijay Bisht] fix for https://spark-project.atlassian.net/browse/SPARK-1052 (cherry picked from commit e797c1abd9692f1b7ec290e4c83d31fd106e6b05) Signed-off-by: Aaron Davidson <aaron@databricks.com>
*	[SPARK-1092] print warning information if user use SPARK_MEM to regulate ↵	CodingCat	2014-02-16	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	executor memory usage https://spark-project.atlassian.net/browse/SPARK-1092?jql=project%20%3D%20SPARK print warning information if user set SPARK_MEM to regulate memory usage of executors ---- OUTDATED: Currently, users will usually set SPARK_MEM to control the memory usage of driver programs, (in spark-class) 91 JAVA_OPTS="$OUR_JAVA_OPTS" 92 JAVA_OPTS="$JAVA_OPTS -Djava.library.path=$SPARK_LIBRARY_PATH" 93 JAVA_OPTS="$JAVA_OPTS -Xms$SPARK_MEM -Xmx$SPARK_MEM" if they didn't set spark.executor.memory, the value in this environment variable will also affect the memory usage of executors, because the following lines in SparkContext privatespark val executorMemory = conf.getOption("spark.executor.memory") .orElse(Option(System.getenv("SPARK_MEM"))) .map(Utils.memoryStringToMb) .getOrElse(512) also since SPARK_MEM has been (proposed to) deprecated in SPARK-929 (https://spark-project.atlassian.net/browse/SPARK-929) and the corresponding PR (https://github.com/apache/incubator-spark/pull/104) we should remove this line Author: CodingCat <zhunansjtu@gmail.com> Closes #602 from CodingCat/clean_spark_mem and squashes the following commits: 302bb28 [CodingCat] print warning information if user use SPARK_MEM to regulate executor memory usage
*	Merge pull request #591 from mengxr/transient-new.	Xiangrui Meng	2014-02-12	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	SPARK-1076: [Fix #578] add @transient to some vals I'll try to be more careful next time. Author: Xiangrui Meng <meng@databricks.com> Closes #591 and squashes the following commits: 2b4f044 [Xiangrui Meng] add @transient to prev in ZippedWithIndexRDD add @transient to seed in PartitionwiseSampledRDD
*	Merge pull request #589 from mengxr/index.	Xiangrui Meng	2014-02-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	SPARK-1076: Convert Int to Long to avoid overflow Patch for PR #578. Author: Xiangrui Meng <meng@databricks.com> Closes #589 and squashes the following commits: 98c435e [Xiangrui Meng] cast Int to Long to avoid Int overflow
*	Merge pull request #578 from mengxr/rank.	Xiangrui Meng	2014-02-12	5	-12/+139
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SPARK-1076: zipWithIndex and zipWithUniqueId to RDD Assign ranks to an ordered or unordered data set is a common operation. This could be done by first counting records in each partition and then assign ranks in parallel. The purpose of assigning ranks to an unordered set is usually to get a unique id for each item, e.g., to map feature names to feature indices. In such cases, the assignment could be done without counting records, saving one spark job. https://spark-project.atlassian.net/browse/SPARK-1076 == update == Because assigning ranks is very similar to Scala's zipWithIndex, I changed the method name to zipWithIndex and put the index in the value field. Author: Xiangrui Meng <meng@databricks.com> Closes #578 and squashes the following commits: 52a05e1 [Xiangrui Meng] changed assignRanks to zipWithIndex changed assignUniqueIds to zipWithUniqueId minor updates 756881c [Xiangrui Meng] simplified RankedRDD by implementing assignUniqueIds separately moved couting iterator size to Utils do not count items in the last partition and skip counting if there is only one partition 630868c [Xiangrui Meng] newline 21b434b [Xiangrui Meng] add assignRanks and assignUniqueIds to RDD
*	Merge pull request #583 from colorant/zookeeper.	Raymond Liu	2014-02-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Minor fix for ZooKeeperPersistenceEngine to use configured working dir Author: Raymond Liu <raymond.liu@intel.com> Closes #583 and squashes the following commits: 91b0609 [Raymond Liu] Minor fix for ZooKeeperPersistenceEngine to use configured working dir
*	Merge pull request #571 from holdenk/switchtobinarysearch.	Holden Karau	2014-02-11	3	-5/+91
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	SPARK-1072 Use binary search when needed in RangePartioner Author: Holden Karau <holden@pigscanfly.ca> Closes #571 and squashes the following commits: f31a2e1 [Holden Karau] Swith to using CollectionsUtils in Partitioner 4c7a0c3 [Holden Karau] Add CollectionsUtil as suggested by aarondav 7099962 [Holden Karau] Add the binary search to only init once 1bef01d [Holden Karau] CR feedback a21e097 [Holden Karau] Use binary search if we have more than 1000 elements inside of RangePartitioner
*	Revert "Merge pull request #560 from pwendell/logging. Closes #560."	Patrick Wendell	2014-02-09	1	-5/+2
\| \| \| \|	This reverts commit b6d40b782327188a25ded5b22790552121e5271f.
*	Merge pull request #567 from ScrapCodes/style2.	Prashant Sharma	2014-02-09	21	-76/+80
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SPARK-1058, Fix Style Errors and Add Scala Style to Spark Build. Pt 2 Continuation of PR #557 With this all scala style errors are fixed across the code base !! The reason for creating a separate PR was to not interrupt an already reviewed and ready to merge PR. Hope this gets reviewed soon and merged too. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #567 and squashes the following commits: 3b1ec30 [Prashant Sharma] scala style fixes
*	Merge pull request #551 from qqsun8819/json-protocol.	qqsun8819	2014-02-09	2	-8/+109
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	[SPARK-1038] Add more fields in JsonProtocol and add tests that verify the JSON itself This is a PR for SPARK-1038. Two major changes: 1 add some fields to JsonProtocol which is new and important to standalone-related data structures 2 Use Diff in liftweb.json to verity the stringified Json output for detecting someone mod type T to Option[T] Author: qqsun8819 <jin.oyj@alibaba-inc.com> Closes #551 and squashes the following commits: fdf0b4e [qqsun8819] [SPARK-1038] 1. Change code style for more readable according to rxin review 2. change submitdate hard-coded string to a date object toString for more complexiblity 095a26f [qqsun8819] [SPARK-1038] mod according to review of pwendel, use hard-coded json string for json data validation. Each test use its own json string 0524e41 [qqsun8819] Merge remote-tracking branch 'upstream/master' into json-protocol d203d5c [qqsun8819] [SPARK-1038] Add more fields in JsonProtocol and add tests that verify the JSON itself
*	Merge pull request #557 from ScrapCodes/style. Closes #557.	Patrick Wendell	2014-02-09	84	-322/+465
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SPARK-1058, Fix Style Errors and Add Scala Style to Spark Build. Author: Patrick Wendell <pwendell@gmail.com> Author: Prashant Sharma <scrapcodes@gmail.com> == Merge branch commits == commit 1a8bd1c059b842cb95cc246aaea74a79fec684f4 Author: Prashant Sharma <scrapcodes@gmail.com> Date: Sun Feb 9 17:39:07 2014 +0530 scala style fixes commit f91709887a8e0b608c5c2b282db19b8a44d53a43 Author: Patrick Wendell <pwendell@gmail.com> Date: Fri Jan 24 11:22:53 2014 -0800 Adding scalastyle snapshot
*	Merge pull request #556 from CodingCat/JettyUtil. Closes #556.	CodingCat	2014-02-08	5	-9/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	[SPARK-1060] startJettyServer should explicitly use IP information https://spark-project.atlassian.net/browse/SPARK-1060 In the current implementation, the webserver in Master/Worker is started with val (srv, bPort) = JettyUtils.startJettyServer("0.0.0.0", port, handlers) inside startJettyServer: val server = new Server(currentPort) //here, the Server will take "0.0.0.0" as the hostname, i.e. will always bind to the IP address of the first NIC this can cause wrong IP binding, e.g. if the host has two NICs, N1 and N2, the user specify the SPARK_LOCAL_IP as the N2's IP address, however, when starting the web server, for the reason stated above, it will always bind to the N1's address Author: CodingCat <zhunansjtu@gmail.com> == Merge branch commits == commit 6c6d9a8ccc9ec4590678a3b34cb03df19092029d Author: CodingCat <zhunansjtu@gmail.com> Date: Thu Feb 6 14:53:34 2014 -0500 startJettyServer should explicitly use IP information
*	Merge pull request #560 from pwendell/logging. Closes #560.	Patrick Wendell	2014-02-08	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	[WIP] SPARK-1067: Default log4j initialization causes errors for those not using log4j To fix this - we add a check when initializing log4j. Author: Patrick Wendell <pwendell@gmail.com> == Merge branch commits == commit ffdce513877f64b6eed6d36138c3e0003d392889 Author: Patrick Wendell <pwendell@gmail.com> Date: Fri Feb 7 15:22:29 2014 -0800 Logging fix
*	Merge pull request #542 from markhamstra/versionBump. Closes #542.	Mark Hamstra	2014-02-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Version number to 1.0.0-SNAPSHOT Since 0.9.0-incubating is done and out the door, we shouldn't be building 0.9.0-incubating-SNAPSHOT anymore. @pwendell Author: Mark Hamstra <markhamstra@gmail.com> == Merge branch commits == commit 1b00a8a7c1a7f251b4bb3774b84b9e64758eaa71 Author: Mark Hamstra <markhamstra@gmail.com> Date: Wed Feb 5 09:30:32 2014 -0800 Version number to 1.0.0-SNAPSHOT
*	Merge pull request #561 from Qiuzhuang/master. Closes #561.	Qiuzhuang Lian	2014-02-08	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Kill drivers in postStop() for Worker. JIRA SPARK-1068:https://spark-project.atlassian.net/browse/SPARK-1068 Author: Qiuzhuang Lian <Qiuzhuang.Lian@gmail.com> == Merge branch commits == commit 9c19ce63637eee9369edd235979288d3d9fc9105 Author: Qiuzhuang Lian <Qiuzhuang.Lian@gmail.com> Date: Sat Feb 8 16:07:39 2014 +0800 Kill drivers in postStop() for Worker. JIRA SPARK-1068:https://spark-project.atlassian.net/browse/SPARK-1068
*	Merge pull request #506 from ash211/intersection. Closes #506.	Andrew Ash	2014-02-06	2	-2/+58
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SPARK-1062 Add rdd.intersection(otherRdd) method Author: Andrew Ash <andrew@andrewash.com> == Merge branch commits == commit 5d9982b171b9572649e9828f37ef0b43f0242912 Author: Andrew Ash <andrew@andrewash.com> Date: Thu Feb 6 18:11:45 2014 -0800 Minor fixes - style: (v,null) => (v, null) - mention the shuffle in Javadoc commit b86d02f14e810902719cef893cf6bfa18ff9acb0 Author: Andrew Ash <andrew@andrewash.com> Date: Sun Feb 2 13:17:40 2014 -0800 Overload .intersection() for numPartitions and custom Partitioner commit bcaa34911fcc6bb5bc5e4f9fe46d1df73cb71c09 Author: Andrew Ash <andrew@andrewash.com> Date: Sun Feb 2 13:05:40 2014 -0800 Better naming of parameters in intersection's filter commit b10a6af2d793ec6e9a06c798007fac3f6b860d89 Author: Andrew Ash <andrew@andrewash.com> Date: Sat Jan 25 23:06:26 2014 -0800 Follow spark code format conventions of tab => 2 spaces commit 965256e4304cca514bb36a1a36087711dec535ec Author: Andrew Ash <andrew@andrewash.com> Date: Fri Jan 24 00:28:01 2014 -0800 Add rdd.intersection(otherRdd) method
*	Merge pull request #533 from andrewor14/master. Closes #533.	Andrew Or	2014-02-06	3	-97/+92
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	External spilling - generalize batching logic The existing implementation consists of a hack for Kryo specifically and only works for LZF compression. Introducing an intermediate batch-level stream takes care of pre-fetching and other arbitrary behavior of higher level streams in a more general way. Author: Andrew Or <andrewor14@gmail.com> == Merge branch commits == commit 3ddeb7ef89a0af2b685fb5d071aa0f71c975cc82 Author: Andrew Or <andrewor14@gmail.com> Date: Wed Feb 5 12:09:32 2014 -0800 Also privatize fields commit 090544a87a0767effd0c835a53952f72fc8d24f0 Author: Andrew Or <andrewor14@gmail.com> Date: Wed Feb 5 10:58:23 2014 -0800 Privatize methods commit 13920c918efe22e66a1760b14beceb17a61fd8cc Author: Andrew Or <andrewor14@gmail.com> Date: Tue Feb 4 16:34:15 2014 -0800 Update docs commit bd5a1d7350467ed3dc19c2de9b2c9f531f0e6aa3 Author: Andrew Or <andrewor14@gmail.com> Date: Tue Feb 4 13:44:24 2014 -0800 Typo: phyiscal -> physical commit 287ef44e593ad72f7434b759be3170d9ee2723d2 Author: Andrew Or <andrewor14@gmail.com> Date: Tue Feb 4 13:38:32 2014 -0800 Avoid reading the entire batch into memory; also simplify streaming logic Additionally, address formatting comments. commit 3df700509955f7074821e9aab1e74cb53c58b5a5 Merge: a531d2e 164489d Author: Andrew Or <andrewor14@gmail.com> Date: Mon Feb 3 18:27:49 2014 -0800 Merge branch 'master' of github.com:andrewor14/incubator-spark commit a531d2e347acdcecf2d0ab72cd4f965ab5e145d8 Author: Andrew Or <andrewor14@gmail.com> Date: Mon Feb 3 18:18:04 2014 -0800 Relax assumptions on compressors and serializers when batching This commit introduces an intermediate layer of an input stream on the batch level. This guards against interference from higher level streams (i.e. compression and deserialization streams), especially pre-fetching, without specifically targeting particular libraries (Kryo) and forcing shuffle spill compression to use LZF. commit 164489d6f176bdecfa9dabec2dfce5504d1ee8af Author: Andrew Or <andrewor14@gmail.com> Date: Mon Feb 3 18:18:04 2014 -0800 Relax assumptions on compressors and serializers when batching This commit introduces an intermediate layer of an input stream on the batch level. This guards against interference from higher level streams (i.e. compression and deserialization streams), especially pre-fetching, without specifically targeting particular libraries (Kryo) and forcing shuffle spill compression to use LZF.
*	Merge pull request #450 from kayousterhout/fetch_failures. Closes #450.	Kay Ousterhout	2014-02-06	1	-22/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Only run ResubmitFailedStages event after a fetch fails Previously, the ResubmitFailedStages event was called every 200 milliseconds, leading to a lot of unnecessary event processing and clogged DAGScheduler logs. Author: Kay Ousterhout <kayousterhout@gmail.com> == Merge branch commits == commit e603784b3a562980e6f1863845097effe2129d3b Author: Kay Ousterhout <kayousterhout@gmail.com> Date: Wed Feb 5 11:34:41 2014 -0800 Re-add check for empty set of failed stages commit d258f0ef50caff4bbb19fb95a6b82186db1935bf Author: Kay Ousterhout <kayousterhout@gmail.com> Date: Wed Jan 15 23:35:41 2014 -0800 Only run ResubmitFailedStages event after a fetch fails Previously, the ResubmitFailedStages event was called every 200 milliseconds, leading to a lot of unnecessary event processing and clogged DAGScheduler logs.
*	Merge pull request #321 from kayousterhout/ui_kill_fix. Closes #321.	Kay Ousterhout	2014-02-06	9	-144/+183
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Inform DAG scheduler about all started/finished tasks. Previously, the DAG scheduler was not always informed when tasks started and finished. The simplest example here is for speculated tasks: the DAGScheduler was only told about the first attempt of a task, meaning that SparkListeners were also not told about multiple task attempts, so users can't see what's going on with speculation in the UI. The DAGScheduler also wasn't always told about finished tasks, so in the UI, some tasks will never be shown as finished (this occurs, for example, if a task set gets killed). The other problem is that the fairness accounting was wrong -- the number of running tasks in a pool was decreased when a task set was considered done, even if all of its tasks hadn't yet finished. Author: Kay Ousterhout <kayousterhout@gmail.com> == Merge branch commits == commit c8d547d0f7a17f5a193bef05f5872b9f475675c5 Author: Kay Ousterhout <kayousterhout@gmail.com> Date: Wed Jan 15 16:47:33 2014 -0800 Addressed Reynold's review comments. Always use a TaskEndReason (remove the option), and explicitly signal when we don't know the reason. Also, always tell DAGScheduler (and associated listeners) about started tasks, even when they're speculated. commit 3fee1e2e3c06b975ff7f95d595448f38cce97a04 Author: Kay Ousterhout <kayousterhout@gmail.com> Date: Wed Jan 8 22:58:13 2014 -0800 Fixed broken test and improved logging commit ff12fcaa2567c5d02b75a1d5db35687225bcd46f Author: Kay Ousterhout <kayousterhout@gmail.com> Date: Sun Dec 29 21:08:20 2013 -0800 Inform DAG scheduler about all finished tasks. Previously, the DAG scheduler was not always informed when tasks finished. For example, when a task set was aborted, the DAG scheduler was never told when the tasks in that task set finished. The DAG scheduler was also never told about the completion of speculated tasks. This led to confusion with SparkListeners because information about the completion of those tasks was never passed on to the listeners (so in the UI, for example, some tasks will never be shown as finished). The other problem is that the fairness accounting was wrong -- the number of running tasks in a pool was decreased when a task set was considered done, even if all of its tasks hadn't yet finished.
*	Merge pull request #554 from sryza/sandy-spark-1056. Closes #554.	Sandy Ryza	2014-02-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SPARK-1056. Fix header comment in Executor to not imply that it's only u... ...sed for Mesos and Standalone. Author: Sandy Ryza <sandy@cloudera.com> == Merge branch commits == commit 1f2443d902a26365a5c23e4af9077e1539ed2eab Author: Sandy Ryza <sandy@cloudera.com> Date: Thu Feb 6 15:03:50 2014 -0800 SPARK-1056. Fix header comment in Executor to not imply that it's only used for Mesos and Standalone
*	Merge pull request #545 from kayousterhout/fix_progress. Closes #545.	Kay Ousterhout	2014-02-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix off-by-one error with task progress info log. Author: Kay Ousterhout <kayousterhout@gmail.com> == Merge branch commits == commit 29798fc685c4e7e3eb3bf91c75df7fa8ec94a235 Author: Kay Ousterhout <kayousterhout@gmail.com> Date: Wed Feb 5 13:40:01 2014 -0800 Fix off-by-one error with task progress info log.
*	Merge pull request #549 from CodingCat/deadcode_master. Closes #549.	CodingCat	2014-02-05	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	remove actorToWorker in master.scala, which is actually not used actorToWorker is actually not used in the code....just remove it Author: CodingCat <zhunansjtu@gmail.com> == Merge branch commits == commit 52656c2d4bbf9abcd8bef65d454badb9cb14a32c Author: CodingCat <zhunansjtu@gmail.com> Date: Thu Feb 6 00:28:26 2014 -0500 remove actorToWorker in master.scala, which is actually not used
*	Merge pull request #544 from kayousterhout/fix_test_warnings. Closes #544.	Kay Ousterhout	2014-02-05	2	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fixed warnings in test compilation. This commit fixes two problems: a redundant import, and a deprecated function. Author: Kay Ousterhout <kayousterhout@gmail.com> == Merge branch commits == commit da9d2e13ee4102bc58888df0559c65cb26232a82 Author: Kay Ousterhout <kayousterhout@gmail.com> Date: Wed Feb 5 11:41:51 2014 -0800 Fixed warnings in test compilation. This commit fixes two problems: a redundant import, and a deprecated function.