spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	Merge pull request #73 from falaki/ApproximateDistinctCount	Reynold Xin	2013-12-31	12	-233/+1595
\|\ \| \| \| \| \| \| \| \| \| \|	Approximate distinct count Added countApproxDistinct() to RDD and countApproxDistinctByKey() to PairRDDFunctions to approximately count distinct number of elements and distinct number of values per key, respectively. Both functions use HyperLogLog from stream-lib for counting. Both functions take a parameter that controls the trade-off between accuracy and memory consumption. Also added Scala docs and test suites for both methods.
\| *	Made the code more compact and readable	Hossein Falaki	2013-12-31	3	-23/+8
\| \|
\| *	minor improvements	Hossein Falaki	2013-12-31	2	-4/+5
\| \|
\| *	Added Java unit tests for countApproxDistinct and countApproxDistinctByKey	Hossein Falaki	2013-12-30	1	-0/+32
\| \|
\| *	Added Java API for countApproxDistinct	Hossein Falaki	2013-12-30	1	-0/+11
\| \|
\| *	Added Java API for countApproxDistinctByKey	Hossein Falaki	2013-12-30	1	-0/+36
\| \|
\| *	Added stream 2.5.1 jar depenency	Hossein Falaki	2013-12-30	1	-1/+2
\| \|
\| *	Renamed countDistinct and countDistinctByKey methods to include Approx	Hossein Falaki	2013-12-30	5	-15/+15
\| \|
\| *	Using origin version	Hossein Falaki	2013-12-30	374	-8424/+19051
\| \|\
\| * \|	Removed superfluous abs call from test cases.	Hossein Falaki	2013-12-10	1	-2/+2
\| \| \|
\| * \|	Made SerializableHyperLogLog Externalizable and added Kryo tests	Hossein Falaki	2013-10-18	2	-5/+10
\| \| \|
\| * \|	Added stream-lib dependency to Maven build	Hossein Falaki	2013-10-18	2	-0/+9
\| \| \|
\| * \|	Improved code style.	Hossein Falaki	2013-10-17	4	-15/+19
\| \| \|
\| * \|	Fixed document typo	Hossein Falaki	2013-10-17	2	-4/+4
\| \| \|
\| * \|	Added dependency on stream-lib version 2.4.0 for approximate distinct count ↵	Hossein Falaki	2013-10-17	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \|	support.
\| * \|	Added countDistinctByKey to PairRDDFunctions that counts the approximate ↵	Hossein Falaki	2013-10-17	2	-0/+81
\| \| \| \| \| \| \| \| \| \| \| \|	number of unique values for each key in the RDD.
\| * \|	Added a countDistinct method to RDD that takes takes an accuracy parameter ↵	Hossein Falaki	2013-10-17	2	-1/+38
\| \| \| \| \| \| \| \| \| \| \| \|	and returns the (approximate) number of distinct elements in the RDD.
\| * \|	Added a serializable wrapper for HyperLogLog	Hossein Falaki	2013-10-17	1	-0/+44
\| \| \|
* \| \|	Merge pull request #238 from ngbinh/upgradeNetty	Patrick Wendell	2013-12-31	8	-44/+60
\|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	upgrade Netty from 4.0.0.Beta2 to 4.0.13.Final the changes are listed at https://github.com/netty/netty/wiki/New-and-noteworthy
\| * \| \|	Fix failed unit tests	Binh Nguyen	2013-12-27	3	-13/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Also clean up a bit.
\| * \| \|	Fix imports order	Binh Nguyen	2013-12-24	3	-5/+2
\| \| \| \|
\| * \| \|	Remove import * and fix some formatting	Binh Nguyen	2013-12-24	2	-7/+4
\| \| \| \|
\| * \| \|	upgrade Netty from 4.0.0.Beta2 to 4.0.13.Final	Binh Nguyen	2013-12-24	7	-31/+42
\| \| \| \|
* \| \| \|	Merge pull request #289 from tdas/filestream-fix	Patrick Wendell	2013-12-31	14	-196/+269
\|\ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Bug fixes for file input stream and checkpointing - Fixed bugs in the file input stream that led the stream to fail due to transient HDFS errors (listing files when a background thread it deleting fails caused errors, etc.) - Updated Spark's CheckpointRDD and Streaming's CheckpointWriter to use SparkContext.hadoopConfiguration, to allow checkpoints to be written to any HDFS compatible store requiring special configuration. - Changed the API of SparkContext.setCheckpointDir() - eliminated the unnecessary 'useExisting' parameter. Now SparkContext will always create a unique subdirectory within the user specified checkpoint directory. This is to ensure that previous checkpoint files are not accidentally overwritten. - Fixed bug where setting checkpoint directory as a relative local path caused the checkpointing to fail.
\| * \| \| \|	Fixed comments and long lines based on comments on PR 289.	Tathagata Das	2013-12-31	4	-10/+19
\| \| \| \| \|
\| * \| \| \|	Minor changes in comments and strings to address comments in PR 289.	Tathagata Das	2013-12-27	1	-8/+6
\| \| \| \| \|
\| * \| \| \|	Added warning if filestream adds files with no data in them (file RDDs have ↵	Tathagata Das	2013-12-26	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	0 partitions).
\| * \| \| \|	Changed file stream to not catch any exceptions related to finding new files ↵	Tathagata Das	2013-12-26	1	-19/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	(FileNotFound exception is still caught and ignored).
\| * \| \| \|	Removed slack time in file stream and added better handling of exceptions ↵	Tathagata Das	2013-12-26	3	-50/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	due to failures due FileNotFound exceptions.
\| * \| \| \|	Fixed Python API for sc.setCheckpointDir. Also other fixes based on ↵	Tathagata Das	2013-12-24	7	-22/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reynold's comments on PR 289.
\| * \| \| \|	Merge branch 'apache-master' into filestream-fix	Tathagata Das	2013-12-24	37	-123/+465
\| \|\ \ \ \ \| \| \| \|_\|/ \| \| \|/\| \|
\| * \| \| \|	Minor formatting fixes.	Tathagata Das	2013-12-23	3	-9/+13
\| \| \| \| \|
\| * \| \| \|	Updated testsuites to work with the slack time of file stream.	Tathagata Das	2013-12-23	3	-2/+22
\| \| \| \| \|
\| * \| \| \|	Merge branch 'scheduler-update' into filestream-fix	Tathagata Das	2013-12-23	3	-4/+26
\| \|\ \ \ \
\| * \| \| \| \|	Fixed bug in file stream that prevented some files from being read	Tathagata Das	2013-12-23	1	-9/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	correctly.
\| * \| \| \| \|	Updated CheckpointWriter and FileInputDStream to be robust against failed ↵	Tathagata Das	2013-12-22	3	-35/+78
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	FileSystem objects. Refactored JobGenerator to use actor so that all updating of DStream's metadata is single threaded.
\| * \| \| \| \|	Merge branch 'scheduler-update' into filestream-fix	Tathagata Das	2013-12-22	2	-1/+6
\| \|\ \ \ \ \
\| * \ \ \ \ \	Merge branch 'scheduler-update' into filestream-fix	Tathagata Das	2013-12-19	224	-3164/+4050
\| \|\ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala
\| * \| \| \| \| \| \|	Fixed multiple file stream and checkpointing bugs.	Tathagata Das	2013-12-11	10	-117/+159
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Made file stream more robust to transient failures. - Changed Spark.setCheckpointDir API to not have the second 'useExisting' parameter. Spark will always create a unique directory for checkpointing underneath the directory provide to the funtion. - Fixed bug wrt local relative paths as checkpoint directory. - Made DStream and RDD checkpointing use SparkContext.hadoopConfiguration, so that more HDFS compatible filesystems are supported for checkpointing.
* \| \| \| \| \| \| \|	Merge pull request #308 from kayousterhout/stage_naming	Patrick Wendell	2013-12-30	7	-14/+18
\|\ \ \ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Changed naming of StageCompleted event to be consistent The rest of the SparkListener events are named with "SparkListener" as the prefix of the name; this commit renames the StageCompleted event to SparkListenerStageCompleted for consistency.
\| * \| \| \| \| \| \| \|	Updated code style according to Patrick's comments	Kay Ousterhout	2013-12-29	1	-4/+2
\| \| \| \| \| \| \| \| \|
\| * \| \| \| \| \| \| \|	Changed naming of StageCompleted event to be consistent	Kay Ousterhout	2013-12-27	7	-14/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The rest of the SparkListener events are named with "SparkListener" as the prefix of the name; this commit renames the StageCompleted event to SparkListenerStageCompleted for consistency.
* \| \| \| \| \| \| \| \|	Revert "Merge pull request #310 from jyunfan/master"	Reynold Xin	2013-12-28	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This reverts commit 79b20e4dbe3dcd8559ec8316784d3334bb55868b, reversing changes made to 7375047d516c5aa69221611f5f7b0f1d367039af.
* \| \| \| \| \| \| \| \|	Merge pull request #310 from jyunfan/master	Reynold Xin	2013-12-28	1	-1/+1
\|\ \ \ \ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix typo in the Accumulators section Change 'val' to 'var'
\| * \| \| \| \| \| \| \| \|	Fix typo in the Accumulators section	Jyun-Fan Tsai	2013-12-29	1	-1/+1
\|/ / / / / / / / / \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	val => var
* \| \| \| \| \| \| \| \|	Merge pull request #304 from kayousterhout/remove_unused	Patrick Wendell	2013-12-28	1	-6/+0
\|\ \ \ \ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Removed unused failed and causeOfFailure variables (in TaskSetManager)
\| * \| \| \| \| \| \| \| \|	Removed unused failed and causeOfFailure variables	Kay Ousterhout	2013-12-27	1	-6/+0
\| \| \| \| \| \| \| \| \| \|
* \| \| \| \| \| \| \| \| \|	Merge pull request #307 from kayousterhout/other_failure	Matei Zaharia	2013-12-27	2	-6/+0
\|\ \ \ \ \ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Removed unused OtherFailure TaskEndReason. The OtherFailure TaskEndReason was added by @mateiz 3 years ago in this commit: https://github.com/apache/incubator-spark/commit/24a1e7f8380bfd8d4fbdda688482a451bd6ea215 Unless I am missing something, it doesn't seem to have been used then, and is not used now, so seems safe for deletion.
\| * \| \| \| \| \| \| \| \| \|	Removed unused OtherFailure TaskEndReason.	Kay Ousterhout	2013-12-27	2	-6/+0
\| \| \|/ / / / / / / / \| \|/\| \| \| \| \| \| \| \|
* \| \| \| \| \| \| \| \| \|	Merge pull request #306 from kayousterhout/remove_pending	Matei Zaharia	2013-12-27	4	-16/+0
\|\ \ \ \ \ \ \ \ \ \ \| \|/ / / / / / / / / \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove unused hasPendingTasks methods