| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| | |
Approximate distinct count
Added countApproxDistinct() to RDD and countApproxDistinctByKey() to PairRDDFunctions to approximately count distinct number of elements and distinct number of values per key, respectively. Both functions use HyperLogLog from stream-lib for counting. Both functions take a parameter that controls the trade-off between accuracy and memory consumption. Also added Scala docs and test suites for both methods.
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| |\ |
|
| | | |
|
| | | |
|
| | | |
|
| | | |
|
| | | |
|
| | |
| | |
| | |
| | | |
support.
|
| | |
| | |
| | |
| | | |
number of unique values for each key in the RDD.
|
| | |
| | |
| | |
| | | |
and returns the (approximate) number of distinct elements in the RDD.
|
| | | |
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | | |
upgrade Netty from 4.0.0.Beta2 to 4.0.13.Final
the changes are listed at https://github.com/netty/netty/wiki/New-and-noteworthy
|
| | | |
| | | |
| | | |
| | | | |
Also clean up a bit.
|
| | | | |
|
| | | | |
|
| | | | |
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Bug fixes for file input stream and checkpointing
- Fixed bugs in the file input stream that led the stream to fail due to transient HDFS errors (listing files when a background thread it deleting fails caused errors, etc.)
- Updated Spark's CheckpointRDD and Streaming's CheckpointWriter to use SparkContext.hadoopConfiguration, to allow checkpoints to be written to any HDFS compatible store requiring special configuration.
- Changed the API of SparkContext.setCheckpointDir() - eliminated the unnecessary 'useExisting' parameter. Now SparkContext will always create a unique subdirectory within the user specified checkpoint directory. This is to ensure that previous checkpoint files are not accidentally overwritten.
- Fixed bug where setting checkpoint directory as a relative local path caused the checkpointing to fail.
|
| | | | | |
|
| | | | | |
|
| | | | |
| | | | |
| | | | |
| | | | | |
0 partitions).
|
| | | | |
| | | | |
| | | | |
| | | | | |
(FileNotFound exception is still caught and ignored).
|
| | | | |
| | | | |
| | | | |
| | | | | |
due to failures due FileNotFound exceptions.
|
| | | | |
| | | | |
| | | | |
| | | | | |
Reynold's comments on PR 289.
|
| |\ \ \ \
| | | |_|/
| | |/| | |
|
| | | | | |
|
| | | | | |
|
| |\ \ \ \ |
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
correctly.
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
FileSystem objects. Refactored JobGenerator to use actor so that all updating of DStream's metadata is single threaded.
|
| |\ \ \ \ \ |
|
| |\ \ \ \ \ \
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | | |
Conflicts:
core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala
streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala
|
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | | |
- Made file stream more robust to transient failures.
- Changed Spark.setCheckpointDir API to not have the second
'useExisting' parameter. Spark will always create a unique directory
for checkpointing underneath the directory provide to the funtion.
- Fixed bug wrt local relative paths as checkpoint directory.
- Made DStream and RDD checkpointing use
SparkContext.hadoopConfiguration, so that more HDFS compatible
filesystems are supported for checkpointing.
|
|\ \ \ \ \ \ \ \
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | | |
Changed naming of StageCompleted event to be consistent
The rest of the SparkListener events are named with "SparkListener"
as the prefix of the name; this commit renames the StageCompleted
event to SparkListenerStageCompleted for consistency.
|
| | | | | | | | | |
|
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | | |
The rest of the SparkListener events are named with "SparkListener"
as the prefix of the name; this commit renames the StageCompleted
event to SparkListenerStageCompleted for consistency.
|
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | | |
This reverts commit 79b20e4dbe3dcd8559ec8316784d3334bb55868b, reversing
changes made to 7375047d516c5aa69221611f5f7b0f1d367039af.
|
|\ \ \ \ \ \ \ \ \
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | | |
Fix typo in the Accumulators section
Change 'val' to 'var'
|
|/ / / / / / / / /
| | | | | | | | |
| | | | | | | | | |
val => var
|
|\ \ \ \ \ \ \ \ \
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | | |
Removed unused failed and causeOfFailure variables (in TaskSetManager)
|
| | | | | | | | | | |
|
|\ \ \ \ \ \ \ \ \ \
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | | |
Removed unused OtherFailure TaskEndReason.
The OtherFailure TaskEndReason was added by @mateiz 3 years ago in this commit: https://github.com/apache/incubator-spark/commit/24a1e7f8380bfd8d4fbdda688482a451bd6ea215
Unless I am missing something, it doesn't seem to have been used then, and is not used now, so seems safe for deletion.
|
| | |/ / / / / / / /
| |/| | | | | | | | |
|
|\ \ \ \ \ \ \ \ \ \
| |/ / / / / / / / /
|/| | | | | | | | |
| | | | | | | | | | |
Remove unused hasPendingTasks methods
|