aboutsummaryrefslogtreecommitdiff
path: root/graphx/src/main/scala/org/apache
diff options
context:
space:
mode:
authorAnkur Dave <ankurdave@gmail.com>2014-08-28 15:17:01 -0700
committerJosh Rosen <joshrosen@apache.org>2014-08-28 15:17:01 -0700
commit96df92906978c5f58e0cc8ff5eebe5b35a08be3b (patch)
tree983908ea591adc8147530f3d1431eeaa21a5db9a /graphx/src/main/scala/org/apache
parent39012452daa0746fe5d218493b85f9b5f96190c1 (diff)
downloadspark-96df92906978c5f58e0cc8ff5eebe5b35a08be3b.tar.gz
spark-96df92906978c5f58e0cc8ff5eebe5b35a08be3b.tar.bz2
spark-96df92906978c5f58e0cc8ff5eebe5b35a08be3b.zip
[SPARK-3190] Avoid overflow in VertexRDD.count()
VertexRDDs with more than 4 billion elements are counted incorrectly due to integer overflow when summing partition sizes. This PR fixes the issue by converting partition sizes to Longs before summing them. The following code previously returned -10000000. After applying this PR, it returns the correct answer of 5000000000 (5 billion). ```scala val pairs = sc.parallelize(0L until 500L).map(_ * 10000000) .flatMap(start => start until (start + 10000000)).map(x => (x, x)) VertexRDD(pairs).count() ``` Author: Ankur Dave <ankurdave@gmail.com> Closes #2106 from ankurdave/SPARK-3190 and squashes the following commits: 641f468 [Ankur Dave] Avoid overflow in VertexRDD.count()
Diffstat (limited to 'graphx/src/main/scala/org/apache')
-rw-r--r--graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala2
1 files changed, 1 insertions, 1 deletions
diff --git a/graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala b/graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala
index 4825d12fc2..04fbc9dbab 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala
@@ -108,7 +108,7 @@ class VertexRDD[@specialized VD: ClassTag](
/** The number of vertices in the RDD. */
override def count(): Long = {
- partitionsRDD.map(_.size).reduce(_ + _)
+ partitionsRDD.map(_.size.toLong).reduce(_ + _)
}
/**