[SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper - spark

diff options

author	Sandeep Singh <sandeep@techaddict.me>	2016-12-01 13:22:40 -0800
committer	Joseph K. Bradley <joseph@databricks.com>	2016-12-01 13:22:40 -0800
commit	78bb7f8071379114314c394e0167c4c5fd8545c5 (patch)
tree	4c49f4fd69c635edf605ceda934c4b3f33595266 /streaming/pom.xml
parent	e6534847100670a22b3b191a0f9d924fab7f3c02 (diff)
download	spark-78bb7f8071379114314c394e0167c4c5fd8545c5.tar.gz spark-78bb7f8071379114314c394e0167c4c5fd8545c5.tar.bz2 spark-78bb7f8071379114314c394e0167c4c5fd8545c5.zip

[SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper

## What changes were proposed in this pull request? In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach` Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams` ## How was this patch tested? ```scala import random, string from pyspark.ml.feature import StringIndexer l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))] # 700000 random strings of 10 characters df = spark.createDataFrame(l, ['string']) for i in range(50): indexer = StringIndexer(inputCol='string', outputCol='index') indexer.fit(df) ``` * Before: would keep StringIndexer strong reference, causing GC issues and is halted midway After: garbage collection works as the object is dereferenced, and computation completes * Mem footprint tested using profiler * Added a parameter copy related test which was failing before. Author: Sandeep Singh <sandeep@techaddict.me> Author: jkbradley <joseph.kurata.bradley@gmail.com> Closes #15843 from techaddict/SPARK-18274.

Diffstat (limited to 'streaming/pom.xml')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: