aboutsummaryrefslogtreecommitdiff
path: root/examples
diff options
context:
space:
mode:
authorMaxime Rihouey <maxime.rihouey@gmail.com>2016-10-17 10:56:22 +0100
committerSean Owen <sowen@cloudera.com>2016-10-17 10:56:22 +0100
commite3bf37fa3ada43624b2e77bef90ad3d3dbcd8ce1 (patch)
tree1af2185a65c6752069c7c628f480408599d63137 /examples
parent56b0f5f4d1d7826737b81ebc4ec5dad83b6463e3 (diff)
downloadspark-e3bf37fa3ada43624b2e77bef90ad3d3dbcd8ce1.tar.gz
spark-e3bf37fa3ada43624b2e77bef90ad3d3dbcd8ce1.tar.bz2
spark-e3bf37fa3ada43624b2e77bef90ad3d3dbcd8ce1.zip
Fix example of tf_idf with minDocFreq
## What changes were proposed in this pull request? The python example for tf_idf with the parameter "minDocFreq" is not properly set up because the same variable is used to transform the document for both with and without the "minDocFreq" parameter. The IDF(minDocFreq=2) is stored in the variable "idfIgnore" but then it is the original variable "idf" used to transform the "tf" instead of the "idfIgnore". ## How was this patch tested? Before the results for "tfidf" and "tfidfIgnore" were the same: tfidf: (1048576,[1046921],[3.75828890549]) (1048576,[1046920],[3.75828890549]) (1048576,[1046923],[3.75828890549]) (1048576,[892732],[3.75828890549]) (1048576,[892733],[3.75828890549]) (1048576,[892734],[3.75828890549]) tfidfIgnore: (1048576,[1046921],[3.75828890549]) (1048576,[1046920],[3.75828890549]) (1048576,[1046923],[3.75828890549]) (1048576,[892732],[3.75828890549]) (1048576,[892733],[3.75828890549]) (1048576,[892734],[3.75828890549]) After the fix those are how they should be: tfidf: (1048576,[1046921],[3.75828890549]) (1048576,[1046920],[3.75828890549]) (1048576,[1046923],[3.75828890549]) (1048576,[892732],[3.75828890549]) (1048576,[892733],[3.75828890549]) (1048576,[892734],[3.75828890549]) tfidfIgnore: (1048576,[1046921],[0.0]) (1048576,[1046920],[0.0]) (1048576,[1046923],[0.0]) (1048576,[892732],[0.0]) (1048576,[892733],[0.0]) (1048576,[892734],[0.0]) Author: Maxime Rihouey <maxime.rihouey@gmail.com> Closes #15503 from maximerihouey/patch-1.
Diffstat (limited to 'examples')
-rw-r--r--examples/src/main/python/mllib/tf_idf_example.py2
1 files changed, 1 insertions, 1 deletions
diff --git a/examples/src/main/python/mllib/tf_idf_example.py b/examples/src/main/python/mllib/tf_idf_example.py
index c4d53333a9..b66412b233 100644
--- a/examples/src/main/python/mllib/tf_idf_example.py
+++ b/examples/src/main/python/mllib/tf_idf_example.py
@@ -43,7 +43,7 @@ if __name__ == "__main__":
# In such cases, the IDF for these terms is set to 0.
# This feature can be used by passing the minDocFreq value to the IDF constructor.
idfIgnore = IDF(minDocFreq=2).fit(tf)
- tfidfIgnore = idf.transform(tf)
+ tfidfIgnore = idfIgnore.transform(tf)
# $example off$
print("tfidf:")