[SPARK-7243][SQL] Reduce size for Contingency Tables in DataFrames

Reduced take size from 1e8 to 1e6. cc rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #5900 from brkyvz/df-cont-followup and squashes the following commits: c11e762 [Burak Yavuz] fix grammar b30ace2 [Burak Yavuz] address comments a417ba5 [Burak Yavuz] [SPARK-7243][SQL] Reduce size for Contingency Tables in DataFrames
author: Burak Yavuz <brkyvz@gmail.com> 2015-05-05 11:01:25 -0700
committer: Reynold Xin <rxin@databricks.com> 2015-05-05 11:01:25 -0700
commit: 18340d7be55a6834918956555bf820c96769aa52 (patch)
tree: 0327a2603cdfd3321bb7a2e765439254beba5acb /python/pyspark/sql/dataframe.py
parent: 9f1f9b1037ee003a07ff09d60bb360cf32c8a564 (diff)
download: spark-18340d7be55a6834918956555bf820c96769aa52.tar.gz
spark-18340d7be55a6834918956555bf820c96769aa52.tar.bz2
spark-18340d7be55a6834918956555bf820c96769aa52.zip
1 files changed, 5 insertions, 4 deletions
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index f30a92dfc8..17448b38c3 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -934,10 +934,11 @@ class DataFrame(object):
     def crosstab(self, col1, col2):
         """
         Computes a pair-wise frequency table of the given columns. Also known as a contingency
-        table. The number of distinct values for each column should be less than 1e4. The first
-        column of each row will be the distinct values of `col1` and the column names will be the
-        distinct values of `col2`. The name of the first column will be `$col1_$col2`. Pairs that
-        have no occurrences will have `null` as their counts.
+        table. The number of distinct values for each column should be less than 1e4. At most 1e6
+        non-zero pair frequencies will be returned.
+        The first column of each row will be the distinct values of `col1` and the column names
+        will be the distinct values of `col2`. The name of the first column will be `$col1_$col2`.
+        Pairs that have no occurrences will have `null` as their counts.
         :func:`DataFrame.crosstab` and :func:`DataFrameStatFunctions.crosstab` are aliases.
 
         :param col1: The name of the first column. Distinct items will make the first item of
author	Burak Yavuz <brkyvz@gmail.com>	2015-05-05 11:01:25 -0700
committer	Reynold Xin <rxin@databricks.com>	2015-05-05 11:01:25 -0700
commit	18340d7be55a6834918956555bf820c96769aa52 (patch)
tree	0327a2603cdfd3321bb7a2e765439254beba5acb /python/pyspark/sql/dataframe.py
parent	9f1f9b1037ee003a07ff09d60bb360cf32c8a564 (diff)
download	spark-18340d7be55a6834918956555bf820c96769aa52.tar.gz spark-18340d7be55a6834918956555bf820c96769aa52.tar.bz2 spark-18340d7be55a6834918956555bf820c96769aa52.zip