spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-12091] [PYSPARK] Deprecate the JAVA-specific deserialized storage levels	gatorsmile	2015-12-18	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current default storage level of Python persist API is MEMORY_ONLY_SER. This is different from the default level MEMORY_ONLY in the official document and RDD APIs. davies Is this inconsistency intentional? Thanks! Updates: Since the data is always serialized on the Python side, the storage levels of JAVA-specific deserialization are not removed, such as MEMORY_ONLY. Updates: Based on the reviewers' feedback. In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`, `MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2` and `OFF_HEAP`. Author: gatorsmile <gatorsmile@gmail.com> Closes #10092 from gatorsmile/persistStorageLevel.
*	[SQL] Update SQLContext.read.text doc	Yanbo Liang	2015-12-17	1	-1/+1
\| \| \| \| \| \| \| \|	Since we rename the column name from ```text``` to ```value``` for DataFrame load by ```SQLContext.read.text```, we need to update doc. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10349 from yanboliang/text-value.
*	[SPARK-12012][SQL] Show more comprehensive PhysicalRDD metadata when ↵	Cheng Lian	2015-12-09	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	visualizing SQL query plan This PR adds a `private[sql]` method `metadata` to `SparkPlan`, which can be used to describe detail information about a physical plan during visualization. Specifically, this PR uses this method to provide details of `PhysicalRDD`s translated from a data source relation. For example, a `ParquetRelation` converted from Hive metastore table `default.psrc` is now shown as the following screenshot: ![image](https://cloud.githubusercontent.com/assets/230655/11526657/e10cb7e6-9916-11e5-9afa-f108932ec890.png) And here is the screenshot for a regular `ParquetRelation` (not converted from Hive metastore table) loaded from a really long path: ![output](https://cloud.githubusercontent.com/assets/230655/11680582/37c66460-9e94-11e5-8f50-842db5309d5a.png) Author: Cheng Lian <lian@databricks.com> Closes #10004 from liancheng/spark-12012.physical-rdd-metadata.
*	[SPARK-12184][PYTHON] Make python api doc for pivot consistant with scala doc	Andrew Ray	2015-12-07	1	-5/+9
\| \| \| \| \| \| \| \|	In SPARK-11946 the API for pivot was changed a bit and got updated doc, the doc changes were not made for the python api though. This PR updates the python doc to be consistent. Author: Andrew Ray <ray.andrew@gmail.com> Closes #10176 from aray/sql-pivot-python-doc.
*	[SPARK-11917][PYSPARK] Add SQLContext#dropTempTable to PySpark	Jeff Zhang	2015-11-26	1	-0/+9
\| \| \| \| \| \|	Author: Jeff Zhang <zjffdu@apache.org> Closes #9903 from zjffdu/SPARK-11917.
*	[SPARK-11980][SPARK-10621][SQL] Fix json_tuple and add test cases for	gatorsmile	2015-11-25	1	-10/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Added Python test cases for the function `isnan`, `isnull`, `nanvl` and `json_tuple`. Fixed a bug in the function `json_tuple` rxin , could you help me review my changes? Please let me know anything is missing. Thank you! Have a good Thanksgiving day! Author: gatorsmile <gatorsmile@gmail.com> Closes #9977 from gatorsmile/json_tuple.
*	[SPARK-11969] [SQL] [PYSPARK] visualization of SQL query for pyspark	Davies Liu	2015-11-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Currently, we does not have visualization for SQL query from Python, this PR fix that. cc zsxwing Author: Davies Liu <davies@databricks.com> Closes #9949 from davies/pyspark_sql_ui.
*	[SPARK-11984][SQL][PYTHON] Fix typos in doc for pivot for scala and python	felixcheung	2015-11-25	1	-3/+3
\| \| \| \| \| \|	Author: felixcheung <felixcheung_m@hotmail.com> Closes #9967 from felixcheung/pypivotdoc.
*	[SPARK-11860][PYSAPRK][DOCUMENTATION] Invalid argument specification …	Jeff Zhang	2015-11-25	1	-2/+3
\| \| \| \| \| \| \| \| \| \|	…for registerFunction [Python] Straightforward change on the python doc Author: Jeff Zhang <zjffdu@apache.org> Closes #9901 from zjffdu/SPARK-11860.
*	[SPARK-10621][SQL] Consistent naming for functions in SQL, Python, Scala	Reynold Xin	2015-11-24	1	-17/+94
\| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #9948 from rxin/SPARK-10621.
*	[SPARK-11967][SQL] Consistent use of varargs for multiple paths in ↵	Reynold Xin	2015-11-24	1	-7/+12
\| \| \| \| \| \| \| \| \| \| \| \|	DataFrameReader This patch makes it consistent to use varargs in all DataFrameReader methods, including Parquet, JSON, text, and the generic load function. Also added a few more API tests for the Java API. Author: Reynold Xin <rxin@databricks.com> Closes #9945 from rxin/SPARK-11967.
*	[SPARK-11946][SQL] Audit pivot API for 1.6.	Reynold Xin	2015-11-24	1	-4/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently pivot's signature looks like ```scala scala.annotation.varargs def pivot(pivotColumn: Column, values: Column): GroupedData scala.annotation.varargs def pivot(pivotColumn: String, values: Any): GroupedData ``` I think we can remove the one that takes "Column" types, since callers should always be passing in literals. It'd also be more clear if the values are not varargs, but rather Seq or java.util.List. I also made similar changes for Python. Author: Reynold Xin <rxin@databricks.com> Closes #9929 from rxin/SPARK-11946.
*	[SPARK-11836][SQL] udf/cast should not create new SQLContext	Davies Liu	2015-11-23	2	-6/+8
\| \| \| \| \| \| \| \|	They should use the existing SQLContext. Author: Davies Liu <davies@databricks.com> Closes #9914 from davies/create_udf.
*	[SPARK-11720][SQL][ML] Handle edge cases when count = 0 or 1 for Stats function	JihongMa	2015-11-18	1	-1/+1
\| \| \| \| \| \| \| \|	return Double.NaN for mean/average when count == 0 for all numeric types that is converted to Double, Decimal type continue to return null. Author: JihongMa <linlin200605@gmail.com> Closes #9705 from JihongMA/SPARK-11720.
*	[SPARK-11804] [PYSPARK] Exception raise when using Jdbc predicates opt…	Jeff Zhang	2015-11-18	2	-5/+18
\| \| \| \| \| \| \| \|	…ion in PySpark Author: Jeff Zhang <zjffdu@apache.org> Closes #9791 from zjffdu/SPARK-11804.
*	[SPARK-11745][SQL] Enable more JSON parsing options	Reynold Xin	2015-11-16	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds the following options to the JSON data source, for dealing with non-standard JSON files: * `allowComments` (default `false`): ignores Java/C++ style comment in JSON records * `allowUnquotedFieldNames` (default `false`): allows unquoted JSON field names * `allowSingleQuotes` (default `true`): allows single quotes in addition to double quotes * `allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers (e.g. 00012) To avoid passing a lot of options throughout the json package, I introduced a new JSONOptions case class to define all JSON config options. Also updated documentation to explain these options. Scala ![screen shot 2015-11-15 at 6 12 12 pm](https://cloud.githubusercontent.com/assets/323388/11172965/e3ace6ec-8bc4-11e5-805e-2d78f80d0ed6.png) Python ![screen shot 2015-11-15 at 6 11 28 pm](https://cloud.githubusercontent.com/assets/323388/11172964/e23ed6ee-8bc4-11e5-8216-312f5983acd5.png) Author: Reynold Xin <rxin@databricks.com> Closes #9724 from rxin/SPARK-11745.
*	[SPARK-11690][PYSPARK] Add pivot to python api	Andrew Ray	2015-11-13	1	-1/+23
\| \| \| \| \| \| \| \|	This PR adds pivot to the python api of GroupedData with the same syntax as Scala/Java. Author: Andrew Ray <ray.andrew@gmail.com> Closes #9653 from aray/sql-pivot-python.
*	[SPARK-11671] documentation code example typo	Chris Snow	2015-11-12	1	-1/+1
\| \| \| \| \| \| \| \|	Example for sqlContext.createDataDrame from pandas.DataFrame has a typo Author: Chris Snow <chsnow123@gmail.com> Closes #9639 from snowch/patch-2.
*	[SPARK-11420] Updating Stddev support via Imperative Aggregate	JihongMa	2015-11-12	1	-1/+1
\| \| \| \| \| \| \| \|	switched stddev support from DeclarativeAggregate to ImperativeAggregate. Author: JihongMa <linlin200605@gmail.com> Closes #9380 from JihongMA/SPARK-11420.
*	[SPARK-11567] [PYTHON] Add Python API for corr Aggregate function	felixcheung	2015-11-10	1	-0/+16
\| \| \| \| \| \| \| \| \| \|	like `df.agg(corr("col1", "col2")` davies Author: felixcheung <felixcheung_m@hotmail.com> Closes #9536 from felixcheung/pyfunc.
*	[SPARK-9830][SQL] Remove AggregateExpression1 and Aggregate Operator used to ↵	Yin Huai	2015-11-10	3	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	evaluate AggregateExpression1s https://issues.apache.org/jira/browse/SPARK-9830 This PR contains the following main changes. * Removing `AggregateExpression1`. * Removing `Aggregate` operator, which is used to evaluate `AggregateExpression1`. * Removing planner rule used to plan `Aggregate`. * Linking `MultipleDistinctRewriter` to analyzer. * Renaming `AggregateExpression2` to `AggregateExpression` and `AggregateFunction2` to `AggregateFunction`. * Updating places where we create aggregate expression. The way to create aggregate expressions is `AggregateExpression(aggregateFunction, mode, isDistinct)`. * Changing `val`s in `DeclarativeAggregate`s that touch children of this function to `lazy val`s (when we create aggregate expression in DataFrame API, children of an aggregate function can be unresolved). Author: Yin Huai <yhuai@databricks.com> Closes #9556 from yhuai/removeAgg1.
*	[SPARK-9301][SQL] Add collect_set and collect_list aggregate functions	Nick Buroojy	2015-11-09	2	-11/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For now they are thin wrappers around the corresponding Hive UDAFs. One limitation with these in Hive 0.13.0 is they only support aggregating primitive types. I chose snake_case here instead of camelCase because it seems to be used in the majority of the multi-word fns. Do we also want to add these to `functions.py`? This approach was recommended here: https://github.com/apache/spark/pull/8592#issuecomment-154247089 marmbrus rxin Author: Nick Buroojy <nick.buroojy@civitaslearning.com> Closes #9526 from nburoojy/nick/udaf-alias. (cherry picked from commit a6ee4f989d020420dd08b97abb24802200ff23b2) Signed-off-by: Michael Armbrust <michael@databricks.com>
*	[HOTFIX] Fix python tests after #9527	Michael Armbrust	2015-11-06	1	-1/+1
\| \| \| \| \| \| \| \|	#9527 missed updating the python tests. Author: Michael Armbrust <michael@databricks.com> Closes #9533 from marmbrus/hotfixTextValue.
*	[SPARK-11410] [PYSPARK] Add python bindings for repartition and sortW…	Nong Li	2015-11-06	1	-16/+101
\| \| \| \| \| \| \| \|	…ithinPartitions. Author: Nong Li <nong@databricks.com> Closes #9504 from nongli/spark-11410.
*	[SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bits	Imran Rashid	2015-11-06	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-10116 This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`. mengxr mkolod Author: Imran Rashid <irashid@cloudera.com> Closes #8314 from squito/SPARK-10116.
*	[SPARK-11489][SQL] Only include common first order statistics in GroupedData	Reynold Xin	2015-11-03	1	-88/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We added a bunch of higher order statistics such as skewness and kurtosis to GroupedData. I don't think they are common enough to justify being listed, since users can always use the normal statistics aggregate functions. That is to say, after this change, we won't support ```scala df.groupBy("key").kurtosis("colA", "colB") ``` However, we will still support ```scala df.groupBy("key").agg(kurtosis(col("colA")), kurtosis(col("colB"))) ``` Author: Reynold Xin <rxin@databricks.com> Closes #9446 from rxin/SPARK-11489.
*	[SPARK-11467][SQL] add Python API for stddev/variance	Davies Liu	2015-11-03	2	-0/+105
\| \| \| \| \| \| \| \|	Add Python API for stddev/stddev_pop/stddev_samp/variance/var_pop/var_samp/skewness/kurtosis Author: Davies Liu <davies@databricks.com> Closes #9424 from davies/py_var.
*	[SPARK-11437] [PYSPARK] Don't .take when converting RDD to DataFrame with ↵	Jason White	2015-11-02	1	-7/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	provided schema When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls `.take(10)` to verify the first 10 rows of the RDD match the provided schema. Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue affected cases where a schema was not provided. Verifying the first 10 rows is of limited utility and causes the DAG to be executed non-lazily. If necessary, I believe this verification should be done lazily on all rows. However, since the caller is providing a schema to follow, I think it's acceptable to simply fail if the schema is incorrect. marmbrus We chatted about this at SparkSummitEU. davies you made a similar change for the infer-schema path in https://github.com/apache/spark/pull/6606 Author: Jason White <jason.white@shopify.com> Closes #9392 from JasonMWhite/createDataFrame_without_take.
*	[SPARK-11322] [PYSPARK] Keep full stack trace in captured exception	Liang-Chi Hsieh	2015-10-28	2	-4/+21
\| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-11322 As reported by JoshRosen in [databricks/spark-redshift/issues/89](https://github.com/databricks/spark-redshift/issues/89#issuecomment-149828308), the exception-masking behavior sometimes makes debugging harder. To deal with this issue, we should keep full stack trace in the captured exception. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9283 from viirya/py-exception-stacktrace.
*	[SPARK-11292] [SQL] Python API for text data source	Reynold Xin	2015-10-28	1	-2/+25
\| \| \| \| \| \| \| \|	Adds DataFrameReader.text and DataFrameWriter.text. Author: Reynold Xin <rxin@databricks.com> Closes #9259 from rxin/SPARK-11292.
*	[SPARK-11279][PYSPARK] Add DataFrame#toDF in PySpark	Jeff Zhang	2015-10-26	1	-0/+12
\| \| \| \| \| \|	Author: Jeff Zhang <zjffdu@apache.org> Closes #9248 from zjffdu/SPARK-11279.
*	[SPARK-7021] Add JUnit output for Python unit tests	Gábor Lipták	2015-10-22	1	-1/+8
\| \| \| \| \| \| \| \|	WIP Author: Gábor Lipták <gliptak@gmail.com> Closes #8323 from gliptak/SPARK-7021.
*	[SPARK-11205][PYSPARK] Delegate to scala DataFrame API rather than p…	Jeff Zhang	2015-10-20	1	-1/+2
\| \| \| \| \| \| \| \| \| \|	…rint in python No test needed. Verify it manually in pyspark shell Author: Jeff Zhang <zjffdu@apache.org> Closes #9177 from zjffdu/SPARK-11205.
*	[SPARK-11114][PYSPARK] add getOrCreate for SparkContext/SQLContext in Python	Davies Liu	2015-10-19	2	-0/+41
\| \| \| \| \| \| \| \|	Also added SQLContext.newSession() Author: Davies Liu <davies@databricks.com> Closes #9122 from davies/py_create.
*	[SPARK-11158][SQL] Modified _verify_type() to be more informative on Errors ↵	Mahmoud Lababidi	2015-10-18	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	by presenting the Object The _verify_type() function had Errors that were raised when there were Type conversion issues but left out the Object in question. The Object is now added in the Error to reduce the strain on the user to debug through to figure out the Object that failed the Type conversion. The use case for me was a Pandas DataFrame that contained 'nan' as values for columns of Strings. Author: Mahmoud Lababidi <mahmoud@thehumangeo.com> Author: Mahmoud Lababidi <lababidi@gmail.com> Closes #9149 from lababidi/master.
*	[SPARK-10185] [SQL] Feat sql comma separated paths	Koert Kuipers	2015-10-17	1	-1/+13
\| \| \| \| \| \| \| \|	Make sure comma-separated paths get processed correcly in ResolvedDataSource for a HadoopFsRelationProvider Author: Koert Kuipers <koert@tresata.com> Closes #8416 from koertkuipers/feat-sql-comma-separated-paths.
*	[SPARK-10782] [PYTHON] Update dropDuplicates documentation	asokadiggs	2015-09-29	1	-0/+2
\| \| \| \| \| \| \| \|	Documentation for dropDuplicates() and drop_duplicates() is one and the same. Resolved the error in the example for drop_duplicates using the same approach used for groupby and groupBy, by indicating that dropDuplicates and drop_duplicates are aliases. Author: asokadiggs <asoka.diggs@intel.com> Closes #8930 from asokadiggs/jira-10782.
*	[SPARK-10731] [SQL] Delegate to Scala's DataFrame.take implementation in ↵	Reynold Xin	2015-09-23	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \|	Python DataFrame. Python DataFrame.head/take now requires scanning all the partitions. This pull request changes them to delegate the actual implementation to Scala DataFrame (by calling DataFrame.take). This is more of a hack for fixing this issue in 1.5.1. A more proper fix is to change executeCollect and executeTake to return InternalRow rather than Row, and thus eliminate the extra round-trip conversion. Author: Reynold Xin <rxin@databricks.com> Closes #8876 from rxin/SPARK-10731.
*	[SPARK-10446][SQL] Support to specify join type when calling join with ↵	Liang-Chi Hsieh	2015-09-21	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \|	usingColumns JIRA: https://issues.apache.org/jira/browse/SPARK-10446 Currently the method `join(right: DataFrame, usingColumns: Seq[String])` only supports inner join. It is more convenient to have it support other join types. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #8600 from viirya/usingcolumns_df.
*	[SPARK-10577] [PYSPARK] DataFrame hint for broadcast join	Jian Feng	2015-09-21	2	-0/+27
\| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-10577 Author: Jian Feng <jzhang.chs@gmail.com> Closes #8801 from Jianfeng-chs/master.
*	[SPARK-10615] [PYSPARK] change assertEquals to assertEqual	Yanbo Liang	2015-09-18	1	-9/+9
\| \| \| \| \| \| \| \|	As ```assertEquals``` is deprecated, so we need to change ```assertEquals``` to ```assertEqual``` for existing python unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8814 from yanboliang/spark-10615.
*	[SPARK-6548] Adding stddev to DataFrame functions	JihongMa	2015-09-12	1	-18/+18
\| \| \| \| \| \| \| \| \| \| \|	Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change. Author: JihongMa <linlin200605@gmail.com> Author: Jihong MA <linlin200605@gmail.com> Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com> Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local> Closes #6297 from JihongMA/SPARK-SQL.
*	[SPARK-9014] [SQL] Allow Python spark API to use built-in exponential operator	0x0FFF	2015-09-11	2	-1/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR addresses (SPARK-9014)[https://issues.apache.org/jira/browse/SPARK-9014] Added functionality: `Column` object in Python now supports exponential operator `*` Example: ``` from pyspark.sql import df = sqlContext.createDataFrame([Row(a=2)]) df.select(3df.a,df.a3,df.a**df.a).collect() ``` Outputs: ``` [Row(POWER(3.0, a)=9.0, POWER(a, 3.0)=8.0, POWER(a, a)=4.0)] ``` Author: 0x0FFF <programmerag@gmail.com> Closes #8658 from 0x0FFF/SPARK-9014.
*	[SPARK-7544] [SQL] [PySpark] pyspark.sql.types.Row implements __getitem__	Yanbo Liang	2015-09-10	1	-0/+15
\| \| \| \| \| \| \| \|	pyspark.sql.types.Row implements ```__getitem__``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #8333 from yanboliang/spark-7544.
*	[SPARK-10373] [PYSPARK] move @since into pyspark from sql	Davies Liu	2015-09-08	8	-25/+7
\| \| \| \| \| \| \| \|	cc mengxr Author: Davies Liu <davies@databricks.com> Closes #8657 from davies/move_since.
*	[SPARK-10417] [SQL] Iterating through Column results in infinite loop	0x0FFF	2015-09-02	2	-0/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	`pyspark.sql.column.Column` object has `__getitem__` method, which makes it iterable for Python. In fact it has `__getitem__` to address the case when the column might be a list or dict, for you to be able to access certain element of it in DF API. The ability to iterate over it is just a side effect that might cause confusion for the people getting familiar with Spark DF (as you might iterate this way on Pandas DF for instance) Issue reproduction: ``` df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}'])) for i in df["name"]: print i ``` Author: 0x0FFF <programmerag@gmail.com> Closes #8574 from 0x0FFF/SPARK-10417.
*	[SPARK-10392] [SQL] Pyspark - Wrong DateType support on JDBC connection	0x0FFF	2015-09-01	2	-2/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR addresses issue [SPARK-10392](https://issues.apache.org/jira/browse/SPARK-10392) The problem is that for "start of epoch" date (01 Jan 1970) PySpark class DateType returns 0 instead of the `datetime.date` due to implementation of its return statement Issue reproduction on master: ``` >>> from pyspark.sql.types import * >>> a = DateType() >>> a.fromInternal(0) 0 >>> a.fromInternal(1) datetime.date(1970, 1, 2) ``` Author: 0x0FFF <programmerag@gmail.com> Closes #8556 from 0x0FFF/SPARK-10392.
*	[SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter ↵	0x0FFF	2015-09-01	2	-10/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	function This PR addresses [SPARK-10162](https://issues.apache.org/jira/browse/SPARK-10162) The issue is with DataFrame filter() function, if datetime.datetime is passed to it: * Timezone information of this datetime is ignored * This datetime is assumed to be in local timezone, which depends on the OS timezone setting Fix includes both code change and regression test. Problem reproduction code on master: ```python import pytz from datetime import datetime from pyspark.sql import * from pyspark.sql.types import * sqc = SQLContext(sc) df = sqc.createDataFrame([], StructType([StructField("dt", TimestampType())])) m1 = pytz.timezone('UTC') m2 = pytz.timezone('Etc/GMT+3') df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() ``` It gives the same timestamp ignoring time zone: ``` >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() Filter (dt#0 > 946713600000000) Scan PhysicalRDD[dt#0] >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() Filter (dt#0 > 946713600000000) Scan PhysicalRDD[dt#0] ``` After the fix: ``` >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() Filter (dt#0 > 946684800000000) Scan PhysicalRDD[dt#0] >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() Filter (dt#0 > 946695600000000) Scan PhysicalRDD[dt#0] ``` PR [8536](https://github.com/apache/spark/pull/8536) was occasionally closed by me dropping the repo Author: 0x0FFF <programmerag@gmail.com> Closes #8555 from 0x0FFF/SPARK-10162.
*	[SPARK-9964] [PYSPARK] [SQL] PySpark DataFrameReader accept RDD of String ↵	Yanbo Liang	2015-08-26	1	-6/+22
\| \| \| \| \| \| \| \| \| \| \|	for JSON PySpark DataFrameReader should could accept an RDD of Strings (like the Scala version does) for JSON, rather than only taking a path. If this PR is merged, it should be duplicated to cover the other input types (not just JSON). Author: Yanbo Liang <ybliang8@gmail.com> Closes #8444 from yanboliang/spark-9964.
*	[SPARK-10305] [SQL] fix create DataFrame from Python class	Davies Liu	2015-08-26	2	-0/+18
\| \| \| \| \| \| \| \|	cc jkbradley Author: Davies Liu <davies@databricks.com> Closes #8470 from davies/fix_create_df.