spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on ↵	Vinayak	2017-01-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	a new SQLContext object fails with a Derby error Change is for SQLContext to reuse the active SparkSession during construction if the sparkContext supplied is the same as the currently active SparkContext. Without this change, a new SparkSession is instantiated that results in a Derby error when attempting to create a dataframe using a new SQLContext object even though the SparkContext supplied to the new SQLContext is same as the currently active one. Refer https://issues.apache.org/jira/browse/SPARK-18687 for details on the error and a repro. Existing unit tests and a new unit test added to pyspark-sql: /python/run-tests --python-executables=python --modules=pyspark-sql Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Vinayak <vijoshi5@in.ibm.com> Author: Vinayak Joshi <vijoshi@users.noreply.github.com> Closes #16119 from vijoshi/SPARK-18687_master.
*	[SPARK-11775][PYSPARK][SQL] Allow PySpark to register Java UDF	Jeff Zhang	2016-10-14	1	-1/+27
\| \| \| \| \| \| \| \| \| \|	Currently pyspark can only call the builtin java UDF, but can not call custom java UDF. It would be better to allow that. 2 benefits: * Leverage the power of rich third party java library * Improve the performance. Because if we use python UDF, python daemons will be started on worker which will affect the performance. Author: Jeff Zhang <zjffdu@apache.org> Closes #9766 from zjffdu/SPARK-11775.
*	[SPARK-17338][SQL] add global temp view	Wenchen Fan	2016-10-10	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Global temporary view is a cross-session temporary view, which means it's shared among all sessions. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database `global_temp`(configurable via SparkConf), and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1. changes for `SessionCatalog`: 1. add a new field `gloabalTempViews: GlobalTempViewManager`, to access the shared global temp views, and the global temp db name. 2. `createDatabase` will fail if users wanna create `global_temp`, which is system preserved. 3. `setCurrentDatabase` will fail if users wanna set `global_temp`, which is system preserved. 4. add `createGlobalTempView`, which is used in `CreateViewCommand` to create global temp views. 5. add `dropGlobalTempView`, which is used in `CatalogImpl` to drop global temp view. 6. add `alterTempViewDefinition`, which is used in `AlterViewAsCommand` to update the view definition for local/global temp views. 7. `renameTable`/`dropTable`/`isTemporaryTable`/`lookupRelation`/`getTempViewOrPermanentTableMetadata`/`refreshTable` will handle global temp views. changes for SQL commands: 1. `CreateViewCommand`/`AlterViewAsCommand` is updated to support global temp views 2. `ShowTablesCommand` outputs a new column `database`, which is used to distinguish global and local temp views. 3. other commands can also handle global temp views if they call `SessionCatalog` APIs which accepts global temp views, e.g. `DropTableCommand`, `AlterTableRenameCommand`, `ShowColumnsCommand`, etc. changes for other public API 1. add a new method `dropGlobalTempView` in `Catalog` 2. `Catalog.findTable` can find global temp view 3. add a new method `createGlobalTempView` in `Dataset` ## How was this patch tested? new tests in `SQLViewSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #14897 from cloud-fan/global-temp-view.
*	[SPARK-16700][PYSPARK][SQL] create DataFrame from dict/Row with schema	Davies Liu	2016-08-15	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In 2.0, we verify the data type against schema for every row for safety, but with performance cost, this PR make it optional. When we verify the data type for StructType, it does not support all the types we support in infer schema (for example, dict), this PR fix that to make them consistent. For Row object which is created using named arguments, the order of fields are sorted by name, they may be not different than the order in provided schema, this PR fix that by ignore the order of fields in this case. ## How was this patch tested? Created regression tests for them. Author: Davies Liu <davies@databricks.com> Closes #14469 from davies/py_dict.
*	[SPARK-16772][PYTHON][DOCS] Restore "datatype string" to Python API docstrings	Nicholas Chammas	2016-07-29	1	-6/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR corrects [an error made in an earlier PR](https://github.com/apache/spark/pull/14393/files#r72843069). ## How was this patch tested? ```sh $ ./dev/lint-python PEP8 checks passed. rm -rf _build/* pydoc checks passed. ``` I also built the docs and confirmed that they looked good in my browser. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #14408 from nchammas/SPARK-16772.
*	[SPARK-16772] Correct API doc references to PySpark classes + formatting fixes	Nicholas Chammas	2016-07-28	1	-19/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What's Been Changed The PR corrects several broken or missing class references in the Python API docs. It also correct formatting problems. For example, you can see [here](http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.SQLContext.registerFunction) how Sphinx is not picking up the reference to `DataType`. That's because the reference is relative to the current module, whereas `DataType` is in a different module. You can also see [here](http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame) how the formatting for byte, tinyint, and so on is italic instead of monospace. That's because in ReST single backticks just make things italic, unlike in Markdown. ## Testing I tested this PR by [building the Python docs](https://github.com/apache/spark/tree/master/docs#generating-the-documentation-html) and reviewing the results locally in my browser. I confirmed that the broken or missing class references were resolved, and that the formatting was corrected. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #14393 from nchammas/python-docstring-fixes.
*	[SPARK-16662][PYSPARK][SQL] fix HiveContext warning bug	WeichenXu	2016-07-23	1	-5/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? move the `HiveContext` deprecate warning printing statement into `HiveContext` constructor. so that this warning will appear only when we use `HiveContext` otherwise this warning will always appear if we reference the pyspark.ml.context code file. ## How was this patch tested? Manual. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14301 from WeichenXu123/hiveContext_python_warning_update.
*	[SPARK-15954][SQL] Disable loading test tables in Python tests	Reynold Xin	2016-06-30	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch introduces a flag to disable loading test tables in TestHiveSparkSession and disables that in Python. This fixes an issue in which python/run-tests would fail due to failure to load test tables. Note that these test tables are not used outside of HiveCompatibilitySuite. In the long run we should probably decouple the loading of test tables from the test Hive setup. ## How was this patch tested? This is a test only change. Author: Reynold Xin <rxin@databricks.com> Closes #14005 from rxin/SPARK-15954.
*	[SPARK-16313][SQL] Spark should not silently drop exceptions in file listing	Reynold Xin	2016-06-30	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Spark silently drops exceptions during file listing. This is a very bad behavior because it can mask legitimate errors and the resulting plan will silently have 0 rows. This patch changes it to not silently drop the errors. ## How was this patch tested? Manually verified. Author: Reynold Xin <rxin@databricks.com> Closes #13987 from rxin/SPARK-16313.
*	[SPARK-16266][SQL][STREAING] Moved DataStreamReader/Writer from pyspark.sql ↵	Tathagata Das	2016-06-28	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	to pyspark.sql.streaming ## What changes were proposed in this pull request? - Moved DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming to make them consistent with scala packaging - Exposed the necessary classes in sql.streaming package so that they appear in the docs - Added pyspark.sql.streaming module to the docs ## How was this patch tested? - updated unit tests. - generated docs for testing visibility of pyspark.sql.streaming classes. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13955 from tdas/SPARK-16266.
*	[SPARK-16268][PYSPARK] SQLContext should import DataStreamReader	Shixiong Zhu	2016-06-28	1	-2/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fixed the following error: ``` >>> sqlContext.readStream Traceback (most recent call last): File "<stdin>", line 1, in <module> File "...", line 442, in readStream return DataStreamReader(self._wrapped) NameError: global name 'DataStreamReader' is not defined ``` ## How was this patch tested? The added test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13958 from zsxwing/fix-import.
*	[SPARK-15953][WIP][STREAMING] Renamed ContinuousQuery to StreamingQuery	Tathagata Das	2016-06-15	1	-4/+4
\| \| \| \| \| \| \| \| \| \|	Renamed for simplicity, so that its obvious that its related to streaming. Existing unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13673 from tdas/SPARK-15953.
*	[SPARK-15933][SQL][STREAMING] Refactored DF reader-writer to use readStream ↵	Tathagata Das	2016-06-14	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and writeStream for streaming DFs ## What changes were proposed in this pull request? Currently, the DataFrameReader/Writer has method that are needed for streaming and non-streaming DFs. This is quite awkward because each method in them through runtime exception for one case or the other. So rather having half the methods throw runtime exceptions, its just better to have a different reader/writer API for streams. - [x] Python API!! ## How was this patch tested? Existing unit tests + two sets of unit tests for DataFrameReader/Writer and DataStreamReader/Writer. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13653 from tdas/SPARK-15933.
*	[SPARK-15935][PYSPARK] Enable test for sql/streaming.py and fix these tests	Shixiong Zhu	2016-06-14	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR just enables tests for sql/streaming.py and also fixes the failures. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13655 from zsxwing/python-streaming-test.
*	[SPARK-15075][SPARK-15345][SQL] Clean up SparkSession builder and propagate ↵	Reynold Xin	2016-05-19	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	config options to existing sessions if specified ## What changes were proposed in this pull request? Currently SparkSession.Builder use SQLContext.getOrCreate. It should probably the the other way around, i.e. all the core logic goes in SparkSession, and SQLContext just calls that. This patch does that. This patch also makes sure config options specified in the builder are propagated to the existing (and of course the new) SparkSession. ## How was this patch tested? Updated tests to reflect the change, and also introduced a new SparkSessionBuilderSuite that should cover all the branches. Author: Reynold Xin <rxin@databricks.com> Closes #13200 from rxin/SPARK-15075.
*	[SPARK-15171][SQL] Remove the references to deprecated method ↵	Sean Zhong	2016-05-18	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	dataset.registerTempTable ## What changes were proposed in this pull request? Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`. ## How was this patch tested? This PR only changes the unit test code, examples, and comments. It should be safe. This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged. Author: Sean Zhong <seanzhong@databricks.com> Closes #13098 from clockfly/spark-15171-remove-deprecation.
*	[SPARK-15171][SQL] Deprecate registerTempTable and add dataset.createTempView	Sean Zhong	2016-05-12	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzhong@databricks.com> Closes #12945 from clockfly/spark-15171.
*	[SPARK-15270] [SQL] Use SparkSession Builder to build a session with HiveSupport	Sandeep Singh	2016-05-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Before: Creating a hiveContext was failing ```python from pyspark.sql import HiveContext hc = HiveContext(sc) ``` with ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "spark-2.0/python/pyspark/sql/context.py", line 458, in __init__ sparkSession = SparkSession.withHiveSupport(sparkContext) File "spark-2.0/python/pyspark/sql/session.py", line 192, in withHiveSupport jsparkSession = sparkContext._jvm.SparkSession.withHiveSupport(sparkContext._jsc.sc()) File "spark-2.0/python/lib/py4j-0.9.2-src.zip/py4j/java_gateway.py", line 1048, in __getattr__ py4j.protocol.Py4JError: org.apache.spark.sql.SparkSession.withHiveSupport does not exist in the JVM ``` Now: ```python >>> from pyspark.sql import HiveContext >>> hc = HiveContext(sc) >>> hc.range(0, 100) DataFrame[id: bigint] >>> hc.range(0, 100).count() 100 ``` ## How was this patch tested? Existing Tests, tested manually in python shell Author: Sandeep Singh <sandeep@techaddict.me> Closes #13056 from techaddict/SPARK-15270.
*	[SPARK-14896][SQL] Deprecate HiveContext in python	Andrew Or	2016-05-04	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? See title. ## How was this patch tested? PySpark tests. Author: Andrew Or <andrew@databricks.com> Closes #12917 from andrewor14/deprecate-hive-context-python.
*	[SPARK-15012][SQL] Simplify configuration API further	Andrew Or	2016-04-29	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? 1. Remove all the `spark.setConf` etc. Just expose `spark.conf` 2. Make `spark.conf` take in things set in the core `SparkConf` as well, otherwise users may get confused This was done for both the Python and Scala APIs. ## How was this patch tested? `SQLConfSuite`, python tests. This one fixes the failed tests in #12787 Closes #12787 Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12798 from yhuai/conf-api.
*	[SPARK-14988][PYTHON] SparkSession API follow-ups	Andrew Or	2016-04-29	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Addresses comments in #12765. ## How was this patch tested? Python tests. Author: Andrew Or <andrew@databricks.com> Closes #12784 from andrewor14/python-followup.
*	[SPARK-14988][PYTHON] SparkSession catalog and conf API	Andrew Or	2016-04-29	1	-5/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The `catalog` and `conf` APIs were exposed in `SparkSession` in #12713 and #12669. This patch adds those to the python API. ## How was this patch tested? Python tests. Author: Andrew Or <andrew@databricks.com> Closes #12765 from andrewor14/python-spark-session-more.
*	[SPARK-14555] Second cut of Python API for Structured Streaming	Burak Yavuz	2016-04-28	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds Python APIs for: - `ContinuousQueryManager` - `ContinuousQueryException` The `ContinuousQueryException` is a very basic wrapper, it doesn't provide the functionality that the Scala side provides, but it follows the same pattern for `AnalysisException`. For `ContinuousQueryManager`, all APIs are provided except for registering listeners. This PR also attempts to fix test flakiness by stopping all active streams just before tests. ## How was this patch tested? Python Doc tests and unit tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #12673 from brkyvz/pyspark-cqm.
*	[SPARK-14945][PYTHON] SparkSession Python API	Andrew Or	2016-04-28	1	-229/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? ``` Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT /_/ Using Python version 2.7.5 (default, Mar 9 2014 22:15:05) SparkSession available as 'spark'. >>> spark <pyspark.sql.session.SparkSession object at 0x101f3bfd0> >>> spark.sql("SHOW TABLES").show() ... +---------+-----------+ \|tableName\|isTemporary\| +---------+-----------+ \| src\| false\| +---------+-----------+ >>> spark.range(1, 10, 2).show() +---+ \| id\| +---+ \| 1\| \| 3\| \| 5\| \| 7\| \| 9\| +---+ ``` Note: This API is NOT complete in its current state. In particular, for now I left out the `conf` and `catalog` APIs, which were added later in Scala. These will be added later before 2.0. ## How was this patch tested? Python tests. Author: Andrew Or <andrew@databricks.com> Closes #12746 from andrewor14/python-spark-session.
*	[SPARK-14721][SQL] Remove HiveContext (part 2)	Andrew Or	2016-04-25	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This removes the class `HiveContext` itself along with all code usages associated with it. The bulk of the work was already done in #12485. This is mainly just code cleanup and actually removing the class. Note: A couple of things will break after this patch. These will be fixed separately. - the python HiveContext - all the documentation / comments referencing HiveContext - there will be no more HiveContext in the REPL (fixed by #12589) ## How was this patch tested? No change in functionality. Author: Andrew Or <andrew@databricks.com> Closes #12585 from andrewor14/delete-hive-context.
*	Support single argument version of sqlContext.getConf	mathieu longtin	2016-04-23	1	-3/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In Python, sqlContext.getConf didn't allow getting the system default (getConf with one parameter). Now the following are supported: ``` sqlContext.getConf(confName) # System default if not locally set, this is new sqlContext.getConf(confName, myDefault) # myDefault if not locally set, old behavior ``` I also added doctests to this function. The original behavior does not change. ## How was this patch tested? Manually, but doctests were added. Author: mathieu longtin <mathieu.longtin@nuance.com> Closes #12488 from mathieulongtin/pyfixgetconf3.
*	[SPARK-14573][PYSPARK][BUILD] Fix PyDoc Makefile & highlighting issues	Holden Karau	2016-04-14	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The PyDoc Makefile used "=" rather than "?=" for setting env variables so it overwrote the user values. This ignored the environment variables we set for linting allowing warnings through. This PR also fixes the warnings that had been introduced. ## How was this patch tested? manual local export & make Author: Holden Karau <holden@us.ibm.com> Closes #12336 from holdenk/SPARK-14573-fix-pydoc-makefile.
*	[SPARK-14014][SQL] Integrate session catalog (attempt #2)	Andrew Or	2016-03-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This reopens #11836, which was merged but promptly reverted because it introduced flaky Hive tests. ## How was this patch tested? See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`. Author: Andrew Or <andrew@databricks.com> Closes #11938 from andrewor14/session-catalog-again.
*	Revert "[SPARK-14014][SQL] Replace existing catalog with SessionCatalog"	Andrew Or	2016-03-23	1	-1/+1
\| \| \| \|	This reverts commit 5dfc01976bb0d72489620b4f32cc12d620bb6260.
*	[SPARK-14014][SQL] Replace existing catalog with SessionCatalog	Andrew Or	2016-03-23	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`. As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely: - SPARK-14013: Properly implement temporary functions in `SessionCatalog` - SPARK-13879: Decide which DDL/DML commands to support natively in Spark - SPARK-?????: Implement the ones we do want to support through `SessionCatalog`. - SPARK-?????: Merge SQL/HiveContext ## How was this patch tested? This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`. Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #11836 from andrewor14/use-session-catalog.
*	[SPARK-13593] [SQL] improve the `createDataFrame` to accept data type string ↵	Wenchen Fan	2016-03-08	1	-14/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and verify the data ## What changes were proposed in this pull request? This PR improves the `createDataFrame` method to make it also accept datatype string, then users can convert python RDD to DataFrame easily, for example, `df = rdd.toDF("a: int, b: string")`. It also supports flat schema so users can convert an RDD of int to DataFrame directly, we will automatically wrap int to row for users. If schema is given, now we checks if the real data matches the given schema, and throw error if it doesn't. ## How was this patch tested? new tests in `test.py` and doc test in `types.py` Author: Wenchen Fan <wenchen@databricks.com> Closes #11444 from cloud-fan/pyrdd.
*	[SPARK-13594][SQL] remove typed operations(e.g. map, flatMap) from python ↵	Wenchen Fan	2016-03-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DataFrame ## What changes were proposed in this pull request? Remove `map`, `flatMap`, `mapPartitions` from python DataFrame, to prepare for Dataset API in the future. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #11445 from cloud-fan/python-clean.
*	[SPARK-13467] [PYSPARK] abstract python function to simplify pyspark code	Wenchen Fan	2016-02-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? When we pass a Python function to JVM side, we also need to send its context, e.g. `envVars`, `pythonIncludes`, `pythonExec`, etc. However, it's annoying to pass around so many parameters at many places. This PR abstract python function along with its context, to simplify some pyspark code and make the logic more clear. ## How was the this patch tested? by existing unit tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11342 from cloud-fan/python-clean.
*	[SPARK-12799] Simplify various string output for expressions	Cheng Lian	2016-02-21	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR introduces several major changes: 1. Replacing `Expression.prettyString` with `Expression.sql` The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users. 1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed) Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird. Here are several examples: Expression \| `prettyString` \| `sql` \| Note ------------------ \| -------------- \| ---------- \| --------------- `a && b` \| `a && b` \| `a AND b` \| `a.getField("f")` \| `a[f]` \| `a.f` \| `a` is a struct 1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders) `NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression. Author: Cheng Lian <lian@databricks.com> Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.
*	[SPARK-12120][PYSPARK] Improve exception message when failing to init…	Jeff Zhang	2016-01-24	1	-3/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	…ialize HiveContext in PySpark davies Mind to review ? This is the error message after this PR ``` 15/12/03 16:59:53 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException /Users/jzhang/github/spark/python/pyspark/sql/context.py:689: UserWarning: You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly warnings.warn("You must build Spark with Hive. " Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 663, in read return DataFrameReader(self) File "/Users/jzhang/github/spark/python/pyspark/sql/readwriter.py", line 56, in __init__ self._jreader = sqlContext._ssql_ctx.read() File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 692, in _ssql_ctx raise e py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. : java.lang.RuntimeException: java.net.ConnectException: Call From jzhangMBPr.local/127.0.0.1 to 0.0.0.0:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:194) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238) at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218) at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208) at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462) at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461) at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40) at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:330) at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90) at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) ``` Author: Jeff Zhang <zjffdu@apache.org> Closes #10126 from zjffdu/SPARK-12120.
*	[SPARK-12600][SQL] Remove deprecated methods in Spark SQL	Reynold Xin	2016-01-04	1	-111/+0
\| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #10559 from rxin/remove-deprecated-sql.
*	[SPARK-12300] [SQL] [PYSPARK] fix schema inferance on local collections	Holden Karau	2015-12-30	1	-7/+3
\| \| \| \| \| \| \| \|	Current schema inference for local python collections halts as soon as there are no NullTypes. This is different than when we specify a sampling ratio of 1.0 on a distributed collection. This could result in incomplete schema information. Author: Holden Karau <holden@us.ibm.com> Closes #10275 from holdenk/SPARK-12300-fix-schmea-inferance-on-local-collections.
*	[SPARK-11917][PYSPARK] Add SQLContext#dropTempTable to PySpark	Jeff Zhang	2015-11-26	1	-0/+9
\| \| \| \| \| \|	Author: Jeff Zhang <zjffdu@apache.org> Closes #9903 from zjffdu/SPARK-11917.
*	[SPARK-11860][PYSAPRK][DOCUMENTATION] Invalid argument specification …	Jeff Zhang	2015-11-25	1	-2/+3
\| \| \| \| \| \| \| \| \| \|	…for registerFunction [Python] Straightforward change on the python doc Author: Jeff Zhang <zjffdu@apache.org> Closes #9901 from zjffdu/SPARK-11860.
*	[SPARK-11671] documentation code example typo	Chris Snow	2015-11-12	1	-1/+1
\| \| \| \| \| \| \| \|	Example for sqlContext.createDataDrame from pandas.DataFrame has a typo Author: Chris Snow <chsnow123@gmail.com> Closes #9639 from snowch/patch-2.
*	[SPARK-11437] [PYSPARK] Don't .take when converting RDD to DataFrame with ↵	Jason White	2015-11-02	1	-7/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	provided schema When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls `.take(10)` to verify the first 10 rows of the RDD match the provided schema. Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue affected cases where a schema was not provided. Verifying the first 10 rows is of limited utility and causes the DAG to be executed non-lazily. If necessary, I believe this verification should be done lazily on all rows. However, since the caller is providing a schema to follow, I think it's acceptable to simply fail if the schema is incorrect. marmbrus We chatted about this at SparkSummitEU. davies you made a similar change for the infer-schema path in https://github.com/apache/spark/pull/6606 Author: Jason White <jason.white@shopify.com> Closes #9392 from JasonMWhite/createDataFrame_without_take.
*	[SPARK-11114][PYSPARK] add getOrCreate for SparkContext/SQLContext in Python	Davies Liu	2015-10-19	1	-0/+27
\| \| \| \| \| \| \| \|	Also added SQLContext.newSession() Author: Davies Liu <davies@databricks.com> Closes #9122 from davies/py_create.
*	[SPARK-10373] [PYSPARK] move @since into pyspark from sql	Davies Liu	2015-09-08	1	-1/+1
\| \| \| \| \| \| \| \|	cc mengxr Author: Davies Liu <davies@databricks.com> Closes #8657 from davies/move_since.
*	[SPARK-9942] [PYSPARK] [SQL] ignore exceptions while try to import pandas	Davies Liu	2015-08-13	1	-1/+1
\| \| \| \| \| \| \| \|	If pandas is broken (can't be imported, raise other exceptions other than ImportError), pyspark can't be imported, we should ignore all the exceptions. Author: Davies Liu <davies@databricks.com> Closes #8173 from davies/fix_pandas.
*	[SPARK-9116] [SQL] [PYSPARK] support Python only UDT in __main__	Davies Liu	2015-07-29	1	-41/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Also we could create a Python UDT without having a Scala one, it's important for Python users. cc mengxr JoshRosen Author: Davies Liu <davies@databricks.com> Closes #7453 from davies/class_in_main and squashes the following commits: 4dfd5e1 [Davies Liu] add tests for Python and Scala UDT 793d9b2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main dc65f19 [Davies Liu] address comment a9a3c40 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main a86e1fc [Davies Liu] fix serialization ad528ba [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main 63f52ef [Davies Liu] fix pylint check 655b8a9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main 316a394 [Davies Liu] support Python UDT with UTF 0bcb3ef [Davies Liu] fix bug in mllib de986d6 [Davies Liu] fix test 83d65ac [Davies Liu] fix bug in StructType 55bb86e [Davies Liu] support Python UDT in __main__ (without Scala one)
*	[SPARK-9114] [SQL] [PySpark] convert returned object from UDF into internal type	Davies Liu	2015-07-20	1	-13/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR also remove the duplicated code between registerFunction and UserDefinedFunction. cc JoshRosen Author: Davies Liu <davies@databricks.com> Closes #7450 from davies/fix_return_type and squashes the following commits: e80bf9f [Davies Liu] remove debugging code f94b1f6 [Davies Liu] fix mima 8f9c58b [Davies Liu] convert returned object from UDF into internal type
*	[SPARK-7902] [SPARK-6289] [SPARK-8685] [SQL] [PYSPARK] Refactor of ↵	Davies Liu	2015-07-09	1	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	serialization for Python DataFrame This PR fix the long standing issue of serialization between Python RDD and DataFrame, it change to using a customized Pickler for InternalRow to enable customized unpickling (type conversion, especially for UDT), now we can support UDT for UDF, cc mengxr . There is no generated `Row` anymore. Author: Davies Liu <davies@databricks.com> Closes #7301 from davies/sql_ser and squashes the following commits: 81bef71 [Davies Liu] address comments e9217bd [Davies Liu] add regression tests db34167 [Davies Liu] Refactor of serialization for Python DataFrame
*	[SPARK-8535] [PYSPARK] PySpark : Can't create DataFrame from Pandas ↵	x1-	2015-06-30	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	dataframe with no explicit column name Because implicit name of `pandas.columns` are Int, but `StructField` json expect `String`. So I think `pandas.columns` are should be convert to `String`. ### issue * [SPARK-8535 PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name](https://issues.apache.org/jira/browse/SPARK-8535) Author: x1- <viva008@gmail.com> Closes #7124 from x1-/SPARK-8535 and squashes the following commits: d68fd38 [x1-] modify unit-test using pandas. ea1897d [x1-] For implicit name of pandas.columns are Int, so should be convert to String.
*	[SPARK-8738] [SQL] [PYSPARK] capture SQL AnalysisException in Python API	Davies Liu	2015-06-30	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Capture the AnalysisException in SQL, hide the long java stack trace, only show the error message. cc rxin Author: Davies Liu <davies@databricks.com> Closes #7135 from davies/ananylis and squashes the following commits: dad7ae7 [Davies Liu] add comment ec0c0e8 [Davies Liu] Update utils.py cdd7edd [Davies Liu] add doc 7b044c2 [Davies Liu] fix python 3 f84d3bd [Davies Liu] capture SQL AnalysisException in Python API
*	[SPARK-8070] [SQL] [PYSPARK] avoid spark jobs in createDataFrame	Davies Liu	2015-06-29	1	-17/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Avoid the unnecessary jobs when infer schema from list. cc yhuai mengxr Author: Davies Liu <davies@databricks.com> Closes #6606 from davies/improve_create and squashes the following commits: a5928bf [Davies Liu] Update MimaExcludes.scala 62da911 [Davies Liu] fix mima bab4d7d [Davies Liu] Merge branch 'improve_create' of github.com:davies/spark into improve_create eee44a8 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_create 8d9292d [Davies Liu] Update context.py eb24531 [Davies Liu] Update context.py c969997 [Davies Liu] bug fix d5a8ab0 [Davies Liu] fix tests 8c3f10d [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_create 6ea5925 [Davies Liu] address comments 6ceaeff [Davies Liu] avoid spark jobs in createDataFrame