aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-19069][CORE] Expose task 'status' and 'duration' in spark history ↵Parag Chaudhari2017-01-2015-5/+360
| | | | | | | | | | | | | | | | server REST API. ## What changes were proposed in this pull request? Although Spark history server UI shows task ‘status’ and ‘duration’ fields, it does not expose these fields in the REST API response. For the Spark history server API users, it is not possible to determine task status and duration. Spark history server has access to task status and duration from event log, but it is not exposing these in API. This patch is proposed to expose task ‘status’ and ‘duration’ fields in Spark history server REST API. ## How was this patch tested? Modified existing test cases in org.apache.spark.deploy.history.HistoryServerSuite. Author: Parag Chaudhari <paragpc@amazon.com> Closes #16473 from paragpc/expose_task_status.
* [SPARK-19302][DOC][MINOR] Fix the wrong item format in security.mdsarutak2017-01-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In docs/security.md, there is a description as follows. ``` steps to configure the key-stores and the trust-store for the standalone deployment mode is as follows: * Generate a keys pair for each node * Export the public key of the key pair to a file on each node * Import all exported public keys into a single trust-store ``` According to markdown format, the first item should follow a blank line. ## How was this patch tested? Manually tested. Following captures are rendered web page before and after fix. * before ![before](https://cloud.githubusercontent.com/assets/4736016/22136731/b358115c-df19-11e6-8f6c-2f7b65766265.png) * after ![after](https://cloud.githubusercontent.com/assets/4736016/22136745/c6366ff8-df19-11e6-840d-e7e894218f9c.png) Author: sarutak <sarutak@oss.nttdata.co.jp> Closes #16653 from sarutak/SPARK-19302.
* [SPARK-19271][SQL] Change non-cbo estimation of aggregatewangzhenhua2017-01-197-11/+38
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Change non-cbo estimation behavior of aggregate: - If groupExpression is empty, we can know row count (=1) and the corresponding size; - otherwise, estimation falls back to UnaryNode's computeStats method, which should not propagate rowCount and attributeStats in Statistics because they are not estimated in that method. ## How was this patch tested? Added test case Author: wangzhenhua <wangzhenhua@huawei.com> Closes #16631 from wzhfy/aggNoCbo.
* [SPARK-19292][SQL] filter with partition columns should be case-insensitive ↵Wenchen Fan2017-01-193-2/+25
| | | | | | | | | | | | | | | | | | on Hive tables ## What changes were proposed in this pull request? When we query a table with a filter on partitioned columns, we will push the partition filter to the metastore to get matched partitions directly. In `HiveExternalCatalog.listPartitionsByFilter`, we assume the column names in partition filter are already normalized and we don't need to consider case sensitivity. However, `HiveTableScanExec` doesn't follow this assumption. This PR fixes it. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #16647 from cloud-fan/bug.
* [SPARK-17912] [SQL] Refactor code generation to get data for ↵Kazuaki Ishizaki2017-01-192-84/+135
| | | | | | | | | | | | | | | | | ColumnVector/ColumnarBatch ## What changes were proposed in this pull request? This PR refactors the code generation part to get data from `ColumnarVector` and `ColumnarBatch` by using a trait `ColumnarBatchScan` for ease of reuse. This is because this part will be reused by several components (e.g. parquet reader, Dataset.cache, and others) since `ColumnarBatch` will be first citizen. This PR is a part of https://github.com/apache/spark/pull/15219. In advance, this PR makes the code generation for `ColumnarVector` and `ColumnarBatch` reuseable as a trait. In general, this is very useful for other components from the reuseability view, too. ## How was this patch tested? tested existing test suites Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #15467 from kiszk/columnarrefactor.
* [SPARK-19295][SQL] IsolatedClientLoader's downloadVersion should log the ↵Yin Huai2017-01-191-0/+1
| | | | | | | | | | | | | | location of downloaded metastore client jars ## What changes were proposed in this pull request? This will help the users to know the location of those downloaded jars when `spark.sql.hive.metastore.jars` is set to `maven`. ## How was this patch tested? jenkins Author: Yin Huai <yhuai@databricks.com> Closes #16649 from yhuai/SPARK-19295.
* [SPARK-16654][CORE] Add UI coverage for Application Level BlacklistingJosé Hiram Soltren2017-01-1936-235/+950
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Builds on top of work in SPARK-8425 to update Application Level Blacklisting in the scheduler. ## What changes were proposed in this pull request? Adds a UI to these patches by: - defining new listener events for blacklisting and unblacklisting, nodes and executors; - sending said events at the relevant points in BlacklistTracker; - adding JSON (de)serialization code for these events; - augmenting the Executors UI page to show which, and how many, executors are blacklisted; - adding a unit test to make sure events are being fired; - adding HistoryServerSuite coverage to verify that the SHS reads these events correctly. - updates the Executor UI to show Blacklisted/Active/Dead as a tri-state in Executors Status Updates .rat-excludes to pass tests. username squito ## How was this patch tested? ./dev/run-tests testOnly org.apache.spark.util.JsonProtocolSuite testOnly org.apache.spark.scheduler.BlacklistTrackerSuite testOnly org.apache.spark.deploy.history.HistoryServerSuite https://github.com/jsoltren/jose-utils/blob/master/blacklist/test-blacklist.sh ![blacklist-20161219](https://cloud.githubusercontent.com/assets/1208477/21335321/9eda320a-c623-11e6-8b8c-9c912a73c276.jpg) Author: José Hiram Soltren <jose@cloudera.com> Closes #16346 from jsoltren/SPARK-16654-submit.
* [SPARK-19059][SQL] Unable to retrieve data from parquet table whose name ↵jayadevanmurali2017-01-192-46/+53
| | | | | | | | | | | | | | | | | | | | startswith underscore ## What changes were proposed in this pull request? The initial shouldFilterOut() method invocation filter the root path name(table name in the intial call) and remove if it contains _. I moved the check one level below, so it first list files/directories in the given root path and then apply filter. (Please fill in changes proposed in this fix) ## How was this patch tested? Added new test case for this scenario (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: jayadevanmurali <jayadevan.m@tcs.com> Author: jayadevan <jayadevan.m@tcs.com> Closes #16635 from jayadevanmurali/branch-0.1-SPARK-19059.
* [SPARK-14272][ML] Add Loglikelihood in GaussianMixtureSummaryZheng RuiFeng2017-01-195-4/+27
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? add loglikelihood in GMM.summary ## How was this patch tested? added tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Author: Ruifeng Zheng <ruifengz@foxmail.com> Closes #12064 from zhengruifeng/gmm_metric.
* [SPARK-19265][SQL] make table relation cache general and does not depend on hiveWenchen Fan2017-01-1920-198/+144
| | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? We have a table relation plan cache in `HiveMetastoreCatalog`, which caches a lot of things: file status, resolved data source, inferred schema, etc. However, it doesn't make sense to limit this cache with hive support, we should move it to SQL core module so that users can use this cache without hive support. It can also reduce the size of `HiveMetastoreCatalog`, so that it's easier to remove it eventually. main changes: 1. move the table relation cache to `SessionCatalog` 2. `SessionCatalog.lookupRelation` will return `SimpleCatalogRelation` and the analyzer will convert it to `LogicalRelation` or `MetastoreRelation` later, then `HiveSessionCatalog` doesn't need to override `lookupRelation` anymore 3. `FindDataSourceTable` will read/write the table relation cache. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #16621 from cloud-fan/plan-cache.
* Update known_translations for contributor namesYin Huai2017-01-181-2/+3
| | | | | | | | | ## What changes were proposed in this pull request? Update known_translations per https://github.com/apache/spark/pull/16423#issuecomment-269739634 Author: Yin Huai <yhuai@databricks.com> Closes #16628 from yhuai/known_translations.
* [SPARK-14975][ML] Fixed GBTClassifier to predict probability per training ↵Ilya Matiach2017-01-185-29/+248
| | | | | | | | | | | | | | | | | | instance and fixed interfaces ## What changes were proposed in this pull request? For all of the classifiers in MLLib we can predict probabilities except for GBTClassifier. Also, all classifiers inherit from ProbabilisticClassifier but GBTClassifier strangely inherits from Predictor, which is a bug. This change corrects the interface and adds the ability for the classifier to give a probabilities vector. ## How was this patch tested? The basic ML tests were run after making the changes. I've marked this as WIP as I need to add more tests. Author: Ilya Matiach <ilmat@microsoft.com> Closes #16441 from imatiach-msft/ilmat/fix-GBT.
* [SPARK-19182][DSTREAM] Optimize the lock in StreamingJobProgressListener to ↵uncleGen2017-01-182-8/+13
| | | | | | | | | | | | | | | | | | not block UI when generating Streaming jobs ## What changes were proposed in this pull request? When DStreamGraph is generating a job, it will hold a lock and block other APIs. Because StreamingJobProgressListener (numInactiveReceivers, streamName(streamId: Int), streamIds) needs to call DStreamGraph's methods to access some information, the UI may hang if generating a job is very slow (e.g., talking to the slow Kafka cluster to fetch metadata). It's better to optimize the locks in DStreamGraph and StreamingJobProgressListener to make the UI not block by job generation. ## How was this patch tested? existing ut cc zsxwing Author: uncleGen <hustyugm@gmail.com> Closes #16601 from uncleGen/SPARK-19182.
* [SPARK-19168][STRUCTURED STREAMING] StateStore should be aborted upon errorLiwei Lin2017-01-183-2/+10
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? We should call `StateStore.abort()` when there should be any error before the store is committed. ## How was this patch tested? Manually. Author: Liwei Lin <lwlin7@gmail.com> Closes #16547 from lw-lin/append-filter.
* [SPARK-19113][SS][TESTS] Ignore StreamingQueryException thrown from ↵Shixiong Zhu2017-01-181-1/+6
| | | | | | | | | | | | | | | | awaitInitialization to avoid breaking tests ## What changes were proposed in this pull request? #16492 missed one race condition: `StreamExecution.awaitInitialization` may throw fatal errors and fail the test. This PR just ignores `StreamingQueryException` thrown from `awaitInitialization` so that we can verify the exception in the `ExpectFailure` action later. It's fine since `StopStream` or `ExpectFailure` will catch `StreamingQueryException` as well. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16567 from zsxwing/SPARK-19113-2.
* [SPARK-18113] Use ask to replace askWithRetry in canCommit and make receiver ↵jinxing2017-01-182-4/+31
| | | | | | | | | | | | | | | | | | | | idempotent. ## What changes were proposed in this pull request? Method canCommit sends AskPermissionToCommitOutput using askWithRetry. If timeout, it will send again. Thus AskPermissionToCommitOutput can be received multi times. Method canCommit should return the same value when called by the same attempt multi times. In implementation before this fix, method handleAskPermissionToCommit just check if there is committer already registered, which is not enough. When worker retries AskPermissionToCommitOutput it will get CommitDeniedException, then the task will fail with reason TaskCommitDenied, which is not regarded as a task failure(SPARK-11178), so TaskScheduler will schedule this task infinitely. In this fix, use `ask` to replace `askWithRetry` in `canCommit` and make receiver idempotent. ## How was this patch tested? Added a new unit test to OutputCommitCoordinatorSuite. Author: jinxing <jinxing@meituan.com> Closes #16503 from jinxing64/SPARK-18113.
* [SPARK-19231][SPARKR] add error handling for download and untar for Spark ↵Felix Cheung2017-01-181-15/+40
| | | | | | | | | | | | | | | | release ## What changes were proposed in this pull request? When R is starting as a package and it needs to download the Spark release distribution we need to handle error for download and untar, and clean up, otherwise it will get stuck. ## How was this patch tested? manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16589 from felixcheung/rtarreturncode.
* [SPARK-19223][SQL][PYSPARK] Fix InputFileBlockHolder for datasources which ↵Liang-Chi Hsieh2017-01-182-3/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | are based on HadoopRDD or NewHadoopRDD ## What changes were proposed in this pull request? For some datasources which are based on HadoopRDD or NewHadoopRDD, such as spark-xml, InputFileBlockHolder doesn't work with Python UDF. The method to reproduce it is, running the following codes with `bin/pyspark --packages com.databricks:spark-xml_2.11:0.4.1`: from pyspark.sql.functions import udf,input_file_name from pyspark.sql.types import StringType from pyspark.sql import SparkSession def filename(path): return path session = SparkSession.builder.appName('APP').getOrCreate() session.udf.register('sameText', filename) sameText = udf(filename, StringType()) df = session.read.format('xml').load('a.xml', rowTag='root').select('*', input_file_name().alias('file')) df.select('file').show() # works df.select(sameText(df['file'])).show() # returns empty content The issue is because in `HadoopRDD` and `NewHadoopRDD` we set the file block's info in `InputFileBlockHolder` before the returned iterator begins consuming. `InputFileBlockHolder` will record this info into thread local variable. When running Python UDF in batch, we set up another thread to consume the iterator from child plan's output rdd, so we can't read the info back in another thread. To fix this, we have to set the info in `InputFileBlockHolder` after the iterator begins consuming. So the info can be read in correct thread. ## How was this patch tested? Manual test with above example codes for spark-xml package on pyspark: `bin/pyspark --packages com.databricks:spark-xml_2.11:0.4.1`. Added pyspark test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #16585 from viirya/fix-inputfileblock-hadooprdd.
* [SPARK-19024][SQL] Implement new approach to write a permanent viewjiangxingbo2017-01-185-92/+146
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? On CREATE/ALTER a view, it's no longer needed to generate a SQL text string from the LogicalPlan, instead we store the SQL query text、the output column names of the query plan, and current database to CatalogTable. Permanent views created by this approach can be resolved by current view resolution approach. The main advantage includes: 1. If you update an underlying view, the current view also gets updated; 2. That gives us a change to get ride of SQL generation for operators. Major changes of this PR: 1. Generate the view-specific properties(e.g. view default database, view query output column names) during permanent view creation and store them as properties in the CatalogTable; 2. Update the commands `CreateViewCommand` and `AlterViewAsCommand`, get rid of SQL generation from them. ## How was this patch tested? Existing tests. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #16613 from jiangxb1987/view-write-path.
* [SPARK-18782][BUILD] Bump Hadoop 2.6 version to use Hadoop 2.6.5Adam Roberts2017-01-182-16/+16
| | | | | | | | | | | | | | **What changes were proposed in this pull request?** Use Hadoop 2.6.5 for the Hadoop 2.6 profile, I see a bunch of fixes including security ones in the release notes that we should pick up **How was this patch tested?** Running the unit tests now with IBM's SDK for Java and let's see what happens with OpenJDK in the community builder - expecting no trouble as it is only a minor release. Author: Adam Roberts <aroberts@uk.ibm.com> Closes #16616 from a-roberts/Hadoop265Bumper.
* [SPARK-19227][SPARK-19251] remove unused imports and outdated commentsuncleGen2017-01-1847-79/+25
| | | | | | | | | | | | ## What changes were proposed in this pull request? remove ununsed imports and outdated comments, and fix some minor code style issue. ## How was this patch tested? existing ut Author: uncleGen <hustyugm@gmail.com> Closes #16591 from uncleGen/SPARK-19227.
* [SPARK-18243][SQL] Port Hive writing to use FileFormat interfaceWenchen Fan2017-01-1715-533/+318
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Inserting data into Hive tables has its own implementation that is distinct from data sources: `InsertIntoHiveTable`, `SparkHiveWriterContainer` and `SparkHiveDynamicPartitionWriterContainer`. Note that one other major difference is that data source tables write directly to the final destination without using some staging directory, and then Spark itself adds the partitions/tables to the catalog. Hive tables actually write to some staging directory, and then call Hive metastore's loadPartition/loadTable function to load those data in. So we still need to keep `InsertIntoHiveTable` to put this special logic. In the future, we should think of writing to the hive table location directly, so that we don't need to call `loadTable`/`loadPartition` at the end and remove `InsertIntoHiveTable`. This PR removes `SparkHiveWriterContainer` and `SparkHiveDynamicPartitionWriterContainer`, and create a `HiveFileFormat` to implement the write logic. In the future, we should also implement the read logic in `HiveFileFormat`. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #16517 from cloud-fan/insert-hive.
* [SPARK-18206][ML] Add instrumentation for MLP,NB,LDA,AFT,GLM,Isotonic,LiRZheng RuiFeng2017-01-1713-33/+105
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? add instrumentation for MLP,NB,LDA,AFT,GLM,Isotonic,LiR ## How was this patch tested? local test in spark-shell Author: Zheng RuiFeng <ruifengz@foxmail.com> Author: Ruifeng Zheng <ruifengz@foxmail.com> Closes #15671 from zhengruifeng/lir_instr.
* [SPARK-13721][SQL] Support outer generators in DataFrame APIBogdan Raducanu2017-01-179-16/+150
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Added outer_explode, outer_posexplode, outer_inline functions and expressions. Some bug fixing in GenerateExec.scala for CollectionGenerator. Previously it was not correctly handling the case of outer with empty collections, only with nulls. ## How was this patch tested? New tests added to GeneratorFunctionSuite Author: Bogdan Raducanu <bogdan.rdc@gmail.com> Closes #16608 from bogdanrdc/SPARK-13721.
* [SPARK-18917][SQL] Remove schema check in appending dataReynold Xin2017-01-171-33/+3
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In append mode, we check whether the schema of the write is compatible with the schema of the existing data. It can be a significant performance issue in cloud environment to find the existing schema for files. This patch removes the check. Note that for catalog tables, we always do the check, as discussed in https://github.com/apache/spark/pull/16339#discussion_r96208357 ## How was this patch tested? N/A Closes #16339. Author: Reynold Xin <rxin@databricks.com> Closes #16622 from rxin/SPARK-18917.
* [MINOR][SQL] Remove duplicate call of reset() function in ↵jiangxingbo2017-01-171-1/+0
| | | | | | | | | | | | | | | | CurrentOrigin.withOrigin() ## What changes were proposed in this pull request? Remove duplicate call of reset() function in CurrentOrigin.withOrigin(). ## How was this patch tested? Existing test cases. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #16615 from jiangxb1987/dummy-code.
* [SPARK-19239][PYSPARK] Check parameters whether equals None when specify the ↵DjvuLee2017-01-171-3/+6
| | | | | | | | | | | | | | | | | | | | column in jdbc API ## What changes were proposed in this pull request? The `jdbc` API do not check the `lowerBound` and `upperBound` when we specified the ``column``, and just throw the following exception: >```int() argument must be a string or a number, not 'NoneType'``` If we check the parameter, we can give a more friendly suggestion. ## How was this patch tested? Test using the pyspark shell, without the lowerBound and upperBound parameters. Author: DjvuLee <lihu@bytedance.com> Closes #16599 from djvulee/pysparkFix.
* [SPARK-19129][SQL] SessionCatalog: Disallow empty part col values in ↵gatorsmile2017-01-185-8/+106
| | | | | | | | | | | | | | | | | | | | | | | | | partition spec ### What changes were proposed in this pull request? Empty partition column values are not valid for partition specification. Before this PR, we accept users to do it; however, Hive metastore does not detect and disallow it too. Thus, users hit the following strange error. ```Scala val df = spark.createDataFrame(Seq((0, "a"), (1, "b"))).toDF("partCol1", "name") df.write.mode("overwrite").partitionBy("partCol1").saveAsTable("partitionedTable") spark.sql("alter table partitionedTable drop partition(partCol1='')") spark.table("partitionedTable").show() ``` In the above example, the WHOLE table is DROPPED when users specify a partition spec containing only one partition column with empty values. When the partition columns contains more than one, Hive metastore APIs simply ignore the columns with empty values and treat it as partial spec. This is also not expected. This does not follow the actual Hive behaviors. This PR is to disallow users to specify such an invalid partition spec in the `SessionCatalog` APIs. ### How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #16583 from gatorsmile/disallowEmptyPartColValue.
* [SPARK-19065][SQL] Don't inherit expression id in dropDuplicatesShixiong Zhu2017-01-183-11/+27
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? `dropDuplicates` will create an Alias using the same exprId, so `StreamExecution` should also replace Alias if necessary. ## How was this patch tested? test("SPARK-19065: dropDuplicates should not create expressions using the same id") Author: Shixiong Zhu <shixiong@databricks.com> Closes #16564 from zsxwing/SPARK-19065.
* [SPARK-19019] [PYTHON] Fix hijacked `collections.namedtuple` and port ↵hyukjinkwon2017-01-172-31/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cloudpickle changes for PySpark to work with Python 3.6.0 ## What changes were proposed in this pull request? Currently, PySpark does not work with Python 3.6.0. Running `./bin/pyspark` simply throws the error as below and PySpark does not work at all: ``` Traceback (most recent call last): File ".../spark/python/pyspark/shell.py", line 30, in <module> import pyspark File ".../spark/python/pyspark/__init__.py", line 46, in <module> from pyspark.context import SparkContext File ".../spark/python/pyspark/context.py", line 36, in <module> from pyspark.java_gateway import launch_gateway File ".../spark/python/pyspark/java_gateway.py", line 31, in <module> from py4j.java_gateway import java_import, JavaGateway, GatewayClient File "<frozen importlib._bootstrap>", line 961, in _find_and_load File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 646, in _load_unlocked File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 18, in <module> File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", line 62, in <module> import pkgutil File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", line 22, in <module> ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg') File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple cls = _old_namedtuple(*args, **kwargs) TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module' ``` The root cause seems because some arguments of `namedtuple` are now completely keyword-only arguments from Python 3.6.0 (See https://bugs.python.org/issue25628). We currently copy this function via `types.FunctionType` which does not set the default values of keyword-only arguments (meaning `namedtuple.__kwdefaults__`) and this seems causing internally missing values in the function (non-bound arguments). This PR proposes to work around this by manually setting it via `kwargs` as `types.FunctionType` seems not supporting to set this. Also, this PR ports the changes in cloudpickle for compatibility for Python 3.6.0. ## How was this patch tested? Manually tested with Python 2.7.6 and Python 3.6.0. ``` ./bin/pyspsark ``` , manual creation of `namedtuple` both in local and rdd with Python 3.6.0, and Jenkins tests for other Python versions. Also, ``` ./run-tests --python-executables=python3.6 ``` ``` Will test against the following Python executables: ['python3.6'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Finished test(python3.6): pyspark.sql.tests (192s) Finished test(python3.6): pyspark.accumulators (3s) Finished test(python3.6): pyspark.mllib.tests (198s) Finished test(python3.6): pyspark.broadcast (3s) Finished test(python3.6): pyspark.conf (2s) Finished test(python3.6): pyspark.context (14s) Finished test(python3.6): pyspark.ml.classification (21s) Finished test(python3.6): pyspark.ml.evaluation (11s) Finished test(python3.6): pyspark.ml.clustering (20s) Finished test(python3.6): pyspark.ml.linalg.__init__ (0s) Finished test(python3.6): pyspark.streaming.tests (240s) Finished test(python3.6): pyspark.tests (240s) Finished test(python3.6): pyspark.ml.recommendation (19s) Finished test(python3.6): pyspark.ml.feature (36s) Finished test(python3.6): pyspark.ml.regression (37s) Finished test(python3.6): pyspark.ml.tuning (28s) Finished test(python3.6): pyspark.mllib.classification (26s) Finished test(python3.6): pyspark.mllib.evaluation (18s) Finished test(python3.6): pyspark.mllib.clustering (44s) Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s) Finished test(python3.6): pyspark.mllib.feature (26s) Finished test(python3.6): pyspark.mllib.fpm (23s) Finished test(python3.6): pyspark.mllib.random (8s) Finished test(python3.6): pyspark.ml.tests (92s) Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s) Finished test(python3.6): pyspark.mllib.linalg.distributed (25s) Finished test(python3.6): pyspark.mllib.stat._statistics (15s) Finished test(python3.6): pyspark.mllib.recommendation (24s) Finished test(python3.6): pyspark.mllib.regression (26s) Finished test(python3.6): pyspark.profiler (9s) Finished test(python3.6): pyspark.mllib.tree (16s) Finished test(python3.6): pyspark.shuffle (1s) Finished test(python3.6): pyspark.mllib.util (18s) Finished test(python3.6): pyspark.serializers (11s) Finished test(python3.6): pyspark.rdd (20s) Finished test(python3.6): pyspark.sql.conf (8s) Finished test(python3.6): pyspark.sql.catalog (17s) Finished test(python3.6): pyspark.sql.column (18s) Finished test(python3.6): pyspark.sql.context (18s) Finished test(python3.6): pyspark.sql.group (27s) Finished test(python3.6): pyspark.sql.dataframe (33s) Finished test(python3.6): pyspark.sql.functions (35s) Finished test(python3.6): pyspark.sql.types (6s) Finished test(python3.6): pyspark.sql.streaming (13s) Finished test(python3.6): pyspark.streaming.util (0s) Finished test(python3.6): pyspark.sql.session (16s) Finished test(python3.6): pyspark.sql.window (4s) Finished test(python3.6): pyspark.sql.readwriter (35s) Tests passed in 433 seconds ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #16429 from HyukjinKwon/SPARK-19019.
* [SPARK-19179][YARN] Change spark.yarn.access.namenodes config and update docsjerryshao2017-01-174-15/+23
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? `spark.yarn.access.namenodes` configuration cannot actually reflects the usage of it, inside the code it is the Hadoop filesystems we get tokens, not NNs. So here propose to update the name of this configuration, also change the related code and doc. ## How was this patch tested? Local verification. Author: jerryshao <sshao@hortonworks.com> Closes #16560 from jerryshao/SPARK-19179.
* [SPARK-3249][DOC] Fix links in ScalaDoc that cause warning messages in ↵hyukjinkwon2017-01-1720-86/+93
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | `sbt/sbt unidoc` ## What changes were proposed in this pull request? This PR proposes to fix ambiguous link warnings by simply making them as code blocks for both javadoc and scaladoc. ``` [warn] .../spark/core/src/main/scala/org/apache/spark/Accumulator.scala:20: The link target "SparkContext#accumulator" is ambiguous. Several members fit the target: [warn] .../spark/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala:281: The link target "runMiniBatchSGD" is ambiguous. Several members fit the target: [warn] .../spark/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala:83: The link target "run" is ambiguous. Several members fit the target: ... ``` This PR also fixes javadoc8 break as below: ``` [error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found [error] * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product} [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found [error] * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product} [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found [error] * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product} [error] ^ [info] 3 errors ``` ## How was this patch tested? Manually via `sbt unidoc > output.txt` and the checked it via `cat output.txt | grep ambiguous` and `sbt unidoc | grep error`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #16604 from HyukjinKwon/SPARK-3249.
* [SPARK-19219][SQL] Fix Parquet log output defaultsNick Lavers2017-01-173-4/+8
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Changing the default parquet logging levels to reflect the changes made in PR [#15538](https://github.com/apache/spark/pull/15538), in order to prevent the flood of log messages by default. ## How was this patch tested? Default log output when reading from parquet 1.6 files was compared with and without this change. The change eliminates the extraneous logging and makes the output readable. Author: Nick Lavers <nick.lavers@videoamp.com> Closes #16580 from nicklavers/spark-19219-set_default_parquet_log_level.
* [SPARK-19240][SQL][TEST] add test for setting location for managed tableWenchen Fan2017-01-171-0/+28
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? SET LOCATION can also work on managed table(or table created without custom path), the behavior is a little weird, but as we have already supported it, we should add a test to explicitly show the behavior. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #16597 from cloud-fan/set-location.
* [MINOR][YARN] Move YarnSchedulerBackendSuite to resource-managers/yarn ↵Yanbo Liang2017-01-171-0/+0
| | | | | | | | | | | | | | directory. ## What changes were proposed in this pull request? #16092 moves YARN resource manager related code to resource-managers/yarn directory. The test case ```YarnSchedulerBackendSuite``` was added after that but with the wrong place. I move it to correct directory in this PR. ## How was this patch tested? Existing test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16595 from yanboliang/yarn.
* [SPARK-19148][SQL] do not expose the external table concept in CatalogWenchen Fan2017-01-176-103/+211
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/16296 , we reached a consensus that we should hide the external/managed table concept to users and only expose custom table path. This PR renames `Catalog.createExternalTable` to `createTable`(still keep the old versions for backward compatibility), and only set the table type to EXTERNAL if `path` is specified in options. ## How was this patch tested? new tests in `CatalogSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #16528 from cloud-fan/create-table.
* [SPARK-18905][STREAMING] Fix the issue of removing a failed jobset from ↵CodingCat2017-01-161-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | JobScheduler.jobSets ## What changes were proposed in this pull request? the current implementation of Spark streaming considers a batch is completed no matter the results of the jobs (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203) Let's consider the following case: A micro batch contains 2 jobs and they read from two different kafka topics respectively. One of these jobs is failed due to some problem in the user defined logic, after the other one is finished successfully. 1. The main thread in the Spark streaming application will execute the line mentioned above, 2. and another thread (checkpoint writer) will make a checkpoint file immediately after this line is executed. 3. Then due to the current error handling mechanism in Spark Streaming, StreamingContext will be closed (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214) the user recovers from the checkpoint file, and because the JobSet containing the failed job has been removed (taken as completed) before the checkpoint is constructed, the data being processed by the failed job would never be reprocessed This PR fix it by removing jobset from JobScheduler.jobSets only when all jobs in a jobset are successfully finished ## How was this patch tested? existing tests Author: CodingCat <zhunansjtu@gmail.com> Author: Nan Zhu <zhunansjtu@gmail.com> Closes #16542 from CodingCat/SPARK-18905.
* [SPARK-18828][SPARKR] Refactor scripts for RFelix Cheung2017-01-167-52/+146
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Refactored script to remove duplications and clearer purpose for each script ## How was this patch tested? manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16249 from felixcheung/rscripts.
* [SPARK-19232][SPARKR] Update Spark distribution download cache location on ↵Felix Cheung2017-01-161-2/+2
| | | | | | | | | | | | | | | | | Windows ## What changes were proposed in this pull request? Windows seems to be the only place with appauthor in the path, for which we should say "Apache" (and case sensitive) Current path of `AppData\Local\spark\spark\Cache` is a bit odd. ## How was this patch tested? manual. Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16590 from felixcheung/rcachedir.
* [SPARK-19066][SPARKR] SparkR LDA doesn't set optimizer correctlywm624@hotmail.com2017-01-164-5/+42
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the `isDistributed` field is FALSE, which should be TRUE based on scala code. In addition, the `summary` method should bring back the results related to `DistributedLDAModel`. ## How was this patch tested? Manual tests by comparing with scala example. Modified the current unit test: fix the incorrect unit test and add necessary tests for `summary` method. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16464 from wangmiao1981/new.
* [SPARK-18801][SQL][FOLLOWUP] Alias the view with its childjiangxingbo2017-01-165-57/+214
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR is a follow-up to address the comments https://github.com/apache/spark/pull/16233/files#r95669988 and https://github.com/apache/spark/pull/16233/files#r95662299. We try to wrap the child by: 1. Generate the `queryOutput` by: 1.1. If the query column names are defined, map the column names to attributes in the child output by name; 1.2. Else set the child output attributes to `queryOutput`. 2. Map the `queryQutput` to view output by index, if the corresponding attributes don't match, try to up cast and alias the attribute in `queryOutput` to the attribute in the view output. 3. Add a Project over the child, with the new output generated by the previous steps. If the view output doesn't have the same number of columns neither with the child output, nor with the query column names, throw an AnalysisException. ## How was this patch tested? Add new test cases in `SQLViewSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #16561 from jiangxb1987/alias-view.
* [SPARK-19082][SQL] Make ignoreCorruptFiles work for ParquetLiang-Chi Hsieh2017-01-165-8/+140
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? We have a config `spark.sql.files.ignoreCorruptFiles` which can be used to ignore corrupt files when reading files in SQL. Currently the `ignoreCorruptFiles` config has two issues and can't work for Parquet: 1. We only ignore corrupt files in `FileScanRDD` . Actually, we begin to read those files as early as inferring data schema from the files. For corrupt files, we can't read the schema and fail the program. A related issue reported at http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tc20418.html 2. In `FileScanRDD`, we assume that we only begin to read the files when starting to consume the iterator. However, it is possibly the files are read before that. In this case, `ignoreCorruptFiles` config doesn't work too. This patch targets Parquet datasource. If this direction is ok, we can address the same issue for other datasources like Orc. Two main changes in this patch: 1. Replace `ParquetFileReader.readAllFootersInParallel` by implementing the logic to read footers in multi-threaded manner We can't ignore corrupt files if we use `ParquetFileReader.readAllFootersInParallel`. So this patch implements the logic to do the similar thing in `readParquetFootersInParallel`. 2. In `FileScanRDD`, we need to ignore corrupt file too when we call `readFunction` to return iterator. One thing to notice is: We read schema from Parquet file's footer. The method to read footer `ParquetFileReader.readFooter` throws `RuntimeException`, instead of `IOException`, if it can't successfully read the footer. Please check out https://github.com/apache/parquet-mr/blob/df9d8e415436292ae33e1ca0b8da256640de9710/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L470. So this patch catches `RuntimeException`. One concern is that it might also shadow other runtime exceptions other than reading corrupt files. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #16474 from viirya/fix-ignorecorrupted-parquet-files.
* [SPARK-19120] Refresh Metadata Cache After Loading Hive Tablesgatorsmile2017-01-153-14/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ### What changes were proposed in this pull request? ```Scala sql("CREATE TABLE tab (a STRING) STORED AS PARQUET") // This table fetch is to fill the cache with zero leaf files spark.table("tab").show() sql( s""" |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE |INTO TABLE tab """.stripMargin) spark.table("tab").show() ``` In the above example, the returned result is empty after table loading. The metadata cache could be out of dated after loading new data into the table, because loading/inserting does not update the cache. So far, the metadata cache is only used for data source tables. Thus, for Hive serde tables, only `parquet` and `orc` formats are facing such issues, because the Hive serde tables in the format of parquet/orc could be converted to data source tables when `spark.sql.hive.convertMetastoreParquet`/`spark.sql.hive.convertMetastoreOrc` is on. This PR is to refresh the metadata cache after processing the `LOAD DATA` command. In addition, Spark SQL does not convert **partitioned** Hive tables (orc/parquet) to data source tables in the write path, but the read path is using the metadata cache for both **partitioned** and non-partitioned Hive tables (orc/parquet). That means, writing the partitioned parquet/orc tables still use `InsertIntoHiveTable`, instead of `InsertIntoHadoopFsRelationCommand`. To avoid reading the out-of-dated cache, `InsertIntoHiveTable` needs to refresh the metadata cache for partitioned tables. Note, it does not need to refresh the cache for non-partitioned parquet/orc tables, because it does not call `InsertIntoHiveTable` at all. Based on the comments, this PR will keep the existing logics unchanged. That means, we always refresh the table no matter whether the table is partitioned or not. ### How was this patch tested? Added test cases in parquetSuites.scala Author: gatorsmile <gatorsmile@gmail.com> Closes #16500 from gatorsmile/refreshInsertIntoHiveTable.
* [SPARK-19206][DOC][DSTREAM] Fix outdated parameter descriptions in kafka010uncleGen2017-01-153-25/+16
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix outdated parameter descriptions in kafka010 ## How was this patch tested? cc koeninger zsxwing Author: uncleGen <hustyugm@gmail.com> Closes #16569 from uncleGen/SPARK-19206.
* [SPARK-18971][CORE] Upgrade Netty to 4.0.43.FinalShixiong Zhu2017-01-156-6/+6
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Upgrade Netty to `4.0.43.Final` to add the fix for https://github.com/netty/netty/issues/6153 ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16568 from zsxwing/SPARK-18971.
* [MINOR][DOC] Document local[*,F] master modesMaurus Cuelenaere2017-01-151-0/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? core/src/main/scala/org/apache/spark/SparkContext.scala contains LOCAL_N_FAILURES_REGEX master mode, but this was never documented, so do so. ## How was this patch tested? By using the Github Markdown preview feature. Author: Maurus Cuelenaere <mcuelenaere@gmail.com> Closes #16562 from mcuelenaere/patch-1.
* [SPARK-19042] spark executor can't download the jars when uber jar's http ↵xiaojian.fxj2017-01-151-2/+2
| | | | | | | | | | url contains any query strings If the uber jars' https contains any query strings, the Executor.updateDependencies method can't can't download the jars correctly. This is because the "localName = name.split("/").last" won't get the expected jar's url. The bug fix is the same as [SPARK-17855] Author: xiaojian.fxj <xiaojian.fxj@alibaba-inc.com> Closes #16509 from hustfxj/bug.
* [SPARK-19207][SQL] LocalSparkSession should use Slf4JLoggerFactory.INSTANCETsuyoshi Ozawa2017-01-151-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Using Slf4JLoggerFactory.INSTANCE instead of creating Slf4JLoggerFactory's object with constructor. It's deprecated. ## How was this patch tested? With running StateStoreRDDSuite. Author: Tsuyoshi Ozawa <ozawa@apache.org> Closes #16570 from oza/SPARK-19207.
* [SPARK-19151][SQL] DataFrameWriter.saveAsTable support hive overwritewindpiger2017-01-143-14/+34
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? After [SPARK-19107](https://issues.apache.org/jira/browse/SPARK-19107), we now can treat hive as a data source and create hive tables with DataFrameWriter and Catalog. However, the support is not completed, there are still some cases we do not support. This PR implement: DataFrameWriter.saveAsTable work with hive format with overwrite mode ## How was this patch tested? unit test added Author: windpiger <songjun@outlook.com> Closes #16549 from windpiger/saveAsTableWithHiveOverwrite.
* [SPARK-19221][PROJECT INFRA][R] Add winutils binaries to the path in ↵hyukjinkwon2017-01-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AppVeyor tests for Hadoop libraries to call native codes properly ## What changes were proposed in this pull request? It seems Hadoop libraries need winutils binaries for native libraries in the path. It is not a problem in tests for now because we are only testing SparkR on Windows via AppVeyor but it can be a problem if we run Scala tests via AppVeyor as below: ``` - SPARK-18220: read Hive orc table with varchar column *** FAILED *** (3 seconds, 937 milliseconds) org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:609) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283) ... ``` This PR proposes to add it to the `Path` for AppVeyor tests. ## How was this patch tested? Manually via AppVeyor. **Before** https://ci.appveyor.com/project/spark-test/spark/build/549-windows-complete/job/gc8a1pjua2bc4i8m **After** https://ci.appveyor.com/project/spark-test/spark/build/572-windows-complete/job/c4vrysr5uvj2hgu7 Author: hyukjinkwon <gurwls223@gmail.com> Closes #16584 from HyukjinKwon/set-path-appveyor.