aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* [SPARK-13446][SQL] Support reading data from Hive 2.0.1 metastoreXiao Li2017-03-037-9/+107
| | | | | | | | | | | | | | | ### What changes were proposed in this pull request? This PR is to make Spark work with Hive 2.0's metastores. Compared with Hive 1.2, Hive 2.0's metastore has an API update due to removal of `HOLD_DDLTIME` in https://issues.apache.org/jira/browse/HIVE-12224. Based on the following Hive JIRA description, `HOLD_DDLTIME` should be removed from our internal API too. (https://github.com/apache/spark/pull/17063 was submitted for it): > This arcane feature was introduced long ago via HIVE-1394 It was broken as soon as it landed, HIVE-1442 and is thus useless. Fact that no one has fixed it since informs that its not really used by anyone. Better is to remove it so no one hits the bug of HIVE-1442 In the next PR, we will support 2.1.0 metastore, whose APIs were changed due to https://issues.apache.org/jira/browse/HIVE-12730. However, before that, we need a code cleanup for stats collection and setting. ### How was this patch tested? Added test cases to VersionsSuite.scala Author: Xiao Li <gatorsmile@gmail.com> Closes #17061 from gatorsmile/Hive2.
* [SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safeBryan Cutler2017-03-0311-120/+161
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? The `keyword_only` decorator in PySpark is not thread-safe. It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as `_input_kwargs`. If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten. See [SPARK-19348](https://issues.apache.org/jira/browse/SPARK-19348) for reproduction code. This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition. It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize. ## How was this patch tested? Added new unit tests for using the keyword_only decorator and a regression test that verifies `_input_kwargs` can be overwritten from different class instances. Author: Bryan Cutler <cutlerb@gmail.com> Closes #16782 from BryanCutler/pyspark-keyword_only-threadsafe-SPARK-19348.
* [SPARK-18939][SQL] Timezone support in partition values.Takuya UESHIN2017-03-0315-59/+175
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This is a follow-up pr of #16308 and #16750. This pr enables timezone support in partition values. We should use `timeZone` option introduced at #16750 to parse/format partition values of the `TimestampType`. For example, if you have timestamp `"2016-01-01 00:00:00"` in `GMT` which will be used for partition values, the values written by the default timezone option, which is `"GMT"` because the session local timezone is `"GMT"` here, are: ```scala scala> spark.conf.set("spark.sql.session.timeZone", "GMT") scala> val df = Seq((1, new java.sql.Timestamp(1451606400000L))).toDF("i", "ts") df: org.apache.spark.sql.DataFrame = [i: int, ts: timestamp] scala> df.show() +---+-------------------+ | i| ts| +---+-------------------+ | 1|2016-01-01 00:00:00| +---+-------------------+ scala> df.write.partitionBy("ts").save("/path/to/gmtpartition") ``` ```sh $ ls /path/to/gmtpartition/ _SUCCESS ts=2016-01-01 00%3A00%3A00 ``` whereas setting the option to `"PST"`, they are: ```scala scala> df.write.option("timeZone", "PST").partitionBy("ts").save("/path/to/pstpartition") ``` ```sh $ ls /path/to/pstpartition/ _SUCCESS ts=2015-12-31 16%3A00%3A00 ``` We can properly read the partition values if the session local timezone and the timezone of the partition values are the same: ```scala scala> spark.read.load("/path/to/gmtpartition").show() +---+-------------------+ | i| ts| +---+-------------------+ | 1|2016-01-01 00:00:00| +---+-------------------+ ``` And even if the timezones are different, we can properly read the values with setting corrent timezone option: ```scala // wrong result scala> spark.read.load("/path/to/pstpartition").show() +---+-------------------+ | i| ts| +---+-------------------+ | 1|2015-12-31 16:00:00| +---+-------------------+ // correct result scala> spark.read.option("timeZone", "PST").load("/path/to/pstpartition").show() +---+-------------------+ | i| ts| +---+-------------------+ | 1|2016-01-01 00:00:00| +---+-------------------+ ``` ## How was this patch tested? Existing tests and added some tests. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #17053 from ueshin/issues/SPARK-18939.
* [MINOR][DOC] Fix doc for web UI https configurationjerryshao2017-03-031-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Doc about enabling web UI https is not correct, "spark.ui.https.enabled" is not existed, actually enabling SSL is enough for https. ## How was this patch tested? N/A Author: jerryshao <sshao@hortonworks.com> Closes #17147 from jerryshao/fix-doc-ssl.
* [SPARK-19774] StreamExecution should call stop() on sources when a stream failsBurak Yavuz2017-03-033-3/+169
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? We call stop() on a Structured Streaming Source only when the stream is shutdown when a user calls streamingQuery.stop(). We should actually stop all sources when the stream fails as well, otherwise we may leak resources, e.g. connections to Kafka. ## How was this patch tested? Unit tests in `StreamingQuerySuite`. Author: Burak Yavuz <brkyvz@gmail.com> Closes #17107 from brkyvz/close-source.
* [SPARK-19710][SQL][TESTS] Fix ordering of rows in query resultsPete Robbins2017-03-036-13/+13
| | | | | | | | | | | | | ## What changes were proposed in this pull request? Changes to SQLQueryTests to make the order of the results constant. Where possible ORDER BY has been added to match the existing expected output ## How was this patch tested? Test runs on x86, zLinux (big endian), ppc (big endian) Author: Pete Robbins <robbinspg@gmail.com> Closes #17039 from robbinspg/SPARK-19710.
* [SPARK-19758][SQL] Resolving timezone aware expressions with time zone when ↵Liang-Chi Hsieh2017-03-035-23/+48
| | | | | | | | | | | | | | | | | | | | | | resolving inline table ## What changes were proposed in this pull request? When we resolve inline tables in analyzer, we will evaluate the expressions of inline tables. When it evaluates a `TimeZoneAwareExpression` expression, an error will happen because the `TimeZoneAwareExpression` is not associated with timezone yet. So we need to resolve these `TimeZoneAwareExpression`s with time zone when resolving inline tables. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #17114 from viirya/resolve-timeawareexpr-inline-table.
* [SPARK-19801][BUILD] Remove JDK7 from Travis CIDongjoon Hyun2017-03-031-1/+0
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Since Spark 2.1.0, Travis CI was supported by SPARK-15207 for automated PR verification (JDK7/JDK8 maven compilation and Java Linter) and contributors can see the additional result via their Travis CI dashboard (or PC). This PR aims to make `.travis.yml` up-to-date by removing JDK7 which was removed via SPARK-19550. ## How was this patch tested? See the result via Travis CI. - https://travis-ci.org/dongjoon-hyun/spark/builds/207111713 Author: Dongjoon Hyun <dongjoon@apache.org> Closes #17143 from dongjoon-hyun/SPARK-19801.
* [SPARK-19797][DOC] ML pipeline document correctionZhe Sun2017-03-031-1/+1
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Description about pipeline in this paragraph is incorrect https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works > If the Pipeline had more **stages**, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage. Reason: Transformer could also be a stage. But only another Estimator will invoke an transform call and pass the data to next stage. The description in the document misleads ML pipeline users. ## How was this patch tested? This is a tiny modification of **docs/ml-pipelines.md**. I jekyll build the modification and check the compiled document. Author: Zhe Sun <ymwdalex@gmail.com> Closes #17137 from ymwdalex/SPARK-19797-ML-pipeline-document-correction.
* [SPARK-19739][CORE] propagate S3 session token to cluseruncleGen2017-03-031-5/+8
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? propagate S3 session token to cluser ## How was this patch tested? existing ut Author: uncleGen <hustyugm@gmail.com> Closes #17080 from uncleGen/SPARK-19739.
* [SPARK-18699][SQL][FOLLOWUP] Add explanation in CSV parser and minor cleanuphyukjinkwon2017-03-031-29/+68
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR suggests adding some comments in `UnivocityParser` logics to explain what happens. Also, it proposes, IMHO, a little bit cleaner (at least easy for me to explain). ## How was this patch tested? Unit tests in `CSVSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17142 from HyukjinKwon/SPARK-18699.
* [SPARK-18726][SQL] resolveRelation for FileFormat DataSource don't need to ↵windpiger2017-03-022-4/+20
| | | | | | | | | | | | | | | | | | | | | | listFiles twice ## What changes were proposed in this pull request? Currently when we resolveRelation for a `FileFormat DataSource` without providing user schema, it will execute `listFiles` twice in `InMemoryFileIndex` during `resolveRelation`. This PR add a `FileStatusCache` for DataSource, this can avoid listFiles twice. But there is a bug in `InMemoryFileIndex` see: [SPARK-19748](https://github.com/apache/spark/pull/17079) [SPARK-19761](https://github.com/apache/spark/pull/17093), so this pr should be after SPARK-19748/ SPARK-19761. ## How was this patch tested? unit test added Author: windpiger <songjun@outlook.com> Closes #17081 from windpiger/resolveDataSourceScanFilesTwice.
* [SPARK-19779][SS] Delete needless tmp file after restart structured ↵guifeng2017-03-022-1/+10
| | | | | | | | | | | | | | | | | streaming job ## What changes were proposed in this pull request? [SPARK-19779](https://issues.apache.org/jira/browse/SPARK-19779) The PR (https://github.com/apache/spark/pull/17012) can to fix restart a Structured Streaming application using hdfs as fileSystem, but also exist a problem that a tmp file of delta file is still reserved in hdfs. And Structured Streaming don't delete the tmp file generated when restart streaming job in future. ## How was this patch tested? unit tests Author: guifeng <guifengleaf@gmail.com> Closes #17124 from gf53520/SPARK-19779.
* [SPARK-19602][SQL][TESTS] Add tests for qualified column namesSunitha Kambhampati2017-03-028-5/+980
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? - Add tests covering different scenarios with qualified column names - Please see Section 2 in the design doc for the various test scenarios [here](https://issues.apache.org/jira/secure/attachment/12854681/Design_ColResolution_JIRA19602.pdf) - As part of SPARK-19602, changes are made to support three part column name. In order to aid in the review and to reduce the diff, the test scenarios are separated out into this PR. ## How was this patch tested? - This is a **test only** change. The individual test suites were run successfully. Author: Sunitha Kambhampati <skambha@us.ibm.com> Closes #17067 from skambha/colResolutionTests.
* [SPARK-19745][ML] SVCAggregator captures coefficients in its closuresethah2017-03-026-24/+34
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? JIRA: [SPARK-19745](https://issues.apache.org/jira/browse/SPARK-19745) Reorganize SVCAggregator to avoid serializing coefficients. This patch also makes the gradient array a `lazy val` which will avoid materializing a large array on the driver before shipping the class to the executors. This improvement stems from https://github.com/apache/spark/pull/16037. Actually, probably all ML aggregators can benefit from this. We can either: a.) separate the gradient improvement into another patch b.) keep what's here _plus_ add the lazy evaluation to all other aggregators in this patch or c.) keep it as is. ## How was this patch tested? This is an interesting question! I don't know of a reasonable way to test this right now. Ideally, we could perform an optimization and look at the shuffle write data for each task, and we could compare the size to what it we know it should be: `numCoefficients * 8 bytes`. Not sure if there is a good way to do that right now? We could discuss this here or in another JIRA, but I suspect it would be a significant undertaking. Author: sethah <seth.hendrickson16@gmail.com> Closes #17076 from sethah/svc_agg.
* [SPARK-19276][CORE] Fetch Failure handling robust to user error handlingImran Rashid2017-03-027-17/+198
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fault-tolerance in spark requires special handling of shuffle fetch failures. The Executor would catch FetchFailedException and send a special msg back to the driver. However, intervening user code could intercept that exception, and wrap it with something else. This even happens in SparkSQL. So rather than checking the thrown exception only, we'll store the fetch failure directly in the TaskContext, where users can't touch it. ## How was this patch tested? Added a test case which failed before the fix. Full test suite via jenkins. Author: Imran Rashid <irashid@cloudera.com> Closes #16639 from squito/SPARK-19276.
* [SPARK-19631][CORE] OutputCommitCoordinator should not allow commits for ↵Patrick Woody2017-03-022-24/+46
| | | | | | | | | | | | | | | | | | already failed tasks ## What changes were proposed in this pull request? Previously it was possible for there to be a race between a task failure and committing the output of a task. For example, the driver may mark a task attempt as failed due to an executor heartbeat timeout (possibly due to GC), but the task attempt actually ends up coordinating with the OutputCommitCoordinator once the executor recovers and committing its result. This will lead to any retry attempt failing because the task result has already been committed despite the original attempt failing. This ensures that any previously failed task attempts cannot enter the commit protocol. ## How was this patch tested? Added a unit test Author: Patrick Woody <pwoody@palantir.com> Closes #16959 from pwoody/pw/recordFailuresForCommitter.
* [SPARK-19720][CORE] Redact sensitive information from SparkSubmit consoleMark Grover2017-03-023-5/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This change redacts senstive information (based on `spark.redaction.regex` property) from the Spark Submit console logs. Such sensitive information is already being redacted from event logs and yarn logs, etc. ## How was this patch tested? Testing was done manually to make sure that the console logs were not printing any sensitive information. Here's some output from the console: ``` Spark properties used, including those specified through --conf and those from the properties file /etc/spark2/conf/spark-defaults.conf: (spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted)) (spark.authenticate,false) (spark.executorEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted)) ``` ``` System properties: (spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted)) (spark.authenticate,false) (spark.executorEnv.HADOOP_CREDSTORE_PASSWORD,*********(redacted)) ``` There is a risk if new print statements were added to the console down the road, sensitive information may still get leaked, since there is no test that asserts on the console log output. I considered it out of the scope of this JIRA to write an integration test to make sure new leaks don't happen in the future. Running unit tests to make sure nothing else is broken by this change. Author: Mark Grover <mark@apache.org> Closes #17047 from markgrover/master_redaction.
* [SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" usage in ALSNick Pentreath2017-03-024-1/+35
| | | | | | | | | | | | [SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489) added the ability to skip `NaN` predictions during `ALSModel.transform`. This PR adds documentation for the `coldStartStrategy` param to the ALS user guide, and add code to the examples to illustrate usage. ## How was this patch tested? Doc and example change only. Build HTML doc locally and verified example code builds, and runs in shell for Scala/Python. Author: Nick Pentreath <nickp@za.ibm.com> Closes #17102 from MLnick/SPARK-19345-coldstart-doc.
* [SPARK-19704][ML] AFTSurvivalRegression should support numeric censorColZheng RuiFeng2017-03-023-5/+37
| | | | | | | | | | | ## What changes were proposed in this pull request? make `AFTSurvivalRegression` support numeric censorCol ## How was this patch tested? existing tests and added tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #17034 from zhengruifeng/aft_numeric_censor.
* [SPARK-19733][ML] Removed unnecessary castings and refactored checked casts ↵Vasilis Vryniotis2017-03-022-20/+95
| | | | | | | | | | | | | | | | in ALS. ## What changes were proposed in this pull request? The original ALS was performing unnecessary casting to the user and item ids because the protected checkedCast() method required a double. I removed the castings and refactored the method to receive Any and efficiently handle all permitted numeric values. ## How was this patch tested? I tested it by running the unit-tests and by manually validating the result of checkedCast for various legal and illegal values. Author: Vasilis Vryniotis <bbriniotis@datumbox.com> Closes #17059 from datumbox/als_casting_fix.
* [SPARK-18352][DOCS] wholeFile JSON update doc and programming guideFelix Cheung2017-03-026-22/+30
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Update doc for R, programming guide. Clarify default behavior for all languages. ## How was this patch tested? manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17128 from felixcheung/jsonwholefiledoc.
* [SPARK-19734][PYTHON][ML] Correct OneHotEncoder doc string to say dropLastMark Grover2017-03-011-1/+1
| | | | | | | | | | | | | ## What changes were proposed in this pull request? Updates the doc string to match up with the code i.e. say dropLast instead of includeFirst ## How was this patch tested? Not much, since it's a doc-like change. Will run unit tests via Jenkins job. Author: Mark Grover <mark@apache.org> Closes #17127 from markgrover/spark_19734.
* [MINOR][ML] Fix comments in LSH Examples and Python APIYun Ni2017-03-015-5/+5
| | | | | | | | | | | | | ## What changes were proposed in this pull request? Remove `org.apache.spark.examples.` in Add slash in one of the python doc. ## How was this patch tested? Run examples using the commands in the comments. Author: Yun Ni <yunn@uber.com> Closes #17104 from Yunni/yunn_minor.
* [SPARK-19583][SQL] CTAS for data source table with a created location should ↵windpiger2017-03-013-13/+156
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | succeed ## What changes were proposed in this pull request? ``` spark.sql( s""" |CREATE TABLE t |USING parquet |PARTITIONED BY(a, b) |LOCATION '$dir' |AS SELECT 3 as a, 4 as b, 1 as c, 2 as d """.stripMargin) ``` Failed with the error message: ``` path file:/private/var/folders/6r/15tqm8hn3ldb3rmbfqm1gf4c0000gn/T/spark-195cd513-428a-4df9-b196-87db0c73e772 already exists.; org.apache.spark.sql.AnalysisException: path file:/private/var/folders/6r/15tqm8hn3ldb3rmbfqm1gf4c0000gn/T/spark-195cd513-428a-4df9-b196-87db0c73e772 already exists.; at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:102) ``` while hive table is ok ,so we should fix it for datasource table. The reason is that the SaveMode check is put in `InsertIntoHadoopFsRelationCommand` , and the SaveMode check actually use `path`, this is fine when we use `DataFrameWriter.save()`, because this situation of SaveMode act on `path`. While when we use `CreateDataSourceAsSelectCommand`, the situation of SaveMode act on table, and we have already do SaveMode check in `CreateDataSourceAsSelectCommand` for table , so we should not do SaveMode check in the following logic in `InsertIntoHadoopFsRelationCommand` for path, this is redundant and wrong logic for `CreateDataSourceAsSelectCommand` After this PR, the following DDL will succeed, when the location has been created we will append it or overwrite it. ``` CREATE TABLE ... (PARTITIONED BY ...) LOCATION path AS SELECT ... ``` ## How was this patch tested? unit test added Author: windpiger <songjun@outlook.com> Closes #16938 from windpiger/CTASDataSourceWitLocation.
* [SPARK-13931] Stage can hang if an executor fails while speculated tasks are ↵GavinGavinNo12017-03-012-2/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | running ## What changes were proposed in this pull request? When function 'executorLost' is invoked in class 'TaskSetManager', it's significant to judge whether variable 'isZombie' is set to true. This pull request fixes the following hang: 1.Open speculation switch in the application. 2.Run this app and suppose last task of shuffleMapStage 1 finishes. Let's get the record straight, from the eyes of DAG, this stage really finishes, and from the eyes of TaskSetManager, variable 'isZombie' is set to true, but variable runningTasksSet isn't empty because of speculation. 3.Suddenly, executor 3 is lost. TaskScheduler receiving this signal, invokes all executorLost functions of rootPool's taskSetManagers. DAG receiving this signal, removes all this executor's outputLocs. 4.TaskSetManager adds all this executor's tasks to pendingTasks and tells DAG they will be resubmitted (Attention: possibly not on time). 5.DAG starts to submit a new waitingStage, let's say shuffleMapStage 2, and going to find that shuffleMapStage 1 is its missing parent because some outputLocs are removed due to executor lost. Then DAG submits shuffleMapStage 1 again. 6.DAG still receives Task 'Resubmitted' signal from old taskSetManager, and increases the number of pendingTasks of shuffleMapStage 1 each time. However, old taskSetManager won't resolve new task to submit because its variable 'isZombie' is set to true. 7.Finally shuffleMapStage 1 never finishes in DAG together with all stages depending on it. ## How was this patch tested? It's quite difficult to construct test cases. Author: GavinGavinNo1 <gavingavinno1@gmail.com> Author: 16092929 <16092929@cnsuning.com> Closes #16855 from GavinGavinNo1/resolve-stage-blocked2.
* [SPARK-19777] Scan runningTasksSet when check speculatable tasks in ↵jinxing2017-03-011-3/+2
| | | | | | | | | | | | | | | TaskSetManager. ## What changes were proposed in this pull request? When check speculatable tasks in `TaskSetManager`, only scan `runningTasksSet` instead of scanning all `taskInfos`. ## How was this patch tested? Existing tests. Author: jinxing <jinxing6042@126.com> Closes #17111 from jinxing64/SPARK-19777.
* [SPARK-19775][SQL] Remove an obsolete `partitionBy().insertInto()` test caseDongjoon Hyun2017-03-021-13/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This issue removes [a test case](https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L287-L298) which was introduced by [SPARK-14459](https://github.com/apache/spark/commit/652bbb1bf62722b08a062c7a2bf72019f85e179e) and was superseded by [SPARK-16033](https://github.com/apache/spark/blame/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L365-L371). Basically, we cannot use `partitionBy` and `insertInto` together. ```scala test("Reject partitioning that does not match table") { withSQLConf(("hive.exec.dynamic.partition.mode", "nonstrict")) { sql("CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part string)") val data = (1 to 10).map(i => (i, s"data-$i", if ((i % 2) == 0) "even" else "odd")) .toDF("id", "data", "part") intercept[AnalysisException] { // cannot partition by 2 fields when there is only one in the table definition data.write.partitionBy("part", "data").insertInto("partitioned") } } } ``` ## How was this patch tested? This only removes a test case. Pass the existing Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #17106 from dongjoon-hyun/SPARK-19775.
* [DOC][MINOR][SPARKR] Update SparkR doc for names, columns and colnamesactuaryzhang2017-03-012-2/+8
| | | | | | | | | | | | Update R doc: 1. columns, names and colnames returns a vector of strings, not **list** as in current doc. 2. `colnames<-` does allow the subset assignment, so the length of `value` can be less than the number of columns, e.g., `colnames(df)[1] <- "a"`. felixcheung Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #17115 from actuaryzhang/sparkRMinorDoc.
* [SPARK-19787][ML] Changing the default parameter of regParam.Vasilis Vryniotis2017-03-011-1/+1
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In the ALS method the default values of regParam do not match within the same file (lines [224](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224) and [714](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714)). In one place we set it to 1.0 and in the other to 0.1. I changed the one of train() method to 0.1 and now it matches the default value which is visible to Spark users. The method is marked with DeveloperApi so it should not affect the users. Whenever we use the particular method we provide all parameters, so the default does not matter. Only exception is the unit-tests on ALSSuite but the change does not break them. Note: This PR should get the award of the laziest commit in Spark history. Originally I wanted to correct this on another PR but MLnick [suggested](https://github.com/apache/spark/pull/17059#issuecomment-283333572) to create a separate PR & ticket. If you think this change is too insignificant/minor, you are probably right, so feel free to reject and close this. :) ## How was this patch tested? Unit-tests Author: Vasilis Vryniotis <vvryniotis@hotels.com> Closes #17121 from datumbox/als_regparam.
* [SPARK-19761][SQL] create InMemoryFileIndex with an empty rootPaths when set ↵windpiger2017-03-013-3/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero failed ## What changes were proposed in this pull request? If we create a InMemoryFileIndex with an empty rootPaths when set PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero, it will throw an exception: ``` Positive number of slices required java.lang.IllegalArgumentException: Positive number of slices required at org.apache.spark.rdd.ParallelCollectionRDD$.slice(ParallelCollectionRDD.scala:119) at org.apache.spark.rdd.ParallelCollectionRDD.getPartitions(ParallelCollectionRDD.scala:97) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2084) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.collect(RDD.scala:935) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles(PartitioningAwareFileIndex.scala:357) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.listLeafFiles(PartitioningAwareFileIndex.scala:256) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:74) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:50) at org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9$$anonfun$apply$mcV$sp$2.apply$mcV$sp(FileIndexSuite.scala:186) at org.apache.spark.sql.test.SQLTestUtils$class.withSQLConf(SQLTestUtils.scala:105) at org.apache.spark.sql.execution.datasources.FileIndexSuite.withSQLConf(FileIndexSuite.scala:33) at org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply$mcV$sp(FileIndexSuite.scala:185) at org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply(FileIndexSuite.scala:185) at org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply(FileIndexSuite.scala:185) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) ``` ## How was this patch tested? unit test added Author: windpiger <songjun@outlook.com> Closes #17093 from windpiger/fixEmptiPathInBulkListFiles.
* [SPARK-19766][SQL] Constant alias columns in INNER JOIN should not be folded ↵Stan Zhai2017-03-014-1/+100
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | by FoldablePropagation rule ## What changes were proposed in this pull request? This PR fixes the code in Optimizer phase where the constant alias columns of a `INNER JOIN` query are folded in Rule `FoldablePropagation`. For the following query(): ``` val sqlA = """ |create temporary view ta as |select a, 'a' as tag from t1 union all |select a, 'b' as tag from t2 """.stripMargin val sqlB = """ |create temporary view tb as |select a, 'a' as tag from t3 union all |select a, 'b' as tag from t4 """.stripMargin val sql = """ |select tb.* from ta inner join tb on |ta.a = tb.a and |ta.tag = tb.tag """.stripMargin ``` The tag column is an constant alias column, it's folded by `FoldablePropagation` like this: ``` TRACE SparkOptimizer: === Applying Rule org.apache.spark.sql.catalyst.optimizer.FoldablePropagation === Project [a#4, tag#14] Project [a#4, tag#14] !+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14)) +- Join Inner, ((a#0 = a#4) && (a = a)) :- Union :- Union : :- Project [a#0, a AS tag#8] : :- Project [a#0, a AS tag#8] : : +- LocalRelation [a#0] : : +- LocalRelation [a#0] : +- Project [a#2, b AS tag#9] : +- Project [a#2, b AS tag#9] : +- LocalRelation [a#2] : +- LocalRelation [a#2] +- Union +- Union :- Project [a#4, a AS tag#14] :- Project [a#4, a AS tag#14] : +- LocalRelation [a#4] : +- LocalRelation [a#4] +- Project [a#6, b AS tag#15] +- Project [a#6, b AS tag#15] +- LocalRelation [a#6] +- LocalRelation [a#6] ``` Finally the Result of Batch Operator Optimizations is: ``` Project [a#4, tag#14] Project [a#4, tag#14] !+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14)) +- Join Inner, (a#0 = a#4) ! :- SubqueryAlias ta, `ta` :- Union ! : +- Union : :- LocalRelation [a#0] ! : :- Project [a#0, a AS tag#8] : +- LocalRelation [a#2] ! : : +- SubqueryAlias t1, `t1` +- Union ! : : +- Project [a#0] :- LocalRelation [a#4, tag#14] ! : : +- SubqueryAlias grouping +- LocalRelation [a#6, tag#15] ! : : +- LocalRelation [a#0] ! : +- Project [a#2, b AS tag#9] ! : +- SubqueryAlias t2, `t2` ! : +- Project [a#2] ! : +- SubqueryAlias grouping ! : +- LocalRelation [a#2] ! +- SubqueryAlias tb, `tb` ! +- Union ! :- Project [a#4, a AS tag#14] ! : +- SubqueryAlias t3, `t3` ! : +- Project [a#4] ! : +- SubqueryAlias grouping ! : +- LocalRelation [a#4] ! +- Project [a#6, b AS tag#15] ! +- SubqueryAlias t4, `t4` ! +- Project [a#6] ! +- SubqueryAlias grouping ! +- LocalRelation [a#6] ``` The condition `tag#8 = tag#14` of INNER JOIN has been removed. This leads to the data of inner join being wrong. After fix: ``` === Result of Batch LocalRelation === GlobalLimit 21 GlobalLimit 21 +- LocalLimit 21 +- LocalLimit 21 +- Project [a#4, tag#11] +- Project [a#4, tag#11] +- Join Inner, ((a#0 = a#4) && (tag#8 = tag#11)) +- Join Inner, ((a#0 = a#4) && (tag#8 = tag#11)) ! :- SubqueryAlias ta :- Union ! : +- Union : :- LocalRelation [a#0, tag#8] ! : :- Project [a#0, a AS tag#8] : +- LocalRelation [a#2, tag#9] ! : : +- SubqueryAlias t1 +- Union ! : : +- Project [a#0] :- LocalRelation [a#4, tag#11] ! : : +- SubqueryAlias grouping +- LocalRelation [a#6, tag#12] ! : : +- LocalRelation [a#0] ! : +- Project [a#2, b AS tag#9] ! : +- SubqueryAlias t2 ! : +- Project [a#2] ! : +- SubqueryAlias grouping ! : +- LocalRelation [a#2] ! +- SubqueryAlias tb ! +- Union ! :- Project [a#4, a AS tag#11] ! : +- SubqueryAlias t3 ! : +- Project [a#4] ! : +- SubqueryAlias grouping ! : +- LocalRelation [a#4] ! +- Project [a#6, b AS tag#12] ! +- SubqueryAlias t4 ! +- Project [a#6] ! +- SubqueryAlias grouping ! +- LocalRelation [a#6] ``` ## How was this patch tested? add sql-tests/inputs/inner-join.sql All tests passed. Author: Stan Zhai <zhaishidan@haizhi.com> Closes #17099 from stanzhai/fix-inner-join.
* [SPARK-19736][SQL] refreshByPath should clear all cached plans with the ↵Liang-Chi Hsieh2017-03-012-9/+26
| | | | | | | | | | | | | | | | | | | | specified path ## What changes were proposed in this pull request? `Catalog.refreshByPath` can refresh the cache entry and the associated metadata for all dataframes (if any), that contain the given data source path. However, `CacheManager.invalidateCachedPath` doesn't clear all cached plans with the specified path. It causes some strange behaviors reported in SPARK-15678. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #17064 from viirya/fix-refreshByPath.
* [SPARK-19633][SS] FileSource read from FileSinkLiwei Lin2017-02-285-41/+200
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Right now file source always uses `InMemoryFileIndex` to scan files from a given path. But when reading the outputs from another streaming query, the file source should use `MetadataFileIndex` to list files from the sink log. This patch adds this support. ## `MetadataFileIndex` or `InMemoryFileIndex` ```scala spark .readStream .format(...) .load("/some/path") // for a non-glob path: // - use `MetadataFileIndex` when `/some/path/_spark_meta` exists // - fall back to `InMemoryFileIndex` otherwise ``` ```scala spark .readStream .format(...) .load("/some/path/*/*") // for a glob path: always use `InMemoryFileIndex` ``` ## How was this patch tested? two newly added tests Author: Liwei Lin <lwlin7@gmail.com> Closes #16987 from lw-lin/source-read-from-sink.
* [SPARK-19460][SPARKR] Update dataset used in R documentation, examples to ↵wm624@hotmail.com2017-02-289-74/+85
| | | | | | | | | | | | | | | | reduce warning noise and confusions ## What changes were proposed in this pull request? Replace `iris` dataset with `Titanic` or other dataset in example and document. ## How was this patch tested? Manual and existing test Author: wm624@hotmail.com <wm624@hotmail.com> Closes #17032 from wangmiao1981/example.
* [SPARK-19572][SPARKR] Allow to disable hive in sparkR shellJeff Zhang2017-02-281-2/+4
| | | | | | | | | | | | | ## What changes were proposed in this pull request? SPARK-15236 do this for scala shell, this ticket is for sparkR shell. This is not only for sparkR itself, but can also benefit downstream project like livy which use shell.R for its interactive session. For now, livy has no control of whether enable hive or not. ## How was this patch tested? Tested it manually, run `bin/sparkR --master local --conf spark.sql.catalogImplementation=in-memory` and verify hive is not enabled. Author: Jeff Zhang <zjffdu@apache.org> Closes #16907 from zjffdu/SPARK-19572.
* [SPARK-14503][ML] spark.ml API for FPGrowthYuhao2017-02-282-0/+469
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-14503 Function parity: Add FPGrowth and AssociationRules to ML. design doc: https://docs.google.com/document/d/1bVhABn5DiEj8bw0upqGMJT2L4nvO_0_cXdwu4uMT6uU/pub Currently I make FPGrowthModel a transformer. For each association rule, it will just examine the input items against antecedents and summarize the consequents. Update: Thinking again, FPGrowth is only the algorithm to find the frequent itemsets, and can be replaced by other algorithms. The frequent itemsets are used by AssociationRules to generate the association rules. Then we can use the association rules to predict with other records. ![drawing1](https://cloud.githubusercontent.com/assets/7981698/22489294/76b9302c-e7cb-11e6-8d2d-3fc53f407b2f.png) **For reviewers**, Let's first decide if the current `transform` function meets your expectation. Current options: 1. Current implementation: Use Estimator and Transformer pattern in ML, the `transform` function will examine the input items against all the association rules and summarize the consequents. Users can also access frequent items and association rules via other model members. 2. Keep the Estimator and Transformer pattern. But AssociationRulesModel and FPGrowthModel will have empty `transform` function, meaning DataFrame has no change after transform. But users can access frequent items and association rules via other model members. 3. (mentioned by zhengruifeng) Keep the Estimator and Transformer pattern. But `FPGrowthModel` and `AssociationRulesModel` will just return frequent itemsets and association rules DataFrame in the `transform` function. Meaning the resulting DataFrame after `transform` will not be related to the input DataFrame. 4. Discard the Estimator and Transformer pattern. Both FPGrowth and FPGrowthModel will directly extend from PipelineStage, thus we don't need to have a `transform` function. I'd like to hear more concrete suggestions. I would prefer option 1 or 2. update 2: As discussed in the jira, we will not expose AssociationRules as a public API for now. ## How was this patch tested? new unit test suites Author: Yuhao <yuhao.yang@intel.com> Author: Yuhao Yang <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes #15415 from hhbyyh/mlfpm.
* [SPARK-19373][MESOS] Base spark.scheduler.minRegisteredResourceRatio on ↵Michael Gummelt2017-03-012-68/+70
| | | | | | | | | | | | | | | | | | registered cores rather than accepted cores ## What changes were proposed in this pull request? See JIRA ## How was this patch tested? Unit tests, Mesos/Spark integration tests cc skonto susanxhuynh Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #17045 from mgummelt/SPARK-19373-registered-resources.
* [SPARK-19769][DOCS] Update quickstart instructionsMichael McCune2017-03-011-3/+3
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This change addresses the renaming of the `simple.sbt` build file to `build.sbt`. Newer versions of the sbt tool are not finding the older named file and are looking for the `build.sbt`. The quickstart instructions for self-contained applications is updated with this change. ## How was this patch tested? As this is a relatively minor change of a few words, the markdown was checked for syntax and spelling. Site was built with `SKIP_API=1 jekyll serve` for testing purposes. Author: Michael McCune <msm@redhat.com> Closes #17101 from elmiko/spark-19769.
* [MINOR][DOC] Update GLM doc to include tweedie distributionactuaryzhang2017-02-281-0/+5
| | | | | | | | | | Update GLM documentation to include the Tweedie distribution. #16344 jkbradley yanboliang Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #17103 from actuaryzhang/doc.
* [SPARK-19610][SQL] Support parsing multiline CSV fileshyukjinkwon2017-02-2814-197/+525
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR proposes the support for multiple lines for CSV by resembling the multiline supports in JSON datasource (in case of JSON, per file). So, this PR introduces `wholeFile` option which makes the format not splittable and reads each whole file. Since Univocity parser can produces each row from a stream, it should be capable of parsing very large documents when the internal rows are fix in the memory. ## How was this patch tested? Unit tests in `CSVSuite` and `tests.py` Manual tests with a single 9GB CSV file in local file system, for example, ```scala spark.read.option("wholeFile", true).option("inferSchema", true).csv("tmp.csv").count() ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #16976 from HyukjinKwon/SPARK-19610.
* [SPARK-19463][SQL] refresh cache after the InsertIntoHadoopFsRelationCommandwindpiger2017-02-284-25/+14
| | | | | | | | | | | | | ## What changes were proposed in this pull request? If we first cache a DataSource table, then we insert some data into the table, we should refresh the data in the cache after the insert command. ## How was this patch tested? unit test added Author: windpiger <songjun@outlook.com> Closes #16809 from windpiger/refreshCacheAfterInsert.
* [SPARK-19677][SS] Committing a delta file atop an existing one should not ↵Roberto Agostino Vitillo2017-02-282-8/+34
| | | | | | | | | | | | | | | | | | | | | | | fail on HDFS ## What changes were proposed in this pull request? HDFSBackedStateStoreProvider fails to rename files on HDFS but not on the local filesystem. According to the [implementation notes](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html) of `rename()`, the behavior of the local filesystem and HDFS varies: > Destination exists and is a file > Renaming a file atop an existing file is specified as failing, raising an exception. > - Local FileSystem : the rename succeeds; the destination file is replaced by the source file. > - HDFS : The rename fails, no exception is raised. Instead the method call simply returns false. This patch ensures that `rename()` isn't called if the destination file already exists. It's still semantically correct because Structured Streaming requires that rerunning a batch should generate the same output. ## How was this patch tested? This patch was tested by running `StateStoreSuite`. Author: Roberto Agostino Vitillo <ra.vitillo@gmail.com> Closes #17012 from vitillo/fix_rename.
* [SPARK-19678][SQL] remove MetastoreRelationWenchen Fan2017-02-2827-672/+442
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? `MetastoreRelation` is used to represent table relation for hive tables, and provides some hive related information. We will resolve `SimpleCatalogRelation` to `MetastoreRelation` for hive tables, which is unnecessary as these 2 are the same essentially. This PR merges `SimpleCatalogRelation` and `MetastoreRelation` ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #17015 from cloud-fan/table-relation.
* [SPARK-14489][ML][PYSPARK] ALS unknown user/item prediction strategyNick Pentreath2017-02-283-9/+116
| | | | | | | | | | | | | | | This PR adds a param to `ALS`/`ALSModel` to set the strategy used when encountering unknown users or items at prediction time in `transform`. This can occur in 2 scenarios: (a) production scoring, and (b) cross-validation & evaluation. The current behavior returns `NaN` if a user/item is unknown. In scenario (b), this can easily occur when using `CrossValidator` or `TrainValidationSplit` since some users/items may only occur in the test set and not in the training set. In this case, the evaluator returns `NaN` for all metrics, making model selection impossible. The new param, `coldStartStrategy`, defaults to `nan` (the current behavior). The other option supported initially is `drop`, which drops all rows with `NaN` predictions. This flag allows users to use `ALS` in cross-validation settings. It is made an `expertParam`. The param is made a string so that the set of strategies can be extended in future (some options are discussed in [SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489)). ## How was this patch tested? New unit tests, and manual "before and after" tests for Scala & Python using MovieLens `ml-latest-small` as example data. Here, using `CrossValidator` or `TrainValidationSplit` with the default param setting results in metrics that are all `NaN`, while setting `coldStartStrategy` to `drop` results in valid metrics. Author: Nick Pentreath <nickp@za.ibm.com> Closes #12896 from MLnick/SPARK-14489-als-nan.
* [SPARK-19660][CORE][SQL] Replace the configuration property names that are ↵Yuming Wang2017-02-28145-205/+213
| | | | | | | | | | | | | | | | | | | | | | deprecated in the version of Hadoop 2.6 ## What changes were proposed in this pull request? Replace all the Hadoop deprecated configuration property names according to [DeprecatedProperties](https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/DeprecatedProperties.html). except: https://github.com/apache/spark/blob/v2.1.0/python/pyspark/sql/tests.py#L1533 https://github.com/apache/spark/blob/v2.1.0/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L987 https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/SetCommand.scala#L45 https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L614 ## How was this patch tested? Existing tests Author: Yuming Wang <wgyumg@gmail.com> Closes #16990 from wangyum/HadoopDeprecatedProperties.
* [SPARK-19748][SQL] refresh function has a wrong order to do cache invalidate ↵windpiger2017-02-282-1/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | and regenerate the inmemory var for InMemoryFileIndex with FileStatusCache ## What changes were proposed in this pull request? If we refresh a InMemoryFileIndex with a FileStatusCache, it will first use the FileStatusCache to re-generate the cachedLeafFiles etc, then call FileStatusCache.invalidateAll. While the order to do these two actions is wrong, this lead to the refresh action does not take effect. ``` override def refresh(): Unit = { refresh0() fileStatusCache.invalidateAll() } private def refresh0(): Unit = { val files = listLeafFiles(rootPaths) cachedLeafFiles = new mutable.LinkedHashMap[Path, FileStatus]() ++= files.map(f => f.getPath -> f) cachedLeafDirToChildrenFiles = files.toArray.groupBy(_.getPath.getParent) cachedPartitionSpec = null } ``` ## How was this patch tested? unit test added Author: windpiger <songjun@outlook.com> Closes #17079 from windpiger/fixInMemoryFileIndexRefresh.
* [SPARK-19749][SS] Name socket source with a meaningful nameuncleGen2017-02-271-2/+4
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Name socket source with a meaningful name ## How was this patch tested? Jenkins Author: uncleGen <hustyugm@gmail.com> Closes #17082 from uncleGen/SPARK-19749.
* [SPARK-19746][ML] Faster indexing for logistic aggregatorsethah2017-02-282-3/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? JIRA: [SPARK-19746](https://issues.apache.org/jira/browse/SPARK-19746) The following code is inefficient: ````scala val localCoefficients: Vector = bcCoefficients.value features.foreachActive { (index, value) => val stdValue = value / localFeaturesStd(index) var j = 0 while (j < numClasses) { margins(j) += localCoefficients(index * numClasses + j) * stdValue j += 1 } } ```` `localCoefficients(index * numClasses + j)` calls `Vector.apply` which creates a new Breeze vector and indexes that. Even if it is not that slow to create the object, we will generate a lot of extra garbage that may result in longer GC pauses. This is a hot inner loop, so we should optimize wherever possible. ## How was this patch tested? I don't think there's a great way to test this patch. It's purely performance related, so unit tests should guarantee that we haven't made any unwanted changes. Empirically I observed between 10-40% speedups just running short local tests. I suspect the big differences will be seen when large data/coefficient sizes have to pause for GC more often. I welcome other ideas for testing. Author: sethah <seth.hendrickson16@gmail.com> Closes #17078 from sethah/logistic_agg_indexing.
* [SPARK-15615][SQL][BUILD][FOLLOW-UP] Replace deprecated usage of ↵hyukjinkwon2017-02-2721-201/+189
| | | | | | | | | | | | | | | | | | json(RDD[String]) API ## What changes were proposed in this pull request? This PR proposes to replace the deprecated `json(RDD[String])` usage to `json(Dataset[String])`. This currently produces so many warnings. ## How was this patch tested? Fixed tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17071 from HyukjinKwon/SPARK-15615-followup.