spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	Revert "[SPARK-17637][SCHEDULER] Packed scheduling for Spark tasks across ↵	Reynold Xin	2016-10-15	1	-11/+0
\| \| \| \| \| \| \| \|	executors" This reverts commit ed1463341455830b8867b721a1b34f291139baf3. The patch merged had obvious quality and documentation issue. The idea is useful, and we should work towards improving its quality and merging it in again.
*	[SPARK-17637][SCHEDULER] Packed scheduling for Spark tasks across executors	Zhan Zhang	2016-10-15	1	-0/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Restructure the code and implement two new task assigner. PackedAssigner: try to allocate tasks to the executors with least available cores, so that spark can release reserved executors when dynamic allocation is enabled. BalancedAssigner: try to allocate tasks to the executors with more available cores in order to balance the workload across all executors. By default, the original round robin assigner is used. We test a pipeline, and new PackedAssigner save around 45% regarding the reserved cpu and memory with dynamic allocation enabled. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Both unit test in TaskSchedulerImplSuite and manual tests in production pipeline. Author: Zhan Zhang <zhanzhang@fb.com> Closes #15218 from zhzhan/packed-scheduler.
*	[DOC] Fix typo in sql hive doc	Dhruve Ashar	2016-10-14	1	-1/+1
\| \| \| \| \| \| \| \|	Change is too trivial to file a JIRA. Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #15485 from dhruve/master.
*	[SPARK-17675][CORE] Expand Blacklist for TaskSets	Imran Rashid	2016-10-12	1	-0/+43
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This is a step along the way to SPARK-8425. To enable incremental review, the first step proposed here is to expand the blacklisting within tasksets. In particular, this will enable blacklisting for * (task, executor) pairs (this already exists via an undocumented config) * (task, node) * (taskset, executor) * (taskset, node) Adding (task, node) is critical to making spark fault-tolerant of one-bad disk in a cluster, without requiring careful tuning of "spark.task.maxFailures". The other additions are also important to avoid many misleading task failures and long scheduling delays when there is one bad node on a large cluster. Note that some of the code changes here aren't really required for just this -- they put pieces in place for SPARK-8425 even though they are not used yet (eg. the `BlacklistTracker` helper is a little out of place, `TaskSetBlacklist` holds onto a little more info than it needs to for just this change, and `ExecutorFailuresInTaskSet` is more complex than it needs to be). ## How was this patch tested? Added unit tests, run tests via jenkins. Author: Imran Rashid <irashid@cloudera.com> Author: mwws <wei.mao@intel.com> Closes #15249 from squito/taskset_blacklist_only.
*	[SPARK-17853][STREAMING][KAFKA][DOC] make it clear that reusing group.id is bad	cody koeninger	2016-10-12	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Documentation fix to make it clear that reusing group id for different streams is super duper bad, just like it is with the underlying Kafka consumer. ## How was this patch tested? I built jekyll doc and made sure it looked ok. Author: cody koeninger <cody@koeninger.org> Closes #15442 from koeninger/SPARK-17853.
*	[SPARK-17880][DOC] The url linking to `AccumulatorV2` in the document is ↵	Kousuke Saruta	2016-10-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	incorrect. ## What changes were proposed in this pull request? In `programming-guide.md`, the url which links to `AccumulatorV2` says `api/scala/index.html#org.apache.spark.AccumulatorV2` but `api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct. ## How was this patch tested? manual test. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #15439 from sarutak/SPARK-17880.
*	Fix hadoop.version in building-spark.md	Alexander Pivovarov	2016-10-11	1	-2/+2
\| \| \| \| \| \| \| \|	Couple of mvn build examples use `-Dhadoop.version=VERSION` instead of actual version number Author: Alexander Pivovarov <apivovarov@gmail.com> Closes #15440 from apivovarov/patch-1.
*	[SPARK-17719][SPARK-17776][SQL] Unify and tie up options in a single place ↵	hyukjinkwon	2016-10-10	1	-9/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	in JDBC datasource package ## What changes were proposed in this pull request? This PR proposes to fix arbitrary usages among `Map[String, String]`, `Properties` and `JDBCOptions` instances for options in `execution/jdbc` package and make the connection properties exclude Spark-only options. This PR includes some changes as below: - Unify `Map[String, String]`, `Properties` and `JDBCOptions` in `execution/jdbc` package to `JDBCOptions`. - Move `batchsize`, `fetchszie`, `driver` and `isolationlevel` options into `JDBCOptions` instance. - Document `batchSize` and `isolationlevel` with marking both read-only options and write-only options. Also, this includes minor types and detailed explanation for some statements such as url. - Throw exceptions fast by checking arguments first rather than in execution time (e.g. for `fetchsize`). - Exclude Spark-only options in connection properties. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15292 from HyukjinKwon/SPARK-17719.
*	[SPARK-14082][MESOS] Enable GPU support with Mesos	Timothy Chen	2016-10-10	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Enable GPU resources to be used when running coarse grain mode with Mesos. ## How was this patch tested? Manual test with GPU. Author: Timothy Chen <tnachen@gmail.com> Closes #14644 from tnachen/gpu_mesos.
*	[SPARK-17338][SQL] add global temp view	Wenchen Fan	2016-10-10	1	-5/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Global temporary view is a cross-session temporary view, which means it's shared among all sessions. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database `global_temp`(configurable via SparkConf), and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1. changes for `SessionCatalog`: 1. add a new field `gloabalTempViews: GlobalTempViewManager`, to access the shared global temp views, and the global temp db name. 2. `createDatabase` will fail if users wanna create `global_temp`, which is system preserved. 3. `setCurrentDatabase` will fail if users wanna set `global_temp`, which is system preserved. 4. add `createGlobalTempView`, which is used in `CreateViewCommand` to create global temp views. 5. add `dropGlobalTempView`, which is used in `CatalogImpl` to drop global temp view. 6. add `alterTempViewDefinition`, which is used in `AlterViewAsCommand` to update the view definition for local/global temp views. 7. `renameTable`/`dropTable`/`isTemporaryTable`/`lookupRelation`/`getTempViewOrPermanentTableMetadata`/`refreshTable` will handle global temp views. changes for SQL commands: 1. `CreateViewCommand`/`AlterViewAsCommand` is updated to support global temp views 2. `ShowTablesCommand` outputs a new column `database`, which is used to distinguish global and local temp views. 3. other commands can also handle global temp views if they call `SessionCatalog` APIs which accepts global temp views, e.g. `DropTableCommand`, `AlterTableRenameCommand`, `ShowColumnsCommand`, etc. changes for other public API 1. add a new method `dropGlobalTempView` in `Catalog` 2. `Catalog.findTable` can find global temp view 3. add a new method `createGlobalTempView` in `Dataset` ## How was this patch tested? new tests in `SQLViewSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #14897 from cloud-fan/global-temp-view.
*	[SPARK-17707][WEBUI] Web UI prevents spark-submit application to be finished	Sean Owen	2016-10-07	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This expands calls to Jetty's simple `ServerConnector` constructor to explicitly specify a `ScheduledExecutorScheduler` that makes daemon threads. It should otherwise result in exactly the same configuration, because the other args are copied from the constructor that is currently called. (I'm not sure we should change the Hive Thriftserver impl, but I did anyway.) This also adds `sc.stop()` to the quick start guide example. ## How was this patch tested? Existing tests; _pending_ at least manual verification of the fix. Author: Sean Owen <sowen@cloudera.com> Closes #15381 from srowen/SPARK-17707.
*	[SPARK-17346][SQL] Add Kafka source for Structured Streaming	Shixiong Zhu	2016-10-05	2	-1/+245
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source. It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing tdas did most of work and part of them was inspired by koeninger's work. ### Introduction The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows: Column \| Type ---- \| ---- key \| binary value \| binary topic \| string partition \| int offset \| long timestamp \| long timestampType \| int The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic. ### Configuration The user can use `DataStreamReader.option` to set the following configurations. Kafka Source's options \| value \| default \| meaning ------ \| ------- \| ------ \| ----- startingOffset \| ["earliest", "latest"] \| "latest" \| The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off. failOnDataLost \| [true, false] \| true \| Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected. subscribe \| A comma-separated list of topics \| (none) \| The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source. subscribePattern \| Java regex string \| (none) \| The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source. kafka.consumer.poll.timeoutMs \| long \| 512 \| The timeout in milliseconds to poll data from Kafka in executors fetchOffset.numRetries \| int \| 3 \| Number of times to retry before giving up fatch Kafka latest offsets. fetchOffset.retryIntervalMs \| long \| 10 \| milliseconds to wait before retrying to fetch Kafka offsets Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")` ### Usage * Subscribe to 1 topic ```Scala spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host:port") .option("subscribe", "topic1") .load() ``` * Subscribe to multiple topics ```Scala spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host:port") .option("subscribe", "topic1,topic2") .load() ``` * Subscribe to a pattern ```Scala spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host:port") .option("subscribePattern", "topic.*") .load() ``` ## How was this patch tested? The new unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Shixiong Zhu <zsxwing@gmail.com> Author: cody koeninger <cody@koeninger.org> Closes #15102 from zsxwing/kafka-source.
*	[SPARK-17239][ML][DOC] Update user guide for multiclass logistic regression	sethah	2016-10-05	1	-7/+58
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Updates user guide to reflect that LogisticRegression now supports multiclass. Also adds new examples to show multiclass training. ## How was this patch tested? Ran locally using spark-submit, run-example, and copy/paste from user guide into shells. Generated docs and verified correct output. Author: sethah <seth.hendrickson16@gmail.com> Closes #15349 from sethah/SPARK-17239.
*	[SPARK-17718][DOCS][MLLIB] Make loss function formulation label note clearer ↵	Sean Owen	2016-10-03	1	-4/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	in MLlib docs ## What changes were proposed in this pull request? Move note about labels being +1/-1 in formulation only to be just under the table of formulations. ## How was this patch tested? Doc build Author: Sean Owen <sowen@cloudera.com> Closes #15330 from srowen/SPARK-17718.
*	[SPARK-17736][DOCUMENTATION][SPARKR] Update R README for rmarkdown,…	Jagadeesan	2016-10-03	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? To build R docs (which are built when R tests are run), users need to install pandoc and rmarkdown. This was done for Jenkins in ~~[SPARK-17420](https://issues.apache.org/jira/browse/SPARK-17420)~~ … pandoc] Author: Jagadeesan <as2@us.ibm.com> Closes #15309 from jagadeesanas2/SPARK-17736.
*	[MINOR][DOC] Add an up-to-date description for default serialization during ↵	Dongjoon Hyun	2016-09-30	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	shuffling ## What changes were proposed in this pull request? This PR aims to make the doc up-to-date. The documentation is generally correct, but after https://issues.apache.org/jira/browse/SPARK-13926, Spark starts to choose Kyro as a default serialization library during shuffling of simple types, arrays of simple types, or string type. ## How was this patch tested? This is a documentation update. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #15315 from dongjoon-hyun/SPARK-DOC-SERIALIZER.
*	[SPARK-17412][DOC] All test should not be run by `root` or any admin user	Dongjoon Hyun	2016-09-29	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `FsHistoryProviderSuite` fails if `root` user runs it. The test case SPARK-3697: ignore directories that cannot be read depends on `setReadable(false, false)` to make test data files and expects the number of accessible files is 1. But, `root` can access all files, so it returns 2. This PR adds the assumption explicitly on doc. `building-spark.md`. ## How was this patch tested? This is a documentation change. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #15291 from dongjoon-hyun/SPARK-17412.
*	[DOCS] Reorganize explanation of Accumulators and Broadcast Variables	José Hiram Soltren	2016-09-29	1	-164/+164
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The discussion of the interaction of Accumulators and Broadcast Variables should logically follow the discussion on Checkpointing. As currently written, this section discusses Checkpointing before it is formally introduced. To remedy this: - Rename this section to "Accumulators, Broadcast Variables, and Checkpoints", and - Move this section after "Checkpointing". ## How was this patch tested? Testing: ran $ SKIP_API=1 jekyll build , and verified changes in a Web browser pointed at docs/_site/index.html. Author: José Hiram Soltren <jose@cloudera.com> Closes #15281 from jsoltren/doc-changes.
*	[MINOR][DOCS] Fix th doc. of spark-streaming with kinesis	Takeshi YAMAMURO	2016-09-29	1	-6/+3
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This pr is just to fix the document of `spark-kinesis-integration`. Since `SPARK-17418` prevented all the kinesis stuffs (including kinesis example code) from publishing, `bin/run-example streaming.KinesisWordCountASL` and `bin/run-example streaming.JavaKinesisWordCountASL` does not work. Instead, it fetches the kinesis jar from the Spark Package. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #15260 from maropu/DocFixKinesis.
*	[SPARK-17017][ML][MLLIB][ML][DOC] Updated the ml/mllib feature selection ↵	Shuai Lin	2016-09-28	2	-8/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	docs for ChiSqSelector ## What changes were proposed in this pull request? A follow up for #14597 to update feature selection docs about ChiSqSelector. ## How was this patch tested? Generated html docs. It can be previewed at: * ml: http://sparkdocs.lins05.pw/spark-17017/ml-features.html#chisqselector * mllib: http://sparkdocs.lins05.pw/spark-17017/mllib-feature-extraction.html#chisqselector Author: Shuai Lin <linshuai2012@gmail.com> Closes #15236 from lins05/spark-17017-update-docs-for-chisq-selector-fpr.
*	[Docs] Update spark-standalone.md to fix link	Andrew Mills	2016-09-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Corrected a link to the configuration.html page, it was pointing to a page that does not exist (configurations.html). Documentation change, verified in preview. Author: Andrew Mills <ammills01@users.noreply.github.com> Closes #15244 from ammills01/master.
*	[SPARK-17153][SQL] Should read partition data when reading new files in ↵	Liang-Chi Hsieh	2016-09-26	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	filestream without globbing ## What changes were proposed in this pull request? When reading file stream with non-globbing path, the results return data with all `null`s for the partitioned columns. E.g., case class A(id: Int, value: Int) val data = spark.createDataset(Seq( A(1, 1), A(2, 2), A(2, 3)) ) val url = "/tmp/test" data.write.partitionBy("id").parquet(url) spark.read.parquet(url).show +-----+---+ \|value\| id\| +-----+---+ \| 2\| 2\| \| 3\| 2\| \| 1\| 1\| +-----+---+ val s = spark.readStream.schema(spark.read.load(url).schema).parquet(url) s.writeStream.queryName("test").format("memory").start() sql("SELECT * FROM test").show +-----+----+ \|value\| id\| +-----+----+ \| 2\|null\| \| 3\|null\| \| 1\|null\| +-----+----+ ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #14803 from viirya/filestreamsource-option.
*	[SPARK-14525][SQL] Make DataFrameWrite.save work for jdbc	Justin Pihony	2016-09-26	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This change modifies the implementation of DataFrameWriter.save such that it works with jdbc, and the call to jdbc merely delegates to save. ## How was this patch tested? This was tested via unit tests in the JDBCWriteSuite, of which I added one new test to cover this scenario. ## Additional details rxin This seems to have been most recently touched by you and was also commented on in the JIRA. This contribution is my original work and I license the work to the project under the project's open source license. Author: Justin Pihony <justin.pihony@gmail.com> Author: Justin Pihony <justin.pihony@typesafe.com> Closes #12601 from JustinPihony/jdbc_reconciliation.
*	[SPARK-17210][SPARKR] sparkr.zip is not distributed to executors when ↵	Jeff Zhang	2016-09-23	1	-0/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	running sparkr in RStudio ## What changes were proposed in this pull request? Spark will add sparkr.zip to archive only when it is yarn mode (SparkSubmit.scala). ``` if (args.isR && clusterManager == YARN) { val sparkRPackagePath = RUtils.localSparkRPackagePath if (sparkRPackagePath.isEmpty) { printErrorAndExit("SPARK_HOME does not exist for R application in YARN mode.") } val sparkRPackageFile = new File(sparkRPackagePath.get, SPARKR_PACKAGE_ARCHIVE) if (!sparkRPackageFile.exists()) { printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R application in YARN mode.") } val sparkRPackageURI = Utils.resolveURI(sparkRPackageFile.getAbsolutePath).toString // Distribute the SparkR package. // Assigns a symbol link name "sparkr" to the shipped package. args.archives = mergeFileLists(args.archives, sparkRPackageURI + "#sparkr") // Distribute the R package archive containing all the built R packages. if (!RUtils.rPackages.isEmpty) { val rPackageFile = RPackageUtils.zipRLibraries(new File(RUtils.rPackages.get), R_PACKAGE_ARCHIVE) if (!rPackageFile.exists()) { printErrorAndExit("Failed to zip all the built R packages.") } val rPackageURI = Utils.resolveURI(rPackageFile.getAbsolutePath).toString // Assigns a symbol link name "rpkg" to the shipped package. args.archives = mergeFileLists(args.archives, rPackageURI + "#rpkg") } } ``` So it is necessary to pass spark.master from R process to JVM. Otherwise sparkr.zip won't be distributed to executor. Besides that I also pass spark.yarn.keytab/spark.yarn.principal to spark side, because JVM process need them to access secured cluster. ## How was this patch tested? Verify it manually in R Studio using the following code. ``` Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark") .libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths())) library(SparkR) sparkR.session(master="yarn-client", sparkConfig = list(spark.executor.instances="1")) df <- as.DataFrame(mtcars) head(df) ``` … Author: Jeff Zhang <zjffdu@apache.org> Closes #14784 from zjffdu/SPARK-17210.
*	[SPARK-17421][DOCS] Documenting the current treatment of MAVEN_OPTS.	frreiss	2016-09-22	1	-6/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Modified the documentation to clarify that `build/mvn` and `pom.xml` always add Java 7-specific parameters to `MAVEN_OPTS`, and that developers can safely ignore warnings about `-XX:MaxPermSize` that may result from compiling or running tests with Java 8. ## How was this patch tested? Rebuilt HTML documentation, made sure that building-spark.html displays correctly in a browser. Author: frreiss <frreiss@us.ibm.com> Closes #15005 from frreiss/fred-17421a.
*	[SPARK-4563][CORE] Allow driver to advertise a different network address.	Marcelo Vanzin	2016-09-21	1	-1/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The goal of this feature is to allow the Spark driver to run in an isolated environment, such as a docker container, and be able to use the host's port forwarding mechanism to be able to accept connections from the outside world. The change is restricted to the driver: there is no support for achieving the same thing on executors (or the YARN AM for that matter). Those still need full access to the outside world so that, for example, connections can be made to an executor's block manager. The core of the change is simple: add a new configuration that tells what's the address the driver should bind to, which can be different than the address it advertises to executors (spark.driver.host). Everything else is plumbing the new configuration where it's needed. To use the feature, the host starting the container needs to set up the driver's port range to fall into a range that is being forwarded; this required the block manager port to need a special configuration just for the driver, which falls back to the existing spark.blockManager.port when not set. This way, users can modify the driver settings without affecting the executors; it would theoretically be nice to also have different retry counts for driver and executors, but given that docker (at least) allows forwarding port ranges, we can probably live without that for now. Because of the nature of the feature it's kinda hard to add unit tests; I just added a simple one to make sure the configuration works. This was tested with a docker image running spark-shell with the following command: docker blah blah blah \ -p 38000-38100:38000-38100 \ [image] \ spark-shell \ --num-executors 3 \ --conf spark.shuffle.service.enabled=false \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.driver.host=[host's address] \ --conf spark.driver.port=38000 \ --conf spark.driver.blockManager.port=38020 \ --conf spark.ui.port=38040 Running on YARN; verified the driver works, executors start up and listen on ephemeral ports (instead of using the driver's config), and that caching and shuffling (without the shuffle service) works. Clicked through the UI to make sure all pages (including executor thread dumps) worked. Also tested apps without docker, and ran unit tests. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #15120 from vanzin/SPARK-4563.
*	[SPARK-17219][ML] Add NaN value handling in Bucketizer	VinceShieh	2016-09-21	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value. Sometimes, null value might also be useful to users, so in these cases, Bucketizer should reserve one extra bucket for NaN values, instead of throwing an illegal exception. Before: ``` Bucketizer.transform on NaN value threw an illegal exception. ``` After: ``` NaN values will be grouped in an extra bucket. ``` ## How was this patch tested? New test cases added in `BucketizerSuite`. Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Closes #14858 from VinceShieh/spark-17219.
*	[SPARK-17575][DOCS] Remove extra table tags in configuration document	sandy	2016-09-17	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Remove extra table tags in configurations document. ## How was this patch tested? Run all test cases and generate document. Before with extra tag its look like below ![config-wrong1](https://cloud.githubusercontent.com/assets/8075390/18608239/c602bb60-7d01-11e6-875e-f38558997dd3.png) ![config-wrong2](https://cloud.githubusercontent.com/assets/8075390/18608241/cf3b672c-7d01-11e6-935e-1e73f9e6e578.png) After removing tags its looks like below ![config](https://cloud.githubusercontent.com/assets/8075390/18608245/e156eb8e-7d01-11e6-98aa-3be68d4d1961.png) ![config2](https://cloud.githubusercontent.com/assets/8075390/18608247/e84eecd4-7d01-11e6-9738-a3f7ff8fe834.png) Author: sandy <phalodi@gmail.com> Closes #15130 from phalodi/SPARK-17575.
*	Correct fetchsize property name in docs	Daniel Darabos	2016-09-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Replace `fetchSize` with `fetchsize` in the docs. ## How was this patch tested? I manually tested `fetchSize` and `fetchsize`. The latter has an effect. See also [`JdbcUtils.scala#L38`](https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L38) for the definition of the property. Author: Daniel Darabos <darabos.daniel@gmail.com> Closes #14975 from darabos/patch-3.
*	[SPARK-17445][DOCS] Reference an ASF page as the main place to find ↵	Sean Owen	2016-09-14	4	-4/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	third-party packages ## What changes were proposed in this pull request? Point references to spark-packages.org to https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects This will be accompanied by a parallel change to the spark-website repo, and additional changes to this wiki. ## How was this patch tested? Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #15075 from srowen/SPARK-17445.
*	[SPARK-17449][DOCUMENTATION] Relation between heartbeatInterval and…	Jagadeesan	2016-09-14	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The relation between spark.network.timeout and spark.executor.heartbeatInterval should be mentioned in the document. … network timeout] Author: Jagadeesan <as2@us.ibm.com> Closes #15042 from jagadeesanas2/SPARK-17449.
*	Streaming doc correction.	Satendra Kumar	2016-09-09	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) Streaming doc correction. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Satendra Kumar <satendra@knoldus.com> Closes #14996 from satendrakumar06/patch-1.
*	[SPARK-15487][WEB UI] Spark Master UI to reverse proxy Application and ↵	Gurvinder Singh	2016-09-08	1	-0/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Workers UI ## What changes were proposed in this pull request? This pull request adds the functionality to enable accessing worker and application UI through master UI itself. Thus helps in accessing SparkUI when running spark cluster in closed networks e.g. Kubernetes. Cluster admin needs to expose only spark master UI and rest of the UIs can be in the private network, master UI will reverse proxy the connection request to corresponding resource. It adds the path for workers/application UIs as WorkerUI: <http/https>://master-publicIP:<port>/target/workerID/ ApplicationUI: <http/https>://master-publicIP:<port>/target/appID/ This makes it easy for users to easily protect the Spark master cluster access by putting some reverse proxy e.g. https://github.com/bitly/oauth2_proxy ## How was this patch tested? The functionality has been tested manually and there is a unit test too for testing access to worker UI with reverse proxy address. pwendell bomeng BryanCutler can you please review it, thanks. Author: Gurvinder Singh <gurvinder.singh@uninett.no> Closes #13950 from gurvindersingh/rproxy.
*	[SPARK-16619] Add shuffle service metrics entry in monitoring docs	Yangyang Liu	2016-09-01	1	-0/+1
\| \| \| \| \| \| \| \|	After change [SPARK-16405](https://github.com/apache/spark/pull/14080), we need to update docs by adding shuffle service metrics entry in currently supporting metrics list. Author: Yangyang Liu <yangyangliu@fb.com> Closes #14254 from lovexi/yangyang-monitoring-doc.
*	[SPARK-17331][CORE][MLLIB] Avoid allocating 0-length arrays	Sean Owen	2016-09-01	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Avoid allocating some 0-length arrays, esp. in UTF8String, and by using Array.empty in Scala over Array[T]() ## How was this patch tested? Jenkins Author: Sean Owen <sowen@cloudera.com> Closes #14895 from srowen/SPARK-17331.
*	fixed typos	Seigneurin, Alexis (CONT)	2016-09-01	1	-2/+2
\| \| \| \| \| \| \| \|	fixed 2 typos Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com> Closes #14877 from aseigneurin/fix-typo-2.
*	[SPARK-17178][SPARKR][SPARKSUBMIT] Allow to set sparkr shell command through ↵	Jeff Zhang	2016-08-31	1	-1/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	--conf ## What changes were proposed in this pull request? Allow user to set sparkr shell command through --conf spark.r.shell.command ## How was this patch tested? Unit test is added and also verify it manually through ``` bin/sparkr --master yarn-client --conf spark.r.shell.command=/usr/local/bin/R ``` Author: Jeff Zhang <zjffdu@apache.org> Closes #14744 from zjffdu/SPARK-17178.
*	[SPARK-17243][WEB UI] Spark 2.0 History Server won't load with very large ↵	Alex Bozarth	2016-08-30	1	-3/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	application history ## What changes were proposed in this pull request? With the new History Server the summary page loads the application list via the the REST API, this makes it very slow to impossible to load with large (10K+) application history. This pr fixes this by adding the `spark.history.ui.maxApplications` conf to limit the number of applications the History Server displays. This is accomplished using a new optional `limit` param for the `applications` api. (Note this only applies to what the summary page displays, all the Application UI's are still accessible if the user knows the App ID and goes to the Application UI directly.) I've also added a new test for the `limit` param in `HistoryServerSuite.scala` ## How was this patch tested? Manual testing and dev/run-tests Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #14835 from ajbozarth/spark17243.
*	[SPARK-5682][CORE] Add encrypted shuffle in spark	Ferdinand Xu	2016-08-30	1	-0/+23
\| \| \| \| \| \| \| \| \|	This patch is using Apache Commons Crypto library to enable shuffle encryption support. Author: Ferdinand Xu <cheng.a.xu@intel.com> Author: kellyzly <kellyzly@126.com> Closes #8880 from winningsix/SPARK-10771.
*	[MINOR][DOCS] Fix minor typos in python example code	Dmitriy Sokolov	2016-08-30	6	-77/+77
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fix minor typos python example code in streaming programming guide ## How was this patch tested? N/A Author: Dmitriy Sokolov <silentsokolov@gmail.com> Closes #14805 from silentsokolov/fix-typos.
*	fixed a typo	Seigneurin, Alexis (CONT)	2016-08-29	1	-1/+1
\| \| \| \| \| \| \| \|	idempotant -> idempotent Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com> Closes #14833 from aseigneurin/fix-typo.
*	[SPARK-17001][ML] Enable standardScaler to standardize sparse vectors when ↵	Sean Owen	2016-08-27	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	withMean=True ## What changes were proposed in this pull request? Allow centering / mean scaling of sparse vectors in StandardScaler, if requested. This is for compatibility with `VectorAssembler` in common usages. ## How was this patch tested? Jenkins tests, including new caes to reflect the new behavior. Author: Sean Owen <sowen@cloudera.com> Closes #14663 from srowen/SPARK-17001.
*	[SPARK-16967] move mesos to module	Michael Gummelt	2016-08-26	1	-10/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Move Mesos code into a mvn module ## How was this patch tested? unit tests manually submitting a client mode and cluster mode job spark/mesos integration test suite Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #14637 from mgummelt/mesos-module.
*	[SPARK-17242][DOCUMENT] Update links of external dstream projects	Shixiong Zhu	2016-08-25	1	-6/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Updated links of external dstream projects. ## How was this patch tested? Just document changes. Author: Shixiong Zhu <shixiong@databricks.com> Closes #14814 from zsxwing/dstream-link.
*	[SPARK-15083][WEB UI] History Server can OOM due to unlimited TaskUIData	Alex Bozarth	2016-08-24	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Based on #12990 by tankkyo Since the History Server currently loads all application's data it can OOM if too many applications have a significant task count. `spark.ui.trimTasks` (default: false) can be set to true to trim tasks by `spark.ui.retainedTasks` (default: 10000) (This is a "quick fix" to help those running into the problem until a update of how the history server loads app data can be done) ## How was this patch tested? Manual testing and dev/run-tests ![spark-15083](https://cloud.githubusercontent.com/assets/13952758/17713694/fe82d246-63b0-11e6-9697-b87ea75ff4ef.png) Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #14673 from ajbozarth/spark15083.
*	[MINOR][DOC] Fix wrong ml.feature.Normalizer document.	Yanbo Liang	2016-08-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The ```ml.feature.Normalizer``` examples illustrate L1 norm rather than L2, we should correct corresponding document. ![image](https://cloud.githubusercontent.com/assets/1962026/17928637/85aec284-69b0-11e6-9b13-d465ee560581.png) ## How was this patch tested? Doc change, no test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14787 from yanboliang/normalizer.
*	[MINOR][DOC] Use standard quotes instead of "curly quote" marks from Mac in ↵	hyukjinkwon	2016-08-23	1	-19/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	structured streaming programming guides ## What changes were proposed in this pull request? This PR fixes curly quotes (`“` and `”` ) to standard quotes (`"`). This will be a actual problem when users copy and paste the examples. This would not work. This seems only happening in `structured-streaming-programming-guide.md`. ## How was this patch tested? Manually built. This will change some examples to be correctly marked down as below: ![2016-08-23 3 24 13](https://cloud.githubusercontent.com/assets/6477701/17882878/2a38332e-694a-11e6-8e84-76bdb89151e0.png) to ![2016-08-23 3 26 06](https://cloud.githubusercontent.com/assets/6477701/17882888/376eaa28-694a-11e6-8b88-32ea83997037.png) Author: hyukjinkwon <gurwls223@gmail.com> Closes #14770 from HyukjinKwon/minor-quotes.
*	[SPARK-16320][DOC] Document G1 heap region's effect on spark 2.0 vs 1.6	Sean Owen	2016-08-22	1	-19/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Collect GC discussion in one section, and documenting findings about G1 GC heap region size. ## How was this patch tested? Jekyll doc build Author: Sean Owen <sowen@cloudera.com> Closes #14732 from srowen/SPARK-16320.
*	[SPARK-17085][STREAMING][DOCUMENTATION AND ACTUAL CODE DIFFERS - UNSUPPORTED ↵	Jagadeesan	2016-08-22	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \|	OPERATIONS] Changes in Spark Stuctured Streaming doc in this link https://spark.apache.org/docs/2.0.0/structured-streaming-programming-guide.html#unsupported-operations Author: Jagadeesan <as2@us.ibm.com> Closes #14715 from jagadeesanas2/SPARK-17085.
*	[SPARK-16968] Document additional options in jdbc Writer	GraceH	2016-08-22	1	-0/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) This is the document for previous JDBC Writer options. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Unit test has been added in previous PR. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: GraceH <jhuang1@paypal.com> Closes #14683 from GraceH/jdbc_options.