aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* Revert "Preparing development version 1.3.1-SNAPSHOT"Patrick Wendell2015-03-0428-28/+28
| | | | This reverts commit 0ecab40e4391d0674ac86595ec09af3b9a4ac50d.
* Preparing development version 1.3.1-SNAPSHOTPatrick Wendell2015-03-0528-28/+28
|
* Preparing Spark release v1.3.0-rc3Patrick Wendell2015-03-0528-28/+28
|
* Updating CHANGES filePatrick Wendell2015-03-041-0/+50
|
* Revert "Preparing Spark release v1.3.0-rc2"Patrick Wendell2015-03-0428-28/+28
| | | | This reverts commit 3af26870e5163438868c4eb2df88380a533bb232.
* Revert "Preparing development version 1.3.1-SNAPSHOT"Patrick Wendell2015-03-0428-28/+28
| | | | This reverts commit 05d5a29eb3193aeb57d177bafe39eb75edce72a1.
* SPARK-5143 [BUILD] [WIP] spark-network-yarn 2.11 depends on ↵Sean Owen2015-03-042-2/+13
| | | | | | | | | | | | | | | | | | | | | spark-network-shuffle 2.10 Update `<scala.binary.version>` prop in POM when switching between Scala 2.10/2.11 ScrapCodes for review. This `sed` command is supposed to just replace the first occurrence, but it replaces them all. Are you more of a `sed` wizard than I? It may be a GNU/BSD thing that is throwing me off. Really, just the first instance should be replaced, hence the `[WIP]`. NB on OS X the original `sed` command here will create files like `pom.xml-e` through the source tree though it otherwise works. It's like `-e` is also the arg to `-i`. I couldn't get rid of that even with `-i""`. No biggie. Author: Sean Owen <sowen@cloudera.com> Closes #4876 from srowen/SPARK-5143 and squashes the following commits: b060c44 [Sean Owen] Oops, fixed reversed version numbers! e875d4a [Sean Owen] Add note about non-GNU sed; fix new pom.xml update to work as intended on GNU sed 703e1eb [Sean Owen] Update scala.binary.version prop in POM when switching between Scala 2.10/2.11 (cherry picked from commit 7ac072f74b5a9a02339cede82ad5ffec5beed715) Signed-off-by: Patrick Wendell <patrick@databricks.com>
* [SPARK-6149] [SQL] [Build] Excludes Guava 15 referenced by ↵Cheng Lian2015-03-041-0/+8
| | | | | | | | | | | | | | | | | | | | jackson-module-scala_2.10 This PR excludes Guava 15.0 from the SBT build, to make Spark SQL CLI (`bin/spark-sql`) work when compiled against Hive 0.12.0. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4890) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4890 from liancheng/exclude-guava-15 and squashes the following commits: 91ae9fa [Cheng Lian] Moves Guava 15 exclusion from SBT build to POM 282bd2a [Cheng Lian] Excludes Guava 15 referenced by jackson-module-scala_2.10 (cherry picked from commit 1aa90e39e33caa497971544ee7643fb3ff048c12) Signed-off-by: Patrick Wendell <patrick@databricks.com>
* [SPARK-6144] [core] Fix addFile when source files are on "hdfs:"Marcelo Vanzin2015-03-042-50/+63
| | | | | | | | | | | | | | | | The code failed in two modes: it complained when it tried to re-create a directory that already existed, and it was placing some files in the wrong parent directory. The patch fixes both issues. Author: Marcelo Vanzin <vanzin@cloudera.com> Author: trystanleftwich <trystan@atscale.com> Closes #4894 from vanzin/SPARK-6144 and squashes the following commits: 100b3a1 [Marcelo Vanzin] Style fix. 58266aa [Marcelo Vanzin] Fix fetchHcfs file for directories. 91733b7 [trystanleftwich] [SPARK-6144]When in cluster mode using ADD JAR with a hdfs:// sourced jar will fail (cherry picked from commit 3a35a0dfe940843c3f3a5f51acfe24def488faa9) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-6134][SQL] Fix wrong datatype for casting FloatType and default ↵Liang-Chi Hsieh2015-03-041-2/+2
| | | | | | | | | | | | | | | | | LongType value in defaultPrimitive In `CodeGenerator`, the casting on `FloatType` should use `FloatType` instead of `IntegerType`. Besides, `defaultPrimitive` for `LongType` should be `-1L` instead of `1L`. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4870 from viirya/codegen_type and squashes the following commits: 76311dd [Liang-Chi Hsieh] Fix wrong datatype for casting on FloatType. Fix the wrong value for LongType in defaultPrimitive. (cherry picked from commit aef8a84e42351419a67d56abaf1ee75a05eb11ea) Signed-off-by: Cheng Lian <lian@databricks.com>
* [SPARK-6136] [SQL] Removed JDBC integration tests which depends on docker-clientCheng Lian2015-03-044-432/+0
| | | | | | | | | | | | | | | | | | | | | | Integration test suites in the JDBC data source (`MySQLIntegration` and `PostgresIntegration`) depend on docker-client 2.7.5, which transitively depends on Guava 17.0. Unfortunately, Guava 17.0 is causing test runtime binary compatibility issues when Spark is compiled against Hive 0.12.0, or Hadoop 2.4. Considering `MySQLIntegration` and `PostgresIntegration` are ignored right now, I'd suggest moving them from the Spark project to the [Spark integration tests] [1] project. This PR removes both the JDBC data source integration tests and the docker-client test dependency. [1]: |https://github.com/databricks/spark-integration-tests <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4872) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4872 from liancheng/remove-docker-client and squashes the following commits: 1f4169e [Cheng Lian] Removes DockerHacks 159b24a [Cheng Lian] Removed JDBC integration tests which depends on docker-client (cherry picked from commit 76b472f12a57bb5bec7b3791660eb47e9177da7f) Signed-off-by: Cheng Lian <lian@databricks.com>
* [SPARK-6141][MLlib] Upgrade Breeze from 0.10 to 0.11 to fix convergence bugXiangrui Meng2015-03-032-1/+5
| | | | | | | | | | | | | | | | | | | | | LBFGS and OWLQN in Breeze 0.10 has convergence check bug. This is fixed in 0.11, see the description in Breeze project for detail: https://github.com/scalanlp/breeze/pull/373#issuecomment-76879760 Author: Xiangrui Meng <meng@databricks.com> Author: DB Tsai <dbtsai@alpinenow.com> Author: DB Tsai <dbtsai@dbtsai.com> Closes #4879 from dbtsai/breeze and squashes the following commits: d848f65 [DB Tsai] Merge pull request #1 from mengxr/AlpineNow-breeze c2ca6ac [Xiangrui Meng] upgrade to breeze-0.11.1 35c2f26 [Xiangrui Meng] fix LRSuite 397a208 [DB Tsai] upgrade breeze (cherry picked from commit 76e20a0a03cf2c02db35e00271924efb070eaaa5) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-5949] HighlyCompressedMapStatus needs more classes registered w/ kryoImran Rashid2015-03-032-5/+33
| | | | | | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-5949 Author: Imran Rashid <irashid@cloudera.com> Closes #4877 from squito/SPARK-5949_register_roaring_bitmap and squashes the following commits: 7e13316 [Imran Rashid] style style style 5f6bb6d [Imran Rashid] more style 709bfe0 [Imran Rashid] style a5cb744 [Imran Rashid] update tests to cover both types of RoaringBitmapContainers 09610c6 [Imran Rashid] formatting f9a0b7c [Imran Rashid] put primitive array registrations together 97beaf8 [Imran Rashid] SPARK-5949 HighlyCompressedMapStatus needs more classes registered w/ kryo (cherry picked from commit 1f1fccc5ceb0c5b7656a0594be3a67bd3b432e85) Signed-off-by: Reynold Xin <rxin@databricks.com>
* SPARK-1911 [DOCS] Warn users if their assembly jars are not built with Java 6Sean Owen2015-03-031-0/+4
| | | | | | | | | | | | | | | Add warning about building with Java 7+ and running the JAR on early Java 6. CC andrewor14 Author: Sean Owen <sowen@cloudera.com> Closes #4874 from srowen/SPARK-1911 and squashes the following commits: 79fa2f6 [Sean Owen] Add warning about building with Java 7+ and running the JAR on early Java 6. (cherry picked from commit e750a6bfddf1d7bf7d3e99a424ec2b83a18b40d9) Signed-off-by: Andrew Or <andrew@databricks.com>
* Revert "[SPARK-5423][Core] Cleanup resources in DiskMapIterator.finalize to ↵Andrew Or2015-03-031-43/+9
| | | | | | ensure deleting the temp file" This reverts commit 25fae8e7e6c93b7817771342d370b73b40dcf92e.
* Preparing development version 1.3.1-SNAPSHOTPatrick Wendell2015-03-0328-28/+28
|
* Preparing Spark release v1.3.0-rc2Patrick Wendell2015-03-0328-28/+28
|
* Revert "Preparing Spark release v1.3.0-rc1"Patrick Wendell2015-03-0328-28/+28
| | | | This reverts commit f97b0d4a6b26504916816d7aefcf3132cd1da6c2.
* Revert "Preparing development version 1.3.1-SNAPSHOT"Patrick Wendell2015-03-0328-28/+28
| | | | This reverts commit 2ab0ba04f66683be25cbe0e83cecf2bdcb0f13ba.
* Adding CHANGES.txt for Spark 1.3Patrick Wendell2015-03-032-2/+6522
|
* BUILD: Minor tweaks to internal build scriptsPatrick Wendell2015-03-031-5/+19
| | | | | | | | This adds two features: 1. The ability to publish with a different maven version than that specified in the release source. 2. Forking of different Zinc instances during the parallel dist creation (to help with some stability issues).
* HOTFIX: Bump HBase version in MapR profiles.Patrick Wendell2015-03-031-2/+2
| | | | After #2982 (SPARK-4048) we rely on the newer HBase packaging format.
* [SPARK-5537][MLlib][Docs] Add user guide for multinomial logistic regressionDB Tsai2015-03-021-0/+10
| | | | | | | | | | | | | Adding more description on top of #4861. Author: DB Tsai <dbtsai@alpinenow.com> Closes #4866 from dbtsai/doc and squashes the following commits: 37e9d07 [DB Tsai] doc (cherry picked from commit b196056190c569505cc32669d1aec30ed9d70665) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-6120] [mllib] Warnings about memory in tree, ensemble model saveJoseph K. Bradley2015-03-022-4/+50
| | | | | | | | | | | | | | | | | | Issue: When the Python DecisionTree example in the programming guide is run, it runs out of Java Heap Space when using the default memory settings for the spark shell. This prints a warning. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #4864 from jkbradley/dt-save-heap and squashes the following commits: 02e8daf [Joseph K. Bradley] fixed based on code review 7ecb1ed [Joseph K. Bradley] Added warnings about memory when calling tree and ensemble model save with too small a Java heap size (cherry picked from commit c2fe3a6ff1a48a9da54d2c2c4d80ecd06cdeebca) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-6097][MLLIB] Support tree model save/load in PySpark/MLlibXiangrui Meng2015-03-026-33/+109
| | | | | | | | | | | | | | | | | | | Similar to `MatrixFactorizaionModel`, we only need wrappers to support save/load for tree models in Python. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #4854 from mengxr/SPARK-6097 and squashes the following commits: 4586a4d [Xiangrui Meng] fix more typos 8ebcac2 [Xiangrui Meng] fix python style 91172d8 [Xiangrui Meng] fix typos 201b3b9 [Xiangrui Meng] update user guide b5158e2 [Xiangrui Meng] support tree model save/load in PySpark/MLlib (cherry picked from commit 7e53a79c30511dbd0e5d9878a4b8b0f5bc94e68b) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-5310][SQL] Fixes to Docs and Datasources APIReynold Xin2015-03-0222-136/+115
| | | | | | | | | | | | | | | | | | | - Various Fixes to docs - Make data source traits actually interfaces Based on #4862 but with fixed conflicts. Author: Reynold Xin <rxin@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #4868 from marmbrus/pr/4862 and squashes the following commits: fe091ea [Michael Armbrust] Merge remote-tracking branch 'origin/master' into pr/4862 0208497 [Reynold Xin] Test fixes. 34e0a28 [Reynold Xin] [SPARK-5310][SQL] Various fixes to Spark SQL docs. (cherry picked from commit 54d19689ff8d786acde5b8ada6741854ffadadea) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-5950][SQL]Insert array into a metastore table saved as parquet should ↵Yin Huai2015-03-0217-36/+330
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | work when using datasource api This PR contains the following changes: 1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is the middle ground between DataType's equality check and `DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does `equalsIgnoreNullability` as well as if the nullability of `from` is compatible with that of `to`. For example, the nullability of `ArrayType(IntegerType, containsNull = false)` is compatible with that of `ArrayType(IntegerType, containsNull = true)` (for an array without null values, we can always say it may contain null values). However, the nullability of `ArrayType(IntegerType, containsNull = true)` is incompatible with that of `ArrayType(IntegerType, containsNull = false)` (for an array that may have null values, we cannot say it does not have null values). 2. For the `resolved` field of `InsertIntoTable`, use `equalsIgnoreCompatibleNullability` to replace the equality check of the data types. 3. For our data source write path, when appending data, we always use the schema of existing table to write the data. This is important for parquet, since nullability direct impacts the way to encode/decode values. If we do not do this, we may see corrupted values when reading values from a set of parquet files generated with different nullability settings. 4. When generating a new parquet table, we always set nullable/containsNull/valueContainsNull to true. So, we will not face situations that we cannot append data because containsNull/valueContainsNull in an Array/Map column of the existing table has already been set to `false`. This change makes the whole data pipeline more robust. 5. Update the equality check of JSON relation. Since JSON does not really cares nullability, `equalsIgnoreNullability` seems a better choice to compare schemata from to JSON tables. JIRA: https://issues.apache.org/jira/browse/SPARK-5950 Thanks viirya for the initial work in #4729. cc marmbrus liancheng Author: Yin Huai <yhuai@databricks.com> Closes #4826 from yhuai/insertNullabilityCheck and squashes the following commits: 3b61a04 [Yin Huai] Revert change on equals. 80e487e [Yin Huai] asNullable in UDT. 587d88b [Yin Huai] Make methods private. 0cb7ea2 [Yin Huai] marmbrus's comments. 3cec464 [Yin Huai] Cheng's comments. 486ed08 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck d3747d1 [Yin Huai] Remove unnecessary change. 8360817 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck 8a3f237 [Yin Huai] Use equalsIgnoreNullability instead of equality check. 0eb5578 [Yin Huai] Fix tests. f6ed813 [Yin Huai] Update old parquet path. e4f397c [Yin Huai] Unit tests. b2c06f8 [Yin Huai] Ignore nullability in JSON relation's equality check. 8bd008b [Yin Huai] nullable, containsNull, and valueContainsNull will be always true for parquet data. bf50d73 [Yin Huai] When appending data, we use the schema of the existing table instead of the schema of the new data. 0a703e7 [Yin Huai] Test failed again since we cannot read correct content. 9a26611 [Yin Huai] Make InsertIntoTable happy. 8f19fe5 [Yin Huai] equalsIgnoreCompatibleNullability 4ec17fd [Yin Huai] Failed test. (cherry picked from commit 12599942e69e4d73040f3a8611661a0862514ffc) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-6127][Streaming][Docs] Add Kafka to Python api docsTathagata Das2015-03-021-0/+7
| | | | | | | | | | | | | davies Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #4860 from tdas/SPARK-6127 and squashes the following commits: 82de92a [Tathagata Das] Add Kafka to Python api docs (cherry picked from commit 9eb22ece115c69899d100cecb8a5e20b3a268649) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
* [SPARK-5537] Add user guide for multinomial logistic regressionXiangrui Meng2015-03-021-61/+217
| | | | | | | | | | | | | | | | | | This is based on #4801 from dbtsai. The linear method guide is re-organized a little bit for this change. Closes #4801 Author: Xiangrui Meng <meng@databricks.com> Author: DB Tsai <dbtsai@alpinenow.com> Closes #4861 from mengxr/SPARK-5537 and squashes the following commits: 47af0ac [Xiangrui Meng] update user guide for multinomial logistic regression cdc2e15 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into AlpineNow-mlor-doc 096d0ca [DB Tsai] first commit (cherry picked from commit 9d6c5aeebd3c7f8ff6defe3bccd8ff12ed918293) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-6121][SQL][MLLIB] simpleString for UDTXiangrui Meng2015-03-022-1/+4
| | | | | | | | | | | | | | | `df.dtypes` shows `null` for UDTs. This PR uses `udt` by default and `VectorUDT` overwrites it with `vector`. jkbradley davies Author: Xiangrui Meng <meng@databricks.com> Closes #4858 from mengxr/SPARK-6121 and squashes the following commits: 34f0a77 [Xiangrui Meng] simpleString for UDT (cherry picked from commit 2db6a853a53b4c25e35983bc489510abb8a73e1d) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-6048] SparkConf should not translate deprecated configs on setAndrew Or2015-03-025-22/+25
| | | | | | | | | | | | | | | | | | | | There are multiple issues with translating on set outlined in the JIRA. This PR reverts the translation logic added to `SparkConf`. In the future, after the 1.3.0 release we will figure out a way to reorganize the internal structure more elegantly. For now, let's preserve the existing semantics of `SparkConf` since it's a public interface. Unfortunately this means duplicating some code for now, but this is all internal and we can always clean it up later. Author: Andrew Or <andrew@databricks.com> Closes #4799 from andrewor14/conf-set-translate and squashes the following commits: 11c525b [Andrew Or] Move warning to driver 10e77b5 [Andrew Or] Add documentation for deprecation precedence a369cb1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into conf-set-translate c26a9e3 [Andrew Or] Revert all translate logic in SparkConf fef6c9c [Andrew Or] Restore deprecation logic for spark.executor.userClassPathFirst 94b4dfa [Andrew Or] Translate on get, not set (cherry picked from commit 258d154c9f1afdd52dce19f03d81683ee34effac) Signed-off-by: Patrick Wendell <patrick@databricks.com>
* [SPARK-6066] Make event log format easier to parseAndrew Or2015-03-0214-189/+212
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some users have reported difficulty in parsing the new event log format. Since we embed the metadata in the beginning of the file, when we compress the event log we need to skip the metadata because we need that information to parse the log later. This means we'll end up with a partially compressed file if event logging compression is turned on. The old format looks like: ``` sparkVersion = 1.3.0 compressionCodec = org.apache.spark.io.LZFCompressionCodec === LOG_HEADER_END === // actual events, could be compressed bytes ``` The new format in this patch puts the compression codec in the log file name instead. It also removes the metadata header altogether along with the Spark version, which was not needed. The new file name looks something like: ``` app_without_compression app_123.lzf app_456.snappy ``` I tested this with and without compression, using different compression codecs and event logging directories. I verified that both the `Master` and the `HistoryServer` can render both compressed and uncompressed logs as before. Author: Andrew Or <andrew@databricks.com> Closes #4821 from andrewor14/event-log-format and squashes the following commits: 8511141 [Andrew Or] Fix test 654883d [Andrew Or] Add back metadata with Spark version 7f537cd [Andrew Or] Address review feedback 7d6aa61 [Andrew Or] Make codec an extension 59abee9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into event-log-format 27c9a6c [Andrew Or] Address review feedback 519e51a [Andrew Or] Address review feedback ef69276 [Andrew Or] Merge branch 'master' of github.com:apache/spark into event-log-format 88a091d [Andrew Or] Add tests for new format and file name f32d8d2 [Andrew Or] Fix tests 8db5a06 [Andrew Or] Embed metadata in the event log file name instead (cherry picked from commit 6776cb33ea691f7843b956b3e80979282967e826) Signed-off-by: Patrick Wendell <patrick@databricks.com>
* [SPARK-6082] [SQL] Provides better error message for malformed rows when ↵Cheng Lian2015-03-021-0/+11
| | | | | | | | | | | | | | | | | | | caching tables Constructs like Hive `TRANSFORM` may generate malformed rows (via badly authored external scripts for example). I'm a bit hesitant to have this feature, since it introduces per-tuple cost when caching tables. However, considering caching tables is usually a one-time cost, this is probably worth having. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4842) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4842 from liancheng/spark-6082 and squashes the following commits: b05dbff [Cheng Lian] Provides better error message for malformed rows when caching tables (cherry picked from commit 1a49496b4a9df40c74739fc0fb8a21c88a477075) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-6114][SQL] Avoid metastore conversions before plan is resolvedMichael Armbrust2015-03-022-0/+14
| | | | | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #4855 from marmbrus/explodeBug and squashes the following commits: a712249 [Michael Armbrust] [SPARK-6114][SQL] Avoid metastore conversions before plan is resolved (cherry picked from commit 8223ce6a81e4cc9fdf816892365fcdff4006c35e) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-6050] [yarn] Relax matching of vcore count in received containers.Marcelo Vanzin2015-03-021-2/+8
| | | | | | | | | | | | | | | | | | Some YARN configurations return a vcore count for allocated containers that does not match the requested resource. That means Spark would always ignore those containers. So relax the the matching of the vcore count to allow the Spark jobs to run. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #4818 from vanzin/SPARK-6050 and squashes the following commits: 991c803 [Marcelo Vanzin] Remove config option, standardize on legacy behavior (no vcore matching). 8c9c346 [Marcelo Vanzin] Restrict lax matching to vcores only. 3359692 [Marcelo Vanzin] [SPARK-6050] [yarn] Add config option to do lax resource matching. (cherry picked from commit 6b348d90f475440c285a4b636134ffa9351580b9) Signed-off-by: Thomas Graves <tgraves@apache.org>
* [SPARK-6040][SQL] Fix the percent bug in tablesampleq002515982015-03-022-1/+11
| | | | | | | | | | | | | HiveQL expression like `select count(1) from src tablesample(1 percent);` means take 1% sample to select. But it means 100% in the current version of the Spark. Author: q00251598 <qiyadong@huawei.com> Closes #4789 from watermen/SPARK-6040 and squashes the following commits: 2453ebe [q00251598] check and adjust the fraction. (cherry picked from commit 582e5a24c55e8c876733537c9910001affc8b29b) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [Minor] Fix doc typo for describing primitiveTerm effectiveness conditionLiang-Chi Hsieh2015-03-021-1/+1
| | | | | | | | | | | | | It should be `true` instead of `false`? Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4762 from viirya/doc_fix and squashes the following commits: 2e37482 [Liang-Chi Hsieh] Fix doc. (cherry picked from commit 3f9def81170c24f24f4a6b7ca7905de4f75e11e0) Signed-off-by: Michael Armbrust <michael@databricks.com>
* SPARK-5390 [DOCS] Encourage users to post on Stack Overflow in Community DocsSean Owen2015-03-021-8/+2
| | | | | | | | | | | | | | | | Point "Community" to main Spark Community page; mention SO tag apache-spark. Separately, the Apache site can be updated to mention, under Mailing Lists: "StackOverflow also has an apache-spark tag for Spark Q&A." or similar. Author: Sean Owen <sowen@cloudera.com> Closes #4843 from srowen/SPARK-5390 and squashes the following commits: 3508ac6 [Sean Owen] Point "Community" to main Spark Community page; mention SO tag apache-spark (cherry picked from commit 0b472f60cdf4984ab5e28e6dbf12615e8997a448) Signed-off-by: Sean Owen <sowen@cloudera.com>
* [DOCS] Refactored Dataframe join comment to use correct parameter orderingPaul Power2015-03-021-2/+2
| | | | | | | | | | | | | | The API signatire for join requires the JoinType to be the third parameter. The code examples provided for join show JoinType being provided as the 2nd parater resuling in errors (i.e. "df1.join(df2, "outer", $"df1Key" === $"df2Key") ). The correct sample code is df1.join(df2, $"df1Key" === $"df2Key", "outer") Author: Paul Power <paul.power@peerside.com> Closes #4847 from peerside/master and squashes the following commits: ebc1efa [Paul Power] Merge pull request #1 from peerside/peerside-patch-1 e353340 [Paul Power] Updated comments use correct sample code for Dataframe joins (cherry picked from commit d9a8bae77826a0cc77df29d85883e914d0f0b4f3) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-6080] [PySpark] correct LogisticRegressionWithLBFGS regType parameter ↵Yanbo Liang2015-03-021-1/+1
| | | | | | | | | | | | | | | | for pyspark Currently LogisticRegressionWithLBFGS in python/pyspark/mllib/classification.py will invoke callMLlibFunc with a wrong "regType" parameter. It was assigned to "str(regType)" which translate None(Python) to "None"(Java/Scala). The right way should be translate None(Python) to null(Java/Scala) just as what we did at LogisticRegressionWithSGD. Author: Yanbo Liang <ybliang8@gmail.com> Closes #4831 from yanboliang/pyspark_classification and squashes the following commits: 12db65a [Yanbo Liang] correct LogisticRegressionWithLBFGS regType parameter for pyspark (cherry picked from commit af2effdd7b54316af0c02e781911acfb148b962b) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-5741][SQL] Support the path contains comma in HiveContextq002515982015-03-0218-1/+2511
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When run ```select * from nzhang_part where hr = 'file,';```, it throws exception ```java.lang.IllegalArgumentException: Can not create a Path from an empty string``` . Because the path of hdfs contains comma, and FileInputFormat.setInputPaths will split path by comma. ### SQL ``` set hive.merge.mapfiles=true; set hive.merge.mapredfiles=true; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; create table nzhang_part like srcpart; insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select key, value, hr from srcpart where ds='2008-04-08'; insert overwrite table nzhang_part partition (ds='2010-08-15', hr=11) select key, value from srcpart where ds='2008-04-08'; insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select * from ( select key, value, hr from srcpart where ds='2008-04-08' union all select '1' as key, '1' as value, 'file,' as hr from src limit 1) s; select * from nzhang_part where hr = 'file,'; ``` ### Error Log ``` 15/02/10 14:33:16 ERROR SparkSQLDriver: Failed in [select * from nzhang_part where hr = 'file,'] java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127) at org.apache.hadoop.fs.Path.<init>(Path.java:135) at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:251) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) Author: q00251598 <qiyadong@huawei.com> Closes #4532 from watermen/SPARK-5741 and squashes the following commits: 9758ab1 [q00251598] fix bug 1db1a1c [q00251598] use setInputPaths(Job job, Path... inputPaths) b788a72 [q00251598] change FileInputFormat.setInputPaths to jobConf.set and add test suite (cherry picked from commit 9ce12aaf283a2793e719bdc956dd858922636e8d) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-6111] Fixed usage string in documentation.Kenneth Myers2015-03-021-1/+1
| | | | | | | | | | | | | | | Usage info in documentation does not match actual usage info. Doc string usage says ```Usage: network_wordcount.py <zk> <topic>``` whereas the actual usage is ```Usage: kafka_wordcount.py <zk> <topic>``` Author: Kenneth Myers <myerske@us.ibm.com> Closes #4852 from kennethmyers/kafka_wordcount_documentation_fix and squashes the following commits: 3855325 [Kenneth Myers] Fixed usage string in documentation. (cherry picked from commit 95ac68bf127b5370c13d6bc15adbda78228829cc) Signed-off-by: Sean Owen <sowen@cloudera.com>
* [SPARK-6052][SQL]In JSON schema inference, we should always set containsNull ↵Yin Huai2015-03-022-24/+23
| | | | | | | | | | | | | | | | | of an ArrayType to true Always set `containsNull = true` when infer the schema of JSON datasets. If we set `containsNull` based on records we scanned, we may miss arrays with null values when we do sampling. Also, because future data can have arrays with null values, if we convert JSON data to parquet, always setting `containsNull = true` is a more robust way to go. JIRA: https://issues.apache.org/jira/browse/SPARK-6052 Author: Yin Huai <yhuai@databricks.com> Closes #4806 from yhuai/jsonArrayContainsNull and squashes the following commits: 05eab9d [Yin Huai] Change containsNull to true. (cherry picked from commit 3efd8bb6cf139ce094ff631c7a9c1eb93fdcd566) Signed-off-by: Cheng Lian <lian@databricks.com>
* [SPARK-6073][SQL] Need to refresh metastore cache after append data in ↵Yin Huai2015-03-022-0/+54
| | | | | | | | | | | | | | | | | CreateMetastoreDataSourceAsSelect JIRA: https://issues.apache.org/jira/browse/SPARK-6073 liancheng Author: Yin Huai <yhuai@databricks.com> Closes #4824 from yhuai/refreshCache and squashes the following commits: b9542ef [Yin Huai] Refresh metadata cache in the Catalog in CreateMetastoreDataSourceAsSelect. (cherry picked from commit 39a54b40aff66816f8b8f5c6133eaaad6eaecae1) Signed-off-by: Cheng Lian <lian@databricks.com>
* [Streaming][Minor]Fix some error docs in streaming examplesSaisai Shao2015-03-023-3/+4
| | | | | | | | | | | | | Small changes, please help to review, thanks a lot. Author: Saisai Shao <saisai.shao@intel.com> Closes #4837 from jerryshao/doc-fix and squashes the following commits: 545291a [Saisai Shao] Fix some error docs in streaming examples (cherry picked from commit d8fb40edea7c8c811814f1ff288d59178928964b) Signed-off-by: Sean Owen <sowen@cloudera.com>
* [SPARK-6083] [MLLib] [DOC] Make Python API example consistent in NaiveBayesMechCoder2015-03-011-10/+16
| | | | | | | | | | | | Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #4834 from MechCoder/spark-6083 and squashes the following commits: 1cdd7b5 [MechCoder] Add parse function 65bbbe9 [MechCoder] [SPARK-6083] Make Python API example consistent in NaiveBayes (cherry picked from commit 3f00bb3ef1384fabf86a68180d40a1a515f6f5e3) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-6053][MLLIB] support save/load in PySpark's ALSXiangrui Meng2015-03-014-6/+82
| | | | | | | | | | | | | | | | A simple wrapper to save/load `MatrixFactorizationModel` in Python. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #4811 from mengxr/SPARK-5991 and squashes the following commits: f135dac [Xiangrui Meng] update save doc 57e5200 [Xiangrui Meng] address comments 06140a4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5991 282ec8d [Xiangrui Meng] support save/load in PySpark's ALS (cherry picked from commit aedbbaa3dda9cbc154cd52c07f6d296b972b0eb2) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-6074] [sql] Package pyspark sql bindings.Marcelo Vanzin2015-03-011-0/+8
| | | | | | | | | | | | | This is needed for the SQL bindings to work on Yarn. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #4822 from vanzin/SPARK-6074 and squashes the following commits: fb52001 [Marcelo Vanzin] [SPARK-6074] [sql] Package pyspark sql bindings. (cherry picked from commit fd8d283eeb98e310b1e85ef8c3a8af9e547ab5e0) Signed-off-by: Sean Owen <sowen@cloudera.com>
* SPARK-5984: Fix TimSort bug causes ArrayOutOfBoundsExceptionEvan Yu2015-02-284-5/+161
| | | | | | | | | | | | | | | | | | | | | Fix TimSort bug which causes a ArrayOutOfBoundsException. Using the proposed fix here http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/ Author: Evan Yu <ehotou@gmail.com> Closes #4804 from hotou/SPARK-5984 and squashes the following commits: 3421b6c [Evan Yu] SPARK-5984: Add info to LICENSE e61c6b8 [Evan Yu] SPARK-5984: Fix license and document 6ccc280 [Evan Yu] SPARK-5984: Add License header to file e06c0d2 [Evan Yu] SPARK-5984: Add License header to file 4d95f75 [Evan Yu] SPARK-5984: Fix TimSort bug causes ArrayOutOfBoundsException 479a106 [Evan Yu] SPARK-5984: Fix TimSort bug causes ArrayOutOfBoundsException (cherry picked from commit 643300a6e27dac3822f9a3ced0ad5fb3b4f2ad75) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-5775] [SQL] BugFix: GenericRow cannot be cast to SpecificMutableRow ↵Cheng Lian2015-02-283-24/+217
| | | | | | | | | | | | | | | | | | | | | | | | when nested data and partitioned table This PR adapts anselmevignon's #4697 to master and branch-1.3. Please refer to PR description of #4697 for details. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4792) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Author: Cheng Lian <liancheng@users.noreply.github.com> Author: Yin Huai <yhuai@databricks.com> Closes #4792 from liancheng/spark-5775 and squashes the following commits: 538f506 [Cheng Lian] Addresses comments cee55cf [Cheng Lian] Merge pull request #4 from yhuai/spark-5775-yin b0b74fb [Yin Huai] Remove runtime pattern matching. ca6e038 [Cheng Lian] Fixes SPARK-5775 (cherry picked from commit e6003f0a571ba44fcd011e695c8622e11cfee7dd) Signed-off-by: Cheng Lian <lian@databricks.com>