aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SQL] Python JsonRDD UTF8 Encoding FixAhir Reddy2014-08-141-1/+3
| | | | | | | | | | Only encode unicode objects to UTF-8, and not strings Author: Ahir Reddy <ahirreddy@gmail.com> Closes #1914 from ahirreddy/json-rdd-unicode-fix1 and squashes the following commits: ca4e9ba [Ahir Reddy] Encoding Fix
* [SPARK-2927][SQL] Add a conf to configure if we always read Binary columns ↵Yin Huai2014-08-145-22/+87
| | | | | | | | | | | | | | | | | | stored in Parquet as String columns This PR adds a new conf flag `spark.sql.parquet.binaryAsString`. When it is `true`, if there is no parquet metadata file available to provide the schema of the data, we will always treat binary fields stored in parquet as string fields. This conf is used to provide a way to read string fields generated without UTF8 decoration. JIRA: https://issues.apache.org/jira/browse/SPARK-2927 Author: Yin Huai <huai@cse.ohio-state.edu> Closes #1855 from yhuai/parquetBinaryAsString and squashes the following commits: 689ffa9 [Yin Huai] Add missing "=". 80827de [Yin Huai] Unit test. 1765ca4 [Yin Huai] Use .toBoolean. 9d3f199 [Yin Huai] Merge remote-tracking branch 'upstream/master' into parquetBinaryAsString 5d436a1 [Yin Huai] The initial support of adding a conf to treat binary columns stored in Parquet as string columns.
* [SPARK-3011][SQL] _temporary directory should be filtered out by ↵Chia-Yung Su2014-08-141-1/+2
| | | | | | | | | | | sqlContext.parquetFile Author: Chia-Yung Su <chiayung@appier.com> Closes #1924 from joesu/bugfix-spark3011 and squashes the following commits: c7e44f2 [Chia-Yung Su] match syntax f8fc32a [Chia-Yung Su] filter out tmp dir
* SPARK-2893: Do not swallow Exceptions when running a custom kryo registratorGraham Dennis2014-08-142-5/+16
| | | | | | | | | | | | | | | The previous behaviour of swallowing ClassNotFound exceptions when running a custom Kryo registrator could lead to difficult to debug problems later on at serialisation / deserialisation time, see SPARK-2878. Instead it is better to fail fast. Added test case. Author: Graham Dennis <graham.dennis@gmail.com> Closes #1827 from GrahamDennis/feature/spark-2893 and squashes the following commits: fbe4cb6 [Graham Dennis] [SPARK-2878]: Update the test case to match the updated exception message 65e53c5 [Graham Dennis] [SPARK-2893]: Improve message when a spark.kryo.registrator fails. f480d85 [Graham Dennis] [SPARK-2893] Fix typo. b59d2c2 [Graham Dennis] SPARK-2893: Do not swallow Exceptions when running a custom spark.kryo.registrator
* [SPARK-3029] Disable local execution of Spark jobs by defaultAaron Davidson2014-08-143-2/+18
| | | | | | | | | | | | | | | Currently, local execution of Spark jobs is only used by take(), and it can be problematic as it can load a significant amount of data onto the driver. The worst case scenarios occur if the RDD is cached (guaranteed to load whole partition), has very large elements, or the partition is just large and we apply a filter with high selectivity or computational overhead. Additionally, jobs that run locally in this manner do not show up in the web UI, and are thus harder to track or understand what is occurring. This PR adds a flag to disable local execution, which is turned OFF by default, with the intention of perhaps eventually removing this functionality altogether. Removing it now is a tougher proposition since it is part of the public runJob API. An alternative solution would be to limit the flag to take()/first() to avoid impacting any external users of this API, but such usage (or, at least, reliance upon the feature) is hopefully minimal. Author: Aaron Davidson <aaron@databricks.com> Closes #1321 from aarondav/allowlocal and squashes the following commits: 136b253 [Aaron Davidson] Fix DAGSchedulerSuite 5599d55 [Aaron Davidson] [RFC] Disable local execution of Spark jobs by default
* [SPARK-2995][MLLIB] add ALS.setIntermediateRDDStorageLevelXiangrui Meng2014-08-131-15/+30
| | | | | | | | | | | | | As mentioned in SPARK-2465, using `MEMORY_AND_DISK_SER` for user/product in/out links together with `spark.rdd.compress=true` can help reduce the space requirement by a lot, at the cost of speed. It might be useful to add this option so people can run ALS on much bigger datasets. Another option for the method name is `setIntermediateRDDStorageLevel`. Author: Xiangrui Meng <meng@databricks.com> Closes #1913 from mengxr/als-storagelevel and squashes the following commits: d942017 [Xiangrui Meng] rename to setIntermediateRDDStorageLevel 7550029 [Xiangrui Meng] add ALS.setIntermediateDataStorageLevel
* [Docs] Add missing <code> tags (minor)Andrew Or2014-08-131-2/+2
| | | | | | | | | | These configs looked inconsistent from the rest. Author: Andrew Or <andrewor14@gmail.com> Closes #1936 from andrewor14/docs-code and squashes the following commits: 15f578a [Andrew Or] Add <code> tag
* [SPARK-3006] Failed to execute spark-shell in Windows OSMasayoshi TSUZUKI2014-08-131-1/+1
| | | | | | | | | | | Modified the order of the options and arguments in spark-shell.cmd Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #1918 from tsudukim/feature/SPARK-3006 and squashes the following commits: 8bba494 [Masayoshi TSUZUKI] [SPARK-3006] Failed to execute spark-shell in Windows OS 1a32410 [Masayoshi TSUZUKI] [SPARK-3006] Failed to execute spark-shell in Windows OS
* SPARK-3020: Print completed indices rather than tasks in web UIPatrick Wendell2014-08-133-1/+4
| | | | | | | | | Author: Patrick Wendell <pwendell@gmail.com> Closes #1933 from pwendell/speculation and squashes the following commits: 33a3473 [Patrick Wendell] Use OpenHashSet 8ce2ff0 [Patrick Wendell] SPARK-3020: Print completed indices rather than tasks in web UI
* [SPARK-2986] [SQL] fixed: setting properties does not effectguowei2014-08-131-2/+2
| | | | | | | | | | | | | it seems that set command does not run by SparkSQLDriver. it runs on hive api. user can not change reduce number by setting spark.sql.shuffle.partitions but i think setting hive properties seems just a role to spark sql. Author: guowei <guowei@upyoo.com> Closes #1904 from guowei2/temp-branch and squashes the following commits: 7d47dde [guowei] fixed: setting properties like spark.sql.shuffle.partitions does not effective
* [SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging ↵Kousuke Saruta2014-08-131-2/+8
| | | | | | | | | | | | is enabled Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #1891 from sarutak/SPARK-2970 and squashes the following commits: 4a2d2fe [Kousuke Saruta] Modified comment style 8bd833c [Kousuke Saruta] Modified style 6c0997c [Kousuke Saruta] Modified the timing of shutdown hook execution. It should be executed before shutdown hook of o.a.h.f.FileSystem
* [SPARK-2935][SQL]Fix parquet predicate push down bugMichael Armbrust2014-08-133-3/+10
| | | | | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #1863 from marmbrus/parquetPredicates and squashes the following commits: 10ad202 [Michael Armbrust] left <=> right f249158 [Michael Armbrust] quiet parquet tests. 802da5b [Michael Armbrust] Add test case. eab2eda [Michael Armbrust] Fix parquet predicate push down bug
* [SPARK-2650][SQL] More precise initial buffer size estimation for in-memory ↵Cheng Lian2014-08-131-5/+6
| | | | | | | | | | | | | | column buffer This is a follow up of #1880. Since the row number within a single batch is known, we can estimate a much more precise initial buffer size when building an in-memory column buffer. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1901 from liancheng/precise-init-buffer-size and squashes the following commits: d5501fa [Cheng Lian] More precise initial buffer size estimation for in-memory column buffer
* [SPARK-2994][SQL] Support for udfs that take complex typesMichael Armbrust2014-08-132-18/+37
| | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #1915 from marmbrus/arrayUDF and squashes the following commits: a1c503d [Michael Armbrust] Support for udfs that take complex types
* [SPARK-2817] [SQL] add "show create table" supporttianyi2014-08-1337-0/+199
| | | | | | | | | | | | | | | | | | | | | | | | | In spark sql component, the "show create table" syntax had been disabled. We thought it is a useful funciton to describe a hive table. Author: tianyi <tianyi@asiainfo-linkage.com> Author: tianyi <tianyi@asiainfo.com> Author: tianyi <tianyi.asiainfo@gmail.com> Closes #1760 from tianyi/spark-2817 and squashes the following commits: 7d28b15 [tianyi] [SPARK-2817] fix too short prefix problem cbffe8b [tianyi] [SPARK-2817] fix the case problem 565ec14 [tianyi] [SPARK-2817] fix the case problem 60d48a9 [tianyi] [SPARK-2817] use system temporary folder instead of temporary files in the source tree, and also clean some empty line dbe1031 [tianyi] [SPARK-2817] move some code out of function rewritePaths, as it may be called multiple times 9b2ba11 [tianyi] [SPARK-2817] fix the line length problem 9f97586 [tianyi] [SPARK-2817] remove test.tmp.dir from pom.xml bfc2999 [tianyi] [SPARK-2817] add "File.separator" support, create a "testTmpDir" outside the rewritePaths bde800a [tianyi] [SPARK-2817] add "${system:test.tmp.dir}" support add "last_modified_by" to nonDeterministicLineIndicators in HiveComparisonTest bb82726 [tianyi] [SPARK-2817] remove test which requires a system from the whitelist. bbf6b42 [tianyi] [SPARK-2817] add a systemProperties named "test.tmp.dir" to pass the test which contains "${system:test.tmp.dir}" a337bd6 [tianyi] [SPARK-2817] add "show create table" support a03db77 [tianyi] [SPARK-2817] add "show create table" support
* [SPARK-3004][SQL] Added null checking when retrieving row setCheng Lian2014-08-133-33/+96
| | | | | | | | | | | | | | | JIRA issue: [SPARK-3004](https://issues.apache.org/jira/browse/SPARK-3004) HiveThriftServer2 throws exception when the result set contains `NULL`. Should check `isNullAt` in `SparkSQLOperationManager.getNextRowSet`. Note that simply using `row.addColumnValue(null)` doesn't work, since Hive set the column type of a null `ColumnValue` to String by default. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1920 from liancheng/spark-3004 and squashes the following commits: 1b1db1c [Cheng Lian] Adding NULL column values in the Hive way 2217722 [Cheng Lian] Fixed SPARK-3004: added null checking when retrieving row set
* [MLLIB] use Iterator.fill instead of Array.fillXiangrui Meng2014-08-131-5/+5
| | | | | | | | | | Iterator.fill uses less memory Author: Xiangrui Meng <meng@databricks.com> Closes #1930 from mengxr/rand-gen-iter and squashes the following commits: 24178ca [Xiangrui Meng] use Iterator.fill instead of Array.fill
* [SPARK-2983] [PySpark] improve performance of sortByKey()Davies Liu2014-08-131-23/+24
| | | | | | | | | | | | | 1. skip partitionBy() when numOfPartition is 1 2. use bisect_left (O(lg(N))) instread of loop (O(N)) in rangePartitioner Author: Davies Liu <davies.liu@gmail.com> Closes #1898 from davies/sort and squashes the following commits: 0a9608b [Davies Liu] Merge branch 'master' into sort 1cf9565 [Davies Liu] improve performance of sortByKey()
* [SPARK-3013] [SQL] [PySpark] convert array into listDavies Liu2014-08-131-7/+7
| | | | | | | | | | because Pyrolite does not support array from Python 2.6 Author: Davies Liu <davies.liu@gmail.com> Closes #1928 from davies/fix_array and squashes the following commits: 858e6c5 [Davies Liu] convert array into list
* [SPARK-2963] [SQL] There no documentation about building to use HiveServer ↵Kousuke Saruta2014-08-132-0/+18
| | | | | | | | | | | | | and CLI for SparkSQL Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #1885 from sarutak/SPARK-2963 and squashes the following commits: ed53329 [Kousuke Saruta] Modified description and notaton of proper noun 07c59fc [Kousuke Saruta] Added a description about how to build to use HiveServer and CLI for SparkSQL to building-with-maven.md 6e6645a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2963 c88fa93 [Kousuke Saruta] Added a description about building to use HiveServer and CLI for SparkSQL
* [SPARK-2993] [MLLib] colStats (wrapper around ↵Doris Xin2014-08-125-261/+374
| | | | | | | | | | | | | | | | | | MultivariateStatisticalSummary) in Statistics For both Scala and Python. The ser/de util functions were moved out of `PythonMLLibAPI` and into their own object to avoid creating the `PythonMLLibAPI` object inside of `MultivariateStatisticalSummarySerialized`, which is then referenced inside of a method in `PythonMLLibAPI`. `MultivariateStatisticalSummarySerialized` was created to serialize the `Vector` fields in `MultivariateStatisticalSummary`. Author: Doris Xin <doris.s.xin@gmail.com> Closes #1911 from dorx/colStats and squashes the following commits: 77b9924 [Doris Xin] developerAPI tag de9cbbe [Doris Xin] reviewer comments and moved more ser/de 459faba [Doris Xin] colStats in Statistics for both Scala and Python
* [SPARK-1777 (partial)] bugfix: make size of requested memory correctlyZhang, Liye2014-08-121-2/+2
| | | | | | | | Author: Zhang, Liye <liye.zhang@intel.com> Closes #1892 from liyezhang556520/lazy_memory_request and squashes the following commits: 335ab61 [Zhang, Liye] [SPARK-1777 (partial)] bugfix: make size of requested memory correctly
* Use transferTo when copy merge files in ExternalSorterRaymond Liu2014-08-122-11/+25
| | | | | | | | | | | Since this is a file to file copy, using transferTo should be faster. Author: Raymond Liu <raymond.liu@intel.com> Closes #1884 from colorant/externalSorter and squashes the following commits: 6e42f3c [Raymond Liu] More code into copyStream bfb496b [Raymond Liu] Use transferTo when copy merge files in ExternalSorter
* [SPARK-2953] Allow using short names for io compression codecsReynold Xin2014-08-123-5/+32
| | | | | | | | | | | | Instead of requiring "org.apache.spark.io.LZ4CompressionCodec", it is easier for users if Spark just accepts "lz4", "lzf", "snappy". Author: Reynold Xin <rxin@apache.org> Closes #1873 from rxin/compressionCodecShortForm and squashes the following commits: 9f50962 [Reynold Xin] Specify short-form compression codec names first. 63f78ee [Reynold Xin] Updated configuration documentation. 47b3848 [Reynold Xin] [SPARK-2953] Allow using short names for io compression codecs
* SPARK-2830 [MLlib]: re-organize mllib documentationAmeet Talwalkar2014-08-1210-220/+317
| | | | | | | | | | | | As per discussions with Xiangrui, I've reorganized and edited the mllib documentation. Author: Ameet Talwalkar <atalwalkar@gmail.com> Closes #1908 from atalwalkar/master and squashes the following commits: fe6938a [Ameet Talwalkar] made xiangruis suggested changes 840028b [Ameet Talwalkar] made xiangruis suggested changes 7ec366a [Ameet Talwalkar] reorganize and edit mllib documentation
* fix flaky testsDavies Liu2014-08-121-1/+1
| | | | | | | | | | Python 2.6 does not handle float error well as 2.7+ Author: Davies Liu <davies.liu@gmail.com> Closes #1910 from davies/fix_test and squashes the following commits: 7e51200 [Davies Liu] fix flaky tests
* [MLlib] Correctly set vectorSize and alphaLiquan Pei2014-08-121-13/+12
| | | | | | | | | | | mengxr Correctly set vectorSize and alpha in Word2Vec training. Author: Liquan Pei <liquanpei@gmail.com> Closes #1900 from Ishiihara/Word2Vec-bugfix and squashes the following commits: 85f64f2 [Liquan Pei] correctly set vectorSize and alpha
* [SPARK-2923][MLLIB] Implement some basic BLAS routinesXiangrui Meng2014-08-117-66/+432
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Having some basic BLAS operations implemented in MLlib can help simplify the current implementation and improve some performance. Tested on my local machine: ~~~ bin/spark-submit --class org.apache.spark.examples.mllib.BinaryClassification \ examples/target/scala-*/spark-examples-*.jar --algorithm LR --regType L2 \ --regParam 1.0 --numIterations 1000 ~/share/data/rcv1.binary/rcv1_train.binary ~~~ 1. before: ~1m 2. after: ~30s CC: jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #1849 from mengxr/ml-blas and squashes the following commits: ba583a2 [Xiangrui Meng] exclude Vector.copy a4d7d2f [Xiangrui Meng] Merge branch 'master' into ml-blas 6edeab9 [Xiangrui Meng] address comments 940bdeb [Xiangrui Meng] rename MLlibBLAS to BLAS c2a38bc [Xiangrui Meng] enhance dot tests 4cfaac4 [Xiangrui Meng] add apache header 48d01d2 [Xiangrui Meng] add tests for zeros and copy 3b882b1 [Xiangrui Meng] use blas.scal in gradient 735eb23 [Xiangrui Meng] remove d from BLAS routines d2d7d3c [Xiangrui Meng] update gradient and lbfgs 7f78186 [Xiangrui Meng] add zeros to Vectors; add dscal and dcopy to BLAS 14e6645 [Xiangrui Meng] add ddot cbb8273 [Xiangrui Meng] add daxpy test 07db0bb [Xiangrui Meng] Merge branch 'master' into ml-blas e8c326d [Xiangrui Meng] axpy
* [SQL] [SPARK-2826] Reduce the memory copy while building the hashmap for ↵Cheng Hao2014-08-111-26/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | HashOuterJoin This is a follow up for #1147 , this PR will improve the performance about 10% - 15% in my local tests. ``` Before: LeftOuterJoin: took 16750 ms ([3000000] records) LeftOuterJoin: took 15179 ms ([3000000] records) RightOuterJoin: took 15515 ms ([3000000] records) RightOuterJoin: took 15276 ms ([3000000] records) FullOuterJoin: took 19150 ms ([6000000] records) FullOuterJoin: took 18935 ms ([6000000] records) After: LeftOuterJoin: took 15218 ms ([3000000] records) LeftOuterJoin: took 13503 ms ([3000000] records) RightOuterJoin: took 13663 ms ([3000000] records) RightOuterJoin: took 14025 ms ([3000000] records) FullOuterJoin: took 16624 ms ([6000000] records) FullOuterJoin: took 16578 ms ([6000000] records) ``` Besides the performance improvement, I also do some clean up as suggested in #1147 Author: Cheng Hao <hao.cheng@intel.com> Closes #1765 from chenghao-intel/hash_outer_join_fixing and squashes the following commits: ab1f9e0 [Cheng Hao] Reduce the memory copy while building the hashmap
* [SPARK-2650][SQL] Build column buffers in smaller batchesMichael Armbrust2014-08-117-36/+70
| | | | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #1880 from marmbrus/columnBatches and squashes the following commits: 0649987 [Michael Armbrust] add test 4756fad [Michael Armbrust] fix compilation 2314532 [Michael Armbrust] Build column buffers in smaller batches
* [SPARK-2968][SQL] Fix nullabilities of Explode.Takuya UESHIN2014-08-111-4/+4
| | | | | | | | | | Output nullabilities of `Explode` could be detemined by `ArrayType.containsNull` or `MapType.valueContainsNull`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1888 from ueshin/issues/SPARK-2968 and squashes the following commits: d128c95 [Takuya UESHIN] Fix nullability of Explode.
* [SPARK-2965][SQL] Fix HashOuterJoin output nullabilities.Takuya UESHIN2014-08-111-1/+12
| | | | | | | | | | Output attributes of opposite side of `OuterJoin` should be nullable. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1887 from ueshin/issues/SPARK-2965 and squashes the following commits: bcb2d37 [Takuya UESHIN] Fix HashOuterJoin output nullabilities.
* [SQL] A tiny refactoring in HiveContext#analyzeYin Huai2014-08-111-5/+3
| | | | | | | | | | I should use `EliminateAnalysisOperators` in `analyze` instead of manually pattern matching. Author: Yin Huai <huaiyin.thu@gmail.com> Closes #1881 from yhuai/useEliminateAnalysisOperators and squashes the following commits: f3e1e7f [Yin Huai] Use EliminateAnalysisOperators.
* [sql]use SparkSQLEnv.stop() in ShutdownHookwangfei2014-08-111-1/+1
| | | | | | | | Author: wangfei <wangfei1@huawei.com> Closes #1852 from scwf/patch-3 and squashes the following commits: ae28c29 [wangfei] use SparkSQLEnv.stop() in ShutdownHook
* [SPARK-2590][SQL] Added option to handle incremental collection, disabled by ↵Cheng Lian2014-08-111-1/+10
| | | | | | | | | | | | | | default JIRA issue: [SPARK-2590](https://issues.apache.org/jira/browse/SPARK-2590) Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1853 from liancheng/inc-collect-option and squashes the following commits: cb3ea45 [Cheng Lian] Moved incremental collection option to Thrift server 43ce3aa [Cheng Lian] Changed incremental collect option name 623abde [Cheng Lian] Added option to handle incremental collection, disabled by default
* [SPARK-2844][SQL] Correctly set JVM HiveContext if it is passed into Python ↵Ahir Reddy2014-08-111-0/+14
| | | | | | | | | | | | HiveContext constructor https://issues.apache.org/jira/browse/SPARK-2844 Author: Ahir Reddy <ahirreddy@gmail.com> Closes #1768 from ahirreddy/python-hive-context-fix and squashes the following commits: 7972d3b [Ahir Reddy] Correctly set JVM HiveContext if it is passed into Python HiveContext constructor
* [SPARK-2934][MLlib] Adding LogisticRegressionWithLBFGS InterfaceDB Tsai2014-08-112-4/+136
| | | | | | | | | | | | | | | for training with LBFGS Optimizer which will converge faster than SGD. Author: DB Tsai <dbtsai@alpinenow.com> Closes #1862 from dbtsai/dbtsai-lbfgs-lor and squashes the following commits: aa84b81 [DB Tsai] small change f852bcd [DB Tsai] Remove duplicate method f119fdc [DB Tsai] Formatting 97776aa [DB Tsai] address more feedback 85b4a91 [DB Tsai] address feedback 3cf50c2 [DB Tsai] LogisticRegressionWithLBFGS interface
* [SPARK-2515][mllib] Chi Squared testDoris Xin2014-08-114-0/+512
| | | | | | | | | | | | | | | | | | | | | | | | | Author: Doris Xin <doris.s.xin@gmail.com> Closes #1733 from dorx/chisquare and squashes the following commits: cafb3a7 [Doris Xin] fixed p-value for extreme case. d286783 [Doris Xin] Merge branch 'master' into chisquare e95e485 [Doris Xin] reviewer comments. 7dde711 [Doris Xin] ChiSqTestResult renaming and changed to Class 80d03e2 [Doris Xin] Reviewer comments. c39eeb5 [Doris Xin] units passed with updated API e90d90a [Doris Xin] Merge branch 'master' into chisquare 7eea80b [Doris Xin] WIP d64c2fb [Doris Xin] Merge branch 'master' into chisquare 5686082 [Doris Xin] facelift bc7eb2e [Doris Xin] unit passed; still need docs and some refactoring 50703a5 [Doris Xin] merge master 4e4e361 [Doris Xin] WIP e6b83f3 [Doris Xin] reviewer comments 3d61582 [Doris Xin] input names 706d436 [Doris Xin] Added API for RDD[Vector] 6598379 [Doris Xin] API and code structure. ff17423 [Doris Xin] WIP
* [SPARK-2931] In TaskSetManager, reset currentLocalityIndex after recomputing ↵Josh Rosen2014-08-112-5/+46
| | | | | | | | | | | | | | | | | locality levels This addresses SPARK-2931, a bug where getAllowedLocalityLevel() could throw ArrayIndexOutOfBoundsException. The fix here is to reset currentLocalityIndex after recomputing the locality levels. Thanks to kayousterhout, mridulm, and lirui-intel for helping me to debug this. Author: Josh Rosen <joshrosen@apache.org> Closes #1896 from JoshRosen/SPARK-2931 and squashes the following commits: 48b60b5 [Josh Rosen] Move FakeRackUtil.cleanUp() info beforeEach(). 6fec474 [Josh Rosen] Set currentLocalityIndex after recomputing locality levels. 9384897 [Josh Rosen] Update SPARK-2931 test to reflect changes in 63bdb1f41b4895e3a9444f7938094438a94d3007. 9ecd455 [Josh Rosen] Apply @mridulm's patch for reproducing SPARK-2931.
* [SPARK-2952] Enable logging actor messages at DEBUG levelReynold Xin2014-08-1113-38/+111
| | | | | | | | | | | | | | | | | | Example messages: ``` 14/08/09 21:37:01 DEBUG BlockManagerMasterActor: [actor] received message RegisterBlockManager(BlockManagerId(0, rxin-mbp, 58092, 0),278302556,Actor[akka.tcp://spark@rxin-mbp:58088/user/BlockManagerActor1#-63596539]) from Actor[akka.tcp://spark@rxin-mbp:58088/temp/$c] 14/08/09 21:37:01 DEBUG BlockManagerMasterActor: [actor] handled message (0.279 ms) RegisterBlockManager(BlockManagerId(0, rxin-mbp, 58092, 0),278302556,Actor[akka.tcp://spark@rxin-mbp:58088/user/BlockManagerActor1#-63596539]) from Actor[akka.tcp://spark@rxin-mbp:58088/temp/$c] ``` cc @mengxr @tdas @pwendell Author: Reynold Xin <rxin@apache.org> Closes #1870 from rxin/actorLogging and squashes the following commits: c531ee5 [Reynold Xin] Added license header for ActorLogReceive. f6b1ebe [Reynold Xin] [SPARK-2952] Enable logging actor messages at DEBUG level
* [PySpark] [SPARK-2954] [SPARK-2948] [SPARK-2910] [SPARK-2101] Python 2.6 FixesJosh Rosen2014-08-115-7/+36
| | | | | | | | | | | | | | | | | - Modify python/run-tests to test with Python 2.6 - Use unittest2 when running on Python 2.6. - Fix issue with namedtuple. - Skip TestOutputFormat.test_newhadoop on Python 2.6 until SPARK-2951 is fixed. - Fix MLlib _deserialize_double on Python 2.6. Closes #1868. Closes #1042. Author: Josh Rosen <joshrosen@apache.org> Closes #1874 from JoshRosen/python2.6 and squashes the following commits: 983d259 [Josh Rosen] [SPARK-2954] Fix MLlib _deserialize_double on Python 2.6. 5d18fd7 [Josh Rosen] [SPARK-2948] [SPARK-2910] [SPARK-2101] Python 2.6 fixes
* [SPARK-2936] Migrate Netty network module from Java to ScalaReynold Xin2014-08-1012-364/+292
| | | | | | | | | | | | | | | | The Netty network module was originally written when Scala 2.9.x had a bug that prevents a pure Scala implementation, and a subset of the files were done in Java. We have since upgraded to Scala 2.10, and can migrate all Java files now to Scala. https://github.com/netty/netty/issues/781 https://github.com/mesos/spark/pull/522 Author: Reynold Xin <rxin@apache.org> Closes #1865 from rxin/netty and squashes the following commits: 332422f [Reynold Xin] Code review feedback ca9eeee [Reynold Xin] Minor update. 7f1434b [Reynold Xin] [SPARK-2936] Migrate Netty network module from Java to Scala
* [SPARK-2937] Separate out samplyByKeyExact as its own API in PairRDDFunctionDoris Xin2014-08-104-128/+216
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To enable Python consistency and `Experimental` label of the `sampleByKeyExact` API. Author: Doris Xin <doris.s.xin@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #1866 from dorx/stratified and squashes the following commits: 0ad97b2 [Doris Xin] reviewer comments. 2948aae [Doris Xin] remove unrelated changes e990325 [Doris Xin] Merge branch 'master' into stratified 555a3f9 [Doris Xin] separate out sampleByKeyExact as its own API 616e55c [Doris Xin] merge master 245439e [Doris Xin] moved minSamplingRate to getUpperBound eaf5771 [Doris Xin] bug fixes. 17a381b [Doris Xin] fixed a merge issue and a failed unit ea7d27f [Doris Xin] merge master b223529 [Xiangrui Meng] use approx bounds for poisson fix poisson mean for waitlisting add unit tests for Java b3013a4 [Xiangrui Meng] move math3 back to test scope eecee5f [Doris Xin] Merge branch 'master' into stratified f4c21f3 [Doris Xin] Reviewer comments a10e68d [Doris Xin] style fix a2bf756 [Doris Xin] Merge branch 'master' into stratified 680b677 [Doris Xin] use mapPartitionWithIndex instead 9884a9f [Doris Xin] style fix bbfb8c9 [Doris Xin] Merge branch 'master' into stratified ee9d260 [Doris Xin] addressed reviewer comments 6b5b10b [Doris Xin] Merge branch 'master' into stratified 254e03c [Doris Xin] minor fixes and Java API. 4ad516b [Doris Xin] remove unused imports from PairRDDFunctions bd9dc6e [Doris Xin] unit bug and style violation fixed 1fe1cff [Doris Xin] Changed fractionByKey to a map to enable arg check 944a10c [Doris Xin] [SPARK-2145] Add lower bound on sampling rate 0214a76 [Doris Xin] cleanUp 90d94c0 [Doris Xin] merge master 9e74ab5 [Doris Xin] Separated out most of the logic in sampleByKey 7327611 [Doris Xin] merge master 50581fc [Doris Xin] added a TODO for logging in python 46f6c8c [Doris Xin] fixed the NPE caused by closures being cleaned before being passed into the aggregate function 7e1a481 [Doris Xin] changed the permission on SamplingUtil 1d413ce [Doris Xin] fixed checkstyle issues 9ee94ee [Doris Xin] [SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact sample size e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample 7cab53a [Doris Xin] fixed import bug in rdd.py ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD 1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
* [SPARK-2898] [PySpark] fix bugs in deamon.pyDavies Liu2014-08-102-32/+48
| | | | | | | | | | | | | | | | | | 1. do not use signal handler for SIGCHILD, it's easy to cause deadlock 2. handle EINTR during accept() 3. pass errno into JVM 4. handle EAGAIN during fork() Now, it can pass 50k tasks tests in 180 seconds. Author: Davies Liu <davies.liu@gmail.com> Closes #1842 from davies/qa and squashes the following commits: f0ea451 [Davies Liu] fix lint 03a2e8c [Davies Liu] cleanup dead children every seconds 32cb829 [Davies Liu] fix lint 0cd0817 [Davies Liu] fix bugs in deamon.py
* [SPARK-2950] Add gc time and shuffle write time to JobLoggerShivaram Venkataraman2014-08-101-3/+6
| | | | | | | | | | | | | | | The JobLogger is very useful for performing offline performance profiling of Spark jobs. GC Time and Shuffle Write time are available in TaskMetrics but are currently missed from the JobLogger output. This patch adds these two fields. ~~Since this is a small change, I didn't create a JIRA. Let me know if I should do that.~~ cc kayousterhout Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #1869 from shivaram/job-logger and squashes the following commits: 1b709fc [Shivaram Venkataraman] Add a space before GC_TIME c418105 [Shivaram Venkataraman] Add gc time and shuffle write time to JobLogger
* Remove extra semicolon in Task.scalaGuoQiang Li2014-08-101-1/+1
| | | | | | | | Author: GuoQiang Li <witgo@qq.com> Closes #1876 from witgo/remove_semicolon_in_Task_scala and squashes the following commits: c6ea732 [GuoQiang Li] Remove extra semicolon in Task.scala
* Turn UpdateBlockInfo into case class.Reynold Xin2014-08-091-19/+1
| | | | | | | | | | This helps us log UpdateBlockInfo properly once #1870 is merged. Author: Reynold Xin <rxin@apache.org> Closes #1872 from rxin/UpdateBlockInfo and squashes the following commits: 0cee1c2 [Reynold Xin] Turn UpdateBlockInfo into case class.
* Updated Spark SQL README to include the hive-thriftserver moduleReynold Xin2014-08-091-1/+2
| | | | | | | | Author: Reynold Xin <rxin@apache.org> Closes #1867 from rxin/sql-readme and squashes the following commits: 42a5307 [Reynold Xin] Updated Spark SQL README to include the hive-thriftserver module
* [SPARK-2894] spark-shell doesn't accept flagsKousuke Saruta2014-08-096-11/+94
| | | | | | | | | | | | | | | | | | | | As sryza reported, spark-shell doesn't accept any flags. The root cause is wrong usage of spark-submit in spark-shell and it come to the surface by #1801 Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1715, Closes #1864, and Closes #1861 Closes #1825 from sarutak/SPARK-2894 and squashes the following commits: 47f3510 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2894 2c899ed [Kousuke Saruta] Removed useless code from java_gateway.py 98287ed [Kousuke Saruta] Removed useless code from java_gateway.py 513ad2e [Kousuke Saruta] Modified util.sh to enable to use option including white spaces 28a374e [Kousuke Saruta] Modified java_gateway.py to recognize arguments 5afc584 [Cheng Lian] Filter out spark-submit options when starting Python gateway e630d19 [Cheng Lian] Fixing pyspark and spark-shell CLI options
* [SPARK-1766] sorted functions to meet pedantic requirementsChris Cope2014-08-091-19/+19
| | | | | | | | | | Pedantry is underrated Author: Chris Cope <ccope@resilientscience.com> Closes #1859 from copester/master and squashes the following commits: 0fb4499 [Chris Cope] [SPARK-1766] sorted functions to meet pedantic requirements