aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* [SPARK-10003] Improve readability of DAGSchedulerAndrew Or2015-09-031-37/+9
| | | | | | | | | | | | | | | | | Note: this is not intended to be in Spark 1.5! This patch rewrites some code in the `DAGScheduler` to make it more readable. In particular - there were blocks of code that are unnecessary and removed for simplicity - there were abstractions that are unnecessary and made the code hard to navigate - other minor changes Author: Andrew Or <andrew@databricks.com> Closes #8217 from andrewor14/dag-scheduler-readability and squashes the following commits: 57abca3 [Andrew Or] Move comment back into if case 574fb1e [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-scheduler-readability 64a9ed2 [Andrew Or] Remove unnecessary code + minor code rewrites
* [SPARK-10421] [BUILD] Exclude curator artifacts from tachyon dependencies.Marcelo Vanzin2015-09-031-0/+8
| | | | | | | | | | | This avoids them being mistakenly pulled instead of the newer ones that Spark actually uses. Spark only depends on these artifacts transitively, so sometimes maven just decides to pick tachyon's version of the dependency for whatever reason. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8577 from vanzin/SPARK-10421.
* [SPARK-10435] Spark submit should fail fast for Mesos cluster mode with RAndrew Or2015-09-031-0/+3
| | | | | | | | It's not supported yet so we should error with a clear message. Author: Andrew Or <andrew@databricks.com> Closes #8590 from andrewor14/mesos-cluster-r-guard.
* [SPARK-9591] [CORE] Job may fail for exception during getting remote blockjeanlyn2015-09-033-2/+80
| | | | | | | | | [SPARK-9591](https://issues.apache.org/jira/browse/SPARK-9591) When we getting the broadcast variable, we can fetch the block form several location,but now when connecting the lost blockmanager(idle for enough time removed by driver when using dynamic resource allocate and so on) will cause task fail,and the worse case will cause the job fail. Author: jeanlyn <jeanlyn92@gmail.com> Closes #7927 from jeanlyn/catch_exception.
* [SPARK-10430] [CORE] Added hashCode methods in AccumulableInfo and ↵Vinod K C2015-09-034-1/+26
| | | | | | | | RDDOperationScope Author: Vinod K C <vinod.kc@huawei.com> Closes #8581 from vinodkc/fix_RDDOperationScope_Hashcode.
* [SPARK-9672] [MESOS] Don’t include SPARK_ENV_LOADED when passing env varsPat Shields2015-09-032-4/+25
| | | | | | | | This contribution is my original work and I license the work to the project under the project's open source license. Author: Pat Shields <yeoldefortran@gmail.com> Closes #7979 from pashields/env-loading-on-driver.
* [SPARK-9869] [STREAMING] Wait for all event notifications before asserting ↵robbins2015-09-031-0/+3
| | | | | | | | results Author: robbins <robbins@uk.ibm.com> Closes #8589 from robbinspg/InputStreamSuite-fix.
* [SPARK-10431] [CORE] Fix intermittent test failure. Wait for event queue to ↵robbins2015-09-031-0/+4
| | | | | | | | be clear Author: robbins <robbins@uk.ibm.com> Closes #8582 from robbinspg/InputOutputMetricsSuite.
* [SPARK-10432] spark.port.maxRetries documentation is unclearTom Graves2015-09-031-1/+5
| | | | | | Author: Tom Graves <tgraves@yahoo-inc.com> Closes #8585 from tgravescs/SPARK-10432.
* [SPARK-8951] [SPARKR] support Unicode characters in collect()CHOIJAEHONG2015-09-034-8/+35
| | | | | | | | | Spark gives an error message and does not show the output when a field of the result DataFrame contains characters in CJK. I changed SerDe.scala in order that Spark support Unicode characters when writes a string to R. Author: CHOIJAEHONG <redrock07@naver.com> Closes #7494 from CHOIJAEHONG1/SPARK-8951.
* [SPARK-9596] [SQL] treat hadoop classes as shared one in IsolatedClientLoaderWangTaoTheTonic2015-09-031-0/+1
| | | | | | | | https://issues.apache.org/jira/browse/SPARK-9596 Author: WangTaoTheTonic <wangtao111@huawei.com> Closes #7931 from WangTaoTheTonic/SPARK-9596.
* [SPARK-10332] [CORE] Fix yarn spark executor validationHolden Karau2015-09-031-0/+3
| | | | | | | | | | | | | | | | From Jira: Running spark-submit with yarn with number-executors equal to 0 when not using dynamic allocation should error out. In spark 1.5.0 it continues and ends up hanging. yarn.ClientArguments still has the check so something else must have changed. spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi --num-executors 0 .... spark 1.4.1 errors with: java.lang.IllegalArgumentException: Number of executors was 0, but must be at least 1 (or 0 if dynamic executor allocation is enabled). Author: Holden Karau <holden@pigscanfly.ca> Closes #8580 from holdenk/SPARK-10332-spark-submit-to-yarn-executors-0-message.
* [SPARK-10411] [SQL] Move visualization above explain output and hide explain ↵zsxwing2015-09-021-5/+22
| | | | | | | | | | | | | | | | | | by default New screenshots after this fix: <img width="627" alt="s1" src="https://cloud.githubusercontent.com/assets/1000778/9625782/4b2dba36-518b-11e5-9104-c713ff026e3d.png"> Default: <img width="462" alt="s2" src="https://cloud.githubusercontent.com/assets/1000778/9625817/92366e50-518b-11e5-9981-cdfb774d66b8.png"> After clicking `+details`: <img width="377" alt="s3" src="https://cloud.githubusercontent.com/assets/1000778/9625784/4ba24342-518b-11e5-8522-846a16a95d44.png"> Author: zsxwing <zsxwing@gmail.com> Closes #8570 from zsxwing/SPARK-10411.
* [SPARK-10379] preserve first page in UnsafeShuffleExternalSorterDavies Liu2015-09-023-3/+8
| | | | | | Author: Davies Liu <davies@databricks.com> Closes #8543 from davies/preserve_page.
* [SPARK-10247] [CORE] improve readability of a test case in DAGSchedulerSuiteImran Rashid2015-09-021-10/+47
| | | | | | | | This is pretty minor, just trying to improve the readability of `DAGSchedulerSuite`, I figure every bit helps. Before whenever I read this test, I never knew what "should work" and "should be ignored" really meant -- this adds some asserts & updates comments to make it more clear. Also some reformatting per a suggestion from markhamstra on https://github.com/apache/spark/pull/7699 Author: Imran Rashid <irashid@cloudera.com> Closes #8434 from squito/SPARK-10247.
* Removed code duplication in ShuffleBlockFetcherIteratorEvan Racah2015-09-021-8/+10
| | | | | | | | Added fetchUpToMaxBytes() to prevent having to update both code blocks when a change is made. Author: Evan Racah <ejracah@gmail.com> Closes #8514 from eracah/master.
* [SPARK-8707] RDD#toDebugString fails if any cached RDD has invalid partitionsnavis.ryu2015-09-022-2/+6
| | | | | | | | Added numPartitions(evaluate: Boolean) to RDD. With "evaluate=true" the method is same with "partitions.length". With "evaluate=false", it checks checked-out or already evaluated partitions in the RDD to get number of partition. If it's not those cases, returns -1. RDDInfo.partitionNum calls numPartition only when it's accessed. Author: navis.ryu <navis@apache.org> Closes #7127 from navis/SPARK-8707.
* [SPARK-5945] Spark should not retry a stage infinitely on a FetchFailedExceptionIlya Ganelin2015-09-023-5/+320
| | | | | | | | | | | | The ```Stage``` class now tracks whether there were a sufficient number of consecutive failures of that stage to trigger an abort. To avoid an infinite loop of stage retries, we abort the job completely after 4 consecutive stage failures for one stage. We still allow more than 4 consecutive stage failures if there is an intervening successful attempt for the stage, so that in very long-lived applications, where a stage may get reused many times, we don't abort the job after failures that have been recovered from successfully. I've added test cases to exercise the most obvious scenarios. Author: Ilya Ganelin <ilya.ganelin@capitalone.com> Closes #5636 from ilganeli/SPARK-5945.
* [SPARK-9723] [ML] params getordefault should throw more useful errorHolden Karau2015-09-023-8/+15
| | | | | | | | Params.getOrDefault should throw a more meaningful exception than what you get from a bad key lookup. Author: Holden Karau <holden@pigscanfly.ca> Closes #8567 from holdenk/SPARK-9723-params-getordefault-should-throw-more-useful-error.
* [SPARK-10422] [SQL] String column in InMemoryColumnarCache needs to override ↵Yin Huai2015-09-022-0/+22
| | | | | | | | | | clone method https://issues.apache.org/jira/browse/SPARK-10422 Author: Yin Huai <yhuai@databricks.com> Closes #8578 from yhuai/SPARK-10422.
* [SPARK-10417] [SQL] Iterating through Column results in infinite loop0x0FFF2015-09-022-0/+12
| | | | | | | | | | | | | | `pyspark.sql.column.Column` object has `__getitem__` method, which makes it iterable for Python. In fact it has `__getitem__` to address the case when the column might be a list or dict, for you to be able to access certain element of it in DF API. The ability to iterate over it is just a side effect that might cause confusion for the people getting familiar with Spark DF (as you might iterate this way on Pandas DF for instance) Issue reproduction: ``` df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}'])) for i in df["name"]: print i ``` Author: 0x0FFF <programmerag@gmail.com> Closes #8574 from 0x0FFF/SPARK-10417.
* [SPARK-10004] [SHUFFLE] Perform auth checks when clients read shuffle data.Marcelo Vanzin2015-09-0213-36/+221
| | | | | | | | | | | | | | | To correctly isolate applications, when requests to read shuffle data arrive at the shuffle service, proper authorization checks need to be performed. This change makes sure that only the application that created the shuffle data can read from it. Such checks are only enabled when "spark.authenticate" is enabled, otherwise there's no secure way to make sure that the client is really who it says it is. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8218 from vanzin/SPARK-10004.
* [SPARK-10389] [SQL] support order by non-attribute grouping expression on ↵Wenchen Fan2015-09-022-39/+52
| | | | | | | | | | Aggregate For example, we can write `SELECT MAX(value) FROM src GROUP BY key + 1 ORDER BY key + 1` in PostgreSQL, and we should support this in Spark SQL. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8548 from cloud-fan/support-order-by-non-attribute.
* [SPARK-10034] [SQL] add regression test for Sort on AggregateWenchen Fan2015-09-022-0/+18
| | | | | | | | | | Before #8371, there was a bug for `Sort` on `Aggregate` that we can't use aggregate expressions named `_aggOrdering` and can't use more than one ordering expressions which contains aggregate functions. The reason of this bug is that: The aggregate expression in `SortOrder` never get resolved, we alias it with `_aggOrdering` and call `toAttribute` which gives us an `UnresolvedAttribute`. So actually we are referencing aggregate expression by name, not by exprId like we thought. And if there is already an aggregate expression named `_aggOrdering` or there are more than one ordering expressions having aggregate functions, we will have conflict names and can't search by name. However, after #8371 got merged, the `SortOrder`s are guaranteed to be resolved and we are always referencing aggregate expression by exprId. The Bug doesn't exist anymore and this PR add regression tests for it. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8231 from cloud-fan/sort-agg.
* [SPARK-7336] [HISTORYSERVER] Fix bug that applications status incorrect on ↵Chuan Shao2015-09-021-5/+22
| | | | | | | | JobHistory UI. Author: ArcherShao <shaochuan@huawei.com> Closes #5886 from ArcherShao/SPARK-7336.
* [SPARK-10392] [SQL] Pyspark - Wrong DateType support on JDBC connection0x0FFF2015-09-012-2/+9
| | | | | | | | | | | | | | | | | | | This PR addresses issue [SPARK-10392](https://issues.apache.org/jira/browse/SPARK-10392) The problem is that for "start of epoch" date (01 Jan 1970) PySpark class DateType returns 0 instead of the `datetime.date` due to implementation of its return statement Issue reproduction on master: ``` >>> from pyspark.sql.types import * >>> a = DateType() >>> a.fromInternal(0) 0 >>> a.fromInternal(1) datetime.date(1970, 1, 2) ``` Author: 0x0FFF <programmerag@gmail.com> Closes #8556 from 0x0FFF/SPARK-10392.
* [SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter ↵0x0FFF2015-09-012-10/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | function This PR addresses [SPARK-10162](https://issues.apache.org/jira/browse/SPARK-10162) The issue is with DataFrame filter() function, if datetime.datetime is passed to it: * Timezone information of this datetime is ignored * This datetime is assumed to be in local timezone, which depends on the OS timezone setting Fix includes both code change and regression test. Problem reproduction code on master: ```python import pytz from datetime import datetime from pyspark.sql import * from pyspark.sql.types import * sqc = SQLContext(sc) df = sqc.createDataFrame([], StructType([StructField("dt", TimestampType())])) m1 = pytz.timezone('UTC') m2 = pytz.timezone('Etc/GMT+3') df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() ``` It gives the same timestamp ignoring time zone: ``` >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() Filter (dt#0 > 946713600000000) Scan PhysicalRDD[dt#0] >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() Filter (dt#0 > 946713600000000) Scan PhysicalRDD[dt#0] ``` After the fix: ``` >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() Filter (dt#0 > 946684800000000) Scan PhysicalRDD[dt#0] >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() Filter (dt#0 > 946695600000000) Scan PhysicalRDD[dt#0] ``` PR [8536](https://github.com/apache/spark/pull/8536) was occasionally closed by me dropping the repo Author: 0x0FFF <programmerag@gmail.com> Closes #8555 from 0x0FFF/SPARK-10162.
* [SPARK-4223] [CORE] Support * in acls.zhuol2015-09-013-7/+69
| | | | | | | | | | | | | | | SPARK-4223. Currently we support setting view and modify acls but you have to specify a list of users. It would be nice to support * meaning all users have access. Manual tests to verify that: "*" works for any user in: a. Spark ui: view and kill stage. Done. b. Spark history server. Done. c. Yarn application killing. Done. Author: zhuol <zhuol@yahoo-inc.com> Closes #8398 from zhuoliu/4223.
* [SPARK-10398] [DOCS] Migrate Spark download page to use new lua mirroring ↵Sean Owen2015-09-012-2/+2
| | | | | | | | | | | | scripts Migrate Apache download closer.cgi refs to new closer.lua This is the bit of the change that affects the project docs; I'm implementing the changes to the Apache site separately. Author: Sean Owen <sowen@cloudera.com> Closes #8557 from srowen/SPARK-10398.
* [SPARK-9679] [ML] [PYSPARK] Add Python API for Stop Words RemoverHolden Karau2015-09-014-8/+93
| | | | | | | | Add a python API for the Stop Words Remover. Author: Holden Karau <holden@pigscanfly.ca> Closes #8118 from holdenk/SPARK-9679-python-StopWordsRemover.
* [SPARK-10301] [SQL] Fixes schema merging for nested structsCheng Lian2015-09-017-125/+653
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR can be quite challenging to review. I'm trying to give a detailed description of the problem as well as its solution here. When reading Parquet files, we need to specify a potentially nested Parquet schema (of type `MessageType`) as requested schema for column pruning. This Parquet schema is translated from a Catalyst schema (of type `StructType`), which is generated by the query planner and represents all requested columns. However, this translation can be fairly complicated because of several reasons: 1. Requested schema must conform to the real schema of the physical file to be read. This means we have to tailor the actual file schema of every individual physical Parquet file to be read according to the given Catalyst schema. Fortunately we are already doing this in Spark 1.5 by pushing request schema conversion to executor side in PR #7231. 1. Support for schema merging. A single Parquet dataset may consist of multiple physical Parquet files come with different but compatible schemas. This means we may request for a column path that doesn't exist in a physical Parquet file. All requested column paths can be nested. For example, for a Parquet file schema ``` message root { required group f0 { required group f00 { required int32 f000; required binary f001 (UTF8); } } } ``` we may request for column paths defined in the following schema: ``` message root { required group f0 { required group f00 { required binary f001 (UTF8); required float f002; } } optional double f1; } ``` Notice that we pruned column path `f0.f00.f000`, but added `f0.f00.f002` and `f1`. The good news is that Parquet handles non-existing column paths properly and always returns null for them. 1. The map from `StructType` to `MessageType` is a one-to-many map. This is the most unfortunate part. Due to historical reasons (dark histories!), schemas of Parquet files generated by different libraries have different "flavors". For example, to handle a schema with a single non-nullable column, whose type is an array of non-nullable integers, parquet-protobuf generates the following Parquet schema: ``` message m0 { repeated int32 f; } ``` while parquet-avro generates another version: ``` message m1 { required group f (LIST) { repeated int32 array; } } ``` and parquet-thrift spills this: ``` message m1 { required group f (LIST) { repeated int32 f_tuple; } } ``` All of them can be mapped to the following _unique_ Catalyst schema: ``` StructType( StructField( "f", ArrayType(IntegerType, containsNull = false), nullable = false)) ``` This greatly complicates Parquet requested schema construction, since the path of a given column varies in different cases. To read the array elements from files with the above schemas, we must use `f` for `m0`, `f.array` for `m1`, and `f.f_tuple` for `m2`. In earlier Spark versions, we didn't try to fix this issue properly. Spark 1.4 and prior versions simply translate the Catalyst schema in a way more or less compatible with parquet-hive and parquet-avro, but is broken in many other cases. Earlier revisions of Spark 1.5 only try to tailor the Parquet file schema at the first level, and ignore nested ones. This caused [SPARK-10301] [spark-10301] as well as [SPARK-10005] [spark-10005]. In PR #8228, I tried to avoid the hard part of the problem and made a minimum change in `CatalystRowConverter` to fix SPARK-10005. However, when taking SPARK-10301 into consideration, keeping hacking `CatalystRowConverter` doesn't seem to be a good idea. So this PR is an attempt to fix the problem in a proper way. For a given physical Parquet file with schema `ps` and a compatible Catalyst requested schema `cs`, we use the following algorithm to tailor `ps` to get the result Parquet requested schema `ps'`: For a leaf column path `c` in `cs`: - if `c` exists in `cs` and a corresponding Parquet column path `c'` can be found in `ps`, `c'` should be included in `ps'`; - otherwise, we convert `c` to a Parquet column path `c"` using `CatalystSchemaConverter`, and include `c"` in `ps'`; - no other column paths should exist in `ps'`. Then comes the most tedious part: > Given `cs`, `ps`, and `c`, how to locate `c'` in `ps`? Unfortunately, there's no quick answer, and we have to enumerate all possible structures defined in parquet-format spec. They are: 1. the standard structure of nested types, and 1. cases defined in all backwards-compatibility rules for `LIST` and `MAP`. The core part of this PR is `CatalystReadSupport.clipParquetType()`, which tailors a given Parquet file schema according to a requested schema in its Catalyst form. Backwards-compatibility rules of `LIST` and `MAP` are covered in `clipParquetListType()` and `clipParquetMapType()` respectively. The column path selection algorithm is implemented in `clipParquetGroupFields()`. With this PR, we no longer need to do schema tailoring in `CatalystReadSupport` and `CatalystRowConverter`. Another benefit is that, now we can also read Parquet datasets consist of files with different physical Parquet schema but share the same logical schema, for example, files generated by different Parquet libraries. This situation is illustrated by [this test case] [test-case]. [spark-10301]: https://issues.apache.org/jira/browse/SPARK-10301 [spark-10005]: https://issues.apache.org/jira/browse/SPARK-10005 [test-case]: https://github.com/liancheng/spark/commit/38644d8a45175cbdf20d2ace021c2c2544a50ab3#diff-a9b98e28ce3ae30641829dffd1173be2R26 Author: Cheng Lian <lian@databricks.com> Closes #8509 from liancheng/spark-10301/fix-parquet-requested-schema.
* [SPARK-10378][SQL][Test] Remove HashJoinCompatibilitySuite.Reynold Xin2015-08-311-169/+0
| | | | | | | | They don't bring much value since we now have better unit test coverage for hash joins. This will also help reduce the test time. Author: Reynold Xin <rxin@databricks.com> Closes #8542 from rxin/SPARK-10378.
* [SPARK-10355] [ML] [PySpark] Add Python API for SQLTransformerYanbo Liang2015-08-311-3/+54
| | | | | | | | Add Python API for SQLTransformer Author: Yanbo Liang <ybliang8@gmail.com> Closes #8527 from yanboliang/spark-10355.
* [SPARK-10349] [ML] OneVsRest use 'when ... otherwise' not UDF to generate ↵Yanbo Liang2015-08-311-8/+2
| | | | | | | | | | | new label at binary reduction Currently OneVsRest use UDF to generate new binary label during training. Considering that [SPARK-7321](https://issues.apache.org/jira/browse/SPARK-7321) has been merged, we can use ```when ... otherwise``` which will be more efficiency. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8519 from yanboliang/spark-10349.
* [SPARK-10341] [SQL] fix memory starving in unsafe SMJDavies Liu2015-08-313-6/+42
| | | | | | | | | | | | In SMJ, the first ExternalSorter could consume all the memory before spilling, then the second can not even acquire the first page. Before we have a better memory allocator, SMJ should call prepare() before call any compute() of it's children. cc rxin JoshRosen Author: Davies Liu <davies@databricks.com> Closes #8511 from davies/smj_memory.
* [SPARK-8472] [ML] [PySpark] Python API for DCTYanbo Liang2015-08-311-1/+64
| | | | | | | | Add Python API for ml.feature.DCT. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8485 from yanboliang/spark-8472.
* [SPARK-9954] [MLLIB] use first 128 nonzeros to compute Vector.hashCodeXiangrui Meng2015-08-311-17/+21
| | | | | | | | This could help reduce hash collisions, e.g., in `RDD[Vector].repartition`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8182 from mengxr/SPARK-9954.
* [SPARK-10170] [SQL] Add DB2 JDBC dialect support.sureshthalamati2015-08-312-0/+25
| | | | | | | | | | Data frame write to DB2 database is failing because by default JDBC data source implementation is generating a table schema with DB2 unsupported data types TEXT for String, and BIT1(1) for Boolean. This patch registers DB2 JDBC Dialect that maps String, Boolean to valid DB2 data types. Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #8393 from sureshthalamati/db2_dialect_spark-10170.
* [SPARK-10369] [STREAMING] Don't remove ReceiverTrackingInfo when ↵zsxwing2015-08-312-2/+53
| | | | | | | | | | deregisterReceivering since we may reuse it later `deregisterReceiver` should not remove `ReceiverTrackingInfo`. Otherwise, it will throw `java.util.NoSuchElementException: key not found` when restarting it. Author: zsxwing <zsxwing@gmail.com> Closes #8538 from zsxwing/SPARK-10369.
* [SPARK-8730] Fixes - Deser objects containing a primitive class attributeEugenCepoi2015-08-312-5/+40
| | | | | | Author: EugenCepoi <cepoi.eugen@gmail.com> Closes #7122 from EugenCepoi/master.
* [SPARK-100354] [MLLIB] fix some apparent memory issues in k-means|| ↵Xiangrui Meng2015-08-301-7/+14
| | | | | | | | | | | | | | | | initializaiton * do not cache first cost RDD * change following cost RDD cache level to MEMORY_AND_DISK * remove Vector wrapper to save a object per instance Further improvements will be addressed in SPARK-10329 cc: yu-iskw HuJiayin Author: Xiangrui Meng <meng@databricks.com> Closes #8526 from mengxr/SPARK-10354.
* [SPARK-10351] [SQL] Fixes UTF8String.fromAddress to handle off-heap memoryFeynman Liang2015-08-302-9/+6
| | | | | | | | CC rxin marmbrus Author: Feynman Liang <fliang@databricks.com> Closes #8523 from feynmanliang/SPARK-10351.
* SPARK-9545, SPARK-9547: Use Maven in PRB if title contains "[test-maven]"Patrick Wendell2015-08-302-4/+42
| | | | | | | | | | | | | This is just some small glue code to actually make use of the AMPLAB_JENKINS_BUILD_TOOL switch. As far as I can tell, we actually don't currently use the Maven support in the tool even though it exists. This patch switches to Maven when the PR title contains "test-maven". There are a few small other pieces of cleanup in the patch as well. Author: Patrick Wendell <patrick@databricks.com> Closes #7878 from pwendell/maven-tests.
* [SPARK-10353] [MLLIB] BLAS gemm not scaling when beta = 0.0 for some subset ↵Burak Yavuz2015-08-302-16/+15
| | | | | | | | | | | | of matrix multiplications mengxr jkbradley rxin It would be great if this fix made it into RC3! Author: Burak Yavuz <brkyvz@gmail.com> Closes #8525 from brkyvz/blas-scaling.
* [SPARK-10184] [CORE] Optimization for bounds determination in RangePartitionerihainan2015-08-301-1/+1
| | | | | | | | | | JIRA Issue: https://issues.apache.org/jira/browse/SPARK-10184 Change `cumWeight > target` to `cumWeight >= target` in `RangePartitioner.determineBounds` method to make the output partitions more balanced. Author: ihainan <ihainan72@gmail.com> Closes #8397 from ihainan/opt_for_rangepartitioner.
* [SPARK-10331] [MLLIB] Update example code in ml-guideXiangrui Meng2015-08-291-215/+147
| | | | | | | | | | | | * The example code was added in 1.2, before `createDataFrame`. This PR switches to `createDataFrame`. Java code still uses JavaBean. * assume `sqlContext` is available * fix some minor issues from previous code review jkbradley srowen feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8518 from mengxr/SPARK-10331.
* [SPARK-10348] [MLLIB] updates ml-guideXiangrui Meng2015-08-292-52/+78
| | | | | | | | | | | | | | * replace `ML Dataset` by `DataFrame` to unify the abstraction * ML algorithms -> pipeline components to describe the main concept * remove Scala API doc links from the main guide * `Section Title` -> `Section tile` to be consistent with other section titles in MLlib guide * modified lines break at 100 chars or periods jkbradley feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8517 from mengxr/SPARK-10348.
* [SPARK-9986] [SPARK-9991] [SPARK-9993] [SQL] Create a simple test framework ↵zsxwing2015-08-2914-55/+509
| | | | | | | | | | | | for local operators This PR includes the following changes: - Add `LocalNodeTest` for local operator tests and add unit tests for FilterNode and ProjectNode. - Add `LimitNode` and `UnionNode` and their unit tests to show how to use `LocalNodeTest`. (SPARK-9991, SPARK-9993) Author: zsxwing <zsxwing@gmail.com> Closes #8464 from zsxwing/local-execution.
* [SPARK-10339] [SPARK-10334] [SPARK-10301] [SQL] Partitioned table scan can ↵Yin Huai2015-08-293-42/+65
| | | | | | | | | | | | | | | OOM driver and throw a better error message when users need to enable parquet schema merging This fixes the problem that scanning partitioned table causes driver have a high memory pressure and takes down the cluster. Also, with this fix, we will be able to correctly show the query plan of a query consuming partitioned tables. https://issues.apache.org/jira/browse/SPARK-10339 https://issues.apache.org/jira/browse/SPARK-10334 Finally, this PR squeeze in a "quick fix" for SPARK-10301. It is not a real fix, but it just throw a better error message to let user know what to do. Author: Yin Huai <yhuai@databricks.com> Closes #8515 from yhuai/partitionedTableScan.
* [SPARK-10330] Use SparkHadoopUtil TaskAttemptContext reflection methods in ↵Josh Rosen2015-08-295-12/+28
| | | | | | | | | | more places SparkHadoopUtil contains methods that use reflection to work around TaskAttemptContext binary incompatibilities between Hadoop 1.x and 2.x. We should use these methods in more places. Author: Josh Rosen <joshrosen@databricks.com> Closes #8499 from JoshRosen/use-hadoop-reflection-in-more-places.