aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-9817][YARN] Improve the locality calculation of containers by taking ↵jerryshao2015-11-025-40/+159
| | | | | | | | | | | | pending container requests into consideraion This is a follow-up PR to further improve the locality calculation by considering the pending container's request. Since the locality preferences of tasks may be shifted from time to time, current localities of pending container requests may not fully match the new preferences, this PR improve it by removing outdated, unmatched container requests and replace with new requests. sryza please help to review, thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #8100 from jerryshao/SPARK-9817.
* [SPARK-11311][SQL] spark cannot describe temporary functionsDaoyuan Wang2015-11-022-1/+15
| | | | | | | | When describe temporary function, spark would return 'Unable to find function', this is not right. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #9277 from adrian-wang/functionreg.
* [SPARK-10786][SQL] Take the whole statement to generate the CommandProcessorhuangzhaowei2015-11-021-1/+1
| | | | | | | | | | | | | | | | | | | In the now implementation of `SparkSQLCLIDriver.scala`: `val proc: CommandProcessor = CommandProcessorFactory.get(Array(tokens(0)), hconf)` `CommandProcessorFactory` only take the first token of the statement, and this will be hard to diff the statement `delete jar xxx` and `delete from xxx`. So maybe it's better to take the whole statement into the `CommandProcessorFactory`. And in [HiveCommand](https://github.com/SaintBacchus/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/processors/HiveCommand.java#L76), it already special handing these two statement. ```java if(command.length > 1 && "from".equalsIgnoreCase(command[1])) { //special handling for SQL "delete from <table> where..." return null; } ``` Author: huangzhaowei <carlmartinmax@gmail.com> Closes #8895 from SaintBacchus/SPARK-10786.
* [SPARK-11413][BUILD] Bump joda-time version to 2.9 for java 8 and s3Yongjia Wang2015-11-021-1/+1
| | | | | | | | | It's a known issue that joda-time before 2.8.1 is incompatible with java 1.8u60 or later, which causes s3 request to fail. This affects Spark when using s3 as data source. https://github.com/aws/aws-sdk-java/issues/444 Author: Yongjia Wang <yongjiaw@gmail.com> Closes #9379 from yongjiaw/SPARK-11413.
* [SPARK-11271][SPARK-11016][CORE] Use Spark BitSet instead of RoaringBitmap ↵Liang-Chi Hsieh2015-11-027-33/+82
| | | | | | | | | | | | to reduce memory usage JIRA: https://issues.apache.org/jira/browse/SPARK-11271 As reported in the JIRA ticket, when there are too many tasks, the memory usage of MapStatus will cause problem. Use BitSet instead of RoaringBitMap should be more efficient in memory usage. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9243 from viirya/mapstatus-bitset.
* [SPARK-9722] [ML] Pass random seed to spark.ml DecisionTree*Yu ISHIKAWA2015-11-011-3/+5
| | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9402 from yu-iskw/SPARK-9722.
* [SPARK-9298][SQL] Add pearson correlation aggregation functionLiang-Chi Hsieh2015-11-017-2/+311
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-9298 This patch adds pearson correlation aggregation function based on `AggregateExpression2`. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #8587 from viirya/corr_aggregation.
* [SPARK-11073][CORE][YARN] Remove akka dependency in secret key generation.Marcelo Vanzin2015-11-018-83/+138
| | | | | | | | | | Use standard JDK APIs for that (with a little help from Guava). Most of the changes here are in test code, since there were no tests specific to that part of the code. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9257 from vanzin/SPARK-11073.
* [SPARK-11020][CORE] Wait for HDFS to leave safe mode before initializing HS.Marcelo Vanzin2015-11-012-3/+166
| | | | | | | | | | | Large HDFS clusters may take a while to leave safe mode when starting; this change makes the HS wait for that before doing checks about its configuraton. This means the HS won't stop right away if HDFS is in safe mode and the configuration is not correct, but that should be a very uncommon situation. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9043 from vanzin/SPARK-11020.
* [SPARK-11410][SQL] Add APIs to provide functionality similar to Hive's ↵Nong Li2015-11-014-19/+186
| | | | | | | | | | | | | | | | | | | DISTRIBUTE BY and SORT BY. DISTRIBUTE BY allows the user to hash partition the data by specified exprs. It also allows for optioning sorting within each resulting partition. There is no required relationship between the exprs for partitioning and sorting (i.e. one does not need to be a prefix of the other). This patch adds to APIs to DataFrames which can be used together to provide this functionality: 1. distributeBy() which partitions the data frame into a specified number of partitions using the partitioning exprs. 2. localSort() which sorts each partition using the provided sorting exprs. To get the DISTRIBUTE BY functionality, the user simply does: df.distributeBy(...).localSort(...) Author: Nong Li <nongli@gmail.com> Closes #9364 from nongli/spark-11410.
* [SPARK-11338] [WEBUI] Prepend app links on HistoryPage with uiRoot pathChristian Kadner2015-11-012-7/+23
| | | | | | | | | | | | | | [SPARK-11338: HistoryPage not multi-tenancy enabled ...](https://issues.apache.org/jira/browse/SPARK-11338) - `HistoryPage.scala` ...prepending all page links with the web proxy (`uiRoot`) path - `HistoryServerSuite.scala` ...adding a test case to verify all site-relative links are prefixed when the environment variable `APPLICATION_WEB_PROXY_BASE` (or System property `spark.ui.proxyBase`) is set Author: Christian Kadner <ckadner@us.ibm.com> Closes #9291 from ckadner/SPARK-11338 and squashes the following commits: 01d2f35 [Christian Kadner] [SPARK-11338][WebUI] nit fixes d054bd7 [Christian Kadner] [SPARK-11338][WebUI] prependBaseUri in method makePageLink 8bcb3dc [Christian Kadner] [SPARK-11338][WebUI] Prepend application links on HistoryPage with uiRoot path
* [SPARK-11305][DOCS] Remove Third-Party Hadoop Distributions Doc PageSean Owen2015-11-016-129/+19
| | | | | | | | | | Remove Hadoop third party distro page, and move Hadoop cluster config info to configuration page CC pwendell Author: Sean Owen <sowen@cloudera.com> Closes #9298 from srowen/SPARK-11305.
* [SPARK-11117] [SPARK-11345] [SQL] Makes all HadoopFsRelation data sources ↵Cheng Lian2015-10-3112-59/+156
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | produce UnsafeRow This PR fixes two issues: 1. `PhysicalRDD.outputsUnsafeRows` is always `false` Thus a `ConvertToUnsafe` operator is often required even if the underlying data source relation does output `UnsafeRow`. 1. Internal/external row conversion for `HadoopFsRelation` is kinda messy Currently we're using `HadoopFsRelation.needConversion` and [dirty type erasure hacks][1] to indicate whether the relation outputs external row or internal row and apply external-to-internal conversion when necessary. Basically, all builtin `HadoopFsRelation` data sources, i.e. Parquet, JSON, ORC, and Text output `InternalRow`, while typical external `HadoopFsRelation` data sources, e.g. spark-avro and spark-csv, output `Row`. This PR adds a `private[sql]` interface method `HadoopFsRelation.buildInternalScan`, which by default invokes `HadoopFsRelation.buildScan` and converts `Row`s to `UnsafeRow`s (which are also `InternalRow`s). All builtin `HadoopFsRelation` data sources override this method and directly output `UnsafeRow`s. In this way, now `HadoopFsRelation` always produces `UnsafeRow`s. Thus `PhysicalRDD.outputsUnsafeRows` can be properly set by checking whether the underlying data source is a `HadoopFsRelation`. A remaining question is that, can we assume that all non-builtin `HadoopFsRelation` data sources output external rows? At least all well known ones do so. However it's possible that some users implemented their own `HadoopFsRelation` data sources that leverages `InternalRow` and thus all those unstable internal data representations. If this assumption is safe, we can deprecate `HadoopFsRelation.needConversion` and cleanup some more conversion code (like [here][2] and [here][3]). This PR supersedes #9125. Follow-ups: 1. Makes JSON and ORC data sources output `UnsafeRow` directly 1. Makes `HiveTableScan` output `UnsafeRow` directly This is related to 1 since ORC data source shares the same `Writable` unwrapping code with `HiveTableScan`. [1]: https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L353 [2]: https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L331-L335 [3]: https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L630-L669 Author: Cheng Lian <lian@databricks.com> Closes #9305 from liancheng/spark-11345.unsafe-hadoop-fs-relation.
* [SPARK-11265][YARN] YarnClient can't get tokens to talk to Hive 1.2.1 in a ↵Steve Loughran2015-10-314-49/+129
| | | | | | | | | | secure cluster This is a fix for SPARK-11265; the introspection code to get Hive delegation tokens failing on Spark 1.5.1+, due to changes in the Hive codebase Author: Steve Loughran <stevel@hortonworks.com> Closes #9232 from steveloughran/stevel/patches/SPARK-11265-hive-tokens.
* [SPARK-11024][SQL] Optimize NULL in <inlist-expressions> by folding it to ↵Dilip Biswal2015-10-312-1/+55
| | | | | | | | | | | | | | | Literal(null) Add a rule in optimizer to convert NULL [NOT] IN (expr1,...,expr2) to Literal(null). This is a follow up defect to SPARK-8654 cloud-fan Can you please take a look ? Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #9348 from dilipbiswal/spark_11024.
* [SPARK-11424] Guard against double-close() of RecordReadersJosh Rosen2015-10-314-52/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | **TL;DR**: We can rule out one rare but potential cause of input stream corruption via defensive programming. ## Background [MAPREDUCE-5918](https://issues.apache.org/jira/browse/MAPREDUCE-5918) is a bug where an instance of a decompressor ends up getting placed into a pool multiple times. Since the pool is backed by a list instead of a set, this can lead to the same decompressor being used in different places at the same time, which is not safe because those decompressors will overwrite each other's buffers. Sometimes this buffer sharing will lead to exceptions but other times it will might silently result in invalid / garbled input. That Hadoop bug is fixed in Hadoop 2.7 but is still present in many Hadoop versions that we wish to support. As a result, I think that we should try to work around this issue in Spark via defensive programming to prevent RecordReaders from being closed multiple times. So far, I've had a hard time coming up with explanations of exactly how double-`close()`s occur in practice, but I do have a couple of explanations that work on paper. For instance, it looks like https://github.com/apache/spark/pull/7424, added in 1.5, introduces at least one extremely~rare corner-case path where Spark could double-close() a LineRecordReader instance in a way that triggers the bug. Here are the steps involved in the bad execution that I brainstormed up: * [The task has finished reading input, so we call close()](https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L168). * [While handling the close call and trying to close the reader, reader.close() throws an exception]( https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L190) * We don't set `reader = null` after handling this exception, so the [TaskCompletionListener also ends up calling NewHadoopRDD.close()](https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L156), which, in turn, closes the record reader again. In this hypothetical situation, `LineRecordReader.close()` could [fail with an exception if its InputStream failed to close](https://github.com/apache/hadoop/blob/release-1.2.1/src/mapred/org/apache/hadoop/mapred/LineRecordReader.java#L212). I googled for "Exception in RecordReader.close()" and it looks like it's possible for a closed Hadoop FileSystem to trigger an error there: [SPARK-757](https://issues.apache.org/jira/browse/SPARK-757), [SPARK-2491](https://issues.apache.org/jira/browse/SPARK-2491) Looking at [SPARK-3052](https://issues.apache.org/jira/browse/SPARK-3052), it seems like it's possible to get spurious exceptions there when there is an error reading from Hadoop. If the Hadoop FileSystem were to get into an error state _right_ after reading the last record then it looks like we could hit the bug here in 1.5. ## The fix This patch guards against these issues by modifying `HadoopRDD.close()` and `NewHadoopRDD.close()` so that they set `reader = null` even if an exception occurs in the `reader.close()` call. In addition, I modified `NextIterator. closeIfNeeded()` to guard against double-close if the first `close()` call throws an exception. I don't have an easy way to test this, since I haven't been able to reproduce the bug that prompted this patch, but these changes seem safe and seem to rule out the on-paper reproductions that I was able to brainstorm up. Author: Josh Rosen <joshrosen@databricks.com> Closes #9382 from JoshRosen/hadoop-decompressor-pooling-fix and squashes the following commits: 5ec97d7 [Josh Rosen] Add SqlNewHadoopRDD.unsetInputFileName() that I accidentally deleted. ae46cf4 [Josh Rosen] Merge remote-tracking branch 'origin/master' into hadoop-decompressor-pooling-fix 087aa63 [Josh Rosen] Guard against double-close() of RecordReaders.
* [SPARK-11226][SQL] Empty line in json file should be skippedJeff Zhang2015-10-313-24/+36
| | | | | | | | | | Currently the empty line in json file will be parsed into Row with all null field values. But in json, "{}" represents a json object, empty line is supposed to be skipped. Make a trivial change for this. Author: Jeff Zhang <zjffdu@apache.org> Closes #9211 from zjffdu/SPARK-11226.
* [SPARK-11434][SPARK-11103][SQL] Fix test ": Filter applied on merged Parquet ↵Yin Huai2015-10-301-3/+3
| | | | | | | | | | schema with new column fails" https://issues.apache.org/jira/browse/SPARK-11434 Author: Yin Huai <yhuai@databricks.com> Closes #9387 from yhuai/SPARK-11434.
* [SPARK-11385] [ML] foreachActive made public in MLLib's vector APINakul Jindal2015-10-301-3/+6
| | | | | | | | Made foreachActive public in MLLib's vector API Author: Nakul Jindal <njindal@us.ibm.com> Closes #9362 from nakul02/SPARK-11385_foreach_for_mllib_linalg_vector.
* Revert "[SPARK-11236][CORE] Update Tachyon dependency from 0.7.1 -> 0.8.0."Yin Huai2015-10-302-5/+9
| | | | This reverts commit 4f5e60c647d7d6827438721b7fabbc3a57b81023.
* [SPARK-11423] remove MapPartitionsWithPreparationRDDDavies Liu2015-10-308-270/+75
| | | | | | | | | | Since we do not need to preserve a page before calling compute(), MapPartitionsWithPreparationRDD is not needed anymore. This PR basically revert #8543, #8511, #8038, #8011 Author: Davies Liu <davies@databricks.com> Closes #9381 from davies/remove_prepare2.
* [SPARK-11340][SPARKR] Support setting driver properties when starting Spark ↵felixcheung2015-10-303-13/+87
| | | | | | | | | | | | | from R programmatically or from RStudio Mapping spark.driver.memory from sparkEnvir to spark-submit commandline arguments. shivaram suggested that we possibly add other spark.driver.* properties - do we want to add all of those? I thought those could be set in SparkConf? sun-rui Author: felixcheung <felixcheung_m@hotmail.com> Closes #9290 from felixcheung/rdrivermem.
* [SPARK-11342][TESTS] Allow to set hadoop profile when running dev/ru…Jeff Zhang2015-10-301-1/+1
| | | | | | | | …n_tests Author: Jeff Zhang <zjffdu@apache.org> Closes #9295 from zjffdu/SPARK-11342.
* [SPARK-11210][SPARKR] Add window functions into SparkR [step 2].Sun Rui2015-10-304-0/+117
| | | | | | Author: Sun Rui <rui.sun@intel.com> Closes #9196 from sun-rui/SPARK-11210.
* [SPARK-11414][SPARKR] Forgot to update usage of 'spark.sparkr.r.command' in ↵Sun Rui2015-10-301-1/+6
| | | | | | | | RRDD in the PR for SPARK-10971. Author: Sun Rui <rui.sun@intel.com> Closes #9368 from sun-rui/SPARK-11414.
* [SPARK-10986][MESOS] Set the context class loader in the Mesos executor backend.Iulian Dragos2015-10-301-0/+5
| | | | | | | | | | | | | | | | See [SPARK-10986](https://issues.apache.org/jira/browse/SPARK-10986) for details. This fixes the `ClassNotFoundException` for Spark classes in the serializer. I am not sure this is the right way to handle the class loader, but I couldn't find any documentation on how the context class loader is used and who relies on it. It seems at least the serializer uses it to instantiate classes during deserialization. I am open to suggestions (I tried this fix on a real Mesos cluster and it *does* fix the issue). tnachen andrewor14 Author: Iulian Dragos <jaguarul@gmail.com> Closes #9282 from dragos/issue/mesos-classloader.
* [SPARK-11393] [SQL] CoGroupedIterator should respect the fact that ↵Wenchen Fan2015-10-302-6/+32
| | | | | | | | | | | | GroupedIterator.hasNext is not idempotent When we cogroup 2 `GroupedIterator`s in `CoGroupedIterator`, if the right side is smaller, we will consume right data and keep the left data unchanged. Then we call `hasNext` which will call `left.hasNext`. This will make `GroupedIterator` generate an extra group as the previous one has not been comsumed yet. Author: Wenchen Fan <wenchen@databricks.com> Closes #9346 from cloud-fan/cogroup and squashes the following commits: 9be67c8 [Wenchen Fan] SPARK-11393
* [SPARK-11103][SQL] Filter applied on Merged Parquet shema with new column failhyukjinkwon2015-10-302-1/+25
| | | | | | | | | | | When enabling mergedSchema and predicate filter, this fails since Parquet does not accept filters pushed down when the columns of the filters do not exist in the schema. This is related with Parquet issue (https://issues.apache.org/jira/browse/PARQUET-389). For now, it just simply disables predicate push down when using merged schema in this PR. Author: hyukjinkwon <gurwls223@gmail.com> Closes #9327 from HyukjinKwon/SPARK-11103.
* [SPARK-11207] [ML] Add test cases for solver selection of LinearRegres…Lewuathe2015-10-302-82/+144
| | | | | | | | | | | | …sion as followup. This is the follow up work of SPARK-10668. * Fix miner style issues. * Add test case for checking whether solver is selected properly. Author: Lewuathe <lewuathe@me.com> Author: lewuathe <lewuathe@me.com> Closes #9180 from Lewuathe/SPARK-11207.
* [SPARK-11417] [SQL] no @Override in codegenDavies Liu2015-10-303-9/+0
| | | | | | | | Older version of Janino (>2.7) does not support Override, we should not use that in codegen. Author: Davies Liu <davies@databricks.com> Closes #9372 from davies/no_override.
* [SPARK-10342] [SPARK-10309] [SPARK-10474] [SPARK-10929] [SQL] Cooperative ↵Davies Liu2015-10-2930-834/+1270
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | memory management This PR introduce a mechanism to call spill() on those SQL operators that support spilling (for example, BytesToBytesMap, UnsafeExternalSorter and ShuffleExternalSorter) if there is not enough memory for execution. The preserved first page is needed anymore, so removed. Other Spillable objects in Spark core (ExternalSorter and AppendOnlyMap) are not included in this PR, but those could benefit from this (trigger others' spilling). The PrepareRDD may be not needed anymore, could be removed in follow up PR. The following script will fail with OOM before this PR, finished in 150 seconds with 2G heap (also works in 1.5 branch, with similar duration). ```python sqlContext.setConf("spark.sql.shuffle.partitions", "1") df = sqlContext.range(1<<25).selectExpr("id", "repeat(id, 2) as s") df2 = df.select(df.id.alias('id2'), df.s.alias('s2')) j = df.join(df2, df.id==df2.id2).groupBy(df.id).max("id", "id2") j.explain() print j.count() ``` For thread-safety, here what I'm got: 1) Without calling spill(), the operators should only be used by single thread, no safety problems. 2) spill() could be triggered in two cases, triggered by itself, or by other operators. we can check trigger == this in spill(), so it's still in the same thread, so safety problems. 3) if it's triggered by other operators (right now cache will not trigger spill()), we only spill the data into disk when it's in scanning stage (building is finished), so the in-memory sorter or memory pages are read-only, we only need to synchronize the iterator and change it. 4) During scanning, the iterator will only use one record in one page, we can't free this page, because the downstream is currently using it (used by UnsafeRow or other objects). In BytesToBytesMap, we just skip the current page, and dump all others into disk. In UnsafeExternalSorter, we keep the page that is used by current record (having the same baseObject), free it when loading the next record. In ShuffleExternalSorter, the spill() will not trigger during scanning. 5) In order to avoid deadlock, we didn't call acquireMemory during spill (so we reused the pointer array in InMemorySorter). Author: Davies Liu <davies@databricks.com> Closes #9241 from davies/force_spill.
* [SPARK-11409][SPARKR] Enable url link in R doc for Persistfelixcheung2015-10-291-2/+2
| | | | | | | | | | | | Quick one line doc fix link is not clickable ![image](https://cloud.githubusercontent.com/assets/8969467/10833041/4e91dd7c-7e4c-11e5-8905-713b986dbbde.png) shivaram Author: felixcheung <felixcheung_m@hotmail.com> Closes #9363 from felixcheung/rpersistdoc.
* [SPARK-11301] [SQL] fix case sensitivity for filter on partitioned columnsWenchen Fan2015-10-292-7/+15
| | | | | | Author: Wenchen Fan <wenchen@databricks.com> Closes #9271 from cloud-fan/filter.
* [SPARK-11236][CORE] Update Tachyon dependency from 0.7.1 -> 0.8.0.Calvin Jia2015-10-292-9/+5
| | | | | | | | | | Upgrades the tachyon-client version to the latest release. No new dependencies are added and no spark facing APIs are changed. The removal of the `tachyon-underfs-s3` exclusion will enable users to use S3 out of the box and there are no longer any additional external dependencies added by the module. Author: Calvin Jia <jia.calvin@gmail.com> Closes #9204 from calvinjia/spark-11236.
* [SPARK-10532][EC2] Added --profile option to specify the name of profileteramonagi2015-10-291-1/+8
| | | | | | | | | | | | | "profiles" give us the way that you can specify the set of credentials you want to use when you initialize a connection to AWS. You can keep multiple sets of credentials in the same credentials files using different profile names. For example, you can use --profile option to do that when you use "aws cli tool". http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html Author: teramonagi <teramonagi@gmail.com> Closes #8696 from teramonagi/SPARK-10532.
* [SPARK-10641][SQL] Add Skewness and Kurtosis Supportsethah2015-10-2912-11/+823
| | | | | | | | | Implementing skewness and kurtosis support based on following algorithm: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics Author: sethah <seth.hendrickson16@gmail.com> Closes #9003 from sethah/SPARK-10641.
* [SPARK-11188][SQL] Elide stacktraces in bin/spark-sql for AnalysisExceptionsDilip Biswal2015-10-293-6/+27
| | | | | | | | Only print the error message to the console for Analysis Exceptions in sql-shell. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #9194 from dilipbiswal/spark-11188.
* [SPARK-11246] [SQL] Table cache for Parquet broken in 1.5xin Wu2015-10-292-0/+16
| | | | | | | | | The root cause is that when spark.sql.hive.convertMetastoreParquet=true by default, the cached InMemoryRelation of the ParquetRelation can not be looked up from the cachedData of CacheManager because the key comparison fails even though it is the same LogicalPlan representing the Subquery that wraps the ParquetRelation. The solution in this PR is overriding the LogicalPlan.sameResult function in Subquery case class to eliminate subquery node first before directly comparing the child (ParquetRelation), which will find the key to the cached InMemoryRelation. Author: xin Wu <xinwu@us.ibm.com> Closes #9326 from xwu0226/spark-11246-commit.
* [SPARK-11388][BUILD] Fix self closing tags.Herman van Hovell2015-10-292-6/+6
| | | | | | | | | | Java 8 javadoc does not like self closing tags: ```<p/>```, ```<br/>```, ... This PR fixes those. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #9339 from hvanhovell/SPARK-11388.
* [SPARK-11318] Include hive profile in make-distribution.sh commandtedyu2015-10-291-1/+1
| | | | | | Author: tedyu <yuzhihong@gmail.com> Closes #9281 from tedyu/master.
* [SPARK-11370] [SQL] fix a bug in GroupedIterator and create unit test for itWenchen Fan2015-10-292-37/+144
| | | | | | | | Before this PR, user has to consume the iterator of one group before process next group, or we will get into infinite loops. Author: Wenchen Fan <wenchen@databricks.com> Closes #9330 from cloud-fan/group.
* [SPARK-11379][SQL] ExpressionEncoder can't handle top level primitive type ↵Wenchen Fan2015-10-292-1/+2
| | | | | | | | | | | | correctly For inner primitive type(e.g. inside `Product`), we use `schemaFor` to get the catalyst type for it, https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L403. However, for top level primitive type, we use `dataTypeFor`, which is wrong. Author: Wenchen Fan <wenchen@databricks.com> Closes #9337 from cloud-fan/encoder.
* [SPARK-11322] [PYSPARK] Keep full stack trace in captured exceptionLiang-Chi Hsieh2015-10-282-4/+21
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-11322 As reported by JoshRosen in [databricks/spark-redshift/issues/89](https://github.com/databricks/spark-redshift/issues/89#issuecomment-149828308), the exception-masking behavior sometimes makes debugging harder. To deal with this issue, we should keep full stack trace in the captured exception. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9283 from viirya/py-exception-stacktrace.
* [SPARK-11351] [SQL] support hive interval literalWenchen Fan2015-10-282-20/+103
| | | | | | Author: Wenchen Fan <wenchen@databricks.com> Closes #9304 from cloud-fan/interval.
* [SPARK-11376][SQL] Removes duplicated `mutableRow` fieldCheng Lian2015-10-291-2/+0
| | | | | | | | This PR fixes a mistake in the code generated by `GenerateColumnAccessor`. Interestingly, although the code is illegal in Java (the class has two fields with the same name), Janino accepts it happily and accidentally works properly. Author: Cheng Lian <lian@databricks.com> Closes #9335 from liancheng/spark-11376.fix-generated-code.
* [SPARK-11363] [SQL] LeftSemiJoin should be LeftSemi in SparkStrategiesLiang-Chi Hsieh2015-10-281-3/+3
| | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-11363 In SparkStrategies some places use LeftSemiJoin. It should be LeftSemi. cc chenghao-intel liancheng Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9318 from viirya/no-left-semi-join.
* [SPARK-11292] [SQL] Python API for text data sourceReynold Xin2015-10-282-2/+27
| | | | | | | | Adds DataFrameReader.text and DataFrameWriter.text. Author: Reynold Xin <rxin@databricks.com> Closes #9259 from rxin/SPARK-11292.
* [SPARK-11377] [SQL] withNewChildren should not convert StructType to SeqMichael Armbrust2015-10-281-1/+3
| | | | | | | | | | This is minor, but I ran into while writing Datasets and while it wasn't needed for the final solution, it was super confusing so we should fix it. Basically we recurse into `Seq` to see if they have children. This breaks because we don't preserve the original subclass of `Seq` (and `StructType <:< Seq[StructField]`). Since a struct can never contain children, lets just not recurse into it. Author: Michael Armbrust <michael@databricks.com> Closes #9334 from marmbrus/structMakeCopy.
* [SPARK-11367][ML][PYSPARK] Python LinearRegression should support setting solverYanbo Liang2015-10-283-22/+37
| | | | | | | | [SPARK-10668](https://issues.apache.org/jira/browse/SPARK-10668) has provided ```WeightedLeastSquares``` solver("normal") in ```LinearRegression``` with L2 regularization in Scala and R, Python ML ```LinearRegression``` should also support setting solver("auto", "normal", "l-bfgs") Author: Yanbo Liang <ybliang8@gmail.com> Closes #9328 from yanboliang/spark-11367.
* [SPARK-11369][ML][R] SparkR glm should support setting standardizeYanbo Liang2015-10-282-2/+5
| | | | | | | | | | SparkR glm currently support : ```formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0``` We should also support setting standardize which has been defined at [design documentation](https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit) Author: Yanbo Liang <ybliang8@gmail.com> Closes #9331 from yanboliang/spark-11369.