spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-11093] [CORE] ChildFirstURLClassLoader#getResources should return all ↵	Adam Lewandowski	2015-10-15	2	-9/+44
\| \| \| \| \| \| \| \|	found resources, not just those in the child classloader Author: Adam Lewandowski <alewandowski@ipcoop.com> Closes #9106 from alewando/childFirstFix.
*	[SPARK-11076] [SQL] Add decimal support for floor and ceil	Cheng Hao	2015-10-14	4	-13/+91
\| \| \| \| \| \| \| \|	Actually all of the `UnaryMathExpression` doens't support the Decimal, will create follow ups for supporing it. This is the first PR which will be good to review the approach I am taking. Author: Cheng Hao <hao.cheng@intel.com> Closes #9086 from chenghao-intel/ceiling.
*	[SPARK-11017] [SQL] Support ImperativeAggregates in TungstenAggregate	Josh Rosen	2015-10-14	9	-260/+457
\| \| \| \| \| \| \| \| \| \|	This patch extends TungstenAggregate to support ImperativeAggregate functions. The existing TungstenAggregate operator only supported DeclarativeAggregate functions, which are defined in terms of Catalyst expressions and can be evaluated via generated projections. ImperativeAggregate functions, on the other hand, are evaluated by calling their `initialize`, `update`, `merge`, and `eval` methods. The basic strategy here is similar to how SortBasedAggregate evaluates both types of aggregate functions: use a generated projection to evaluate the expression-based declarative aggregates with dummy placeholder expressions inserted in place of the imperative aggregate function output, then invoke the imperative aggregate functions and target them against the aggregation buffer. The bulk of the diff here consists of code that was copied and adapted from SortBasedAggregate, with some key changes to handle TungstenAggregate's sort fallback path. Author: Josh Rosen <joshrosen@databricks.com> Closes #9038 from JoshRosen/support-interpreted-in-tungsten-agg-final.
*	[SPARK-10829] [SQL] Filter combine partition key and attribute doesn't work ↵	Cheng Hao	2015-10-14	2	-12/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	in DataSource scan ```scala withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") { withTempPath { dir => val path = s"${dir.getCanonicalPath}/part=1" (1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path) // If the "part = 1" filter gets pushed down, this query will throw an exception since // "part" is not a valid column in the actual Parquet file checkAnswer( sqlContext.read.parquet(path).filter("a > 0 and (part = 0 or a > 1)"), (2 to 3).map(i => Row(i, i.toString, 1))) } } ``` We expect the result to be: ``` 2,1 3,1 ``` But got ``` 1,1 2,1 3,1 ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #8916 from chenghao-intel/partition_filter.
*	[SPARK-11113] [SQL] Remove DeveloperApi annotation from private classes.	Reynold Xin	2015-10-14	29	-153/+22
\| \| \| \| \| \| \| \|	o.a.s.sql.catalyst and o.a.s.sql.execution are supposed to be private. Author: Reynold Xin <rxin@databricks.com> Closes #9121 from rxin/SPARK-11113.
*	[SPARK-10104] [SQL] Consolidate different forms of table identifiers	Wenchen Fan	2015-10-14	32	-327/+212
\| \| \| \| \| \| \| \| \|	Right now, we have QualifiedTableName, TableIdentifier, and Seq[String] to represent table identifiers. We should only have one form and TableIdentifier is the best one because it provides methods to get table name, database name, return unquoted string, and return quoted string. Author: Wenchen Fan <wenchen@databricks.com> Author: Wenchen Fan <cloud0fan@163.com> Closes #8453 from cloud-fan/table-name.
*	[SPARK-11068] [SQL] [FOLLOW-UP] move execution listener to util	Wenchen Fan	2015-10-14	3	-2/+4
\| \| \| \| \| \|	Author: Wenchen Fan <wenchen@databricks.com> Closes #9119 from cloud-fan/callback.
*	[SPARK-11096] Post-hoc review Netty based RPC implementation - round 2	Reynold Xin	2015-10-14	7	-107/+81
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	A few more changes: 1. Renamed IDVerifier -> RpcEndpointVerifier 2. Renamed NettyRpcAddress -> RpcEndpointAddress 3. Simplified NettyRpcHandler a bit by removing the connection count tracking. This is OK because I now force spark.shuffle.io.numConnectionsPerPeer to 1 4. Reduced spark.rpc.connect.threads to 64. It would be great to eventually remove this extra thread pool. 5. Minor cleanup & documentation. Author: Reynold Xin <rxin@databricks.com> Closes #9112 from rxin/SPARK-11096.
*	[SPARK-10973]	Reynold Xin	2015-10-14	0	-0/+0
\| \| \| \| \| \| \| \|	Close #9064 Close #9063 Close #9062 These pull requests were merged into branch-1.5, branch-1.4, and branch-1.3.
*	[SPARK-8386] [SQL] add write.mode for insertIntoJDBC when the parm overwrite ↵	Huaxin Gao	2015-10-14	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	is false the fix is for jira https://issues.apache.org/jira/browse/SPARK-8386 Author: Huaxin Gao <huaxing@us.ibm.com> Closes #9042 from huaxingao/spark8386.
*	[SPARK-11040] [NETWORK] Make sure SASL handler delegates all events.	Marcelo Vanzin	2015-10-14	3	-3/+37
\| \| \| \| \| \|	Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9053 from vanzin/SPARK-11040.
*	[SPARK-10619] Can't sort columns on Executor Page	Tom Graves	2015-10-14	4	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	should pick into spark 1.5.2 also. https://issues.apache.org/jira/browse/SPARK-10619 looks like this was broken by commit: https://github.com/apache/spark/commit/fb1d06fc242ec00320f1a3049673fbb03c4a6eb9#diff-b8adb646ef90f616c34eb5c98d1ebd16 It looks like somethings were change to use the UIUtils.listingTable but executor page wasn't converted so when it removed sortable from the UIUtils. TABLE_CLASS_NOT_STRIPED it broke this page. Simply add the sortable tag back in and it fixes both active UI and the history server UI. Author: Tom Graves <tgraves@yahoo-inc.com> Closes #9101 from tgravescs/SPARK-10619.
*	[SPARK-10996] [SPARKR] Implement sampleBy() in DataFrameStatFunctions.	Sun Rui	2015-10-13	7	-19/+76
\| \| \| \| \| \|	Author: Sun Rui <rui.sun@intel.com> Closes #9023 from sun-rui/SPARK-10996.
*	[SPARK-10981] [SPARKR] SparkR Join improvements	Monica Liu	2015-10-13	2	-6/+34
\| \| \| \| \| \| \| \| \| \|	I was having issues with collect() and orderBy() in Spark 1.5.0 so I used the DataFrame.R file and test_sparkSQL.R file from the Spark 1.5.1 download. I only modified the join() function in DataFrame.R to include "full", "fullouter", "left", "right", and "leftsemi" and added corresponding test cases in the test for join() and merge() in test_sparkSQL.R file. Pull request because I filed this JIRA bug report: https://issues.apache.org/jira/browse/SPARK-10981 Author: Monica Liu <liu.monica.f@gmail.com> Closes #9029 from mfliu/master.
*	[SPARK-11091] [SQL] Change spark.sql.canonicalizeView to spark.sql.nativeView.	Yin Huai	2015-10-13	4	-11/+11
\| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-11091 Author: Yin Huai <yhuai@databricks.com> Closes #9103 from yhuai/SPARK-11091.
*	[SPARK-11068] [SQL] add callback to query execution	Wenchen Fan	2015-10-13	4	-6/+261
\| \| \| \| \| \| \| \|	With this feature, we can track the query plan, time cost, exception during query execution for spark users. Author: Wenchen Fan <cloud0fan@163.com> Closes #9078 from cloud-fan/callback.
*	[SPARK-11032] [SQL] correctly handle having	Wenchen Fan	2015-10-13	2	-1/+10
\| \| \| \| \| \| \| \|	We should not stop resolving having when the having condtion is resolved, or something like `count(1)` will crash. Author: Wenchen Fan <cloud0fan@163.com> Closes #9105 from cloud-fan/having.
*	[SPARK-11090] [SQL] Constructor for Product types from InternalRow	Michael Armbrust	2015-10-13	9	-159/+723
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a first draft of the ability to construct expressions that will take a catalyst internal row and construct a Product (case class or tuple) that has fields with the correct names. Support include: - Nested classes - Maps - Efficiently handling of arrays of primitive types Not yet supported: - Case classes that require custom collection types (i.e. List instead of Seq). Author: Michael Armbrust <michael@databricks.com> Closes #9100 from marmbrus/productContructor.
*	[SPARK-11059] [ML] Change range of quantile probabilities in ↵	vectorijk	2015-10-13	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \|	AFTSurvivalRegression Value of the quantile probabilities array should be in the range (0, 1) instead of [0,1] in `AFTSurvivalRegression.scala` according to [Discussion] (https://github.com/apache/spark/pull/8926#discussion-diff-40698242) Author: vectorijk <jiangkai@gmail.com> Closes #9083 from vectorijk/spark-11059.
*	[SPARK-10932] [PROJECT INFRA] Port two minor changes to release-build.sh ↵	Josh Rosen	2015-10-13	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \|	from scripts' old repo Spark's release packaging scripts used to live in a separate repository. Although these scripts are now part of the Spark repo, there are some minor patches made against the old repos that are missing in Spark's copy of the script. This PR ports those changes. /cc shivaram, who originally submitted these changes against https://github.com/rxin/spark-utils Author: Josh Rosen <joshrosen@databricks.com> Closes #8986 from JoshRosen/port-release-build-fixes-from-rxin-repo.
*	[SPARK-11080] [SQL] Incorporate per-JVM id into ExprId to prevent unsafe ↵	Josh Rosen	2015-10-13	1	-3/+12
\| \| \| \| \| \| \| \| \| \| \| \|	cross-JVM comparisions In the current implementation of named expressions' `ExprIds`, we rely on a per-JVM AtomicLong to ensure that expression ids are unique within a JVM. However, these expression ids will not be _globally_ unique. This opens the potential for id collisions if new expression ids happen to be created inside of tasks rather than on the driver. There are currently a few cases where tasks allocate expression ids, which happen to be safe because those expressions are never compared to expressions created on the driver. In order to guard against the introduction of invalid comparisons between driver-created and executor-created expression ids, this patch extends `ExprId` to incorporate a UUID to identify the JVM that created the id, which prevents collisions. Author: Josh Rosen <joshrosen@databricks.com> Closes #9093 from JoshRosen/SPARK-11080.
*	[SPARK-11052] Spaces in the build dir causes failures in the build/mv…	trystanleftwich	2015-10-13	2	-6/+6
\| \| \| \| \| \| \| \|	…n script Author: trystanleftwich <trystan@atscale.com> Closes #9065 from trystanleftwich/SPARK-11052.
*	[SPARK-10983] Unified memory manager	Andrew Or	2015-10-13	21	-306/+840
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch unifies the memory management of the storage and execution regions such that either side can borrow memory from each other. When memory pressure arises, storage will be evicted in favor of execution. To avoid regressions in cases where storage is crucial, we dynamically allocate a fraction of space for storage that execution cannot evict. Several configurations are introduced: - spark.memory.fraction (default 0.75): fraction of the heap space used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records. - spark.memory.storageFraction (default 0.5): size of the storage region within the space set aside by `spark.memory.fraction`. Cached data may only be evicted if total storage exceeds this region. - spark.memory.useLegacyMode (default false): whether to use the memory management that existed in Spark 1.5 and before. This is mainly for backward compatibility. For a detailed description of the design, see [SPARK-10000](https://issues.apache.org/jira/browse/SPARK-10000). This patch builds on top of the `MemoryManager` interface introduced in #9000. Author: Andrew Or <andrew@databricks.com> Closes #9084 from andrewor14/unified-memory-manager.
*	[SPARK-7402] [ML] JSON SerDe for standard param types	Xiangrui Meng	2015-10-13	2	-0/+283
\| \| \| \| \| \| \| \|	This PR implements the JSON SerDe for the following param types: `Boolean`, `Int`, `Long`, `Float`, `Double`, `String`, `Array[Int]`, `Array[Double]`, and `Array[String]`. The implementation of `Float`, `Double`, and `Array[Double]` are specialized to handle `NaN` and `Inf`s. This will be used in pipeline persistence. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9090 from mengxr/SPARK-7402.
*	[PYTHON] [MINOR] List modules in PySpark tests when given bad name	Joseph K. Bradley	2015-10-13	1	-1/+2
\| \| \| \| \| \| \| \| \| \|	Output list of supported modules for python tests in error message when given bad module name. CC: davies Author: Joseph K. Bradley <joseph@databricks.com> Closes #9088 from jkbradley/python-tests-modules.
*	[SPARK-10913] [SPARKR] attach() function support	Adrian Zhuang	2015-10-13	4	-0/+55
\| \| \| \| \| \| \| \| \|	Bring the change code up to date. Author: Adrian Zhuang <adrian555@users.noreply.github.com> Author: adrian555 <wzhuang@us.ibm.com> Closes #9031 from adrian555/attach2.
*	[SPARK-10888] [SPARKR] Added as.DataFrame as a synonym to createDataFrame	Narine Kokhlikyan	2015-10-13	3	-5/+30
\| \| \| \| \| \| \| \| \|	as.DataFrame is more a R-style like signature. Also, I'd like to know if we could make the context, e.g. sqlContext global, so that we do not have to specify it as an argument, when we each time create a dataframe. Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com> Closes #8952 from NarineK/sparkrasDataFrame.
*	[SPARK-10051] [SPARKR] Support collecting data of StructType in DataFrame	Sun Rui	2015-10-13	9	-69/+224
\| \| \| \| \| \| \| \| \| \| \| \|	Two points in this PR: 1. Originally thought was that a named R list is assumed to be a struct in SerDe. But this is problematic because some R functions will implicitly generate named lists that are not intended to be a struct when transferred by SerDe. So SerDe clients have to explicitly mark a names list as struct by changing its class from "list" to "struct". 2. SerDe is in the Spark Core module, and data of StructType is represented as GenricRow which is defined in Spark SQL module. SerDe can't import GenricRow as in maven build Spark SQL module depends on Spark Core module. So this PR adds a registration hook in SerDe to allow SQLUtils in Spark SQL module to register its functions for serialization and deserialization of StructType. Author: Sun Rui <rui.sun@intel.com> Closes #8794 from sun-rui/SPARK-10051.
*	[SPARK-11030] [SQL] share the SQLTab across sessions	Davies Liu	2015-10-13	5	-24/+33
\| \| \| \| \| \| \| \| \| \|	The SQLTab will be shared by multiple sessions. If we create multiple independent SQLContexts (not using newSession()), will still see multiple SQLTabs in the Spark UI. Author: Davies Liu <davies@databricks.com> Closes #9048 from davies/sqlui.
*	[SPARK-11079] Post-hoc review Netty-based RPC - round 1	Reynold Xin	2015-10-13	15	-302/+336
\| \| \| \| \| \| \| \| \| \|	I'm going through the implementation right now for post-doc review. Adding more comments and renaming things as I go through them. I also want to write higher level documentation about how the whole thing works -- but those will come in other pull requests. Author: Reynold Xin <rxin@databricks.com> Closes #9091 from rxin/rpc-review.
*	[SPARK-11009] [SQL] fix wrong result of Window function in cluster mode	Davies Liu	2015-10-13	2	-10/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, All windows function could generate wrong result in cluster sometimes. The root cause is that AttributeReference is called in executor, then id of it may not be unique than others created in driver. Here is the script that could reproduce the problem (run in local cluster): ``` from pyspark import SparkContext, HiveContext from pyspark.sql.window import Window from pyspark.sql.functions import rowNumber sqlContext = HiveContext(SparkContext()) sqlContext.setConf("spark.sql.shuffle.partitions", "3") df = sqlContext.range(1<<20) df2 = df.select((df.id % 1000).alias("A"), (df.id / 1000).alias('B')) ws = Window.partitionBy(df2.A).orderBy(df2.B) df3 = df2.select("client", "date", rowNumber().over(ws).alias("rn")).filter("rn < 0") assert df3.count() == 0 ``` Author: Davies Liu <davies@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #9050 from davies/wrong_window.
*	[SPARK-11026] [YARN] spark.yarn.user.classpath.first does work for ↵	Lianhui Wang	2015-10-13	1	-8/+15
\| \| \| \| \| \| \| \| \| \|	'spark-submit --jars hdfs://user/foo.jar' when spark.yarn.user.classpath.first=true and using 'spark-submit --jars hdfs://user/foo.jar', it can not put foo.jar to system classpath. so we need to put yarn's linkNames of jars to the system classpath. vanzin tgravescs Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #9045 from lianhuiwang/spark-11026.
*	[SPARK-10990] [SPARK-11018] [SQL] improve unrolling of complex types	Davies Liu	2015-10-12	12	-140/+188
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR improve the unrolling and read of complex types in columnar cache: 1) Using UnsafeProjection to do serialization of complex types, so they will not be serialized three times (two for actualSize) 2) Copy the bytes from UnsafeRow/UnsafeArrayData to ByteBuffer directly, avoiding the immediate byte[] 3) Using the underlying array in ByteBuffer to create UTF8String/UnsafeRow/UnsafeArrayData without copy. Combine these optimizations, we can reduce the unrolling time from 25s to 21s (20% less), reduce the scanning time from 3.5s to 2.5s (28% less). ``` df = sqlContext.read.parquet(path) t = time.time() df.cache() df.count() print 'unrolling', time.time() - t for i in range(10): t = time.time() print df.select("*")._jdf.queryExecution().toRdd().count() print time.time() - t ``` The schema is ``` root \|-- a: struct (nullable = true) \| \|-- b: long (nullable = true) \| \|-- c: string (nullable = true) \|-- d: array (nullable = true) \| \|-- element: long (containsNull = true) \|-- e: map (nullable = true) \| \|-- key: long \| \|-- value: string (valueContainsNull = true) ``` Now the columnar cache depends on that UnsafeProjection support all the data types (including UDT), this PR also fix that. Author: Davies Liu <davies@databricks.com> Closes #9016 from davies/complex2.
*	[SPARK-10739] [YARN] Add application attempt window for Spark on Yarn	jerryshao	2015-10-12	2	-0/+23
\| \| \| \| \| \| \| \| \| \| \| \| \|	Add application attempt window for Spark on Yarn to ignore old out of window failures, this is useful for long running applications to recover from failures. Author: jerryshao <sshao@hortonworks.com> Closes #8857 from jerryshao/SPARK-10739 and squashes the following commits: 36eabdc [jerryshao] change the doc 7f9b77d [jerryshao] Style change 1c9afd0 [jerryshao] Address the comments caca695 [jerryshao] Add application attempt window for Spark on Yarn
*	[SPARK-11056] Improve documentation of SBT build.	Kay Ousterhout	2015-10-12	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit improves the documentation around building Spark to (1) recommend using SBT interactive mode to avoid the overhead of launching SBT and (2) refer to the wiki page that documents using SPARK_PREPEND_CLASSES to avoid creating the assembly jar for each compile. cc srowen Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #9068 from kayousterhout/SPARK-11056.
*	[SPARK-11042] [SQL] Add a mechanism to ban creating multiple root ↵	Yin Huai	2015-10-12	4	-7/+156
\| \| \| \| \| \| \| \| \| \|	SQLContexts/HiveContexts in a JVM https://issues.apache.org/jira/browse/SPARK-11042 Author: Yin Huai <yhuai@databricks.com> Closes #9058 from yhuai/SPARK-11042.
*	[SPARK-8170] [PYTHON] Add signal handler to trap Ctrl-C in pyspark and ↵	Ashwin Shankar	2015-10-12	1	-0/+7
\| \| \| \| \| \| \| \| \| \|	cancel all running jobs This patch adds a signal handler to trap Ctrl-C and cancels running job. Author: Ashwin Shankar <ashankar@netflix.com> Closes #9033 from ashwinshankar77/master.
*	[SPARK-11023] [YARN] Avoid creating URIs from local paths directly.	Marcelo Vanzin	2015-10-12	1	-5/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The issue is that local paths on Windows, when provided with drive letters or backslashes, are not valid URIs. Instead of trying to figure out whether paths are URIs or not, use Utils.resolveURI() which does that for us. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9049 from vanzin/SPARK-11023 and squashes the following commits: 77021f2 [Marcelo Vanzin] [SPARK-11023] [yarn] Avoid creating URIs from local paths directly.
*	[SPARK-11007] [SQL] Adds dictionary aware Parquet decimal converters	Cheng Lian	2015-10-12	6	-26/+103
\| \| \| \| \| \| \| \| \| \|	For Parquet decimal columns that are encoded using plain-dictionary encoding, we can make the upper level converter aware of the dictionary, so that we can pre-instantiate all the decimals to avoid duplicated instantiation. Note that plain-dictionary encoding isn't available for `FIXED_LEN_BYTE_ARRAY` for Parquet writer version `PARQUET_1_0`. So currently only decimals written as `INT32` and `INT64` can benefit from this optimization. Author: Cheng Lian <lian@databricks.com> Closes #9040 from liancheng/spark-11007.decimal-converter-dict-support.
*	[SPARK-10960] [SQL] SQL with windowing function should be able to refer ↵	Liang-Chi Hsieh	2015-10-12	2	-0/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	column in inner select JIRA: https://issues.apache.org/jira/browse/SPARK-10960 When accessing a column in inner select from a select with window function, `AnalysisException` will be thrown. For example, an query like this: select area, rank() over (partition by area order by tmp.month) + tmp.tmp1 as c1 from (select month, area, product, 1 as tmp1 from windowData) tmp Currently, the rule `ExtractWindowExpressions` in `Analyzer` only extracts regular expressions from `WindowFunction`, `WindowSpecDefinition` and `AggregateExpression`. We need to also extract other attributes as the one in `Alias` as shown in the above query. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9011 from viirya/fix-window-inner-column.
*	[SPARK-11053] Remove use of KVIterator in SortBasedAggregationIterator	Josh Rosen	2015-10-11	3	-159/+33
\| \| \| \| \| \| \| \|	SortBasedAggregationIterator uses a KVIterator interface in order to process input rows as key-value pairs, but this use of KVIterator is unnecessary, slightly complicates the code, and might hurt performance. This patch refactors this code to remove the use of this extra layer of iterator wrapping and simplifies other parts of the code in the process. Author: Josh Rosen <joshrosen@databricks.com> Closes #9066 from JoshRosen/sort-iterator-cleanup.
*	[SPARK-10772] [STREAMING] [SCALA] NullPointerException when transform ↵	Jacker Hu	2015-10-10	2	-2/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	function in DStream returns NULL Currently, the ```TransformedDStream``` will using ```Some(transformFunc(parentRDDs, validTime))``` as compute return value, when the ```transformFunc``` somehow returns null as return value, the followed operator will have NullPointerExeception. This fix uses the ```Option()``` instead of ```Some()``` to deal with the possible null value. When ```transformFunc``` returns ```null```, the option will transform null to ```None```, the downstream can handle ```None``` correctly. NOTE (2015-09-25): The latest fix will check the return value of transform function, if it is ```NULL```, a spark exception will be thrown out Author: Jacker Hu <gt.hu.chang@gmail.com> Author: jhu-chang <gt.hu.chang@gmail.com> Closes #8881 from jhu-chang/Fix_Transform.
*	[SPARK-10079] [SPARKR] Make 'column' and 'col' functions be S4 functions.	Sun Rui	2015-10-09	5	-9/+34
\| \| \| \| \| \| \| \| \| \| \|	1. Add a "col" function into DataFrame. 2. Move the current "col" function in Column.R to functions.R, convert it to S4 function. 3. Add a s4 "column" function in functions.R. 4. Convert the "column" function in Column.R to S4 function. This is for private use. Author: Sun Rui <rui.sun@intel.com> Closes #8864 from sun-rui/SPARK-10079.
*	[SPARK-10535] Sync up API for matrix factorization model between Scala and ↵	Vladimir Vladimirov	2015-10-09	2	-4/+36
\| \| \| \| \| \| \| \| \| \|	PySpark Support for recommendUsersForProducts and recommendProductsForUsers in matrix factorization model for PySpark Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com> Closes #8700 from smartkiwi/SPARK-10535_.
*	[SPARK-10858] YARN: archives/jar/files rename with # doesn't work unl	Tom Graves	2015-10-09	2	-3/+10
\| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-10858 The issue here is that in resolveURI we default to calling new File(path).getAbsoluteFile().toURI(). But if the path passed in already has a # in it then File(path) will think that is supposed to be part of the actual file path and not a fragment so it changes # to %23. Then when we try to parse that later in Client as a URI it doesn't recognize there is a fragment. so to fix we just check if there is a fragment, still create the File like we did before and then add the fragment back on. Author: Tom Graves <tgraves@yahoo-inc.com> Closes #9035 from tgravescs/SPARK-10858.
*	[SPARK-10855] [SQL] Add a JDBC dialect for Apache Derby	Rick Hillegas	2015-10-09	2	-1/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	marmbrus rxin This patch adds a JdbcDialect class, which customizes the datatype mappings for Derby backends. The patch also adds unit tests for the new dialect, corresponding to the existing tests for other JDBC dialects. JDBCSuite runs cleanly for me with this patch. So does JDBCWriteSuite, although it produces noise as described here: https://issues.apache.org/jira/browse/SPARK-10890 This patch is my original work, which I license to the ASF. I am a Derby contributor, so my ICLA is on file under SVN id "rhillegas": http://people.apache.org/committer-index.html Touches the following files: --------------------------------- org.apache.spark.sql.jdbc.JdbcDialects Adds a DerbyDialect. --------------------------------- org.apache.spark.sql.jdbc.JDBCSuite Adds unit tests for the new DerbyDialect. Author: Rick Hillegas <rhilleg@us.ibm.com> Closes #8982 from rick-ibm/b_10855.
*	[SPARK-8673] [LAUNCHER] API and infrastructure for communicating with child ↵	Marcelo Vanzin	2015-10-09	29	-146/+1820
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	apps. This change adds an API that encapsulates information about an app launched using the library. It also creates a socket-based communication layer for apps that are launched as child processes; the launching application listens for connections from launched apps, and once communication is established, the channel can be used to send updates to the launching app, or to send commands to the child app. The change also includes hooks for local, standalone/client and yarn masters. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #7052 from vanzin/SPARK-8673.
*	[SPARK-10905] [SPARKR] Export freqItems() for DataFrameStatFunctions	Rerngvit Yanggratoke	2015-10-09	4	-0/+53
\| \| \| \| \| \| \| \| \| \| \|	[SPARK-10905][SparkR]: Export freqItems() for DataFrameStatFunctions - Add function (together with roxygen2 doc) to DataFrame.R and generics.R - Expose the function in NAMESPACE - Add unit test for the function Author: Rerngvit Yanggratoke <rerngvit@kth.se> Closes #8962 from rerngvit/SPARK-10905.
*	[SPARK-10875] [MLLIB] Computed covariance matrix should be symmetric	Nick Pritchard	2015-10-08	2	-2/+22
\| \| \| \| \| \| \| \|	Compute upper triangular values of the covariance matrix, then copy to lower triangular values. Author: Nick Pritchard <nicholas.pritchard@falkonry.com> Closes #8940 from pnpritchard/SPARK-10875.
*	[SPARK-10959] [PYSPARK] StreamingLogisticRegressionWithSGD does not train ↵	Bryan Cutler	2015-10-08	2	-2/+3
\| \| \| \| \| \| \| \| \| \|	with given regParam and convergenceTol parameters These params were being passed into the StreamingLogisticRegressionWithSGD constructor, but not transferred to the call for model training. Same with StreamingLinearRegressionWithSGD. I added the params as named arguments to the call and also fixed the intercept parameter, which was being passed as regularization value. Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #9002 from BryanCutler/StreamingSGD-convergenceTol-bug-10959.