| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add `import org.apache.spark.sql.Dsl._` to make DSL query works.
Since queryExecution is not avaliable in DataFrame, so remove it.
Author: OopsOutOfMemory <victorshengli@126.com>
Author: Sheng, Li <OopsOutOfMemory@users.noreply.github.com>
Closes #4330 from OopsOutOfMemory/hiveconsole and squashes the following commits:
46eb790 [Sheng, Li] Update SparkBuild.scala
d23ee9f [OopsOutOfMemory] minor
d4dd593 [OopsOutOfMemory] refine hive console
(cherry picked from commit b73d5fff7208f9d09adde92c241c70095983bc09)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A follow up for #4163: support `select array(key, *) from src`
Since array(key, *) will not go into this case
```
case Alias(f UnresolvedFunction(_, args), name) if containsStar(args) =>
val expandedArgs = args.flatMap {
case s: Star => s.expand(child.output, resolver)
case o => o :: Nil
}
```
here added a case to cover the corner case of array.
/cc liancheng
Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>
Closes #4353 from scwf/udf-star1 and squashes the following commits:
4350d17 [wangfei] minor fix
a7cd191 [wangfei] minor fix
0942fb1 [wangfei] follow up: support select array(key, *) from src
6ae00db [wangfei] also fix problem with array
da1da09 [scwf] minor fix
f87b5f9 [scwf] added test case
587bf7e [wangfei] compile fix
eb93c16 [wangfei] fix star resolve issue in udf
(cherry picked from commit 417d1118cd720b616bfd30e9c1a0bfb1838e8cbd)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Right now the PR adds few helper methods for java apis. But the issue was opened mainly to get rid of transformations in java api like `.rdd` and `.toJavaRDD` while working with `SQLContext` or `HiveContext`.
Author: kul <kuldeep.bora@gmail.com>
Closes #4243 from kul/master and squashes the following commits:
2390fba [kul] [SPARK-5426][SQL] Add SparkSQL Java API helper methods.
(cherry picked from commit 424cb699ee9b091eb23b86dc018a86e377ad309f)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Support change database owner, here i do not add the golden files since the golden answer is related to the tmp dir path (see https://github.com/scwf/spark/commit/6331e4ac0f982caf70531defcb957be76fe093c7)
Author: wangfei <wangfei1@huawei.com>
Closes #4357 from scwf/db_owner and squashes the following commits:
f761533 [wangfei] remove the alter_db_owner which have added to whitelist
79413c6 [wangfei] Revert "added golden files"
6331e4a [wangfei] added golden files
6f7cacd [wangfei] support change database owner
(cherry picked from commit b90dd39793ab0a13a4559fcfff3bb5305c92ab3b)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now CTAS runs successfully but will throw a NoSuchObjectException.
```
create table sc as select *
from (select '2011-01-11', '2011-01-11+14:18:26' from src tablesample (1 rows)
union all
select '2011-01-11', '2011-01-11+15:18:26' from src tablesample (1 rows)
union all
select '2011-01-11', '2011-01-11+16:18:26' from src tablesample (1 rows) ) s;
```
Get this exception:
ERROR Hive: NoSuchObjectException(message:default.sc table not found)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1560)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
at $Proxy8.get_table(Unknown Source)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
at $Proxy9.getTable(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:976)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.tableExists(HiveMetastoreCatalog.scala:152)
at org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$tableExists(HiveContext.scala:309)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.tableExists(Catalog.scala:121)
at org.apache.spark.sql.hive.HiveContext$$anon$2.tableExists(HiveContext.scala:309)
at org.apache.spark.sql.hive.execution.CreateTableAsSelect.run(CreateTableAsSelect.scala:63)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:53)
Author: wangfei <wangfei1@huawei.com>
Closes #4365 from scwf/ctas-exception and squashes the following commits:
c7c67bc [wangfei] no used imports
f54eb2a [wangfei] fix exception for CTAS
(cherry picked from commit a9f0db1fb41a6e5510d2c14f79ff086da2ecc7b0)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, if there are different locality in a task set, the tasks with NODE_LOCAL only get scheduled after all the PROCESS_LOCAL tasks are scheduled and timeout with spark.locality.wait.process (3 seconds by default). In local mode, the LocalScheduler will never call resourceOffer() again once it failed to get a task with same locality, then all the NODE_LOCAL tasks will be never scheduled.
This bug could be reproduced by run example python/streaming/stateful_network_wordcount.py, it will hang after finished a batch with some data.
This patch will check whether there is task for current locality level, if not, it will change to next locality level without waiting for `spark.locality.wait.process` seconds. It works for all locality levels.
Because the list of pending tasks are updated lazily, the check can be false-positive, it means it will not move to next locality level even there is no valid pending tasks, it will wait for timeout.
Author: Davies Liu <davies@databricks.com>
Closes #3779 from davies/local_streaming and squashes the following commits:
2d25fb3 [Davies Liu] Update TaskSetManager.scala
1550668 [Davies Liu] add comment
1c37aac [Davies Liu] address comments
6b13824 [Davies Liu] address comments
906f456 [Davies Liu] Merge branch 'master' of github.com:apache/spark into local_streaming
414e79e [Davies Liu] fix bug, add logging
ff8eabb [Davies Liu] Merge branch 'master' into local_streaming
28d1b3c [Davies Liu] check tasks
9d0ceab [Davies Liu] Merge branch 'master' of github.com:apache/spark into local_streaming
37a2804 [Davies Liu] fix tests
49bda82 [Davies Liu] address comment
d8fb95a [Davies Liu] move to next locality level if no more tasks
2d6ae73 [Davies Liu] add comments
32d363f [Davies Liu] add regression test
7d8c5a5 [Davies Liu] jump to next locality if no pending tasks for executors
(cherry picked from commit 0a89b156850fc5ba93160987927f249a7e633d51)
Signed-off-by: Kay Ousterhout <kayousterhout@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
...ck generator fails to store data.
The Reliable Kafka Receiver commits offsets only when events are actually stored, which ensures that on restart we will actually start where we left off. But if the failure happens in the store() call, and the block generator reports an error the receiver does not do anything and will continue reading from the current offset and not the last commit. This means that messages between the last commit and the current offset will be lost.
This PR retries the store call four times and then stops the receiver with an error message and the last exception that was received from the store.
Author: Hari Shreedharan <hshreedharan@apache.org>
Closes #3655 from harishreedharan/kafka-failure-fix and squashes the following commits:
5e2e7ad [Hari Shreedharan] [SPARK-4704][STREAMING] Reliable Kafka Receiver can lose data if the block generator fails to store data.
(cherry picked from commit f0500f9fa378d81e4b4038a66a40eee15806b677)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: cody koeninger <cody@koeninger.org>
Closes #3798 from koeninger/kafkaRdd and squashes the following commits:
1dc2941 [cody koeninger] [SPARK-4964] silence ConsumerConfig warnings about broker connection props
59e29f6 [cody koeninger] [SPARK-4964] settle on "Direct" as a naming convention for the new stream
8c31855 [cody koeninger] [SPARK-4964] remove HasOffsetRanges interface from return types
0df3ebe [cody koeninger] [SPARK-4964] add comments per pwendell / dibbhatt
8991017 [cody koeninger] [SPARK-4964] formatting
825110f [cody koeninger] [SPARK-4964] rename stuff per TD
4354bce [cody koeninger] [SPARK-4964] per td, remove java interfaces, replace with final classes, corresponding changes to KafkaRDD constructor and checkpointing
9adaa0a [cody koeninger] [SPARK-4964] formatting
0090553 [cody koeninger] [SPARK-4964] javafication of interfaces
9a838c2 [cody koeninger] [SPARK-4964] code cleanup, add more tests
2b340d8 [cody koeninger] [SPARK-4964] refactor per TD feedback
80fd6ae [cody koeninger] [SPARK-4964] Rename createExactlyOnceStream so it isnt over-promising, change doc
99d2eba [cody koeninger] [SPARK-4964] Reduce level of nesting. If beginning is past end, its actually an error (may happen if Kafka topic was deleted and recreated)
19406cc [cody koeninger] Merge branch 'master' of https://github.com/apache/spark into kafkaRdd
2e67117 [cody koeninger] [SPARK-4964] one potential way of hiding most of the implementation, while still allowing access to offsets (but not subclassing)
bb80bbe [cody koeninger] [SPARK-4964] scalastyle line length
d4a7cf7 [cody koeninger] [SPARK-4964] allow for use cases that need to override compute for custom kafka dstreams
c1bd6d9 [cody koeninger] [SPARK-4964] use newly available attemptNumber for correct retry behavior
548d529 [cody koeninger] Merge branch 'master' of https://github.com/apache/spark into kafkaRdd
0458e4e [cody koeninger] [SPARK-4964] recovery of generated rdds from checkpoint
e86317b [cody koeninger] [SPARK-4964] try seed brokers in random order to spread metadata requests
e93eb72 [cody koeninger] [SPARK-4964] refactor to add preferredLocations. depends on SPARK-4014
356c7cc [cody koeninger] [SPARK-4964] code cleanup per helena
adf99a6 [cody koeninger] [SPARK-4964] fix serialization issues for checkpointing
1d50749 [cody koeninger] [SPARK-4964] code cleanup per tdas
8bfd6c0 [cody koeninger] [SPARK-4964] configure rate limiting via spark.streaming.receiver.maxRate
e09045b [cody koeninger] [SPARK-4964] add foreachPartitionWithIndex, to avoid doing equivalent map + empty foreach boilerplate
cac63ee [cody koeninger] additional testing, fix fencepost error
37d3053 [cody koeninger] make KafkaRDDPartition available to users so offsets can be committed per partition
bcca8a4 [cody koeninger] Merge branch 'master' of https://github.com/apache/spark into kafkaRdd
6bf14f2 [cody koeninger] first attempt at a Kafka dstream that allows for exactly-once semantics
326ff3c [cody koeninger] add some tests
38bb727 [cody koeninger] give easy access to the parameters of a KafkaRDD
979da25 [cody koeninger] dont allow empty leader offsets to be returned
8d7de4a [cody koeninger] make sure leader offsets can be found even for leaders that arent in the seed brokers
4b078bf [cody koeninger] differentiate between leader and consumer offsets in error message
3c2a96a [cody koeninger] fix scalastyle errors
29c6b43 [cody koeninger] cleanup logging
783b477 [cody koeninger] update tests for kafka 8.1.1
7d050bc [cody koeninger] methods to set consumer offsets and get topic metadata, switch back to inclusive start / exclusive end to match typical kafka consumer behavior
ce91c59 [cody koeninger] method to get consumer offsets, explicit error handling
4dafd1b [cody koeninger] method to get leader offsets, switch rdd bound to being exclusive start, inclusive end to match offsets typically returned from cluster
0b94b33 [cody koeninger] use dropWhile rather than filter to trim beginning of fetch response
1d70625 [cody koeninger] WIP on kafka cluster
76913e2 [cody koeninger] Batch oriented kafka rdd, WIP. todo: cluster metadata / finding leader
(cherry picked from commit b0c0021953826bccaee818a54afc44e8bdfa8572)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
```
df.selectExpr('a + 1', 'abs(age)')
df.filter('age > 3')
df[ df.age > 3 ]
df[ ['age', 'name'] ]
```
Author: Davies Liu <davies@databricks.com>
Closes #4359 from davies/select_expr and squashes the following commits:
d99856b [Davies Liu] support select/filter by SQL expression
(cherry picked from commit ac0b2b788ff144970d6fdbdc445367772770458d)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add a seed for tests.
Author: Davies Liu <davies@databricks.com>
Closes #4358 from davies/flaky_test and squashes the following commits:
02371c3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into flaky_test
ced499b [Davies Liu] add seed for test
(cherry picked from commit 38a416f0360fa68fc445af14910fb253ff9ad493)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-5574
very minor, doesn't effect external behavior at all.
Note that after this change, some of these dirs no longer will have "spark" in the name at all. I could change those locations that do pass in a name prefix to also include "spark", eg. "blockmgr" -> "spark-blockmgr"
Author: Imran Rashid <irashid@cloudera.com>
Closes #4344 from squito/SPARK-5574 and squashes the following commits:
33a84fe [Imran Rashid] use given name prefix in dir
(cherry picked from commit 5aa0f219d1f8e8d3bb083dd547532ebb72ba7a15)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The warning log looks incorrect. Just fix it.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #4360 from viirya/fixing_typo and squashes the following commits:
48fbe4f [Liang-Chi Hsieh] Fix incorrect warning log.
(cherry picked from commit a74cbbf12fa59df37eb7172652138c78707d33d8)
Signed-off-by: Tathagata Das <tdas@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added `awaitTerminationOrTimeout` to return if the waiting time elapsed:
* `true` if it's stopped.
* `false` if the waiting time elapsed before returning from the method.
* throw the reported error if it's thrown during the execution.
Also deprecated `awaitTermination(timeout: Long)`.
Author: zsxwing <zsxwing@gmail.com>
Closes #4171 from zsxwing/SPARK-5379 and squashes the following commits:
c9e660b [zsxwing] Add a unit test for awaitTerminationOrTimeout
8a89f92 [zsxwing] Add awaitTerminationOrTimeout to python
cdc820b [zsxwing] Add awaitTerminationOrTimeout
(cherry picked from commit 4cf4cba08f1757ec0d9bffdfae6db719a4fb5a3f)
Signed-off-by: Tathagata Das <tdas@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
spark-submit
This PR adds support for using maven coordinates as dependencies to spark-shell.
Coordinates can be provided as a comma-delimited string after the flag `--packages`.
Additional remote repositories (like SonaType) can be supplied as a comma-delimited string after the flag
`--repositories`.
Uses the Ivy library to resolve dependencies. Unfortunately the library has no decent documentation, therefore solving more complex dependency issues can be a problem.
pwendell, mateiz, mengxr
**Note: This is still a WIP. The following need to be handled:**
- [x] add docs for the methods
- [x] take local ivy cache path as an argument
- [x] add tests
- [x] add Windows compatibility
- [x] exclude unused Ivy dependencies
Author: Burak Yavuz <brkyvz@gmail.com>
Closes #4215 from brkyvz/SPARK-5341ivy and squashes the following commits:
9215851 [Burak Yavuz] ready to merge
db2a5cc [Burak Yavuz] changed logging to printStream
9dae87f [Burak Yavuz] file separators changed
71c374d [Burak Yavuz] merge conflicts fixed
c08dc9f [Burak Yavuz] fixed merge conflicts
3ada19a [Burak Yavuz] fixed Jenkins error (hopefully) and added comment on oro
43c2290 [Burak Yavuz] fixed that ONE line
231f72f [Burak Yavuz] addressed code review
2cd6562 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into SPARK-5341ivy
85ec5a3 [Burak Yavuz] added oro as a dependency explicitly
ea44ca4 [Burak Yavuz] add oro back to dependencies
cef0e24 [Burak Yavuz] IntelliJ is just messing things up
97c4a92 [Burak Yavuz] fix more weird IntelliJ formatting
9cf077d [Burak Yavuz] fix weird IntelliJ formatting
dcf5e13 [Burak Yavuz] fix windows command line flags
3a23f21 [Burak Yavuz] excluded ivy dependencies
53423e0 [Burak Yavuz] tests added
3705907 [Burak Yavuz] remove ivy-repo as a command line argument. Use global ivy cache as default
c04d885 [Burak Yavuz] take path to ivy cache as a conf
2edc9b5 [Burak Yavuz] managed to exclude Spark and it's dependencies
a0870af [Burak Yavuz] add docs. remove unnecesary new lines
6645af4 [Burak Yavuz] [SPARK-5341] added base implementation
882c4c8 [Burak Yavuz] added maven dependency download
(cherry picked from commit 6aed719e503afa48820f6e3e798da483649dfcb9)
Signed-off-by: Patrick Wendell <patrick@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The locality timeout assume that the SchedulerBackend can revive offers periodically, but currently LocalBackend did do that, then some job with mixed locality levels in local mode will hang forever.
This PR let LocalBackend revive offers periodically, just like in cluster mode.
Author: Davies Liu <davies@databricks.com>
Closes #4147 from davies/revive and squashes the following commits:
2acdf9d [Davies Liu] Update LocalBackend.scala
3c8ca7c [Davies Liu] Update LocalBackend.scala
d1b60d2 [Davies Liu] address comments from Kay
33ac9bb [Davies Liu] fix build
d0da0d5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into revive
6cf5972 [Davies Liu] fix thread-safety
ed62a31 [Davies Liu] fix scala style
df9008b [Davies Liu] fix typo
bfc1396 [Davies Liu] revive offers periodically in LocalBackend
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In Spark 1.2 we added a `binaryRecords` input method for loading flat binary data. This format is useful for numerical array data, e.g. in scientific computing applications. This PR adds support for the same format in Streaming applications, where it is similarly useful, especially for streaming time series or sensor data.
Summary of additions
- adding `binaryRecordsStream` to Spark Streaming
- exposing `binaryRecordsStream` in the new PySpark Streaming
- new unit tests in Scala and Python
This required adding an optional Hadoop configuration param to `fileStream` and `FileInputStream`, but was otherwise straightforward.
tdas davies
Author: freeman <the.freeman.lab@gmail.com>
Closes #3803 from freeman-lab/streaming-binary-records and squashes the following commits:
b676534 [freeman] Clarify note
5ff1b75 [freeman] Add note to java streaming context
eba925c [freeman] Simplify notes
c4237b8 [freeman] Add experimental tag
30eba67 [freeman] Add filter and newFilesOnly alongside conf
c2cfa6d [freeman] Expose new version of fileStream with conf in java
34d20ef [freeman] Add experimental tag
14bca9a [freeman] Add experimental tag
b85bffc [freeman] Formatting
47560f4 [freeman] Space formatting
9a3715a [freeman] Refactor to reflect changes to FileInputSuite
7373f73 [freeman] Add note and defensive assertion for byte length
3ceb684 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-binary-records
317b6d1 [freeman] Make test inline
fcb915c [freeman] Formatting
becb344 [freeman] Formatting
d3e75b2 [freeman] Add tests in python
a4324a3 [freeman] Line length
029d49c [freeman] Formatting
1c739aa [freeman] Simpler default arg handling
94d90d0 [freeman] Spelling
2843e9d [freeman] Add params to docstring
8b70fbc [freeman] Reorganization
28bff9b [freeman] Fix missing arg
9398bcb [freeman] Expose optional hadoop configuration
23dd69f [freeman] Tests for binaryRecordsStream
36cb0fd [freeman] Add binaryRecordsStream to scala
fe4e803 [freeman] Add binaryRecordStream to Java API
ecef0eb [freeman] Add binaryRecordsStream to python
8550c26 [freeman] Expose additional argument combination
(cherry picked from commit 242b4f02df7f71ebcfa86a85c9ed39e40750a7fd)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
```scala
df.selectExpr("abs(colA)", "colB")
df.filter("age > 21")
```
Author: Reynold Xin <rxin@databricks.com>
Closes #4348 from rxin/SPARK-5579 and squashes the following commits:
2baeef2 [Reynold Xin] Fix Python.
b416372 [Reynold Xin] [SPARK-5579][SQL][DataFrame] Support for project/filter using SQL expressions.
(cherry picked from commit 40c4cb2fe79ceac0d656be7b72cb2ee8d7db7258)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If `seed` is `None` on the python side, it will pass in as a `null`. So we should use `java.lang.Long` instead of `Long` to take it.
Author: Xiangrui Meng <meng@databricks.com>
Closes #4349 from mengxr/gmm-fix and squashes the following commits:
3be5926 [Xiangrui Meng] fix seed handling in Python GMM
(cherry picked from commit eb15631854f8e8e7bcd1b3331f18d6ee81dd6d5c)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
to make them be activated automatically
Try to redesign the "primitive type => Writable" implicit APIs to make them be activated automatically and without breaking binary compatibility.
However, this PR will breaking the source compatibility if people use `xxxToXxxWritable` occasionally. See the unit test in `graphx`.
Author: zsxwing <zsxwing@gmail.com>
Closes #3642 from zsxwing/SPARK-4795 and squashes the following commits:
914b2d6 [zsxwing] Add implicit back to the Writables methods
0b9017f [zsxwing] Add some docs
a0e8509 [zsxwing] Merge branch 'master' into SPARK-4795
39343de [zsxwing] Fix the unit test
64853af [zsxwing] Reorganize the rest 'implicit' methods in SparkContext
(cherry picked from commit d37978d8aafef8a2e637687f3848ca0a8b935b33)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
UDFs
A more convenient way to define user-defined functions.
Author: Reynold Xin <rxin@databricks.com>
Closes #4345 from rxin/defineUDF and squashes the following commits:
639c0f8 [Reynold Xin] udf tests.
0a0b339 [Reynold Xin] defineUDF -> udf.
b452b8d [Reynold Xin] Fix UDF registration.
d2e42c3 [Reynold Xin] SQLContext.udf.register() returns a UserDefinedFunction also.
4333605 [Reynold Xin] [SQL][DataFrame] defineUDF.
(cherry picked from commit 1077f2e1def6266aee6ad6f0640a8f46cd273e21)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Make FPGrowth.run API take generic item types:
`def run[Item: ClassTag, Basket <: Iterable[Item]](data: RDD[Basket]): FPGrowthModel[Item]`
so that user can invoke it by run[String, Seq[String]], run[Int, Seq[Int]], run[Int, List[Int]], etc.
Scala part is done, while java part is still in progress
Author: Jacky Li <jacky.likun@huawei.com>
Author: Jacky Li <jackylk@users.noreply.github.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes #4340 from jackylk/SPARK-5520-WIP and squashes the following commits:
f5acf84 [Jacky Li] Merge pull request #2 from mengxr/SPARK-5520
63073d0 [Xiangrui Meng] update to make generic FPGrowth Java-friendly
737d8bb [Jacky Li] fix scalastyle
793f85c [Jacky Li] add Java test case
7783351 [Jacky Li] add generic support in FPGrowth
(cherry picked from commit e380d2d46c92b319eafe30974ac7c1509081fca4)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add more tests and docs for DataFrame Python API, improve test coverage, fix bugs.
Author: Davies Liu <davies@databricks.com>
Closes #4331 from davies/fix_df and squashes the following commits:
dd9919f [Davies Liu] fix tests
467332c [Davies Liu] support string in cast()
83c92fe [Davies Liu] address comments
c052f6f [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_df
8dd19a9 [Davies Liu] fix tests in python 2.6
35ccb9f [Davies Liu] fix build
78ebcfa [Davies Liu] add sql_test.py in run_tests
9ab78b4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_df
6040ba7 [Davies Liu] fix docs
3ab2661 [Davies Liu] add more tests for DataFrame
(cherry picked from commit 068c0e2ee05ee8b133c2dc26b8fa094ab2712d45)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
terminate
A slow receiver might not have enough time to shutdown cleanly even when graceful shutdown is used. This PR extends graceful waiting to make sure all receivers have deregistered and that the receiver job has terminated.
Author: Jesper Lundgren <jesper.lundgren@vpon.com>
Closes #4338 from cleaton/stopreceivers and squashes the following commits:
a9cf223 [Jesper Lundgren] remove cleaner.ttl config
f969b6e [Jesper Lundgren] fix inversed logic in unit test
3d0bd35 [Jesper Lundgren] switch boleans to match running status instead of terminated
9a9ff88 [Jesper Lundgren] wait for receivers to shutdown and receiver job to terminate
d179372 [Jesper Lundgren] Add graceful shutdown unit test covering slow receiver onStop
(cherry picked from commit 1e8b5394b44a0d3b36f64f10576c3ae3b977810c)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
KafkaStreamSuite
Timeout increased to allow overloaded Jenkins to cope with delay in topic creation.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #4342 from tdas/SPARK-5153 and squashes the following commits:
dc42762 [Tathagata Das] Increased timeout to deal with delays in overloaded Jenkins.
(cherry picked from commit 681f9df47ff40f7b0d9175d835e9758d33a13a06)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous #3732 is reverted due to some test failure.
Have fixed that.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes #4325 from adrian-wang/datenative and squashes the following commits:
096e20d [Daoyuan Wang] fix for mixed timezone
0ed0fdc [Daoyuan Wang] fix test data
a2fdd4e [Daoyuan Wang] getDate
c37832b [Daoyuan Wang] row to catalyst
f0005b1 [Daoyuan Wang] add date in sql parser and java type conversion
024c9a6 [Daoyuan Wang] clean some import order
d6715fc [Daoyuan Wang] refactoring Date as Primitive Int internally
374abd5 [Daoyuan Wang] spark native date type support
(cherry picked from commit db821ed2ededf6ce79b838c77a9c10bed2ce555a)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add support for alias of udtfs, such as
```
select stack(2, key, value, key, value) as (a, b) from src limit 5;
select a, b from (select stack(2, key, value, key, value) as (a, b) from src) t limit 5
```
Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>
Author: Fei Wang <wangfei1@huawei.com>
Closes #4186 from scwf/multi-alias-names and squashes the following commits:
c35e922 [wangfei] fix conflicts
adc8311 [wangfei] minor format fix
2783aed [wangfei] convert it to a Generate instead of leaving it inside of a Project clause
a87668a [wangfei] minor improvement
b25d9b3 [wangfei] resolve conflicts
d38f041 [wangfei] style fix
8cfcebf [wangfei] minor improvement
12a239e [wangfei] fix test case
050177f [wangfei] added extendedCheckRules
3d69329 [wangfei] added CheckMultiAlias to analyzer
324150d [wangfei] added multi alias node
74f5a81 [Fei Wang] imports order fix
5bc3f59 [scwf] style fix
3daec28 [scwf] support alias for udfs with multi output columns
(cherry picked from commit 5adbb39482631998dbfe4a1da88f6e75b30fb5ac)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SQL in HiveContext, should be case insensitive, however, the following query will fail.
```scala
udf.register("random0", () => { Math.random()})
assert(sql("SELECT RANDOM0() FROM src LIMIT 1").head().getDouble(0) >= 0.0)
```
Author: Cheng Hao <hao.cheng@intel.com>
Closes #4326 from chenghao-intel/udf_case_sensitive and squashes the following commits:
485cf66 [Cheng Hao] Support the case insensitive for UDF
(cherry picked from commit ca7a6cdff004eb4605fd223e127b4a46a0a214e7)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes #3820 from adrian-wang/parquettimestamp and squashes the following commits:
b1e2a0d [Daoyuan Wang] fix for nanos
4dadef1 [Daoyuan Wang] fix wrong read
93f438d [Daoyuan Wang] parquet timestamp support
(cherry picked from commit 0c20ce69fb4bcb1cec5313a9d072826c5588cbbc)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. Added Java-friendly version of the expression operators (i.e. gt, geq)
2. Added JavaDoc for most operators
3. Simplified expression operators by having only one version of the function (that accepts Any). Previously we had two methods for each expression operator, one accepting Any and another accepting Column.
4. agg function now accepts varargs of (String, String).
Author: Reynold Xin <rxin@databricks.com>
Closes #4332 from rxin/df-update and squashes the following commits:
ab0aa69 [Reynold Xin] Added Java friendly expression methods. Added JavaDoc. For each expression operator, have only one version of the function (that accepts Any). Previously we had two methods for each expression operator, one accepting Any and another accepting Column.
576d07a [Reynold Xin] random commit.
|
|
|
|
|
|
|
|
|
|
| |
Made a mistake in https://github.com/apache/spark/pull/4324
Author: Reynold Xin <rxin@databricks.com>
Closes #4333 from rxin/taskcontext-deprecate and squashes the following commits:
61c44ee [Reynold Xin] Minor: Fix TaskContext deprecated annotations.
|
|
|
|
|
|
|
|
|
|
|
|
| |
So the interface documentation shows up in ScalaDoc.
Author: Reynold Xin <rxin@databricks.com>
Closes #4324 from rxin/TaskContext-scala and squashes the following commits:
2480a17 [Reynold Xin] comment
573756f [Reynold Xin] style fixes and javadoc fixes.
87dd537 [Reynold Xin] [SPARK-5549] Define TaskContext interface in Scala.
|
|
|
|
|
|
|
|
|
|
| |
compatibility
Author: Reynold Xin <rxin@databricks.com>
Closes #4327 from rxin/schemarddTypeAlias and squashes the following commits:
e5a8ff3 [Reynold Xin] [SPARK-5551][SQL] Create type alias for SchemaRDD for source backward compatibility
|
|
|
|
|
|
|
|
|
|
| |
They were there mostly for code review and easier check of the API. I don't think they need to be there anymore.
Author: Reynold Xin <rxin@databricks.com>
Closes #4328 from rxin/remove-df-api and squashes the following commits:
723d600 [Reynold Xin] [SQL][DataFrame] Remove DataFrameApi and ColumnApi.
|
|
|
|
|
|
|
|
| |
Author: Xiangrui Meng <meng@databricks.com>
Closes #4329 from mengxr/streaming-lr and squashes the following commits:
78731e1 [Xiangrui Meng] update streaming linear algorithms
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
**This PR introduces an API + simple implementation for Latent Dirichlet Allocation (LDA).**
The [design doc for this PR](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo) has been updated since I initially posted it. In particular, see the API and Planning for the Future sections.
* Settle on a public API which may eventually include:
* more inference algorithms
* more options / functionality
* Have an initial easy-to-understand implementation which others may improve.
* This is NOT intended to support every topic model out there. However, if there are suggestions for making this extensible or pluggable in the future, that could be nice, as long as it does not complicate the API or implementation too much.
* This may not be very scalable currently. It will be important to check and improve accuracy. For correctness of the implementation, please check against the Asuncion et al. (2009) paper in the design doc.
**Dependency: This makes MLlib depend on GraphX.**
Files and classes:
* LDA.scala (441 lines):
* class LDA (main estimator class)
* LDA.Document (text + document ID)
* LDAModel.scala (266 lines)
* abstract class LDAModel
* class LocalLDAModel
* class DistributedLDAModel
* LDAExample.scala (245 lines): script to run LDA + a simple (private) Tokenizer
* LDASuite.scala (144 lines)
Data/model representation and algorithm:
* Data/model: Uses GraphX, with term vertices + document vertices
* Algorithm: EM, following [Asuncion, Welling, Smyth, and Teh. "On Smoothing and Inference for Topic Models." UAI, 2009.](http://arxiv-web3.library.cornell.edu/abs/1205.2662v1)
* For more details, please see the description in the “DEVELOPERS NOTE” in LDA.scala
Please refer to the JIRA for more discussion + the [design doc for this PR](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo)
Here, I list the main changes AFTER the design doc was posted.
Design decisions:
* logLikelihood() computes the log likelihood of the data and the current point estimate of parameters. This is different from the likelihood of the data given the hyperparameters, which would be harder to compute. I’d describe the current approach as more frequentist, whereas the harder approach would be more Bayesian.
* The current API takes Documents as token count vectors. I believe there should be an extended API taking RDD[String] or RDD[Array[String]] in a future PR. I have sketched this out in the design doc (as well as handier versions of getTopics returning Strings).
* Hyperparameters should be set differently for different inference/learning algorithms. See Asuncion et al. (2009) in the design doc for a good demonstration. I encourage good behavior via defaults and warning messages.
Items planned for future PRs:
* perplexity
* API taking Strings
* Should LDA be called LatentDirichletAllocation (and LDAModel be LatentDirichletAllocationModel)?
* Pro: We may someday want LinearDiscriminantAnalysis.
* Con: Very long names
* Should LDA reside in clustering? Or do we want a sub-package?
* mllib.topicmodel
* mllib.clustering.topicmodel
* Does the API seem reasonable and extensible?
* Unit tests:
* Should there be a test which checks a clustering results? E.g., train on a small, fake dataset with 2 very distinct topics/clusters, and ensure LDA finds those 2 topics/clusters. Does that sound useful or too flaky?
This has not been tested much for scaling. I have run it on a laptop for 200 iterations on a 5MB dataset with 1000 terms and 5 topics. Running it for 500 iterations made it fail because of GC problems. I'm running larger scale tests & will put results here, but future PRs may need to improve the scaling.
* dlwh for the initial implementation
* + jegonzal for some code in the initial implementation
* The many contributors towards topic model implementations in Spark which were referenced as a basis for this PR: akopich witgo yinxusen dlwh EntilZha jegonzal IlyaKozlov
* Note: The plan is to include this full list in the authors if this PR gets merged. Please notify me if you prefer otherwise.
CC: mengxr
Authors:
Joseph K. Bradley <joseph@databricks.com>
Joseph Gonzalez <joseph.e.gonzalez@gmail.com>
David Hall <david.lw.hall@gmail.com>
Guoqiang Li <witgo@qq.com>
Xiangrui Meng <meng@databricks.com>
Pedro Rodriguez <pedro@snowgeek.org>
Avanesov Valeriy <acopich@gmail.com>
Xusen Yin <yinxusen@gmail.com>
Closes #2388
Closes #4047 from jkbradley/davidhall-lda and squashes the following commits:
77e8814 [Joseph K. Bradley] small doc fix
5c74345 [Joseph K. Bradley] cleaned up doc based on code review
589728b [Joseph K. Bradley] Updates per code review. Main change was in LDAExample for faster vocab computation. Also updated PeriodicGraphCheckpointerSuite.scala to clean up checkpoint files at end
e3980d2 [Joseph K. Bradley] cleaned up PeriodicGraphCheckpointerSuite.scala
74487e5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into davidhall-lda
4ae2a7d [Joseph K. Bradley] removed duplicate graphx dependency in mllib/pom.xml
e391474 [Joseph K. Bradley] Removed LDATiming. Added PeriodicGraphCheckpointerSuite.scala. Small LDA cleanups.
e8d8acf [Joseph K. Bradley] Added catch for BreakIterator exception. Improved preprocessing to reduce passes over data
1a231b4 [Joseph K. Bradley] fixed scalastyle
91aadfe [Joseph K. Bradley] Added Java-friendly run method to LDA. Added Java test suite for LDA. Changed LDAModel.describeTopics to return Java-friendly type
b75472d [Joseph K. Bradley] merged improvements from LDATiming into LDAExample. Will remove LDATiming after done testing
993ca56 [Joseph K. Bradley] * Removed Document type in favor of (Long, Vector) * Changed doc ID restriction to be: id must be nonnegative and unique in the doc (instead of 0,1,2,...) * Add checks for valid ranges of eta, alpha * Rename “LearningState” to “EMOptimizer” * Renamed params: termSmoothing -> topicConcentration, topicSmoothing -> docConcentration * Also added aliases alpha, beta
cb5a319 [Joseph K. Bradley] Added checkpointing to LDA * new class PeriodicGraphCheckpointer * params checkpointDir, checkpointInterval to LDA
43c1c40 [Joseph K. Bradley] small cleanup
0b90393 [Joseph K. Bradley] renamed LDA LearningState.collectTopicTotals to globalTopicTotals
77a2c85 [Joseph K. Bradley] Moved auto term,topic smoothing computation to get*Smoothing methods. Changed word to term in some places. Updated LDAExample to use default smoothing amounts.
fb1e7b5 [Xiangrui Meng] minor
08d59a3 [Xiangrui Meng] reset spacing
9fe0b95 [Xiangrui Meng] optimize aggregateMessages
cec0a9c [Xiangrui Meng] * -> *=
6cb11b0 [Xiangrui Meng] optimize computePTopic
9eb3d02 [Xiangrui Meng] + -> +=
892530c [Xiangrui Meng] use axpy
45cc7f2 [Xiangrui Meng] mapPart -> flatMap
ce53be9 [Joseph K. Bradley] fixed example name
75749e7 [Joseph K. Bradley] scala style fix
9f2a492 [Joseph K. Bradley] Unit tests and fixes for LDA, now ready for PR
377ebd9 [Joseph K. Bradley] separated LDA models into own file. more cleanups before PR
2d40006 [Joseph K. Bradley] cleanups before PR
2891e89 [Joseph K. Bradley] Prepped LDA main class for PR, but some cleanups remain
0cb7187 [Joseph K. Bradley] Added 3 files from dlwh LDA implementation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The only issue is that `analyzeBlock` is removed, which was marked as a developer API. I didn't change other tests in the ALSSuite under `spark.mllib` to ensure that the implementation is correct.
CC: srowen coderxiang
Author: Xiangrui Meng <meng@databricks.com>
Closes #4321 from mengxr/SPARK-5536 and squashes the following commits:
5a3cee8 [Xiangrui Meng] update python tests that are too strict
e840acf [Xiangrui Meng] ignore scala style check for ALS.train
e9a721c [Xiangrui Meng] update mima excludes
9ee6a36 [Xiangrui Meng] merge master
9a8aeac [Xiangrui Meng] update tests
d8c3271 [Xiangrui Meng] remove analyzeBlocks
d68eee7 [Xiangrui Meng] add checkpoint to new ALS
22a56f8 [Xiangrui Meng] wrap old ALS
c387dff [Xiangrui Meng] support random seed
3bdf24b [Xiangrui Meng] make storage level configurable in the new ALS
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
events
There isn't a good way to write a SparkListener that receives all SparkListener events and which will be future-compatible (e.g. it will receive events introduced in newer versions of Spark without having to override new methods to process those events).
To address this, this patch adds `SparkFirehoseListener`, a SparkListener implementation that receives all events and dispatches them to a single `onEvent` method (which can be overridden by users).
Author: Josh Rosen <joshrosen@databricks.com>
Closes #4210 from JoshRosen/firehose-listener and squashes the following commits:
223f579 [Josh Rosen] Expand comment to explain rationale for this being a Java class.
ecdfaed [Josh Rosen] Add SparkFirehoseListener class for consuming all SparkListener events.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR aims to support `INSERT INTO/OVERWRITE TABLE tableName` and `CREATE TABLE tableName AS SELECT` for the data source API (partitioned tables are not supported).
In this PR, I am also adding the support of `IF NOT EXISTS` for our ddl parser. The current semantic of `IF NOT EXISTS` is explained as follows.
* For a `CREATE TEMPORARY TABLE` statement, it does not `IF NOT EXISTS` for now.
* For a `CREATE TABLE` statement (we are creating a metastore table), if there is an existing table having the same name ...
* when `IF NOT EXISTS` clause is used, we will do nothing.
* when `IF NOT EXISTS` clause is not used, the user will see an exception saying the table already exists.
TODOs:
- [x] CTAS support
- [x] Programmatic APIs
- [ ] Python API (another PR)
- [x] More unit tests
- [ ] Documents (another PR)
marmbrus liancheng rxin
Author: Yin Huai <yhuai@databricks.com>
Closes #4294 from yhuai/writeSupport and squashes the following commits:
3db1539 [Yin Huai] save does not take overwrite.
1c98881 [Yin Huai] Fix test.
142372a [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupport
34e1bfb [Yin Huai] Address comments.
1682ca6 [Yin Huai] Better support for CTAS statements.
e789d64 [Yin Huai] For the Scala API, let users to use tuples to provide options.
0128065 [Yin Huai] Short hand versions of save and load.
66ebd74 [Yin Huai] Formatting.
9203ec2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupport
e5d29f2 [Yin Huai] Programmatic APIs.
1a719a5 [Yin Huai] CREATE TEMPORARY TABLE with IF NOT EXISTS is not allowed for now.
909924f [Yin Huai] Add saveAsTable for the data source API to DataFrame.
95a7c71 [Yin Huai] Fix bug when handling IF NOT EXISTS clause in a CREATE TEMPORARY TABLE statement.
d37b19c [Yin Huai] Cheng's comments.
fd6758c [Yin Huai] Use BeforeAndAfterAll.
7880891 [Yin Huai] Support CREATE TABLE AS SELECT STATEMENT and the IF NOT EXISTS clause.
cb85b05 [Yin Huai] Initial write support.
2f91354 [Yin Huai] Make INSERT OVERWRITE/INTO statements consistent between HiveQL and SqlParser.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Python API for the Gaussian Mixture Model clustering algorithm in MLLib.
Author: FlytxtRnD <meethu.mathew@flytxt.com>
Closes #4059 from FlytxtRnD/PythonGmmWrapper and squashes the following commits:
c973ab3 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
339b09c [FlytxtRnD] Added MultivariateGaussian namedtuple and Arraybuffer in trainGaussianMixture
fa0a142 [FlytxtRnD] New line added
d5b36ab [FlytxtRnD] Changed argument names to lowercase
ac134f1 [FlytxtRnD] Merge branch 'PythonGmmWrapper' of https://github.com/FlytxtRnD/spark into PythonGmmWrapper
6671ea1 [FlytxtRnD] Added mllib/stat/distribution.py
3aee84b [FlytxtRnD] Fixed style issues
2e9f12a [FlytxtRnD] Added mllib/stat/distribution.py and fixed style issues
b22532c [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
2e14d82 [FlytxtRnD] Incorporate MultivariateGaussian instances in GaussianMixtureModel
05767c7 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
3464d19 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
c1d4c71 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'origin/PythonGmmWrapper' into PythonGmmWrapper
426d130 [FlytxtRnD] Added random seed parameter
332bad1 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
f82750b [FlytxtRnD] Fixed style issues
5c83825 [FlytxtRnD] Split input file with space delimiter
fda60f3 [FlytxtRnD] Python API for Gaussian Mixture Model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
.this was https://github.com/apache/spark/pull/2676
https://issues.apache.org/jira/browse/SPARK-3778
This affects if someone is trying to access secure hdfs something like:
val lines = {
val hconf = new Configuration()
hconf.set("mapred.input.dir", "mydir")
hconf.set("textinputformat.record.delimiter","\003432\n")
sc.newAPIHadoopRDD(hconf, classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
}
Author: Thomas Graves <tgraves@apache.org>
Closes #4292 from tgravescs/SPARK-3788 and squashes the following commits:
cf3b453 [Thomas Graves] newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds support for streaming logistic regression with stochastic gradient descent, in the same manner as the existing implementation of streaming linear regression. It is a relatively simple addition because most of the work is already done by the abstract class `StreamingLinearAlgorithm` and existing algorithms and models from MLlib.
The PR includes
- Streaming Logistic Regression algorithm
- Unit tests for accuracy, streaming convergence, and streaming prediction
- An example use
cc mengxr tdas
Author: freeman <the.freeman.lab@gmail.com>
Closes #4306 from freeman-lab/streaming-logisitic-regression and squashes the following commits:
5c2c70b [freeman] Use Option on model
5cca2bc [freeman] Merge remote-tracking branch 'upstream/master' into streaming-logisitic-regression
275f8bd [freeman] Make private to mllib
3926e4e [freeman] Line formatting
5ee8694 [freeman] Experimental tag for docs
2fc68ac [freeman] Fix example formatting
85320b1 [freeman] Fixed line length
d88f717 [freeman] Remove stray comment
59d7ecb [freeman] Add streaming logistic regression
e78fe28 [freeman] Add streaming logistic regression example
321cc66 [freeman] Set private and protected within mllib
|
|
|
|
|
|
|
|
| |
Author: zsxwing <zsxwing@gmail.com>
Closes #4019 from zsxwing/SPARK-5219 and squashes the following commits:
36a8b4e [zsxwing] Add locks to avoid race conditions
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fixes several minor formatting issues in the [Continuous Compilation] [1] section.
[1]: http://spark.apache.org/docs/latest/building-spark.html#continuous-compilation
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4316)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #4316 from liancheng/fix-build-instruction-docs and squashes the following commits:
0a92e01 [Cheng Lian] Fixes several formatting issues
|
|
|
|
|
|
|
|
|
|
|
|
| |
These are needed transitively from the other Jetty libraries
we include. It was not picked up by unit tests because we
disable the UI.
Author: Patrick Wendell <patrick@databricks.com>
Closes #4323 from pwendell/jetty and squashes the following commits:
d8669da [Patrick Wendell] SPARK-3996: Add jetty servlet and continuations.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These are some changes to the build script to allow parts of it to be run independently. This has already been tested during the 1.2.1 release cycle.
Author: Patrick Wendell <patrick@databricks.com>
Author: Patrick Wendell <pwendell@gmail.com>
Closes #4319 from pwendell/release-updates and squashes the following commits:
dfe7ed9 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into release-updates
478b072 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into release-updates
126dd0c [Patrick Wendell] Allow decoupling Maven publishing from cutting release
|
|
|
|
|
|
|
|
|
|
|
| |
Simple PR to Remove unused import JsonUtil from from org.apache.spark.util.JsonProtocol.scala which fails builds with older versions of hadoop-core
This import is unused. It was introduced in PR #4029 https://github.com/apache/spark/pull/4029 as a part of JIRA SPARK-5231
Author: nemccarthy <nathan@nemccarthy.me>
Closes #4320 from nemccarthy/master and squashes the following commits:
8e34a11 [nemccarthy] [SPARK-5543][WebUI] Remove unused import JsonUtil from from org.apache.spark.util.JsonProtocol.scala which fails builds with older versions of hadoop-core
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This pull request contains a Spark SQL data source that can pull data from, and can put data into, a JDBC database.
I have tested both read and write support with H2, MySQL, and Postgres. It would surprise me if both read and write support worked flawlessly out-of-the-box for any other database; different databases have different names for different JDBC data types and different meanings for SQL types with the same name. However, this code is designed (see `DriverQuirks.scala`) to make it *relatively* painless to add support for another database by augmenting the type mapping contained in this PR.
Author: Tor Myklebust <tmyklebu@gmail.com>
Closes #4261 from tmyklebu/master and squashes the following commits:
cf167ce [Tor Myklebust] Work around other Java tests ruining TestSQLContext.
67893bf [Tor Myklebust] Move the jdbcRDD methods into SQLContext itself.
585f95b [Tor Myklebust] Dependencies go into the project's pom.xml.
829d5ba [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
41647ef [Tor Myklebust] Hide a couple things that don't need to be public.
7318aea [Tor Myklebust] Fix scalastyle warnings.
a09eeac [Tor Myklebust] JDBC data source for Spark SQL.
176bb98 [Tor Myklebust] Add test deps for JDBC support.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
the PIC paper
As suggested by the paper of Power Iteration Clustering, it is useful to set the initial vector v0 as the degree vector d. This pr tries to add a running method for that.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #4301 from viirya/pic_degreevector and squashes the following commits:
7db28fb [Liang-Chi Hsieh] Refactor it to address comments.
19cf94e [Liang-Chi Hsieh] Add an option to select initialization method.
ec88567 [Liang-Chi Hsieh] Run the PIC algorithm with degree vector d as suggected by the PIC paper.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR brings the Python API for Spark Streaming Kafka data source.
```
class KafkaUtils(__builtin__.object)
| Static methods defined here:
|
| createStream(ssc, zkQuorum, groupId, topics, storageLevel=StorageLevel(True, True, False, False,
2), keyDecoder=<function utf8_decoder>, valueDecoder=<function utf8_decoder>)
| Create an input stream that pulls messages from a Kafka Broker.
|
| :param ssc: StreamingContext object
| :param zkQuorum: Zookeeper quorum (hostname:port,hostname:port,..).
| :param groupId: The group id for this consumer.
| :param topics: Dict of (topic_name -> numPartitions) to consume.
| Each partition is consumed in its own thread.
| :param storageLevel: RDD storage level.
| :param keyDecoder: A function used to decode key
| :param valueDecoder: A function used to decode value
| :return: A DStream object
```
run the example:
```
bin/spark-submit --driver-class-path external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar examples/src/main/python/streaming/kafka_wordcount.py localhost:2181 test
```
Author: Davies Liu <davies@databricks.com>
Author: Tathagata Das <tdas@databricks.com>
Closes #3715 from davies/kafka and squashes the following commits:
d93bfe0 [Davies Liu] Update make-distribution.sh
4280d04 [Davies Liu] address comments
e6d0427 [Davies Liu] Merge branch 'master' of github.com:apache/spark into kafka
f257071 [Davies Liu] add tests for null in RDD
23b039a [Davies Liu] address comments
9af51c4 [Davies Liu] Merge branch 'kafka' of github.com:davies/spark into kafka
a74da87 [Davies Liu] address comments
dc1eed0 [Davies Liu] Update kafka_wordcount.py
31e2317 [Davies Liu] Update kafka_wordcount.py
370ba61 [Davies Liu] Update kafka.py
97386b3 [Davies Liu] address comment
2c567a5 [Davies Liu] update logging and comment
33730d1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into kafka
adeeb38 [Davies Liu] Merge pull request #3 from tdas/kafka-python-api
aea8953 [Tathagata Das] Kafka-assembly for Python API
eea16a7 [Davies Liu] refactor
f6ce899 [Davies Liu] add example and fix bugs
98c8d17 [Davies Liu] fix python style
5697a01 [Davies Liu] bypass decoder in scala
048dbe6 [Davies Liu] fix python style
75d485e [Davies Liu] add mqtt
07923c4 [Davies Liu] support kafka in Python
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. Throw UnsupportedOperationException if a Column is not computable.
2. Perform eager analysis on DataFrame so we can catch errors when they happen (not when an action is run).
Author: Reynold Xin <rxin@databricks.com>
Author: Davies Liu <davies@databricks.com>
Closes #4296 from rxin/col-computability and squashes the following commits:
6527b86 [Reynold Xin] Merge pull request #8 from davies/col-computability
fd92bc7 [Reynold Xin] Merge branch 'master' into col-computability
f79034c [Davies Liu] fix python tests
5afe1ff [Reynold Xin] Fix scala test.
17f6bae [Reynold Xin] Various fixes.
b932e86 [Reynold Xin] Added eager analysis for error reporting.
e6f00b8 [Reynold Xin] [SQL][API] ComputableColumn vs IncomputableColumn
|