spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	SPARK-1637: Clean up examples for 1.0	Sandeep	2014-05-06	10	-574/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- [x] Move all of them into subpackages of org.apache.spark.examples (right now some are in org.apache.spark.streaming.examples, for instance, and others are in org.apache.spark.examples.mllib) - [x] Move Python examples into examples/src/main/python - [x] Update docs to reflect these changes Author: Sandeep <sandeep@techaddict.me> This patch had conflicts when merged, resolved by Committer: Matei Zaharia <matei@databricks.com> Closes #571 from techaddict/SPARK-1637 and squashes the following commits: 47ef86c [Sandeep] Changes based on Discussions on PR, removing use of RawTextHelper from examples 8ed2d3f [Sandeep] Docs Updated for changes, Change for java examples 5f96121 [Sandeep] Move Python examples into examples/src/main/python 0a8dd77 [Sandeep] Move all Scala Examples to org.apache.spark.examples (some are in org.apache.spark.streaming.examples, for instance, and others are in org.apache.spark.examples.mllib)
*	[WIP] SPARK-1430: Support sparse data in Python MLlib	Matei Zaharia	2014-04-15	4	-6/+107
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type. On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models. Some to-do items left: - [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector. - [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling. - [x] Explain how to use these in the Python MLlib docs. CC @mengxr, @joshrosen Author: Matei Zaharia <matei@databricks.com> Closes #341 from mateiz/py-ml-update and squashes the following commits: d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge b9f97a3 [Matei Zaharia] Fix test 1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python 88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs 37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script. a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights 74eefe7 [Matei Zaharia] Added LabeledPoint class in Python 889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict a5d6426 [Matei Zaharia] Add linalg.py to run-tests script 0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data 2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data 154f45d [Matei Zaharia] Update docs, name some magic values 881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
*	Merge pull request #562 from jyotiska/master. Closes #562.	jyotiska	2014-02-08	1	-0/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Added example Python code for sort I added an example Python code for sort. Right now, PySpark has limited examples for new people willing to use the project. This example code sorts integers stored in a file. I was able to sort 5 million, 10 million and 25 million integers with this code. Author: jyotiska <jyotiska123@gmail.com> == Merge branch commits == commit 8ad8faf6c8e02ae1cd68565d98524edf165f54df Author: jyotiska <jyotiska123@gmail.com> Date: Sun Feb 9 11:00:41 2014 +0530 Added comments in code on collect() method commit 6f98f1e313f4472a7c2207d36c4f0fbcebc95a8c Author: jyotiska <jyotiska123@gmail.com> Date: Sat Feb 8 13:12:37 2014 +0530 Updated python example code sort.py commit 945e39a5d68daa7e5bab0d96cbd35d7c4b04eafb Author: jyotiska <jyotiska123@gmail.com> Date: Sat Feb 8 12:59:09 2014 +0530 Added example python code for sort
*	Add banner to PySpark and make wordcount output nicer	Matei Zaharia	2013-09-01	1	-1/+1
\|
*	Merge pull request #802 from stayhf/SPARK-760-Python	Matei Zaharia	2013-08-12	1	-0/+70
\|\ \| \| \| \|	Simple PageRank algorithm implementation in Python for SPARK-760
\| *	Code update for Matei's suggestions	stayhf	2013-08-11	1	-7/+9
\| \|
\| *	Simple PageRank algorithm implementation in Python for SPARK-760	stayhf	2013-08-10	1	-0/+68
\| \|
* \|	Fix string parsing and style in LR	Matei Zaharia	2013-07-31	1	-1/+1
\| \|
* \|	Update the Python logistic regression example to read from a file and	Matei Zaharia	2013-07-29	1	-27/+26
\| \| \| \| \| \| \| \|	batch input records for more efficient NumPy computations
* \|	Some fixes to Python examples (style and package name for LR)	Matei Zaharia	2013-07-27	6	-15/+9
\|/
*	Add Apache license headers and LICENSE and NOTICE files	Matei Zaharia	2013-07-16	6	-0/+102
\|
*	Fix argv handling in Python transitive closure example	Jey Kottalam	2013-04-02	1	-1/+1
\|
*	Minor formatting fixes	Matei Zaharia	2013-01-20	1	-1/+1
\|
*	Python ALS example	Nick Pentreath	2013-01-15	1	-0/+71
\|
*	Use take() instead of takeSample() in PySpark kmeans example.	Josh Rosen	2013-01-09	1	-1/+3
\| \| \| \|	This is a temporary change until we port takeSample().
*	Rename top-level 'pyspark' directory to 'python'	Josh Rosen	2013-01-01	5	-0/+199