Getting Started with PySpark
MLeap PySpark integration provides serialization of PySpark-trained ML pipelines to MLeap Bundles. MLeap also provides several extensions to Spark, including enhanced one hot encoding and one vs rest models. Unlike Mleap<>Spark integration, MLeap doesn't yet provide PySpark integration with Spark Extensions transformers.
Adding MLeap Spark to Your Project
Before adding MLeap Pyspark to your project, you first have to compile and add MLeap Spark.
MLeap PySpark is available in the combust/mleap github repository in the python package.
To add MLeap to your PySpark project, just clone the git repo, add the mleap/pyhton
path, and import mleap.pyspark
git clone git@github.com:combust/mleap.git
Then in your python environment do:
import sys
sys.path.append('<git directory>/mleap/python')
import mleap.pyspark
Note: the import of mleap.pyspark
needs to happen before any other PySpark
libraries are imported.
Note: If you are working from a notebook environment, be sure to take a look at instructions of how to set up MLeap PySpark with:
Using PIP
Alternatively, there is PIP support for PySpark available under: https://pypi.python.org/pypi/mleap.
To use MLeap extensions to PySpark:
- See build instructions to build MLeap from source.
- See core concepts for an overview of ML pipelines.
- See Spark documentation to learn how to train ML pipelines in Spark.
- See Demo notebook on how to use PySpark and MLeap to serialize your pipeline to Bundle.ml