Getting Started with PySpark

MLeap PySpark integration provides serialization of PySpark-trained ML pipelines to MLeap Bundles. MLeap also provides several extensions to Spark, including enhanced one hot encoding and one vs rest models. Unlike Mleap<>Spark integration, MLeap doesn't yet provide PySpark integration with Spark Extensions transformers.

Adding MLeap Spark to Your Project

Before adding MLeap Pyspark to your project, you first have to compile and add MLeap Spark.

MLeap PySpark is available in the combust/mleap github repository in the python package.

To add MLeap to your PySpark project, just clone the git repo, add the mleap/pyhton path, and import mleap.pyspark

git clone git@github.com:combust/mleap.git

Then in your python environment do:

import sys
sys.path.append('<git directory>/mleap/python')

import mleap.pyspark

Note: the import of mleap.pyspark needs to happen before any other PySpark libraries are imported.

Note: If you are working from a notebook environment, be sure to take a look at instructions of how to set up MLeap PySpark with:

Using PIP

Alternatively, there is PIP support for PySpark available under: https://pypi.python.org/pypi/mleap.

To use MLeap extensions to PySpark:

  1. See build instructions to build MLeap from source.
  2. See core concepts for an overview of ML pipelines.
  3. See Spark documentation to learn how to train ML pipelines in Spark.
  4. See Demo notebook on how to use PySpark and MLeap to serialize your pipeline to Bundle.ml

results matching ""

    No results matching ""