Parallel Prediction

Overview

  • Tutorial: 20 Minutes

    Objectives:
    • Learn how to do parallel prediction in Dask.

ParallelPostFit in Dask-ML wraps around a scikit-learn model, enabling parallel prediction and transformation operations. While the model’s training step remains on a single machine, the predict and transform methods are executed in parallel using Dask, which is especially useful for large datasets that exceed a single machine’s memory.

Here we a first generate a small training data

1from sklearn.ensemble import GradientBoostingClassifier
2import sklearn.datasets
3import dask_ml.datasets
4from dask_ml.wrappers import ParallelPostFit
5
6X, y = sklearn.datasets.make_classification(n_samples=1000, random_state=0)

and then wrap the GradientBoostingClassifier model using ParallelPostFit.

1clf = ParallelPostFit(estimator=GradientBoostingClassifier())
2clf.fit(X, y)

The training occurs on a single node, while the prediction is distributed across the cluster, returning a Dask array.

1X_big, _ = dask_ml.datasets.make_classification(n_samples=100000, chunks=10000, random_state=0)
2clf.predict(X_big)

Key Points

  • ParallelPostFit can be used to parallelize prediction across a cluster.