Parallel Prediction

Overview

Tutorial: 20 Minutes

Objectives:

Learn how to do parallel prediction in Dask.

ParallelPostFit in Dask-ML wraps around a scikit-learn model, enabling parallel prediction and transformation operations. While the model’s training step remains on a single machine, the predict and transform methods are executed in parallel using Dask, which is especially useful for large datasets that exceed a single machine’s memory.

Here we a first generate a small training data

from sklearn.ensemble import GradientBoostingClassifier
import sklearn.datasets
import dask_ml.datasets
from dask_ml.wrappers import ParallelPostFit

X, y = sklearn.datasets.make_classification(n_samples=1000, random_state=0)

and then wrap the GradientBoostingClassifier model using ParallelPostFit.

clf = ParallelPostFit(estimator=GradientBoostingClassifier())
clf.fit(X, y)

The training occurs on a single node, while the prediction is distributed across the cluster, returning a Dask array.

X_big, _ = dask_ml.datasets.make_classification(n_samples=100000, chunks=10000, random_state=0)
clf.predict(X_big)

Key Points

ParallelPostFit can be used to parallelize prediction across a cluster.