Parallel Prediction
Overview
Tutorial: 20 Minutes
- Objectives:
Learn how to do parallel prediction in Dask.
ParallelPostFit in Dask-ML wraps around a scikit-learn model, enabling parallel prediction and transformation operations. While the model’s training step remains on a single machine, the predict and transform methods are executed in parallel using Dask, which is especially useful for large datasets that exceed a single machine’s memory.
Here we a first generate a small training data
1from sklearn.ensemble import GradientBoostingClassifier
2import sklearn.datasets
3import dask_ml.datasets
4from dask_ml.wrappers import ParallelPostFit
5
6X, y = sklearn.datasets.make_classification(n_samples=1000, random_state=0)
and then wrap the GradientBoostingClassifier model using ParallelPostFit.
1clf = ParallelPostFit(estimator=GradientBoostingClassifier())
2clf.fit(X, y)
The training occurs on a single node, while the prediction is distributed across the cluster, returning a Dask array.
1X_big, _ = dask_ml.datasets.make_classification(n_samples=100000, chunks=10000, random_state=0)
2clf.predict(X_big)
Key Points
ParallelPostFit can be used to parallelize prediction across a cluster.