Distributed Learning
----------------------

.. admonition:: Overview
   :class: Overview

    * **Tutorial:** 20 Minutes

        **Objectives:**
            - Learn how to do distributed learning in Dask.

Distributed learning in Dask splits both the data and computation across multiple machines or nodes in a cluster, leveraging parallelism for faster 
training. Incremental learning focuses on processing data in smaller chunks sequentially, updating the model after each chunk, and does not 
require the whole dataset in memory. While Distributed learning distributes the entire dataset across a cluster of workers, where each worker 
processes a portion of the data in parallel. 

Dask distributes the data and computation across the cluster, where each worker processes its portion of the data in parallel.
After each worker processes its chunk of the data, the results are aggregated to update the model. The model's parameters are synchronized across 
the workers, and training happens concurrently across the distributed setup.

..  code-block:: python
    :linenos:

    from dask.distributed import Client
    from dask_ml.datasets import make_classification
    from dask_ml.model_selection import train_test_split
    from sklearn.linear_model import SGDClassifier
    from dask_ml.model_selection import GridSearchCV
    from scipy.stats import uniform, loguniform

    # Step 1: Start a Dask client
    client = Client()

    # Step 2: Create a synthetic classification dataset with Dask
    X, y = make_classification(n_samples=5000, n_features=20, chunks=25, random_state=0)

    # Step 3: Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y)

    # Step 4: Initialize the estimator (SGDClassifier)
    estimator = SGDClassifier(random_state=10, max_iter=100)

    # Step 5: Define the parameter grid for hyperparameter tuning
    params = {
        'alpha': loguniform(1e-5, 1e-1),
        'l1_ratio': uniform(0, 1)
    }

    # Step 6: Set up GridSearchCV using Dask-ML to distribute hyperparameter search
    search = GridSearchCV(estimator, params, cv=3)

    # Step 7: Fit the model (this will distribute the computation across the Dask cluster)
    search.fit(X_train, y_train)

    # Step 8: Output the best hyperparameters and best score
    print("Best parameters:", search.best_params_)
    print("Best score:", search.best_score_)

    # Step 9: Evaluate the model on the test set
    test_score = search.score(X_test, y_test)
    print("Test score:", test_score)

    # Close the Dask client when done
    client.close()


.. admonition:: Key Points
   :class: hint

        - Distributed learning can help with large dataset in Dask..