Introduction to AutoML: Automating Machine Learning Workflows

Image by Author

AutoML is a tool designed for both technical and non-technical experts. It simplifies the process of training machine learning models. All you have to do is provide it with the dataset, and in return, it will provide you with the best-performing model for your use case. You don’t have to code for long hours or experiment with various techniques; it will do everything on its own for you.

In this tutorial, we will learn about AutoML and TPOT, a Python AutoML tool for building machine learning pipelines. We will also learn to build a machine learning classifier, save the model, and use it for model inference.

What is AutoML?

AutoML, or Automated Machine Learning, is a tool where you provide a dataset, and it will do all the tasks on the back end to provide you with a high-performing machine learning model. AutoML performs various tasks such as data preprocessing, feature selection, model selection, hyperparameter tuning, model ensembling, and model evaluation. Even a non-technical user can build a highly complex machine learning model using the AutoML tools.

By using advanced machine learning algorithms and techniques, AutoML systems can automatically discover the best models and configurations for a given dataset, thus reducing the time and effort required to develop machine learning models.

1. Getting Started with TPOT

TPOT (Tree-based Pipeline Optimization Tool) is the most simple and highly popular AutoML tool that uses genetic programming to optimize machine learning pipelines. It automatically explores hundreds of potential pipelines to identify the most effective model for a given dataset.

You can install TPOT using the following command on your system.

!pip install tpot==0.12.2

Load the necessary Python libraries to load and process the data and train the classification model.

import numpy as np

import pandas as pd

from tpot import TPOTClassifier

from sklearn.model_selection import train_test_split

from sklearn.datasets import load_breast_cancer

2. Loading the Data

For this tutorial, we are using the Mushroom Dataset from Kaggle which contains 9 features to determine if the mushroom is poisonous or not.

We will load the dataset using Pandas and randomly select 1000 samples from the dataset.

data = pd.read_csv('mushroom_cleaned.csv')

data = data.sample(n=1000, random_state=55)

data.head()

Introduction to AutoML: Automating Machine Learning Workflows

3. Data Processing

The “class” column is our target variable, which contains two values—0 or 1—where 0 refers to non-poisonous and 1 refers to poisonous. We will use it to create independent and dependent variables. After that, we will split it into a train and test datasets.

X = data.drop('class', axis=1)

y = data['class'].values

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55)

4. Building and Fitting TPOT Classifier

We will initiate the TPOT classifier and train it using a training set. The model will experiment with various models and techniques and return the best-performing model and pipeline.

# Initialize TPOTClassifier

tpot = TPOTClassifier(verbosity=2, generations=5, population_size=20, random_state=55)

# Fit the classifier to the training data

tpot.fit(X_train, y_train)

We got various scores for different generations and the best pipeline.

Introduction to AutoML: Automating Machine Learning Workflows

Let’s evaluate our best pipeline on the test dataset by using the .score function.

# Evaluate the model on the test set

print(tpot.score(X_test, y_test))

I think we have a pretty stable and accurate model.

5. Saving the TPOT Pipeline and Model

To save the TPOT pipeline, we will use the .export function and provide it with the file name and .py extension.

tpot.export('tpot_mashroom_pipeline.py')

The file will be saved as a Python file with the code containing the best pipeline. In order to run the pipeline, you have to make a few changes to the dataset’s directory, separator, and target column names.

tpot_mashroom_pipeline.py:

import numpy as np

import pandas as pd

from sklearn.ensemble import ExtraTreesClassifier

from sklearn.feature_selection import SelectFromModel

from sklearn.model_selection import train_test_split

from sklearn.pipeline import make_pipeline

from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file

tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)

features = tpot_data.drop('target', axis=1)

training_features, testing_features, training_target, testing_target = \

train_test_split(features, tpot_data['target'], random_state=55)

# Average CV score on the training set was: 0.8800000000000001

exported_pipeline = make_pipeline(

SelectFromModel(estimator=ExtraTreesClassifier(criterion="entropy", max_features=0.9000000000000001, n_estimators=100), threshold=0.1),

ExtraTreesClassifier(bootstrap=False, criterion="gini", max_features=0.9500000000000001, min_samples_leaf=4, min_samples_split=2, n_estimators=100)

)

# Fix random state for all the steps in exported pipeline

set_param_recursive(exported_pipeline.steps, 'random_state', 55)

exported_pipeline.fit(training_features, training_target)

results = exported_pipeline.predict(testing_features)

You can even save the model using the joblib library as a pickle file. This file contains the model weights and the code to run the model inference.

import joblib

joblib.dump(tpot.fitted_pipeline_, 'tpot_mashroom_pipeline.pkl')

6. Loading the TPOT Pipeline and Model Inference

We will load the saved model using the joblib.load function and predict the top 10 samples from the testing dataset.

model = joblib.load('tpot_mashroom_pipeline.pkl')

print(y_test[0:10])

print(model.predict(X_test[0:10]))

Our model is accurate as the actual labels are similar to predicted labels.

[1 1 1 1 1 1 0 1 0 1]

Summary

In this tutorial, we have learned about AutoML and how it can be used by anyone, even non-technical users. We have also learned to use TPOT, an AutoML Python tool that automatically performs data processing, feature selection, model selection, hyperparameter tuning, model ensembling, and model evaluation. At the end of model training, we get the best-performing model and the pipeline by running two lines of code. We can even save the model and use it to build an AI application.

Introduction to AutoML: Automating Machine Learning Workflows

What is AutoML?

1. Getting Started with TPOT

2. Loading the Data

3. Data Processing

4. Building and Fitting TPOT Classifier

5. Saving the TPOT Pipeline and Model

6. Loading the TPOT Pipeline and Model Inference

Summary

Related stories

Other stories