pwshub.com

Introduction to AutoML: Automating Machine Learning Workflows

Introduction to AutoML: Automating Machine Learning Workflows

Image by Author

AutoML is a tool designed for both technical and non-technical experts. It simplifies the process of training machine learning models. All you have to do is provide it with the dataset, and in return, it will provide you with the best-performing model for your use case. You don’t have to code for long hours or experiment with various techniques; it will do everything on its own for you.

In this tutorial, we will learn about AutoML and TPOT, a Python AutoML tool for building machine learning pipelines. We will also learn to build a machine learning classifier, save the model, and use it for model inference.

What is AutoML?

AutoML, or Automated Machine Learning, is a tool where you provide a dataset, and it will do all the tasks on the back end to provide you with a high-performing machine learning model. AutoML performs various tasks such as data preprocessing, feature selection, model selection, hyperparameter tuning, model ensembling, and model evaluation. Even a non-technical user can build a highly complex machine learning model using the AutoML tools. 

By using advanced machine learning algorithms and techniques, AutoML systems can automatically discover the best models and configurations for a given dataset, thus reducing the time and effort required to develop machine learning models.

1. Getting Started with TPOT

TPOT (Tree-based Pipeline Optimization Tool) is the most simple and highly popular AutoML tool that uses genetic programming to optimize machine learning pipelines. It automatically explores hundreds of potential pipelines to identify the most effective model for a given dataset.

You can install TPOT using the following command on your system. 

!pip install tpot==0.12.2

Load the necessary Python libraries to load and process the data and train the classification model. 

import numpy as np

import pandas as pd

from tpot import TPOTClassifier

from sklearn.model_selection import train_test_split

from sklearn.datasets import load_breast_cancer

2. Loading the Data

For this tutorial, we are using the Mushroom Dataset from Kaggle which contains 9 features to determine if the mushroom is poisonous or not. 

We will load the dataset using Pandas and randomly select 1000 samples from the dataset. 

data = pd.read_csv('mushroom_cleaned.csv')

data = data.sample(n=1000, random_state=55)

data.head()

Introduction to AutoML: Automating Machine Learning Workflows

3. Data Processing

The “class” column is our target variable, which contains two values—0 or 1—where 0 refers to non-poisonous and 1 refers to poisonous. We will use it to create independent and dependent variables. After that, we will split it into a train and test datasets. 

X = data.drop('class', axis=1)

y = data['class'].values

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55)

4. Building and Fitting TPOT Classifier

We will initiate the TPOT classifier and train it using a training set. The model will experiment with various models and techniques and return the best-performing model and pipeline. 

# Initialize TPOTClassifier

tpot = TPOTClassifier(verbosity=2, generations=5, population_size=20, random_state=55)

# Fit the classifier to the training data

tpot.fit(X_train, y_train)

We got various scores for different generations and the best pipeline. 

Introduction to AutoML: Automating Machine Learning Workflows

Let’s evaluate our best pipeline on the test dataset by using the .score function.

# Evaluate the model on the test set

print(tpot.score(X_test, y_test))

I think we have a pretty stable and accurate model. 

5. Saving the TPOT Pipeline and Model

To save the TPOT pipeline, we will use the .export function and provide it with the file name and .py extension. 

tpot.export('tpot_mashroom_pipeline.py')

The file will be saved as a Python file with the code containing the best pipeline. In order to run the pipeline, you have to make a few changes to the dataset’s directory, separator, and target column names. 

tpot_mashroom_pipeline.py:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

import numpy as np

import pandas as pd

from sklearn.ensemble import ExtraTreesClassifier

from sklearn.feature_selection import SelectFromModel

from sklearn.model_selection import train_test_split

from sklearn.pipeline import make_pipeline

from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file

tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)

features = tpot_data.drop('target', axis=1)

training_features, testing_features, training_target, testing_target = \

            train_test_split(features, tpot_data['target'], random_state=55)

# Average CV score on the training set was: 0.8800000000000001

exported_pipeline = make_pipeline(

    SelectFromModel(estimator=ExtraTreesClassifier(criterion="entropy", max_features=0.9000000000000001, n_estimators=100), threshold=0.1),

    ExtraTreesClassifier(bootstrap=False, criterion="gini", max_features=0.9500000000000001, min_samples_leaf=4, min_samples_split=2, n_estimators=100)

)

# Fix random state for all the steps in exported pipeline

set_param_recursive(exported_pipeline.steps, 'random_state', 55)

exported_pipeline.fit(training_features, training_target)

results = exported_pipeline.predict(testing_features)

You can even save the model using the joblib library as a pickle file. This file contains the model weights and the code to run the model inference. 

import joblib

joblib.dump(tpot.fitted_pipeline_, 'tpot_mashroom_pipeline.pkl')

6. Loading the TPOT Pipeline and Model Inference

We will load the saved model using the joblib.load function and predict the top 10 samples from the testing dataset. 

model = joblib.load('tpot_mashroom_pipeline.pkl')

print(y_test[0:10])

print(model.predict(X_test[0:10]))

Our model is accurate as the actual labels are similar to predicted labels. 

[1 1 1 1 1 1 0 1 0 1]

[1 1 1 1 1 1 0 1 0 1]

Summary

In this tutorial, we have learned about AutoML and how it can be used by anyone, even non-technical users. We have also learned to use TPOT, an AutoML Python tool that automatically performs data processing, feature selection, model selection, hyperparameter tuning, model ensembling, and model evaluation. At the end of model training, we get the best-performing model and the pipeline by running two lines of code. We can even save the model and use it to build an AI application.

Source: machinelearningmastery.com

Related stories
1 month ago - In this course, you'll practice the main steps of the web scraping process. You'll write a script that uses Python's requests library to scrape and parse data from a website. You'll also interact with HTML forms using tools like Beautiful...
1 month ago - An introduction to Raymarching using the power of Signed Distance Fields (SDFs) and simple lighting to create a liquid shape effect.
1 month ago - From design to coding to data science, there are many areas in the IT sector to explore. Because it's such a wealthy and alluring profession, you might be unsure of which one is ideal for you. This is where cybersecurity...
1 month ago - The advent of cloud computing marked a turning point in the field of technology. It has provided ease of access to users across the globe to web and mobile applications and services. Modern-day computing services, in addition, provide a...
1 month ago - Machine learning is an area of AI where the likes of ChatGPT and other famous models were created. These systems were all created with neural networks. The field of machine learning that deals with the creation of these neural networks is...
Other stories
6 hours ago - Looking for a powerful new Linux laptop? The new KDE Slimbook VI may very well appeal. Unveiled at Akademy 2024, KDE’s annual community get-together, the KDE Slimbook VI marks a major refresh from earlier models in the KDE Slimbook line....
10 hours ago - Fixes 130 bugs (addressing 250 👍). `bun pm pack`, faster `node:zlib`. Static routes in Bun.serve(). ReadableStream support in response.clone() & request.clone(). Per-request timeouts. Cancel method in ReadableStream is called. `bun run`...
1 day ago - Have you ever used an attribute in HTML without fully understanding its purpose? You're not alone! Over time, I've dug into the meaning behind many HTML attributes, especially those that are crucial for accessibility. In this in-depth...
1 day ago - Lifetimes are fundamental mechanisms in Rust. There's a very high chance you'll need to work with lifetimes in any Rust project that has any sort of complexity. Even though they are important to Rust projects, lifetimes can be quite...
1 day ago - The first interaction sets the tone for the entire experience — get it right, and you’ve hooked your users from the start. So as a UX designer, you need to know how to put the primacy effect of UX design to good use. The post Leveraging...