pwshub.com

Integrating Scikit-Learn and Statsmodels for Regression

Statistics and Machine Learning both aim to extract insights from data, though their approaches differ significantly. Traditional statistics primarily concerns itself with inference, using the entire dataset to test hypotheses and estimate probabilities about a larger population. In contrast, machine learning emphasizes prediction and decision-making, typically employing a train-test split methodology where models learn from a portion of the data (the training set) and validate their predictions on unseen data (the testing set).

In this post, we will demonstrate how a seemingly straightforward technique like linear regression can be viewed through these two lenses. We will explore their unique contributions by using Scikit-Learn for machine learning and Statsmodels for statistical inference.

Let’s get started.

Integrating Scikit-Learn and Statsmodels for Regression.
Photo by Stephen Dawson. Some rights reserved.

Overview

This post is divided into three parts; they are:

  • Supervised Learning: Classification vs. Regression
  • Diving into Regression with a Machine Learning Focus
  • Enhancing Understanding with Statistical Insights

Supervised Learning: Classification vs. Regression

Supervised learning is a branch of machine learning where the model is trained on a labeled dataset. This means that each example in the training dataset is paired with the correct output. Once trained, the model can apply what it has learned to new, unseen data.

In supervised learning, we encounter two main tasks: classification and regression. These tasks are determined by the type of output we aim to predict. If the goal is to predict categories, such as determining if an email is spam, we are dealing with a classification task. Alternatively, if we estimate a value, such as calculating the miles per gallon (MPG) a car will achieve based on its features, it falls under regression. The output’s nature — a category or a number — steers us toward the appropriate approach.

In this series, we will used the Ames housing dataset. It provides a comprehensive collection of features related to houses, including architectural details, condition, and location, aimed at predicting the “SalePrice” (the sales price) of each house.

# Load the Ames dataset

import pandas as pd

Ames = pd.read_csv('Ames.csv')

# Display the first few rows of the dataset and the data type of 'SalePrice'

print(Ames.head())

sale_price_dtype = Ames['SalePrice'].dtype

print(f"The data type of 'SalePrice' is {sale_price_dtype}.")

This should output:

PID  GrLivArea  SalePrice  ...          Prop_Addr   Latitude  Longitude

0  909176150        856     126000  ...    436 HAYWARD AVE  42.018564 -93.651619

1  905476230       1049     139500  ...       3416 WEST ST  42.024855 -93.663671

2  911128020       1001     124900  ...       320 S 2ND ST  42.021548 -93.614068

3  535377150       1039     114000  ...   1524 DOUGLAS AVE  42.037391 -93.612207

4  534177230       1665     227000  ...  2304 FILLMORE AVE  42.044554 -93.631818

[5 rows x 85 columns]

The data type of 'SalePrice' is int64.

The “SalePrice” column is of data type int64, indicating that it represents integer values. Since “SalePrice” is a numerical (continuous) variable rather than categorical, predicting the “SalePrice” would be a regression task. This means the goal is to predict a continuous quantity (the sale price of a house) based on the input features provided in your dataset.

Diving into Regression with a Machine Learning Focus

Supervised learning in machine learning focuses on predicting outcomes based on input data. In our case, using the Ames Housing dataset, we aim to predict a house’s sale price from its living area—a classic regression task. For this, we turn to scikit-learn, renowned for its simplicity and effectiveness in building predictive models.

To start, we select “GrLivArea” (ground living area) as our feature and “SalePrice” as the target. The next step involves splitting our dataset into training and testing sets using scikit-learn’s train_test_split() function. This crucial step allows us to train our model on one set of data and evaluate its performance on another, ensuring the model’s reliability.

Here’s how we do it:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

# Import Linear Regression from scikit-learn

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split

# Select features and target

X = Ames[['GrLivArea']]  # Feature: GrLivArea, 2D matrix

y = Ames['SalePrice']    # Target: SalePrice, 1D vector

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the Linear Regression model

model = LinearRegression()

model.fit(X_train, y_train)

# Scoring the model

score = round(model.score(X_test, y_test), 4)

print(f"Model R^2 Score: {score}")

This should output:

The LinearRegression object imported in the code above is scikit-learn’s implementation of linear regression. The model’s R² score of 0.4789 indicates that our model explains approximately 48% of the variation in sale prices based on the living area alone—a significant insight for such a simple model. This step marks our initial foray into machine learning with scikit-learn, showcasing the ease with which we can assess model performance on unseen or test data.

Enhancing Understanding with Statistical Insights

After exploring how scikit-learn can help us assess model performance on unseen data, we now turn our attention to statsmodels, a Python package that offers a different angle of analysis. While scikit-learn excels in building models and predicting outcomes, statsmodels shines by diving deep into the statistical aspects of our data and model. Let’s see how statsmodels can provide you with insight at a different level:

import statsmodels.api as sm

# Adding a constant to our independent variable for the intercept

X_with_constant = sm.add_constant(X)

# Fit the OLS model

model_stats = sm.OLS(y, X_with_constant).fit()

# Print the summary of the model

print(model_stats.summary())

The first key distinction to highlight is statsmodels‘ use of all observations in our dataset. Unlike the predictive modeling approach, where we split our data into training and testing sets, statsmodels leverages the entire dataset to provide comprehensive statistical insights. This full utilization of data allows for a detailed understanding of the relationships between variables and enhances the accuracy of our statistical estimates. The above code should output the following:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

OLS Regression Results                            

==============================================================================

Dep. Variable:              SalePrice   R-squared:                       0.518

Model:                            OLS   Adj. R-squared:                  0.518

Method:                 Least Squares   F-statistic:                     2774.

Date:                Sun, 31 Mar 2024   Prob (F-statistic):               0.00

Time:                        19:59:01   Log-Likelihood:                -31668.

No. Observations:                2579   AIC:                         6.334e+04

Df Residuals:                    2577   BIC:                         6.335e+04

Df Model:                           1                                        

Covariance Type:            nonrobust                                        

==============================================================================

coef    std err          t      P>|t|      [0.025      0.975]

------------------------------------------------------------------------------

const       1.377e+04   3283.652      4.195      0.000    7335.256    2.02e+04

GrLivArea    110.5551      2.099     52.665      0.000     106.439     114.671

==============================================================================

Omnibus:                      566.257   Durbin-Watson:                   1.926

Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3364.083

Skew:                           0.903   Prob(JB):                         0.00

Kurtosis:                       8.296   Cond. No.                     5.01e+03

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

[2] The condition number is large, 5.01e+03. This might indicate that there are

strong multicollinearity or other numerical problems.

Note that it is not the same regerssion as in the case of scikit-learn because the full dataset is used without train-test split.

Let’s dive into the statsmodels‘ output for our OLS regression and explain what the P-values, coefficients, confidence intervals, and diagnostics tell us about our model, specifically focusing on predicting “SalePrice” from “GrLivArea”:

P-values and Coefficients

  • Coefficient of “GrLivArea”: The coefficient for “GrLivArea” is 110.5551. This means that for every additional square foot of living area, the sales price of the house is expected to increase by approximately $110.55. This coefficient quantifies the impact of living area size on the house’s sales price.
  • P-value for “GrLivArea”: The p-value associated with the “GrLivArea” coefficient is essentially 0 (indicated by P>|t| near 0.000), suggesting that the living area is a highly significant predictor of the sales price. In statistical terms, we can reject the null hypothesis that the coefficient is zero (no effect) and confidently state that there is a strong relationship between the living area and sales price (but not necessarily the only factor).

Confidence Intervals

  • Confidence Interval for “GrLivArea”: The confidence interval for the “GrLivArea” coefficient is [106.439, 114.671]. This range tells us that we can be 95% confident that the true impact of living area on sale price falls within this interval. It offers a measure of the precision of our coefficient estimate.

Diagnostics

  • R-squared (R²): The R² value of 0.518 indicates that the living area can explain approximately 51.8% of the variability in sale prices. It’s a measure of how well the model fits the data. It is expected that this number is not the same as the case in scikit-learn regression since the data is different.
  • F-statistic and Prob (F-statistic): The F-statistic is a measure of the overall significance of the model. With an F-statistic of 2774 and a Prob (F-statistic) essentially at 0, this indicates that the model is statistically significant.
  • Omnibus, Prob(Omnibus): These tests assess the normality of the residuals. Residual is the difference between the predicted value $\hat{y}$) and the actual value ($y$). The linear regression algorithm is based on the assumption that the residuals are normally distributed. A Prob(Omnibus) value close to 0 suggests the residuals are not normally distributed, which could be a concern for the validity of some statistical tests.
  • Durbin-Watson: The Durbin-Watson statistic tests the presence of autocorrelation in the residuals. It is between 0 and 4. A value close to 2 (1.926) suggests there is no strong autocorrelation. Otherwise, this suggests that the relationship between $X$ and $y$ may not be linear.

This comprehensive output from statsmodels provides a deep understanding of how and why “GrLivArea” influences “SalePrice,” backed by statistical evidence. It underscores the importance of not just using models for predictions but also interpreting them to make informed decisions based on a solid statistical foundation. This insight is invaluable for those looking to explore the statistical story behind their data.

Further Reading

APIs

Tutorials

Books

Ames Housing Dataset & Data Dictionary

Summary

In this post, we navigated through the foundational concepts of supervised learning, specifically focusing on regression analysis. Using the Ames Housing dataset, we demonstrated how to employ scikit-learn for model building and performance, and statsmodels for gaining statistical insights into our data. This journey from data to insights underscores the critical role of both predictive modeling and statistical analysis in understanding and leveraging data effectively.

Specifically, you learned:

  • The distinction between classification and regression tasks in supervised learning.
  • How to identify which approach to use based on the nature of your data.
  • How to use scikit-learn to implement a simple linear regression model, assess its performance, and understand the significance of the model’s R² score.
  • The value of employing statsmodels to explore the statistical aspects of your data, including the interpretation of coefficients, p-values, and confidence intervals, and the importance of diagnostic tests for model assumptions.

Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.

Source: machinelearningmastery.com

Related stories
1 week ago - The rapid evolution of artificial intelligence (AI) has resulted in a powerful synergy between large language models (LLMs) and AI agents. This dynamic interplay is sort of like the tale of David and Goliath (without the fighting), where...
1 month ago - Many beginners will initially rely on the train-test method to evaluate their models. This method is straightforward and seems to give a clear indication of how well a model performs on unseen data. However, this approach can often lead...
1 week ago - In September 2024, Python 3.12.5 was released, improving stability and security. Python ranked first again in IEEE Spectrum's 2024 language rankings. The Python Developers Survey 2023 unveiled key trends, and PEP 750 suggested tag strings...
2 weeks ago - In the second part of this series, Joas Pambou aims to build a more advanced version of the previous application that performs conversational analyses on images or videos, much like a chatbot assistant. This means you can ask and learn...
1 week ago - As the capabilities of LLMs (Large Language Models) and adjacent tools like embedding models grew significantly over the past year, more and more...
Other stories
31 minutes ago - Vivaldi web browser has arrived on the Canonical Snap Store – officially. This closed-source, Chromium-based web browser has been available on Linux since its debut in 2015, providing an official DEB package for Ubuntu users (which adds...
4 hours ago - Four Prometheus metric types that all developers and DevOps pros should know form the building blocks of an effective observability strategy
5 hours ago - The 2024 Gartner Magic Quadrant positions AWS as a Leader, reflecting our commitment to diverse virtual desktop solutions and operational excellence - driving innovation for remote and hybrid workforces.
6 hours ago - Understanding design patterns are important for efficient software development. They offer proven solutions to common coding challenges, promote code reusability, and enhance maintainability. By mastering these patterns, developers can...
6 hours ago - APIs (Application Programming Interfaces) play an important role in enabling communication between different software systems. However, with great power comes great responsibility, and securing these APIs is necessary to protect sensitive...