5 DIY Python Functions for Data Cleaning

Image by Author | Midjourney

Data cleaning: whether you love it or hate it, you likely spend a lot of time doing it.

It’s what we signed up for. There’s no understanding, analyzing, or modeling data without first cleaning it. Making sure we have reusable tools handy for data cleaning is essential. To that end, here are 5 DIY functions to give you a some examples and starting points for building up your own data cleaning tool chest.

The functions are well-documented, and include explicit descriptions or function parameters and return types. Type hinting is also employed to ensure both that the functions are called in the manner they were intended, and that they can be well understood by you, the reader.

Before we get started, let’s take care of the imports.

import re

from datetime import datetime

import pandas as pd

import numpy as np

from typing import List, Union, Optional

With that done, let’s get on to the functions.

1. Remove Multiple Spaces

Our first DIY function is meant to remove excessive whitespace from text. If we want neither multiple spaces within a string, nor excessive leading or trailing spaces, this single line function will take care of it for us. We make use of regular expressions for internal spaces, as well as strip() for trailing/leading whitespace.

def clean_spaces(text: str) -> str:

"""

Remove multiple spaces from a string and trim leading/trailing spaces.

:param text: The input string to clean

:returns: A string with multiple spaces removed and trimmed

"""

return re.sub(' +', ' ', str(text).strip())

Testing:

messy_text = "This has too many spaces"

clean_text = clean_spaces(messy_text)

print(clean_text)

Output:

2. Standardize Date Formats

Do you have datasets with dates running the gamut of internationally acceptable formats? This function will standardize them all to our specified format (YYYY-MM-DD).

def standardize_date(date_string: str) -> Optional[str]:

"""

Convert various date formats to YYYY-MM-DD.

:param date_string: The input date string to standardize

:returns: A standardized date string in YYYY-MM-DD format, or None if parsing fails

"""

date_formats = ["%Y-%m-%d", "%d-%m-%Y", "%m/%d/%Y", "%d/%m/%Y", "%B %d, %Y"]

for fmt in date_formats:

try:

return datetime.strptime(date_string, fmt).strftime("%Y-%m-%d")

except ValueError:

pass

# Return None if no format matches

return None

Testing:

dates = ["2023-04-01", "01-04-2023", "04/01/2023", "April 1, 2023"]

standardized_dates = [standardize_date(date) for date in dates]

print(standardized_dates)

Output:

['2023-04-01', '2023-04-01', '2023-04-01', '2023-04-01']

3. Handle Missing Values

Let’s deal with those pesky missing values. We can specify our numeric data strategy to use (‘mean’, ‘median’, or ‘mode’), as well as our categorical data strategy (‘mode’ or ‘dummy’).

def handle_missing(df: pd.DataFrame, numeric_strategy: str = 'mean', categorical_strategy: str = 'mode') -> pd.DataFrame:

"""

Fill missing values in a DataFrame.

:param df: The input DataFrame

:param numeric_strategy: Strategy for handling missing numeric values ('mean', 'median', or 'mode')

:param categorical_strategy: Strategy for handling missing categorical values ('mode' or 'dummy')

:returns: A DataFrame with missing values filled

"""

for column in df.columns:

if df[column].dtype in ['int64', 'float64']:

if numeric_strategy == 'mean':

df[column].fillna(df[column].mean(), inplace=True)

elif numeric_strategy == 'median':

df[column].fillna(df[column].median(), inplace=True)

elif numeric_strategy == 'mode':

df[column].fillna(df[column].mode()[0], inplace=True)

else:

if categorical_strategy == 'mode':

df[column].fillna(df[column].mode()[0], inplace=True)

elif categorical_strategy == 'dummy':

df[column].fillna('Unknown', inplace=True)

return df

Testing:

df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': ['x', 'y', np.nan, 'z']})

cleaned_df = handle_missing(df)

print(cleaned_df)

Output:

df[column].fillna(df[column].mode()[0], inplace=True)

A B

0 1.000000 x

1 2.000000 y

2 2.333333 x

3 4.000000 z

4. Remove Outliers

Outliers causing you problems? Not any more. This DIY function uses the IQR method for removing outliers from our data. You just pass in the data and specify the columns to check for outliers, it returns an outlier-free dataframe.

import pandas as pd

import numpy as np

from typing import List

def remove_outliers_iqr(df: pd.DataFrame, columns: List[str], factor: float = 1.5) -> pd.DataFrame:

"""

Remove outliers from specified columns using the Interquartile Range (IQR) method.

:param df: The input DataFrame

:param columns: List of column names to check for outliers

:param factor: The IQR factor to use (default is 1.5)

:returns: A DataFrame with outliers removed

"""

mask = pd.Series(True, index=df.index)

for col in columns:

Q1 = df[col].quantile(0.25)

Q3 = df[col].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - factor * IQR

upper_bound = Q3 + factor * IQR

mask &= (df[col] >= lower_bound) & (df[col] <= upper_bound)

cleaned_df = df[mask]

return cleaned_df

Testing:

df = pd.DataFrame({'A': [1, 2, 3, 100, 4, 5], 'B': [10, 20, 30, 40, 50, 1000]})

print("Original DataFrame:")

print(df)

print("\nCleaned DataFrame:")

cleaned_df = remove_outliers_iqr(df, ['A', 'B'])

print(cleaned_df)

Output:

Original DataFrame:

A B

0 1 10

1 2 20

2 3 30

3 100 40

4 4 50

5 5 1000

Cleaned DataFrame:

A B

0 1 10

1 2 20

2 3 30

4 4 50

5. Normalize Text Data

Let’s get normal! When you want to convert all text to lowercase, strip of whitespace, and remove special characters, this DIY function will do the trick.

def normalize_text(text: str) -> str:

"""

Normalize text data by converting to lowercase, removing special characters, and extra spaces.

:param text: The input text to normalize

:returns: Normalized text

"""

# Convert to lowercase

text = str(text).lower()

# Remove special characters

text = re.sub(r'[^\w\s]', '', text)

# Remove extra spaces

text = re.sub(r'\s+', ' ', text).strip()

return text

Testing:

messy_text = "This is MESSY!!! Text with $pecial ch@racters."

clean_text = normalize_text(messy_text)

print(clean_text)

Output:

this is messy text with pecial chracters

Final Thoughts

Well that’s that. We went presented 5 different DIY functions that will perform specific data cleaning tasks. We test drove them all, and checked out their results. You should now have some idea of where to go on your own from here, and don’t forget to save these functions for use later.

5 DIY Python Functions for Data Cleaning

1. Remove Multiple Spaces

2. Standardize Date Formats

3. Handle Missing Values

4. Remove Outliers

5. Normalize Text Data

Final Thoughts

Related stories

Other stories