pwshub.com

5 DIY Python Functions for Data Cleaning

5 DIY Python Functions for Data Cleaning

Image by Author | Midjourney

Data cleaning: whether you love it or hate it, you likely spend a lot of time doing it.

It’s what we signed up for. There’s no understanding, analyzing, or modeling data without first cleaning it. Making sure we have reusable tools handy for data cleaning is essential. To that end, here are 5 DIY functions to give you a some examples and starting points for building up your own data cleaning tool chest.

The functions are well-documented, and include explicit descriptions or function parameters and return types. Type hinting is also employed to ensure both that the functions are called in the manner they were intended, and that they can be well understood by you, the reader.

Before we get started, let’s take care of the imports.

import re

from datetime import datetime

import pandas as pd

import numpy as np

from typing import List, Union, Optional

With that done, let’s get on to the functions.

1. Remove Multiple Spaces

Our first DIY function is meant to remove excessive whitespace from text. If we want neither multiple spaces within a string, nor excessive leading or trailing spaces, this single line function will take care of it for us. We make use of regular expressions for internal spaces, as well as strip() for trailing/leading whitespace.

def clean_spaces(text: str) -> str:

    """

    Remove multiple spaces from a string and trim leading/trailing spaces.

    :param text: The input string to clean

    :returns: A string with multiple spaces removed and trimmed

    """

    return re.sub(' +', ' ', str(text).strip())

Testing:

messy_text = "This   has   too    many    spaces"

clean_text = clean_spaces(messy_text)

print(clean_text)

Output:

2. Standardize Date Formats

Do you have datasets with dates running the gamut of internationally acceptable formats? This function will standardize them all to our specified format (YYYY-MM-DD).

def standardize_date(date_string: str) -> Optional[str]:

    """

    Convert various date formats to YYYY-MM-DD.

    :param date_string: The input date string to standardize

    :returns: A standardized date string in YYYY-MM-DD format, or None if parsing fails

    """

    date_formats = ["%Y-%m-%d", "%d-%m-%Y", "%m/%d/%Y", "%d/%m/%Y", "%B %d, %Y"]

    for fmt in date_formats:

        try:

            return datetime.strptime(date_string, fmt).strftime("%Y-%m-%d")

        except ValueError:

            pass

    # Return None if no format matches

    return None

Testing:

dates = ["2023-04-01", "01-04-2023", "04/01/2023", "April 1, 2023"]

standardized_dates = [standardize_date(date) for date in dates]

print(standardized_dates)

Output:

['2023-04-01', '2023-04-01', '2023-04-01', '2023-04-01']

3. Handle Missing Values

Let’s deal with those pesky missing values. We can specify our numeric data strategy to use (‘mean’, ‘median’, or ‘mode’), as well as our categorical data strategy (‘mode’ or ‘dummy’).

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

def handle_missing(df: pd.DataFrame, numeric_strategy: str = 'mean', categorical_strategy: str = 'mode') -> pd.DataFrame:

    """

    Fill missing values in a DataFrame.

    :param df: The input DataFrame

    :param numeric_strategy: Strategy for handling missing numeric values ('mean', 'median', or 'mode')

    :param categorical_strategy: Strategy for handling missing categorical values ('mode' or 'dummy')

    :returns: A DataFrame with missing values filled

    """

    for column in df.columns:

        if df[column].dtype in ['int64', 'float64']:

            if numeric_strategy == 'mean':

                df[column].fillna(df[column].mean(), inplace=True)

            elif numeric_strategy == 'median':

                df[column].fillna(df[column].median(), inplace=True)

            elif numeric_strategy == 'mode':

                df[column].fillna(df[column].mode()[0], inplace=True)

        else:

            if categorical_strategy == 'mode':

                df[column].fillna(df[column].mode()[0], inplace=True)

            elif categorical_strategy == 'dummy':

                df[column].fillna('Unknown', inplace=True)

    return df

Testing:

df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': ['x', 'y', np.nan, 'z']})

cleaned_df = handle_missing(df)

print(cleaned_df)

Output:

  df[column].fillna(df[column].mode()[0], inplace=True)

          A  B

0  1.000000  x

1  2.000000  y

2  2.333333  x

3  4.000000  z

4. Remove Outliers

Outliers causing you problems? Not any more. This DIY function uses the IQR method for removing outliers from our data. You just pass in the data and specify the columns to check for outliers, it returns an outlier-free dataframe.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

import pandas as pd

import numpy as np

from typing import List

def remove_outliers_iqr(df: pd.DataFrame, columns: List[str], factor: float = 1.5) -> pd.DataFrame:

    """

    Remove outliers from specified columns using the Interquartile Range (IQR) method.

    :param df: The input DataFrame

    :param columns: List of column names to check for outliers

    :param factor: The IQR factor to use (default is 1.5)

    :returns: A DataFrame with outliers removed

    """

    mask = pd.Series(True, index=df.index)

    for col in columns:

        Q1 = df[col].quantile(0.25)

        Q3 = df[col].quantile(0.75)

        IQR = Q3 - Q1

        lower_bound = Q1 - factor * IQR

        upper_bound = Q3 + factor * IQR

        mask &= (df[col] >= lower_bound) & (df[col] <= upper_bound)

    cleaned_df = df[mask]

    return cleaned_df

Testing:

df = pd.DataFrame({'A': [1, 2, 3, 100, 4, 5], 'B': [10, 20, 30, 40, 50, 1000]})

print("Original DataFrame:")

print(df)

print("\nCleaned DataFrame:")

cleaned_df = remove_outliers_iqr(df, ['A', 'B'])

print(cleaned_df)

Output:

Original DataFrame:

     A     B

0    1    10

1    2    20

2    3    30

3  100    40

4    4    50

5    5  1000

Cleaned DataFrame:

   A   B

0  1  10

1  2  20

2  3  30

4  4  50

5. Normalize Text Data

Let’s get normal! When you want to convert all text to lowercase, strip of whitespace, and remove special characters, this DIY function will do the trick.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

def normalize_text(text: str) -> str:

    """

    Normalize text data by converting to lowercase, removing special characters, and extra spaces.

    :param text: The input text to normalize

    :returns: Normalized text

    """

    # Convert to lowercase

    text = str(text).lower()

    # Remove special characters

    text = re.sub(r'[^\w\s]', '', text)

    # Remove extra spaces

    text = re.sub(r'\s+', ' ', text).strip()

    return text

Testing:

messy_text = "This is MESSY!!! Text   with $pecial ch@racters."

clean_text = normalize_text(messy_text)

print(clean_text)

Output:

this is messy text with pecial chracters

Final Thoughts

Well that’s that. We went presented 5 different DIY functions that will perform specific data cleaning tasks. We test drove them all, and checked out their results. You should now have some idea of where to go on your own from here, and don’t forget to save these functions for use later.

Source: machinelearningmastery.com

Related stories
1 month ago - Password manager is a software designed to store and manage online credentials securely. A password manager generates and retrieves complex, unique passwords for various accounts, enhancing digital security by minimizing the risk of...
1 day ago - Choosing Google Cloud Platform (GCP) for Wordpress hosting is beneficial as it is enterprise-ready. It offers high scalability, tight security, top-notch performance, and complete control over deployment. For data centers, you can choose...
1 month ago - As a machine learning engineer or a data scientist, you’ll likely need to work with time series data. Time series analysis focuses on data indexed by time, such as stock prices, temperature, and the like. If you’re already comfortable...
1 month ago - Instagram, a popular photo sharing app, has become a go-to platform for sharing photos and short video clips, known as Instagram Reels. These creative Reels are fun to watch as you scroll through your feed, but sometimes you might want to...
1 month ago - Watermarking is a common way to protect images and keep them from being used without permission. But there are times when you might need to remove a watermark – maybe it’s covering up important details or you’d like it to be in a...
Other stories
1 hour ago - This release candidate, a near-final look at Deno 2, includes the addition of Node's process global, better dependency management, and various API stabilizations, and more.
1 hour ago - Published: September 19, 2024 The CSS Working Group has combined the two CSS masonry proposals into one draft specification. The group hopes that...
1 hour ago - Stay organized with collections Save and categorize content based on your preferences. Published: September...
3 hours ago - DNS monitoring tool is a cloud-based scanner that constantly monitors DNS records and servers for anomalies. This tool aims to ensure that users are sent to genuine and intended website pages, instead of fakes or replicas. It alerts users...
4 hours ago - Email spoofing is a malicious tactic in which cybercriminals send fake emails that look like they come from trusted organizations or individuals. When unsuspecting users act on these emails, they may unknowingly share sensitive data,...