0% found this document useful (0 votes)
8 views

Assessing Data Quality Dimensions

Uploaded by

amullya patil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Assessing Data Quality Dimensions

Uploaded by

amullya patil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Assessing Data Quality Dimensions in Business Data

Prerequisites
1. Install Python: Make sure you have Python installed. You can download it
from Python's official website (https://www.python.org/downloads/).

2. Install Required Libraries: You will need the following libraries: 'pandas',
'numpy', and 'matplotlib'. You can install them using pip.

pip install pandas numpy matplotlib

2. Set Up Your IDE: You can use any Python IDE or text editor (like Jupyter
Notebook, VS Code, or PyCharm).
Step 1: Gather Data
For demonstration, let’s create a sample dataset in CSV format. Save the
following data in a file named 'business_data.csv'.

CustomerID,Name,Email,JoinDate,AmountSpent
1,John Doe,[email protected],2024-01-15,150.00
2,Jane Smith,[email protected],2024-02-20,200.00
3,Bob Johnson,,2024-03-05,150.00
4,Mary Johnson,[email protected],2024-02-30,300.00
5,Tom Brown,[email protected],2024-03-15,400.00
6,Emily Davis,[email protected],2024-01-25,
1,John Doe,[email protected],2024-01-15,150.00

Step 2: Load the Data


Use Pandas to load the dataset and inspect its contents.

import pandas as pd

# Load the data


data = pd.read_csv('business_data.csv')

# Display the first few rows


print(data.head())

Step 3: Data Profiling


Perform basic profiling to understand the structure of the dataset.

# Get summary statistics


print(data.describe())

# Check for missing values


print(data.isnull().sum())
# Check data types
print(data.dtypes)

# Check unique values in 'CustomerID'


print(data['CustomerID'].unique())

Step 4: Assess Data Quality Dimensions

a. Accuracy:

Check for potential inaccuracies, like invalid email formats or incorrect join
dates.

import re

# Function to validate email


def is_valid_email(email):
pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'
return bool(re.match(pattern, email))

# Validate emails
data['Email_Valid'] = data['Email'].apply(lambda x: is_valid_email(x) if
pd.notnull(x) else False)
print(data[['Email', 'Email_Valid']])

b. Completeness:

Count missing values in each column.

# Check completeness
missing_values = data.isnull().sum()
print("Missing Values:\n", missing_values)

c. Consistency:

Check for duplicate entries.

# Check for duplicates


duplicates = data.duplicated().sum()
print("Number of duplicate entries:", duplicates)
# Display duplicates
print(data[data.duplicated()])

d. Timeliness:

Assess whether JoinDate is in a valid range.

# Check for valid JoinDate format


data['JoinDate'] = pd.to_datetime(data['JoinDate'], errors='coerce')
invalid_dates = data[data['JoinDate'].isnull()]
print("Invalid Join Dates:\n", invalid_dates)

e. Relevance:

Evaluate whether all columns are relevant for analysis.

# Display columns to evaluate relevance


print("Columns in the dataset:\n", data.columns)
f. Uniqueness:

Check for unique CustomerID values.

# Check for unique CustomerID


unique_ids = data['CustomerID'].nunique()
print("Unique Customer IDs:", unique_ids)

Step 5: Data Cleaning


Now let’s clean the dataset based on our assessments.

a. Fill missing values:

For AmountSpent, we could fill missing values with the average.

# Fill missing AmountSpent with mean


mean_amount = data['AmountSpent'].mean()
data['AmountSpent'].fillna(mean_amount, inplace=True)

b. Remove duplicates:

# Remove duplicates
data.drop_duplicates(inplace=True)

c. Remove invalid emails:

# Remove rows with invalid emails


data = data[data['Email_Valid']]
d. Remove invalid dates:

# Remove rows with invalid JoinDates


data = data[data['JoinDate'].notnull()]

Step 6: Validate Data Quality After Cleaning

After cleaning, validate the data quality again.

# Check for missing values again


print("Missing Values after cleaning:\n", data.isnull().sum())

# Check for duplicates again


print("Number of duplicate entries after cleaning:", data.duplicated().sum())

Step 7: Documentation and Reporting

Create a report summarizing your findings.

with open('data_quality_report.txt', 'w') as report:


report.write("Data Quality Assessment Report\n")
report.write("=================================\n")
report.write(f"Total Rows: {len(data)}\n")
report.write(f"Missing Values: {data.isnull().sum().to_dict()}\n")
report.write(f"Duplicate Entries: {data.duplicated().sum()}\n")
report.write(f"Invalid Emails: {data[data['Email_Valid'] == False].shape[0]}\n")
report.write(f"Invalid Join Dates: {data[data['JoinDate'].isnull()].shape[0]}\n")

You might also like