Assessing Data Quality Dimensions
Assessing Data Quality Dimensions
Prerequisites
1. Install Python: Make sure you have Python installed. You can download it
from Python's official website (https://www.python.org/downloads/).
2. Install Required Libraries: You will need the following libraries: 'pandas',
'numpy', and 'matplotlib'. You can install them using pip.
2. Set Up Your IDE: You can use any Python IDE or text editor (like Jupyter
Notebook, VS Code, or PyCharm).
Step 1: Gather Data
For demonstration, let’s create a sample dataset in CSV format. Save the
following data in a file named 'business_data.csv'.
CustomerID,Name,Email,JoinDate,AmountSpent
1,John Doe,[email protected],2024-01-15,150.00
2,Jane Smith,[email protected],2024-02-20,200.00
3,Bob Johnson,,2024-03-05,150.00
4,Mary Johnson,[email protected],2024-02-30,300.00
5,Tom Brown,[email protected],2024-03-15,400.00
6,Emily Davis,[email protected],2024-01-25,
1,John Doe,[email protected],2024-01-15,150.00
import pandas as pd
a. Accuracy:
Check for potential inaccuracies, like invalid email formats or incorrect join
dates.
import re
# Validate emails
data['Email_Valid'] = data['Email'].apply(lambda x: is_valid_email(x) if
pd.notnull(x) else False)
print(data[['Email', 'Email_Valid']])
b. Completeness:
# Check completeness
missing_values = data.isnull().sum()
print("Missing Values:\n", missing_values)
c. Consistency:
d. Timeliness:
e. Relevance:
b. Remove duplicates:
# Remove duplicates
data.drop_duplicates(inplace=True)