0% found this document useful (0 votes)

63 views28 pages

Data Cleaning

Uploaded by

lizzmc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views28 pages

Data Cleaning

Uploaded by

lizzmc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Data Cleaning - Summary

Thursday, October 17, 2024 3:22 PM

Data cleaning is a critical step in data analysis that ensures data is accurate, complete, and ready for analysis. Here are
the general steps involved in data cleaning:
1. Understand the Data
• Explore the dataset: Get familiar with the data by understanding its structure, data types, and distributions.
• Identify missing values, outliers, and inconsistencies.
2. Remove Duplicates
• Identify duplicate records: Look for rows that are repeated.
• Drop duplicates: Use functions like drop_duplicates() in Python (Pandas) to eliminate them.
3. Handle Missing Data
• Identify missing data: Use techniques like isnull() in Pandas to spot missing values.
• Impute missing values: Replace missing data using strategies like:
○ Mean/Median/Mode imputation for numerical data.
○ Most frequent value for categorical data.
○ Forward/backward fill for time-series data.
○ Dropping rows or columns with too many missing values if necessary.
4. Standardize Data
• Convert data types: Ensure columns are in the correct format (e.g., dates, numeric, categorical).
• Correct capitalization: Make sure that text fields follow a consistent style (all lowercase, etc.).
• Remove leading/trailing spaces: Clean up extra spaces from string fields.
• Unify formats: Make sure formats are consistent (e.g., date formats, phone numbers).
5. Handle Outliers
• Detect outliers: Use techniques like boxplots, z-scores, or IQR (interquartile range) to identify outliers.
• Decide on action: Depending on the use case, you can remove outliers, transform them, or cap them to a certain
threshold.
6. Fix Structural Errors
• Identify inconsistencies: These include typos, misplaced values, and inconsistent naming conventions (e.g., “NY”
vs. “New York”).
• Correct errors: Replace or modify inconsistent data values.
7. Normalize/Scale Data
• Normalization: Adjust numerical features to a common scale (e.g., min-max scaling, Z-score normalization) if
needed for specific algorithms (like machine learning).
• Log transformations: Apply log transformations to handle skewed data.
8. Convert Categorical Data
• Encode categorical variables: Convert categorical variables into numerical forms (e.g., one -hot encoding, label
encoding) for machine learning algorithms that require numeric inputs.
9. Remove Irrelevant Data
• Drop unneeded columns: Remove features that are not necessary for analysis or do not add value to the final
model.
10. Validate Data
• Check data consistency: Verify the cleaned data against known rules (e.g., dates within reasonable ranges, numeric
values within expected limits).
• Re-check for missing data and duplicates.
11. Save the Cleaned Data
• Store the cleaned data: Save the cleaned dataset in the desired format (CSV, database, etc.) for further analysis.
By following these steps, you ensure that your data is clean, reliable, and ready for analysis or modeling.

DataCleaning Page 1
Understand the data
Thursday, October 17, 2024 3:23 PM

Understanding the data is the first and most crucial step in the data analysis or data cleaning process. It involves examining the structure, content, and quality
of the data. Here are some methods and approaches to help you understand your data effectively:
1. Inspect the Data Structure
• Data types: Check the types of each column (e.g., integer, float, string, date) using methods like .dtypes in Pandas. This will help you understand how
the data is stored.
• Shape and size: Use .shape or .info() to get the number of rows and columns and see if any initial transformations are needed.
• Sample the data: Use .head() or .tail() to view the first few or last few rows and get a general sense of what the data looks like.
• Metadata/Schema: If available, review the data dictionary or schema that describes each column's meaning, the units of measurement, etc.
2. Descriptive Statistics
• Summary statistics: Use .describe() for numerical columns and .value_counts() for categorical columns. This will give you key information like:
○ Mean, median, mode.
○ Standard deviation, minimum, maximum values.
○ Quantiles, counts, and distribution for each feature.
• Distribution of data: Look for patterns in how the data is distributed to understand its central tendencies, spread, and possible skewness.
3. Identify Missing and Null Values
• Check for missing data: Use .isnull() combined with .sum() to see how many missing values exist in each column.
• Patterns of missingness: Understand if there are columns with many missing values, which could signal a problem with data collection or suggest a
need for imputation.
4. Check for Duplicates
• Duplicate records: Look for duplicate rows using .duplicated() to avoid skewed results later on. This step can help identify data entry issues or
redundancies in the dataset.
5. Understand Relationships Between Variables
• Correlation analysis: For numerical data, you can use .corr() to find the correlation matrix, which shows the strength of relationships between variables.
Highly correlated variables may indicate multicollinearity.
• Cross-tabulation: For categorical data, use cross-tabulation (pd.crosstab()) to see relationships between different categorical features.
6. Visualize the Data
Visualization is a powerful tool to help you understand the distribution, trends, and anomalies in the data.
• Histograms: Show the distribution of numerical data.
• Box plots: Identify outliers and understand the spread of data.
• Scatter plots: Help reveal relationships and patterns between two continuous variables.
• Bar charts and pie charts: For understanding the frequency distribution of categorical variables.
• Heatmaps: Can visualize the correlation matrix or missing value patterns.
7. Understand the Business Context
• Domain knowledge: Understanding the business problem, the context in which the data was collected, and what each feature represents is crucial. You
might need to collaborate with domain experts or review documentation.
• Units and measurement scales: Ensure you understand the units of measurement (e.g., currency, time, etc.) for each feature to prevent incorrect
assumptions during analysis.
• Key performance indicators (KPIs): If the data is used to drive decisions, understand the KPIs and how they are calculated.
8. Identify Data Quality Issues
• Outliers: Detect unusually high or low values that don't make sense in the context of the data (e.g., negative values in age).
• Inconsistent data: Identify and investigate inconsistent values, such as incorrect or inconsistent formatting in categorical variables (e.g., "NY" vs "New
York").
• Check for time-based inconsistencies: For time-series data, ensure the data is complete and properly sequenced.
9. Categorical Feature Analysis
• Unique values: Use .unique() or .nunique() to see the distinct values in a categorical column. This will help in understanding the diversity or range of
categories.
• Frequency counts: Examine the most common categories in each column, which will help you detect possible data anomalies or imbalances.
10. Review Data Collection Process
• Source of data: Understand where and how the data was collected (e.g., sensors, user input, automated systems). This will give you insight into
potential biases or gaps in the data.
• Time range: Check if the data covers the appropriate time period for your analysis.
Tools to Understand the Data:
• Python (Pandas): Use functions like .head(), .info(), .describe(), .corr(), and visualizations with libraries like Matplotlib or Seaborn.
• SQL: Write queries to extract summaries like COUNT(), SUM(), AVG(), and GROUP BY to understand patterns in relational databases.
• Data visualization tools: Tableau, Power BI, or Excel can help visualize data distributions and relationships interactively.
Understanding the data thoroughly sets the foundation for accurate analysis and cleaning.

DataCleaning Page 2
To analyze the distribution of data and understand its central tendencies, spread, and possible skewness, several statistical and visual techniques can be
applied. Here's how it can be done step-by-step:
1. Summary Statistics
• Mean: The average value, which indicates the central tendency.
• Median: The middle value when the data is sorted, providing insight into the center of the distribution, especially when the data is skewed.
• Mode: The most frequent value, which may highlight common occurrences in the data.
• Range: The difference between the maximum and minimum values, providing an understanding of the data’s spread.
• Variance and Standard Deviation: These measures quantify the spread of the data around the mean. A high standard deviation indicates that the
data points are spread out over a large range of values.

python
Copy code
import pandas as pd
# Example DataFrame
df = pd.DataFrame({'values': [10, 12, 13, 15, 18, 20, 21, 24, 28, 30]})
# Summary statistics
mean = df['values'].mean()
median = df['values'].median()
mode = df['values'].mode()
std_dev = df['values'].std()
skewness = df['values'].skew()
print(f"Mean: {mean}, Median: {median}, Mode: {mode[0]}, Std Dev: {std_dev}, Skewness: {skewness}")
2. Visualizing Distribution
• Histograms: Show the frequency of data points across bins. This helps visualize the spread and shape of the data, including skewness and modality.
• Box Plot (Box and Whisker Plot): Highlights the quartiles of the data, median, and potential outliers. It is excellent for detecting skewness and
spread.
• Density Plot: A smoothed version of the histogram, providing a continuous curve that can help in identifying the underlying distribution of the
data.

python
Copy code
import matplotlib.pyplot as plt
import seaborn as sns
# Histogram
plt.hist(df['values'], bins=10, alpha=0.7, color='blue')
plt.title('Histogram of Values')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()
# Boxplot
sns.boxplot(data=df['values'])
plt.title('Boxplot of Values')
plt.show()
# Density Plot
sns.kdeplot(df['values'], shade=True)
plt.title('Density Plot of Values')
plt.show()
3. Checking for Skewness
• Skewness: If the skewness value is close to 0, the data is fairly symmetric. A positive skew indicates a longer right tail, while a negative skew
indicates a longer left tail.

python
Copy code
skewness = df['values'].skew()
print(f"Skewness: {skewness}")
• Positive skew means the right tail is longer, meaning a few high values are dragging the mean upwards.
• Negative skew means the left tail is longer, meaning a few low values are dragging the mean downwards.
4. Kurtosis
• Kurtosis indicates the "tailedness" of the data distribution. High kurtosis means more data is concentrated in the tails, while low kurtosis indicates
lighter tails.

python
Copy code
kurtosis = df['values'].kurtosis()
print(f"Kurtosis: {kurtosis}")
5. Normality Tests
• Shapiro-Wilk Test: A statistical test that checks whether data follows a normal distribution.
• Kolmogorov-Smirnov Test: Another test to compare the sample distribution with a normal distribution.

python
Copy code
from scipy import stats
shapiro_test = stats.shapiro(df['values'])

DataCleaning Page 3
shapiro_test = stats.shapiro(df['values'])
print(f"Shapiro-Wilk Test: {shapiro_test}")
These methods, combined, provide a comprehensive understanding of how your data is distributed.

DataCleaning Page 4
Remove Duplicates
Thursday, October 17, 2024 3:24 PM

To remove duplicates in a dataset, you follow two main steps: identifying duplicate records and dropping duplicates. Here’s how to do this step by step
using Python and Pandas.
1. Identify Duplicate Records:
• Goal: You want to detect rows that are repeated in your dataset.
• Code Example:

python
Copy code
import pandas as pd

# Load your dataset into a pandas DataFrame

df = pd.read_csv('your_dataset.csv')

# Check for duplicate rows

duplicates = df.duplicated()

# Print rows that are duplicates (if any)

print(df[duplicates])
• Explanation:
○ df.duplicated() checks for rows that are duplicated, returning a boolean series (True for duplicates, False otherwise).
○ You can print out all the duplicated rows to review them before removal.
2. Drop Duplicates:
• Goal: Remove duplicate rows from your dataset.
• Code Example:

python
Copy code
# Drop duplicates, keeping the first occurrence of each duplicated row
df_cleaned = df.drop_duplicates()

# Optionally, if you want to remove duplicates based on specific columns:

df_cleaned_by_cols = df.drop_duplicates(subset=['column1', 'column2'])

# Save the cleaned DataFrame to a new CSV file

df_cleaned.to_csv('cleaned_dataset.csv', index=False)
• Explanation:
○ df.drop_duplicates() removes rows that are entirely duplicated, keeping the first occurrence by default.
○ The subset parameter allows you to specify particular columns for checking duplicates (e.g., only consider rows as duplicates if certain
columns match).
○ After removing the duplicates, you can save the cleaned data using .to_csv() or keep it in memory for further processing.
Important Parameters in drop_duplicates():
• keep='first': Keeps the first occurrence of the duplicate rows and removes the rest (this is the default behavior).
• keep='last': Keeps the last occurrence of the duplicate rows.
• keep=False: Removes all occurrences of duplicated rows.
• subset=[columns]: Specifies the columns to check for duplicates (useful when duplicates occur only in certain features).
Example with Dataset:
Imagine you have the following data:
ID Name Age
1 John 28
2 Sarah 24
3 John 28
4 Adam 35
5 Sarah 24
Using df.drop_duplicates(), rows with ID = 3 and ID = 5 will be removed because they are duplicates of the rows with ID = 1 a nd ID = 2, respectively.
By removing duplicates, you ensure that your dataset remains accurate and consistent, preventing misleading results during an alysis or modeling.

DataCleaning Page 5
4o

DataCleaning Page 6
Handle Missing Data
Thursday, October 17, 2024 3:25 PM

Deciding on the best technique to handle missing data depends on several factors, including the type, amount, and pattern of missingness, as well as the
impact on your analysis. Here's a step-by-step guide to help you determine the right approach:
1. Understand the Pattern of Missingness:
First, assess whether your missing data falls into one of these categories:
• MCAR (Missing Completely at Random): The missing data has no relationship with any variable. In this case, deletion or simple imputation works well.
• MAR (Missing at Random): The missingness is related to other observed variables but not the value of the missing data itself. For example, salary might
be missing for certain education levels.
• MNAR (Missing Not at Random): The missing data depends on the unobserved value itself. For instance, people with very high or low income might
intentionally not report it.
How to Check:
• Visualize the missing data pattern with a missingness matrix (using Python's missingno or seaborn's heatmap).
• Conduct a Little’s MCAR test (in Python, this is done using statsmodels).
Based on this, you’ll have a better idea of how to proceed:
• MCAR: Deletion is often acceptable.
• MAR: Imputation is usually preferred.
• MNAR: More complex techniques like model-based imputation or even consulting subject-matter experts might be necessary.
2. Assess the Proportion of Missing Data:
• Small Amount of Missing Data (e.g., < 5%): Simple methods like mean/median imputation or listwise deletion are often sufficient.
• Moderate Amount of Missing Data (5–20%): Consider advanced imputation techniques (like k-NN imputation, regression imputation, or even multiple
imputation).
• Large Amount of Missing Data (>20%): Deletion may be harmful due to significant data loss. Use multiple imputation, model-based imputation, or
predictive modeling to handle the missing values effectively.
3. Understand the Type of Data:
• Numerical Data: Imputation techniques like mean, median, or regression imputation are good options.
• Categorical Data: Use mode imputation, dummy encoding for missing categories, or a machine learning model to predict the missing category.
• Time Series Data: You can use forward-fill or backward-fill if the missing data is sequential, or interpolation if there is a trend or pattern in the data.
4. Consider the Impact on Your Analysis:
• Predictive Modeling: If you are using machine learning models, certain algorithms can handle missing data internally, like XGBoost or random forests.
For other models, imputation is often needed to prevent losing data. You should also evaluate whether you need to keep the "missingness" information
itself as a feature.
• Statistical Analysis: Deletion methods (especially listwise deletion) can bias results if data is MAR or MNAR. Imputation or more sophisticated
techniques should be used to avoid biased estimates.
5. Evaluate the Importance of Missing Data:
• If the column or rows with missing data are unimportant or non-essential to your analysis, deletion might be acceptable.
• If the missing data affects key variables or has important relationships with other variables, imputation or advanced techniques are necessary to
preserve the integrity of your analysis.
6. Available Resources and Tools:
• Computational Power: If you're dealing with a large dataset and limited computational resources, mean/median/mode imputation or simple deletion
might be more efficient. More advanced methods like multiple imputation and model-based methods require more time and computational power.
• Software Capabilities: Make sure the tools you're using support your chosen imputation method. For instance, Python’s pandas, sklearn, or fancyimpute
libraries offer several imputation strategies, while R provides packages like mice for multiple imputation.
7. Review the Final Model Performance:
After applying a technique to handle missing data, evaluate how it impacts your model or analysis:
• Compare model performance (e.g., accuracy, RMSE, R²) across different methods for handling missing data to ensure that the selected method
improves or at least maintains the overall performance.
• Check the validity of statistical tests after using imputation or deletion to ensure that your conclusions remain robust.
Common Techniques and When to Use Them:
Technique When to Use Pros Cons
Listwise MCAR or when missing data is minimal (<5%). Easy to apply, no bias if MCAR. Reduces sample size, potential for bias if data
Deletion is MAR/MNAR.
Pairwise When you want to preserve some data and missing values Preserves more data. Inconsistent sample sizes across analyses,
Deletion only affect certain calculations. potential bias if not MCAR.
Mean/Median MCAR or when missing values are small and data is Simple to implement, retains Reduces variability, leads to biased estimates
Imputation numerical. sample size. if data is not MCAR.
Mode Categorical variables with minimal missing data. Easy to implement. Can distort categorical distributions.
Imputation
k-NN MAR data with moderate missing values (<20%), and Captures complex relationships. Computationally expensive, sensitive to the
Imputation when relationships between variables can be leveraged. choice of k.
Regression MAR data with moderate missing values and when Takes into account relationships Can underestimate the variance of the data.
Imputation relationships between variables can predict missing between variables.

DataCleaning Page 7
Imputation relationships between variables can predict missing between variables.
values.
Multiple MAR data with moderate to high missing values (>20%). Preserves variability, unbiased Computationally expensive, requires multiple
Imputation estimates. models, complexity increases with missing
data patterns.
Forward/Back Sequential (time series) data where missing values occur Simple and maintains trend Only suitable for ordered data; not ideal if the
ward Fill at the beginning or end of a series. continuity. data fluctuates significantly.
Model-Based When data is MAR or MNAR and relationships between Can handle complex missing Computationally expensive, requires proper
Imputation variables can be modeled (e.g., decision trees, random data patterns and capture model tuning.
forest, or XGBoost). interactions.

Final Decision Flow:

1. Determine the pattern of missingness (MCAR, MAR, or MNAR).
2. Assess the proportion of missing data in each feature/row.
3. Evaluate the importance of the missing data for your analysis.
4. Select a technique that fits your use case:
○ For MCAR and low missing values: Use deletion or simple imputation (mean/median/mode).
○ For MAR: Use advanced imputation (k-NN, regression, or multiple imputation).
○ For MNAR: Explore model-based imputation or predictive modeling.
By following these steps, you can select the most appropriate technique for handling missing data in a way that maintains the integrity and accuracy of your
analysis.

The statement "Patterns of missingness" refers to analyzing the distribution of missing values within your dataset. Here's how this can be understood and
acted upon:
1. Identify Columns with Missing Values: First, look for columns (features) that contain missing data. You can either manually check this in tools like
Excel or programmatically identify them using languages like Python (using pandas.isnull().sum() for instance).
2. Evaluate the Extent of Missingness:
○ If a column has a large proportion of missing values, it could indicate:
▪ Issues with how the data was collected (e.g., faulty sensors, data entry errors).
▪ A feature that might not be relevant for the population being studied (leading to missing entries).
3. Patterns of Missingness: Missingness can follow different patterns, including:
○ MCAR (Missing Completely at Random): No pattern to the missing data, which may mean there are no deeper data collection issues.
○ MAR (Missing at Random): Missing values depend on another observed variable. For example, if missing salary data correlates with higher
levels of education.
○ MNAR (Missing Not at Random): The missingness is related to the value itself (e.g., people with low incomes may not disclose it).
4. Next Steps - Imputation:
○ If many missing values are detected, you might need to impute (fill in the missing values).
○ Common imputation techniques include:
▪ Mean/Median/Mode Imputation: For numerical data, use the mean or median of that column. For categorical data, use the most
frequent category (mode).
▪ Forward/Backward Fill: Fill missing values based on nearby values in a time-series dataset.
▪ Model-based Imputation: Predict the missing values using a machine learning model trained on the rest of the dataset.
By analyzing the patterns of missingness, you can make informed decisions on how to clean your data, leading to better model performance and more
accurate insights.

Deletion is another common method for handling missing data, and it is typically used when imputation is either unnecessary or infeasible. Here are the
main types of deletion techniques and when you might use them:
1. Listwise Deletion (Complete Case Analysis)
• Description: This method involves removing any row that has one or more missing values.
• When to Use:
○ If the proportion of missing data is small (e.g., less than 5%).
○ When the missingness is MCAR (Missing Completely at Random), meaning there is no pattern or relationship between missing values and other
variables.
○ If you're willing to lose some data points but still maintain statistical validity.
• Drawbacks:
○ Reduces the sample size, which can affect the statistical power of your analysis.
○ If missing data is not MCAR, this method may introduce bias into your results.
2. Pairwise Deletion
• Description: Instead of removing entire rows, this method excludes missing values only when performing operations that require that specific data
point. For example, if you're calculating a correlation between two variables and one of them has missing values, those speci fic rows would be
ignored just for that calculation.
• When to Use:
○ When you want to preserve as much data as possible without sacrificing entire rows.
Suitable when you're running analyses like correlation or regression, and only a subset of the columns is being used at a time.

DataCleaning Page 8
○ Suitable when you're running analyses like correlation or regression, and only a subset of the columns is being used at a time.
• Drawbacks:
○ May lead to inconsistent sample sizes across analyses, which can complicate interpretation.
○ Potential for bias if the data is not MCAR.
3. Column Deletion
• Description: Entire columns (features) are removed if they contain too many missing values.
• When to Use:
○ If a specific column has a very high percentage of missing data (e.g., >60%).
○ When the missing values are MNAR (Missing Not at Random), and you can't impute them accurately.
○ When the feature is not essential or contributes little to the overall analysis or model.
• Drawbacks:
○ You may lose potentially useful information if the missing values could have been imputed accurately.
Considerations for Deletion:
• Data Distribution: If deletion significantly reduces the amount of data, it can skew the results, especially if the data is not missing at random (MNAR).
• Impact on Model Performance: In predictive modeling, reducing the dataset size can affect your model’s ability to generalize. For small datasets,
deletion may not be an option as it would lead to underfitting.
• Preserving Variability: The more data you delete, the more likely you are to lose important variability and relationships between variables.
Best Practices:
• MCAR data: Deletion is often a valid strategy since it does not introduce bias.
• MAR or MNAR data: You should be cautious with deletion as it can lead to biased estimates. Imputation is often preferable in these cases.
In general, deletion is useful when missing data is limited or unimportant, but imputation tends to be a better strategy when the missing data is substantial
or the relationships in the data are complex.

To test whether data is Missing Completely at Random (MCAR), you can use Little's MCAR test. This statistical test assesses whether the missingness in
the data is random or if it is related to the observed data.
Overview of Little's MCAR Test
• Null Hypothesis (H0): The data is MCAR (the missing data mechanism is random and does not depend on the observed data).
• Alternative Hypothesis (H1): The data is not MCAR (the missing data mechanism is related to the observed data).
Steps to Perform Little's MCAR Test
1. Create a Missingness Indicator: First, you need to create a binary variable that indicates whether each value is missing (1 for missing, 0 for
observed).
2. Conduct the Test: Use statistical software (e.g., Python with statsmodels) to conduct the test and calculate the test statistic and p-value.
Implementation in Python
Here’s how you can implement Little’s MCAR test using the statsmodels library along with the fancyimpute library, which has a built-in function for this
test.
Step 1: Install Required Libraries
If you haven't already installed fancyimpute, you can do so using pip:

bash
Copy code
pip install fancyimpute
Step 2: Perform Little's MCAR Test
Here's a code example that demonstrates how to perform Little's MCAR test:

python
Copy code
import pandas as pd
import numpy as np
from fancyimpute import LittleMCAR
# Load your dataset
df = pd.read_csv('your_dataset.csv')
# Assume we are interested in a specific feature with missing values
# Create a missingness indicator for the feature
missing_indicator = df['feature_name'].isnull().astype(int) # Replace 'feature_name' with the actual column name
# Combine the missingness indicator with the original DataFrame
df_with_indicator = df.copy()
df_with_indicator['missing_indicator'] = missing_indicator
# Conduct Little's MCAR test
mcar_test = LittleMCAR()
test_statistic, p_value = mcar_test.test(df_with_indicator)
# Output the test statistic and p-value
print(f"Test Statistic: {test_statistic}, p-value: {p_value}")
# Interpret results
if p_value > 0.05:
print("Fail to reject the null hypothesis: Data is MCAR.")
else:
print("Reject the null hypothesis: Data is not MCAR.")

DataCleaning Page 9
print("Reject the null hypothesis: Data is not MCAR.")
Interpretation of Results
• If the p-value is greater than 0.05: You fail to reject the null hypothesis, which suggests that the data is MCAR.
• If the p-value is less than or equal to 0.05: You reject the null hypothesis, indicating that the data is not MCAR.
Alternative Method: Chi-Square Test
In addition to Little's MCAR test, another common approach to test for MCAR involves using a Chi-Square test for independence. This method checks
whether the missingness is independent of observed data.
1. Create a contingency table: This table should cross-tabulate the observed data and the missingness indicators.
2. Perform the Chi-Square test: Use the Chi-Square test to see if there’s a significant association between the missingness and the observed data.

The Chi-Squared test is a statistical method used to determine whether there is a significant association between two categorical variables. In the
context of assessing missing data, you can use it to check if the missingness of a variable is related to other observed variables.
Steps to Perform the Chi-Squared Test for Missing Data
1. Create a Contingency Table: Cross-tabulate the observed data with the missingness indicator.
2. Perform the Chi-Squared Test: Use statistical software or libraries to perform the test.
3. Interpret the Results: Evaluate the p-value to determine if the variables are independent.
Example Code in Python
Below is a step-by-step example of how to perform the Chi-Squared test using Python with the pandas and scipy libraries.
Step 1: Install Required Libraries
If you haven't already installed pandas and scipy, you can do so using pip:

bash
Copy code
pip install pandas scipy
Step 2: Create a Sample Dataset
Here’s an example of how you might set up your dataset:

python
Copy code
import pandas as pd
import numpy as np
# Create a sample dataset
data = {
'education': ['High School', 'Bachelors', 'Masters', 'Bachelors', 'High School', 'Masters', 'PhD', 'PhD'],
'income': [50000, 60000, np.nan, 70000, np.nan, 80000, 90000, np.nan]
}
df = pd.DataFrame(data)
# Display the DataFrame
print(df)
Step 3: Create a Missingness Indicator
You need to create a binary column to indicate whether the income data is missing:

python
Copy code
# Create a missingness indicator
df['income_missing'] = df['income'].isnull().astype(int) # 1 for missing, 0 for observed
# Display the updated DataFrame
print(df)
Step 4: Create the Contingency Table
Next, create a contingency table that shows the counts of missing and observed data for each education level:

python
Copy code
# Create a contingency table
contingency_table = pd.crosstab(df['education'], df['income_missing'])
print(contingency_table)
The contingency table will have education levels as rows and the missingness indicator as columns.
Step 5: Perform the Chi-Squared Test
Now you can perform the Chi-Squared test using scipy:

python
Copy code
from scipy.stats import chi2_contingency
# Perform the Chi-Squared test
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)
# Output the results
print(f"Chi-Squared Statistic: {chi2_stat}")
print(f"p-value: {p_value}")
print(f"Degrees of Freedom: {dof}")

DataCleaning Page 10
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:\n", expected)
Step 6: Interpret the Results
Interpret the p-value to determine whether there is a significant relationship between education level and income missingness:

python
Copy code
# Interpret results
alpha = 0.05 # Significance level
if p_value < alpha:
print("Reject the null hypothesis: There is a significant association between education level and income missingness.")
else:
print("Fail to reject the null hypothesis: There is no significant association between education level and income missingness.")
Summary of Results
• Chi-Squared Statistic: A measure of how much the observed counts deviate from the expected counts.
• p-value: Indicates whether the relationship between the variables is statistically significant.
• Degrees of Freedom (dof): Calculated as (number of rows−1)×(number of columns−1)(\text{number of rows} - 1) \times (\text{number of
columns} - 1)(number of rows−1)×(number of columns−1).
• Expected Frequencies: The counts expected if the null hypothesis is true.
Conclusion
If the p-value is less than the significance level (commonly 0.05), you conclude that there is a significant association between the missingness of income
data and education level, suggesting that the data might not be MCAR. If the p-value is greater than 0.05, it indicates that the missingness is
independent of the education variable, supporting the MCAR hypothesis.

Summary
• Little's MCAR Test is the primary statistical test used to check if data is MCAR.
• The test compares the observed data patterns to see if the missingness is randomly distributed.
• A Chi-Square test can also be used as an alternative method to test for independence between missingness and observed data.

Understanding MAR (Missing At Random)

Missing At Random (MAR) means that the probability of missing data on a variable is related to some observed data but not to the value of the variable itself.
For example, in a survey, younger respondents may be less likely to report their income, but this missingness is related to their age (an observed variable).
Strategies for Handling MAR Data
1. Imputation Techniques Imputation is a method of filling in missing values. Here are several common imputation techniques:
○ Mean/Median/Mode Imputation:
▪ Mean: Suitable for normally distributed numerical data.
▪ Median: Better for skewed distributions.
▪ Mode: Used for categorical data.
Implementation Example (Python):

python
Copy code
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
'age': [25, 30, None, 40, 35],
'income': [50000, None, 60000, None, 70000]
})

# Mean Imputation
df['age'].fillna(df['age'].mean(), inplace=True)

# Median Imputation
df['income'].fillna(df['income'].median(), inplace=True)
○ K-Nearest Neighbors (KNN) Imputation: KNN uses the feature values of the nearest observations to estimate missing values. It can be more
accurate than simple imputation methods.
Implementation Example (Python using sklearn library):

python
Copy code
from sklearn.impute import KNNImputer

# Sample DataFrame
df = pd.DataFrame({
'age': [25, 30, None, 40, 35],
'income': [50000, None, 60000, None, 70000]
})

DataCleaning Page 11
})

imputer = KNNImputer(n_neighbors=2)
imputed_data = imputer.fit_transform(df)
df_imputed = pd.DataFrame(imputed_data, columns=df.columns)
○ Regression Imputation: Use regression models to predict the missing values based on other variables in the dataset.
Implementation Example (Python):

python
Copy code
from sklearn.linear_model import LinearRegression

# Sample DataFrame
df = pd.DataFrame({
'age': [25, 30, None, 40, 35],
'income': [50000, None, 60000, None, 70000]
})

# Fit a regression model to predict income based on age

model = LinearRegression()
df_train = df[df['income'].notnull()]
model.fit(df_train[['age']], df_train['income'])

# Predict missing income values

df.loc[df['income'].isnull(), 'income'] = model.predict(df[df['income'].isnull()][['age']])
2. Multiple Imputation: This method involves creating multiple datasets where the missing values are imputed in different ways. Each dataset is ana lyzed
separately, and the results are pooled to account for uncertainty.
Implementation Example (Python using statsmodels library):

python
Copy code
import statsmodels.api as sm
from statsmodels.imputation import mice

# Sample DataFrame
df = pd.DataFrame({
'age': [25, 30, None, 40, 35],
'income': [50000, None, 60000, None, 70000]
})

# Using MICE for multiple imputation

imp = mice.MICEData(df)
imputed_datasets = imp.multiple_imputation(5) # Generate 5 imputed datasets

# Analyze each dataset separately and pool results

3. Using Algorithms That Handle Missing Data: Some machine learning models can naturally handle missing data without requiring explicit imputation. For
instance, decision trees and ensemble methods like Random Forest can work directly with missing values.
Implementation Example (Python using sklearn Random Forest):

python
Copy code
from sklearn.ensemble import RandomForestClassifier

# Assuming 'features' and 'target' are defined

model = RandomForestClassifier()
model.fit(X_train, y_train) # X_train can contain missing values
4. Dropping Missing Values: If the proportion of missing data is small, you may choose to drop rows or columns with missing values. Be cautious, as this
might lead to loss of important information.
Implementation Example (Python):

python
Copy code
# Dropping rows with any missing values
df_dropped_rows = df.dropna()

# Dropping columns with any missing values

df_dropped_columns = df.dropna(axis=1)
5. Data Augmentation: Generate synthetic data points to supplement the dataset, especially if the missing values occur in specific patterns. Tech niques like
SMOTE (Synthetic Minority Over-sampling Technique) can be useful here.
6. Sensitivity Analysis: After choosing an imputation method, it’s crucial to conduct a sensitivity analysis. This involves comparing results obtained from
different imputation techniques to assess how they affect your conclusions.
7. Documentation: Maintain a clear record of how you handled missing data, including the imputation methods used and any assumptions made. This

DataCleaning Page 12
7. Documentation: Maintain a clear record of how you handled missing data, including the imputation methods used and any assumptions made. This
practice enhances transparency and reproducibility in your analysis.
Example Scenario
Let's consider a practical scenario where we have a dataset about customers with missing income data that we suspect is MAR. Here’s a step-by-step approach:
1. Analyze the Missing Data:
○ Check the pattern of missingness. For example, you can use heatmaps to visualize missing data.

python
Copy code
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull(), cbar=False)
plt.show()
2. Choose an Imputation Method:
○ Suppose you find that income is missing for younger customers. You could use KNN imputation, leveraging age and perhaps other demographic
information to predict missing income values.
3. Implement KNN Imputation:
○ Use KNN to fill in missing income values.

python
Copy code
imputer = KNNImputer(n_neighbors=3)
imputed_data = imputer.fit_transform(df)
df_imputed = pd.DataFrame(imputed_data, columns=df.columns)
4. Model the Data:
○ After imputation, you can proceed to analyze or build models using the complete dataset.
5. Conduct Sensitivity Analysis:
○ Test different imputation methods (mean, median, KNN) and compare model performance metrics (e.g., accuracy, RMSE) to understand how your
choice of method impacts results.
6. Document Your Process:
○ Keep detailed records of your analysis, including imputation methods, results, and any potential limitations.
By following these steps and techniques, you can effectively handle missing data that is MAR, thereby improving the robustness of your data analysis and
modeling efforts. If you have any specific questions about these methods or would like further examples, feel free to ask!

Understanding MNAR
In MNAR situations, the reason for the missing data is related to the missing values themselves. This means that the data that is not observed would have
a different distribution if it were observed. This can lead to biased results if we don't properly account for the missing data.
Example Scenario
Imagine a clinical trial studying the effects of a new medication on blood pressure. Researchers collect data on participants' blood pressure readings
before and after the treatment. However, after a few weeks, some participants drop out of the study. The dropout rate is higher among participants who
experienced side effects from the medication.
• Observed Data: Blood pressure readings for participants who stayed in the study.
• Missing Data: Blood pressure readings for participants who dropped out due to side effects.
Why Is This MNAR?
The missingness is related to the value of the data that is missing:
• Participants who experienced side effects might have higher blood pressure readings than those who stayed in the study.
• If you only analyze the data from participants who completed the trial, you may underestimate the true effect of the medicati on because you're
missing data from those who experienced adverse reactions.
Consequences of MNAR
If the missing data is not handled appropriately, the analysis could lead to:
• Underestimation of adverse effects: The true side effects of the medication may be obscured.
• Bias in treatment effect: The perceived effectiveness of the medication may be overstated because you're not accounting for the participants who
had negative experiences.
Strategies to Address MNAR
1. Sensitivity Analysis
• Implementation: Conduct analyses under different assumptions about the missing data. For instance, you could analyze the data assuming that
dropouts had higher blood pressure than observed participants or the same.
• Example: If you find that the treatment appears effective under one assumption but not under another, it highlights the uncertainty caused by the
missing data.
2. Modeling the Missingness
• Implementation: Use a statistical model to understand and predict why data is missing. This could involve logistic regression to model dropout
based on observed characteristics.
• Example: You might create a model that predicts the likelihood of dropout based on baseline characteristics (e.g., age, initial blood pressure, side
effects) and include these predictors in your analysis.
3. Data Augmentation
• Implementation: Use methods like multiple imputation, where you create several datasets by filling in the missing values based on observed data.

DataCleaning Page 13
• Implementation: Use methods like multiple imputation, where you create several datasets by filling in the missing values based on observed data.
• Example: You could generate multiple plausible blood pressure readings for the dropouts based on the relationship between other variables and
blood pressure.
4. Weighting
• Implementation: Assign weights to participants based on the likelihood of being observed. This adjusts the analysis to account for the bias
introduced by the missing data.
• Example: If participants who reported side effects had a 70% chance of dropping out, while those who did not had a 30% chance, you c ould weight
the remaining data accordingly to reduce bias.
5. Collect More Data
• Implementation: If possible, conduct follow-up surveys or collect additional data to understand why participants dropped out.
• Example: Conducting exit interviews with participants who dropped out could provide valuable insights into their reasons and the missing data.
Conclusion
MNAR data presents significant challenges in data analysis, as the missingness is related to the values of the missing data itself. It’s crucial to recognize
when data is MNAR and to apply appropriate strategies to mitigate bias. By doing so, researchers can make more accurate inferences and avoid misleading
conclusions based on incomplete data.

DataCleaning Page 14
Standardize Data
Thursday, October 17, 2024 3:28 PM

To standardize data as outlined in the image, here's how you can achieve each of the tasks in Python using pandas, a widely u sed library for data manipulation. I’ll
provide code snippets to handle each step.
1. Convert Data Types
You can use pandas.to_datetime() for date columns, pandas.to_numeric() for numeric columns, and astype() for categorical conv ersions.

python
Copy code
import pandas as pd
# Sample DataFrame
data = {'date_col': ['2024-01-01', '2024/02/02'],
'numeric_col': ['10', '20'],
'categorical_col': [1, 2]}
df = pd.DataFrame(data)
# Convert date column
df['date_col'] = pd.to_datetime(df['date_col'], errors='coerce')
# Convert numeric column
df['numeric_col'] = pd.to_numeric(df['numeric_col'], errors='coerce')
# Convert categorical column
df['categorical_col'] = df['categorical_col'].astype('category')
print(df.dtypes)
2. Correct Capitalization
To ensure consistent capitalization, you can apply string methods like .str.lower() or .str.upper() to convert all text to lo wercase or uppercase.

python
Copy code
# Sample DataFrame
df = pd.DataFrame({'text_col': ['John Doe', 'jane DOE', 'JAMES Smith']})
# Convert text to lowercase
df['text_col'] = df['text_col'].str.lower()
print(df)
3. Remove Leading/Trailing Spaces
You can use .str.strip() to remove extra spaces from the start and end of string fields.

python
Copy code
# Sample DataFrame
df = pd.DataFrame({'text_col': [' John ', 'jane ', ' James ']})
# Remove leading and trailing spaces
df['text_col'] = df['text_col'].str.strip()
print(df)
4. Unify Formats (e.g., Dates, Phone Numbers)
For date formats, you can standardize them using strftime() after converting to datetime. For phone numbers, you can use rege x or formatting libraries like
phonenumbers to clean and format them.
Example: Standardizing Date Formats

python
Copy code
# Sample DataFrame
df = pd.DataFrame({'date_col': ['01/01/2024', '2024-02-02']})
# Convert to datetime and unify format to 'YYYY-MM-DD'
df['date_col'] = pd.to_datetime(df['date_col'], errors='coerce').dt.strftime('%Y-%m-%d')
print(df)
Example: Standardizing Phone Numbers (Using Regex)

python
Copy code
import re
# Sample DataFrame
df = pd.DataFrame({'phone_col': ['(123) 456-7890', '123.456.7890', '+1-123-456-7890']})
# Function to clean phone numbers
def clean_phone(phone):
return re.sub(r'\D', '', phone) # Remove non-numeric characters
df['phone_col'] = df['phone_col'].apply(clean_phone)
print(df)

DataCleaning Page 15
print(df)
Combining Everything into One Function
Here’s a comprehensive function that standardizes data for a given pandas DataFrame.

python
Copy code
def standardize_data(df):
# Convert data types
df['date_col'] = pd.to_datetime(df['date_col'], errors='coerce')
df['numeric_col'] = pd.to_numeric(df['numeric_col'], errors='coerce')
df['categorical_col'] = df['categorical_col'].astype('category')
# Correct capitalization
df['text_col'] = df['text_col'].str.lower()
# Remove leading/trailing spaces
df['text_col'] = df['text_col'].str.strip()
# Standardize date format
df['date_col'] = df['date_col'].dt.strftime('%Y-%m-%d')
# Standardize phone numbers
df['phone_col'] = df['phone_col'].apply(lambda x: re.sub(r'\D', '', x))
return df
# Example usage
df = pd.DataFrame({
'date_col': ['01/01/2024', '2024-02-02'],
'numeric_col': ['10', '20'],
'categorical_col': [1, 2],
'text_col': [' John DOE ', 'jane doe '],
'phone_col': ['(123) 456-7890', '123.456.7890']
})
df = standardize_data(df)
print(df)
This function handles all the standardization tasks in one place, making it easier to clean and prepare your data for further analysis or modeling. Let me know if you
need more details!

DataCleaning Page 16
Handle Outliers
Thursday, October 17, 2024 3:29 PM

1. Using Z-Score Method:

The Z-Score method identifies how many standard deviations a data point is from the mean. Data points beyond a certain threshold (commonly 3) are
considered outliers.
Code:

python
Copy code
import pandas as pd
import numpy as np
from scipy import stats
# Sample Data
data = {'value': [10, 12, 12, 15, 14, 13, 100, 12, 11, 10, 13, 12, 11, 9, 12]}
df = pd.DataFrame(data)
# Z-score calculation
df['z_score'] = np.abs(stats.zscore(df['value']))
# Threshold for Z-score (commonly 3)
threshold = 3
df['outlier'] = df['z_score'] > threshold
# Displaying the results
print(df)
Explanation:
• stats.zscore() calculates the Z-score for each data point.
• We take the absolute value of Z-scores to focus on the magnitude.
• Any Z-score greater than the threshold (3 in this case) is flagged as an outlier.
2. Using IQR Method:
The IQR method identifies outliers by checking if data points are outside the range of 1.5 * IQR from the 25th and 75th percentiles.
Code:

python
Copy code
# Sample Data
data = {'value': [10, 12, 12, 15, 14, 13, 100, 12, 11, 10, 13, 12, 11, 9, 12]}
df = pd.DataFrame(data)
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
# Calculate IQR (Interquartile Range)
IQR = Q3 - Q1
# Define bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Flag outliers
df['outlier'] = (df['value'] < lower_bound) | (df['value'] > upper_bound)
# Display the results
print(df)
Explanation:
• quantile(0.25) and quantile(0.75) calculate the 25th and 75th percentiles.
• The IQR is calculated as the difference between Q3 and Q1.
• Data points outside the bounds (Q1 - 1.5 * IQR or Q3 + 1.5 * IQR) are flagged as outliers.
3. Handling Outliers:
After detecting outliers, you can choose to:
1. Remove them: Simply drop rows where the outlier flag is True.
2. Cap them: Replace outliers with the upper or lower bounds to reduce their impact.
Code for Removing Outliers:

python
Copy code
# Remove outliers
df_cleaned = df[~df['outlier']]
print(df_cleaned)
Code for Capping Outliers:

DataCleaning Page 17
Code for Capping Outliers:

python
Copy code
# Cap outliers
df['value'] = np.where(df['value'] < lower_bound, lower_bound, df['value'])
df['value'] = np.where(df['value'] > upper_bound, upper_bound, df['value'])
print(df)
Conclusion:
• Z-Score Method is useful for normally distributed data, flagging data points that deviate significantly from the mean.
• IQR Method works well for skewed data as it uses percentiles, so it isn't affected by extreme values like the mean or standard deviation might be.

Choosing the right technique to detect and handle outliers depends on the characteristics of your data and the specific context of your analysis. Here’s a
guide on how to determine which method to use:
1. Z-Score Method:
When to Use:
• Normal Distribution (Gaussian): Z-score works well when the data follows a normal (or near-normal) distribution. Since Z-scores are based on the
mean and standard deviation, they rely on data symmetry.
• Small or Moderate Dataset: Z-scores work better when you have a reasonably sized dataset and the assumption of normality holds.
• Outliers that are distant from the mean: Z-score flags extreme deviations from the mean as outliers.
How to Check:
• Histogram: Plot the data using a histogram or density plot to see if it follows a bell-shaped curve.
• QQ Plot: A Quantile-Quantile plot helps visualize how your data compares to a normal distribution.
• Shapiro-Wilk Test: A statistical test to check normality. If the p-value is high (> 0.05), the data is normally distributed.
Use if: Data looks normally distributed with no extreme skewness or heavy tails.
Example:

python
Copy code
from scipy.stats import shapiro
stat, p_value = shapiro(df['value'])
print(p_value) # If > 0.05, the data is likely normally distributed
2. IQR Method (Boxplot Method):
When to Use:
• Skewed Distribution: IQR is robust to outliers and works well for data that does not follow a normal distribution, such as heavily skewed data.
• Non-parametric datasets: If your dataset is ordinal or non-parametric, the IQR method is preferable as it does not rely on the mean or standard
deviation, which can be skewed by outliers.
• Resistant to extreme values: IQR is not affected by extreme outliers and uses percentiles, making it more stable when you expect skewed or heavy-
tailed data.
How to Check:
• Boxplot: Create a boxplot to see how the data is distributed and spot potential outliers. A long whisker on one side indicates skewness.
• Skewness/Kurtosis: Check for skewness and kurtosis. High positive skewness indicates a long tail on the right, while high negative skewness
indicates a long tail on the left.
Use if: The data is skewed or not normally distributed.
Example:

python
Copy code
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(df['value'])
plt.show()
# Checking for skewness
print(df['value'].skew()) # If > 1 or < -1, the data is skewed
3. Considerations for Choosing:
Size of Dataset:
• Small Dataset: Z-scores may work better since extreme deviations from the mean can be easily identified. But for highly skewed small datasets, the
IQR method is more appropriate.
• Large Dataset: IQR is often more robust in large datasets because it doesn’t assume normality.
Domain Knowledge:
• Sometimes, the choice of method depends on what kind of outliers you expect based on your domain. For example:
○ In finance: Sudden spikes in transaction amounts may not follow a normal distribution, so IQR would work better.
○ In sensor data: If sensor measurements are supposed to follow a normal pattern, Z-scores can be effective.
Summary of Key Points:
DataCleaning Page 18
Summary of Key Points:
Method Use When Key Considerations Pros/Cons
Z-Score Data is normally distributed Sensitive to outliers; works with normal data Pro: Simple to implement and understand
Con: Sensitive to skewed data
IQR Data is skewed or not normally distributed More robust to outliers; uses percentiles Pro: Stable, non-parametric method
Con: Less sensitive to moderate outliers

When in Doubt:
If you're unsure about the distribution:
• Start by checking the data’s distribution using a histogram or boxplot.
• If the data is normally distributed, Z-score is a good starting point.
• If the data is skewed or contains extreme values, IQR is more appropriate.
Conclusion:
• Z-score is great for normally distributed data with no heavy skewness.
• IQR works well for skewed, non-normally distributed, or heavy-tailed datasets.

DataCleaning Page 19
Fix Structural Errors
Thursday, October 17, 2024 3:30 PM

To fix structural errors like typos, misplaced values, and inconsistent naming conventions in a dataset, here’s a step-by-step process for identifying and correcting
these issues:
1. Identifying Inconsistencies
• Typos: Use basic string similarity techniques or regex to identify typos. For example, fuzzy matching methods like Levenshtein dis tance can be used to detect
strings with minor variations (e.g., "Ney York" vs. "New York").
• Misplaced Values: Validate values based on expected data types and ranges. For example, dates should be in a specific format (YYYY -MM-DD), and numeric
fields should not have alphabetic characters.
• Inconsistent Naming Conventions: Identify cases where the same entity is referred to differently (e.g., “NY” vs. “New York”). You can group similar entities
using:
○ Value Counts: This helps to spot inconsistencies in the names by identifying uncommon or rare variations.
○ Category Lists: For predefined categories, validate each entry against a list (e.g., ["NY", "New York"]).
2. Correcting Errors
• Typos:
○ Use fuzzy matching techniques like fuzzywuzzy in Python to identify and fix typos automatically by replacing strings with the closest match.

python
Copy code
from fuzzywuzzy import process

# Example list of valid cities

valid_cities = ["New York", "Los Angeles", "Chicago"]

# City column with a typo

city = "Ney York"

# Get closest match to correct typo

closest_match = process.extractOne(city, valid_cities)
corrected_city = closest_match[0]
• Misplaced Values:
○ Use regex or type-casting to correct data. For example, convert columns to their correct types and flag values that fail to convert.

python
Copy code
# Convert a column to datetime and handle errors
df['date_column'] = pd.to_datetime(df['date_column'], errors='coerce')

# Flag rows where conversion failed

incorrect_dates = df[df['date_column'].isnull()]
• Inconsistent Naming:
○ Use a mapping dictionary to replace inconsistent names.

python
Copy code
# Mapping for correcting inconsistent values
name_corrections = {"NY": "New York", "LA": "Los Angeles"}

# Replace inconsistent values using the map

df['city_column'] = df['city_column'].replace(name_corrections)
3. Automation & Verification
• Automation: You can write a function to identify and fix multiple structural errors automatically. This would include:
○ Applying value counts for spotting outliers.
○ Using mapping or fuzzy matching for correcting known errors.
• Verification: After fixing the errors, it’s important to verify by running sanity checks (e.g., checking unique values again or recalcula ting basic statistics).

DataCleaning Page 20
Normalize/Scale Data
Thursday, October 17, 2024 3:31 PM

Normalization and scaling of data are crucial steps in the preprocessing pipeline for machine learning and data analysis. Let's break down the techniques shown in
the image:
1. Normalization
Normalization adjusts numerical features so that they fall within a specific range or follow a certain distribution. This is necessary for algorithms sensitive to the
scale of input data (e.g., k-NN, SVM, neural networks). The two most common techniques for normalization are:
• Min-Max Scaling: Rescales features to a range between 0 and 1 (or any arbitrary range).
When to use: This is useful when you want to preserve the relationships of the data while bringing them into a specific range (e.g., [0,1] or [-1,1]). It's
especially beneficial for models that depend on distance measures like k-Nearest Neighbors (k-NN) and clustering algorithms.
Formula:
X′=X−XminXmax−XminX' = \frac{X - X_{min}}{X_{max} - X_{min}}X′=Xmax−XminX−Xmin
Code Example:

python
Copy code
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

# Sample DataFrame
data = {'feature1': [20, 30, 40, 50, 60],
'feature2': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Apply to DataFrame
scaled_df = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print(scaled_df)
• Z-score Normalization (Standardization): This centers the data around a mean of 0 with a standard deviation of 1.
When to use: Standardization is useful when you want to normalize your data but keep the effects of outliers. It is the default assumption for many machine
learning algorithms like logistic regression, SVM, and k-Means.
Formula:
X′=X−μσX' = \frac{X - \mu}{\sigma}X′=σX−μ
where μ\muμ is the mean and σ\sigmaσ is the standard deviation of the feature.
Code Example:

python
Copy code
from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler
scaler = StandardScaler()

# Apply to DataFrame
standardized_df = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print(standardized_df)
2. Log Transformations
Logarithmic transformations are used to handle skewed data, making it more normally distributed. Many machine learning models perform better with normally
distributed data because they assume a Gaussian-like distribution (e.g., linear regression).
• When to use: When your data has a long tail or when the variance increases as the values increase, indicating skewness. For example, income data or sales
data may have a long right tail where a few individuals/companies make significantly more money than the rest.
• Formula:
X′=log (X)X' = \log(X)X′=log(X)
Note: Before applying, you may need to add a small constant to avoid the issue of taking the log of zero values, e.g., X′=log (X+1)X' = \log(X + 1)X′=log(X+1).
Code Example:

python
Copy code
import numpy as np

# Apply log transformation

df_log_transformed = df.applymap(lambda x: np.log(x + 1))
print(df_log_transformed)
Choosing the Technique:

DataCleaning Page 21
Choosing the Technique:
• Min-Max Scaling: Best when features have a known range, or algorithms sensitive to distances like k-NN, clustering (K-means).
• Z-score Normalization: Ideal when your data doesn't have a bounded range and you want to compare data points in terms of standard deviations (for
algorithms like SVM, Logistic Regression).
• Log Transformation: Use when your data is highly skewed, as it compresses large values and stretches small ones.

DataCleaning Page 22
Convert Categorical Data
Thursday, October 17, 2024 3:31 PM

Converting categorical data into numerical formats is essential when preparing data for machine learning algorithms, which of ten require numerical inputs. There
are several techniques to do this, such as label encoding and one-hot encoding. I'll explain each method, when to use it, and provide code examples.
1. Label Encoding
Label Encoding converts categorical values into integer values (0, 1, 2, etc.). It's suitable when the categorical variable i s ordinal—where the categories have an
inherent order.
Example:

python
Copy code
from sklearn.preprocessing import LabelEncoder
import pandas as pd
# Example data
data = {'Category': ['Low', 'Medium', 'High', 'Medium', 'Low']}
df = pd.DataFrame(data)
# Applying Label Encoding
label_encoder = LabelEncoder()
df['Category_LabelEncoded'] = label_encoder.fit_transform(df['Category'])
print(df)
Output:

mathematica
Copy code
Category Category_LabelEncoded
0 Low 1
1 Medium 2
2 High 0
3 Medium 2
4 Low 1
When to Use:
• Ordinal data: Use label encoding when the categories have a natural rank (e.g., "Low", "Medium", "High").
• Caution: Label encoding can sometimes introduce a sense of order where none exists, which might confuse machine learning models.

2. One-Hot Encoding
One-hot encoding converts categories into a binary matrix, where each category is represented as a new column, and the values are either 0 or 1. It’s suitable
when there is no ordinal relationship between categories (i.e., the categories are nominal).
Example:

python
Copy code
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
# Example data
data = {'Category': ['Red', 'Blue', 'Green', 'Blue', 'Red']}
df = pd.DataFrame(data)
# Applying One-Hot Encoding
one_hot_encoder = OneHotEncoder(sparse=False)
one_hot_encoded = one_hot_encoder.fit_transform(df[['Category']])
# Converting the result to a DataFrame
one_hot_df = pd.DataFrame(one_hot_encoded, columns=one_hot_encoder.get_feature_names_out())
# Concatenating the one-hot encoded columns with the original data
df = pd.concat([df, one_hot_df], axis=1)
print(df)
Output:

mathematica
Copy code
Category Category_Blue Category_Green Category_Red
0 Red 0.0 0.0 1.0
1 Blue 1.0 0.0 0.0
2 Green 0.0 1.0 0.0
3 Blue 1.0 0.0 0.0
4 Red 0.0 0.0 1.0
When to Use:
• Nominal data: Use one-hot encoding when the categories have no inherent order (e.g., colors: "Red", "Blue", "Green").

DataCleaning Page 23
• Nominal data: Use one-hot encoding when the categories have no inherent order (e.g., colors: "Red", "Blue", "Green").
• Caution: One-hot encoding can create many additional columns if the number of categories is large, which can lead to the "curse of dimensionality" and
potentially slow down model training.

Summary of When to Use Each Technique:

1. Label Encoding:
○ Use for ordinal categorical variables.
○ Example: Education levels (e.g., "High School", "Bachelor's", "Master's").
2. One-Hot Encoding:
○ Use for nominal categorical variables.
○ Example: Colors, types of fruit, etc.
Additional Considerations:
• For high-cardinality features (i.e., many unique categories), consider dimensionality reduction techniques like target encoding, which replaces categories
with the mean of the target variable, or hashing.

DataCleaning Page 24
Validate Data
Thursday, October 17, 2024 3:32 PM

To validate data consistency, check for missing data, duplicates, and verify values against expected limits. though you can also apply these concepts using Dask
DataFrame if you are working with large datasets (given your experience).
1. Check Data Consistency:
Verify Dates within Reasonable Ranges
Assume you have a dataset with a Date column, and you want to ensure that the dates are within a reasonable range (for example, between 2000-01-01 and
2023-12-31):

python
Copy code
import pandas as pd
# Sample data
df = pd.DataFrame({
'date': pd.to_datetime(['2022-05-10', '2001-03-15', '1999-12-31', '2023-08-21']),
'value': [100, 200, 150, 80]
})
# Define the date range
start_date = pd.Timestamp('2000-01-01')
end_date = pd.Timestamp('2023-12-31')
# Filter out dates outside the range
invalid_dates = df[(df['date'] < start_date) | (df['date'] > end_date)]
if not invalid_dates.empty:
print("Invalid dates found:")
print(invalid_dates)
else:
print("All dates are within the valid range.")
Check Numeric Values within Expected Limits
If you have numeric data, you can set a minimum and maximum threshold and validate whether all values fall within that range.

python
Copy code
# Assume 'value' column must be between 50 and 250
min_value = 50
max_value = 250
# Find values outside the valid range
invalid_values = df[(df['value'] < min_value) | (df['value'] > max_value)]
if not invalid_values.empty:
print("Invalid numeric values found:")
print(invalid_values)
else:
print("All numeric values are within the valid range.")
2. Re-check for Missing Data and Duplicates
Check for Missing Data
You can check for missing values in the entire DataFrame or in specific columns:

python
Copy code
# Check for missing values in the DataFrame
missing_data = df.isnull().sum()
if missing_data.any():
print("Columns with missing data:")
print(missing_data[missing_data > 0])
else:
print("No missing data found.")
Check for Duplicates
To find duplicate rows:

python
Copy code
# Check for duplicate rows
duplicates = df[df.duplicated()]
if not duplicates.empty:
print("Duplicate rows found:")
print(duplicates)
else:
print("No duplicate rows found.")

DataCleaning Page 25
print("No duplicate rows found.")
You can also remove duplicates using:

python
Copy code
# Remove duplicates
df_cleaned = df.drop_duplicates()
Using Dask for Large Data
If you are working with large datasets in Dask, the code is similar, but with Dask DataFrame methods (dd). Here's how you would modify the missing data check and
duplicate handling:

python
Copy code
import dask.dataframe as dd
# Load Dask DataFrame
df = dd.read_csv('your_large_data.csv')
# Check for missing data in Dask
missing_data = df.isnull().sum().compute()
if missing_data.any():
print("Columns with missing data:")
print(missing_data[missing_data > 0])
else:
print("No missing data found.")
# Check for duplicates in Dask
duplicates = df[df.duplicated()].compute()
if not duplicates.empty:
print("Duplicate rows found:")
print(duplicates)
else:
print("No duplicate rows found.")
This process will help ensure data quality by validating consistency, missing data, and duplicates.

DataCleaning Page 26
Anomalies Detection
Thursday, October 17, 2024 3:35 PM

DataCleaning Page 27
Data Cleaning
Thursday, October 17, 2024 3:58 PM

Notes

Data
Cleaning

DataCleaning Page 28

Step-by-Step Exploratory Data Analysis (EDA) Using Python
100% (1)
Step-by-Step Exploratory Data Analysis (EDA) Using Python
20 pages
AWS Machine Learning Specialty
100% (1)
AWS Machine Learning Specialty
67 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Introduction To Data Analysis
No ratings yet
Introduction To Data Analysis
94 pages
Multiple Imputation and Its Application 2nd Edition Scribd PDF Download
100% (13)
Multiple Imputation and Its Application 2nd Edition Scribd PDF Download
17 pages
Data Analysis CheatSheet
No ratings yet
Data Analysis CheatSheet
34 pages
Social Media Tourism Project
67% (3)
Social Media Tourism Project
15 pages
Predictive Modelling ALOK KUMAR
100% (1)
Predictive Modelling ALOK KUMAR
25 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
84 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Fuzzy Imputation Test
No ratings yet
Fuzzy Imputation Test
31 pages
Questions and Answers: SAS Institute A00-240
100% (1)
Questions and Answers: SAS Institute A00-240
5 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
100% (2)
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
8 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
6.data Cleaning
No ratings yet
6.data Cleaning
20 pages
Manual Timss Pisa2005 0503
No ratings yet
Manual Timss Pisa2005 0503
30 pages
Data Cleansing Steps
No ratings yet
Data Cleansing Steps
8 pages
EDA and Cleaning
No ratings yet
EDA and Cleaning
24 pages
Testes de Qualidade de Ajuste
No ratings yet
Testes de Qualidade de Ajuste
113 pages
Jomo
No ratings yet
Jomo
111 pages
PRELIS Examples Guide PDF
No ratings yet
PRELIS Examples Guide PDF
78 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Program: MBA Semester-III Course: Syndicated Learning Program (SLP-3) Academic Year: 2023-24 Department of Marketing & Strategy IBS, IFHE, Hyderabad
No ratings yet
Program: MBA Semester-III Course: Syndicated Learning Program (SLP-3) Academic Year: 2023-24 Department of Marketing & Strategy IBS, IFHE, Hyderabad
81 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Chapter 6 SVM
No ratings yet
Chapter 6 SVM
66 pages
Data Cleaning Ebook
No ratings yet
Data Cleaning Ebook
25 pages
An Extensive Step by Step Guide To Exploratory Data Analysis
No ratings yet
An Extensive Step by Step Guide To Exploratory Data Analysis
26 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Unit 4
No ratings yet
Unit 4
33 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
Internship Progress Report: Bachelor of Engineering
No ratings yet
Internship Progress Report: Bachelor of Engineering
31 pages
? Data Cleaning 101
No ratings yet
? Data Cleaning 101
17 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
DABD (KMBNIT01) Model Paper With Solution
No ratings yet
DABD (KMBNIT01) Model Paper With Solution
19 pages
20PMHS012 RH
No ratings yet
20PMHS012 RH
32 pages
Document
No ratings yet
Document
29 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
DSBD
No ratings yet
DSBD
23 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Package Amelia': February 19, 2015
No ratings yet
Package Amelia': February 19, 2015
23 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Statistical and Machine-Learning Data Mining: Bruce Ratner
No ratings yet
Statistical and Machine-Learning Data Mining: Bruce Ratner
13 pages
Data Science Syllabus
No ratings yet
Data Science Syllabus
8 pages
A Comparative Study of Multiple Imputation and Maximum Likelihood Methods of Imputing Missing Data in A
No ratings yet
A Comparative Study of Multiple Imputation and Maximum Likelihood Methods of Imputing Missing Data in A
14 pages
DAP Writeups - Merged
No ratings yet
DAP Writeups - Merged
33 pages
Data Analytics Template - Task 3 - Final
No ratings yet
Data Analytics Template - Task 3 - Final
11 pages
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
10 pages
Data Wrangling
No ratings yet
Data Wrangling
9 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Cognitive Functional Therapy in Patients With Non Specific Chronic Low Back Pain-A Randomized Controlled Trial 3 Year Follow Up
No ratings yet
Cognitive Functional Therapy in Patients With Non Specific Chronic Low Back Pain-A Randomized Controlled Trial 3 Year Follow Up
9 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
Refractory Condition Monitoring and Lifetime Prognosis For RH Degasser
No ratings yet
Refractory Condition Monitoring and Lifetime Prognosis For RH Degasser
10 pages
Spiritual and Religious Beliefs As Risk Factors For The Onset of Major Depression
No ratings yet
Spiritual and Religious Beliefs As Risk Factors For The Onset of Major Depression
13 pages
As You Delve Into The World of Data Analytics
No ratings yet
As You Delve Into The World of Data Analytics
10 pages
1-Introduction To Data Cleaning
No ratings yet
1-Introduction To Data Cleaning
22 pages
Summary - Lifecycle of Data Analysis - 3982
No ratings yet
Summary - Lifecycle of Data Analysis - 3982
7 pages
Data Cleaningin ML
No ratings yet
Data Cleaningin ML
15 pages
Basic Data Analysis
No ratings yet
Basic Data Analysis
16 pages
Data Exploration and Visualization Unit 3
No ratings yet
Data Exploration and Visualization Unit 3
13 pages
Mida (AE)
No ratings yet
Mida (AE)
12 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
Holwerda 2013 - Predictors of Work Participation of Young Adults With Mild ID
No ratings yet
Holwerda 2013 - Predictors of Work Participation of Young Adults With Mild ID
10 pages
Unit 1 (DWV)
No ratings yet
Unit 1 (DWV)
12 pages
Question Bank-DA
No ratings yet
Question Bank-DA
5 pages
FDSMSE Imp
No ratings yet
FDSMSE Imp
6 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
U1 - DA - Data Preprocessing
No ratings yet
U1 - DA - Data Preprocessing
6 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
9 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
KJWDH
No ratings yet
KJWDH
4 pages
Group 1 CIN-Act QN (A)
No ratings yet
Group 1 CIN-Act QN (A)
3 pages
Step by Step Data Wrangling
No ratings yet
Step by Step Data Wrangling
4 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
14 pages
Data Cleaning Using R
No ratings yet
Data Cleaning Using R
5 pages
Data Cleaning Guide
No ratings yet
Data Cleaning Guide
4 pages
Data Analytic Process
No ratings yet
Data Analytic Process
3 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
III Unit
No ratings yet
III Unit
4 pages
Sohail DataScientist
No ratings yet
Sohail DataScientist
3 pages
Data Cleaning - Importance and Techniques
No ratings yet
Data Cleaning - Importance and Techniques
1 page
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet

Data Cleaning

Uploaded by

Data Cleaning

Uploaded by

Data Cleaning - Summary

Thursday, October 17, 2024 3:22 PM

# Load your dataset into a pandas DataFrame

# Check for duplicate rows

# Print rows that are duplicates (if any)

# Optionally, if you want to remove duplicates based on specific columns:

# Save the cleaned DataFrame to a new CSV file

Final Decision Flow:

Understanding MAR (Missing At Random)

# Fit a regression model to predict income based on age

# Predict missing income values

# Using MICE for multiple imputation

# Analyze each dataset separately and pool results

# Assuming 'features' and 'target' are defined

# Dropping columns with any missing values

1. Using Z-Score Method:

# Example list of valid cities

# City column with a typo

# Get closest match to correct typo

# Flag rows where conversion failed

# Replace inconsistent values using the map

# Apply log transformation

Summary of When to Use Each Technique:

You might also like