0% found this document useful (0 votes)
10 views

L6 and 7-Data Preprocessing-coding

The document discusses data preprocessing techniques in data science, focusing on data cleaning, transformation, and the use of Python libraries such as Pandas, NumPy, Matplotlib, and Seaborn. It highlights the importance of handling missing values, filtering, and merging datasets, along with practical examples of operations performed on sample datasets. Additionally, it covers visualization techniques and statistical analysis using box plots to identify outliers.

Uploaded by

naeem tareen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

L6 and 7-Data Preprocessing-coding

The document discusses data preprocessing techniques in data science, focusing on data cleaning, transformation, and the use of Python libraries such as Pandas, NumPy, Matplotlib, and Seaborn. It highlights the importance of handling missing values, filtering, and merging datasets, along with practical examples of operations performed on sample datasets. Additionally, it covers visualization techniques and statistical analysis using box plots to identify outliers.

Uploaded by

naeem tareen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Course: DS5002

Data Science Tools and


Techniques

Data Preprocessing

Dr. Safdar Ali

Explore and discuss the process of data cleaning, with understanding of


its importance, common challenges, and effective techniques along with
data transformation.
Example:
Based on various market surveys, the consulting firm has gathered a
large dataset of different types of used cars across the market.
Data Dictionary:
1.Sales_ID (Sales ID)
2.name (Name of the used car)
3.year (Year of the car purchase)
4.selling_price (Current selling price for used car)
5.km_driven (Total km driven)
6.Region (Region where it is used)
7.State or Province (State or Province where it is used)
8.City (City where it is used)
9.fuel (Fuel type)
10.seller_type (Who is selling the car)
11.transmission (Transmission type of the car)
12.owner (Owner type)
13.mileage (Mileage of the car)
14.engine (engine power)
15.max_power (max power)
16.seats (Number of seats)
17.sold (used car sold or not)
https://www.kaggle.com/datasets/shubham1kumar/usedcar-data
Example data

Problems in data
Missing value
Mixed data: (e.g. in 1st Col, car_name with company name, in 2nd col.
Car_price amount with Lakh, in last Col. Date is in unstructured form.
Using Python- Advantages
• Syntax used is simple to understand code and reasonably
fast to prototype
• Libraries designed for specific data science tasks
• Provides good ecosystem libraries that are robust and varied
• Links well with majority of the cloud platform service
providers
• Tight-knit integration with big data frameworks such as
Hadoop, Spark, etc.
• Supports both object oriented and functional programming
paradigms
• Supports reading files from local, databases and cloud
Data Science using Python
• Python libraries provide key feature sets which
essential for data science
• For this, necessary knowledge of:
– Python and following powerful and basic modules or
libraries for data analysis and visualization:
• Pandas (for data manipulation and cleaning)
• Matplotlib (for general-purpose plotting)
• Seaborn (builds on Matplotlib for advanced statistical
visualizations)
• NumPy (for numerical python)

– Machine learning libraries like ‘Sci-kit learn’ or


‘Sklearn’ offer a bouquet of learning algorithms
Import numpy
Modules within a library e.g., content = dir (numpy)
print (content)
Pandas
• This module is employed for data manipulation and analysis.

• Easy to work and it gives data structures like


– Series (1D = a single column ) ; series = pd.Series()
– DataFrame (2D = a collection of columns provides merging, joining, and
reshaping data); df = pd.DataFrame(), where df stands for "DataFrame"
– handle large datasets.

• General practice for:


– Cleaning, filtering, and transforming data.
– Handling missing data and combining datasets.
– Analyzing time series and statistics.

• Example: use it to read data from CSV files for cleaning/ analysis.
.csv file extension stands for "comma-separated value” file, and it's one of
the most common outputs for any spreadsheet program.
https://flatfile.com/demo/
Example: Series (1D) and DataFrame
(2D)
• Series (1D) • DataFrame (2D)
import pandas as pd data = {
data = [100, 200, 300, 400] "Name": ["Alice", "Bob", "Charlie"],
series = pd.Series(data, "Age": [25, 30, 35], "Salary":
index=['A', 'B', 'C', 'D']) [50000, 60000, 70000]
print(series) }
Output df = pd.DataFrame(data)
A 100 print(df)
B 200 Output
C 300 Name Age Salary
D 400 0 Alice 25 50000
dtype: int64 1 Bob 30 60000
2 Charlie 35 70000
Matplotlib
• A plotting module used for creating static, animated, and
interactive visualizations

• General practice for:


– Plotting line graph, histograms, bar charts, scatter plots, etc.
– Modifying for interactive plots using titles, labels, legends, and
other annotations.

• Example: use it for a given dataset to visualize trends over time, to


create line charts or bar charts.
Seaborn
• A higher-level plotting interface builds on Matplotlib used for
making attractive and informative statistical graphics by
simplifying the complex visualizations.

• General practice for:


– Making more sophisticated plots like heatmaps, violin plots
(combining of box and density plots), pair plots, etc.
– Adding statistical features like regression lines, correlation
coefficients, and distributions.

• Example: use it for creating correlation heatmap or distribution of


data.
seaborn.heatmap()
seaborn.violinplot()
seaborn.pairplot()
NumPy (Numerical Python)
• A powerful Python library used for numerical
computing.
• Support to data structures such as:
– Large, multi-dimensional arrays and matrices,

– Mathematical functions (linear algebra, statistics,


random number generation, etc.)
import numpy as np
# Creating a 1D array
Installing: arr1 = np.array([1, 2, 3, 4, 5])
pip install numpy print(arr1)
# Creating a 2D array
arr2 = np.array([[1, 2, 3], [4, 5, 6]]) print(arr2)
Real world sample employee salary
dataset-1
Index Empl_ID Name Depart Age Salary Joining_Date

0 101 Alice HR 25.0 50000.0 2020-01-15


1 102 Bob IT 30.0 60000.0 2018-06-23
2 103 Charlie Finance NaN 70000.0 2017-08-19
3 104 David IT 40.0 NaN 2015-09-10
4 105 Eve HR 35.0 65000.0 2019-12-11
5 106 NaN Finance 28.0 72000.0 2021-07-01
6 107 Grace IT NaN 55000.0 2016-05-14
Tasks perform in python
• Using dataset-1 perform following operations in
python:
• Loaded sample employee salary dataset
• Handled missing values (Filled missing ages &
salaries, removed missing names)
• Filtered data (Employees with salary > 60K, IT
employees above 30)
• Transformed data (Added "Years of Experience",
increased salary by 10%)
• Merged datasets (Added a Bonus column from
another dataset)
• Sorted & grouped data (Sorted by salary, grouped by
department)
Creating and displaying a sample employee dataset
Load existing Sample Data
import pandas as pd import pandas as pd
import numpy as np # Load DataFrame from a CSV file
df = pd.read_csv("path/to/your/folder/data.csv")
# Creating a sample employee dataset # Display the first 5 rows
data = { print(df.head())
"EmployeeID": [101, 102, 103, 104, 105, 106, 107],
"Name": ["Alice", "Bob", "Charlie", "David", "Eve", np.nan, "Grace"],
"Department": ["HR", "IT", "Finance", "IT", "HR", "Finance", "IT"],
"Age": [25, 30, np.nan, 40, 35, 28, np.nan],
"Salary": [50000, 60000, 70000, np.nan, 65000, 72000, 55000],
"Joining_Date": ["2020-01-15", "2018-06-23", "2017-08-19", "2015-09-10", "2019-12-11",
"2021-07-01", "2016-05-14"]
} File Format Method
CSV pd.read_csv("file.csv")
# Convert to DataFrame Excel pd.read_excel("file.xlsx")
df = pd.DataFrame(data)
JSON pd.read_json("file.json")
Pickle pd.read_pickle("file.pkl")
# Convert Joining_Date to datetime
df["Joining_Date"] = pd.to_datetime(df["Joining_Date"])
JSON:JavaScript Object Notation
# Display the dataset
print(df)
Cleaning Data - Pandas
• Removing Duplicates df.drop_duplicates(inplace=True)

• Renaming Columns
df.rename(columns={"OldColumn": "NewColumn"},
inplace=True)

• Changing Data Types


df["Age"] = df["Age"].astype(int) # Convert to integer
df["Date"] = pd.to_datetime(df["Date"]) # Convert to datetime*

• Stripping Whitespace from Column Names


df.columns = df.columns.str.strip() #Remove spaces from
column names or column values
*class datetime.date
An idealized naive date, assuming the current Gregorian calendar always was, and
always will be, in effect. Attributes: year, month, and day.
Handling Missing Data (NaN values)
• Checking for Missing Values
• df.isnull().sum() # Count missing values per
column
• Removing Rows with Missing Data
• df.dropna(inplace=True) # Drop rows with NaN
values
• Filling Missing Values
• df.fillna(0, inplace=True) # Replace NaN with 0
• df["Salary"].fillna(df["Salary"].mean(),
inplace=True) # Replace with column mean
• print(df)
Boxplot (5-number statistic)
Box-and-whisker plot is a graphical representation of the
distribution of a dataset
• Minimum (?) – The smallest data point, excluding outliers.
• First Quartile (Q1) – 25th percentile (middle of lower half of
data).
• Median (Q2) – 50th percentile (middle value of the dataset).
• Third Quartile (Q3) –75th percentile (middle of upper half of
data).
• Maximum (?) – The largest data point, excluding outliers.
Boxplot (5-number statistic)
• Skewness: Median is closer to Q1 or Q3, data is skewed.
• If the median is closer to Q1, the distribution is right-
skewed (longer tail on the right).
• If the median is closer to Q3, the distribution is left-
skewed (longer tail on the left).
• Spread of data: A wider box means more variability in data.
• Outliers: Points beyond the whiskers suggest extreme
values.
Boxplot (5-number statistic)
A box plot consists of:
• A box that represents the interquartile range (IQR = Q3 - Q1),
which contains the middle 50% of the data.
• A line inside the box that shows the median (Q2).
• Whiskers extending from the box to the minimum and
maximum values within 1.5 times the IQR.
• Outliers, which are individual points outside the whiskers,
marked as dots or small circles.
•Lower Bound (Minimum): Q1−1.5×IQR
Any data point below this bound is considered an outlier.

•Upper Bound (Maximum): Q3+1.5×IQR


Any data point above this bound is also considered an
outlier.
import pandas as pd
import numpy as np Identify Outliers
import matplotlib.pyplot as plt
import seaborn as sns
# Sample data # Step 3: Identify outliers
data = { outliers = df[(df['salary'] <
'salary': [50000, 60000, 65000, 70000, lower_bound) | (df['salary'] >
75000, 80000, 85000, 90000, 120000, upper_bound)]
200000, 250000, 300000, 350000] # Step 4: Visualize using a box plot
} plt.figure(figsize=(8,6))
# Create a DataFrame sns.boxplot(x=df['salary'])
df = pd.DataFrame(data) plt.title('Box Plot of Salaries')
# Step 1: Calculate Q1, Q3, and IQR plt.show()
Q1 = df['salary'].quantile(0.25) # Display outliers
Q3 = df['salary'].quantile(0.75) print("Outliers:")
IQR = Q3 - Q1 print(outliers)
# Step 2: Calculate the outlier
thresholds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
Filtering Data
• Filtering Rows Based on Condition
df_filtered = df[df["Age"] > 30] # Select rows where Age > 30

• Filtering Multiple Conditions


df_filtered = df[(df["Age"] > 30) & (df["Salary"] > 50000)]

• Using .query() for Filtering


df_filtered = df.query("Age > 30 and Salary > 50000") # filter
rows where Age is greater than 30 and Salary is greater than
5000

• print(filtered_df)
Transforming Data
• Transforming Data
df["Salary"] = df["Salary"].apply(lambda x: x * 1.1)
# Increase salary by 10%

• Creating a New Column


df["Salary_After_Tax"] = df["Salary"] * 0.8

• Replacing Values
df["Department"] = df["Department"].replace({"HR": "Human
Resources", "IT": "Tech"})
# Replacing Islamabad' with ‘Rawalpindi‘
df["City"] = df["City"].replace(" Islamabad ", " Rawalpindi ")

In pandas -apply() - is a function that applies to each value in a column/row.


lambda x: x * 1.1 is a lambda function that multiplies each value (x) by
1.1, effectively increasing the salary by 10%.
Combining Datasets (Merging, Joining,
and Concatenation)
• Merging DataFrames on a Key (Like SQL JOIN*)
df_merged = pd.merge(df1, df2, on="EmployeeID", how="inner") # Inner join
df_merged = pd.merge(df1, df2, on="EmployeeID", how="left") # Left join
df_merged = pd.merge(df1, df2, on="EmployeeID", how="outer") # Outer join

*A SQL JOIN is used to combine rows from two or more tables based on
a related column between them
Example LEFT JOIN
=df1 Returns all records from the left table
(Employees), and matching records from
the right (Departments).
If no match is found, NULL is returned.
inner_merge = pd.merge(df1, df2, on=‘DepartmentID',
how=‘left')

=df2

INNER JOIN Note that David is included, but with


NULL in DepartmentName because
inner_merge = pd.merge(df1, df2, on=‘DepartmentID', no matching record exists in the
how='inner') Departments table.

Note that David is missing because there's no matching DepartmentID = 4 in the Departments table.
RIGHT JOIN
Returns all records from the right table (Departments), and matching records from the left
(Employees).
inner_merge = pd.merge(df1, df2, on=‘DepartmentID', how=‘right')

FULL OUTER JOIN


Returns all records from both tables, with NULLs where there are no matches.

full_outer_merge = df1.merge(df2, on='DepartmentID', how='outer').merge(df3,


on='DepartmentID', how='outer')

Note that David is included (no match in Departments) and "Sales" appears with NULL
(Employees).
Combining Datasets (Merging, Joining,
and Concatenation)
"Customers" Table
"Orders“ Table
OrderID CustomerID OrderDate CustomerID CustomerName ContactName Country
Alfreds
1 Maria Anders Germany
Futterkiste
10308 2 1996-09-18 Ana Trujillo
10309 37 1996-09-19 2 Emparedados y Ana Trujillo Mexico
10310 77 1996-09-20 helados
Antonio Moreno Antonio
Notice that the "CustomerID" column in the 3 Mexico
Taquería Moreno
"Orders" table refers to the "CustomerID" in the
"Customers" table. The relationship between the
two tables above is the "CustomerID" column.

OrderID CustomerName OrderDate


10308 Ana Trujillo Emparedados y helados 9/18/1996
10365 Antonio Moreno Taquería 11/27/1996
10383 Around the Horn 12/16/1996
10355 Around the Horn 11/15/1996
10278 Berglunds snabbköp 8/12/1996
Summary of Types of SQL JOINs

• INNER JOIN → Returns only matching records.


• LEFT JOIN (LEFT OUTER JOIN) → Returns all records
from the left table and matching records from the right.
• RIGHT JOIN (RIGHT OUTER JOIN) → Returns all
records from the right table and matching records from
the left.
• FULL JOIN (FULL OUTER JOIN) → Returns all records
from both tables (matching and non-matching).
Combining Datasets (Merging, Joining,
and Concatenation)
• Joining DataFrames on Index
df_joined = df1.join(df2.set_index("EmployeeID"), on="EmployeeID")

• Concatenating DataFrames (Stacking)


df_combined = pd.concat([df1, df2], axis=0) # Stack rows
df_combined = pd.concat([df1, df2], axis=1) # Merge side by side
(columns)
Practice
1. Load datasets using pandas.

2. Merge the first two datasets on the Department_ID column.

3. Filter the merged dataset to show only employees who earn a salary
greater than some specific value X.

4. Join the merged dataset with a third dataset (managers.csv) that


contains Manager_ID, Manager_Name, and Manager_Age.

5. Concatenate the resulting dataset with a new dataset


(office_locations.csv) that contains Department_ID, Office_Location,
and City, showing the office locations for each department.

• Provide the Python code to perform these tasks using pandas.


import pandas as pd
# Load datasets
employees = pd.read_csv('employees.csv')
departments = pd.read_csv('departments.csv')
managers = pd.read_csv('managers.csv')
office_locations = pd.read_csv('office_locations.csv')
# 1. Merge employees with departments on 'Department_ID'
merged_data = pd.merge(employees, departments, on='Department_ID')
# 2. Filter employees with salary > 60,000
filtered_data = merged_data[merged_data['Salary'] > 60000]
# 3. Join the filtered dataset with managers on 'Manager_ID'
final_data = pd.merge(filtered_data, managers, on='Manager_ID')
# 4. Concatenate with office locations on 'Department_ID'
final_dataset = pd.merge(final_data, office_locations, on='Department_ID')
# Display the final result
print(final_dataset)
Grouping and Aggregating Data
• Grouping Data & Summarizing
df_grouped = df.groupby("Department")["Salary"].mean()
# Mean salary per department
df_grouped = df.groupby("Department").agg({"Salary":
"mean", "Age": "max"})
# Multiple aggregations
Sorting & Rearranging Data
• Sorting Data
df_sorted = df.sort_values("Salary",
ascending=False) # Sort by salary
(descending)
• Reset Index
df.reset_index(drop=True, inplace=True)
MultiIndex

You might also like