0% found this document useful (0 votes)

22 views34 pages

L6 and 7-Data Preprocessing-Coding

The document discusses data preprocessing techniques in data science, focusing on data cleaning, transformation, and the use of Python libraries such as Pandas, NumPy, Matplotlib, and Seaborn. It highlights the importance of handling missing values, filtering, and merging datasets, along with practical examples of operations performed on sample datasets. Additionally, it covers visualization techniques and statistical analysis using box plots to identify outliers.

Uploaded by

naeem tareen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views34 pages

L6 and 7-Data Preprocessing-Coding

Uploaded by

naeem tareen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Course: DS5002

Data Science Tools and

Techniques

Data Preprocessing

Dr. Safdar Ali

Explore and discuss the process of data cleaning, with understanding of

its importance, common challenges, and effective techniques along with
data transformation.
Example:
Based on various market surveys, the consulting firm has gathered a
large dataset of different types of used cars across the market.
Data Dictionary:
1.Sales_ID (Sales ID)
2.name (Name of the used car)
3.year (Year of the car purchase)
4.selling_price (Current selling price for used car)
5.km_driven (Total km driven)
6.Region (Region where it is used)
7.State or Province (State or Province where it is used)
8.City (City where it is used)
9.fuel (Fuel type)
10.seller_type (Who is selling the car)
11.transmission (Transmission type of the car)
12.owner (Owner type)
13.mileage (Mileage of the car)
14.engine (engine power)
15.max_power (max power)
16.seats (Number of seats)
17.sold (used car sold or not)
https://www.kaggle.com/datasets/shubham1kumar/usedcar-data
Example data

Problems in data
Missing value
Mixed data: (e.g. in 1st Col, car_name with company name, in 2nd col.
Car_price amount with Lakh, in last Col. Date is in unstructured form.
Using Python- Advantages
• Syntax used is simple to understand code and reasonably
fast to prototype
• Libraries designed for specific data science tasks
• Provides good ecosystem libraries that are robust and varied
• Links well with majority of the cloud platform service
providers
• Tight-knit integration with big data frameworks such as
Hadoop, Spark, etc.
• Supports both object oriented and functional programming
paradigms
• Supports reading files from local, databases and cloud
Data Science using Python
• Python libraries provide key feature sets which
essential for data science
• For this, necessary knowledge of:
– Python and following powerful and basic modules or
libraries for data analysis and visualization:
• Pandas (for data manipulation and cleaning)
• Matplotlib (for general-purpose plotting)
• Seaborn (builds on Matplotlib for advanced statistical
visualizations)
• NumPy (for numerical python)

– Machine learning libraries like ‘Sci-kit learn’ or

‘Sklearn’ offer a bouquet of learning algorithms
Import numpy
Modules within a library e.g., content = dir (numpy)
print (content)
Pandas
• This module is employed for data manipulation and analysis.

• Easy to work and it gives data structures like

– Series (1D = a single column ) ; series = pd.Series()
– DataFrame (2D = a collection of columns provides merging, joining, and
reshaping data); df = pd.DataFrame(), where df stands for "DataFrame"
– handle large datasets.

• General practice for:

– Cleaning, filtering, and transforming data.
– Handling missing data and combining datasets.
– Analyzing time series and statistics.

• Example: use it to read data from CSV files for cleaning/ analysis.
.csv file extension stands for "comma-separated value” file, and it's one of
the most common outputs for any spreadsheet program.
https://flatfile.com/demo/
Example: Series (1D) and DataFrame
(2D)
• Series (1D) • DataFrame (2D)
import pandas as pd data = {
data = [100, 200, 300, 400] "Name": ["Alice", "Bob", "Charlie"],
series = pd.Series(data, "Age": [25, 30, 35], "Salary":
index=['A', 'B', 'C', 'D']) [50000, 60000, 70000]
print(series) }
Output df = pd.DataFrame(data)
A 100 print(df)
B 200 Output
C 300 Name Age Salary
D 400 0 Alice 25 50000
dtype: int64 1 Bob 30 60000
2 Charlie 35 70000
Matplotlib
• A plotting module used for creating static, animated, and
interactive visualizations

• General practice for:

– Plotting line graph, histograms, bar charts, scatter plots, etc.
– Modifying for interactive plots using titles, labels, legends, and
other annotations.

• Example: use it for a given dataset to visualize trends over time, to

create line charts or bar charts.
Seaborn
• A higher-level plotting interface builds on Matplotlib used for
making attractive and informative statistical graphics by
simplifying the complex visualizations.

• General practice for:

– Making more sophisticated plots like heatmaps, violin plots
(combining of box and density plots), pair plots, etc.
– Adding statistical features like regression lines, correlation
coefficients, and distributions.

• Example: use it for creating correlation heatmap or distribution of

data.
seaborn.heatmap()
seaborn.violinplot()
seaborn.pairplot()
NumPy (Numerical Python)
• A powerful Python library used for numerical
computing.
• Support to data structures such as:
– Large, multi-dimensional arrays and matrices,

– Mathematical functions (linear algebra, statistics,

random number generation, etc.)
import numpy as np
# Creating a 1D array
Installing: arr1 = np.array([1, 2, 3, 4, 5])
pip install numpy print(arr1)
# Creating a 2D array
arr2 = np.array([[1, 2, 3], [4, 5, 6]]) print(arr2)
Real world sample employee salary
dataset-1
Index Empl_ID Name Depart Age Salary Joining_Date

0 101 Alice HR 25.0 50000.0 2020-01-15

1 102 Bob IT 30.0 60000.0 2018-06-23
2 103 Charlie Finance NaN 70000.0 2017-08-19
3 104 David IT 40.0 NaN 2015-09-10
4 105 Eve HR 35.0 65000.0 2019-12-11
5 106 NaN Finance 28.0 72000.0 2021-07-01
6 107 Grace IT NaN 55000.0 2016-05-14
Tasks perform in python
• Using dataset-1 perform following operations in
python:
• Loaded sample employee salary dataset
• Handled missing values (Filled missing ages &
salaries, removed missing names)
• Filtered data (Employees with salary > 60K, IT
employees above 30)
• Transformed data (Added "Years of Experience",
increased salary by 10%)
• Merged datasets (Added a Bonus column from
another dataset)
• Sorted & grouped data (Sorted by salary, grouped by
department)
Creating and displaying a sample employee dataset
Load existing Sample Data
import pandas as pd import pandas as pd
import numpy as np # Load DataFrame from a CSV file
df = pd.read_csv("path/to/your/folder/data.csv")
# Creating a sample employee dataset # Display the first 5 rows
data = { print(df.head())
"EmployeeID": [101, 102, 103, 104, 105, 106, 107],
"Name": ["Alice", "Bob", "Charlie", "David", "Eve", np.nan, "Grace"],
"Department": ["HR", "IT", "Finance", "IT", "HR", "Finance", "IT"],
"Age": [25, 30, np.nan, 40, 35, 28, np.nan],
"Salary": [50000, 60000, 70000, np.nan, 65000, 72000, 55000],
"Joining_Date": ["2020-01-15", "2018-06-23", "2017-08-19", "2015-09-10", "2019-12-11",
"2021-07-01", "2016-05-14"]
} File Format Method
CSV pd.read_csv("file.csv")
# Convert to DataFrame Excel pd.read_excel("file.xlsx")
df = pd.DataFrame(data)
JSON pd.read_json("file.json")
Pickle pd.read_pickle("file.pkl")
# Convert Joining_Date to datetime
df["Joining_Date"] = pd.to_datetime(df["Joining_Date"])
JSON:JavaScript Object Notation
# Display the dataset
print(df)
Cleaning Data - Pandas
• Removing Duplicates df.drop_duplicates(inplace=True)

• Renaming Columns
df.rename(columns={"OldColumn": "NewColumn"},
inplace=True)

• Changing Data Types

df["Age"] = df["Age"].astype(int) # Convert to integer
df["Date"] = pd.to_datetime(df["Date"]) # Convert to datetime*

• Stripping Whitespace from Column Names

df.columns = df.columns.str.strip() #Remove spaces from
column names or column values
*class datetime.date
An idealized naive date, assuming the current Gregorian calendar always was, and
always will be, in effect. Attributes: year, month, and day.
Handling Missing Data (NaN values)
• Checking for Missing Values
• df.isnull().sum() # Count missing values per
column
• Removing Rows with Missing Data
• df.dropna(inplace=True) # Drop rows with NaN
values
• Filling Missing Values
• df.fillna(0, inplace=True) # Replace NaN with 0
• df["Salary"].fillna(df["Salary"].mean(),
inplace=True) # Replace with column mean
• print(df)
Boxplot (5-number statistic)
Box-and-whisker plot is a graphical representation of the
distribution of a dataset
• Minimum (?) – The smallest data point, excluding outliers.
• First Quartile (Q1) – 25th percentile (middle of lower half of
data).
• Median (Q2) – 50th percentile (middle value of the dataset).
• Third Quartile (Q3) –75th percentile (middle of upper half of
data).
• Maximum (?) – The largest data point, excluding outliers.
Boxplot (5-number statistic)
• Skewness: Median is closer to Q1 or Q3, data is skewed.
• If the median is closer to Q1, the distribution is right-
skewed (longer tail on the right).
• If the median is closer to Q3, the distribution is left-
skewed (longer tail on the left).
• Spread of data: A wider box means more variability in data.
• Outliers: Points beyond the whiskers suggest extreme
values.
Boxplot (5-number statistic)
A box plot consists of:
• A box that represents the interquartile range (IQR = Q3 - Q1),
which contains the middle 50% of the data.
• A line inside the box that shows the median (Q2).
• Whiskers extending from the box to the minimum and
maximum values within 1.5 times the IQR.
• Outliers, which are individual points outside the whiskers,
marked as dots or small circles.
•Lower Bound (Minimum): Q1−1.5×IQR
Any data point below this bound is considered an outlier.

•Upper Bound (Maximum): Q3+1.5×IQR

Any data point above this bound is also considered an
outlier.
import pandas as pd
import numpy as np Identify Outliers
import matplotlib.pyplot as plt
import seaborn as sns
# Sample data # Step 3: Identify outliers
data = { outliers = df[(df['salary'] <
'salary': [50000, 60000, 65000, 70000, lower_bound) | (df['salary'] >
75000, 80000, 85000, 90000, 120000, upper_bound)]
200000, 250000, 300000, 350000] # Step 4: Visualize using a box plot
} plt.figure(figsize=(8,6))
# Create a DataFrame sns.boxplot(x=df['salary'])
df = pd.DataFrame(data) plt.title('Box Plot of Salaries')
# Step 1: Calculate Q1, Q3, and IQR plt.show()
Q1 = df['salary'].quantile(0.25) # Display outliers
Q3 = df['salary'].quantile(0.75) print("Outliers:")
IQR = Q3 - Q1 print(outliers)
# Step 2: Calculate the outlier
thresholds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
Filtering Data
• Filtering Rows Based on Condition
df_filtered = df[df["Age"] > 30] # Select rows where Age > 30

• Filtering Multiple Conditions

df_filtered = df[(df["Age"] > 30) & (df["Salary"] > 50000)]

• Using .query() for Filtering

df_filtered = df.query("Age > 30 and Salary > 50000") # filter
rows where Age is greater than 30 and Salary is greater than
5000

• print(filtered_df)
Transforming Data
• Transforming Data
df["Salary"] = df["Salary"].apply(lambda x: x * 1.1)
# Increase salary by 10%

• Creating a New Column

df["Salary_After_Tax"] = df["Salary"] * 0.8

• Replacing Values
df["Department"] = df["Department"].replace({"HR": "Human
Resources", "IT": "Tech"})
# Replacing Islamabad' with ‘Rawalpindi‘
df["City"] = df["City"].replace(" Islamabad ", " Rawalpindi ")

In pandas -apply() - is a function that applies to each value in a column/row.

lambda x: x * 1.1 is a lambda function that multiplies each value (x) by
1.1, effectively increasing the salary by 10%.
Combining Datasets (Merging, Joining,
and Concatenation)
• Merging DataFrames on a Key (Like SQL JOIN*)
df_merged = pd.merge(df1, df2, on="EmployeeID", how="inner") # Inner join
df_merged = pd.merge(df1, df2, on="EmployeeID", how="left") # Left join
df_merged = pd.merge(df1, df2, on="EmployeeID", how="outer") # Outer join

*A SQL JOIN is used to combine rows from two or more tables based on
a related column between them
Example LEFT JOIN
=df1 Returns all records from the left table
(Employees), and matching records from
the right (Departments).
If no match is found, NULL is returned.
inner_merge = pd.merge(df1, df2, on=‘DepartmentID',
how=‘left')

=df2

INNER JOIN Note that David is included, but with

NULL in DepartmentName because
inner_merge = pd.merge(df1, df2, on=‘DepartmentID', no matching record exists in the
how='inner') Departments table.

Note that David is missing because there's no matching DepartmentID = 4 in the Departments table.
RIGHT JOIN
Returns all records from the right table (Departments), and matching records from the left
(Employees).
inner_merge = pd.merge(df1, df2, on=‘DepartmentID', how=‘right')

FULL OUTER JOIN

Returns all records from both tables, with NULLs where there are no matches.

full_outer_merge = df1.merge(df2, on='DepartmentID', how='outer').merge(df3,

on='DepartmentID', how='outer')

Note that David is included (no match in Departments) and "Sales" appears with NULL
(Employees).
Combining Datasets (Merging, Joining,
and Concatenation)
"Customers" Table
"Orders“ Table
OrderID CustomerID OrderDate CustomerID CustomerName ContactName Country
Alfreds
1 Maria Anders Germany
Futterkiste
10308 2 1996-09-18 Ana Trujillo
10309 37 1996-09-19 2 Emparedados y Ana Trujillo Mexico
10310 77 1996-09-20 helados
Antonio Moreno Antonio
Notice that the "CustomerID" column in the 3 Mexico
Taquería Moreno
"Orders" table refers to the "CustomerID" in the
"Customers" table. The relationship between the
two tables above is the "CustomerID" column.

OrderID CustomerName OrderDate

10308 Ana Trujillo Emparedados y helados 9/18/1996
10365 Antonio Moreno Taquería 11/27/1996
10383 Around the Horn 12/16/1996
10355 Around the Horn 11/15/1996
10278 Berglunds snabbköp 8/12/1996
Summary of Types of SQL JOINs

• INNER JOIN → Returns only matching records.

• LEFT JOIN (LEFT OUTER JOIN) → Returns all records
from the left table and matching records from the right.
• RIGHT JOIN (RIGHT OUTER JOIN) → Returns all
records from the right table and matching records from
the left.
• FULL JOIN (FULL OUTER JOIN) → Returns all records
from both tables (matching and non-matching).
Combining Datasets (Merging, Joining,
and Concatenation)
• Joining DataFrames on Index
df_joined = df1.join(df2.set_index("EmployeeID"), on="EmployeeID")

• Concatenating DataFrames (Stacking)

df_combined = pd.concat([df1, df2], axis=0) # Stack rows
df_combined = pd.concat([df1, df2], axis=1) # Merge side by side
(columns)
Practice
1. Load datasets using pandas.

2. Merge the first two datasets on the Department_ID column.

3. Filter the merged dataset to show only employees who earn a salary
greater than some specific value X.

4. Join the merged dataset with a third dataset (managers.csv) that

contains Manager_ID, Manager_Name, and Manager_Age.

5. Concatenate the resulting dataset with a new dataset

(office_locations.csv) that contains Department_ID, Office_Location,
and City, showing the office locations for each department.

• Provide the Python code to perform these tasks using pandas.

import pandas as pd
# Load datasets
employees = pd.read_csv('employees.csv')
departments = pd.read_csv('departments.csv')
managers = pd.read_csv('managers.csv')
office_locations = pd.read_csv('office_locations.csv')
# 1. Merge employees with departments on 'Department_ID'
merged_data = pd.merge(employees, departments, on='Department_ID')
# 2. Filter employees with salary > 60,000
filtered_data = merged_data[merged_data['Salary'] > 60000]
# 3. Join the filtered dataset with managers on 'Manager_ID'
final_data = pd.merge(filtered_data, managers, on='Manager_ID')
# 4. Concatenate with office locations on 'Department_ID'
final_dataset = pd.merge(final_data, office_locations, on='Department_ID')
# Display the final result
print(final_dataset)
Grouping and Aggregating Data
• Grouping Data & Summarizing
df_grouped = df.groupby("Department")["Salary"].mean()
# Mean salary per department
df_grouped = df.groupby("Department").agg({"Salary":
"mean", "Age": "max"})
# Multiple aggregations
Sorting & Rearranging Data
• Sorting Data
df_sorted = df.sort_values("Salary",
ascending=False) # Sort by salary
(descending)
• Reset Index
df.reset_index(drop=True, inplace=True)
MultiIndex

Essential n8n Playbook
From Everand
Essential n8n Playbook
Leandro Calado
No ratings yet
TechLog Fondamentals
100% (6)
TechLog Fondamentals
429 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
29 pages
STAB22 Midterm-2022with-Keys
No ratings yet
STAB22 Midterm-2022with-Keys
23 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
96 pages
Python For Data Science
No ratings yet
Python For Data Science
45 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
2,3. Introduction Pandas & Matplotlib
No ratings yet
2,3. Introduction Pandas & Matplotlib
32 pages
2A - Python+Data Analysis For Pyhton2 v2
No ratings yet
2A - Python+Data Analysis For Pyhton2 v2
38 pages
Python For Statistics
No ratings yet
Python For Statistics
40 pages
Python For Data Analysis: Dr. Kishore Kunal
100% (1)
Python For Data Analysis: Dr. Kishore Kunal
43 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
1st Class-Introduction and Python Package
No ratings yet
1st Class-Introduction and Python Package
93 pages
Wa0005.
No ratings yet
Wa0005.
29 pages
Python For Data Analysis Edgar
No ratings yet
Python For Data Analysis Edgar
49 pages
Murali Internship
No ratings yet
Murali Internship
34 pages
Python Unit 4
No ratings yet
Python Unit 4
70 pages
Python For Data Analysis Jan 28
No ratings yet
Python For Data Analysis Jan 28
105 pages
NumPy and Pandas Tutorial
No ratings yet
NumPy and Pandas Tutorial
8 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
Chapter 4 - Python For Data Analysis
No ratings yet
Chapter 4 - Python For Data Analysis
47 pages
Python Data Analyst Handbook Guide - Byom - Cybertechie
No ratings yet
Python Data Analyst Handbook Guide - Byom - Cybertechie
57 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
Python For ML
No ratings yet
Python For ML
41 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
47 pages
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
100% (1)
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
12 pages
Edap Lab
No ratings yet
Edap Lab
47 pages
Final Dev Record
No ratings yet
Final Dev Record
49 pages
BDA File
No ratings yet
BDA File
26 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Jenisha INTERNSHIP REPORT-2
No ratings yet
Jenisha INTERNSHIP REPORT-2
19 pages
Stats Unit1
No ratings yet
Stats Unit1
27 pages
Summary: Introduction To Data Visualization Tools
No ratings yet
Summary: Introduction To Data Visualization Tools
13 pages
Pandas Complete + Visualisation Summary of IBM Visualization
No ratings yet
Pandas Complete + Visualisation Summary of IBM Visualization
21 pages
Teks DATA SCIENCE Syllabus - QR
No ratings yet
Teks DATA SCIENCE Syllabus - QR
26 pages
Data Science
No ratings yet
Data Science
42 pages
Unit - 4 - Part 2
No ratings yet
Unit - 4 - Part 2
36 pages
ML Manual
No ratings yet
ML Manual
21 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
Rajni Ip File Final
No ratings yet
Rajni Ip File Final
42 pages
Dev Lab Record
No ratings yet
Dev Lab Record
31 pages
Python For Data Exploration
No ratings yet
Python For Data Exploration
28 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
Data Analysis
No ratings yet
Data Analysis
42 pages
Data Project
No ratings yet
Data Project
12 pages
Lab Record Dev
No ratings yet
Lab Record Dev
20 pages
Data Sciene File
No ratings yet
Data Sciene File
36 pages
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
Course - Introduction To Data Science (SD211105)
No ratings yet
Course - Introduction To Data Science (SD211105)
10 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
Data Exploration in Python PDF
No ratings yet
Data Exploration in Python PDF
1 page
Ipl Data Analysis PBL
No ratings yet
Ipl Data Analysis PBL
11 pages
Maxbox Starter139 Top5 Data Diagram Types
No ratings yet
Maxbox Starter139 Top5 Data Diagram Types
4 pages
Python Libraries 2
No ratings yet
Python Libraries 2
80 pages
MLS 1 - Python For Data Science
No ratings yet
MLS 1 - Python For Data Science
33 pages
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
100 Puzzles to Learn Data Warehousing
From Everand
100 Puzzles to Learn Data Warehousing
Cristian Scutaru
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Solution STA101 Assignment 1&2 Summer24
No ratings yet
Solution STA101 Assignment 1&2 Summer24
15 pages
Session 09 - BS - 2020-Z Score
No ratings yet
Session 09 - BS - 2020-Z Score
32 pages
IA Sample High Marks
No ratings yet
IA Sample High Marks
6 pages
Add Maths Sba
No ratings yet
Add Maths Sba
24 pages
Business Statistics For Decision Making (Sneha)
No ratings yet
Business Statistics For Decision Making (Sneha)
15 pages
Research Dispersion
No ratings yet
Research Dispersion
3 pages
GCSE Cumulative Frequency, Box Plots & Quartiles
No ratings yet
GCSE Cumulative Frequency, Box Plots & Quartiles
10 pages
STAT5002 Midterm Review Questions R
No ratings yet
STAT5002 Midterm Review Questions R
7 pages
Jeannette Moriak: The Racers' Conjectures
100% (2)
Jeannette Moriak: The Racers' Conjectures
5 pages
A Comprehensive Guide on Ggplot2 in r
No ratings yet
A Comprehensive Guide on Ggplot2 in r
30 pages
Further Maths - ToPIC 1 - Univariate & Bivariate Data
No ratings yet
Further Maths - ToPIC 1 - Univariate & Bivariate Data
24 pages
Factors Affecting On Students Test Scores
No ratings yet
Factors Affecting On Students Test Scores
43 pages
Datascience Lab
No ratings yet
Datascience Lab
42 pages
(Ebook PDF) Modern Business Statistics, With Microsoft Office Excel 4th Edition PDF Download
100% (2)
(Ebook PDF) Modern Business Statistics, With Microsoft Office Excel 4th Edition PDF Download
48 pages
AK - STATISTIKA - 02 - Describing Data (Cont.)
No ratings yet
AK - STATISTIKA - 02 - Describing Data (Cont.)
47 pages
Data Measurements Data Preprocessing Lende Gashi
No ratings yet
Data Measurements Data Preprocessing Lende Gashi
14 pages
UCS551 Chapter 4 - Descriptive Analytics - Visualization
No ratings yet
UCS551 Chapter 4 - Descriptive Analytics - Visualization
39 pages
ASSIGNMENT PART 1a PDF
No ratings yet
ASSIGNMENT PART 1a PDF
8 pages
2023 Cambridge Units 3 4 Solutions
No ratings yet
2023 Cambridge Units 3 4 Solutions
402 pages
Introduction To EDA: Exploratory Data Analysis (EDA) in Data Science
No ratings yet
Introduction To EDA: Exploratory Data Analysis (EDA) in Data Science
4 pages
Topic 1: Introduction To Principles of Experimental Design: 1. 1. Purpose
No ratings yet
Topic 1: Introduction To Principles of Experimental Design: 1. 1. Purpose
16 pages
Solomon B QP - S1 Edexcel
No ratings yet
Solomon B QP - S1 Edexcel
4 pages
Confusion Matrix and Outliers
No ratings yet
Confusion Matrix and Outliers
32 pages
Complete Business Statistics With Solutions in R 1st Edition Mustapha Abiodun Akinkunmi PDF For All Chapters
100% (1)
Complete Business Statistics With Solutions in R 1st Edition Mustapha Abiodun Akinkunmi PDF For All Chapters
65 pages
2024 2458 Moesm1 Esm
No ratings yet
2024 2458 Moesm1 Esm
34 pages
Corroletion & Regeression1 Mrs Sahar
No ratings yet
Corroletion & Regeression1 Mrs Sahar
33 pages
Match Game Matching Statistics & Graphs: What Do I Do?
No ratings yet
Match Game Matching Statistics & Graphs: What Do I Do?
3 pages
SSRN 4836529
No ratings yet
SSRN 4836529
19 pages

L6 and 7-Data Preprocessing-Coding

Uploaded by

L6 and 7-Data Preprocessing-Coding

Uploaded by

Course: DS5002

Data Science Tools and

Dr. Safdar Ali

Explore and discuss the process of data cleaning, with understanding of

– Machine learning libraries like ‘Sci-kit learn’ or

• Easy to work and it gives data structures like

• General practice for:

• General practice for:

• Example: use it for a given dataset to visualize trends over time, to

• General practice for:

• Example: use it for creating correlation heatmap or distribution of

– Mathematical functions (linear algebra, statistics,

0 101 Alice HR 25.0 50000.0 2020-01-15

• Changing Data Types

• Stripping Whitespace from Column Names

•Upper Bound (Maximum): Q3+1.5×IQR

• Filtering Multiple Conditions

• Using .query() for Filtering

• Creating a New Column

In pandas -apply() - is a function that applies to each value in a column/row.

INNER JOIN Note that David is included, but with

FULL OUTER JOIN

full_outer_merge = df1.merge(df2, on='DepartmentID', how='outer').merge(df3,

OrderID CustomerName OrderDate

• INNER JOIN → Returns only matching records.

• Concatenating DataFrames (Stacking)

2. Merge the first two datasets on the Department_ID column.

4. Join the merged dataset with a third dataset (managers.csv) that

5. Concatenate the resulting dataset with a new dataset

• Provide the Python code to perform these tasks using pandas.

You might also like