Unit2 Modified

Uploaded by

Priyanka Kengar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views42 pages

Unit2 Modified

Uploaded by

Priyanka Kengar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 42

Data exploration and

preprocessing
Loading a dataset
• Loading a dataset into Python can be done
using various libraries, but one of the most
commonly used libraries for data
manipulation and analysis is pandas.
Statistical measures
• Statistical measures, also known as summary
statistics or descriptive statistics, are
numerical values or techniques used to
summarize and describe a dataset.
• These measures provide a concise overview of
key characteristics of the data, helping to
understand its central tendency, variability,
and distribution.
Skewness
Kurtosis
• Kurtosis is a statistical measure that describes
the distribution of data points in a dataset,
specifically how data points are distributed in
the tails (extreme values) compared to the
center (mean) of the distribution.
• There are three main types of kurtosis:
• Mesokurtic: A mesokurtic distribution has kurtosis equal to
zero. This means that the distribution has tails that are
neither too heavy (leptokurtic) nor too light (platykurtic)
compared to a normal distribution.
• Leptokurtic: A leptokurtic distribution has positive kurtosis.
This means that the distribution has heavier tails and a
sharper peak around the mean compared to a normal
distribution.
• Platykurtic: A platykurtic distribution has negative kurtosis.
This means that the distribution has lighter tails and a flatter
peak around the mean compared to a normal distribution.
• In summary, skewness describes the asymmetry of the
distribution, while kurtosis describes the tails of the
distribution. Both measures can provide valuable insights into
the characteristics of a dataset, such as whether it is skewed,
whether it has extreme values in the tails, and how it deviates
from a normal distribution.
• Researchers and statisticians often use skewness and kurtosis
in combination with other descriptive statistics to gain a
comprehensive understanding of data distributions.
• These statistical measures are essential for
summarizing and gaining insights from
datasets in various fields, including statistics,
data science, and research.
• Depending on the characteristics of the data
and the research question, different measures
may be more relevant or informative.
statistics
• To calculate and examine basic summary
statistics like mean, median, mode, standard
deviation, and range for a dataset in Python,
you can use the pandas library.
Data Cleaning:

– Handle missing data: Identify and decide how to

deal with missing values (e.g., imputation,
removal).
– Detect and address duplicate records if they exist.
– Handle outliers: Identify and decide whether to
remove or transform outliers.
Handling Missing Data:

• Use pandas to handle missing data by either

dropping or imputing missing values.
By dropping
By dropping
imputing missing values by mean
Handling Duplicates
• # Remove duplicate rows
df.drop_duplicates(inplace=True)
Handling Outliers:

• Identify and handle outliers based on your

domain knowledge or statistical methods.
Data Type Conversion
• Convert data types as needed (e.g., converting
strings to numbers).
• One-hot encoding and label encoding are two
different techniques used to convert
categorical data into numerical format,
making it suitable for machine learning
algorithms.
• They have distinct characteristics and use
cases:
Text Cleaning:
• For text data, you can perform tasks like
removing special characters and lowercasing.
• # Remove special characters and lowercase
text df['text_column'] =
df['text_column'].str.replace('[^a-zA-Z0-9\s]',
'').str.lower()
Feature Scaling
• Standardize or normalize numerical features.
• Feature scaling is an important preprocessing
step in machine learning to ensure that
numerical features are on a similar scale
• Two common techniques for feature scaling
are standardization (Z-score normalization)
and normalization (min-max scaling).
Data Integration
• Data integration involves combining data from different
sources or datasets into a single, unified dataset. This is often
necessary when working with diverse data sources, such as
databases, spreadsheets, APIs, and more.
• Data Source Identification: Identifying the various sources of
data that need to be integrated.
• Data Extraction: Extracting data from different sources. This
can involve querying databases, reading CSV files, or using
APIs to collect data.
• Data Transformation: Transforming the extracted data into a
common format or structure. This might include converting
data types, aggregating data, or reformatting data to match
the target schema.
• Data Cleaning (Again): After integration, additional data
cleaning may be required to address inconsistencies or issues
that arise during the integration process.
• Data Merging or Joining: Combining the transformed data
from different sources based on common keys or identifiers.
This can involve various types of joins, such as inner, outer,
left, or right joins.
Visualization
• Visualization is the process of representing data graphically,
typically using charts, graphs, and other visual elements, to
help individuals understand and interpret complex data
patterns, relationships, and trends.
• Effective data visualization can make information more
accessible, intuitive, and actionable. Here are some common
types of data visualizations:
• Bar Charts: Bar charts are used to display categorical data
with rectangular bars. They are useful for comparing values
across categories. Vertical bars are commonly used for these
charts, and horizontal bar charts are also used in some cases.
• Line Charts: Line charts are used to show trends over a
continuous interval or time series data. They connect data
points with lines, making it easy to see how values change
over time.
• Pie Charts: Pie charts represent data as a circle divided into
slices, where each slice represents a proportion of the whole.
They are suitable for displaying parts of a whole and showing
the composition of a dataset.
• Scatter Plots: Scatter plots are used to visualize the
relationship between two variables. Each data point is plotted
on a two-dimensional plane, with one variable on the x-axis
and the other on the y-axis.
Histogram
• A histogram is a graphical representation of the
distribution of a dataset.
• It provides a visual way to understand the frequency
or occurrence of different values or ranges of values
within the dataset.
• Histograms are commonly used in data analysis and
statistics to explore and visualize the underlying
characteristics of a dataset.
key components and characteristics of
a histogram
• Bins or Intervals: A histogram divides the range of data into a set of
contiguous, non-overlapping intervals or bins. Each bin represents a
specific range of values.
• Frequency or Count: The height of each bar in the histogram
corresponds to the frequency or count of data points that fall
within the respective bin. In other words, it shows how many data
points fall into each interval.
• X-Axis: The X-axis of the histogram represents the range of values
covered by the dataset. Each bin is positioned along the X-axis
according to its interval.
• Y-Axis: The Y-axis represents the frequency or count of data points
within each bin.
• Bars: The bars or rectangles in the histogram visually depict the
frequency distribution of the data. Taller bars indicate a higher
frequency of data points within the corresponding bin.
Bin=20
Bin=10
Heatmaps
• Creating a heatmap in Python is often done
using libraries like Matplotlib and Seaborn.
• Heatmaps use color-coding to represent data
values in a matrix.
• They are often used to visualize correlations,
density, or patterns in large datasets.
• # Create a heatmap
• sns.heatmap(data, annot=True, cmap='YlGnBu', fmt='d')
• import the Seaborn and Matplotlib libraries.
• import seaborn as sns
• import matplotlib.pyplot as plt
• We define the sample data as a 2D array. You can replace this with
your own data or load data from a file or DataFrame.
• We use sns.heatmap to create the heatmap. The annot=True
parameter adds annotation (data values) to each cell, the cmap
parameter sets the color map (you can choose from various
predefined color maps), and fmt='d' specifies that the annotations
should be displayed as integers.

Unit 3
No ratings yet
Unit 3
20 pages
Crash Course Data Science
No ratings yet
Crash Course Data Science
7 pages
Six Sigma Tools
No ratings yet
Six Sigma Tools
24 pages
Unit 2 1
No ratings yet
Unit 2 1
54 pages
Aphical Representation
No ratings yet
Aphical Representation
8 pages
STA301 Quiz-2 File by Vu Topper RM
No ratings yet
STA301 Quiz-2 File by Vu Topper RM
109 pages
Data Visualization
No ratings yet
Data Visualization
35 pages
Week13 2 Data Analysis 2
No ratings yet
Week13 2 Data Analysis 2
44 pages
Valuation Analysis of Initial Public Offer (IPO) - The Case of India - K. S. Manu, Chhavi Saini, 2020
No ratings yet
Valuation Analysis of Initial Public Offer (IPO) - The Case of India - K. S. Manu, Chhavi Saini, 2020
16 pages
A Correlational Study of The Relationship Between Height and Weight of 2nd Year BSIS Students of CMDI
100% (1)
A Correlational Study of The Relationship Between Height and Weight of 2nd Year BSIS Students of CMDI
11 pages
Advanced Python Chap 3 Part 1
No ratings yet
Advanced Python Chap 3 Part 1
49 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
Sections Revision Part 2
No ratings yet
Sections Revision Part 2
7 pages
EDAV Manual With Code
No ratings yet
EDAV Manual With Code
70 pages
4 Visualization
No ratings yet
4 Visualization
28 pages
BarPlot and Histogram
No ratings yet
BarPlot and Histogram
28 pages
B Lab Manual Machine Learning SEM-7 CSE 2024
No ratings yet
B Lab Manual Machine Learning SEM-7 CSE 2024
49 pages
3-Data Description
No ratings yet
3-Data Description
91 pages
Violations of OLS
No ratings yet
Violations of OLS
64 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Modeling Risk and Realities Week 4 Session 3
No ratings yet
Modeling Risk and Realities Week 4 Session 3
25 pages
MAS202Group1 Group-Assignment
No ratings yet
MAS202Group1 Group-Assignment
20 pages
FDS Pyq2
No ratings yet
FDS Pyq2
10 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Unit 3 DS
No ratings yet
Unit 3 DS
30 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
Unit 4 - Data Visualization
No ratings yet
Unit 4 - Data Visualization
32 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
Week - 1 Day - 1 Descriptive Statistics
No ratings yet
Week - 1 Day - 1 Descriptive Statistics
40 pages
Unit 2
No ratings yet
Unit 2
36 pages
Description of Data Visualization Tools
No ratings yet
Description of Data Visualization Tools
15 pages
L5 6 DataViz
No ratings yet
L5 6 DataViz
79 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
DS Day 5
No ratings yet
DS Day 5
11 pages
Data Analisis 2
No ratings yet
Data Analisis 2
13 pages
ML Report
No ratings yet
ML Report
12 pages
19 Matplotlib
No ratings yet
19 Matplotlib
26 pages
Data Basics For ML
No ratings yet
Data Basics For ML
23 pages
Chap 2 Introduction To Statistics
No ratings yet
Chap 2 Introduction To Statistics
46 pages
Unit 5
No ratings yet
Unit 5
25 pages
Data Analysis
No ratings yet
Data Analysis
20 pages
Matplotlib Basics
No ratings yet
Matplotlib Basics
27 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
9 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
DAI Data Preprocessing 1 46233380 2025 06 12 17 18
No ratings yet
DAI Data Preprocessing 1 46233380 2025 06 12 17 18
14 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
22 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Data Visualization
No ratings yet
Data Visualization
31 pages
02 Data
No ratings yet
02 Data
62 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Spss Guide
No ratings yet
Spss Guide
161 pages
Data Visualization
No ratings yet
Data Visualization
48 pages
CHAPTER-2 Data Visualization
No ratings yet
CHAPTER-2 Data Visualization
4 pages
Datascienece
No ratings yet
Datascienece
18 pages
Data Analytics and Interactive Dashboards Using Python
No ratings yet
Data Analytics and Interactive Dashboards Using Python
96 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
Data Unit4
No ratings yet
Data Unit4
8 pages
Data Visualization
No ratings yet
Data Visualization
26 pages
Data Visualization With Matplotlib
No ratings yet
Data Visualization With Matplotlib
20 pages
Ad3301 Apr May 2024 Answer Key
No ratings yet
Ad3301 Apr May 2024 Answer Key
31 pages
Condition Monitoring of Self Aligning Carrying Idler (SAI) in
No ratings yet
Condition Monitoring of Self Aligning Carrying Idler (SAI) in
6 pages
DAUP Exam Notes - 2in1
No ratings yet
DAUP Exam Notes - 2in1
35 pages
Department of Mathematics Indian Institute of Technology, Kharagpur Module No. #01
No ratings yet
Department of Mathematics Indian Institute of Technology, Kharagpur Module No. #01
19 pages
Measures of Dispersion Modified PPT Sheet-3
No ratings yet
Measures of Dispersion Modified PPT Sheet-3
61 pages
Datascience
No ratings yet
Datascience
26 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Bachelor of Arts in Economics Program Curriculum 2018 - Final - Senate Approved - 2018.10.04
No ratings yet
Bachelor of Arts in Economics Program Curriculum 2018 - Final - Senate Approved - 2018.10.04
108 pages
The Impact of Cybersecurity On The Quality of Financial Statement
No ratings yet
The Impact of Cybersecurity On The Quality of Financial Statement
14 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
Data Visualisation
No ratings yet
Data Visualisation
5 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Chapter 3 Aqs
No ratings yet
Chapter 3 Aqs
14 pages
ML Week 7
No ratings yet
ML Week 7
12 pages
Prob and Stats Notes
No ratings yet
Prob and Stats Notes
22 pages
Descriptive Analysis:: 1. Mean
No ratings yet
Descriptive Analysis:: 1. Mean
3 pages
Stock Price Predictions Using A Geometric Brownian Motion - Joel Lidén
No ratings yet
Stock Price Predictions Using A Geometric Brownian Motion - Joel Lidén
41 pages
Natural Disaster Preparedness of Macalaya National High School, Castilla, Sorsogonan Assessment
No ratings yet
Natural Disaster Preparedness of Macalaya National High School, Castilla, Sorsogonan Assessment
12 pages
Edu 533 Outline
No ratings yet
Edu 533 Outline
10 pages
IoT Chapter 3
No ratings yet
IoT Chapter 3
59 pages
2artikkeli Leivo
No ratings yet
2artikkeli Leivo
17 pages
Statistical Analysis of Rainfall Data
No ratings yet
Statistical Analysis of Rainfall Data
22 pages
Internship PPT
No ratings yet
Internship PPT
11 pages
Assignment #2
No ratings yet
Assignment #2
15 pages
SPSS Uji PH
No ratings yet
SPSS Uji PH
16 pages
Gretchen Dice M. Parcon, LPT: by God'S Grace, I Can Pass and Top The Let 2019
No ratings yet
Gretchen Dice M. Parcon, LPT: by God'S Grace, I Can Pass and Top The Let 2019
15 pages
Question Banks 1 To 5-3
No ratings yet
Question Banks 1 To 5-3
12 pages
Adoc - Pub Ly Ed Jakarta Egc Netter FH Atlas of Human Anatomy
No ratings yet
Adoc - Pub Ly Ed Jakarta Egc Netter FH Atlas of Human Anatomy
11 pages
MTH145 Pyq1
No ratings yet
MTH145 Pyq1
3 pages
Stata Syntax Alpha Omega
No ratings yet
Stata Syntax Alpha Omega
4 pages
Audit Committee Indepence and Auditor Independence 2
No ratings yet
Audit Committee Indepence and Auditor Independence 2
15 pages
Skew & Curtosis
No ratings yet
Skew & Curtosis
2 pages
Pre Final
No ratings yet
Pre Final
9 pages
Biology
No ratings yet
Biology
6 pages
CPM Time Estimates
No ratings yet
CPM Time Estimates
7 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet

Unit2 Modified

Uploaded by

Unit2 Modified

Uploaded by

Data exploration and

– Handle missing data: Identify and decide how to

• Use pandas to handle missing data by either

• Identify and handle outliers based on your

You might also like