0% found this document useful (0 votes)
19 views42 pages

Unit2 Modified

Uploaded by

Priyanka Kengar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views42 pages

Unit2 Modified

Uploaded by

Priyanka Kengar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 42

Data exploration and

preprocessing
Loading a dataset
• Loading a dataset into Python can be done
using various libraries, but one of the most
commonly used libraries for data
manipulation and analysis is pandas.
Statistical measures
• Statistical measures, also known as summary
statistics or descriptive statistics, are
numerical values or techniques used to
summarize and describe a dataset.
• These measures provide a concise overview of
key characteristics of the data, helping to
understand its central tendency, variability,
and distribution.
Skewness
Kurtosis
• Kurtosis is a statistical measure that describes
the distribution of data points in a dataset,
specifically how data points are distributed in
the tails (extreme values) compared to the
center (mean) of the distribution.
• There are three main types of kurtosis:
• Mesokurtic: A mesokurtic distribution has kurtosis equal to
zero. This means that the distribution has tails that are
neither too heavy (leptokurtic) nor too light (platykurtic)
compared to a normal distribution.
• Leptokurtic: A leptokurtic distribution has positive kurtosis.
This means that the distribution has heavier tails and a
sharper peak around the mean compared to a normal
distribution.
• Platykurtic: A platykurtic distribution has negative kurtosis.
This means that the distribution has lighter tails and a flatter
peak around the mean compared to a normal distribution.
• In summary, skewness describes the asymmetry of the
distribution, while kurtosis describes the tails of the
distribution. Both measures can provide valuable insights into
the characteristics of a dataset, such as whether it is skewed,
whether it has extreme values in the tails, and how it deviates
from a normal distribution.
• Researchers and statisticians often use skewness and kurtosis
in combination with other descriptive statistics to gain a
comprehensive understanding of data distributions.
• These statistical measures are essential for
summarizing and gaining insights from
datasets in various fields, including statistics,
data science, and research.
• Depending on the characteristics of the data
and the research question, different measures
may be more relevant or informative.
statistics
• To calculate and examine basic summary
statistics like mean, median, mode, standard
deviation, and range for a dataset in Python,
you can use the pandas library.
Data Cleaning:

– Handle missing data: Identify and decide how to


deal with missing values (e.g., imputation,
removal).
– Detect and address duplicate records if they exist.
– Handle outliers: Identify and decide whether to
remove or transform outliers.
Handling Missing Data:

• Use pandas to handle missing data by either


dropping or imputing missing values.
By dropping
By dropping
imputing missing values by mean
Handling Duplicates
• # Remove duplicate rows
df.drop_duplicates(inplace=True)
Handling Outliers:

• Identify and handle outliers based on your


domain knowledge or statistical methods.
Data Type Conversion
• Convert data types as needed (e.g., converting
strings to numbers).
• One-hot encoding and label encoding are two
different techniques used to convert
categorical data into numerical format,
making it suitable for machine learning
algorithms.
• They have distinct characteristics and use
cases:
Text Cleaning:
• For text data, you can perform tasks like
removing special characters and lowercasing.
• # Remove special characters and lowercase
text df['text_column'] =
df['text_column'].str.replace('[^a-zA-Z0-9\s]',
'').str.lower()
Feature Scaling
• Standardize or normalize numerical features.
• Feature scaling is an important preprocessing
step in machine learning to ensure that
numerical features are on a similar scale
• Two common techniques for feature scaling
are standardization (Z-score normalization)
and normalization (min-max scaling).
Data Integration
• Data integration involves combining data from different
sources or datasets into a single, unified dataset. This is often
necessary when working with diverse data sources, such as
databases, spreadsheets, APIs, and more.
• Data Source Identification: Identifying the various sources of
data that need to be integrated.
• Data Extraction: Extracting data from different sources. This
can involve querying databases, reading CSV files, or using
APIs to collect data.
• Data Transformation: Transforming the extracted data into a
common format or structure. This might include converting
data types, aggregating data, or reformatting data to match
the target schema.
• Data Cleaning (Again): After integration, additional data
cleaning may be required to address inconsistencies or issues
that arise during the integration process.
• Data Merging or Joining: Combining the transformed data
from different sources based on common keys or identifiers.
This can involve various types of joins, such as inner, outer,
left, or right joins.
Visualization
• Visualization is the process of representing data graphically,
typically using charts, graphs, and other visual elements, to
help individuals understand and interpret complex data
patterns, relationships, and trends.
• Effective data visualization can make information more
accessible, intuitive, and actionable. Here are some common
types of data visualizations:
• Bar Charts: Bar charts are used to display categorical data
with rectangular bars. They are useful for comparing values
across categories. Vertical bars are commonly used for these
charts, and horizontal bar charts are also used in some cases.
• Line Charts: Line charts are used to show trends over a
continuous interval or time series data. They connect data
points with lines, making it easy to see how values change
over time.
• Pie Charts: Pie charts represent data as a circle divided into
slices, where each slice represents a proportion of the whole.
They are suitable for displaying parts of a whole and showing
the composition of a dataset.
• Scatter Plots: Scatter plots are used to visualize the
relationship between two variables. Each data point is plotted
on a two-dimensional plane, with one variable on the x-axis
and the other on the y-axis.
Histogram
• A histogram is a graphical representation of the
distribution of a dataset.
• It provides a visual way to understand the frequency
or occurrence of different values or ranges of values
within the dataset.
• Histograms are commonly used in data analysis and
statistics to explore and visualize the underlying
characteristics of a dataset.
key components and characteristics of
a histogram
• Bins or Intervals: A histogram divides the range of data into a set of
contiguous, non-overlapping intervals or bins. Each bin represents a
specific range of values.
• Frequency or Count: The height of each bar in the histogram
corresponds to the frequency or count of data points that fall
within the respective bin. In other words, it shows how many data
points fall into each interval.
• X-Axis: The X-axis of the histogram represents the range of values
covered by the dataset. Each bin is positioned along the X-axis
according to its interval.
• Y-Axis: The Y-axis represents the frequency or count of data points
within each bin.
• Bars: The bars or rectangles in the histogram visually depict the
frequency distribution of the data. Taller bars indicate a higher
frequency of data points within the corresponding bin.
Bin=20
Bin=10
Heatmaps
• Creating a heatmap in Python is often done
using libraries like Matplotlib and Seaborn.
• Heatmaps use color-coding to represent data
values in a matrix.
• They are often used to visualize correlations,
density, or patterns in large datasets.
• # Create a heatmap
• sns.heatmap(data, annot=True, cmap='YlGnBu', fmt='d')
• import the Seaborn and Matplotlib libraries.
• import seaborn as sns
• import matplotlib.pyplot as plt
• We define the sample data as a 2D array. You can replace this with
your own data or load data from a file or DataFrame.
• We use sns.heatmap to create the heatmap. The annot=True
parameter adds annotation (data values) to each cell, the cmap
parameter sets the color map (you can choose from various
predefined color maps), and fmt='d' specifies that the annotations
should be displayed as integers.

You might also like