0% found this document useful (0 votes)
7 views114 pages

PS ML Lect 5 9 Unit 2

The document outlines key steps in data exploration, pre-processing, and visualization, focusing on handling missing values, categorical data, and outlier detection. It discusses types of missingness (MCAR, MAR, MNAR) and various techniques for treating missing values, including imputation and deletion. Additionally, it covers methods for detecting and treating outliers to ensure robust machine learning model performance.

Uploaded by

ashisharma0507
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views114 pages

PS ML Lect 5 9 Unit 2

The document outlines key steps in data exploration, pre-processing, and visualization, focusing on handling missing values, categorical data, and outlier detection. It discusses types of missingness (MCAR, MAR, MNAR) and various techniques for treating missing values, including imputation and deletion. Additionally, it covers methods for detecting and treating outliers to ensure robust machine learning model performance.

Uploaded by

ashisharma0507
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114

Data Exploration,

Pre-processing & Visualization


Prepared By:
Piyush Kumar Soni
Contents
• Missing Values Treatment
• Handling Categorical data:
• Mapping ordinal features
• Encoding class labels
• Performing one-hot encoding on nominal features
• Outlier Detection and Treatment
• Feature Engineering
• Variable Transformation and Variable Creation
• Selecting meaningful features

Piyush Kumar Soni 2


Missing Values Treatment
• Missing data can arise from various places in data:
• A survey was conducted and values were just randomly missed when being
entered in the computer.
• A respondent chooses not to respond to a question like `Have you ever
recreationally used opioids?'.
• You decide to start collecting a new variable (due to new actions: like a
pandemic) partway through the data collection of a study.
• You want to measure the speed of meteors, and some observations are just
'too quick' to be measured properly.

Piyush Kumar Soni 3


Types of Missingness
• Missing Completely at Random (MCAR)
• Definition:
Data is MCAR when the missingness is independent of both the observed and
unobserved data. In other words, the likelihood of a value being missing is the same
for all data points, without any underlying pattern or reason.
• Characteristics:
• No systematic difference between the missing data and the observed data.
• Handling MCAR data does not introduce bias if the missing values are ignored or imputed.
• Example:
• A researcher conducts a survey, but some participants forget to answer a question due to
oversight. There’s no reason to believe that those who didn’t answer systematically differ
from those who did.
• In a sensor network, a random sensor fails temporarily, causing some missing temperature
readings.
• Handling MCAR:
• Simple imputation techniques or deletion (listwise or pairwise) can be applied without biasing
the results.

Piyush Kumar Soni 4


• Missing at Random (MAR)
• Definition:
Data is MAR when the missingness is related to the observed data but not to the
missing values themselves. The reason for the missing data can be explained by the
observed data.
• Characteristics:
• There is a systematic relationship between missingness and observed data.
• MAR requires more sophisticated imputation techniques to avoid bias.
• Example:
• In a medical study, patients who are older (observed variable) are less likely to respond to
certain survey questions, but the likelihood of missingness is unrelated to the unobserved
value of the missing response.
• In a customer database, high-income individuals (observed variable) may choose not to
disclose their spending habits (missing value).
• Handling MAR:
• Techniques like multiple imputation or model-based approaches (e.g., maximum likelihood
estimation) are suitable for handling MAR data.

Piyush Kumar Soni 5


• Missing Not at Random (MNAR)
• Definition:
Data is MNAR when the missingness depends on the value of the missing data itself or other
unobserved factors. The missingness introduces systematic bias because the missing data is
not random and cannot be explained by the observed data.
• Characteristics:
• MNAR data is the most challenging to handle because the reasons for missingness are tied to the missing
values themselves.
• Requires domain knowledge to model the missingness mechanism.
• Example:
• In a survey about income, people with very high or very low incomes may be less likely to report their
earnings due to privacy concerns. Here, the likelihood of missing data depends on the income itself
(missing variable).
• In a health study, patients with severe symptoms might be less likely to attend follow-up appointments,
leading to missing health outcome data.
• Handling MNAR:
• Requires external information, assumptions, or domain-specific knowledge to model the missingness
mechanism accurately.
• Sensitivity analysis or pattern-mixture models can be used.

Piyush Kumar Soni 6


Handling Missing Values
• Handling missing values is a crucial step in the data preprocessing phase for
machine learning. Missing data can adversely affect the performance and
accuracy of machine learning models. Here are some common techniques
for treating missing values:
• Identifying Missing Values:
• Start by identifying the missing values in your dataset. This can be done using
functions like isnull() or info() in pandas for Python, or summary() in R.
• Remove Missing Values:
• If the proportion of missing values for a particular feature is small and the missing
values are randomly distributed, you may choose to remove the rows with missing
values using the dropna() function. However, be cautious about removing too many
rows, as it may lead to loss of valuable information.
df.dropna(inplace=True)

Piyush Kumar Soni 7


• Imputation:
• Imputation involves replacing missing values with estimated or calculated
values. Common imputation methods include:
• Mean/Median Imputation: Replace missing values with the mean or median of the
observed values in that column.
df['column_name'].fillna(df['column_name'].mean(), inplace=True)
• Mode Imputation: For categorical data, replace missing values with the mode (most
frequent value) of the column.
df['column_name'].fillna(df['column_name'].mode()[0], inplace=True)
• Forward Fill or Backward Fill: Use the values from the previous or next row to fill
missing values in time-series data.
df.fillna(method='ffill', inplace=True) # forward fill
df.fillna(method='bfill', inplace=True) # backward fill

Piyush Kumar Soni 8


• Predictive Modeling:
• Use machine learning algorithms to predict missing values based on the other
features in the dataset. This approach can be powerful but may require more
computational resources.
• Indicator Variables:
• Create an indicator variable to signify whether a value was missing. This way, you
retain information about the missingness, which may be useful for some models.
df['column_name_missing'] = df['column_name'].isnull().astype(int)
• Domain-Specific Imputation:
• Depending on the nature of the data, domain knowledge can be used to impute
missing values. For example, if data represents time series, missing values might be
imputed differently than in cross-sectional data.

Piyush Kumar Soni 9


• Multiple Imputation:
• Generate multiple imputations to account for uncertainty in the imputation
process. This involves creating several datasets with different imputed values
and combining the results.
• The choice of the method depends on the nature of your data and
the underlying assumptions of your analysis.
• It's essential to carefully evaluate the impact of missing value
treatment on your model and choose the method that best suits your
specific use case.

Piyush Kumar Soni 10


Choosing the Best Method
• Percentage of missing values: If less than 5%, deletion might be
acceptable.
• Pattern of missingness: Understanding the type (MCAR, MAR, MNAR)
is crucial.
• Variable importance: More important variables might warrant more
sophisticated imputation.
• Algorithm sensitivity: Some algorithms are more sensitive to missing
data than others.
• Domain knowledge: Insights into reasons for missingness can guide
appropriate methods.

Piyush Kumar Soni 11


Imputation through Modeling
• How do we use models to fill in missing data?

Piyush Kumar Soni 12


Piyush Kumar Soni 13
• How do we use models to fill in missing data? Using k-NN for k = 2?

Piyush Kumar Soni 14


• How do we use models to fill in missing data? Using linear regression?

Piyush Kumar Soni 15


Handling Categorical Data
• Handling categorical data is an important aspect of data preprocessing in
machine learning.
• Categorical data represents variables that can take on one of a limited and
usually fixed number of possible values, such as colors, gender, or country
names.
• Types of categorical data:
• Nominal
• Ordinal
• Data that lacks any intrinsic order, such as colors, genders, or animal
species, is represented as nominal categorical data.
• While ordinal categorical data refers to information that is naturally ranked
or ordered, such as customer satisfaction levels or educational attainment.

Piyush Kumar Soni 16


• Encoding class labels:
• Encodes target labels with values between 0 and n_classes-1.

Piyush Kumar Soni 17


• It can also be used to transform non-numerical labels

Piyush Kumar Soni 18


• Mapping ordinal features:
• Ordinal coding is a popular technique for encoding categorical data where
each category is given a different numerical value based on its rank or order.
• The categories with the lowest values receive the smallest integers, while
those with the highest values receive the largest integers.
• When the categories are grouped organically, like with ratings (poor, fair,
good, outstanding), or educational achievement, this strategy is beneficial
(high school, college, graduate school).

Piyush Kumar Soni 19


Piyush Kumar Soni 20
No Name Gender Blood Grade Height Study
1 Tom M O 56 160 Math
2 Harry M A 76 192 Math
3 John M A 45 178 English
4 Nancy F B 78 157 Biology
5 Mike M O 79 167 Math
6 Kate F AB 66 156 English
7 Mary F O 99 166 Science

Piyush Kumar Soni 21


No Name Gender Blood Grade Height Study
1 Tom 0 O 56 160 0
2 Harry 0 A 76 192 0
3 John 0 A 45 178 1
4 Nancy 1 B 78 157 2
5 Mike 0 O 79 167 0
6 Kate 1 AB 66 156 1
7 Mary 1 O 99 166 3

Piyush Kumar Soni 22


from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
enc.fit(df[[“Gender", "Study"]])
df[[“Gender", "Study"]]= enc.transform(df[[“Gender", "Study"]])
• Or
enc = OrdinalEncoder()
df[[“Gender", "Study"]]= enc.fit_transform(df[[“Gender", "Study"]])

Piyush Kumar Soni 23


• Performing one-hot encoding on nominal features:
• One-Hot Encoding is a technique used to convert categorical data into
numerical format.
• It creates a binary vector for each category in the dataset. The vector contains
a 1 for the category it represents and 0s for all other categories.

Piyush Kumar Soni 24


Piyush Kumar Soni 25
Outlier Detection and Treatment
• Outlier detection and treatment are important steps in the process of
building machine learning models.
• Outlier:
• Outliers are those data points that are significantly different from the rest of
the dataset.
• They are often abnormal observations that skew the data distribution.
• Arise due to inconsistent data entry, or erroneous observations.
• To ensure that the trained model generalizes well to the valid range of
test inputs, it’s important to detect and remove outliers.

Piyush Kumar Soni 26


• Outliers can negatively impact the performance of machine learning
models in several ways:
• Overfitting: Models can focus on fitting the outliers rather than the
underlying patterns in the majority of the data.
• Reduced accuracy: Outliers can pull the model’s predictions towards
themselves, leading to inaccurate predictions for other data points.
• Unstable models: The presence of outliers can make the model’s predictions
sensitive to small changes in the data.

Piyush Kumar Soni 27


Outlier Detection
• Statistical Methods
• Box Plot
• Z-Score
• IQR (Interquartile Range)
• Outlier Detection Using Percentile
• Distance from the mean
• Distance-based Methods
• Distance to Centroid
• Nearest Neighbors
• Density-Based Methods
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
• LOF (Local Outlier Factor)

Piyush Kumar Soni 28


Box Plot
• Box plots are a visual method to identify outliers.
• Box plots are one of the many ways to visualize data distribution.
• Box plot plots the q1 (25th percentile), q2 (50th percentile or median)
and q3 (75th percentile) of the data along with (q1–1.5*(q3-q1)) and
(q3+1.5*(q3-q1)).
• Outliers, if any, are plotted as points above and below the plot.

Piyush Kumar Soni 29


Piyush Kumar Soni 30
IQR method
• QR stands for interquartile range, which is the difference between q3
(75th percentile) and q1 (25th percentile). The IQR method computes
lower bound and upper bound to identify outliers.
Lower Bound = q1–1.5*IQR
Upper Bound = q3+1.5*IQR
• Any values below the lower bound and above the upper bound are
considered to be outliers.
• Example
• 27, 2, 22, 29, 19, 30, 32, 59, 52, 35
• 78, 74, 88, 90, 94, 90, 98, 80

Piyush Kumar Soni 31


Outlier Detection Using Percentile
• Define a custom range that accommodates all data points that lie
anywhere between 0.5 and 99.5 (for eg.) percentile of the dataset.
• Observations Outside this range are treated as outliers.

Piyush Kumar Soni 32


Z-score method
• Z-score method is generally used when a variable’ distribution looks
close to Normal.
• Z-score is the number of standard deviations a value of a variable is
away from the variable’ mean.
Z-Score = (X-mean) / Standard deviation
• when the values of a variable are converted to Z-scores, then the
distribution of the variable is called standard normal distribution with
mean=0 and standard deviation=1.

Piyush Kumar Soni 33


• For data that is normally distributed, around 68.2% of the data will lie
within one standard deviation from the mean. Close to 95.4% and
99.7% of the data lie within two and three standard deviations from
the mean, respectively.
• One approach to outlier detection is to set the lower limit to three
standard deviations below the mean (μ - 3*σ), and the upper limit to
three standard deviations above the mean (μ + 3*σ). Any data point
that falls outside this range is detected as an outlier.

Piyush Kumar Soni 34


Piyush Kumar Soni 35
Distance from the mean
• Unlike the previous methods, this method considers multiple
variables in a data set to detect outliers.
• This method calculates the Euclidean distance of the data points from
their mean and converts the distances into absolute z-scores. Any z-
score greater than the pre-specified cut-off is considered to be an
outlier.

Piyush Kumar Soni 36


Outlier Treatment
• Removal:
• The simplest approach is to remove outliers from the dataset. However, this should
be done carefully, as it may lead to information loss.
• Transformation:
• Apply mathematical transformations to make the distribution more symmetrical,
such as logarithmic or square root transformations.
• Imputation:
• Replace outlier values with a measure of central tendency, such as mean, median, or
mode.
• Winsorizing:
• Replace extreme values with values within a certain percentile range. This helps to
limit the impact of outliers without entirely removing them.

Piyush Kumar Soni 37


• Binning:
• Group outlier values into a specific bin or category, treating them as a
separate group.
• Model-Specific Approaches:
• Some models have built-in methods to handle outliers. For example, decision
trees and random forests are often less sensitive to outliers.

Piyush Kumar Soni 38


Feature Engineering
• Feature engineering is the process of transforming raw data into features
that are suitable for machine learning models.
• In other words, it is the process of selecting, extracting, and transforming
the most relevant features from the available data to build more accurate
and efficient machine learning models.
• The success of machine learning models heavily depends on the quality of
the features used to train them.
• Feature engineering involves a set of techniques that enable us to create
new features by combining or transforming the existing ones.
• These techniques help to highlight the most important patterns and
relationships in the data, which in turn helps the machine learning model
to learn from the data more effectively.

Piyush Kumar Soni 39


Piyush Kumar Soni 40
What is a Feature?
• In the context of machine learning, a feature (also known as a
variable or attribute) is an individual measurable property or
characteristic of a data point that is used as input for a machine
learning algorithm.
• Features can be numerical, categorical, or text-based, and they
represent different aspects of the data that are relevant to the
problem at hand.
• The choice and quality of features are critical in machine learning, as
they can greatly impact the accuracy and performance of the model.

Piyush Kumar Soni 41


Feature Creation
• Feature Creation is the process of generating new features based on
domain knowledge or by observing patterns in the data.
• It is a form of feature engineering that can significantly improve the
performance of a machine-learning model.
• Types of Feature Creation:
• Domain-Specific: Creating new features based on domain knowledge, such as
creating features based on business rules or industry standards.
• Data-Driven: Creating new features by observing patterns in the data, such as
calculating aggregations or creating interaction features.
• Synthetic: Generating new features by combining existing features or
synthesizing new data points.

Piyush Kumar Soni 42


Feature Transformation
• Feature Transformation is the process of transforming the features
into a more suitable representation for the machine learning model.
• This is done to ensure that the model can effectively learn from the
data.
• Types of Feature Transformation
• Transformation: Transforming the features using mathematical operations to
change the distribution or scale of the features. Examples are logarithmic,
square root, and reciprocal transformations.
• Encoding Categorical Variables: Converting categorical variables into a
numerical format that can be fed into machine learning algorithms. Common
methods include one-hot encoding, label encoding, or target encoding.

Piyush Kumar Soni 43


Piyush Kumar Soni 44
• Scaling and Normalization: Scaling features to ensure that they are on similar
scales. Common methods include MaxAbs scaling, Min-Max scaling, Standard
scaling (Z-score normalization), and robust scaling.
• Binning or Discretization: Grouping continuous numerical features into
discrete bins or intervals. This can help capture non-linear relationships and
patterns in the data.
• Polynomial Features: Introducing interaction terms or polynomial features to
capture non-linear relationships between variables.
• Text Data Features: Extracting features from text data, such as word
frequency, TF-IDF scores, or word embeddings.

Piyush Kumar Soni 45


Feature Selection
• Feature Selection is the process of selecting a subset of relevant features
from the dataset to be used in a machine-learning model.
• It is an important step in the feature engineering process as it can have a
significant impact on the model’s performance.
• Common feature selection techniques include:
• Filter methods:
• Assess feature relevance independent of the model using statistical measures.
• Examples: correlation analysis, chi-square test, ANOVA, mutual information.
• Wrapper methods:
• Evaluate different feature subsets using the performance of a specific machine
learning model.
• Examples: forward selection, backward elimination, recursive feature elimination
(RFE).

Piyush Kumar Soni 46


• Embedded methods:
• Incorporate feature selection into the model training process.
• Examples: L1 regularization (Lasso regression), tree-based models (random
forests, decision trees) that inherently provide feature importance scores.

Piyush Kumar Soni 47


Libraries
Sci-kit Learn for Pre-processing
• The sklearn.preprocessing package provides several common utility
functions and transformer classes to change raw feature vectors into a
representation that is more suitable for the downstream estimators.
• Standardization, or mean removal and variance scaling-
• Standardization of datasets is a common requirement for many machine learning
estimators implemented in scikit-learn; they might behave badly if the individual
features do not more or less look like standard normally distributed data: Gaussian
with zero mean and unit variance.
• In practice we often ignore the shape of the distribution and just transform the data
to center it by removing the mean value of each feature, then scale it by dividing
non-constant features by their standard deviation.
• The preprocessing module provides the StandardScaler utility class, which is a quick
and easy way to perform the following operation on an array-like dataset:

Piyush Kumar Soni 49


Piyush Kumar Soni 50
• Scaling features to a range-
• An alternative standardization is scaling features to lie between a given
minimum and maximum value, often between zero and one, or so that the
maximum absolute value of each feature is scaled to unit size. This can be
achieved using MinMaxScaler or MaxAbsScaler, respectively.

Piyush Kumar Soni 51


Piyush Kumar Soni 52
• Normalization-
• Normalization is the process of scaling individual samples to have unit norm.
• The function normalize provides a quick and easy way to perform this
operation on a single array-like dataset, either using the l1, l2, or max norms:

Piyush Kumar Soni 53


• Encoding categorical features-
• Label encoding
• Ordinal encoding
• One hot encoding
• Target encoding

Piyush Kumar Soni 54


• Discretization-
• Discretization (otherwise known as quantization or binning) provides a way to
partition continuous features into discrete values.
• Certain datasets with continuous features may benefit from discretization
because discretization can transform the dataset of continuous attributes to
one with only nominal attributes.
• K-bins discretization-

Piyush Kumar Soni 55


• For the current example, these intervals are defined as:

• Based on these bin intervals, X is transformed as follows:

Piyush Kumar Soni 56


• Feature binarization-
• Feature binarization is the process of thresholding numerical features
to get boolean values. This can be useful for downstream probabilistic
estimators that assume that the input data is distributed according to
a multi-variate Bernoulli distribution.
• It is also common among the text processing community to use
binary feature values (probably to simplify the probabilistic reasoning)
even if normalized counts (a.k.a. term frequencies) or TF-IDF valued
features often perform slightly better in practice.

Piyush Kumar Soni 57


Piyush Kumar Soni 58
• It is possible to adjust the threshold of the binarizer:

Piyush Kumar Soni 59


• Imputation of missing values-
• Generating polynomial features-
• Often it’s useful to add complexity to a model by considering
nonlinear features of the input data.
• A simple and common method to use is polynomial features, which
can get features’ high-order and interaction terms. It is implemented
in PolynomialFeatures.

Piyush Kumar Soni 60


Piyush Kumar Soni 61
• In some cases, only interaction terms among features are required,
and it can be gotten with the setting interaction_only=True:

Piyush Kumar Soni 62


• Custom transformers-
• You can implement a transformer from an arbitrary function with
FunctionTransformer. For example, to build a transformer that applies a log
transformation in a pipeline, do:

Piyush Kumar Soni 63


Matplotlib for Data Visualization
• Data Visualization is the process of presenting data in the form of
graphs or charts.
• It helps to understand large and complex amounts of data very easily.
It allows the decision-makers to make decisions very efficiently and
also allows them to identify new trends and patterns very easily.
• Matplotlib-
• Matplotlib is a low-level library of Python which is used for data visualization.
• It is easy to use and emulates MATLAB like graphs and visualization.
• This library is built on the top of NumPy arrays and consist of several plots like
line chart, bar chart, histogram, etc.

Piyush Kumar Soni 64


• Import Essential Packages
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline
import os

Piyush Kumar Soni 65


• Pyplot is a Matplotlib module that provides a MATLAB-like interface.
Matplotlib is designed to be as usable as MATLAB, with the ability to
use Python and the advantage of being free and open-source.
• Each pyplot function makes some change to a figure: e.g., creates a
figure, creates a plotting area in a figure, plots some lines in a plotting
area, decorates the plot with labels, etc.
• The various plots we can utilize using Pyplot are Line Plot, Histogram,
Scatter, 3D Plot, Image, Contour, and Polar.

Piyush Kumar Soni 66


• How to create a simple plot?

Piyush Kumar Soni 67


• Adding Title-

Piyush Kumar Soni 68


• Adding X Label and Y Label-

Piyush Kumar Soni 69


• Adding Legends-

Piyush Kumar Soni 70


• Axes class is the most basic and flexible unit for creating sub-plots. A
given figure may contain many axes, but a given axes can only be
present in one figure.
• The plt.subplots() method is the best way to handle several subplots
at once.

Piyush Kumar Soni 71


Piyush Kumar Soni 72
Piyush Kumar Soni 73
• Creating a pie chart-
• To create a pie chart using matplotlib, refer to the pie() function.

Piyush Kumar Soni 74


• Labels-
• Adding labels to a pie chart is pretty straightforward just need to pass a list of
strings with labels corresponding to the list of values:

Piyush Kumar Soni 75


Piyush Kumar Soni 76
• A heatmap is a graph that extensively uses color for data visualization.
The colors depend on several independent variables.

Piyush Kumar Soni 77


• Adding labels-

Piyush Kumar Soni 78


• A bar chart is a diagram where variables are represented as
rectangular bars the taller or longer the bar, the higher the value it
represents.
• Usually, one axis of a bar chart represents a category, and the other is
its value.
• A bar chart is used to compare discrete data, such as occurrences or
proportions.

Piyush Kumar Soni 79


Piyush Kumar Soni 80
Piyush Kumar Soni 81
• Plotting multiple bars next to each other can come in handy when we
need to compare two or more data series that share categories.

Piyush Kumar Soni 82


Piyush Kumar Soni 83
• Stacked bar plot

Piyush Kumar Soni 84


Piyush Kumar Soni 85
• Histogram-
• A histogram is a graphical display of data that organizes groups of
data points into ranges. These ranges are represented by bars.
• It resembles a bar chart, but it's not quite the same. The key
difference is that you use a bar chart for categorical data
representation, while a histogram displays only numerical data.

Piyush Kumar Soni 86


Piyush Kumar Soni 87
• Changing bins-

Piyush Kumar Soni 88


Piyush Kumar Soni 89
• Scatter Plots-
• A scatter plot is a
visualization of
how two variables
relate to each
other by using
plots. It is widely
used for its
simplicity in
building a chart.

Piyush Kumar Soni 90


• Box plot-
• A box plot (also known as a box-
and-whisker plot) is a convenient
way to visualize the distributions
of numerical data using quartiles.
Box plots are widespread in
descriptive statistics, they allow
you to quickly explore one or
more datasets.

Piyush Kumar Soni 91


Piyush Kumar Soni 92
Piyush Kumar Soni 93
• Stack Plots-
• A stack plot is basically like a pie-chart, only over time.
• Let's consider a situation where we have 24 hours in a day, and we'd like to
see how we're spending our time. We'll divide our activities into: Sleeping,
eating, working, and playing.
• We're going to assume that we're tracking this over the course of 5 days, so
our starting data will look like:

Piyush Kumar Soni 94


Piyush Kumar Soni 95
Piyush Kumar Soni 96
Pandas for Exploratory Data Analysis
• Exploratory data analysis (EDA) is a vital initial step of any data
analysis or machine learning project. It is necessary for:
• getting an overall understanding of the data, including first insights
• identifying the size of the dataset, its structure, and the features that are
crucial for the project goal
• gathering fundamental statistics of the data
• detecting potential issues to fix (such as missing values, duplicates, or
outliers)

Piyush Kumar Soni 97


• head() and tail()
• By default, the head() method returns the first five and tail() – the last five
rows of a dataframe or a series. To return the number of rows different from
five, we need to pass in that number.
• sample()
• By default, it returns a random row of a dataframe or a series. To return a
certain number of random rows, we need to pass in that number.
• shape
• For a dataframe, it returns a tuple with the number of rows and columns. For
a series, it returns a one-element tuple with the number of rows.

Piyush Kumar Soni 98


• size
• Returns the number of elements in a dataframe or a series. For a series object, it makes more
sense to use size rather than shape. The obtained information is the same in both cases, but
size returns it in a more handy form – as an integer rather than a one-element tuple.
• info()
• Returns overall information about a dataframe, including the index data type, the number of
rows and columns, column names, indices, and data types, the number of non-null values by
column, and memory usage.
• describe()
• Returns the major statistics of a dataframe or a series, including the number of non-null
values, the minimum, maximum, and mean values, and percentiles. For a dataframe, it
returns the information by column, and by default, only for numeric columns. To include the
statistics for the columns of an object type as well, we need to pass in include='all'. For object
columns, the method returns the number of non-null values, the number of unique values,
the most frequent value, and the number of times it is encountered in the corresponding
column.

Piyush Kumar Soni 99


• dtypes
• Returns the data types of a dataframe by column. If a column contains mixed data
types or if all its values are None, the returned data type of that column will be an
object. This also includes a special case of a column containing booleans and null
(NaN) or None values.
• For a series, we can interchangeably use either dtypes or dtype for EDA in pandas.
• select_dtypes()
• Returns a subset of the columns of a dataframe based on the provided column data
type (or types). We have to specify the data type (or types) either to include into the
subset (using the include parameter) or exclude from it (exclude).
• Some typical data types are number, int, float, object, bool, category, and datetime.
To specify several data types to include or exclude, we pass in a list of those data
types.
• columns
• Returns the column names of a dataframe.
Piyush Kumar Soni 100
• count()
• Returns the count of non-null values in a dataframe or a series. For a
dataframe, by default, returns the results by column. Passing in axis=1 or
axis='columns' will give the results by row.
• unique() and nunique()
• The unique() method returns the unique values of a series, while nunique() –
the number of unique values in a dataframe or a series.
• For a dataframe, nunique(), by default, returns the results by column.
Otherwise, passing in axis=1 or axis='columns' will give the results by row.
• is_unique
• Returns True if all the values in a series are unique.

Piyush Kumar Soni 101


• isnull() and isna()
• Both methods return a boolean same-sized object showing which values are null
(True) and which are not (False). These functions apply to both series and
dataframes.
• isnull() and isna() work best when chained with sum() (e.g., df.isnull().sum())
returning the number of null values by column for a dataframe or their total number
for a series. For a dataframe, the method chaining df.isnull().sum().sum() gives the
total number of null values.
• hasnans
• Returns if a series contains at least one null value.
• value_counts()
• Returns the count of each unique value in a series. By default, the outputs are not
normalized, or sorted in descending order, and the null values are not considered. To
override the defaults, we can set the optional parameters normalize, ascending, and
dropna accordingly.

Piyush Kumar Soni 102


• nsmallest() and nlargest()
• By default, nsmallest() returns the five smallest while nlargest() – the five largest values of a
series together with their indices. To return the number of values different from five, we
need to pass in that number.
• corr()
• This method of EDA in pandas applies both to dataframes and series, but in a slightly
different way. For a dataframe (df.corr()), it returns column pairwise correlation, excluding
null values. For a series (Series1.corr(Series2)), this method returns the correlation of that
series with another one, excluding null values.
• plot()
• Allows the creation of simple plots of various kinds for a data frame or a series. The main
parameters are x, y, and kind. The popular types of supported plots are line, bar, barh, hist,
box, area, density, pie, and scatter.
• In general, pandas are not the best choice for creating compelling visualizations in Python.
However, for the purposes of EDA in pandas, the plot() method works just fine.

Piyush Kumar Soni 103


• Sorting Values
• Sorting your data according to a certain column can also be useful in EDA. For
example, you might want to sort your data by a 'population' column to see
which countries have the highest populations. In Pandas, you can use the
sort_values() function to sort your DataFrame:
sorted_df = df.sort_values(by='population', ascending=False)
• This will return a new DataFrame sorted by the 'population' column in
descending order. The ascending=False argument sorts the column in
descending order. If you want to sort in ascending order, you can omit this
argument as True is the default value.

Piyush Kumar Soni 104


• Grouping Data
• Grouping your data based on certain criteria can provide valuable insights. For
example, you might want to group your data by 'continent' to analyze the
data at the continent level. In Pandas, you can use the groupby() function to
group your data:
grouped_df = df.groupby('continent').mean()
• This will return a new DataFrame where the data is grouped by the 'continent'
column, and the values in each group are the mean values of the original data
in that group.

Piyush Kumar Soni 105


• Filtering Data Based on Data Types
• Sometimes, you might want to perform operations only on columns of a
certain data type. For example, you might want to calculate statistical
measures like mean, median, etc., only on numerical columns. In such cases,
you can filter the columns based on their data types.
• In Pandas, you can use the select_dtypes() function to select columns of a
specific data type:
numeric_df = df.select_dtypes(include='number')
• This will return a new DataFrame containing only the columns with numerical
data. Similarly, you can select columns with object (string) data type as
follows:
object_df = df.select_dtypes(include='object')

Piyush Kumar Soni 106


• Applying Functions to Cells, Columns and Rows
• To apply functions to each column, use apply():
df.apply(np.max)
• The apply method can also be used to apply a function to each row. To do this,
specify axis=1. Lambda functions are very convenient in such scenarios. For example,
if we need to select all states starting with ‘W’, we can do it like this:
df[df["State"].apply(lambda state: state[0] == "W")].head()
• The map method can be used to replace values in a column by passing a
dictionary of the form {old_value: new_value} as its argument:
d = {"No": False, "Yes": True}
df["International plan"] = df["International plan"].map(d)
df.head()

Piyush Kumar Soni 107


NumPy for Statistical Analysis
• numpy.amin() and numpy.amax()
• These functions return the minimum and the
maximum from the elements in the given array along
the specified axis.

Piyush Kumar Soni 108


• numpy.ptp()
• The numpy.ptp() function returns the range (maximum-minimum) of values
along an axis.

Piyush Kumar Soni 109


• numpy.percentile()
• Percentile (or a centile) is a measure used in statistics indicating the value
below which a given percentage of observations in a group of observations
fall. The function numpy.percentile() takes the following arguments.
numpy.percentile(a, q, axis)
Sr.N
Argument & Description
o.
a
1
Input array
q
2 The percentile to compute must be between
0-100
axis
3 The axis along which the percentile is to be
calculated Piyush Kumar Soni 110
Piyush Kumar Soni 111
• numpy.median()
• Median is defined as the value separating the higher half of a data sample
from the lower half. The numpy.median() function is used as shown in the
following program.

Piyush Kumar Soni 112


• numpy.mean()
• Arithmetic mean is the sum of elements along an axis divided by the number
of elements. The numpy.mean() function returns the arithmetic mean of
elements in the array. If the axis is mentioned, it is calculated along it.

Piyush Kumar Soni 113


• Standard Deviation
import numpy as np
print np.std([1,2,3,4])
• Variance
import numpy as np
print np.var([1,2,3,4])

Piyush Kumar Soni 114

You might also like