0% found this document useful (0 votes)
19 views18 pages

Unit 5 Material

Uploaded by

220390107034
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views18 pages

Unit 5 Material

Uploaded by

220390107034
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Unit-5

Wrangling Data: Stretching Python’s Capabilities and


Exploring Data Analysis
Subject : Python for Data Science

1
1 Stretching Python’s Capabilities
This chapter focuses on extending Python’s functionality with libraries and techniques
to optimize performance and handle large datasets efficiently.

1.1 Playing with Scikit-learn


Scikit-learn is one of the most popular machine learning libraries in Python, used for
implementing algorithms such as classification, regression, clustering, dimensionality re-
duction, and model evaluation.
Key Features:
ˆ Preprocessing: Scikit-learn provides tools for scaling, encoding, and transforming
data.

ˆ Model Training: Offers implementations of common machine learning models (e.g.,


SVM, Random Forest, Linear Regression).

ˆ Model Evaluation: Provides tools for cross-validation, accuracy metrics, and hy-
perparameter tuning.
Example:
from s k l e a r n . l i n e a r m o d e l import L i n e a r R e g r e s s i o n
model = L i n e a r R e g r e s s i o n ( )
model . f i t ( X t r a i n , y t r a i n )
p r e d i c t i o n s = model . p r e d i c t ( X t e s t )

1.2 What is Scikit-learn?


ˆ Scikit-learn is a Python library built on NumPy, SciPy, and Matplotlib, de-
signed for machine learning.

ˆ It provides tools for classification, regression, clustering, dimensionality re-


duction, model selection, and preprocessing.

ˆ Emphasizes simplicity, scalability, and performance for machine learning workflows.

1.3 Core Features of Scikit-learn


ˆ Classification: Identifying the category to which an object belongs (e.g., spam
detection).

ˆ Regression: Predicting continuous values (e.g., predicting house prices).

ˆ Clustering: Grouping similar objects (e.g., customer segmentation).

ˆ Dimensionality Reduction: Reducing the number of features while retaining


essential information.

ˆ Model Selection: Tuning hyperparameters, cross-validation, grid search.

ˆ Preprocessing: Scaling, encoding, transforming raw data.

2
1.4 Installing and Importing Scikit-learn
To install Scikit-learn, run:
pip install scikit-learn

Example imports:
import numpy as np
import pandas as pd
from s k l e a r n . m o d e l s e l e c t i o n import t r a i n t e s t s p l i
from s k l e a r n . l i n e a r m o d e l import L i n e a r R e g r e s s i o n
from s k l e a r n . m e t r i c s import a c c u r a c y s c o r e

2 Understanding Classes in Scikit-learn


2.1 Classes in Scikit-learn
ˆ Scikit-learn is built on an object-oriented architecture.

ˆ Each machine learning algorithm is represented as a class.

2.2 Main Components of a Scikit-learn Class


ˆ Estimator: Object that estimates parameters based on data (e.g., Linear Regres-
sion, SVC).

ˆ fit(): Trains the model.

ˆ predict(): Makes predictions based on trained model.

ˆ transform(): Transforms data (often used in preprocessing).

ˆ score(): Evaluates the performance of the model.

2.3 Example Workflow Using Classes


# Import n e c e s s a r y l i b r a r i e s
import numpy as np
import pandas as pd
from s k l e a r n . m o d e l s e l e c t i o n import t r a i n t e s t s p l i t
from s k l e a r n . l i n e a r m o d e l import L i n e a r R e g r e s s i o n
from s k l e a r n . d a t a s e t s import f e t c h c a l i f o r n i a h o u s i n g
from s k l e a r n . m e t r i c s import r 2 s c o r e

# Load t h e C a l i f o r n i a h o u s i n g d a t a s e t
data = f e t c h c a l i f o r n i a h o u s i n g ( )
X = data . data # F e a t u r e s
y = data . t a r g e t # Target v a r i a b l e

3
# S p l i t t h e d a t a i n t o t r a i n i n g and t e s t i n g s e t s
X t r a i n , X t e s t , y t r a i n , y t e s t = t r a i n t e s t s p l i t (X, y , t

# I n i t i a l i z e t h e Linear R e g r e s s i o n model
model = L i n e a r R e g r e s s i o n ( )

# Train t h e model
model . f i t ( X t r a i n , y t r a i n )

# Make p r e d i c t i o n s on t h e t e s t s e t
p r e d i c t i o n s = model . p r e d i c t ( X t e s t )

# E v a l u a t e t h e model u s i n g Rˆ2 s c o r e
score = r2 score ( y test , predictions )
print ( f ”Model Rˆ2 s c o r e : { s c o r e }” )

2.4 Pipelines in Scikit-learn


A pipeline is a sequence of transformers and a final estimator:
from s k l e a r n . p i p e l i n e import P i p e l i n e
from s k l e a r n . p r e p r o c e s s i n g import S t a n d a r d S c a l e r

pipe = Pipeline ( [
( ’ s c a l e r ’ , StandardScaler ( ) ) , # Transform s t e p
( ’ model ’ , L i n e a r R e g r e s s i o n ( ) ) # Estimator step
])

pipe . f i t ( X train , y t r a i n )
p r e d i c t i o n s = pipe . predict ( X test )

3 Defining Applications for Data Science


3.1 Classification
ˆ Objective: Categorize data into predefined classes.

ˆ Applications:

– Spam detection: Categorizing emails as spam or not spam.


– Image classification: Identifying objects in images.
– Medical diagnosis: Predicting diseases based on symptoms.

3.2 Regression
ˆ Objective: Predict continuous outcomes.

ˆ Applications:

4
– Predicting housing prices.
– Stock market prediction.
– Energy consumption prediction.

3.3 Clustering
ˆ Objective: Group similar data points together.

ˆ Applications:

– Customer segmentation.
– Fraud detection.
– Document clustering.

3.4 Dimensionality Reduction


ˆ Objective: Reduce the number of variables in data.

ˆ Applications:

– Reducing features in genetics datasets.


– Data visualization in 2D/3D.
– Speeding up machine learning models.

3.5 Preprocessing and Feature Engineering


ˆ Objective: Clean and transform data for machine learning.

ˆ Applications:

– Normalizing income data for credit scoring.


– One-hot encoding for categorical variables.
– Scaling numerical data.

3.6 Model Evaluation and Selection


ˆ Objective: Compare and select the best model.

ˆ Applications:

– Cross-validation for selecting models.


– Hyperparameter tuning (e.g., selecting the best k for KNN).
– Grid search for finding optimal parameters.

5
4 Performing the Hashing Trick
The Hashing Trick:

ˆ The Hashing Trick transforms large amounts of data (such as text) into fixed-size
feature vectors.

ˆ It is commonly used in Natural Language Processing (NLP) to map high-dimensional


data into smaller vector space.

ˆ A hash function is used to map each feature into an index in a vector.

Advantages:

ˆ Memory efficiency: Reduces memory usage by avoiding the storage of large feature
dictionaries.

ˆ Speed : Faster than traditional methods for large datasets.

ˆ Scalability: Can handle very large datasets using fixed-size vectors.

5 Using Hash Functions


Hash Functions:

ˆ Hash functions map input data to a fixed-size integer value (hash).

ˆ They are deterministic: the same input always gives the same hash.

6 Demonstrating the Hashing Trick


How it Works:

ˆ Choose a hash function to map each feature (word) to an index.

ˆ The feature vector is reduced in dimensionality by using these indices.

ˆ Collisions (multiple features mapping to the same index) may occur.

7 CountVectorizer vs HashingVectorizer
7.1 CountVectorizer
ˆ Converts a collection of text documents into a matrix of token counts.

ˆ Each word (token) is a feature, and the number of times it appears in a document
is counted.

Python Code Example:

6
from s k l e a r n . f e a t u r e e x t r a c t i o n . t e x t import C o u n t V e c t o r i z e r

c o r p u s = [ ” This i s a sample t e x t ” , ” Another sample t e x t ” ]


v e c t o r i z e r = CountVectorizer ()
X = v e c t o r i z e r . f i t t r a n s f o r m ( corpus )
print ( v e c t o r i z e r . g e t f e a t u r e n a m e s o u t ( ) )
print (X. t o a r r a y ( ) )

7.2 HashingVectorizer
ˆ Similar to CountVectorizer, but uses a hash function to map words into a fixed-size
vector.
ˆ Saves memory and is ideal for very large datasets.
Python Code Example:
from s k l e a r n . f e a t u r e e x t r a c t i o n . t e x t import H a s h i n g V e c t o r i z

c o r p u s = [ ” This i s a sample t e x t ” , ” Another sample t e x t ” ]


v e c t o r i z e r = H a s h i n g V e c t o r i z e r ( n f e a t u r e s =10)
X = v e c t o r i z e r . f i t t r a n s f o r m ( corpus )
print (X. t o a r r a y ( ) )

8 Considering Timing and Performance


Importance of Timing and Performance:
ˆ Execution time and memory usage are critical for machine learning tasks.
ˆ Python provides tools like time and timeit to benchmark performance.

8.1 Benchmarking with %time and %%timeit


%time Command:
ˆ Measures the execution time of a single line of code.
Python Code Example:
%time sum( range ( 1 0 0 0 0 0 0 ) )
%%timeit Command:
ˆ Benchmarks the execution time of a code block.
ˆ Runs the code multiple times to get an average execution time.
Python Code Example:
%%t i m e i t
total = 0
for i in range ( 1 0 0 0 0 0 0 ) :
t o t a l += i

7
9 Working with the Memory Profiler
Memory Profiling:

ˆ Memory profiling tracks how much memory your code uses.

ˆ The memory profiler package helps identify memory bottlenecks.

Python Code Example:


# I n s t a l l m e m o r y p r o f i l e r w i t h p i p i n s t a l l memory= p r o f i l e r
from m e m o r y p r o f i l e r import memory usage

def m y f u n c t i o n ( ) :
a = [ i for i in range ( 1 0 0 0 0 ) ]
return a

print ( memory usage ( m y f u n c t i o n ) )

10 Running in Parallel on Multiple Cores


Parallel Computing:

ˆ Parallel computing splits tasks into smaller parts and runs them across multiple
cores.

ˆ This reduces execution time and improves performance, especially for large datasets.

11 Demonstrating Multiprocessing
What is Multiprocessing?

ˆ Multiprocessing allows multiple processes to run simultaneously, bypassing Python’s


Global Interpreter Lock (GIL).

ˆ The multiprocessing module in Python is used to implement multiprocessing.

Python Code Example:


from m u l t i p r o c e s s i n g import Pool

def s q u a r e ( x ) :
return x * x

if name == ’ m a i n ’ :
with Pool ( 4 ) as p o o l :
r e s u l t s = p o o l .map( square , [ 1 , 2 , 3 , 4 , 5 ] )
print ( r e s u l t s )

8
12 Exploring Data Analysis
Exploratory Data Analysis (EDA) is a critical step in understanding and preparing data
before applying machine learning algorithms. This chapter covers techniques to explore
and summarize data.

12.1 The EDA Approach


Goal: EDA aims to uncover patterns, spot anomalies, test hypotheses, and check as-
sumptions using summary statistics and graphical representations. Exploratory Data
Analysis is an approach to analyzing datasets to summarize their main characteristics
using visual methods. The goal of EDA is to:

ˆ Understand the data’s structure.

ˆ Identify patterns, anomalies, and relationships.

ˆ Prepare for more advanced modeling or hypothesis testing.

The EDA process typically involves:

ˆ Understanding the dataset: Identifying data types, missing values, and outliers.

ˆ Summarizing descriptive statistics: Computing mean, median, mode, variance,


skewness, and kurtosis.
ˆ Visualizing data: Using histograms, box plots, scatter plots, and correlation
matrices to explore relationships.

EDA helps ensure the validity and reliability of data models by providing insights into
the data that might affect analysis.

13 Defining Descriptive Statistics for Numeric Data


Descriptive statistics summarize and describe the main features of a dataset. There are
three primary types of descriptive statistics:

ˆ Measures of Central Tendency

ˆ Measures of Dispersion

ˆ Measures of Normality

13.1 Measures of Central Tendency


Measures of central tendency describe the ”center” of a data distribution. Common
measures include:

ˆ Mean: The arithmetic average of all values.

ˆ Median: The middle value when data is arranged in order.

ˆ Mode: The most frequent value in the dataset.

9
13.2 Measures of Dispersion
Measures of dispersion describe the spread or variability of data around the central
value. Common measures include:

ˆ Range: The difference between the maximum and minimum values.

ˆ Variance: The average squared deviation from the mean.

ˆ Standard Deviation: The square root of the variance, indicating how much data
varies from the mean.

13.3 Measures of Normality


Measures of normality describe the shape of a dataset’s distribution compared to a
normal distribution. Two key measures are:

ˆ Skewness: Measures the asymmetry of the data distribution. A skewness of zero


indicates a perfectly symmetrical distribution.

ˆ Kurtosis: Measures the ”tailedness” of the distribution, or how much of the data
falls in the tails. High kurtosis means more extreme values in the tails, while low
kurtosis means less.

13.4 Python Code for all three measures


import numpy as np
import pandas as pd
from scipy . stats import skew , kurtosis

# Sample dataset ( you can replace this


with any dataset )
data = pd . DataFrame ({
’ Values ’: [12 , 15 , 14 , 10 , 18 ,
13 , 19 , 17 , 16 , 15 , 14 , 16 ,
18 , 19 , 20]
})

# Measures of Central Tendency


mean = data [ ’ Values ’ ]. mean ()
median = data [ ’ Values ’ ]. median ()
mode = data [ ’ Values ’ ]. mode () [0] # Take
the first mode

# Measures of Dispersion
range_value = data [ ’ Values ’ ]. max () - data
[ ’ Values ’ ]. min ()
variance = data [ ’ Values ’ ]. var ()
std_dev = data [ ’ Values ’ ]. std ()

# Measures of Normality

10
skewness = skew ( data [ ’ Values ’ ])
kurt = kurtosis ( data [ ’ Values ’ ])

# Display results
print ( f " Measures of Central Tendency : " )
print ( f " Mean : { mean } " )
print ( f " Median : { median } " )
print ( f " Mode : { mode } " )
print ( " \ nMeasures of Dispersion : " )
print ( f " Range : { range_value } " )
print ( f " Variance : { variance } " )
print ( f " Standard Deviation : { std_dev } " )
print ( " \ nMeasures of Normality : " )
print ( f " Skewness : { skewness } " )
print ( f " Kurtosis : { kurt } " )
Output: When you run the code, the output for the given dataset would be as
follows:

Measures of Central Tendency:


Mean: 15.533333333333333
Median: 16.0
Mode: 15

Measures of Dispersion:
Range: 10
Variance: 9.361904761904761
Standard Deviation: 3.0598956126999554

Measures of Normality:
Skewness: -0.05370816581648294
Kurtosis: -1.3244086460519235

Example code of Skewness and Kurtosis: Below is an example python code for
illustrating the concepts of skewness and kurtosis:
import numpy as np
import matplotlib . pyplot as plt
from scipy . stats import skew , kurtosis

# Generating datasets with different


skewness and kurtosis
data_normal = np . random . normal (0 , 1 ,
1000) # Normal distribution (
Mesokurtic )
da ta _p os it iv e_ sk ew = np . random .
exponential ( scale =1 , size =1000) #
Positively skewed
da ta _n eg at iv e_ sk ew = - np . random .
exponential ( scale =1 , size =1000) #
Negatively skewed

11
# Plotting histograms
fig , axes = plt . subplots (1 , 3 , figsize
=(15 , 5) )

# Normal distribution
axes [0]. hist ( data_normal , bins =30 , color =
’ blue ’ , alpha =0.7)
axes [0]. set_title ( f ’ Normal Distribution \
nSkewness : { skew ( data_normal ) :.2 f } ,
Kurtosis : { kurtosis ( data_normal ) :.2 f } ’
)

# Positively skewed
axes [1]. hist ( data_positive_skew , bins =30 ,
color = ’ green ’ , alpha =0.7)
axes [1]. set_title ( f ’ Positively Skewed \
nSkewness : { skew ( da ta _p os it iv e_ sk ew )
:.2 f } , Kurtosis : { kurtosis (
da ta _p os it iv e_ sk ew ) :.2 f } ’)

# Negatively skewed
axes [2]. hist ( data_negative_skew , bins =30 ,
color = ’ red ’ , alpha =0.7)
axes [2]. set_title ( f ’ Negatively Skewed \
nSkewness : { skew ( da ta _n eg at iv e_ sk ew )
:.2 f } , Kurtosis : { kurtosis (
da ta _n eg at iv e_ sk ew ) :.2 f } ’)

plt . tight_layout ()
plt . show ()
Output: When you run the code, it will display histograms for the datasets with
different skewness and kurtosis as shown in fig[1].

Figure 1: Histograms showing Normal, Positively Skewed, and Negatively Skewed Dis-
tributions

12
Categorical Data
Categorical data refers to variables that are divided into groups or categories, where each
group represents a distinct characteristic or attribute. This type of data is qualitative
and typically does not have a numerical meaning, which means it cannot be measured on
a numeric scale. Examples of categorical data include:

ˆ Types of fruits: apple, orange, banana

ˆ Car brands: Toyota, Honda, Ford

ˆ Gender: male, female

ˆ Colors: red, green, blue

Categorical data is often visualized using bar charts or pie charts, as these types of
visualizations help to compare different categories.

Measures of Scale
Data can be divided into four major scales based on the nature of the measurement.
These scales are nominal, ordinal, interval, and ratio scales.

ˆ Nominal Scale: This is the simplest type of scale, where data is grouped into
distinct categories without any order or ranking between them. The categories are
mutually exclusive, meaning each data point belongs to one and only one category.
There is no numerical relationship between categories. For example:

– Example 1: Blood types (A, B, AB, O)


– Example 2: Types of animals (dog, cat, bird )

Nominal data can only be counted, not ordered or measured. It is best suited for
qualitative analysis.

ˆ Ordinal Scale: Ordinal data is similar to nominal data, but with the added feature
of meaningful order or ranking between categories. However, the intervals between
the categories are not necessarily equal. For example:

– Example 1: Customer satisfaction rating (poor, fair, good, excellent)


– Example 2: Education level (high school, bachelor’s, master’s, PhD)

Although ordinal data can be ranked, we cannot quantify the difference between
categories. For example, the difference between ”fair” and ”good” may not be the
same as the difference between ”good” and ”excellent.”

ˆ Interval Scale: Interval data is numerical and has equal intervals between mea-
surements, meaning that the difference between two values is meaningful. However,
there is no true zero point in interval data, which means you cannot make state-
ments about ratios or percentages. For example:

13
– Example 1: Temperature in Celsius or Fahrenheit. The difference between
20°C and 30°C is the same as between 30°C and 40°C, but 0°C does not indicate
an absence of temperature.
– Example 2: IQ scores. The difference between an IQ score of 100 and 110
is the same as between 110 and 120, but an IQ score of 0 does not mean an
absence of intelligence.

Interval data can be added and subtracted, but not multiplied or divided meaning-
fully.

ˆ Ratio Scale: Ratio data is the most informative scale. It is numerical, has equal
intervals between values, and also has a true zero point, which allows for meaningful
ratio comparisons. For example:

– Example 1: Height in centimeters. A height of 0 cm means the absence of


height, and you can meaningfully say that someone who is 180 cm tall is twice
as tall as someone who is 90 cm.
– Example 2: Weight in kilograms. A weight of 0 kg means no weight, and you
can meaningfully compare different weights using ratios.

Ratio data allows for all arithmetic operations, including addition, subtraction,
multiplication, and division.

13.5 Counting for Categorical Data


For categorical data, EDA focuses on counting unique categories and analyzing their
frequency distribution.
Example:
df [ ’ category_column ’ ]. value_counts ()
————————————————————-

14 Creating Contingency Tables


ˆ Definition: A contingency table is a type of table that displays the frequency
distribution of variables.

ˆ Purpose: It helps to summarize the relationship between two categorical variables.

ˆ Structure: The rows typically represent one categorical variable, while the columns
represent another. Each cell contains the count of occurrences for the combination
of the row and column categories.

ˆ Applications: Useful in various fields such as marketing, social sciences, and


health studies to analyze the relationship between variables.

14
15 Creating Applied Visualization for EDA
ˆ Purpose of Visualization: To make the analysis of data easier to understand
and interpret by visually representing data distributions and relationships.
ˆ Types of Visualizations:
– Histograms: Display the distribution of a single continuous variable.
– Bar Charts: Compare frequencies or counts of categorical variables.
– Boxplots: Show the distribution of data based on five summary statistics.
– Scatterplots: Illustrate the relationship between two continuous variables.

16 Inspecting Boxplots
ˆ Definition: A boxplot (or whisker plot) is a standardized way of displaying the
distribution of data based on a five-number summary.
ˆ Components:
– Minimum: The smallest data point.
– First Quartile (Q1): The median of the lower half of the dataset.
– Median (Q2): The middle value of the dataset.
– Third Quartile (Q3): The median of the upper half of the dataset.
– Maximum: The largest data point.
ˆ Purpose: Helps to identify outliers and visualize the spread of the data.

17 Performing t-tests After Boxplots


ˆ Purpose: A t-test is used to determine if there are significant differences between
the means of two groups.
ˆ Types of t-tests:
– Independent t-test: Compares means from two different groups.
– Paired t-test: Compares means from the same group at different times.
ˆ Application: Often used in A/B testing, clinical trials, and comparing group
performances.

18 Observing Parallel Coordinates


ˆ Definition: A visualization technique that displays multivariate data by using
parallel axes.
ˆ Structure: Each vertical line represents a variable, and each data point is con-
nected across the axes.
ˆ Purpose: Helps to visualize relationships and trends in multi-dimensional data.

15
19 Graphing Distributions
ˆ Purpose: To visualize how data is distributed across different values.

ˆ Common Types:

– Histograms: Show frequency distributions.


– Density Plots: Estimate the probability density function of the variable.

20 Plotting Scatterplots
ˆ Definition: A scatterplot displays values for typically two variables for a set of
data.

ˆ Purpose: To investigate the relationship or correlation between two continuous


variables.

ˆ Interpretation: Patterns in the scatterplot can indicate positive, negative, or no


correlation.

21 Understanding Correlation
ˆ Definition: Correlation measures the strength and direction of a linear relationship
between two variables.

ˆ Correlation Coefficient (r):

– Range: -1 to 1.
– Interpretation:
* 1: Perfect positive correlation.
* -1: Perfect negative correlation.
* 0: No correlation.

ˆ Importance: Helps in predictive modeling and understanding relationships.

22 Using Covariance and Correlation


ˆ Covariance: Measures how two variables change together.

– Positive covariance indicates that both variables increase or decrease together.


– Negative covariance indicates that as one variable increases, the other de-
creases.

ˆ Correlation: A standardized measure of covariance that provides a clearer under-


standing of the relationship’s strength.

16
23 Using Nonparametric Correlation
ˆ Definition: Nonparametric correlation methods (like Spearman’s rank correlation)
do not assume a normal distribution.

ˆ Use Cases: Ideal for ordinal data or when the data do not meet parametric test
assumptions.

ˆ Benefit: More robust in the presence of outliers or non-linear relationships.

24 Considering the Chi-Square Test for Tables


ˆ Definition: A statistical test to determine if there is a significant association
between categorical variables.

ˆ Application: Used with contingency tables to assess whether the observed fre-
quencies differ from expected frequencies under the assumption of independence.

ˆ Interpretation: A high chi-square statistic indicates a significant relationship.

25 Modifying Data Distributions


ˆ Purpose: Sometimes, data needs to be transformed to meet the assumptions of
statistical tests.

ˆ Common Modifications:

– Log Transformation: Useful for right-skewed data.


– Square Root Transformation: Helps to stabilize variance.
– Box-Cox Transformation: A family of power transformations that can help
normalize distributions.

26 Using Different Statistical Distributions


ˆ Normal Distribution: Symmetrical, bell-shaped distribution; many statistical
tests assume normality.

ˆ Binomial Distribution: Discrete distribution representing the number of suc-


cesses in a fixed number of trials.

ˆ Poisson Distribution: Represents the number of events occurring in a fixed in-


terval of time or space.

27 Creating a Z-score Standardization


ˆ Definition: Z-score standardization transforms data to have a mean of 0 and a
standard deviation of 1.

17
ˆ Formula:
X −µ
Z= (1)
σ
where X is the value, µ is the mean, and σ is the standard deviation.

ˆ Purpose: Useful for comparing scores from different distributions.

28 Transforming Other Notable Distributions


ˆ Purpose: To normalize data for analysis.

ˆ Common Techniques:

– Min-Max Scaling: Rescales data to a range of [0, 1].


– Quantile Transformation: Transforms features to follow a uniform or normal
distribution.

18

You might also like