Unit 5 Material
Unit 5 Material
1
1 Stretching Python’s Capabilities
This chapter focuses on extending Python’s functionality with libraries and techniques
to optimize performance and handle large datasets efficiently.
Model Evaluation: Provides tools for cross-validation, accuracy metrics, and hy-
perparameter tuning.
Example:
from s k l e a r n . l i n e a r m o d e l import L i n e a r R e g r e s s i o n
model = L i n e a r R e g r e s s i o n ( )
model . f i t ( X t r a i n , y t r a i n )
p r e d i c t i o n s = model . p r e d i c t ( X t e s t )
2
1.4 Installing and Importing Scikit-learn
To install Scikit-learn, run:
pip install scikit-learn
Example imports:
import numpy as np
import pandas as pd
from s k l e a r n . m o d e l s e l e c t i o n import t r a i n t e s t s p l i
from s k l e a r n . l i n e a r m o d e l import L i n e a r R e g r e s s i o n
from s k l e a r n . m e t r i c s import a c c u r a c y s c o r e
# Load t h e C a l i f o r n i a h o u s i n g d a t a s e t
data = f e t c h c a l i f o r n i a h o u s i n g ( )
X = data . data # F e a t u r e s
y = data . t a r g e t # Target v a r i a b l e
3
# S p l i t t h e d a t a i n t o t r a i n i n g and t e s t i n g s e t s
X t r a i n , X t e s t , y t r a i n , y t e s t = t r a i n t e s t s p l i t (X, y , t
# I n i t i a l i z e t h e Linear R e g r e s s i o n model
model = L i n e a r R e g r e s s i o n ( )
# Train t h e model
model . f i t ( X t r a i n , y t r a i n )
# Make p r e d i c t i o n s on t h e t e s t s e t
p r e d i c t i o n s = model . p r e d i c t ( X t e s t )
# E v a l u a t e t h e model u s i n g Rˆ2 s c o r e
score = r2 score ( y test , predictions )
print ( f ”Model Rˆ2 s c o r e : { s c o r e }” )
pipe = Pipeline ( [
( ’ s c a l e r ’ , StandardScaler ( ) ) , # Transform s t e p
( ’ model ’ , L i n e a r R e g r e s s i o n ( ) ) # Estimator step
])
pipe . f i t ( X train , y t r a i n )
p r e d i c t i o n s = pipe . predict ( X test )
Applications:
3.2 Regression
Objective: Predict continuous outcomes.
Applications:
4
– Predicting housing prices.
– Stock market prediction.
– Energy consumption prediction.
3.3 Clustering
Objective: Group similar data points together.
Applications:
– Customer segmentation.
– Fraud detection.
– Document clustering.
Applications:
Applications:
Applications:
5
4 Performing the Hashing Trick
The Hashing Trick:
The Hashing Trick transforms large amounts of data (such as text) into fixed-size
feature vectors.
Advantages:
Memory efficiency: Reduces memory usage by avoiding the storage of large feature
dictionaries.
They are deterministic: the same input always gives the same hash.
7 CountVectorizer vs HashingVectorizer
7.1 CountVectorizer
Converts a collection of text documents into a matrix of token counts.
Each word (token) is a feature, and the number of times it appears in a document
is counted.
6
from s k l e a r n . f e a t u r e e x t r a c t i o n . t e x t import C o u n t V e c t o r i z e r
7.2 HashingVectorizer
Similar to CountVectorizer, but uses a hash function to map words into a fixed-size
vector.
Saves memory and is ideal for very large datasets.
Python Code Example:
from s k l e a r n . f e a t u r e e x t r a c t i o n . t e x t import H a s h i n g V e c t o r i z
7
9 Working with the Memory Profiler
Memory Profiling:
def m y f u n c t i o n ( ) :
a = [ i for i in range ( 1 0 0 0 0 ) ]
return a
Parallel computing splits tasks into smaller parts and runs them across multiple
cores.
This reduces execution time and improves performance, especially for large datasets.
11 Demonstrating Multiprocessing
What is Multiprocessing?
def s q u a r e ( x ) :
return x * x
if name == ’ m a i n ’ :
with Pool ( 4 ) as p o o l :
r e s u l t s = p o o l .map( square , [ 1 , 2 , 3 , 4 , 5 ] )
print ( r e s u l t s )
8
12 Exploring Data Analysis
Exploratory Data Analysis (EDA) is a critical step in understanding and preparing data
before applying machine learning algorithms. This chapter covers techniques to explore
and summarize data.
Understanding the dataset: Identifying data types, missing values, and outliers.
EDA helps ensure the validity and reliability of data models by providing insights into
the data that might affect analysis.
Measures of Dispersion
Measures of Normality
9
13.2 Measures of Dispersion
Measures of dispersion describe the spread or variability of data around the central
value. Common measures include:
Standard Deviation: The square root of the variance, indicating how much data
varies from the mean.
Kurtosis: Measures the ”tailedness” of the distribution, or how much of the data
falls in the tails. High kurtosis means more extreme values in the tails, while low
kurtosis means less.
# Measures of Dispersion
range_value = data [ ’ Values ’ ]. max () - data
[ ’ Values ’ ]. min ()
variance = data [ ’ Values ’ ]. var ()
std_dev = data [ ’ Values ’ ]. std ()
# Measures of Normality
10
skewness = skew ( data [ ’ Values ’ ])
kurt = kurtosis ( data [ ’ Values ’ ])
# Display results
print ( f " Measures of Central Tendency : " )
print ( f " Mean : { mean } " )
print ( f " Median : { median } " )
print ( f " Mode : { mode } " )
print ( " \ nMeasures of Dispersion : " )
print ( f " Range : { range_value } " )
print ( f " Variance : { variance } " )
print ( f " Standard Deviation : { std_dev } " )
print ( " \ nMeasures of Normality : " )
print ( f " Skewness : { skewness } " )
print ( f " Kurtosis : { kurt } " )
Output: When you run the code, the output for the given dataset would be as
follows:
Measures of Dispersion:
Range: 10
Variance: 9.361904761904761
Standard Deviation: 3.0598956126999554
Measures of Normality:
Skewness: -0.05370816581648294
Kurtosis: -1.3244086460519235
Example code of Skewness and Kurtosis: Below is an example python code for
illustrating the concepts of skewness and kurtosis:
import numpy as np
import matplotlib . pyplot as plt
from scipy . stats import skew , kurtosis
11
# Plotting histograms
fig , axes = plt . subplots (1 , 3 , figsize
=(15 , 5) )
# Normal distribution
axes [0]. hist ( data_normal , bins =30 , color =
’ blue ’ , alpha =0.7)
axes [0]. set_title ( f ’ Normal Distribution \
nSkewness : { skew ( data_normal ) :.2 f } ,
Kurtosis : { kurtosis ( data_normal ) :.2 f } ’
)
# Positively skewed
axes [1]. hist ( data_positive_skew , bins =30 ,
color = ’ green ’ , alpha =0.7)
axes [1]. set_title ( f ’ Positively Skewed \
nSkewness : { skew ( da ta _p os it iv e_ sk ew )
:.2 f } , Kurtosis : { kurtosis (
da ta _p os it iv e_ sk ew ) :.2 f } ’)
# Negatively skewed
axes [2]. hist ( data_negative_skew , bins =30 ,
color = ’ red ’ , alpha =0.7)
axes [2]. set_title ( f ’ Negatively Skewed \
nSkewness : { skew ( da ta _n eg at iv e_ sk ew )
:.2 f } , Kurtosis : { kurtosis (
da ta _n eg at iv e_ sk ew ) :.2 f } ’)
plt . tight_layout ()
plt . show ()
Output: When you run the code, it will display histograms for the datasets with
different skewness and kurtosis as shown in fig[1].
Figure 1: Histograms showing Normal, Positively Skewed, and Negatively Skewed Dis-
tributions
12
Categorical Data
Categorical data refers to variables that are divided into groups or categories, where each
group represents a distinct characteristic or attribute. This type of data is qualitative
and typically does not have a numerical meaning, which means it cannot be measured on
a numeric scale. Examples of categorical data include:
Categorical data is often visualized using bar charts or pie charts, as these types of
visualizations help to compare different categories.
Measures of Scale
Data can be divided into four major scales based on the nature of the measurement.
These scales are nominal, ordinal, interval, and ratio scales.
Nominal Scale: This is the simplest type of scale, where data is grouped into
distinct categories without any order or ranking between them. The categories are
mutually exclusive, meaning each data point belongs to one and only one category.
There is no numerical relationship between categories. For example:
Nominal data can only be counted, not ordered or measured. It is best suited for
qualitative analysis.
Ordinal Scale: Ordinal data is similar to nominal data, but with the added feature
of meaningful order or ranking between categories. However, the intervals between
the categories are not necessarily equal. For example:
Although ordinal data can be ranked, we cannot quantify the difference between
categories. For example, the difference between ”fair” and ”good” may not be the
same as the difference between ”good” and ”excellent.”
Interval Scale: Interval data is numerical and has equal intervals between mea-
surements, meaning that the difference between two values is meaningful. However,
there is no true zero point in interval data, which means you cannot make state-
ments about ratios or percentages. For example:
13
– Example 1: Temperature in Celsius or Fahrenheit. The difference between
20°C and 30°C is the same as between 30°C and 40°C, but 0°C does not indicate
an absence of temperature.
– Example 2: IQ scores. The difference between an IQ score of 100 and 110
is the same as between 110 and 120, but an IQ score of 0 does not mean an
absence of intelligence.
Interval data can be added and subtracted, but not multiplied or divided meaning-
fully.
Ratio Scale: Ratio data is the most informative scale. It is numerical, has equal
intervals between values, and also has a true zero point, which allows for meaningful
ratio comparisons. For example:
Ratio data allows for all arithmetic operations, including addition, subtraction,
multiplication, and division.
Structure: The rows typically represent one categorical variable, while the columns
represent another. Each cell contains the count of occurrences for the combination
of the row and column categories.
14
15 Creating Applied Visualization for EDA
Purpose of Visualization: To make the analysis of data easier to understand
and interpret by visually representing data distributions and relationships.
Types of Visualizations:
– Histograms: Display the distribution of a single continuous variable.
– Bar Charts: Compare frequencies or counts of categorical variables.
– Boxplots: Show the distribution of data based on five summary statistics.
– Scatterplots: Illustrate the relationship between two continuous variables.
16 Inspecting Boxplots
Definition: A boxplot (or whisker plot) is a standardized way of displaying the
distribution of data based on a five-number summary.
Components:
– Minimum: The smallest data point.
– First Quartile (Q1): The median of the lower half of the dataset.
– Median (Q2): The middle value of the dataset.
– Third Quartile (Q3): The median of the upper half of the dataset.
– Maximum: The largest data point.
Purpose: Helps to identify outliers and visualize the spread of the data.
15
19 Graphing Distributions
Purpose: To visualize how data is distributed across different values.
Common Types:
20 Plotting Scatterplots
Definition: A scatterplot displays values for typically two variables for a set of
data.
21 Understanding Correlation
Definition: Correlation measures the strength and direction of a linear relationship
between two variables.
– Range: -1 to 1.
– Interpretation:
* 1: Perfect positive correlation.
* -1: Perfect negative correlation.
* 0: No correlation.
16
23 Using Nonparametric Correlation
Definition: Nonparametric correlation methods (like Spearman’s rank correlation)
do not assume a normal distribution.
Use Cases: Ideal for ordinal data or when the data do not meet parametric test
assumptions.
Application: Used with contingency tables to assess whether the observed fre-
quencies differ from expected frequencies under the assumption of independence.
Common Modifications:
17
Formula:
X −µ
Z= (1)
σ
where X is the value, µ is the mean, and σ is the standard deviation.
Common Techniques:
18