0% found this document useful (0 votes)

19 views18 pages

Unit 5 Material

Uploaded by

220390107034

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views18 pages

Unit 5 Material

Uploaded by

220390107034

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Unit-5

Wrangling Data: Stretching Python’s Capabilities and

Exploring Data Analysis
Subject : Python for Data Science

1
1 Stretching Python’s Capabilities
This chapter focuses on extending Python’s functionality with libraries and techniques
to optimize performance and handle large datasets efficiently.

1.1 Playing with Scikit-learn

Scikit-learn is one of the most popular machine learning libraries in Python, used for
implementing algorithms such as classification, regression, clustering, dimensionality re-
duction, and model evaluation.
Key Features:
Preprocessing: Scikit-learn provides tools for scaling, encoding, and transforming
data.

Model Training: Offers implementations of common machine learning models (e.g.,

SVM, Random Forest, Linear Regression).

Model Evaluation: Provides tools for cross-validation, accuracy metrics, and hy-
perparameter tuning.
Example:
from s k l e a r n . l i n e a r m o d e l import L i n e a r R e g r e s s i o n
model = L i n e a r R e g r e s s i o n ( )
model . f i t ( X t r a i n , y t r a i n )
p r e d i c t i o n s = model . p r e d i c t ( X t e s t )

1.2 What is Scikit-learn?

Scikit-learn is a Python library built on NumPy, SciPy, and Matplotlib, de-
signed for machine learning.

It provides tools for classification, regression, clustering, dimensionality re-

duction, model selection, and preprocessing.

Emphasizes simplicity, scalability, and performance for machine learning workflows.

1.3 Core Features of Scikit-learn

Classification: Identifying the category to which an object belongs (e.g., spam
detection).

Regression: Predicting continuous values (e.g., predicting house prices).

Clustering: Grouping similar objects (e.g., customer segmentation).

Dimensionality Reduction: Reducing the number of features while retaining

essential information.

Model Selection: Tuning hyperparameters, cross-validation, grid search.

Preprocessing: Scaling, encoding, transforming raw data.

2
1.4 Installing and Importing Scikit-learn
To install Scikit-learn, run:
pip install scikit-learn

Example imports:
import numpy as np
import pandas as pd
from s k l e a r n . m o d e l s e l e c t i o n import t r a i n t e s t s p l i
from s k l e a r n . l i n e a r m o d e l import L i n e a r R e g r e s s i o n
from s k l e a r n . m e t r i c s import a c c u r a c y s c o r e

2 Understanding Classes in Scikit-learn

2.1 Classes in Scikit-learn
Scikit-learn is built on an object-oriented architecture.

Each machine learning algorithm is represented as a class.

2.2 Main Components of a Scikit-learn Class

Estimator: Object that estimates parameters based on data (e.g., Linear Regres-
sion, SVC).

fit(): Trains the model.

predict(): Makes predictions based on trained model.

transform(): Transforms data (often used in preprocessing).

score(): Evaluates the performance of the model.

2.3 Example Workflow Using Classes

# Import n e c e s s a r y l i b r a r i e s
import numpy as np
import pandas as pd
from s k l e a r n . m o d e l s e l e c t i o n import t r a i n t e s t s p l i t
from s k l e a r n . l i n e a r m o d e l import L i n e a r R e g r e s s i o n
from s k l e a r n . d a t a s e t s import f e t c h c a l i f o r n i a h o u s i n g
from s k l e a r n . m e t r i c s import r 2 s c o r e

# Load t h e C a l i f o r n i a h o u s i n g d a t a s e t
data = f e t c h c a l i f o r n i a h o u s i n g ( )
X = data . data # F e a t u r e s
y = data . t a r g e t # Target v a r i a b l e

3
# S p l i t t h e d a t a i n t o t r a i n i n g and t e s t i n g s e t s
X t r a i n , X t e s t , y t r a i n , y t e s t = t r a i n t e s t s p l i t (X, y , t

# I n i t i a l i z e t h e Linear R e g r e s s i o n model
model = L i n e a r R e g r e s s i o n ( )

# Train t h e model
model . f i t ( X t r a i n , y t r a i n )

# Make p r e d i c t i o n s on t h e t e s t s e t
p r e d i c t i o n s = model . p r e d i c t ( X t e s t )

# E v a l u a t e t h e model u s i n g Rˆ2 s c o r e
score = r2 score ( y test , predictions )
print ( f ”Model Rˆ2 s c o r e : { s c o r e }” )

2.4 Pipelines in Scikit-learn

A pipeline is a sequence of transformers and a final estimator:
from s k l e a r n . p i p e l i n e import P i p e l i n e
from s k l e a r n . p r e p r o c e s s i n g import S t a n d a r d S c a l e r

pipe = Pipeline ( [
( ’ s c a l e r ’ , StandardScaler ( ) ) , # Transform s t e p
( ’ model ’ , L i n e a r R e g r e s s i o n ( ) ) # Estimator step
])

pipe . f i t ( X train , y t r a i n )
p r e d i c t i o n s = pipe . predict ( X test )

3 Defining Applications for Data Science

3.1 Classification
Objective: Categorize data into predefined classes.

Applications:

– Spam detection: Categorizing emails as spam or not spam.

– Image classification: Identifying objects in images.
– Medical diagnosis: Predicting diseases based on symptoms.

3.2 Regression
Objective: Predict continuous outcomes.

Applications:

4
– Predicting housing prices.
– Stock market prediction.
– Energy consumption prediction.

3.3 Clustering
Objective: Group similar data points together.

Applications:

– Customer segmentation.
– Fraud detection.
– Document clustering.

3.4 Dimensionality Reduction

Objective: Reduce the number of variables in data.

Applications:

– Reducing features in genetics datasets.

– Data visualization in 2D/3D.
– Speeding up machine learning models.

3.5 Preprocessing and Feature Engineering

Objective: Clean and transform data for machine learning.

Applications:

– Normalizing income data for credit scoring.

– One-hot encoding for categorical variables.
– Scaling numerical data.

3.6 Model Evaluation and Selection

Objective: Compare and select the best model.

Applications:

– Cross-validation for selecting models.

– Hyperparameter tuning (e.g., selecting the best k for KNN).
– Grid search for finding optimal parameters.

5
4 Performing the Hashing Trick
The Hashing Trick:

The Hashing Trick transforms large amounts of data (such as text) into fixed-size
feature vectors.

It is commonly used in Natural Language Processing (NLP) to map high-dimensional

data into smaller vector space.

A hash function is used to map each feature into an index in a vector.

Advantages:

Memory efficiency: Reduces memory usage by avoiding the storage of large feature
dictionaries.

Speed : Faster than traditional methods for large datasets.

Scalability: Can handle very large datasets using fixed-size vectors.

5 Using Hash Functions

Hash Functions:

Hash functions map input data to a fixed-size integer value (hash).

They are deterministic: the same input always gives the same hash.

6 Demonstrating the Hashing Trick

How it Works:

Choose a hash function to map each feature (word) to an index.

The feature vector is reduced in dimensionality by using these indices.

Collisions (multiple features mapping to the same index) may occur.

7 CountVectorizer vs HashingVectorizer
7.1 CountVectorizer
Converts a collection of text documents into a matrix of token counts.

Each word (token) is a feature, and the number of times it appears in a document
is counted.

Python Code Example:

6
from s k l e a r n . f e a t u r e e x t r a c t i o n . t e x t import C o u n t V e c t o r i z e r

c o r p u s = [ ” This i s a sample t e x t ” , ” Another sample t e x t ” ]

v e c t o r i z e r = CountVectorizer ()
X = v e c t o r i z e r . f i t t r a n s f o r m ( corpus )
print ( v e c t o r i z e r . g e t f e a t u r e n a m e s o u t ( ) )
print (X. t o a r r a y ( ) )

7.2 HashingVectorizer
Similar to CountVectorizer, but uses a hash function to map words into a fixed-size
vector.
Saves memory and is ideal for very large datasets.
Python Code Example:
from s k l e a r n . f e a t u r e e x t r a c t i o n . t e x t import H a s h i n g V e c t o r i z

c o r p u s = [ ” This i s a sample t e x t ” , ” Another sample t e x t ” ]

v e c t o r i z e r = H a s h i n g V e c t o r i z e r ( n f e a t u r e s =10)
X = v e c t o r i z e r . f i t t r a n s f o r m ( corpus )
print (X. t o a r r a y ( ) )

8 Considering Timing and Performance

Importance of Timing and Performance:
Execution time and memory usage are critical for machine learning tasks.
Python provides tools like time and timeit to benchmark performance.

8.1 Benchmarking with %time and %%timeit

%time Command:
Measures the execution time of a single line of code.
Python Code Example:
%time sum( range ( 1 0 0 0 0 0 0 ) )
%%timeit Command:
Benchmarks the execution time of a code block.
Runs the code multiple times to get an average execution time.
Python Code Example:
%%t i m e i t
total = 0
for i in range ( 1 0 0 0 0 0 0 ) :
t o t a l += i

7
9 Working with the Memory Profiler
Memory Profiling:

Memory profiling tracks how much memory your code uses.

The memory profiler package helps identify memory bottlenecks.

Python Code Example:

# I n s t a l l m e m o r y p r o f i l e r w i t h p i p i n s t a l l memory= p r o f i l e r
from m e m o r y p r o f i l e r import memory usage

def m y f u n c t i o n ( ) :
a = [ i for i in range ( 1 0 0 0 0 ) ]
return a

print ( memory usage ( m y f u n c t i o n ) )

10 Running in Parallel on Multiple Cores

Parallel Computing:

Parallel computing splits tasks into smaller parts and runs them across multiple
cores.

This reduces execution time and improves performance, especially for large datasets.

11 Demonstrating Multiprocessing
What is Multiprocessing?

Multiprocessing allows multiple processes to run simultaneously, bypassing Python’s

Global Interpreter Lock (GIL).

The multiprocessing module in Python is used to implement multiprocessing.

Python Code Example:

from m u l t i p r o c e s s i n g import Pool

def s q u a r e ( x ) :
return x * x

if name == ’ m a i n ’ :
with Pool ( 4 ) as p o o l :
r e s u l t s = p o o l .map( square , [ 1 , 2 , 3 , 4 , 5 ] )
print ( r e s u l t s )

8
12 Exploring Data Analysis
Exploratory Data Analysis (EDA) is a critical step in understanding and preparing data
before applying machine learning algorithms. This chapter covers techniques to explore
and summarize data.

12.1 The EDA Approach

Goal: EDA aims to uncover patterns, spot anomalies, test hypotheses, and check as-
sumptions using summary statistics and graphical representations. Exploratory Data
Analysis is an approach to analyzing datasets to summarize their main characteristics
using visual methods. The goal of EDA is to:

Understand the data’s structure.

Identify patterns, anomalies, and relationships.

Prepare for more advanced modeling or hypothesis testing.

The EDA process typically involves:

Understanding the dataset: Identifying data types, missing values, and outliers.

Summarizing descriptive statistics: Computing mean, median, mode, variance,

skewness, and kurtosis.
Visualizing data: Using histograms, box plots, scatter plots, and correlation
matrices to explore relationships.

EDA helps ensure the validity and reliability of data models by providing insights into
the data that might affect analysis.

13 Defining Descriptive Statistics for Numeric Data

Descriptive statistics summarize and describe the main features of a dataset. There are
three primary types of descriptive statistics:

Measures of Central Tendency

Measures of Dispersion

Measures of Normality

13.1 Measures of Central Tendency

Measures of central tendency describe the ”center” of a data distribution. Common
measures include:

Mean: The arithmetic average of all values.

Median: The middle value when data is arranged in order.

Mode: The most frequent value in the dataset.

9
13.2 Measures of Dispersion
Measures of dispersion describe the spread or variability of data around the central
value. Common measures include:

Range: The difference between the maximum and minimum values.

Variance: The average squared deviation from the mean.

Standard Deviation: The square root of the variance, indicating how much data
varies from the mean.

13.3 Measures of Normality

Measures of normality describe the shape of a dataset’s distribution compared to a
normal distribution. Two key measures are:

Skewness: Measures the asymmetry of the data distribution. A skewness of zero

indicates a perfectly symmetrical distribution.

Kurtosis: Measures the ”tailedness” of the distribution, or how much of the data
falls in the tails. High kurtosis means more extreme values in the tails, while low
kurtosis means less.

13.4 Python Code for all three measures

import numpy as np
import pandas as pd
from scipy . stats import skew , kurtosis

# Sample dataset ( you can replace this

with any dataset )
data = pd . DataFrame ({
’ Values ’: [12 , 15 , 14 , 10 , 18 ,
13 , 19 , 17 , 16 , 15 , 14 , 16 ,
18 , 19 , 20]
})

# Measures of Central Tendency

mean = data [ ’ Values ’ ]. mean ()
median = data [ ’ Values ’ ]. median ()
mode = data [ ’ Values ’ ]. mode () [0] # Take
the first mode

# Measures of Dispersion
range_value = data [ ’ Values ’ ]. max () - data
[ ’ Values ’ ]. min ()
variance = data [ ’ Values ’ ]. var ()
std_dev = data [ ’ Values ’ ]. std ()

# Measures of Normality

10
skewness = skew ( data [ ’ Values ’ ])
kurt = kurtosis ( data [ ’ Values ’ ])

# Display results
print ( f " Measures of Central Tendency : " )
print ( f " Mean : { mean } " )
print ( f " Median : { median } " )
print ( f " Mode : { mode } " )
print ( " \ nMeasures of Dispersion : " )
print ( f " Range : { range_value } " )
print ( f " Variance : { variance } " )
print ( f " Standard Deviation : { std_dev } " )
print ( " \ nMeasures of Normality : " )
print ( f " Skewness : { skewness } " )
print ( f " Kurtosis : { kurt } " )
Output: When you run the code, the output for the given dataset would be as
follows:

Measures of Central Tendency:

Mean: 15.533333333333333
Median: 16.0
Mode: 15

Measures of Dispersion:
Range: 10
Variance: 9.361904761904761
Standard Deviation: 3.0598956126999554

Measures of Normality:
Skewness: -0.05370816581648294
Kurtosis: -1.3244086460519235

Example code of Skewness and Kurtosis: Below is an example python code for
illustrating the concepts of skewness and kurtosis:
import numpy as np
import matplotlib . pyplot as plt
from scipy . stats import skew , kurtosis

# Generating datasets with different

skewness and kurtosis
data_normal = np . random . normal (0 , 1 ,
1000) # Normal distribution (
Mesokurtic )
da ta _p os it iv e_ sk ew = np . random .
exponential ( scale =1 , size =1000) #
Positively skewed
da ta _n eg at iv e_ sk ew = - np . random .
exponential ( scale =1 , size =1000) #
Negatively skewed

11
# Plotting histograms
fig , axes = plt . subplots (1 , 3 , figsize
=(15 , 5) )

# Normal distribution
axes [0]. hist ( data_normal , bins =30 , color =
’ blue ’ , alpha =0.7)
axes [0]. set_title ( f ’ Normal Distribution \
nSkewness : { skew ( data_normal ) :.2 f } ,
Kurtosis : { kurtosis ( data_normal ) :.2 f } ’
)

# Positively skewed
axes [1]. hist ( data_positive_skew , bins =30 ,
color = ’ green ’ , alpha =0.7)
axes [1]. set_title ( f ’ Positively Skewed \
nSkewness : { skew ( da ta _p os it iv e_ sk ew )
:.2 f } , Kurtosis : { kurtosis (
da ta _p os it iv e_ sk ew ) :.2 f } ’)

# Negatively skewed
axes [2]. hist ( data_negative_skew , bins =30 ,
color = ’ red ’ , alpha =0.7)
axes [2]. set_title ( f ’ Negatively Skewed \
nSkewness : { skew ( da ta _n eg at iv e_ sk ew )
:.2 f } , Kurtosis : { kurtosis (
da ta _n eg at iv e_ sk ew ) :.2 f } ’)

plt . tight_layout ()
plt . show ()
Output: When you run the code, it will display histograms for the datasets with
different skewness and kurtosis as shown in fig[1].

Figure 1: Histograms showing Normal, Positively Skewed, and Negatively Skewed Dis-
tributions

12
Categorical Data
Categorical data refers to variables that are divided into groups or categories, where each
group represents a distinct characteristic or attribute. This type of data is qualitative
and typically does not have a numerical meaning, which means it cannot be measured on
a numeric scale. Examples of categorical data include:

Types of fruits: apple, orange, banana

Car brands: Toyota, Honda, Ford

Gender: male, female

Colors: red, green, blue

Categorical data is often visualized using bar charts or pie charts, as these types of
visualizations help to compare different categories.

Measures of Scale
Data can be divided into four major scales based on the nature of the measurement.
These scales are nominal, ordinal, interval, and ratio scales.

Nominal Scale: This is the simplest type of scale, where data is grouped into
distinct categories without any order or ranking between them. The categories are
mutually exclusive, meaning each data point belongs to one and only one category.
There is no numerical relationship between categories. For example:

– Example 1: Blood types (A, B, AB, O)

– Example 2: Types of animals (dog, cat, bird )

Nominal data can only be counted, not ordered or measured. It is best suited for
qualitative analysis.

Ordinal Scale: Ordinal data is similar to nominal data, but with the added feature
of meaningful order or ranking between categories. However, the intervals between
the categories are not necessarily equal. For example:

– Example 1: Customer satisfaction rating (poor, fair, good, excellent)

– Example 2: Education level (high school, bachelor’s, master’s, PhD)

Although ordinal data can be ranked, we cannot quantify the difference between
categories. For example, the difference between ”fair” and ”good” may not be the
same as the difference between ”good” and ”excellent.”

Interval Scale: Interval data is numerical and has equal intervals between mea-
surements, meaning that the difference between two values is meaningful. However,
there is no true zero point in interval data, which means you cannot make state-
ments about ratios or percentages. For example:

13
– Example 1: Temperature in Celsius or Fahrenheit. The difference between
20°C and 30°C is the same as between 30°C and 40°C, but 0°C does not indicate
an absence of temperature.
– Example 2: IQ scores. The difference between an IQ score of 100 and 110
is the same as between 110 and 120, but an IQ score of 0 does not mean an
absence of intelligence.

Interval data can be added and subtracted, but not multiplied or divided meaning-
fully.

Ratio Scale: Ratio data is the most informative scale. It is numerical, has equal
intervals between values, and also has a true zero point, which allows for meaningful
ratio comparisons. For example:

– Example 1: Height in centimeters. A height of 0 cm means the absence of

height, and you can meaningfully say that someone who is 180 cm tall is twice
as tall as someone who is 90 cm.
– Example 2: Weight in kilograms. A weight of 0 kg means no weight, and you
can meaningfully compare different weights using ratios.

Ratio data allows for all arithmetic operations, including addition, subtraction,
multiplication, and division.

13.5 Counting for Categorical Data

For categorical data, EDA focuses on counting unique categories and analyzing their
frequency distribution.
Example:
df [ ’ category_column ’ ]. value_counts ()
————————————————————-

14 Creating Contingency Tables

Definition: A contingency table is a type of table that displays the frequency
distribution of variables.

Purpose: It helps to summarize the relationship between two categorical variables.

Structure: The rows typically represent one categorical variable, while the columns
represent another. Each cell contains the count of occurrences for the combination
of the row and column categories.

Applications: Useful in various fields such as marketing, social sciences, and

health studies to analyze the relationship between variables.

14
15 Creating Applied Visualization for EDA
Purpose of Visualization: To make the analysis of data easier to understand
and interpret by visually representing data distributions and relationships.
Types of Visualizations:
– Histograms: Display the distribution of a single continuous variable.
– Bar Charts: Compare frequencies or counts of categorical variables.
– Boxplots: Show the distribution of data based on five summary statistics.
– Scatterplots: Illustrate the relationship between two continuous variables.

16 Inspecting Boxplots
Definition: A boxplot (or whisker plot) is a standardized way of displaying the
distribution of data based on a five-number summary.
Components:
– Minimum: The smallest data point.
– First Quartile (Q1): The median of the lower half of the dataset.
– Median (Q2): The middle value of the dataset.
– Third Quartile (Q3): The median of the upper half of the dataset.
– Maximum: The largest data point.
Purpose: Helps to identify outliers and visualize the spread of the data.

17 Performing t-tests After Boxplots

Purpose: A t-test is used to determine if there are significant differences between
the means of two groups.
Types of t-tests:
– Independent t-test: Compares means from two different groups.
– Paired t-test: Compares means from the same group at different times.
Application: Often used in A/B testing, clinical trials, and comparing group
performances.

18 Observing Parallel Coordinates

Definition: A visualization technique that displays multivariate data by using
parallel axes.
Structure: Each vertical line represents a variable, and each data point is con-
nected across the axes.
Purpose: Helps to visualize relationships and trends in multi-dimensional data.

15
19 Graphing Distributions
Purpose: To visualize how data is distributed across different values.

Common Types:

– Histograms: Show frequency distributions.

– Density Plots: Estimate the probability density function of the variable.

20 Plotting Scatterplots
Definition: A scatterplot displays values for typically two variables for a set of
data.

Purpose: To investigate the relationship or correlation between two continuous

variables.

Interpretation: Patterns in the scatterplot can indicate positive, negative, or no

correlation.

21 Understanding Correlation
Definition: Correlation measures the strength and direction of a linear relationship
between two variables.

Correlation Coefficient (r):

– Range: -1 to 1.
– Interpretation:
* 1: Perfect positive correlation.
* -1: Perfect negative correlation.
* 0: No correlation.

Importance: Helps in predictive modeling and understanding relationships.

22 Using Covariance and Correlation

Covariance: Measures how two variables change together.

– Positive covariance indicates that both variables increase or decrease together.

– Negative covariance indicates that as one variable increases, the other de-
creases.

Correlation: A standardized measure of covariance that provides a clearer under-

standing of the relationship’s strength.

16
23 Using Nonparametric Correlation
Definition: Nonparametric correlation methods (like Spearman’s rank correlation)
do not assume a normal distribution.

Use Cases: Ideal for ordinal data or when the data do not meet parametric test
assumptions.

Benefit: More robust in the presence of outliers or non-linear relationships.

24 Considering the Chi-Square Test for Tables

Definition: A statistical test to determine if there is a significant association
between categorical variables.

Application: Used with contingency tables to assess whether the observed fre-
quencies differ from expected frequencies under the assumption of independence.

Interpretation: A high chi-square statistic indicates a significant relationship.

25 Modifying Data Distributions

Purpose: Sometimes, data needs to be transformed to meet the assumptions of
statistical tests.

Common Modifications:

– Log Transformation: Useful for right-skewed data.

– Square Root Transformation: Helps to stabilize variance.
– Box-Cox Transformation: A family of power transformations that can help
normalize distributions.

26 Using Different Statistical Distributions

Normal Distribution: Symmetrical, bell-shaped distribution; many statistical
tests assume normality.

Binomial Distribution: Discrete distribution representing the number of suc-

cesses in a fixed number of trials.

Poisson Distribution: Represents the number of events occurring in a fixed in-

terval of time or space.

27 Creating a Z-score Standardization

Definition: Z-score standardization transforms data to have a mean of 0 and a
standard deviation of 1.

17
Formula:
X −µ
Z= (1)
σ
where X is the value, µ is the mean, and σ is the standard deviation.

Purpose: Useful for comparing scores from different distributions.

28 Transforming Other Notable Distributions

Purpose: To normalize data for analysis.

Common Techniques:

– Min-Max Scaling: Rescales data to a range of [0, 1].

– Quantile Transformation: Transforms features to follow a uniform or normal
distribution.

(Mai 4.4) Linear Regression
No ratings yet
(Mai 4.4) Linear Regression
20 pages
DR - Ashiq Tutorials Krok2 Hygiene Base
No ratings yet
DR - Ashiq Tutorials Krok2 Hygiene Base
47 pages
Python GTU Study Material E-Notes Unit-5 16012021061815AM
No ratings yet
Python GTU Study Material E-Notes Unit-5 16012021061815AM
9 pages
Python GTU Study Material E-Notes Unit-5 16012021061815AM
No ratings yet
Python GTU Study Material E-Notes Unit-5 16012021061815AM
9 pages
Python GTU Study Material E-Notes Unit-5 16012021061815AM
No ratings yet
Python GTU Study Material E-Notes Unit-5 16012021061815AM
9 pages
Python Unit 5
No ratings yet
Python Unit 5
23 pages
Scikit-Learn: Scikit-Learn Is An Open Source Python Library That
100% (1)
Scikit-Learn: Scikit-Learn Is An Open Source Python Library That
1 page
Unit 1-1
No ratings yet
Unit 1-1
10 pages
Scikit Learn
No ratings yet
Scikit Learn
25 pages
Scikit Learn Cheat Sheet Python
No ratings yet
Scikit Learn Cheat Sheet Python
1 page
Wa0003.
No ratings yet
Wa0003.
12 pages
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
No ratings yet
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
91 pages
Vtu ML
No ratings yet
Vtu ML
13 pages
Scikit-Learn: Library For Machine Learning and Data Science With Python
No ratings yet
Scikit-Learn: Library For Machine Learning and Data Science With Python
11 pages
Algorithmeknn 121213175830 Phpapp02
No ratings yet
Algorithmeknn 121213175830 Phpapp02
52 pages
Syllabus
No ratings yet
Syllabus
7 pages
Data Mining Essen, Als 2: Data Mining in Prac, Ce, With Python
No ratings yet
Data Mining Essen, Als 2: Data Mining in Prac, Ce, With Python
31 pages
Scikit - Notes ML
100% (2)
Scikit - Notes ML
12 pages
Python SciKit Learn Tutorial - DigitalOcean
No ratings yet
Python SciKit Learn Tutorial - DigitalOcean
11 pages
Summer Training Report - Ishan Patwal
No ratings yet
Summer Training Report - Ishan Patwal
21 pages
Practical 2 - Working With Scikit-Learn
No ratings yet
Practical 2 - Working With Scikit-Learn
6 pages
Assignment1 LATEX
No ratings yet
Assignment1 LATEX
11 pages
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
No ratings yet
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
20 pages
Report Intership Chapters
No ratings yet
Report Intership Chapters
39 pages
20191120122749-Data Science Certification Training
No ratings yet
20191120122749-Data Science Certification Training
4 pages
Unit 1
No ratings yet
Unit 1
28 pages
Wa0013.
No ratings yet
Wa0013.
12 pages
Vtu ML
No ratings yet
Vtu ML
62 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
ML Libraries
No ratings yet
ML Libraries
19 pages
Intro To Scikit Learning
No ratings yet
Intro To Scikit Learning
18 pages
TP02
No ratings yet
TP02
3 pages
Scikit-Learn Cookbook Sample Chapter
No ratings yet
Scikit-Learn Cookbook Sample Chapter
52 pages
Scikit - Learn Machine Learning in Python
No ratings yet
Scikit - Learn Machine Learning in Python
6 pages
Scikit Learn
No ratings yet
Scikit Learn
107 pages
An Introduction To Supervised Learning With Scikit-Learn: Machine Learning: The Problem Setting
No ratings yet
An Introduction To Supervised Learning With Scikit-Learn: Machine Learning: The Problem Setting
4 pages
Scikit-Learn - Machine Learning in Python PDF
No ratings yet
Scikit-Learn - Machine Learning in Python PDF
6 pages
305 BA PYTHON - APR 2022 ANSWER Key
No ratings yet
305 BA PYTHON - APR 2022 ANSWER Key
14 pages
Data Preprocessing and Data Analysis Using Python
No ratings yet
Data Preprocessing and Data Analysis Using Python
32 pages
Maxbox Starter60 Machine Learning
No ratings yet
Maxbox Starter60 Machine Learning
8 pages
Skit Learn Cheatsheet
No ratings yet
Skit Learn Cheatsheet
11 pages
ML Lab Manual
No ratings yet
ML Lab Manual
38 pages
Python Scikit-Learn Cheat Sheet For Machine Learning
No ratings yet
Python Scikit-Learn Cheat Sheet For Machine Learning
3 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
SQL Useful Links
No ratings yet
SQL Useful Links
6 pages
Python For Data Science
No ratings yet
Python For Data Science
8 pages
Machine Learning Lab Dlihebca6sem
100% (1)
Machine Learning Lab Dlihebca6sem
25 pages
Data Science & AIML Coursework
No ratings yet
Data Science & AIML Coursework
10 pages
Practical-1: Aim: Study About Numpy Library of Python
No ratings yet
Practical-1: Aim: Study About Numpy Library of Python
28 pages
Data Science II: Charles C.N. Wang
No ratings yet
Data Science II: Charles C.N. Wang
38 pages
Kabir Data Preprocessing Python
No ratings yet
Kabir Data Preprocessing Python
14 pages
Python GTU Study Material Presentations Unit-5 20112020032922AM
No ratings yet
Python GTU Study Material Presentations Unit-5 20112020032922AM
24 pages
Unit-2 Feature Selection
No ratings yet
Unit-2 Feature Selection
92 pages
Practical Guide To Scikit-Learn For Data Science
No ratings yet
Practical Guide To Scikit-Learn For Data Science
27 pages
IT Lab PPT Pratham Chouhan CSE174
No ratings yet
IT Lab PPT Pratham Chouhan CSE174
40 pages
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Data Structures and Algorithms with Python
From Everand
Data Structures and Algorithms with Python
Aadinath Pothuvaal
No ratings yet
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Mastering Python: A Comprehensive Guide for Beginners and Experts
From Everand
Mastering Python: A Comprehensive Guide for Beginners and Experts
Rick Spair
No ratings yet
UEQ Data PKL
No ratings yet
UEQ Data PKL
122 pages
MSC Thimmaiah K 2024 PDF
No ratings yet
MSC Thimmaiah K 2024 PDF
39 pages
9ma0 03 Q S2
No ratings yet
9ma0 03 Q S2
1 page
Psy3200 U2 As Psystats
No ratings yet
Psy3200 U2 As Psystats
11 pages
Statistics Notes
No ratings yet
Statistics Notes
28 pages
Iit M Qualifier Exam Pod21qf3qpe-Qualifier
No ratings yet
Iit M Qualifier Exam Pod21qf3qpe-Qualifier
57 pages
Quantitative Reasoning-II Full Book - Important MCQs & SEQs For Practice of BS, DPT, AHS 2nd
No ratings yet
Quantitative Reasoning-II Full Book - Important MCQs & SEQs For Practice of BS, DPT, AHS 2nd
15 pages
Apa 7th Ed 2020.3
No ratings yet
Apa 7th Ed 2020.3
66 pages
Talent Detection in Taekwondo
No ratings yet
Talent Detection in Taekwondo
13 pages
Biomechanical Analysis of Jumping - The Influence of External Load
No ratings yet
Biomechanical Analysis of Jumping - The Influence of External Load
111 pages
Standard Error
No ratings yet
Standard Error
14 pages
DETERMINATION OF THE Uniaxial Compressive Strength of Rock FROM THE Strength INDEX
No ratings yet
DETERMINATION OF THE Uniaxial Compressive Strength of Rock FROM THE Strength INDEX
10 pages
E
No ratings yet
E
34 pages
Chatgpt Learn Statistics
No ratings yet
Chatgpt Learn Statistics
28 pages
The Measurement of Work Engagement With A Short Questionair - DF
No ratings yet
The Measurement of Work Engagement With A Short Questionair - DF
16 pages
Hallberg Peck Humidity Accelerations Quality and Reliability
No ratings yet
Hallberg Peck Humidity Accelerations Quality and Reliability
12 pages
Mas Topic 2 Cost Behavior Cost Classification
No ratings yet
Mas Topic 2 Cost Behavior Cost Classification
38 pages
Advanced Corporate Finance Part
No ratings yet
Advanced Corporate Finance Part
73 pages
CaseStudy PPT Urban Heat Island Detection
No ratings yet
CaseStudy PPT Urban Heat Island Detection
11 pages
BBA (Statistics) - Imp Ques
No ratings yet
BBA (Statistics) - Imp Ques
4 pages
Notes On Regression For ITM
No ratings yet
Notes On Regression For ITM
10 pages
Tsion Takele Thesis
No ratings yet
Tsion Takele Thesis
69 pages
Bivariate Data Past Paper Answers
No ratings yet
Bivariate Data Past Paper Answers
10 pages
Mathematics III (Prob&Stats) - Unit 4 5
No ratings yet
Mathematics III (Prob&Stats) - Unit 4 5
122 pages
3.10 - Valves - Modeling Dynamics - Engineering LibreTexts
No ratings yet
3.10 - Valves - Modeling Dynamics - Engineering LibreTexts
5 pages
Ethical Considerations
No ratings yet
Ethical Considerations
31 pages
Correlation vs. Causation
No ratings yet
Correlation vs. Causation
13 pages
11 Com Question Bank
No ratings yet
11 Com Question Bank
7 pages

Unit 5 Material

Uploaded by

Unit 5 Material

Uploaded by

Unit-5

Wrangling Data: Stretching Python’s Capabilities and

1.1 Playing with Scikit-learn

 Model Training: Offers implementations of common machine learning models (e.g.,

1.2 What is Scikit-learn?

 It provides tools for classification, regression, clustering, dimensionality re-

 Emphasizes simplicity, scalability, and performance for machine learning workflows.

1.3 Core Features of Scikit-learn

 Regression: Predicting continuous values (e.g., predicting house prices).

 Clustering: Grouping similar objects (e.g., customer segmentation).

 Dimensionality Reduction: Reducing the number of features while retaining

 Model Selection: Tuning hyperparameters, cross-validation, grid search.

 Preprocessing: Scaling, encoding, transforming raw data.

2 Understanding Classes in Scikit-learn

 Each machine learning algorithm is represented as a class.

2.2 Main Components of a Scikit-learn Class

 fit(): Trains the model.

 predict(): Makes predictions based on trained model.

 transform(): Transforms data (often used in preprocessing).

 score(): Evaluates the performance of the model.

2.3 Example Workflow Using Classes

2.4 Pipelines in Scikit-learn

3 Defining Applications for Data Science

– Spam detection: Categorizing emails as spam or not spam.

3.4 Dimensionality Reduction

– Reducing features in genetics datasets.

3.5 Preprocessing and Feature Engineering

– Normalizing income data for credit scoring.

3.6 Model Evaluation and Selection

– Cross-validation for selecting models.

 It is commonly used in Natural Language Processing (NLP) to map high-dimensional

 A hash function is used to map each feature into an index in a vector.

 Speed : Faster than traditional methods for large datasets.

 Scalability: Can handle very large datasets using fixed-size vectors.

5 Using Hash Functions

 Hash functions map input data to a fixed-size integer value (hash).

6 Demonstrating the Hashing Trick

 Choose a hash function to map each feature (word) to an index.

 The feature vector is reduced in dimensionality by using these indices.

 Collisions (multiple features mapping to the same index) may occur.

Python Code Example:

c o r p u s = [ ” This i s a sample t e x t ” , ” Another sample t e x t ” ]

c o r p u s = [ ” This i s a sample t e x t ” , ” Another sample t e x t ” ]

8 Considering Timing and Performance

8.1 Benchmarking with %time and %%timeit

 Memory profiling tracks how much memory your code uses.

 The memory profiler package helps identify memory bottlenecks.

Python Code Example:

print ( memory usage ( m y f u n c t i o n ) )

10 Running in Parallel on Multiple Cores

 Multiprocessing allows multiple processes to run simultaneously, bypassing Python’s

 The multiprocessing module in Python is used to implement multiprocessing.

Python Code Example:

12.1 The EDA Approach

 Understand the data’s structure.

 Identify patterns, anomalies, and relationships.

 Prepare for more advanced modeling or hypothesis testing.

The EDA process typically involves:

 Summarizing descriptive statistics: Computing mean, median, mode, variance,

13 Defining Descriptive Statistics for Numeric Data

 Measures of Central Tendency

13.1 Measures of Central Tendency

 Mean: The arithmetic average of all values.

 Median: The middle value when data is arranged in order.

 Mode: The most frequent value in the dataset.

 Range: The difference between the maximum and minimum values.

 Variance: The average squared deviation from the mean.

13.3 Measures of Normality

 Skewness: Measures the asymmetry of the data distribution. A skewness of zero

13.4 Python Code for all three measures

# Sample dataset ( you can replace this

# Measures of Central Tendency

Measures of Central Tendency:

# Generating datasets with different

 Types of fruits: apple, orange, banana

 Car brands: Toyota, Honda, Ford

Model Training: Offers implementations of common machine learning models (e.g.,

It provides tools for classification, regression, clustering, dimensionality re-

Emphasizes simplicity, scalability, and performance for machine learning workflows.

Regression: Predicting continuous values (e.g., predicting house prices).

Clustering: Grouping similar objects (e.g., customer segmentation).

Dimensionality Reduction: Reducing the number of features while retaining

Model Selection: Tuning hyperparameters, cross-validation, grid search.

Preprocessing: Scaling, encoding, transforming raw data.

Each machine learning algorithm is represented as a class.

fit(): Trains the model.

predict(): Makes predictions based on trained model.

transform(): Transforms data (often used in preprocessing).

score(): Evaluates the performance of the model.

It is commonly used in Natural Language Processing (NLP) to map high-dimensional

A hash function is used to map each feature into an index in a vector.

Speed : Faster than traditional methods for large datasets.

Scalability: Can handle very large datasets using fixed-size vectors.

Hash functions map input data to a fixed-size integer value (hash).

Choose a hash function to map each feature (word) to an index.

The feature vector is reduced in dimensionality by using these indices.

Collisions (multiple features mapping to the same index) may occur.

Memory profiling tracks how much memory your code uses.

The memory profiler package helps identify memory bottlenecks.

Multiprocessing allows multiple processes to run simultaneously, bypassing Python’s

The multiprocessing module in Python is used to implement multiprocessing.

Understand the data’s structure.

Identify patterns, anomalies, and relationships.

Prepare for more advanced modeling or hypothesis testing.

Summarizing descriptive statistics: Computing mean, median, mode, variance,

Measures of Central Tendency

Mean: The arithmetic average of all values.

Median: The middle value when data is arranged in order.

Mode: The most frequent value in the dataset.

Range: The difference between the maximum and minimum values.

Variance: The average squared deviation from the mean.

Skewness: Measures the asymmetry of the data distribution. A skewness of zero

Types of fruits: apple, orange, banana

Car brands: Toyota, Honda, Ford

Gender: male, female

Colors: red, green, blue

Purpose: It helps to summarize the relationship between two categorical variables.

Applications: Useful in various fields such as marketing, social sciences, and

Purpose: To investigate the relationship or correlation between two continuous

Interpretation: Patterns in the scatterplot can indicate positive, negative, or no

Correlation Coefficient (r):

Importance: Helps in predictive modeling and understanding relationships.

Correlation: A standardized measure of covariance that provides a clearer under-

Benefit: More robust in the presence of outliers or non-linear relationships.

Interpretation: A high chi-square statistic indicates a significant relationship.

Binomial Distribution: Discrete distribution representing the number of suc-

Poisson Distribution: Represents the number of events occurring in a fixed in-

Purpose: Useful for comparing scores from different distributions.