0% found this document useful (0 votes)
13 views26 pages

Chapter 1,2,3

Uploaded by

sara.ayubian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views26 pages

Chapter 1,2,3

Uploaded by

sara.ayubian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Chapter 1

Data Science

Artificial Intelligence

Machine Learning

Data Mining

Deep
Learning

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 1
Data
Data are characteristics or information, usually numerical, that are collected through observation

Variable
Wikipedia / Attribute/ Feature
Date/Time Building Code Power Consumption (kW) Heat Consumption (kW) Power Price ($/kW) Heat Price ($/kW)
Row / Example/ 1/1/21 0:00 6601 450 550 10 4
Sample 1/2/21 1:00 6602 480 590 12 5
1/3/21 2:00 6603 600 540 11 7
1/4/21 3:00 6604 670 596 12 3
1/5/21 4:00 6605 -26 523 10 4
1/6/21 5:00 6606 390 488 9 6
1/7/21 6:00 6607 430 610 14 6

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 1

IDE (Integrated Development Environment)

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 1

Required Packages

NumPy
Pip Installation
Pandas
Anaconda Distribution (Conda Installation)
Matplotlib

Scikit-learn

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 2

NumPy

NumPy is an open-source library that is used for working


with arrays

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 2

NumPy

1 3 3 4 18 25
@ =
4 5 5 7 37 51

1 3 3 4 3 12
∗ =
4 5 5 7 20 35

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 2

NumPy

3 4 3 3
+
5 7 3 3

1 2 3
+ [4 5 6]
3 4 5
1 2 3 4 5 6 5 7 9
+ =
3 4 5 4 5 6 7 9 11

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 2

NumPy

Normal Distribution
Uniform Distribution

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 2

NumPy ∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖


Mean 𝑥𝑥̅ =
𝑛𝑛
Variance
∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 2
𝑉𝑉 =
𝑛𝑛
∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 2
Standard Deviation 𝑆𝑆𝑆𝑆 =
𝑛𝑛

Median 𝑥𝑥1 , 𝑥𝑥2 , 𝑥𝑥3 , 𝑥𝑥4 , 𝑥𝑥5 𝑥𝑥1 , 𝑥𝑥2 , 𝑥𝑥3 , 𝑥𝑥4
odd even

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 2

Pandas Pandas (Panel Data) is an open-source library that is


designed for working with DataFrames and Series of
Data

Series are 1D arrays with Labeling Pandas DataFrame is 2D labeled


Possibility (Data type could be data with different types of
numerical or string) features (columns)

Index Values Index Age Grade1 Grade2


age1 10 Several Series Could Create S1 20 10 8
age2 20 a DataFrame S2 25 8 10
age3 30 S3 27 5 3
age4 40 S4 30 9 7

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 2

Matplotlib

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 2

Matplotlib

Year TI
2010 0.72
2011 0.61
2012 0.65
2013 0.68
2014 0.75
2015 0.90
2016 1.02
2017 0.93
2018 0.85
2019 0.99
2020 1.02

Data source: NASA/GISS


Credit: NASA Scientific Visualization Studio

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 2

Matplotlib

Month Elec 1 Elec 2


Jan 12 14
Feb 13 16
March 9 11
April 8 7
May 7 6
June 8 6
July 8 7
Aug 7 6
Sep 6 5
Oct 5 8
Nov 8 9
Dec 10 12

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 2

Matplotlib

Reference: https://matplotlib.org/3.1.0/gallery/color/named_colors.html

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 2

Matplotlib

Reference: https://matplotlib.org/3.1.1/api/markers_api.html

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 2

Matplotlib

https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 3

Preprocessing
4, 8, 12, 21, 33, 58, 92, 98
Statistics

∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 4 + 8 + 12 + 21 + 33 + 58 + 92 + 98
𝑥𝑥̅ = 8
= 40.75
𝑛𝑛
∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 2
4 − 40.75 2
+ 8 − 40.75 2
+ 12 − 40.75 2
+ 21 − 40.75 2
+ 33 − 40.75 2
+ 58 − 40.75 2
+ 92 − 40.75 2
+ 98 − 40.75 2

𝑉𝑉 = 8
𝑛𝑛
= 1237.7

∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 2


𝑆𝑆𝑆𝑆 =
𝑛𝑛
1237.7 = 35.2

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 3

Preprocessing
4, 5, 6, 8, 10, 11, 13, 14
Statistics

∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 4 + 5 + 6 + 8 + 10 + 11 + 13 + 14
𝑥𝑥̅ = = 8.87
𝑛𝑛 8

∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 2


𝑉𝑉 = = 12.1
𝑛𝑛

∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 2


𝑆𝑆𝑆𝑆 =
𝑛𝑛
12.1 = 3.47
Low SD means low dispersion of
our data, relative to its average

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 3

Preprocessing Covariance shows the relation between to variables (features).


Covariance formula measures the correlation between two variables.
Covariance

∑ 𝑋𝑋 − 𝑋𝑋)(𝑌𝑌 − 𝑌𝑌�
𝐶𝐶𝐶𝐶𝐶𝐶 𝑋𝑋, 𝑌𝑌 =
𝑁𝑁

Y Y Y

X X X
𝐶𝐶𝐶𝐶𝐶𝐶 𝑋𝑋, 𝑌𝑌 < 0 𝐶𝐶𝐶𝐶𝐶𝐶 𝑋𝑋, 𝑌𝑌 > 0 𝐶𝐶𝐶𝐶𝐶𝐶 𝑋𝑋, 𝑌𝑌 = 0

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 3
Time E_Plug E_Heat Price Temperat No. Occup
Preprocessing 1 24 28 10 -15 12
2 17 32 12 -17 12
3 16 34 11 -19 12
Missing Values 3
4
16
16
34
33
11
12
-19
-18
12
12
5 16 30 10 -14 12
6 16 31 10 -16 12
7 19 28 14 -14 12
8 22 29 12 -15 9
NaN ≈ Not a Number 9 25 26 12 -12 8
10 26 24 14 -8 8
11 27 20 14 -4 8
12 30 19 16 0 5
Null ≈ No Value 13 30 19 16 0 4
14 NaN 13 17 2 4
15 27 14 17 3 4
16 27 16 17 2 6
17 28 -4 18 0 8
18 33 26 20 -6 9
19 42 32 2 -8 10
Python report null when the cell is 20 48 33 21 -12 12
empty while NaN could be used when 21 47 32 21 -16 12
22 44 30 22 -18 12
the cell is filled with something that
23 36 35 21 -19 12
doesn’t make sense 24 37 36 18 -22 12

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 3

Outlier
When a data is unusually smaller or bigger than other
Concept data, it could (Not Always) an outlier

Data
4
2
5
3
7
5
6
9 Not always the outlier, but it has
32 high potential to be an outlier
2

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 3

Outlier 𝟑𝟑𝟑𝟑

How to find them 𝟏𝟏𝟏𝟏. 𝟓𝟓


𝟗𝟗
Data Q1
4 Q3 Q3 6.5
2
5 Median
3 𝟐𝟐, 𝟐𝟐, 𝟑𝟑, 𝟒𝟒, 𝟓𝟓, 𝟓𝟓, 𝟔𝟔, 𝟕𝟕, 𝟗𝟗, 𝟑𝟑𝟑𝟑 There are 2 standards for this: IQR = Q3-Q1 =3 5
7
5 1. Mild Outlier
Median = 5 Q1 3.5
6 • Q1 - 1.5×IQR (Lower) ≈ 3.5- 1.5×3 = -1.5
9 • Q3 + 1.5×IQR (Upper) ≈ 6.5+ 1.5×3 = 11
𝟐𝟐, 𝟐𝟐, 𝟑𝟑, 𝟒𝟒, 𝟓𝟓, 𝟓𝟓, 𝟓𝟓, 𝟔𝟔, 𝟕𝟕, 𝟗𝟗, 𝟑𝟑𝟑𝟑 𝟐𝟐
32 2. Extreme Outlier
2 • Q1 - 3×IQR (Lower) ≈ 3.5- 3×3 = -5.5
• Q3 + 3×IQR (Upper) ≈ 6.5+ 3×3 = 15.5
Q1 3.5 6.5 Q3

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 3

Preprocessing axis = 1

Concatenating
Col 3 Col 1 Col 2
Code1 4 25
Code2 2 36
Code3 5 55
Code4 3 69
Code5 7 99
Code6 5 65
Code7 6 51
Code8 9 21
Code9 32 58
axis = 0 Code10 2 22

Code11 56 12

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 3
P/OffP Peak OffPeak
OffPeak 0 1
Preprocessing OffPeak 0 1
OffPeak 0 1
Dummy Coding OffPeak
OffPeak
0
0
1
1
OffPeak 0 1
Peak 1 0
Peak 1 0
To Make Them
Peak 1 0
Understandable for
OffPeak 0 1
OffPeak
Machine 0 1
Categorical OffPeak 0 1 Dummy
We Need to Give Value,
OffPeak 0 1 Variable
Variable And Thats Why We Use
OffPeak 0 1
OffPeak Dummy Coding 0 1
OffPeak 0 1
Peak 1 0
Peak 1 0
Peak 1 0
Peak 1 0
Peak 1 0
OffPeak 0 1
OffPeak 0 1
OffPeak 0 1

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 3

Preprocessing
We are dealing with different ranges. But
Normalization although Col1 range is way less than Col2 range,
it could be as important as Col2 is (and maybe
more)
Col 1 Col 2
4 25 But Machine does not understand
2 36 this, and it just understand
5 55 numbers!!
3 69 2< 𝐶𝐶𝐶𝐶𝐶𝐶𝐶 < 9
7 99 21 < 𝐶𝐶𝐶𝐶𝐶𝐶𝐶 < 99 To solve this, we need to change the range of the
5 65 features to unique same range (For example
6 51 between 0-1)
9 21
32 58
2 22 Normalizing the Data

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi
Chapter 3

Preprocessing
Normalization Sklearn (Preprocessing Sub package)

Min Max Scaler Sub-package Normalize Sub-package

𝒙𝒙 − 𝒙𝒙𝑴𝑴𝑴𝑴𝑴𝑴
𝒙𝒙𝑴𝑴𝑴𝑴𝑴𝑴 − 𝒙𝒙𝑴𝑴𝑴𝑴𝑴𝑴 L1 Norm L2 Norm
(Manhattan Distance) (Euclidean Distance)

𝒙𝒙𝒊𝒊𝒊𝒊 𝒙𝒙𝒊𝒊𝒊𝒊
𝒙𝒙𝟏𝟏 + 𝒙𝒙𝟐𝟐 + 𝒙𝒙𝟑𝟑 +⋯+|𝒙𝒙𝒏𝒏 |
𝒙𝒙𝟐𝟐𝟏𝟏 +𝒙𝒙𝟐𝟐𝟐𝟐 +𝒙𝒙𝟐𝟐𝟑𝟑 …+𝒙𝒙𝟐𝟐𝒏𝒏

Data Science & Machine Learning A-Z: Hands on Python Instructor: Navid Shirzadi

You might also like