Data 20science 20crash 20course 20for 20beginners
Data 20science 20crash 20course 20for 20beginners
Edited by AI Publishing
eBook Converted and Cover by Gazler Studio
Published by AI Publishing LLC
ISBN-13: 978-1-7347901-4-6
Legal Notice:
You are not permitted to amend, use, distribute, sell, quote, or paraphrase
any part of the content within this book without the specific consent of
the author.
Disclaimer Notice:
Kindly note that the information contained within this document is
solely for educational and entertainment purposes. No warranties of
any kind are indicated or expressed. Readers accept that the author
is not providing any legal, professional, financial, or medical advice.
Kindly consult a licensed professional before trying out any techniques
explained in this book.
Preface.......................................................................................1
Who Is This Book For?................................................................................. 1
How to Use This Book?................................................................................ 1
Conclusions......................................................................... 295
1.1. Introduction
We are living in the age of data. Not only the amount of
data we currently have is gigantic, but the pace of the data
generation is also accelerating day by day. There are more
than 4.5 billion active internet users at the time of writing this
book, July 2020, which comprises about 60 percent of the
global population. These internet users create large volumes
of data by using social media such as Facebook, Twitter,
Instagram, and Youtube.
8 | Introduction to D ata S c i e n c e and D e c isi o n M a k i n g
Figure 1.1 shows the global growth trend of data volume from
2006-2020 based on ‘’The digital universe in 2020: big data,
bigger digital shadows, and biggest growth in the far east.’’
Note that the graph above is measured in zettabytes (ZB) or
1021 bytes – 1 ZB accounts for 1 trillion gigabytes (GB). This is
a huge amount of data. The value associated with this data
diminishes because the data is not processed and analyzed
at the same rate. Thus, it is of utmost importance to extract
knowledge from the data.
· Data Acquisition,
· Data Preparation,
· Exploratory Data Analysis,
· Data Modeling and Evaluation, and
· Interpretation and Reporting of Findings.
Hands-on Time
It is time to check your understanding of the topic of this
book through the exercise questions given in Section 1.5.
The answers to these questions are given at the end of the
book.
18 | Introduction to D ata S c i e n c e and D e c isi o n M a k i n g
1.5. Exercises
Question 1:
Question 2:
Question 3:
Question 4:
2.1. Introduction
Digital computers can understand instructions given to them
in zeros and ones, where a one means turning ON a specific
part of the central processing unit of a computer, and a zero
means turning OFF that part. The computer instructions in
the form of 0s and 1s are called machine language or machine
code. It is very difficult for humans to understand and program
the computer using machine language. Instead of using a
low-level machine language, we use easily understandable
higher-level languages that automatically translate high-level
instructions into machine language.
2.2.1. Windows
1. Download the graphical Windows installer from https://
www.anaconda.com/products/individual
2. Double-click the downloaded file. Then, click Continue
to begin the installation.
3. Answer the following prompts: Introduction, Read Me,
and License screens.
4. Next, click the Install button. Install Anaconda in a
24 | P y t h o n I n s ta l l at i o n and L ib r a r i e s for D ata S c i e n c e
2.2.2. Apple OS X
1. Download the graphical MacOS installer from https://
www.anaconda.com/products/individual
2. Double-click the downloaded file. Click Continue to
begin the installation.
3. Next, click the Install button. Install Anaconda in the
specified directory.
2.2.3. GNU/Linux
We use the command line to install Anaconda on Linux because
there is no option for graphical installation. We download the
copy of installation file from https://www.anaconda.com/
products/individual first.
2.3. Datasets
Scikit-Learn, also called sklearn, is a free Python library for
machine learning tasks such as classification, regression,
and clustering. It is designed to work with the Python library
NumPy that operates on numbers and is used to perform
common arithmetic operations. The Scikit-Learn library comes
with standard datasets that can be loaded using the Python
functions.
A single Python command can load the dataset of our choice.
For example,
from sklearn.datasets import load_boston
command loads the Boston house-prices dataset.
Table 2.1 shows some of the datasets available in the Scikit-
Learn library.
Table 2.1: Scikit-Learn datasets for machine learning tasks.
Datasets Description
load_boston(\*[, return_X_y]) Load and return the Boston
house-prices dataset
(regression).
load_iris(\*[, return_X_y, as_ Load and return the iris dataset
frame]) (classification).
load_diabetes(\*[, return_X_y, Load and return the diabetes
as_frame]) dataset (regression).
load_digits(\*[, n_class, Load and return the digits
return_X_y, as_frame]) dataset (classification).
load_linnerud(\*[, return_X_y, Load and return the physical
as_frame]) exercise linnerud dataset.
load_wine(\*[, return_X_y, Load and return the wine
as_frame]) dataset (classification).
load_breast_cancer(\*[, Load and return the breast
return_X_y, as_frame]) cancer Wisconsin dataset
(classification).
D ata S c i e n c e C r a s h C o u r s e for Beginners | 37
2.4.1. NumPy
NumPy (Numerical Python) is a core Python library for
performing arithmetic and computing tasks. It offers multi-
dimensional array objects for processing data. An array is a
grid of values, as shown below:
x = [2 1 3.5 -9 0.6],
2.4.2. Pandas
Pandas is a Python library for data cleaning manipulation and
preprocessing. It is very handy to work with labeled data.
Pandas allows importing data of various file formats such as
comma-separated values (.csv) and excel. Pandas is based on
two main data structures: Series, which is like a list of items,
D ata S c i e n c e C r a s h C o u r s e for Beginners | 39
2.4.3. Matplotlib
A picture is worth a thousand words. It is very convenient
to discover patterns in the data by visualizing its important
features. Matplotlib is a standard data science plotting
library that helps us generate two-dimensional plots, graphs,
and diagrams. Using one-liners to generate basic plots in
Matplotlib is quite simple. For example, the following Python
code generates a sine wave.
1. # Python Code to generate a Sine wave
2. import numpy as np
3. import matplotlib.pyplot as plt
4. #time values of wave
5. time = np.arange(0, 10, 0.1)
6. #Amplitude of wave
7. x = np.sin(time)
8. #Plotting the wave
9. plt.plot(time, x)
Figure 2.22: Sine wave generated from the code given above.
2.4.4. Scikit-Learn
Scikit-Learn is an open-source library in the SciPy Stack. The
latter is a Python-based ecosystem of open-source software
for mathematics, science, and engineering. Scikit-Learn uses
the math operations of SciPy to implement machine learning
algorithms for common data science tasks. These tasks include
classification, regression, and clustering. The details of these
tasks are provided in Chapter 7.
Further Reading
· Further reading related to the installation of Anaconda
on Windows, MacOS, and Linux can be found at https://
docs.anaconda.com/anaconda/install/.
D ata S c i e n c e C r a s h C o u r s e for Beginners | 41
Hands-on Time
It is time to check your understanding of the topic of this
chapter through the exercise questions given in Section 2.5.
The answers to these questions are given at the end of the
book.
Question 2:
Question 3:
Question 4:
Question 5:
3.1. Introduction
Chapter 2 outlines the Python installation specific to data
science. However, the installed Python can perform common
tasks as well. This chapter provides a quick overview of the
basic functionality of Python. If the reader is familiar with the
basics of Python, this chapter can be skipped to directly move
on to the next chapter. Nevertheless, it is recommended to
briefly go through the material presented here and test some
of the examples to review Python commands. The focus of this
chapter is to provide an overview of the basic functionality of
Python to solve data science problems.
Output:
2
20-4*8
Output:
-12
(20-4*8)/3
Output:
-4
Output:
2.5
The integer numbers, e.g., the number 20 has type int, whereas
2.5 has type float. To get an integer result from division by
discarding the fractional part, we use the // operator.
50 // 8
Output:
6
46 | Review of Python for D ata S c i e n c e
Output:
2
Output:
32
Output:
float
Output:
(2+7j)
D ata S c i e n c e C r a s h C o u r s e for Beginners | 47
Operator and
Description Example
its Symbol
Addition Adds two values together. 3+4
Output:
+ 7
Subtraction Subtracts the right-hand operand 5−2
from the left operand. Output:
− 3
Multiplication Multiplies the operands together 5*2
Output:
* 10
Division Divides the left-hand operand by 5/2
the right operand. Output:
/ 2.5
Modulus Divides the left-hand operand by 15%6
the right operand to return the Output:
% remainder of division. 3
Exponent Calculates the exponential value 3**3
of the right operand by the left Output:
** operand. 27
Floor division Performs integer division of the left 20//3
operand by the right operand to Output:
// return the integer part of the result. 6
1. height = 10
2. width = 3 * 4
3. width * height
Output:
120
Output:
NameError: name ‘Height’ is not defined
Output:
1000
Logical
Description Example
Operator
and Returns True only when both the x=True
operands are true. y=False
x and y
Output:
False
or Returns True if either of the operands is x=True
true. y=False
x or y
Output:
True
not It complements the operand. x = True
not x
Output:
False
Output:
False
Not equal Checks whether both operands x and y x != y
!= are not equal.
Output:
True
Greater Checks whether one operand is greater x>y
than than the other operand.
> Output:
True
Less than Checks whether one operand is smaller x<y
< than the other operand.
Output:
False
Greater Checks whether one operand is greater x >= y
than or than or equal to the other operand.
equal to
Output:
>= True
Unicode 97. Type ord(“a”) in the Jupyter cell and press Enter
to observe 97 as an output. Strings can be specified by:
· Enclosing characters in single quotes (‘...’) or
· Enclosing characters in double quotes (“...”)
Type the following code to observe the output.
‘A string enclosed in a single quote.’
Output:
‘A string enclosed in a single quote.’
Output:
apple and banana are fruits
Output:
File «<ipython-input-96-868ddf679a10>», line 1
print(‘Are n’t, you said this.’)
^
SyntaxError: invalid syntax
56 | Review of Python for D ata S c i e n c e
Output:
Are n’t, you said this.
Output:
c:
ew_directory
Output:
c:\new_directory
D ata S c i e n c e C r a s h C o u r s e for Beginners | 57
Output:
‘Python’
Output:
H
print(x[-1])
Output:
d
§§ String Slicing
Python allows indexing the elements of a string to extract
substrings. This is called string slicing. If x is a string, an
expression of the form x[p1:p2] returns the portion of x from
position p1 to the position just before p2. To access multiple
characters of a string, we specify a range of indices as follows:
print(x[0:5])
When the first index is omitted, the slice starts at the beginning
of the string: x[:p2] and x[0:p2] gives the same result. For an
integer n (0 ≤ n ≤ len(x), ), x[:n] + x[n:] will be equal to x.
1. x = ‘Hello world’
2. x[:6] + x[6:] == x
Output:
True
1. x = ‘Hello world’
2. x[1:8:2]
Output:
‘el o’
Output:
‘hELLO, HOW ARE you DOING?’
‘hello*’.
isalnum()
Output:
False.
62 | Review of Python for D ata S c i e n c e
Output:
True
islower() Returns True if all characters in “Python”.
the string are lower case. islower()
Output:
False
isupper() Returns True if all characters in “PYTHON”.
the string are upper case. isupper()
Output:
True
lower() Converts a string into a lower “PYTHON”.
case string. lower()
Output:
‘python’
partition() Returns a tuple, where the “python”.
string is split into three parts. partition(‘t’)
Output:
(‘py’, ‘t’,
‘hon’)
replace() Returns a string, where a “We learn
specified value is replaced with “.replace
a specified value. (‘learn’,’have
learned’)
Output:
‘We have
learned’
D ata S c i e n c e C r a s h C o u r s e for Beginners | 63
Output:
[‘Data’,
‘Science’]
splitlines() Splits the string at line breaks “““Learn Python
and returns a list. for Data
Science”””.
splitlines()
Output:
[‘Learn
Python’, ‘for
Data Science’]
startswith() Returns True if the string starts “Data Science”.
with the specified value. startswith(“D”)
Output:
True
strip() Returns a trimmed version of “ Python
the string by removing spaces “.strip()
in the start and end.
Output:
‘Python’
swapcase() Swaps cases, lower case “Hello123”.
becomes upper case and vice swapcase()
versa.
Output:
‘hELLO123’
64 | Review of Python for D ata S c i e n c e
Output:
‘Python For
Data Science’
upper() Converts a string into upper “python”.
case. upper()
Output:
‘PYTHON’
Further Readings
More information about Python and its commonly used
functions can be found at
https://docs.python.org/3/tutorial/index.html
if(condition):
Statement1
Note that the input() function takes the input from the user
and saves it as a string. We use int(input()) to convert the
string to an integer. Now, if we run this program and enter any
value greater than 100, we get a warning “Invalid marks.” If
the marks entered by a user are 100 or less, we do not get any
warning message.
else Statement
if(condition):
else:
§§ Nested Decisions
Multiple tests are to be performed to perform complex
decisions. Python allows us to perform these tests via nested
68 | Review of Python for D ata S c i e n c e
Further Readings
More information about conditional statements can be
found at
https://www.techbeamers.com/python-if-else/
D ata S c i e n c e C r a s h C o u r s e for Beginners | 69
switch = {
case1: value1,
case2: value2,
case3: value3,
...
switch.get(case)
while (condition):
§§ Nested Loops
Both for and while loops can be used inside each other. This is
called nested loops, which are used to work in 2-dimensions.
For example, if we have two variables, and we want to print
all the combinations of these variables together, we may write
the following program that uses two for loops, one nested
inside another.
1. properties = [“red”, “round”, “tasty”]
2. fruits = [“apple”, “orange”, “banana”]
3. for j in properties:
4. for k in fruits:
5. print(j, k)
Output:
red apple
red orange
red banana
round apple
round orange
round banana
tasty apple
tasty orange
tasty banana
Further Readings
More information about conditional statements can be
found at
https://bit.ly/3hQUFay
1. def my_function1():
2. print(“This is my first Python function”)
Output:
This is my first Python function.
N! = 1*2*3*...*(N-2)*(N-1)*N.
5! = 1*2*3*4*5=120
None in the output means the function does not return any
numeric value for 0 and negative inputs.
3.6.1. Lists
A list is an ordered and changeable collection of elements. In
Python, lists are written with square brackets. For example, to
create a list named fruitlist, type the following code:
1. fruitlist = [“apple”, “orange”, “banana”, “melon”]
2. print(fruitlist)
Table 3.7: Some useful methods for the list data type.
Output:
NameError: name ‘fruitlist’ is not defined
3.6.2. Tuples
A tuple is an ordered collection that is unchangeable. Tuples
are written using round brackets () in Python. For example, to
create a tuple, type:
84 | Review of Python for D ata S c i e n c e
Output:
Data Science
§§ Tuple Methods
count() returns the number of times a specified value occurs
in a tuple.
1. mytuple = (‘Python’, ‘is handy for’, ‘Data Science’)
2. print(mytuple.count(‘Data’))
3. print(mytuple.count(‘Data Science’))
Output:
0
1
3.6.3. Sets
A set is an unordered and unindexed collection of elements.
Sets are written with curly brackets { } in Python. For example,
1. myset = {“cat”, “tiger”, “dog”, “cow”}
2. print(myset)
Output:
{«cat», «tiger», «dog», «cow»}
Furthermore, we can:
· remove the last item of a set by using the pop() method;
· empty the set by using the clear() method; and
· delete the set completely using the keyword del.
Output:
{‘name’: ‘Python’, ‘purpose’: ‘Data science’, ‘year’: 2020}
The values can repeat and they can be of any data type.
However, keys must be a unique and immutable string, number,
or tuple. We use square brackets to access a specified value
of a dictionary by referring to its key name as follows.
1. # accesses value for key ‘name’
2. print(mydict[‘name’])
3.
4. # accesses value for key ‘purpose’
5. print(mydict.get(‘purpose’))
Output:
Python
Data science
Output:
None
KeyError: ‘address’
D ata S c i e n c e C r a s h C o u r s e for Beginners | 89
This error indicates that the key ‘address’ does not exist. We
can alter the value of a specific element by referring to its key
name as follows:
1. mydict[“year”] = 2019
2. mydict
Output:
{‘name’: ‘Python’, ‘purpose’: ‘Data science’, ‘year’: 2019}
1. for x, y in mydict.items():
2. print(x, y)
Output:
name Python
purpose Data science
year 2019
1. mydict.clear()
2. mydict
Output:
{ }
§§ Nested Dictionaries
A dictionary inside another dictionary is called a nested
dictionary. We can have multiple dictionaries inside a dictionary.
For instance, the following code will create a dictionary family
that contains three other dictionaries: child1, child2, and child3.
92 | Review of Python for D ata S c i e n c e
1. child1 = {
2. “name” : “John”,
3. “dob” : 2004
4. }
5. child2 = {
6. “name” : “Jack”,
7. “dob” : 2007
8. }
9. child3 = {
10. “name” : “Tom”,
11. “dob” : 2011
12. }
13.
14. family = {
15. “child1” : child1, # child1 is placed inside dictionary
family
16. “child2” : child2, # child2 is placed inside dictionary
family
17. “child3” : child3 # child3 is placed inside dictionary
family
18. }
19. family # displays dictionary family
Output:
{‘child1’: {‘name’: ‘John’, ‘dob’: 2004},
‘child2’: {‘name’: ‘Jack’, ‘dob’: 2007},
‘child3’: {‘name’: ‘Tom’, ‘dob’: 2011} }
Output:
dict_items([(‘name’, ‘John’), (‘dob’, 2004)])
Further Readings
More detailed tutorials on Python can be found at
https://bit.ly/2DqhAe5
D ata S c i e n c e C r a s h C o u r s e for Beginners | 93
Question 1:
Question 2:
Question 3:
Question 4:
Question 5:
Question 6:
word = ‘Python’
word[1]
A. ‘P’
B. ‘p’
C. ‘y’
D. ‘Y’
D ata S c i e n c e C r a s h C o u r s e for Beginners | 95
Question 7:
word = ‘Python’
word[−2]
A. ‘n’
B. ‘o’
C. ‘h’
D. ‘P’
Question 8:
Question 9:
Question 10:
4.1. Introduction
As discussed earlier in Chapter 1, the process to collect,
preprocess, clean, visualize, analyze, model, and interpret the
data is called a data science pipeline. The main steps of this
pipeline are as follows:
· Data Acquisition,
· Data Preparation,
· Exploratory Data Analysis,
· Data Modeling and Evaluation, and
· Interpretation and Reporting of Findings.
This chapter explains the data acquisition step in detail,
along with practical Python examples. Subsequent chapters
of the book explain the steps of the pipeline after the data
acquisition. In the following sections, we describe different
methods to acquire data.
Output:
Degrees Sine Cosine Tangent
0 0 0 1 0
1 1 0.0175 0.9998 0.0175
2 2 0.0349 0.9994 0.0349
3 3 0.0523 0.9986 0.0524
4 4 0.0698 0.9976 0.0699
.. ... ... ... ...
356 356 -0.0698 0.9976 -0.0699
357 357 -0.0523 0.9986 -0.0524
358 358 -0.0349 0.9994 -0.0349
359 359 -0.0175 0.9998 -0.0175
360 360 0 1 0
Output:
Column names are name, age, designation
John is 50 years olds and works as Manager.
Julia is 40 years old and works as Assistant Manager.
Tom is 30 years old and works as Programmer.
Sophia is 25 years old and works as Accountant.
Processed 5 lines.
The join() method takes all items in its input argument and
connects them into one string. {“, “.join(row)} connects all row
values together using a comma.
<span class=”item-price”>$29.95</span>
D ata S c i e n c e C r a s h C o u r s e for Beginners | 109
Now, we are able to create the correct XPath query and use
the lxml xpath function as given below.
1. # Creates a list of buyers
2. buyers = tree.xpath(‘//div[@title=»buyer-name»]/text()’)
3. # Create a list of prices
4. prices = tree.xpath(‘//span[@class=»item-price»]/text()’)
5. print (‘Buyers: ‘, buyers)
6. print (‘Prices: ‘, prices)
Output:
Buyers: [‘Carson Busses’, ‘Earl E. Byrd’, ‘Patty Cakes’,
‘Derri Anne Connecticut’, ‘Moe Dess’, ‘Leda Doggslife’,
‘Dan Druff’, ‘Al Fresco’, ‘Ido Hoe’, ‘Howie Kisses’, ‘Len
Lease’, ‘Phil Meup’, ‘Ira Pent’, ‘Ben D. Rules’, ‘Ave
Sectomy’, ‘Gary Shattire’, ‘Bobbi Soks’, ‘Sheila Takya’,
‘Rose Tattoo’, ‘Moe Tell’]
To read data from this XML file, we type the following code:
1. from lxml import objectify
2. my_xml = objectify.parse(‘books.xml’)
3. my_xml
Output:
<lxml.etree._ElementTree at 0x1de253ed488>
D ata S c i e n c e C r a s h C o u r s e for Beginners | 111
Output:
‘Giada De Laurentiis’
Output:
2005
We can find the children of the root element of the XML page.
root.getchildren()
Output:
[<Element book at 0x1de26789808>,
<Element book at 0x1de26789a48>,
<Element book at 0x1de26789a88>,
<Element book at 0x1de26789ac8>]
Output:
[‘title’, ‘author’, ‘year’, ‘price’]
Output:
[‘Everyday Italian’, ‘Giada De Laurentiis’, ‘2005’, ‘30.00’]
Hands-on Time
It is time to check your understanding of the topic of this
chapter through the exercise questions given in Section 4.7.
The answers to these questions are given at the end of the
book.
5.1. Introduction
Data preparation is the process of constructing a clean dataset
from one or more sources such that the data can be fed into
subsequent stages of a data science pipeline. Common data
preparation tasks include handling missing values, outlier
detection, feature/variable scaling, and feature encoding.
Data preparation is often a time-consuming process.
1. myseries2.values
Output:
array([ 1, −3, 5, 20], dtype=int64)
1. myseries2.index
Output:
Index([‘a’, ‘b’, ‘c’, ‘d’], dtype=’object’)
Output:
2 3
3 4
dtype: int32
Output:
a 0.000000
b NaN
c 2.302585
d 2.995732
dtype: float64
120 | D ata P r e pa r at i o n ( P r e p r o c e ssi n g )
Output:
white 1
black 2
blue 3
green 4
green 5
yellow 4
black 3
red 2
dtype: int64
Output:
array([1, 2, 3, 4, 5], dtype=int64)
D ata S c i e n c e C r a s h C o u r s e for Beginners | 121
Output:
4 2
3 2
2 2
5 1
1 1
dtype: int64
Output:
white False
black False
blue False
green False
green True
yellow False
black False
red False
dtype: bool
mycolors[mycolors.isin([5,7])]
Output:
green 5
dtype: int64
Output:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
D ata S c i e n c e C r a s h C o u r s e for Beginners | 125
pd.concat([myseries1,myseries2])
Output:
0 0.865165
1 0.305467
2 0.692341
3 0.859180
4 0.004683
5 0.670931
6 0.762998
7 0.200184
8 0.266258
9 0.296408
dtype: float64
Output:
If we use the option keys, along the axis = 1, the provided keys
become the column names of the DataFrame.
128 | D ata P r e pa r at i o n ( P r e p r o c e ssi n g )
Output:
Output:
Note the NaN values have been placed in those columns whose
information is not present in individual data frames.
pd.merge([myframe1, myframe2])
Output:
Output:
Output:
1. myseries1 = pd.
Series([50, 40, 30, 20, 10], index=[1,2,3,4,5])
2. myseries1
Output:
1 50
2 40
3 30
4 20
5 10
dtype: int64
1. myseries2 = pd.
Series([100, 200, 300, 400] ,index=[3,4,5,6])
2. myseries2
Output:
3 100
4 200
5 300
6 400
dtype: int64
Output:
1 50.0
2 40.0
3 30.0
4 20.0
5 10.0
6 400.0
dtype: float64
myseries2.combine_first(myseries1)
Output:
1 50.0
2 40.0
3 100.0
4 200.0
5 300.0
6 400.0
dtype: float64
2. del myframe5[‘School’]
3. myframe5
Output:
134 | D ata P r e pa r at i o n ( P r e p r o c e ssi n g )
Output:
item_frame.duplicated()
Output:
0 False
1 False
2 False
3 False
4 True
dtype: bool
Output:
Output:
student_frame.describe()
Output:
We subtract this value, 4.5, from the Q1 to find the lower limit,
and add 4.5 to the Q3 to find the upper limit. Thus,
0 10.0
1 20.0
2 30.0
3 NaN
4 40.0
5 50.0
6 NaN
dtype: float64
D ata S c i e n c e C r a s h C o u r s e for Beginners | 141
Output:
3 NaN
6 NaN
dtype: float64
Note, the original colors blue and green have been replaced
by dark blue and light green as mapped inside the dictionary
mymap. The function replace () can also be used to replace
NaN values contained inside a data structure.
144 | D ata P r e pa r at i o n ( P r e p r o c e ssi n g )
1. myseries = pd.Series([1,2,np.nan,4,5,np.nan])
2. myseries.replace(np.nan,0)
Output:
0 1.0
1 2.0
2 0.0
3 4.0
4 5.0
5 0.0
dtype: float64
myframe=myframe.rename(reindex)
Output:
Note that we rename the indices, and assign the result of the
right-hand side to myframe to update it. If this assignment
operation is not performed, myframe will not be updated.
Length: 20
Categories (4, interval[int64]): [(0, 25] < (25, 50] < (50,
75] < (75, 100]]
Output:
(75, 100] 6
(25, 50] 5
(0, 25] 5
(50, 75] 4
dtype: int64
D ata S c i e n c e C r a s h C o u r s e for Beginners | 147
Length: 20
Categories (4, object): [Poor < Below Average < Average <
Good]
Output:
[(31.0, 46.5], (31.0, 46.5], (46.5, 83.0], (46.5, 83.0],
(14.999, 31.0], ..., (83.0, 99.0], (14.999, 31.0], (31.0,
46.5], (83.0, 99.0], (83.0, 99.0]]
Length: 20
Categories (4, interval[float64]): [(14.999, 31.0] < (31.0,
46.5] < (46.5, 83.0] < (83.0, 99.0]]
Output:
(83.0, 99.0] 5
(46.5, 83.0] 5
(31.0, 46.5] 5
(14.999, 31.0] 5
dtype: int64
148 | D ata P r e pa r at i o n ( P r e p r o c e ssi n g )
4. myframe = pd.DataFrame(data)
5. myframe
Output:
Note that the column color has two entries for both white
and red. If we want to group the data based upon the column
color, for example, we may type:
1. mygroup = myframe[‘price’].groupby(myframe[‘color’])
2. mygroup.groups
Output:
{‘blue’: Int64Index([0], dtype=’int64’),
‘red’: Int64Index([2, 3], dtype=’int64’),
‘white’: Int64Index([1, 4], dtype=’int64’)}
D ata S c i e n c e C r a s h C o u r s e for Beginners | 149
1. mygroup.sum()
Output:
color
blue 1.2
red 1.5
white 2.7
Name: price, dtype: float64
2. mygroup2.groups
Output:
{(‘blue’, ‘ball’): Int64Index([0], dtype=’int64’),
(‘red’, ‘paper’): Int64Index([3], dtype=’int64’),
(‘red’, ‘pencil’): Int64Index([2], dtype=’int64’),
(‘white’, ‘mug’): Int64Index([4], dtype=’int64’),
(‘white’, ‘pen’): Int64Index([1], dtype=’int64’)}
1. myframe2 = myframe
2. myframe2.loc[5]=[‘red’,’pencil’,0.8]
3. myframe2
Output:
1. mygroup2.mean()
Output:
color object
blue ball 1.2
red paper 0.9
pencil 0.7
white mug
1.7
pen 1.0
Name: price, dtype: float64
1. mygroup2.sum()
Output:
color object
blue ball
1.2
red paper 0.9
pencil 1.4
white mug
1.7
pen 1.0
Name: price, dtype: float64
1. myframe4.columns
Output:
Index([‘col0’, ‘col1’, ‘col2’, ‘col3’, ‘col4’],
dtype=’object’)
myframe4.index
Output:
Index([‘row0’, ‘row1’, ‘row2’], dtype=’object’)
1. myframe4.values
Output:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
Output:
row0 2
row1 7
row2 12
Name: col2, dtype: int32
Output:
row0 2
row1 7
row2 12
Name: col2, dtype: int32
myframe4[1:3]
Output:
Output:
col0 5
col1 6
col2 7
col3 8
col4 9
Name: row1, dtype: int32
myframe4.isin([1,4,99])
Output:
Output:
1. del myframe4[‘col5’]
2. myframe4
Output:
Hands-on Time
It is time to check your understanding of the topic of this
chapter through the exercise questions given in Section 5.7.
The answers to these questions are given at the end of the
book.
158 | D ata P r e pa r at i o n ( P r e p r o c e ssi n g )
6.1. Introduction
In the previous chapter, we have seen that data preprocessing
is an important step in the data science pipeline. Once we get
the preprocessed data, we have to choose suitable machine
learning algorithms to model the data. However, before
applying machine learning, we have to answer the following
questions:
· How to find the structure of the data?
· How to test assumptions about the data?
· How to select the features that can be used for machine
learning methods?
· How to choose suitable machine learning algorithms to
model our dataset?
Exploratory Data Analysis (EDA) is a process to get familiar
with the structure and important features of a dataset. EDA
helps us answer the aforementioned questions by providing
us with a good understanding of what the data contains. EDA
explores the preprocessed data using suitable visualization
tools to find the structure of the data, its salient features, and
important patterns.
162 | E x p l o r ato r y D ata A n a ly sis
https://www.kaggle.com/aariyan101/usa-housingcsv
Or
https://raw.githubusercontent.com/bcbarsness/machine-
learning/master/USA_Housing.csv
plt.rcParams[‘figure.figsize’] = [12,8]
Output:
6.3.4. Histogram
A histogram is a plot that indicates the frequency distribution
or shape of a numeric feature in a dataset. This allows us to
discover the underlying distribution of the data by visual
inspection. To plot a histogram, we pass a collection of numeric
values to the method hist (). For example, the following
histogram plots the distribution of values in the price column
of the USA_Housing dataset.
plt.hist(housing_price[‘Price’])
plt.show()
Output:
This plot shows that more than 1,200 houses, out of 5,000, have
a price of around $ 1,000,000. A few houses have prices less
than $ 500,000 and greater than $ 2,000,000. By default, the
method hist () uses 10 bins or groups to plot the distribution
of the data. We can change the number of bins by using the
option bins.
174 | E x p l o r ato r y D ata A n a ly sis
From this plot, it can be observed that the house price follows
a Normal or a Gaussian distribution that is a bell curve. It is
important to know that many machine learning algorithms
assume Gaussian distribution of features. Thus, it is better for
a feature to follow this distribution.
1. iris_data = pd.read_csv(‘c:/Users/GNG/Desktop/iris_
dataset.csv’)
2. iris_data.info()
Output:
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
For the bar charts given above, we observe that species Iris-
virginica has the highest petal length, petal width, and sepal
length. Species Iris-setosa has the smallest petal length, petal
width, and sepal length. However, there is a deviation from the
trend; Iris-setosa shows the highest sepal width, followed by
virginica and versicolor.
housing_price.tail()
Output:
Output:
(5000, 7)
Output:
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
# Column Count Non-NullDtype
--- ------ ------------ -----
0 Avg. Area Income 5000 non-null float64
1 Avg. Area House Age 5000 non-null float64
2 Avg. Area Number of Rooms 5000 non-null float64
3 Avg. Area Number of Bedrooms 5000 non-null float64
4 Area Population 5000 non-null float64
5 Price 5000 non-null float64
6 Address 5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.6+ KB
D ata S c i e n c e C r a s h C o u r s e for Beginners | 181
Output:
Output:
Avg. Area Income 0
Avg. Area House Age 0
Avg. Area Number of Rooms 0
Avg. Area Number of Bedrooms 0
Area Population 0
Price 0
Address 0
dtype: int64
Output:
184 | E x p l o r ato r y D ata A n a ly sis
D ata S c i e n c e C r a s h C o u r s e for Beginners | 185
To find the correlation, we use the method corr (). To plot the
correlation matrix, we import the Seaborn library as sns.
1. import seaborn as sns
2. corrmat = housing_price.corr()
3. feature_ind = corrmat.index
4.
5. plt.figure(figsize=(12,12))
6.
7. #Plotting correlation matrix as an sns heat map
8. sns.heatmap(housing_price[feature_ind].
corr(),annot=True, cmap=”RdYlGn”)
9. plt.show()
188 | E x p l o r ato r y D ata A n a ly sis
Output:
The annot and cmap options display the value of the correlation
coefficient within the square boxes, and the color map to
display the figure, respectively.
The values closer to zero in the red boxes indicate that these
features are nearly independent of each other. However, larger
values such as 0.64 between Price and Avg. Area Income
indicates that the house prices are strongly correlated with
the average income of residents of a particular area. Features
having a small value of correlation with the target variable can
D ata S c i e n c e C r a s h C o u r s e for Beginners | 189
7.1. Introduction
In the previous chapter, we have performed Exploratory Data
Analysis (EDA) to get familiar with the structure of a dataset.
Various methods were presented to discover patterns in the
data. Once we have explored the important features of the
data, we choose suitable machine learning (ML) algorithms to
model the data. Future instances of the data can make use of
ML models to predict the output.
Output:
The mean is = 5.43
The median is = 3.0
The mode is = 3
The range is = 25
The standard deviation is = 7.27
The variance is = 52.82
Output:
array([[ 1. , 3.25 ],
[ 3.25 , 11.58333333]])
Output:
array([[1. , 0.95491911],
[0.95491911, 1. ]])
§§ Bernoulli Distribution
A Bernoulli distribution has only two possible values, outputs,
or outcomes, namely 0 for failure and 1 for success. This
distribution assumes only one trial of the experiment that
generates 0 or 1. Thus, the variable, also known as a random
variable, that follows a Bernoulli distribution can take on the
value 1 (success), or the value 0 (failure).
§§ Uniform Distribution
A uniform distribution is for the continuous-valued data. It
has a single value, 1/(b-a), which occurs in a certain range
[a,b], whereas everything is zero outside that range. We can
think of it as an indication of a categorical variable with two
categories: 0 or the value. The categorical variable may have
multiple values in a continuous range between some numbers
a and b.
D ata S c i e n c e C r a s h C o u r s e for Beginners | 197
§§ Gaussian Distribution
A Normal or Gaussian Distribution is defined by its mean and
standard deviation. The data values are spread around the
mean value, and the standard deviation controls the spread. A
Gaussian distribution has most data values around the mean
or center value. A smaller value of standard deviation indicates
that data is highly concentrated and vice versa.
§§ Poisson Distribution
A Poisson Distribution is similar to the Normal distribution
but with some skewness. A Poisson distribution has relatively
uniform spread in all directions, just like the Normal distribution;
however, the spread becomes non-uniform for increasing
values of skewness.
7.4.3. Cross-Validation
Generally, we split our dataset into training and test datasets.
Sometimes, we keep aside the test set, and choose some
percentage of the training set to train the model, and use the
remaining part of the training set to validate the model. This
reserved portion of the training set is called the validation set.
The terms loss function and cost function are both based upon
difference or error between the model and actual labels. These
terms are used interchangeably; however, the loss function is
defined for one data point, and a cost function is the average
of all loss functions. Thus, MSE is a cost function, and the term
can be considered as a loss function.
Figure 7.7: The pair of data points (X,Y) are shown as dots in
colors. The model is assumed to be a line that approximates the
data points.
passes through the data points such that the error between
the line and the data points is minimum.
Even when we have multiple input features x1, x2, x3, …, and all
of them have a linear relationship with the target variable y, a
linear model can still be employed.
the noise present in the data. Due to this, both underfit and
overfit models fail to generalize well on the test data.
Output:
[0 1]
[-1.04608067]
[[0.51491375]]
[0 0 0 1 1 1 1 1 1 1]
0.9
accuracy 0.90 10
macro avg 0.93 0.88 0.89 10
weighted avg 0.91 0.90 0.90 10
Output:
points votes for its class. The class with the most votes in the
neighborhood is taken as the predicted class for the test point.
To find points closest or similar to the test point, we find the
distance between points. The steps to classify a test point by
KNN are as follows:
· Calculate distance
· Find closest neighbors
· Vote for labels
To implement a KNN classifier in Python, we first import the
libraries and packages.
1. from sklearn.datasets import load_digits
2. from sklearn.model_selection import train_test_split
3. from sklearn.neighbors import KNeighborsClassifier
4. from sklearn import metrics
We load the Digits dataset and split it into test and training
sets using the following script.
1. digits = load_digits()
2. # Train the model using the training sets
3. x_train, x_test, y_train, y_test = train_test_
split(digits.data, digits.target, test_size=0.25)
To predict the output values from the test input x_test, and
to evaluate the performance of the trained model, we use the
following Python commands.
218 | D ata M o d e l i n g and E va l u at i o n u si n g M ac h i n e L e a r n i n g
1. #Predict Output
2. cm = metrics.confusion_matrix(y_test, model.predict(x_
test))
3. score = model.score(x_test, y_test)
Output:
D ata S c i e n c e C r a s h C o u r s e for Beginners | 219
The predicted and actual labels are shown on the x and y-axis
of the confusion matrix, respectively. The diagonal entries on
the confusion matrix represent correct classification results.
It can be observed that most digits are correctly classified
by the model. However, occasional misclassified results are
shown on the off-diagonal entries of the matrix. The output of
the model shows an accuracy of 98.67 percent.
temp=[‘Hot’,’Hot’,’Hot’,’Mild’,’Cool’,’Cool’,’Cool’,’Mild’,
’Cool’,’Mild’,’Mild’,’Mild’,’Hot’,’Mild’]
play=[‘No’,’No’,’Yes’,’Yes’,’Yes’,’No’,’Yes’,’No’,’Yes’,’Yes’,
’Yes’,’Yes’,’Yes’,’No’]
Output:
Weather: [2 2 0 1 1 1 0 2 2 1 2 0 0 1]
Temp: [1 1 1 2 0 0 0 2 0 2 2 2 1 2]
Play: [0 0 1 1 1 0 1 0 1 1 1 1 1 0]
Combined feature: [[2 1]
[2 1]
[0 1]
[1 2]
[1 0]
[1 0]
[0 0]
[2 2]
[2 0]
[1 2]
[2 2]
[0 2]
[0 1]
[1 2]]
Output:
Predicted Value: [1]
Further Reading
Further reading related to the nearest neighbor and K-means
can be found at
https://www.datacamp.com/community/tutorials/naive-
bayes-scikit-learn
https://www.datacamp.com/community/tutorials/k-
nearest-neighbor-classification-scikit-learn
D ata S c i e n c e C r a s h C o u r s e for Beginners | 225
Output:
Classes to predict: [‘setosa’ ‘versicolor’ ‘virginica’]
Output:
Number of examples in the data: 150
228 | D ata M o d e l i n g and E va l u at i o n u si n g M ac h i n e L e a r n i n g
We split our dataset into test and training sets using train_
test_split.
X_train, X_test, y_train, y_test = train_test_
split(X, y, random_state = 47, test_size = 0.25)
techniques, the input data is not labeled, i.e., the input variables
or features are provided with no corresponding output labels.
The algorithms based on the unsupervised learning aim to
discover patterns and structures within the data.
7.7.1. Clustering
The aim of clustering is to divide or group the given data into
several categories based on their similarities. Let us assume
we have only two features in a given dataset. The labels of the
data are not given to us. We plot this data, and it looks like the
plot given in figure 7.15. The left side of figure 7.15 shows the
data without labels, and the right side presents the clustered
data based on the similarities between the data points.
§§ K-Means Clustering
K-means clustering is the most popular clustering algorithm.
It is an iterative algorithm that aims to find the best cluster for
232 | D ata M o d e l i n g and E va l u at i o n u si n g M ac h i n e L e a r n i n g
The graph given above shows ‘the elbow method’ that gives
us the optimum number of clusters where the elbow occurs on
the plot. This is the point after which WCSS does not decrease
significantly with an increasing number of clusters. It is evident
from the graph that the optimum number of clusters is three,
which confirms the actual number of species/classes in the
Iris dataset. We can apply the K-means clustering algorithm to
the Iris dataset after getting the optimum number of clusters
from the elbow method.
D ata S c i e n c e C r a s h C o u r s e for Beginners | 235
Output:
8. versicolor_examples= iris_data[versicolor_selection]
9. virginica_examples= iris_data[virginica_selection]
10.
11. # Plotting the examples of all 3 classes on a single plot
12. plt.scatter(setosa_examples[‘PetalLengthCm’],setosa_
examples[‘PetalWidthCm’],c=’red’, label=’Iris-setosa’)
13. plt.scatter(versicolor_examples[‘PetalLengthCm’],
versicolor_examples[‘PetalWidthCm’], c=’green’,
label=’Iris-versicolor’)
14. plt.scatter(virginica_examples[‘PetalLengthCm’],
virginica_examples[‘PetalWidthCm’], c=’blue’, label=’Iris-
virginica’)
15.
16. # Giving title and labels to the plot
17. plt.xlabel(‘Petal length (cm)’)
18. plt.ylabel(‘Petal width (cm)’)
19. plt.title(‘Iris dataset: Petal length vs Petal width’)
20. plt.legend(loc=’lower right’)
21. plt.show()
Output:
§§ Holdout
In the holdout method, the dataset is randomly divided into
three subsets.
· The training set is used to prepare models.
· The validation set is used to assess the performance of
242 | D ata M o d e l i n g and E va l u at i o n u si n g M ac h i n e L e a r n i n g
§§ Cross-Validation
As discussed earlier, often, we choose a percentage of the
training set to train the model and use the remaining part of
the training set to validate the model. If the model is iteratively
trained and validated on different validation sets generated
randomly from the training set, the process is commonly
referred to as cross-validation (CV).
Output:
train: [1 4 5 6], test: [2 3]
train: [2 3 4 6], test: [1 5]
train: [1 2 3 5], test: [4 6]
· Classification Accuracy,
· Confusion matrix,
· Precision recall curve, and
· Receiver operating characteristic (ROC) curve.
§§ Classification Accuracy
Accuracy is the commonly accepted evaluation metric for
classification problems. It is defined as the number of correct
predictions made as a ratio of all predictions made.
§§ Confusion Matrix
A confusion matrix offers a detailed breakdown of correct and
incorrect classifications for each class. A sample confusion
matrix is shown in figure 7.17.
§§ Regression Metrics
The two most common metrics to assess the performance of
regression models are the mean absolute error (MAE) and the
mean squared error (MSE). The MAE is the sum of the absolute
D ata S c i e n c e C r a s h C o u r s e for Beginners | 247
The root mean squared error (RMSE) is the square root of the
MSE. We go into the details of these metrics in Chapter 8 of
the book.
Hands-on Time
It is time to check your understanding of this chapter through
the exercise questions given in Section 7.9. The answers to
these questions are given at the end of the book.
248 | D ata M o d e l i n g and E va l u at i o n u si n g M ac h i n e L e a r n i n g
8.1. Introduction
All the hard work behind data preprocessing, analysis, and
modeling is of little value unless we interpret the results and
explain the important findings to the stakeholders. At the end
of a data science project, we should interpret the results using
suitable mathematical and visualization tools, and explain our
results to technical and non-technical stakeholders.
= (100+50)/(100+50+5+10)
= 150/165 = 90.9%.
254 | I n t e r p r e tat i o n and Reporting of Findings
=15/165
= 9.1%
20.
21. # Calculating confusion matrix and accuracy
22. cm = metrics.confusion_matrix(y, model.predict(x))
23. score = model.score(x, y)
24.
25. # Plotting the results
26. sns.heatmap(cm, annot=True, fmt=”.3f”, linewidths=.5,
square = True, cmap = ‘YlGnBu’);
27.
28. plt.ylabel(‘Actual label’);
29. plt.xlabel(‘Predicted label’);
30. all_sample_title = ‘Accuracy Score: {0}’.format(score)
31. plt.title(all_sample_title, size = 15);
Output:
Note that the confusion matrix shows the accuracy score of 0.9
at its top in this output. The confusion matrix shows that out of
256 | I n t e r p r e tat i o n and Reporting of Findings
= TP /P
= TP / (TP + FN)
= 100/105
= FP / N
= FP / (FP+TN)
Besides TPR and FPR, we have the true negative rate (TNR),
also called the specificity or selectivity of a classifier model.
TNR measures the proportion of actual negatives that are
correctly identified as such (e.g., the percentage of healthy
people who are correctly identified as not having the disease).
TNR = TN / N
= TN / (FP+TN)
258 | I n t e r p r e tat i o n and Reporting of Findings
= 1 – FPR
= 50/60
The false negative rate (FNR), also called the miss rate,
measures the chance or probability that a true positive is
missed. It is given as:
FNR = FN/P
= FN/(FN+TP)
= 5/(5+100)
= 0.048
1. model = LogisticRegression(solver=’lbfgs’)
2. model.fit(trainx, trainy)
3.
4. # generate a no skill prediction (always predicts the
majority class)
5. ns_probs = [0 for i in range(len(testy))]
6.
7. # predict probabilities for logistic regression
8. lr_probs = model.predict_proba(testx)
9.
10. # keep probabilities for the positive outcome only
11. lr_probs = lr_probs[:, 1]
12.
13. # calculate scores
14. ns_auc = roc_auc_score(testy, ns_probs)
15. lr_auc = roc_auc_score(testy, lr_probs)
16.
17. # summarize scores
18. print(‘No Skill: ROC AUC=%.3f’ % (ns_auc))
19. print(‘Logistic: ROC AUC=%.3f’ % (lr_auc))
Output:
No Skill: ROC AUC=0.500
Logistic: ROC AUC=0.903
Output:
= 100/110
R= TPR = TP/(TP+FN)
F1 = 2(PxR)/(P+R)
where Pn and Rn are the precision and recall at the nth threshold,
and (Rn-Rn-1) is the increase in the recall. To generate a PR
curve in Python, we first load the libraries, packages, and train
our model.
1. import matplotlib.pyplot as plt
2. from sklearn.metrics import precision_recall_curve
3. from sklearn.metrics import plot_precision_recall_curve
4. from sklearn.metrics import average_precision_score
5. from sklearn.model_selection import train_test_split
6. from sklearn.linear_model import LogisticRegression
7.
8. # Here, we use the same LogisticRegression model that was
used to generate a ROC curve in the previous program.
9.
10. model = LogisticRegression(solver=’lbfgs’)
11. model.fit(trainx, trainy)
1. scorey = model.decision_function(testx)
2. average_precision = average_precision_score(testy, scorey)
3.
4. disp = plot_precision_recall_curve(model, testx, testy)
5. disp.ax_.set_title(‘2-class Precision-Recall curve: ‘
6. ‘AP={0:0.2f}’.format(average_precision))
Output:
Text (0.5, 1.0, ‘2-class Precision-Recall curve: AP=0.90’)
Hands-on Time
It is time to check your understanding of the topic of this
chapter through the exercise questions given in Section 8.6.
The answers to these questions are given at the end of the
book.
266 | I n t e r p r e tat i o n and Reporting of Findings
9.1. Regression
This project forecasts temperature using a numerical prediction
model with an advanced technique known as bias correction.
The dataset used for this project is publicly available at the
UCI Machine Learning Repository:
268 | D ata S c i e n c e P r o j e c t s
https://archive.ics.uci.edu/ml/datasets/Bias+correction+of+
numerical+prediction+model+temperature+forecast#
We import the required packages, and read the csv file of the
project as follows:
1. import pandas as pd
2. import numpy as np
3. df = pd.read_csv(r’I:/Data science books/temperature.csv’)
4. df.drop(‘Date’,axis=1,inplace=True)
5. df.head()
Output:
Output:
0
0
D ata S c i e n c e C r a s h C o u r s e for Beginners | 27 1
Output:
(7752, 23)
Output:
display the MAE and the MSE of the result to assess the
performance of the method.
1. from sklearn.model_selection import train_test_split
2. Xtrain,Xtest,ytrain,ytest = train_test_split(X,y,test_
size=0.2)
3. from sklearn.linear_model import LinearRegression
4. m = LinearRegression()
5. m.fit(Xtrain,ytrain)
6. y_pred = m.predict(Xtest)
7. print(‘Absolute Error: %0.3f’%float(np.abs(ytest-y_pred).
sum()/ len(y_pred)))
Output:
Absolute Error: 1.181
9.2. Classification
This project aims to detect and recognize different English
language accents. We use the Speaker Accent Recognition
dataset from the UCI Machine Learning Repository.
https://archive.ics.uci.edu/ml/datasets/
Speaker+Accent+Recognition#
Output:
D ata S c i e n c e C r a s h C o u r s e for Beginners | 275
Output:
<bound method NDFrame.describe of language X1 X2
X3 X4 X5 X6 \
0 ES 7.071476 -6.512900 7.650800 11.150783 -7.657312 12.484021
1 ES 10.982967 -5.157445 3.952060 11.529381 -7.638047 12.136098
2 ES 7.827108 -5.477472 7.816257 9.187592 -7.172511 11.715299
3 ES 6.744083 -5.688920 6.546789 9.000183 -6.924963 11.710766
4 ES 5.836843 -5.326557 7.472265 8.847440 -6.773244 12.677218
.. ... ... ... ... ... ... ...
324 US -0.525273 -3.868338 3.548304 1.496249 3.490753 5.849887
325 US -2.094001 -1.073113 1.217397 -0.550790 2.666547 7.449942
326 US 2.116909 -4.441482 5.350392 3.675396 2.715876 3.682670
327 US 0.299616 0.324844 3.299919 2.044040 3.634828 6.693840
328 US 3.214254 -3.135152 1.122691 4.712444 5.926518 6.915566
Output:
Output:
278 | D ata S c i e n c e P r o j e c t s
Output:
Text(89.18, 0.5, ‘predicted label’)
http://vis-www.cs.umass.edu/lfw/
Output:
Downloading LFW metadata: https://ndownloader.figshare.com/
files/5976012
Downloading LFW metadata: https://ndownloader.figshare.com/
files/5976009
Downloading LFW metadata: https://ndownloader.figshare.com/
files/5976006
Downloading LFW data (~200MB): https://ndownloader.figshare.
com/files/5976015
Output:
(1560, 2914)
Output:
(62, 47)
faces.target_names
Output:
array([‘Ariel Sharon’, ‘Colin Powell’, ‘Donald Rumsfeld’,
‘George W Bush’,’Gerhard Schroeder’, ‘Hugo Chavez’,
‘Jacques Chirac’, ‘Jean Chretien’, ‘John Ashcroft’,
‘Junichiro Koizumi’,’Serena Williams’, ‘Tony Blair’],
dtype=’<U17’)
faces.target_names.size
Output:
12
np.unique(faces.target)
Output:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
dtype=int64)
Output:
‘Gerhard Schroeder’
plt.imshow(faces.images[0])
Output:
<matplotlib.image.AxesImage at 0x2b9c8a12588>
Output:
D ata S c i e n c e C r a s h C o u r s e for Beginners | 285
Output:
GridSearchCV(cv=’warn’, error_score=’raise-deprecating’,
estimator=Pipeline(memory=None,
steps=[(‘pca’,
PCA(copy=True, iterated_power=’auto’,
n_components=150, random_state=None,
svd_solver=’auto’, tol=0.0,
whiten=True)),
(‘svc’,
SVC(C=1.0, cache_size=200,
class_weight=’balanced’, coef0=0.0,
decision_function_shape=’ovr’,
degree=3, gamma=’auto_deprecated’,
kernel=’rbf’, max_iter=-1,
probability=False,
random_state=None, shrinking=True,
tol=0.001, verbose=False))],
verbose=False),
iid=’warn’, n_jobs=None,
param_grid={‘svc__C’: [1, 5, 15, 30],
‘svc__gamma’: [1e-05, 5e-05, 0.0001, 0.005]},
pre_dispatch=’2*n_jobs’, refit=True,
return_train_score=False,
D ata S c i e n c e C r a s h C o u r s e for Beginners | 287
scoring=None, verbose=0)
print(grid.best_params_)
Output:
{‘svc__C’: 1, ‘svc__gamma’: 0.005}
Output:
288 | D ata S c i e n c e P r o j e c t s
Output:
Text(89.18, 0.5, ‘predicted label’)
The true and predicted labels are shown on the x and y-axis of
the confusion matrix, respectively. The diagonal entries on the
confusion matrix represent correct classification results. It can
290 | D ata S c i e n c e P r o j e c t s
the input and the output. This step reveals the insights and the
trained model at this stage used in making predictions on the
unseen test data.
https://www.datascienceweekly.org/.
https://www.kdnuggets.com/
https://www.datasciencecentral.com/
https://fivethirtyeight.com/
294 | K e y I n si g h t s and Further Avenues
https://flowingdata.com/
https://archive.ics.uci.edu/ml/index.php
10.3. Challenges
If you want to access tons of public datasets, learn from the
examples by top data scientists, and even start competing,
Kaggle is an obvious place for you.
https://www.kaggle.com/competitions
https://www.topcoder.com/community/data-science/
https://www.drivendata.org/
https://mlcontests.com/
Conclusions
Chapter 1
Question 1: C
Question 2: D
Question 3: D
Question 4: C
Chapter 2
Question 1: B
Question 2: B
Question 3: C
Question 4: C
Question 5: D
Chapter 3
Question 1: A
Question 2: C
Question 3: B
298 | Answers to E x e r c is e Q u e s t i o n s
Question 4: A
Question 5: B
Question 6: C
Question 7: B
Question 8: C
Question 9: C
Question 10: C
Chapter 4
Question 1: B
Question 2: A
Question 3: D
Question 4: B
Question 5: C
Chapter 5
Question 1: A
Question 2: B
Question 3: C
Question 4: B
Question 5: A
Chapter 6
Question 1: B
Question 2: C
Question 3: B
D ata S c i e n c e C r a s h C o u r s e for Beginners | 299
Question 4: C
Question 5: D
Chapter 7
Question 1: B
Question 2: D
Question 3: B
Question 4: A
Question 5: D
Question 6: B
Question 7: B
Question 8: B
Question 9: B
Question 10: C
Chapter 8
Question 1: C
Question 2: D
Question 3: C
Question 4: A