Unit 1
Unit 1
Unit 1
Machine Learning is an application of artificial intelligence where a computer/machine learns from the past
experiences (input data) and makes future predictions.
Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data
and algorithms to imitate the way that humans learn, gradually improving its accuracy.
Machine learning enables a machine to automatically learn from data, improve performance from
experiences, and predict things without being explicitly programmed.
Suppose we have a complex problem, where we need to perform some predictions, so instead of writing
a code for it, we just need to feed the data to generic algorithms, and with the help of these algorithms,
machine builds the logic as per the data and predict the output. Machine learning has changed our way
of thinking about the problem. The below block diagram explains the working of Machine Learning
algorithm:
We can train machine learning algorithms by providing them the huge amount of data and let them
explore the data, construct the models, and predict the required output automatically. The performance
of the machine learning algorithm depends on the amount of data, and it can be determined by the cost
function. With the help of machine learning, we can save both time and money.
The importance of machine learning can be easily understood by its uses cases, Currently, machine
learning is used in self-driving cars, cyber fraud detection, face recognition, and friend suggestion
by Facebook, etc. Various top companies such as Netflix and Amazon have build machine learning
models that are using a vast amount of data to analyze the user interest and recommend product
accordingly.
Following are some key points which show the importance of Machine Learning:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1) Supervised Learning
Supervised learning is a type of machine learning method in which we provide sample labeled data to
the machine learning system in order to train it, and on that basis, it predicts the output.
The system creates a model using labeled data to understand the datasets and learn about each data, once
the training and processing are done then we test the model by providing a sample data to check whether
it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The supervised learning is
based on supervision, and it is the same as when a student learns things in the supervision of the teacher.
The example of supervised learning is spam filtering.
• Classification
• Regression
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any supervision.
The training is provided to the machine with the set of data that has not been labeled, classified, or
categorized, and the algorithm needs to act on that data without any supervision. The goal of
unsupervised learning is to restructure the input data into new features or a group of objects with similar
patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find useful insights
from the huge amount of data. It can be further classifieds into two categories of algorithms:
• Clustering
• Association
3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a learning agent gets a reward for
each right action and gets a penalty for each wrong action. The agent learns automatically with these
feedbacks and improves its performance. In reinforcement learning, the agent interacts with the
environment and explores it. The goal of an agent is to get the most reward points, and hence, it
improves its performance.
The robotic dog, which automatically learns the movement of his arms, is an example of Reinforcement
learning.
Note: We will learn about the above types of machine learning in detail in later chapters.
When creating a machine learning project, it is not always a case that we come across the clean and formatted data. And
while doing any operation with data, it is mandatory to clean it and put in a formatted way. So for this, we use data
preprocessing task.
• A real-world data generally contains noises, missing values, and maybe in an unusable format which cannot be
directly used for machine learning models. Data preprocessing is required tasks for cleaning the data and making it
suitable for a machine learning model which also increases the accuracy and efficiency of a machine learning
model .
• Getting the dataset
• Importing libraries
• Importing datasets
• Finding Missing Data
• Encoding Categorical Data
• Splitting dataset into training and test set
• Feature scaling
1) Get the Dataset
To create a machine learning model, the first thing we required is a dataset as a machine learning model
completely works on data. The collected data for a particular problem in a proper format is known as the
dataset.
Dataset may be of different formats for different purposes, such as, if we want to create a machine
learning model for business purpose, then dataset will be different with the dataset required for a liver
patient. So each dataset is different from another dataset. To use the dataset in our code, we usually put
it into a CSV file. However, sometimes, we may also need to use an HTML or xlsx file.
CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save the tabular
data, such as spreadsheets. It is useful for huge datasets and can use these datasets in programs.
Here we will use a demo dataset for data preprocessing, and for practice, it can be downloaded from
here, "https://www.superdatascience.com/pages/machine-learning. For real-world problems, we can
download datasets online from various sources such as https://www.kaggle.com/uciml/datasets,
https://archive.ics.uci.edu/ml/index.php etc.
We can also create our dataset by gathering data using various API with Python and put that data into a
.csv file.
2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some predefined Python
libraries. These libraries are used to perform some specific jobs. There are three specific libraries that we
will use for data preprocessing, which are:
Numpy: Numpy Python library is used for including any type of mathematical operation in the code. It
is the fundamental package for scientific calculation in Python. It also supports to add large,
multidimensional arrays and matrices. So, in Python, we can import it as:
1. import numpy as nm
Here we have used nm, which is a short name for Numpy, and it will be used in the whole program.
Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and with this
library, we need to import a sub-library pyplot. This library is used to plot any type of charts in Python
for the code. It will be imported as below:
1. import matplotlib.pyplot as mpt
Pandas: The last library is the Pandas library, which is one of the most famous Python libraries and
used for importing and managing the datasets. It is an open-source data manipulation and analysis
library. It will be imported as below:
Here, we have used pd as a short name for this library. Consider the below image:
Note: We can set any directory as a working directory, but it must contain the required dataset.
Here, in the below image, we can see the Python file along with required dataset. Now, the current
folder is set as a working directory.
read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library, which is used to read a csv
file and performs various operations on it. Using this function, we can read a csv file locally as well as
through an URL.
1. data_set= pd.read_csv('Dataset.csv')
Here, data_set is a name of the variable to store our dataset, and inside the function, we have passed the
name of our dataset. Once we execute the above line of code, it will successfully import the dataset in
our code. We can also check the imported dataset by clicking on the section variable explorer, and then
double click on data_set. Consider the below image:
As in the above image, indexing is started from 0, which is the default indexing in Python. We can also
change the format of our dataset by clicking on the format option.
In machine learning, it is important to distinguish the matrix of features (independent variables) and
dependent variables from dataset. In our dataset, there are three independent variables that are Country,
Age, and Salary, and one is a dependent variable which is Purchased.
1. x= data_set.iloc[:,:-1].values
In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is for all the
columns. Here we have used :-1, because we don't want to take the last column as it contains the
dependent variable. So by doing this, we will get the matrix of features.
As we can see in the above output, there are only three variables.
1. y= data_set.iloc[:,3].values
Here we have taken all the rows with the last column only. It will give the array of dependent variables.
Output:
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)
Note: If you are using Python language for machine learning, then extraction is mandatory, but
for R language it is not required.
ADVERTISEMENT
There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null values. In this way,
we just delete the specific row or column which consists of null values. But this way is not so efficient
and removing data may lead to loss of information which will not give the accurate output.
By calculating the mean: In this way, we will calculate the mean of that column or row which contains
any missing value and will put it on the place of missing value. This strategy is useful for the features
which have numeric data such as age, salary, year, etc. Here, we will use this approach.
To handle missing values, we will use Scikit-learn library in our code, which contains various libraries
for building machine learning models. Here we will use Imputer class of sklearn.preprocessing
library. Below is the code for it:
1. #handling missing data (Replacing missing data with the mean value)
2. from sklearn.preprocessing import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
4. #Fitting imputer object to the independent variables x.
5. imputerimputer= imputer.fit(x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= imputer.transform(x[:, 1:3])
Output:
As we can see in the above output, the missing values have been replaced with the means of rest column
values.
Firstly, we will convert the country variables into categorical data. So to do this, we will use
LabelEncoder() class from preprocessing library.
1. #Catgorical data
2. #for Country Variable
3. from sklearn.preprocessing import LabelEncoder
4. label_encoder_x= LabelEncoder()
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
Output:
Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)
Explanation:
In above code, we have imported LabelEncoder class of sklearn library. This class has successfully
encoded the variables into digits.
But in our case, there are three country variables, and as we can see in the above output, these variables
are encoded into 0, 1, and 2. By these values, the machine learning model may assume that there is some
correlation between these variables which will produce the wrong output. So to remove this issue, we
will use dummy encoding.
Dummy Variables:
Dummy variables are those variables which have values 0 or 1. The 1 value gives the presence of that
variable in a particular column, and rest variables become 0. With dummy encoding, we will have a
number of columns equal to the number of categories.
In our dataset, we have 3 categories so it will produce three columns having 0 and 1 values. For Dummy
Encoding, we will use OneHotEncoder class of preprocessing library.
1. #for Country Variable
2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
3. label_encoder_x= LabelEncoder()
4. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
5. #Encoding for dummy variables
6. onehot_encoder= OneHotEncoder(categorical_features= [0])
7. x= onehot_encoder.fit_transform(x).toarray()
Output:
ADVERTISEMENT
array([[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
6.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.30000000e+01,
4.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
5.40000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
6.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
6.52222222e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.50000000e+01,
5.80000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.11111111e+01,
5.30000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.90000000e+01,
7.90000000e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 5.00000000e+01,
8.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
7.70000000e+04]])
As we can see in the above output, all the variables are encoded into numbers 0 and 1 and divided into
three columns.
It can be seen more clearly in the variables explorer section, by clicking on x option as:
For Purchased Variable:
1. labelencoder_y= LabelEncoder()
2. y= labelencoder_y.fit_transform(y)
For the second categorical variable, we will only use labelencoder object of LableEncoder class. Here
we are not using OneHotEncoder class because the purchased variable has only two categories yes or
no, and which are automatically encoded into 0 and 1.
Output:
ADVERTISEMENT
Out[17]: array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
6) Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set and test set. This is one
of the crucial steps of data preprocessing as by doing this, we can enhance the performance of our
machine learning model.
Suppose, if we have given training to our machine learning model by a dataset and we test it by a
completely different dataset. Then, it will create difficulties for our model to understand the correlations
between the models.
If we train our model very well and its training accuracy is also very high, but we provide a new dataset
to it, then it will decrease the performance. So we always try to make a machine learning model which
performs well with the training set and also with the test dataset. Here, we can define these datasets as:
Training Set: A subset of dataset to train the machine learning model, and we already know the output.
Test set: A subset of dataset to test the machine learning model, and by using the test set, model predicts
the output.
For splitting the dataset, we will use the below lines of code:
Explanation:
• In the above code, the first line is used for splitting arrays of the dataset into random train and
test subsets.
• In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
• In train_test_split() function, we have passed four parameters in which first two are for arrays
of data, and test_size is for specifying the size of the test set. The test_size maybe .5, .3, or .2,
which tells the dividing ratio of training and testing sets.
• The last parameter random_state is used to set a seed for a random generator so that you always
get the same result, and the most used value for this is 42.
Output:
By executing the above code, we will get 4 different variables, which can be seen under the variable
explorer section.
As we can see in the above image, the x and y variables are divided into 4 different variables with
corresponding values.
ADVERTISEMENT
7) Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a technique to
standardize the independent variables of the dataset in a specific range. In feature scaling, we put our
variables in the same range and in the same scale so that no any variable dominate the other variable.
ADVERTISEMENT
Standardization
Normalization
Here, we will use the standardization method for our dataset.
For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:
Now, we will create the object of StandardScaler class for independent variables or features. And then
we will fit and transform the training dataset.
1. st_x= StandardScaler()
2. x_train= st_x.fit_transform(x_train)
For test dataset, we will directly apply transform() function instead of fit_transform() because it is
already done in training set.
1. x_test= st_x.transform(x_test)
Output:
By executing the above lines of code, we will get the scaled values for x_train and x_test as:
x_train:
x_test:
As we can see in the above output, all the variables are scaled between values -1 to 1.
Note: Here, we have not scaled the dependent variable because there are only two values 0 and 1.
But if these variables will have more range of values, then we will also need to scale those
variables.
Now, in the end, we can combine all the steps together to make our complete code more understandable.
ADVERTISEMENT
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('Dataset.csv')
8.
9. #Extracting Independent Variable
10. x= data_set.iloc[:, :-1].values
11.
12. #Extracting Dependent variable
13. y= data_set.iloc[:, 3].values
14.
15. #handling missing data(Replacing missing data with the mean value)
16. from sklearn.preprocessing import Imputer
17. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
18.
19. #Fitting imputer object to the independent varibles x.
20. imputerimputer= imputer.fit(x[:, 1:3])
21.
22. #Replacing missing data with the calculated mean value
23. x[:, 1:3]= imputer.transform(x[:, 1:3])
24.
25. #for Country Variable
26. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
27. label_encoder_x= LabelEncoder()
28. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
29.
30. #Encoding for dummy variables
31. onehot_encoder= OneHotEncoder(categorical_features= [0])
32. x= onehot_encoder.fit_transform(x).toarray()
33.
34. #encoding for purchased variable
35. labelencoder_y= LabelEncoder()
36. y= labelencoder_y.fit_transform(y)
37.
38. # Splitting the dataset into training and test set.
39. from sklearn.model_selection import train_test_split
40. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
41.
42. #Feature Scaling of datasets
43. from sklearn.preprocessing import StandardScaler
44. st_x= StandardScaler()
45. x_train= st_x.fit_transform(x_train)
46. x_test= st_x.transform(x_test)
In the above code, we have included all the data preprocessing steps together. But there are some steps
or lines of code which are not necessary for all machine learning models. So we can exclude them from
our code to make it reusable for all models.
Pandas library
Install and import
Pandas is an easy package to install. Open up your terminal program (for Mac users) or command line
(for PC users) and install it using either of the following commands:
OR
Alternatively, if you're currently viewing this article in a Jupyter notebook you can run this cell:
To import pandas we usually import it with a shorter name since it's used so much:
import pandas as pd
DataFrames and Series are quite similar in that many operations that you can do with one you can do
with the other, such as filling in null values and calculating the mean.
You'll see how these components work when we start working with data below.
Creating DataFrames right in Python is good to know and quite useful when testing new methods and
functions you find in the pandas docs.
There are many ways to create a DataFrame from scratch, but a great option is to just use a simple dict.
Let's say we have a fruit stand that sells apples and oranges. We want to have a column for each fruit
and a row for each customer purchase. To organize this as a dictionary for pandas we could do
something like:
data = {
'apples': [3, 2, 0, 1],
'oranges': [0, 3, 7, 2]
}
purchases = pd.DataFrame(data)
purchases
Out:
apples oranges
03 0
12 3
20 7
31 2
Each (key, value) item in data corresponds to a column in the resulting DataFrame.
The Index of this DataFrame was given to us on creation as the numbers 0-3, but we could also create
our own when we initialize the DataFrame.
purchases
Out:
apples Oranges
June 3 0
Robert 2 3
Lily 0 7
David 1 2
purchases.loc['June']
Out:
apples 3
oranges 0
Name: June, dtype: int64
There's more on locating and extracting data from the DataFrame later, but now you should be able to
create a DataFrame with any random data to learn on.
Let's move on to some quick methods for creating DataFrames from various other sources.
W to learn more?
See Best Data Science Courses
How to read in data
It’s quite simple to load data from various file formats into a DataFrame. In the following examples
we'll keep using our apples and oranges data, but this time it's coming from various files.
With CSV files all you need is a single line to load in the data:
df = pd.read_csv('purchases.csv')
df
Out:
Unnamed: 0 apples oranges
0 June 3 0
1 Robert 2 3
2 Lily 0 7
3 David 1 2
CSVs don't have indexes like our DataFrames, so all we need to do is just designate the index_col
when reading:
df = pd.read_csv('purchases.csv', index_col=0)
df
Out:
apples oranges
June 3 0
Robert 2 3
Lily 0 7
David 1 2
If you have a JSON file — which is essentially a stored Python dict — pandas can read this just as
easily:
df = pd.read_json('purchases.json')
df
Out:
apples oranges
David 1 2
June 3 0
Lily 0 7
Robert 2 3
Notice this time our index came with us correctly since using JSON allowed indexes to work through
nesting. Feel free to open data_file.json in a notepad so you can see how it works.
If you’re working with data from a SQL database you need to first establish a connection using an
appropriate Python library, then pass a query to pandas. Here we'll use SQLite to demonstrate.
sqlite3 is used to create a connection to a database which we can then use to generate a DataFrame
through a SELECT query.
import sqlite3
con = sqlite3.connect("database.db")
SQL Tip
If you have data in PostgreSQL, MySQL, or some other SQL server, you'll need to obtain the right
Python library to make a connection. For example, psycopg2 (link) is a commonly used library for
making connections to PostgreSQL. Furthermore, you would make a connection to a database URI
instead of a file like we did here with SQLite.
In this SQLite database we have a table called purchases, and our index is in a column called "index".
By passing a SELECT query and our con, we can read from the purchases table:
df
Out:
index apples oranges
0 June 3 0
1 Robert 2 3
2 Lily 0 7
3 David 1 2
Just like with CSVs, we could pass index_col='index', but we can also set an index after-the-fact:
df = df.set_index('index')
df
Out:
apples oranges
Index
June 3 0
Robert 2 3
Lily 0 7
apples oranges
Index
David 1 2
In fact, we could use set_index() on any DataFrame using any column at any time. Indexing Series
and DataFrames is a very common task, and the different ways of doing it is worth remembering.
So after extensive work on cleaning your data, you’re now ready to save it as a file of your choice.
Similar to the ways we read in data, pandas provides intuitive commands to save it:
df.to_csv('new_purchases.csv')
df.to_json('new_purchases.json')
When we save JSON and CSV files, all we have to input into those functions is our desired filename
with the appropriate file extension. With SQL, we’re not creating a new file but instead inserting a new
table into the database using our con variable from before.
Let's move on to importing some real-world data and detailing a few of the operations you'll be using a
lot.
We're loading this dataset from a CSV and designating the movie titles to be our index.
movies_df.head()
Out:
Runti Reven
Ra Descri Direct Ye me Rati Vot ue Metas
Genre Actors
nk ption or ar (Minu ng es (Milli core
tes) ons)
Title
A
group
of Chris Pratt,
Guardi
intergal Vin Diesel,
ans of Action,Adventur James 20 757 333.1
1 actic Bradley 121 8.1 76.0
the e,Sci-Fi Gunn 14 074 3
crimina Cooper, Zoe
Galaxy
ls are S...
forced
...
Followi
ng Noomi
clues to Rapace,
Promet Adventure,Myste the Ridley Logan 20 485 126.4
2 124 7.0 65.0
heus ry,Sci-Fi origin Scott Marshall- 12 820 6
of Green,
mankin Michael Fa...
d, a te...
Three
girls
James
are M.
McAvoy,
kidnapp Night 20 157 138.1
Split 3 Horror,Thriller Anya Taylor- 117 7.3 62.0
ed by a Shyam 16 606 2
Joy, Haley
man alan
Lu Richar...
with a
diag...
In a
city of
Matthew
humano Christ
McConaughe
Animation,Come id ophe 20 605 270.3
Sing 4 y,Reese 108 7.2 59.0
dy,Family animals Lourde 16 45 2
Witherspoon,
,a let
Seth Ma...
hustling
thea...
Runti Reven
Ra Descri Direct Ye me Rati Vot ue Metas
Genre Actors
nk ption or ar (Minu ng es (Milli core
tes) ons)
Title
A
secret
Will Smith,
govern
Jared Leto,
Suicide Action,Adventur ment David 20 393 325.0
5 Margot 123 6.2 40.0
Squad e,Fantasy agency Ayer 16 727 2
Robbie,
recruits
Viola D...
some of
th...
.head() outputs the first five rows of your DataFrame by default, but we could also pass a number as
well: movies_df.head(10) would output the top ten rows, for example.
To see the last five rows use .tail(). tail() also accepts a number, and in this case we printing the
bottom two rows.:
movies_df.tail(2)
Out:
Runti Reven
Ra Descript Directo Ye me Rati Vot ue Metasc
Genre Actors
nk ion r ar (Minu ng es (Millio ore
tes) ns)
Title
A pair of
Adam
friends
Sear Pally, T.J.
embark Scot
ch Adventure,Come Miller, 201 488
999 on a Armstr 93 5.6 NaN 22.0
Part dy Thomas 4 1
mission ong
y Middleditch
to
,Sh...
reuni...
A stuffy
Kevin
business
Spacey,
Nine man Barry
100 Comedy,Family, Jennifer 201 124
Live finds Sonnen 87 5.3 19.64 11.0
0 Fantasy Garner, 6 35
s himself feld
Robbie
trapped
Amell,Ch...
ins...
Typically when we load in a dataset, we like to view the first five or so rows to see what's under the
hood. Here we can see the names of each column, the index, and examples of values in each row.
You'll notice that the index in our DataFrame is the Title column, which you can tell by how the word
Title is slightly lower than the rest of the columns.
.info() should be one of the very first commands you run after loading your data:
movies_df.info()
Out:
<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
Rank 1000 non-null int64
Genre 1000 non-null object
Description 1000 non-null object
Director 1000 non-null object
Actors 1000 non-null object
Year 1000 non-null int64
Runtime (Minutes) 1000 non-null int64
Rating 1000 non-null float64
Votes 1000 non-null int64
Revenue (Millions) 872 non-null float64
Metascore 936 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 93.8+ KB
.info() provides the essential details about your dataset, such as the number of rows and columns, the
number of non-null values, what type of data is in each column, and how much memory your
DataFrame is using.
Notice in our movies dataset we have some obvious missing values in the Revenue and Metascore
columns. We'll look at how to handle those in a bit.
Seeing the datatype quickly is actually quite useful. Imagine you just imported some JSON and the
integers were recorded as strings. You go to do some arithmetic and find an "unsupported operand"
Exception because you can't do math with strings. Calling .info() will quickly point out that your
column you thought was all integers are actually string objects.
Another fast and useful attribute is .shape, which outputs just a tuple of (rows, columns):
movies_df.shape
Out:
(1000, 11)
Note that .shape has no parentheses and is a simple tuple of format (rows, columns). So we have 1000
rows and 11 columns in our movies DataFrame.
You'll be going to .shape a lot when cleaning and transforming data. For example, you might filter
some rows based on some criteria and then want to know quickly how many rows were removed.
Handling duplicates
This dataset does not have duplicate rows, but it is always important to verify you aren't aggregating
duplicate rows.
To demonstrate, let's simply just double up our movies DataFrame by appending it to itself:
temp_df = movies_df.append(movies_df)
temp_df.shape
Out:
(2000, 11)
Using append() will return a copy without affecting the original DataFrame. We are capturing this copy
in temp so we aren't working with the real data.
Notice call .shape quickly proves our DataFrame rows have doubled.
temp_df = temp_df.drop_duplicates()
temp_df.shape
Out:
(1000, 11)
Just like append(), the drop_duplicates() method will also return a copy of your DataFrame, but this
time with duplicates removed. Calling .shape confirms we're back to the 1000 rows of our original
dataset.
It's a little verbose to keep assigning DataFrames to the same variable like in this example. For this
reason, pandas has the inplace keyword argument on many of its methods. Using inplace=True will
modify the DataFrame object in place:
temp_df.drop_duplicates(inplace=True)
Another important argument for drop_duplicates() is keep, which has three possible options:
keep,on the other hand, will drop all duplicates. If two rows are the same then both will be dropped.
Watch what happens to temp_df:
temp_df.drop_duplicates(inplace=True, keep=False)
temp_df.shape
Out:
(0, 11)
Since all rows were duplicates, keep=False dropped them all resulting in zero rows being left over. If
you're wondering why you would want to do this, one reason is that it allows you to locate all duplicates
in your dataset. When conditional selections are shown below you'll see how to do that.
Column cleanup
Many times datasets will have verbose column names with symbols, upper and lowercase words, spaces,
and types. To make selecting data by column name easier we can spend a little time cleaning up their
names.
movies_df.columns
Out:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',
'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
'Metascore'],
dtype='object')
Not only does .columns come in handy if you want to rename columns by allowing for simple copy and
paste, it's also useful if you need to understand why you are receiving a Key Error when selecting data
by column.
We can use the .rename() method to rename certain or all columns via a dict. We don't want
parentheses, so let's rename those:
movies_df.rename(columns={
'Runtime (Minutes)': 'Runtime',
'Revenue (Millions)': 'Revenue_millions'
}, inplace=True)
movies_df.columns
Out:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year', 'Runtime',
'Rating', 'Votes', 'Revenue_millions', 'Metascore'],
dtype='object')
Excellent. But what if we want to lowercase all names? Instead of using .rename() we could also set a
list of names to the columns like so:
movies_df.columns
Out:
Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore'],
dtype='object')
But that's too much work. Instead of just renaming each column manually we can do a list
comprehension:
movies_df.columns
Out:
Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore'],
dtype='object')
list (and dict) comprehensions come in handy a lot when working with pandas and data in general.
It's a good idea to lowercase, remove special characters, and replace spaces with underscores if you'll be
working with a dataset for some time.
When exploring data, you’ll most likely encounter missing or null values, which are essentially
placeholders for non-existent values. Most commonly you'll see Python's None or NumPy's np.nan,
each of which are handled differently in some situations.
movies_df.isnull()
Out:
ran genr descripti directo actor yea runtim ratin vote revenue_millio metasco
k e on r s r e g s ns re
Title
Guardian
Fals Fals Fals Fals
s of the False False False False False False False
e e e e
Galaxy
Promethe Fals Fals Fals Fals
False False False False False False False
us e e e e
Fals Fals Fals Fals
Split False False False False False False False
e e e e
Fals Fals Fals Fals
Sing False False False False False False False
e e e e
Suicide Fals Fals Fals Fals
False False False False False False False
Squad e e e e
Notice isnull() returns a DataFrame where each cell is either True or False depending on that cell's
null status.
To count the number of nulls in each column we use an aggregate function for summing:
movies_df.isnull().sum()
Out:
rank 0
genre 0
description 0
director 0
actors 0
year 0
runtime 0
rating 0
votes 0
revenue_millions 128
metascore 64
dtype: int64
.isnull() just by iteself isn't very useful, and is usually used in conjunction with other methods, like
sum().
We can see now that our data has 128 missing values for revenue_millions and 64 missing values for
metascore.
Data Scientists and Analysts regularly face the dilemma of dropping or imputing null values, and is a
decision that requires intimate knowledge of your data and its context. Overall, removing null data is
only suggested if you have a small amount of missing data.
movies_df.dropna()
This operation will delete any row with at least a single null value, but it will return a new DataFrame
without altering the original one. You could specify inplace=True in this method as well.
So in the case of our dataset, this operation would remove 128 rows where revenue_millions is null
and 64 rows where metascore is null. This obviously seems like a waste since there's perfectly good
data in the other columns of those dropped rows. That's why we'll look at imputation next.
Other than just dropping rows, you can also drop columns with null values by setting axis=1:
movies_df.dropna(axis=1)
In our dataset, this operation would drop the revenue_millions and metascore columns
Intuition
It's not immediately obvious where axis comes from and why you need it to be 1 for it to affect
columns. To see why, just look at the .shape output:
movies_df.shape
As we learned above, this is a tuple that represents the shape of the DataFrame, i.e. 1000 rows and 11
columns. Note that the rows are at index zero of this tuple and columns are at index one of this tuple.
This is why axis=1 affects columns. This comes from NumPy, and is a great example of why learning
NumPy is worth your time.
Imputation
Imputation is a conventional feature engineering technique used to keep valuable data that have null
values.
There may be instances where dropping every row with a null value removes too big a chunk from your
dataset, so instead we can impute that null with another value, usually the mean or the median of that
column.
Let's look at imputing the missing values in the revenue_millions column. First we'll extract that
column into its own variable:
revenue = movies_df['revenue_millions']
If you remember back to when we created DataFrames from scratch, the keys of the dict ended up as
column names. Now when we select columns of a DataFrame, we use brackets just like if we were
accessing a Python dictionary.
revenue.head()
Out:
Title
Guardians of the Galaxy 333.13
Prometheus 126.46
Split 138.12
Sing 270.32
Suicide Squad 325.02
Name: revenue_millions, dtype: float64
Slightly different formatting than a DataFrame, but we still have our Title index.
We'll impute the missing values of revenue using the mean. Here's the mean value:
revenue_mean = revenue.mean()
revenue_mean
Out:
82.95637614678897
movies_df.isnull().sum()
Out:
rank 0
genre 0
description 0
director 0
actors 0
year 0
runtime 0
rating 0
votes 0
revenue_millions 0
metascore 64
dtype: int64
Imputing an entire column with the same value like this is a basic example. It would be a better idea to
try a more granular imputation by Genre or Director.
For example, you would find the mean of the revenue generated in each genre individually and impute
the nulls in each genre with that genre's mean.
Let's now look at more ways to examine and understand the dataset.
Using describe() on an entire DataFrame we can get a summary of the distribution of continuous
variables:
movies_df.describe()
Out:
revenue_millio
rank year runtime rating votes metascore
ns
Coun 1000.00000 1000.00000 1000.00000 1000.00000 1.000000e+0 936.00000
1000.000000
t 0 0 0 0 3 0
2012.78300 1.698083e+0
mean 500.500000 113.172000 6.723200 82.956376 58.985043
0 5
1.887626e+0
Std 288.819436 3.205962 18.810908 0.945429 96.412043 17.194757
5
2006.00000 6.100000e+0
min 1.000000 66.000000 1.900000 0.000000 11.000000
0 1
revenue_millio
rank year runtime rating votes metascore
ns
2010.00000 3.630900e+0
25% 250.750000 100.000000 6.200000 17.442500 47.000000
0 4
2014.00000 1.107990e+0
50% 500.500000 111.000000 6.800000 60.375000 59.500000
0 5
2016.00000 2.399098e+0
75% 750.250000 123.000000 7.400000 99.177500 72.000000
0 5
1000.00000 2016.00000 1.791916e+0 100.00000
max 191.000000 9.000000 936.630000
0 0 6 0
Understanding which numbers are continuous also comes in handy when thinking about the type of plot
to use to represent your data visually.
.describe() can also be used on a categorical variable to get the count of rows, unique count of
categories, top category, and freq of top category:
movies_df['genre'].describe()
Out:
count 1000
unique 207
top Action,Adventure,Sci-Fi
freq 50
Name: genre, dtype: object
This tells us that the genre column has 207 unique values, the top value is Action/Adventure/Sci-Fi,
which shows up 50 times (freq).
movies_df['genre'].value_counts().head(10)
Out:
Action,Adventure,Sci-Fi 50
Drama 48
Comedy,Drama,Romance 35
Comedy 32
Drama,Romance 31
Action,Adventure,Fantasy 27
Comedy,Drama 27
Animation,Adventure,Comedy 27
Comedy,Romance 26
Crime,Drama,Thriller 24
Name: genre, dtype: int64
By using the correlation method .corr() we can generate the relationship between each continuous
variable:
Correlation tables are a numerical representation of the bivariate relationships in the dataset.
Positive numbers indicate a positive correlation — one goes up the other goes up — and negative
numbers represent an inverse correlation — one goes up the other goes down. 1.0 indicates a perfect
correlation.
So looking in the first row, first column we see rank has a perfect correlation with itself, which is
obvious. On the other hand, the correlation between votes and revenue_millions is 0.6. A little more
interesting.
Examining bivariate relationships comes in handy when you have an outcome or dependent variable in
mind and would like to see the features most correlated to the increase or decrease of the outcome. You
can visually represent bivariate relationships with scatterplots (seen below in the plotting section).
It's important to note that, although many methods are the same, DataFrames and Series have different
attributes, so you'll need be sure to know which type you are working with or else you will receive
attribute errors.
By column
You already saw how to extract a column using square brackets like this:
genre_col = movies_df['genre']
type(genre_col)
Out:
pandas.core.series.Series
This will return a Series. To extract a column as a DataFrame, you need to pass a list of column names.
type(genre_col)
pandas.core.frame.DataFrame
subset.head()
Out:
genre rating
Title
Guardians of the Galaxy Action,Adventure,Sci-Fi 8.1
Prometheus Adventure,Mystery,Sci-Fi 7.0
Split Horror,Thriller 7.3
Sing Animation,Comedy,Family 7.2
Suicide Squad Action,Adventure,Fantasy 6.2
By rows
For rows, we have two options:
Remember that we are still indexed by movie Title, so to use .loc we give it the Title of a movie:
prom = movies_df.loc["Prometheus"]
prom
Out:
rank 2
genre Adventure,Mystery,Sci-Fi
description Following clues to the origin of mankind, a te...
director Ridley Scott
actors Noomi Rapace, Logan Marshall-Green, Michael Fa...
year 2012
runtime 124
rating 7
votes 485820
revenue_millions 126.46
metascore 65
Name: Prometheus, dtype: object
On the other hand, with iloc we give it the numerical index of Prometheus:
prom = movies_df.iloc[1]
loc and iloc can be thought of as similar to Python list slicing. To show this even further, let's select
multiple rows.
How would you do it with a list? In Python, just slice with brackets like example_list[1:4]. It's works
the same way in pandas:
movie_subset = movies_df.loc['Prometheus':'Sing']
movie_subset = movies_df.iloc[1:4]
movie_subset
Out:
ra descri direct ye runti rati vote revenue_ metas
genre actors
nk ption or ar me ng s millions core
Title
Promet Adventure,Myste Follow Ridley Noomi 20 485
2 124 7.0 126.46 65.0
heus ry,Sci-Fi ing Scott Rapace, 12 820
ra descri direct ye runti rati vote revenue_ metas
genre actors
nk ption or ar me ng s millions core
Title
clues Logan
to the Marshall-
origin Green,
of Michael Fa...
mankin
d, a
te...
Three
girls
James
are M.
McAvoy,
kidnap Night 20 157
Split 3 Horror,Thriller Anya Taylor- 117 7.3 138.12 62.0
ped by Shyam 16 606
Joy, Haley
a man alan
Lu Richar...
with a
diag...
In a
city of
human Matthew
Christ
oid McConaughe
Animation,Come ophe 20 605
Sing 4 animal y,Reese 108 7.2 270.32 59.0
dy,Family Lourd 16 45
s, a Witherspoon,
elet
hustlin Seth Ma...
g
thea...
One important distinction between using .loc and .iloc to select multiple rows is that .locincludes
the movie Sing in the result, but when using .iloc we're getting rows 1:4 but the movie at index 4
(Suicide Squad) is not included.
Slicing with .iloc follows the same rules as slicing with lists, the object at the index at the end is not
included.
Conditional selections
We’ve gone over how to select columns and rows, but what if we want to make a conditional selection?
For example, what if we want to filter our movies DataFrame to show only films directed by Ridley
Scott or films with a rating greater than or equal to 8.0?
To do that, we take a column from the DataFrame and apply a Boolean condition to it. Here's an
example of a Boolean condition:
condition = (movies_df['director'] == "Ridley Scott")
condition.head()
Out:
Title
Guardians of the Galaxy False
Prometheus True
Split False
Sing False
Suicide Squad False
Name: director, dtype: bool
Similar to isnull(), this returns a Series of True and False values: True for films directed by Ridley
Scott and False for ones not directed by him.
We want to filter out all movies not directed by Ridley Scott, in other words, we don’t want the False
films. To return the rows where that condition is True we have to pass this operation into the
DataFrame:
Let's look at conditional selections using numerical values by filtering the DataFrame by ratings:
We can make some richer conditionals by using logical operators | for "or" and & for "and".
Let's filter the the DataFrame to show only movies by Christopher Nolan OR Ridley Scott:
We need to make sure to group evaluations with parentheses so Python knows how to evaluate the
conditional.
Using the isin() method we could make this more concise though:
movies_df[movies_df['director'].isin(['Christopher Nolan', 'Ridley Scott'])].head()
Out:
ra descrip directo ye runti rati revenue_m metas
genre actors votes
nk tion r ar me ng illions core
Title
Followi
Noomi
ng
Rapace,
clues to
Logan
Promet Adventure,Mys the Ridley 20 4858
2 Marshall- 124 7.0 126.46 65.0
heus tery,Sci-Fi origin Scott 12 20
Green,
of
Michael
mankin
Fa...
d, a te...
A team
of Matthew
explore McConau
rs Christo ghey,
Interste Adventure,Dra 20 1047
37 travel pher Anne 169 8.6 187.99 74.0
llar ma,Sci-Fi 14 747
through Nolan Hathaway
a , Jessica
wormh Ch...
ole ...
When
Christian
the
Bale,
menace
The Christo Heath
Action,Crime,D known 20 1791
Dark 55 pher Ledger, 152 9.0 533.32 82.0
rama as the 08 916
Knight Nolan Aaron
Joker
Eckhart,
wreaks
Mi...
havo...
Two
stage
Christian
magicia
Bale,
The ns Christo
Drama,Mystery Hugh 20 9131
Prestig 65 engage pher 130 8.5 53.08 66.0
,Sci-Fi Jackman, 06 52
e in Nolan
Scarlett
compet
Johanss...
itive
one-...
A thief,
Leonardo
who Christo
Incepti Action,Adventu DiCaprio, 20 1583
81 steals pher 148 8.8 292.57 74.0
on re,Sci-Fi Joseph 10 625
corpora Nolan
Gordon-
te
ra descrip directo ye runti rati revenue_m metas
genre actors votes
nk tion r ar me ng illions core
Title
secrets Levitt,
through Ellen...
...
Let's say we want all movies that were released between 2005 and 2010, have a rating above 8.0, but
made below the 25th percentile in revenue.
movies_df[
((movies_df['year'] >= 2005) & (movies_df['year'] <= 2010))
& (movies_df['rating'] > 8.0)
& (movies_df['revenue_millions'] <
movies_df['revenue_millions'].quantile(0.25))
]
Out:
ra descrip ye runti rati vote revenue_m metas
genre director actors
nk tion ar me ng s illions core
Title
Two
Aamir
friends
Khan,
are
Madhavan,
3 43 Comedy,Dra searchi Rajkumar 20 2387
Mona 170 8.4 6.52 67.0
Idiots 1 ma ng for Hirani 09 89
Singh,
their
Sharman
long
Joshi
lost ...
In 1984
East Ulrich
The Florian
Berlin, Mühe,
Lives Henckel
47 Drama,Thrille an Martina 20 2781
of von 137 8.5 11.28 89.0
7 r agent Gedeck,Seb 06 03
Other Donners
of the astian
s marck
secret Koch, Ul...
po...
Lubna
Twins
Azabal,
journey Denis
Incen 71 Drama,Myste Mélissa 20 9286
to the Villeneuv 131 8.2 6.86 80.0
dies 4 ry,War Désormeau 10 3
Middle e
x-Poulin,
East to
Maxim...
ra descrip ye runti rati vote revenue_m metas
genre director actors
nk tion ar me ng s illions core
Title
discove
r t...
An
Darsheel
eight-
Safary,
Taare year-
Aamir
Zame 99 Drama,Famil old boy Aamir 20 1026
Khan, 165 8.5 1.20 42.0
en 2 y,Music is Khan 07 97
Tanay
Par thought
Chheda,
to be a
Sac...
lazy ...
If you recall up when we used .describe() the 25th percentile for revenue was about 17.4, and we can
access this value directly by using the quantile() method with a float of 0.25.
Applying functions
It is possible to iterate over a DataFrame or Series as you would with a list, but doing so — especially on
large datasets — is very slow.
An efficient alternative is to apply() a function to the dataset. For example, we could use a function to
convert movies with an 8.0 or greater to a string value of "good" and the rest to "bad" and use this
transformed values to create a new column.
First we would create a function that, when given a rating, determines if it's good or bad:
def rating_function(x):
if x >= 8.0:
return "good"
else:
return "bad"
Now we want to send the entire rating column through this function, which is what apply() does:
movies_df["rating_category"] = movies_df["rating"].apply(rating_function)
movies_df.head(2)
Out:
ra descrip direc actor ye runti rati vote revenue_ metas rating_ca
genre
nk tion tor s ar me ng s millions core tegory
Title
Chris
A Pratt,
group Vin
of Diese
Guardi Jame
intergal l,
ans of Action,Advent s 20 757
1 actic Bradl 121 8.1 333.13 76.0 good
the ure,Sci-Fi Gun 14 074
crimina ey
Galaxy n
ls are Coop
forced er,
... Zoe
S...
Noo
mi
Followi Rapa
ng ce,
clues to Loga
the Ridle n
Promet Adventure,My 20 485
2 origin y Mars 124 7.0 126.46 65.0 bad
heus stery,Sci-Fi 12 820
of Scott hall-
mankin Green
d, a ,
te... Mich
ael
Fa...
The .apply() method passes every value in the rating column through the rating_function and then
returns a new Series. This Series is then assigned to a new column called rating_category.
You can also use anonymous functions as well. This lambda function achieves the same result as
rating_function:
movies_df.head(2)
Out:
ra descrip direc actor ye runti rati vote revenue_ metas rating_ca
genre
nk tion tor s ar me ng s millions core tegory
Title
Guardi Action,Advent A Jame Chris 20 757
1 121 8.1 333.13 76.0 good
ans of ure,Sci-Fi group s Pratt, 14 074
ra descrip direc actor ye runti rati vote revenue_ metas rating_ca
genre
nk tion tor s ar me ng s millions core tegory
Title
the of Gun Vin
Galaxy intergal n Diese
actic l,
crimina Bradl
ls are ey
forced Coop
... er,
Zoe
S...
Noo
mi
Followi Rapa
ng ce,
clues to Loga
the Ridle n
Promet Adventure,My 20 485
2 origin y Mars 124 7.0 126.46 65.0 bad
heus stery,Sci-Fi 12 820
of Scott hall-
mankin Green
d, a ,
te... Mich
ael
Fa...
Overall, using apply() will be much faster than iterating manually over rows because pandas is
utilizing vectorization.
Vectorization: a style of computer programming where operations are applied to whole arrays instead of
individual elements
A good example of high usage of apply() is during natural language processing (NLP) work. You'll
need to apply all sorts of text cleaning functions to strings to prepare for machine learning.
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is
called logistic regression, but is used to classify samples; Therefore, it falls under the classification
algorithm.
• In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation by (1-y):
• But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will become:
• Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of the
dependent variable, such as "cat", "dogs", or "sheep"
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as "low", "Medium", or "High".
For this problem, we will build a Machine Learning model using the Logistic regression algorithm. The
dataset is shown in the below image. In this problem, we will predict the purchased variable
(Dependent Variable) by using age and salary (Independent variables).
Steps in Logistic Regression: To implement the Logistic Regression using Python, we will use the
same steps as we have done in previous topics of Regression. Below are the steps:
1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can use it in
our code efficiently. It will be the same as we have done in Data pre-processing topic. The code for this
is given below:
1. #Data Pre-procesing Step
2. # importing libraries
3. import numpy as nm
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')
By executing the above lines of code, we will get the dataset as the output. Consider the given image:
Now, we will extract the dependent and independent variables from the given dataset. Below is the code
for it:
Now we will split the dataset into a training set and test set. Below is the code for it:
1. #feature Scaling
2. from sklearn.preprocessing import StandardScaler
3. st_x= StandardScaler()
4. x_train= st_x.fit_transform(x_train)
5. x_test= st_x.transform(x_test)
We have well prepared our dataset, and now we will train the dataset using the training set. For
providing training or fitting the model to the training set, we will import the LogisticRegression class of
the sklearn library.
After importing the class, we will create a classifier object and use it to fit the model to the logistic
regression. Below is the code for it:
Output: By executing the above code, we will get the below output:
Out[5]:
Our model is well trained on the training set, so we will now predict the result by using test set data.
Below is the code for it:
In the above code, we have created a y_pred vector to predict the test set result.
Output: By executing the above code, a new vector (y_pred) will be created under the variable explorer
option. It can be seen as:
The above output image shows the corresponding predicted users who want to purchase or not purchase
the car.
4. Test Accuracy of the result
Now we will create the confusion matrix here to check the accuracy of the classification. To create it, we
need to import the confusion_matrix function of the sklearn library. After importing the function, we
will call it using a new variable cm. The function takes two parameters, mainly y_true( the actual
values) and y_pred (the targeted value return by the classifier). Below is the code for it:
Output:
By executing the above code, a new confusion matrix will be created. Consider the below image:
We can find the accuracy of the predicted result by interpreting the confusion matrix. By above output,
we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below
diagram in which there are two different categories that are classified using a decision boundary or
hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose
we see a strange cat that also has some features of dogs, so if we want a model that can accurately
identify whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We will
first train our model with lots of images of cats and dogs so that it can learn about different features of
cats and dogs, and then we test it with this strange creature. So as support vector creates a decision
boundary between these two data (cat and dog) and choose extreme cases (support vectors), it will see
the extreme case of cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider
the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:
• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier.
The dimensions of the hyperplane depend on the features present in the dataset, which means if there are
2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the
hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a
Support vector.
The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset
that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that
can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there
can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region
is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes.
These points are called support vectors. The distance between the vectors and the hyperplane is called as
margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is
called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used two
dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d
space with z=1, then it will become as:
Now we will implement the SVM algorithm using Python. Here we will use the same dataset user_data,
which we have used in Logistic regression and KNN classification.
Till the Data pre-processing step, the code will remain the same. Below is the code:
After executing the above code, we will pre-process the data. The code will give the dataset as:
Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we will import
SVC class from Sklearn.svm library. Below is the code for it:
In the above code, we have used kernel='linear', as here we are creating SVM for linearly separable
data. However, we can change it for non-linear data. And then we fitted the classifier to the training
dataset(x_train, y_train)
Output:
Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)
The model performance can be altered by changing the value of C(Regularization factor), gamma,
and kernel.
• Predicting the test set result:
Now, we will predict the output for test set. For this, we will create a new vector y_pred. Below
is the code for it:
After getting the y_pred vector, we can compare the result of y_pred and y_test to check the difference
between the actual value and predicted value.
Output: Below is the output for the prediction of the test set:
Output:
As we can see in the above output image, there are 66+24= 90 correct predictions and 8+2= 10 correct
predictions. Therefore we can say that our SVM model improved as compared to the Logistic regression
model.
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different
clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2,
there will be two clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each
dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in
the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm
is to minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats
the process until it does not find the best clusters. The value of k should be predetermined in this algorithm.
• Determines the best value for K center points or centroids by an iterative process.
• Assigns each data point to its closest k-center. Those data points which are near to the particular k-
center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below:
• Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters. It
means here we will try to group these datasets into two different clusters.
• We need to choose some random k points or centroid to form the cluster. These points can be either the
points from the dataset or any other point. So, here we are selecting the below two points as k points,
which are not the part of our dataset. Consider the below image:
• Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute
it by applying some mathematics that we have studied to calculate the distance between two points. So,
we will draw a median between both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to
the right of the line are close to the yellow centroid. Let's color them as blue and yellow for clear visualization.
• As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To
choose the new centroids, we will compute the center of gravity of these centroids, and will find new
centroids as below:
• Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of
finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right
to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points.
• We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as
shown in the below image:
• As we got the new centroids so again will draw the median line and reassign the data points. So, the
image will be:
• We can see in the above image; there are no dissimilar data points on either side of the line, which
means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as shown
in the below image:
The performance of the K-means clustering algorithm depends upon highly efficient clusters that it forms. But
choosing the optimal number of clusters is a big task. There are some different ways to find the optimal number
of clusters, but here we are discussing the most appropriate method to find the number of clusters or value of K.
The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters. This method uses the
concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total variations
within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and its centroid
within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as Euclidean distance
or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
• It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).
• The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the best
value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow method. The
graph for the elbow method looks like the below image:
Note: We can choose the number of clusters equal to the given data points. If we choose the number of
clusters equal to the data points, then the value of WCSS becomes zero, and that will be the endpoint of the
plot.
In the above section, we have discussed the K-means algorithm, now let's see how it can be implemented using
Python.
Before implementation, let's understand what type of problem we will solve here. So, we have a dataset of
Mall_Customers, which is the data of customers who visit the mall and spend there.
In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($), and Spending Score (which is the
calculated value of how much a customer has spent in the mall, the more the value, the more he has spent).
From this dataset, we need to calculate some patterns, as it is an unsupervised method, so we don't know what
to calculate exactly.
• Data Pre-processing
The first step will be the data pre-processing, as we did in our earlier topics of Regression and Classification. But
for the clustering problem, it will be different from other models. Let's discuss it:
• Importing Libraries
As we did in previous topics, firstly, we will import the libraries for our model, which is part of data pre-
processing. The code is given below:
1. # importing libraries
2. import numpy as nm
4. import pandas as pd
In the above code, the numpy we have imported for the performing mathematics calculation, matplotlib is for
plotting the graph, and pandas are for managing the dataset.
2. dataset = pd.read_csv('Mall_Customers_data.csv')
By executing the above lines of code, we will get our dataset in the Spyder IDE. The dataset looks like the below
image:
From the above dataset, we need to find some patterns in it.
Here we don't need any dependent variable for data pre-processing step as it is a clustering problem, and we
have no idea about what to determine. So we will just add a line of code for the matrix of features.
As we can see, we are extracting only 3rd and 4th feature. It is because we need a 2d plot to visualize the model,
and some features are not required, such as customer_id.
Step-2: Finding the optimal number of clusters using the elbow method
In the second step, we will try to find the optimal number of clusters for our clustering problem. So, as discussed
above, here we are going to use the elbow method for this purpose.
As we know, the elbow method uses the WCSS concept to draw the plot by plotting WCSS values on the Y-axis
and the number of clusters on the X-axis. So we are going to calculate the value for WCSS for different k values
ranging from 1 to 10. Below is the code for it:
1. #finding optimal number of clusters using the elbow method
4.
8. kmeans.fit(x)
9. wcss_list.append(kmeans.inertia_)
13. mtp.ylabel('wcss_list')
14. mtp.show()
As we can see in the above code, we have used the KMeans class of sklearn. cluster library to form the clusters.
Next, we have created the wcss_list variable to initialize an empty list, which is used to contain the value of wcss
computed for different values of k ranging from 1 to 10.
After that, we have initialized the for loop for the iteration on a different value of k ranging from 1 to 10; since for
loop in Python, exclude the outbound limit, so it is taken as 11 to include 10th value.
The rest part of the code is similar as we did in earlier topics, as we have fitted the model on a matrix of features
and then plotted the graph between the number of clusters and WCSS.
Output: After executing the above code, we will get the below output:
From the above plot, we can see the elbow point is at 5. So the number of clusters here will be 5.
Step- 3: Training the K-means algorithm on the training dataset
As we have got the number of clusters, so we can now train the model on the dataset.
To train the model, we will use the same two lines of code as we have used in the above section, but here
instead of using i, we will use 5, as we know there are 5 clusters that need to be formed. The code is given
below:
3. y_predict= kmeans.fit_predict(x)
The first line is the same as above for creating the object of KMeans class.
In the second line of code, we have created the dependent variable y_predict to train the model.
By executing the above lines of code, we will get the y_predict variable. We can check it under the variable
explorer option in the Spyder IDE. We can now compare the values of y_predict with our original dataset.
Consider the below image:
From the above image, we can now relate that the CustomerID 1 belongs to a cluster
3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs to cluster 4, and so on.
The last step is to visualize the clusters. As we have 5 clusters for our model, so we will visualize each cluster one
by one.
ADVERTISEMENT
To visualize the clusters will use scatter plot using mtp.scatter() function of matplotlib.
2. mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1') #for first cluste
r
3. mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2') #for second c
luster
4. mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3') #for third cluster
5. mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4') #for fourth clu
ster
6. mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5') #for fifth c
luster
8. mtp.title('Clusters of customers')
11. mtp.legend()
12. mtp.show()
In above lines of code, we have written code for each clusters, ranging from 1 to 5. The first coordinate of the
mtp.scatter, i.e., x[y_predict == 0, 0] containing the x value for the showing the matrix of features values, and the
y_predict is ranging from 0 to 1.
Output:
The output image is clearly showing the five different clusters with different colors. The clusters are formed
between two parameters of the dataset; Annual income of customer and Spending. We can change the colors
and labels as per the requirement or choice. We can also observe some points from the above patterns, which
are given below:
• Cluster1 shows the customers with average salary and average spending so we can categorize these
customers as
• Cluster2 shows the customer has a high income but low spending, so we can categorize them as careful.
• Cluster3 shows the low income and also low spending so they can be categorized as sensible.
• Cluster4 shows the customers with low income with very high spending so they can be categorized as
careless.
• Cluster5 shows the customers with high income and high spending so they can be categorized as target,
and these customers can be the most profitable customers for the mall owner.