Unit 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 95

What is machine learning?

Machine Learning is an application of artificial intelligence where a computer/machine learns from the past
experiences (input data) and makes future predictions.

Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data
and algorithms to imitate the way that humans learn, gradually improving its accuracy.

Machine learning enables a machine to automatically learn from data, improve performance from
experiences, and predict things without being explicitly programmed.

How does Machine Learning work


A Machine Learning system learns from historical data, builds the prediction models, and
whenever it receives new data, predicts the output for it. The accuracy of predicted output depends
upon the amount of data, as the huge amount of data helps to build a better model which predicts the
output more accurately.

Suppose we have a complex problem, where we need to perform some predictions, so instead of writing
a code for it, we just need to feed the data to generic algorithms, and with the help of these algorithms,
machine builds the logic as per the data and predict the output. Machine learning has changed our way
of thinking about the problem. The below block diagram explains the working of Machine Learning
algorithm:

Features of Machine Learning:


• Machine learning uses data to detect various patterns in a given dataset.
• It can learn from past data and improve automatically.
• It is a data-driven technology.
• Machine learning is much similar to data mining as it also deals with the huge amount of the
data.
Need for Machine Learning
The need for machine learning is increasing day by day. The reason behind the need for machine
learning is that it is capable of doing tasks that are too complex for a person to implement directly. As a
human, we have some limitations as we cannot access the huge amount of data manually, so for this, we
need some computer systems and here comes the machine learning to make things easy for us.

We can train machine learning algorithms by providing them the huge amount of data and let them
explore the data, construct the models, and predict the required output automatically. The performance
of the machine learning algorithm depends on the amount of data, and it can be determined by the cost
function. With the help of machine learning, we can save both time and money.

The importance of machine learning can be easily understood by its uses cases, Currently, machine
learning is used in self-driving cars, cyber fraud detection, face recognition, and friend suggestion
by Facebook, etc. Various top companies such as Netflix and Amazon have build machine learning
models that are using a vast amount of data to analyze the user interest and recommend product
accordingly.

Following are some key points which show the importance of Machine Learning:

• Rapid increment in the production of data


• Solving complex problems, which are difficult for a human
• Decision making in various sector including finance
• Finding hidden patterns and extracting useful information from data.

Classification of Machine Learning


At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1) Supervised Learning

Supervised learning is a type of machine learning method in which we provide sample labeled data to
the machine learning system in order to train it, and on that basis, it predicts the output.

The system creates a model using labeled data to understand the datasets and learn about each data, once
the training and processing are done then we test the model by providing a sample data to check whether
it is predicting the exact output or not.

The goal of supervised learning is to map input data with the output data. The supervised learning is
based on supervision, and it is the same as when a student learns things in the supervision of the teacher.
The example of supervised learning is spam filtering.

Supervised learning can be grouped further in two categories of algorithms:

• Classification
• Regression

2) Unsupervised Learning

Unsupervised learning is a learning method in which a machine learns without any supervision.

The training is provided to the machine with the set of data that has not been labeled, classified, or
categorized, and the algorithm needs to act on that data without any supervision. The goal of
unsupervised learning is to restructure the input data into new features or a group of objects with similar
patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find useful insights
from the huge amount of data. It can be further classifieds into two categories of algorithms:

• Clustering
• Association

3) Reinforcement Learning

Reinforcement learning is a feedback-based learning method, in which a learning agent gets a reward for
each right action and gets a penalty for each wrong action. The agent learns automatically with these
feedbacks and improves its performance. In reinforcement learning, the agent interacts with the
environment and explores it. The goal of an agent is to get the most reward points, and hence, it
improves its performance.

The robotic dog, which automatically learns the movement of his arms, is an example of Reinforcement
learning.

Note: We will learn about the above types of machine learning in detail in later chapters.

History of Machine Learning


Before some years (about 40-50 years), machine learning was science fiction, but today it is the part of
our daily life. Machine learning is making our day to day life easy from self-driving cars to Amazon
virtual assistant "Alexa". However, the idea behind machine learning is so old and has a long history.
Below some milestones are given which have occurred in the history of machine learning:
Data Preprocessing in Machine learning
Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. It is the first
and crucial step while creating a machine learning model.

When creating a machine learning project, it is not always a case that we come across the clean and formatted data. And
while doing any operation with data, it is mandatory to clean it and put in a formatted way. So for this, we use data
preprocessing task.

Why do we need Data Preprocessing?

• A real-world data generally contains noises, missing values, and maybe in an unusable format which cannot be
directly used for machine learning models. Data preprocessing is required tasks for cleaning the data and making it
suitable for a machine learning model which also increases the accuracy and efficiency of a machine learning

model .
• Getting the dataset
• Importing libraries
• Importing datasets
• Finding Missing Data
• Encoding Categorical Data
• Splitting dataset into training and test set
• Feature scaling
1) Get the Dataset
To create a machine learning model, the first thing we required is a dataset as a machine learning model
completely works on data. The collected data for a particular problem in a proper format is known as the
dataset.

Dataset may be of different formats for different purposes, such as, if we want to create a machine
learning model for business purpose, then dataset will be different with the dataset required for a liver
patient. So each dataset is different from another dataset. To use the dataset in our code, we usually put
it into a CSV file. However, sometimes, we may also need to use an HTML or xlsx file.

What is a CSV File?

CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save the tabular
data, such as spreadsheets. It is useful for huge datasets and can use these datasets in programs.

Here we will use a demo dataset for data preprocessing, and for practice, it can be downloaded from
here, "https://www.superdatascience.com/pages/machine-learning. For real-world problems, we can
download datasets online from various sources such as https://www.kaggle.com/uciml/datasets,
https://archive.ics.uci.edu/ml/index.php etc.

We can also create our dataset by gathering data using various API with Python and put that data into a
.csv file.

2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some predefined Python
libraries. These libraries are used to perform some specific jobs. There are three specific libraries that we
will use for data preprocessing, which are:

Numpy: Numpy Python library is used for including any type of mathematical operation in the code. It
is the fundamental package for scientific calculation in Python. It also supports to add large,
multidimensional arrays and matrices. So, in Python, we can import it as:

1. import numpy as nm

Here we have used nm, which is a short name for Numpy, and it will be used in the whole program.

Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and with this
library, we need to import a sub-library pyplot. This library is used to plot any type of charts in Python
for the code. It will be imported as below:
1. import matplotlib.pyplot as mpt

Here we have used mpt as a short name for this library.

Pandas: The last library is the Pandas library, which is one of the most famous Python libraries and
used for importing and managing the datasets. It is an open-source data manipulation and analysis
library. It will be imported as below:

Here, we have used pd as a short name for this library. Consider the below image:

3) Importing the Datasets


Now we need to import the datasets which we have collected for our machine learning project. But
before importing a dataset, we need to set the current directory as a working directory. To set a working
directory in Spyder IDE, we need to follow the below steps:

1. Save your Python file in the directory which contains dataset.


2. Go to File explorer option in Spyder IDE, and select the required directory.
3. Click on F5 button or run option to execute the file.

Note: We can set any directory as a working directory, but it must contain the required dataset.

Here, in the below image, we can see the Python file along with required dataset. Now, the current
folder is set as a working directory.
read_csv() function:

Now to import the dataset, we will use read_csv() function of pandas library, which is used to read a csv
file and performs various operations on it. Using this function, we can read a csv file locally as well as
through an URL.

We can use read_csv function as below:

1. data_set= pd.read_csv('Dataset.csv')

Here, data_set is a name of the variable to store our dataset, and inside the function, we have passed the
name of our dataset. Once we execute the above line of code, it will successfully import the dataset in
our code. We can also check the imported dataset by clicking on the section variable explorer, and then
double click on data_set. Consider the below image:
As in the above image, indexing is started from 0, which is the default indexing in Python. We can also
change the format of our dataset by clicking on the format option.

Extracting dependent and independent variables:

In machine learning, it is important to distinguish the matrix of features (independent variables) and
dependent variables from dataset. In our dataset, there are three independent variables that are Country,
Age, and Salary, and one is a dependent variable which is Purchased.

Extracting independent variable:


To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to extract the
required rows and columns from the dataset.

1. x= data_set.iloc[:,:-1].values

In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is for all the
columns. Here we have used :-1, because we don't want to take the last column as it contains the
dependent variable. So by doing this, we will get the matrix of features.

By executing the above code, we will get output as:

1. [['India' 38.0 68000.0]


2. ['France' 43.0 45000.0]
3. ['Germany' 30.0 54000.0]
4. ['France' 48.0 65000.0]
5. ['Germany' 40.0 nan]
6. ['India' 35.0 58000.0]
7. ['Germany' nan 53000.0]
8. ['France' 49.0 79000.0]
9. ['India' 50.0 88000.0]
10. ['France' 37.0 77000.0]]

As we can see in the above output, there are only three variables.

Extracting dependent variable:

To extract dependent variables, again, we will use Pandas .iloc[] method.

1. y= data_set.iloc[:,3].values

Here we have taken all the rows with the last column only. It will give the array of dependent variables.

By executing the above code, we will get output as:

Output:
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)

Note: If you are using Python language for machine learning, then extraction is mandatory, but
for R language it is not required.

ADVERTISEMENT

4) Handling Missing data:


The next step of data preprocessing is to handle missing data in the datasets. If our dataset contains some
missing data, then it may create a huge problem for our machine learning model. Hence it is necessary to
handle missing values present in the dataset.

Ways to handle missing data:

There are mainly two ways to handle missing data, which are:

By deleting the particular row: The first way is used to commonly deal with null values. In this way,
we just delete the specific row or column which consists of null values. But this way is not so efficient
and removing data may lead to loss of information which will not give the accurate output.

By calculating the mean: In this way, we will calculate the mean of that column or row which contains
any missing value and will put it on the place of missing value. This strategy is useful for the features
which have numeric data such as age, salary, year, etc. Here, we will use this approach.

To handle missing values, we will use Scikit-learn library in our code, which contains various libraries
for building machine learning models. Here we will use Imputer class of sklearn.preprocessing
library. Below is the code for it:

1. #handling missing data (Replacing missing data with the mean value)
2. from sklearn.preprocessing import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
4. #Fitting imputer object to the independent variables x.
5. imputerimputer= imputer.fit(x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= imputer.transform(x[:, 1:3])

Output:

array([['India', 38.0, 68000.0],


['France', 43.0, 45000.0],
['Germany', 30.0, 54000.0],
['France', 48.0, 65000.0],
['Germany', 40.0, 65222.22222222222],
['India', 35.0, 58000.0],
['Germany', 41.111111111111114, 53000.0],
['France', 49.0, 79000.0],
['India', 50.0, 88000.0],
['France', 37.0, 77000.0]], dtype=object

As we can see in the above output, the missing values have been replaced with the means of rest column
values.

5) Encoding Categorical data:


Categorical data is data which has some categories such as, in our dataset; there are two categorical
variable, Country, and Purchased.
Since machine learning model completely works on mathematics and numbers, but if our dataset would
have a categorical variable, then it may create trouble while building the model. So it is necessary to
encode these categorical variables into numbers.

For Country variable:

Firstly, we will convert the country variables into categorical data. So to do this, we will use
LabelEncoder() class from preprocessing library.

1. #Catgorical data
2. #for Country Variable
3. from sklearn.preprocessing import LabelEncoder
4. label_encoder_x= LabelEncoder()
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])

Output:

Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)

Explanation:

In above code, we have imported LabelEncoder class of sklearn library. This class has successfully
encoded the variables into digits.

But in our case, there are three country variables, and as we can see in the above output, these variables
are encoded into 0, 1, and 2. By these values, the machine learning model may assume that there is some
correlation between these variables which will produce the wrong output. So to remove this issue, we
will use dummy encoding.

Dummy Variables:

Dummy variables are those variables which have values 0 or 1. The 1 value gives the presence of that
variable in a particular column, and rest variables become 0. With dummy encoding, we will have a
number of columns equal to the number of categories.

In our dataset, we have 3 categories so it will produce three columns having 0 and 1 values. For Dummy
Encoding, we will use OneHotEncoder class of preprocessing library.
1. #for Country Variable
2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
3. label_encoder_x= LabelEncoder()
4. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
5. #Encoding for dummy variables
6. onehot_encoder= OneHotEncoder(categorical_features= [0])
7. x= onehot_encoder.fit_transform(x).toarray()

Output:

ADVERTISEMENT
array([[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
6.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.30000000e+01,
4.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
5.40000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
6.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
6.52222222e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.50000000e+01,
5.80000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.11111111e+01,
5.30000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.90000000e+01,
7.90000000e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 5.00000000e+01,
8.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
7.70000000e+04]])

As we can see in the above output, all the variables are encoded into numbers 0 and 1 and divided into
three columns.

It can be seen more clearly in the variables explorer section, by clicking on x option as:
For Purchased Variable:

1. labelencoder_y= LabelEncoder()
2. y= labelencoder_y.fit_transform(y)

For the second categorical variable, we will only use labelencoder object of LableEncoder class. Here
we are not using OneHotEncoder class because the purchased variable has only two categories yes or
no, and which are automatically encoded into 0 and 1.

Output:

ADVERTISEMENT
Out[17]: array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

It can also be seen as:


ADVERTISEMENT
ADVERTISEMENT

6) Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set and test set. This is one
of the crucial steps of data preprocessing as by doing this, we can enhance the performance of our
machine learning model.

Suppose, if we have given training to our machine learning model by a dataset and we test it by a
completely different dataset. Then, it will create difficulties for our model to understand the correlations
between the models.

If we train our model very well and its training accuracy is also very high, but we provide a new dataset
to it, then it will decrease the performance. So we always try to make a machine learning model which
performs well with the training set and also with the test dataset. Here, we can define these datasets as:
Training Set: A subset of dataset to train the machine learning model, and we already know the output.

Test set: A subset of dataset to test the machine learning model, and by using the test set, model predicts
the output.

For splitting the dataset, we will use the below lines of code:

1. from sklearn.model_selection import train_test_split


2. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

Explanation:

• In the above code, the first line is used for splitting arrays of the dataset into random train and
test subsets.
• In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
• In train_test_split() function, we have passed four parameters in which first two are for arrays
of data, and test_size is for specifying the size of the test set. The test_size maybe .5, .3, or .2,
which tells the dividing ratio of training and testing sets.
• The last parameter random_state is used to set a seed for a random generator so that you always
get the same result, and the most used value for this is 42.

Output:

By executing the above code, we will get 4 different variables, which can be seen under the variable
explorer section.
As we can see in the above image, the x and y variables are divided into 4 different variables with
corresponding values.

ADVERTISEMENT

7) Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a technique to
standardize the independent variables of the dataset in a specific range. In feature scaling, we put our
variables in the same range and in the same scale so that no any variable dominate the other variable.

ADVERTISEMENT

Consider the below dataset:


As we can see, the age and salary column values are not on the same scale. A machine learning model is
based on Euclidean distance, and if we do not scale the variable, then it will cause some issue in our
machine learning model.

Euclidean distance is given as:


If we compute any two values from age and salary, then salary values will dominate the age values, and
it will produce an incorrect result. So to remove this issue, we need to perform feature scaling for
machine learning.

There are two ways to perform feature scaling in machine learning:

Standardization

Normalization
Here, we will use the standardization method for our dataset.

For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:

1. from sklearn.preprocessing import StandardScaler

Now, we will create the object of StandardScaler class for independent variables or features. And then
we will fit and transform the training dataset.

1. st_x= StandardScaler()
2. x_train= st_x.fit_transform(x_train)

For test dataset, we will directly apply transform() function instead of fit_transform() because it is
already done in training set.

1. x_test= st_x.transform(x_test)

Output:

By executing the above lines of code, we will get the scaled values for x_train and x_test as:

x_train:
x_test:
As we can see in the above output, all the variables are scaled between values -1 to 1.

Note: Here, we have not scaled the dependent variable because there are only two values 0 and 1.
But if these variables will have more range of values, then we will also need to scale those
variables.

Combining all the steps:

Now, in the end, we can combine all the steps together to make our complete code more understandable.

ADVERTISEMENT

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('Dataset.csv')
8.
9. #Extracting Independent Variable
10. x= data_set.iloc[:, :-1].values
11.
12. #Extracting Dependent variable
13. y= data_set.iloc[:, 3].values
14.
15. #handling missing data(Replacing missing data with the mean value)
16. from sklearn.preprocessing import Imputer
17. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
18.
19. #Fitting imputer object to the independent varibles x.
20. imputerimputer= imputer.fit(x[:, 1:3])
21.
22. #Replacing missing data with the calculated mean value
23. x[:, 1:3]= imputer.transform(x[:, 1:3])
24.
25. #for Country Variable
26. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
27. label_encoder_x= LabelEncoder()
28. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
29.
30. #Encoding for dummy variables
31. onehot_encoder= OneHotEncoder(categorical_features= [0])
32. x= onehot_encoder.fit_transform(x).toarray()
33.
34. #encoding for purchased variable
35. labelencoder_y= LabelEncoder()
36. y= labelencoder_y.fit_transform(y)
37.
38. # Splitting the dataset into training and test set.
39. from sklearn.model_selection import train_test_split
40. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
41.
42. #Feature Scaling of datasets
43. from sklearn.preprocessing import StandardScaler
44. st_x= StandardScaler()
45. x_train= st_x.fit_transform(x_train)
46. x_test= st_x.transform(x_test)

In the above code, we have included all the data preprocessing steps together. But there are some steps
or lines of code which are not necessary for all machine learning models. So we can exclude them from
our code to make it reusable for all models.
Pandas library
Install and import

Pandas is an easy package to install. Open up your terminal program (for Mac users) or command line
(for PC users) and install it using either of the following commands:

conda install pandas

OR

pip install pandas

Alternatively, if you're currently viewing this article in a Jupyter notebook you can run this cell:

!pip install pandas

To import pandas we usually import it with a shorter name since it's used so much:

import pandas as pd

Now to the basic components of pandas.

Core components of pandas: Series and DataFrames


The primary two components of pandas are the Series and DataFrame.

A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection


of Series.

DataFrames and Series are quite similar in that many operations that you can do with one you can do
with the other, such as filling in null values and calculating the mean.
You'll see how these components work when we start working with data below.

Creating DataFrames from scratch

Creating DataFrames right in Python is good to know and quite useful when testing new methods and
functions you find in the pandas docs.

There are many ways to create a DataFrame from scratch, but a great option is to just use a simple dict.

Let's say we have a fruit stand that sells apples and oranges. We want to have a column for each fruit
and a row for each customer purchase. To organize this as a dictionary for pandas we could do
something like:

data = {
'apples': [3, 2, 0, 1],
'oranges': [0, 3, 7, 2]
}

And then pass it to the pandas DataFrame constructor:

purchases = pd.DataFrame(data)

purchases
Out:
apples oranges
03 0
12 3
20 7
31 2

How did that work?

Each (key, value) item in data corresponds to a column in the resulting DataFrame.

The Index of this DataFrame was given to us on creation as the numbers 0-3, but we could also create
our own when we initialize the DataFrame.

Let's have customer names as our index:

purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

purchases
Out:
apples Oranges
June 3 0
Robert 2 3
Lily 0 7
David 1 2

So now we could locate a customer's order by using their name:

purchases.loc['June']
Out:
apples 3
oranges 0
Name: June, dtype: int64

There's more on locating and extracting data from the DataFrame later, but now you should be able to
create a DataFrame with any random data to learn on.

Let's move on to some quick methods for creating DataFrames from various other sources.

W to learn more?
See Best Data Science Courses
How to read in data
It’s quite simple to load data from various file formats into a DataFrame. In the following examples
we'll keep using our apples and oranges data, but this time it's coming from various files.

Reading data from CSVs

With CSV files all you need is a single line to load in the data:

df = pd.read_csv('purchases.csv')

df
Out:
Unnamed: 0 apples oranges
0 June 3 0
1 Robert 2 3
2 Lily 0 7
3 David 1 2
CSVs don't have indexes like our DataFrames, so all we need to do is just designate the index_col
when reading:

df = pd.read_csv('purchases.csv', index_col=0)

df
Out:
apples oranges
June 3 0
Robert 2 3
Lily 0 7
David 1 2

Here we're setting the index to be column zero.

Reading data from JSON

If you have a JSON file — which is essentially a stored Python dict — pandas can read this just as
easily:

df = pd.read_json('purchases.json')

df
Out:
apples oranges
David 1 2
June 3 0
Lily 0 7
Robert 2 3

Notice this time our index came with us correctly since using JSON allowed indexes to work through
nesting. Feel free to open data_file.json in a notepad so you can see how it works.

Reading data from a SQL database

If you’re working with data from a SQL database you need to first establish a connection using an
appropriate Python library, then pass a query to pandas. Here we'll use SQLite to demonstrate.

First, we need pysqlite3 installed, so run this command in your terminal:

pip install pysqlite3

Or run this cell if you're in a notebook:


!pip install pysqlite3

sqlite3 is used to create a connection to a database which we can then use to generate a DataFrame
through a SELECT query.

So first we'll make a connection to a SQLite database file:

import sqlite3

con = sqlite3.connect("database.db")

SQL Tip

If you have data in PostgreSQL, MySQL, or some other SQL server, you'll need to obtain the right
Python library to make a connection. For example, psycopg2 (link) is a commonly used library for
making connections to PostgreSQL. Furthermore, you would make a connection to a database URI
instead of a file like we did here with SQLite.

In this SQLite database we have a table called purchases, and our index is in a column called "index".

By passing a SELECT query and our con, we can read from the purchases table:

df = pd.read_sql_query("SELECT * FROM purchases", con)

df
Out:
index apples oranges
0 June 3 0
1 Robert 2 3
2 Lily 0 7
3 David 1 2

Just like with CSVs, we could pass index_col='index', but we can also set an index after-the-fact:

df = df.set_index('index')

df
Out:
apples oranges
Index
June 3 0
Robert 2 3
Lily 0 7
apples oranges
Index
David 1 2

In fact, we could use set_index() on any DataFrame using any column at any time. Indexing Series
and DataFrames is a very common task, and the different ways of doing it is worth remembering.

Converting back to a CSV, JSON, or SQL

So after extensive work on cleaning your data, you’re now ready to save it as a file of your choice.
Similar to the ways we read in data, pandas provides intuitive commands to save it:

df.to_csv('new_purchases.csv')

df.to_json('new_purchases.json')

When we save JSON and CSV files, all we have to input into those functions is our desired filename
with the appropriate file extension. With SQL, we’re not creating a new file but instead inserting a new
table into the database using our con variable from before.

Let's move on to importing some real-world data and detailing a few of the operations you'll be using a
lot.

Most important DataFrame operations


DataFrames possess hundreds of methods and other operations that are crucial to any analysis. As a
beginner, you should know the operations that perform simple transformations of your data and those
that provide fundamental statistical analysis.

Let's load in the IMDB movies dataset to begin:

movies_df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")

We're loading this dataset from a CSV and designating the movie titles to be our index.

Viewing your data


The first thing to do when opening a new dataset is print out a few rows to keep as a visual reference.
We accomplish this with .head():

movies_df.head()
Out:
Runti Reven
Ra Descri Direct Ye me Rati Vot ue Metas
Genre Actors
nk ption or ar (Minu ng es (Milli core
tes) ons)
Title
A
group
of Chris Pratt,
Guardi
intergal Vin Diesel,
ans of Action,Adventur James 20 757 333.1
1 actic Bradley 121 8.1 76.0
the e,Sci-Fi Gunn 14 074 3
crimina Cooper, Zoe
Galaxy
ls are S...
forced
...
Followi
ng Noomi
clues to Rapace,
Promet Adventure,Myste the Ridley Logan 20 485 126.4
2 124 7.0 65.0
heus ry,Sci-Fi origin Scott Marshall- 12 820 6
of Green,
mankin Michael Fa...
d, a te...
Three
girls
James
are M.
McAvoy,
kidnapp Night 20 157 138.1
Split 3 Horror,Thriller Anya Taylor- 117 7.3 62.0
ed by a Shyam 16 606 2
Joy, Haley
man alan
Lu Richar...
with a
diag...
In a
city of
Matthew
humano Christ
McConaughe
Animation,Come id ophe 20 605 270.3
Sing 4 y,Reese 108 7.2 59.0
dy,Family animals Lourde 16 45 2
Witherspoon,
,a let
Seth Ma...
hustling
thea...
Runti Reven
Ra Descri Direct Ye me Rati Vot ue Metas
Genre Actors
nk ption or ar (Minu ng es (Milli core
tes) ons)
Title
A
secret
Will Smith,
govern
Jared Leto,
Suicide Action,Adventur ment David 20 393 325.0
5 Margot 123 6.2 40.0
Squad e,Fantasy agency Ayer 16 727 2
Robbie,
recruits
Viola D...
some of
th...

.head() outputs the first five rows of your DataFrame by default, but we could also pass a number as
well: movies_df.head(10) would output the top ten rows, for example.

To see the last five rows use .tail(). tail() also accepts a number, and in this case we printing the
bottom two rows.:

movies_df.tail(2)
Out:
Runti Reven
Ra Descript Directo Ye me Rati Vot ue Metasc
Genre Actors
nk ion r ar (Minu ng es (Millio ore
tes) ns)
Title
A pair of
Adam
friends
Sear Pally, T.J.
embark Scot
ch Adventure,Come Miller, 201 488
999 on a Armstr 93 5.6 NaN 22.0
Part dy Thomas 4 1
mission ong
y Middleditch
to
,Sh...
reuni...
A stuffy
Kevin
business
Spacey,
Nine man Barry
100 Comedy,Family, Jennifer 201 124
Live finds Sonnen 87 5.3 19.64 11.0
0 Fantasy Garner, 6 35
s himself feld
Robbie
trapped
Amell,Ch...
ins...

Typically when we load in a dataset, we like to view the first five or so rows to see what's under the
hood. Here we can see the names of each column, the index, and examples of values in each row.
You'll notice that the index in our DataFrame is the Title column, which you can tell by how the word
Title is slightly lower than the rest of the columns.

Getting info about your data

.info() should be one of the very first commands you run after loading your data:

movies_df.info()
Out:
<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
Rank 1000 non-null int64
Genre 1000 non-null object
Description 1000 non-null object
Director 1000 non-null object
Actors 1000 non-null object
Year 1000 non-null int64
Runtime (Minutes) 1000 non-null int64
Rating 1000 non-null float64
Votes 1000 non-null int64
Revenue (Millions) 872 non-null float64
Metascore 936 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 93.8+ KB

.info() provides the essential details about your dataset, such as the number of rows and columns, the
number of non-null values, what type of data is in each column, and how much memory your
DataFrame is using.

Notice in our movies dataset we have some obvious missing values in the Revenue and Metascore
columns. We'll look at how to handle those in a bit.

Seeing the datatype quickly is actually quite useful. Imagine you just imported some JSON and the
integers were recorded as strings. You go to do some arithmetic and find an "unsupported operand"
Exception because you can't do math with strings. Calling .info() will quickly point out that your
column you thought was all integers are actually string objects.

Another fast and useful attribute is .shape, which outputs just a tuple of (rows, columns):

movies_df.shape
Out:
(1000, 11)

Note that .shape has no parentheses and is a simple tuple of format (rows, columns). So we have 1000
rows and 11 columns in our movies DataFrame.

You'll be going to .shape a lot when cleaning and transforming data. For example, you might filter
some rows based on some criteria and then want to know quickly how many rows were removed.
Handling duplicates

This dataset does not have duplicate rows, but it is always important to verify you aren't aggregating
duplicate rows.

To demonstrate, let's simply just double up our movies DataFrame by appending it to itself:

temp_df = movies_df.append(movies_df)

temp_df.shape
Out:
(2000, 11)

Using append() will return a copy without affecting the original DataFrame. We are capturing this copy
in temp so we aren't working with the real data.

Notice call .shape quickly proves our DataFrame rows have doubled.

Now we can try dropping duplicates:

temp_df = temp_df.drop_duplicates()

temp_df.shape
Out:
(1000, 11)

Just like append(), the drop_duplicates() method will also return a copy of your DataFrame, but this
time with duplicates removed. Calling .shape confirms we're back to the 1000 rows of our original
dataset.

It's a little verbose to keep assigning DataFrames to the same variable like in this example. For this
reason, pandas has the inplace keyword argument on many of its methods. Using inplace=True will
modify the DataFrame object in place:

temp_df.drop_duplicates(inplace=True)

Now our temp_df will have the transformed data automatically.

Another important argument for drop_duplicates() is keep, which has three possible options:

• first: (default) Drop duplicates except for the first occurrence.


• last: Drop duplicates except for the last occurrence.
• False: Drop all duplicates.
Since we didn't define the keep arugment in the previous example it was defaulted to first. This means
that if two rows are the same pandas will drop the second row and keep the first row. Using last has the
opposite effect: the first row is dropped.

keep,on the other hand, will drop all duplicates. If two rows are the same then both will be dropped.
Watch what happens to temp_df:

temp_df = movies_df.append(movies_df) # make a new copy

temp_df.drop_duplicates(inplace=True, keep=False)

temp_df.shape
Out:
(0, 11)

Since all rows were duplicates, keep=False dropped them all resulting in zero rows being left over. If
you're wondering why you would want to do this, one reason is that it allows you to locate all duplicates
in your dataset. When conditional selections are shown below you'll see how to do that.

Column cleanup

Many times datasets will have verbose column names with symbols, upper and lowercase words, spaces,
and types. To make selecting data by column name easier we can spend a little time cleaning up their
names.

Here's how to print the column names of our dataset:

movies_df.columns
Out:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',
'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
'Metascore'],
dtype='object')

Not only does .columns come in handy if you want to rename columns by allowing for simple copy and
paste, it's also useful if you need to understand why you are receiving a Key Error when selecting data
by column.

We can use the .rename() method to rename certain or all columns via a dict. We don't want
parentheses, so let's rename those:

movies_df.rename(columns={
'Runtime (Minutes)': 'Runtime',
'Revenue (Millions)': 'Revenue_millions'
}, inplace=True)

movies_df.columns
Out:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year', 'Runtime',
'Rating', 'Votes', 'Revenue_millions', 'Metascore'],
dtype='object')

Excellent. But what if we want to lowercase all names? Instead of using .rename() we could also set a
list of names to the columns like so:

movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors', 'year',


'runtime',
'rating', 'votes', 'revenue_millions', 'metascore']

movies_df.columns
Out:
Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore'],
dtype='object')

But that's too much work. Instead of just renaming each column manually we can do a list
comprehension:

movies_df.columns = [col.lower() for col in movies_df]

movies_df.columns
Out:
Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore'],
dtype='object')

list (and dict) comprehensions come in handy a lot when working with pandas and data in general.

It's a good idea to lowercase, remove special characters, and replace spaces with underscores if you'll be
working with a dataset for some time.

How to work with missing values

When exploring data, you’ll most likely encounter missing or null values, which are essentially
placeholders for non-existent values. Most commonly you'll see Python's None or NumPy's np.nan,
each of which are handled differently in some situations.

There are two options in dealing with nulls:

1. Get rid of rows or columns with nulls


2. Replace nulls with non-null values, a technique known as imputation
Let's calculate to total number of nulls in each column of our dataset. The first step is to check which
cells in our DataFrame are null:

movies_df.isnull()
Out:
ran genr descripti directo actor yea runtim ratin vote revenue_millio metasco
k e on r s r e g s ns re
Title
Guardian
Fals Fals Fals Fals
s of the False False False False False False False
e e e e
Galaxy
Promethe Fals Fals Fals Fals
False False False False False False False
us e e e e
Fals Fals Fals Fals
Split False False False False False False False
e e e e
Fals Fals Fals Fals
Sing False False False False False False False
e e e e
Suicide Fals Fals Fals Fals
False False False False False False False
Squad e e e e

Notice isnull() returns a DataFrame where each cell is either True or False depending on that cell's
null status.

To count the number of nulls in each column we use an aggregate function for summing:

movies_df.isnull().sum()
Out:
rank 0
genre 0
description 0
director 0
actors 0
year 0
runtime 0
rating 0
votes 0
revenue_millions 128
metascore 64
dtype: int64

.isnull() just by iteself isn't very useful, and is usually used in conjunction with other methods, like
sum().
We can see now that our data has 128 missing values for revenue_millions and 64 missing values for
metascore.

Removing null values

Data Scientists and Analysts regularly face the dilemma of dropping or imputing null values, and is a
decision that requires intimate knowledge of your data and its context. Overall, removing null data is
only suggested if you have a small amount of missing data.

Remove nulls is pretty simple:

movies_df.dropna()

This operation will delete any row with at least a single null value, but it will return a new DataFrame
without altering the original one. You could specify inplace=True in this method as well.

So in the case of our dataset, this operation would remove 128 rows where revenue_millions is null
and 64 rows where metascore is null. This obviously seems like a waste since there's perfectly good
data in the other columns of those dropped rows. That's why we'll look at imputation next.

Other than just dropping rows, you can also drop columns with null values by setting axis=1:

movies_df.dropna(axis=1)

In our dataset, this operation would drop the revenue_millions and metascore columns

Intuition

What's with this axis=1parameter?

It's not immediately obvious where axis comes from and why you need it to be 1 for it to affect
columns. To see why, just look at the .shape output:

movies_df.shape

Out: (1000, 11)

As we learned above, this is a tuple that represents the shape of the DataFrame, i.e. 1000 rows and 11
columns. Note that the rows are at index zero of this tuple and columns are at index one of this tuple.
This is why axis=1 affects columns. This comes from NumPy, and is a great example of why learning
NumPy is worth your time.

Imputation
Imputation is a conventional feature engineering technique used to keep valuable data that have null
values.

There may be instances where dropping every row with a null value removes too big a chunk from your
dataset, so instead we can impute that null with another value, usually the mean or the median of that
column.

Let's look at imputing the missing values in the revenue_millions column. First we'll extract that
column into its own variable:

revenue = movies_df['revenue_millions']

Using square brackets is the general way we select columns in a DataFrame.

If you remember back to when we created DataFrames from scratch, the keys of the dict ended up as
column names. Now when we select columns of a DataFrame, we use brackets just like if we were
accessing a Python dictionary.

revenue now contains a Series:

revenue.head()
Out:
Title
Guardians of the Galaxy 333.13
Prometheus 126.46
Split 138.12
Sing 270.32
Suicide Squad 325.02
Name: revenue_millions, dtype: float64

Slightly different formatting than a DataFrame, but we still have our Title index.

We'll impute the missing values of revenue using the mean. Here's the mean value:

revenue_mean = revenue.mean()

revenue_mean
Out:
82.95637614678897

With the mean, let's fill the nulls using fillna():


revenue.fillna(revenue_mean, inplace=True)

Learn Data Science with


We have now replaced all nulls in revenue with the mean of the column. Notice that by using
inplace=True we have actually affected the original movies_df:

movies_df.isnull().sum()
Out:
rank 0
genre 0
description 0
director 0
actors 0
year 0
runtime 0
rating 0
votes 0
revenue_millions 0
metascore 64
dtype: int64

Imputing an entire column with the same value like this is a basic example. It would be a better idea to
try a more granular imputation by Genre or Director.

For example, you would find the mean of the revenue generated in each genre individually and impute
the nulls in each genre with that genre's mean.

Let's now look at more ways to examine and understand the dataset.

Understanding your variables

Using describe() on an entire DataFrame we can get a summary of the distribution of continuous
variables:

movies_df.describe()
Out:
revenue_millio
rank year runtime rating votes metascore
ns
Coun 1000.00000 1000.00000 1000.00000 1000.00000 1.000000e+0 936.00000
1000.000000
t 0 0 0 0 3 0
2012.78300 1.698083e+0
mean 500.500000 113.172000 6.723200 82.956376 58.985043
0 5
1.887626e+0
Std 288.819436 3.205962 18.810908 0.945429 96.412043 17.194757
5
2006.00000 6.100000e+0
min 1.000000 66.000000 1.900000 0.000000 11.000000
0 1
revenue_millio
rank year runtime rating votes metascore
ns
2010.00000 3.630900e+0
25% 250.750000 100.000000 6.200000 17.442500 47.000000
0 4
2014.00000 1.107990e+0
50% 500.500000 111.000000 6.800000 60.375000 59.500000
0 5
2016.00000 2.399098e+0
75% 750.250000 123.000000 7.400000 99.177500 72.000000
0 5
1000.00000 2016.00000 1.791916e+0 100.00000
max 191.000000 9.000000 936.630000
0 0 6 0

Understanding which numbers are continuous also comes in handy when thinking about the type of plot
to use to represent your data visually.

.describe() can also be used on a categorical variable to get the count of rows, unique count of
categories, top category, and freq of top category:

movies_df['genre'].describe()
Out:
count 1000
unique 207
top Action,Adventure,Sci-Fi
freq 50
Name: genre, dtype: object

This tells us that the genre column has 207 unique values, the top value is Action/Adventure/Sci-Fi,
which shows up 50 times (freq).

.value_counts() can tell us the frequency of all values in a column:

movies_df['genre'].value_counts().head(10)
Out:
Action,Adventure,Sci-Fi 50
Drama 48
Comedy,Drama,Romance 35
Comedy 32
Drama,Romance 31
Action,Adventure,Fantasy 27
Comedy,Drama 27
Animation,Adventure,Comedy 27
Comedy,Romance 26
Crime,Drama,Thriller 24
Name: genre, dtype: int64

Learn Data Science with


Relationships between continuous variables

By using the correlation method .corr() we can generate the relationship between each continuous
variable:

rank year runtime rating votes revenue_millions metascore


- - - -
Rank 1.000000 -0.252996 -0.191869
0.261605 0.221739 0.219555 0.283876
- - - -
Year 1.000000 -0.117562 -0.079305
0.261605 0.164900 0.211219 0.411904
- -
runtime 1.000000 0.392214 0.407062 0.247834 0.211978
0.221739 0.164900
- -
Rating 0.392214 1.000000 0.511537 0.189527 0.631897
0.219555 0.211219
- -
Votes 0.407062 0.511537 1.000000 0.607941 0.325684
0.283876 0.411904
- -
revenue_millions 0.247834 0.189527 0.607941 1.000000 0.133328
0.252996 0.117562
- -
metascore 0.211978 0.631897 0.325684 0.133328 1.000000
0.191869 0.079305
movies_df.corr()

Correlation tables are a numerical representation of the bivariate relationships in the dataset.

Positive numbers indicate a positive correlation — one goes up the other goes up — and negative
numbers represent an inverse correlation — one goes up the other goes down. 1.0 indicates a perfect
correlation.

So looking in the first row, first column we see rank has a perfect correlation with itself, which is
obvious. On the other hand, the correlation between votes and revenue_millions is 0.6. A little more
interesting.

Examining bivariate relationships comes in handy when you have an outcome or dependent variable in
mind and would like to see the features most correlated to the increase or decrease of the outcome. You
can visually represent bivariate relationships with scatterplots (seen below in the plotting section).

Let's now look more at manipulating DataFrames.

DataFrame slicing, selecting, extracting


Up until now we've focused on some basic summaries of our data. We've learned about simple column
extraction using single brackets, and we imputed null values in a column using fillna(). Below are the
other methods of slicing, selecting, and extracting you'll need to use constantly.

It's important to note that, although many methods are the same, DataFrames and Series have different
attributes, so you'll need be sure to know which type you are working with or else you will receive
attribute errors.

Let's look at working with columns first.

By column

You already saw how to extract a column using square brackets like this:

genre_col = movies_df['genre']

type(genre_col)
Out:
pandas.core.series.Series

This will return a Series. To extract a column as a DataFrame, you need to pass a list of column names.

In our case that's just a single column:


genre_col = movies_df[['genre']]

type(genre_col)

pandas.core.frame.DataFrame

ince it's just a list, adding another column name is easy:


subset = movies_df[['genre', 'rating']]

subset.head()
Out:
genre rating
Title
Guardians of the Galaxy Action,Adventure,Sci-Fi 8.1
Prometheus Adventure,Mystery,Sci-Fi 7.0
Split Horror,Thriller 7.3
Sing Animation,Comedy,Family 7.2
Suicide Squad Action,Adventure,Fantasy 6.2

Now we'll look at getting data by rows.

By rows
For rows, we have two options:

• .loc - locates by name


• .iloc- locates by numerical index

Remember that we are still indexed by movie Title, so to use .loc we give it the Title of a movie:

prom = movies_df.loc["Prometheus"]

prom
Out:
rank 2
genre Adventure,Mystery,Sci-Fi
description Following clues to the origin of mankind, a te...
director Ridley Scott
actors Noomi Rapace, Logan Marshall-Green, Michael Fa...
year 2012
runtime 124
rating 7
votes 485820
revenue_millions 126.46
metascore 65
Name: Prometheus, dtype: object

On the other hand, with iloc we give it the numerical index of Prometheus:

prom = movies_df.iloc[1]

loc and iloc can be thought of as similar to Python list slicing. To show this even further, let's select
multiple rows.

How would you do it with a list? In Python, just slice with brackets like example_list[1:4]. It's works
the same way in pandas:

movie_subset = movies_df.loc['Prometheus':'Sing']

movie_subset = movies_df.iloc[1:4]

movie_subset
Out:
ra descri direct ye runti rati vote revenue_ metas
genre actors
nk ption or ar me ng s millions core
Title
Promet Adventure,Myste Follow Ridley Noomi 20 485
2 124 7.0 126.46 65.0
heus ry,Sci-Fi ing Scott Rapace, 12 820
ra descri direct ye runti rati vote revenue_ metas
genre actors
nk ption or ar me ng s millions core
Title
clues Logan
to the Marshall-
origin Green,
of Michael Fa...
mankin
d, a
te...
Three
girls
James
are M.
McAvoy,
kidnap Night 20 157
Split 3 Horror,Thriller Anya Taylor- 117 7.3 138.12 62.0
ped by Shyam 16 606
Joy, Haley
a man alan
Lu Richar...
with a
diag...
In a
city of
human Matthew
Christ
oid McConaughe
Animation,Come ophe 20 605
Sing 4 animal y,Reese 108 7.2 270.32 59.0
dy,Family Lourd 16 45
s, a Witherspoon,
elet
hustlin Seth Ma...
g
thea...

One important distinction between using .loc and .iloc to select multiple rows is that .locincludes
the movie Sing in the result, but when using .iloc we're getting rows 1:4 but the movie at index 4
(Suicide Squad) is not included.

Slicing with .iloc follows the same rules as slicing with lists, the object at the index at the end is not
included.

Conditional selections

We’ve gone over how to select columns and rows, but what if we want to make a conditional selection?

For example, what if we want to filter our movies DataFrame to show only films directed by Ridley
Scott or films with a rating greater than or equal to 8.0?

To do that, we take a column from the DataFrame and apply a Boolean condition to it. Here's an
example of a Boolean condition:
condition = (movies_df['director'] == "Ridley Scott")

condition.head()
Out:
Title
Guardians of the Galaxy False
Prometheus True
Split False
Sing False
Suicide Squad False
Name: director, dtype: bool

Similar to isnull(), this returns a Series of True and False values: True for films directed by Ridley
Scott and False for ones not directed by him.

We want to filter out all movies not directed by Ridley Scott, in other words, we don’t want the False
films. To return the rows where that condition is True we have to pass this operation into the
DataFrame:

movies_df[movies_df['director'] == "Ridley Scott"]


Out:
ra descri dire ye runt rati vote revenue_ metas rating_ca
genre actors
nk ption ctor ar ime ng s millions core tegory
Title
Follow
Noomi
ing
Rapace
clues
, Logan
to the Ridl
Prome Adventure,Mys Marsha 20 485
2 origin ey 124 7.0 126.46 65.0 bad
theus tery,Sci-Fi ll- 12 820
of Scott
Green,
manki
Michae
nd, a
l Fa...
te...
An
Matt
astrona
Damon
ut
,
becom
The Ridl Jessica
10 Adventure,Dra es 20 556
Martia ey Chastai 144 8.0 228.43 80.0 good
3 ma,Sci-Fi strande 15 097
n Scott n,
d on
Kristen
Mars
Wiig,
after
Ka...
hi...
ra descri dire ye runt rati vote revenue_ metas rating_ca
genre actors
nk ption ctor ar ime ng s millions core tegory
Title
In 12th Russell
centur Crowe,
y Cate
Englan Ridl Blanch
Robin 38 Action,Advent 20 221
d, ey ett, 140 6.7 105.22 53.0 bad
Hood 8 ure,Drama 10 117
Robin Scott Matthe
and his w
band Macfad
of... y...
In
1970s
Denzel
Ameri
Washin
Ameri ca, a
Ridl gton,
can 47 Biography,Cri detecti 20 337
ey Russell 157 7.8 130.13 76.0 bad
Gangst 1 me,Drama ve 07 835
Scott Crowe,
er works
Chiwet
to
el Eji...
bring
d...
The Christi
defiant an
Exodu
leader Bale,
s: Ridl
51 Action,Advent Moses Joel 20 137
Gods ey 150 6.0 65.01 52.0 bad
7 ure,Drama rises Edgert 14 299
and Scott
up on, Ben
Kings
against Kingsle
the ... y, S...

You can get used to looking at these conditionals by reading it like:

Select movies_df where movies_df director equals Ridley Scott.

Let's look at conditional selections using numerical values by filtering the DataFrame by ratings:

movies_df[movies_df['rating'] >= 8.6].head(3)


Out:
ra descrip directo ye runti rati revenue_m metasc
genre actors votes
nk tion r ar me ng illions ore
Title
A team
Matthew
of
McConau
explore
Christo ghey,
Interst Adventure,Dra rs travel 20 1047
37 pher Anne 169 8.6 187.99 74.0
ellar ma,Sci-Fi through 14 747
Nolan Hathaway
a
, Jessica
wormh
Ch...
ole ...
When
Christian
the
Bale,
menace
The Christo Heath
Action,Crime, known 20 1791
Dark 55 pher Ledger, 152 9.0 533.32 82.0
Drama as the 08 916
Knight Nolan Aaron
Joker
Eckhart,
wreaks
Mi...
havo...
A thief,
who Leonardo
steals DiCaprio,
Christo
Incepti Action,Advent corpora Joseph 20 1583
81 pher 148 8.8 292.57 74.0
on ure,Sci-Fi te Gordon- 10 625
Nolan
secrets Levitt,
through Ellen...
...

We can make some richer conditionals by using logical operators | for "or" and & for "and".

Let's filter the the DataFrame to show only movies by Christopher Nolan OR Ridley Scott:

movies_df[(movies_df['director'] == 'Christopher Nolan') | (movies_df['director']


== 'Ridley Scott')].head()
Out:
ra descrip directo ye runti rati revenue_m metas
genre actors votes
nk tion r ar me ng illions core
Title
Followi
Noomi
ng
Rapace,
Promet Adventure,Mys clues to Ridley 20 4858
2 Logan 124 7.0 126.46 65.0
heus tery,Sci-Fi the Scott 12 20
Marshall-
origin
Green,
of
ra descrip directo ye runti rati revenue_m metas
genre actors votes
nk tion r ar me ng illions core
Title
mankin Michael
d, a te... Fa...
A team
of Matthew
explore McConau
rs Christo ghey,
Interste Adventure,Dra 20 1047
37 travel pher Anne 169 8.6 187.99 74.0
llar ma,Sci-Fi 14 747
through Nolan Hathaway
a , Jessica
wormh Ch...
ole ...
When
Christian
the
Bale,
menace
The Christo Heath
Action,Crime,D known 20 1791
Dark 55 pher Ledger, 152 9.0 533.32 82.0
rama as the 08 916
Knight Nolan Aaron
Joker
Eckhart,
wreaks
Mi...
havo...
Two
stage
Christian
magicia
Bale,
The ns Christo
Drama,Mystery Hugh 20 9131
Prestig 65 engage pher 130 8.5 53.08 66.0
,Sci-Fi Jackman, 06 52
e in Nolan
Scarlett
compet
Johanss...
itive
one-...
A thief,
who Leonardo
steals DiCaprio,
Christo
Incepti Action,Adventu corpora Joseph 20 1583
81 pher 148 8.8 292.57 74.0
on re,Sci-Fi te Gordon- 10 625
Nolan
secrets Levitt,
through Ellen...
...

We need to make sure to group evaluations with parentheses so Python knows how to evaluate the
conditional.

Using the isin() method we could make this more concise though:
movies_df[movies_df['director'].isin(['Christopher Nolan', 'Ridley Scott'])].head()
Out:
ra descrip directo ye runti rati revenue_m metas
genre actors votes
nk tion r ar me ng illions core
Title
Followi
Noomi
ng
Rapace,
clues to
Logan
Promet Adventure,Mys the Ridley 20 4858
2 Marshall- 124 7.0 126.46 65.0
heus tery,Sci-Fi origin Scott 12 20
Green,
of
Michael
mankin
Fa...
d, a te...
A team
of Matthew
explore McConau
rs Christo ghey,
Interste Adventure,Dra 20 1047
37 travel pher Anne 169 8.6 187.99 74.0
llar ma,Sci-Fi 14 747
through Nolan Hathaway
a , Jessica
wormh Ch...
ole ...
When
Christian
the
Bale,
menace
The Christo Heath
Action,Crime,D known 20 1791
Dark 55 pher Ledger, 152 9.0 533.32 82.0
rama as the 08 916
Knight Nolan Aaron
Joker
Eckhart,
wreaks
Mi...
havo...
Two
stage
Christian
magicia
Bale,
The ns Christo
Drama,Mystery Hugh 20 9131
Prestig 65 engage pher 130 8.5 53.08 66.0
,Sci-Fi Jackman, 06 52
e in Nolan
Scarlett
compet
Johanss...
itive
one-...
A thief,
Leonardo
who Christo
Incepti Action,Adventu DiCaprio, 20 1583
81 steals pher 148 8.8 292.57 74.0
on re,Sci-Fi Joseph 10 625
corpora Nolan
Gordon-
te
ra descrip directo ye runti rati revenue_m metas
genre actors votes
nk tion r ar me ng illions core
Title
secrets Levitt,
through Ellen...
...

Let's say we want all movies that were released between 2005 and 2010, have a rating above 8.0, but
made below the 25th percentile in revenue.

Here's how we could do all of that:

movies_df[
((movies_df['year'] >= 2005) & (movies_df['year'] <= 2010))
& (movies_df['rating'] > 8.0)
& (movies_df['revenue_millions'] <
movies_df['revenue_millions'].quantile(0.25))
]
Out:
ra descrip ye runti rati vote revenue_m metas
genre director actors
nk tion ar me ng s illions core
Title
Two
Aamir
friends
Khan,
are
Madhavan,
3 43 Comedy,Dra searchi Rajkumar 20 2387
Mona 170 8.4 6.52 67.0
Idiots 1 ma ng for Hirani 09 89
Singh,
their
Sharman
long
Joshi
lost ...
In 1984
East Ulrich
The Florian
Berlin, Mühe,
Lives Henckel
47 Drama,Thrille an Martina 20 2781
of von 137 8.5 11.28 89.0
7 r agent Gedeck,Seb 06 03
Other Donners
of the astian
s marck
secret Koch, Ul...
po...
Lubna
Twins
Azabal,
journey Denis
Incen 71 Drama,Myste Mélissa 20 9286
to the Villeneuv 131 8.2 6.86 80.0
dies 4 ry,War Désormeau 10 3
Middle e
x-Poulin,
East to
Maxim...
ra descrip ye runti rati vote revenue_m metas
genre director actors
nk tion ar me ng s illions core
Title
discove
r t...
An
Darsheel
eight-
Safary,
Taare year-
Aamir
Zame 99 Drama,Famil old boy Aamir 20 1026
Khan, 165 8.5 1.20 42.0
en 2 y,Music is Khan 07 97
Tanay
Par thought
Chheda,
to be a
Sac...
lazy ...

If you recall up when we used .describe() the 25th percentile for revenue was about 17.4, and we can
access this value directly by using the quantile() method with a float of 0.25.

So here we have only four movies that match that criteria.

Applying functions

It is possible to iterate over a DataFrame or Series as you would with a list, but doing so — especially on
large datasets — is very slow.

An efficient alternative is to apply() a function to the dataset. For example, we could use a function to
convert movies with an 8.0 or greater to a string value of "good" and the rest to "bad" and use this
transformed values to create a new column.

First we would create a function that, when given a rating, determines if it's good or bad:

def rating_function(x):
if x >= 8.0:
return "good"
else:
return "bad"

Now we want to send the entire rating column through this function, which is what apply() does:

movies_df["rating_category"] = movies_df["rating"].apply(rating_function)

movies_df.head(2)
Out:
ra descrip direc actor ye runti rati vote revenue_ metas rating_ca
genre
nk tion tor s ar me ng s millions core tegory
Title
Chris
A Pratt,
group Vin
of Diese
Guardi Jame
intergal l,
ans of Action,Advent s 20 757
1 actic Bradl 121 8.1 333.13 76.0 good
the ure,Sci-Fi Gun 14 074
crimina ey
Galaxy n
ls are Coop
forced er,
... Zoe
S...
Noo
mi
Followi Rapa
ng ce,
clues to Loga
the Ridle n
Promet Adventure,My 20 485
2 origin y Mars 124 7.0 126.46 65.0 bad
heus stery,Sci-Fi 12 820
of Scott hall-
mankin Green
d, a ,
te... Mich
ael
Fa...

The .apply() method passes every value in the rating column through the rating_function and then
returns a new Series. This Series is then assigned to a new column called rating_category.

You can also use anonymous functions as well. This lambda function achieves the same result as
rating_function:

movies_df["rating_category"] = movies_df["rating"].apply(lambda x: 'good' if x >=


8.0 else 'bad')

movies_df.head(2)
Out:
ra descrip direc actor ye runti rati vote revenue_ metas rating_ca
genre
nk tion tor s ar me ng s millions core tegory
Title
Guardi Action,Advent A Jame Chris 20 757
1 121 8.1 333.13 76.0 good
ans of ure,Sci-Fi group s Pratt, 14 074
ra descrip direc actor ye runti rati vote revenue_ metas rating_ca
genre
nk tion tor s ar me ng s millions core tegory
Title
the of Gun Vin
Galaxy intergal n Diese
actic l,
crimina Bradl
ls are ey
forced Coop
... er,
Zoe
S...
Noo
mi
Followi Rapa
ng ce,
clues to Loga
the Ridle n
Promet Adventure,My 20 485
2 origin y Mars 124 7.0 126.46 65.0 bad
heus stery,Sci-Fi 12 820
of Scott hall-
mankin Green
d, a ,
te... Mich
ael
Fa...

Overall, using apply() will be much faster than iterating manually over rows because pandas is
utilizing vectorization.

Vectorization: a style of computer programming where operations are applied to whole arrays instead of
individual elements

A good example of high usage of apply() is during natural language processing (NLP) work. You'll
need to apply all sorts of text cleaning functions to strings to prepare for machine learning.

Logistic Regression in Machine Learning


• Logistic regression is one of the most popular Machine Learning algorithms, which comes under the
Supervised Learning technique. It is used for predicting the categorical dependent variable using a given
set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must
be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of
giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are used. Linear
Regression is used for solving Regression problems, whereas Logistic regression is used for solving the
classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which
predicts two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something such as whether the cells are
cancerous or not, a mouse is obese or not based on its weight, etc.
• Logistic Regression is a significant machine learning algorithm because it has the ability to provide
probabilities and classify new data using continuous and discrete datasets.
• Logistic Regression can be used to classify the observations using different types of data and can easily
determine the most effective variables used for the classification. The below image is showing the
logistic function:

Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is
called logistic regression, but is used to classify samples; Therefore, it falls under the classification
algorithm.

Logistic Function (Sigmoid Function):


• The sigmoid function is a mathematical function used to map the predicted values to probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the "S" form. The S-form curve is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the probability of either 0
or 1. Such as values above the threshold value tends to 1, and a value below the threshold values tends
to 0.
Assumptions for Logistic Regression:
• The dependent variable must be categorical in nature.
• The independent variable should not have multi-collinearity.

Logistic Regression Equation:


The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

• We know the equation of the straight line can be written as:

• In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation by (1-y):

• But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:


On the basis of the categories, Logistic Regression can be classified into three types:

• Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of the
dependent variable, such as "cat", "dogs", or "sheep"
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as "low", "Medium", or "High".

Python Implementation of Logistic Regression (Binomial)


To understand the implementation of Logistic Regression in Python, we will use the below example:
Example: There is a dataset given which contains the information of various users obtained from the
social networking sites. There is a car making company that has recently launched a new SUV car. So
the company wanted to check how many users from the dataset, wants to purchase the car.

For this problem, we will build a Machine Learning model using the Logistic regression algorithm. The
dataset is shown in the below image. In this problem, we will predict the purchased variable
(Dependent Variable) by using age and salary (Independent variables).

Steps in Logistic Regression: To implement the Logistic Regression using Python, we will use the
same steps as we have done in previous topics of Regression. Below are the steps:

• Data Pre-processing step


• Fitting Logistic Regression to the Training set
• Predicting the test result
• Test accuracy of the result(Creation of Confusion matrix)
• Visualizing the test set result.

1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can use it in
our code efficiently. It will be the same as we have done in Data pre-processing topic. The code for this
is given below:
1. #Data Pre-procesing Step
2. # importing libraries
3. import numpy as nm
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')

By executing the above lines of code, we will get the dataset as the output. Consider the given image:

Now, we will extract the dependent and independent variables from the given dataset. Below is the code
for it:

1. #Extracting Independent and dependent Variable


2. x= data_set.iloc[:, [2,3]].values
3. y= data_set.iloc[:, 4].values
In the above code, we have taken [2, 3] for x because our independent variables are age and salary,
which are at index 2, 3. And we have taken 4 for y variable because our dependent variable is at index 4.
The output will be:

Now we will split the dataset into a training set and test set. Below is the code for it:

1. # Splitting the dataset into training and test set.


2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)

The output for this is given below:


For test set:

For training set:


In logistic regression, we will do feature scaling because we want accurate result of predictions. Here we
will only scale the independent variable because dependent variable have only 0 and 1 values. Below is
the code for it:

1. #feature Scaling
2. from sklearn.preprocessing import StandardScaler
3. st_x= StandardScaler()
4. x_train= st_x.fit_transform(x_train)
5. x_test= st_x.transform(x_test)

The scaled output is given below:


2. Fitting Logistic Regression to the Training set:

We have well prepared our dataset, and now we will train the dataset using the training set. For
providing training or fitting the model to the training set, we will import the LogisticRegression class of
the sklearn library.

After importing the class, we will create a classifier object and use it to fit the model to the logistic
regression. Below is the code for it:

1. #Fitting Logistic Regression to the training set


2. from sklearn.linear_model import LogisticRegression
3. classifier= LogisticRegression(random_state=0)
4. classifier.fit(x_train, y_train)

Output: By executing the above code, we will get the below output:

Out[5]:

1. LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,


2. intercept_scaling=1, l1_ratio=None, max_iter=100,
3. multi_class='warn', n_jobs=None, penalty='l2',
4. random_state=0, solver='warn', tol=0.0001, verbose=0,
5. warm_start=False)
Hence our model is well fitted to the training set.

3. Predicting the Test Result

Our model is well trained on the training set, so we will now predict the result by using test set data.
Below is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)

In the above code, we have created a y_pred vector to predict the test set result.

Output: By executing the above code, a new vector (y_pred) will be created under the variable explorer
option. It can be seen as:

The above output image shows the corresponding predicted users who want to purchase or not purchase
the car.
4. Test Accuracy of the result

Now we will create the confusion matrix here to check the accuracy of the classification. To create it, we
need to import the confusion_matrix function of the sklearn library. After importing the function, we
will call it using a new variable cm. The function takes two parameters, mainly y_true( the actual
values) and y_pred (the targeted value return by the classifier). Below is the code for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix()

Output:

By executing the above code, a new confusion matrix will be created. Consider the below image:

We can find the accuracy of the predicted result by interpreting the confusion matrix. By above output,
we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).

Support Vector Machine Algorithm


Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is
used for Classification as well as Regression problems. However, primarily, it is used for Classification
problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below
diagram in which there are two different categories that are classified using a decision boundary or
hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose
we see a strange cat that also has some features of dogs, so if we want a model that can accurately
identify whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We will
first train our model with lots of images of cats and dogs so that it can learn about different features of
cats and dogs, and then we test it with this strange creature. So as support vector creates a decision
boundary between these two data (cat and dog) and choose extreme cases (support vectors), it will see
the extreme case of cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider
the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.

Types of SVM
SVM can be of two types:

• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional
space, but we need to find out the best decision boundary that helps to classify the data points. This best
boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means if there are
2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.

Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the
hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a
Support vector.

How does SVM works?


Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset
that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that
can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there
can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region
is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes.
These points are called support vectors. The distance between the vectors and the hyperplane is called as
margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is
called the optimal hyperplane.
Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used two
dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d
space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.

Python Implementation of Support Vector Machine

Now we will implement the SVM algorithm using Python. Here we will use the same dataset user_data,
which we have used in Logistic regression and KNN classification.

• Data Pre-processing step

Till the Data pre-processing step, the code will remain the same. Below is the code:

1. #Data Pre-processing Step


2. # importing libraries
3. import numpy as nm
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')
9.
10. #Extracting Independent and dependent Variable
11. x= data_set.iloc[:, [2,3]].values
12. y= data_set.iloc[:, 4].values
13.
14. # Splitting the dataset into training and test set.
15. from sklearn.model_selection import train_test_split
16. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)

After executing the above code, we will pre-process the data. The code will give the dataset as:

The scaled output for the test set will be:


Fitting the SVM classifier to the training set:

Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we will import
SVC class from Sklearn.svm library. Below is the code for it:

1. from sklearn.svm import SVC # "Support vector classifier"


2. classifier = SVC(kernel='linear', random_state=0)
3. classifier.fit(x_train, y_train)

In the above code, we have used kernel='linear', as here we are creating SVM for linearly separable
data. However, we can change it for non-linear data. And then we fitted the classifier to the training
dataset(x_train, y_train)

Output:

Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)

The model performance can be altered by changing the value of C(Regularization factor), gamma,
and kernel.
• Predicting the test set result:
Now, we will predict the output for test set. For this, we will create a new vector y_pred. Below
is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)

After getting the y_pred vector, we can compare the result of y_pred and y_test to check the difference
between the actual value and predicted value.

Output: Below is the output for the prediction of the test set:

• Creating the confusion matrix:


Now we will see the performance of the SVM classifier that how many incorrect predictions are
there as compared to the Logistic regression classifier. To create the confusion matrix, we need
to import the confusion_matrix function of the sklearn library. After importing the function, we
will call it using a new variable cm. The function takes two parameters, mainly y_true( the
actual values) and y_pred (the targeted value return by the classifier). Below is the code for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

Output:

As we can see in the above output image, there are 66+24= 90 correct predictions and 8+2= 10 correct
predictions. Therefore we can say that our SVM model improved as compared to the Logistic regression
model.

K-Means Clustering Algorithm


K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in
machine learning or data science. In this topic, we will learn what is K-means clustering algorithm, how the
algorithm works, along with the Python implementation of k-means clustering.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different
clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2,
there will be two clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each
dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in
the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm
is to minimize the sum of distances between the data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats
the process until it does not find the best clusters. The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

• Determines the best value for K center points or centroids by an iterative process.

• Assigns each data point to its closest k-center. Those data points which are near to the particular k-
center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.


Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below:

• Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters. It
means here we will try to group these datasets into two different clusters.

• We need to choose some random k points or centroid to form the cluster. These points can be either the
points from the dataset or any other point. So, here we are selecting the below two points as k points,
which are not the part of our dataset. Consider the below image:

• Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute
it by applying some mathematics that we have studied to calculate the distance between two points. So,
we will draw a median between both the centroids. Consider the below image:

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to
the right of the line are close to the yellow centroid. Let's color them as blue and yellow for clear visualization.
• As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To
choose the new centroids, we will compute the center of gravity of these centroids, and will find new
centroids as below:
• Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of
finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right
to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points.
• We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as
shown in the below image:
• As we got the new centroids so again will draw the median line and reassign the data points. So, the
image will be:

• We can see in the above image; there are no dissimilar data points on either side of the line, which
means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as shown
in the below image:

How to choose the value of "K number of clusters" in K-means Clustering?

The performance of the K-means clustering algorithm depends upon highly efficient clusters that it forms. But
choosing the optimal number of clusters is a big task. There are some different ways to find the optimal number
of clusters, but here we are discussing the most appropriate method to find the number of clusters or value of K.
The method is given below:

Elbow Method

The Elbow method is one of the most popular ways to find the optimal number of clusters. This method uses the
concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total variations
within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2

In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and its centroid
within a cluster1 and the same for the other two terms.

To measure the distance between data points and centroid, we can use any method such as Euclidean distance
or Manhattan distance.

To find the optimal value of clusters, the elbow method follows the below steps:

• It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).

• For each value of K, calculates the WCSS value.


• Plots a curve between calculated WCSS values and the number of clusters K.

• The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the best
value of K.

Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow method. The
graph for the elbow method looks like the below image:

Note: We can choose the number of clusters equal to the given data points. If we choose the number of
clusters equal to the data points, then the value of WCSS becomes zero, and that will be the endpoint of the
plot.

Python Implementation of K-means Clustering Algorithm

In the above section, we have discussed the K-means algorithm, now let's see how it can be implemented using
Python.

Before implementation, let's understand what type of problem we will solve here. So, we have a dataset of
Mall_Customers, which is the data of customers who visit the mall and spend there.

In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($), and Spending Score (which is the
calculated value of how much a customer has spent in the mall, the more the value, the more he has spent).
From this dataset, we need to calculate some patterns, as it is an unsupervised method, so we don't know what
to calculate exactly.

The steps to be followed for the implementation are given below:

• Data Pre-processing

• Finding the optimal number of clusters using the elbow method


• Training the K-means algorithm on the training dataset

• Visualizing the clusters

Step-1: Data pre-processing Step

The first step will be the data pre-processing, as we did in our earlier topics of Regression and Classification. But
for the clustering problem, it will be different from other models. Let's discuss it:

• Importing Libraries
As we did in previous topics, firstly, we will import the libraries for our model, which is part of data pre-
processing. The code is given below:

1. # importing libraries

2. import numpy as nm

3. import matplotlib.pyplot as mtp

4. import pandas as pd

In the above code, the numpy we have imported for the performing mathematics calculation, matplotlib is for
plotting the graph, and pandas are for managing the dataset.

• Importing the Dataset:


Next, we will import the dataset that we need to use. So here, we are using the Mall_Customer_data.csv
dataset. It can be imported using the below code:

1. # Importing the dataset

2. dataset = pd.read_csv('Mall_Customers_data.csv')

By executing the above lines of code, we will get our dataset in the Spyder IDE. The dataset looks like the below
image:
From the above dataset, we need to find some patterns in it.

• Extracting Independent Variables

Here we don't need any dependent variable for data pre-processing step as it is a clustering problem, and we
have no idea about what to determine. So we will just add a line of code for the matrix of features.

1. x = dataset.iloc[:, [3, 4]].values

As we can see, we are extracting only 3rd and 4th feature. It is because we need a 2d plot to visualize the model,
and some features are not required, such as customer_id.

Step-2: Finding the optimal number of clusters using the elbow method

In the second step, we will try to find the optimal number of clusters for our clustering problem. So, as discussed
above, here we are going to use the elbow method for this purpose.

As we know, the elbow method uses the WCSS concept to draw the plot by plotting WCSS values on the Y-axis
and the number of clusters on the X-axis. So we are going to calculate the value for WCSS for different k values
ranging from 1 to 10. Below is the code for it:
1. #finding optimal number of clusters using the elbow method

2. from sklearn.cluster import KMeans

3. wcss_list= [] #Initializing the list for the values of WCSS

4.

5. #Using for loop for iterations from 1 to 10.

6. for i in range(1, 11):

7. kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)

8. kmeans.fit(x)

9. wcss_list.append(kmeans.inertia_)

10. mtp.plot(range(1, 11), wcss_list)

11. mtp.title('The Elobw Method Graph')

12. mtp.xlabel('Number of clusters(k)')

13. mtp.ylabel('wcss_list')

14. mtp.show()

As we can see in the above code, we have used the KMeans class of sklearn. cluster library to form the clusters.

Next, we have created the wcss_list variable to initialize an empty list, which is used to contain the value of wcss
computed for different values of k ranging from 1 to 10.

After that, we have initialized the for loop for the iteration on a different value of k ranging from 1 to 10; since for
loop in Python, exclude the outbound limit, so it is taken as 11 to include 10th value.

The rest part of the code is similar as we did in earlier topics, as we have fitted the model on a matrix of features
and then plotted the graph between the number of clusters and WCSS.

Output: After executing the above code, we will get the below output:
From the above plot, we can see the elbow point is at 5. So the number of clusters here will be 5.
Step- 3: Training the K-means algorithm on the training dataset

As we have got the number of clusters, so we can now train the model on the dataset.

To train the model, we will use the same two lines of code as we have used in the above section, but here
instead of using i, we will use 5, as we know there are 5 clusters that need to be formed. The code is given
below:

1. #training the K-means model on a dataset

2. kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)

3. y_predict= kmeans.fit_predict(x)

The first line is the same as above for creating the object of KMeans class.

In the second line of code, we have created the dependent variable y_predict to train the model.
By executing the above lines of code, we will get the y_predict variable. We can check it under the variable
explorer option in the Spyder IDE. We can now compare the values of y_predict with our original dataset.
Consider the below image:

From the above image, we can now relate that the CustomerID 1 belongs to a cluster

3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs to cluster 4, and so on.

Step-4: Visualizing the Clusters

The last step is to visualize the clusters. As we have 5 clusters for our model, so we will visualize each cluster one
by one.

ADVERTISEMENT

To visualize the clusters will use scatter plot using mtp.scatter() function of matplotlib.

1. #visulaizing the clusters

2. mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1') #for first cluste
r

3. mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2') #for second c
luster

4. mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3') #for third cluster

5. mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4') #for fourth clu
ster
6. mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5') #for fifth c
luster

7. mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Ce


ntroid')

8. mtp.title('Clusters of customers')

9. mtp.xlabel('Annual Income (k$)')

10. mtp.ylabel('Spending Score (1-100)')

11. mtp.legend()

12. mtp.show()

In above lines of code, we have written code for each clusters, ranging from 1 to 5. The first coordinate of the
mtp.scatter, i.e., x[y_predict == 0, 0] containing the x value for the showing the matrix of features values, and the
y_predict is ranging from 0 to 1.

Output:

The output image is clearly showing the five different clusters with different colors. The clusters are formed
between two parameters of the dataset; Annual income of customer and Spending. We can change the colors
and labels as per the requirement or choice. We can also observe some points from the above patterns, which
are given below:

• Cluster1 shows the customers with average salary and average spending so we can categorize these
customers as

• Cluster2 shows the customer has a high income but low spending, so we can categorize them as careful.

• Cluster3 shows the low income and also low spending so they can be categorized as sensible.
• Cluster4 shows the customers with low income with very high spending so they can be categorized as
careless.

• Cluster5 shows the customers with high income and high spending so they can be categorized as target,
and these customers can be the most profitable customers for the mall owner.

You might also like