0% found this document useful (0 votes)
25 views42 pages

File of ML

Uploaded by

Varun kotwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views42 pages

File of ML

Uploaded by

Varun kotwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

BABA GHULAM SHAH BADSHAH UNIVERSITY RAJOURI

(J & K) – 185234

INDUSTRIAL TRAINING REPORT


ON
PYTHON WITH DATA SCIENCE & MACHINE LEARNING
Submitted in Partial Fulfilment of the Requirement

For the Award of the Degree of

BACHELOR OF TECHNOLOGY
IN

INFORMATION TECHNOLOGY ENGINNERING

Submitted by

Under the Guidance of

WINNOVATION

(Duration: 10 JUNE to 10 JULY)

Page 1 of 42
DECLARATION

I hereby declare that the project entitled “PYTHON WITH DATA SCIENCE &

MACHINE LEARNING” Submitted for the B.Tech.(ITE) degree is my original work


completed under the supervision of “WINNOVATION” to the best of my knowledge the project
has not formed the basis for the award of any other degree, diploma, fellowship or any other similar
titles.

Place: Rajouri Name of the Student:

Date: 26-11-2024 HRITIK SHARMA

Page 2 of 42
ABSTRACT

The objective of a practical training is to learn something about industries practically and to
be familiar with the working style of a technical worker to adjust simply according to the
industrial environment.

This report deals with the equipment's relation and their general operating principle. Data
science encompasses a set of principles, problem definitions, algorithms, and processes for
extracting nonobvious and useful patterns from large data sets. Many of the elements of data
science have been developed in related fields such as machine learning and data mining. In fact,
the terms data science, machine learning, and data mining are often used interchangeably. The
commonality across these disciplines is a focus on improving decision making through the
analysis of data. However, although data science borrows from these other fields, it is broader in
scope. Machine learning (ML) focuses on the design and evaluation of algorithms for extracting
patterns from data. Data mining generally deals with the analysis of structured data and often
implies an emphasis on commercial applications.

Page 3 of 42
Contents

Chapters Page no’s

1. Introduction to Data Science & Machine Learning ……………..…….............. 7


1.1. What is Data science? ……………………………………...…….………… 7
1.2. The Data Science Lifecycle ………..………………………………………. 7
1.2.1. Data ingestion ………………………………………………………. 7
1.2.2. Data storage and data processing …………………………………… 7
1.2.3. Data analysis …………………………………………………..……. 8
1.2.4. Communicate ……………….………………………………………. 8
1.3. Data science tools ………………………………………………………….. 8
1.3.1. R Studio …………………………………………………………….. 8
1.3.2. Python ………………………………………………………………. 8
1.4. Machine Learning …………..……………………………………………… 9
1.5. How machine learning works ………...……………………………………. 9
1.5.1. A Decision Process …………………………………………………. 9
1.5.2. An Error Function …………………….…………………………….. 9
1.5.3. A Model Optimization Process ………………….………………….. 9
1.6. Machine learning methods …………………..……………………………. 10
1.6.1. Supervised machine learning ……………………..……………….. 10
1.6.2. Unsupervised machine learning …………………...………………. 10
1.6.3. Reinforcement Learning ………………...………………………… 10
1.7. Real-Life machine learning Examples …………….…………………….. 11
1.7.1. Speech Recognition ……………..………………………………… 11
1.7.2. Customer Service ………………………………………………….. 11
1.7.3. Computer Vision ………………..…………………………………. 11
1.7.4. Automated stock trading ………….……………………………….. 11
2. Data Pre-processing & Visualization ……...………………………………..…. 12
2.1. Dealing with Null Values …………………………………………………. 12
2.1.1. Deleting Rows or Columns ………………..………………………. 12
2.1.2. Imputation of Null values ………………….………………………. 12
2.2. Dealing with Categorical Variables ………………...…………………….. 13
2.2.1. Label Encoding …………………………………...……………….. 14
2.2.2. One-Hot Encoding …………………………...……………………. 14
2.3. Standardize Data …………………………..……………………………… 14
2.3.1. Data Analysis …………………………...…………………………. 15
2.4. Data Visualization …………..……………………………………………. 16
2.4.1. Scatter Plot …………………………………..…………………….. 16
2.4.2. Line Plot …………………………………………..………………. 17
2.4.3. Bar Charts …………………………………………………………. 18
3. Feature Engineering ……….…………………………………………………. 19
3.1. Processes of Features Engineering …………………………...…………… 19
3.1.1. Feature Creation .……………………………………………..…… 19
Page 4 of 42
3.1.2. Transformations …………………………………………………… 20
3.1.3. Feature Extraction …………………………………….…………… 20
3.1.4. Exploratory Data Analysis ………………………………..……….. 20
3.1.5. Benchmark …………………………………………………...……. 20
3.2. Importance of Feature Engineering …………….………………………… 21
3.3. Feature Engineering Techniques for Machine Learning ………………….. 21
3.3.1. Imputation ……………………………………...…………………. 22
3.3.2. Handling Outliers …………………………………………………. 22
3.3.3. Log transform …………………………………………………….. 22
3.3.4. Feature Split ……………………………………..………………… 22
3.3.5. One-hot encoding ……………………………………………. 23
4. Machine Learning Algorithm ……….………………………………………… 24
4.1. Machine learning types …………………………………………………… 25
4.1.1. Supervised Machine Learning ………………..…………………… 25
4.1.2. Unsupervised Machine Learning …………………………………… 26
4.1.3. Semi Supervised Machine Learning ………………………………. 26
4.1.4. Reinforcement Machine Learning ………………………………… 26
4.2. Machine Learning Algorithms ……………...…………………………….. 27
4.2.1. Linear Regression …………………………………………………. 27
4.2.2. Logistic Regression ……………….………………………………. 30
4.2.3. K-Means ……………………...…………………………………… 31
4.2.4. k-nearest neighbours (KNN) ………...…………………………….. 32
5. Model Evaluation …………..………………………………………….……… 33
5.1. Model Evaluation Techniques ……...…………………………………….. 33
5.1.1. Holdout ………………………………...………………………….. 34
5.1.2. Cross Validation …………………...……………………………… 34
5.2. Classification metrics ……………..……………………………………… 35
5.2.1. Classification Accuracy …………………..……………………….. 35
5.2.2. Confusion Matrix …………………..……………………………… 35
5.2.3. F-Measure …………………..…………………………………….. 36
5.2.4. Area Under Curve (AUC) ……....…………………………………. 37
5.2.5. Logarithmic Loss ……………..…………………………………… 37
5.2.6. Regression Metrics …………..……………………………………. 37
5.3. Bias vs. Variance ……………………………...………………………….. 38
6. Natural Language Processing using NLTK ……….……………..…………….. 39
6.1. Tokenizing …………………...…………………………………………… 40
6.2. Filtering Stop Words ………………...……………………………………. 41
6.3. Stemming ……………………………..………………………………….. 42
6.4. Tagging Parts of Speech (pos) ……………………………………………. 42 7.
Conclusion ………………………………………………………………..…… 44

Chapter 01
Page 5 of 42
INTRODUCTION TO DATA SCIENCE & MACHINE LEARNING

1.1 What is Data science?


Data science combines math and statistics, specialized programming, advanced analytics,
artificial intelligence (AI), and machine learning with specific subject matter expertise to uncover
actionable insights hidden in an organization’s data. These insights can be used to guide decision
making and strategic planning. Data science is the domain of study that deals with vast volumes of
data using modern tools and techniques to find unseen patterns, derive meaningful information, and
make business decisions. Data science uses complex machine learning algorithms to build
predictive models.

The data used for analysis can come from many different sources and presented in various
formats.

1.2 The Data Science Lifecycle

The data science lifecycle involves various roles, tools, and processes, which enables
analysts to glean actionable insights. Typically, a data science project undergoes the following
stages:

1.2.1 Data ingestion:

The lifecycle begins with the data collection—both raw structured and unstructured data
from all relevant sources using a variety of methods. These methods can include manual entry, web
scraping, and real-time streaming data from systems and devices. Data sources can include
structured data, such as customer data, along with unstructured data like log files, video, audio,
pictures, the Internet of Things (IoT), social media, and more.

1.2.2 Data storage and data processing:

Since data can have different formats and structures, companies need to consider different
storage systems based on the type of data that needs to be captured. Data management teams help
to set standards around data storage and structure, which facilitate workflows around analytics,
machine learning and deep learning models. This stage includes cleaning data, deduplication,
transforming
Page 6 of 42
and combining the data using ETL (extract, transform, and load) jobs or other data integration
technologies. This data preparation is essential for promoting data quality before loading into a
data warehouse, Data Lake, or other repository.

1.2.3 Data analysis:

Here, data scientists conduct an exploratory data analysis to examine biases, patterns,
ranges, and distributions of values within the data. This data analytics exploration drives hypothesis
generation for a/b testing. It also allows analysts to determine the data’s relevance for use within
modelling efforts for predictive analytics, machine learning, and/or deep learning. Depending on a
model’s accuracy, organizations can become reliant on these insights for business decision making,
allowing them to drive more scalability.

1.2.4 Communicate:

Finally, insights are presented as reports and other data visualizations that make the insights
and their impact on business easier for business analysts and other decision-makers to understand.
A data science programming language such as R or Python includes components for generating
visualizations, alternately, data scientists can use dedicated visualization tools.

1.3 Data science tools

Data scientists rely on popular programming languages to conduct exploratory data analysis
and statistical regression. These open source tools support pre-built statistical modelling, machine
learning, and graphics capabilities. These languages include the following:

1.3.1 R Studio:

An open source programming language and environment for developing statistical


computing and graphics.

1.3.2 Python:

It is a dynamic and flexible programming language. The Python includes numerous


libraries, such as NumPy, Pandas and Matplotlib for analysing data quickly.

Page 7 of 42
1.4 Machine Learning
Machine learning is a branch of artificial intelligence (AI) and computer science which
focuses on the use of data and algorithms to imitate the way that humans learn, gradually
improving its accuracy.

Machine learning is an important component of the growing field of data science. Through
the use of statistical methods, algorithms are trained to make classifications or predictions,
uncovering key insights within data mining projects. These insights subsequently drive decision
making within applications and businesses, ideally impacting key growth metrics. As big data
continues to expand and grow, the market demand for data scientists will increase, requiring them
to assist in the identification of the most relevant business questions and subsequently the data to
answer them.

1.5 How machine learning works

UC Berkeley (link resides outside IBM) breaks out the learning system of a machine learning
algorithm into three main parts.

1.5.1 A Decision Process:

In general, machine learning algorithms are used to make a prediction or classification.


Based on some input data, which can be labelled or unlabelled, your algorithm will produce an
estimate about a pattern in the data.

1.5.2 An Error Function:

An error function serves to evaluate the prediction of the model. If there are known
examples, an error function can make a comparison to assess the accuracy of the model.

1.5.3 A Model Optimization Process:

If the model can fit better to the data points in the training set, then weights are adjusted to
reduce the discrepancy between the known example and the model estimate. The algorithm will
repeat this evaluate and optimize process, updating weights autonomously until a threshold of
accuracy has been met.

Page 8 of 42
1.6 Machine learning methods
Machine learning classifiers fall into three primary categories.

1.6.1 Supervised machine learning

Supervised learning, also known as supervised machine learning, is defined by its use of
labelled datasets to train algorithms that to classify data or predict outcomes accurately. As input
data is fed into the model, it adjusts its weights until the model has been fitted appropriately. This
occurs as part of the cross validation process to ensure that the model avoids overfitting or
underfitting. Supervised learning helps organizations solve for a variety of real-world problems at
scale, such as classifying spam in a separate folder from your inbox. Some methods used in
supervised learning include neural networks, naïve bayes, linear regression, logistic regression,
random forest, support vector machine (SVM), and more.

1.6.2 Unsupervised machine learning

Unsupervised learning, also known as unsupervised machine learning, uses machine


learning algorithms to analyse and cluster unlabelled datasets. These algorithms discover hidden
patterns or data groupings without the need for human intervention. Its ability to discover
similarities and differences in information make it the ideal solution for exploratory data analysis,
cross-selling strategies, customer segmentation, image and pattern recognition. It’s also used to
reduce the number of features in a model through the process of dimensionality reduction; principal
component analysis (PCA) and singular value decomposition (SVD) are two common approaches
for this. Other algorithms used in unsupervised learning include neural networks, k-means
clustering, probabilistic clustering methods, and more.

1.6.3 Reinforcement Learning

Reinforcement learning is a machine learning training method based on rewarding desired


behaviours and/or punishing undesired ones. In general, a reinforcement learning agent is able to
perceive and interpret its environment, take actions and learn through trial and error.

Page 9 of 42
1.7 Real-Life machine learning Examples

1.7.1 Speech Recognition:

It is also known as automatic speech recognition (ASR), computer speech recognition, or


speech-to-text, and it is a capability which uses natural language processing (NLP) to process
human speech into a written format. Many mobile devices incorporate speech recognition into their
systems to conduct voice search e.g. Siri or provide more accessibility around texting.

1.7.2 Customer Service:

Online chatbots are replacing human agents along the customer journey. They answer
frequently asked questions (FAQs) around topics, like shipping, or provide personalized advice,
and social media platforms. Examples include messaging bots on e-commerce sites with virtual
agents, messaging apps, such as Slack and Facebook Messenger, and tasks usually done by virtual
assistants and voice assistants.

1.7.3 Computer Vision:

This AI technology enables computers and systems to derive meaningful information from
digital images, videos and other visual inputs, and based on those inputs, it can take action. This
ability to provide recommendations distinguishes it from image recognition tasks. Powered by
convolutional neural networks, computer vision has applications within photo tagging in social
media, radiology imaging in healthcare, and self-driving cars within the automotive industry.

1.7.4 Automated stock trading:

Designed to optimize stock portfolios, AI-driven high-frequency trading platforms make


thousands or even millions of trades per day without human intervention.

CHAPTERS 02

DATA PREPROCESSING AND VISUALIZATION


Page 10 of 42
The meaning of Machine learning is learning from data. Data plays a major role in building
an accurate, efficient Machine learning model. Better the data will better will be the model. To
make our data better, we need to perform data pre-processing followed by data analysis and then
visualizing the data.

The real-world data mostly contains noise, has null values, and might not be in a suitable
format. So, we can't train our model with this data. To deal with this problem, we need to pre-
process data.

Data pre-processing is a technique to prepare the raw data such that a Machine learning
model can learn from it.
The following data pre-processing techniques make our data fit for training the model.

2.1 Dealing with Null Values:

It is the first phase of data pre-processing. The real-world data contains null values. No
machine learning algorithm can handle these null values on its own. So, we have to deal with null
values before training the model.

There are two ways to deal with null values:


2.1.1 Deleting Rows or Columns
2.1.2 Imputation of Null values

2.1.1Deleting Rows or Columns


We can delete those rows or columns that contain null values. But this approach is not
preferred because we will lose information. For example, suppose our dataset has 5k rows and 10
columns, out of which 2k rows contain null values for a few columns (1, 2, or 3). So, if we delete
these rows, we will lose 40% of our data just because those rows have few missing values; this
method is not efficient and it is advised to use this method only when a few rows/columns contain
missing values.

2.1.2 Imputation of Null values:

This method is used for those rows or columns that contain numeric data like age, salary,
etc. We can calculate the mean, mode, or median of the rows or columns that contain null values

Page 11 of 42
and replace the null values with the mean, mode, or median. In this method, we don't lose any
information and it gives better results than the previous method.

Look at the demonstration below to know how to use imputation in python using NumPy &
Scikitlearn module.

Fig. 2.1

Note: - In python, nan represents null/missing value.

2.2 Dealing with Categorical Variables:

Categorical variables are discrete variables for example - gender, religion, address, etc.
Machine learning algorithms can only deal with continuous/numeric variables like age, salary, etc.
So, we have to encode the categorical variables before training the model. It is a crucial step in data
pre-processing.

There are two types of categorical variables.

a) Ordinal
b) Non-Ordinal

Ordinal categorical variables are those which can be sorted in an order. Example - Size of bags.
L-M-S (descending) and S-M-L (ascending). On the other hand, non-ordinal categorical variables
can’t be sorted. Example – Colour of bags.
We have different techniques to deal with both ordinal and non-ordinal categorical variables.
For ordinal, we use Label encoding, and for non-ordinal, we use One-Hot encoding.

Page 12 of 42
2.2.1 Label Encoding
This technique converts categorical data into the numeric form so that it becomes machinereadable.
Suppose we have 3 sizes of bags i.e., small, medium, and large; after applying Label encoding,
they are labelled as 0, 1, 2.

2.2.2 One-Hot Encoding


It generates new columns, specifying each possible value from the parent dataset. It is the
most effective way to encode non-ordinal categorical variables so that machine learning algorithms
can work on them.

Take a look at the picture below to understand the difference between Label encoding and One-Hot
encoding.

Fig. 2.2

Here, in the above picture, we can see that in Label Encoding, the features are labelled as
integers. On the other hand, One-Hot Encoding labels the features by creating multiple columns
equal to the number of features and then encodes them.

2.3 Standardize Data


Standardization is the last stage in data pre-processing. Generally, the features in our data
are not scaled and there is variance in our data which can affect the prediction made by our model.
For example, let us say we have a housing dataset of a particular area, and we have to predict the
price of houses. The most important feature is house size, and if the size of most of the houses is
less than 300 sq. ft but few houses have a size more than 1000 sq. ft., then it will impact the
predictions as the data is not scaled. To get rid of this problem, we use Standardization.

Page 13 of 42
It is a technique in which values are centred around the mean and has unit deviation i.e., the
mean of values is zero, and the standard deviation is one. Look at the demonstration below to
understand how to use it in python.

2.3.1 Data Analysis


It is a technique to gain important information from data by manipulating, transforming, and
visualizing the data. The goal is to find patterns in data.

Exploratory Data Analysis (EDA)is a method of analyzing data to outline principal characteristics,
often using graphs and other visualization techniques.

Steps to perform EDA:

a) Dataset Features
b) Variable Identification
c) Data Type Identification
d) Numeric Variable Statistical Summary
e) Non-Graphic Analysis
f) Graphical Analysis

a) Dataset Features
Start with understanding your dataset i.e., size of the dataset, number of rows and columns. Use the
following code for knowing your dataset features.

dataset_name.shape

b) Variable Identification
It is the most important step in EDA. There are two types of variables- Numerical and
Categorical. Identify the type of variables and store their names in different lists. It is a
manual process.

c) Datatype identification
Once you have categorized the variables, the next step is to identify their data type. You can
use the following python code for this step.

dataset_name.dtypes
Page 14 of 42
d) Numeric Variable Statistical Summary
To describe the statistical features like count, mean, min, percentile, etc. of your dataset use
the following Pandas function.

dataset_name.describe ()

e) Non-Graphic Analysis

Use Pandas methods to analyse your data.

f) Graphic Analysis

The last step of EDA is graphic analysis i.e., visualizing your data.

2.4 Data Visualization


It is a method to present data in graphical format. It makes data easily understandable as the
data is in summarized form. Even a large amount of data can be easily understood just by looking
at a graph or plot. In python, we mostly use the MatPlotlib library for data visualization.

Some of the basic plotting techniques are described below.

2.4.1 Scatter Plot


It is a mathematical diagram to plot the values of two variables. Take a look at the image
below.

Page 15 of 42
Fig 2.3

2.4.2 Line Plot


It shows information as a series of data points connected by a straight line. Take a look at the
image below.

Fig. 2.4

Page 16 of 42
2.4.3 Bar Charts

It represents categorical data with the help of rectangular bars having lengths equal to the
values it speaks for. Take a look at the image below.

Fig. 2.5

Chapter 03

Page 17 of 42
FEATURE ENGINEERING
Feature engineering is a machine learning technique that leverages data to create new
variables that aren’t in the training set. It can produce new features for both supervised and
unsupervised learning, with the goal of simplifying and speeding up data transformations while
also enhancing model accuracy. Feature engineering is required when working with machine
learning models. Regardless of the data or architecture, a terrible feature will have a direct impact
on your model.

Feature engineering is the pre-processing step of machine learning, which extracts features
from raw data. It helps to represent an underlying problem to predictive models in a better way,
which as a result, improve the accuracy of the model for unseen data. The predictive model
contains predictor variables and an outcome variable, and while the feature engineering process
selects the most useful predictor variables for the model.

Fig. 3.1

Since 2016, automated feature engineering is also used in different machine learning software that
helps in automatically extracting features from raw data.

3.1 Processes of Features Engineering

3.1.1 Feature Creation:

Creating features involves creating new variables which will be most helpful for our
model. This can be adding or removing some features. As we saw above, the cost per sq. ft
column was a feature creation.

3.1.2 Transformations:

Page 18 of 42
Feature transformation is simply a function that transforms features from one representation
to another. The goal here is to plot and visualise data, if something is not adding up with the new
features we can reduce the number of features used, speed up training, or increase the accuracy of a
certain model.

3.1.3 Feature Extraction:

Feature extraction is the process of extracting features from a data set to identify useful
information. Without distorting the original relationships or significant information, this
compresses the amount of data into manageable quantities for algorithms to process.

3.1.4 Exploratory Data Analysis:

Exploratory data analysis (EDA) is a powerful and simple tool that can be used to improve your
understanding of your data, by exploring its properties. The technique is often applied when the
goal is to create new hypotheses or find patterns in the data. It’s often used on large amounts of
qualitative or quantitative data that haven’t been analyzed before.

3.1.5 Benchmark:

A Benchmark Model is the most user-friendly, dependable, transparent, and interpretable


model against which you can measure your own. It’s a good idea to run test datasets to see if your
new machine learning model outperforms a recognised benchmark. These benchmarks are often
used as measures for comparing the performance between different machine learning models like
neural networks and support vector machines, linear and non-linear classifiers, or different
approaches like bagging and boosting. To learn more about feature engineering steps and process,
check the links
provided at the end of this article. Now, let’s have a look at why we need feature engineering in
machine learning.

Page 19 of 42
3.2 Importance of Feature Engineering
Feature Engineering is a very important step in machine learning. Feature engineering refers
to the process of designing artificial features into an algorithm. These artificial features are then
used by that algorithm in order to improve its performance, or in other words reap better results.
Data scientists spend most of their time with data, and it becomes important to make models
accurate.

Fig. 3.2

When feature engineering activities are done correctly, the resulting dataset is optimal and contains
all of the important factors that affect the business problem. As a result of these datasets, the most
accurate predictive models and the most useful insights are produced.

3.3 Feature Engineering Techniques for Machine Learning

Some of the popular feature engineering techniques include:

3.3.1 Imputation
When it comes to preparing your data for machine learning, missing values are one of the
most typical issues. Human errors, data flow interruptions, privacy concerns, and other factors
could all contribute to missing values. Missing values have an impact on the performance of

Page 20 of 42
machine learning models for whatever cause. The main goal of imputation is to handle these
missing values.
There are two types of imputation

a) Numerical Imputation
b) Categorical Imputation

3.3.2 Handling Outliers


Outlier handling is a technique for removing outliers from a dataset. This method can be
used on a variety of scales to produce a more accurate data representation. This has an impact on
the model’s performance. Depending on the model, the effect could be large or minimal; for
example, linear regression is particularly susceptible to outliers. This procedure should be
completed prior to model training. The various methods of handling outliers include:

a) Removal
b) Replacing values
c) Capping
d) Discretization

3.3.3 Log transform

Log Transform is the most used technique among data scientists. It’s mostly used to turn a
skewed distribution into a normal or less-skewed distribution. We take the log of the values in a
column and utilise those values as the column in this transform. It is used to handle confusing data,
and the data becomes more approximatively to normal applications.

3.3.4 Feature Split


As the name suggests, feature split is the process of splitting features intimately into two or
more parts and performing to make new features. This technique helps the algorithms to better
understand and learn the patterns in the dataset. The feature splitting process enables the new
features to be clustered and binned, which results in extracting useful information and improving
the performance of the data models.
3.3.5 One-hot encoding

Page 21 of 42
A one-hot encoding is a type of encoding in which an element of a finite set is represented
by the index in that set, where only one element has its index set to “1” and all other elements are
assigned indices within the range [0, n-1]. In contrast to binary encoding schemes, where each bit
can represent 2 values (i.e. 0 and 1), this scheme assigns a unique value for each possible case.

Chapter 04

Machine Learning Algorithm

Page 22 of 42
According to Arthur Samuel (1959), ML is the field of study that gives computers the
ability to learn without being explicitly programmed. Thus, we can define ML as the field of
computer science in which machines can be designed that can program themselves. The process of
learning is simply learning from experience or observations from previous work, such as examples,
or instruction, to look for patterns in data and with the help of examples, provided the system can
make better decisions. The basic aim of ML is to make computers learn automatically with no
human intervention and to adjust perform actions accordingly.

Fig. 4.1

Fig. shows in the process of Machine Learning Past data are used to train the model, and
then this trained model is used to test new data and then for prediction. The trained ML model's
performance is evaluated using some portion of available past data (which is not present during
training). This is usually referred as the validation process. In this process, the ML model is
evaluated for its performance measure, such as accuracy. Accuracy describes the ML model's
performance over unseen data in terms of the ratio of the number of correctly predicted features
and total available features to be predicted.

Page 23 of 42
Fig. 4.2

4.1 Machine learning types


Machine learning is of various types some of them are:

4.1.1 Supervised Machine Learning


4.1.2 Unsupervised Machine Learning
4.1.3 Semi Supervised Machine Learning
4.1.4 Reinforcement Machine Learning

4.1.1 Supervised Machine Learning

Supervised ML algorithms is a type of ML technique that can be applied according to what


was previously learned to get new data using labelled data and to predict future events or labels. In
this type of learning, supervisor (labels) is present to guide or correct. For this first analysis, the
known training set and then the output values are predicted using the learning algorithm. The
output defined by the learning system can be compared with the actual output; if errors are
identified, they can be rectified and the model can be modified accordingly.

Fig. 4.3
Page 24 of 42
4.1.2 Unsupervised Machine Learning

Unsupervised ML algorithms: In this type, there is no supervisor to guide or correct. This


type of learning algorithm is used when unlabelled or unclassified information is present to train
the system. The system does not define the correct output, but it explores the data in such a way
that it can draw inferences (rules) from datasets and can describe hidden structures from unlabelled
data.

Fig. 4.4

4.1.3 Semi Supervised Machine Learning

Semi Supervised ML algorithms are algorithms that are between the category of supervised
and unsupervised learning. Thus, this type of learning algorithm uses both unlabelled and labelled
data for training purposes, generally a small amount of labelled data and a large amount of
unlabelled data. This type of method is used to improve the accuracy of learning.

4.1.4 Reinforcement Machine Learning

Reinforcement ML algorithms is a type of learning method that gives rewards or


punishment on the basis of the work performed by the system. If we train the system to perform a
certain task and it fails to do that, the system might be punished; if it performs perfectly, it will be
rewarded. It typically works on 0 and 1, in which 0 indicates a punishment and 1 indicates a
reward.
It works on the principle in which, if we train a bird or a dog to do some task and it does
exactly as we want, we give it a treat or the food it likes, or we might praise it. This is a reward. If
it did not perform the task properly, it might be scolded as a punishment by us.

Page 25 of 42
4.2 Machine Learning Algorithms
Here is the list of commonly used machine learning algorithms. These algorithms can be applied to
almost any data problem:

4.2.1 Linear Regression


4.2.2 Logistic Regression
4.2.3 K-Means
4.2.4 k-nearest neighbours (KNN)
4.2.5 Decision Tree
4.2.6 SVM
4.2.7 Naive Bayes
4.2.8 Random Forest
4.2.9 Dimensionality Reduction Algorithms

4.2.1 Linear Regression

It is used to estimate real values (cost of houses, number of calls, total sales etc.) based on
continuous variable(s). Here, we establish the relationship between independent and dependent
variables by fitting the best line. This best fit line is known as the regression line and is represented
by a linear equation Y= a *X + b.

The best way to understand linear regression is to relive this experience of childhood. Let us
say, you ask a child in fifth grade to arrange people in his class by increasing the order of weight,
without asking them their weights! What do you think the child will do? He/she would likely look
(visually analyze) at the height and build of people and arrange them using a combination of these
visible parameters. This is linear regression in real life! The child has actually figured out that
height and build would be correlated to weight by a relationship, which looks like the equation
above.

In this equation:

• Y – Dependent Variable
• a – Slope
• X – Independent variable
• b – Intercept

Page 26 of 42
These coefficients a and b are derived based on minimizing the sum of the squared
difference of distance between data points and the regression line.

Look at the below example. Here we have identified the best fit line having linear equation
y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a person.

Fig. 4.5

Linear Regression is mainly of two types:

• Simple Linear Regression

• Multiple Linear Regression.

Simple Linear Regression is characterized by one independent variable. And, Multiple Linear

Regression (as the name suggests) is characterized by multiple (more than 1) independent

variables. While finding the best fit line, you can fit a polynomial or curvilinear regression. And

these are known as polynomial or curvilinear regression.

Example of the code is:

import numpy as np import

matplotlib.pyplot as plt def

estimate_coef(x, y):

# number of observations/points

n = np.size(x) # mean

of x and y vector m_x =

Page 27 of 42
np.mean(x) m_y =

np.mean(y)

# calculating cross-deviation and deviation about x

SS_xy = np.sum(y*x) - n*m_y*m_x

SS_xx = np.sum(x*x) - n*m_x*m_x #

calculating regression coefficients

b_1 = SS_xy / SS_xx b_0 =

m_y - b_1*m_x return (b_0, b_1)

def plot_regression_line(x, y, b): #

plotting the actual points as scatter plot

plt.scatter(x, y, color = "m",

marker = "o", s = 30) #

predicted response vector y_pred =

b[0] + b[1]*x # plotting the regression

line plt.plot(x, y_pred, color = "g")

# putting labels plt.xlabel('x')

plt.ylabel('y') # function to

show plot plt.show() def main():

# observations / data x = np.array([0, 1,

2, 3, 4, 5, 6, 7, 8, 9]) y = np.array([1, 3,

2, 5, 7, 8, 8, 9, 10, 12]) # estimating

coefficients b = estimate_coef(x, y)

print("Estimated coefficients:\nb_0 = {} \

\nb_1 = {}".format(b[0], b[1]))

# plotting regression line

plot_regression_line(x, y, b) if

__name__ == "__main__":

Page 28 of 42
main()

4.2.2 Logistic Regression


Don’t get confused by its name! It is a classification, not a regression algorithm. It is used

to estimate discrete values (Binary values like 0/1, yes/no, true/false) based on a given set of the

independent variable(s). In simple words, it predicts the probability of occurrence of an event by

fitting data to a logit function. Hence, it is also known as logit regression. Since it predicts the

probability, its output values lie between 0 and 1 (as expected). Again, let us try and understand

this through a simple example.

Let’s say your friend gives you a puzzle to solve. There are only 2 outcome scenarios –

either you solve it or you don’t. Now imagine, that you are being given a wide range of

puzzles/quizzes in an attempt to understand which subjects you are good at. The outcome of this

study would be something like this – if you are given a trigonometric-based tenth-grade problem,

you are 70% likely to solve it. On the other hand, if it is a grade fifth history question, the

probability of getting an answer is only 30%. This is what Logistic Regression provides you.

Coming to the math, the log odds of the outcome is modeled as a linear combination of the

predictor variables.

odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence


ln(odds) = ln(p/(1-p))

logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk

Above, p is the probability of the presence of the characteristic of interest. It chooses

parameters that maximize the likelihood of observing the sample values rather than that minimize

the sum of squared errors (like in ordinary regression).

Page 29 of 42
Fig. 4.6

4.2.3 K-Means
It is a type of unsupervised algorithm which solves the clustering problem. Its procedure
follows a simple and easy way to classify a given data set through a certain number of clusters
(assume k clusters). Data points inside a cluster are homogeneous and heterogeneous to peer
groups.

Remember figuring out shapes from ink blots? k means is somewhat similar to this activity.
You look at the shape and spread to decipher how many different clusters/populations are present!

How K-means forms cluster:

• K-means picks k number of points for each cluster known as centroids.

• Each data point forms a cluster with the closest centroids i.e. k clusters.

• Finds the centroid of each cluster based on existing cluster members. Here we have new
centroids.
• As we have new centroids, repeat steps 2 and 3. Find the closest distance for each data point
from new centroids and get associated with new k-clusters. Repeat this process until
convergence occurs i.e. centroids do not change.

How to determine the value of K:

In K-means, we have clusters and each cluster has its own centroid. The sum of the square
of the difference between the centroid and the data points within a cluster constitutes the sum of the

Page 30 of 42
square value for that cluster. Also, when the sum of square values for all the clusters is added, it
becomes a total within the sum of the square value for the cluster solution.

We know that as the number of clusters increases, this value keeps on decreasing but if you plot the
result you may see that the sum of squared distance decreases sharply up to some value of k, and
then much more slowly after that. Here, we can find the optimum number of clusters.

Fig. 4.7

4.2.4 k-nearest neighbours (KNN)

The k-nearest neighbour’s algorithm, also known as KNN or k-NN, is a non-parametric,


supervised learning classifier, which uses proximity to make classifications or predictions about the
grouping of an individual data point. While it can be used for either regression or classification
problems, it is typically used as a classification algorithm, working off the assumption that similar
points can be found near one another.

Regression problems use a similar concept as classification problem, but in this case, the
average the k nearest neighbours is taken to make a prediction about a classification. The main
distinction here is that classification is used for discrete values, whereas regression is used with
continuous ones. However, before a classification can be made, the distance must be defined

Chapter 05
MODEL EVALUATION
Model Evaluation is the process through which we quantify the quality of a system’s
predictions. To do this, we measure the newly trained model performance on a new and
independent dataset. This model will compare labelled data with its own predictions. There are a
large number of machine learning algorithms out there but not all of them apply to a given
problem. We need to choose among those algorithms the one that best suits our problem and gives

Page 31 of 42
us the desired results. This is where the role of Model Evaluation comes in. It defines metrics to
evaluate our models and then based on that evaluation, we choose one or more than one model to
use. This technique of Evaluation helps us to know which algorithm best suits the given dataset for
solving a particular problem. Likewise, in terms of Machine Learning it is called as “Best Fit”. It
evaluates the performance of different Machine Learning models, based on the same input dataset.
The method of evaluation focuses on accuracy of the model, in predicting the end outcomes.

Out of all the different algorithms we use in the stage, we choose the algorithm that gives
more accuracy for the input data and is considered as the best model as it better predicts the
outcome. The accuracy is considered as the main factor, when we work on solving different
problems using machine learning. If the accuracy is high, the model predictions on the given data
are also true to the maximum possible extent.

There are several stages in solving an ML problem like collection of dataset, defining the
problem, brainstorming on the given data, pre-processing, transformation, training the model and
evaluating. Even though there are several stages, the stage of Evaluation of a ML model is the most
crucial stage, because it gives us an idea of the accuracy of model prediction. The performance and
usage of the ML model is decided in terms of accuracy measures at the end.

Fig. 5.1

5.1 Model Evaluation Techniques:


We have known that the model evaluation is an Integral part in Machine Learning. Initially,
the dataset is divided into two types, they are “Training dataset” and “Test dataset”. We build the
machine learning model using the training dataset to see the functionality of the model. But we
evaluate the designed Model using a test dataset, which consists of unseen or unknown samples of
the data that are not used for training purposes. Evaluation of a model tells us how accurate the
results were. If we use the training dataset for evaluation of the model, for any instance of the
training data it will always show the correct predictions for the given problem with high accuracy
measures, in that case our model is not adequately effective to use.

Page 32 of 42
There are two methods that are used to evaluate a model performance. They are

5.1.1 Holdout
5.1.2 Cross Validation

Fig. 5.2

5.1.1 Holdout
The Holdout method is used to evaluate the model performance and uses two types of data
for testing and training. The test data is used to calculate the performance of the model whereas it is
trained using the training data set. This method is used to check how well the machine learning
model developed using different algorithm techniques performs on unseen samples of data. This
approach is simple, flexible and fast.

5.1.2 Cross Validation

Cross-validation is a procedure of dividing the whole dataset into data samples, and then
evaluating the machine learning model using the other samples of data to know accuracy of the
model. i.e., we train the model using a subset of data and we evaluate it with a complementary data
subset. We can calculate cross validation based on the following 3 methods, namely
• Validation
• Leave one out cross validation (LOOCV)
• K-Fold Cross Validation

5.2 Classification metrics:


In order to evaluate the performance of a Machine Learning model, there are some Metrics to
know its performance and are applied for Regression and Classification algorithms. The different
types of classification metrics are:

Page 33 of 42
5.2.1 Classification Accuracy
5.2.2 Confusion Matrix
5.2.3 F-Measure
5.2.4 Area under Curve (AUC)
5.2.5 Logarithmic Loss
5.2.6 Regression Metrics

5.2.1 Classification Accuracy:

Classification accuracy is similar to the term Accuracy. It is the ratio of the correct
predictions to the total number of Predictions made by the model from the given data.

We can get better accuracy if the given data samples have the same type of data related to the given
problem statement. If the accuracy is high, the model is more accurate and we can use the model in
the real world and for different types of applications also.

If the accuracy is less, it shows that the data samples are not correctly classified to suit the given
problem.

5.2.2 Confusion Matrix:

It is an N x N matrix structure used for evaluating the performance of a classification


model, where N is the number of classes that are predicted. It is operated on a test dataset in which
the true values are known. The matrix lets us know about the number of incorrect and correct
predictions made by a classifier and is used to find correctness of the model. It consists of values
like True
Positive, False Positive, True Negative, and False Negative, which helps in measuring Accuracy,
Precision, Recall, Specificity, Sensitivity, and AUC curve. The above measures will talk about the
model performance and compare with other models to describe how good it is.

There are 4 important terms in confusion matrix:

True Positives (TP): The cases in which our predictions are TRUE, and the actual output
was also TRUE.
Page 34 of 42
True Negatives (TN): The cases in which our predictions are FALSE, and the actual output
was also FALSE.
False Positives (FP): The cases in which our predictions are TRUE, and the actual output
was FALSE.
False Negative (FN): The cases in which our predictions are FALSE, and the actual output
was TRUE.

These four outcomes are often plotted on a confusion matrix. The following confusion matrix is
an example for the case of binary classification. You would generate this matrix after making
predictions on your test data and then identifying each prediction as one of the four possible
outcomes described above.

Fig. 5.3

5.2.3 F-Measure:

It is also called as F1 Score. It is a best measure of the Test accuracy of the developed model.
It makes our task easy by eliminating the need to calculate Precision and Recall separately to know
about the model performance. F1 Score is the Harmonic mean of Recall and Precision. Higher the
F1 Score, better the performance of the model. Without calculating Precision and Recall separately,
we can calculate the model performance using F1 score as it is precise and robust.

Specificity is also called the True Negative Rate. It is the ratio of the Number of True
Negatives in the sample to the sum of True negative and the False positive samples in the given
dataset. It tells about the number of actual Negative samples that are correctly identified from the
given dataset.

Page 35 of 42
False positive rate is defined as 1-specificity. It is the ratio of number of False Positives in
the sample to the sum of False positive and True Negative samples. It tells us about the Negative
data samples that are classified as Positive, with respect to all Negative data samples.

5.2.4 Area Under Curve (AUC):

It is a widely used Evaluation Metric, mainly used for Binary Classification. The False
positive rates and the True positive rates have the values ranging from 0 to 1. The TPR and FPR are
calculated with different threshold values and a graph is drawn to better understand about the data.
Thus, the Area Under Curve is the plot between false positive rate and True positive rate at
different values of [0, 1].

5.2.5 Logarithmic Loss:

It is also called Log Loss. As we know, the AUC ROC determines the model performance
using the predicted probabilities, but it does not consider model capability to predict the higher
probability of samples to be more likely positive. This technique is mostly used in Multi-class
Classification. It is calculated to the negative average of the log of correctly predicted probabilities
for each instance. Where,

● y_ij, indicates whether sample i belongs to class j or not ●

p_ij, indicates the probability of sample i belonging to class j

5.2.6 Regression Metrics:

It helps to predict the state of outcome at any time with the help of independent variables
that are correlated. There are mainly 3 different types of metrics used in regression. These metrics
are designed in order to predict if the data is underfitted or overfitted for the better usage of the
model.
They are:-

Mean Absolute Error (MAE)

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

Page 36 of 42
1. Mean Absolute Error is the average of the difference of the original values and the predicted
values. It gives us an idea of how far the predictions are from the actual output. It doesn’t give
clarity on whether the data is under fitted or over fitted. It is calculated as follows:

2. The mean squared error is similar to the mean absolute error. It is computed by taking the

average of the square of the difference between original and predicted values. With the help of

squaring, large errors can be converted to small errors and large errors can be dealt with. It is

computed as follows.

3. The root mean squared error is the root of the mean of the square of difference of the predicted
and actual values of the given data. It is the most popular metric evolution technique used in
regression problems. It follows a normal distribution and is based on the assumption that errors are
unbiased. It is computed using the below formulae.

5.3 Bias vs Variance:


Bias is the difference between the Expected value and the Predicted value by our model. It
is simply some assumptions made by the model to make the target function easier to learn. The low
bias indicates fewer assumptions, whereas the high bias talks about more assumptions in the target
data. It leads to underfitting of the model.

Variance takes all types of data including noise into it. The model considers the variance as
something to learn, and the model learns too much from the trained data, and at the end the model
fails in giving out accurate results to the given problem statement. In case of high variance, the
model learns too much and it can lead to overfitting of the model.

Chapter 06

Natural Language Processing using NLTK


NLTK (Natural Language Toolkit) is the go-to API for NLP (Natural Language Processing)
with Python. It is really a powerful tool to pre-process text data for further analysis like with ML
models. It helps convert text into numbers, which the model can then easily work with. This is the
first part of a basic introduction to NLTK for getting your feet wet and assumes some basic
knowledge of Python.

Page 37 of 42
First, you want to install NLTK using pip (or conda). The command for this is pretty
straightforward for both Mac and Windows:

pip install nltk

If this does not work, take a look at documentation of nltk (https://www.nltk.org ). Note,
you must have at least version 3.5 of Python for NLTK.

To check if NLTK is installed properly, just type import nltk in your IDE. If it runs without
any error, congrats! But hold ‘up, there’s still a bunch of stuff to download and install. In your IDE,
after importing, continue to the next line and type nltk. Download () and run this script. An
installation window will pop up. Select all and click ‘Download’ to download and install the
additional bundles. This will download all the dictionaries and other language and grammar data
frames necessary for full NLTK functionality. NLTKfully supports the English language, but others
like Spanish or French are not supported as extensively. Now we are ready to process our first
natural language.

NLTK is a leading platform for building Python programs to work with human language
data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet,
along with a suite of text processing libraries for classification, tokenization, stemming, tagging,
parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active
discussion forum.

Thanks to a hands-on guide introducing programming fundamentals alongside topics in


computational linguistics, plus comprehensive API documentation, NLTK is suitable for linguists,
engineers, students, educators, researchers, and industry users alike. NLTK is available for
Windows, Mac OS X, and Linux. Best of all, NLTK is a free, open source, community-driven
project.
NLTK has been called “a wonderful tool for teaching, and working in, computational
linguistics using Python,” and “an amazing library to play with natural language.”

Natural Language Processing with Python provides a practical introduction to programming


for language processing. Written by the creators of NLTK, it guides the reader through the
fundamentals of writing Python programs, working with corpora, categorizing text, analyzing
linguistic structure, and more. The online version of the book has been updated for Python 3 and
NLTK 3.

Page 38 of 42
Fig.6.1

6.1. Tokenizing
By tokenizing, you can conveniently split up text by word or by sentence. This will allow
you to work with smaller pieces of text that are still relatively coherent and meaningful even
outside of the context of the rest of the text. It’s your first step in turning unstructured data into
structured data, which is easier to analyze.
When you’re analysing text, you’ll be tokenizing by word and tokenizing by sentence. Here’s what
both types of tokenization bring to the table:

● Tokenizing by word: Words are like the atoms of natural language. They’re the smallest
unit of meaning that still makes sense on its own. Tokenizing your text by word allows you to
identify words that come up particularly often. For example, if you were analyzing a group of job
ads, then you might find that the word “Python” comes up often. That could suggest high demand
for Python knowledge, but you’d need to look deeper to know more.

● Tokenizing by sentence: When you tokenize by sentence, you can analyze how those
words relate to one another and see more context. Are there a lot of negative words around the
word “Python” because the hiring manager doesn’t like Python? Are there more terms from the
domain of herpetology than the domain of software development, suggesting that you may be
dealing with an entirely different kind of python than you were expecting?

Page 39 of 42
Fig. 6.2

6.2 Filtering Stop Words


Stop-words are basically words that don’t have strong meaningful connotations for instance,
‘and’, ‘a’, ‘it's’, ‘they’, etc. These have a meaningful impact when we use them to communicate
with each other but for analysis by a computer, they are not really that useful.

Stop words are words that you want to ignore, so you filter them out of your text when
you’re processing it. Very common words like 'in', 'is', and 'an' are often used as stop words since
they don’t add a lot of meaning to a text in and of themselves.

Fig. 6.3

As you can see many of the words like ‘will’, ‘and’ are removed. This will save massive
amounts of computation power and hence time if we were to shove bodies of texts with lots of
“fluff” words into an ML model.

Page 40 of 42
6.3 Stemming
This is when ‘fluff’ letters (not words) are removed from a word and grouped together with
its “stem form”. For instance, the words ‘play’, ‘playing’, or ‘plays’ convey the same meaning
(although, again, not exactly, but for analysis with a computer, that sort of detail is still not a viable
option). So instead of having them as different words, we can put them together under the same
umbrella term ‘play’.

Fig. 6.4

Snowball Stemmer and Lancaster Stemmer but PorterStemmer is sort of the


simplest one. ‘Play’ and ‘Playful’ should have been recognized as two different words
however. Notice how the last ‘playful’ got recognized as ‘play’ and not ‘playful’. This
is where the simplicity of the PorterStemmer is undesirable. You can also train your
own using unsupervised clustering or supervised classification ML models. Now let’s
stem an actual sentence!
6.4 Tagging Parts of Speech (pos)
The next essential thing we want to do is tagging each word in the corpus (a corpus is just a
‘bag’ of words) we created after converting sentences by tokenizing. The pos_tag () method takes
in a list of tokenized words, and tags each of them with a corresponding Parts of Speech identifier
into tuples. For example, VB refers to ‘verb’, NNS refers to ‘plural nouns’, DT refers to a
‘determiner’. Refer to this website for a list of tags. These tags are almost always pretty accurate
but we should be aware that they can be inaccurate at times. However, pre-trained models usually
assume the English being used is written properly, following the grammatical rules.

This can be a problem when analyzing informal texts like from the internet. Remember the data
frames we downloaded after pip installing NLTK? Those contain the datasets that were used to
train these models initially. To apply these models in the context of our own interests, we would
need to train these models on new datasets containing informal languages first.

Page 41 of 42
Chapter 07

Conclusion
This course has been an excellent and rewarding experience. I can conclude that there have
been a lot that I've learnt from my work at their search Centre. Needless to say, the technical
aspects of the work redone are not flawless and could be improved provided enough time. As
someone with no prior experience in Machine learning. Whatsoever I believe my time spent in
research and discovering new approaches was well worth it and contributed to finding an
acceptable solution to an important aspect of Machine learning. Two main things that I've learned is
the importance of our time-management skills and self-motivation. Although I have often stumbled
upon these problems at university, they had to be approached differently in a working environment.

Page 42 of 42

You might also like