0% found this document useful (0 votes)
97 views56 pages

Industrial Report

This document is a practical training report submitted by Amit Khandelwal to fulfill the requirements for a Bachelor of Technology degree in Computer Science and Engineering. The report covers topics related to data science including the history of Python, its data structures and file handling capabilities. It also discusses machine learning techniques such as supervised and unsupervised learning as well as statistical concepts. Finally, it describes a problem statement and dataset provided during training and includes model building to solve the given problem.

Uploaded by

Amit Khandelwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views56 pages

Industrial Report

This document is a practical training report submitted by Amit Khandelwal to fulfill the requirements for a Bachelor of Technology degree in Computer Science and Engineering. The report covers topics related to data science including the history of Python, its data structures and file handling capabilities. It also discusses machine learning techniques such as supervised and unsupervised learning as well as statistical concepts. Finally, it describes a problem statement and dataset provided during training and includes model building to solve the given problem.

Uploaded by

Amit Khandelwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

A

Practical Training Report


on
DATA SCIENCE
Submitted in partial fulfillment for the award of degree of
BACHELOR OF TECHNOLOGY
In
Computer Science & Engineering

Coordinator: Submitted By:


Mr. Loveleen Kumar Amit Khandelwal
Assistant Professor 18EGJCS010

Submitted to:
Mr. S. S. Shekhawat
Head of Dept.

Department of Computer Science & Engineering


GLOBAL INSTITUTE OF TECHNOLOGY
JAIPUR (RAJASTHAN)-302022
SESSION: 2020-21
DATA SCIENCE REPORT

SUMMER TRAINING CERTIFICATE

https://trainings.internshala.com/verify_certificate

ii
SESSION 2020-2021
DATA SCIENCE REPORT

Acknowledgement

I take this opportunity to express my deep sense of gratitude to my coordinator Mr. Loveleen
Kumar , Assistant Professor Department of Computer Science and Engineering, Global Institute
of Technology, Jaipur, for his valuable guidance and cooperation throughout the Practical Training
work. He provided constant encouragement and unceasing enthusiasm at every stage of the
Practical Training work.
We are grateful to our respected Dr. I. C. Sharma, Principal GIT for guiding us during Practical
Training period
We express our indebtedness to Mr. S.S.Shekhawat, Head of Department of Computer Science
and Engineering, Global Institute of Technology, Jaipur for providing me ample support during my
Practical Training period.
Without their support and timely guidance, the completion of our Practical Training would have
seemed a farfetched dream. In this respect we find ourselves lucky to have mentors of such a great
potential.

Place: GIT, Jaipur

Amit Khandelwal
18EGJCS010
B.Tech. V Semester, III Year, CS

iii
SESSION 2020-2021
DATA SCIENCE REPORT

Abstract
In this project, we were asked to experiment with a real-world dataset, and to explore how machine
learning can be used to find the patterns in data. We were expected to gain experience using a
common data- mining and machine learning library, we were expected to submit a report about the
dataset and the algorithms used. After performing the required tasks on a dataset of the given
choice, here in lies my final report.

Keywords: Machine Learning, Classification, Supervised learning, Artificial intelligence, Data


Science, Statistics

iv
SESSION 2020-2021
DATA SCIENCE REPORT

Table of Contents

Certificate ............................................................................................................................ ii

Acknowledgement ............................................................................................................... iii

Abstract ............................................................................................................................... iv

Table of Content ................................................................................................................... v

List of Figures ...................................................................................................................... vi

1. History of python…………………………………………………..…………………...…….01

1.1. Why python……………………………………………………………………...………01

1.2. Characteristics of Python ……………………………………………………………....02

1.3. Data Structure in Python…………………………………………………………………03

1.3.1 List………………………………………………..…………………..03
1.3.2 Dictionary………………………………………………………….…03
1.3.3 Tuple…………………………………………...………………….….04
1.3.4 Sets……………………………………………..…………………….04
1.4. File Handling…………………………………………………………………………….05

1.5. NumPy…………………………………………………………………………………...07

1.5.1 Operations using NumPy……………………………….……..….07


1.6 Pandas……………………………………………………………………..……..………..9
1.6.2 Key Features…………………………………….……...……………..9
2. Machine Learning…………………………………………………………………………….11
2.1 A taste of Machine learning……………….………………………………….…11
2.2 Relation of Data mining…………….……………………………………..…….11
2.3 Relation to optimization……………….…………………………………...……12
2.4 Relation to Statistics……………………………………………………………..12
2.5 Future of Machine Learning……………………………………………………13
2.6 Definition of AI……………………………………………………………...…13
2.6.1 Definition of ML…………………………………………………….13

2.7 ML Algorithms…………………………………………...……………………14
v
SESSION 2020-2021
DATA SCIENCE REPORT

2.7.1 Approach Traditions…………………………..…………………….14


2.7.2 ML approach………………………………………..….……………15
2.8 ML Techniques………………………………………………..………………15
2.8.1 Applications of ML……………………………………….…………16
2.9 Techniques of ML……………………………………………………..………16
2.9.1 Supervised Learning…………………………………………………16
2.9.2 Unsupervised ……………………………………………………….17
2.9.3 Semi-super vised …………………………………………………….19
2.9.4 Reinforcement………………………………………………….……19
2.10 Supervised Learning…………………………………………………….………20
2.10.1 Regression……………………………………………………….…20
2.10.1.1 Linear……………………………………………….……21
2.10.1.2 Multilinear regression……………………………………22
2.10.1.3 Decision tree…………………………………………..…22
2.10.1.2 Random Forest…………………………………….…..…23
2.10.2 Classification………………………………………………………24
2.10.2.1 Linear model………………………………………….… 24
2.10.2.1.1 Logistic……………………………………………….. 24
2.10.2.1.2 SVM………………………………………………..….25
2.10.2.2 Nonlinear model…………………………………………26
2.10.2.2.1 KNN……………………………………….…..26
2.11 Unsupervised Learning…………………………………….………………..27
2.11.1 Clustering………………………………………………….27
2.11.1.1 K means Clustering……………………………..28
2.11.1.2 Elbow Method…………………………..…..….28

3. Introduction to Statistics……………………………………………………………………30
3.1 Terminologies…………………………………………………………………31
3.2 Types of analysis………………………………………………………………31
3.3 Categories………………………………………………………………...……32
3.3.1 Descriptive…………………………………………………..……….32
3.3.2 Inferential…………………………………………………………….33

vi
SESSION 2020-2021
DATA SCIENCE REPORT

3.4 Understanding Descriptive analysis………………………………….…...…...34


3.4.1 Measuring of the center……………………………………..……….35
3.4.2 Measuring of the speed………………………………………..…….36
3.5 Understanding Inferential analysis……………………………………………36
4. Problem Statement given in the Training……………………………………………….39
4.1 Problem statement………………………………………………………………39
4.2 Data dictionary………………………………………………………………….39
4.3 Correlation plot………………………………………………………………….42
4.4 Model Building………………………………………………………………….43

4 Conclusion …………………………………………………………………………………45
5. Reference and Bibliography……………………………………………………………...46

vii
SESSION 2020-2021
DATA SCIENCE REPORT

List of Figures
Fig 1.1 Program of ‘Hello World’…………………………..………………………….……2
Fig 1.2 Example of file handling code in python……………………………………….…6
Fig 1.3 Example of dictionary.…………………………………………..……………….…6
Fig 1.4 Output of File handling..………………………………………………………...…7
Fig 1.5 Numpy Example 1…………………………………………………………….…...8
Fig 1.6 Numpy Example 2…………………………………………………………..……..8
Fig 1.7 Pandas example…………………………………………………………………....10
Fig 2.1 Process of ML…………………………………………………..……….…………11
Fig 2.2 Optimization……………………………………………………………………… 12
Fig 2.3 Relation to statistics………………………………………………………….……12
Fig 2.4 ML & AI Relation…………………………………………………………….……13
Fig 2.5 ML vs Traditional programming…………………………………………………..14
Fig 2.6 Traditional Approach……………………………………………………………...15
Fig 2.7 ML Approach……………………………………………………………….……...15
Fig 2.8 ML technique……………………………………………………………………....16
Fig 2.9 ML Application…………………………………………………………………….16
Fig 2.10 Supervised learning………………………………………………………………17
Fig 2.11 Unsupervised learning…………………………………………………...………..18
Fig 2.12 Types of Unsupervised learning.....…………………………………….…….......18
Fig 2.13 Labeled and Unlabeled………………………………………………...….…........19
Fig 2.14 Reinforcement learning ……………………………………………….….….…. .20
Fig 2.15 Linear regression graph…………………………………………………...…...….21
Fig 2.17 Multiple linear regression………………………………………………………...22
Fig 2.18 Decision tree…………………………………………………………………..….23
Fig 2.19 A single decision tree vs a bagging ensemble of 500 trees …………………...…24
Fig 2.20 Logistic regression……………………………………………………………….25
viii
SESSION 2020-2021
DATA SCIENCE REPORT

Fig 2.22 SVM…………………………………………………………………………….26


Fig 2.23 KNN…………………………………………………………………………….27
Fig 2.24 Iterative process to get data points in the best clusters possible………………...28
Fig 3.1 Statistics………………………………………………………………….………..30
Fig 3.2 Types of analysis…………………………………………………….……………31
Fig 3.3 Descriptive statistics………………………………………………….…………..32
Fig 3.4 Descriptive statistics example…………………………………………….………33
Fig 3.5 Inferential Statistics…………………………………………………….………...34
Fig 3.6 Car dataset………………………………………………………………………..34
Fig 3.7 Inferential analysis………………………………………………….……………37
Fig 4.1 Table of data dictionary…………………………………………………………..40
Fig 4.2 loading the data…………………………………………………………………..40
Fig 4.3 Assigning variables……………………………………………………………….41
Fig 4.4 Data types………………………………………………………………………...41
Fig 4.5 Five rows of the dataset…………………………………………………………..41
Fig 4.6 Box plot of frequencies…………………………………………………………..42
Fig 4.7 Code for the correlation plot……………………………………………………....42
Fig 4.8 Correlation plot…………………………………………………………………..42
Fig 4.9 Code and output for null values………………………………………………….43
Fig 4.10 Splitting of the dataset into train and test……………………………………….43
Fig 4.11 Using of the decision tree function from the sklearn library……………………44
Fig 4.12 Saving the prediction into a csv file……………………………………………44
Fig 4.13 Converting the target value from “yes” or “no” to “1” or “0”…………………44

ix
SESSION 2020-2021
DATA SCIENCE REPORT

Chapter 1.

History of Python

Python was developed in 1980 by Guido van Rossum at the National Research
Institute for Mathematics and Computer Science in the Netherlands as a successor of
ABC language capable of exception handling and interfacing. Python features a
dynamic type system and automatic memory management. It supports multiple
programming paradigms, including object-oriented, imperative, functional and
procedural, and has a large and comprehensive standard library.

Van Rossum picked the name Python for the new language from a TV show, Monty
Python's Flying Circus.

In December 1989 the creator developed the 1st python interpreter as a hobby and then
on 16 October 2000, Python 2.0 was released with many new features.

...In December 1989, I was looking for a "hobby" programming project that would
keep me occupied during the week around Christmas. My office ... would be closed,
but I had a home computer, and not much else on my hands. I decided to write an
interpreter for the new scripting language I had been thinking about lately: a
descendant of ABC that would appeal to Unix/C hackers. I chose Python as a working
title for the project, being in a slightly irreverent mood (and a big fan of Monty
Python's Flying Circus)

— Guido van Rossum

1.1 Why Python?

The language's core philosophy is summarized in the document The Zen of Python
(PEP 20), which includes aphorisms such as…

Beautiful is better than ugly

1
SESSION 2020-2021
DATA SCIENCE REPORT

Simple is better than complex

Complex is better than complicated

Readability counts

Explicit is better than implicit

Fig 1.1 Program of “Hello World”

1.2 Characteristics of Python

Interpreted Language: Python is processed at runtime by Python Easy to


read: Python source-code is clearly defined and visible to the eyes.

• Portable: Python codes can be run on a wide variety of hardware platforms


having the same interface.

• Extendable: Users can add low level-modules to Python interpreter.

• Scalable: Python provides an improved structure for supporting large


programs than shell-scripts.

• Object-Oriented Language: It supports object-oriented


2
SESSION 2020-2021
DATA SCIENCE REPORT

• Interactive Programming Language: Users can interact with the python


interpreter directly for writing programs.

• Easy language: Python is easy to learn language especially for beginners.

• Straight forward Syntax: The formation of python syntax is simple and


straightforward which also makes it popular.

1.3 Data Structures in Python

1.3.1 LISTS-

Ordered collection of data.

Supports similar slicing and indexing functionalities as in the case of Strings.

They are mutable.

Advantage of a list over a conventional array

• Lists have no size or type constraints (no setting restrictions


beforehand).

• They can contain different object types.

• We can delete elements from a list by using Del list_name[index_val]

 Example-

• my_list = ['one', 'two','three',4,5]

• len(my_list) would output 5.

1.3.2 Dictionary-

Lists are sequences but the dictionaries are mappings.

They are mappings between a unique key and a value pair.

These mappings may not retain order.

Constructing a dictionary.
3
SESSION 2020-2021
DATA SCIENCE REPORT

Accessing object from a dictionary.

Nesting Dictionaries.

Basic Dictionary Methods.

Basic Syntax

o d={} empty dictionary will be generated and assign keys and values to
it, like d[‘animal’] = ‘Dog’

o d = {'K1':'V1', 'K2’:’V2'}

o d['K1'] outputs 'V1‘

1.3.3 Tuples-

Immutable in nature, i.e they cannot be changed.

No type restriction

Indexing and slicing, everything's same like that in strings and lists.

Constructing tuples.

Basic tuple methods.

Immutability.

When to use tuples?

We can use tuples to present things that shouldn’t change, such as days of the
week, or dates on a calendar, etc.

1.3.4 Sets-

A set contains unique and unordered elements and we can construct them by
using a set() function.

Convert a list into Set-

l=[1,2,3,4,1,1,2,3,6,7]

k = set(l)

k becomes {1,2,3,4,6,7}
4
SESSION 2020-2021
DATA SCIENCE REPORT

Basic Syntax-

x=set()

x.add(1)

x = {1}

This would make no change in x now

1.4 File Handling in Python

Python too supports file handling and allows users to handle files i.e., to read and write
files, along with many other file handling options, to operate on files. The concept of
file handling has stretched over various other languages, but the implementation is
either complicated or lengthy, but alike other concepts of Python, this concept here is
also easy and short. Python treats file differently as text or binary and this is important.
Each line of code includes a sequence of characters and they form text file. Each line
of a file is terminated with a special character, called the EOL or End of Line
characters like comma {,} or newline character. It ends the current line and tells the
interpreter a new one has begun. Let’s start with Reading and Writing files.

We use open () function in Python to open a file in read or write mode. As explained
above, open ( ) will return a file object. To return a file object we use open () function
along with two arguments, that accepts file name and the mode, whether to read or
write. So, the syntax being: open (filename, mode). There are three kinds of mode,
that Python provides and how files can be opened:

• “ r “, for reading.

• “ w “, for writing.

• “ a “, for appending.

• “ r+ “, for both reading and writing

Ex-It is a notepad file (101.txt)

5
SESSION 2020-2021
DATA SCIENCE REPORT

Fig 1.2 Example of File HandlingCode in python

Fig 1.3 Example of dictionary

6
SESSION 2020-2021
DATA SCIENCE REPORT

It read the words from 101.txt file and print the all words which are present in the
file and also tell that word occurring how many times.

Fig 1.4 Output of file handling

1.5 NumPy

NumPy is a Python package. It stands for 'Numerical Python'. It is a library


consisting of multidimensional array objects and a collection of routines for
processing of array.

Numeric, the ancestor of NumPy, was developed by Jim Hugunin. Another package
Num-array was also developed, having some additional functionalities. In 2005, Travis
Oliphant created NumPy package by incorporating the features of Num-array into
Numeric package. There are many contributors to this open-source project.

1.5.1 Operations using NumPy

Using NumPy, a developer can perform the following operations −

Mathematical and logical operations on arrays.

Fourier transforms and routines for shape manipulation.


7
SESSION 2020-2021
DATA SCIENCE REPORT

Operations related to linear algebra. NumPy has in-built functions for linear
algebra and random number generation.

Simple program to create a matrix-

First of all we import NumPy package then using this we take input in NumPy
function as a list then we create a matrix

Fig 1.5 numpy example

There is many more function can be perform by using this like that take sin value of
the given value ,print a zero matrix etc. we also take any image in the form of array.

Fig 1.6 Numpy example 2

8
SESSION 2020-2021
DATA SCIENCE REPORT

1.6 Pandas

Pandas is an open-source, BSD-licensed Python library providing high-


performance, easy-to-use data structures and data analysis tools for the Python
programming language. Python with Pandas is used in a wide range of fields
including academic and commercial domains including finance, economics,
Statistics, analytics, etc.

Pandas is an open-source Python Library providing high-performance data


manipulation and analysis tool using its powerful data structures. The name
Pandas is derived from the word Panel Data – an Econometrics from
Multidimensional data.

1.6.1 Key Features of Pandas-

• Fast and efficient Data Frame object with default and customized indexing.

• Tools for loading data into in-memory data objects from different file
formats.

• Data alignment and integrated handling of missing data.

• Reshaping and pivoting of date sets.

• Label-based slicing, indexing and sub setting of large data sets.

• Columns from a data structure can be deleted or inserted.

• Group by data for aggregation and transformations.

Pandas deals with the following three data structures −

Series

Data Frame

Panel

These data structures are built on top of NumPy array, which means they are fast.

9
SESSION 2020-2021
DATA SCIENCE REPORT

Fig 1.7 Pandas example

10
SESSION 2020-2021
DATA SCIENCE REPORT

Chapter 2.

Introduction to Machine Learning

2.1 Taste of Machine Learning

 Arthur Samuel, an American pioneer in the field of computer gaming and artificial
intelligence, coined the term "Machine Learning" in 1959.

 Over the past two decades Machine Learning has become one of the mainstays of
information technology.

 With the ever-increasing amounts of data becoming available there is good reason
to believe that smart data analysis will become even more pervasive as a necessary
ingredient for technological progress.

2.2 Relation to Data Mining

Fig 2.1 Process of machine learning

Data mining uses many machine learning methods, but with different goals;
on the other hand, machine learning also employs data mining methods as
"unsupervised learning" or as a preprocessing step to improve learner
accuracy.

11
SESSION 2020-2021
DATA SCIENCE REPORT

2.3 Relation to Optimization

Fig 2.2 Optimization

Machine learning also has intimate ties to optimization: many learning problems
are formulated as minimization of some loss function on a training set of examples.
Loss functions express the discrepancy between the predictions of the model being
trained and the actual problem instances.

2.4 Relation to Statistic

Fig 2.3 Relation to Statistic

Michael I. Jordan suggested the term data science as a placeholder to call the overall
field. Leo Bierman distinguished two statistical modelling paradigms: data model
and algorithmic model, wherein "algorithmic model" means more or less the
machine learning algorithms like Random forest.

12
SESSION 2020-2021
DATA SCIENCE REPORT

 Future of Machine Learning

 Machine Learning can be a competitive advantage to any company be it a top


MNC or a startup as things that are currently being done manually will be
done tomorrow by machines.

 Machine Learning revolution will stay with us for long and so will be the future of
Machine Learning.

2.6 Define Artificial Intelligence

Artificial intelligence refers to the simulation of human intelligence in machines that


are programmed to think like humans and mimic their actions. The term may also be
applied to any machine that exhibits traits associated with a human mind such as
learning and problem- solving.

2.6.1 Definition of Machine Learning

Relationship between AI and ML

Fig 2.4 AL & ML

Machine Learning is an approach or subset of Artificial Intelligence that is based on the


idea that machines can be given access to data along with the ability to learn from it.

Define Machine Learning

Machine learning is an application of artificial intelligence (AI) that provides systems the

13
SESSION 2020-2021
DATA SCIENCE REPORT

ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the development of computer programs that
can access data and use it learn for themselves.

Features of Machine Learning

 Machine Learning is computing-intensive and generally requires a large amount of


training data.

 It involves repetitive training to improve the learning and decision making of


algorithms.

 As more data gets added, Machine Learning training can be automated for learning
new data patterns and adapting its algorithm.

2.7 Machine Learning Algorithms

 Traditional Programming vs. Machine Learning Approach

Fig 2.5 Traditional programming vs Machine learning

14
SESSION 2020-2021
DATA SCIENCE REPORT

2.7.1 Traditional Approach

Traditional programming relies on hard-coded rules.

Fig 2.6 Traditional approach

2.7.2.1 Machine Learning Approach

Machine Learning relies on learning patterns based on sample data.

Fig 2.7 ML approach

2.8 Machine Learning Techniques

15
SESSION 2020-2021
DATA SCIENCE REPORT

 Machine Learning uses a number of theories and techniques from Data


Science.

Fig 2.8

 Machine Learning can learn from labelled data (known as supervised


learning) or unlabeled data (known as unsupervised learning).

2.8.1 Applications of Machine Learning

Fig 2.9 Application of ML

2.9 Techniques of Machine Learning

2.9.1 Supervised Learning

Define Supervised Learning

Supervised learning is the machine learning task of learning a function that maps an input to
16
SESSION 2020-2021
DATA SCIENCE REPORT

an output based on example input-output pairs. It infers a function from labeled training data
consisting of a set of training examples.

Fig 2.10

 Examples of Supervised Learning

 Voice Assistants

 Gmail Filters

 Weather Apps

 Types of Supervised Learning

Fig 2.10 Supervised learning

2.9.2 Unsupervised Learning

Define Unsupervised Learning

17
SESSION 2020-2021
DATA SCIENCE REPORT

Unsupervised learning is the training of machine using information that is neither


classified nor labeled and allowing the algorithm to act on that information without
guidance.

Fig 2.11

Here the task of machine is to group unsorted information according to similarities, patterns
and differences without any prior training of data.

 Types of Unsupervised Learning

Fig 2.12 types of Unsupervised Learning

 Clustering

The most common unsupervised learning method is cluster analysis. It is used to find

18
SESSION 2020-2021
DATA SCIENCE REPORT

data clusters so that each cluster has the most closely matched data.

 Visualization Algorithms

Visualization algorithms are unsupervised learning algorithms that accept unlabeled


data and display this data in an intuitive 2D or 3D format. The data is separated into
somewhat clear clusters to aid understanding.

 Anomaly Detection

This algorithm detects anomalies in data without any prior training

2.9.3. Semi- supervised Learning

Define Semi-supervised Learning

Semi-supervised learning is a class of machine learning tasks and techniques that also make
use of unlabeled data for training – typically a small amount of labeled data with a large amount
of unlabeled data.

Fig 2.13 Labeled and unlabeled data

Semi-supervised learning falls between unsupervised learning (without any labeled training
data) and supervised learning (with completely labeled training data).

2.9.3 Reinforcement Learning

19
SESSION 2020-2021
DATA SCIENCE REPORT

Reinforcement Learning is a type of Machine Learning that allows the learning system to
observe the environment and learn the ideal behavior based on trying to maximize some notion
of cumulative reward. It differs from supervised learning in that labelled input/output pairs
need not be presented, and sub-optimal actions need not be explicitly corrected. Instead the
focus is finding a balance between exploration (of uncharted territory) and exploitation (of

current knowledge)

Fig 2.14 Reinforcement learning

Features of Reinforcement Learning

 The learning system (agent) observes the environment, selects and takes certain
actions, and gets rewards in return (or penalties in certain cases).

 The agent learns the strategy or policy (choice of actions) that maximizes its
rewards over time.

Example of Reinforcement Learning

 In a manufacturing unit, a robot uses deep reinforcement learning to identify a


device from one box and put it in a container.

 The robot learns this by means of a rewards-based learning system, which


incentivizes it for the right action

2.10 Supervised learning

2.10.1 Regression

 In statistical modeling, regression analysis is a set of statistical processes for estimating


20
SESSION 2020-2021
DATA SCIENCE REPORT

the relationships among variables.

 It includes many techniques for modeling and analyzing several variables, when the
focus is on the relationship between a dependent variable and one or more independent
variables (or 'predictors').

 More specifically, regression analysis helps one understand how the typical value of
the dependent variable (or 'criterion variable') changes when any one of the independent
variables is varied, while the other independent variables are held fixed.

2.10.1.1 Linear Regression

Fig 2.15 Basic linear regression

Linear regression is a linear approach for modeling the relationship between a scalar
dependent variable y and an independent variable x.

y^=wTx
where x, y, w are vectors of real numbers and w is a vector of weight parameters.

The equation is also written as:

21
SESSION 2020-2021
DATA SCIENCE REPORT

y = wx + b
where b is the bias or the value of output for zero input

2.10.1.2 Multiple Linear Regression


It is a statistical technique used to predict the outcome of a response variable through
several explanatory variables and model the relationships between them.

The graph shows dependent variable y plotted against two independent variables x1 and x2. It is
shown in 3D. More independent variables (if involved) will increase the dimensions further.

Fig 2.17 Multiple regression

It represents line fitment between multiple inputs and one output, typically:

Y=w1x1+w2x2+b

2.10.1.3 Decision Tree Regression

 A decision tree is a graphical representation of all the possible solutions to a decision


based on a few conditions.

 Decision Trees are non-parametric models, which means that the number of parameters

22
SESSION 2020-2021
DATA SCIENCE REPORT

is not determined prior to training. Such models will normally overfit data.

 In contrast, a parametric model (such as a linear model) has a predetermined number of


parameters, thereby reducing its degrees of freedom. This in turn prevents overfitting.

Fig 2.18 Decision tree

o max_depth –limit the maximum depth of the tree

o min_samples_split –the minimum number of samples a node must have


before it can be split

o min_samples_leaf –the minimum number of samples a leaf node must have

o min_weight_fraction_leaf –same as min_samples_leaf but expressed as a


fraction of total instances

o max_leaf_nodes –maximum number of leaf nodes

o max_features –maximum number of features that are evaluated for splitting


at each node

2.10.1.4 Random Forest Regression

Ensemble Learning uses the same algorithm multiple times or a group of different algorithms
together to improve the prediction of a model.

23
SESSION 2020-2021
DATA SCIENCE REPORT

Fig 2.19 A single decision tree vs a bagging ensemble of 500 trees

Random Forests use an ensemble of decision trees to perform regression tasks.

2.10.2 Classification

It specifies the class to which data elements belong to.


It predicts a class for an input variable.
It is best used when the output has finite and discreet
values.

2.10.2.1 Linear Models

2.10.2.1.1 Logistic Regression

This method is widely used for binary classification problems. It can also be extended to
multi-class classification problems.

A binary dependent variable can have only two values, like 0 or 1, win or lose, pass or fail,
healthy or sick, etc.

24
SESSION 2020-2021
DATA SCIENCE REPORT

Fig 2.20 Logistic regression

The probability in the logistic regression is often represented by the Sigmoid function (also
called the logistic function or the S-curve). Represented as:-

𝑆 (𝑡) = 1⁄1 + 𝑒 −𝑡

2.10.2.1.2 Support Vector machines

 SVMs are very versatile and are also capable of performing linear or
nonlinear classification, regression, and outlier detection.

 They involve detecting hyperplanes which segregate data into classes.

25
SESSION 2020-2021
DATA SCIENCE REPORT

Fig 2.22 Support vector machines

 The optimization objective is to find “maximum margin hyperplane” that


is farthest from the closest points in the two classes (these points are called
support vectors).

2.10.2.2 Nonlinear Models

2.10.2.2.1 K-Nearest Neighbors (KNN)

K-nearest Neighbors algorithm is used to assign a data point to clusters


based on similarity measurement.

26
SESSION 2020-2021
DATA SCIENCE REPORT

Fig 2.23 KNN illustration

2.11 Unsupervised learning

2.11.1 Clustering

Clustering means

 Clustering is a Machine Learning technique that involves the grouping


of data points.

Prototype Based Clustering

 Prototype-based clustering assumes that most data is located near


prototypes; example: centroids (average) or medoid (most frequently
occurring point)

 K-means, a Prototype-based method, is the most popular method for


clustering that involves:

27
SESSION 2020-2021
DATA SCIENCE REPORT

 Training data that gets assigned to matching cluster based on


similarity

Fig 2.24 Iterative process to get data points in the best clusters possible

2.11.1.1 K-means Clustering

K-means Clustering Algorithm

Step 1: randomly pick k centroids

Step 2: assign each point to the nearest centroid

Step 3: move each centroid to the center of the respective cluster


Step 4: calculate the distance of the centroids from each point again

Step 5: move points across clusters and re-calculate the distance from
the centroid

Step 6: keep moving the points across clusters until the Euclidean
distance is minimized

2.11.1.2 Elbow Method

 One could plot the Distortion against the number of clusters K. Intuitively, if K
increases, distortion should decrease. This is because the samples will be close
to their assigned centroids. This plot is called the Elbow method.

28
SESSION 2020-2021
DATA SCIENCE REPORT

 It indicates the optimum number of clusters at the position of the elbow, the
point where distortion begins to increase most rapidly.

K-means is based on finding points close to cluster centroids. The distance between two points
x and y can be measured by the squared Euclidean distance between them in an m-dimensional
space.

3 Examples of K-means Clustering

 Grouping articles (example: Google news)

 Grouping customers who share similar interests

 Classifying high risk and low risk patients from a patient pool

29
SESSION 2020-2021
DATA SCIENCE REPORT

Chapter 3

Introduction to Statistics
Statistics is used to process complex problems in the real world so that Data Scientists and
Analysts can look for meaningful trends and changes in Data. In simple words, Statistics can be
used to derive meaningful insights from data by performing mathematical computations on it.

Several Statistical functions, principles, and algorithms are implemented to analyse raw data, build
a Statistical Model and infer or predict the result.

Statistics is a Mathematical Science pertaining to data collection, analysis, interpretation and


presentation.

Fig 3.1 Overview of statistic

The field of Statistics has an influence over all domains of life, the Stock market, life sciences,
weather, retail, insurance, and education are but to name a few.

30
SESSION 2020-2021
DATA SCIENCE REPORT

3.1 Terminologies in Statistics - Statistics for Data Science


One should be aware of a few key statistical terminologies while dealing with Statistics for Data
Science. I’ve discussed these terminologies below:

 The population is the set of sources from which data has to be collected.
 A Sample is a subset of the Population
 A Variable is any characteristics, number, or quantity that can be measured or counted.
A variable may also be called a data item.
 Also known as a statistical model, A statistical Parameter or population parameter is a
quantity that indexes a family of probability distributions. For example, the mean, median,
etc of a population.

Before we move any further and discuss the categories of Statistics, let’s look at the types of
analysis.

3.2 Types of Analysis

An analysis of any event can be done in one of two ways:

Fig 3.2 Types of Analysis – Math And Statistics For Data Science

1. Quantitative Analysis: Quantitative Analysis or Statistical Analysis is the science of


collecting and interpreting data with numbers and graphs to identify patterns and trends.
2. Qualitative Analysis: Qualitative or Non-Statistical Analysis gives generic information and
uses text, sound and other forms of media to do so.
31
SESSION 2020-2021
DATA SCIENCE REPORT

For example, if I want a purchase a coffee from Starbucks, it is available in Short, Tall and Grande.
This is an example of Qualitative Analysis. But if a store sells 70 regular coffees a week, it is
Quantitative Analysis because we have a number representing the coffees sold per week.

Although the purpose of both these analyses is to provide results, Quantitative analysis provides a
clearer picture hence making it crucial in analytics.

3.3 Categories in Statistics


There are two main categories in Statistics, namely:

1. Descriptive Statistics
2. Inferential Statistics

3.3.1 Descriptive Statistics

Descriptive Statistics uses the data to provide descriptions of the population, either through
numerical calculations or graphs or tables.

Descriptive Statistics helps organize data and focuses on the characteristics of data providing
parameters.

Fig 3.3 Descriptive Statistics – Math and Statistics for Data Science

Suppose you want to study the average height of students in a classroom, in descriptive statistics
you would record the heights of all students in the class and then you would find out the maximum,
minimum and average height of the class.

32
SESSION 2020-2021
DATA SCIENCE REPORT

Fig 3.4 Descriptive Statistics Example – Math and Statistics for Data Science

3.2.2 Inferential Statistics

Inferential Statistics makes inferences and predictions about a population based on a sample of
data taken from the population in question.

Inferential statistics generalizes a large data set and applies probability to arrive at a conclusion. It
allows you to infer parameters of the population based on sample stats and build models on it.

Fig 3.5 Inferential Statistics – Math and Statistics for Data Science

So, if we consider the same example of finding the average height of students in a class, in
Inferential Statistics, you will take a sample set of the class, which is basically a few people from
the entire class. You already have had grouped the class into tall, average and short. In this
method, you basically build a statistical model and expand it for the entire population in the class.

33
SESSION 2020-2021
DATA SCIENCE REPORT

Inferential Statistics Example – Math and Statistics for Data Science

Now let’s focus our attention on Descriptive Statistics and see how it can be used to solve
analytical problems.

3.4 Understanding Descriptive Analysis

When we try to represent data in the form of graphs, like histograms, line plots, etc. the data is
represented based on some kind of central tendency. Central tendency measures like, mean,
median, or measures of the spread, etc are used for statistical analysis. To better understand
Statistics lets discuss the different measures in Statistics with the help of an example.

Fig 3.6 Cars Data Set – Math and Statistics for Data Science

Here is a sample data set of cars containing the variables:

34
SESSION 2020-2021
DATA SCIENCE REPORT

1. Cars
2. Mileage per Gallon (mpg)
3. Cylinder Type (cyl)
4. Displacement (disp)
5. Horse Power (hp)
6. Real Axle Ratio (drat).

Before we move any further, let’s define the main Measures of the Centre or Measures of Central
tendency.
3.5 Measures of The Centre

1. Mean: Measure of average of all the values in a sample is called Mean.


2. Median: Measure of the central value of the sample set is called Median.
3. Mode: The value most recurrent in the sample set is known as Mode.

Using descriptive Analysis, you can analyse each of the variables in the sample data set for mean,
standard deviation, minimum and maximum.

 If we want to find out the mean or average horsepower of the cars among the population of
cars, we will check and calculate the average of all values. In this case, we’ll take the sum
of the Horse Power of each car, divided by the total number of cars:

Mean = (110+110+93+96+90+110+110+110)/8 = 103.625

 If we want to find out the center value of mpg among the population of cars, we will
arrange the mpg values in ascending or descending order and choose the middle value. In
this case, we have 8 values which is an even entry. Hence we must take the average of the
two middle values.

The mpg for 8 cars 21,21,21,8,23,23,23,23

Median = (22.8+23)/2=22.9

 If we want to find out the most common type of cylinder among the population of cars, we
will check the value which is repeated most number of times. Here we can see that the
cylinders come in two values, 4 and 6. Take a look at the data set, you can see that the most
recurring value is 6. Hence 6 is our Mode.
35
SESSION 2020-2021
DATA SCIENCE REPORT

3.6 Measures of the Spread

Just like the measure of centre, we also have measures of the spread, which comprises of the
following measures:

1. Range: It is the given measure of how spread apart the values in a data set are.
2. Inter Quartile Range (IQR): It is the measure of variability, based on dividing a data set into
quartiles.
3. Variance: It describes how much a random variable differs from its expected value. It
entails computing squares of deviations.
1. Deviation is the difference between each element from the mean.
2. Population Variance is the average of squared deviations
3. Sample Variance is the average of squared differences from the mean
4. Standard Deviation: It is the measure of the dispersion of a set of data from its mean.

Now that we’ve seen the stats and math behind Descriptive analysis, let’s try to work it out in R.

3.7 Understanding Inferential Analysis


Statisticians use hypothesis testing to formally check whether the hypothesis is accepted or
rejected. Hypothesis testing is an Inferential Statistical technique used to determine whether there
is enough evidence in a data sample to infer that a certain condition holds true for an entire
population.

To under the characteristics of a general population, we take a random sample and analyze the
properties of the sample. We test whether or not the identified conclusion represents the population
accurately and finally we interpret their results. Whether or not to accept the hypothesis depends
upon the percentage value that we get from the hypothesis.

To better understand this, let’s look at an example.

Consider four boys, Nick, John, Bob and Harry who were caught bunking a class. They were asked
to stay back at school and clean their classroom as a punishment.

36
SESSION 2020-2021
DATA SCIENCE REPORT

Fig 3.7 Inferential Analysis – Math and Statistics For Data Science

So, John decided that the four of them would take turns to clean their classroom. He came up with
a plan of writing each of their names on chits and putting them in a bowl. Every day they had to
pick up a name from the bowl and that person must clean the class.

Now it has been three days and everybody’s name has come up, except John’s! Assuming that this
event is completely random and free of bias, what is the probability of John not cheating?

Let’s begin by calculating the probability of John not being picked for a day:

P(John not picked for a day) = 3/4 = 75%

The probability here is 75%, which is fairly high. Now, if John is not picked for three days in a
row, the probability drops down to 42%

P(John not picked for 3 days) = 3/4 ×3/4× 3/4 = 0.42 ( approx. )

Now, let’s consider a situation where John is not picked for 12 days in a row! The probability
drops down to 3.2%. Thus, the probability of John cheating becomes fairly high.

P(John not picked for 12 days) = (3/4) ^12 = 0.032 <?.??

In order for statisticians to come to a conclusion, they define what is known as a threshold value.
Considering the above situation, if the threshold value is set to 5%, it would indicate that, if the
probability lies below 5%, then John is cheating his way out of detention. But if the probability is
above the threshold value, then John is just lucky, and his name isn’t getting picked.

The probability and hypothesis testing give rise to two important concepts, namely:

 Null Hypothesis: The result is no different from assumption.

37
SESSION 2020-2021
DATA SCIENCE REPORT

 Alternate Hypothesis: Result disproves the assumption.

Therefore, in our example, if the probability of an event occurring is less than 5%, then it is a
biased event, hence it approves the alternate hypothesis.

38
SESSION 2020-2021
DATA SCIENCE REPORT

Chapter 4.

Problem statement given during the training

4.1 Problem statement

Your client is a retail banking institution. Term deposits are a major source of income for a bank. A
term deposit is a cash investment held at a financial institution. Your money is invested for an
agreed rate of interest over a fixed amount of time, or term.

The bank has various outreach plans to sell term deposits to their customers such as email
marketing, advertisements, telephonic marketing and digital marketing. Telephonic marketing
campaigns still remain one of the most effective way to reach out to people. However, they require
huge investment as large call centres are hired to actually execute these campaigns. Hence, it is
crucial to identify the customers most likely to convert beforehand so that they can be specifically
targeted via call.

You are provided with the client data such as: age of the client, their job type, their marital status,
etc. Along with the client data, you are also provided with the information of the call such as the
duration of the call, day and month of the call, etc. Given this information, your task is to predict if
the client will subscribe to term deposit.

I was provided with the following data

1. train.csv: Use this dataset to train the model. This file contains all the client and call details as
well as the target variable “subscribed”. You have to train your model using this file.

2. test.csv: Use the trained model to predict whether a new set of clients will subscribe the term
deposit

4.2 Data Dictionary

Variables Defination
ID Unique client ID
Age Age of the client
39
SESSION 2020-2021
DATA SCIENCE REPORT

job type of job


marital Marital status of the client
education Education level
default Credit in default.
housing Housing loan
loan Personal loan
contact Type of communication
month Contact month
day_of_week Day of week of contact
duration Contact duration
campaign number of contacts performed during this
campaign to the client
pdays number of days that passed by after the client
was last contacted
previous number of contacts performed before this
campaign
poutcome outcome of the previous marketing campaign

Subscribed (target) has the client subscribed a term


deposit?
Table 4.1 Table of data dictionary

The following were the steps that I performed to solve this problem statement

Importing all the necessary Machine learning libraries and loading the data.

Fig 4.2 loading the data.

40
SESSION 2020-2021
DATA SCIENCE REPORT

Assigning the dependent and independent variables to

Fig 4.3 Assigning variables

Checking the data types for each variable.

Fig 4.4 Data types

Printing the first five rows from the dataset

Fig 4.5 Five rows of the dataset

41
SESSION 2020-2021
DATA SCIENCE REPORT

Fig 4.6 Box plot of frequencies

So, 3715 users out of total 31647 have subscribed which is around 12%. Let's now explore the
variables to have a better understanding of the dataset. We will first explore the variables
individually using univariate analysis, then we will look at the relation between various
independent variables and the target variable. We will look at the correlation plot to see which
variables affects the target variable most.

Fig 4.7 Code for the corelation plot

4.3 Correlation Plot

Fig 4.8 Correlation plot


42
SESSION 2020-2021
DATA SCIENCE REPORT

Now we will check that is there any null value or not in the dataset

Fig 4.9 Code and output for null values

There are no missing values in the train dataset.

Next, we will start to build our predictive model to predict whether a client will subscribe to a term
deposit or not.

As the sklearn models takes only numerical input, we will convert the categorical variables into
numerical values using dummies. We will remove the ID variables as they are unique values and
then apply dummies. We will also remove the target variable and keep it in a separate variable.

4.4 Model Building

Fig 4.10 Splitting of the dataset into train and test

We will use the Decision tree algorithm for the prediction

43
SESSION 2020-2021
DATA SCIENCE REPORT

Fig 4.11 Using of the decision tree function from the sklearn library

Fig 4.12 Saving the prediction into a csv file

Since the target variable is yes or no, we will convert 1 and 0 in the predictions to yes and no
respectively.

Fig 4.13 Converting the target value from “yes” or “no” to “1” or “0”

44
SESSION 2020-2021
DATA SCIENCE REPORT

Chapter 5.

Conclusion

Machine Learning is a technique of training machines to perform the activities a human


brain can do, albeit bit faster and better than an average human-being. Today we have
seen that the machines can beat human champions in games such as Chess, Alpha-GO,
which are considered very complex. You have seen that machines can be trained to
perform human activities in several areas and can aid humans in living better lives.

Machine Learning can be a Supervised or Unsupervised. If you have lesser amount of


data and clearly labelled data for training, opt for Supervised Learning. Unsupervised
Learning would generally give better performance and results for large data sets.

Finally, when it comes to the development of machine learning models of your own,
you looked at the choices of various development languages, IDEs and Platforms. Next
thing that you need to do is start learning and practicing each machine learning
technique. The subject is vast, it means that there is width, but if you consider the
depth, each topic can be learned in a few hours. Each topic is independent of each
other. You need to take into consideration one topic at a time, learn it, practice it and
implement the algorithm/s in it using a language choice of yours. This is the best way
to start studying Machine Learning. Practicing one topic at a time, very soon you
would acquire the width that is eventually required of a Machine Learning expert.

45
SESSION 2020-2021
DATA SCIENCE REPORT

Chapter 6.

References and Bibliography

The content is taken from: -

1. https://www.simplilearn.com/

2. https://www.scribd.com/document/434622438/AN-INDUSTRIAL-TRAINING-
REPORT-pdf

3. https://www.wikipedia.org/

4. https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/

5. https://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

6. https://data-flair.training/blogs/svm-support-vector-machine-tutorial/

7. https://towardsdatascience.com/

8. https://towardsdatascience.com/machine-learning-algorithms-in-laymans-terms-part-
1-d0368d769a7b

9. https://www.expertsystem.com

10. https://analyticsindiamag.com/7-types-classification-algorithms/

11. https://www.edureka.co/

12. https://towardsdatascience.com/machine-learning-algorithms-in-laymans-terms-part-
1-d0368d769a7b

13. https://medium.com/

14. https://data-flair.training/blogs/machine-learning-tutorial/

46
SESSION 2020-2021
DATA SCIENCE REPORT

Pictures are taken from

1 https://www.simplilearn.com/

2 https://www.lbef.org/machine-learning-projects-is-it-really-hard-to-manage/

3 https://www.wordstream.com/blog/ws/2017/07/28/machine-learning-applications

4 https://www.wikipedia.org/

5 https://www.youtube.com/

6 https://data-flair.training/blogs/svm-support-vector-machine-tutorial/

47
SESSION 2020-2021

You might also like