Machine Learning Industrial Training
Machine Learning Industrial Training
REFERENCES ................................................................................................................31
LIST OF FIGURES
1. Linear Regression
2. Flowchart of regression
3. Logistic Regression
4. Clustering technique
5. K nearest algorithm
1) INTRODUCTION
Machine learning is a sub-domain of computer science which evolved from the study of
pattern recognition in data, and also from the computational learning theory in artificial
intelligence. It is a field of computer science that gives computers the ability to learn without
being explicitly programmed.
“A computer program is said to learn from experience E with respect to some task T and
some performance measure P, if its performance on T, as measured by P, improves with
experience E.”
It is the first-class ticket to most interesting careers in data analytics today. As data sources
proliferate along with the computing power to process them, going straight to the data is one
of the most straightforward ways to quickly gain insights and make predictions. Machine
Learning can be thought of as the study of a list of sub-problems, viz: decision making,
clustering, classification, forecasting, deep-learning, inductive logic programming, support
vector machines, reinforcement learning, similarity and metric learning, genetic algorithms,
sparse dictionary learning, etc.
Deep Learning - Deep learning is an artificial intelligence function that imitates the
workings of the human brain in processing data and creating patterns for use in decision
1. Introduction
1.1.A Taste of Machine Learning
✓ Arthur Samuel, an American pioneer in the field of computer gaming and artificial
intelligence, coined the term "Machine Learning" in 1959.
✓ Over the past two decades Machine Learning has become one of the mainstays of
information technology.
✓ With the ever-increasing amounts of data becoming available there is good reason
to believe that smart data analysis will become even more pervasive as a necessary
ingredient for technological progress.
1.2. Relation to Data Mining
• Data mining uses many machine learning methods, but with different goals; on the
other hand, machine learning also employs data mining methods as "unsupervised
learning" or as a preprocessing step to improve learner accuracy.
1.3. Relation to Optimization
Page 7 of 49
making. Deep learning is a subset of machine learning in artificial intelligence (AI) that has
networks capable of learning unsupervised from data that is unstructured or unlabeled. Also
known as deep neural learning or deep neural network.
Supervised Learning - We know what we are trying to predict. We use some examples that
we (and the model) know the answer to, to “train” our model. It can then generate predictions
to examples we don’t know the answer to.
Examples: Predict the price a house will sell at. Identify the gender of someone based on a
photograph.
Unsupervised Learning- We don’t know what we are trying to predict. We are trying to
identify some naturally occurring patterns in the data which may be informative.
● Planning
● Data Preparation
● Modelling
● Model Presentation
2.1) Planning
The planning phase is the most important phase of the machine learning methodology. You
can never win a war if your planning skill is poor. The importance of planning is to
accumulate important information and facts about the selected target. This information can
then be applied, in the grass, to reach the potential necessary position and to get important
data.
Defining Goals - The journey always begins with the primary action of questioning ,what are
we doing ,why are we doing and how are we doing?
1. Data Retrieval
2. Data Cleansing
3. Data Exploration and Refining
Data Retrieval - Extracting and retrieving the data on which dataset we are going to work
and create the model.
Data Cleansing - Cleansing the data which is retrieved in previous phase is known as data
cleansing.
1. Manual overrules
2. Use string functions
3. Manual overrules
4. Replace with another value
5. Treat as missing value
6. Omit the values
7. Set value to null
8. Recalculate for same unit
9. Bring same level of aggregation
This phase will take a deep dive into understanding data. By using various graphical
techniques and statistical techniques to gain information the data scientist tries to understand
whether the data is normally distributed or not. Correlation functions are used to know how
variables are related to each other.The visualization techniques that we use can be simple as
histogram and yet complex like pairplot and others.
1. Model Creation
2. Model Validation
3. Model Evaluation
Model Creation - With clear data and understanding of the content,we can build models with
the goal of making better predictions and classifying objects or gaining an understanding of
the system.This phase is most important part of the cycle because the better selection of
algorithm and the model will lead to the better accuracy of the model as well as the
prediction.
Just choose the right algorithm and the model for the dataset and the accuracy will always be
more than enough.
Model Validation -
1. Adjusted R square
2. Standard Error
3. P-Value
4. Z-value
5. MAPE- Mean absolute percentage error
6. MSE- mean squared error
7. RMSE - Root mean square error
8. MASE - Mean absolute scaled error
Model Evaluation-
Hold-out method -
Cross-validation method -
When only a limited amount of data is available, to achieve an unbiased estimate of the
model performance we use k-fold cross-validation. We divide the data into k subsets of equal
size.We build models k times,each time leaving out one of the subsets from training and use
it as test set.
1. Preparation for the presentation which showcase the outcome results of the analysis or
model which we developed and how good these are addressing to the business or the
problem
2. Deployment of the model based on the preference
3. Scaling the model based on the issues emerged
CHAPTER - 3
Like all regression analyses, the logistic regression is a predictive analysis. Logistic
regression is used to describe data and to explain the relationship between one dependent
binary variable and one or more nominal, ordinal, interval or ratio-level independent
variables.
3.4) Clustering Technique - Clustering is a Machine Learning technique that involves the
grouping of data points. ... In theory, data points that are in the same group should have
similar properties and/or features, while data points in different groups should have highly
dissimilar properties and/or features. K- means is the most used algorithm for this technique.
3.5). K nearest Algorithm-
The KNN algorithm assumes that similar things exist in close proximity. In other words,
similar things are near to each other. KNN captures the idea of similarity (sometimes called
distance, proximity, or closeness) with some mathematics calculating the distance between
points on a graph.
There are other ways of calculating distance, and one way might be preferable depending on
the problem we are solving. However, the straight-line distance (also called the Euclidean
distance) is a popular and familiar choice.
CHAPTER - 4
4.1) NUMPY-
⚫ Introduces objects for multidimensional arrays and matrices, as well as functions that
allow to easily perform advanced mathematical and statistical operations on those objects
⚫ provides vectorization of mathematical operations on arrays and matrices which
significantly improves the performance
⚫ many other python libraries are built on NumPy.
4.2) Pandas:
⚫ Adds data structures and tools designed to work with table-like data (similar to Series
and Data Frames in R)
⚫ provides tools for data manipulation: reshaping, merging, sorting, slicing, aggregation
4.3) SciKit-Learn:
4.4) Matplotlib:
⚫ a set of functionalities similar to those of MATLAB line plots, scatter plots, barcharts,
histograms, pie charts etc.
⚫ relatively low-level; some effort needed to create advanced visualization
4.5) Seaborn:
⚫
⚫
⚫ Similar (in style) to the popular ggplot2 library in R
CHAPTER - 5
DESCRIPTION
Predict the onset of diabetes based on diagnostic measures.
SUMMARY
This dataset is originally from the National Institute of Diabetes and Digestive and
Kidney Diseases. The objective is to predict based on diagnostic measurements whether a
patient has diabetes.
Several constraints were placed on the selection of these instances from a larger database.
In particular, all patients here are females at least 21 years old of Pima Indian heritage.
Then we used the pairplot to understand which vectors are more significant and important
than others and outcome was-
Hence we knew we can use all the vectors to compute the prediction.
data.isnull().sum()
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)
Then we find out the data is not in standard form and we standardize the data using sklearn
library.
sc= StandardScaler()
X_train= sc.fit_transform(X_train)
X_test= sc.transform(X_test)
5. Selection of Algorithm-
As we know that the it is a classification problem so we will use the logistic regression in
solving this.
classifier=LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)
Conclusion-
We conclude that the dataset is not a complete space, and there are still other feature vectors
missing from it. What we were attempting to generalize is a subspace of the actual input
space, where the other dimensions are not known, and hence none of the classifiers were able
to do better than 71.6% . In the future, if similar studies are conducted to generate the dataset
used in this report, more feature vectors need to be calculated so that the classifiers can form
a better idea of the problem at hand.
References -
1. Datasciencesociety.org
2. Kaggle.com
3. HP student guide
4. Analytics Vidhya
5. Coursera
Machine learning also has intimate ties to optimization: many learning problems
are formulated as minimization of some loss function on a training set of examples.
Loss functions express the discrepancy between the predictions of the model being
trained and the actual problem instances.
1.4. Relation to Statistics
Michael I. Jordan suggested the term data science as a placeholder to call the overall
field.
Leo Breiman distinguished two statistical modelling paradigms: data model and
algorithmic model, wherein "algorithmic model" means more or less the machine
learning algorithms like Random forest.
1.5. Future of Machine Learning
➢ Machine Learning can be a competitive advantage to any company be it a top MNC
or a startup as things that are currently being done manually will be done tomorrow
by machines.
➢ Machine Learning revolution will stay with us for long and so will be the future of
Machine Learning.
2. Technology Learnt
2.1. Introduction to AI & Machine Learning
2.1.1. Definition of Artificial Intelligence
❖ Data Economy
✓ World is witnessing real time flow of all types structured and unstructured data from
social media, communication, transportation, sensors, and devices.
✓ International Data Corporation (IDC) forecasts that 180 zettabytes of data will
be generated by 2025.
Page 8 of 49
✓ This explosion of data has given rise to a new economy known as the Data
Economy.
✓ Data is the new oil that is precious but useful only when cleaned and processed.
✓ There is a constant battle for ownership of data between enterprises to derive
benefits from it.
❖ Define Artificial Intelligence
Artificial intelligence refers to the simulation of human intelligence in machines that are
programmed to think like humans and mimic their actions. The term may also be applied to
any machine that exhibits traits associated with a human mind such as learning and problem-
solving.
Machine Learning is an approach or subset of Artificial Intelligence that is based on the idea
that machines can be given access to data along with the ability to learn from it.
Page 9 of 49
❖ Define Machine Learning
Machine learning is an application of artificial intelligence (AI) that provides systems the
ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the development of computer programs that can
access data and use it learn for themselves.
❖ Features of Machine Learning
✓ Machine Learning is computing-intensive and generally requires a large amount of
training data.
✓ It involves repetitive training to improve the learning and decision making of
algorithms.
✓ As more data gets added, Machine Learning training can be automated for learning
new data patterns and adapting its algorithm.
2.1.3. Machine Learning Algorithms
❖ Traditional Programming vs. Machine Learning Approach
❖ Traditional Approach
Traditional programming relies on hard-coded rules.
Page 10 of 49
❖ Machine Learning Approach
Machine Learning relies on learning patterns based on sample data.
Page 11 of 49
❖ Image Processing
✓ Optical Character Recognition (OCR)
✓ Self-driving cars
✓ Image tagging and recognition
❖ Robotics
✓ Industrial robotics
✓ Human simulation
❖ Data Mining
✓ Association rules
✓ Anomaly detection
✓ Grouping and Predictions
❖ Video games
✓ Pokémon
✓ PUBG
❖ Text Analysis
✓ Spam Filtering
✓ Information Extraction
✓ Sentiment Analysis
❖ Healthcare
✓ Emergency Room & Surgery
✓ Research
✓ Medical Imaging & Diagnostics
2.2. Techniques of Machine Learning
2.2.1. Supervised Learning
❖ Define Supervised Learning
Supervised learning is the machine learning task of learning a function that maps an input to
an output based on example input-output pairs. It infers a function from labeled training data
consisting of a set of training examples.
Page 12 of 49
In supervised learning, each example is a pair consisting of an input object (typically a vector)
and a desired output value (also called the supervisory signal).
❖ Supervised Learning Flow
✓ Data Preparation
Clean data
Label data (x, y)
Feature Engineering
Reserve 80% of data for Training (Train_X) and 20% for Evaluation
(Train_E)
✓ Training Step
Design algorithmic logic
Train the model with Train X
Derive the relationship between x and y, that is, y = f(x)
✓ Evaluation or Test Step
Evaluate or test with Train E
If accuracy score is high, you have the final learned algorithm y = f(x)
If accuracy score is low, go back to training step
✓ Production Deployment
Use the learned algorithm y = f(x) to predict production data.
The algorithm can be improved by more training data, capacity, or algo redesign.
Page 13 of 49
✓ If the learning is poor, we have an underfitted situation. The algorithm will not
work well on test data. Retraining may be needed to find a better fit.
Page 14 of 49
❖ Examples of Supervised Learning
✓ Voice Assistants
✓ Gmail Filters
✓ Weather Apps
❖ Types of Supervised Learning
✓ Classification
➢ Answers “What class?”
Page 15 of 49
Here the task of machine is to group unsorted information according to similarities, patterns
and differences without any prior training of data.
❖ Types of Unsupervised Learning
✓ Clustering
The most common unsupervised learning method is cluster analysis. It is used to find data
clusters so that each cluster has the most closely matched data.
Page 17 of 49
▪ Google Photos automatically detects the same person in multiple photos from a
vacation trip (clustering –unsupervised).
▪ One has to just name the person once (supervised), and the name tag gets
attached to that person in all the photos.
2.2.4. Reinforcement Learning
❖ Define Reinforcement Learning
Reinforcement Learning is a type of Machine Learning that allows the learning system to
observe the environment and learn the ideal behavior based on trying to maximize some notion
of cumulative reward.
It differs from supervised learning in that labelled input/output pairs need not be presented, and
sub-optimal actions need not be explicitly corrected. Instead the focus is finding a balance
between exploration (of uncharted territory) and exploitation (of current knowledge)
❖ Features of Reinforcement Learning
• The learning system (agent) observes the environment, selects and takes certain
actions, and gets rewards in return (or penalties in certain cases).
• The agent learns the strategy or policy (choice of actions) that maximizes its
rewards over time.
❖ Example of Reinforcement Learning
• In a manufacturing unit, a robot uses deep reinforcement learning to identify a
device from one box and put it in a container.
• The robot learns this by means of a rewards-based learning system, which
incentivizes it for the right action.
2.2.5. Some Important Considerations in Machine Learning
❖ Bias & Variance Tradeoff
➢ Bias refers to error in the machine learning model due to wrong assumptions. A
high-bias model will underfit the training data.
➢ Variance refers to problems caused due to overfitting. This is a result of over-
sensitivity of the model to small variations in the training data. A model with
Page 19 of 49
many degrees of freedom (such as a high-degree polynomial model) is likely to
have high variance and thus overfit the training data.
❖ Bias & Variance Dependencies
➢ Increasing a model’s complexity will reduce its bias and increase its variance.
➢ Conversely, reducing a model’s complexity will increase its bias and reduce its
variance. This is why it is called a tradeoff.
❖ What is Representational Learning
In Machine Learning, Representation refers to the way the data is presented. This often make
a huge difference in understanding.
Page 20 of 49
❖ Types of Data
✓ Labelled Data or Training Data
✓ Unlabeled Data
✓ Test Data
✓ Validation Data
2.3.2. Feature Engineering
❖ Define Feature Engineering
The transformation stage in the data preparation process includes an important step known as
Feature Engineering.
Feature Engineering refers to selecting and extracting right features from the data that are
relevant to the task and model in consideration.
❖ Aspects of Feature Engineering
✓ Feature Selection
Most useful and relevant features are selected from the available data
✓ Feature Addition
Page 21 of 49
New features are created by gathering new data
✓ Feature Extraction
Existing features are combined to develop more useful ones
✓ Feature Filtering
Filter out irrelevant features to make the modelling step easy
✓ Standardization
▪ Standardization is a popular feature scaling method, which gives data
the property of a standard normal distribution (also known as Gaussian
distribution).
▪ All features are standardized on the normal distribution (a mathematical
model).
▪ The mean of each feature is centered at zero, and the feature column has
a standard deviation of one.
Page 22 of 49
2.3.4. Datasets
➢ Machine Learning problems often need training or testing datasets.
➢ A dataset is a large repository of structured data.
➢ In many cases, it has input and output labels that assist in Supervised Learning.
2.3.5. Dimensionality Reduction with Principal Component Analysis
❖ Define Dimensionality Reduction
✓ Dimensionality reduction involves transformation of data to new dimensions in
a way that facilitates discarding of some dimensions without losing any key
information.
Page 24 of 49
✓ They are also the fundamental components of Random Forests, one of
the most powerful ML algorithms.
✓ Start at the tree root and split the data on the feature using the decision
algorithm, resulting in the largest information gain (IG).
2.5.2.2.5. Random Forest Classification
➢ Random decision forests correct for decision trees' habit of overfitting to
their training set.
Page 42 of 49
2.7.3. TensorFlow
❖ TensorFlow is the open source Deep Learning library provided by Google.
Page 47 of 49