0% found this document useful (0 votes)
47 views38 pages

Machine Learning Industrial Training

Machine Learning

Uploaded by

vegito gogeta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views38 pages

Machine Learning Industrial Training

Machine Learning

Uploaded by

vegito gogeta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

LIST OF FIGURES ........................................................................................................

CHAPTER 1: INTRODUCTION ..................................................................................8

1.1 Machine Learning. ................................................................................................ 8


1.2 Artificial Intelligence and deep learning............................................................. 8
1.3 Types of Machine Learning ................................................................................. 9
1.4 Planning For Machine Learning ........................................................................ 10
1.5 Commandments of AI and ML............................................................................ 11

CHAPTER 2: Machine Learning Cycle ....................................................................... 12

2.1 Planning ................................................................................................................ 12


2.2 Data Preparation ..................................................................................................13
2.3 Data Pre-processing ............................................................................................. 14
2.4 Model Presentation ................................................................................................ 16

CHAPTER 3: Some Common Techniques and Algorithms ......................................... 17

3.1 Linear Regression................................................................................................... 17


3.2 Multivariate Linear Regression ........................................................................... 18
3.3 Logistic Regression ................................................................................................. 19
3.4 Clustering Technique ............................................................................................. 19
3.5 K nearest Algorithm............................................................................................... 20
3.6 Naive Bayes Classifier ............................................................................................ 21

CHAPTER 4: Python Library used................................................................................ 23

4.1 Numpy ..................................................................................................................... 23


4.2 Pandas ..................................................................................................................... 23
4.3 SciKit-learn ..............................................................................................................23
4.4 Matplotlib................................................................................................................. 23
4.5 Seaborn .................................................................................................................... 24

CHAPTER 5: Report on Indian Diabetic Dataset ......................................................... 25

5.1 Summary ...................................................................................................................25


5.1.1 Importing Library ............................................................................................. 26

5.1.2 Importing Dataset ............................................................................................. 26


5.1.3 Planning and Cleaning the data ........................................................................ 27

5.1.4 Creating train and test dataset for validation................................................. 28

5.1.5 Selection of Algorithm ....................................................................................... 29

5.1.6 Model Evaluation ............................................................................................... 30

CONCLUSIONS AND FUTURE SCOPE ....................................................................31

REFERENCES ................................................................................................................31
LIST OF FIGURES

1. Linear Regression

2. Flowchart of regression

3. Logistic Regression

4. Clustering technique

5. K nearest algorithm

6. Naive Bayes classifier

7. Feature plot of dataset

8. Heatmap and correlation of dataset vectors

9. Pairplot of dataset’s vector

10. Confusion matrix


CHAPTER - 1

1) INTRODUCTION

1.1) Machine Learning:

Machine learning is a sub-domain of computer science which evolved from the study of
pattern recognition in data, and also from the computational learning theory in artificial
intelligence. It is a field of computer science that gives computers the ability to learn without
being explicitly programmed.

“A computer program is said to learn from experience E with respect to some task T and
some performance measure P, if its performance on T, as measured by P, improves with
experience E.”

It is the first-class ticket to most interesting careers in data analytics today. As data sources
proliferate along with the computing power to process them, going straight to the data is one
of the most straightforward ways to quickly gain insights and make predictions. Machine
Learning can be thought of as the study of a list of sub-problems, viz: decision making,
clustering, classification, forecasting, deep-learning, inductive logic programming, support
vector machines, reinforcement learning, similarity and metric learning, genetic algorithms,
sparse dictionary learning, etc.

1.2) Artificial Intelligence And Deep Learning:

Artificial Intelligence - In computer science, artificial intelligence (AI), sometimes


called machine intelligence, is intelligence demonstrated by machines, in contrast to
the natural intelligence displayed by humans. Colloquially, the term "artificial intelligence"
is often used to describe machines (or computers) that mimic "cognitive" functions that
humans associate with the human mind, such as "learning" and "problem solving".

As machines become increasingly capable, tasks considered to require "intelligence" are


often removed from the definition of AI, a phenomenon known as the AI effect. A quip in
Tesler's Theorem says "AI is whatever hasn't been done yet." For instance, optical character
recognition is frequently excluded from things considered to be AI, having become a routine
technology.

Deep Learning - Deep learning is an artificial intelligence function that imitates the
workings of the human brain in processing data and creating patterns for use in decision
1. Introduction
1.1.A Taste of Machine Learning
✓ Arthur Samuel, an American pioneer in the field of computer gaming and artificial
intelligence, coined the term "Machine Learning" in 1959.
✓ Over the past two decades Machine Learning has become one of the mainstays of
information technology.
✓ With the ever-increasing amounts of data becoming available there is good reason
to believe that smart data analysis will become even more pervasive as a necessary
ingredient for technological progress.
1.2. Relation to Data Mining

• Data mining uses many machine learning methods, but with different goals; on the
other hand, machine learning also employs data mining methods as "unsupervised
learning" or as a preprocessing step to improve learner accuracy.
1.3. Relation to Optimization

Page 7 of 49
making. Deep learning is a subset of machine learning in artificial intelligence (AI) that has
networks capable of learning unsupervised from data that is unstructured or unlabeled. Also
known as deep neural learning or deep neural network.

1.3) Types of Machine Learning:

ENSEMBLE LEARNING- Ensemble learning is the process by which multiple models,


such as classifiers or experts, are strategically generated and combined to solve a
particular computational intelligence problem. Ensemble learning is primarily used to
improve the (classification, prediction, function approximation, etc.) performance of a model,
or reduce the likelihood of an unfortunate selection of a poor one. Other applications of
ensemble learning include assigning a confidence to the decision made by the model,
selecting optimal (or near optimal) features, data fusion, incremental learning, nonstationary
learning and error-correcting. This article focuses on classification related applications of
ensemble learning, however, all principle ideas described below can be easily generalized to
function approximation or prediction type problems as well.

Supervised Learning - We know what we are trying to predict. We use some examples that
we (and the model) know the answer to, to “train” our model. It can then generate predictions
to examples we don’t know the answer to.

Examples: Predict the price a house will sell at. Identify the gender of someone based on a
photograph.

Unsupervised Learning- We don’t know what we are trying to predict. We are trying to
identify some naturally occurring patterns in the data which may be informative.

Examples: Try to identify “clusters” of customers based on data we have on them

Reinforcement Learning - Behavioural psychology is the core concept behind


reinforcement learning. It is similar to cumulative rewards given as incentives to people in
order to motivate for better results, this method is used to maximize the accurate learning
behaviour. It learns through a policy or a pre-defined rule of the best way to act when an
observation of the real world is given.

Classification Learning Technique:In machine learning and statistics, classification is the


problem of identifying to which of a set of categories (sub-populations) a new observation
belongs, on the basis of a training set of data containing observations (or instances) whose
category membership is known.
CHAPTER - 2

2) Machine Learning Cycle:

The various stages in the Machine Learning methodology are

● Planning

● Data Preparation

● Modelling

● Model Presentation

2.1) Planning

The planning phase is the most important phase of the machine learning methodology. You
can never win a war if your planning skill is poor. The importance of planning is to
accumulate important information and facts about the selected target. This information can
then be applied, in the grass, to reach the potential necessary position and to get important
data.

In the planning there are two phases:

First is Defining Goals and Second is developing project charter.

Defining Goals - The journey always begins with the primary action of questioning ,what are
we doing ,why are we doing and how are we doing?

⚫ What is the Business expecting from the project?


⚫ Why is the management spending its resources on this project?
⚫ How will this project drive the business in the next financial year?
Answers to these questions will ensure the job of planning is half done. The outcome will
always be crystal clear.
We should understand the goals by the various stakeholders involved in project and agreeing
upon the set goals is also done at this level. At this initial phase of the project emphasis is
majorly done on organizing resources,developing communication matrix between various
stakeholders to have smooth co-ordination among teams. This phase will be initiated and
shouldered by the senior project manager as it involves clear communication and information
exchange with the client.

Developing Project Charter:


It contains following points:
⚫ Comprehensible research goal
⚫ Project Mission and Context
⚫ Mechanisms to perform data analysis
⚫ Organising technical resources
⚫ Proof of concepts
⚫ Measure of Success
⚫ Schedules and Time Lines

2.2) Data Preparation

It consist of few steps

1. Data Retrieval
2. Data Cleansing
3. Data Exploration and Refining

Data Retrieval - Extracting and retrieving the data on which dataset we are going to work
and create the model.

Data Cleansing - Cleansing the data which is retrieved in previous phase is known as data
cleansing.

Error in data are removed in this phase-

1. Mistakes during entry


2. Redundant space
3. Impossible values
4. Missing values
5. Outliers
6. Deviation from code book
7. Different unit of measurements
8. Different level of aggregation.

Possible solutions respectively for these are-

1. Manual overrules
2. Use string functions
3. Manual overrules
4. Replace with another value
5. Treat as missing value
6. Omit the values
7. Set value to null
8. Recalculate for same unit
9. Bring same level of aggregation

Data Exploration and Refining-

This phase will take a deep dive into understanding data. By using various graphical
techniques and statistical techniques to gain information the data scientist tries to understand
whether the data is normally distributed or not. Correlation functions are used to know how
variables are related to each other.The visualization techniques that we use can be simple as
histogram and yet complex like pairplot and others.

2.3) Data Preprocessing-

It have following steps-

1. Model Creation
2. Model Validation
3. Model Evaluation

Model Creation - With clear data and understanding of the content,we can build models with
the goal of making better predictions and classifying objects or gaining an understanding of
the system.This phase is most important part of the cycle because the better selection of
algorithm and the model will lead to the better accuracy of the model as well as the
prediction.
Just choose the right algorithm and the model for the dataset and the accuracy will always be
more than enough.

Model Validation -

Model can be validated by various means-

1. Adjusted R square
2. Standard Error
3. P-Value
4. Z-value
5. MAPE- Mean absolute percentage error
6. MSE- mean squared error
7. RMSE - Root mean square error
8. MASE - Mean absolute scaled error

Model Evaluation-

Popular Methods of Evaluating models in data science are-

1. Hold out method


2. Cross-validation method

Hold-out method -

In this the dataset is divided into three subsets-

⚫ Training set is a subset of dataset used to build the model


⚫ Validation set is subset of dataset used to access the performance of model built in the
training phase. It provides a test platform for fine tuning model’s parameter and selecting
the best performing model.
⚫ Test set is subset of dataset which is used to check the future performance of the model

Cross-validation method -

When only a limited amount of data is available, to achieve an unbiased estimate of the
model performance we use k-fold cross-validation. We divide the data into k subsets of equal
size.We build models k times,each time leaving out one of the subsets from training and use
it as test set.

2.4) Model Presentation -

1. Preparation for the presentation which showcase the outcome results of the analysis or
model which we developed and how good these are addressing to the business or the
problem
2. Deployment of the model based on the preference
3. Scaling the model based on the issues emerged
CHAPTER - 3

3) Some Common Techniques and Algorithms-

3.1) Linear Regression-

A Supervised Learning Algorithm that learns from a set of training samples

It estimates relationship between a dependent variable (target/label) and one or more


independent variable (predictors).
3.2) Multivariate Linear Regression

A natural generalization of the simple linear regression model is a situation including


influence of more than one independent variable to the dependent variable, again with a
linear relationship (strongly, mathematically speaking this is virtually the same model). Thus,
a regression model is called the multiple linear regression model. Dependent variable is
denoted by y, x1, x2,…,xn are independent variables whereas β0 ,β1,…, βn denote coefficients.
Although the multiple regression is analogue to the regression between two random variables,
in this case development of a model is more complex. First of all, might we don’t put into
model all available independent variables but among m>n candidates we will
choose n variables with greatest contribution to the model accuracy. Namely, in general we
aim to develop as simpler model as possible; so a variable with a small contribution we
usually don’t include in a model.

3.3) Logistic Regression

Like all regression analyses, the logistic regression is a predictive analysis. Logistic
regression is used to describe data and to explain the relationship between one dependent
binary variable and one or more nominal, ordinal, interval or ratio-level independent
variables.
3.4) Clustering Technique - Clustering is a Machine Learning technique that involves the
grouping of data points. ... In theory, data points that are in the same group should have
similar properties and/or features, while data points in different groups should have highly
dissimilar properties and/or features. K- means is the most used algorithm for this technique.
3.5). K nearest Algorithm-

The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine


learning algorithm that can be used to solve both classification and regression problems.

The KNN algorithm assumes that similar things exist in close proximity. In other words,
similar things are near to each other. KNN captures the idea of similarity (sometimes called
distance, proximity, or closeness) with some mathematics calculating the distance between
points on a graph.

There are other ways of calculating distance, and one way might be preferable depending on
the problem we are solving. However, the straight-line distance (also called the Euclidean
distance) is a popular and familiar choice.
CHAPTER - 4

4) Python Library used-

4.1) NUMPY-

⚫ Introduces objects for multidimensional arrays and matrices, as well as functions that
allow to easily perform advanced mathematical and statistical operations on those objects
⚫ provides vectorization of mathematical operations on arrays and matrices which
significantly improves the performance
⚫ many other python libraries are built on NumPy.

4.2) Pandas:

⚫ Adds data structures and tools designed to work with table-like data (similar to Series
and Data Frames in R)
⚫ provides tools for data manipulation: reshaping, merging, sorting, slicing, aggregation

⚫ Allows handling missing data

4.3) SciKit-Learn:

⚫ provides machine learning algorithms: classification, regression, clustering, model

⚫ Most important tools for the machine learning.


⚫ built on NumPy, SciPy and matplotlib

4.4) Matplotlib:

⚫ python 2D plotting library which produces publication quality figures in a variety of

⚫ a set of functionalities similar to those of MATLAB line plots, scatter plots, barcharts,
histograms, pie charts etc.
⚫ relatively low-level; some effort needed to create advanced visualization
4.5) Seaborn:



⚫ Similar (in style) to the popular ggplot2 library in R
CHAPTER - 5

5) Report on Indian Diabetic Dataset -

Dataset - The dataset used is a sample of person in a diabetic dataset,high-risk region of


the India. The dataset that was used for this project is a subset of a much larger dataset,
and has the following feature vectors:

DESCRIPTION
Predict the onset of diabetes based on diagnostic measures.
SUMMARY

This dataset is originally from the National Institute of Diabetes and Digestive and
Kidney Diseases. The objective is to predict based on diagnostic measurements whether a
patient has diabetes.

Several constraints were placed on the selection of these instances from a larger database.
In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Pregnancies: Number of times pregnant

Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

BloodPressure: Diastolic blood pressure (mm Hg)

SkinThickness: Triceps skin fold thickness (mm)

Insulin: 2-Hour serum insulin (mu U/ml)

BMI: Body mass index (weight in kg/(height in m)^2)

DiabetesPedigreeFunction: Diabetes pedigree function

Age: Age (years)

Outcome: Class variable (0 or 1)


Selected attributes: 1,2,3,4,5,6,7,8 : 9 Here we can see that all factors are important after we
do the PCA. The last feature has been deemed unworthy by the PCA implementation , which
made little sense to us as age is highly correlated to most diseases. We further our
investigation by using another attribute selector, the Significance Attribute Evaluator.

Then we used the pairplot to understand which vectors are more significant and important
than others and outcome was-

Hence we knew we can use all the vectors to compute the prediction.

data.isnull().sum()
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0

4. Creating the Train and test dataset for validation -

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)

Then we find out the data is not in standard form and we standardize the data using sklearn
library.

from sklearn.preprocessing import StandardScaler

sc= StandardScaler()

X_train= sc.fit_transform(X_train)

X_test= sc.transform(X_test)

This fit as well as transform the data to the understandable form.

5. Selection of Algorithm-

As we know that the it is a classification problem so we will use the logistic regression in
solving this.

from sklearn.linear_model import LogisticRegression

classifier=LogisticRegression(random_state=0)

classifier.fit(X_train,y_train)
Conclusion-

We conclude that the dataset is not a complete space, and there are still other feature vectors
missing from it. What we were attempting to generalize is a subspace of the actual input
space, where the other dimensions are not known, and hence none of the classifiers were able
to do better than 71.6% . In the future, if similar studies are conducted to generate the dataset
used in this report, more feature vectors need to be calculated so that the classifiers can form
a better idea of the problem at hand.

References -

1. Datasciencesociety.org
2. Kaggle.com
3. HP student guide
4. Analytics Vidhya
5. Coursera
Machine learning also has intimate ties to optimization: many learning problems
are formulated as minimization of some loss function on a training set of examples.
Loss functions express the discrepancy between the predictions of the model being
trained and the actual problem instances.
1.4. Relation to Statistics

Michael I. Jordan suggested the term data science as a placeholder to call the overall
field.
Leo Breiman distinguished two statistical modelling paradigms: data model and
algorithmic model, wherein "algorithmic model" means more or less the machine
learning algorithms like Random forest.
1.5. Future of Machine Learning
➢ Machine Learning can be a competitive advantage to any company be it a top MNC
or a startup as things that are currently being done manually will be done tomorrow
by machines.
➢ Machine Learning revolution will stay with us for long and so will be the future of
Machine Learning.
2. Technology Learnt
2.1. Introduction to AI & Machine Learning
2.1.1. Definition of Artificial Intelligence
❖ Data Economy
✓ World is witnessing real time flow of all types structured and unstructured data from
social media, communication, transportation, sensors, and devices.
✓ International Data Corporation (IDC) forecasts that 180 zettabytes of data will
be generated by 2025.

Page 8 of 49
✓ This explosion of data has given rise to a new economy known as the Data
Economy.
✓ Data is the new oil that is precious but useful only when cleaned and processed.
✓ There is a constant battle for ownership of data between enterprises to derive
benefits from it.
❖ Define Artificial Intelligence
Artificial intelligence refers to the simulation of human intelligence in machines that are
programmed to think like humans and mimic their actions. The term may also be applied to
any machine that exhibits traits associated with a human mind such as learning and problem-
solving.

2.1.2. Definition of Machine Learning


❖ Relationship between AI and ML

Machine Learning is an approach or subset of Artificial Intelligence that is based on the idea
that machines can be given access to data along with the ability to learn from it.

Page 9 of 49
❖ Define Machine Learning
Machine learning is an application of artificial intelligence (AI) that provides systems the
ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the development of computer programs that can
access data and use it learn for themselves.
❖ Features of Machine Learning
✓ Machine Learning is computing-intensive and generally requires a large amount of
training data.
✓ It involves repetitive training to improve the learning and decision making of
algorithms.
✓ As more data gets added, Machine Learning training can be automated for learning
new data patterns and adapting its algorithm.
2.1.3. Machine Learning Algorithms
❖ Traditional Programming vs. Machine Learning Approach

❖ Traditional Approach
Traditional programming relies on hard-coded rules.

Page 10 of 49
❖ Machine Learning Approach
Machine Learning relies on learning patterns based on sample data.

❖ Machine Learning Techniques


✓ Machine Learning uses a number of theories and techniques from Data
Science.

✓ Machine Learning can learn from labelled data (known as supervised


learning) or unlabeled data (known as unsupervised learning).
2.1.4. Applications of Machine Learning

Page 11 of 49
❖ Image Processing
✓ Optical Character Recognition (OCR)
✓ Self-driving cars
✓ Image tagging and recognition
❖ Robotics
✓ Industrial robotics
✓ Human simulation
❖ Data Mining
✓ Association rules
✓ Anomaly detection
✓ Grouping and Predictions
❖ Video games
✓ Pokémon
✓ PUBG
❖ Text Analysis
✓ Spam Filtering
✓ Information Extraction
✓ Sentiment Analysis
❖ Healthcare
✓ Emergency Room & Surgery
✓ Research
✓ Medical Imaging & Diagnostics
2.2. Techniques of Machine Learning
2.2.1. Supervised Learning
❖ Define Supervised Learning
Supervised learning is the machine learning task of learning a function that maps an input to
an output based on example input-output pairs. It infers a function from labeled training data
consisting of a set of training examples.

Page 12 of 49
In supervised learning, each example is a pair consisting of an input object (typically a vector)
and a desired output value (also called the supervisory signal).
❖ Supervised Learning Flow
✓ Data Preparation
Clean data
Label data (x, y)
Feature Engineering
Reserve 80% of data for Training (Train_X) and 20% for Evaluation
(Train_E)
✓ Training Step
Design algorithmic logic
Train the model with Train X
Derive the relationship between x and y, that is, y = f(x)
✓ Evaluation or Test Step
Evaluate or test with Train E
If accuracy score is high, you have the final learned algorithm y = f(x)
If accuracy score is low, go back to training step
✓ Production Deployment
Use the learned algorithm y = f(x) to predict production data.
The algorithm can be improved by more training data, capacity, or algo redesign.

❖ Testing the Algorithms


✓ Once the algorithm is trained, test it with test data (a set of data instances that
do not appear in the training set).
✓ A well-trained algorithm can predict well for new test data.

Page 13 of 49
✓ If the learning is poor, we have an underfitted situation. The algorithm will not
work well on test data. Retraining may be needed to find a better fit.

✓ If learning on training data is too intensive, it may lead to overfitting–a situation


where the algorithm is not able to handle new testing data that it has not seen
before. The technique to keep data generic is called regularization.

Page 14 of 49
❖ Examples of Supervised Learning
✓ Voice Assistants
✓ Gmail Filters
✓ Weather Apps
❖ Types of Supervised Learning

✓ Classification
➢ Answers “What class?”

Page 15 of 49
Here the task of machine is to group unsorted information according to similarities, patterns
and differences without any prior training of data.
❖ Types of Unsupervised Learning

✓ Clustering
The most common unsupervised learning method is cluster analysis. It is used to find data
clusters so that each cluster has the most closely matched data.

Page 17 of 49
▪ Google Photos automatically detects the same person in multiple photos from a
vacation trip (clustering –unsupervised).
▪ One has to just name the person once (supervised), and the name tag gets
attached to that person in all the photos.
2.2.4. Reinforcement Learning
❖ Define Reinforcement Learning
Reinforcement Learning is a type of Machine Learning that allows the learning system to
observe the environment and learn the ideal behavior based on trying to maximize some notion
of cumulative reward.

It differs from supervised learning in that labelled input/output pairs need not be presented, and
sub-optimal actions need not be explicitly corrected. Instead the focus is finding a balance
between exploration (of uncharted territory) and exploitation (of current knowledge)
❖ Features of Reinforcement Learning
• The learning system (agent) observes the environment, selects and takes certain
actions, and gets rewards in return (or penalties in certain cases).
• The agent learns the strategy or policy (choice of actions) that maximizes its
rewards over time.
❖ Example of Reinforcement Learning
• In a manufacturing unit, a robot uses deep reinforcement learning to identify a
device from one box and put it in a container.
• The robot learns this by means of a rewards-based learning system, which
incentivizes it for the right action.
2.2.5. Some Important Considerations in Machine Learning
❖ Bias & Variance Tradeoff
➢ Bias refers to error in the machine learning model due to wrong assumptions. A
high-bias model will underfit the training data.
➢ Variance refers to problems caused due to overfitting. This is a result of over-
sensitivity of the model to small variations in the training data. A model with

Page 19 of 49
many degrees of freedom (such as a high-degree polynomial model) is likely to
have high variance and thus overfit the training data.
❖ Bias & Variance Dependencies
➢ Increasing a model’s complexity will reduce its bias and increase its variance.

➢ Conversely, reducing a model’s complexity will increase its bias and reduce its
variance. This is why it is called a tradeoff.
❖ What is Representational Learning
In Machine Learning, Representation refers to the way the data is presented. This often make
a huge difference in understanding.

2.3. Data Preprocessing


2.3.1. Data Preparation
❖ Data Preparation Process
✓ Machine Learning depends largely on test data.
✓ Data preparation involves data selection, filtering, transformation, etc.

✓ Data preparation is a crucial step to make it suitable for ML.


✓ A large amount of data is generally required for the most common forms of
ML.

Page 20 of 49
❖ Types of Data
✓ Labelled Data or Training Data

✓ Unlabeled Data
✓ Test Data
✓ Validation Data
2.3.2. Feature Engineering
❖ Define Feature Engineering
The transformation stage in the data preparation process includes an important step known as
Feature Engineering.

Feature Engineering refers to selecting and extracting right features from the data that are
relevant to the task and model in consideration.
❖ Aspects of Feature Engineering
✓ Feature Selection
Most useful and relevant features are selected from the available data
✓ Feature Addition

Page 21 of 49
New features are created by gathering new data
✓ Feature Extraction
Existing features are combined to develop more useful ones
✓ Feature Filtering
Filter out irrelevant features to make the modelling step easy

2.3.3. Feature Scaling


❖ Define Feature Scaling
✓ Feature scaling is an important step in the data transformation stage of data
preparation process.
✓ Feature Scaling is a method used in Machine Learning for standardization of
independent variables of data features.
❖ Techniques of Feature Scaling

✓ Standardization
▪ Standardization is a popular feature scaling method, which gives data
the property of a standard normal distribution (also known as Gaussian
distribution).
▪ All features are standardized on the normal distribution (a mathematical
model).
▪ The mean of each feature is centered at zero, and the feature column has
a standard deviation of one.

Page 22 of 49
2.3.4. Datasets
➢ Machine Learning problems often need training or testing datasets.
➢ A dataset is a large repository of structured data.
➢ In many cases, it has input and output labels that assist in Supervised Learning.
2.3.5. Dimensionality Reduction with Principal Component Analysis
❖ Define Dimensionality Reduction
✓ Dimensionality reduction involves transformation of data to new dimensions in
a way that facilitates discarding of some dimensions without losing any key
information.

❖ Define Principal Component Analysis (PCA)


✓ Principal component analysis (PCA) is a technique for dimensionality reduction
that helps in arriving at better visualization models.

Page 24 of 49
✓ They are also the fundamental components of Random Forests, one of
the most powerful ML algorithms.

✓ Start at the tree root and split the data on the feature using the decision
algorithm, resulting in the largest information gain (IG).
2.5.2.2.5. Random Forest Classification
➢ Random decision forests correct for decision trees' habit of overfitting to
their training set.

➢ Random forests or random decision forests are an ensemble learning method


for classification, regression and other tasks that operates by constructing a
multitude of decision trees at training time and outputting the class that is
the mode of the classes (classification) or mean prediction (regression) of
the individual trees.

Page 42 of 49
2.7.3. TensorFlow
❖ TensorFlow is the open source Deep Learning library provided by Google.

❖ It allows development of a variety of neural network applications such as computer


vision, speech processing, or text recognition.
❖ It uses data flow graphs for numerical computations.
3. Reason for choosing Machine Learning
➢ Learning machine learning brings in better career opportunities
✓ Machine learning is the shining star of the moment.
✓ Every industry looking to apply AI in their domain, studying machine learning
opens world of opportunities to develop cutting edge machine learning
applications in various verticals – such as cyber security, image recognition,
medicine, or face recognition.
✓ Several machine learning companies on the verge of hiring skilled ML
engineers, it is becoming the brain behind business intelligence.
➢ Machine Learning Jobs on the rise
✓ The major hiring is happening in all top tech companies in search of those
special kind of people (machine learning engineers) who can build a hammer
(machine learning algorithms).
✓ The job market for machine learning engineers is not just hot but it’s sizzling.
✓ Machine Learning Jobs on Indeed.com - 2,500+(India) & 12,000+(US)

Page 47 of 49

You might also like