CS601 - Machine Learning - Unit 1 - Notes - 1672759748
CS601 - Machine Learning - Unit 1 - Notes - 1672759748
A Machine Learning process begins by feeding the machine lots of data, by using this data
the machine is trained to detect hidden insights and trends. These insights are then used to
build a Machine Learning Model by using an algorithm in order to solve a problem.
Scope
• Increase in Data Generation: Due to excessive production of data, need a method that
can be used to structure, analyse and draw useful insights from data. This is where
Machine Learning comes in. It uses data to solve problems and find solutions to the
most complex tasks faced by organizations.
• Improve Decision Making: By making use of various algorithms, Machine Learning can
be used to make better business decisions.
• Uncover patterns & trends in data: Finding hidden patterns and extracting key insights
from data is the most essential part of Machine Learning. By building predictive models
and using statistical techniques, Machine Learning allows you to dig beneath the surface
and explore the data at a minute scale. Understanding data and extracting patterns
manually will take days, whereas Machine Learning algorithms can perform such
computations in less than a second.
• Solve complex problems: Building self-driving cars, Machine Learning can be used to
solve the most complex problems.
Limitations
1. What algorithms exist for learning general target function from specific training
examples?
2. In what setting will particular algorithm converge to the desired function, given sufficient
training data?
3. Which algorithm performs best for which types of problems and representations?
4. How much training data is sufficient?
5. When and how can prior knowledge held by the learner guide the process of
generalizing from examples?
6. What is the best way to reduce the learning task to one more function approximation
problem?
7. Machine learning algorithms require massive stores of training data.
8. Labeling training data is a tedious process.
9. Machines cannot explain themselves.
Regression
Regression models are used to predict a continuous value. Predicting prices of a house given
the features of house like size, price etc. is one of the common examples of Regression. It is a
supervised technique.
Types of Regression
1. Simple Linear Regression
2. Polynomial Regression
3. Support Vector Regression
4. Decision Tree Regression
5. Random Forest Regression
Simple Linear Regression
This is one of the most common and interesting type of Regression technique. Here we predict
a target variable Y based on the input variable X. A linear relationship should exist between
target variable and predictor and so comes the name Linear Regression.
Consider predicting the salary of an employee based on his/her age. We can easily identify that
there seems to be a correlation between employee’s age and salary (more the age more is the
salary). The hypothesis of linear regression is-Y= a + bX
Y represents salary, X is employee’s age and a, and b are the coefficients of equation. So in
order to predict Y (salary) given X (age), we need to know the values of a, and b (the model’s
coefficients).
Probability
Probability is an intuitive concept. We use it on a daily basis without necessarily realising that
we are speaking and applying probability to work.
Life is full of uncertainties. We don’t know the outcomes of a particular situation until it
happens. Will it rain today? Will I pass the next math test? Will my favourite team win the toss?
Will I get a promotion in next 6 months? All these questions are examples of uncertain
situations we live in. Let us map them to few common terminologies are-
Experiments are the uncertain situations, which could have multiple outcomes. Whether
it rains on a daily basis is an experiment.
Outcome is the result of a single trial. So, if it rains today, the outcome of today’s trial
from the experiment is “It rained”
Event is one or more outcome from an experiment. “It rained” is one of the possible
events for this experiment.
Probability is a measure of how likely an event is. So, if it is 60% chance that it will rain
tomorrow, the probability of Outcome “it rained” for tomorrow is 0.6
Statistics
Machine learning and statistics are two tightly related fields of study. So much so that
statisticians refer to machine learning as “applied statistics” or “statistical learning” rather than
the computer-science-centric name.
Raw observations alone are data, but they are not information or knowledge.
Data raises questions, such as:
What is the most common or expected observation?
What are the limits on the observations?
What does the data look like?
Although they appear simple, these questions must be answered in order to turn raw
observations into information that we can use and share.
Beyond raw data, we may design experiments in order to collect observations. From these
experimental results we may have more sophisticated questions, such as:
What variables are most relevant?
What is the difference in an outcome between two experiments?
Are the differences real or the result of noise in the data?
Questions of this type are important. The results matter to the project, to stakeholders, and to
effective decision making.
Statistical methods are required to find answers to the questions that we have about data.
We can see that in order to both understand the data used to train a machine learning model
and to interpret the results of testing different machine learning models, that statistical
methods are required. Statistics is a subfield of mathematics.
It refers to a collection of methods for working with data and using data to answer questions.
Convex Optimization
Optimization is a big part of machine learning. It is the core of most popular methods, from
least squares regression to artificial neural networks.
These methods are useful in the core implementation of a machine learning algorithm. It is
required to implement own algorithm tuning scheme to optimize the parameters of a model for
some cost function.
A good example may be the case where we want to optimize the hyper-parameters of a blend
of predictions from an ensemble of multiple child models.
Machine learning algorithms use optimization all the time. We minimize loss, or error, or
maximize some kind of score functions. Gradient descent is the "hello world" optimization
algorithm covered on probably any machine learning course. It is obvious in the case of
regression, or classification models, but even with tasks such as clustering we are looking for a
solution that optimally fits our data (e.g. k-means minimizes the within-cluster sum of squares).
So if you want to understand how the machine learning algorithms do work, learning more
about optimization helps. Moreover, if you need to do things like hyper parameter tuning, then
you are also directly using optimization.
Data Visualization
Data visualization is an important skill in applied statistics and machine learning.
Statistics does indeed focus on quantitative descriptions and estimations of data. Data
visualization provides an important suite of tools for gaining a qualitative understanding.
This can be helpful when exploring and getting to know a dataset and can help with identifying
patterns, corrupt data, outliers, and much more. With a little domain knowledge, data
visualizations can be used to express and demonstrate key relationships in plots and charts that
are more visceral to yourself and stakeholders than measures of association or significance.
There are five key plots that need to know well for basic data visualization. They are:
Line Plot
Bar Chart
Histogram Plot
Box and Whisker Plot
Scatter Plot
With knowledge of these plots, you can quickly get a qualitative understanding of most data
that you come across.
Line Plot
A line plot is generally used to present observations collected at regular intervals.
The x-axis represents the regular interval, such as time. The y-axis shows the observations,
ordered by the x-axis and connected by a line.
Data Distributions
From a practical perspective, we can think of a distribution as a function that describes the
relationship between observations in a sample space.
For example, we may be interested in the age of humans, with individual ages representing
observations in the domain, and ages 0 to 125 the extent of the sample space. The distribution
is a mathematical function that describes the relationship of observations of different heights.
A distribution is simply a collection of data, or scores, on a variable. Usually, these scores are
arranged in order from smallest to largest and then they can be presented graphically.
Density Functions
Distributions are often described in terms of their density or density functions.
Density functions are functions that describe how the proportion of data or likelihood of the
proportion of observations changes over the range of the distribution.
Two types of density functions are probability density functions and cumulative density
functions.
Probability Density function: calculates the probability of observing a given value.
Cumulative Density function: calculates the probability of an observation equal or less
than a value.
A probability density function, or PDF, can be used to calculate the likelihood of a given
observation in a distribution. It can also be used to summarize the likelihood of observations
across the distribution’s sample space. Plots of the PDF show the familiar shape of a
distribution, such as the bell-curve for the Gaussian distribution.
Data Pre-processing
Pre-processing refers to the transformations applied to our data before feeding it to the
algorithm.
• Data pre-processing is a technique that is used to convert the raw data into a clean data set.
In other words, whenever the data is gathered from different sources it is collected in raw
format which is not feasible for the analysis.
Data Augmentation
Data augmentation is the process of increasing the amount and diversity of data. We do not
collect new data; rather we transform the already present data. For instance we can consider
image, so in image there are various ways to transform and augment the image data.
Need for data augmentation
Data augmentation is an integral process in deep learning, as in deep learning we need large
amounts of data and in some cases it is not feasible to collect thousands or millions of images,
so data augmentation comes to the rescue. It helps us to increase the size of the dataset and
introduce variability in the dataset.
Operations in data augmentation
The most commonly used operations are-
1. Rotation
2. Shearing
3. Zooming
4. Cropping
5. Flipping
6. Changing the brightness level
Supervised Learning
Supervised Learning is the one, where you can consider the learning is guided by a teacher. We
have a dataset which acts as a teacher and its role is to train the model or the machine. Once
the model gets trained it can start making a prediction or decision when new data is given to it.
Learning under supervision directly translates to being under guidance and learning from an
entity that is in charge of providing feedback through this process. When training a machine,
supervised learning refers to a category of methods in which we teach or train a machine
learning algorithm using data, while guiding the algorithm model with labels associated with the
data.
The formal supervised learning process involves input variables, which we call (X), and an
output variable, which we call (Y). We use an algorithm to learn the mapping function from the
input to the output. In simple mathematics, the output (Y) is a dependent variable of input (X)
as illustrated by:
Y = f(X)
Here, our end goal is to try to approximate the mapping function (f), so that we can predict the
output variables (Y) when we have new input data (X).
Example: House Prices
One practical example of supervised learning problems is predicting house prices. How is this
achieved?
First, we need data about the houses: square footage, number of rooms, features, whether a
house has a garden or not, and so on. We then need to know the prices of these houses, i.e. the
corresponding labels. By leveraging data coming from thousands of houses, their features and
prices, we can now train a supervised machine learning model to predict a new house’s price
based on the examples observed by the model.
Example: Is it a cat or a dog?
Image classification is a popular problem in the computer vision field. Here, the goal is to
predict what class an image belongs to. In this set of problems, we are interested in finding the
class label of an image. More precisely: is the image of a car or a plane? A cat or a dog?
Example: How’s the weather today?
One particularly interesting problem which requires considering a lot of different parameters is
predicting weather conditions in a particular location. To make correct predictions for the
weather, we need to take into account various parameters, including historical temperature
data, precipitation, wind, humidity, and so on.
Unsupervised Learning
In supervised learning, the main idea is to learn under supervision, where the supervision signal
is named as target value or label. In unsupervised learning, we lack this kind of signal.
Therefore, we need to find our way without any supervision or guidance. This simply means
that we are alone and need to figure out what is what by us.
The model learns through observation and finds structures in the data. Once the model is given
a dataset, it automatically finds patterns and relationships in the dataset by creating clusters in
it. What it cannot do is add labels to the cluster; like it cannot say this a group of apples or
mangoes, but it will separate all the apples from mangoes.
This is roughly how unsupervised learning happens. We use the data points as references to find
meaningful structure and patterns in the observations. Unsupervised learning is commonly used
for finding meaningful patterns and groupings inherent in data, extracting generative features,
and exploratory purposes.
Example: Finding customer segments
Clustering is an unsupervised technique where the goal is to find natural groups or clusters in a
feature space and interpret the input data. Clustering is commonly used for determining
customer segments in marketing data. Being able to determine different segments of customers
helps marketing teams approach these customer segments in unique ways. (Think of features
like gender, location, age, education, income bracket, and so on.)
Example: Reducing the complexity of a problem
Dimensionality reduction is a commonly used unsupervised learning technique where the goal
is to reduce the number of random variables under consideration. It has several practical
applications. One of the most common uses of dimensionality reduction is to reduce the
complexity of a problem by projecting the feature space to a lower-dimensional space so that
less correlated variables are considered in a machine learning system.
Example: Feature selection
Even though feature selection and dimensionality reduction aim towards reducing the number
of features in the original set of features, understanding how feature selection works helps us
get a better understanding of dimensionality reduction.
It is important to understand that not every feature adds value to solving the problem.
Therefore, eliminating these features is an essential part of machine learning. In feature
selection, we try to eliminate a subset of the original set of features.