DS Unit2
DS Unit2
ML In Data Science
Machine learning is a field of computer science that gives computers the ability
to learn without being explicitly programmed. Machine learning is a subset of
artificial intelligence that focuses on building systems that can learn from and
make decisions based on data. Instead of explicitly programming rules, machine
learning algorithms use statistical techniques to enable computers to “learn” and
improve their performance on a specific task over time.
Types of Machine Learning
There are three main types of machine learning:
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Role of Machine Learning in Data Science
Machine learning significantly boosts data science by improving analysis
efficiency, spotting patterns, predicting outcomes, and identifying anomalies in
extensive datasets, facilitating informed decision-making.
Enabling predictive modeling: Machine learning is like having a
superpower. Why? It can look at old data and find patterns. Those
patterns help guess what will happen next. It’s pretty accurate, too.
Businesses love this. They can use it to make plans and good choices, like
in finance. Machine learning looks at old stock market info and guesses
what prices will do. It can help investors know when to buy or sell. Or in
healthcare. It can look at patient info and guess if they might get sick. If
they might, doctors can help sooner. That can make patients healthier.
Machine learning has become increasingly important in data science as it
can uncover patterns and correlations in large datasets that would be
impossible to detect otherwise. By training algorithms on vast amounts of
real-world data, machine learning techniques are able to identify useful
insights and make predictions that guide critical decisions in many
different fields.
Facilitating classification: Machine learning algorithms work like tools.
They sort data into set groups. This makes it easier to handle and
understand information. By grouping items based on their qualities, we
can make sense of a lot of data. Just picture an online shop. Machine
learning algorithms can sort products into groups like electronics, clothes,
or home stuff. Thus, customers can smoothly uncover what they want.
Because this sorting is automated, machine learning algorithms save time
and energy. This lets businesses focus on studying data and pulling out
useful details. In short, machine learning makes data management and
understanding better. This leads to swifter decisions and a clearer grasp of
complex data sets.
Supporting anomaly detection: Machine learning plays a key role in
picking out odd patterns or weird things in datasets. This could point out
possible issues or sneaky activities. Machine learning algorithms look at
the load of data. They find anything that moves off the beaten path, like
odd money transactions or strange user actions. This skill to spot oddities
is key in many areas. This includes finance, cybersecurity, and healthcare.
Here, spotting anything unusual early on might stop big losses or risks.
For example, in banks, machine learning algorithms can mark
transactions that stray from the normal. This can stop fraud.
Applications of Machine Learning in Data Science
Machine learning has a wide range of applications across various industries.
1. Finance:
o Credit scoring
o Fraud detection
o Algorithmic trading
2. Healthcare:
o Disease diagnosis
o Drug discovery
o Personalized treatment plans
3. Retail:
o Product recommendations
o Demand forecasting
o Price optimization
4. Manufacturing:
o Predictive maintenance
o Quality control
o Supply chain optimization
5. Marketing:
o Customer segmentation
o Churn prediction
o Sentiment analysis
6. Transportation:
o Route optimization
o Self-driving vehicles
o Traffic prediction
Types of ML
1. Supervised Learning
In supervised learning, the algorithm learns a mapping between the input and
output data. This mapping is learned from a labelled dataset, which consists of
pairs of input and output data. This process involves supervised learning
algorithms that help the machine learn from input-output pairs. The algorithm
tries to learn the relationship between the input and output data so that it can
make accurate predictions on new, unseen data.
A supervised learning algorithm uses a labelled dataset consisting of input
features and corresponding output labels. Input features are the characteristics
of the data used to make predictions, while the output labels are the desired
outcomes the model is being trained to predict. By learning this mapping, the
model becomes capable of making predictions on new, unseen data.
Let us discuss what learning for a machine is as shown below media as follows:
The labeled dataset used in supervised learning consists of input features and
corresponding output labels. The input features are the attributes or
characteristics of the data that are used to make predictions, while the output
labels are the desired outcomes or targets that the algorithm tries to predict.
A fundamental concept in supervised machine learning is learning a class from
examples. This involves providing the model with examples where the correct
label is known, such as learning to classify images of cats and dogs by being
shown labeled examples of both. The model then learns the distinguishing
features of each class and applies this knowledge to classify new images.
Types of Supervised Learning
Supervised learning is typically divided into two main categories:
In regression, the algorithm learns to predict a continuous output value,
such as the price of a house or the temperature of a city.
In classification, the algorithm learns to predict a categorical output
variable or class label, such as whether a customer is likely to purchase a
product or not.
While training the model, data is usually split in the ratio of 80:20 i.e. 80% as
training data and the rest as testing data. In training data, we feed input as well
as output for 80% of data. The model learns from training data only.
Supervised Machine Learning Algorithms
Supervised learning can be further divided into several different types, each
with its own unique characteristics and applications.Some of the most common
types of supervised learning algorithms:
Linear Regression: Linear regression is used to predict a continuous
output value. It is one of the simplest and most widely used algorithms in
supervised learning.
Logistic Regression : Logistic regression is used to predict a binary
output variable. It is commonly used in machine learning applications
where the output variable is either true or false, such as in fraud detection
or spam filtering.
Decision Trees : Decision tree is a tree-like structure that is used to
model decisions and their possible consequences. Each internal node in
the tree represents a decision, while each leaf node represents a possible
outcome. Decision trees can be used to model complex relationships
between input features and output variables. A decision tree is a type
of machine learning algorithm that is used for both classification and
regression tasks.
Random Forests : Random forests again are made up of multiple
decision trees that work together to make predictions. Each tree in the
forest is trained on a different subset of the input features and data. The
final prediction is made by aggregating the predictions of all the trees in
the forest. Random forests are an ensemble machine learning
technique that is used for both classification and regression tasks in
supervised learning.
Support Vector Machine(SVM) : The SVM algorithm creates a
hyperplane to segregate n-dimensional space into classes and identify the
correct category of new data points. The extreme cases that help create
the hyperplane are called support vectors, hence the name Support Vector
Machine. A Support Vector Machine is a type of supervised machine
learning algorithm that is also used for both classification and regression
tasks.
K-Nearest Neighbors (KNN) : KNN works by finding k training
examples closest to a given input and then predicts the class or value
based on the majority class or average value of these neighbors. The
performance of KNN can be influenced by the choice of k and the
distance metric used to measure proximity. However, it is intuitive but
can be sensitive to noisy data and requires careful selection of k for
optimal results. A K-Nearest Neighbors (KNN) is a type of algorithm that
is used for both classification and regression tasks.
Gradient Boosting : Gradient Boosting combines weak learners,
like decision trees, to create a strong model. It iteratively builds new
models that correct errors made by previous ones. Each new model is
trained to minimize residual errors, resulting in a powerful predictor
capable of handling complex data relationships. A Gradient Boosting is a
type of algorithm that is used for both classification and regression tasks.
Naive Bayes Algorithm: The Naive Bayes algorithm is a supervised
machine learning algorithm based on applying Bayes’ Theorem with
the “naive” assumption that features are independent of each other given
the class label. Despite this simplifying assumption, Naive Bayes
performs well for many real-world tasks, especially in text classification,
spam detection, and document categorization
Advantages of Supervised Learning
Labeled training data benefits supervised learning by enabling models to
accurately learn patterns and relationships between inputs and outputs.
Supervised learning models can accurately predict and classify new data.
Supervised learning has a wide range of applications, including
classification, regression, and even more complex problems like image
recognition and natural language processing.
Well-established evaluation metrics, including accuracy, precision, recall,
and F1-score, facilitate the assessment of supervised learning model
performance.
One of the primary advantages of supervised learning is that it allows for
the creation of complex models that can make accurate predictions on
new data. However, supervised learning requires large amounts of labeled
training data to be effective. Additionally, the quality and
representativeness of the training data can have a significant impact on
the accuracy of the model.
Disadvantages of Supervised Learning
Overfitting : Models can overfit training data, which leads to poor
performance on new, unseen data due to the capture of noise.
Feature Engineering : Extracting relevant features from raw data is
crucial for model performance, but this process can be time-consuming
and may require domain expertise.
Bias in Models: Training data biases can lead to unfair predictions.
Supervised learning heavily depends on labeled training data, which can
be costly, time-consuming, and may require domain expertise.
2. Unsupervised Learning
Unsupervised learning is a machine learning technique in which models are not
supervised using training dataset. Instead, models itself find the hidden patterns
and insights from the given data. It can be compared to learning which takes
place in the human brain while learning new things.
The goal of unsupervised learning is to find the underlying structure of dataset,
group that data according to similarities, and represent that dataset in a
compressed format.
Types of Unsupervised Learning Algorithm:
The unsupervised learning algorithm can be further categorized into two types.
Clustering
Dimensionality Reduction
Clustering :
Using a clustering algorithm is to give the algorithm a lot of input data with no
labels and let it find any groupings in the data it can.
Those groupings are called clusters. A cluster is a group of data points that are
similar to each other based on their relation to surrounding data points.
Clustering is used for things like feature engineering or pattern discovery.
Types of clustering algorithms
There are different types of clustering algorithms that handle all kinds of unique
data.
Density-based
In density-based clustering, data is grouped by areas of high concentrations of
data points surrounded by areas of low concentrations of data points. Basically
the algorithm finds the places that are dense with data points and calls those
clusters.
The great thing about this is that the clusters can be any shape and aren't
constrained to expected conditions.
The clustering algorithms under this type don't try to assign outliers to clusters,
so they get ignored.
Distribution-based
With a distribution-based clustering approach, all of the data points are
considered parts of a cluster based on the probability that they belong to a given
cluster.
It works like this: there is a center-point, and as the distance of a data point from
the center increases, the probability of it being a part of that cluster decreases.
If you aren't sure of how the distribution in your data might be, you should
consider a different type of algorithm.
Centroid-based
Centroid-based clustering is the one probably heard about the most. It's a little
sensitive to the initial parameters you give it, but it's fast and efficient.
These types of algorithms separate data points based on multiple centroids in
the data. Each data point is assigned to a cluster based on its squared distance
from the centroid. This is the most commonly used type of clustering.
Hierarchical-based
Hierarchical-based clustering is typically used on hierarchical data, like you
would get from a company database or taxonomies. It builds a tree of clusters so
everything is organized from the top-down.
This is more restrictive than the other clustering types, but it's perfect for
specific kinds of data sets.
Clustering Algorithms
K-means clustering algorithm
K-means clustering is the most commonly used clustering algorithm. It's a
centroid-based algorithm and the simplest unsupervised learning algorithm.
This algorithm tries to minimize the variance of data points within a cluster.
It's also how most people are introduced to unsupervised machine learning.
K-means is best used on smaller data sets because it iterates over all of the
data points. That means it'll take more time to classify data points if there are
a large amount of them in the data set.
DBSCAN clustering algorithm
DBSCAN stands for density-based spatial clustering of applications with
noise. It's a density-based clustering algorithm, unlike k-means.
This is a good algorithm for finding outliners in a data set. It finds arbitrarily
shaped clusters based on the density of data points in different regions. It
separates regions by areas of low-density so that it can detect outliers
between the high-density clusters.
This algorithm is better than k-means when it comes to working with oddly
shaped data.
DBSCAN uses two parameters to determine how clusters are
defined: minPts (the minimum number of data points that need to be
clustered together for an area to be considered high-density) and eps (the
distance used to determine if a data point is in the same area as other data
points).
Choosing the right initial parameters is critical for this algorithm to work.
Gaussian Mixture Model algorithm
One of the problems with k-means is that the data needs to follow a circular
format. The way k-means calculates the distance between data points has to
do with a circular path, so non-circular data isn't clustered correctly.
This is an issue that Gaussian mixture models fix. You don’t need circular
shaped data for it to work well.
The Gaussian mixture model uses multiple Gaussian distributions to fit
arbitrarily shaped data.
There are several single Gaussian models that act as hidden layers in this
hybrid model. So the model calculates the probability that a data point
belongs to a specific Gaussian distribution and that's the cluster it will fall
under.
BIRCH algorithm
The Balance Iterative Reducing and Clustering using Hierarchies (BIRCH)
algorithm works better on large data sets than the k-means algorithm.
It breaks the data into little summaries that are clustered instead of the
original data points. The summaries hold as much distribution information
about the data points as possible.
This algorithm is commonly used with other clustering algorithm because
the other clustering techniques can be used on the summaries generated by
BIRCH.
The main downside of the BIRCH algorithm is that it only works on numeric
data values. You can't use this for categorical values unless you do some data
transformations.
Affinity Propagation clustering algorithm
This clustering algorithm is completely different from the others in the way
that it clusters data.
Each data point communicates with all of the other data points to let each
other know how similar they are and that starts to reveal the clusters in the
data. You don't have to tell this algorithm how many clusters to expect in the
initialization parameters.
As messages are sent between data points, sets of data called exemplars are
found and they represent the clusters.
An exemplar is found after the data points have passed messages to each
other and form a consensus on what data point best represents a cluster.
When you aren't sure how many clusters to expect, like in a computer vision
problem, this is a great algorithm to start with.
Mean-Shift clustering algorithm
This is another algorithm that is particularly useful for handling images and
computer vision processing.
Mean-shift is similar to the BIRCH algorithm because it also finds clusters
without an initial number of clusters being set.
This is a hierarchical clustering algorithm, but the downside is that it doesn't
scale well when working with large data sets.
It works by iterating over all of the data points and shifts them towards the
mode. The mode in this context is the high density area of data points in a
region.
That's why you might hear this algorithm referred to as the mode-seeking
algorithm. It will go through this iterative process with each data point and
move them closer to where other data points are until all data points have
been assigned to a cluster.
OPTICS algorithm
OPTICS stands for Ordering Points to Identify the Clustering Structure. It's a
density-based algorithm similar to DBSCAN, but it's better because it can
find meaningful clusters in data that varies in density. It does this by ordering
the data points so that the closest points are neighbors in the ordering.
This makes it easier to detect different density clusters. The OPTICS
algorithm only processes each data point once, similar to DBSCAN
(although it runs slower than DBSCAN). There's also a special distance
stored for each data point that indicates a point belongs to a specific cluster.
Agglomerative Hierarchy clustering algorithm
This is the most common type of hierarchical clustering algorithm. It's used
to group objects in clusters based on how similar they are to each other.
This is a form of bottom-up clustering, where each data point is assigned to
its own cluster. Then those clusters get joined together.
At each iteration, similar clusters are merged until all of the data points are
part of one big root cluster.
Agglomerative clustering is best at finding small clusters. The end result
looks like a dendrogram so that you can easily visualize the clusters when
the algorithm finishes.
Advantages of Unsupervised Learning
o Unsupervised learning is used for more complex tasks as compared to
supervised learning because, in unsupervised learning, we don't have
labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in
comparison to labeled data.
Disadvantages of Unsupervised Learning
o Unsupervised learning is intrinsically more difficult than supervised
learning as it does not have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate
as input data is not labeled, and algorithms do not know the exact output
in advance.
Dimensionality Reduction
Dimensionality reduction is the process of reducing the number of features
(or dimensions) in a dataset while retaining as much information as possible.
This can be done for a variety of reasons, such as to reduce the complexity of
a model, to improve the performance of a learning algorithm, or to make it
easier to visualize the data.
What is Dimensionality Reduction?
Dimensionality reduction is a technique used to reduce the number of
features in a dataset while retaining as much of the important information as
possible. In other words, it is a process of transforming high-dimensional
data into a lower-dimensional space that still preserves the essence of the
original data.
In machine learning, high-dimensional data refers to data with a large
number of features or variables. Dimensionality reduction can help to
mitigate these problems by reducing the complexity of the model and
improving its generalization performance.
There are two main approaches to dimensionality reduction:
feature selection and
feature extraction.
FeatureSelection:
Feature selection involves selecting a subset of the original features that are
most relevant to the problem at hand. The goal is to reduce the
dimensionality of the dataset while retaining the most important features.
There are several methods for feature selection, including filter methods,
wrapper methods, and embedded methods. Filter methods rank the features
based on their relevance to the target variable, wrapper methods use the
model performance as the criteria for selecting features, and embedded
methods combine feature selection with the model training process.
FeatureExtraction:
Feature extraction involves creating new features by combining or
transforming the original features. The goal is to create a set of features that
captures the essence of the original data in a lower-dimensional space. There
are several methods for feature extraction, including principal component
analysis (PCA), linear discriminant analysis (LDA), and t-distributed
stochastic neighbor embedding (t-SNE). PCA is a popular technique that
projects the original features onto a lower-dimensional space while
preserving as much of the variance as possible.