Introduction To Data Science UNIT - II
Introduction To Data Science UNIT - II
Unit -II
Applications of Machine Learning in Data science
Machine learning is one of the most exciting technologies that one would have ever come
across. As is evident from the name, it gives the computer that which makes it more similar
to humans: The ability to learn.
Machine learning is actively being used today, perhaps in many more places than one would
expect.
Today, companies are using Machine Learning to improve business decisions, increase
productivity, detect disease, forecast weather, and do many more things. With the exponen-
tial growth of technology, we not only need better tools to understand the data we currently
have, but we also need to prepare ourselves for the data we will have. To achieve this goal we
need to build intelligent machines. We can write a program to do simple things. But most of
the time, Hardwiring Intelligence in it is difficult. The best way to do it is to have some way for
machines to learn things themselves. A mechanism for learning – if a machine can learn from
input then it does the hard work for us. This is where Machine Learning comes into action.
Some of the most common examples are:
• Image Recognition
• Speech Recognition
• .Recommender Systems
• Fraud Detection
• Self Driving Cars
• Medical Diagnosis
• Stock Market Trading
• Virtual Try On
Image Recognition
Image Recognition is one of the reasons behind the boom one could have experienced in
the field of Deep Learning. The task which started from classification between cats and
dog images has now evolved up to the level of Face Recognition and real-world use
cases based on that like employee attendance tracking.
Also, image recognition has helped revolutionized the healthcare industry by employing
smart systems in disease recognition and diagnosis methodologies.
Speech Recognition
Speech Recognition based smart systems like Alexa and Siri have certainly come across
and used to communicate with them. In the backend, these systems are based basically
on Speech Recognition systems. These systems are designed such that they can convert
voice instructions into text.
One more application of the Speech recognition that we can encounter in our day-to-day
life is that of performing Google searches just by speaking to it.
Recommender Systems
As our world has digitalized more and more approximately every tech giants try to pro-
vide customized services to its users. This application is possible just because of
the recommender systems which can analyze a user’s preferences and search history
and based on that they can recommend content or services to them.
An example of these services is very common for example youtube. It recommends new
videos and content based on the user’s past search patterns. Netflix recommends mov-
ies and series based on the interest provided by users when someone creates an ac-
count for the very first time.
Fraud Detection
In today’s world, most things have been digitalized varying from buying toothbrushes or
making transactions of millions of dollars everything is accessible and easy to use. But
with this process of digitization cases of fraudulent transactions and fraudulent activi-
ties have increased. Identifying them is not that easy but machine learning systems are
very efficient in these tasks.
Due to these applications only whenever the system detects red flags in a user’s activity
than a suitable notification be provided to the administrator so, that these cases can be
monitored properly for any spam or fraud activities.
Medical Diagnosis
If you are a machine learning practitioner or even if you are a student then you must
have heard about projects like breast cancer Classification, Parkinson’s Disease
Classification, Pneumonia detection, and many more health-related tasks which are
performed by machine learning models with more than 90% of accuracy.
Not even in the field of disease diagnosis in human beings but they work perfectly fine
for plant disease-related tasks whether it is to predict the type of disease it is or to de-
tect whether some disease is going to occur in the future.
Virtual Try On
Have you ever purchased your specs or lenses from Lenskart? If yes then you must have
come across its feature where you can try different frames virtually without actually pur-
chasing them or visiting the outlet. This has become possible just because of the ma-
chine learning systems only which identify certain landmarks on a person’s face and
then place the specs virtually on your face using those landmarks.
in today’s world, the collaboration between machine learning and data science plays an
important role in maximizing the potential of large datasets. Despite the complexity,
these concepts are integral in unraveling insights from vast data pools. Let’s delve into
the role of machine learning in data science, exploring the functionalities and signifi-
cance across diverse domains.
Data scientists use powerful tools from Machine learning algorithms act as magic
machine learning algorithms. keys, unlocking large datasets with ease.
Data science is like a twin to machine Machine learning is a handy sidekick for
learning, enhancing each other’s abili- data scientists, helping them navigate
ties. through complex data mazes.
With machine learning, data scientists Machine learning algorithms are wizards at
can dig deeper into information and un- finding patterns, predicting outcomes, and
cover concealed patterns. spotting anomalies.
The collaboration between data science This joint effort leads to smarter decisions,
and machine learning is crucial across better work methods, and success in data-
various fields. driven environments.
2. Facilitating classification: Machine learning algorithms work like tools. They sort
data into set groups. This makes it easier to handle and understand information. By
grouping items based on their qualities, we can make sense of a lot of data. Just pic-
ture an online shop. Machine learning algorithms can sort products into groups like
electronics, clothes, or home stuff. Thus, customers can smoothly uncover what
they want. Because this sorting is automated, machine learning algorithms save
time and energy. This lets businesses focus on studying data and pulling out useful
details. In short, machine learning makes data management and understanding bet-
ter. This leads to swifter decisions and a clearer grasp of complex data sets.
3. Supporting anomaly detection: Machine learning plays a key role in picking out odd
patterns or weird things in datasets. This could point out possible issues or sneaky
activities. Machine learning algorithms look at the load of data. They find anything
that moves off the beaten path, like odd money transactions or strange user ac-
tions. This skill to spot oddities is key in many areas. This includes finance, cyberse-
curity, and healthcare. Here, spotting anything unusual early on might stop big
losses or risks. For example, in banks, machine learning algorithms can mark trans-
actions that stray from the normal. This can stop fraud.
Industry Applications
Real-world Applications
The influence of machine learning in data science spans industries, facilitating efficient
analysis, predictive modeling, anomaly detection, and decision-making processes, en-
hancing overall productivity and effectiveness.
1. Business: Machine learning helps businesses improve service, hone marketing, and
smooth out tasks. It uses client data to tailor suggestions, predict demands, and
automate jobs, which elevates service and ramps up efficiency. More so, it allows
firms to gather valuable knowledge from vast data, aiding strategy choices and
powering innovation. As an example, machine learning-based predictive analytics
can predict demand shifts, helping businesses better manage supplies and re-
sources.
3. Finance: Machine learning is super important in finance. It helps find fraud, check
risks, and manage investments in the best way. It looks at lots of financial data to
find regular patterns that might mean fraud. This way, crime can be stopped earlier.
Machine learning also helps check how risky different financial dealings or invest-
ments are. This helps organizations make the best choices and reduce the chances
of loss.
4. Marketing: Machine Learning enables customer segmentaion, campaign optimiza-
tion, and personalized marketing strategies, improving targeting and conversion
rates.
1. Enhancing Efficiency and Insights: Machine learning algorithms help data scien-
tists. They can look at complex data and find hidden things like patterns, trends,
and connections. Data science and machine learning can change many things. It
can help fields like health care, finance, and retail. It can predict the future, recom-
mend things people might like, and make business processes better. Take
healthcare, for example. Here, machine learning can detect disease early, guess
whether a treatment works, and create unique treatment plans for each patient. In
the same way, finance uses machine learning, too. It helps find fraud, assess risks,
and choose the best investment strategies. Adding machine learning to data sci-
ence helps companies. Smart choices are easier for them. Work gets done quicker.
Market changes don’t throw them off. As tech grows, uniting machine learning and
data science is key. It drives new ideas and shapes industries globally.
3. Driving Innovation and Competitiveness: The spark that machine learning and data
science create together isn’t going dim any soon. It’s a must-have for successful
businesses and efficient operations. Plus, it keeps you ahead in the competitive
jungle out there. Companies that use machine learning to bolster their data science
game get a huge leg up. Better choices, a smooth process, and new paths to innova-
tion are their prizes. Technology is improving. More data is accumulated. The link
between machine learning and data science fuels growth in numerous areas.
By using machine learning, companies decode crucial knowledge from their data. They
can then adjust to market trends, predict customer wants, and sharpen their products.
Simply, mixing machine learning with data science boosts businesses. It allows them to
thrive in a changing world full of opportunities and obstacles.
Conclusion
Think of machine learning as the spine of data science. It’s super important because it
can dig deep into big, complicated data collections and pull out useful info. Beyond pre-
dicting what might happen in the future, machine learning can spot tricky patterns and
help businesses work smoother and smarter, sparking new ideas in all kinds of fields.
Scikit-learn
Scikit-learn is one of the most popular ML libraries for classical ML algorithms. It is built
on top of two basic Python libraries, viz., NumPy and SciPy. Scikit-learn supports most
of the supervised and unsupervised learning algorithms. Scikit-learn can also be used
for data-mining and data-analysis, which makes it a great tool who is starting out with
ML.
Python
# Python script using Scikit-learn
# for Decision Tree Classifier
# make predictions
expected = dataset.target
predicted = model.predict(dataset.data)
Output:
[[50 0 0]
[ 0 50 0]
[ 0 0 50]]
• Numpy
• Scipy
• Scikit-learn
• Theano
• TensorFlow
• Keras
• PyTorch
• Pandas
• Matplotlib
The success of machine learning models heavily depends on the quality of the fea-
tures used to train them. Feature engineering involves a set of techniques that ena-
ble us to create new features by combining or transforming the existing ones. These
techniques help to highlight the most important patterns and relationships in the
data, which in turn helps the machine learning model to learn from the data more
effectively.
What is a Feature?
In the context of machine learning, a feature (also known as a variable or attribute)
is an individual measurable property or characteristic of a data point that is used as
input for a machine learning algorithm. Features can be numerical, categorical, or
text-based, and they represent different aspects of the data that are relevant to the
problem at hand.
• For example, in a dataset of housing prices, features could include the number of
bedrooms, the square footage, the location, and the age of the property. In a da-
taset of customer demographics, features could include age, gender, income level,
and occupation.
• The choice and quality of features are critical in machine learning, as they can
greatly impact the accuracy and performance of the model.
Need for Feature Engineering in Machine Learning?
We engineer features for various reasons, and some of the main reasons include:
• Improve User Experience: The primary reason we engineer features is to enhance
the user experience of a product or service. By adding new features, we can make
the product more intuitive, efficient, and user-friendly, which can increase user sat-
isfaction and engagement.
• Competitive Advantage: Another reason we engineer features is to gain a competi-
tive advantage in the marketplace. By offering unique and innovative features, we
can differentiate our product from competitors and attract more customers.
• Meet Customer Needs: We engineer features to meet the evolving needs of custom-
ers. By analyzing user feedback, market trends, and customer behavior, we can
identify areas where new features could enhance the product’s value and meet cus-
tomer needs.
• Increase Revenue: Features can also be engineered to generate more revenue. For
example, a new feature that streamlines the checkout process can increase sales,
or a feature that provides additional functionality could lead to more upsells or
cross-sells.
• Future-Proofing: Engineering features can also be done to future-proof a product or
service. By anticipating future trends and potential customer needs, we can de-
velop features that ensure the product remains relevant and useful in the long term.
Processes Involved in Feature Engineering
Feature engineering in Machine learning consists of mainly 5 processes: Feature
Creation, Feature Transformation, Feature Extraction, Feature Selection, and Fea-
ture Scaling. It is an iterative process that requires experimentation and testing to
find the best combination of features for a given problem. The success of a machine
learning model largely depends on the quality of the features used in the model.
1. Feature Creation
Feature Creation is the process of generating new features based on domain
knowledge or by observing patterns in the data. It is a form of feature engineering
that can significantly improve the performance of a machine-learning model.
Types of Feature Creation:
1. Domain-Specific: Creating new features based on domain knowledge, such as cre-
ating features based on business rules or industry standards.
2. Data-Driven: Creating new features by observing patterns in the data, such as cal-
culating aggregations or creating interaction features.
3. Synthetic: Generating new features by combining existing features or synthesizing
new data points.
Why Feature Creation?
1. Improves Model Performance: By providing additional and more relevant infor-
mation to the model, feature creation can increase the accuracy and precision of
the model.
2. Increases Model Robustness: By adding additional features, the model can become
more robust to outliers and other anomalies.
3. Improves Model Interpretability: By creating new features, it can be easier to under-
stand the model’s predictions.
4. Increases Model Flexibility: By adding new features, the model can be made more
flexible to handle different types of data.
2. Feature Transformation
Feature Transformation is the process of transforming the features into a more suit-
able representation for the machine learning model. This is done to ensure that the
model can effectively learn from the data.
Types of Feature Transformation:
1. Normalization: Rescaling the features to have a similar range, such as between 0
and 1, to prevent some features from dominating others.
2. Scaling: Scaling is a technique used to transform numerical variables to have a sim-
ilar scale, so that they can be compared more easily. Rescaling the features to have
a similar scale, such as having a standard deviation of 1, to make sure the model
considers all features equally.
3. Encoding: Transforming categorical features into a numerical representation. Ex-
amples are one-hot encoding and label encoding.
4. Transformation: Transforming the features using mathematical operations to
change the distribution or scale of the features. Examples are logarithmic, square
root, and reciprocal transformations.
Why Feature Transformation?
1. Improves Model Performance: By transforming the features into a more suitable
representation, the model can learn more meaningful patterns in the data.
2. Increases Model Robustness: Transforming the features can make the model more
robust to outliers and other anomalies.
3. Improves Computational Efficiency: The transformed features often require fewer
computational resources.
4. Improves Model Interpretability: By transforming the features, it can be easier to un-
derstand the model’s predictions.
3. Feature Extraction
Feature Extraction is the process of creating new features from existing ones to pro-
vide more relevant information to the machine learning model. This is done by
transforming, combining, or aggregating existing features.
Types of Feature Extraction:
1. Dimensionality Reduction: Reducing the number of features by transforming the
data into a lower-dimensional space while retaining important information. Exam-
ples are PCA and t-SNE.
2. Feature Combination: Combining two or more existing features to create a new one.
For example, the interaction between two features.
3. Feature Aggregation: Aggregating features to create a new one. For example, calcu-
lating the mean, sum, or count of a set of features.
4. Feature Transformation: Transforming existing features into a new representation.
For example, log transformation of a feature with a skewed distribution.
Why Feature Extraction?
1. Improves Model Performance: By creating new and more relevant features, the
model can learn more meaningful patterns in the data.
2. Reduces Overfitting: By reducing the dimensionality of the data, the model is less
likely to overfit the training data.
3. Improves Computational Efficiency: The transformed features often require fewer
computational resources.
4. Improves Model Interpretability: By creating new features, it can be easier to under-
stand the model’s predictions.
4. Feature Selection
Feature Selection is the process of selecting a subset of relevant features from the
dataset to be used in a machine-learning model. It is an important step in the fea-
ture engineering process as it can have a significant impact on the model’s perfor-
mance.
Types of Feature Selection:
1. Filter Method: Based on the statistical measure of the relationship between the fea-
ture and the target variable. Features with a high correlation are selected.
2. Wrapper Method: Based on the evaluation of the feature subset using a specific ma-
chine learning algorithm. The feature subset that results in the best performance is
selected.
3. Embedded Method: Based on the feature selection as part of the training process of
the machine learning algorithm.
Why Feature Selection?
1. Reduces Overfitting: By using only the most relevant features, the model can gener-
alize better to new data.
2. Improves Model Performance: Selecting the right features can improve the accu-
racy, precision, and recall of the model.
3. Decreases Computational Costs: A smaller number of features requires less com-
putation and storage resources.
4. Improves Interpretability: By reducing the number of features, it is easier to under-
stand and interpret the results of the model.
5. Feature Scaling
Feature Scaling is the process of transforming the features so that they have a simi-
lar scale. This is important in machine learning because the scale of the features
can affect the performance of the model.
Types of Feature Scaling:
1. Min-Max Scaling: Rescaling the features to a specific range, such as between 0 and
1, by subtracting the minimum value and dividing by the range.
2. Standard Scaling: Rescaling the features to have a mean of 0 and a standard devia-
tion of 1 by subtracting the mean and dividing by the standard deviation.
3. Robust Scaling: Rescaling the features to be robust to outliers by dividing them by
the interquartile range.
Why Feature Scaling?
1. Improves Model Performance: By transforming the features to have a similar scale,
the model can learn from all features equally and avoid being dominated by a few
large features.
2. Increases Model Robustness: By transforming the features to be robust to outliers,
the model can become more robust to anomalies.
3. Improves Computational Efficiency: Many machine learning algorithms, such as k-
nearest neighbors, are sensitive to the scale of the features and perform better with
scaled features.
4. Improves Model Interpretability: By transforming the features to have a similar
scale, it can be easier to understand the model’s predictions.
What are the Steps in Feature Engineering?
The steps for feature engineering vary per different Ml engineers and data scien-
tists. Some of the common steps that are involved in most machine-learning algo-
rithms are:
1. Data Cleansing
• Data cleansing (also known as data cleaning or data scrubbing) involves identi-
fying and removing or correcting any errors or inconsistencies in the dataset.
This step is important to ensure that the data is accurate and reliable.
2. Data Transformation
3. Feature Extraction
4. Feature Selection
• Feature selection involves selecting the most relevant features from the da-
taset for use in machine learning. This can include techniques like correlation
analysis, mutual information, and stepwise regression.
5. Feature Iteration
• Feature iteration involves refining and improving the features based on the per-
formance of the machine learning model. This can include techniques like add-
ing new features, removing redundant features and transforming features in
different ways.
model_selec-
This cross-validation test is
tion.TimeSeriesSplit([
for time series.
n_splits, ... ])
Splitter Functions
Hyper-parameter optimizers
Cross-Validation in Sklearn
Data scientists can benefit from cross-validation while working with machine learn-
ing models in two key aspects: it can assist in minimizing the amount of data
needed and ensuring the machine learning model is reliable. Cross-validation ac-
complishes that at the expense of resource use; thus, it's critical to comprehend
how it operates before deciding to use it.
In this article, we'll quickly go over the advantages of cross-validation, and then we
will go through its application using a wide range of techniques from the well-known
Python Scikit-learn package.
What is Cross-validation?
A fundamental error is training the model to make a prediction function and then us-
ing the same data to test the model and get a validation score. A model that simply
repeats the labels of the samples it has just examined would receive a perfect score
but be unable to make predictions about data that has not yet been seen. Overfit-
ting is the term used to describe this circumstance. To avoid this, it is customary to
reserve a portion of the given data as a test set (X test, y test) when conducting a
(supervised) machine learning study. Because machine learning sometimes begins
as an experiment in business contexts, we should note that the word "experiment"
does not just refer to academic application.
How does Cross Validation Solve the Problem of Overfitting?
We create numerous micro train-test splits during cross-validation using our initial
training dataset. To fine-tune our model, use these splits. For instance, we divide
the dataset into k subgroups for the usual k-fold cross-validation. The remaining da-
taset is then used as the test dataset after the model has been successively trained
on the k-1 dataset. We may test the model on a new dataset in this manner. We will
learn about the seven most popular cross-validation approaches in this tutorial. The
code samples for each method are also included.
There is still a chance of overfitting the test dataset when comparing various set-
tings for estimators. This is why We can adjust the parameters until the estimator
works at its best. The model may "leak" information about the test dataset in this
method, and evaluation measures may no longer reflect generalization perfor-
mance. We can resolve this issue by holding out a different portion of the dataset as
a "validation set": training is conducted on the training dataset, followed by
evaluation on the validation dataset, and when it appears that the experiment has
succeeded, we can perform a final assessment on the test set.
Data Size Reduction
Usually, the data is divided into three sets.
Training: used to hone the hyperparameters of the machine learning model and
train the model.
Testing: used to ensure that the improved model performs well when applied to new
data and that the model generalizes correctly.
Validation: We execute the last check on utterly unreliable data because, when op-
timizing, some knowledge about the test dataset seeps into the model due to the
choice of parameters.
Because we can train and test using the same dataset, adding cross-validation to
the workflow helps you eliminate the requirement for the validation dataset.
Robust Process
Even though sklearn's train test split method uses a stratified split, which ensures
that the target variable's distribution is the same in both the train and test sets, it's
still possible to unintentionally train on a subset that doesn't accurately represent
the real world.
Methods of Cross-Validation with Sklearn
HoldOut Cross Validation or Train-Test Split
This cross-validation procedure randomly divides the entire dataset into a training
dataset and a validation dataset. Generally, approximately 70% of the whole da-
taset is utilized as a training set, and the leftover 30% is taken as a validation da-
taset.
The advantage of this method is that we only need to divide the dataset into the
training and validation sets once. The machine learning model will only need to be
trained once based on the training dataset, allowing for quick execution.
This method is not appropriate for an unbalanced dataset. Consider an unbalanced
dataset with classes "0" and "1". Let's assume that 80% of the data falls under class
'0' and the rest 20% falls under class '1' upon performing a train-test splitting, with
the training dataset making up to 80% of the dataset and the test data making up
20%. The training dataset may contain 100% of the class "0" data, and the test da-
taset has 100% of the class "1" data. Since our model has never previously encoun-
tered class "1" data, it will not generalize well to our testing data.
Until each fold is employed as a validation dataset and the leftover folds are the
training datasets, the procedure is repeated K times.
The average accuracy of the k number of models of the validation dataset is used to
calculate the model's final accuracy.
Supervised Learning
Let’s understand it with the help of an example.
Example: Consider a scenario where you have to build an image classifier to differ-
entiate between cats and dogs. If you feed the datasets of dogs and cats labelled
images to the algorithm, the machine will learn to classify between a dog or a cat
from these labeled images. When we input new dog or cat images that it has never
seen before, it will use the learned algorithms and predict whether it is a dog or a
cat. This is how supervised learning works, and this is particularly an image classifi-
cation.
There are two main categories of supervised learning that are mentioned below:
• Classification
• Regression
Classification
Classification deals with predicting categorical target variables, which represent
discrete classes or labels. For instance, classifying emails as spam or not spam, or
predicting whether a patient has a high risk of heart disease. Classification algo-
rithms learn to map the input features to one of the predefined classes.
Here are some classification algorithms:
• Logistic Regression
• Support Vector Machine
• Random Forest
• Decision Tree
• K-Nearest Neighbors (KNN)
• Naive Bayes
Regression
Regression, on the other hand, deals with predicting continuous target variables,
which represent numerical values. For example, predicting the price of a house
based on its size, location, and amenities, or forecasting the sales of a product. Re-
gression algorithms learn to map the input features to a continuous numerical
value.
Here are some regression algorithms:
• Linear Regression
• Polynomial Regression
• Ridge Regression
• Lasso Regression
• Decision tree
• Random Forest
Advantages of Supervised Machine Learning
• Supervised Learning models can have high accuracy as they are trained on labelled
data.
• The process of decision-making in supervised learning models is often interpreta-
ble.
• It can often be used in pre-trained models which saves time and resources when
developing new models from scratch.
Disadvantages of Supervised Machine Learning
• It has limitations in knowing patterns and may struggle with unseen or unexpected
patterns that are not present in the training data.
• It can be time-consuming and costly as it relies on labeled data only.
• It may lead to poor generalizations based on new data.
Applications of Supervised Learning
Supervised learning is used in a wide variety of applications, including:
• Image classification: Identify objects, faces, and other features in images.
• Natural language processing: Extract information from text, such as sentiment, en-
tities, and relationships.
• Speech recognition: Convert spoken language into text.
• Recommendation systems: Make personalized recommendations to users.
• Predictive analytics: Predict outcomes, such as sales, customer churn, and stock
prices.
• Medical diagnosis: Detect diseases and other medical conditions.
• Fraud detection: Identify fraudulent transactions.
• Autonomous vehicles: Recognize and respond to objects in the environment.
• Email spam detection: Classify emails as spam or not spam.
• Quality control in manufacturing: Inspect products for defects.
• Credit scoring: Assess the risk of a borrower defaulting on a loan.
• Gaming: Recognize characters, analyze player behavior, and create NPCs.
• Customer support: Automate customer support tasks.
• Weather forecasting: Make predictions for temperature, precipitation, and other
meteorological parameters.
• Sports analytics: Analyze player performance, make game predictions, and opti-
mize strategies.
2. Unsupervised Machine Learning
Unsupervised Learning Unsupervised learning is a type of machine learning tech-
nique in which an algorithm discovers patterns and relationships using unlabeled
data. Unlike supervised learning, unsupervised learning doesn’t involve providing
the algorithm with labeled target outputs. The primary goal of Unsupervised learn-
ing is often to discover hidden patterns, similarities, or clusters within the data,
which can then be used for various purposes, such as data exploration, visualiza-
tion, dimensionality reduction, and more.
Unsupervised Learning
Let’s understand it with the help of an example.
Example: Consider that you have a dataset that contains information about the pur-
chases you made from the shop. Through clustering, the algorithm can group the
same purchasing behavior among you and other customers, which reveals potential
customers without predefined labels. This type of information can help businesses
get target customers as well as identify outliers.
There are two main categories of unsupervised learning that are mentioned below:
• Clustering
• Association
Clustering
Clustering is the process of grouping data points into clusters based on their simi-
larity. This technique is useful for identifying patterns and relationships in data
without the need for labeled examples.
Here are some clustering algorithms:
• K-Means Clustering algorithm
• Mean-shift algorithm
• DBSCAN Algorithm
• Principal Component Analysis
• Independent Component Analysis
Association
Association rule learning is a technique for discovering relationships between items
in a dataset. It identifies rules that indicate the presence of one item implies the
presence of another item with a specific probability.
Here are some association rule learning algorithms:
• Apriori Algorithm
• Eclat
• FP-growth Algorithm
Advantages of Unsupervised Machine Learning
• It helps to discover hidden patterns and various relationships between the data.
• Used for tasks such as customer segmentation, anomaly detection, and data explo-
ration.
• It does not require labeled data and reduces the effort of data labeling.
Disadvantages of Unsupervised Machine Learning
• Without using labels, it may be difficult to predict the quality of the model’s output.
• Cluster Interpretability may not be clear and may not have meaningful interpreta-
tions.
• It has techniques such as autoencoders and dimensionality reduction that can be
used to extract meaningful features from raw data.
Applications of Unsupervised Learning
Here are some common applications of unsupervised learning:
• Clustering: Group similar data points into clusters.
• Anomaly detection: Identify outliers or anomalies in data.
• Dimensionality reduction: Reduce the dimensionality of data while preserving its
essential information.
• Recommendation systems: Suggest products, movies, or content to users based on
their historical behavior or preferences.
• Topic modeling: Discover latent topics within a collection of documents.
• Density estimation: Estimate the probability density function of data.
• Image and video compression: Reduce the amount of storage required for multime-
dia content.
• Data preprocessing: Help with data preprocessing tasks such as data cleaning, im-
putation of missing values, and data scaling.
• Market basket analysis: Discover associations between products.
• Genomic data analysis: Identify patterns or group genes with similar expression
profiles.
• Image segmentation: Segment images into meaningful regions.
• Community detection in social networks: Identify communities or groups of individ-
uals with similar interests or connections.
• Customer behavior analysis: Uncover patterns and insights for better marketing
and product recommendations.
• Content recommendation: Classify and tag content to make it easier to recommend
similar items to users.
• Exploratory data analysis (EDA): Explore data and gain insights before defining spe-
cific tasks.
3. Semi-Supervised Learning
Semi-Supervised learning is a machine learning algorithm that works between
the supervised and unsupervised learning so it uses both labelled and unla-
belled data. It’s particularly useful when obtaining labeled data is costly, time-con-
suming, or resource-intensive. This approach is useful when the dataset is expen-
sive and time-consuming. Semi-supervised learning is chosen when labeled data
requires skills and relevant resources in order to train or learn from it.
We use these techniques when we are dealing with data that is a little bit labeled
and the rest large portion of it is unlabeled. We can use the unsupervised tech-
niques to predict labels and then feed these labels to supervised techniques. This
technique is mostly applicable in the case of image data sets where usually all im-
ages are not labeled.
Semi-Supervised Learning
Let’s understand it with the help of an example.
Example: Consider that we are building a language translation model, having la-
beled translations for every sentence pair can be resources intensive. It allows the
models to learn from labeled and unlabeled sentence pairs, making them more ac-
curate. This technique has led to significant improvements in the quality of machine
translation services.
Types of Semi-Supervised Learning Methods
There are a number of different semi-supervised learning methods each with its
own characteristics. Some of the most common ones include:
• Graph-based semi-supervised learning: This approach uses a graph to represent the
relationships between the data points. The graph is then used to propagate labels
from the labeled data points to the unlabeled data points.
• Label propagation: This approach iteratively propagates labels from the labeled
data points to the unlabeled data points, based on the similarities between the data
points.
• Co-training: This approach trains two different machine learning models on differ-
ent subsets of the unlabeled data. The two models are then used to label each
other’s predictions.
• Self-training: This approach trains a machine learning model on the labeled data
and then uses the model to predict labels for the unlabeled data. The model is then
retrained on the labeled data and the predicted labels for the unlabeled data.
• Generative adversarial networks (GANs): GANs are a type of deep learning algorithm
that can be used to generate synthetic data. GANs can be used to generate unla-
beled data for semi-supervised learning by training two neural networks, a genera-
tor and a discriminator.
Advantages of Semi- Supervised Machine Learning
• It leads to better generalization as compared to supervised learning, as it takes both
labeled and unlabeled data.
• Can be applied to a wide range of data.
Disadvantages of Semi- Supervised Machine Learning
• Semi-supervised methods can be more complex to implement compared to other
approaches.
• It still requires some labeled data that might not always be available or easy to ob-
tain.
• The unlabeled data can impact the model performance accordingly.
Applications of Semi-Supervised Learning
Here are some common applications of semi-supervised learning:
• Image Classification and Object Recognition: Improve the accuracy of models by
combining a small set of labeled images with a larger set of unlabeled images.
• Natural Language Processing (NLP): Enhance the performance of language models
and classifiers by combining a small set of labeled text data with a vast amount of
unlabeled text.
• Speech Recognition: Improve the accuracy of speech recognition by leveraging a
limited amount of transcribed speech data and a more extensive set of unlabeled
audio.
• Recommendation Systems: Improve the accuracy of personalized recommenda-
tions by supplementing a sparse set of user-item interactions (labeled data) with a
wealth of unlabeled user behavior data.
• Healthcare and Medical Imaging: Enhance medical image analysis by utilizing a
small set of labeled medical images alongside a larger set of unlabeled images.
4. Reinforcement Machine Learning
Reinforcement machine learning algorithm is a learning method that interacts with
the environment by producing actions and discovering errors. Trial, error, and de-
lay are the most relevant characteristics of reinforcement learning. In this tech-
nique, the model keeps on increasing its performance using Reward Feedback to
learn the behavior or pattern. These algorithms are specific to a particular problem
e.g. Google Self Driving car, AlphaGo where a bot competes with humans and even
itself to get better and better performers in Go Game. Each time we feed in data,
they learn and add the data to their knowledge which is training data. So, the more it
learns the better it gets trained and hence experienced.
Here are some of most common reinforcement learning algorithms:
• Q-learning: Q-learning is a model-free RL algorithm that learns a Q-function, which
maps states to actions. The Q-function estimates the expected reward of taking a
particular action in a given state.
• SARSA (State-Action-Reward-State-Action): SARSA is another model-free RL algo-
rithm that learns a Q-function. However, unlike Q-learning, SARSA updates the Q-
function for the action that was actually taken, rather than the optimal action.
• Deep Q-learning: Deep Q-learning is a combination of Q-learning and deep learning.
Deep Q-learning uses a neural network to represent the Q-function, which allows it
to learn complex relationships between states and actions.
supervised learning