0% found this document useful (0 votes)
47 views17 pages

ML Unit 1

Machine learning

Uploaded by

016-Triveni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views17 pages

ML Unit 1

Machine learning

Uploaded by

016-Triveni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

UNIT - I

Introduction to Machine Learning: Introduction, Classic and Adaptive machines, learning


Types-Supervised, Unsupervised, deep learning, bio-inspired adaptive systems, Machine
Learning, and big data.
Elements of Machine Learning: Data formats, Learnability, Statistical learning concepts,
Class balancing, Elements of Information theory.

A rapidly developing field of technology, machine learning allows computers to


automatically learn from previous data. For building mathematical models and making
predictions based on historical data or information, machine learning employs a variety of
algorithms. It is currently being used for a variety of tasks, including speech recognition,
email filtering, auto-tagging on Facebook, a recommender system, and image recognition.

A subset of artificial intelligence known as machine learning focuses primarily on the


creation of algorithms that enable a computer to independently learn from data and previous
experiences.

“ Without being explicitly programmed, machine learning enables a machine to automatically


learn from data, improve performance from experiences, and predict things “.

How does Machine Learning work ?????


A machine learning system builds prediction models, learns from previous data, and predicts
the output of new data whenever it receives it. The amount of data helps to build a better
model that accurately predicts the output, which in turn affects the accuracy of the predicted
output.

● Machine learning uses data to detect various patterns in a given dataset.


● It can learn from past data and improve automatically.
● It is a data-driven technology.
● Machine learning is much similar to data mining as it also deals with the huge amount
of the data.

Need for Machine Learning

(in case u want it in points…. Need for ML)


● Predictive modeling: Machine learning can be used to build predictive models that
can help businesses make better decisions. For example, machine learning can be used
to predict which customers are most likely to buy a particular product, or which
patients are most likely to develop a certain disease.
● Natural language processing: Machine learning is used to build systems that can
understand and interpret human language. This is important for applications such as
voice recognition, chatbots, and language translation.
● Computer vision: Machine learning is used to build systems that can recognize and
interpret images and videos. This is important for applications such as self-driving
cars, surveillance systems, and medical imaging.
● Fraud detection: Machine learning can be used to detect fraudulent behavior in
financial transactions, online advertising, and other areas.
● Recommendation systems: Machine learning can be used to build recommendation
systems that suggest products, services, or content to users based on their past
behavior and preferences.

→ The demand for machine learning is steadily rising. Because it is able to perform tasks that
are too complex for a person to directly implement, machine learning is required. Humans are
constrained by our inability to manually access vast amounts of data; as a result, we require
computer systems, which is where machine learning comes in to simplify our lives.
→ By providing them with a large amount of data and allowing them to automatically
explore the data, build models, and predict the required output, we can train machine learning
algorithms. The cost function can be used to determine the amount of data and the machine
learning algorithm's performance. We can save both time and money by using machine
learning.
→ The significance of AI can be handily perceived by its utilization's cases, Presently, AI is
utilized in self-driving vehicles, digital misrepresentation identification, face
acknowledgment, and companion idea by Facebook, and so on. Different top organizations,
for example, Netflix and Amazon have constructed AI models that are utilizing an immense
measure of information to examine the client interest and suggest item likewise.
What are Classic Machines(Traditional machines)?
Classic machines, sometimes referred to as classical machine learning algorithms are a subset
of machine learning algorithms that discover patterns and relationships in data using
statistical techniques(linear regression, logistic regression, svm, knn, random forest, decision
trees). These algorithms are made to perform well in situations with a defined scope and a
distinct set of characteristics.

Classic machines examine data inputs according to a predetermined set of rules, finding
patterns and relationships that can be used to generate predictions or choices. Support vector
machines, decision trees, and logistic regression are some of the most used classical
machine-learning techniques

Advantages of classic machines:


● Easy implementation: Because classic machines don’t need sophisticated learning
algorithms or a lot of data, they are frequently easier to design and implement than
adaptive machines.
● Reduction in the likelihood of errors: Because classic machines do not learn from
their surroundings, the likelihood of errors caused by unforeseen inputs or data is
minimized.
● Less expensive to develop and maintain: Because classic machines do not need
expensive hardware or software components, they may be less expensive to design
and maintain than adaptive machines.

Disadvantages of classic machines


● Restricted adaptability: Traditional machines are built to function in accordance with
pre-established norms, which restricts their capacity to respond to novel or
unexpected circumstances.
● Reduced accuracy: In some applications, classic machines may not be as accurate as
adaptive machines due to their inability to handle complicated or unstructured input.
● Restricted scalability: Older computers might not be able to handle sophisticated
systems or vast amounts of data without extensive reprogramming or hardware
upgrades.
● Lack of ability to learn: Traditional machines are incapable of gaining knowledge
from their surroundings or enhancing their performance over time, which can restrict
their capacity to adjust to shifting circumstances or enhance performance.

Applications of classic machines


● Image and video recognition: classical machine learning techniques like decision
trees, random forests, and support vector machines have been used for tasks like face
detection, object recognition and scene recognition in image and video applications.
● Natural language processing: classical machine learning methods like Naive Bayes,
logistic regression, and decision trees have been used in applications of natural
language processing like sentiment analysis, text classification, and spam filtering.
● Data mining: To find patterns and connections in massive datasets, applications of
classical machine learning algorithms, such as association rules and clustering, have
been utilized.
● Fraud detection: classical machine learning techniques, such as logistic regression and
decision trees, have been used in fraud detection applications in order to spot patterns
of fraudulent activity and alert users to potentially fraudulent transactions.
● Medical diagnosis: classical machine learning techniques have been applied to
medical diagnosis applications to recognize symptom patterns and forecast the
likelihood of specific diseases. These techniques include decision trees and support
vector machines.

What are Adaptive machines?


A class of machine learning techniques called adaptive machines, commonly referred to as
adaptive or deep learning, is created to automatically learn from data inputs without being
explicitly programmed. By learning hierarchical representations of the input, these algorithms
are able to handle more complex and unstructured data, such as photos, videos, and natural
language.

Artificial neural networks, which are designed after the composition and operation of the
human brain, are used by adaptive machines. These neural networks are made up of layers of
connected nodes, or neurons, where each one carries out a straightforward calculation. The
neurons are arranged in layers, and each layer processes the input data in a unique way.

Advantages of Adaptive machines


● Handling complicated and unstructured data: By learning hierarchical representations
of the input, adaptive machines are capable of managing complex and unstructured
data, including photos, videos, and natural language. This enables them to extract and
learn useful characteristics from unprocessed data without the need for manual feature
extraction.
● Better accuracy: Adaptive machines frequently outperform traditional machine
learning algorithms in tasks like speech and picture recognition, natural language
processing, and gaming.
● Automatic feature learning: Adaptive computers automatically extract the pertinent
features from the data, in contrast to traditional machine learning methods that depend
on manually-engineered features. This leads to enhanced feature representations and
accuracy rather frequently.
● Scalability: By extending the neural network’s layers and neurons, adaptive machines
can be made larger to handle enormous datasets and challenging tasks. This qualifies
them for extensive use in areas including speech and image recognition, natural
language processing, and autonomous driving.

Disadvantages of Adaptive machines:


● Large volumes of labeled data are necessary for adaptive machine learning algorithms
to train efficiently. Usually, more data is needed the more complicated the situation is.
Data collection and labeling can take a lot of time and money.
● A large amount of computing power is needed for adaptive machine learning
methods, including strong hardware and specialized software frameworks. It can take
days or weeks to train deep neural networks, and this demands a lot of memory and
computing power.
● Overfitting is a problem that can affect adaptive machine-learning algorithms. It
happens when the model gets too complicated and begins to memorize the training
data rather than learning broader patterns. Poor performance on new data can result
from overfitting.
● Adaptive machine learning algorithms are susceptible to adversarial attacks, in which
rogue actors deliberately alter the input data to make the model produce inaccurate
predictions or judgments.

Application of Adaptive machines


Adaptive machines usually referred to as adaptive or deep learning, have several uses in a
variety of industries. Some instances of their applications include:

● Recognition of voice and images: Adaptive robots can pick up faces, objects, and
other visual cues in pictures and videos. Moreover, they are capable of speech
recognition, enabling voice-activated technology and virtual assistants.
● Language translation, sentiment analysis, and chatbots are just a few examples of
applications that can be made possible by adaptive machines’ ability to comprehend
and evaluate human language.
● Systems for making recommendations: Adaptive machines can learn from user
preferences and behavior to offer tailored suggestions for goods, media, and other
things.
● Autonomous vehicles: Deep learning algorithms are used to help self-driving cars
recognize objects, detect obstacles, and navigate complex environments.
● Fraud detection: Adaptive machines are capable of analyzing huge datasets to spot
trends and anomalies that might be signs of fraud.
Classification of Machine Learning :

Supervised learning :
Supervised learning is the type of machine learning in which machines are trained using well
"labelled" training data, and on the basis of that data, machines predict the output. The
labelled data means some input data is already tagged with the correct output. In supervised
learning, the training data provided to the machines work as the supervisor that teaches the
machines to predict the output correctly. It applies the same concept as a student learns in the
supervision of the teacher. Supervised learning is a process of providing input data as well as
correct output data to the machine learning model. The aim of a supervised learning
algorithm is to find a mapping function to map the input variable(x) with the output
variable(y).

Classification algorithms are used when the output variable is categorical, which means there
are two classes(or more) such as Yes-No, Male-Female, True-false, etc.
○ Random Forest
○ Decision Trees
○ Logistic Regression
○ Support vector Machines
Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc.
○ Linear Regression
○ Regression Trees
○ Non-Linear Regression
○ Bayesian Linear Regression
○ Polynomial Regression
Advantages of Supervised learning:
○ With the help of supervised learning, the model can predict the output on the basis of
prior experiences.
○ In supervised learning, we can have an exact idea about the classes of objects.
○ Supervised learning model helps us to solve various real-world problems such as
fraud detection, spam filtering, etc.
Disadvantages of supervised learning:
○ Supervised learning models are not suitable for handling complex tasks.
○ Supervised learning cannot predict the correct output if the test data is different from
the training dataset.
○ Training required lots of computation times.
○ In supervised learning, we need enough knowledge about the classes of object.

What is Unsupervised Learning?


As the name suggests, unsupervised learning is a machine learning technique in which
models are not supervised using training dataset. Instead, models itself find the hidden
patterns and insights from the given data. It can be compared to learning which takes place in
the human brain while learning new things. It can be defined as:
Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.

○ Unsupervised learning is helpful for finding useful insights from the data.
○ Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
○ Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
○ In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.
Types of Unsupervised Learning Algorithm:

○ Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities with
the objects of another group. Cluster analysis finds the commonalities between the
data objects and categorizes them as per the presence and absence of those
commonalities.
○ Association: An association rule is an unsupervised learning method which is used
for finding the relationships between variables in the large database. It determines the
set of items that occurs together in the dataset. Association rule makes marketing
strategy more effective. Such as people who buy X item (suppose a bread) are also
tend to purchase Y (Butter/Jam) item. A typical example of Association rule is Market
Basket Analysis????:(.
Advantages of Unsupervised Learning
○ Unsupervised learning is used for more complex tasks as compared to supervised
learning because, in unsupervised learning, we don't have labeled input data.
○ Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to
labeled data.
Disadvantages of Unsupervised Learning
○ Unsupervised learning is intrinsically more difficult than supervised learning as it
does not have corresponding output.
○ The result of the unsupervised learning algorithm might be less accurate as input data
is not labeled, and algorithms do not know the exact output in advance.
Reinforcement learning :
Reinforcement learning is a feedback-based learning method, in which a learning agent gets
a reward for each right action and gets a penalty for each wrong action. The agent learns
automatically with these feedbacks and improves its performance. In reinforcement learning,
the agent interacts with the environment and explores it. The goal of an agent is to get the
most reward points, and hence, it improves its performance. The robotic dog, which
automatically learns the movement of his arms, is an example of Reinforcement learning.
DEEP LEARNING :
Deep learning models can analyze data continuously. They draw conclusions similar to
humans—by taking in information, consulting data reserves full of information, and
determining an answer.
This technique enables it to recognize speech and images, and DL has made a lasting impact
on fields such as healthcare, finance, retail, logistics, and robotics.

Deep learning is an evolution of machine learning. Both are algorithms that use data to learn,
but the key difference is how they process and learn from it.
While basic machine learning models do become progressively better at performing their
specific functions as they take in new data, they still need some human intervention. If an AI
algorithm returns an inaccurate prediction, then an engineer has to step in and make
adjustments.
With a deep learning model, an algorithm can determine whether or not a prediction is
accurate through its own neural network—minimal to no human help is required. A deep
learning model is able to learn through its own method of computing—a technique that
makes it seem like it has its own brain.

Other key differences include:

● ML consists of thousands of data points while DL uses millions of data points.


Machine learning algorithms usually perform well with relatively small datasets.
Deep Learning requires large amounts of data to understand and perform better than
traditional machine learning algorithms.
● Machine learning algorithms solve problems by using explicit programming. Deep
learning algorithms solve problems based on the layers of neural networks.
● Machine learning algorithms take relatively less time to train, ranging from a few
seconds to a few hours. Deep learning algorithms, on the other hand, take a lot of time
to train, ranging from a few hours to many weeks.
Deep learning: DL is a subset of machine learning. With this model, an algorithm can
determine whether or not a prediction is accurate through a neural network without human
intervention. Deep learning models can build extensive knowledge over time, acting as a
brain, of sorts.

What are the different types of deep learning algorithms?

Convolutional neural networks : Convolutional neural networks (CNNs) are algorithms


that work like the brain’s visual processing system. They can process images and detect
objects by filtering a visual prompt and assessing components such as patterns, texture,
shapes, and colors.

Recurrent neural networks : Recurrent neural networks (RNNs) are AI algorithms that use
built-in feedback loops to “remember” past data points. RNNs can use this memory of past
events to inform their understanding of current events or even predict the future.

Multilayer perceptron : Multilayer perceptrons (MLPs) are a type of algorithm used


primarily in deep learning. MLPs are classified as a feedforward neural network meaning the
information the user inputs only flows in one direction without using feedback loops which
makes it better at processing unpredictable data and patterns than other algorithms. MLPs can
be used to classify images, recognize speech, solve regression problems, and more.
ELEMENTS OF MACHINE LEARNING ??
1. Data: Data is the foundation of machine learning. It includes both the input features
(attributes) and the corresponding target labels (if it's a supervised learning problem).
High-quality, relevant, and representative data is essential for training accurate
models.
2. Task : It is important to know what exactly we are trying to achieve by building our
Machine Learning model. Do we have a supervised learning problem or we are
simply trying to find a pattern in data(unsupervised learning) or do we have a robot
whose actions we are trying to optimize(reinforcement learning) It’s also required to
know whether we have a classification or regression task at our hand. Knowing the
problem is pertinent to the selection of a better model.
3. Model : The model is the algorithm or mathematical representation used to learn
patterns and make predictions from the data. It could be a decision tree, neural
network, support vector machine, or any other algorithm suitable for the task at hand.
4. Loss Function : Loss function calculates the difference between true output (y) and
the approximated value (^f(x)). It is represented with L. Square Error Loss,
Cross-Entropy Loss
5. Learning Algorithm : (eg : Gradient Descent)Parameter estimation in machine
learning is a kind of search operation. We can compute the parameters through
learning algorithm and it becomes an optimization problem where we try to optimize
the parameters by minimizing the loss. Hence, the learning algorithm and loss
function go hand in hand.
6. Evaluation : Evaluation is the process of assessing the performance of the trained
model. It involves testing the model on unseen data to measure its accuracy, precision,
recall, F1-score, or other relevant metrics depending on the problem domain.

Data formats ??
data is a distinct piece of information that is gathered and translated for some purpose. Data
can be available in different forms, such as bits and bytes stored in electronic memory,
numbers or text on pieces of paper, or facts stored in a person's mind.

Big Data is defined as the Data which are very large in size.
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources.

Structured data : It is easy to search and analyze structured data. Structured data exists in a
predefined format. Relational database consisting of tables with rows and columns is one of
the best examples of structured data. Structured data generally exist in tables like excel files
and Google Docs spreadsheets. Structured data is highly organized and understandable for
machine language. Common applications of relational databases with structured data include
sales transactions, Airline reservation systems, inventory control, and others. Quantitative in
nature
Unstructured data : All the unstructured files, log files, audio files, and image files are
included in the unstructured data. It requires a lot of storage space, and it is hard to maintain
security in it. It cannot be presented in a data model or schema. That's why managing,
analyzing, or searching for unstructured data is hard. It is qualitative in nature

Semi-Structured data –
Semi-structured data is information that does not reside in a relational database but that has
some organizational properties that make it easier to analyze.

DATA FORMATS IN ML :::::: →


● Tabular Data: Tabular data is structured data organized into rows and columns,
similar to a spreadsheet. Each row represents an individual sample or instance, and
each column represents a feature or attribute. Tabular data is commonly used in tasks
such as classification, regression, and clustering

● Text Data: Text data consists of unstructured text documents, such as emails, articles,
tweets, or reviews. Text data is often preprocessed and represented using techniques
like bag-of-words, TF-IDF (Term Frequency-Inverse Document Frequency), or word
embeddings before being used in machine learning algorithms. Natural Language
Processing (NLP) techniques are applied to analyze and extract insights from text
data.

● Image Data: Image data consists of visual representations, such as photographs,


digital images, or medical scans. Images are typically represented as arrays of pixel
values, and techniques like convolutional neural networks (CNNs) are commonly
used to analyze and classify image data. Image data is prevalent in applications such
as object detection, image recognition, and medical imaging.

● Audio Data: Audio data consists of sound recordings, such as speech, music, or
environmental sounds. Audio data is often represented as waveforms or spectrograms,
and techniques like recurrent neural networks (RNNs) or convolutional neural
networks (CNNs) are used for tasks such as speech recognition, audio classification,
and sound generation.

● Time Series Data: Time series data consists of observations collected sequentially
over time, such as stock prices, weather measurements, or sensor readings. Time
series data is characterized by temporal dependencies, and techniques like
autoregression, recurrent neural networks (RNNs), or Long Short-Term Memory
(LSTM) networks are commonly used for time series forecasting, anomaly detection,
and pattern recognition.

● Graph Data: Graph data consists of entities (nodes) and relationships (edges)
between them, represented as a graph structure. Graph data is prevalent in social
networks, recommendation systems, and biological networks. Graph neural networks
(GNNs) and graph-based algorithms are used to analyze and make predictions on
graph data.

● Geospatial Data: Geospatial data consists of geographic information, such as GPS


coordinates, maps, or satellite imagery. Geospatial data is used in applications such as
location-based services, environmental monitoring, and urban planning. Techniques
like geographic information systems (GIS), spatial analysis, and deep learning are
applied to analyze and extract insights from geospatial data.

Learnability:
→ Learnability in Machine Learning can be defined as the ability of AI models to quickly
adapt to changes in the data and to new data. Haivo Annotation, Text Annotation, and Data
Annotation are forms of Machine Learning data that can be used to improve learnability.
→Haivo Annotation involves labeling images, videos, audio, and text to give the AI model a
better understanding of what it is looking at.
→ Text Annotation is used to label text for the AI model to recognize patterns.
→ Data annotation is used to label data such as objects, images, and languages. Data
Validation for AI is the process of validating AI models and ensuring that the results are
accurate. Through learnability, AI models can be trained to recognize patterns and respond
quickly to changes in data and new data.
→ Learnability – model's capacity to extract meaningful patterns from data, its flexibility to
adjust its internal parameters based on new examples, and its ability to avoid overfitting
(memorizing the training data too well) while still capturing the underlying relationships in
the data. Techniques such as regularization, cross-validation, and ensemble methods are often
employed to improve a model's learnability.

SOME IMP POINTS

Machine Learning Process -


1. Define the Problem
2. Collect the data
3. Prepare the data
4. Split data in training and testing
5. Algorithm Selection
6. Training the algorithm
7. Evaluation On Test data
8. Parameter Tuning
9. Start using your model(Deployment)

Time series data refers to a sequence of data points collected, recorded, or measured at
successive and evenly spaced intervals over time.
Categorical data refers to data that represents categories or labels and does not have a natural
order or numerical value associated with it.
Categorical data can be further divided into two main subtypes:

Nominal Data:

Nominal data consists of categories that have no inherent order or ranking.


Each category is distinct and unrelated to other categories.
Examples of nominal data include:
Types of fruits: apple, banana, orange.
Colors: red, blue, green.
Marital status: single, married, divorced.
Gender: male, female, other.
Ordinal Data:

Ordinal data consists of categories with a natural order or ranking.


0While the categories have a defined order, the differences between the categories may not be
uniform or measurable.
Examples of ordinal data include:
Educational attainment: elementary school, high school, college, graduate school.
Rating scales: poor, fair, good, excellent.
Socioeconomic status: lower class, middle class, upper class.

Semi-supervised learning is a machine learning paradigm that combines elements of both


supervised and unsupervised learning. In semi-supervised learning, the training dataset
consists of a small amount of labeled data along with a much larger amount of unlabeled
data. The goal is to leverage the additional unlabeled data to improve the performance of the
learning algorithm.

Big data and ML


When talking about 5V's in big data, machine learning models helps to deal with them and
predict accurate results. Similarly, while developing machine learning models, big data helps
to extract high-quality data as well as improved learning methods by means of providing
analytics teams.Machine Learning is a very crucial technology, and with big data, it has
become more powerful for data collection, data analysis, and data integration. this leads to
generating improved quality business operations and building better customer relationship
management. Big Data helps machine learning by providing a variety of data so machines
can learn more or multiple samples or training data.

Machine Learning provides efficient and automated tools for data gathering, analysis, and
integration. In collaboration with cloud computing superiority, machine learning ingests
agility into processing and integrates large amounts of data regardless of its source.
Machine learning algorithms can be applied to every element of Big Data operation,
including: Data Segmentation, Data Analytics, Simulation
All these stages are integrated to create the big picture out of Big Data with insights, patterns,
which later get categorized and packaged into an understandable format.

Bio-inspired adaptive systems


Bio-inspired adaptive systems refer to systems or technologies that draw inspiration from
biological processes and mechanisms to adapt to changing environments, solve complex
problems, and improve performance. These systems often mimic principles observed in
nature, such as evolution, self-organization, and learning, to develop innovative solutions for
various engineering, computing, and robotics applications.
Some examples of bio-inspired adaptive systems include:

1. Evolutionary Algorithms: These algorithms are inspired by the process of natural selection
and evolution. They involve generating a population of candidate solutions to a problem,
subjecting them to selective pressure, and iteratively refining them through processes like
mutation, recombination, and selection to find optimal or near-optimal solutions.

2. Artificial Neural Networks (ANNs): ANNs are computational models inspired by the
structure and function of biological neural networks in the brain. They consist of
interconnected nodes (neurons) organized in layers, and they are capable of learning from
data through a process called training. ANNs have been applied to various tasks, including
pattern recognition, classification, and optimization.

3. Swarm Intelligence: Swarm intelligence algorithms are inspired by the collective behavior
of social insects, such as ants, bees, and termites. These algorithms involve multiple agents
(particles, robots, etc.) interacting locally with one another and their environment to achieve a
global objective. Examples include ant colony optimization, particle swarm optimization, and
bee colony optimization.

4. Self-organizing Systems: These systems are inspired by biological systems that exhibit
emergent properties through decentralized interactions among their components. They are
capable of self-organization, adaptation, and robustness in response to changing conditions.
Examples include self-organizing networks, self-organizing robotic systems, and
self-organizing algorithms.

5. Genetic Algorithms: Genetic algorithms are optimization algorithms inspired by the


process of natural selection and genetics. They involve representing potential solutions to a
problem as individuals in a population, and iteratively evolving them through processes like
crossover, mutation, and selection to find optimal or near-optimal solutions.

Class Balancing -
Class balancing, also known as imbalance learning, refers to the process of adjusting the
distribution of classes in a dataset to address class imbalance. Class imbalance occurs when
the number of instances belonging to one class significantly outweighs the number of
instances belonging to another class in a classification problem.

In many real-world scenarios, class imbalance is common. For example, in fraud detection,
the number of fraudulent transactions is typically much lower than the number of legitimate
transactions. In medical diagnosis, the number of patients with a rare disease may be much
smaller than the number of healthy individuals.

Class imbalance can pose challenges for machine learning algorithms, as they may become
biased towards the majority class, leading to poor performance in accurately predicting
minority class instances. Class balancing techniques aim to mitigate this issue by adjusting
the class distribution in the dataset.

Some common class balancing techniques include:

1. **Random Oversampling**: Random oversampling involves randomly duplicating


instances from the minority class to increase its representation in the dataset. This approach
helps to balance the class distribution but may lead to overfitting if not carefully applied.

2. **Random Undersampling**: Random undersampling involves randomly removing


instances from the majority class to reduce its dominance in the dataset. While this approach
helps balance the class distribution, it may lead to loss of important information and reduced
model performance.

3. **Synthetic Minority Over-sampling Technique (SMOTE)**: SMOTE is a popular method


that generates synthetic instances for the minority class by interpolating between existing
instances. This approach helps to balance the class distribution while preserving the original
information in the dataset.

4. **Class Weighting**: Class weighting involves assigning higher weights to instances of


the minority class and lower weights to instances of the majority class during model training.
This approach allows the algorithm to pay more attention to minority class instances, thereby
reducing bias towards the majority class.

5. **Cost-sensitive Learning**: Cost-sensitive learning involves explicitly specifying the


costs or penalties associated with misclassifying instances of different classes. By adjusting
these costs, the algorithm can learn to prioritize correctly classifying instances of the minority

TO DO !!!!!
Statistical learning concepts,
Elements of Information theory. ?????

You might also like