ML Unit 1
ML Unit 1
→ The demand for machine learning is steadily rising. Because it is able to perform tasks that
are too complex for a person to directly implement, machine learning is required. Humans are
constrained by our inability to manually access vast amounts of data; as a result, we require
computer systems, which is where machine learning comes in to simplify our lives.
→ By providing them with a large amount of data and allowing them to automatically
explore the data, build models, and predict the required output, we can train machine learning
algorithms. The cost function can be used to determine the amount of data and the machine
learning algorithm's performance. We can save both time and money by using machine
learning.
→ The significance of AI can be handily perceived by its utilization's cases, Presently, AI is
utilized in self-driving vehicles, digital misrepresentation identification, face
acknowledgment, and companion idea by Facebook, and so on. Different top organizations,
for example, Netflix and Amazon have constructed AI models that are utilizing an immense
measure of information to examine the client interest and suggest item likewise.
What are Classic Machines(Traditional machines)?
Classic machines, sometimes referred to as classical machine learning algorithms are a subset
of machine learning algorithms that discover patterns and relationships in data using
statistical techniques(linear regression, logistic regression, svm, knn, random forest, decision
trees). These algorithms are made to perform well in situations with a defined scope and a
distinct set of characteristics.
Classic machines examine data inputs according to a predetermined set of rules, finding
patterns and relationships that can be used to generate predictions or choices. Support vector
machines, decision trees, and logistic regression are some of the most used classical
machine-learning techniques
Artificial neural networks, which are designed after the composition and operation of the
human brain, are used by adaptive machines. These neural networks are made up of layers of
connected nodes, or neurons, where each one carries out a straightforward calculation. The
neurons are arranged in layers, and each layer processes the input data in a unique way.
● Recognition of voice and images: Adaptive robots can pick up faces, objects, and
other visual cues in pictures and videos. Moreover, they are capable of speech
recognition, enabling voice-activated technology and virtual assistants.
● Language translation, sentiment analysis, and chatbots are just a few examples of
applications that can be made possible by adaptive machines’ ability to comprehend
and evaluate human language.
● Systems for making recommendations: Adaptive machines can learn from user
preferences and behavior to offer tailored suggestions for goods, media, and other
things.
● Autonomous vehicles: Deep learning algorithms are used to help self-driving cars
recognize objects, detect obstacles, and navigate complex environments.
● Fraud detection: Adaptive machines are capable of analyzing huge datasets to spot
trends and anomalies that might be signs of fraud.
Classification of Machine Learning :
Supervised learning :
Supervised learning is the type of machine learning in which machines are trained using well
"labelled" training data, and on the basis of that data, machines predict the output. The
labelled data means some input data is already tagged with the correct output. In supervised
learning, the training data provided to the machines work as the supervisor that teaches the
machines to predict the output correctly. It applies the same concept as a student learns in the
supervision of the teacher. Supervised learning is a process of providing input data as well as
correct output data to the machine learning model. The aim of a supervised learning
algorithm is to find a mapping function to map the input variable(x) with the output
variable(y).
Classification algorithms are used when the output variable is categorical, which means there
are two classes(or more) such as Yes-No, Male-Female, True-false, etc.
○ Random Forest
○ Decision Trees
○ Logistic Regression
○ Support vector Machines
Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc.
○ Linear Regression
○ Regression Trees
○ Non-Linear Regression
○ Bayesian Linear Regression
○ Polynomial Regression
Advantages of Supervised learning:
○ With the help of supervised learning, the model can predict the output on the basis of
prior experiences.
○ In supervised learning, we can have an exact idea about the classes of objects.
○ Supervised learning model helps us to solve various real-world problems such as
fraud detection, spam filtering, etc.
Disadvantages of supervised learning:
○ Supervised learning models are not suitable for handling complex tasks.
○ Supervised learning cannot predict the correct output if the test data is different from
the training dataset.
○ Training required lots of computation times.
○ In supervised learning, we need enough knowledge about the classes of object.
○ Unsupervised learning is helpful for finding useful insights from the data.
○ Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
○ Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
○ In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.
Types of Unsupervised Learning Algorithm:
○ Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities with
the objects of another group. Cluster analysis finds the commonalities between the
data objects and categorizes them as per the presence and absence of those
commonalities.
○ Association: An association rule is an unsupervised learning method which is used
for finding the relationships between variables in the large database. It determines the
set of items that occurs together in the dataset. Association rule makes marketing
strategy more effective. Such as people who buy X item (suppose a bread) are also
tend to purchase Y (Butter/Jam) item. A typical example of Association rule is Market
Basket Analysis????:(.
Advantages of Unsupervised Learning
○ Unsupervised learning is used for more complex tasks as compared to supervised
learning because, in unsupervised learning, we don't have labeled input data.
○ Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to
labeled data.
Disadvantages of Unsupervised Learning
○ Unsupervised learning is intrinsically more difficult than supervised learning as it
does not have corresponding output.
○ The result of the unsupervised learning algorithm might be less accurate as input data
is not labeled, and algorithms do not know the exact output in advance.
Reinforcement learning :
Reinforcement learning is a feedback-based learning method, in which a learning agent gets
a reward for each right action and gets a penalty for each wrong action. The agent learns
automatically with these feedbacks and improves its performance. In reinforcement learning,
the agent interacts with the environment and explores it. The goal of an agent is to get the
most reward points, and hence, it improves its performance. The robotic dog, which
automatically learns the movement of his arms, is an example of Reinforcement learning.
DEEP LEARNING :
Deep learning models can analyze data continuously. They draw conclusions similar to
humans—by taking in information, consulting data reserves full of information, and
determining an answer.
This technique enables it to recognize speech and images, and DL has made a lasting impact
on fields such as healthcare, finance, retail, logistics, and robotics.
Deep learning is an evolution of machine learning. Both are algorithms that use data to learn,
but the key difference is how they process and learn from it.
While basic machine learning models do become progressively better at performing their
specific functions as they take in new data, they still need some human intervention. If an AI
algorithm returns an inaccurate prediction, then an engineer has to step in and make
adjustments.
With a deep learning model, an algorithm can determine whether or not a prediction is
accurate through its own neural network—minimal to no human help is required. A deep
learning model is able to learn through its own method of computing—a technique that
makes it seem like it has its own brain.
Recurrent neural networks : Recurrent neural networks (RNNs) are AI algorithms that use
built-in feedback loops to “remember” past data points. RNNs can use this memory of past
events to inform their understanding of current events or even predict the future.
Data formats ??
data is a distinct piece of information that is gathered and translated for some purpose. Data
can be available in different forms, such as bits and bytes stored in electronic memory,
numbers or text on pieces of paper, or facts stored in a person's mind.
Big Data is defined as the Data which are very large in size.
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources.
Structured data : It is easy to search and analyze structured data. Structured data exists in a
predefined format. Relational database consisting of tables with rows and columns is one of
the best examples of structured data. Structured data generally exist in tables like excel files
and Google Docs spreadsheets. Structured data is highly organized and understandable for
machine language. Common applications of relational databases with structured data include
sales transactions, Airline reservation systems, inventory control, and others. Quantitative in
nature
Unstructured data : All the unstructured files, log files, audio files, and image files are
included in the unstructured data. It requires a lot of storage space, and it is hard to maintain
security in it. It cannot be presented in a data model or schema. That's why managing,
analyzing, or searching for unstructured data is hard. It is qualitative in nature
Semi-Structured data –
Semi-structured data is information that does not reside in a relational database but that has
some organizational properties that make it easier to analyze.
● Text Data: Text data consists of unstructured text documents, such as emails, articles,
tweets, or reviews. Text data is often preprocessed and represented using techniques
like bag-of-words, TF-IDF (Term Frequency-Inverse Document Frequency), or word
embeddings before being used in machine learning algorithms. Natural Language
Processing (NLP) techniques are applied to analyze and extract insights from text
data.
● Audio Data: Audio data consists of sound recordings, such as speech, music, or
environmental sounds. Audio data is often represented as waveforms or spectrograms,
and techniques like recurrent neural networks (RNNs) or convolutional neural
networks (CNNs) are used for tasks such as speech recognition, audio classification,
and sound generation.
● Time Series Data: Time series data consists of observations collected sequentially
over time, such as stock prices, weather measurements, or sensor readings. Time
series data is characterized by temporal dependencies, and techniques like
autoregression, recurrent neural networks (RNNs), or Long Short-Term Memory
(LSTM) networks are commonly used for time series forecasting, anomaly detection,
and pattern recognition.
● Graph Data: Graph data consists of entities (nodes) and relationships (edges)
between them, represented as a graph structure. Graph data is prevalent in social
networks, recommendation systems, and biological networks. Graph neural networks
(GNNs) and graph-based algorithms are used to analyze and make predictions on
graph data.
Learnability:
→ Learnability in Machine Learning can be defined as the ability of AI models to quickly
adapt to changes in the data and to new data. Haivo Annotation, Text Annotation, and Data
Annotation are forms of Machine Learning data that can be used to improve learnability.
→Haivo Annotation involves labeling images, videos, audio, and text to give the AI model a
better understanding of what it is looking at.
→ Text Annotation is used to label text for the AI model to recognize patterns.
→ Data annotation is used to label data such as objects, images, and languages. Data
Validation for AI is the process of validating AI models and ensuring that the results are
accurate. Through learnability, AI models can be trained to recognize patterns and respond
quickly to changes in data and new data.
→ Learnability – model's capacity to extract meaningful patterns from data, its flexibility to
adjust its internal parameters based on new examples, and its ability to avoid overfitting
(memorizing the training data too well) while still capturing the underlying relationships in
the data. Techniques such as regularization, cross-validation, and ensemble methods are often
employed to improve a model's learnability.
Time series data refers to a sequence of data points collected, recorded, or measured at
successive and evenly spaced intervals over time.
Categorical data refers to data that represents categories or labels and does not have a natural
order or numerical value associated with it.
Categorical data can be further divided into two main subtypes:
Nominal Data:
Machine Learning provides efficient and automated tools for data gathering, analysis, and
integration. In collaboration with cloud computing superiority, machine learning ingests
agility into processing and integrates large amounts of data regardless of its source.
Machine learning algorithms can be applied to every element of Big Data operation,
including: Data Segmentation, Data Analytics, Simulation
All these stages are integrated to create the big picture out of Big Data with insights, patterns,
which later get categorized and packaged into an understandable format.
1. Evolutionary Algorithms: These algorithms are inspired by the process of natural selection
and evolution. They involve generating a population of candidate solutions to a problem,
subjecting them to selective pressure, and iteratively refining them through processes like
mutation, recombination, and selection to find optimal or near-optimal solutions.
2. Artificial Neural Networks (ANNs): ANNs are computational models inspired by the
structure and function of biological neural networks in the brain. They consist of
interconnected nodes (neurons) organized in layers, and they are capable of learning from
data through a process called training. ANNs have been applied to various tasks, including
pattern recognition, classification, and optimization.
3. Swarm Intelligence: Swarm intelligence algorithms are inspired by the collective behavior
of social insects, such as ants, bees, and termites. These algorithms involve multiple agents
(particles, robots, etc.) interacting locally with one another and their environment to achieve a
global objective. Examples include ant colony optimization, particle swarm optimization, and
bee colony optimization.
4. Self-organizing Systems: These systems are inspired by biological systems that exhibit
emergent properties through decentralized interactions among their components. They are
capable of self-organization, adaptation, and robustness in response to changing conditions.
Examples include self-organizing networks, self-organizing robotic systems, and
self-organizing algorithms.
Class Balancing -
Class balancing, also known as imbalance learning, refers to the process of adjusting the
distribution of classes in a dataset to address class imbalance. Class imbalance occurs when
the number of instances belonging to one class significantly outweighs the number of
instances belonging to another class in a classification problem.
In many real-world scenarios, class imbalance is common. For example, in fraud detection,
the number of fraudulent transactions is typically much lower than the number of legitimate
transactions. In medical diagnosis, the number of patients with a rare disease may be much
smaller than the number of healthy individuals.
Class imbalance can pose challenges for machine learning algorithms, as they may become
biased towards the majority class, leading to poor performance in accurately predicting
minority class instances. Class balancing techniques aim to mitigate this issue by adjusting
the class distribution in the dataset.
TO DO !!!!!
Statistical learning concepts,
Elements of Information theory. ?????