0% found this document useful (0 votes)
2 views

module 3

Module 3 provides an introduction to Artificial Intelligence and Machine Learning, detailing their interrelationship and the significance of data in machine learning processes. It outlines the challenges faced in machine learning, including the need for high-quality data, computational power, and the complexity of algorithms. Additionally, the module discusses the machine learning process, including data preparation, modeling, evaluation, and deployment, along with techniques for data cleaning and normalization.

Uploaded by

darshan200328
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

module 3

Module 3 provides an introduction to Artificial Intelligence and Machine Learning, detailing their interrelationship and the significance of data in machine learning processes. It outlines the challenges faced in machine learning, including the need for high-quality data, computational power, and the complexity of algorithms. Additionally, the module discusses the machine learning process, including data preparation, modeling, evaluation, and deployment, along with techniques for data cleaning and normalization.

Uploaded by

darshan200328
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Module 3- Introduction to AIML (21CS752)

1.1 MACHINE LEARNING IN RELATION TO OTHER FIELDS


Machine learning uses the concepts of Artificial Intelligence, Data Science, and Statistics
primarily.It is the resultant of combined ideas of diverse fields.
1.3.1 Machine Learning and Artificial Intelligence

intelligence

Figure 1.3: Relationship of AI with Machine Learning

Artificial Intelligence (AI) is a broad field focused on creating systems (called


"intelligent agents") that can perform tasks autonomously, such as robots,
humans, or other systems. The early goal of AI was ambitious: to create intelligent
systems that could think and act like humans, focusing on logic and reasoning.
However, AI faced several challenges and periods of slow progress, called AI
winters, where enthusiasm and funding declined. AI’s resurgence came with the
rise of data-driven systems—models that learn by finding patterns in data. This
led to the development of Machine Learning (ML), a key branch of AI.
Machine Learning aims to extract patterns from data to make predictions.
Instead of explicitly programming systems for every possible scenario, ML
algorithms "learn" from examples (training data) and can handle new, unseen
situations. Machine learning includes various techniques like reinforcement
learning, where agents learn by interacting with their environment.

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT


Module 3- Introduction to AIML (21CS752)

Relationship Between AI and Machine Learning:


 AI is the broader field aiming to create intelligent agents.
 ML is a subfield of AI that focuses on learning from data.
 Deep Learning, a subset of ML, uses neural networks inspired by the human brain
to build models. These networks consist of layers of interconnected units ("neurons") that
process information in a way that mimics how the brain works, and they are especially
useful for tasks like image and speech recognition.

1.3.2 Machine Learning, Data Science, Data Mining, and Data


Analytics
Data Science is an umbrella term that covers various fields related to working
with data. It involves gathering, processing, analyzing, and drawing insights from
data. Machine learning starts with data, which makes it closely linked to data
science. Here’s how machine learning connects to related fields:
Big Data:
Big data is part of data science and refers to massive volumes of data generated by
companies like Facebook, Twitter, and YouTube. It deals with three key
characteristics:
1. Volume: The sheer amount of data being generated.
2. Variety: Data comes in many forms—text, images, videos, etc.
3. Velocity: The speed at which data is generated and processed.
Big data is essential for machine learning because many algorithms rely on large
datasets for training. For example, deep learning (a subfield of ML) uses big data
for tasks like image recognition and language translation.
Data Mining:
Data mining originally came from business applications. It’s like "mining" for
valuable information hidden in large datasets. While data mining and machine
learning overlap significantly, the distinction is:
 Data Mining: Focuses on discovering hidden patterns in data.
 Machine Learning: Uses those patterns to make predictions.
Data Analytics:

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT


Module 3- Introduction to AIML (21CS752)

o Mathematical Models: Uses complex equations (e.g., regression, ANOVA) to explain


data.
o Knowledge Required: Strong statistical background needed for analysis and
interpretation.
o Goal: Primarily concerned with verifying relationships and patterns in data.
2. Machine Learning (ML):
 Definition: A branch of AI focused on building models that learn from data to make
predictions or decisions without being explicitly programmed.
 Key Features:
o Data-driven: Focuses on learning from data patterns for predictions.
o Less Assumptions: Fewer restrictions on data (e.g., can handle non-normal data).
o Automation: Emphasizes using tools and algorithms to automate the learning process.
o Flexibility: Works well with large, complex datasets; adaptable to different scenarios.
o Goal: Makes predictions based on learned patterns, often without needing detailed
statistical knowledge.
1.2 TYPES OF MACHINE LEARNING
What does the word ‘learn’ mean? Learning, like adaptation, occurs as the result of
interaction of the program with its environment. There are four types of machine learning
as shown in Figure 1.5.

Before discussing the types of learning, it is necessary to discuss about data.

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT


Module 3- Introduction to AIML (21CS752)

In this grid game, the gray tile indicates the danger, black is a block, and the tile with
diagonallines is the goal. The aim is to start, say from bottom-left grid, using the actions
left, right, top andbottom to reach the goal state.
To solve this sort of problem, there is no data. The agent interacts with the environment
toget experience. In the above case, the agent tries to create a model by simulating many
paths and finding rewarding paths. This experience helps in constructing a model.

1.5 CHALLENGES OF MACHINE LEARNING


Machine learning allows computers to solve certain types of problems much
better than humans, especially tasks involving computation. For instance,
computers can quickly calculate the square root of large numbers or win games
like chess and Go against professional players.
However, humans are still better than machines at tasks like recognition, though
modern machine learning systems, especially deep learning, are improving
rapidly. For example, machines can recognize human faces instantly. But there are
still challenges in machine learning, mainly due to the need for high-quality data.
Key Challenges in Machine Learning:
1. Well-Posed Vs Ill-Posed Problems:
o Machine learning works well with well-posed problems, where the problem is clearly
defined and has enough information to find a solution.
o In ill-posed problems, there may be multiple possible answers, making it hard to find
the correct one. For example, in a simple dataset (as shown in Table 1.3), several models
could fit the data (e.g., multiplication or division). To solve such problems, more data is
needed to narrow down the correct model.
Table 1.3: An Example

Input (x1, x2) Output (y)

1, 1 1
2, 1 2
3, 1 3
4, 1 4
5, 1 5

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT


Module 3- Introduction to AIML (21CS752)

Can a model for this test data be multiplication? That is, y = x1 * x2. Well! It is true! But, this
is equally true that y may be y = x1 / x2 or y = x1 ^ x2. So, there are three functions that fit
the data.
This means that the problem is ill-posed. To solve this problem, one needs more example to
check the model. Puzzles and games that do not have sufficient specification may become an
ill-posed problem and scientific computation has many ill-posed problems.
2. Need for Huge, Quality Data:
o Machine learning requires large amounts of high-quality data. The data must be
complete, without missing or incorrect values. Poor-quality data can lead to inaccurate
models.
3. High Computational Power:
o With the growth of Big Data, machine learning tasks require powerful computers with
specialized hardware like GPUs or TPUs to handle the high computational load. The
increasing complexity of tasks has made high-performance computing essential.
4. Complexity of Algorithms:
o Choosing the right machine learning algorithm, explaining how it works, applying it
correctly, and comparing different algorithms are now critical skills for data scientists.
This makes the selection and evaluation of algorithms a significant challenge.
5. Bias-Variance Trade-off:
o Overfitting: When a model performs well on training data but fails on test data, it’s
called overfitting. This means the model has learned the training data too well but lacks
generalization to new data.
o Underfitting: When a model fails to perform well on both training and test data, it’s
called underfitting. The model is too simple to capture the patterns in the data.
o Balancing between overfitting and underfitting is a major challenge for machine
learning algorithms.
1.6 MACHINE LEARNING PROCESS
The emerging process model for the data mining solutions for business organizations is
CRISP-DM.Since machine learning is like data mining, except for the aim, this process can
be used for machinelearning. CRISP-DM stands for Cross Industry Standard Process – Data
Mining. This process involves six steps. The steps are listed below in Figure 1.11.

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT


Module 3- Introduction to AIML (21CS752)

1. Understanding the business – This step involves understanding the objectives and
requirements of the business organization. Generally, a single data mining algorithm is
enough for giving the solution. This step also involves the formulation of the problem
statement for the data mining process.
2. Understanding the data – It involves the steps like data collection, study of the
characteristics of the data, formulation of hypothesis, and matching of patterns to the
selected hypothesis.
3. Preparation of data – This step involves producing the final dataset by cleaning the
raw data and preparation of data for the data mining process. The missing values may
cause problems during both training and testing phases. Missing data forces classifiers to
produceinaccurate results. This is a perennial problem for the classification models. Hence,
suitablestrategies should be adopted to handle the missing data.
4. Modelling – This step plays a role in the application of data mining algorithm for the
datato obtain a model or pattern.
5. Evaluate – This step involves the evaluation of the data mining results using statistical
analysis and visualization methods. The performance of the classifier is determined by
evaluating the accuracy of the classifier. The process of classification is a fuzzy issue.
For example, classification of emails requires extensive domain knowledge and requires
domain experts. Hence, performance of the classifier is very crucial.
6. Deployment – This step involves the deployment of results of the data mining
algorithm to improve the existing process or for a new situation.

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT


Module 3- Introduction to AIML (21CS752)

6. Most Probable Value: Predict missing values using machine


learning algorithms like decision trees.

Removal of Noisy or Outlier Data


In data analysis, noise refers to random errors or variations in
the data that can distort the results of analysis. Noise can affect
data accuracy and, if not removed, can lead to misleading
conclusions. Therefore, it's important to clean noisy data before
applying any analysis or machine learning algorithms.
What is Noise?
 Noise is random error or variance in measured values.
 It can appear as outliers, missing values, or inconsistent data.
 Noise reduction is an essential step in data cleaning to improve
the quality of analysis.
Techniques for Removing Noise:
One common method to remove noisy data is binning, which
organizes data into groups (bins) and then applies smoothing
techniques to remove noise. Binning methods can also be used
for data discretization, which reduces the number of values for
easier analysis.
Binning Method:
 Step 1: Sort the data in increasing order.
 Step 2: Divide the sorted data into equal-frequency bins (also
called buckets).
 Step 3: Apply smoothing techniques within each bin to reduce
noise.
Smoothing Techniques for Binning:
1. Smoothing by Means:

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT


Module 3- Introduction to AIML (21CS752)

o Replace all values in the bin with the mean (average) of the bin
values.
Example:
o Given data: S = {12, 14, 19, 22, 24, 26, 28, 31, 34}
o First, divide into bins of size 3:
 Bin 1: {12, 14, 19}
 Bin 2: {22, 24, 26}
 Bin 3: {28, 31, 34}
o Now apply smoothing by means (replace all values with the bin's
mean):
 Bin 1 (mean = 15): {15, 15, 15}
 Bin 2 (mean = 24): {24, 24, 24}
 Bin 3 (mean ≈ 31): {31, 31, 31}
o Explanation: Each value in the bin is replaced by the mean of the
bin to smooth the data.
2. Smoothing by Medians:
o Replace all values in the bin with the median of the bin values (the
middle value when the data is sorted).
Example:
o Given the same data and bins:
 Bin 1 (median = 14): {14, 14, 14}
 Bin 2 (median = 24): {24, 24, 24}
 Bin 3 (median = 31): {31, 31, 31}
o Explanation: Each value in the bin is replaced by the median,
which reduces the effect of outliers or extreme values.
3. Smoothing by Bin Boundaries:
o Replace each value in the bin with the closest boundary value
(minimum or maximum value in the bin).

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT


Module 3- Introduction to AIML (21CS752)

Example:
o Given the same data and bins:
 Bin 1 (boundary values: 12 and 19): {12, 12, 19}
 Bin 2 (boundary values: 22 and 26): {22, 22, 26}
 Bin 3 (boundary values: 28 and 34): {28, 34, 34}
o Explanation: For each bin, values are replaced by the closest
boundary value (either the minimum or maximum of that bin).
o Example: In Bin 1, the original data was {12, 14, 19}. The
boundaries are 12 and 19, so the value 14 is closer to 12, and it's
replaced by 12.

Why Use Binning to Remove Noise?


 Smoothing by Means: Reduces random noise by averaging the
values within each bin.
 Smoothing by Medians: More robust against outliers than using
means since medians are less sensitive to extreme values.
 Smoothing by Bin Boundaries: Eliminates noise by forcing all
values within the bin to adhere to the boundaries, creating a more
consistent dataset.

Data Integration and Data Transformations


Data integration involves routines that merge data from multiple
sources into a single data source.
So, this may lead to redundant data. The main goal of data integration
is to detect and remove redundancies that arise from integration. Data
transformation routines perform operations like normalization to
improve the performance of the data mining algorithms. It is
necessary to transform data so that it can be processed. This can be
considered as a preliminary stage of data conditioning. Normalization
is one such technique. In normalization, the attribute values are scaled

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT


Module 3- Introduction to AIML (21CS752)

to fit in a range (say 0-1) to improve the performance of the data


mining algorithm. Often, in neural networks, these techniques are
used. Some of the normalization procedures used are:
1. Min-Max
2. z-Score
Min-Max Procedure It is a normalization technique where each
variable V is normalized by its difference with the minimum value
divided by the range to a new range, say 0–1. Often, neural networks
require this kind of normalization. The formula to implement this
normalization is given as:

Here max-min is the range. Min and max are the minimum and
maximum of the given data, new max and new min are the minimum
and maximum of the target range, say 0 and 1.

Example 2.2: Consider the set: V = {88, 90, 92, 94}. Apply Min-Max
procedure and map the marks to a new range 0–1.
Solution: The minimum of the list V is 88 and maximum is 94. The
new min and new max are 0 and 1, respectively. The mapping can be
done using Eq. (2.1) as:

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT


Module 3- Introduction to AIML (21CS752)

So, it can be observed that the marks {88, 90, 92, 94} are mapped to
the new range {0, 0.33, 0.66, 1}. Thus, the Min-Max normalization
range is between 0 and 1.

z-Score Normalization This procedure works by taking the


difference between the field value
and mean value, and by scaling this difference by standard deviation
of the attribute.

Here, s is the standard deviation of the list V and m is the mean of the
list V.
Example 2.3: Consider the mark list V = {10, 20, 30}, convert the
marks to z-score.
Solution: The mean and Sample Standard deviation (s) values of the
list V are 20 and 10, respectively. So the z-scores of these marks are

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT


Module 3- Introduction to AIML (21CS752)

calculated using Eq. (2.2) as:

Hence, the z-score of the marks 10, 20, 30 are -1, 0 and 1, respectively.

Data Reduction
Data reduction reduces data size but produces the same results. There
are different ways in which data reduction can be carried out such as
data aggregation, feature selection, and dimensionality reduction.

2.4 DESCRIPTIVE STATISTICS


Descriptive statistics is a branch of statistics that does dataset
summarization. It is used to summarize and describe data. Descriptive
statistics are just descriptive and do not go beyond that.
In other words, descriptive statistics do not bother too much about
machine learning algorithms and its functioning.
Let us discuss descriptive statistics with the fundamental concepts of
datatypes.
Dataset and Data Types
A dataset can be assumed to be a collection of data objects. The data
objects may be records, points, vectors, patterns, events, cases,
samples or observations. These records contain many attributes. An
attribute can be defined as the property or characteristics of an object.

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT


Module 3- Introduction to AIML (21CS752)

is that by visual inspection one can find out who got more marks.

2.5.2 Central Tendency


Therefore, a condensation or summary of the data is necessary. This
makes the data analysis easy and simple. One such summary is called
central tendency. Thus, central tendency can explain the
characteristics of data and that further helps in comparison. Mass data
have tendency to concentrate at certain values, normally in the central
location. It is called measure of central tendency (or averages).
Popular measures are mean, median and mode.
1. Mean – Arithmetic average (or mean) is a measure of central
tendency that represents the ‘center’ of the dataset. Mathematically,
the average of all the values in the sample (population) is denoted as
x. Let x1, x2, … , xN be a set of ‘N’ values or observations, then the
arithmetic mean is given as:

For example, the mean of the three numbers 10, 20, and 30 is 20

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT


Module 3- Introduction to AIML (21CS752)

•Weighted mean – Unlike arithmetic mean that gives the weightage


of all items equally, weighted mean gives different importance to all
items as the item importance varies.
Hence, different weightage can be given to items. In case of frequency
distribution, mid values of the range are taken for computation. This
is illustrated in the following computation. In weighted mean, the
mean is computed by adding the product of proportion and group
mean. It is mostly used when the sample sizes are unequal.
•Geometric mean – Let x1, x2, … , xN be a set of ‘N’ values or
observations. Geometric mean
is the Nth root of the product of N items. The formula for computing
geometric mean is given as follows:

Here, n is the number of items and xi are values. For example, if the
values are 6 and 8, the geometric mean is given as In larger cases,
computing geometric mean is difficult. Hence, it is usually calculated
as:

The problem of mean is its extreme sensitiveness to noise. Even small


changes in the input affect the mean drastically. Hence, often the top
2% is chopped off and then the mean is calculated for a larger dataset.

2. Median – The middle value in the distribution is called median. If


the total number of items in the distribution is odd, then the middle
value is called median. A median class is that class where (N/2)th item
is present.

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT


Module 3- Introduction to AIML (21CS752)

In the continuous case, the median is given by the formula:

Median class is that class where N/2th item is present. Here, i is the
class interval of the median class and L1 is the lower limit of median
class, f is the frequency of the median class, and cf is the cumulative
frequency of all classes preceding median.
3. Mode – Mode is the value that occurs more frequently in the
dataset. In other words, the value that has the highest frequency is
called mode.

2.5.3 Dispersion
The spreadout of a set of data around the central tendency (mean,
median or mode) is called dispersion. Dispersion is represented by
various ways such as range, variance, standard deviation, and
standard error. These are second order measures. The most common
measures of the dispersion data are listed below:
Range Range is the difference between the maximum and minimum
of values of the given list of data.
Standard Deviation The mean does not convey much more than a
middle point. For example, the following datasets {10, 20, 30} and {10,
50, 0} both have a mean of 20. The difference between these two sets
is the spread of data. Standard deviation is the average distance from
the mean of the dataset to each point.
The formula for sample standard deviation is given by:

Here, N is the size of the population, xi is observation or value from


the population and m is the population mean. Often, N – 1 is used
instead of N in the denominator of Eq. (2.8).

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT

You might also like