unit-1 Lecture Notes
unit-1 Lecture Notes
1.4.1.1 Abstraction:
Representing raw input data in a structured format is the typical task for a learning
algorithm.
The work of assigning a meaning to data occurs during the abstraction process.
During the process of knowledge representation, the computer summarizes raw inputs in
a model
The model may be in any one of the following forms
(a) Computational blocks like if/else rules
(b) Mathematical equations
(c) Specific data structures like trees or graphs.
(d) Logical groupings of similar observations
This process of fitting the model based on the input data is known as training. Also, the
input data based on which the model is being finalized is known as training data.
1.4.1.2 Generalization:
FIG: Classification
Regression:
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous
output variables, such as market trends, weather prediction, etc.
it is a type of supervised learning that learns from labelled data sets to predict continuous
output for different data
Some popular Regression algorithms are given below:
1. Linear Regression Algorithm
2. Logistic Regression
3. Multivariate Regression Algorithm
1.5.2 Unsupervised learning:
Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision
In unsupervised learning, the objective is to take a dataset as input and try to find similar
groupings or patterns within the data elements or records.
unsupervised learning is also termed as descriptive model and the process of unsupervised
learning is referred as pattern discovery or knowledge discovery.
1. Healthcare:
o Diagnostics: ML algorithms can analyze medical images (like X-rays, MRIs) to
detect conditions such as tumors, fractures, and infections.
o Personalized Medicine: Tailoring treatment plans based on individual patient
data and genetic information.
2. Finance:
o Fraud Detection: Identifying unusual patterns in transactions that may indicate
fraudulent activity.
o Algorithmic Trading: Developing trading strategies that can analyze market
conditions and execute trades at optimal times.
3. Retail:
o Recommendation Systems: Providing personalized product recommendations to
customers based on their browsing and purchase history.
o Inventory Management: Predicting demand for products to optimize stock
levels and reduce waste.
o Customer Sentiment Analysis: Analyzing customer reviews and feedback to
improve products and services.
4. Transportation:
o Autonomous Vehicles: Developing self-driving cars that can navigate and make
decisions in real-time.
o Route Optimization: Finding the most efficient routes for logistics and delivery
services.
o Traffic Prediction: Predicting traffic patterns and suggesting optimal routes to
reduce congestion.
5. Manufacturing:
o Predictive Maintenance: Monitoring equipment to predict failures and schedule
maintenance before breakdowns occur.
o Quality Control: Inspecting products for defects using computer vision and other
sensors.
6. Entertainment:
o Content Recommendation: Suggesting movies, music, and other content based
on user preferences.
o Content Creation: Using generative models to create music, art, and even entire
movie scripts.
o Audience Analysis: Analyzing audience behavior to improve content delivery
and marketing strategies.
data plays a vital role in the processing of machine learning algorithms, many data
scientists claim that inadequate data, noisy data, and unclean data are extremely
exhausting the machine learning algorithms
These errors exist when certain elements of the dataset are heavily weighted or need more
importance than others. Biased data leads to inaccurate results, skewed outcomes, and
other analytical errors.
2. Preparing to Model
2. Once the data is prepared for modelling, then do the following activities in the learning
The input data is first divided into two parts – the training data and the test data (called
holdout). This step is applicable for supervised learning only.
Consider different models or learning algorithms for selection.
Train the model based on the training data (for supervised) and Directly apply the chosen
unsupervised model (for unsupervised).
1. Qualitative data:
• Qualitative data provides information about the quality of an object or information which
cannot be measured.
• For example, if we consider the quality of performance of students in term of 'Good',
'Average', and 'Poor',
• it falls under the category of qualitative data.
• Also, name or roll number of students are information that cannot be measured using
some scale of measurement.
• Qualitative data is also called categorical data.
• Qualitative data - two types:
(a) Nominal data
(b) Ordinal data
(a)
Nominal data is one which has no numeric value, but a named value.
• It is used for assigning named values to attributes.
• Nominal values cannot be quantified.
• Examples of nominal data are
1. Blood group: A+, B, O, AB, etc.
2. Nationality: Indian, American, British, etc.
3. Gender: Male, Other
• Mathematical operations such as addition, subtraction, multiplication etc. and statistical
functions such as mean, variance, etc. cannot be performed on nominal data.
variance
where x is the variable or attribute whose variance is to be measured and n is the number of
observations
Standard deviation of a data is measured as follows:
Larger value of variance or standard deviation indicates more dispersion in the data and vice
versa.
In the above example, let’s calculate the variance of attribute 1 and that of attribute 2.
For attribute 1
For attribute 2,
So it is clear from the measure that attribute 1 values are quite concentrated around the mean
while attribute 2 values are extremely spread out.
2. Measuring data value position:
When the data values of an attribute are arranged in an increasing order, we have seen
earlier that median gives the central data value, which divides the entire data set into two
halves.
Similarly, if the first half of the data is divided into two halves so that each half consists of
one quarter of the data set, then that median of the first half is known as first quartile or Q1.
In the same way, if the second half of the data is divided into two halves, then that median
of the second half is known as third quartile or Q3.
The overall median is also known as second quartile or Q2. So, any data set has five values
minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum
2.3.2.1 Box Plots: It is a type of chart that depicts a group of numerical data through their
quartiles. It is a simple way to visualize the shape of our data.
Components of a box plot gives a five-number summary of a set of data which is-
Minimum – It is the minimum value in the dataset excluding the outliers
First Quartile (Q1) – 25% of the data lies below the First (lower) Quartile.
Median (Q2) – It is the mid-point of the dataset. Half of the values lie below it and half
above. Third Quartile (Q3) – 75% of the data lies below the Third (Upper) Quartile.
Maximum – It is the maximum value in the dataset excluding the outliers.
The area inside the box (50% of the data) is known as the Inter Quartile Range. The IQR
is calculated as – IQR = Q3-Q1
Outliers are the data points below and above the lower and upper limit.
The lower and upper limit is calculated as – Lower Limit = Q1 - 1.5*IQR, Upper Limit =
Q3 + 1.5*IQR
2.3.2.2 HISTOGRAM:
Histogram is another plot which helps in effective visualization of numeric attributes. It
helps in understanding the distribution of a numeric data into series of intervals, also
termed as ‘bins’.
The important difference between histogram and box plot is
The focus of histogram is to plot ranges of data values (acting as ‘bins’), the number of
data elements in each range will depend on the data distribution. Based on that, the size
of each bar corresponding to the different ranges will vary.
The focus of box plot is to divide the data elements in a data set into four equal portions,
such that each portion contains an equal number of data elements.
2.3.3 Exploring categorical data: Categorical data is used to group the information with similar
characteristics. mean and median cannot be applied for categorical variables, mode is only
applicable
• An attribute may have one or more modes.Frequency distribution of an attribute having
1. Single mode is called 'unimodal’,
2. Two modes are called 'bimodal' and
3. Multiple modes are called 'multimodal'.
Lets take sample data set