Unit 1 Machine Learning
Unit 1 Machine Learning
The random.randn parameter is part of the NumPy package and it generates random numbers. The
series function creates a pandas series that consists of an index, which is the first column, and the
second column consists of random values. At the bottom of the output is the datatype of the series.
The index of the series can be customized by calling the following:
import pandas as pd import numpy
as np
s=pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
DataFrame
DataFrame is a 2D data structure with columns that can be of different data types. It can be seen as a
table. A DataFrame can be formed from the following data structures: A NumPy array, Lists, Dicts,
Series, etc.
A DataFrame can be created from a dictionary of series as below:
import pandas as pd
d = {'c1': pd.Series(['A', 'B', 'C']),'c2': pd.Series([1, 2, 3, 4])}
df = pd.DataFrame(d) print(df)
import pandas as pd
d = {'c1': ['A', 'B', 'C', 'D'],'c2': [1, 2, 3, 4]}
df = pd.DataFrame(d) print(df)
XLS
To read the data from an Excel file read_excel() command can be used and to_excel() command can
be used to write excel file.
Example: Write a program that reads Book1.xlsx file, displays its content, stores Sid, and Sname to another
dataframe writes it to mybook.xlsx file
import pandas as pd
book = pd.read_excel('/content/drive/My Drive/Data/Book1.xlsx') print(book)
b=book[["Sid","Grade"]] print(b)
b.to_excel("/content/drive/My Drive/Data/Book.xlsx")
JSON Data
JSON is a syntax for storing and exchanging data. JSON is text, written with JavaScript object notation.
Python has a built-in package called json, which can be used to work with JSON data. If we have a
JSON string, we can parse it by using the json.loads() method. The result will be a Python
dictionary. If we have a Python object, we can convert it into a JSON string by using the
json.dumps() method.
Example: Represent Id, Name, and Email of 3 person in JSON format, load it into dictionary and
display it. Again represent Id, Name, and Email of 3 person in dictionary convert it into JSON
format and display it.
import json #JSON Data
x = """[
{"ID":101,"name":"Ram", "email":"[email protected]"},
{"ID":102,"name":"Bob", "email":"[email protected]"},
{"ID":103,"name":"Hari", "email":"[email protected]"}
]"""
# loads method converts x into dictionary y =
json.loads(x)
print(y)
# Displaying Email Id of all persons from dictionary for r in
y:
print(r["email"])
Data Cleansing
Data cleansing, sometimes known as data cleaning or data scrubbing, denotes the procedure of
rectifying inaccurate, unfinished, duplicated, or other flawed data within a dataset. This task entails
detecting data discrepancies and subsequently modifying, enhancing, or eliminating the data to
rectify them. Through data cleansing, data quality is enhanced, thereby furnishing more precise,
uniform, and dependable information crucial for organizational decision-making.
for c in emp.columns:
print(emp[c].isnull().value_counts())
Merging Data
To combine datasets together, the concat function of pandas can be utilized. We can concatenate two
or more dataframes together.
Example
import pandas as pd import numpy
as np
emp = pd.read_csv('/content/drive/My Drive/Data/employees.csv') e1=emp[0:5]
print("First 5 Rows of Dataframe:") print(e1)
e2=emp[10:15]
print("Rows 100-15 of Dataframe:") print(e2)
print("Concatenated Dataframe:")
e=pd.concat([e1,e2])
print(e)
Data operations
Once the missing data is handled, various operations such as aggregate operations, joins etc. can be
performed on the data.
Aggregation Operations
There are a number of aggregation operations, such as average, sum, and so on, which we would
like to perform on a numerical field. These aggregate methods are discussed below.
Average: mean() method of pandas dataframe is used for finding average of specified
numerical field of the dataframe.
Sum: sum() method of pandas dataframe is used for total of specified numerical field of the
dataframe.
Max: max() method of pandas dataframe is used for finding maximum value of specified
numerical field of the dataframe.
Min: min() method of pandas dataframe is used for finding minimum value of specified
numerical field of the dataframe.
Standard Deviation: std() method of pandas dataframe is used for finding standard
deviation of specified numerical field of the dataframe.
Count: count() method of pandas dataframe is used for total number of values in the
specified field of the dataframe.
Example
import pandas as pd import numpy as np
emp = pd.read_csv('/content/drive/My Drive/Data/employees.csv')
emp.drop_duplicates(inplace=True)
avgsal=emp["Salary"].mean() print("Average Salary=",avgsal)
totsal=emp["Salary"].sum() print("Total Salary=",totsal) maxsal=emp["Salary"].max()
print("Maximum Salary=",maxsal) minsal=emp["Salary"].min()
print("Minimum Salary=",minsal)
nemp=emp["First Name"].count() print("#Employees=",nemp)
teams=emp["Team"].drop_duplicates().count()
print("#Teams=",teams)
std=emp["Bonus %"].std()
print("Stadrard Deviation of Bonus=",std)
groupby Function
A groupby operation involves some combination of splitting the object, applying a function, and
combining the results. This can be used to group large amounts of data and compute operations on these
groups.
import pandas as pd import numpy
as np
emp = pd.read_csv('/content/drive/My Drive/Data/employees.csv')
emp.drop_duplicates(inplace=True)
emp.dropna(inplace=True)
avgsal=emp[["Team","Salary"]].groupby(["Team"]).mean() print("Average
Salary For Each Team")
print(avgsal)
gencount=emp.groupby(["Gender"]).count()
print("#Employees Gender Wise") print(gencount)
minbonus=emp[["Gender","Bonus %"]].groupby(["Gender"]).min() print("Minimum Bonus%
For Each Gender")
print(minbonus)
For all normal distributions, 68.2% of the observations will appear within plus or minus one
standard deviation of the mean; 95.4% of the observations will fall within +/- two standard
deviations; and 99.7% within +/- three standard deviations. This fact is sometimes referred to as the
"empirical rule," a heuristic that describes where most of the data in a normal distribution will appear.
This means that data falling outside of three standard deviations ("3-sigma") would signify rare
occurrences.
Example
from scipy import stats
import matplotlib.pyplot as plt import numpy as np
dist = stats.norm(loc=5.6, scale=1)#Here 5.5 is mean and 1 is SD
# Generate a sample of 100 random penguin heights heights = dist.rvs(size=100)
heights=heights.round(2) heights=np.sort(heights)prob = dist.pdf(x=5.2)
print("Probability of being height 5.2=",prob) probs = dist.pdf(x=[4.5,5,5.5,6,6.5])
print("Probability of heights=",probs)
probs = dist.pdf(x=heights)
#plotting histograms with desnisty curve plt.figure(figsize=(6, 4))
plt.hist(heights, bins=20, density=True)
plt.title("Height Histogram and Desity Curve")
plt.xlabel("Height")plt.ylabel("Frequency")
plt.plot(heights, probs) plt.show()
The method pdf() from the norm class can help us find the probability of some randomly selected
value. It returns the probabilities for specific values from a normal distribution. PDF stands for
probability density function.
The norm method cdf() helps us to calculate the proportion of a normally distributed population
that is less than or equal to a given value. CDF stands for Cumulative Distribution Function.
Example
from scipy import stats
import matplotlib.pyplot as plt import
numpy as np
Z-score
Z-score is a statistical measurement that describes a value's relationship to the mean of a group of
values. Z-score is measured in terms of standard deviations from the mean. If a Z-score is 0, it
indicates that the data point's score is identical to the mean value. A Z- score of 1.0 would indicate
a value that is one standard deviation from the mean. Z-scores may be positive or negative, with a
positive value indicating the value is larger than the mean and a negative z-score indicating it is
smaller than the mean. It is calculated by using the formula given below.
z= (x − μ)/σ
Here, x is the value in the distribution, μ is the mean of the distribution, and σ is the standard
deviation of the distribution. Conversely, if x is a normal random variable with mean μ and
standard deviation σ, it is calculated as below.
x = σz + μ
Numerical Example
A survey of daily travel time had these results (in minutes): 26, 33, 65, 28, 34, 55, 25, 44,
50, 36, 26, 37, 43, 62, 35, 38, 45, 32, 28, 34. Convert the values to z-scores ("standard scores").
Solution
μ=38.5 σ=11.4
Original Value Standard Score (z-score)
26 (26-38.8) / 11.4 = −1.12
33 (33-38.8) / 11.4 = −0.51
65 (65-38.8) / 11.4 = 2.30
Example
import numpy as np
import matplotlib.pyplot as plt from scipy
import stats
dist = stats.norm(loc=50, scale=10) scores =
dist.rvs(size=100) scores=scores.round()
print(*scores)
plt.hist(scores, bins=30) plt.title("Histrogram of
Origical Scores") plt.show()
#Converting scores to z-scores
z=stats.zscore(scores).round(3)
print(*z) plt.hist(z, bins=30)
plt.title("Histrogram of Z-values of Scores") plt.show()
#converting z-scores to value in distribution
s=(scores.std()*z+scores.mean()).round() print(*s)
print(*scores)
Standard scalar calculates z-score for every data while normalizing data and the normalized can be
inverse scaled accordingly as mentioned above.
Binomial Distribution
Binomial distribution is a probability distribution that summarizes the likelihood that a variable will take
one of two independent values under a given set of parameters. The distribution is obtained by
performing a number of Bernoulli trials. A Bernoulli trial is assumed to meet each of these criteria.
There must be only 2 possible outcomes.
Each outcome has a fixed probability of occurring. A success has the probability of p, and a
failure has the probability of 1 – p.
Each trial is completely independent of all others.
For example, the probability of getting a head or a tail is 50%. If we take the same coin and flip it n
times, the probability of getting a head p times can be computed using probability mass function
(PMF) of binomial distribution. The binomial distribution formula is for any random variable x,
given by
Where, n is the number of times the coin is flipped, p is the probability of success, and q=1– p is the
probability of failure, and x is the number of successes desired.
Numerical Example: If a coin is tossed 5 times, find the probability of: (a) Exactly 2 heads and (b)
at least 4 heads.
Solution
Number of trials: n=5
Probability of head: p= 1/2 and hence the probability of tail, q =1/2
For exactly two heads: x=2
𝑃(𝑥 = 2) =
5! × 0.52 × 0.53 = 0.315
Again 2!×3!
𝑃𝑥≤4 =
5! 5! × 0.55 × 0.50
( ) 4 1
× 0.5 × 0.5 +
4!×1! 5!×0!
𝑃(𝑥 ≤ 4) = 0.15625 + 0.03125 = 0.1875
Example
from scipy import stats
import matplotlib.pyplot as plt
dist=stats.binom(n=5,p=0.5) prob=dist.pmf(k=2)
print("Proability of Two Heads=",prob)
prob=dist.pmf(k=4)+dist.pmf(k=5) print("Praobability of at
least 4 heads=",prob)
Note!!!
A probability mass function is a function that gives the probability that a discrete random variable is
exactly equal to some value. A probability mass function differs from a probability density function
(PDF) in that the latter is associated with continuous rather than discrete random variables. A PDF
must be integrated over an interval to yield a probability.
Poisson Distribution
Poisson distribution is a Discrete Distribution. It estimates how many times an event can happen in
a specified time provided the mean occurrence of the event in the interval. For example, if someone
eats twice a day what is the probability he will eat thrice? If lambda is the mean occurrence of the
events per interval, then the probability of having a k occurrence within a given interval is given by
the following formula.
Where, e is the Euler's number, k is the number of occurrences for which the probability is going to
be determined, and lambda is the mean number of occurrences.
Numerical Example
In the World Cup, an average of 2.5 goals are scored in each game. Modeling this situation with a
Poisson distribution, what is the probability that 3 goals are scored in a game? What is the
probability that 5 goals are scored in a game?
Solution
Given, λ=2.5
2.53 × 𝑒−2.5
𝑃(𝑥 = 3) = = 0.214
𝑃(𝑥 = 5) = 3! = 0.0668
2.55 ×
𝑒−2.5
5!
Example
from scipy import stats
dist=stats.poisson(2.5)#2.5 is average values
prob=dist.pmf(k=1)
print("Probability of having 1-goal=",prob) prob=dist.pmf(k=3)
print("Probability of having 3-goals=",prob) prob=dist.pmf(k=5)
print("Probability of having 5-goals=",prob)
P-value
The P-value is known as the probability value. It is defined as the probability of getting a result that is
either the same or more extreme than the actual observations. A p-value is used as a statistical test to
determine whether null-hypothesis is rejected or not? The null hypothesis is a statement that says that
there is no difference between two measures. If the hypothesis is that people who clock in 4 hours of
study everyday score more that 90 marks out of 100. The null hypothesis here would be that there is no
relation between the number of hours clocked in and the marks scored. If the p-value is equal to or
less than the significance level, then the null hypothesis is inconsistent and it needs to be rejected. The
P-value table shows the hypothesis interpretations.
P-value Decision
P-value > 0.05 The result is not statistically significant and hence accept the null
hypothesis.
P-value < 0.05 The result is statistically significant. Generally, reject the null hypothesis in
favor of the alternative hypothesis.
P-value < 0.01 The result is highly statistically significant, and thus rejects the null
hypothesis
in favor of the alternative hypothesis.
Suppose null hypothesis is “It is common for students to score 68 marks in mathematics.” Let's define
the significance level at 5%. If the p-value is less than 5%, then the null hypothesis is rejected and it is
not common to score 68 marks in mathematics. First calculate z-score of 68 marks (say z68) and then we
calculate p-value for the given z-score value as below.
pv=p(z≥z68)
This means pv*100% of the students are above the specified score 68.
import numpy as np
from scipy import stats
#Generate 60 random scores with mean=50 and SD=10
dist=stats.norm(loc=50,scale=10) scores=dist.rvs(size=100)
mean=scores.mean() SD=scores.std()
z = (68-mean)/SD #z-value of score=68 print("Z-
value of score=68:",z)
p=stats.norm.cdf(z) #probability of score<68 pv=1-p
#probabily of score>=68
print(pv) pvp=np.round(pv*100,2)
print(f"p-value={pvp}%")
if(pv>0.05):
print("Null hypothesis is accpted: It is common to score 68 marka in mathematics:")
else:
print("Null hypothesis is rejeceted: It is not common to score 68 marka in
mathematics:")
One-tailed and Two-tailed Tests
A one-tailed test may be either left-tailed or right-tailed. A left-tailed test is used when the
alternative hypothesis states that the true value of the parameter specified in the null hypothesis is less
than the null hypothesis claims. A right-tailed test is used when the alternative hypothesis states that the
true value of the parameter specified in the null hypothesis is greater than the null hypothesis claims.
The main difference between one-tailed and two-tailed tests is that one-tailed tests will only have one
critical region whereas two-tailed tests will have two critical regions. If we require a 100(1-α)%
confidence interval we have to make some adjustments when using a two-tailed test. The confidence
interval must remain a constant size, so if we are performing a two-tailed test, as there are twice as many
critical regions then these critical regions must be half the size. This means that when performing a two-
tailed test, we need to consider α/2 significance level rather than α.
Example: A light bulb manufacturer claims that its' energy saving light bulbs last an average of 60 days.
Set up a hypothesis test to check this claim and comment on what sort of test we need to use.
The example in the previous section was an instance of a one-tailed test where the null hypothesis
is rejected or accepted based on one direction of the normal distribution. In a two-tailed test, both
the tails of the null hypothesis are used to test the hypothesis. In a two-tailed test, when a
significance level of 5% is used, then it is distributed equally in the both directions, that is, 2.5% of
it in one direction and 2.5% in the other direction.
Let's understand this with an example. The mean score of the mathematics exam at a national level is
60 marks and the standard deviation is 3 marks. The mean marks of a class are 53. The null
hypothesis is that the mean marks of the class are similar to the national average.
from scipy import stats
zs = ( 53 - 60 ) / 3.0
print(f"z-score={zs}")
pv= stats.norm.cdf(zs)
print(f"p-value={pv}")
pv=(pv*100).round(2)
print(f"p-value={pv}%")
So, the p-value is 0.98%. The null hypothesis is to be rejected, and the p-value should be less than
2.5% in either direction of the bell curve. Since the p-value is less than 2.5%, we can reject the null
hypothesis and clearly state that the average marks of the class are significantly different from the
national average.
Type 1 and Type 2 Errors
A type 1 error appears when the null hypothesis of an experiment is true, but still, it is rejected. A
type 1 error is often called a false positive. Consider the following example. There is a new drug
that is being developed and it needs to be tested on whether it is effective in combating diseases.
The null hypothesis is that “it is not effective in combating diseases.” The significance level is kept
at 5% so that the null hypothesis can be accepted confidently 95% of the time. However, 5% of the
time, we'll accept the rejection of the hypothesis although it had to be accepted, which means that
even though the drug is ineffective, it is assumed to be effective. The Type 1 error is controlled by
controlling the significance level, which is α. α is the highest probability to have a Type 1 error.
The lower the α, the lower will be the Type 1 error.
The Type 2 error is the kind of error that occurs when we do not reject a null hypothesis that is
false. A type 2 error is also known as false negative. This kind of error occurs in a drug scenario
when the drug is accepted as ineffective but is actually it is effective. The probability of a type 2
error is β. Beta depends on the power of the test. This means the probability of not committing a type
2 error equal to 1-β. There are 3 parameters that can
affect the power of a test: sample size (n), significance level of test (α), “true” value of tested parameter.
Sample size (n): Other things being equal, the greater the sample size, the greater the power of
the test.
Significance level (α): The lower the significance level, the lower the power of the test. If we
reduce the significance level (e.g., from 0.05 to 0.01), the region of acceptance gets bigger.
As a result, we are less likely to reject the null hypothesis. This means we are less likely to
reject the null hypothesis when it is false, so we are more likely to make a Type II error. In
short, the power of the test is reduced when we reduce the significance level; and vice versa.
The "true" value of the parameter being tested: The greater the difference between the
"true" value of a parameter and the value specified in the null hypothesis, the greater the
power of the test.
These errors can be controlled one at a time. If one of the errors is lowered, then the other one increases.
It depends on the use case and the problem statement that the analysis is trying to address, and
depending on it, the appropriate error should reduce. In the case of this drug scenario, typically, a
Type 1 error should be lowered because it is better to ship a drug that is confidently effective.
Confidence Interval
When we make an estimate in statistics, whether it is a summary statistic or a test statistic, there is
always uncertainty around that estimate because the number is based on a sample of the population we
are studying. A confidence interval is the mean of our estimate plus and minus the variation in that
estimate. This is the range of values we expect our estimate to fall between if we experiment again or
re-sample the population in the same way.
The confidence level is the percentage of times we expect to reproduce an estimate between the upper
and lower bounds of the confidence interval. For example, if we construct a confidence interval with a
95% confidence level, you are confident that 95 out of 100 times the estimate will fall between the upper
and lower values specified by the confidence interval. Our desired confidence level is usually one minus
the alpha (α) value we used in our statistical test:
Confidence level = 1 − a
So if we use an alpha value of p < 0.05 for statistical significance, then our confidence level would
be 1 − 0.05 = 0.95, or 95%.
∑𝑛 (𝑥−𝑥̅)2
Calculate the standard deviation as: 𝑆𝐷 = √ 𝑖=1
𝑛−1
Find the standard error: The standard error of the mean is the deviation of the sample mean
𝑆𝐷
from the population mean. It is defined using the following formula:
𝑆𝐸 =
√𝑛
Finally, find confidence interval as: 𝑈𝑝𝑝𝑒𝑟/𝐿𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 = 𝑥̅ +/− 𝑧 × 𝑆𝐸, Where 𝑧 is
z- score of the given confidence level.
Note: Z-score of various confidence levels is given below.
Correlation
Correlation is a statistical measure that expresses the extent to which two variables are linearly related
(meaning they change together at a constant rate). It’s a common tool for describing simple relationships
without making a statement about cause and effect. The sample correlation coefficient, r, quantifies the
strength and direction of the relationship. Correlation coefficient quite close to 0, but either positive or
negative, implies little or no relationship between the two variables. A correlation coefficient close to
plus 1 means a positive relationship between the two variables, with increases in one of the variables
being associated with increment in the other variable. A correlation coefficient close to - 1 indicates a
negative relationship between two variables, with an increase in one of the variables being associated
with a decrease in the other variable. The most common formula is the Pearson Correlation coefficient
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
used for linear dependency between the data sets and is given as below.
𝑟=
√𝑛 ∑ 𝑥2 − (∑ 𝑥)2 √𝑛 ∑ 𝑦2 − (∑ 𝑦)2
Numerical Example
Calculate the coefficient of correlation for the following two data sets: x = (41, 19, 23, 40, 55, 57, 33)
and y = (94, 60, 74, 71, 82, 76, 61).
∑ 𝑥 = 41 + 19 + 23 + 40 + 55 + 57 + 33 = 268
∑ 𝑦 = 94 + 60 + 74 + 71 + 82 + 76 + 61 = 518
√(7×11,534−2682)√(7×39174−5182)= 0.54
Example
from scipy import stats import
numpy as np
x = np.array([2, 4, 3, 9, 7, 6, 5])
y = np.array([5, 7, 7, 18, 15, 11, 10])
r=stats.pearsonr(x,y)#computes pearson correlation coefficient print("Result:",r)
print("Correlation Coefficient:",r[0])
T-test
A t-test is a statistical test that is used to compare the means of two groups. It is often used in
hypothesis testing to determine whether two groups are different from one another. A t-test can
only be used when comparing the means of two groups. If we want to compare more than two
groups use an ANOVA test. When choosing a t-test, we will need to consider two things: whether
the groups being compared come from a single population or two different populations, and
whether we want to test the difference in a specific direction.
If the groups come from a single population perform a paired sample t test.
If the groups come from two different populations perform a two-sample t-test.
If there is one group being compared against a standard value, perform a one-sample t test.
𝑑̅
results from the same people. The formula used to obtain the t-value is:
𝑡=
𝑠/√𝑛
Where, d=difference between paired samples, 𝑑̅ is mean of d, s is standard deviation of the
differences, n is sample size.
Example
An instructor takes two exams of the students. Scores of Both exams are given in the table below.
He/She wants to know if the exams are equally difficult.
Student Exam1 Score(x) Exam2 Score(y)
S1 63 69
S2 65 65
S3 56 62
S4 100 91
S5 88 78
S6 83 87
S7 77 79
S8 92 88
S9 90 85
S10 84 92
S11 68 69
S12 74 81
S13 87 84
S14 64 75
S15 71 84
S16 88 82
Solution
𝑑̅ = 1.31
Now,
(𝑥 − 𝑦)2 = 7.13
𝑠=√
Now, 𝑛−1
𝑑̅
𝑡=
1.31 = 0.74
=
𝑠/√𝑛 7.13/√16
Let’s assume, Significance level (α) = 0.05
Degree of freedom (df)= n-1=15
The tabulated t-value with α = 0.05 and 15 degrees of freedom is 2.131.
Because 0.74 < 2.131, we accept null hypothesis. This means the mean score of two exams is
similar.
Exams are equally difficult.
In Python, we can perform a paired t-test using the scipy.stats.ttest_rel() function. It performs the t-
test on two related samples of scores.
Example
from scipy import stats
Output
t-statistic: -0.7497768853141169
p-value: 0.4649871003972206
Null Hypothesis is accepted. This means there is no difference between means scores of two exams
One-Sample T-test
The One Sample t Test examines whether the mean of a sample is statistically different from a known
𝑥̅ − 𝜇
or hypothesized or population mean. It is calculated as below.
𝑡=
𝑠/√𝑛
Numerical Example
Imagine a company wants to test the claim that their batteries last more than 40 hours. Using a
simple random sample of 15 batteries yielded a mean of 44.9 hours, with a standard deviation of
8.9 hours. Test this claim using a significance level of 0.05.
Solution
𝑡=
44.9−40
= 2.13
8.9/√15
Given, significance level (α) = 0.05 Degree of
freedom (df)= n-1=14
The tabulated t-value with α = 0.05 and 14 degrees of freedom is 1.761.
Because 2.13 > 1.761, we reject the null hypothesis and conclude that batteries last more than 40
hours.
In Python, we can perform a one-sample t-test using the scipy.stats.ttest_1samp() function.
Example
from scipy import stats battry_hour=[40,50,55,38,48,62,44,52,46,44,37,42,46,38,45]
Two-Sample T-test
The two-sample t-test (also known as the independent samples t-test) is a method used to test
whether the unknown population means of two groups are equal or not. We can use the test when
our data values are independent, are randomly sampled from two normal populations and the two
𝑥̅1 − 𝑥̅2
independent groups. It carried out as below.
𝑡=
𝑠𝑝 1 1
𝑛2
+
𝑛1
√
where 𝑥̅1 and 𝑥̅2 are the sample means, n1 and n2 are the sample sizes, and sp is calculated as below.
Example
Our sample data is from a group of men and women who did workouts at a gym three times a week for
a year. Then, their trainer measured the body fat. The table below shows the data.
T-test vs Z-Test
The difference between a t-test and a z-test hinges on the differences in their respective
Where, 𝑥̅ is a measured sample mean, 𝜇 is the hypothesized population mean, 𝑠 is the sample
standard deviation, and n is the sample size.
Notice that this distribution uses a known population standard deviation for a data set to
approximate the population mean. However, the population standard deviation is not always
known, and the sample standard deviation, s, is not always a good approximation. In these
instances, it is better to use the T-test.
The T-Distribution looks a lot like a Standard Normal Distribution. In fact, the larger a sample is,
the more it looks like the Standard Normal Distribution - and at sample sizes larger than 30, they
are very, very similar. Like the Standard Normal Distribution, the T- Distribution is defined as
having a mean 𝜇 = 0, but its standard deviation, and thus the width of its graph, varies according to
the sample size of the data set used for the hypothesis test. It is calculated using following formula:
The standard normal or z-distribution assumes that you know the population standard deviation. The
t-distribution is based on the sample standard deviation. The t-distribution is similar to a
normal distribution. The useful properties of the t-distribution are:
The table below presents the key differences between the two statistical methods, Z-test and T-test.
Z-Test T-Test
Used for large sample sizes (n≥30). Used for small to moderate
sample sizes (n<30).
Performing this test requires It is performed when the
knowledge of population standard population standard deviation is
deviation (σ). unknown.
Does not involve the sample standard Involves the sample standard
deviation. deviation (s).
Assumes a standard normal Assumes a t-distribution, which
distribution. varies with degrees of freedom.
Chi-square Distribution
If we repeatedly take samples and define the chi-square statistics, then we can form a chi- square
distribution. A chi-square (Χ2) distribution is a continuous probability distribution that is used in many
hypothesis tests. The shape of a chi-square distribution is determined by the parameter k, which
represents the degrees of freedom. The graph below shows examples of chi-square distributions
with different values of k.
There are two main types of Chi-Square tests namely: Chi-Square for the Goodness-of-Fit
and Chi-Square for the test of Independence.
Chi-Square for the Goodness-of-Fit
A chi-square test is a statistical test that is used to compare observed and expected results. The goal of
this test is to identify whether a disparity between actual and predicted data is due to chance or to a
link between the variables under consideration. As a result, the chi-square test is an ideal choice for
aiding in our understanding and interpretation of the connection between our two categorical variables.
Pearson’s chi-square test was the first chi-square test to be discovered and is the most widely used.
Pearson’s chi-square test statistic is given as below.
The null hypothesis in the chi-square test is that the observed value is similar to the expected value. The
chi-square can be performed using the chisquare function in the SciPy package. The function gives
chisquare value and p-value as output. By looking at the p-value we can reject/accept null
hypothesis
Example
from scipy import stats import
numpy as np
expected = np.array([6,6,6,6,6,6])
observed = np.array([7,5,3,9,6,6])
cp=stats.chisquare(observed,expected) print(cp)
Output: P-value=0.65
Conclusion: Since p-value>0.05, Null hypothesis is accepted. Thus, we conclude that observed value of
dice is same as expected value.
Gender Empathy
High Low
Male 180 120
Female 140 60
Output:P-value=.029
Conclusion: since p-value is less than 0.05 our null hypothesis is rejected. Thus, we conclude that the
empathy is related with gender.
ANOVA
ANOVA stands for Analysis of Variance. It is a statistical method used to analyze the differences
between the means of two or more groups. It is often used to determine whether there are any
statistically significant differences between the means of different groups. ANOVA compares the
variation between group means to the variation within the groups. If the variation between group means
is significantly larger than the variation within groups, it suggests a significant difference between the
means of the groups.
ANOVA calculates an F-statistic by comparing between-group variability to within-group variability. If
the F-statistic exceeds a critical value, it indicates significant differences between group means. Types of
ANOVA include one-way (for comparing means of groups) and two-way (for examining effects of two
independent variables on a dependent variable). To perform the one- way ANOVA, we can use the
f_oneway() function of the SciPy package.
Example
Suppose we want to know whether or not three different exam prep programs lead to different mean
scores on a certain exam. To test this, we recruit 30 students to participate in a study and split them into
three groups. The students in each group are randomly assigned to use one of the three exam prep
programs for the next three weeks to prepare for an exam. At the end of the three weeks, all of the
students take the same exam. The exam scores for each group are shown below.
Python program to solve above problem
from scipy import stats import
numpy as np
sg1=[85,86,88,75,78,94,98,79,71,80]
sg2=[91,92,93,85,87,84,82,88,95,96]
sg3=[79,78,88,94,92,85,83,85,82,81]
r=stats.f_oneway(sg1,sg2,sg3) print("P-
value=",r.pvalue)
There are many properties of a line that can be set, such as the color, dashes etc. There are essentially
three ways of doing this: using keyword arguments, using setter methods, and using setp() command.
Using Keyword Arguments
Keyword arguments (or named arguments) are values that, when passed into a function, are identified by
specific parameter names. These arguments can be sent using key = value syntax. We can use keyword
arguments to change default value of properties of line charts as below. Major keyword arguments
supported by plot() methods are: linewidth, color, linestyle, label, alpha, etc.
import matplotlib.pyplot as plt
x=[1,2,3,4,5,6,7]
y=[3,5,7,9,11,13,15]
plt.plot(x,y, linewidth=4, linestyle="--", color="red", label="y=2x+1")
plt.xlabel("x")
plt.ylabel("y") plt.title("Line Chart
Example") plt.legend(loc='upper center')
plt.show()
#Generate
x = np.arange(-20, 21, 1) y = 2*x**2
We can use text() method to display text over columns in a bar chart so that we could place text at a
specific location of the bars column.
Example
import matplotlib.pyplot as plt x = ['A',
'B', 'C', 'D', 'E'] y = [1, 3, 2, 5, 4]
percentage = [10, 30, 20, 50, 40]
plt.figure(figsize=(3,4)) plt.bar(x, y)
for i in range(len(x)): plt.text(x[i], y[i],
percentage[i])
plt.show()
The annotate() function in pyplot module of matplotlib library is used to annotate the point xy with
specified text. In order to add text annotations to a matplotlib chart we need to set at least, the text, the
coordinates of the plot to be highlighted with an arrow (xy), the coordinates of the text (xytext) and
the properties of the arrow (arrowprops).
Example
import numpy as np
import matplotlib.pyplot as plt
x=np.arange(0,10,0.25) y=np.sin(x)
plt.plot(x,y)
plt.annotate('Minimum',xy = (4.75, -1),xytext = (4.75, 0.2), arrowprops =
dict(facecolor = 'black',width=0.2), horizontalalignment =
'center')
plt.show()
Styling Plots
These options can be accessed by executing the command plt.style.available. This gives a list of all the
available stylesheet option names that can be used as an attribute inside plt.style.use().
Example
import matplotlib.pyplot as plt ls=plt.style.available print("Number of
Styples:",len(ls))
print("List of Styles:",ls)
A. ggplot is a popular data visualization package in R programming. It stands for “Grammar of
Graphics plot”. To apply ggplot styling to a plot created in Matplotlib, we can use the following
syntax:
plt.style.use('ggplot')
This style adds a light grey background with white gridlines and uses slightly larger axis tick labels.
The statement plt.style.use(‘ggplot’) can be used to apply ggplot styling to any plot in Matplotlib.
Example
from scipy import stats
import matplotlib.pyplot as plt
dist=stats.norm(loc=150,scale=20)
data=dist.rvs(size=1000) plt.style.use('ggplot')
plt.hist(data,bins=100,color='blue') plt.show()
The FiveThirtyEight Style is another way of styling plots in matplotlib.pyplot. It is based on the
popular American blog FiveThirtyEight which provides economic, sports, and political analysis.
The FiveThirtyEight stylesheet in Matplotlib has gridlines on the plot area with bold x and y ticks.
The colors of the bars in Bar plot or Lines in the Line chart are usually bright and distinguishable.
Syntax of using the style is as below.
Example
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
a = [2, 3, 4, 3, 4, 5, 3]
b = [4, 5, 5, 7, 9, 8, 6]
plt.figure(figsize = (4,3))
plt.plot(a, marker='o',linewidth=1,color='blue')
plt.plot(b, marker='v',linewidth=1,color='red') plt.show()
The dark_background stylesheet is third popular style that is based on the dark mode. Applying this
stylesheet makes the plot background black and ticks color to white, in contrast. In the foreground,
the bars and/or Lines are grey based colors to increase the aesthetics and readability of the plot.
Example
import matplotlib.pyplot as plt
plt.style.use("dark_background")
a = [1, 2, 3, 4, 5, 6, 7]
b = [1, 4, 9, 16, 25, 36, 49]
plt.figure(figsize = (4,3))
plt.plot(a, marker='o',linewidth=1,color='blue') plt.plot(b,
marker='v',linewidth=1,color='red')
plt.show()
Box Plots
A Box plot is a way to visualize the distribution of the data by using a box and some vertical lines. It
is known as the whisker plot. The data can be distributed between five key ranges, which are as
follows:
Minimum: Q1-1.5*IQR
1st quartile (Q1): 25th percentile
Median:50th percentile
3rd quartile(Q3):75th percentile
Maximum: Q3+1.5*IQR
Here IQR represents the InterQuartile Range which starts from the first quartile (Q1) and ends at the
third quartile (Q3). Thus, IQR=Q3-Q1.
In the box plot, those points which are out of range are called outliers. We can create the box plot of the
data to determine the following.
The number of outliers in a dataset
Is the data skewed or not
The range of the data
The range of the data from minimum to maximum is called the whisker limit. In Python, we will
use the matplotlib module's pyplot module, which has an inbuilt function named boxplot() which
can create the box plot of any data set. Multiple boxes can be created just by send list of data to the
boxplot() method.
Example
import matplotlib.pyplot as plt import
numpy as np
from scipy import stats dist =
stats.norm(100, 30)
data=dist.rvs(size=500)
plt.figure(figsize =(6, 4))
plt.boxplot(data)
plt.show()
Example 2
import matplotlib.pyplot as plt import
numpy as np
from scipy import stats dist =
stats.norm(100, 50)
data1=dist.rvs(size=500)
data2=dist.rvs(size=500)
data3=dist.rvs(size=500)
plt.figure(figsize =(6, 4))
plt.boxplot([data1,data2,data3]) plt.show()
Horizontal box plots can be created by setting vert=0 while creating box plots. Boxes in the plot
can be filled by setting patch_artist=True. The boxplot function is a Python dictionary with key values
such as boxes, whiskers, fliers, caps, and median. We can also change properties of dictionary objects
by calling set method.
Example
import matplotlib.pyplot as plt import
numpy as np
from scipy import stats dist =
stats.norm(100, 50)
data=dist.rvs(size=500)
plt.figure(figsize =(6, 4))
bp=plt.boxplot(data,vert=0,patch_artist=True) for b in
bp['boxes']:
b.set(color='blue',facecolor='cyan',linewidth=2) for w in
bp['whiskers']:
w.set(linestyle='--',linewidth=1, color='green') for f in
bp['fliers']:
f.set(marker='D', color='black',alpha=1) for m in
bp['medians']:
m.set(color='yellow',linewidth=2) for c in
bp['caps']:
c.set(color='red') plt.show()
Heatmaps
A heatmap (or heat map) is a graphical representation of data where values are depicted by color. A
simple heat map provides an immediate visual summary of information across two axes, allowing
users to quickly grasp the most important or relevant data points. More elaborate heat maps allow the
viewer to understand complex data sets. All heat maps share one thing in common -- they use
different colors or different shades of the same color to represent different values and to communicate
the relationships that may exist between the variables plotted on the x-axis and y-axis. Usually, a
darker color or shade represents a higher or greater quantity of the value being represented in the heat
map. For instance, a heat map showing the rain distribution (range of values) of a city grouped by
month may use varying shades of red, yellow and blue. The months may be mapped on the y axis and
the rain ranges on the x axis. The lightest color (i.e., blue) would represent the lower rainfall. In
contrast, yellow and red would represent increasing rainfall values, with red indicating the highest
values.
When using matplotlib we can create a heat map with the imshow() function. In order to create a
default heat map you just need to input an array of m×n dimensions, where the first dimension
defines the rows and the second the columns of the heat map. We can choose different colors for
Heatmap using the cmap parameter. Cmap is colormap instance or registered color map name.
Some of the possible values of cmap are: ‘pink’, ‘spring’, ‘summer’, ‘autumn’, ‘winter’, ‘cool’,
‘Wistia’, ‘hot’, ‘copper’ etc.
Example
import numpy as np
import matplotlib.pyplot as plt data = np.random.random(( 12 , 12 ))
plt.imshow( data,cmap='autumn') plt.title( "2-D Heat Map" ) plt.show()
Heat maps usually provide a legend named color bar for better interpretation of the colors of the
cells. We can add a colorbar to the heatmap using plt.colorbar(). We can also add the ticks and labels
for our heatmap using xticks() and yticks() methods.
Example
import numpy as np
import matplotlib.pyplot as plt
teams = ["A", "B", "C", "D","E", "F", "G"]
year= ["2022", "2021", "2020", "2019", "2018", "2017", "2016"]
games_won = np.array([[82, 63, 83, 92, 70, 45, 64],
[86, 48, 72, 67, 46, 42, 71],
[76, 89, 45, 43, 51, 38, 53],
[54, 56, 78, 76, 72, 80, 65],
[67, 49, 91, 56, 68, 40, 87],
[45, 70, 53, 86, 59, 63, 97],
[97, 67, 62, 90, 67, 78, 39]])
plt.figure(figsize = (4,4))
plt.imshow(games_won,cmap='spring')
plt.colorbar()
plt.xticks(np.arange(len(teams)),
labels=teams) plt.yticks(np.arange(len(year)),
labels=year) plt.title( "Games Won By Teams" )
plt.show()
We also use a heatmap to plot the correlation between columns of the dataset. We will use correlation
to find the relation between columns of the dataset.
Example
import numpy as np import
matplotlib
import matplotlib.pyplot as plt import
pandas as pd
df=pd.DataFrame({"x":[2,3,4,5,6],"y":[5,8,9,13,15],"z":[0,4,5,6,7]})
corr=df.corr(method='pearson')
plt.figure(figsize = (4,4))
plt.imshow(corr,cmap='spring')
plt.colorbar()
plt.xticks(np.arange(len(df.columns)), labels=df.columns,rotation=65)
plt.yticks(np.arange(len(df.columns)), labels=df.columns)
plt.show()
Unsupervised Learning
In unsupervised learning ML algorithm is provided with dataset without desired output. The ML
algorithm then attempts to find patterns in the data by extracting useful features and analyzing its
structure. Unsupervised learning algorithms are widely used for tasks like: clustering, dimensionality
reduction, association mining etc. K-Means algorithm, K-Medoid algorithm, Agglomerative algorithm
etc. are examples of clustering algorithms.
Reinforcement Learning
In reinforcement learning, we do not provide the machine with examples of correct input-output
pairs, but we do provide a method for the machine to quantify its performance in the form of a
reward signal. Reinforcement learning methods resemble how humans and animals learn: the
machine tries a bunch of different things and is rewarded with performance signal. Reinforcement
learning algorithms are widely used for training agents interacting with its environment.
Linear Regression
Regression analysis is the process of curve fitting in which the relationship between the independent
variables and dependent variables are modeled in the mth degree polynomial. Polynomial Regression
models are usually fit with the method of least mean square (LMS). If we assume that the relationship
𝑦 = 𝑓(𝑥) = 𝑤0 + 𝑤1𝑥
is a linear one and only one variable, then we can use linear equation given as below.
In the above equation, y is dependent variable and x is independent variable, w0 and w1 are
coefficient that needs to be determined through training of the model. If we have two independent
We use linear_model.LinearRegression() to create a linear regression object. We then use the fit()
method to train the linear regression model. This method takes dependent and independent variables as
input. Finally, we predict values of dependent variable by providing values of independent variable as
input.
Example
import numpy as np
from sklearn.linear_model import LinearRegression x =
np.array([2,3,5,8,9,11,15,12,19,17])
y = np.array([5,7,11,17,19,23,31,25,39,35])
test=np.array([4,10,13])
#reshaping data in column vector form
x=x.reshape((len(x),1)) test=test.reshape((len(test),1))
lr=LinearRegression()
lr.fit(x,y) pred=lr.predict(test)
print("Test Data:",test) print("Predicted
Values:",pred)
Logistic Regression
Logistic regression is one of the most popular machine learning algorithms for binary classification.
want to predict a variable 𝑦̂ ∈ {0,1} , where 0 is called negative class, while 1 is called positive
This is because it is a simple algorithm that performs very well on a wide range of problems. We
class. Such task is known as binary classification. The heart of the logistic regression technique is
logistic function and is defined as given in equation
(1). Logistic function transforms the input into the range [0, 1]. Smallest negative numbers results in
values close to zero and the larger positive numbers results in values close to one.
1
f (x)
1 ex
If there are two input variable, logistic regression has two coefficients just like linear regression.
y w0 w1 x 1 w2 x 2
Unlike linear regression, the output is transformed into a probability using the logistic function.
1
yˆ ( y)
1 ey
If the probability is > 0.5 we can take the output as a prediction for the class 1, otherwise the prediction
is for the class 0. The job of the learning algorithm will be to discover the best values for the
coefficients (w0, w1, and w2) based on the training data.
We use linear_model.LogisticRegression() to create a logistic regression object. We then use the fit()
method to train the logistic regression model. This method takes target and independent variables as
input. Finally, we predict values of target variable by providing values of independent variable as input.
Example
import numpy as np
from sklearn.linear_model import LogisticRegression
x = np.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52,
3.69, 5.88])
y= np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
test=np.array([2.93,1.86,5.24,6.32])
Example given below creates Naïve Bayes classifier model using the above training data and then
predicts class level of the tuple: X = (age = youth, income = medium, student = yes, credit_rating =
fair) using the model.
Example
from sklearn.naive_bayes import GaussianNB from sklearn
import preprocessing
import pandas as pd
Age=['Youth','Youth','Middle_Aged','Senior','Senior','Senior','Middle_Aged
','Youth','Youth','Senior','Youth','Middle_Aged','Middle_Aged','Senior','Y outh']
Income=['High','High','High','Medium','Low','Low','Low','Medium','Low','Me
dium','Medium','Medium','High','Medium','Medium']
Student=['No','No','No','No','Yes','Yes','Yes','No','Yes','Yes','Yes','No'
,'Yes','No','Yes']
Credit_Rating=['Fair','Excellent','Fair','Fair','Fair','Excellent','Excell
ent','Fair','Fair','Fair','Excellent','Excellent','Fair','Excellent','Fair ']
Buys=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes',
'Yes','No',"?"]
le = preprocessing.LabelEncoder()
a=list(le.fit_transform(Age))
i=list(le.fit_transform(Income))
s=list(le.fit_transform(Student))
cr=list(le.fit_transform(Credit_Rating))
b=list(le.fit_transform(Buys))
d={'Age':a,'Income':i,'Student':s, 'Credit_Rating':cr, 'Buys_Computer':b}
df=pd.DataFrame(d)
print(df) x=df[['Age','Income','Student','Credit_Rating']]
y=df['Buys_Computer']
trainx=x[0:14]
trainy=y[0:14]
testx=x[14:15]
model = GaussianNB()
model.fit(trainx,trainy) predicted=
model.predict(testx) if(predicted==1):
pred='No' else:
pred='Yes'
print("Predicted Value:", pred)
Once the decision tree is learned, in order to make prediction for a tuple, the attributes of a tuple are
tested against the decision tree. A path is traced from the root to a leaf node which determines the
predicted class for that tuple. Constructing a Decision tree uses greedy algorithm. Tree is
constructed in a top-down divide-and-conquer manner. High level algorithm for decision tree
algorithm is presented below.
1. At start, all the training tuples are at the root
2. Tuples are partitioned recursively based on selected attributes
3. If all samples for a given node belong to the same class
• Label the class
4. Else if there are no remaining attributes for further partitioning
• Majority voting is employed for assigning class label to the leaf
5. Else
• Got to step 2
There are many variations of decision-tree algorithms. Some of them are: ID3 (Iterative Dichotomiser
3), C4.5 (successor of ID3), CART (Classification and Regression Tree) etc. There are different attribute
selection measures used by decision tree classifiers. Some of them are: Information Gain, Gain Ratio,
Gini Index etc. ID3 stands for Iterative Dichotomiser 3. It uses top-down greedy approach to build
decision tree model. This algorithm computes information gain for each attribute and then selects
the attribute with the highestinformation gain. Information gain measures reduction in entropy after data
transformation. It is calculated by comparing entropy of the dataset before and after transformation. Entropy is
the measure of homogeneity of the sample. Entropy or expected information of dataset D is calculated by
using equation(1) given below.
(1)
m
E(D) p i log 2 pi
i1
Where pi is the probability of a tuple in D belonging to class Ci and is estimated using
Equation (2).
pi Ci, D (2)
D
Where Ci,D is the number of tuples in D belonging to class Ci and is the number
of tuples in D.
Suppose we have to partition the tuples in D on some attribute A having v distinct values. The
attribute A can be used to split D into v partitions {D1,D2,..,Dv}. Now, the total entropy of data
partitions while partitioning D around attribute A is calculated using Equation (3).
Dvj
E (D) E(D ) (3)
A j
j1 D
Finally, the information gain achieved after partitioning D on attribute A is calculated using
Equation (4).
IG(A) E(D) EA (D) (4)
In python, we create instance of DecisionTreeClassifier class in sklearn.tree module. The instance can
be trained using training dataset for learning predictive model in the form of decision tree structure.
The model can then be used to predict class label for the new data tuples. We have to supply the
parameter value criterion = "entropy" to create instance of ID3 decision tree.
Example
Use the dataset given below to train ID3 decision tree classifier and predict class label for the input
tuple {Outlook=Sunny, Temperature=Hot, Humidity=Normal, Windy=Strong}.
from sklearn.tree import DecisionTreeClassifier from sklearn import preprocessing
import pandas as pd
from sklearn.preprocessing import LabelEncoder from sklearn import tree
outlook=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Su
nny','Sunny', 'Rainy','Sunny','Overcast','Overcast','Rainy','Sunny']
temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild',' Mild','Mild','Hot','Mild','Hot']
humidity=['High','High','High','High','Normal','Normal','Normal','High','N
ormal','Normal','Normal','High','Normal','High','Normal']
wind=['Weak','Strong','Weak','Weak','Weak','Strong','Strong','Weak','Weak'
,'Weak','Strong','Strong','Weak','Strong','Strong']
play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes', 'Yes','No','?']
dt = DecisionTreeClassifier(criterion = 'entropy')
dt.fit(trainx,trainy)
p= dt.predict(testx)
p=Le.inverse_transform(p) print("Predicted
Label:",p)
Unit 4 Unsupervised Learning
Clustering
The process of grouping a set of physical or abstract objects into classes of similar objects is called
clustering. It is an unsupervised learning technique. A cluster is a collection of data objects that are
similar to one another within the same cluster and are dissimilar to the objects in other clusters.
Clustering can also be used for outlier detection, where outliers may be more interesting than common
cases. Applications of outlier detection include the detection of credit card fraud and the monitoring of
criminal activities in electronic commerce. For example, exceptional cases in credit card transactions,
such as very expensive and frequent purchases, may be of interest as possible fraudulent activity.
Categories of Clustering Algorithms
Many clustering algorithms exist in the literature. In general, the major clustering methods can be
classified into the following categories.
1. Partitioning methods: Given a database of n objects or data tuples, a partitioning method
constructs k partitions of the data, where each partition represents a cluster and k <n. Given
k, the number of partitions to construct, a partitioning method creates an initial partitioning. It
then uses an iterative relocation technique that attempts to improve the partitioning by moving
objects from one group to another.
2. Hierarchical methods: A hierarchical method creates a hierarchical decomposition of the
given set of data objects. A hierarchical method can be classified as being either
agglomerative or divisive. The agglomerative approach follows the bottom-up approach. It
starts with each object forming a separate group. It successively merges the objects or groups
that are close to one another, until a termination condition holds. The divisive approach follows
the top-down approach. It starts with all of the objects in the same cluster. In each successive
iteration, a cluster is split up into smaller clusters, until a termination condition holds.
3. Density-based methods: Most partitioning methods cluster objects based on the distance
between objects. Such methods can find only spherical-shaped clusters and encounter difficulty
at discovering clusters of arbitrary shapes. Other clustering methods have been developed
based on the notion of density. Their general idea is to continue growing the given cluster as
long as the density (number of objects or data points) in the neighborhood exceeds some
threshold.
4. Model-based methods: Model-based methods hypothesize a model for each of the clusters and
find the best fit of the data to the given model. EM is an algorithm that performs
expectation-maximization analysis based on statistical modeling.
Measures of Similarity
Distance measures are used in order to find similarity or dissimilarity between data objects. The
most popular distance measure is Euclidean distance, which is defined as below.
d (x, y)
Where, x (x1(x, 2y1 x) 1and
)2 ( yy2 y (x
)122 , y2 )
Another well-known metric is Manhattan (or city block) distance, defined as below.
d (x, y) x2 x1 y2 y1
data=[(2,8),(3,2),(1,4),(4,6),(3,5),(2,3),(5,7),(4,8),(4,2),(1,3),(4,5),(3
,4),(7,4),(2,1)]
km=KMeans(n_clusters=2,init='random') km.fit(data)
centers = km.cluster_centers_ labels =
km.labels_ print("Cluster Centers:",*centers)
print("Cluster Labels:",*labels) #Diaplaying
Clusters
cluster1=[] cluster2=[]
for i in range(len(labels)): if
(labels[i]==0):
cluster1.append(data[i]) else:
cluster2.append(data[i])
print("Cluster 1:",cluster1) print("Cluster
2:",cluster2)
Numerical Example
Divide the data points {(2,10), ((2,5), (8,4), (5,8), (7,5), (6,4)} into two clusters.
Solution
Let p1=(2,10) p2=(2,5) p3=(8,4) p4=(5,8) p5=(7,5)
p6=(6,4)
Initial step
Let m1=(2,5) and m2=(6,4) are two initial cluster centers (medoid).
Iteration 1
Calculate distance between medoids and each data points d(m1,p1)=5
d(m2,p1)=10
d(m1,p2)=0 d(m2,p2)=5
d(m1,p3)=7 d(m2,p3)=2
d(m1,p4)=6 d(m2,p4)=5
d(m1,p5)=5 d(m2,p5)=2
d(m1,p6)=5 d(m2,p6)=0 Thus,
Cluster1={p1,p2} cluster2={p3,p4,p5,p6} Total
Cost=5+0+2+5+2+0=14
Iteration 2:
Swap m1 with p1, m1 =(2,10) m2=(6,4)
Calculate distance between medoids and each data points d(m1,p1)=0
d(m2,p1)=10
d(m1,p2)=5 d(m2,p2)=5
d(m1,p3)=12 d(m2,p3)=2
d(m1,p4)=5 d(m2,p4)=5
d(m1,p5)=10 d(m2,p5)=2
d(m1,p6)=10 d(m2,p6)=0
Thus, Cluster1={p1,p2,p4} cluster2={p3,p5,p6} Total
Cost=0+5+2+5+2+0=14
Iteration 3:
Swap m1 with p3, m1 =(8,4) m2=(6,4)
Calculate distance between medoids and each data points d(m1,p1)=12
d(m2,p1)=10
d(m1,p2)=7 d(m2,p2)=5
d(m1,p3)=0 d(m2,p3)=2
d(m1,p4)=7 d(m2,p4)=5
d(m1,p5)=2 d(m2,p5)=2
d(m1,p6)=2 d(m2,p6)=0
Thus, Cluster1={p3,p5} cluster2={p1,p2,p4,p6} Total
Cost=10+5+0+5+2+0=22 => Undo Swapping Continue this
process…
In python, KMedoids method from sklearn_extra.cluster module used to create instance of KMedoid
algorithm. This module may not come with python default installation. Therefore, we may need to
install the module. One of the major parameter of this method is n_clusters. The parameter n_clusters
is used specify the value of k (number of clusters).
Once the instance of KMedoid is created, like KMeans, fit() method is used to compute clusters by
KMedoid algorithm. This method accepts dataset as input argument and stores final cluster centers
in variable cluster_centers_ and cluster labels of the dataset in labels_ variable.
Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn_extra.cluster import KMedoids
data=[(2,8),(3,2),(1,4),(4,6),(3,5),(2,3),(5,7),(4,8),(4,2),(1,3),(4,5),(3
,4),(7,4),(2,1)]
km=KMedoids(n_clusters=2) km.fit(data)
centers = km.cluster_centers_ labels =
km.labels_ print("Cluster Centers:",*centers)
print("Cluster Labels:",*labels) #Diaplaying
Clusters
cluster1=[] cluster2=[]
for i in range(len(labels)): if
(labels[i]==0):
cluster1.append(data[i]) else:
cluster2.append(data[i])
print("Cluster 1:",cluster1)
print("Cluster 2:",cluster2)
Hierarchical Clustering
In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA)
is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical
clustering generally fall into two types: Agglomerative and Divisive. Agglomerative Clustering is a
bottom up approach. Initially, each observation is considered in separate cluster and pairs of
clusters are merged as one moves up the hierarchy. This process continues until the single cluster or
required number of clusters are formed. Distance matrix is used for deciding which clusters to
merge.
A cluster hierarchy can also be generated top-down. This variant of hierarchical clustering is called top-
down clustering or divisive clustering. We start at the top with all data in one cluster. The cluster is split
two clusters such that the objects in one subgroup are far from the objects in the other. This procedure is
applied recursively until required numbers of clusters are formed. This method is not considered
attractive because there exist O(2n) ways of splitting each cluster.
The closest cluster are cluster {F} and {D} with shortest distance of 0.5. Thus, we group cluster
D and F into single cluster {D, F}.
Update the Distance Matrix
We can see that the distance between cluster {B} and cluster {A} is minimum with distance
0.71. Thus, we group cluster {A} and cluster {B} into a single cluster named {A, B}.
We can see that the distance between clusters {E} and cluster {D, F} is minimum with distance
1.00. Thus, we group them together into cluster {D, E, F}.
Updated Distance Matrix
After that, we merge cluster {D, E, E} and cluster {C} into a new cluster {C, D, E, F} because
cluster {D, E, E} and cluster {C} are closest clusters with distance 1.41.
Updated Distance Matrix
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering data=[(2,8),(3,2),(1,4),(4,6),(3,5),
(2,3),(5,7),(4,8),(4,2),(1,3),(4,5),(3
,4),(7,4),(2,1)]
ac=AgglomerativeClustering(n_clusters=2) ac.fit(data)
labels = ac.labels_ print("Cluster
Labels:",*labels)
Lower Casing
Lower casing is a common text preprocessing technique. The idea is to convert the input text into same
casing format so that 'test', 'Test' and 'TEST' are treated the same way. This is more helpful for text
featurization techniques like frequency, TF-IDF as it helps to combine the same words together thereby
reducing the duplication and get correct counts/TF-IDF values. We can convert text to lower case simply
by calling strings lower() method.
Example
import numpy as np import pandas
as pd
df = pd.read_csv("/content/drive/My Drive/sample.csv") df =
df[["text"]]
text=df["text"][0] print("Original
Text") print(text)
text=text.lower() df["text"]
[0]=text
print("After Converting into Lower Case") print(text)
Removal of Punctuations
Another common text preprocessing technique is to remove the punctuations from the text data. This is
again a text standardization process that will help to treat 'hurray' and 'hurray!' in the same way.
We also need to carefully choose the list of punctuations to exclude depending on the use case. For
example, the string.punctuation in python contains the following punctuation symbols: !"#$
%&\'()*+,-./:;<=>?@[\\]^_{|}~`.We can add or remove more punctuations as per our need.
Example
import pandas as pd import string
df = pd.read_csv("/content/drive/My Drive/sample.csv") df =
df[["text"]]
text=df["text"][2] print("Original Data")
print(text) ps=string.punctuation
print("Puctuation Symbols:",ps)
new_text=""
for c in text:
if c not in ps: new_text=new_text+c
df["text"][2]=new_text
print("After Removal of Punctuation Symbols") print(new_text)
Removing Numbers
Sometimes it happens that words and digits combine are written in the text which creates a problem for
machines to understand. hence, We need to remove the words and digits which are combined like
game57 or game5ts7. This type of word is difficult to process so better to remove them or replace
them with an empty string. We can replace digits and words containing digits from by using sub()
method of re module. Syntax of the method is given below.
re.sub(pat, replacement, str)
This function searches for specified pattern in the given string, and replaces the strings by the
specified replacement.
Example: Removing digits
Removing URLs
Next preprocessing step is to remove any URLs present in the text data. If we scraped text data
from web, then there is a good chance that the text data will have some URL in it. We might need
to remove them for our further analysis. We can also replace URLs from the text by using sub()
method of re module.
Example
import pandas as pd import re
df = pd.read_csv("/content/drive/My Drive/sample.csv") df = df[["text"]]
text=df["text"][7] print("Original Text") print(text) toks=text.split() new_toks=[]
for t in toks: t=re.sub("https?://\S+|www\.\S+","",t)
#\S represents any character except white space characters new_toks.append(t)
text=" ".join(new_toks) df["text"][7]=text print("After Removal of URLs") print(text)
Removal of HTML Tags
One another common preprocessing technique that will come handy in multiple places is removal of
html tags. This is especially useful, if we scrap the data from different websites. We might end up
having html strings as part of our text. We can remove the HTML tags using regular expressions.
Example
import pandas as pd import re
text="The HTML <b> element defines bold text, without any extra importance."
print("Original Text Data") print(text)
new_toks=[] tokens=text.split()
for t in tokens:
t=re.sub("<.*>","",t) new_toks.append(t)
text=" ".join(new_toks)
print("Text Data After Removal of HTML Tags") print(text)
Removal of Emojis
With more and more usage of social media platforms, there is an explosion in the usage of emojis
in our day to day life as well. Probably we might need to remove these emojis for some of our
textual analysis. We have to use ‘u’ literal to create a Unicode string. Also, we should pass
re.UNICODE flag and convert our input data to Unicode.
Example
import pandas as pd import re
df = pd.read_csv("/content/drive/My Drive/sample.csv") df =
df[["text"]]
print(df["text"][0])
text=df["text"][0]
toks=text.split() new_toks=[]
pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols &
pictographs
u"\U0001F680-\U0001F6FF" # transport & map
symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS) u"\
U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
for t in toks: t=re.sub(pattern,"",t)
new_toks.append(t)
text=" ".join(new_toks)
print("After Removal of Words Containing Digits") df["text"]
[7]=text
print(df["text"][7])
Stemming
Stemming is the process of converting a word to its most general form, or stem. This helps in reducing
the size of our vocabulary. Consider the words: learn, learning, learned and learnt. All these words are
stemmed from its common root learn. However, in some cases, the stemming process produces words
that are not correct spellings of the root word. For example, happi. That's because it chooses the most
common stem for related words. For example, we can look at the set of words that comprises the
different forms of happy: happy, happiness and happier. We can see that the prefix happi is more
commonly used. We cannot choose happ because it is the stem of unrelated words like happen. NLTK
has different modules for stemming and we will use the PorterStemmer module which uses the Porter
Stemming Algorithm.
Example
import numpy as np import pandas
as pd
from nltk.stem.porter import PorterStemmer
df = pd.read_csv("/content/drive/My Drive/sample.csv") df =
df[["text"]]
text=df["text"][0] text=text.lower()
print("Original Text") print(text)
stemmer = PorterStemmer()
toks=text.split()
new_toks=[] for t in
toks:
rw=stemmer.stem(t) new_toks.append(rw)
df["text"][0]=text text="
".join(new_toks) print("After
Stemming") print(text)
Lemmatization
Lemmatization is a text pre-processing technique used in natural language processing (NLP)
models to break a word down to its root meaning to identify similarities. For example, a
lemmatization algorithm would reduce the word better to its root word, or lemme, good.
In stemming, a part of the word is just chopped off at the tail end to arrive at the stem of the word.
There are different algorithms used to find out how many characters have to be chopped off, but the
algorithms don’t actually know the meaning of the word in the language it belongs to. In
lemmatization, the algorithms do have this knowledge. In fact, you can even say that these
algorithms refer to a dictionary to understand the meaning of the word before reducing it to its root
word, or lemma. It reduces the size of text data massively and hence is faster in processing large
amount of text data. However, stemming may results to meaningless words.
So, a lemmatization algorithm would know that the word better is derived from the word good, and
hence, the lemme is good. But a stemming algorithm wouldn’t be able to do the same. There could
be over-stemming or under-stemming, and the word better could be reduced to either bet, or bett, or
just retained as better. But there is no way in stemming that can reduce better to its root word good. This
is the difference between stemming and lemmatization. Lemmatization preserves the meaning of
words. However, it may be computationally expensive due to less powerful dimensionality
reduction.
Example
import numpy as np
import pandas as pd
from nltk.stem import WordNetLemmatizer
df = pd.read_csv("/content/drive/My Drive/sample.csv") df =
df[["text"]]
text=df["text"][1] text=text.lower()
print("Original Text") print(text)
lemmatizer = WordNetLemmatizer() toks=text.split()
new_toks=[] for t in
toks:
rw=lemmatizer.lemmatize(t)
new_toks.append(rw)
df["text"][1]=text text="
".join(new_toks)
print("After Lemmatization") print(text)