0% found this document useful (0 votes)
5 views

Unit 1 Machine Learning

The document provides an overview of NumPy and Pandas, two essential Python libraries for scientific computing and data analysis. It covers the creation and manipulation of NumPy arrays, mathematical operations, and key features of Pandas data structures like Series and DataFrame. Additionally, it discusses data input/output, data cleansing techniques, and methods for handling missing data.

Uploaded by

Nischal Ghimire
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Unit 1 Machine Learning

The document provides an overview of NumPy and Pandas, two essential Python libraries for scientific computing and data analysis. It covers the creation and manipulation of NumPy arrays, mathematical operations, and key features of Pandas data structures like Series and DataFrame. Additionally, it discusses data input/output, data cleansing techniques, and methods for handling missing data.

Uploaded by

Nischal Ghimire
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 61

Unit 1

Review of NumPy Arrays


NumPy is a wonderful Python package, which has been created fundamentally for scientific
computing. It helps handle large multidimensional arrays and matrices, along with a large library of
high-level mathematical functions to operate on these arrays. A NumPy array would require much
less memory to store the same amount of data compared to a Python list, which helps in reading
and writing from the array in a faster manner.
Creating an Array
A list of numbers can be passed to the following array function to create a NumPy array object.
import numpy as np
a = np.array([[0, 1, 2, 3], [4, 5, 6, 7],[8, 9, 10, 11]])
A NumPy array object has a number of attributes, which help in giving information about the array.
Here are its important attributes.
1 ndim: This gives the number of dimensions of the array. The following shows that the array
that we defined had two dimensions.
a.ndim
2 shape: This gives the size of each dimension of the array.
a.shape
3 size: This gives the number of elements.
a.size
4 dtype: This gives the data type of the elements in the array:
a.dtype.name
Example: Create a NumPy array of dimension 3x3 and display all of its attributes.
import numpy as np a=np.array([[1,3,6],[2,4,7],
[2,5,9]])
print(a) print("#Dimensions:",a.ndim)
print("Shape:",a.shape)
print("#Elements:",a.size) print("Data
Type:",a.dtype.name)
Mathematical Operations
When we have an array of data, we would like to perform certain mathematical operations on it.
We will now discuss a few of the important ones in this section.
1. Array Subtraction
The following commands subtract the a array from the b array to get the resultant c array. The
subtraction happens element by element:
a = np.array( [11, 12, 13, 14])
b = np.array( [ 1, 2, 3, 4])
c=a–b
2. Squaring an Array
The following command raises each element to the power of 2 to obtain this result: r=b**2
3. Trigonometric Function Performed on the Array
The following command applies cosine to each of the values in the b array to obtain the
following result:
r=np.cos(b)
4. Conditional Operations
The following command will apply a conditional operation to each of the elements of the b
array, in order to generate the respective Boolean values.
r=b<2
5. Matrix Multiplication
Two matrices can be multiplied element by element or in a dot product. The following
commands will perform the element-by-element multiplication:
a = np.array([[1, 1], [0, 1]])
b = np.array([[2, 0],[3, 4]]) c=a* b
Example: Create two NumPy arrays of dimension 3x4, Say a and b, and then perform following
operations.
a. Find their sum and difference,
b. Find square of elements in first array and cube of elements in second array
c. Find r=sin(a)+cos(b)
d. Create a Boolean array for the first array, where all entries are true if the element is
+ve and false otherwise.
e. Find transpose of first array
f. Find element-wise multiplication of arrays and multiplication of both matrices.
import numpy as np
a=np.array([[1,-3,4,7],[2,4,-5,9],[0,-1,8,3]])
b=np.array([[2,3,6,7],[2,5,5,8],[3,1,7,3]])
r=a+b print("a+b=",r) r=a-b
print("a-b=",r) r=a**2
print("a^2=",r) r=b**3
print("b^3=",r) r=np.sin(a)
+np.cos(b)
print("sin(a)+cos(b)=",r) r=(a>=0)
print("(a>=0)=",r) r=a*b
print("Element wise Multiplication of a and b=",r) a=np.transpose(a)
print("Transpose of a=",a) r=np.dot(a,b)
print("Matrix Multiplication of a and b=",r)

6. Indexing and Slicing


If we want to select a particular element of an array, it can be achieved using indexes.
a[0,1]
The preceding command will select the first row and then select the second value in the row.
It can also be seen as an intersection of the first row and the second column of the matrix. If a
range of values has to be selected on a row, then we can use the following command.
a[0 , 0:3 ]
The 0:3 value selects the first three values of the first row. The whole row of values can be
selected with the following command.
a[ 0 , : ]
Using the following command, an entire column of values need to be selected. a[ : , 1 ]
7. Shape Manipulation
Once the array has been created, we can change the shape of it too. The following command
flattens the array.
a.ravel()
The following command reshapes the array in to a six rows and two columns format. Also,
note that when reshaping, the new shape should have the same number of elements as the
previous one.
a.shape = (6,2)
The array can be transposed too:
a.transpose()
Example: Create a NumPy array of dimension 4x3 and then perform following operations.
a. Display 3rd element of second row
b. Display first two elements of second row
c. Display 2nd to 3rd rows of the array
d. Display 2x2 slice of top left part of the array
e. Display first two columns of the array
f. Convert array to 1D
g. Convert array to dimension 3x4
import numpy as np
a=np.array((np.random.rand(4,3)))
print(a)
print("Third Element of Second Row=",a[1,2])
print("First Two Elements of Second Row=",a[1,0:2])
print("Second to Third Row:",a[1:3,:])
print("2x2 Slice of Top-Left Part=",a[0:2,0:2])
print("Fisr Two Columns=",a[:,0:2])
a=a.ravel()
print("1D Array:",a)
a.shape=(3,4)
print("a=",a)

Review of Pandas Data Structures


The pandas library is an open source Python library, specially designed for data analysis. It has been
built on NumPy and makes it easy to handle data. The pandas library brings the richness of R in the
world of Python to handle data. It's has efficient data structures to process data, perform fast joins,
and read data from various sources, to name a few. The pandas library essentially has three data
structures: Series, DataFrame, and Panel.
Series
Series is a one-dimensional array, which can hold any type of data, such as integers, floats, strings,
and Python objects too. A series can be created by calling the following.
import pandas as pd import numpy
as np
s=pd.Series(np.random.randn(5))
print(s)

The random.randn parameter is part of the NumPy package and it generates random numbers. The
series function creates a pandas series that consists of an index, which is the first column, and the
second column consists of random values. At the bottom of the output is the datatype of the series.
The index of the series can be customized by calling the following:
import pandas as pd import numpy
as np
s=pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)

A series can be derived from a Python dict too as below:


import pandas as pd import numpy
as np
d = {'A': 10, 'B': 20, 'C': 30}
s=pd.Series(d) print(s)

DataFrame
DataFrame is a 2D data structure with columns that can be of different data types. It can be seen as a
table. A DataFrame can be formed from the following data structures: A NumPy array, Lists, Dicts,
Series, etc.
A DataFrame can be created from a dictionary of series as below:
import pandas as pd
d = {'c1': pd.Series(['A', 'B', 'C']),'c2': pd.Series([1, 2, 3, 4])}
df = pd.DataFrame(d) print(df)
import pandas as pd
d = {'c1': ['A', 'B', 'C', 'D'],'c2': [1, 2, 3, 4]}
df = pd.DataFrame(d) print(df)

We can also convert NumPy arrays into data Frame.


import pandas as pd import numpy
as np
a=np.array(np.random.randn(3,4)) df=pd.DataFrame(a)
print(df)

Inserting and Exporting Data


The data is stored in various forms, such as CSV, TSV (Tab Separated Values), databases, and so on.
The pandas library makes it convenient to read data from these formats or to export to these formats.
CSV
To read data from a .csv file, the read_csv function can be used. To write a data to the .csv file, the
to_csv function can be used.
Example: Write a program that reads emp.csv file, displays its content, stores Eid, Ename, and Salary to
another dataframe writes it to emptiest.csv file
import pandas as pd
emp = pd.read_csv('/content/drive/My Drive/Data/emp.csv')
print(emp) d=emp[["Eid","Ename","Age"]]
print(d)
d.to_csv("/content/drive/My Drive/Data/emptest.csv")

XLS
To read the data from an Excel file read_excel() command can be used and to_excel() command can
be used to write excel file.
Example: Write a program that reads Book1.xlsx file, displays its content, stores Sid, and Sname to another
dataframe writes it to mybook.xlsx file
import pandas as pd
book = pd.read_excel('/content/drive/My Drive/Data/Book1.xlsx') print(book)
b=book[["Sid","Grade"]] print(b)
b.to_excel("/content/drive/My Drive/Data/Book.xlsx")

JSON Data
JSON is a syntax for storing and exchanging data. JSON is text, written with JavaScript object notation.
Python has a built-in package called json, which can be used to work with JSON data. If we have a
JSON string, we can parse it by using the json.loads() method. The result will be a Python
dictionary. If we have a Python object, we can convert it into a JSON string by using the
json.dumps() method.
Example: Represent Id, Name, and Email of 3 person in JSON format, load it into dictionary and
display it. Again represent Id, Name, and Email of 3 person in dictionary convert it into JSON
format and display it.
import json #JSON Data
x = """[
{"ID":101,"name":"Ram", "email":"[email protected]"},
{"ID":102,"name":"Bob", "email":"[email protected]"},
{"ID":103,"name":"Hari", "email":"[email protected]"}
]"""
# loads method converts x into dictionary y =
json.loads(x)
print(y)
# Displaying Email Id of all persons from dictionary for r in
y:
print(r["email"])

# a Python object (dict):


x = {"Name": ["Ram","Hari","Sita"],"Age": [30,40,27],"City": ["KTM","PKR","DHN"]}

# dumps method converts d ictionary into JSON string y =


json.dumps(x)
print(y)
print(x["Name"]) #No Error beacuse x is dictionary
#print(y["Name"]) #Error because y is not dictionary
Database
SQL-based relational databases (such as SQL Server, PostgreSQL, and MySQL) are in wide use,
and many alternative NoSQL databases have become quite popular. Loading data from SQL into a
DataFrame is fairly straightforward, and pandas has some functions to simplify the process. As an
example, an in-memory SQLite database using Python’s built-in sqlite3 driver is presented below.
Example
import sqlite3
query = "CREATE TABLE Student (Sid Varchar(10), Sname VARCHAR(20), GPA Real, Age
Integer);"
con = sqlite3.connect(':memory:')
con.execute(query)
con.commit()
data = [('S1', 'Ram', 3.25, 23),('S2', 'Hari', 3.4, 24),('S3', 'Sita', 3.7, 22)]
stmt = "Insert Into Student Values(?, ?, ?, ?)"
con.executemany(stmt, data)
con.commit()
cursor = con.execute('select * from Student where GPA>3.3') rows =
cursor.fetchall()print(rows)

Data Cleansing
Data cleansing, sometimes known as data cleaning or data scrubbing, denotes the procedure of
rectifying inaccurate, unfinished, duplicated, or other flawed data within a dataset. This task entails
detecting data discrepancies and subsequently modifying, enhancing, or eliminating the data to
rectify them. Through data cleansing, data quality is enhanced, thereby furnishing more precise,
uniform, and dependable information crucial for organizational decision-making.

Checking the Missing Data


Generally, most data will have some missing values. There could be various reasons for this: the source
system which collects the data might not have collected the values or the values may never have existed.
Once you have the data loaded, it is essential to check the missing elements in the data. Depending on
the requirements, the missing data needs to be handled. It can be handled by removing a row or
replacing a missing value with an alternative value. Commands isnull(), notnull(), dropna() are
widely used for checking null values.
 isnull(): It examines columns for NULL values and generates a Boolean series where
True represents NaN values and False signifies non-null values.
 notnull: It examines columns for NULL values and generates a Boolean series where
True represents non-null values and False signifies NULL values.
 dropna():It eliminates all rows containing NULL values from the DataFrame.
Example
import pandas as pd
emp = pd.read_csv('/content/drive/My Drive/Data/employees.csv') for c in
emp.columns:
print(emp[c].isnull().value_counts())

emp=emp.dropna() print ("Cleaned


Data")

for c in emp.columns:
print(emp[c].isnull().value_counts())

Filling Missing Values


To address null values within datasets, we employ functions like fillna(), replace(), and interpolate().
These functions substitute NaN values with specific values.
filna():The fillna() function substitutes NULL values with a designated value. By default, it
generates a new DataFrame object unless the inplace parameter is set to True, in which case it performs
the replacement within the original DataFrame.
Example
import pandas as pd
emp = pd.read_csv('/content/drive/My Drive/Data/employees.csv')
emp["First Name"].fillna(value="Unknown",inplace=True)
emp["Gender"].fillna(value="Unknown",inplace=True)
emp["Salary"].fillna(value=emp["Salary"].mean,inplace=True) emp["Bonus
%"].fillna(value=emp["Bonus %"].mean,inplace=True)
emp["Team"].fillna(value="Unknown",inplace=True)
for c in emp.columns:
print(emp[c].isnull().value_counts())
The Pandas interpolate() function is utilized to populate NaN values within the DataFrame or Series
by employing different interpolation techniques aimed at filling the missing values in the data.
import pandas as pd import numpy
as np
emp = pd.read_csv('/content/drive/My Drive/Data/employees.csv') print(*emp["Salary"])
emp["Salary"].interpolate(inplace=True,method="linear") print(*emp["Salary"])

Merging Data
To combine datasets together, the concat function of pandas can be utilized. We can concatenate two
or more dataframes together.
Example
import pandas as pd import numpy
as np
emp = pd.read_csv('/content/drive/My Drive/Data/employees.csv') e1=emp[0:5]
print("First 5 Rows of Dataframe:") print(e1)
e2=emp[10:15]
print("Rows 100-15 of Dataframe:") print(e2)
print("Concatenated Dataframe:")
e=pd.concat([e1,e2])
print(e)

Data operations
Once the missing data is handled, various operations such as aggregate operations, joins etc. can be
performed on the data.
Aggregation Operations
There are a number of aggregation operations, such as average, sum, and so on, which we would
like to perform on a numerical field. These aggregate methods are discussed below.
 Average: mean() method of pandas dataframe is used for finding average of specified
numerical field of the dataframe.
 Sum: sum() method of pandas dataframe is used for total of specified numerical field of the
dataframe.
 Max: max() method of pandas dataframe is used for finding maximum value of specified
numerical field of the dataframe.
 Min: min() method of pandas dataframe is used for finding minimum value of specified
numerical field of the dataframe.
 Standard Deviation: std() method of pandas dataframe is used for finding standard
deviation of specified numerical field of the dataframe.
 Count: count() method of pandas dataframe is used for total number of values in the
specified field of the dataframe.
Example
import pandas as pd import numpy as np
emp = pd.read_csv('/content/drive/My Drive/Data/employees.csv')
emp.drop_duplicates(inplace=True)
avgsal=emp["Salary"].mean() print("Average Salary=",avgsal)
totsal=emp["Salary"].sum() print("Total Salary=",totsal) maxsal=emp["Salary"].max()
print("Maximum Salary=",maxsal) minsal=emp["Salary"].min()
print("Minimum Salary=",minsal)
nemp=emp["First Name"].count() print("#Employees=",nemp)
teams=emp["Team"].drop_duplicates().count()
print("#Teams=",teams)
std=emp["Bonus %"].std()
print("Stadrard Deviation of Bonus=",std)

groupby Function
A groupby operation involves some combination of splitting the object, applying a function, and
combining the results. This can be used to group large amounts of data and compute operations on these
groups.
import pandas as pd import numpy
as np
emp = pd.read_csv('/content/drive/My Drive/Data/employees.csv')
emp.drop_duplicates(inplace=True)
emp.dropna(inplace=True)
avgsal=emp[["Team","Salary"]].groupby(["Team"]).mean() print("Average
Salary For Each Team")
print(avgsal)
gencount=emp.groupby(["Gender"]).count()
print("#Employees Gender Wise") print(gencount)
minbonus=emp[["Gender","Bonus %"]].groupby(["Gender"]).min() print("Minimum Bonus%
For Each Gender")
print(minbonus)

Various Forms of Distribution


There are various kinds of probability distributions, and each distribution shows the probability of
different outcomes for a random experiment.
Normal Distribution
Normal distribution, also known as the Gaussian distribution, is a probability distribution that is
symmetric about the mean, showing that data near the mean are more frequent in occurrence than data
far from the mean. In graphical form, the normal distribution appears as a "bell curve" and is
completely determined by two parameters: its mean μ and its standard deviation σ. The mean
indicates where the bell is centered, and the standard deviation how “wide”.
. We say the data is "normally distributed" if the data exhibit following properties:
 mean = median = mode
 Data symmetric about the center that is 50% of values less than the mean and 50% greater than
the mean.

Probability distribution function of normal distribution is given as below:

For all normal distributions, 68.2% of the observations will appear within plus or minus one
standard deviation of the mean; 95.4% of the observations will fall within +/- two standard
deviations; and 99.7% within +/- three standard deviations. This fact is sometimes referred to as the
"empirical rule," a heuristic that describes where most of the data in a normal distribution will appear.
This means that data falling outside of three standard deviations ("3-sigma") would signify rare
occurrences.

Example
from scipy import stats
import matplotlib.pyplot as plt import numpy as np
dist = stats.norm(loc=5.6, scale=1)#Here 5.5 is mean and 1 is SD
# Generate a sample of 100 random penguin heights heights = dist.rvs(size=100)
heights=heights.round(2) heights=np.sort(heights)prob = dist.pdf(x=5.2)
print("Probability of being height 5.2=",prob) probs = dist.pdf(x=[4.5,5,5.5,6,6.5])
print("Probability of heights=",probs)
probs = dist.pdf(x=heights)
#plotting histograms with desnisty curve plt.figure(figsize=(6, 4))
plt.hist(heights, bins=20, density=True)
plt.title("Height Histogram and Desity Curve")
plt.xlabel("Height")plt.ylabel("Frequency")
plt.plot(heights, probs) plt.show()
The method pdf() from the norm class can help us find the probability of some randomly selected
value. It returns the probabilities for specific values from a normal distribution. PDF stands for
probability density function.
The norm method cdf() helps us to calculate the proportion of a normally distributed population
that is less than or equal to a given value. CDF stands for Cumulative Distribution Function.
Example
from scipy import stats
import matplotlib.pyplot as plt import
numpy as np

dist = stats.norm(loc=5.5, scale=2)


# Generate a sample of 100 random penguin heights heights =
dist.rvs(size=1000) heights=heights.round(2)
heights=np.sort(heights)
prob = dist.pdf(x=6) print("Probability of
Height=6:",prob) prob = dist.cdf(x=6)
print("Proabability of Height<=6:",prob)

Z-score
Z-score is a statistical measurement that describes a value's relationship to the mean of a group of
values. Z-score is measured in terms of standard deviations from the mean. If a Z-score is 0, it
indicates that the data point's score is identical to the mean value. A Z- score of 1.0 would indicate
a value that is one standard deviation from the mean. Z-scores may be positive or negative, with a
positive value indicating the value is larger than the mean and a negative z-score indicating it is
smaller than the mean. It is calculated by using the formula given below.
z= (x − μ)/σ
Here, x is the value in the distribution, μ is the mean of the distribution, and σ is the standard
deviation of the distribution. Conversely, if x is a normal random variable with mean μ and
standard deviation σ, it is calculated as below.
x = σz + μ
Numerical Example
A survey of daily travel time had these results (in minutes): 26, 33, 65, 28, 34, 55, 25, 44,
50, 36, 26, 37, 43, 62, 35, 38, 45, 32, 28, 34. Convert the values to z-scores ("standard scores").
Solution
μ=38.5 σ=11.4
Original Value Standard Score (z-score)
26 (26-38.8) / 11.4 = −1.12
33 (33-38.8) / 11.4 = −0.51
65 (65-38.8) / 11.4 = 2.30
Example
import numpy as np
import matplotlib.pyplot as plt from scipy
import stats
dist = stats.norm(loc=50, scale=10) scores =
dist.rvs(size=100) scores=scores.round()
print(*scores)
plt.hist(scores, bins=30) plt.title("Histrogram of
Origical Scores") plt.show()
#Converting scores to z-scores
z=stats.zscore(scores).round(3)
print(*z) plt.hist(z, bins=30)
plt.title("Histrogram of Z-values of Scores") plt.show()
#converting z-scores to value in distribution
s=(scores.std()*z+scores.mean()).round() print(*s)
print(*scores)

Standard scalar calculates z-score for every data while normalizing data and the normalized can be
inverse scaled accordingly as mentioned above.
Binomial Distribution
Binomial distribution is a probability distribution that summarizes the likelihood that a variable will take
one of two independent values under a given set of parameters. The distribution is obtained by
performing a number of Bernoulli trials. A Bernoulli trial is assumed to meet each of these criteria.
 There must be only 2 possible outcomes.
 Each outcome has a fixed probability of occurring. A success has the probability of p, and a
failure has the probability of 1 – p.
 Each trial is completely independent of all others.

For example, the probability of getting a head or a tail is 50%. If we take the same coin and flip it n
times, the probability of getting a head p times can be computed using probability mass function
(PMF) of binomial distribution. The binomial distribution formula is for any random variable x,
given by

Where, n is the number of times the coin is flipped, p is the probability of success, and q=1– p is the
probability of failure, and x is the number of successes desired.
Numerical Example: If a coin is tossed 5 times, find the probability of: (a) Exactly 2 heads and (b)
at least 4 heads.
Solution
Number of trials: n=5
Probability of head: p= 1/2 and hence the probability of tail, q =1/2
For exactly two heads: x=2

𝑃(𝑥 = 2) =
5! × 0.52 × 0.53 = 0.315

Again 2!×3!

𝑃(𝑥 ≤ 4) = 𝑃(𝑥 = 4) + 𝑃(𝑥 = 5)


For at least 4 heads: x<=4

 𝑃𝑥≤4 =
5! 5! × 0.55 × 0.50
( ) 4 1
× 0.5 × 0.5 +
4!×1! 5!×0!
 𝑃(𝑥 ≤ 4) = 0.15625 + 0.03125 = 0.1875

Example
from scipy import stats
import matplotlib.pyplot as plt
dist=stats.binom(n=5,p=0.5) prob=dist.pmf(k=2)
print("Proability of Two Heads=",prob)
prob=dist.pmf(k=4)+dist.pmf(k=5) print("Praobability of at
least 4 heads=",prob)

Note!!!
A probability mass function is a function that gives the probability that a discrete random variable is
exactly equal to some value. A probability mass function differs from a probability density function
(PDF) in that the latter is associated with continuous rather than discrete random variables. A PDF
must be integrated over an interval to yield a probability.
Poisson Distribution
Poisson distribution is a Discrete Distribution. It estimates how many times an event can happen in
a specified time provided the mean occurrence of the event in the interval. For example, if someone
eats twice a day what is the probability he will eat thrice? If lambda is the mean occurrence of the
events per interval, then the probability of having a k occurrence within a given interval is given by
the following formula.

Where, e is the Euler's number, k is the number of occurrences for which the probability is going to
be determined, and lambda is the mean number of occurrences.
Numerical Example
In the World Cup, an average of 2.5 goals are scored in each game. Modeling this situation with a
Poisson distribution, what is the probability that 3 goals are scored in a game? What is the
probability that 5 goals are scored in a game?
Solution
Given, λ=2.5
2.53 × 𝑒−2.5
𝑃(𝑥 = 3) = = 0.214

𝑃(𝑥 = 5) = 3! = 0.0668
2.55 ×
𝑒−2.5

5!
Example
from scipy import stats
dist=stats.poisson(2.5)#2.5 is average values
prob=dist.pmf(k=1)
print("Probability of having 1-goal=",prob) prob=dist.pmf(k=3)
print("Probability of having 3-goals=",prob) prob=dist.pmf(k=5)
print("Probability of having 5-goals=",prob)

P-value
The P-value is known as the probability value. It is defined as the probability of getting a result that is
either the same or more extreme than the actual observations. A p-value is used as a statistical test to
determine whether null-hypothesis is rejected or not? The null hypothesis is a statement that says that
there is no difference between two measures. If the hypothesis is that people who clock in 4 hours of
study everyday score more that 90 marks out of 100. The null hypothesis here would be that there is no
relation between the number of hours clocked in and the marks scored. If the p-value is equal to or
less than the significance level, then the null hypothesis is inconsistent and it needs to be rejected. The
P-value table shows the hypothesis interpretations.
P-value Decision
P-value > 0.05 The result is not statistically significant and hence accept the null
hypothesis.
P-value < 0.05 The result is statistically significant. Generally, reject the null hypothesis in
favor of the alternative hypothesis.
P-value < 0.01 The result is highly statistically significant, and thus rejects the null
hypothesis
in favor of the alternative hypothesis.

Suppose null hypothesis is “It is common for students to score 68 marks in mathematics.” Let's define
the significance level at 5%. If the p-value is less than 5%, then the null hypothesis is rejected and it is
not common to score 68 marks in mathematics. First calculate z-score of 68 marks (say z68) and then we
calculate p-value for the given z-score value as below.
pv=p(z≥z68)
This means pv*100% of the students are above the specified score 68.
import numpy as np
from scipy import stats
#Generate 60 random scores with mean=50 and SD=10
dist=stats.norm(loc=50,scale=10) scores=dist.rvs(size=100)
mean=scores.mean() SD=scores.std()
z = (68-mean)/SD #z-value of score=68 print("Z-
value of score=68:",z)
p=stats.norm.cdf(z) #probability of score<68 pv=1-p
#probabily of score>=68
print(pv) pvp=np.round(pv*100,2)
print(f"p-value={pvp}%")
if(pv>0.05):
print("Null hypothesis is accpted: It is common to score 68 marka in mathematics:")
else:
print("Null hypothesis is rejeceted: It is not common to score 68 marka in
mathematics:")
One-tailed and Two-tailed Tests
A one-tailed test may be either left-tailed or right-tailed. A left-tailed test is used when the
alternative hypothesis states that the true value of the parameter specified in the null hypothesis is less
than the null hypothesis claims. A right-tailed test is used when the alternative hypothesis states that the
true value of the parameter specified in the null hypothesis is greater than the null hypothesis claims.

The main difference between one-tailed and two-tailed tests is that one-tailed tests will only have one
critical region whereas two-tailed tests will have two critical regions. If we require a 100(1-α)%
confidence interval we have to make some adjustments when using a two-tailed test. The confidence
interval must remain a constant size, so if we are performing a two-tailed test, as there are twice as many
critical regions then these critical regions must be half the size. This means that when performing a two-
tailed test, we need to consider α/2 significance level rather than α.

Example: A light bulb manufacturer claims that its' energy saving light bulbs last an average of 60 days.
Set up a hypothesis test to check this claim and comment on what sort of test we need to use.
The example in the previous section was an instance of a one-tailed test where the null hypothesis
is rejected or accepted based on one direction of the normal distribution. In a two-tailed test, both
the tails of the null hypothesis are used to test the hypothesis. In a two-tailed test, when a
significance level of 5% is used, then it is distributed equally in the both directions, that is, 2.5% of
it in one direction and 2.5% in the other direction.
Let's understand this with an example. The mean score of the mathematics exam at a national level is
60 marks and the standard deviation is 3 marks. The mean marks of a class are 53. The null
hypothesis is that the mean marks of the class are similar to the national average.
from scipy import stats
zs = ( 53 - 60 ) / 3.0
print(f"z-score={zs}")
pv= stats.norm.cdf(zs)
print(f"p-value={pv}")
pv=(pv*100).round(2)
print(f"p-value={pv}%")

So, the p-value is 0.98%. The null hypothesis is to be rejected, and the p-value should be less than
2.5% in either direction of the bell curve. Since the p-value is less than 2.5%, we can reject the null
hypothesis and clearly state that the average marks of the class are significantly different from the
national average.
Type 1 and Type 2 Errors
A type 1 error appears when the null hypothesis of an experiment is true, but still, it is rejected. A
type 1 error is often called a false positive. Consider the following example. There is a new drug
that is being developed and it needs to be tested on whether it is effective in combating diseases.
The null hypothesis is that “it is not effective in combating diseases.” The significance level is kept
at 5% so that the null hypothesis can be accepted confidently 95% of the time. However, 5% of the
time, we'll accept the rejection of the hypothesis although it had to be accepted, which means that
even though the drug is ineffective, it is assumed to be effective. The Type 1 error is controlled by
controlling the significance level, which is α. α is the highest probability to have a Type 1 error.
The lower the α, the lower will be the Type 1 error.

The Type 2 error is the kind of error that occurs when we do not reject a null hypothesis that is
false. A type 2 error is also known as false negative. This kind of error occurs in a drug scenario
when the drug is accepted as ineffective but is actually it is effective. The probability of a type 2
error is β. Beta depends on the power of the test. This means the probability of not committing a type
2 error equal to 1-β. There are 3 parameters that can
affect the power of a test: sample size (n), significance level of test (α), “true” value of tested parameter.
 Sample size (n): Other things being equal, the greater the sample size, the greater the power of
the test.
 Significance level (α): The lower the significance level, the lower the power of the test. If we
reduce the significance level (e.g., from 0.05 to 0.01), the region of acceptance gets bigger.
As a result, we are less likely to reject the null hypothesis. This means we are less likely to
reject the null hypothesis when it is false, so we are more likely to make a Type II error. In
short, the power of the test is reduced when we reduce the significance level; and vice versa.
 The "true" value of the parameter being tested: The greater the difference between the
"true" value of a parameter and the value specified in the null hypothesis, the greater the
power of the test.
These errors can be controlled one at a time. If one of the errors is lowered, then the other one increases.
It depends on the use case and the problem statement that the analysis is trying to address, and
depending on it, the appropriate error should reduce. In the case of this drug scenario, typically, a
Type 1 error should be lowered because it is better to ship a drug that is confidently effective.

Confidence Interval
When we make an estimate in statistics, whether it is a summary statistic or a test statistic, there is
always uncertainty around that estimate because the number is based on a sample of the population we
are studying. A confidence interval is the mean of our estimate plus and minus the variation in that
estimate. This is the range of values we expect our estimate to fall between if we experiment again or
re-sample the population in the same way.
The confidence level is the percentage of times we expect to reproduce an estimate between the upper
and lower bounds of the confidence interval. For example, if we construct a confidence interval with a
95% confidence level, you are confident that 95 out of 100 times the estimate will fall between the upper
and lower values specified by the confidence interval. Our desired confidence level is usually one minus
the alpha (α) value we used in our statistical test:
Confidence level = 1 − a
So if we use an alpha value of p < 0.05 for statistical significance, then our confidence level would
be 1 − 0.05 = 0.95, or 95%.

 Find the sample mean as: 𝑥̅ = (𝑥1 + 𝑥2 + 𝑥3 + ⋯ + 𝑥1𝑛)/𝑛


Confidence interval for sample data is calculated as below:

∑𝑛 (𝑥−𝑥̅)2
Calculate the standard deviation as: 𝑆𝐷 = √ 𝑖=1
𝑛−1

 Find the standard error: The standard error of the mean is the deviation of the sample mean

𝑆𝐷
from the population mean. It is defined using the following formula:
𝑆𝐸 =
√𝑛
 Finally, find confidence interval as: 𝑈𝑝𝑝𝑒𝑟/𝐿𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 = 𝑥̅ +/− 𝑧 × 𝑆𝐸, Where 𝑧 is
z- score of the given confidence level.
Note: Z-score of various confidence levels is given below.

Confidence Level Z-score


90% 1.645
95% 1.96
98% 2.33
99% 2.575
Numerical Example: Consider the following exam scores of 10 students { 80, 95, 90, 90, 95, 75, 75, 85,
90 , 80}. What will be the confidence interval for the confidence level 95%?
Example. Generate heights of 50 persons randomly such that the heights have normal distribution
with mean=165 and SD=20. Calculate Confidence interval for the dataset for the confidence level
95%.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

#Generate 50 random heights with mean=165 and SD=20 dist =


stats.norm(loc=165, scale=20)
heights = dist.rvs(size=50)
mean=heights.mean() print("Average
Heights",mean)
SE=stats.sem(heights)#sem calculates standard error of mean
print("Standard Error=",SE)
ul=mean+1.96*SE ll=mean-
1.96*SE
print(f"Confidence Interval=({ll},{ul})")

Correlation
Correlation is a statistical measure that expresses the extent to which two variables are linearly related
(meaning they change together at a constant rate). It’s a common tool for describing simple relationships
without making a statement about cause and effect. The sample correlation coefficient, r, quantifies the
strength and direction of the relationship. Correlation coefficient quite close to 0, but either positive or
negative, implies little or no relationship between the two variables. A correlation coefficient close to
plus 1 means a positive relationship between the two variables, with increases in one of the variables
being associated with increment in the other variable. A correlation coefficient close to - 1 indicates a
negative relationship between two variables, with an increase in one of the variables being associated
with a decrease in the other variable. The most common formula is the Pearson Correlation coefficient

𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
used for linear dependency between the data sets and is given as below.

𝑟=
√𝑛 ∑ 𝑥2 − (∑ 𝑥)2 √𝑛 ∑ 𝑦2 − (∑ 𝑦)2

Numerical Example
Calculate the coefficient of correlation for the following two data sets: x = (41, 19, 23, 40, 55, 57, 33)
and y = (94, 60, 74, 71, 82, 76, 61).

∑ 𝑥 = 41 + 19 + 23 + 40 + 55 + 57 + 33 = 268
∑ 𝑦 = 94 + 60 + 74 + 71 + 82 + 76 + 61 = 518

∑ 𝑥𝑦 = (41 × 94) + (19 × 60)+ . . . + (33 × 61) = 20,391

∑ 𝑥2 = (412) + (192) + . . . (332) = 11,534

∑ 𝑦2 = (942) + (602) + . . . (612) = 39,174


𝑟=
 7×20,391−268×518

√(7×11,534−2682)√(7×39174−5182)= 0.54

Example
from scipy import stats import
numpy as np
x = np.array([2, 4, 3, 9, 7, 6, 5])
y = np.array([5, 7, 7, 18, 15, 11, 10])
r=stats.pearsonr(x,y)#computes pearson correlation coefficient print("Result:",r)
print("Correlation Coefficient:",r[0])

T-test
A t-test is a statistical test that is used to compare the means of two groups. It is often used in
hypothesis testing to determine whether two groups are different from one another. A t-test can
only be used when comparing the means of two groups. If we want to compare more than two
groups use an ANOVA test. When choosing a t-test, we will need to consider two things: whether
the groups being compared come from a single population or two different populations, and
whether we want to test the difference in a specific direction.
 If the groups come from a single population perform a paired sample t test.
 If the groups come from two different populations perform a two-sample t-test.
 If there is one group being compared against a standard value, perform a one-sample t test.

Paired Sample T-Test


This hypothesis testing is conducted when two groups belong to the same population. The groups
are studied either at two different times or under two varied conditions. They could be pre and post test

𝑑̅
results from the same people. The formula used to obtain the t-value is:

𝑡=
𝑠/√𝑛
Where, d=difference between paired samples, 𝑑̅ is mean of d, s is standard deviation of the
differences, n is sample size.

Example
An instructor takes two exams of the students. Scores of Both exams are given in the table below.
He/She wants to know if the exams are equally difficult.
Student Exam1 Score(x) Exam2 Score(y)
S1 63 69
S2 65 65
S3 56 62
S4 100 91
S5 88 78
S6 83 87
S7 77 79
S8 92 88
S9 90 85
S10 84 92
S11 68 69
S12 74 81
S13 87 84
S14 64 75
S15 71 84
S16 88 82

Solution

Student Exam1 Score(x) Exam2 Score(y) d=x-y (x-y)2


S1 63 69 6 36
S2 65 65 0 0
S3 56 62 6 36
S4 100 91 -9 81
S5 88 78 -10 100
S6 83 87 4 16
S7 77 79 2 4
S8 92 88 -4 16
S9 90 85 -5 25
S10 84 92 8 64
S11 68 69 1 1
S12 74 81 7 49
S13 87 84 -3 9
S14 64 75 11 121
S15 71 84 13 169
S16 88 82 -6 36

𝑑̅ = 1.31
Now,
(𝑥 − 𝑦)2 = 7.13
𝑠=√
Now, 𝑛−1

𝑑̅
𝑡=
1.31 = 0.74
=
𝑠/√𝑛 7.13/√16
Let’s assume, Significance level (α) = 0.05
Degree of freedom (df)= n-1=15
The tabulated t-value with α = 0.05 and 15 degrees of freedom is 2.131.
Because 0.74 < 2.131, we accept null hypothesis. This means the mean score of two exams is
similar.
 Exams are equally difficult.
In Python, we can perform a paired t-test using the scipy.stats.ttest_rel() function. It performs the t-
test on two related samples of scores.
Example
from scipy import stats

# Performance before and after tune-up


Exam1_Score = [63, 65,56,100, 88, 83, 77, 92, 90, 84, 68, 74, 87, 64, 71, 88]
Exam2_Score = [69, 65, 62, 91,78, 87, 79, 88, 85, 92, 69, 81, 84, 75, 84, 82]
# Perform paired t-test
tv, pv = stats.ttest_rel(Exam1_Score, Exam2_Score)
print('t-statistic:', tv) print('p-
value:', pv) if(pv>0.05):
print("Null Hypothesis is Accepted. This means there is no difference between means
scores of two exams")
else:
print("Null Hypothesis is Rejected. This means there is difference between means
scores of two exams")

Output
t-statistic: -0.7497768853141169
p-value: 0.4649871003972206
Null Hypothesis is accepted. This means there is no difference between means scores of two exams

One-Sample T-test
The One Sample t Test examines whether the mean of a sample is statistically different from a known

𝑥̅ − 𝜇
or hypothesized or population mean. It is calculated as below.

𝑡=

𝑠/√𝑛
Numerical Example
Imagine a company wants to test the claim that their batteries last more than 40 hours. Using a
simple random sample of 15 batteries yielded a mean of 44.9 hours, with a standard deviation of
8.9 hours. Test this claim using a significance level of 0.05.
Solution
𝑡=
44.9−40
= 2.13
8.9/√15
Given, significance level (α) = 0.05 Degree of
freedom (df)= n-1=14
The tabulated t-value with α = 0.05 and 14 degrees of freedom is 1.761.
Because 2.13 > 1.761, we reject the null hypothesis and conclude that batteries last more than 40
hours.
In Python, we can perform a one-sample t-test using the scipy.stats.ttest_1samp() function.
Example
from scipy import stats battry_hour=[40,50,55,38,48,62,44,52,46,44,37,42,46,38,45]

#One Sample t-test


tv, pv = stats.ttest_1samp(battry_hour, 40) print('t-
statistic:', tv)
print('p-value:', pv)
if(pv>0.05):
print("Null Hypothesis is Accepted. This means battries las for 40 hours")
else:
print("Null Hypothesis is Rejected. This means battries last for more than or less
than 40 hours")

Two-Sample T-test
The two-sample t-test (also known as the independent samples t-test) is a method used to test
whether the unknown population means of two groups are equal or not. We can use the test when
our data values are independent, are randomly sampled from two normal populations and the two

𝑥̅1 − 𝑥̅2
independent groups. It carried out as below.

𝑡=
𝑠𝑝 1 1
𝑛2
+
𝑛1

where 𝑥̅1 and 𝑥̅2 are the sample means, n1 and n2 are the sample sizes, and sp is calculated as below.

𝑠𝑝 = √((𝑛1 − 1)𝑠2 + (𝑛2 − 1)𝑠2)/(𝑛1 + 𝑛2 − 2)


1 2

Example
Our sample data is from a group of men and women who did workouts at a gym three times a week for
a year. Then, their trainer measured the body fat. The table below shows the data.

Group Body Fat Percentage


Men 13.3,6.0, 20.0, 8.0, 14.0, 19.0, 18.0, 25.0, 16.0,24.0,15.0, 1.0, 15.0
Women 22.0, 16.0, 21.7, 21.0, 30.0, 26.0, 12.0, 23.2, 28.0, 23.0
Determine whether the underlying populations of men and women at the gym have the same mean
body fat.
Solution
We have ttest_ind() function in Python to perform two sample t-test. Example
from scipy import stats
men=[13.3,6.0, 20.0, 8.0, 14.0, 19.0, 18.0, 25.0, 16.0,24.0,15.0, 1.0,
15.0]
women=[22.0, 16.0, 21.7, 21.0, 30.0, 26.0, 12.0, 23.2, 28.0, 23.0]
#Two Sample t-test
tv, pv = stats.ttest_ind(women,men) print('t-
statistic:', tv)
print('p-value:', pv) if(pv>0.05):
print("Null Hypothesis is Accepted. This means Body Fat percentage of Men and
Women is similar")
else:
print("Null Hypothesis is Rejected. This means Body Fat percentage of Men and
Women is different")

T-test vs Z-Test
The difference between a t-test and a z-test hinges on the differences in their respective

having a population mean, 𝜇, of 0 and a population standard deviation, 𝜎 , of


distributions. As mentioned, a z-test uses the Standard Normal Distribution, which is defined as

1. It is calculated using following formula.

Where, 𝑥̅ is a measured sample mean, 𝜇 is the hypothesized population mean, 𝑠 is the sample
standard deviation, and n is the sample size.
Notice that this distribution uses a known population standard deviation for a data set to
approximate the population mean. However, the population standard deviation is not always
known, and the sample standard deviation, s, is not always a good approximation. In these
instances, it is better to use the T-test.
The T-Distribution looks a lot like a Standard Normal Distribution. In fact, the larger a sample is,
the more it looks like the Standard Normal Distribution - and at sample sizes larger than 30, they
are very, very similar. Like the Standard Normal Distribution, the T- Distribution is defined as
having a mean 𝜇 = 0, but its standard deviation, and thus the width of its graph, varies according to
the sample size of the data set used for the hypothesis test. It is calculated using following formula:

The standard normal or z-distribution assumes that you know the population standard deviation. The
t-distribution is based on the sample standard deviation. The t-distribution is similar to a
normal distribution. The useful properties of the t-distribution are:

The table below presents the key differences between the two statistical methods, Z-test and T-test.

Z-Test T-Test
Used for large sample sizes (n≥30). Used for small to moderate
sample sizes (n<30).
Performing this test requires It is performed when the
knowledge of population standard population standard deviation is
deviation (σ). unknown.
Does not involve the sample standard Involves the sample standard
deviation. deviation (s).
Assumes a standard normal Assumes a t-distribution, which
distribution. varies with degrees of freedom.

Chi-square Distribution
If we repeatedly take samples and define the chi-square statistics, then we can form a chi- square
distribution. A chi-square (Χ2) distribution is a continuous probability distribution that is used in many
hypothesis tests. The shape of a chi-square distribution is determined by the parameter k, which
represents the degrees of freedom. The graph below shows examples of chi-square distributions
with different values of k.
There are two main types of Chi-Square tests namely: Chi-Square for the Goodness-of-Fit
and Chi-Square for the test of Independence.
Chi-Square for the Goodness-of-Fit
A chi-square test is a statistical test that is used to compare observed and expected results. The goal of
this test is to identify whether a disparity between actual and predicted data is due to chance or to a
link between the variables under consideration. As a result, the chi-square test is an ideal choice for
aiding in our understanding and interpretation of the connection between our two categorical variables.
Pearson’s chi-square test was the first chi-square test to be discovered and is the most widely used.
Pearson’s chi-square test statistic is given as below.

Where, O is the observed frequency and E is the expected frequency.


The dice is rolled 36 times and the probability that each face should turn upwards is 1/6. So, the
expected distribution is as follows:

The observed distribution is as follows:

The null hypothesis in the chi-square test is that the observed value is similar to the expected value. The
chi-square can be performed using the chisquare function in the SciPy package. The function gives
chisquare value and p-value as output. By looking at the p-value we can reject/accept null
hypothesis
Example
from scipy import stats import
numpy as np
expected = np.array([6,6,6,6,6,6])
observed = np.array([7,5,3,9,6,6])
cp=stats.chisquare(observed,expected) print(cp)
Output: P-value=0.65
Conclusion: Since p-value>0.05, Null hypothesis is accepted. Thus, we conclude that observed value of
dice is same as expected value.

Chi-square Test of Independence


The Chi-Square test of independence is used to determine if there is a significant relationship between
two nominal (categorical) variables. For example, say a researcher wants to examine the relationship
between gender (male vs. female) and empathy (high vs. low). The chi-square test of independence can
be used to examine this relationship. The null hypothesis for this test is that there is no relationship
between gender and empathy. The alternative hypothesis is that there is a relationship between gender
and empathy (e.g. there are more high-empathy females than high-empathy males). The Chi-Square test
of independence can be performed using the chi2_contingency function in the SciPy package.
Example
Suppose the researcher collected data about empathy of males and female. He/She has collected data
about 300 males and 200 males as given in the table.

Gender Empathy
High Low
Male 180 120
Female 140 60

Null Hypothesis (H0) =There is no relationship between gender and empathy.


Python program to test above null hypothesis
from scipy import stats import
numpy as np
male_female = np.array([[180, 120],[140, 60]])
x=stats.chi2_contingency(male_female)
print(x)

Output:P-value=.029
Conclusion: since p-value is less than 0.05 our null hypothesis is rejected. Thus, we conclude that the
empathy is related with gender.
ANOVA
ANOVA stands for Analysis of Variance. It is a statistical method used to analyze the differences
between the means of two or more groups. It is often used to determine whether there are any
statistically significant differences between the means of different groups. ANOVA compares the
variation between group means to the variation within the groups. If the variation between group means
is significantly larger than the variation within groups, it suggests a significant difference between the
means of the groups.
ANOVA calculates an F-statistic by comparing between-group variability to within-group variability. If
the F-statistic exceeds a critical value, it indicates significant differences between group means. Types of
ANOVA include one-way (for comparing means of groups) and two-way (for examining effects of two
independent variables on a dependent variable). To perform the one- way ANOVA, we can use the
f_oneway() function of the SciPy package.
Example
Suppose we want to know whether or not three different exam prep programs lead to different mean
scores on a certain exam. To test this, we recruit 30 students to participate in a study and split them into
three groups. The students in each group are randomly assigned to use one of the three exam prep
programs for the next three weeks to prepare for an exam. At the end of the three weeks, all of the
students take the same exam. The exam scores for each group are shown below.
Python program to solve above problem
from scipy import stats import
numpy as np
sg1=[85,86,88,75,78,94,98,79,71,80]
sg2=[91,92,93,85,87,84,82,88,95,96]
sg3=[79,78,88,94,92,85,83,85,82,81]
r=stats.f_oneway(sg1,sg2,sg3) print("P-
value=",r.pvalue)

Output: P-value= 0.113


Since pvalue>0.05, Null hypothesis is accepted. Thus we conclude that three different exam prep
programs lead to similar mean scores on a certain exam.
Suppose a scientist is interested in how a person's marital status affects weight. They have only one
factor to examine so the scientist would use a one-way ANOVA. Now assume that another scientist is
interested in how a person's marital status and income affect their weight. In this case, there are two
factors to consider; therefore a two-way ANOVA will be performed.
Unit 2
Data Mining and Data Visualization Controlling Line Properties of Charts
Matplotlib is a data visualization library in Python. The pyplot, a sublibrary of Matplotlib, is a
collection of functions that helps in creating a variety of charts. Line charts can be created simply
by using plot() method of pyplot library.
import matplotlib.pyplot as plt
x=[1,2,3,4,5,6,7]
y=[3,5,7,9,11,13,15]
plt.plot(x,y)
plt.xlabel("x")
plt.ylabel("y") plt.title("Line Chart Example")
plt.show()

There are many properties of a line that can be set, such as the color, dashes etc. There are essentially
three ways of doing this: using keyword arguments, using setter methods, and using setp() command.
Using Keyword Arguments
Keyword arguments (or named arguments) are values that, when passed into a function, are identified by
specific parameter names. These arguments can be sent using key = value syntax. We can use keyword
arguments to change default value of properties of line charts as below. Major keyword arguments
supported by plot() methods are: linewidth, color, linestyle, label, alpha, etc.
import matplotlib.pyplot as plt
x=[1,2,3,4,5,6,7]
y=[3,5,7,9,11,13,15]
plt.plot(x,y, linewidth=4, linestyle="--", color="red", label="y=2x+1")
plt.xlabel("x")
plt.ylabel("y") plt.title("Line Chart
Example") plt.legend(loc='upper center')
plt.show()

Using Setter Methods


The plot function returns the list of line objects, for example line,=plot(x,y) returns single line object and
line1, line2 =plot(x1,y1,x2,y2) returns list of multiple line objects. Then, using the setter methods of line
objects we can define the property that needs to be set. Major setter methods supported by plot() method
are set_label(),set_linewidth(), set_linestyle(), set_color() etc.
Example
import matplotlib.pyplot as plt
x=[1,2,3,4,5,6,7]
y=[3,5,7,9,11,13,15]
line,=plt.plot(x,y) print(line)
line.set_label("y=2x+1")
line.set_linewidth(4)
line.set_linestyle("-")
line.set_color("green")
plt.xlabel("x")
plt.ylabel("y") plt.title("Line Chart
Example") plt.legend(loc='upper center')
plt.show()
Using setp Command
The setp() function in pyplot module of matplotlib library can also used to set the properties of line
objects. We can either use python keyword arguments or string/value pairs to set properties of line
objects.
Example
import matplotlib.pyplot as plt
x=[1,2,3,4,5,6,7]
y=[3,5,7,9,11,13,15]
line,=plt.plot(x,y)
plt.setp(line,linewidth=4,linestyle="dashdot",label="y=2x+1",color="red")
plt.xlabel("x")
plt.ylabel("y") plt.title("Line Chart
Example") plt.legend(loc='upper center')
plt.show()

Creating Multiple Plots


One very useful feature of matplotlib is that it makes it easy to plot multiple plots, which can be
compared to each other. In Matplotlib, we can achieve this using the subplots() function. The
subplots() function creates a grid of subplots within a single figure. We can specify the number of
rows and columns in the grid, as well as the figure number.
plt.subplots(211)
A subplot with a value of 211 means that there will be two rows, one column, and one figure.
Example
import matplotlib.pyplot as plt import numpy as
np x=[0,0.52,1.04,1.57,2.09,2.62,3.14]
y=np.sin(x) plt.subplot(211)
plt.plot(x,y,linestyle="dashed",linewidth=2,label="sin(x)") plt.xlabel("x")
plt.ylabel("sin(x)") plt.title("Sin(x) vs.
Cos(x) Curve") plt.legend(loc='upper left')
plt.subplot(212)
y=np.cos(x)
plt.plot(x,y,linestyle="dashed", color="red",linewidth=2,label="cos(x)")
plt.xlabel("x")
plt.ylabel("cos(x)") plt.legend(loc='upper
center') plt.show()

Playing With Text


The matplotlib.pyplot.text() function is used to add text inside the plot. It adds text at an arbitrary
location of the axes. It also supports mathematical expressions.
import matplotlib.pyplot as plt import numpy as np

#Generate Data for Parabola x = np.arange(-20, 21, 1) y = 2*x**2


#adding text inside the plot
plt.text(-10,400 , 'Parabola Y =2x^2', fontsize = 20) plt.plot(x, y, color='green')
plt.xlabel("x") plt.ylabel("y=2x^2")
plt.show()
We can also add mathematical equations as text inside the plot by following LaTex syntax. This can be
done by enclosing text in $ symbol.
Example
import matplotlib.pyplot as plt import
numpy as np

#Generate
x = np.arange(-20, 21, 1) y = 2*x**2

#adding text inside the plot


plt.text(-10,400 , 'Parabola $Y =2x^2$', fontsize = 20) plt.plot(x, y,
color='green')
plt.xlabel("x") plt.ylabel("y=2x^2")
plt.show()

We can use text() method to display text over columns in a bar chart so that we could place text at a
specific location of the bars column.
Example
import matplotlib.pyplot as plt x = ['A',
'B', 'C', 'D', 'E'] y = [1, 3, 2, 5, 4]
percentage = [10, 30, 20, 50, 40]
plt.figure(figsize=(3,4)) plt.bar(x, y)
for i in range(len(x)): plt.text(x[i], y[i],
percentage[i])
plt.show()

The annotate() function in pyplot module of matplotlib library is used to annotate the point xy with
specified text. In order to add text annotations to a matplotlib chart we need to set at least, the text, the
coordinates of the plot to be highlighted with an arrow (xy), the coordinates of the text (xytext) and
the properties of the arrow (arrowprops).
Example
import numpy as np
import matplotlib.pyplot as plt
x=np.arange(0,10,0.25) y=np.sin(x)
plt.plot(x,y)
plt.annotate('Minimum',xy = (4.75, -1),xytext = (4.75, 0.2), arrowprops =
dict(facecolor = 'black',width=0.2), horizontalalignment =
'center')

plt.show()

Styling Plots
These options can be accessed by executing the command plt.style.available. This gives a list of all the
available stylesheet option names that can be used as an attribute inside plt.style.use().
Example
import matplotlib.pyplot as plt ls=plt.style.available print("Number of
Styples:",len(ls))
print("List of Styles:",ls)
A. ggplot is a popular data visualization package in R programming. It stands for “Grammar of
Graphics plot”. To apply ggplot styling to a plot created in Matplotlib, we can use the following
syntax:
plt.style.use('ggplot')
This style adds a light grey background with white gridlines and uses slightly larger axis tick labels.
The statement plt.style.use(‘ggplot’) can be used to apply ggplot styling to any plot in Matplotlib.
Example
from scipy import stats
import matplotlib.pyplot as plt
dist=stats.norm(loc=150,scale=20)
data=dist.rvs(size=1000) plt.style.use('ggplot')
plt.hist(data,bins=100,color='blue') plt.show()

The FiveThirtyEight Style is another way of styling plots in matplotlib.pyplot. It is based on the
popular American blog FiveThirtyEight which provides economic, sports, and political analysis.
The FiveThirtyEight stylesheet in Matplotlib has gridlines on the plot area with bold x and y ticks.
The colors of the bars in Bar plot or Lines in the Line chart are usually bright and distinguishable.
Syntax of using the style is as below.
Example
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
a = [2, 3, 4, 3, 4, 5, 3]
b = [4, 5, 5, 7, 9, 8, 6]
plt.figure(figsize = (4,3))
plt.plot(a, marker='o',linewidth=1,color='blue')
plt.plot(b, marker='v',linewidth=1,color='red') plt.show()

The dark_background stylesheet is third popular style that is based on the dark mode. Applying this
stylesheet makes the plot background black and ticks color to white, in contrast. In the foreground,
the bars and/or Lines are grey based colors to increase the aesthetics and readability of the plot.
Example
import matplotlib.pyplot as plt
plt.style.use("dark_background")
a = [1, 2, 3, 4, 5, 6, 7]
b = [1, 4, 9, 16, 25, 36, 49]
plt.figure(figsize = (4,3))
plt.plot(a, marker='o',linewidth=1,color='blue') plt.plot(b,
marker='v',linewidth=1,color='red')
plt.show()

Box Plots
A Box plot is a way to visualize the distribution of the data by using a box and some vertical lines. It
is known as the whisker plot. The data can be distributed between five key ranges, which are as
follows:
 Minimum: Q1-1.5*IQR
 1st quartile (Q1): 25th percentile
 Median:50th percentile
 3rd quartile(Q3):75th percentile
 Maximum: Q3+1.5*IQR
Here IQR represents the InterQuartile Range which starts from the first quartile (Q1) and ends at the
third quartile (Q3). Thus, IQR=Q3-Q1.
In the box plot, those points which are out of range are called outliers. We can create the box plot of the
data to determine the following.
 The number of outliers in a dataset
 Is the data skewed or not
 The range of the data
The range of the data from minimum to maximum is called the whisker limit. In Python, we will
use the matplotlib module's pyplot module, which has an inbuilt function named boxplot() which
can create the box plot of any data set. Multiple boxes can be created just by send list of data to the
boxplot() method.
Example
import matplotlib.pyplot as plt import
numpy as np
from scipy import stats dist =
stats.norm(100, 30)
data=dist.rvs(size=500)
plt.figure(figsize =(6, 4))
plt.boxplot(data)
plt.show()

Example 2
import matplotlib.pyplot as plt import
numpy as np
from scipy import stats dist =
stats.norm(100, 50)
data1=dist.rvs(size=500)
data2=dist.rvs(size=500)
data3=dist.rvs(size=500)
plt.figure(figsize =(6, 4))
plt.boxplot([data1,data2,data3]) plt.show()

Horizontal box plots can be created by setting vert=0 while creating box plots. Boxes in the plot
can be filled by setting patch_artist=True. The boxplot function is a Python dictionary with key values
such as boxes, whiskers, fliers, caps, and median. We can also change properties of dictionary objects
by calling set method.
Example
import matplotlib.pyplot as plt import
numpy as np
from scipy import stats dist =
stats.norm(100, 50)
data=dist.rvs(size=500)
plt.figure(figsize =(6, 4))
bp=plt.boxplot(data,vert=0,patch_artist=True) for b in
bp['boxes']:
b.set(color='blue',facecolor='cyan',linewidth=2) for w in
bp['whiskers']:
w.set(linestyle='--',linewidth=1, color='green') for f in
bp['fliers']:
f.set(marker='D', color='black',alpha=1) for m in
bp['medians']:
m.set(color='yellow',linewidth=2) for c in
bp['caps']:
c.set(color='red') plt.show()

Heatmaps
A heatmap (or heat map) is a graphical representation of data where values are depicted by color. A
simple heat map provides an immediate visual summary of information across two axes, allowing
users to quickly grasp the most important or relevant data points. More elaborate heat maps allow the
viewer to understand complex data sets. All heat maps share one thing in common -- they use
different colors or different shades of the same color to represent different values and to communicate
the relationships that may exist between the variables plotted on the x-axis and y-axis. Usually, a
darker color or shade represents a higher or greater quantity of the value being represented in the heat
map. For instance, a heat map showing the rain distribution (range of values) of a city grouped by
month may use varying shades of red, yellow and blue. The months may be mapped on the y axis and
the rain ranges on the x axis. The lightest color (i.e., blue) would represent the lower rainfall. In
contrast, yellow and red would represent increasing rainfall values, with red indicating the highest
values.
When using matplotlib we can create a heat map with the imshow() function. In order to create a
default heat map you just need to input an array of m×n dimensions, where the first dimension
defines the rows and the second the columns of the heat map. We can choose different colors for
Heatmap using the cmap parameter. Cmap is colormap instance or registered color map name.
Some of the possible values of cmap are: ‘pink’, ‘spring’, ‘summer’, ‘autumn’, ‘winter’, ‘cool’,
‘Wistia’, ‘hot’, ‘copper’ etc.
Example
import numpy as np
import matplotlib.pyplot as plt data = np.random.random(( 12 , 12 ))
plt.imshow( data,cmap='autumn') plt.title( "2-D Heat Map" ) plt.show()

Heat maps usually provide a legend named color bar for better interpretation of the colors of the
cells. We can add a colorbar to the heatmap using plt.colorbar(). We can also add the ticks and labels
for our heatmap using xticks() and yticks() methods.
Example

import numpy as np
import matplotlib.pyplot as plt
teams = ["A", "B", "C", "D","E", "F", "G"]
year= ["2022", "2021", "2020", "2019", "2018", "2017", "2016"]
games_won = np.array([[82, 63, 83, 92, 70, 45, 64],
[86, 48, 72, 67, 46, 42, 71],
[76, 89, 45, 43, 51, 38, 53],
[54, 56, 78, 76, 72, 80, 65],
[67, 49, 91, 56, 68, 40, 87],
[45, 70, 53, 86, 59, 63, 97],
[97, 67, 62, 90, 67, 78, 39]])
plt.figure(figsize = (4,4))
plt.imshow(games_won,cmap='spring')
plt.colorbar()
plt.xticks(np.arange(len(teams)),
labels=teams) plt.yticks(np.arange(len(year)),
labels=year) plt.title( "Games Won By Teams" )
plt.show()

We also use a heatmap to plot the correlation between columns of the dataset. We will use correlation
to find the relation between columns of the dataset.
Example
import numpy as np import
matplotlib
import matplotlib.pyplot as plt import
pandas as pd
df=pd.DataFrame({"x":[2,3,4,5,6],"y":[5,8,9,13,15],"z":[0,4,5,6,7]})
corr=df.corr(method='pearson')
plt.figure(figsize = (4,4))
plt.imshow(corr,cmap='spring')
plt.colorbar()
plt.xticks(np.arange(len(df.columns)), labels=df.columns,rotation=65)
plt.yticks(np.arange(len(df.columns)), labels=df.columns)
plt.show()

Scatter Plots with Histograms


We can combine a simple scatter plot with histograms for each axis. These kinds of plots help us
see the distribution of the values of each axis. Sometimes when we make scatterplot with a lot of
data points, overplotting can be an issue. Overlapping data points can make it difficult to fully
interpret the data. Having marginal histograms on the side along with the scatter plot can help with
overplotting. To make the simplest marginal plot, we provide x and y variable to Seaborn’s
jointplot() function.
Example
import numpy as np
import matplotlib.pyplot as plt import
pandas as pd
import seaborn as sns
from sklearn import datasets
df = datasets.load_iris()
df=df.data[:,0:2]
df=pd.DataFrame({'SepalLength': df[:,0],'SepalWidth': df[:,1]})
sns.jointplot(x="SepalLength",y="SepalWidth",edgecolor="white",data=df);
plt.title("Scatter Plot with Histograms")
plt.show()
The simplest plotting method, JointGrid.plot() accepts a pair of functions. One for the joint axes
and one for both marginal axes. Some other keyword arguments accepted by the method are listed
below
 height: Size of each side of the figure in inches (it will be square).
 ratio: Ratio of joint axes height to marginal axes height.
 space: Space between the joint and marginal axes
Example
import numpy as np
import matplotlib.pyplot as plt import
pandas as pd
import seaborn as sns
from sklearn import datasets df =
datasets.load_iris() df=df.data[:,0:2]
df=pd.DataFrame({'SepalLength': df[:,0],'SepalWidth': df[:,1]}) g =
sns.JointGrid(data=df,
x="SepalLength",y="SepalWidth",height=4,ratio=2,space=0)
g.plot(sns.scatterplot, sns.histplot)
plt.show()
Unit 3
Supervised Learning
Machine learning is an application of AI that enables systems to learn and improve from experience
without being explicitly programmed. Machine learning focuses on developing computer programs
that can access data and use it to learn for themselves. The machine learning process begins with
observations or data, such as examples, direct experience or instruction. It looks for patterns in data so
it can later make inferences based on the examples provided. The primary aim of ML is to allow
computers to learn autonomously without human intervention or assistance and adjust actions
accordingly. Learning system of a machine learning algorithms can be broken down into three main
parts.
 Decision Process: In general, machine learning algorithms are used to make a prediction or
classification. Based on some input data, which can be labelled or unlabeled, ML algorithm
will produce an estimate about a pattern in the data.
 Error Function: An error function serves to evaluate the prediction of the model. If there are
known examples, an error function can make comparison to assess the accuracy of the model.
 Model Optimization Process: If the model can’t fit better to the data points in the training set,
then parameters are adjusted to reduce the discrepancy between the known example and the
model estimate. The algorithm will repeat this evaluate and optimize process, until a
threshold of accuracy has been met.
Types of Machine Learning
Machine learning methods fall into three primary categories: Supervised, Unsupervised, and
Reinforcement.
Supervised Learning
In this learning paradigm, we present examples of correct input-output pairs to the ML algorithms
during the training phase. This training set of examples is equivalent to the teacher for the ML
algorithms. During the training of ML algorithm under supervised learning, then it takes input vector
and computes output vector. An error signal is generated, if there is a difference between the computed
output and the desired output vector. On the basis of this error signal, the model parameters are adjusted
until the actual output is matched with the desired output. Supervised machine learning is used for
performing tasks like: Regression and Classification. Naïve Bayes classifier, Logistic
regression, decision tree examples of classification algorithm. Linear regression and multiple regression
are examples of regression algorithms.

Unsupervised Learning
In unsupervised learning ML algorithm is provided with dataset without desired output. The ML
algorithm then attempts to find patterns in the data by extracting useful features and analyzing its
structure. Unsupervised learning algorithms are widely used for tasks like: clustering, dimensionality
reduction, association mining etc. K-Means algorithm, K-Medoid algorithm, Agglomerative algorithm
etc. are examples of clustering algorithms.

Reinforcement Learning
In reinforcement learning, we do not provide the machine with examples of correct input-output
pairs, but we do provide a method for the machine to quantify its performance in the form of a
reward signal. Reinforcement learning methods resemble how humans and animals learn: the
machine tries a bunch of different things and is rewarded with performance signal. Reinforcement
learning algorithms are widely used for training agents interacting with its environment.

Classification vs. Prediction


Classification and Prediction are two major categories of prediction problems which are usually
dealt with Data mining and machine learning. The terms prediction and regression are used
synonymously in in data mining. Both of them are supervised learning approaches. Classification is
the process of finding or discovering a model or function which helps to predict class label for a
given data. Prediction is the process of finding a model or function which is used to predict
continuous real-valued output for a given data.
For example, we can build a classification model to categorize bank loan applications as either safe or
risky. We can also construct a classification model to identify digits. On the other hand, we can build a
regression model to predict the expenditures of a potential customers on computer equipment given
their income and occupation. We can also build a prediction model to predict stock price given
historical trading data.
Working of Classification Algorithms
The Classification process works in following two steps: Learning Step and Testing Step.
 Learning Step: This step is also called training step. In this step the learning algorithms build a
model on the basis of relationship between input and output in the training dataset. This dataset
contains input attributes along with class label for every input tuple. Because the class label of
each training tuple is provided, this step is also known as supervised learning.
 Testing Step: In this step, the model is used for prediction. Here the test dataset is
used to estimate the accuracy of the model. This dataset contains values of input attributes
along with class label of the output attribute. However, the model only takes values of input
attributes and predicts class label of each input tuple. Then, accuracy of the model is
computed by looking at predicted class labels and actual class labels of test dataset. The
model can be applied to the new data tuples if the accuracy is considered acceptable.

Linear Regression
Regression analysis is the process of curve fitting in which the relationship between the independent
variables and dependent variables are modeled in the mth degree polynomial. Polynomial Regression
models are usually fit with the method of least mean square (LMS). If we assume that the relationship

𝑦 = 𝑓(𝑥) = 𝑤0 + 𝑤1𝑥
is a linear one and only one variable, then we can use linear equation given as below.

In the above equation, y is dependent variable and x is independent variable, w0 and w1 are
coefficient that needs to be determined through training of the model. If we have two independent

𝑦 = 𝑓(𝑥) = 𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2


variable, linear regression equation can be written as below.

We use linear_model.LinearRegression() to create a linear regression object. We then use the fit()
method to train the linear regression model. This method takes dependent and independent variables as
input. Finally, we predict values of dependent variable by providing values of independent variable as
input.
Example
import numpy as np
from sklearn.linear_model import LinearRegression x =
np.array([2,3,5,8,9,11,15,12,19,17])
y = np.array([5,7,11,17,19,23,31,25,39,35])
test=np.array([4,10,13])
#reshaping data in column vector form
x=x.reshape((len(x),1)) test=test.reshape((len(test),1))
lr=LinearRegression()
lr.fit(x,y) pred=lr.predict(test)
print("Test Data:",test) print("Predicted
Values:",pred)
Logistic Regression
Logistic regression is one of the most popular machine learning algorithms for binary classification.

want to predict a variable 𝑦̂ ∈ {0,1} , where 0 is called negative class, while 1 is called positive
This is because it is a simple algorithm that performs very well on a wide range of problems. We

class. Such task is known as binary classification. The heart of the logistic regression technique is
logistic function and is defined as given in equation
(1). Logistic function transforms the input into the range [0, 1]. Smallest negative numbers results in
values close to zero and the larger positive numbers results in values close to one.
1
f (x) 

1 ex
If there are two input variable, logistic regression has two coefficients just like linear regression.
y  w0  w1 x 1  w2 x 2
Unlike linear regression, the output is transformed into a probability using the logistic function.
1
yˆ   ( y) 

1 ey
If the probability is > 0.5 we can take the output as a prediction for the class 1, otherwise the prediction
is for the class 0. The job of the learning algorithm will be to discover the best values for the
coefficients (w0, w1, and w2) based on the training data.
We use linear_model.LogisticRegression() to create a logistic regression object. We then use the fit()
method to train the logistic regression model. This method takes target and independent variables as
input. Finally, we predict values of target variable by providing values of independent variable as input.
Example
import numpy as np
from sklearn.linear_model import LogisticRegression
x = np.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52,
3.69, 5.88])
y= np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
test=np.array([2.93,1.86,5.24,6.32])

#reshaping data in column vector form


x=x.reshape((len(x),1)) test=test.reshape((len(test),1))
lr=LogisticRegression()
lr.fit(x,y) pred=lr.predict(test)
print("Test Data:",test) print("Predicted
Values:",pred)

Naïve Bayes Classification


Naïve Bayes classification also called Bayesian classification or Naïve Bayes classification. It is a
classification technique based on Bayes’ Theorem which assumes that input features of the model
and independent of each other. Bayes’ theorem is sated as below.
P( X | H )P(H )
P(H | X ) 
P( X )
Bayes’ theorem is useful in that it provides a way of calculating the posterior probability, P(H|X),
from P(H), P(X|H), and P(X). Here P(X) and P(H) are prior probability and P(X|H) is likelihood.
Let D be a database and C1,C2……Cm are m classes. Now above Bayes rule can be written as
below.
P( X | Ci )P(Ci )
| X ) 
P(Ci P( X )
Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior
probability, conditioned on X. Thus we need to maximize P(Ci|X). As P(X) is constant for all
classes, only P(X|Ci)P(Ci) need to be maximized. Let X is the set of attributes {x1, x2, x3…….xn}
where attributes are independent of one another. Now the probability P(X|C i) is given by the
equation given below.
n

P( X | Ci )   P(xk | Ci )  P(x1 | Ci )  P(x2 | Ci )  P(xn | Ci )


k 1
In puthon, GaussianNB class of sklearn.naive_bayes module is used to create Naïve Bayes
classifiers instance and perform classification using the model. Consider the dataset given below.

Example given below creates Naïve Bayes classifier model using the above training data and then
predicts class level of the tuple: X = (age = youth, income = medium, student = yes, credit_rating =
fair) using the model.
Example
from sklearn.naive_bayes import GaussianNB from sklearn
import preprocessing
import pandas as pd
Age=['Youth','Youth','Middle_Aged','Senior','Senior','Senior','Middle_Aged
','Youth','Youth','Senior','Youth','Middle_Aged','Middle_Aged','Senior','Y outh']
Income=['High','High','High','Medium','Low','Low','Low','Medium','Low','Me
dium','Medium','Medium','High','Medium','Medium']
Student=['No','No','No','No','Yes','Yes','Yes','No','Yes','Yes','Yes','No'
,'Yes','No','Yes']
Credit_Rating=['Fair','Excellent','Fair','Fair','Fair','Excellent','Excell
ent','Fair','Fair','Fair','Excellent','Excellent','Fair','Excellent','Fair ']
Buys=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes',
'Yes','No',"?"]
le = preprocessing.LabelEncoder()
a=list(le.fit_transform(Age))
i=list(le.fit_transform(Income))
s=list(le.fit_transform(Student))
cr=list(le.fit_transform(Credit_Rating))
b=list(le.fit_transform(Buys))
d={'Age':a,'Income':i,'Student':s, 'Credit_Rating':cr, 'Buys_Computer':b}
df=pd.DataFrame(d)
print(df) x=df[['Age','Income','Student','Credit_Rating']]
y=df['Buys_Computer']
trainx=x[0:14]
trainy=y[0:14]
testx=x[14:15]

model = GaussianNB()
model.fit(trainx,trainy) predicted=
model.predict(testx) if(predicted==1):
pred='No' else:
pred='Yes'
print("Predicted Value:", pred)

Decision Tree Classifier


Decision tree classification algorithm is the learning of decision trees from class labeled training tuples.
Decision tree is a flowchart-like tree structure where internal nodes (non leaf node) denotes a test on an
attribute, branches represent outcomes of tests, and Leaf nodes (terminal nodes) hold class labels.

Once the decision tree is learned, in order to make prediction for a tuple, the attributes of a tuple are
tested against the decision tree. A path is traced from the root to a leaf node which determines the
predicted class for that tuple. Constructing a Decision tree uses greedy algorithm. Tree is
constructed in a top-down divide-and-conquer manner. High level algorithm for decision tree
algorithm is presented below.
1. At start, all the training tuples are at the root
2. Tuples are partitioned recursively based on selected attributes
3. If all samples for a given node belong to the same class
• Label the class
4. Else if there are no remaining attributes for further partitioning
• Majority voting is employed for assigning class label to the leaf
5. Else
• Got to step 2

There are many variations of decision-tree algorithms. Some of them are: ID3 (Iterative Dichotomiser
3), C4.5 (successor of ID3), CART (Classification and Regression Tree) etc. There are different attribute
selection measures used by decision tree classifiers. Some of them are: Information Gain, Gain Ratio,
Gini Index etc. ID3 stands for Iterative Dichotomiser 3. It uses top-down greedy approach to build
decision tree model. This algorithm computes information gain for each attribute and then selects
the attribute with the highestinformation gain. Information gain measures reduction in entropy after data
transformation. It is calculated by comparing entropy of the dataset before and after transformation. Entropy is
the measure of homogeneity of the sample. Entropy or expected information of dataset D is calculated by
using equation(1) given below.

(1)
m

E(D)    p i log 2 pi
i1
Where pi is the probability of a tuple in D belonging to class Ci and is estimated using
Equation (2).

pi  Ci, D (2)
D
Where Ci,D is the number of tuples in D belonging to class Ci and is the number
of tuples in D.
Suppose we have to partition the tuples in D on some attribute A having v distinct values. The
attribute A can be used to split D into v partitions {D1,D2,..,Dv}. Now, the total entropy of data
partitions while partitioning D around attribute A is calculated using Equation (3).

Dvj
E (D)    E(D ) (3)
A j
j1 D
Finally, the information gain achieved after partitioning D on attribute A is calculated using
Equation (4).
IG(A)  E(D)  EA (D) (4)
In python, we create instance of DecisionTreeClassifier class in sklearn.tree module. The instance can
be trained using training dataset for learning predictive model in the form of decision tree structure.
The model can then be used to predict class label for the new data tuples. We have to supply the
parameter value criterion = "entropy" to create instance of ID3 decision tree.
Example
Use the dataset given below to train ID3 decision tree classifier and predict class label for the input
tuple {Outlook=Sunny, Temperature=Hot, Humidity=Normal, Windy=Strong}.
from sklearn.tree import DecisionTreeClassifier from sklearn import preprocessing
import pandas as pd
from sklearn.preprocessing import LabelEncoder from sklearn import tree
outlook=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Su
nny','Sunny', 'Rainy','Sunny','Overcast','Overcast','Rainy','Sunny']
temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild',' Mild','Mild','Hot','Mild','Hot']
humidity=['High','High','High','High','Normal','Normal','Normal','High','N
ormal','Normal','Normal','High','Normal','High','Normal']
wind=['Weak','Strong','Weak','Weak','Weak','Strong','Strong','Weak','Weak'
,'Weak','Strong','Strong','Weak','Strong','Strong']
play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes', 'Yes','No','?']

d={'Outlook':outlook,'Temperature':temp,'Humidity':humidity, 'Windy':wind, 'Play_Tennis':play}


df=pd.DataFrame(d) Le =
LabelEncoder()
df['Outlook'] = Le.fit_transform(df['Outlook']) df['Temperature'] =
Le.fit_transform(df['Temperature']) df['Humidity'] = Le.fit_transform(df['Humidity'])
df['Windy'] = Le.fit_transform(df['Windy'])

df['Play_Tennis'] = Le.fit_transform(df['Play_Tennis']) print(df)


x=df[['Outlook','Temperature','Humidity','Windy']]
y=df['Play_Tennis']
trainx,trainy=x[0:14],y[0:14] testx,testy=x[14:15],y[14:15]

dt = DecisionTreeClassifier(criterion = 'entropy')
dt.fit(trainx,trainy)
p= dt.predict(testx)
p=Le.inverse_transform(p) print("Predicted
Label:",p)
Unit 4 Unsupervised Learning
Clustering
The process of grouping a set of physical or abstract objects into classes of similar objects is called
clustering. It is an unsupervised learning technique. A cluster is a collection of data objects that are
similar to one another within the same cluster and are dissimilar to the objects in other clusters.
Clustering can also be used for outlier detection, where outliers may be more interesting than common
cases. Applications of outlier detection include the detection of credit card fraud and the monitoring of
criminal activities in electronic commerce. For example, exceptional cases in credit card transactions,
such as very expensive and frequent purchases, may be of interest as possible fraudulent activity.
Categories of Clustering Algorithms
Many clustering algorithms exist in the literature. In general, the major clustering methods can be
classified into the following categories.
1. Partitioning methods: Given a database of n objects or data tuples, a partitioning method
constructs k partitions of the data, where each partition represents a cluster and k <n. Given
k, the number of partitions to construct, a partitioning method creates an initial partitioning. It
then uses an iterative relocation technique that attempts to improve the partitioning by moving
objects from one group to another.
2. Hierarchical methods: A hierarchical method creates a hierarchical decomposition of the
given set of data objects. A hierarchical method can be classified as being either
agglomerative or divisive. The agglomerative approach follows the bottom-up approach. It
starts with each object forming a separate group. It successively merges the objects or groups
that are close to one another, until a termination condition holds. The divisive approach follows
the top-down approach. It starts with all of the objects in the same cluster. In each successive
iteration, a cluster is split up into smaller clusters, until a termination condition holds.
3. Density-based methods: Most partitioning methods cluster objects based on the distance
between objects. Such methods can find only spherical-shaped clusters and encounter difficulty
at discovering clusters of arbitrary shapes. Other clustering methods have been developed
based on the notion of density. Their general idea is to continue growing the given cluster as
long as the density (number of objects or data points) in the neighborhood exceeds some
threshold.
4. Model-based methods: Model-based methods hypothesize a model for each of the clusters and
find the best fit of the data to the given model. EM is an algorithm that performs
expectation-maximization analysis based on statistical modeling.
Measures of Similarity
Distance measures are used in order to find similarity or dissimilarity between data objects. The
most popular distance measure is Euclidean distance, which is defined as below.
d (x, y) 
Where, x  (x1(x, 2y1 x) 1and
)2  ( yy2 y (x
)122 , y2 )
Another well-known metric is Manhattan (or city block) distance, defined as below.
d (x, y)  x2  x1  y2  y1

Minkowski distance is a generalization of both Euclidean distance and Manhattan distance. It is


defined as below.
p p 1/ p
d (x, y)  x2  x1  y2  y1 
Where, p is a positive integer, such a distance is also called Lp norm, in some literature.
It represents the Manhattan distance when p = 1 (i.e., L 1 norm) and Euclidean distance when p = 2
(i.e., L2 norm).
K-Means Algorithm
M-Means is one of the simplest partitioning based clustering algorithm. The procedure follows a
simple and easy way to group a given data set into a certain number of clusters (assume k clusters)
fixed Apriori. The main idea is to define k centers, one for each cluster. These centers should be
selected cleverly because of different location causes different result. So, the better choice is to
place them as much as possible far away from each other.
Algorithm
Let X = {x1,x2,x3,……..,xn} be the set of data points and C = {c 1,c2,…….,ck} be the cluster
centers.
1. Select k cluster centers randomly
2. Calculate the distance between each data point and cluster centers.
3. Assign the data point to the cluster which is closest to the data point.
4. If No data is reassigned
• Display Clusters
• Terminate
5. Else
• Calculate centroid of each cluster and set cluster centers to centroids.
• Go to step 2
Numerical Example
Divide the data points {(2,10), ((2,5), (8,4), (5,8), (7,5), (6,4)} into two clusters.
Solution
Let p1=(2,10) p2=(2,5) p3=(8,4) p4=(5,8) p5=(7,5) p6=(6,4)
Initial step
Choose Cluster centers randomly
Let c1=(2,5) and c2=(6,4) are two initial cluster centers.
Iteration 1
Calculate distance between clusters centers and each data points d(c1,p1)=5
d(c2,p1)=7.21
d(c1,p2)=0 d(c2,p2)=4.12
d(c1,p3)=6.08 d(c2,p3)=2
d(c1,p4)=4.24 d(c2,p4)=4.12
d(c1,p5)=5 d(c2,p5)=1.41
d(c1,p6)=4.12 d(c2,p6)=0
Thus, Cluster1={p1,p2} cluster2={p3,p4,p5,p6}
Iteration 2
New Cluster centers: c1=(2,7.5) c2=(6.5,5.25)
Again, Calculate distance between clusters centers and each data points
d(c1,p1)=2.5 d(c2,p1)=6.54
d(c1,p2)=2.5 d(c2,p2)=4.51
d(c1,p3)=6.95 d(c2,p3)=1.95
d(c1,p4)=3.04 d(c2,p4)=3.13
d(c1,p5)=4.59 d(c2,p5)=0.56
d(c1,p6)=5.32 d(c2,p6)=1.35
Thus, Cluster1={p1,p2,p4} cluster2={p3,p5,p6}
Iteration 3
New Cluster centers: c1=(3,7.67) c2=(7,4.33)
Again, Calculate distance between clusters centers and each data points d(c1,p1)=2.54
d(c2,p1=7.56
d(c1,p2)=2.85 d(c2,p2)=5.04
d(c1,p3)=6.2 d(c2,p3)=1.05
d(c1,p4)=2.03 d(c2,p4)=4.18
d(c1,p5)=4.81 d(c2,p5)=0.67
d(c1,p6)=4.74 d(c2,p6)=1.05
Thus, Cluster1={p1,p2,p4} cluster2={p3,p5,p6} No data points are re-
assigned
Thus, final clusters are: Cluster1={p1,p2,p4} cluster2={p3,p5,p6}
In python, KMeans method from sklearn.cluster module used to create instance of KMean
algorithm. Two major parameters of this method are n_clusters and init . The parameter n_clusters is
used specify the value of k (number of clusters) and init parameter is used to specify the
initialization method. In case of KMeans init=’random’ is used. Once the instance of KMean is created,
fit() method is used to compute clusters by KMeans algorithm. This method accepts dataset as input
argument and stores final cluster centers in variable cluster_centers_ and cluster labels of the dataset
in labels_ variable.
Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

data=[(2,8),(3,2),(1,4),(4,6),(3,5),(2,3),(5,7),(4,8),(4,2),(1,3),(4,5),(3
,4),(7,4),(2,1)]
km=KMeans(n_clusters=2,init='random') km.fit(data)
centers = km.cluster_centers_ labels =
km.labels_ print("Cluster Centers:",*centers)
print("Cluster Labels:",*labels) #Diaplaying
Clusters
cluster1=[] cluster2=[]
for i in range(len(labels)): if
(labels[i]==0):
cluster1.append(data[i]) else:
cluster2.append(data[i])
print("Cluster 1:",cluster1) print("Cluster
2:",cluster2)

KMedoid Clustering Algorithm


K-Medoids is also a portioning based clustering algorithm. It is also called as partitioning around
medoid (PAM) algorithm. A medoid can be defined as the point in the cluster, whose dissimilarities
with all the other points in the cluster is minimum. It majorly differs from the K-Means algorithm in
terms of the way it selects the cluster centers. K-Means algorithm selects the average of a cluster’s
points as its center whereas the K-Medoid algorithm always picks the actual data points from the
clusters as their centers.
K-medoid algorithm selects k medoids (cluster centers) randomly and swaps each medoid with each
non-medoid data point. The swap is accepted only when total cost is decreased. Total cost is the sum of
all the distances from all the data points to the medoids and is calculated as below.
c    pi  mi
mi pimi
Where, mi is medoid point and pi is non-medoid data point
Algorithm
1. Select k medoids randomly.
2. Assign each data point to the closest medoid.
3. Compute Total cost
4. For each medoid m
For each non-medoid point p
 Swap m and p
 Assign each data point to the closest medoid
 Compute total cost
 If the total cost is more than that in the previous step
 Undo the swap.
5. Display clusters

Numerical Example
Divide the data points {(2,10), ((2,5), (8,4), (5,8), (7,5), (6,4)} into two clusters.
Solution
Let p1=(2,10) p2=(2,5) p3=(8,4) p4=(5,8) p5=(7,5)
p6=(6,4)
Initial step
Let m1=(2,5) and m2=(6,4) are two initial cluster centers (medoid).
Iteration 1
Calculate distance between medoids and each data points d(m1,p1)=5
d(m2,p1)=10
d(m1,p2)=0 d(m2,p2)=5
d(m1,p3)=7 d(m2,p3)=2
d(m1,p4)=6 d(m2,p4)=5
d(m1,p5)=5 d(m2,p5)=2
d(m1,p6)=5 d(m2,p6)=0 Thus,
Cluster1={p1,p2} cluster2={p3,p4,p5,p6} Total
Cost=5+0+2+5+2+0=14
Iteration 2:
Swap m1 with p1, m1 =(2,10) m2=(6,4)
Calculate distance between medoids and each data points d(m1,p1)=0
d(m2,p1)=10
d(m1,p2)=5 d(m2,p2)=5
d(m1,p3)=12 d(m2,p3)=2
d(m1,p4)=5 d(m2,p4)=5
d(m1,p5)=10 d(m2,p5)=2
d(m1,p6)=10 d(m2,p6)=0
Thus, Cluster1={p1,p2,p4} cluster2={p3,p5,p6} Total
Cost=0+5+2+5+2+0=14
Iteration 3:
Swap m1 with p3, m1 =(8,4) m2=(6,4)
Calculate distance between medoids and each data points d(m1,p1)=12
d(m2,p1)=10
d(m1,p2)=7 d(m2,p2)=5
d(m1,p3)=0 d(m2,p3)=2
d(m1,p4)=7 d(m2,p4)=5
d(m1,p5)=2 d(m2,p5)=2
d(m1,p6)=2 d(m2,p6)=0
Thus, Cluster1={p3,p5} cluster2={p1,p2,p4,p6} Total
Cost=10+5+0+5+2+0=22 => Undo Swapping Continue this
process…
In python, KMedoids method from sklearn_extra.cluster module used to create instance of KMedoid
algorithm. This module may not come with python default installation. Therefore, we may need to
install the module. One of the major parameter of this method is n_clusters. The parameter n_clusters
is used specify the value of k (number of clusters).
Once the instance of KMedoid is created, like KMeans, fit() method is used to compute clusters by
KMedoid algorithm. This method accepts dataset as input argument and stores final cluster centers
in variable cluster_centers_ and cluster labels of the dataset in labels_ variable.
Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn_extra.cluster import KMedoids

data=[(2,8),(3,2),(1,4),(4,6),(3,5),(2,3),(5,7),(4,8),(4,2),(1,3),(4,5),(3
,4),(7,4),(2,1)]
km=KMedoids(n_clusters=2) km.fit(data)
centers = km.cluster_centers_ labels =
km.labels_ print("Cluster Centers:",*centers)
print("Cluster Labels:",*labels) #Diaplaying
Clusters
cluster1=[] cluster2=[]
for i in range(len(labels)): if
(labels[i]==0):
cluster1.append(data[i]) else:
cluster2.append(data[i])
print("Cluster 1:",cluster1)
print("Cluster 2:",cluster2)

Hierarchical Clustering
In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA)
is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical
clustering generally fall into two types: Agglomerative and Divisive. Agglomerative Clustering is a
bottom up approach. Initially, each observation is considered in separate cluster and pairs of
clusters are merged as one moves up the hierarchy. This process continues until the single cluster or
required number of clusters are formed. Distance matrix is used for deciding which clusters to
merge.

A cluster hierarchy can also be generated top-down. This variant of hierarchical clustering is called top-
down clustering or divisive clustering. We start at the top with all data in one cluster. The cluster is split
two clusters such that the objects in one subgroup are far from the objects in the other. This procedure is
applied recursively until required numbers of clusters are formed. This method is not considered
attractive because there exist O(2n) ways of splitting each cluster.

Agglomerative Clustering Algorithm


1. Compute the distance matrix between the input data points
2. Let each data point be a cluster
3. Repeat steps 4 and 5 until only K clusters remains
4. Merge the two closest clusters
5. Update the distance matrix
Example
Cluster the data points (1,1), (1.5,1.5), (5,5), (3,4), (4,4), (3, 3.5) into two clusters.
Solution
A=(1,1), B= (1.5,1.5), C=(5,5), D=(3,4), E=(4,4), F=(3,3.5)
Distance Matrix

The closest cluster are cluster {F} and {D} with shortest distance of 0.5. Thus, we group cluster
D and F into single cluster {D, F}.
Update the Distance Matrix
We can see that the distance between cluster {B} and cluster {A} is minimum with distance
0.71. Thus, we group cluster {A} and cluster {B} into a single cluster named {A, B}.

Updated Distance Matrix

We can see that the distance between clusters {E} and cluster {D, F} is minimum with distance
1.00. Thus, we group them together into cluster {D, E, F}.
Updated Distance Matrix

After that, we merge cluster {D, E, E} and cluster {C} into a new cluster {C, D, E, F} because
cluster {D, E, E} and cluster {C} are closest clusters with distance 1.41.
Updated Distance Matrix

Now, we have only two clusters.


Thus, Final clusters are: {A, B} and {C, D, E, F}
In python, AgglomerativeClustering method from sklearn.cluster module used to create instance of
Agglomerative clustering algorithm. One of the major parameter of this method is n_clusters. The
parameter n_clusters is used specify the value of k (number of clusters). Once the instance of
AgglomerativeClustering is created, fit() method is used to compute clusters by Agglomerative
clustering algorithm. This method accepts dataset as input argument and stores cluster labels of the
dataset in labels_ variable.
Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering data=[(2,8),(3,2),(1,4),(4,6),(3,5),
(2,3),(5,7),(4,8),(4,2),(1,3),(4,5),(3
,4),(7,4),(2,1)]
ac=AgglomerativeClustering(n_clusters=2) ac.fit(data)
labels = ac.labels_ print("Cluster
Labels:",*labels)

#Diaplaying Clusters cluster1=[]


cluster2=[]
for i in range(len(labels)): if
(labels[i]==0):
cluster1.append(data[i]) else:
cluster2.append(data[i])
print("Cluster 1:",cluster1) print("Cluster
2:",cluster2)
Unit 5
Text Mining and Big Data
Text Preprocessing
Text mining (also known as text analysis), is the process of transforming unstructured text into
structured data for easy analysis. Text mining uses natural language processing (NLP), allowing
machines to understand the human language and process it automatically. Natural Language
Toolkit (NLTK) package of Python is widely used for text mining.
To prepare the text data for the model building we perform text preprocessing. It is the very first
step of NLP projects. Some of the preprocessing steps are: Tokenization, Lower casing, Removing
punctuations, Removing URLs, Removing Numbers, Removing Stop words, Removing HTML Tags etc.
Tokenization
Tokenization is nothing but splitting the raw text into small chunks of words or sentences, called
tokens. If the text is split into words, then it is called as Word Tokenization and if it's split into
sentences then it is called as Sentence Tokenization. Generally white space character is used to
perform the word tokenization and characters like periods, exclamation point, question mark and
newline character are used for Sentence Tokenization.
Example: Word Tokenization
text = """There are multiple ways we can perform tokenization on given text data. We
can choose any method based on langauge, library and purpose of modeling."""
# Split text by whitespace tokens = text.split()
print(tokens)

Example 2: Sentence Tokenization


text = """A regular expression is a sequence of characters that define a search
pattern.Using Regular expression we can match character combinations in string and
perform word/sentence tokenization.""" sentences=text.split(".")
print(sentences)Natural Language Toolkit (NLTK) is library written in python for natural language processing.
NLTK has module word_tokenize() for word tokenization and sent_tokenize() for sentence tokenization.
Example
from nltk.tokenize import word_tokenize from
nltk.tokenize import sent_tokenize
text = """Characters like periods, exclamation point and newline char are used to
separate the sentences. But one drawback with split() method, that we can only use
one separator at a time! So sentence tonenization wont be foolproof with split()
method."""
tokens = word_tokenize(text) print("Words
as Tokens") print(tokens)
tokens=sent_tokenize(text)
print("Sentences as Tokens")
print(tokens)

Lower Casing
Lower casing is a common text preprocessing technique. The idea is to convert the input text into same
casing format so that 'test', 'Test' and 'TEST' are treated the same way. This is more helpful for text
featurization techniques like frequency, TF-IDF as it helps to combine the same words together thereby
reducing the duplication and get correct counts/TF-IDF values. We can convert text to lower case simply
by calling strings lower() method.
Example
import numpy as np import pandas
as pd
df = pd.read_csv("/content/drive/My Drive/sample.csv") df =
df[["text"]]
text=df["text"][0] print("Original
Text") print(text)
text=text.lower() df["text"]
[0]=text
print("After Converting into Lower Case") print(text)

Removal of Punctuations
Another common text preprocessing technique is to remove the punctuations from the text data. This is
again a text standardization process that will help to treat 'hurray' and 'hurray!' in the same way.
We also need to carefully choose the list of punctuations to exclude depending on the use case. For
example, the string.punctuation in python contains the following punctuation symbols: !"#$
%&\'()*+,-./:;<=>?@[\\]^_{|}~`.We can add or remove more punctuations as per our need.
Example
import pandas as pd import string
df = pd.read_csv("/content/drive/My Drive/sample.csv") df =
df[["text"]]
text=df["text"][2] print("Original Data")
print(text) ps=string.punctuation
print("Puctuation Symbols:",ps)
new_text=""
for c in text:
if c not in ps: new_text=new_text+c
df["text"][2]=new_text
print("After Removal of Punctuation Symbols") print(new_text)

Removing Numbers
Sometimes it happens that words and digits combine are written in the text which creates a problem for
machines to understand. hence, We need to remove the words and digits which are combined like
game57 or game5ts7. This type of word is difficult to process so better to remove them or replace
them with an empty string. We can replace digits and words containing digits from by using sub()
method of re module. Syntax of the method is given below.
re.sub(pat, replacement, str)
This function searches for specified pattern in the given string, and replaces the strings by the
specified replacement.
Example: Removing digits

import pandas as pd import re


df = pd.read_csv("/content/drive/My Drive/sample.csv") df =
df[["text"]]
text=df["text"][7] print("Original Data")
print(text)
text=re.sub("[0-9]","",text) df["text"]
[7]=text
print("After Removal of Digits") print(text)
Example 2: Removing Words Containing Digits
import pandas as pd import re
df = pd.read_csv("/content/drive/My Drive/sample.csv") df =
df[["text"]]
text=df["text"][7] print("Original
data") print(text)
toks=text.split() new_toks=[]
for w in toks: w=re.sub(".*[0-
9].*","",w) new_toks.append(w)
text=" ".join(new_toks) df["text"][7]=text
print("After Removal of Words Containing Digits") print(text)

Removing Stop Words


Stop words are commonly occurring words in a language like 'the', 'a' and so on. They can be
removed from the text most of the times, as they don't provide valuable information for downstream
analysis. In cases like Part of Speech Tagging (POS), we should not remove them as provide very
valuable information about the POS.
These stop word lists are already compiled for different languages and we can safely use them. For
example, the stop word list for English language from the NLTK package can be displayed and
removed from text data as below.
Example from nltk.corpus import stopwords import
numpy as np
import pandas as pd

df = pd.read_csv("/content/drive/My Drive/sample.csv") df = df[["text"]]


text=df["text"][0] text=text.lower() print("Original Text") print(text)
sw=stopwords.words('english') print("List of stop words:",sw) tokens=text.split()
new_tokens=[w for w in tokens if w not in sw] text=" ".join(new_tokens)
df["text"]=text
print("After Removal of stop words") print(text)

Removing URLs
Next preprocessing step is to remove any URLs present in the text data. If we scraped text data
from web, then there is a good chance that the text data will have some URL in it. We might need
to remove them for our further analysis. We can also replace URLs from the text by using sub()
method of re module.
Example
import pandas as pd import re
df = pd.read_csv("/content/drive/My Drive/sample.csv") df = df[["text"]]
text=df["text"][7] print("Original Text") print(text) toks=text.split() new_toks=[]
for t in toks: t=re.sub("https?://\S+|www\.\S+","",t)
#\S represents any character except white space characters new_toks.append(t)
text=" ".join(new_toks) df["text"][7]=text print("After Removal of URLs") print(text)
Removal of HTML Tags
One another common preprocessing technique that will come handy in multiple places is removal of
html tags. This is especially useful, if we scrap the data from different websites. We might end up
having html strings as part of our text. We can remove the HTML tags using regular expressions.
Example
import pandas as pd import re
text="The HTML <b> element defines bold text, without any extra importance."
print("Original Text Data") print(text)
new_toks=[] tokens=text.split()
for t in tokens:
t=re.sub("<.*>","",t) new_toks.append(t)
text=" ".join(new_toks)
print("Text Data After Removal of HTML Tags") print(text)

Removal of Emojis
With more and more usage of social media platforms, there is an explosion in the usage of emojis
in our day to day life as well. Probably we might need to remove these emojis for some of our
textual analysis. We have to use ‘u’ literal to create a Unicode string. Also, we should pass
re.UNICODE flag and convert our input data to Unicode.
Example
import pandas as pd import re
df = pd.read_csv("/content/drive/My Drive/sample.csv") df =
df[["text"]]
print(df["text"][0])
text=df["text"][0]
toks=text.split() new_toks=[]
pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols &
pictographs
u"\U0001F680-\U0001F6FF" # transport & map
symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS) u"\
U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
for t in toks: t=re.sub(pattern,"",t)
new_toks.append(t)
text=" ".join(new_toks)
print("After Removal of Words Containing Digits") df["text"]
[7]=text
print(df["text"][7])
Stemming
Stemming is the process of converting a word to its most general form, or stem. This helps in reducing
the size of our vocabulary. Consider the words: learn, learning, learned and learnt. All these words are
stemmed from its common root learn. However, in some cases, the stemming process produces words
that are not correct spellings of the root word. For example, happi. That's because it chooses the most
common stem for related words. For example, we can look at the set of words that comprises the
different forms of happy: happy, happiness and happier. We can see that the prefix happi is more
commonly used. We cannot choose happ because it is the stem of unrelated words like happen. NLTK
has different modules for stemming and we will use the PorterStemmer module which uses the Porter
Stemming Algorithm.
Example
import numpy as np import pandas
as pd
from nltk.stem.porter import PorterStemmer

df = pd.read_csv("/content/drive/My Drive/sample.csv") df =
df[["text"]]
text=df["text"][0] text=text.lower()
print("Original Text") print(text)
stemmer = PorterStemmer()
toks=text.split()
new_toks=[] for t in
toks:
rw=stemmer.stem(t) new_toks.append(rw)
df["text"][0]=text text="
".join(new_toks) print("After
Stemming") print(text)

Lemmatization
Lemmatization is a text pre-processing technique used in natural language processing (NLP)
models to break a word down to its root meaning to identify similarities. For example, a
lemmatization algorithm would reduce the word better to its root word, or lemme, good.
In stemming, a part of the word is just chopped off at the tail end to arrive at the stem of the word.
There are different algorithms used to find out how many characters have to be chopped off, but the
algorithms don’t actually know the meaning of the word in the language it belongs to. In
lemmatization, the algorithms do have this knowledge. In fact, you can even say that these
algorithms refer to a dictionary to understand the meaning of the word before reducing it to its root
word, or lemma. It reduces the size of text data massively and hence is faster in processing large
amount of text data. However, stemming may results to meaningless words.
So, a lemmatization algorithm would know that the word better is derived from the word good, and
hence, the lemme is good. But a stemming algorithm wouldn’t be able to do the same. There could
be over-stemming or under-stemming, and the word better could be reduced to either bet, or bett, or
just retained as better. But there is no way in stemming that can reduce better to its root word good. This
is the difference between stemming and lemmatization. Lemmatization preserves the meaning of
words. However, it may be computationally expensive due to less powerful dimensionality
reduction.
Example
import numpy as np
import pandas as pd
from nltk.stem import WordNetLemmatizer

df = pd.read_csv("/content/drive/My Drive/sample.csv") df =
df[["text"]]
text=df["text"][1] text=text.lower()
print("Original Text") print(text)
lemmatizer = WordNetLemmatizer() toks=text.split()
new_toks=[] for t in
toks:
rw=lemmatizer.lemmatize(t)
new_toks.append(rw)
df["text"][1]=text text="
".join(new_toks)
print("After Lemmatization") print(text)

You might also like