EXP NO:1a MATRIX MANIPULATION USING NUMPY
DATE:
AIM:
To write a python program for manipulating the matrix operations using Numpy.
ALGORITHM:
1. Start the program.
2. Import Numpy library.
3. Get the input matrices x and y.
4. Manipulate the matrix operations add, subtract, multiply, division and dot.
5. Display the output.
6. Stop.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 1
PROGRAM:
import numpy
# initializing matrices
x = numpy.array([[1, 2], [4, 5]])
y = numpy.array([[7, 8], [9, 10]])
# using add() to add matrices
print ("The element wise addition of matrix is : ")
print (numpy.add(x,y))
# using add() to add matrices
print ("The element wise subtraction of matrix is : ")
print (numpy.subtract(x,y))
# using divide() to divide matrices
print ("The element wise division of matrix is : ")
print (numpy.divide(x,y))
# using multiply() to multiply matrices element wise
print ("The element wise multiplication of matrix is : ")
print (numpy.multiply(x,y))
# using dot() to multiply matrices
print ("The product of matrices is : ")
print (numpy.dot(x,y))
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 2
OUTPUT:
The element wise addition of matrix is:
[[ 8 10]
[13 15]]
The element wise subtraction of matrix is:
[[-6 -6]
[-5 -5]]
The element wise division of matrix is:
[[0.14285714 0.25]
[0.44444444 0.5]]
The element wise multiplication of matrix is:
[[ 7 16]
[36 50]]
The product of matrices is:
[[25 28]
[73 82]]
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 3
RESULT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 4
EX.NO 1b AGGREGATE AND STATISTICAL FUNCTIONS USING NUMPY
DATE:
AIM:
To write a python program for Aggregate and statistical functions using Numpy.
ALGORITHM:
1. Start the program
2. Import the Numpy library.
3. Get the input array.
4. Using the methods calculate the mean, mode and standard deviation.
5. Display the output.
6. Stop the program.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 5
PROGRAM:
import numpy as np
array1 = np.array([[10, 20, 30], [40, 50, 60]])
print("Mean: ", np.mean(array1))
print("Std: ", np.std(array1))
print("Var: ", np.var(array1))
print("Sum: ", np.sum(array1))
print("Prod: ", np.prod(array1))
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 6
OUTPUT:
Mean: 35.0
Std: 17.07825127659933
Var: 291.6666666666667
Sum: 210
Prod: 720000000
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 7
RESULT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 8
EX.NO1c: RESHAPE USING NUMPY
DATE:
AIM:
To write the python program for reshaping the array using Numpy.
ALGORITHM:
1. Start the program
2. Import the numpy library.
3. Get the input array.
4. Call the reshape () function.
5. Display the output.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 9
PROGRAM
import numpy as np
thearray = np.array([1, 2, 3, 4, 5, 6, 7, 8])
thearray = thearray.reshape(2, 4)
print(thearray)
print("-" * 10)
thearray = thearray.reshape(4, 2)
print(thearray)
print("-" * 10)
thearray = thearray.reshape(8, 1)
print(thearray)
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 10
OUTPUT:
[[1 2 3 4]
[5 6 7 8]]
----------
[[1 2]
[3 4]
[5 6]
[7 8]]
----------
[[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]]
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 11
RESULT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 12
EX.NO2a: CREATING DATAFRAMES USING LIST
DATE:
AIM:
To write the python program for creating the data frames using list.
ALGORITHM:
1. Start the program.
2. Import the pandas package as pd.
3. Declare the input as list.
4. Load the data into the data frame.
5. Display the result.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 13
PROGRAM 1:
import pandas as pd
# string values in the list
lst = ['Java', 'Python', 'C', 'C++',
'JavaScript', 'Swift', 'Go']
# Calling DataFrame constructor on list
dframe = pd.DataFrame(lst)
print(dframe)
PROGRAM 2:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 14
OUTPUT 1:
calories duration
0 420 50
1 380 40
2 390 45
OUTPUT 2:
0
0 Java
1 Python
2 C
3 C++
4 JavaScript
5 Swift
6 Go
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 15
RESULT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 16
EX.NO2b: HIERARCHICAL INDEXING USING PANDAS
DATE:
AIM:
To write the python program to create hierarchical indexing using pandas dataframes.
ALGORITHM:
1. Start the program.
2. Import the library pandas as pd.
3. Create the data frames.
4. Using set_index function to manipulate the hierarchical indexing.
5. By using view command, the indexing can be displayed.
6. Manipulate the hierarchical indexing with out drop command using the set-index function
with drop=false.
7. Display the result.
8. Stop the program.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 17
PROGRAM:
import pandas as pd
import numpy as np
#Create a DataFrame
d={
'Name':['Alisa','Bobby','Cathrine','Alisa','Bobby','Cathrine',
'Alisa','Bobby','Cathrine','Alisa','Bobby','Cathrine'],
'Exam':['Semester 1','Semester 1','Semester 1','Semester 1','Semester 1','Semester 1',
'Semester 2','Semester 2','Semester 2','Semester 2','Semester 2','Semester 2'],
'Subject':['Mathematics','Mathematics','Mathematics','Science','Science','Science',
'Mathematics','Mathematics','Mathematics','Science','Science','Science'],
'Score':[62,47,55,74,31,77,85,63,42,67,89,81]}
df = pd.DataFrame(d,columns=['Name','Exam','Subject','Score'])
df
# multiple indexing or hierarchical indexing
df1=df.set_index(['Exam', 'Subject'])
df1
# View index
df1.index
# Swap the column in multiple index
df1.swaplevel('Subject','Exam')
# multiple indexing or hierarchical indexing with drop=False
df1=df.set_index(['Exam', 'Subject'],drop=False)
df1
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 18
OUTPUT:
Hierarchical Indexing:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 19
View Index:
MultiIndex([('Semester 1', 'Mathematics'),
('Semester 1', 'Mathematics'),
('Semester 1', 'Mathematics'),
('Semester 1', 'Science'),
('Semester 1', 'Science'),
('Semester 1', 'Science'),
('Semester 2', 'Mathematics'),
('Semester 2', 'Mathematics'),
('Semester 2', 'Mathematics'),
('Semester 2', 'Science'),
('Semester 2', 'Science'),
('Semester 2', 'Science')],
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 20
names=['Exam', 'Subject'])
SWAP LEVEL:
Hierarchical indexing or multiple indexing without dropping:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 21
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 22
RESULT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 23
EX.NO 3a LINE GRAPH
DATE:
AIM:
To write a python program to plot a line graph using matplotlib library.
ALGORITHM:
1.Start
2.Import matplotlib library
3.Assign values for an array x and y
4.Assign the label for x-axis
5.Assign the label for y-axis
6.Assign the title for the graph
7.Show the plotted graph.
8.Stop
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 24
PROGRAM:
import matplotlib.pyplot as plt
# x axis values
x = [1,2,3,4]
# corresponding y axis values
y = [2,4,1,5]
# plotting the points
plt.plot(x, y)
# naming the x axis
plt.xlabel('x - axis')
# naming the y axis
plt.ylabel('y - axis')
# giving a title to my graph
plt.title('Plot graph!')
# function to show the plot
plt.show()
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 25
OUTPUT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 26
RESULT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 27
EX.NO:3b SINE WAVE GRAPH
DATE:
AIM:
To write a python program to plot a sine wave graph using matplotlib library.
ALGORITHM:
1.Start
2.Import matplotlib library
3.Import numpy library
4.Import math
5.Assign the values of x and y
6.Plot the graph
7.Assign the label for X-axis and Y-axis respectively
8.Assign the title for the graph
9.Show the plotted graph
10.stop
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 28
PROGRAM:
from matplotlib import pyplot as plt
import numpy as np
import math #needed for definition of pi
x = np.arange(0, math.pi*2, 0.05)
y = np.sin(x)
plt.plot(x,y)
plt.xlabel('angle')
plt.ylabel('sine')
plt.title('sine wave')
plt.show()
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 29
OUTPUT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 30
RESULT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 31
EX.NO3c: MULTIPLOT GRAPH
DATE:
AIM:
To write a python program to plot a multiplot graph using matplotlib library.
ALGORITHM:
1.Start
2.Import matplotlib library
3.Create an lists a and b with values
4.Plot a, b and list(range(0,22,3))
5.Assign the names for x and y axis
6.Create an array c with values
7.Plot c and label c
8.Get the current axis command
9.With the current axis command set the boundary line to right, top and left
10.Set the interval for x and y axis
11.Assign the names for legend
12.Assign the names for title
13.Show the plotted graph
14.Stop
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 32
PROGRAM:
import matplotlib.pyplot as plt
a = [1, 2, 3, 4, 5]
b = [0, 0.6, 0.2, 15, 10, 8, 16, 21]
plt.plot(a)
# o is for circles and r is for red
plt.plot(b, 'or')
plt.plot(list(range(0, 22, 3)))
# naming the x-axis
plt.xlabel('Day ->')
# naming the y-axis
plt.ylabel('Temp ->')
c = [4, 2, 6, 8, 3, 20, 13, 15]
plt.plot(c, label = '4th Rep')
# get current axes command
ax = plt.gca()
# get command over the individual
# boundary line of the graph body
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
# set the range or the bounds of
# the left boundary line to fixed range
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 33
ax.spines['left'].set_bounds(-3, 40)
# set the interval by which
# the x-axis set the marks
plt.xticks(list(range(-3, 10)))
# set the intervals by which y-axis set the marks
plt.yticks(list(range(-3, 20, 3)))
# legend denotes that what color signifies what
ax.legend(['1st Rep','2nd Rep','3rd Rep','4th Rep'])
# annotate command helps to write
# ON THE GRAPH any text xy denotes
# the position on the graph
plt.annotate('Temperature V / s Days', xy = (1.01, -2.15))
# gives a title to the Graph
plt.title('All Features Discussed')
plt.show()
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 34
OUTPUT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 35
RESULT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 36
EX.NO3d: PIE CHART
DATE:
AIM:
To write a python program to plot a pie chart using matplotlib library.
ALGORITHM:
1.Start
2.Import matplotlib library
3.Import numpy
4.Declare the values for an array ‘y’
5.Declare a list ‘mylist’
6.Assign the values for the pie function.
7.Assign the title for the legend function
8.Show the plotted graph
9.Stop
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 37
PROGRAM:
import matplotlib.pyplot as plt
import numpy as np
y = np.array([35, 25, 25, 15])
mylabels = ['Apples', 'Bananas', 'Cherries', 'Dates']
plt.pie(y, labels = mylabels)
plt.legend(title = 'Four Fruits:')
plt.show()
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 38
OUTPUT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 39
RESULT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 40
EX.NO:3e SUBPLOT
DATE:
AIM:
To write a python program to plot a subplot using matplotlib library.
ALGORITHM:
1.Start
2.Import matplotlib library
3.Import numpy
4.Assign the values for x and y array
5.Declare the subplot function
6.Plot the graph
7.Assign the title for plot as ‘sales’
8.Again, assign the values for x and y array for another graph
9.Declare the subplot function
10.Plot the graph
11.Assign the title for plot as ‘income’
12.Assign the suptitle as ‘my shop’
13.Show the plotted graph
14.Stop
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 41
PROGRAM:
import matplotlib.pyplot as plt
import numpy as np
#plot 1:
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(1, 2, 1)
plt.plot(x,y)
plt.title('SALES')
#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.subplot(1, 2, 2)
plt.plot(x,y)
plt.title('INCOME')
plt.suptitle('MY SHOP')
plt.show()
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 42
OUTPUT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 43
RESULT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 44
EX.NO3f: HISTOGRAM
DATE:
AIM:
To write a python program to plot a histogram using matplotlib library.
ALGORITHM:
1.Start
2.Import matplotlib and numpy libraries
3.Using the command axes plot the subplot
4.Create an array a with values and plot the histogram
5.Assign the name for title
6.Set the interval for x-axis
7.Assign the names for x and y axis
8.Show the plotted graph
9.Stop
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 45
PROGRAM:
from matplotlib import pyplot as plt
import numpy as np
fig,ax = plt.subplots(1,1)
a = np.array([22,87,5,43,56,73,55,54,11,20,51,5,79,31,27])
ax.hist(a, bins = [0,25,50,75,100])
ax.set_title('histogram of result')
ax.set_xticks([0,25,50,75,100])
ax.set_xlabel('marks')
ax.set_ylabel('no. of students')
plt.show()
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 46
OUTPUT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 47
RESULT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 48
EX.NO:3g BAR CHART
DATE:
AIM:
To write a python program to plot a bar chart using matplotlib library.
ALGORITHM:
1.Start
2.Import matplotlib library
3.With the current axes command add axes
4.Create an arrays langs and students and assign the values
5.Plot the bar chart using langs and students
6.Show the plotted graph
7.Stop
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 49
PROGRAM:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8,6))
ax = fig.add_axes([0.15,0.1,0.7,0.74])
langs = ['C','C++','Java','Python','PHP']
students = [23,17,35,29,12]
ax.bar(langs,students,color='green',width=0.4)
plt.xlabel("Languages available")
plt.ylabel("Number of students selected the languages")
plt.title("Bar graph for the languages opted by students")
plt.show()
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 50
OUTPUT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 51
RESULT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 52
EX.NO:3h SCATTER PLOT
DATE:
AIM:
To write a python program to plot a scatter plot using matplotlib library.
ALGORITHM:
1.Start
2.Import matplotlib library
3.Create an arrays x and y using values
4.Plot the scatter plot with the color “blue”
5.Assign the names for x and y axis
6.Assign the name for legend functions
7.Show the plotted graph
8.Stop
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 53
PROGRAM:
import matplotlib.pyplot as plt
x =[5, 7, 8, 7, 2, 17, 2, 9,4, 11, 12, 9, 6]
y =[99, 86, 87, 88, 100, 86,103, 87, 94, 78, 77, 85, 86]
plt.scatter(x, y, c ='blue')
plt.xlabel('X-values')
plt.ylabel('Y-values')
plt.legend(['plot values'])
# To show the plot
plt.show()
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 54
OUTPUT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 55
RESULT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 56
EX.NO:4a FREQUENCY DISTRIBUTIONS
DATE:
AIM:
To write a python program to implement the frequency tables using frequency distribution.
ALGORITHM:
1. Start the program.
2. Import the pandas library.
3. Create a .csv file with values and set the path.
4. Make an frequency table of pos(position) column from the dataset.
5. Find the frequency table of height column from the file.
6. Use Series.sort_index() method to sort the file.
7. Representing data in ascending order, then set the ascending parameter false.
8. Display the result.
9. Stop the program
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 57
PROGRAM:
import pandas as pd
wnba = pd.read_csv(‘wnba.csv’)
freq_dis_pos = wnba[‘Pos’].value_counts()
freq_dis_pos
freq_dis_height = wnba["Height"].value_counts()
freq_dis_height
freq_dis_height =wnba["Height"].value_counts().sort_index(ascending= False)
freq_dis_height
freq_dis_height = wnba["Height"].value_counts().sort_index()
freq_dis_height
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 58
OUTPUT:
G 60
F 33
C 25
G/F 13
F/C 12
Name: Pos, dtype: int64
188 20
193 18
175 16
185 15
173 11
183 11
191 11
196 9
178 8
180 7
170 6
198 5
168 2
201 2
165 1
206 1
Name: Height, dtype: int64
206 1
201 2
198 5
196 9
193 18
191 11
188 20
185 15
183 11
180 7
178 8
175 16
173 11
170 6
168 2
165 1
Name: Height, dtype: int64
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 59
165 1
168 2
170 6
173 11
175 16
178 8
180 7
183 11
185 15
188 20
191 11
193 18
196 9
198 5
201 2
206 1
Name: Height, dtype: int64
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 60
RESULT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 61
EX.NO:4b RELATIVE FREQUENCY AND PERCENTAGE FREQUENCY
DATE:
AIM:
To write the python program for relative frequency and percentile ranks using pandas.
ALGORITHM:
1. Start the program.
2. Import the pandas library.
3. Create a .csv file with values and set the path.
4. Make an percentage table of pos(position) column from the dataset.
5. From scipy import percentileofscores.
6. Find the percentage table of age column from the file.
7. Display the output.
8. Stop the program.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 62
PROGRAM:
import pandas as pd
wnba = pd.read_csv("C:\\Users\\HARSHI\\Downloads\\wnba.csv")
wnba["Age"].value_counts() / len(wnba)
percentages_pos = wnba["Age"].value_counts(normalize=True).sort_index() * 100
percentages_pos
from scipy.stats import percentileofscore
percentile_of_25 = percentileofscore(wnba["Age"], 25, kind = ‘weak’)
percentile_of_25
percentiles = wnba["Age"].describe()
percentiles = wnba["Age"].describe(percentiles = [.1, .15, .33, .5, .592, .85, .9])
percentiles
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 63
OUTPUT:
24 0.111888
23 0.104895
25 0.104895
28 0.097902
27 0.090909
26 0.083916
22 0.069930
30 0.062937
29 0.055944
31 0.055944
32 0.055944
34 0.034965
35 0.027972
33 0.020979
21 0.013986
36 0.006993
Name: Age, dtype: float64
21 1.398601
22 6.993007
23 10.489510
24 11.188811
25 10.489510
26 8.391608
27 9.090909
28 9.790210
29 5.594406
30 6.293706
31 5.594406
32 5.594406
33 2.097902
34 3.496503
35 2.797203
36 0.699301
Name: Age, dtype: float64
40.55944055944056
count 143.000000
mean 27.076923
std 3.679170
min 21.000000
10% 23.000000
15% 23.000000
33% 25.000000
50% 27.000000
59.2% 28.000000
85% 31.000000
90% 32.000000
max 36.000000
Name: Age, dtype: float64
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 64
RESULT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 65
EX.NO: 4c AVERAGES
DATE:
AIM:
To write a python program to execute the Average of the values.
ALGORITHM:
1. Start the program.
2. Import the statistics package.
3. Using the method mean to calculate the average of given data.
4. Display the result.
5. Stop the program.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 66
PROGRAM:
import statistics
# list of positive integer numbers
data1 = [1, 3, 4, 5, 7, 9, 2]
x = statistics.mean(data1)
# Printing the mean
print("Mean is :", x)
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 67
OUTPUT:
Mean is : 4.428571428571429
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 68
RESULT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 69
Ex.No: 4d) VARIABILTY USING DATA VALUES
Date:
Aim:
To write a python program to execute the variability using data values.
Algorithm:
Step 1: Start.
Step 2: Import statistics library.
Step 3: Create a sample data
Step 4: Print the variance of the sample data.
Step 5: Stop.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 70
Program:
import statistics
sample = [2.74, 1.23, 2.63, 2.22, 3, 1.98]
print("Variance of sample set is % s"%(statistics.variance(sample)))
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 71
Output:
Variance of sample set is 0.40924
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 72
Result:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 73
Ex.No: 4e) VARIABILTY USING LIST
Date:
Aim:
To write a python program to execute the variability using list.
Algorithm:
Step 1: Start.
Step 2: Import statistic library.
Step 3: Create a list with values.
Step 4: Calculate the mean of the value.
Step 5: Calculate the variance of the value.
Step 6: Print.
Step 7: Stop.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 74
Program:
import statistics
sample = (1, 1.3, 1.2, 1.9, 2.5, 2.2)
m = statistics.mean(sample)
print("Variance of Sample set is % s"%(statistics.variance(sample, xbar = m)))
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 75
Output:
Variance of Sample set is 0.3656666666666667
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 76
Result:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 77
Ex.No: 4f VARIABILTY USING PANDAS
Date:
Aim:
To write a python program to execute the variability using pandas.
Algorithm:
Step 1: Start.
Step 2: Import pandas library.
Step 3: Create a list with values.
Step 4: Assign the values in series to sample.
Step 5: Print the type of the value.
Step 6: Print the mean of the value.
Step 7: Print the median of the value.
Step 8: Print the standard deviation of the value.
Step 9: Print the variance of the value.
Step 10: Stop.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 78
Program:
import pandas as pd
lst = [33219, 36254, 38801, 46335, 46840, 47596, 55130, 56863, 78070, 88830]
sample = pd.Series(lst)
print(type(sample))
print(sample.mean())
print(sample.median())
print(sample.std(ddof=0))
print(sample.var(ddof=0))
print(sample.var(ddof=1))
print(sample.mad())
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 79
Output:
<class 'pandas.core.series.Series'>
52793.8
47218.0
17076.965197598784
291622740.36
324025267.06666666
13543.560000000001
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 80
Result:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 81
Ex.No: 5a NORMAL CURVES
Date:
Aim:
To write a python program to execute the normal curves.
Procedure:
Step 1: Start the program.
Step 2: Import the Numpy Library.
Step 3: Import matplotlib.
Step 4: Import normal from scipy.
Step 5: Assign the value of an array x.
Step 6: Plot the graph.
Step 7: show the plotted graph.
Step 8: Stop the program.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 82
Program:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
x = np.arange(-3, 3, 0.001)
plt.plot(x, norm.pdf(x, 0, 1))
plt.show()
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 83
Output:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 84
Result:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 85
Ex.No: 5b CORRELATION AND SCATTER PLOTS
Date:
Aim:
To write a python program to execute the correlations and scatter plots.
Procedure:
Step 1: Start.
Step 2: Import sklearn.
Step 3: Import Numpy Libraries.
Step 4: Import matplotlib.
Step 5: Import pandas Libraries.
Step 6: Assign the values in series to x and y.
Step 7: Assign the value of correlation of x and y to correlation.
Step 8: Assign the title to the plot and plot the scatter plot.
Step 9: Label the x and y axis.
Step 10: Show the plotted graph.
Step 11: Stop the program.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 86
Program:
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
y = pd.Series([1, 2, 3, 4, 3, 5, 4])
x = pd.Series([1, 2, 3, 4, 5, 6, 7])
correlation = y.corr(x)
plt.title('Correlation')
plt.scatter(x, y)
plt.plot(np.unique(x),
np.poly1d(np.polyfit(x, y, 1))
(np.unique(x)), color='red')
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.show()
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 87
Output:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 88
Result:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 89
Ex.No: 5c CORRELATION COEFFICIENT USING NUMPY
Date:
Aim:
To write a python program to execute the correlations coefficient using numpy.
Procedure:
Step 1: Start the program.
Step 2: Import Numpy.
Step 3: Assign the value of x and y.
Step 4: Assign the r value using correlation coefficient.
Step 5: Print r.
Step 7: Stop the program.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 90
Program:
import numpy as np
x = np.arange(10, 20)
y = np.array([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])
r = np.corrcoef(x, y)
print(r)
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 91
Output:
[[1. 0.75864029]
[0.75864029 1. ]]
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 92
Result:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 93
Ex.No: 5d CORRELATION COEFFICIENT USING SCIPY
Date:
Aim:
To write a python program to execute the correlation coefficient using scipy.
Procedure:
Step 1: Start the program.
Step 2: Import Numpy Libraries.
Step 3: Import scipy.
Step 4: Assign the value of x And y.
Step 5: Print pearsonr value.
Step 6: Print spearmanr value.
Step 7: Print kendalltau value.
Step 8: Stop the program.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 94
Program:
import numpy as np
import scipy.stats
x = np.arange(10, 20)
y = np.array([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])
print(scipy.stats.pearsonr(x, y))
print(scipy.stats.spearmanr(x, y))
print(scipy.stats.kendalltau(x, y))
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 95
Output:
(0.758640289091187, 0.010964341301680813)
SpearmanrResult(correlation=0.9757575757575757, pvalue=1.4675461874042197e-06)
KendalltauResult(correlation=0.911111111111111, pvalue=2.9761904761904762e-05)
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 96
Result:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 97
Ex.No: 6 REGRESSION
Date:
Aim:
To write a python program to execute the regression.
Procedure:
Step 1: Start the program.
Step 2: Import Numpy library.
Step 3: Define the function to estimate the coefficient.
Step 4: Return b0 and b1.
Step 5: Define the plot regression line function.
Step 6: Display the plot.
Step 7: Define the main function.
Step 8: Assign the values of x and y.
Step 9: Assign the value of estimate coefficient to b.
Step 10: Print the estimated coefficient.
Step 11: Stop the program.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 98
Program:
import numpy as np
import matplotlib.pyplot as plt
def estimate_coef(x, y):
n = np.size(x)
m_x = np.mean(x)
m_y = np.mean(y)
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)
def plot_regression_line(x, y, b):
plt.scatter(x, y, color = "m",marker = "o", s = 30)
y_pred = b[0] + b[1]*x
plt.plot(x, y_pred, color = "g")
plt.xlabel('x')
plt.ylabel('y')
plt.show()
def main():
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
b = estimate_coef(x, y)
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 99
print("Estimated coefficients:\nb_0 = {} \\nb_1 = {}".format(b[0], b[1]))
plot_regression_line(x, y, b)
if __name__ == "__main__":
main()
Output:
Estimated coefficients:
b_0 = 1.2363636363636363
b_1 = 1.1696969696969697
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 100
Result:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 101
Ex. No: 7 Z TEST CASE STUDIES
Date:
AIM:
To write the program for Z-test (both one tailed and two tailed hypotheses test).
EXPLANATION:
Z-test is a test for the proportions. In other words this is a statistical test that helps us
evaluate our beliefs about certain proportions in the population based on the sample at hand.
This can help us answer the questions like:
is the proportion of female students at SKEMA equal to 0.5.
is the proportion of smokers in France equal to 0.15.
For conducting Z-test you do not need much calculations on your sample data. The only thing
you need to know is the proportion of observations that qualify to belong to the sub-sample
you are interested in (e.g. a “female SKEMA student”, or a “French smoker” in examples
above).We will use the dataset on cars in the US for learning purposes. This contains a list of
32 cars and their characteristics.
In the simplest example involving the data at hand, we can ask the question whether the
share of cars with variable “am” being equal to 0 is equal to 50%.
Function used for z-testing is scipy.stats.binom_test. It requires three arguments x - number
of qualified observations in our data (19 in our case) n - number of total observations (32 in
our case) p - the null hypothesis on the share of qualified data (0.5 in our case)
Output of the test gives rich information about the test:
It specifies the alternative hypothesis (by default it is set to conduct a two-sided test,
so the alternative hypothesis is that the share is not equal to the proportion specified
in the null hypotheses. However, we will see how to adjust this in next chapter)
It specifies the confidence level and interval
However, by default, it only returns the most important piece of information - the p-
value of the test
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 102
This value can be understood as the probability that we are making a mistake if we reject our
null hypothesis in favor of the alternative one. In this case this probability is 38% which is
very high (anything above 10% is high), which would prompt us to conclude that we do not
have enough statistical evidence to claim that the share of cars with am=0 was not 50% in the
population.
PROGRAM:
import seaborn as sns
import scipy.stats as stats
import numpy as np
import random
import warnings
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(rc={'figure.figsize':(13, 7.5)})
sns.set_context('talk')
warnings.filterwarnings('ignore')
#Visualization of One-Tail
values = np.random.normal(loc=0, scale=10, size=6000)
two_std_from_mean = np.mean(values) + np.std(values)*1.645
kde = stats.gaussian_kde(values)
pos = np.linspace(np.min(values), np.max(values), 10000)
plt.plot(pos, kde(pos), color='teal')
shade = np.linspace(two_std_from_mean, 40, 300)
plt.fill_between(shade, kde(shade), alpha=0.45, color='teal')
plt.title("Sampling Distribution for One-Tail Hypothesis Test", y=1.015, fon
tsize=20)
plt.xlabel("sample mean value", labelpad=14)
plt.ylabel("frequency of occurence", labelpad=14);
round(1-stats.norm.cdf(1.645), 2)
round(1-stats.norm.cdf(2.33), 2)
round(1-stats.norm.cdf(3.1), 3)
#Two-Tailed Hypothesis Tests
values = np.random.normal(loc=0, scale=10, size=6000)
alpha_05_positive = np.mean(values) + np.std(values)*1.96
alpha_05_negative = np.mean(values) - np.std(values)*1.96
kde = stats.gaussian_kde(values)
pos = np.linspace(np.min(values), np.max(values), 10000)
plt.plot(pos, kde(pos), color='dodgerblue')
shade = np.linspace(alpha_05_positive, 40, 300)
plt.fill_between(shade, kde(shade), alpha=0.45, color='dodgerblue')
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 103
shade2 = np.linspace(alpha_05_negative, -40, 300)
plt.fill_between(shade2, kde(shade2), alpha=0.45, color='dodgerblue')
plt.title("Sampling Distribution for Two-Tail Hypothesis Test", y=1.015, fon
tsize=20)
plt.xlabel("sample mean value", labelpad=14)
plt.ylabel("frequency of occurence", labelpad=14);
round(1-stats.norm.cdf(1.96), 3)
round(1-stats.norm.cdf(2.575), 3)
round(1-stats.norm.cdf(3.29), 3)
population_mean_pounds = 160
population_size = 5500
population_std_dev_pounds = 22
np.random.seed(50)
population_gym_goers_mass = np.random.normal(loc=population_mean_pounds, sca
le=population_std_dev_pounds, size=5500)
n = 30
treatment_sample_mean_pounds = 169
np.random.seed(50)
sample_means = []
for sample in range(0, 500):
sample_values = np.random.choice(a=population_gym_goers_mass, size=n)
sample_mean = np.mean(sample_values)
sample_means.append(sample_mean)
#sampling distribution
sns.distplot(sample_means, color='darkviolet')
plt.title("Sampling Distribution ($n=30$) of Gym Goers' Mass in Pounds", y=1
.015, fontsize=20)
plt.xlabel("sample mean mass [pounds]", labelpad=14)
plt.ylabel("frequency of occurence", labelpad=14);
standard_error_pounds = population_std_dev_pounds / np.sqrt(n)
standard_error_pounds
sample_mean_at_positive_z_critical = 1.96*standard_error_pounds
+population_mean_pounds
sample_mean_at_positive_z_critical
sample_mean_at_negative_z_critical = 1.96*standard_error_pounds+population_mean_poun
ds
sample_mean_at_negative_z_critical
kde = stats.gaussian_kde(sample_means)
pos = np.linspace(np.min(sample_means), np.max(sample_means), 10000)
plt.plot(pos, kde(pos), color='darkviolet')
shade = np.linspace(sample_mean_at_positive_z_critical, 175, 300)
plt.fill_between(shade, kde(shade), alpha=0.45, color='darkviolet')
shade2 = np.linspace(sample_mean_at_negative_z_critical, 145, 300)
plt.fill_between(shade2, kde(shade2), alpha=0.45, color='darkviolet')
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 104
plt.axvline(x=treatment_sample_mean_pounds, linestyle='--', linewidth=2.5, label="sa
mple mean with Joe personal trainer", c='purple')
plt.title("Sampling Distribution ($n=30$) of Gym Goers' Mass in Pounds", y=1
.015, fontsize=20)
plt.xlabel("sample mean mass [pounds]", labelpad=14)
plt.ylabel("probability of occurence", labelpad=14)
plt.legend();
p_value = round(1-stats.norm.cdf(z_score), 3)
p_value
true_population_mean_pounds_with_joe_training = 162
z_true = (treatment_sample_mean_pounds - true_population_mean_pounds_with_jo
e_training)/standard_error_pounds
z_true
plt.plot(pos, kde(pos), color='darkviolet')
shade = np.linspace(sample_mean_at_positive_z_critical, 175, 300)
plt.fill_between(shade, kde(shade), alpha=0.45, color='darkviolet')
shade2 = np.linspace(sample_mean_at_negative_z_critical, 145, 300)
plt.fill_between(shade2, kde(shade2), alpha=0.45, color='darkviolet')
plt.axvline(x=treatment_sample_mean_pounds, linestyle='--', linewidth=2.5, l
abel="sample mean with Joe personal trainer", c='purple')
plt.axvline(x=true_population_mean_pounds_with_joe_training, linestyle='--', line
width=2.5, label="true population mean with Joe's training", c='c')
plt.xlabel("sample mean mass [pounds]", labelpad=14)
plt.ylabel("probability of occurence", labelpad=14)
plt.legend();
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 105
OUTPUT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 106
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 107
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 108
RESULT:
Thus, the program for Z-Test (both one tailed and two tailed) has been studied,
executed and output has been verified successfully.
Ex. No: 8 T-test case studies
Date:
AIM:
To write the program for T-test (both one tailed and two tailed hypotheses
test).
EXPLANATION:
A T-test is among the most frequently utilized procedures in statistics. However, many people
who even use T-test frequently do not precisely know what happens to their data when
wheeled away and operated upon in the background using the applications such as R and
Python. The T-test is the test that compares two averages, also known as means, and tells us
whether they differ from each other or not. The T-test is also known as Student's T-test, and it
also tells us how significant the differences are. In other terms, it provides us knowledge of
whether those differences could have occurred by chance.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 109
A Ratio between the difference between two groups and the difference within the groups is
known as the T-score. If the T-score is more significant, this means that there is more
difference present between the groups. At the same time, the smaller T-score signifies the
similarities between the groups. A T-score of Three (3) indicates that the group is three times
different from each other and within each other. When we get a bigger T-value while running
a T-test, it more list that the outcomes are repeatable.
Thus, we can conclude that the following:
A large T-score implies that the groups are different from each other.
A small T-score implies that the groups are similar.
Now, let us understand the T-values and P-values.
Understanding T-values and P-values
Every T-value contains a P-value to work with it. A P-value is referred to as the probability
that the outcomes from the sample data happened coincidentally. P-values have values
starting from 0% to 100%. They are generally written as a decimal. For instance, a P-value of
10% is 0.1. It is good to have low P-values. Lower P-values indicate that the data did not
happen coincidentally. For instance, a P-value of 0.1 indicates that there is only a 1%
probability that the experiment's outcomes occurred coincidentally. Generally, in many cases,
a P-value of 5%, that is 0.05, is accepted to mean the data is said to be valid.
There are Three significant Types of T-test:
Independent Samples T-test: This test is used to compare the averages or means for two
groups.
Paired Sample T-test: This test is used to compare means from the same group at different
times (For example, one year apart).
One Sample T-test: This test is used to test the mean of a single group against an
acknowledged mean.
Performing a Sample T-test
Suppose that we need to test if the men's height in the population differs from the women's
height in general. Thus, we will take a sample from the population and utilize the T-test to
check whether the result is significant or not.
Step 1: Determining a Null and Alternate Hypothesis
Step 2: Collecting Sample data
Step 3: Determining a Confidence Interval and Degrees of Freedom
Step 4: Calculating the T-Statistics
Step 5: Calculating the critical T-value from the T-Distribution
Step 6: Comparing the critical T-values with the calculated T-Statistics
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 110
Determining a null and alternate hypothesis
Starting with defining a null and alternate hypothesis is necessary. In general, the null
hypothesis will express that the two being tested populations have no significant difference
statistically. On the other hand, the alternate hypothesis will express that there is one present.
For this example, we can conclude the following statements:
1. Null Hypothesis: The height of men & women is the same.
2. Alternate Hypothesis: The height of men differs from the height of women.
Collecting sample data
Once we determined the hypothesis, we will start collecting the data from each population
group. For this example, we will be collecting two sets of data. The one data set containing the
height of men and the other one with the height of men. The size of sample data ideally needs
to be identical; however, it can be different. Suppose that the sizes of sample data are n x and
ny.
Determining a Confidence interval and degrees of freedom
Confidence interval is generally called alpha (α). The typical value of alpha (α) is 0.05. This
statement implies that there is 95% confidence for the valid conclusion of the test. We can
define the degree of freedom by using the formula given below:
Calculating the T-Statistic
We can calculate the t-statistic by using the following formula:
n = number of scores per group
T-Test in Python
x = individual scores
M = mean
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 111
n = number of scores in group
Moreover, Mx and My are the values of the mean of the two female and male samples. Nx and
Ny are the sample space of the two samples, and S is the standard deviation.
Calculating the critical T-value from the T-Distribution
We require two objects in order to calculate the critical t-value. The first is the alpha's chosen
value, and the other is the degrees of freedom. The formula of critical t-value is complex;
however, it is static for a fixed degree of freedom pair and the alpha's value. We thus, utilize a
table in order to calculate the critical t-value.
However, Python provides a function in the SciPy library that serves the same purpose.
Comparing the critical T-values with the calculated T-Statistic. Once the critical T-value is
calculated, we will compare the value with the T-Statistic that we have calculated earlier. If
the critical t-value is less than the calculated T-Statistic, the test deduces that a significant
difference is present between the two populations statistically. Hence, we have to reject the
null hypothesis that no significant difference is present between the two samples statistically.
However, in another case where there is no significant difference between the two
populations, the test fails to reject the null hypothesis. Thus, we accept the alternate
hypothesis implying that the men's and women's height are statistically different.
PROGRAM:
# Importing the required libraries and packages
import numpy as np
from scipy import stats
# Defining two random distributions
# Sample Size
N = 10
# Gaussian distributed data with mean = 2 and var = 1
x = np.random.randn(N) + 2
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 112
# Gaussian distributed data with mean = 0 and var = 1
y = np.random.randn(N)
# Calculating the Standard Deviation
# Calculating the variance to get the standard deviation
var_x = x.var(ddof = 1)
var_y = y.var(ddof = 1)
# Standard Deviation
SD = np.sqrt((var_x + var_y) / 2)
print("Standard Deviation =", SD)
# Calculating the T-Statistics
tval = (x.mean() - y.mean()) / (SD * np.sqrt(2 / N))
# Comparing with the critical T-Value
# Degrees of freedom
dof = 2 * N - 2
# p-value after comparison with the T-Statistics
pval = 1 - stats.t.cdf( tval, df = dof)
print("t = " + str(tval))
print("p = " + str(2 * pval))
## Cross Checking using the internal function from SciPy Package
tval2, pval2 = stats.ttest_ind(x, y)
print("t = " + str(tval2))
print("p = " + str(pval2))
OUTPUT:
Standard Deviation = 1.0840799841818152
t = 4.4083686523600845
p = 0.00033912894968146645
t = 4.408368652360084
p = 0.0003391289496815314
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 113
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 114
RESULT:
Thus, the program for T-Test (both one tailed and two tailed) has been
studied, executed and output has been verified successfully.
Ex. No: 9 ANOVA CASE STUDIES
Date:
AIM:
To write the python program for ANOVA-test.
EXPLANATION:
ANOVA (ANalysis Of VAriance):
ANOVA test used to compare the means of more than 2 groups (t-test can
be used to compare 2 groups).
Groups mean differences inferred by analyzing variances.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 115
ANOVA uses variance-based F test to check the group mean equality.
Sometimes, ANOVA F test is also called omnibus test as it tests non-
specific null hypothesis i.e. all group means are equal.
Main types: One-way (one factor) and two-way (two factors) ANOVA
(factor is an independent variable).
It is also called univariate ANOVA as there is only one dependent variable
in the model. MANOVA is used when there are multiple dependent
variables in the dataset. If there is an additional continuous independent
variable in the model, then ANCOVA is used.
If you have repeated measurements for treatments or time on same
subjects, you should use Repeated Measure ANOVA.
ANOVA Hypotheses:
Null hypothesis: Groups means are equal (no variation in means of
groups)
H0: μ1=μ2=…=μp
Alternative hypothesis: At least, one group mean is different from other
groups
H1: All μ are not equal
ANOVA Assumptions
Residuals (experimental error) are approximately normally
distributed (Shapiro-Wilks test or histogram).
homoscedasticity or Homogeneity of variances (variances are equal
between treatment groups) (Levene’s, Bartlett’s, or Brown-Forsythe
test).
Observations are sampled independently from each other (no
relation in observations between the groups and within the groups)
i.e., each subject should have only one response.
The dependent variable should be continuous. If the dependent
variable is ordinal or rank (e.g. Likert item data), it is more likely to
violate the assumptions of normality and homogeneity of variances.
If these assumptions are violated, you should consider the non-
parametric tests.
How ANOVA works?
Check sample sizes: equal number of observations in each group
Calculate Mean Square for each group (MS) (SS of group/level-1); level-1
is a degree of freedom (df) for a group
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 116
Calculate Mean Square error (MSE) (SS error/df of residuals)
Calculate F value (MS of group/MSE)
Calculate p value based on F value and degrees of freedom (df)
One-way (one factor) ANOVA:
The ANOVA table represents between- and within-group sources of variation,
and their associated degree of freedoms, the sum of squares (SS), and mean
squares (MS). The total variation is the sum of between- and within-group
variances. The F value is a ratio of between- and within-group mean squares
(MS). p value is estimated from F value and degree of freedoms.
Two-way (two factor) ANOVA (factorial design):
PROGRAM:
import pandas as pd
# load data file
df =
pd.read_csv("https://reneshbedre.github.io/assets/posts/anova/onewayanova.
txt", sep="\t")
# reshape the d dataframe suitable for statsmodels package
df_melt = pd.melt(df.reset_index(), id_vars=['index'], value_vars=['A', 'B', 'C', 'D'])
# replace column names
df_melt.columns = ['index', 'treatments', 'value']
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 117
# generate a boxplot to see the data distribution by treatments. Using boxplot,
we can
# easily detect the differences between different treatments
import matplotlib.pyplot as plt
import seaborn as sns
ax = sns.boxplot(x='treatments', y='value', data=df_melt, color='#99c2a2')
ax = sns.swarmplot(x="treatments", y="value", data=df_melt, color='#7d0013')
plt.show()
import scipy.stats as stats
# stats f_oneway functions takes the groups as input and returns ANOVA F and p
value
fvalue, pvalue = stats.f_oneway(df['A'], df['B'], df['C'], df['D'])
print(fvalue, pvalue)
# 17.492810457516338 2.639241146210922e-05
# get ANOVA table as R like output
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Ordinary Least Squares (OLS) model
model = ols('value ~ C(treatments)', data=df_melt).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
anova_table
# ANOVA table using bioinfokit v1.0.3 or later (it uses wrapper script for
anova_lm)
from bioinfokit.analys import stat
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 118
res = stat()
res.anova_stat(df=df_melt, res_var='value', anova_model='value ~
C(treatments)')
res.anova_summary
from bioinfokit.analys import stat
# perform multiple pairwise comparison (Tukey's HSD)
# unequal sample size data, tukey_hsd uses Tukey-Kramer test
res = stat()
res.tukey_hsd(df=df_melt, res_var='value', xfac_var='treatments',
anova_model='value ~ C(treatments)')
res.tukey_summary
#QQ PLOT
import statsmodels.api as sm
import matplotlib.pyplot as plt
# res.anova_std_residuals are standardized residuals obtained from ANOVA
(check above)
sm.qqplot(res.anova_std_residuals, line='45')
plt.xlabel("Theoretical Quantiles")
plt.ylabel("Standardized Residuals")
plt.show()
# histogram
plt.hist(res.anova_model_out.resid, bins='auto', histtype='bar', ec='k')
plt.xlabel("Residuals")
plt.ylabel('Frequency')
plt.show()
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 119
Two-way (two factor) ANOVA
import pandas as pd
import seaborn as sns
# load data file
d=pd.read_csv("https://reneshbedre.github.io/assets/posts/anova/
twowayanova.txt", sep="\t")
# reshape the d dataframe suitable for statsmodels package
# you do not need to reshape if your data is already in stacked format. Compare
d and d_melt tables for detail
# understanding
d_melt = pd.melt(d, id_vars=['Genotype'], value_vars=['1_year', '2_year',
'3_year'])
# replace column names
d_melt.columns = ['Genotype', 'years', 'value']
d_melt.head()
import statsmodels.api as sm
from statsmodels.formula.api import ols
model = ols('value ~ C(Genotype) + C(years) + C(Genotype):C(years)',
data=d_melt).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
anova_table
from bioinfokit.analys import stat
res = stat()
res.anova_stat(df=d_melt, res_var='value', anova_model='value~C(Genotype)
+C(years)+C(Genotype):C(years)')
res.anova_summary
from statsmodels.graphics.factorplots import interaction_plot
import matplotlib.pyplot as plt
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 120
fig = interaction_plot(x=d_melt['Genotype'], trace=d_melt['years'],
response=d_melt['value'],
colors=['#4c061d','#d17a22', '#b4c292'])
plt.show()
Multiple pairwise comparisons
from bioinfokit.analys import stat
# perform multiple pairwise comparison (Tukey HSD)
# unequal sample size data, tukey_hsd uses Tukey-Kramer test
res = stat()
# for main effect Genotype
res.tukey_hsd(df=d_melt, res_var='value', xfac_var='Genotype',
anova_model='value~C(Genotype)+C(years)+C(Genotype):C(years)')
res.tukey_summary
# for main effect years
res.tukey_hsd(df=d_melt, res_var='value', xfac_var='years', anova_model='value
~ C(Genotype) + C(years) + C(Genotype):C(years)')
res.tukey_summary
# for interaction effect between genotype and years
res.tukey_hsd(df=d_melt, res_var='value', xfac_var=['Genotype','years'],
anova_model='value ~ C(Genotype) + C(years) + C(Genotype):C(years)')
res.tukey_summary.head()
# QQ-plot
import statsmodels.api as sm
import matplotlib.pyplot as plt
# res.anova_std_residuals are standardized residuals obtained from two-way
ANOVA (check above)
sm.qqplot(res.anova_std_residuals, line='45')
plt.xlabel("Theoretical Quantiles")
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 121
plt.ylabel("Standardized Residuals")
plt.show()
# histogram
plt.hist(res.anova_model_out.resid, bins='auto', histtype='bar', ec='k')
plt.xlabel("Residuals")
plt.ylabel('Frequency')
plt.show()
# if you have a stacked table, you can use bioinfokit v1.0.3 or later for the
Levene's test
from bioinfokit.analys import stat
res = stat()
res.levene(df=d_melt, res_var='value', xfac_var=['Genotype', 'years'])
res.levene_summary
OUTPUT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 122
17.492810457516338 2.639241146210922e-05
group1 group2 Diff Lower Upper q-value p-value
0 A B 15.4 1.692871 29.107129 4.546156 0.025070
1 A C 1.6 -12.107129 15.307129 0.472328 0.900000
2 A D 30.4 16.692871 44.107129 8.974231 0.001000
3 B C 13.8 0.092871 27.507129 4.073828 0.048178
4 B D 15.0 1.292871 28.707129 4.428074 0.029578
5 C D 28.8 15.092871 42.507129 8.501903 0.001000
TWO WAY ANOVA TEST
Genotype yearsvalue
0 A 1_year 1.53
1 A 1_year 1.83
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 123
2 A 1_year 1.38
3 B 1_year 3.60
4 B 1_year 2.94
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 124
Parameter Value
0 Test statistics (W) 1.6849
1 Degrees of freedom (Df) 17.0000
2 p value 0.0927
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 125
RESULT:
Thus, the program for ANOVA-Test has been studied, executed and output
has been verified successfully.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 126
Ex. No: 10 BUILDING AND VALIDATING LINEAR MODELS
Date:
AIM:
To write the python program for building and validating linear models.
EXPLANATION:
A linear regression is one of the easiest statistical models in machine learning. Understanding
its algorithm is a crucial part of the Data Science Python Certification’s course curriculum. It is
used to show the linear relationship between a dependent variable and one or more
independent variables.
Importing the dataset
Importing the dataset using pandas and also import other libraries such as numpy and
matplotlib. The dataset.head() shows the first few columns of our dataset.
Data Preprocessing
The X is independent variable array and y is the dependent variable vector. Note the
difference between the array and vector. The dependent variable must be in vector and
independent variable must be an array itself.
Splitting the dataset
We need to split our dataset into the test and train set. Generally, we follow the 20-80 policy
or the 30-70 policy respectively. This is because we wish to train our model according to the
years and salary. We then test our model on the test set. We check whether the predictions
made by the model on the test set data matches what was given in the dataset. If it matches, it
implies that our model is accurate and is making the right predictions.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 127
Fitting linear regression model into the training set
From sklearn’s linear model library, import linear regression class. Create an object for a
linear regression class called regressor. To fit the regressor into the training set, we will call
the fit method – function to fit the regressor into the training set. We need to fit X_train
(training data of matrix of features) into the target values y_train. Thus the model learns the
correlation and learns how to predict the dependent variables based on the independent
variable.
Predicting the test set results
We create a vector containing all the predictions of the test set salaries. The predicted salaries
are then put into the vector called y_pred.(contains prediction for all observations in the test
set) predict method makes the predictions for the test set. Hence, the input is the test set. The
parameter for predict must be an array or sparse matrix, hence input is X_test.
Visualizing the results
To visualize the data, we plot graphs using matplotlib. To plot real observation points ie
plotting the real given values. The X-axis will have years of experience and the Y-axis will have
the predicted salaries. plt.scatter plots a scatter plot of the data. Parameters include:
X – coordinate (X_train: number of years)
Y – coordinate (y_train: real salaries of the employees)
Color ( Regression line in red and observation line in blue)
2. Plotting the regression line
plt.plot have the following parameters :
X coordinates (X_train) – number of years
Y coordinates (predict on X_train) – prediction of X-train (based on a number of years).
Steps to build a Linear Regression model
Step 1: Importing the dataset
Step 2: Data pre-processing
Step 3: Splitting the test and train sets
Step 4: Fitting the linear regression model to the training set
Step 5: Predicting test results
Step 6: Visualizing the test results
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 128
PROGRAM:
# importing the dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv('Salary_Data.csv')
dataset.head()
# data preprocessing
X = dataset.iloc[:, :-1].values #independent variable array
y = dataset.iloc[:,1].values #dependent variable vector
# splitting the dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=1/3,random_state=0)
# fitting the regression model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train) #actually produces the linear eqn for the data
# predicting the test set results
y_pred = regressor.predict(X_test)
y_pred
y_test
# visualizing the results
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 129
#plot for the TRAIN
plt.scatter(X_train, y_train, color='red') # plotting the observation line
plt.plot(X_train, regressor.predict(X_train), color='blue') # plotting the regression line
plt.title("Salary vs Experience (Training set)") # stating the title of the graph
plt.xlabel("Years of experience") # adding the name of x-axis
plt.ylabel("Salaries") # adding the name of y-axis
plt.show() # specifies end of graph
#plot for the TEST
plt.scatter(X_test, y_test, color='red')
plt.plot(X_train, regressor.predict(X_train), color='blue') # plotting the regression line
plt.title("Salary vs Experience (Testing set)")
plt.xlabel("Years of experience")
plt.ylabel("Salaries")
plt.show()
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 130
OUTPUT:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 131
RESULT:
Thus, the program for building and validating linear models has been
studied, executed and output has been verified successfully.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 132
Ex. No: 11 LOGISTIC REGRESSIONS: HANDWRITING RECOGNITION
Date:
AIM:
To write the program for Logistic Regressions: Handwriting Recognition.
EXPLANATION:
Logistic Regression is a Machine Learning algorithm used to make predictions to find the
value of a dependent variable such as the condition of a tumour (malignant or benign),
classification of email (spam or not spam), or admission into a university (admitted or not
admitted) by learning from independent variables. Logistic Regression is a supervised
Machine Learning algorithm, which means the data provided for training is labelled i.e.,
answers are already provided in the training set. The algorithm learns from those examples
and their corresponding answers (labels) and then uses that to classify new examples. In
mathematical terms, suppose the dependent variable is Y and the set of independent
variables is X, then logistic regression will predict the dependent variable P(Y=1) as a
function of X, the set of independent variables.
It is a technique to analyse a data-set which has a dependent variable and one or more
independent variables to predict the outcome in a binary variable, meaning it will have only
two outcomes. The dependent variable is categorical in nature. Dependent variable is also
referred as target variable and the independent variables are called the predictors. Logistic
regression is a special case of linear regression where we only predict the outcome in a
categorical variable. It predicts the probability of the event using the log function. We use the
Sigmoid function/curve to predict the categorical value. The threshold value decides the
outcome(win/lose). Linear regression equation: y = β0 + β1X1 + β2X2 …. + βnXn.
Y stands for the dependent variable that needs to be predicted.
β0 is the Y-intercept, which is basically the point on the line which touches the y-axis.
β1 is the slope of the line (the slope can be negative or positive depending on the
relationship between the dependent variable and the independent variable.)
X here represents the independent variable that is used to predict our resultant
dependent value.
Sigmoid function: p = 1 / 1 + e-y. Apply sigmoid function on the linear regression equation.
The goal is to find the logistic regression function 𝑝(𝐱) such that the predicted responses
𝑝(𝐱ᵢ) are as close as possible to the actual response 𝑦ᵢ for each observation 𝑖 = 1, …, 𝑛.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 133
means that each 𝑝(𝐱ᵢ) should be close to either 0 or 1. That’s why it’s convenient to use the
Remember that the actual response can be only 0 or 1 in binary classification problems! This
sigmoid function. Once you have the logistic regression function 𝑝(𝐱), you can use it to
predict the outputs for new and unseen inputs, assuming that the underlying mathematical
dependence is unchanged.
METHODOLOGY:
Logistic regression is a linear classifier, so you’ll use a linear function 𝑓𝐱
( ) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ +
𝑏ᵣ 𝑥ᵣ , also called the logit. The variables 𝑏₀, 𝑏₁, …, 𝑏ᵣ are the estimators of the regression
regression function 𝑝(𝐱) is the sigmoid function of 𝑓𝐱 ( ): 𝑝(𝐱) = 1 / (1 + exp(−𝑓𝐱
coefficients, which are also called the predicted weights or just coefficients. The logistic
it’s often close to either 0 or 1. The function 𝑝(𝐱) is often interpreted as the predicted
( )). As such,
probability that the output for a given 𝐱 is equal to 1. Therefore, 1 − 𝑝( 𝑥) is the probability
that the output is 0. Logistic regression determines the best predicted weights 𝑏₀, 𝑏₁, …, 𝑏ᵣ
such that the function 𝑝(𝐱) is as close as possible to all actual responses 𝑦ᵢ, 𝑖 = 1, …, 𝑛,
where 𝑛 is the number of observations. The process of calculating the best weights using
available observations is called model training or fitting. To get the best weights, you usually
maximize the log-likelihood function (LLF) for all observations 𝑖 = 1, …, 𝑛. This method is
log(𝑝(𝐱ᵢ)) + (1 − 𝑦ᵢ) log(1 − 𝑝(𝐱ᵢ))).
called the maximum likelihood estimation and is represented by the equation LLF = Σᵢ( 𝑦ᵢ
When 𝑦ᵢ = 0, the LLF for the corresponding observation is equal to log(1 − 𝑝( 𝐱ᵢ)). If 𝑝( 𝐱ᵢ) is
close to 𝑦ᵢ = 0, then log(1 − 𝑝(𝐱ᵢ)) is close to 0. This is the result you want. If 𝑝( 𝐱ᵢ) is far from
0, then log(1 − 𝑝(𝐱ᵢ)) drops significantly. You don’t want that result because your goal is to
obtain the maximum LLF. Similarly, when 𝑦ᵢ = 1, the LLF for that observation is 𝑦ᵢ log(𝑝( 𝐱ᵢ)).
If 𝑝(𝐱ᵢ) is close to 𝑦ᵢ = 1, then log(𝑝(𝐱ᵢ)) is close to 0. If 𝑝(𝐱ᵢ) is far from 1, then log(𝑝(𝐱ᵢ)) is a
large negative number. Once you determine the best weights that define the function 𝑝(𝐱),
you can get the predicted outputs 𝑝(𝐱ᵢ) for any given input 𝐱ᵢ. For each observation 𝑖 = 1,
…, 𝑛, the predicted output is 1 if 𝑝(𝐱ᵢ) > 0.5 and 0 otherwise. The threshold doesn’t have to be
your situation. There’s one more important relationship between 𝑝(𝐱) and 𝑓𝐱
0.5, but it usually is. You might define a lower or higher value if that’s more convenient for
log(𝑝(𝐱) / (1 − 𝑝(𝐱))) = 𝑓𝐱 ( ). This equality explains why 𝑓𝐱 ( ) is the logit. It implies that 𝑝( 𝐱)
( ), which is that
= 0.5 when 𝑓(𝑓 ) = 0 and that the predicted output is 1 if 𝑓 (𝐱 ) > 0 and 0 otherwise.
CLASSIFICATION PERFORMANCE:
Binary classification has four possible types of results:
True negatives: correctly predicted negatives (zeros)
True positives: correctly predicted positives (ones)
False negatives: incorrectly predicted negatives (zeros)
False positives: incorrectly predicted positives (ones)
The most straightforward indicator of classification accuracy is the ratio of the number of
correct predictions to the total number of predictions (or observations). Other indicators of
binary classifiers include the following:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 134
The positive predictive value is the ratio of the number of true positives to the sum of
the numbers of true and false positives.
The negative predictive value is the ratio of the number of true negatives to the sum of
the numbers of true and false negatives.
The sensitivity (also known as recall or true positive rate) is the ratio of the number of
true positives to the number of actual positives.
The specificity (or true negative rate) is the ratio of the number of true negatives to the
number of actual negatives.
The most suitable indicator depends on the problem of interest. In this tutorial, you’ll
use the most straightforward form of classification accuracy.
This example is about image recognition. To be more precise, you’ll work on the recognition
of handwritten digits. You’ll use a dataset with 1797 observations, each of which is an image
of one handwritten digit. Each image has 64 px, with a width of 8 px and a height of 8 px.
The inputs (𝐱 ) are vectors with 64 dimensions or values. Each input vector describes one
image. Each of the 64 values represents one pixel of the image. The input values are the
integers between 0 and 16, depending on the shade of gray for the corresponding pixel. The
output (𝑦) for each observation is an integer between 0 and 9, consistent with the digit on the
image. There are ten classes in total, each corresponding to one image.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 135
PROGRAM:
Import Packages:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
Get Data:
x, y = load_digits(return_X_y=True)
x
y
Split Data:
x_train, x_test, y_train, y_test =\
train_test_split(x, y, test_size=0.2, random_state=0)
Scale Data:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
Create a Model and Train It:
LogisticRegression(C=0.05, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='ovr', n_jobs=None, penalty='l2', random_state=0,
solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
Evaluate the Model:
x_test = scaler.transform(x_test)
y_pred = model.predict(x_test)
model.score(x_train, y_train)
model.score(x_test, y_test)
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 136
confusion_matrix(y_test, y_pred)
Visualization:
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm)
ax.grid(False)
ax.set_xlabel('Predicted outputs', fontsize=font_size, color='black')
ax.set_ylabel('Actual outputs', fontsize=font_size, color='black')
ax.xaxis.set(ticks=range(10))
ax.yaxis.set(ticks=range(10))
ax.set_ylim(9.5, -0.5)
for i in range(10):
for j in range(10):
ax.text(j, i, cm[i, j], ha='center', va='center', color='white')
plt.show()
Classification report:
print(classification_report(y_test, y_pred))
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 137
OUTPUT:
x
array([[ 0., 0., 5., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 10., 0., 0.],
[ 0., 0., 0., ..., 16., 9., 0.],
...,
[ 0., 0., 1., ..., 6., 0., 0.],
[ 0., 0., 2., ..., 12., 0., 0.],
[ 0., 0., 10., ..., 12., 1., 0.]])
Y
array([0, 1, 2, ..., 8, 9, 8])
array([[27, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 32, 0, 0, 0, 0, 1, 0, 1, 1],
[ 1, 1, 33, 1, 0, 0, 0, 0, 0, 0],
[ 0, 0, 1, 28, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 29, 0, 0, 1, 0, 0],
[ 0, 0, 0, 0, 0, 39, 0, 0, 0, 1],
[ 0, 1, 0, 0, 0, 0, 43, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 39, 0, 0],
[ 0, 2, 1, 2, 0, 0, 0, 1, 33, 0],
[ 0, 0, 0, 1, 0, 1, 0, 2, 1, 36]])
Classification Report
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 138
RESULT:
Thus, the program for Logistic Regressions: Handwriting Recognition has been
studied, executed and output has been verified successfully.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 139
EX.NO:12 TIME SERIES ANALYSIS
DATE:
AIM:
To write the python program for Time Series Analysis.
EXPLANATION:
Time series is a sequence of observations recorded at regular time intervals. Depending on
the frequency of observations, a time series may typically be hourly, daily, weekly, monthly,
quarterly and annual. Sometimes, you might have seconds and minute-wise time series as
well, like, number of clicks and user visits every minute etc. Time series analysis comprises
methods for analyzing time series data in order to extract meaningful statistics and other
characteristics of the data. Time series forecasting is the use of a model to predict future
values based on previously observed values. Time series are widely used for non-stationary
data, like economic, weather, stock price, and retail sales in this post.
DATASET: Superstore sales data
There are several categories in the Superstore sales data, we start from time series analysis
and forecasting for furniture sales.
DATA PREPROCESSING:
This step includes removing columns we do not need, check missing values, aggregate sales
by date and so on.
INDEXING WITH TIME SERIES DATA:
Our current datetime data can be tricky to work with, therefore, we will use the averages
daily sales value for that month instead, and we are using the start of each month as the
timestamp.
VISUALIZING FURNITURE SALES TIME SERIES DATA:
Some distinguishable patterns appear when we plot the data. The time-series has seasonality
pattern, such as sales are always low at the beginning of the year and high at the end of the
year. There is always an upward trend within any single year with a couple of low months in
the mid of the year. We can also visualize our data using a method called time-series
decomposition that allows us to decompose our time series into three distinct components:
trend, seasonality, and noise.
TIME SERIES FORECASTING WITH ARIMA:
The most commonly used method for time-series forecasting, known as ARIMA, which stands
for Autoregressive Integrated Moving Average. ARIMA models are denoted with the notation
ARIMA(p, d, q). These three parameters account for seasonality, trend, and noise in data: This
step is parameter Selection for our furniture’s sales ARIMA Time Series Model. Our goal here
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 140
is to use a “grid search” to find the optimal set of parameters that yields the best performance
for our model.
VALIDATING FORECASTS:
To help us understand the accuracy of our forecasts, we compare predicted sales to real sales
of the time series, and we set forecasts to start at 2017–01–01 to the end of the data. The line
plot is showing the observed values compared to the rolling forecast predictions. Overall, our
forecasts align with the true values very well, showing an upward trend starts from the
beginning of the year and captured the seasonality toward the end of the year.
DATA EXPLORATION:
We are going to compare two categories’ sales in the same time period. This means combine
two data frames into one and plot these two categories’ time series into one plot.
TIME SERIES MODELING WITH PROPHET:
Released by Facebook in 2017, forecasting tool Prophet is designed for analyzing time-series
that display patterns on different time scales such as yearly, weekly and daily. It also has
advanced capabilities for modeling the effects of holidays on a time-series and implementing
custom changepoints. Therefore, we are using Prophet to get a model up and running.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 141
PROGRAM:
import warnings
import itertools
import numpy as np
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
plt.style.use('fivethirtyeight')
import pandas as pd
import statsmodels.api as sm
import matplotlib
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'
df = pd.read_excel("Superstore.xls")
furniture = df.loc[df['Category'] == 'Furniture']
furniture['Order Date'].min(), furniture['Order Date'].max()
cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID', 'Customer Name', 'Segment
', 'Country', 'City', 'State', 'Postal Code', 'Region', 'Product ID', 'Category', 'Sub-Category', 'Prod
uct Name', 'Quantity', 'Discount', 'Profit']
furniture.drop(cols, axis=1, inplace=True)
furniture = furniture.sort_values('Order Date')
furniture.isnull().sum()
furniture = furniture.groupby('Order Date')['Sales'].sum().reset_index()
furniture = furniture.groupby('Order Date')['Sales'].sum().reset_index()
furniture = furniture.set_index('Order Date')
furniture.index
y = furniture['Sales'].resample('MS').mean()
y['2017':]
y.plot(figsize=(15, 6))
plt.show()
from pylab import rcParams
rcParams['figure.figsize'] = 18, 8
decomposition = sm.tsa.seasonal_decompose(y, model='additive')
fig = decomposition.plot()
plt.show()
p = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
print('Examples of parameter combinations for Seasonal ARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 142
for param in pdq:
for param_seasonal in seasonal_pdq:
try:
mod = sm.tsa.statespace.SARIMAX(y,
order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
except:
continue
mod = sm.tsa.statespace.SARIMAX(y,
order=(1, 1, 1),
seasonal_order=(1, 1, 0, 12),
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print(results.summary().tables[1])
results.plot_diagnostics(figsize=(16, 8))
plt.show()
pred = results.get_prediction(start=pd.to_datetime('2017-01-01'), dynamic=False)
pred_ci = pred.conf_int()
ax = y['2014':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7, figsize=(14, 7))
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.2)
ax.set_xlabel('Date')
ax.set_ylabel('Furniture Sales')
plt.legend()
plt.show()
y_forecasted = pred.predicted_mean
y_truth = y['2017-01-01':]
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
print('The Root Mean Squared Error of our forecasts is {}'.format(round(np.sqrt(mse), 2)))
pred_uc = results.get_forecast(steps=100)
pred_ci = pred_uc.conf_int()
ax = y.plot(label='observed', figsize=(14, 7))
pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('Date')
ax.set_ylabel('Furniture Sales')
plt.legend()
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 143
plt.show()
furniture = df.loc[df['Category'] == 'Furniture']
office = df.loc[df['Category'] == 'Office Supplies']
furniture.shape, office.shape
cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID', 'Customer Name', 'Segment
', 'Country', 'City', 'State', 'Postal Code', 'Region', 'Product ID', 'Category', 'Sub-Category', 'Prod
uct Name', 'Quantity', 'Discount', 'Profit']
furniture.drop(cols, axis=1, inplace=True)
office.drop(cols, axis=1, inplace=True)
furniture = furniture.sort_values('Order Date')
office = office.sort_values('Order Date')
furniture = furniture.groupby('Order Date')['Sales'].sum().reset_index()
office = office.groupby('Order Date')['Sales'].sum().reset_index()
furniture = furniture.set_index('Order Date')
office = office.set_index('Order Date')
y_furniture = furniture['Sales'].resample('MS').mean()
y_office = office['Sales'].resample('MS').mean()
furniture = pd.DataFrame({'Order Date':y_furniture.index, 'Sales':y_furniture.values})
office = pd.DataFrame({'Order Date': y_office.index, 'Sales': y_office.values})
store = furniture.merge(office, how='inner', on='Order Date')
store.rename(columns={'Sales_x': 'furniture_sales', 'Sales_y': 'office_sales'}, inplace=True)
store.head()
plt.figure(figsize=(20, 8))
plt.plot(store['Order Date'], store['furniture_sales'], 'b-', label = 'furniture')
plt.plot(store['Order Date'], store['office_sales'], 'r-', label = 'office supplies')
plt.xlabel('Date'); plt.ylabel('Sales'); plt.title('Sales of Furniture and Office Supplies')
plt.legend();
irst_date = store.loc[np.min(list(np.where(store['office_sales'] > store['furniture_sales'])[0])),
'Order Date']
print("Office supplies first time produced higher sales than furniture is {}.".format(first_date.d
ate()))
from fbprophet import Prophet
furniture = furniture.rename(columns={'Order Date': 'ds', 'Sales': 'y'})
furniture_model = Prophet(interval_width=0.95)
furniture_model.fit(furniture)
office = office.rename(columns={'Order Date': 'ds', 'Sales': 'y'})
office_model = Prophet(interval_width=0.95)
office_model.fit(office)
furniture_forecast = furniture_model.make_future_dataframe(periods=36, freq='MS')
furniture_forecast = furniture_model.predict(furniture_forecast)
office_forecast = office_model.make_future_dataframe(periods=36, freq='MS')
office_forecast = office_model.predict(office_forecast)
plt.figure(figsize=(18, 6))
furniture_model.plot(furniture_forecast, xlabel = 'Date', ylabel = 'Sales')
plt.title('Furniture Sales');
plt.figure(figsize=(18, 6))
office_model.plot(office_forecast, xlabel = 'Date', ylabel = 'Sales')
plt.title('Office Supplies Sales');
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 144
furniture_names = ['furniture_%s' % column for column in furniture_forecast.columns]
office_names = ['office_%s' % column for column in office_forecast.columns]
merge_furniture_forecast = furniture_forecast.copy()
merge_office_forecast = office_forecast.copy()
merge_furniture_forecast.columns = furniture_names
merge_office_forecast.columns = office_names
forecast = pd.merge(merge_furniture_forecast, merge_office_forecast, how = 'inner', left_on = '
furniture_ds', right_on = 'office_ds')
forecast = forecast.rename(columns={'furniture_ds': 'Date'}).drop('office_ds', axis=1)
forecast.head()
plt.figure(figsize=(10, 7))
plt.plot(forecast['Date'], forecast['furniture_trend'], 'b-')
plt.plot(forecast['Date'], forecast['office_trend'], 'r-')
plt.legend(); plt.xlabel('Date'); plt.ylabel('Sales')
plt.title('Furniture vs. Office Supplies Sales Trend');
plt.figure(figsize=(10, 7))
plt.plot(forecast['Date'], forecast['furniture_yhat'], 'b-')
plt.plot(forecast['Date'], forecast['office_yhat'], 'r-')
plt.legend(); plt.xlabel('Date'); plt.ylabel('Sales')
plt.title('Furniture vs. Office Supplies Estimate');
furniture_model.plot_components(furniture_forecast);
office_model.plot_components(office_forecast);
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 145
OUTPUT:
Timestamp(‘2014–01–06 00:00:00’), Timestamp(‘2017–12–30 00:00:00’)
Data Preprocessing
Indexing with Time Series Data:
DatetimeIndex(['2014-01-06', '2014-01-07', '2014-01-10', '2014-01-11',
'2014-01-13', '2014-01-14', '2014-01-16', '2014-01-19',
'2014-01-20', '2014-01-21',
...
'2017-12-18', '2017-12-19', '2017-12-21', '2017-12-22',
'2017-12-23', '2017-12-24', '2017-12-25', '2017-12-28',
'2017-12-29', '2017-12-30'],
dtype='datetime64[ns]', name='Order Date', length=889,
freq=None)
2017 furniture sales data:
Order Date
2017-01-01 397.602133
2017-02-01 528.179800
2017-03-01 544.672240
2017-04-01 453.297905
2017-05-01 678.302328
2017-06-01 826.460291
2017-07-01 562.524857
2017-08-01 857.881889
2017-09-01 1209.508583
2017-10-01 875.362728
2017-11-01 1277.817759
2017-12-01 1256.298672
Freq: MS, Name: Sales, dtype: float64
Visualizing Furniture Sales Time Series Data
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 146
Time series forecasting with ARIMA:
Examples of parameter combinations for Seasonal ARIMA...
SARIMAX: (0, 0, 1) x (0, 0, 1, 12)
SARIMAX: (0, 0, 1) x (0, 1, 0, 12)
SARIMAX: (0, 1, 0) x (0, 1, 1, 12)
SARIMAX: (0, 1, 0) x (1, 0, 0, 12)
ARIMA(0, 0, 1)x(0, 0, 1, 12)12 - AIC:2931.4459685689417
ARIMA(0, 0, 1)x(0, 1, 0, 12)12 - AIC:466.5607429809145
/usr/local/lib/python3.7/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning:
Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python3.7/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning:
Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
ARIMA(0, 0, 1)x(1, 0, 0, 12)12 - AIC:499.588499811078
ARIMA(0, 0, 1)x(1, 0, 1, 12)12 - AIC:2578.407685878101
ARIMA(0, 0, 1)x(1, 1, 0, 12)12 - AIC:319.9884876946868
ARIMA(0, 1, 0)x(0, 0, 0, 12)12 - AIC:677.8947668259312
ARIMA(0, 1, 0)x(0, 0, 1, 12)12 - AIC:1363.5571341107245
ARIMA(0, 1, 0)x(0, 1, 0, 12)12 - AIC:486.6378567269187
ARIMA(0, 1, 0)x(1, 0, 0, 12)12 - AIC:497.78896630044073
/usr/local/lib/python3.7/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning:
Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
ARIMA(0, 1, 0)x(1, 0, 1, 12)12 - AIC:1379.5770594611533
ARIMA(0, 1, 0)x(1, 1, 0, 12)12 - AIC:319.7714068109212
ARIMA(0, 1, 1)x(0, 0, 0, 12)12 - AIC:649.9056176817331
ARIMA(0, 1, 1)x(0, 0, 1, 12)12 - AIC:2704.9650459821123
ARIMA(0, 1, 1)x(0, 1, 0, 12)12 - AIC:458.87055484827687
ARIMA(0, 1, 1)x(1, 0, 0, 12)12 - AIC:486.18329774425456
ARIMA(0, 1, 1)x(1, 0, 1, 12)12 - AIC:2560.808670239328
ARIMA(0, 1, 1)x(1, 1, 0, 12)12 - AIC:310.75743684172687
ARIMA(1, 0, 0)x(0, 0, 0, 12)12 - AIC:692.1645522067713
ARIMA(1, 0, 0)x(0, 0, 1, 12)12 - AIC:1355.136316958002
ARIMA(1, 0, 0)x(0, 1, 0, 12)12 - AIC:479.4632147852136
/usr/local/lib/python3.7/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning:
Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 147
ARIMA(1, 0, 0)x(1, 0, 0, 12)12 - AIC:480.92593679352154
ARIMA(1, 0, 0)x(1, 0, 1, 12)12 - AIC:1334.896860563096
/usr/local/lib/python3.7/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning:
Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
ARIMA(1, 0, 0)x(1, 1, 0, 12)12 - AIC:304.4664675084565
ARIMA(1, 0, 1)x(0, 0, 0, 12)12 - AIC:665.7794442185481
ARIMA(1, 0, 1)x(0, 0, 1, 12)12 - AIC:82103.26964285906
ARIMA(1, 0, 1)x(0, 1, 0, 12)12 - AIC:468.36851958149913
ARIMA(1, 0, 1)x(1, 0, 0, 12)12 - AIC:482.5763323876879
ARIMA(1, 0, 1)x(1, 0, 1, 12)12 - AIC:2519.493065167048
ARIMA(1, 0, 1)x(1, 1, 0, 12)12 - AIC:306.0156002122771
ARIMA(1, 1, 0)x(0, 0, 0, 12)12 - AIC:671.2513547541902
ARIMA(1, 1, 0)x(0, 0, 1, 12)12 - AIC:1345.8589896655533
/usr/local/lib/python3.7/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning:
Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
ARIMA(1, 1, 0)x(0, 1, 0, 12)12 - AIC:479.2003422281136
ARIMA(1, 1, 0)x(1, 0, 0, 12)12 - AIC:475.34036587860555
ARIMA(1, 1, 0)x(1, 0, 1, 12)12 - AIC:1912.1819232761209
ARIMA(1, 1, 0)x(1, 1, 0, 12)12 - AIC:300.6270901345412
ARIMA(1, 1, 1)x(0, 0, 0, 12)12 - AIC:649.0318019835189
ARIMA(1, 1, 1)x(0, 0, 1, 12)12 - AIC:2516.1759453415243
ARIMA(1, 1, 1)x(0, 1, 0, 12)12 - AIC:460.4762687609516
/usr/local/lib/python3.7/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning:
Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
ARIMA(1, 1, 1)x(1, 0, 0, 12)12 - AIC:469.5250354660858
ARIMA(1, 1, 1)x(1, 0, 1, 12)12 - AIC:nan
/usr/local/lib/python3.7/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning:
Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
ARIMA(1, 1, 1)x(1, 1, 0, 12)12 - AIC:297.78754395474454
============================================================================
==
coef std err z P>|z| [0.025
0.975]
----------------------------------------------------------------------------
--
ar.L1 0.0146 0.342 0.043 0.966 -0.655
0.684
ma.L1 -1.0000 0.360 -2.781 0.005 -1.705 -
0.295
ar.S.L12 -0.0253 0.042 -0.609 0.543 -0.107
0.056
sigma2 2.958e+04 1.22e-05 2.43e+09 0.000 2.96e+04
2.96e+04
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 148
Validating forecasts
Mean Squared Error of our forecasts:
The Mean Squared Error of our forecasts is 22993.58
Root Mean Squared Error of our forecasts:
The Root Mean Squared Error of our forecasts is 151.64
Producing and visualizing forecasts:
Data Exploration:
Order Date furniture_sales office_sales
0 2014-01-01 480.194231 285.357647
1 2014-02-01 367.931600 63.042588
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 149
2 2014-03-01 857.291529 391.176318
3 2014-04-01 567.488357 464.794750
4 2014-05-01 432.049188 324.346545
Time Series Modeling with Prophet:
Compare Forecasts:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 150
Trend and Forecast Visualization:
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 151
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 152
RESULT:
Thus, the program for Time Series Analysis has been studied, executed and output has
been verified successfully.
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY 153