0% found this document useful (0 votes)
69 views49 pages

Lab Mannual

The document provides a syllabus for a data science and analytics laboratory course. It outlines the course objectives, tools used, suggested exercises covering topics like working with data frames, plotting, distributions, regression, hypothesis testing and time series. It also lists the hardware and software requirements.

Uploaded by

vickyakfan152002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views49 pages

Lab Mannual

The document provides a syllabus for a data science and analytics laboratory course. It outlines the course objectives, tools used, suggested exercises covering topics like working with data frames, plotting, distributions, regression, hypothesis testing and time series. It also lists the hardware and software requirements.

Uploaded by

vickyakfan152002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

SYLLABUS

AD3411 DATA SCIENCE AND ANALYTICS LABORATORY LTPC

COURSE OBJECTIVES 0042


• To develop data analytic code in python
• To be able to me python libraries for handling data
• To develop analytical applications song python
• To perform data visualization sang plats

Tools: Python, Numpy, Scipy, Matplotlib, Pandus, statinodels, seuhorn, plotly, Ivich

SUGGESTED EXERCISE:

1. Working with Pandas data frames.


2. Basic plots using Matplotlib.
3. Frequency distributions. Averages, Variability.
4. Normal curves, Correlation and scatter plots, Correlation coefficient.
5. Regression.
6.Z-test.
7.T-test.
8.ANOVA.
9. Building and validating foscar models.
10. Building and validating logistic models.
11.Time Series Analysis.

TOTAL: 60 PERIODS
HARDWARE
• Standalone Desktops with Windows DS

SOFTWARE
• Python with statistical Packages
Name of the Page Staff
S. Date Marks
Experiment No. Signature
No (100)

Tools:Python, Numpy, Scipy, Matplotib, Pandas,Statmodels,


Seaborn,Plotly,Bokeh,working with Numpy arrays

Working with Pandas


1
data frame

Basic Plots using


2
Matplotlib
Frequency
3 distributors, Averages,
Variability
Normal Curves,
Correlation and scatter
4 plots, Correlation
coefficient

5 Regression

Z-test
6

T-test
7

Anova
8

Building and
9 validating linear
models
Building and
validating logistic
10 models

Time series analysis


11
Experiment No:1
WORKING WITH PANDAS DATA FRAMES
Date:

AIM:

To work with Pandas data frames

ALGORITHM:

Step1: Start
Step2: import numpy and pandas module
Step3: Create a dataframe using the dictionary
Step4: Print the output
Step5: Stop

PROGRAM:
import pandas as pd data = {"calories": [420, 380, 390], "duration": [50, 40, 45]}
#load data into a DataFrame object:
df = pd.DataFrame(data) print (df.loc[0])
OUTPUT:
calories 420 duration 50
Name: 0, dtype: int64

RESULT:

Thus the working with Pandas data frames was successfully completed.
Experiment No: 2
BASIC PLOTS USING MATPLOTLIB
Date :

AIM:
To draw basic plots in Python program using Matplotlib

ALGORITHM:
Step1: Start
Step2: import Matplotlib module
Step3: Create a Basic plots using Matplotlib
Step4: Print the output
Step5: Stop
PROGRAM:
import matplotlib.pyplot as plt a = [1, 2, 3, 4, 5] b = [0,
0.6, 0.2, 15, 10, 8, 16, 21] plt.plot(a)
# o is for circles and r is
# for red plt.plot(b, "or") plt.plot(list(range(0, 22,
3))) # naming the x-axis plt.xlabel('Day ->')
# naming the y-axis plt.ylabel('Temp ->') c =
[4, 2, 6, 8, 3, 20, 13, 15] plt.plot(c, label =
'4th Rep') # get current axes command ax =
plt.gca()
# get command over the individual # boundary line of the
graph body ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False) # set the range or the
bounds of # the left boundary line to fixed range
ax.spines['left'].set_bounds(-3, 40)
# set the interval by which # the x-axis set the marks
plt.xticks(list(range(-3, 10))) # set the
intervals by which y-axis

# set the marks


plt.yticks(list(range(-3, 20, 3)))
# legend denotes that what color
# signifies what ax.legend(['1st Rep', '2nd Rep', '3rd Rep', '4th Rep'])
# annotate command helps to write
# ON THE GRAPH any text xy denotes # the position on the graph
plt.annotate('Temperature V / s Days', xy = (1.01, -2.15))
# gives a title to the Graph plt.title('All Features Discussed')
plt.show()
OUTPUT:

RESULT:

Thus the basic plots using Matplotlib in Python program was successfully completed.
Experiment No: 3a
FREQUENCY DISTRIBUTIONS
Date :

AIM:

To Count the frequency of occurrence of a word in a body of text is often needed during text
processing.

ALGORITHM:

Step 1: Start the Program


Step 2: Create text file blake-poems.txt
Step 3: Import the word_tokenize function and gutenberg
Step 4: Write the code to count the frequency of occurrence of a word in a body of text
Step 5: Print the result
Step 6: Stop the process
PROGRAM:

from nltk.tokenize import word_tokenize


from nltk.corpus import gutenberg

sample = gutenberg.raw("blake-poems.txt")

token = word_tokenize(sample)
wlist = []

for i in range(50):
wlist.append(token[i])

wordfreq = [wlist.count(w) for w in wlist]


print("Pairs\n" + str(zip(token, wordfreq)))
OUTPUT:

[([', 1), (Poems', 1), (by', 1), (William', 1), (Blake', 1), (1789', 1), (]', 1), (SONGS', 2),
(OF', 3),
(INNOCENCE', 2), (AND', 1), (OF', 3), (EXPERIENCE', 1), (and', 1), (THE', 1),
(BOOK', 1), (of', 2),
(THEL', 1), (SONGS', 2), (OF', 3), (INNOCENCE', 2), (INTRODUCTION', 1),
(Piping', 2), (down', 1),
(the', 1), (valleys', 1), (wild', 1), (,', 3), (Piping', 2), (songs', 1), (of', 2), (pleasant', 1),
(glee', 1), (,', 3),
(On', 1), (a', 2), (cloud', 1), (I', 1), (saw', 1), (a', 2), (child', 1), (,', 3), (And', 1), (he', 1),
(laughing', 1), (said', 1), (to', 1), (me', 1), (:', 1), (``', 1)]

RESULT:
Thus the compute weighted averages in Python either defining your own functions
or using Numpy was successfully completed.
Experiment No: 3b AVERAGES

Date :

AIM:
To compute weighted averages in Python either defining your own functions or using Numpy

ALGORITHM :

Step 1: Start the Program


Step 2: Create the employees_salary table and save as .csv file
Step 3: Import packages (pandas and numpy) and the employees_salary table itself:
Step 4: Calculate weighted sum and average using Numpy Average() Function
Step 5 : Stop the process

PROGRAM:

#Method Using Numpy Average() Function

weighted_avg_m3 = round(average( df['salary_p_year'], weights =


df['employees_number']),2)

weighted_avg_m3
OUTPUT:

44225.35

RESULT:
Thus the compute weighted averages in Python either defining your own functions
or using Numpy was successfully completed.
Experiment No: 3c VARIABILITY

Date :

AIM:

To write a python program to calculate the variance.

ALGORITHM :

Step 1: Start the Program


Step 2: Import statistics module from statistics import variance
Step 3: Import fractions as parameter values from fractions import Fraction as fr
Step 4: Create tuple of a set of positive and negative numbers
Step 5: Print the variance of each samples
Step 6: Stop the process
PROGRAM:

# Python code to demonstrate variance()


# function on varying range of data-types

# importing statistics module

from statistics import variance

# importing fractions as parameter values


from fractions import Fraction as fr

# tuple of a set of positive integers


# numbers are spread apart but not
very much
sample1 = (1, 2, 5, 4, 8, 9, 12)

# tuple of a set of negative integers


sample2 = (-2, -4, -3, -1, -5, -6)

# tuple of a set of positive and negative


numbers # data-points are spread apart
considerably sample3 = (-9, -1, -0, 2, 1,
3, 4, 19)

# tuple of a set of fractional numbers


sample4 = (fr(1, 2), fr(2, 3), fr(3, 4), fr(5, 6), fr(7, 8))

# tuple of a set of floating point values


sample5 = (1.23, 1.45, 2.1, 2.2, 1.9)
# Print the variance of each samples
print("Variance of Sample1 is % s "
%(variance(sample1))) print("Variance of
Sample2 is % s " %(variance(sample2)))
print("Variance of Sample3 is % s "
%(variance(sample3))) print("Variance of
Sample4 is % s " %(variance(sample4)))
print("Variance of Sample5 is % s "
%(variance(sample5)))
OUTPUT:

Variance of Sample 1 is 15.80952380952381


Variance of Sample 2 is 3.5
Variance of Sample 3 is 61.125
Variance of Sample 4 is 1/45
Variance of Sample 5 is 0.17613000000000006

RESULT:

Thus the computation for variance was successfully completed.


Experiment No: 4a
NORMAL CURVES
Date :

AIM:

To create a normal curve using python program.


To write a python program for correlation with scatter plot and compute correlation
coefficient

ALGORITHM:

Step 1: Start the Program


Step 2: Import packages scipy and call function scipy.stats
Step 3: Import packages numpy, matplotlib and seaborn
Step 4: Create the distribution
Step 5: Visualizing the distribution
Step 6: Stop the process
PROGRAM:

# import required
libraries from
scipy.stats import
norm import numpy
as np import
matplotlib.pyplot as
plt
import seaborn as sb

# Creating the
distribution data =
np.arange(1,10,0.01)
pdf = norm.pdf(data , loc = 5.3 , scale = 1 )

#Visualizing the distribution

sb.set_style('whitegrid')
sb.lineplot(data, pdf , color =
'black') plt.xlabel('Heights')
plt.ylabel('Probability
Density')
OUTPUT:

RESULT:

Thus the normal curve using python program was successfully completed.
Experiment No: 4b
CORRELATION AND SCATTER PLOTS
Date :

AIM:

To write a python program for correlation with scatter plot .

ALGORITHM :

Step 1: Start the Program


Step 2: Create variable y1, y2
Step 3: Create variable x, y3 using random function
Step 4: plot the scatter plot
Step 5: Print the result
Step 6: Stop the process
PROGRAM:

# Scatterplot and Correlations

# Data

x-pp random
randn(100)
yl=x*5+9 y2=-
5°x
y3=no_random.randn(100)

#Plot

plt.reParams update('figure figsize' (10,8), 'figure dpi¹:100}) plt scatter(x,


yl, label=fyl, Correlation = {np.round(np.corrcoef(x,y1)[0,1], 2)}) plt
scatter(x, y2, label=fy2 Correlation = (np.round(np.corrcoef(x,y2)[0,1],
2)}) plt scatter(x, y3, label=fy3 Correlation =
(np.round(np.corrcoef(x,y3)[0,1], 2)})

# Plot

plt titlef('Scatterplot and Correlations')


plt(legend)
plt(show)
OUTPUT:

RESULT:

Thus the Correlation and scatter plots using python program was successfully completed.
Experiment No: 4c
CORRELATION COEFFICIENT
Date :

AIM:
To write a python program to compute correlation coefficient.

ALGORITHM :

Step 1: Start the Program


Step 2: Import math package
Step 3: Define correlation coefficient function
Step 4: Calculate correlation using formula
Step 5:Print the result
Step 6 : Stop the process
PROGRAM:

# Python Program to find correlation coefficient.


import math

# function that returns correlation


coefficient. def
correlationCoefficient(X, Y, n) :
sum_X = 0 sum_Y = 0
sum_XY = 0 squareSum_X = 0
squareSum_Y = 0

i=
0
whi
le i
<n
:
# sum of elements of array X.
sum_X = sum_X + X[i]

# sum of elements of array Y.


sum_Y = sum_Y + Y[i]

# sum of X[i] * Y[i].


sum_XY = sum_XY + X[i] * Y[i]

# sum of square of array elements.


squareSum_X = squareSum_X + X[i] * X[i]
squareSum_Y = squareSum_Y + Y[i] * Y[i] i=i
+1

# use formula for calculating


correlation # coefficient.
corr = (float)(n * sum_XY - sum_X *
sum_Y)/ (float)(math.sqrt((n *
squareSum_X - sum_X *
sum_X)* (n * squareSum_Y -
sum_Y * sum_Y)))
return corr

# Driver function
X = [15, 18, 21, 24, 27]
Y = [25, 25, 27, 31, 32]
# Find the size
of array. n =
len(X)

# Function call to correlationCoefficient.


print
('{0:.6f}'.format(correlationCoefficient(X,
Y, n)))
OUTPUT:

0.953463

RESULT:
Thus the computation for correlation coefficient was Successfully completed
Experiment No: 5
REGRESSION
Date :

AIM:

To write a python program for Simple Linear Regression .

ALGORITHM :

Step 1: Start the Program


Step 2: Import numpy and matplotlib package
Step 3: Define coefficient function
Step 4: Calculate cross-deviation and deviation about x
Step 5: Calculate regression coefficients
Step 6: Plot the Linear regression and define main function
Step 7: Print the result
Step 8: Stop the process
PROGRAM:

import numpy as np
import matplotlib.pyplot as plt

def estimate_coef(x, y):


# number of observations/points
n = np.size(x)

# mean of x
and y vector m_x =
np.mean(x) m_y =
np.mean(y)

# calculating cross-deviation and deviation about x


SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x

# calculating regression
coefficients b_1 = SS_xy /
SS_xx b_0 = m_y -
b_1*m_x

return (b_0, b_1)

def plot_regression_line(x, y, b):


# plotting the actual points as scatter
plot plt.scatter(x, y, color = "m",
marker = "o", s = 30)

# predicted response
vector y_pred = b[0] + b[1]*x

# plotting the regression line


plt.plot(x, y_pred, color = "g")

# putting labels
plt.xlabel('x')
plt.ylabel('y')

# function to show plot


plt.show()
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating
coefficients b =
estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))

# plotting regression line


plot_regression_line(x, y, b)
if __name__ ==
"__main__":
main()
OUTPUT:

Estimated
coefficients: b_0
=-
0.0586206896552
b_1 =
1.45747126437

Graph:

RESULT:
Thus the computation for Simple Linear Regression was successfully completed.
Experiment No: 6 Z-TEST
Date :

AIM:

To Perform Z-test

ALGORITHM:

Stepl: Start
Step2: Import math, numpy, stats models & z-test
Step3: Create a list &Print the z-test list
Step4: Stop
PROGRAM:

# imports import math import numpy as np from numpy.random


import randn from statsmodels.stats.weightstats import ztest
# Generate a random array of 50 numbers having mean 110 and sd 15
# similar to the IQ scores data we assume above mean_iq = 110
sd_iq = 15/math.sqrt(50) alpha = 0.05 null_mean =100 data =
sd_iq*randn(50)+mean_iq
# print mean and sd print('mean=%.2f stdv=%.2f' % (np.mean(data),
np.std(data)))
# now we perform the test. In this function, we passed data, in the value parameter
# we passed mean value in the null hypothesis, in alternative hypothesis we check
whether the
# mean is larger ztest_Score,p_value=ztest(data,value=null_mean,alternative='la
rger')
# the function outputs a p_value and z-score corresponding to that value, we
compare the
# p-value with alpha, if it is greater than alpha then we do not null hypothesis # else
we reject it. if(p_value < alpha):
print("Reject Null Hypothesis")

else:
print("Fail to Reject NUll Hypothesis")
OUTPUT:

Reject Null Hypothesis

RESULT:
Thus the program for Z-Test case studies has been executed and verified successfully.
Experiment No: 7
T-TEST
Date :

AIM:
To Perform T-test for sampling distribution.

ALGORITHM:

Stepl: Start
Step2: Import random &numpy
Step3: Calculate the standard deviation
Step4: Stop
PROGRAM:

# Importing the required libraries and packages import numpy as np


from scipy import stats
# Defining two random distributions
# Sample Size
N = 10
# Gaussian distributed data with mean = 2 and var = 1 x = np.random.randn(N) + 2
# Gaussian distributed data with mean = 0 and var = 1 y = np.random.randn(N)
# Calculating the Standard Deviation
# Calculating the variance to get the standard deviation var_x = x.var(ddof = 1)
var_y = y.var(ddof = 1)
# Standard Deviation
SD = np.sqrt((var_x + var_y) / 2) print("Standard Deviation =", SD) #
Calculating the T-Statistics tval = (x.mean() - y.mean()) / (SD *
np.sqrt(2 / N))
# Comparing with the critical T-Value
# Degrees of freedom dof = 2 * N - 2
# p-value after comparison with the T-Statistics pval = 1 - stats.t.cdf( tval,
df = dof) print("t = " + str(tval)) print("p = " + str(2 * pval))
## Cross Checking using the internal function from SciPy Packa ge tval2, pval2 =
stats.ttest_ind(x, y) print("t = " + str(tval2)) print("p = " + str(pval2))
OUTPUT:
Standard Deviation = 0.7642398582227466
t = 4.87688162540348
p = 0.0001212767169695983 t = 4.876881625403479
p = 0.00012127671696957205

RESULT:
Thus the program for T-test case studies has been executed and verified successfully.
Experiment No: 8
ANOVA
Date :

AIM:
To Perform ANOVA test.

ALGORITHM:
Stepl: Start
Step2: Import scipy
Step3: Import statsmodels
Step4: Calculate ANOVA F and p value
Step 5: Stop
PROGRAM:

# Installing the package install.packages("dplyr")


# Loading the package library(dplyr)
# Variance in mean within group and between group
boxplot(mtcars$disp~factor(mtcars$gear), xlab = "gear", ylab
= "disp")
# Step 1: Setup Null Hypothesis and Alternate Hypothesis
# H0 = mu = mu01 = mu02 (There is no difference
# between average displacement for different gear)
# H1 = Not all means are equal
# Step 2: Calculate test statistics using aov function mtcars_aov <-
aov(mtcars$disp~factor(mtcars$gear)) summary(mtcars_aov)
# Step 3: Calculate F-Critical Value
# For 0.05 Significant value, critical value = alpha = 0.05
# Step 4: Compare test statistics with F-Critical value
# and conclude test p <alpha, Reject Null Hypothesis
OUTPUT:

RESULT:
Thus the program for ANOVA case studies has been executed and verified successfully.
Experiment No: 9
BUILDING AND VALIDATING LINEAR MODELS
Date :

AIM:
To Perform Linear Regression

ALGORITHM

Stepl: Start
Step2: Import numpy.pandas,seaborn,matplotlib&sklearn
Step3: calculate linear regression using the appropriate functions
Step4: display the result
Step 5: Stop
PROGRAM:

# Importing the necessary libraries import pandas as pd import


numpy as np import matplotlib.pyplot as plt import
seaborn as sns from sklearn.datasets import load_boston
sns.set(style=”ticks”,color_codes=True)
plt.rcParams[‘figure.figsize’] = (8,5)
plt.rcParams[‘figure.dpi’] = 150
# loading the databoston = load_boston() You can check those keys with the
following code. print(boston.keys()) The output will be as follow:
dict_keys([‘data’, ‘target’, ‘feature_names’, ‘DESCR’,
‘filename’]) print(boston.DESCR)

You will find these details in output:


Attribute Information (in order):
— CRIM per capita crime rate by town
— ZN proportion of residential land zoned for lots over 25,000 sq.ft. — INDUS
proportion of non-retail business acres per town
— CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
— NOX nitric oxides concentration (parts per 10 million)
— RM average number of rooms per dwelling
— AGE proportion of owner-occupied units built prior to 1940
— DIS weighted distances to five Boston employment centres
— RAD index of accessibility to radial highways
— TAX full-value property-tax rate per $10,000
— PTRATIO pupil-teacher ratio by town
— B 1000 (Bk — 0.63)² where Bk is the proportion of blacks by town
— LSTAT % lower status of the population
— MEDV Median value of owner-occupied homes in $1000’s :Missing
Attribute Values: None
df=pd.DataFrame(boston.data,columns=boston.feature_names) df.head()
# print the columns present in the dataset print(df.columns)
# print the top 5 rows in the dataset print(df.head())
OUTPUT:
First five records from data set

#plotting heatmap for overall data setsns.heatmap(df.corr(), square=True, cmap=’RdYlGn’)


Heat map of overall data set

So let’s plot a regression plot to see the correlation between RM and MEDV.
sns.lmplot(x = ‘RM’, y = ‘MEDV’, data = df)

Regression plot with RM and MEDV

RESULT:
Thus the program for Linear Regression has been executed and verified successfully.
Experiment No: 10
BUILDING AND VALIDATING LOGISTICS MODELS
Date :

AIM:
To Perform Logistic Regression

ALGORITHM:

Stepl: Start
Step2: Import numpy.pandas,seaborn,matplotlib&sklearn
Step3: Calculate logistic regression using the appropriate functions
Step4: Display the result
Step 5: Stop

PROGRAM:
Building the Logistic Regression model:

# importing libraries import statsmodels.api as sm


import pandas as pd
# loading the training dataset df = pd.read_csv('logit_train1.csv', index_col = 0)
# defining the dependent and independent variables Xtrain =
df[['gmat', 'gpa', 'work_experience']] ytrain = df[['admitted']]
# building the model and fitting the data log_reg = sm.Logit(ytrain,
Xtrain).fit()
OUTPUT :
Optimization terminated successfully.
Current function value: 0.352707 Iterations 8
# printing the summary table print(log_reg.summary())

Logit Regression Results


=============================================================
Dep. Variable: admitted No. Observations: 30
Model: Logit Df Residuals: 27
Method: MLE Df Model: 2
Date: Wed, 15 Jul 2020 Pseudo R-squ.: 0.4912
Time: 16:09:17 Log-Likelihood: -10.581
converged: True LL-Null: -20.794
Covariance Type: nonrobust LLR p-value: 3.668e-05
=============================================================
===
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------
gmat -0.0262 0.011 -2.383 0.017 -0.048 -0.005
gpa 3.9422 1.964 2.007 0.045 0.092 7.792

work_experience 1.1983 0.482 2.487 0.013 0.254 2.143

Predicting on New Data :

# loading the testing dataset df = pd.read_csv('logit_test1.csv', index_col = 0) #


defining the dependent and independent variables Xtest = df[['gmat', 'gpa',
'work_experience']] ytest = df['admitted']
# performing predictions on the test dataset yhat = log_reg.predict(Xtest)
prediction = list(map(round, yhat))
# comparing original and predicted values of y print('Actual values', list(ytest.values))
print('Predictions :', prediction)

OUTPUT:
Optimization terminated successfully.
Current function value: 0.352707
Iterations 8
Actual values [0, 0, 0, 0, 0, 1, 1, 0, 1, 1]
Predictions : [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
Testing the accuracy of the model :

from sklearn.metrics import (confusion_matrix, accuracy_score)


# confusion matrix cm = confusion_matrix(ytest, prediction) print ("Confusion Matrix :
\n", cm) # accuracy score of the model print('Test accuracy = ',
accuracy_score(ytest, prediction))

OUTPUT:

Confusion Matrix :
[[6 0]
[2 2]]
Test accuracy = 0.8

RESULT:
Thus the program for Logistics Regression has been executed and verified successfully
Experiment No: 11
TIME SERIES ANALYSIS

Date:

AIM:

To Perform Time series analysis.

ALGORITHM:

Step1: Start Time Series Analysis

Step2: Import numpy.pandas, matplotlib&seaborn

Step3: draw the plot

Step4: display the plo

Step 5: Stop
PROGRAM:

We are using Superstore sales data .


import warnings import itertools import numpy as np import matplotlib.pyplot as plt
warnings.filterwarnings("ignore") plt.style.use('fivethirtyeight') import pandas
as pd import statsmodels.api as sm import
matplotlibmatplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12 matplotlib.rcParams['ytick.labelsize']
= 12 matplotlib.rcParams['text.color'] = 'k'

We start from time series analysis and forecasting for furniture sales.
df=pd.read_excel("Superstore.xls") furniture = df.loc[df['Category'] ==
'Furniture'] A good 4-year furniture sales data.
furniture['Order Date'].min(), furniture['Order Date'].max()
Timestamp(‘2014–01–06 00:00:00’), Timestamp(‘2017–12–30 00:00:00’)

Data Preprocessing
This step includes removing columns we do not need, check missing values, aggregate sales by
date and so on.
cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID',
'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region', 'Product ID',
'Category', 'Sub-Category', 'Product Name', 'Quantity', 'Discount', 'Profit']

furniture.drop(cols,axis=1,inplace=True) furniture=furniture.sort_values('Order
Date')furniture.isnull().sum()
furniture=furniture.groupby('OrderDate')['Sales'].sum().reset_ index()

Order Date 0
Sales dtype: int64 0

Figure 1
Indexing with Time Series Data furniture=furniture.set_index('OrderDate')
furniture.index

Figure 2

We will use the averages daily sales value for that month instead, and we are using the start of each
month as the timestamp.
y = furniture['Sales'].resample('MS').mean() Have a quick peek
2017 furniture sales data.
y['2017':]

Figure 3
OUTPUT:

Visualizing Furniture Sales Time Series Data


y.plot(figsize=(15,6)) plt.show()

RESULT:

Thus the program for Time series analysis has been executed and verified successfully

You might also like