0% found this document useful (0 votes)

27 views34 pages

DSA Lab Manual Pgms - fINAL

The document outlines a series of experiments related to Data Science and Analytics, detailing aims, algorithms, and Python programs for various tasks such as working with Pandas data frames, creating plots with Matplotlib, and performing statistical analyses including regression and correlation. Each experiment includes a program example and results confirming successful execution. The experiments cover a range of topics essential for data analysis, including data manipulation, visualization, and statistical testing.

Uploaded by

mekalar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views34 pages

DSA Lab Manual Pgms - fINAL

Uploaded by

mekalar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 34

lOMoARcPSD|22261930

Sl. Date LIST OF EXPERIMENTS Page No Marks Signature

1. Working with Pandas data frame

Basic Plots using Matplotlib

Frequency distributors, Averages, Variability

3.
Normal Curves, Correlation and scatter plots,
4. Correlation coefficient

5. Regression

6. Z-test

7. T-test

8. Anova

9. Building and validating linear models

10. Building and validating logistic models

11. Time series analysis

12. Content beyond syllabus 1

13. Content beyond syllabus 2

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/10

lOMoARcPSD|22261930

Exp No : 1 WORKING WITH PANDAS DATA FRAMES

Date:

AIM:
To write a python program to work with pandas data frames

ALGORITHM:
1. Import the pandas library: To use pandas, you need to import the library first.
2. Read data into a pandas data frame: Use the read_csv(), read_excel(), or read_sql() functions
to read data from a file or database into a pandas data frame.
3. Explore the data: Use functions like head(), tail(), info(), describe(), and shape to get a sense
of the data you're working with.
4. Select and filter data: Use indexing and slicing to select subsets of the data. You can use the
loc[] and iloc[] methods to select rows and columns based on labels or indices. You can also
filter rows based on conditions using the query() or loc[] method.
5. Manipulate the data: Use pandas functions to manipulate the data, such as adding or dropping
columns, renaming columns, and aggregating data.
6. Handle missing data: Use functions like isnull(), dropna(), fillna(), and interpolate() to handle
missing data in the data frame.
7. Save the data: Use the to_csv(), to_excel(), or to_sql() methods to save the data frame to a file
or database.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/11

lOMoARcPSD|22261930

Program:
# Create a data frame from a dictionary of lists
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Emily'], 'Age': [25, 30, 35, 40, 45], 'Country': ['USA',
'Canada', 'USA', 'Canada', 'USA']}
df = pd.DataFrame(data)

# Print the data frame

print("Original data frame:")
print(df)

# Select rows based on a condition

print("\nSelect rows where Age > 30:")
print(df[df['Age'] > 30])

# Add a new column

df['Gender'] = ['F', 'M', 'M', 'M', 'F']
print("\nData frame with Gender column:")
print(df)

# Group the data by Country and calculate the mean age

grouped = df.groupby('Country')
mean_age = grouped['Age'].mean()
print("\nMean age by Country:")
print(mean_age)

# Sort the data frame by Age in descending order

sorted_df = df.sort_values('Age', ascending=False)
print("\nData frame sorted by Age in descending order:")
print(sorted_df)

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/12

lOMoARcPSD|22261930

OUTPUT:
Original data frame:
Name Age Country
0 Alice 25 USA
1 Bob 30 Canada
2 Charlie 35 USA
3 Dave 40 Canada
4 Emily 45 USA
Select rows where Age > 30:
Name Age Country
2 Charlie 35 USA
3 Dave 40 Canada
4 Emily 45 USA

Data frame with Gender column:

Name Age Country Gender
0 Alice 25 USA F
1 Bob 30 Canada M
2 Charlie 35 USA M
3 Dave 40 Canada M
4 Emily 45 USA F

Mean age by Country:

Country
Canada 35.0
USA 35.0
Name: Age, dtype: float64
Data frame sorted by Age in descending order:
Name Age Country Gender
4 Emily 45 USA F
3 Dave 40 Canada M
2 Charlie 35 USA M
1 Bob 30 Canada M
0 Alice 25 USA F

RESULT:
Thus the python program to work with pandas dataframes is executed successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/13

lOMoARcPSD|22261930

Exp No : 2 BASIC PLOTS USING MATPLOTLIB

Date:

AIM:
To write a python program to draw basic plots using matplotlib.

ALGORITHM:
1. Import the necessary libraries: You'll need to import both NumPy and Matplotlib in order to
create plots.
2. Create data: You need data to plot! Create your data as NumPy arrays.
3. Create a figure and axes: Before you can plot anything, you need to create aigure and axes.
The figure is the canvas that holds your plot, while the axes are the actual plot area.
4. Plot the data: Now it's time to plot the data. You can do this using the plot() function.
5. Customize the plot: You can customize your plot in many ways, including adding a title,
changing the axis labels, and adding a legend.
6. Save the plot: Finally, you can save your plot to a file using the savefig() function.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/14

lOMoARcPSD|22261930

PROGRAM:
import matplotlib.pyplot as plt
a = [1, 2, 3, 4, 5]
b = [0, 0.6, 0.2, 15, 10, 8, 16, 21]
plt.plot(a)

# o is for circles and r is

# for red
plt.plot(b, "or")
plt.plot(list(range(0, 22, 3)))

# naming the x-axis

plt.xlabel('Day ->')

# naming the y-axis

plt.ylabel('Temp ->')

c = [4, 2, 6, 8, 3, 20, 13, 15]

plt.plot(c, label = '4th Rep') # get current axes command
ax = plt.gca()

# get command over the individual

# boundary line of the graph body
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

# set the range or the bounds of

# the left boundary line to fixed range
ax.spines['left'].set_bounds(-3, 40)

# set the interval by which

# the x-axis set the marks
plt.xticks(list(range(-3, 10)))

# set the intervals by which y-axis

# set the marks

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/15

lOMoARcPSD|22261930

plt.yticks(list(range(-3, 20, 3)))

# legend denotes that what color

# signifies what
ax.legend(['1st Rep', '2nd Rep', '3rd Rep', '4th Rep'])

# annotate command helps to write

# ON THE GRAPH any text xy denotes # the position on the graph
plt.annotate('Temperature V / s Days', xy = (1.01, -2.15))

# gives a title to the Graph

plt.title('All Features Discussed')
plt.show()

OUTPUT

RESULT:
Thus the python program to draw basic plots using matplotlib is executed successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/16

lOMoARcPSD|22261930

Exp No : 3 FREQUENCY DISTRIBUTIONS, AVERAGES, VARIABILITY

Date:

AIM:
To write a python program for finding out frequency distributions, averages, variability.

ALGORITHM:
1. Initialize an empty dictionary called frequency_distribution.
2. Calculate the sum of the data and store it in a variable called total_sum
3. Calculate the length of the data and store it in a variable called n
4. Calculate the mean by dividing total_sum by n and store it in a variable called mean
5. Calculate the sum of the squared differences between each value in the data and the mean and
store it in a variable called squared_diff_sum
6. Calculate the variance by dividing squared_diff_sum by n-1 and store it in a variable called
variance
7. Loop over each value in the data: a. If the value is not in the frequency_distribution
dictionary, add it with a value of 1 b. If the value is already in the frequency_distribution
dictionary, increment its value by 1
8. Return the frequency_distribution, mean, and variance

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/17

lOMoARcPSD|22261930

PROGRAM:
import numpy as np
# Python program to get average of a list # Importing the NumPy module import numpy as np
# Taking a list of elements
list = [2, 40, 2, 502, 177, 7, 9]
# Calculating average using average()
print(np.average(list))

# Python program to get variance of a list # Importing the NumPy module import numpy as np
# Taking a list of elements
list = [2, 4, 4, 4, 5, 5, 7, 9]
# Calculating variance using var()
print(np.var(list))

# Python program to get standard deviation of a list# Importing the NumPy module import numpy as
np# Taking a list of elements
list = [290, 124, 127, 899]
# Calculating standard # deviation using var()
print(np.std(list))

OUTPUT:
105.57142857142857
4.0
318.35750344541907

RESULT:
Thus the python program for finding out frequency distributions, averages, variability is executed
successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/18

lOMoARcPSD|22261930

Exp No. 4 NORMAL CURVES, CORRELATION AND SCATTER PLOTS,

Date: CORRELATION COEFFICIENT

AIM:
The aim of finding normal curves, correlation and scatter plots, and correlation coefficients is to better
understand the relationship between variables in a dataset.

ALGORITHM:
Normal Curves:
To find a normal curve, we need to first calculate the mean and standard deviation of the dataset.
Then, we can use a statistical software or calculator to generate a graph of the normal distribution.
The algorithm for finding a normal curve is as follows:
a. Calculate the mean (µ) and standard deviation (σ) of the dataset.
b. Use a statistical software or calculator to generate a graph of the normal distribution.
c. Plot the normal curve on a graph with the x-axis representing the variable of interest and the
y-axis representing the frequency or probability.

Scatter Plots:
To create a scatter plot, we need to first collect data for two variables that we want to investigate.
Then, we can plot the data points on a graph and look for patterns or trends. The algorithm for
creating a scatter plot is as follows:
a. Collect data for two variables.
b. Plot the data points on a graph with one variable on the x-axis and the other variable on the y-
axis.
c. Look for patterns or trends in the data points.

Correlation Coefficient:
The correlation coefficient is a measure of the strength and direction of the relationship between two
variables. To calculate the correlation coefficient, we need to use a statistical formula or software. The
algorithm for calculating the correlation coefficient is as follows:
a. Collect data for two variables.
b. Calculate the mean (µ) and standard deviation (σ) of each variable.
c. Calculate the covariance (cov) between the two variables.
d. Calculate the correlation coefficient (r) using the formula:
r = cov / (σ1 * σ2)
where cov is the covariance between the two variables, σ1 is the standard deviation of variable 1, and
σ2 is the standard deviation of variable 2.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/19

lOMoARcPSD|22261930

PROGRAM 1:
import numpy as np
import matplotlib.pyplot as plt

# Generating some random data

# for an example
data = np.random.normal(170, 10, 250)

# Plotting the histogram.

plt.hist(data, bins=25, density=True, alpha=0.6, color='b')

plt.show()

Output:

PROGRAM 2:
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

# Generate some data for this

# demonstration.
data = np.random.normal(170, 10, 250)

# Fit a normal distribution to

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/20

lOMoARcPSD|22261930

# the data:
# mean and standard deviation
mu, std = norm.fit(data)

# Plot the histogram.

plt.hist(data, bins=25, density=True, alpha=0.6, color='b')

# Plot the PDF.

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)

plt.plot(x, p, 'k', linewidth=2)

title = "Fit Values: {:.2f} and {:.2f}".format(mu, std)
plt.title(title)

plt.show()

OUTPUT:

RESULT:

Thus the python program for finding out normal curves,correlation and scatter plots, correlation
coefficient is executed successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/21

lOMoARcPSD|22261930

Exp No : 5 REGRESSION
Date:

AIM:
The aim of regression analysis is to examine the relationship between a dependent variable and one or
more independent variables.

ALGORITHM:
1. Collect data: Collect data on the dependent variable and one or more independent variables.
2. Check for linearity: Check whether there is a linear relationship between the dependent
variable and the independent variables. This can be done by plotting the data on a scatter plot
and looking for a linear pattern.
3. Determine the regression equation: Determine the regression equation that best fits the data.
This can be done using the least squares method.
4. Test the regression equation: Test the regression equation to see if it accurately predicts the
value of the dependent variable. This can be done by comparing the predicted values with the
actual values.
5. Interpret the results: Interpret the results to draw conclusions about the relationship between
the dependent variable and the independent variables.
6. The formula for the regression equation is: y = b0 + b1x1 + b2x2 + ... + bnxn

PROGRAM:
import numpy as np
import matplotlib.pyplot as plt
def estimate_coef(x, y):
# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x = np.mean(x)
m_y = np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/22

lOMoARcPSD|22261930

def plot_regression_line(x, y, b):

# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",marker = "o", s = 30)
# predicted response vector
y_pred = b[0]+ b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \\nb_1 = {}".format(b[0], b[1]))
# plotting regression line
plot_regression_line(x, y, b)
main()

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/23

lOMoARcPSD|22261930

OUTPUT:

RESULT:

Thus the python program for finding out the regression is executed successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/24

lOMoARcPSD|22261930

Exp No: 6 Z-TEST

Date :

AIM:
The aim of a z-test is to determine whether the mean of a sample is statistically significantly different
from the known or hypothesized population mean.

ALGORITHM:
1. State the null and alternative hypotheses: The null hypothesis (H0) is that there is no
significant difference between the sample mean and the population mean, while the
alternative hypothesis (Ha) is that there is a significant difference.
2. Determine the level of significance: Choose the level of significance, α, that will be used to
test the hypothesis. Typically, α is set at 0.05 or 0.01.
3. Collect data: Collect a random sample from the population of interest, and calculate the
sample mean and sample standard deviation.
4. Calculate the test statistic: Calculate the z-test statistic using the formula: z = (x̄ - μ) / (σ / √n)
5. where x is the sample mean, μ is the population mean, σ is the population standard
6. deviation, and n is the sample size.
7. Determine the critical value: Determine the critical value of z at the chosen level of
significance and degrees of freedom.
8. Compare the test statistic to the critical value: If the test statistic is greater than the critical
value, reject the null hypothesis. If the test statistic is less than the critical value, fail to reject
the null hypothesis.
9. Interpret the results: If the null hypothesis is rejected, it can be concluded that the sample
mean is significantly different from the population mean at the chosen level of significance. If
the null hypothesis is not rejected, it can be concluded that there is not enough evidence to
support the alternative hypothesis.

PROGRAM:
import math
import numpy as np
from numpy.random import randn
from statsmodels.stats.weightstats import ztest
# Generate a random array of 50 numbers having mean 110 and sd 15 # similar to the IQ scores data
we assume above
mean_iq = 110
sd_iq = 15/math.sqrt(50)

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/25

lOMoARcPSD|22261930

alpha = 0.05
null_mean =100
data = sd_iq*randn(50)+mean_iq # print mean and sd
print('mean=%.2f stdv=%.2f' % (np.mean(data), np.std(data)))
# now we perform the test. In this function, we passed data, in the value parameter
# we passed mean value in the null hypothesis, in alternative hypothesis we check whether the
# mean is larger
ztest_Score,p_value=ztest(data,value=null_mean,alternative='larger')
# the function outputs a p_value and z-score corresponding to that value, we compare the
# p-value with alpha, if it is greater than alpha then we do not null hypothesis # else we reject it.
if(p_value<alpha):
print("Reject Null Hypothesis")
else:
print("Fail to Reject NUll Hypothesis")

OUTPUT:
mean=110.55 stdv=1.50
Reject Null Hypothesis

RESULT:
Thus the python program for performing z-test is executed successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/26

lOMoARcPSD|22261930

Exp No: 7 T-TEST

Date:

AIM:
The aim of a t-test is to determine whether the mean of a sample is statistically significantly different
from the hypothesized population mean.

ALGORITHM:
1. The algorithm for a one-sample t-test involves the following steps:
2. State the null and alternative hypotheses: The null hypothesis (H0) is that there is no
significant difference between the sample mean and the population mean, while the
alternative hypothesis (Ha) is that there is a significant difference.
3. Determine the level of significance: Choose the level of significance, α, that will be used to
test the hypothesis. Typically, α is set at 0.05 or 0.01.
4. Collect data: Collect a random sample from the population of interest, and calculate the
sample mean and sample standard deviation.
5. Calculate the test statistic: Calculate the t-test statistic using the formula: t = (x̄ - μ) / (s / √n)
6. where x is the sample mean, μ is the hypothesized population mean, s is the sample
7. standard deviation, and n is the sample size.
8. Determine the degrees of freedom: Determine the degrees of freedom for the t- distribution
using the formula: df = n - 1.
9. Determine the critical value: Determine the critical value of t at the chosen level of
significance and degrees of freedom.
10. Compare the test statistic to the critical value: If the absolute value of the test statistic is
greater than the critical value, reject the null hypothesis. If the absolute value of the test
statistic is less than the critical value, fail to reject the null hypothesis.
11. Interpret the results: If the null hypothesis is rejected, it can be concluded that the sample
mean is significantly different from the hypothesized population mean at the chosen level of
significance. If the null hypothesis is not rejected, it can be concluded that there is not enough
evidence to support the alternative hypothesis.

PROGRAM:
# Importing the required libraries and packages
import numpy as np
from scipy import stats
# Defining two random distributions # Sample Size

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/27

lOMoARcPSD|22261930

N = 10
# Gaussian distributed data with mean = 2 and var = 1
x = np.random.randn(N) + 2
# Gaussian distributed data with mean = 0 and var = 1
y = np.random.randn(N)
# Calculating the Standard Deviation
# Calculating the variance to get the standard deviation
var_x = x.var(ddof = 1)
var_y = y.var(ddof = 1)
# Standard Deviation
SD = np.sqrt((var_x + var_y) / 2)
print("Standard Deviation =", SD)
#Calculating the T-Statistics
tval = (x.mean() - y.mean()) / (SD * np.sqrt(2 / N))
# Comparing with the critical T-Value
# Degrees of freedom
dof = 2 * N - 2
# p-value after comparison with the T-Statistics
pval = 1 - stats.t.cdf(tval, df = dof)
print("t = " + str(tval))
print("p = " + str(2 * pval))

## Cross Checking using the internal function from SciPyPackage

tval2, pval2 = stats.ttest_ind(x, y)
print("t = "+ str(tval2))
print("p = " + str(pval2))

OUTPUT:
Standard Deviation = 0.7642398582227466
t = 4.87688162540348
p = 0.0001212767169695983
t = 4.876881625403479
p = 0.00012127671696957205

RESULT:
Thus the python program for performing t-test is executed successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/28

lOMoARcPSD|22261930

Exp No: 8 ANOVA

Date:

AIM:
The aim of an ANOVA (Analysis of Variance) is to determine whether there is a significant
difference between the means of three or more groups.

ALGORITHM:
1. The algorithm for a one-way ANOVA involves the following steps:
2. State the null and alternative hypotheses: The null hypothesis (H0) is that there is no
significant difference between the means of the groups, while the alternative hypothesis (Ha)
is that there is a significant difference.
3. Determine the level of significance: Choose the level of significance, α, that will be used to
test the hypothesis. Typically, α is set at 0.05 or 0.01.
4. Collect data: Collect data from three or more groups, and calculate the mean and variance for
each group.
5. Calculate the sum of squares between groups: Calculate the sum of squares between groups
using the formula: SSbetween = ∑ni(x̄ i - x̄ )^2
6. whereni is the sample size for group i, x̄ i is the mean of group i, and x̄ is the overall mean.
7. Calculate the sum of squares within groups: Calculate the sum of squares within groups using
the formula:SSwithin = ∑∑(xi - x̄ i)^2
8. where xi is the value of the ith observation in the jth group, x̄ i is the mean of group j, and j is
the number of groups.
9. Calculate the F-statistic: Calculate the F-statistic using the formula: F = (SSbetween / (k-1)) /
(SSwithin / (N-k))
10. where k is the number of groups, and N is the total number of observations.
11. Determine the critical value: Determine the critical value of F at the chosen level of
significance and degrees of freedom.
12. Compare the F-statistic to the critical value: If the F-statistic is greater than the critical value,
reject the null hypothesis. If the F-statistic is less than the critical value, fail to reject the null
hypothesis.
13. Interpret the results: If the null hypothesis is rejected, it can be concluded that there is a
significant difference between the means of the groups at the chosen level of significance. If
the null hypothesis is not rejected, it can be concluded that there is not enough evidence to
support the alternative hypothesis.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/29

lOMoARcPSD|22261930

PROGRAM:
One Way ANOVA
# Importing library
from scipy.stats import f_oneway

# Performance when each of the engine

# oil is applied
performance1 = [89, 89, 88, 78, 79]
performance2 = [93, 92, 94, 89, 88]
performance3 = [89, 88, 89, 93, 90]
performance4 = [81, 78, 81, 92, 82]

# Conduct the one-way ANOVA

f_oneway(performance1, performance2, performance3, performance4)

Two Way ANOVA

import numpy as np
import pandas as pd

#create data
df = pd.DataFrame({'water': np.repeat(['daily', 'weekly'], 15), 'sun': np.tile(np.repeat(['low', 'med',
'high'], 5), 2), 'height': [6, 6, 6, 5, 6, 5, 5, 6, 4, 5, 6, 6, 7, 8, 7, 3, 4, 4, 4, 5, 4, 4, 4, 4, 4, 5, 6, 6, 7, 8]})

#view first ten rows of data

df[:10]

import statsmodels.api as sm
from statsmodels.formula.api import ols

#perform two-way ANOVA

model = ols('height ~ C(water) + C(sun) + C(water):C(sun)', data=df).fit()
sm.stats.anova_lm(model, typ=2)

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/30

lOMoARcPSD|22261930

OUTPUT:
One Way ANOVA:
F_onewayResult(statistic=4.625000000000002, pvalue=0.016336459839780215)

Two Way ANOVA:

sum_sq df F PR(>F)
C(water) 8.533333 1.0 16.00000.000527
C(sun) 24.866667 2.0 23.31250.000002
C(water):C(sun)2.466667 2.0 2.3125 0.120667
Residual 12.800000 24.0 NaN NaN

RESULT:
Thus the python program for performing ANOVA is executed successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/31

lOMoARcPSD|22261930

Exp No: 9 BUILDING AND VALIDATING LINEAR MODELS

Date:

AIM:
The aim of building and validating linear models is to create a model that accurately describes the
relationship between a dependent variable and one or more independent variables, and to determine
whether the model is a good fit for the data.

ALGORITHM:
1. 1.Collect data: Collect data on the dependent variable and one or more independent variables.
2. Choose a linear model: Choose a linear model that describes the relationship between the
dependent variable and independent variable(s). A simple linear model has one independent
variable, while a multiple linear model has two or more independent variables.
3. Estimate model coefficients: Use a statistical software package to estimate the coefficients of
the linear model that best fit the data. The most common method for doing this is least
squares regression.
4. Evaluate model fit: Evaluate the fit of the model by examining the residual plots, which show
the difference between the predicted and actual values of the dependent variable. A good
model will have residuals that are randomly distributed around zero, with no discernible
patterns.
5. Test for significance: Test the significance of the model by calculating the p-value for the
overall F-test of the model. A low p-value indicates that the model is a good fit for the data.
6. Evaluate individual coefficients: Evaluate the significance of individual coefficients in the
model by calculating their t-values and p-values. A low p-value indicates that the coefficient
is significant and should be included in the model.
7. Validate the model: Validate the model by testing it on new data that was not used to estimate
the coefficients. This can be done by using a hold-out sample, or by using cross- validation
techniques.
8. Refine the model: Refine the model by making adjustments to the model specification, such
as adding or removing variables, transforming variables, or adding interaction terms.
9. Interpret the results: Interpret the coefficients of the model in terms of the relationship
between the dependent variable and independent variable(s).

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/32

lOMoARcPSD|22261930

PROGRAM
import numpy as np
from sklearn.linear_model import LinearRegression
x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([5, 20, 14, 32, 22, 38])
model = LinearRegression()
model.fit(x, y)
model = LinearRegression().fit(x, y)
r_sq = model.score(x, y)
print(f"coefficient of determination: {r_sq}")
print(f"intercept: {model.intercept_}")
print(f"slope: {model.coef_}")
new_model = LinearRegression().fit(x, y.reshape((-1, 1)))
print(f"intercept: {new_model.intercept_}")
print(f"slope: {new_model.coef_}")
y_pred = model.predict(x)
print(f"predicted response:\n{y_pred}")
y_pred = model.intercept_ + model.coef_ * x
print(f"predicted response:\n{y_pred}")
plt.scatter(y_pred, x)

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/33

lOMoARcPSD|22261930

OUTPUT:
coefficient of determination: 0.7158756137479542
intercept: 5.633333333333329
slope: [0.54]
intercept: [5.63333333]
slope: [[0.54]]
predicted response:
[ 8.33333333 13.73333333 19.13333333 24.53333333 29.93333333 35.33333333]
predicted response:
[[ 8.33333333]
[13.73333333]
[19.13333333]
[24.53333333]
[29.93333333]
[35.33333333]]

RESULT:

Thus the python program for building and validating linear models is executed successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/34

lOMoARcPSD|22261930

Exp No: 10 BUILDING AND VALIDATING LOGISTICS MODELS

Date:

Aim:
The aim of building and validating logistic models is to create a model that accurately predicts the
probability of a binary outcome (e.g., success or failure) based on one or more independent variables,
and to determine whether the model is a good fit for the data.

Algorithm:
The algorithm for building and validating logistic models involves the following steps:
1. Collect data: Collect data on the binary outcome variable and one or more independent
variables.
2. Choose a logistic model: Choose a logistic model that describes the relationship between the
dependent variable and independent variable(s). A simple logistic model has one independent
variable, while a multiple logistic model has two or more independent variables.
3. Estimate model coefficients: Use a statistical software package to estimate the coefficients of
the logistic model that best fit the data. The most common method for doing this is maximum
likelihood estimation.
4. Evaluate model fit: Evaluate the fit of the model by examining the goodness-of-fit statistics,
such as the deviance, the Akaike Information Criterion (AIC), and the Bayesian Information
Criterion (BIC). A good model will have a low deviance and low values of AIC and BIC.
5. Test for significance: Test the significance of the model by calculating the p-value for the
overall chi-square test of the model. A low p-value indicates that the model is a good fit for
the data.
6. Evaluate individual coefficients: Evaluate the significance of individual coefficients in the
model by calculating their Wald test statistics and p-values. A low p-value indicates that the
coefficient is significant and should be included in the model.
7. Validate the model: Validate the model by testing it on new data that was not used to estimate
the coefficients. This can be done by using a hold-out sample, or by using cross- validation
techniques.
8. Refine the model: Refine the model by making adjustments to the model specification, such
as adding or removing variables, transforming variables, or adding interaction terms.
9. Interpret the results: Interpret the coefficients of the model in terms of the relationship
between the independent variable(s) and the probability of the binary outcome.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/35

lOMoARcPSD|22261930

Program:
import numpy
from sklearn import linear_model

X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-1,1)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

logr = linear_model.LogisticRegression()
logr.fit(X,y)

def logit2prob(logr, X):

log_odds = logr.coef_ * X + logr.intercept_
odds = numpy.exp(log_odds)
probability = odds / (1 + odds)
return(probability)

print(logit2prob(logr, X))
plt.scatter(logit2prob(logr, X), X)

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/36

lOMoARcPSD|22261930

OUTPUT:
[[0.60749955]
[0.19268876]
[0.12775886]
[0.00955221]
[0.08038616]
[0.07345637]
[0.88362743]
[0.77901378]
[0.88924409]
[0.81293497]
[0.57719129]
[0.96664243]]

RESULT:

Thus the python program for building and validating logistic models is executed successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/37

lOMoARcPSD|22261930

Exp No: 11 TIME SERIES ANALYSIS

Date:

AIM:
The aim of performing time series analysis is to model and forecast the behavior of a time series data
over a period of time, using statistical methods, in order to identify patterns, trends, and seasonality in
the data.

ALGORITHM:
The algorithm for performing time series analysis involves the following steps:
1. 1.Collect data: Collect data on the time series variable over a period of time.
2. 2.Visualize the data: Plot the time series data to identify patterns, trends, and seasonality.
3.Decompose the time series: Decompose the time series into its components, which are
trend, seasonality, and residual variation. This can be done using techniques such as moving
averages, exponential smoothing, or the Box-Jenkins method.
3. Model the trend: Model the trend component of the time series using techniques such as linear
regression, exponential smoothing, or ARIMA models.
4. Model the seasonality: Model the seasonality component of the time series using techniques
such as seasonal decomposition, dummy variables, or Fourier series.
5. Model the residual variation: Model the residual variation component of the time series using
techniques such as autoregressive models, moving average models, or ARIMA models.
6. Choose the best model: Evaluate the fit of the different models using measures such as AIC,
BIC, and RMSE, and choose the model that best fits the data.
7. Forecast future values: Use the chosen model to forecast future values of the time series
variable.
8. Validate the model: Validate the model by comparing the forecasted values with actual values
from a hold-out sample, or by using cross-validation techniques.
9. Refine the model: Refine the model by making adjustments to the model specification, such
as adding or removing variables, transforming variables, or adding interaction terms.
10. Interpret the results: Interpret the results of the time series analysis in terms of the patterns,
trends, and seasonality of the data, and use the forecasted values to make predictions and
inform decision-making.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/38

lOMoARcPSD|22261930

PROGRAM
# import modules
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import numpy as np
# create dataframe
dataframe = pd.DataFrame({'date_of_week': np.array([datetime.datetime(2021, 11, i+1)
for i in range(7)]), 'classes': [5, 6, 8, 2, 3, 7, 4]})
# Plotting the time series of given dataframe
plt.plot(dataframe.date_of_week, dataframe.classes)
# Giving title to the chart using plt.title
plt.title('Classes by Date')
# rotating the x-axis tick labels at 30degree
# towards right
plt.xticks(rotation=30, ha='right')
# Providing x and y label to the chart
plt.xlabel('Date')
plt.ylabel('Classes')

OUTPUT:

RESULT:

Thus the python program for performing time series analysis is executed Successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/39

lOMoARcPSD|22261930

Exp No: 12 CONTENT BEYOND SYLLABUS - HEATMAPS

Date:

AIM:
To make a graphical representation of data using colors to visualize the value of the matrix.

ALGORITHM:
1. Import the seaborn library: To use seaborn, you need to import the library first.
2. Explore the data: Use functions like head(), tail(), info(), describe(), and shape to get a sense
of the data you're working with.
3. Select and filter data: Use indexing and slicing to select subsets of the data. You can use the
loc[] and iloc[] methods to select rows and columns based on labels or indices. You can also
filter rows based on conditions using the query() or loc[] method.
4. Manipulate the data: Use pandas functions to manipulate the data, such as adding or dropping
columns, renaming columns, and aggregating data.
5. Handle missing data: Use functions like isnull(), dropna(), fillna(), and interpolate() to handle
missing data in the data frame.
6. Save the data: Use the to_csv(), to_excel(), or to_sql() methods to save the data frame to a file
or database.
PROGRAM:
# importing the modules
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt

# generating 2-D 10x10 matrix of random numbers

# from 1 to 100
data = np.random.randint(low = 1,
high = 100,
size = (10, 10))
print("The data to be plotted:\n")
print(data)

# plotting the heatmap

hm = sn.heatmap(data = data)
# displaying the plotted heatmap
plt.show()

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/40

lOMoARcPSD|22261930

OUTPUT:
The data to be plotted:

[[46 30 55 86 42 94 31 56 21 7]
[68 42 95 28 93 13 90 27 14 65]
[73 84 92 66 16 15 57 36 46 84]
[ 7 11 41 37 8 41 96 53 51 72]
[52 64 1 80 33 30 91 80 28 88]
[19 93 64 23 72 15 39 35 62 3]
[51 45 51 17 83 37 81 31 62 10]
[ 9 28 30 47 73 96 10 43 30 2]
[74 28 34 26 2 70 82 53 97 96]
[86 13 60 51 95 26 22 29 14 29]]

RESULT:
Thus the python program for performing heat map is executed Successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/41

lOMoARcPSD|22261930

EXP No: 13 CONTENT BEYOND SYLLABUS – INTERACTIVE VISUALIZATION

Date: WITH BOKEH

AIM:
To make a interactive data visualization with Bokeh.

PROGRAM:
from bokeh.io import curdoc
from bokeh.plotting import figure, output_file, show

x = [1, 2, 3, 4, 5]
y = [6, 7, 6, 4, 5]

output_file("demo.html")
p = figure(title='demo', width=300, height=300,
toolbar_location="below")
p.circle(x, y)

show(p)

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/42

lOMoARcPSD|22261930

OUTPUT:

RESULT:
Thus the python program for performing heat map is executed successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/43

AD3411 - 1 To 5
No ratings yet
AD3411 - 1 To 5
11 pages
DS - Lab Manual
No ratings yet
DS - Lab Manual
31 pages
Time Series Analysis Group 9
No ratings yet
Time Series Analysis Group 9
16 pages
FDSA Lab Manual
No ratings yet
FDSA Lab Manual
32 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Pandas
No ratings yet
Pandas
25 pages
Python For Exploratory Data Analysis
No ratings yet
Python For Exploratory Data Analysis
12 pages
BDA File
No ratings yet
BDA File
26 pages
FDSA Lab Manual
No ratings yet
FDSA Lab Manual
32 pages
Dsa Lab
No ratings yet
Dsa Lab
28 pages
FDSA Lab Manual Aim Algorithm
No ratings yet
FDSA Lab Manual Aim Algorithm
32 pages
FDSA Lab Manual
No ratings yet
FDSA Lab Manual
27 pages
Aids Lab
No ratings yet
Aids Lab
45 pages
EXP1-siddhant Gupta (23 - SE - 148)
No ratings yet
EXP1-siddhant Gupta (23 - SE - 148)
17 pages
12 Ip Practical List With Solution Complete
No ratings yet
12 Ip Practical List With Solution Complete
5 pages
Khadeeja - DS - PRACTICAL 4
No ratings yet
Khadeeja - DS - PRACTICAL 4
24 pages
EX-02-Data Manipulation Pandas Matplot
No ratings yet
EX-02-Data Manipulation Pandas Matplot
9 pages
Fundamentals of Data Science Lab Manual New1
No ratings yet
Fundamentals of Data Science Lab Manual New1
32 pages
AD3411
No ratings yet
AD3411
28 pages
Pandas Worksheet
No ratings yet
Pandas Worksheet
3 pages
Python For Statistics
No ratings yet
Python For Statistics
40 pages
Data Science Algorithmen Master - 02 Data Handling
No ratings yet
Data Science Algorithmen Master - 02 Data Handling
76 pages
Ad3411 - Student
No ratings yet
Ad3411 - Student
27 pages
Unit 3 (FODS)
No ratings yet
Unit 3 (FODS)
34 pages
Practical File 2024
No ratings yet
Practical File 2024
25 pages
Python Libraries
No ratings yet
Python Libraries
27 pages
Rudra Aiml 1.4
No ratings yet
Rudra Aiml 1.4
4 pages
IP Book 12 Question Bank
No ratings yet
IP Book 12 Question Bank
20 pages
Ad3411 - Data Science and Analytics Laboratory
No ratings yet
Ad3411 - Data Science and Analytics Laboratory
26 pages
FDSA Lab Manual
No ratings yet
FDSA Lab Manual
31 pages
Data Science and Analtics Laboratory
No ratings yet
Data Science and Analtics Laboratory
21 pages
CLASS XII - IP List of Practicals With Coding 2020
No ratings yet
CLASS XII - IP List of Practicals With Coding 2020
15 pages
Ge Sem II Dav Upc 2344001201 Sl. No. Qp. 2012 July 2023
No ratings yet
Ge Sem II Dav Upc 2344001201 Sl. No. Qp. 2012 July 2023
16 pages
Nishanrt Aiml1.4
No ratings yet
Nishanrt Aiml1.4
4 pages
Data Visualization
No ratings yet
Data Visualization
35 pages
Manishadav
No ratings yet
Manishadav
27 pages
Series and Pandas Methods
No ratings yet
Series and Pandas Methods
5 pages
Practical File Question 28.09.2022
No ratings yet
Practical File Question 28.09.2022
15 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
Data Science
No ratings yet
Data Science
3 pages
Vanshika Goyal Gec Practicals
No ratings yet
Vanshika Goyal Gec Practicals
31 pages
Himanshu Aiml 1.4
No ratings yet
Himanshu Aiml 1.4
4 pages
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
No ratings yet
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
28 pages
Data Sci
No ratings yet
Data Sci
10 pages
12 IP Practical Exampl
No ratings yet
12 IP Practical Exampl
6 pages
AD3301 DEV Lab Manual
No ratings yet
AD3301 DEV Lab Manual
26 pages
Commands SQL, Python (BASICS)
No ratings yet
Commands SQL, Python (BASICS)
7 pages
Aiml Lab Manaual R23
100% (1)
Aiml Lab Manaual R23
10 pages
Eda Code Snippets
No ratings yet
Eda Code Snippets
17 pages
DXV Guidelines
No ratings yet
DXV Guidelines
3 pages
Gec Practicals
No ratings yet
Gec Practicals
31 pages
Fdsa Lab Manual Final
No ratings yet
Fdsa Lab Manual Final
70 pages
Ids 1
No ratings yet
Ids 1
30 pages
Unit 5 PythonPackages (Matplotlib)
No ratings yet
Unit 5 PythonPackages (Matplotlib)
24 pages
Unit 2
No ratings yet
Unit 2
36 pages
DAVP Lab Manual
No ratings yet
DAVP Lab Manual
12 pages
Practical 1
No ratings yet
Practical 1
5 pages
Pandas
No ratings yet
Pandas
27 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
96 pages
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Course Information Sheet (CN) - AI&DS
No ratings yet
Course Information Sheet (CN) - AI&DS
6 pages
Few-Shot, Zero-Shot Learning - CONTENT BEYOND SYLLABUS
No ratings yet
Few-Shot, Zero-Shot Learning - CONTENT BEYOND SYLLABUS
36 pages
Viva Dsa
No ratings yet
Viva Dsa
11 pages
Ad3501-Dl-Unit 2 Notes
No ratings yet
Ad3501-Dl-Unit 2 Notes
29 pages
Ad3501-Dl-Unit 5 Notes
No ratings yet
Ad3501-Dl-Unit 5 Notes
16 pages
Session 8 One & Two Tailed Significance Tests
No ratings yet
Session 8 One & Two Tailed Significance Tests
11 pages
Socialsupportsocialadjustment Academicachievement Unistudents
No ratings yet
Socialsupportsocialadjustment Academicachievement Unistudents
10 pages
EBSCO FullText 2024 10 17
No ratings yet
EBSCO FullText 2024 10 17
14 pages
1190 ArticleText 2401 1 10 20210224
No ratings yet
1190 ArticleText 2401 1 10 20210224
9 pages
JetGrouting Akin2016
No ratings yet
JetGrouting Akin2016
39 pages
Nvivo Analysis Sample
No ratings yet
Nvivo Analysis Sample
146 pages
Module 3
No ratings yet
Module 3
98 pages
Ch-7 Reilly
No ratings yet
Ch-7 Reilly
16 pages
Q2-Periodical Test in Science Research Grade 7
No ratings yet
Q2-Periodical Test in Science Research Grade 7
4 pages
MKTG 4110 Class 10
No ratings yet
MKTG 4110 Class 10
5 pages
The Relationship of Sleep Hours On The Academic Performance and Classroom Participation of Senior High School Students
0% (1)
The Relationship of Sleep Hours On The Academic Performance and Classroom Participation of Senior High School Students
7 pages
Research Report On Cocoon 2
No ratings yet
Research Report On Cocoon 2
27 pages
The Determinants of Growth For Small and Medium Sized Firms. The Role of The Availability of External Finance
No ratings yet
The Determinants of Growth For Small and Medium Sized Firms. The Role of The Availability of External Finance
16 pages
Diabetes Health Belief
No ratings yet
Diabetes Health Belief
8 pages
1755 9442 1 PB
No ratings yet
1755 9442 1 PB
10 pages
Panel 101
No ratings yet
Panel 101
48 pages
Chapter 5 Bài tập
No ratings yet
Chapter 5 Bài tập
4 pages
Thailand Cosmetics
No ratings yet
Thailand Cosmetics
24 pages
2023 10 2 7 Ajayi
No ratings yet
2023 10 2 7 Ajayi
16 pages
Stata Eview Problem Set 2 Sol
No ratings yet
Stata Eview Problem Set 2 Sol
14 pages
Determinants of Financial Inclusion in Small and Medium Enterprises Evidence From Ethiopia
No ratings yet
Determinants of Financial Inclusion in Small and Medium Enterprises Evidence From Ethiopia
8 pages
Tampão de Potássio RS
No ratings yet
Tampão de Potássio RS
13 pages
Report ICOM-STORAGE EN Final-Ok
No ratings yet
Report ICOM-STORAGE EN Final-Ok
71 pages
Exam Pa Note
No ratings yet
Exam Pa Note
73 pages
My Project Work
No ratings yet
My Project Work
86 pages
Enhancing Solar Forecasting Accuracy With Sequential Deep Artificial Neural Network and Hybrid Random Forest and Gradient Boosting Models Across Varied Terrains
100% (1)
Enhancing Solar Forecasting Accuracy With Sequential Deep Artificial Neural Network and Hybrid Random Forest and Gradient Boosting Models Across Varied Terrains
20 pages
Robert Marzano
No ratings yet
Robert Marzano
53 pages
Peabody Picture Vocabulary Test
0% (1)
Peabody Picture Vocabulary Test
5 pages
Ccs3000 User Manual
No ratings yet
Ccs3000 User Manual
21 pages
استبانة ابتكار اخضر
No ratings yet
استبانة ابتكار اخضر
17 pages