0% found this document useful (0 votes)
26 views34 pages

DSA Lab Manual Pgms - fINAL

The document outlines a series of experiments related to Data Science and Analytics, detailing aims, algorithms, and Python programs for various tasks such as working with Pandas data frames, creating plots with Matplotlib, and performing statistical analyses including regression and correlation. Each experiment includes a program example and results confirming successful execution. The experiments cover a range of topics essential for data analysis, including data manipulation, visualization, and statistical testing.

Uploaded by

mekalar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views34 pages

DSA Lab Manual Pgms - fINAL

The document outlines a series of experiments related to Data Science and Analytics, detailing aims, algorithms, and Python programs for various tasks such as working with Pandas data frames, creating plots with Matplotlib, and performing statistical analyses including regression and correlation. Each experiment includes a program example and results confirming successful execution. The experiments cover a range of topics essential for data analysis, including data manipulation, visualization, and statistical testing.

Uploaded by

mekalar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

lOMoARcPSD|22261930

Sl. Date LIST OF EXPERIMENTS Page No Marks Signature


No

1. Working with Pandas data frame

Basic Plots using Matplotlib


2.

Frequency distributors, Averages, Variability


3.
Normal Curves, Correlation and scatter plots,
4. Correlation coefficient

5. Regression

6. Z-test

7. T-test

8. Anova

9. Building and validating linear models

10. Building and validating logistic models

11. Time series analysis

12. Content beyond syllabus 1

13. Content beyond syllabus 2

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/10


lOMoARcPSD|22261930

Exp No : 1 WORKING WITH PANDAS DATA FRAMES


Date:

AIM:
To write a python program to work with pandas data frames

ALGORITHM:
1. Import the pandas library: To use pandas, you need to import the library first.
2. Read data into a pandas data frame: Use the read_csv(), read_excel(), or read_sql() functions
to read data from a file or database into a pandas data frame.
3. Explore the data: Use functions like head(), tail(), info(), describe(), and shape to get a sense
of the data you're working with.
4. Select and filter data: Use indexing and slicing to select subsets of the data. You can use the
loc[] and iloc[] methods to select rows and columns based on labels or indices. You can also
filter rows based on conditions using the query() or loc[] method.
5. Manipulate the data: Use pandas functions to manipulate the data, such as adding or dropping
columns, renaming columns, and aggregating data.
6. Handle missing data: Use functions like isnull(), dropna(), fillna(), and interpolate() to handle
missing data in the data frame.
7. Save the data: Use the to_csv(), to_excel(), or to_sql() methods to save the data frame to a file
or database.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/11


lOMoARcPSD|22261930

Program:
# Create a data frame from a dictionary of lists
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Emily'], 'Age': [25, 30, 35, 40, 45], 'Country': ['USA',
'Canada', 'USA', 'Canada', 'USA']}
df = pd.DataFrame(data)

# Print the data frame


print("Original data frame:")
print(df)

# Select rows based on a condition


print("\nSelect rows where Age > 30:")
print(df[df['Age'] > 30])

# Add a new column


df['Gender'] = ['F', 'M', 'M', 'M', 'F']
print("\nData frame with Gender column:")
print(df)

# Group the data by Country and calculate the mean age


grouped = df.groupby('Country')
mean_age = grouped['Age'].mean()
print("\nMean age by Country:")
print(mean_age)

# Sort the data frame by Age in descending order


sorted_df = df.sort_values('Age', ascending=False)
print("\nData frame sorted by Age in descending order:")
print(sorted_df)

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/12


lOMoARcPSD|22261930

OUTPUT:
Original data frame:
Name Age Country
0 Alice 25 USA
1 Bob 30 Canada
2 Charlie 35 USA
3 Dave 40 Canada
4 Emily 45 USA
Select rows where Age > 30:
Name Age Country
2 Charlie 35 USA
3 Dave 40 Canada
4 Emily 45 USA

Data frame with Gender column:


Name Age Country Gender
0 Alice 25 USA F
1 Bob 30 Canada M
2 Charlie 35 USA M
3 Dave 40 Canada M
4 Emily 45 USA F

Mean age by Country:


Country
Canada 35.0
USA 35.0
Name: Age, dtype: float64
Data frame sorted by Age in descending order:
Name Age Country Gender
4 Emily 45 USA F
3 Dave 40 Canada M
2 Charlie 35 USA M
1 Bob 30 Canada M
0 Alice 25 USA F

RESULT:
Thus the python program to work with pandas dataframes is executed successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/13


lOMoARcPSD|22261930

Exp No : 2 BASIC PLOTS USING MATPLOTLIB


Date:

AIM:
To write a python program to draw basic plots using matplotlib.

ALGORITHM:
1. Import the necessary libraries: You'll need to import both NumPy and Matplotlib in order to
create plots.
2. Create data: You need data to plot! Create your data as NumPy arrays.
3. Create a figure and axes: Before you can plot anything, you need to create aigure and axes.
The figure is the canvas that holds your plot, while the axes are the actual plot area.
4. Plot the data: Now it's time to plot the data. You can do this using the plot() function.
5. Customize the plot: You can customize your plot in many ways, including adding a title,
changing the axis labels, and adding a legend.
6. Save the plot: Finally, you can save your plot to a file using the savefig() function.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/14


lOMoARcPSD|22261930

PROGRAM:
import matplotlib.pyplot as plt
a = [1, 2, 3, 4, 5]
b = [0, 0.6, 0.2, 15, 10, 8, 16, 21]
plt.plot(a)

# o is for circles and r is


# for red
plt.plot(b, "or")
plt.plot(list(range(0, 22, 3)))

# naming the x-axis


plt.xlabel('Day ->')

# naming the y-axis


plt.ylabel('Temp ->')

c = [4, 2, 6, 8, 3, 20, 13, 15]


plt.plot(c, label = '4th Rep') # get current axes command
ax = plt.gca()

# get command over the individual


# boundary line of the graph body
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

# set the range or the bounds of


# the left boundary line to fixed range
ax.spines['left'].set_bounds(-3, 40)

# set the interval by which


# the x-axis set the marks
plt.xticks(list(range(-3, 10)))

# set the intervals by which y-axis


# set the marks

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/15


lOMoARcPSD|22261930

plt.yticks(list(range(-3, 20, 3)))

# legend denotes that what color


# signifies what
ax.legend(['1st Rep', '2nd Rep', '3rd Rep', '4th Rep'])

# annotate command helps to write


# ON THE GRAPH any text xy denotes # the position on the graph
plt.annotate('Temperature V / s Days', xy = (1.01, -2.15))

# gives a title to the Graph


plt.title('All Features Discussed')
plt.show()

OUTPUT

RESULT:
Thus the python program to draw basic plots using matplotlib is executed successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/16


lOMoARcPSD|22261930

Exp No : 3 FREQUENCY DISTRIBUTIONS, AVERAGES, VARIABILITY


Date:

AIM:
To write a python program for finding out frequency distributions, averages, variability.

ALGORITHM:
1. Initialize an empty dictionary called frequency_distribution.
2. Calculate the sum of the data and store it in a variable called total_sum
3. Calculate the length of the data and store it in a variable called n
4. Calculate the mean by dividing total_sum by n and store it in a variable called mean
5. Calculate the sum of the squared differences between each value in the data and the mean and
store it in a variable called squared_diff_sum
6. Calculate the variance by dividing squared_diff_sum by n-1 and store it in a variable called
variance
7. Loop over each value in the data: a. If the value is not in the frequency_distribution
dictionary, add it with a value of 1 b. If the value is already in the frequency_distribution
dictionary, increment its value by 1
8. Return the frequency_distribution, mean, and variance

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/17


lOMoARcPSD|22261930

PROGRAM:
import numpy as np
# Python program to get average of a list # Importing the NumPy module import numpy as np
# Taking a list of elements
list = [2, 40, 2, 502, 177, 7, 9]
# Calculating average using average()
print(np.average(list))

# Python program to get variance of a list # Importing the NumPy module import numpy as np
# Taking a list of elements
list = [2, 4, 4, 4, 5, 5, 7, 9]
# Calculating variance using var()
print(np.var(list))

# Python program to get standard deviation of a list# Importing the NumPy module import numpy as
np# Taking a list of elements
list = [290, 124, 127, 899]
# Calculating standard # deviation using var()
print(np.std(list))

OUTPUT:
105.57142857142857
4.0
318.35750344541907

RESULT:
Thus the python program for finding out frequency distributions, averages, variability is executed
successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/18


lOMoARcPSD|22261930

Exp No. 4 NORMAL CURVES, CORRELATION AND SCATTER PLOTS,


Date: CORRELATION COEFFICIENT

AIM:
The aim of finding normal curves, correlation and scatter plots, and correlation coefficients is to better
understand the relationship between variables in a dataset.

ALGORITHM:
Normal Curves:
To find a normal curve, we need to first calculate the mean and standard deviation of the dataset.
Then, we can use a statistical software or calculator to generate a graph of the normal distribution.
The algorithm for finding a normal curve is as follows:
a. Calculate the mean (µ) and standard deviation (σ) of the dataset.
b. Use a statistical software or calculator to generate a graph of the normal distribution.
c. Plot the normal curve on a graph with the x-axis representing the variable of interest and the
y-axis representing the frequency or probability.

Scatter Plots:
To create a scatter plot, we need to first collect data for two variables that we want to investigate.
Then, we can plot the data points on a graph and look for patterns or trends. The algorithm for
creating a scatter plot is as follows:
a. Collect data for two variables.
b. Plot the data points on a graph with one variable on the x-axis and the other variable on the y-
axis.
c. Look for patterns or trends in the data points.

Correlation Coefficient:
The correlation coefficient is a measure of the strength and direction of the relationship between two
variables. To calculate the correlation coefficient, we need to use a statistical formula or software. The
algorithm for calculating the correlation coefficient is as follows:
a. Collect data for two variables.
b. Calculate the mean (µ) and standard deviation (σ) of each variable.
c. Calculate the covariance (cov) between the two variables.
d. Calculate the correlation coefficient (r) using the formula:
r = cov / (σ1 * σ2)
where cov is the covariance between the two variables, σ1 is the standard deviation of variable 1, and
σ2 is the standard deviation of variable 2.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/19


lOMoARcPSD|22261930

PROGRAM 1:
import numpy as np
import matplotlib.pyplot as plt

# Generating some random data


# for an example
data = np.random.normal(170, 10, 250)

# Plotting the histogram.


plt.hist(data, bins=25, density=True, alpha=0.6, color='b')

plt.show()

Output:

PROGRAM 2:
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

# Generate some data for this


# demonstration.
data = np.random.normal(170, 10, 250)

# Fit a normal distribution to

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/20


lOMoARcPSD|22261930

# the data:
# mean and standard deviation
mu, std = norm.fit(data)

# Plot the histogram.


plt.hist(data, bins=25, density=True, alpha=0.6, color='b')

# Plot the PDF.


xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)

plt.plot(x, p, 'k', linewidth=2)


title = "Fit Values: {:.2f} and {:.2f}".format(mu, std)
plt.title(title)

plt.show()

OUTPUT:

RESULT:

Thus the python program for finding out normal curves,correlation and scatter plots, correlation
coefficient is executed successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/21


lOMoARcPSD|22261930

Exp No : 5 REGRESSION
Date:

AIM:
The aim of regression analysis is to examine the relationship between a dependent variable and one or
more independent variables.

ALGORITHM:
1. Collect data: Collect data on the dependent variable and one or more independent variables.
2. Check for linearity: Check whether there is a linear relationship between the dependent
variable and the independent variables. This can be done by plotting the data on a scatter plot
and looking for a linear pattern.
3. Determine the regression equation: Determine the regression equation that best fits the data.
This can be done using the least squares method.
4. Test the regression equation: Test the regression equation to see if it accurately predicts the
value of the dependent variable. This can be done by comparing the predicted values with the
actual values.
5. Interpret the results: Interpret the results to draw conclusions about the relationship between
the dependent variable and the independent variables.
6. The formula for the regression equation is: y = b0 + b1x1 + b2x2 + ... + bnxn

PROGRAM:
import numpy as np
import matplotlib.pyplot as plt
def estimate_coef(x, y):
# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x = np.mean(x)
m_y = np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/22


lOMoARcPSD|22261930

def plot_regression_line(x, y, b):


# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",marker = "o", s = 30)
# predicted response vector
y_pred = b[0]+ b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \\nb_1 = {}".format(b[0], b[1]))
# plotting regression line
plot_regression_line(x, y, b)
main()

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/23


lOMoARcPSD|22261930

OUTPUT:

RESULT:

Thus the python program for finding out the regression is executed successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/24


lOMoARcPSD|22261930

Exp No: 6 Z-TEST


Date :

AIM:
The aim of a z-test is to determine whether the mean of a sample is statistically significantly different
from the known or hypothesized population mean.

ALGORITHM:
1. State the null and alternative hypotheses: The null hypothesis (H0) is that there is no
significant difference between the sample mean and the population mean, while the
alternative hypothesis (Ha) is that there is a significant difference.
2. Determine the level of significance: Choose the level of significance, α, that will be used to
test the hypothesis. Typically, α is set at 0.05 or 0.01.
3. Collect data: Collect a random sample from the population of interest, and calculate the
sample mean and sample standard deviation.
4. Calculate the test statistic: Calculate the z-test statistic using the formula: z = (x̄ - μ) / (σ / √n)
5. where x is the sample mean, μ is the population mean, σ is the population standard
6. deviation, and n is the sample size.
7. Determine the critical value: Determine the critical value of z at the chosen level of
significance and degrees of freedom.
8. Compare the test statistic to the critical value: If the test statistic is greater than the critical
value, reject the null hypothesis. If the test statistic is less than the critical value, fail to reject
the null hypothesis.
9. Interpret the results: If the null hypothesis is rejected, it can be concluded that the sample
mean is significantly different from the population mean at the chosen level of significance. If
the null hypothesis is not rejected, it can be concluded that there is not enough evidence to
support the alternative hypothesis.

PROGRAM:
import math
import numpy as np
from numpy.random import randn
from statsmodels.stats.weightstats import ztest
# Generate a random array of 50 numbers having mean 110 and sd 15 # similar to the IQ scores data
we assume above
mean_iq = 110
sd_iq = 15/math.sqrt(50)

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/25


lOMoARcPSD|22261930

alpha = 0.05
null_mean =100
data = sd_iq*randn(50)+mean_iq # print mean and sd
print('mean=%.2f stdv=%.2f' % (np.mean(data), np.std(data)))
# now we perform the test. In this function, we passed data, in the value parameter
# we passed mean value in the null hypothesis, in alternative hypothesis we check whether the
# mean is larger
ztest_Score,p_value=ztest(data,value=null_mean,alternative='larger')
# the function outputs a p_value and z-score corresponding to that value, we compare the
# p-value with alpha, if it is greater than alpha then we do not null hypothesis # else we reject it.
if(p_value<alpha):
print("Reject Null Hypothesis")
else:
print("Fail to Reject NUll Hypothesis")

OUTPUT:
mean=110.55 stdv=1.50
Reject Null Hypothesis

RESULT:
Thus the python program for performing z-test is executed successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/26


lOMoARcPSD|22261930

Exp No: 7 T-TEST


Date:

AIM:
The aim of a t-test is to determine whether the mean of a sample is statistically significantly different
from the hypothesized population mean.

ALGORITHM:
1. The algorithm for a one-sample t-test involves the following steps:
2. State the null and alternative hypotheses: The null hypothesis (H0) is that there is no
significant difference between the sample mean and the population mean, while the
alternative hypothesis (Ha) is that there is a significant difference.
3. Determine the level of significance: Choose the level of significance, α, that will be used to
test the hypothesis. Typically, α is set at 0.05 or 0.01.
4. Collect data: Collect a random sample from the population of interest, and calculate the
sample mean and sample standard deviation.
5. Calculate the test statistic: Calculate the t-test statistic using the formula: t = (x̄ - μ) / (s / √n)
6. where x is the sample mean, μ is the hypothesized population mean, s is the sample
7. standard deviation, and n is the sample size.
8. Determine the degrees of freedom: Determine the degrees of freedom for the t- distribution
using the formula: df = n - 1.
9. Determine the critical value: Determine the critical value of t at the chosen level of
significance and degrees of freedom.
10. Compare the test statistic to the critical value: If the absolute value of the test statistic is
greater than the critical value, reject the null hypothesis. If the absolute value of the test
statistic is less than the critical value, fail to reject the null hypothesis.
11. Interpret the results: If the null hypothesis is rejected, it can be concluded that the sample
mean is significantly different from the hypothesized population mean at the chosen level of
significance. If the null hypothesis is not rejected, it can be concluded that there is not enough
evidence to support the alternative hypothesis.

PROGRAM:
# Importing the required libraries and packages
import numpy as np
from scipy import stats
# Defining two random distributions # Sample Size

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/27


lOMoARcPSD|22261930

N = 10
# Gaussian distributed data with mean = 2 and var = 1
x = np.random.randn(N) + 2
# Gaussian distributed data with mean = 0 and var = 1
y = np.random.randn(N)
# Calculating the Standard Deviation
# Calculating the variance to get the standard deviation
var_x = x.var(ddof = 1)
var_y = y.var(ddof = 1)
# Standard Deviation
SD = np.sqrt((var_x + var_y) / 2)
print("Standard Deviation =", SD)
#Calculating the T-Statistics
tval = (x.mean() - y.mean()) / (SD * np.sqrt(2 / N))
# Comparing with the critical T-Value
# Degrees of freedom
dof = 2 * N - 2
# p-value after comparison with the T-Statistics
pval = 1 - stats.t.cdf(tval, df = dof)
print("t = " + str(tval))
print("p = " + str(2 * pval))

## Cross Checking using the internal function from SciPyPackage


tval2, pval2 = stats.ttest_ind(x, y)
print("t = "+ str(tval2))
print("p = " + str(pval2))

OUTPUT:
Standard Deviation = 0.7642398582227466
t = 4.87688162540348
p = 0.0001212767169695983
t = 4.876881625403479
p = 0.00012127671696957205

RESULT:
Thus the python program for performing t-test is executed successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/28


lOMoARcPSD|22261930

Exp No: 8 ANOVA


Date:

AIM:
The aim of an ANOVA (Analysis of Variance) is to determine whether there is a significant
difference between the means of three or more groups.

ALGORITHM:
1. The algorithm for a one-way ANOVA involves the following steps:
2. State the null and alternative hypotheses: The null hypothesis (H0) is that there is no
significant difference between the means of the groups, while the alternative hypothesis (Ha)
is that there is a significant difference.
3. Determine the level of significance: Choose the level of significance, α, that will be used to
test the hypothesis. Typically, α is set at 0.05 or 0.01.
4. Collect data: Collect data from three or more groups, and calculate the mean and variance for
each group.
5. Calculate the sum of squares between groups: Calculate the sum of squares between groups
using the formula: SSbetween = ∑ni(x̄ i - x̄ )^2
6. whereni is the sample size for group i, x̄ i is the mean of group i, and x̄ is the overall mean.
7. Calculate the sum of squares within groups: Calculate the sum of squares within groups using
the formula:SSwithin = ∑∑(xi - x̄ i)^2
8. where xi is the value of the ith observation in the jth group, x̄ i is the mean of group j, and j is
the number of groups.
9. Calculate the F-statistic: Calculate the F-statistic using the formula: F = (SSbetween / (k-1)) /
(SSwithin / (N-k))
10. where k is the number of groups, and N is the total number of observations.
11. Determine the critical value: Determine the critical value of F at the chosen level of
significance and degrees of freedom.
12. Compare the F-statistic to the critical value: If the F-statistic is greater than the critical value,
reject the null hypothesis. If the F-statistic is less than the critical value, fail to reject the null
hypothesis.
13. Interpret the results: If the null hypothesis is rejected, it can be concluded that there is a
significant difference between the means of the groups at the chosen level of significance. If
the null hypothesis is not rejected, it can be concluded that there is not enough evidence to
support the alternative hypothesis.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/29


lOMoARcPSD|22261930

PROGRAM:
One Way ANOVA
# Importing library
from scipy.stats import f_oneway

# Performance when each of the engine


# oil is applied
performance1 = [89, 89, 88, 78, 79]
performance2 = [93, 92, 94, 89, 88]
performance3 = [89, 88, 89, 93, 90]
performance4 = [81, 78, 81, 92, 82]

# Conduct the one-way ANOVA


f_oneway(performance1, performance2, performance3, performance4)

Two Way ANOVA


import numpy as np
import pandas as pd

#create data
df = pd.DataFrame({'water': np.repeat(['daily', 'weekly'], 15), 'sun': np.tile(np.repeat(['low', 'med',
'high'], 5), 2), 'height': [6, 6, 6, 5, 6, 5, 5, 6, 4, 5, 6, 6, 7, 8, 7, 3, 4, 4, 4, 5, 4, 4, 4, 4, 4, 5, 6, 6, 7, 8]})

#view first ten rows of data


df[:10]

import statsmodels.api as sm
from statsmodels.formula.api import ols

#perform two-way ANOVA


model = ols('height ~ C(water) + C(sun) + C(water):C(sun)', data=df).fit()
sm.stats.anova_lm(model, typ=2)

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/30


lOMoARcPSD|22261930

OUTPUT:
One Way ANOVA:
F_onewayResult(statistic=4.625000000000002, pvalue=0.016336459839780215)

Two Way ANOVA:


sum_sq df F PR(>F)
C(water) 8.533333 1.0 16.00000.000527
C(sun) 24.866667 2.0 23.31250.000002
C(water):C(sun)2.466667 2.0 2.3125 0.120667
Residual 12.800000 24.0 NaN NaN

RESULT:
Thus the python program for performing ANOVA is executed successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/31


lOMoARcPSD|22261930

Exp No: 9 BUILDING AND VALIDATING LINEAR MODELS


Date:

AIM:
The aim of building and validating linear models is to create a model that accurately describes the
relationship between a dependent variable and one or more independent variables, and to determine
whether the model is a good fit for the data.

ALGORITHM:
1. 1.Collect data: Collect data on the dependent variable and one or more independent variables.
2. Choose a linear model: Choose a linear model that describes the relationship between the
dependent variable and independent variable(s). A simple linear model has one independent
variable, while a multiple linear model has two or more independent variables.
3. Estimate model coefficients: Use a statistical software package to estimate the coefficients of
the linear model that best fit the data. The most common method for doing this is least
squares regression.
4. Evaluate model fit: Evaluate the fit of the model by examining the residual plots, which show
the difference between the predicted and actual values of the dependent variable. A good
model will have residuals that are randomly distributed around zero, with no discernible
patterns.
5. Test for significance: Test the significance of the model by calculating the p-value for the
overall F-test of the model. A low p-value indicates that the model is a good fit for the data.
6. Evaluate individual coefficients: Evaluate the significance of individual coefficients in the
model by calculating their t-values and p-values. A low p-value indicates that the coefficient
is significant and should be included in the model.
7. Validate the model: Validate the model by testing it on new data that was not used to estimate
the coefficients. This can be done by using a hold-out sample, or by using cross- validation
techniques.
8. Refine the model: Refine the model by making adjustments to the model specification, such
as adding or removing variables, transforming variables, or adding interaction terms.
9. Interpret the results: Interpret the coefficients of the model in terms of the relationship
between the dependent variable and independent variable(s).

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/32


lOMoARcPSD|22261930

PROGRAM
import numpy as np
from sklearn.linear_model import LinearRegression
x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([5, 20, 14, 32, 22, 38])
model = LinearRegression()
model.fit(x, y)
model = LinearRegression().fit(x, y)
r_sq = model.score(x, y)
print(f"coefficient of determination: {r_sq}")
print(f"intercept: {model.intercept_}")
print(f"slope: {model.coef_}")
new_model = LinearRegression().fit(x, y.reshape((-1, 1)))
print(f"intercept: {new_model.intercept_}")
print(f"slope: {new_model.coef_}")
y_pred = model.predict(x)
print(f"predicted response:\n{y_pred}")
y_pred = model.intercept_ + model.coef_ * x
print(f"predicted response:\n{y_pred}")
plt.scatter(y_pred, x)

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/33


lOMoARcPSD|22261930

OUTPUT:
coefficient of determination: 0.7158756137479542
intercept: 5.633333333333329
slope: [0.54]
intercept: [5.63333333]
slope: [[0.54]]
predicted response:
[ 8.33333333 13.73333333 19.13333333 24.53333333 29.93333333 35.33333333]
predicted response:
[[ 8.33333333]
[13.73333333]
[19.13333333]
[24.53333333]
[29.93333333]
[35.33333333]]

RESULT:

Thus the python program for building and validating linear models is executed successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/34


lOMoARcPSD|22261930

Exp No: 10 BUILDING AND VALIDATING LOGISTICS MODELS


Date:

Aim:
The aim of building and validating logistic models is to create a model that accurately predicts the
probability of a binary outcome (e.g., success or failure) based on one or more independent variables,
and to determine whether the model is a good fit for the data.

Algorithm:
The algorithm for building and validating logistic models involves the following steps:
1. Collect data: Collect data on the binary outcome variable and one or more independent
variables.
2. Choose a logistic model: Choose a logistic model that describes the relationship between the
dependent variable and independent variable(s). A simple logistic model has one independent
variable, while a multiple logistic model has two or more independent variables.
3. Estimate model coefficients: Use a statistical software package to estimate the coefficients of
the logistic model that best fit the data. The most common method for doing this is maximum
likelihood estimation.
4. Evaluate model fit: Evaluate the fit of the model by examining the goodness-of-fit statistics,
such as the deviance, the Akaike Information Criterion (AIC), and the Bayesian Information
Criterion (BIC). A good model will have a low deviance and low values of AIC and BIC.
5. Test for significance: Test the significance of the model by calculating the p-value for the
overall chi-square test of the model. A low p-value indicates that the model is a good fit for
the data.
6. Evaluate individual coefficients: Evaluate the significance of individual coefficients in the
model by calculating their Wald test statistics and p-values. A low p-value indicates that the
coefficient is significant and should be included in the model.
7. Validate the model: Validate the model by testing it on new data that was not used to estimate
the coefficients. This can be done by using a hold-out sample, or by using cross- validation
techniques.
8. Refine the model: Refine the model by making adjustments to the model specification, such
as adding or removing variables, transforming variables, or adding interaction terms.
9. Interpret the results: Interpret the coefficients of the model in terms of the relationship
between the independent variable(s) and the probability of the binary outcome.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/35


lOMoARcPSD|22261930

Program:
import numpy
from sklearn import linear_model

X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-1,1)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

logr = linear_model.LogisticRegression()
logr.fit(X,y)

def logit2prob(logr, X):


log_odds = logr.coef_ * X + logr.intercept_
odds = numpy.exp(log_odds)
probability = odds / (1 + odds)
return(probability)

print(logit2prob(logr, X))
plt.scatter(logit2prob(logr, X), X)

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/36


lOMoARcPSD|22261930

OUTPUT:
[[0.60749955]
[0.19268876]
[0.12775886]
[0.00955221]
[0.08038616]
[0.07345637]
[0.88362743]
[0.77901378]
[0.88924409]
[0.81293497]
[0.57719129]
[0.96664243]]

RESULT:

Thus the python program for building and validating logistic models is executed successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/37


lOMoARcPSD|22261930

Exp No: 11 TIME SERIES ANALYSIS


Date:

AIM:
The aim of performing time series analysis is to model and forecast the behavior of a time series data
over a period of time, using statistical methods, in order to identify patterns, trends, and seasonality in
the data.

ALGORITHM:
The algorithm for performing time series analysis involves the following steps:
1. 1.Collect data: Collect data on the time series variable over a period of time.
2. 2.Visualize the data: Plot the time series data to identify patterns, trends, and seasonality.
3.Decompose the time series: Decompose the time series into its components, which are
trend, seasonality, and residual variation. This can be done using techniques such as moving
averages, exponential smoothing, or the Box-Jenkins method.
3. Model the trend: Model the trend component of the time series using techniques such as linear
regression, exponential smoothing, or ARIMA models.
4. Model the seasonality: Model the seasonality component of the time series using techniques
such as seasonal decomposition, dummy variables, or Fourier series.
5. Model the residual variation: Model the residual variation component of the time series using
techniques such as autoregressive models, moving average models, or ARIMA models.
6. Choose the best model: Evaluate the fit of the different models using measures such as AIC,
BIC, and RMSE, and choose the model that best fits the data.
7. Forecast future values: Use the chosen model to forecast future values of the time series
variable.
8. Validate the model: Validate the model by comparing the forecasted values with actual values
from a hold-out sample, or by using cross-validation techniques.
9. Refine the model: Refine the model by making adjustments to the model specification, such
as adding or removing variables, transforming variables, or adding interaction terms.
10. Interpret the results: Interpret the results of the time series analysis in terms of the patterns,
trends, and seasonality of the data, and use the forecasted values to make predictions and
inform decision-making.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/38


lOMoARcPSD|22261930

PROGRAM
# import modules
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import numpy as np
# create dataframe
dataframe = pd.DataFrame({'date_of_week': np.array([datetime.datetime(2021, 11, i+1)
for i in range(7)]), 'classes': [5, 6, 8, 2, 3, 7, 4]})
# Plotting the time series of given dataframe
plt.plot(dataframe.date_of_week, dataframe.classes)
# Giving title to the chart using plt.title
plt.title('Classes by Date')
# rotating the x-axis tick labels at 30degree
# towards right
plt.xticks(rotation=30, ha='right')
# Providing x and y label to the chart
plt.xlabel('Date')
plt.ylabel('Classes')

OUTPUT:

RESULT:

Thus the python program for performing time series analysis is executed Successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/39


lOMoARcPSD|22261930

Exp No: 12 CONTENT BEYOND SYLLABUS - HEATMAPS


Date:

AIM:
To make a graphical representation of data using colors to visualize the value of the matrix.

ALGORITHM:
1. Import the seaborn library: To use seaborn, you need to import the library first.
2. Explore the data: Use functions like head(), tail(), info(), describe(), and shape to get a sense
of the data you're working with.
3. Select and filter data: Use indexing and slicing to select subsets of the data. You can use the
loc[] and iloc[] methods to select rows and columns based on labels or indices. You can also
filter rows based on conditions using the query() or loc[] method.
4. Manipulate the data: Use pandas functions to manipulate the data, such as adding or dropping
columns, renaming columns, and aggregating data.
5. Handle missing data: Use functions like isnull(), dropna(), fillna(), and interpolate() to handle
missing data in the data frame.
6. Save the data: Use the to_csv(), to_excel(), or to_sql() methods to save the data frame to a file
or database.
PROGRAM:
# importing the modules
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt

# generating 2-D 10x10 matrix of random numbers


# from 1 to 100
data = np.random.randint(low = 1,
high = 100,
size = (10, 10))
print("The data to be plotted:\n")
print(data)

# plotting the heatmap


hm = sn.heatmap(data = data)
# displaying the plotted heatmap
plt.show()

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/40


lOMoARcPSD|22261930

OUTPUT:
The data to be plotted:

[[46 30 55 86 42 94 31 56 21 7]
[68 42 95 28 93 13 90 27 14 65]
[73 84 92 66 16 15 57 36 46 84]
[ 7 11 41 37 8 41 96 53 51 72]
[52 64 1 80 33 30 91 80 28 88]
[19 93 64 23 72 15 39 35 62 3]
[51 45 51 17 83 37 81 31 62 10]
[ 9 28 30 47 73 96 10 43 30 2]
[74 28 34 26 2 70 82 53 97 96]
[86 13 60 51 95 26 22 29 14 29]]

RESULT:
Thus the python program for performing heat map is executed Successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/41


lOMoARcPSD|22261930

EXP No: 13 CONTENT BEYOND SYLLABUS – INTERACTIVE VISUALIZATION


Date: WITH BOKEH

AIM:
To make a interactive data visualization with Bokeh.

ALGORITHM:
1. Import the seaborn library: To use seaborn, you need to import the library first.
2. Explore the data: Use functions like head(), tail(), info(), describe(), and shape to get a sense
of the data you're working with.
3. Select and filter data: Use indexing and slicing to select subsets of the data. You can use the
loc[] and iloc[] methods to select rows and columns based on labels or indices. You can also
filter rows based on conditions using the query() or loc[] method.
4. Manipulate the data: Use pandas functions to manipulate the data, such as adding or dropping
columns, renaming columns, and aggregating data.
5. Handle missing data: Use functions like isnull(), dropna(), fillna(), and interpolate() to handle
missing data in the data frame.
6. Save the data: Use the to_csv(), to_excel(), or to_sql() methods to save the data frame to a file
or database.

PROGRAM:
from bokeh.io import curdoc
from bokeh.plotting import figure, output_file, show

x = [1, 2, 3, 4, 5]
y = [6, 7, 6, 4, 5]

output_file("demo.html")
p = figure(title='demo', width=300, height=300,
toolbar_location="below")
p.circle(x, y)

show(p)

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/42


lOMoARcPSD|22261930

OUTPUT:

RESULT:
Thus the python program for performing heat map is executed successfully.

R2021/AI&DS/AD3411–DATA SCIENCE & ANALYTICS LABORATORY/II YEAR/IV SEM/43

You might also like