AD3411 DATA SCIENCE AND ANALYTICS LAB (2) - Removed
AD3411 DATA SCIENCE AND ANALYTICS LAB (2) - Removed
DATE:
AIM:
To Write a Python Program for working with data frames using pandas.
ALGORITHM:
Step 1: Start
Step 2: Import the pandas modules as pd
Step 3: Declare the array in row and column
Step 4: Call the function inside the data frame
Step 5: Print the data frames
Step 6: Stop
PROGRAM:
CREATE A SIMPLE PANDAS DATA FRAME:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
OUTPUT:
calories 420
duration 50
Name: 0, dtype: int64
RESULT:
Thus the Python Program for working with data frames using pandas has been
executed successfully.
EXP NO:2 BASIC PLOTS USING MATPLOLIB
DATE:
AIM:
To Write a Python Program for working with Basic Plots Using Matplolib
ALGORITHM:
Step 1: Start
Step 2: Import the pyplot in matplotlib modules as plt
Step 3: Declare the array in row and column
Step 4: Give the necessary x and y plot values
Step 5: Print the basic plots
Step 6: Stop
PROGRAM:
a = [1, 2, 3, 4, 5]
b = [0, 0.6, 0.2, 15, 10, 8, 16, 21]
plt.plot(a)
plt.plot(b, "or")
plt.plot(list(range(0, 22, 3)))
plt.xlabel('Day ->')
plt.ylabel('Temp ->')
c = [4, 2, 6, 8, 3, 20, 13, 15]
plt.plot(c, label='4th Rep')
ax = plt.gca()
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_bounds(-3, 40)
plt.xticks(list(range(-3, 10)))
plt.yticks(list(range(-3, 20, 3)))
ax.legend(['1st Rep', '2nd Rep', '3rd Rep', '4th
Rep'])
plt.annotate('Temperature V / s Days', xy=(1.01, -2.15))
plt.title('BASIC PLOTS')
plt.show()
OUTPUT:
RESULT:
Thus the Python Program for working with for working with Basic Plots Using Matplolib
has been executed successfully.
EXP NO:3 FREQUENCY DISTRIBUTIONS,AVERAGES,
VARIABILITY
DATE:
AIM:
ALGORITHM:
PROGRAM:
Python program to get average of a list
import numpy as np
list = [2, 40, 2, 502, 177, 7, 9]
print(np.average(list))
Output:
105.57142857142857
RESULT:
Thus the Python Program for working with for working with Frequency Distributions,
Averages, Variability has been executed successfully.
EXP NO:4 NORMAL CURVES, CORRELATION AND SCATTER
PLOTS, CORRELATION COEFFICIENT
DATE:
AIM:
To Write a Python Program for Normal Curves, Correlation And Scatter Plots,
Correlation Coefficient
ALGORITHM:
PROGRAM:
#Normal curves
import matplotlib.pyplot as plt import numpy as np
mu, sigma = 0.5, 0.1
s = np.random.normal(mu, sigma, 1000) # Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, normed=True)
Output:
Output:
0.8603090020146067
Correlation coefficient
import math
def correlationCoefficient(X, Y, n) :
sum_X = 0
sum_Y = 0
sum_XY = 0
squareSum_X = 0
squareSum_Y = 0
i=0
while i < n :
# sum of elements of array X. sum_X = sum_X + X[i]
# sum of elements of array Y. sum_Y = sum_Y + Y[i
# sum of X[i] * Y[i]. sum_XY = sum_XY + X[i] * Y[i]
# sum of square of array elements. squareSum_X = squareSum_X + X[i] * X[i]
squareSum_Y = squareSum_Y + Y[i] * Y[i]
i= i+1
# use formula for calculating correlation # coefficient.
corr = (float)(n * sum_XY - sum_X * sum_Y)/ (float)(math.sqrt((n * squareSum_X -
sum_X * sum_X)* (n * squareSum_Y - sum_Y * sum_Y)))
return corr
X = [15, 18, 21, 24, 27]
Y = [25, 25, 27, 31, 32]
# Find the size of array. n = len(X)
# Function call to correlationCoefficient.
print ('{0:.6f}'.format(correlationCoefficient(X, Y, n)))
OUTPUT:
0.953463
RESULT:
Thus the Python Program for Normal Curves, Correlation And Scatter Plots,
Correlation Coefficient has been executed successfully.
EXP NO:5 REGRESSION
DATE:
AIM:
To Write a Python Program for Regression concept.
ALGORITHM
PROGRAM:
import numpy as np
import matplotlib.pyplot as plt def estimate_coef(x, y):
# number of observations/points n = np.size(x)
# mean of x and y vector m_x = np.mean(x)
m_y = np.mean(y)
# calculating cross-deviation and deviation about x SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x return (b_0, b_1)
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot plt.scatter(x, y, color = "m",
marker = "o", s = 30) # predicted response vector y_pred = b[0] + b[1]*x
# plotting the regression line plt.plot(x, y_pred, color = "g") # putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show() def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1])) # plotting regression line plot_regression_line(x, y, b)
if name == " main ": main()
OUTPUT:
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
RESULT:
Thus the Python Program for Regression has been executed successfully
EXP NO:6 Z-TEST
DATE:
AIM:
To Write a Python Program for z-test concept.
ALGORITHM:
Step 1: Evaluate the data distribution.
Step 2: Formulate Hypothesis statement symbolically
Step 3: Define the level of significance (alpha)
Step 4: Calculate Z test statistic or Z score.
Step 5: Derive P-value for the Z score calculated.
Step 6: Make decision:
Step 6.1: P-Value <= alpha, then we reject H0.
Step 6.2: If P-Value > alpha, Fail to reject H0
PROGRAM:
import math
import numpy as np
from numpy.random import randn
from statsmodels.stats.weightstats import ztest
# Generate a random array of 50 numbers having mean 110 and sd 15
# similar to the IQ scores data we assume above mean_iq = 110
sd_iq = 15/math.sqrt(50) alpha = 0.05
null_mean =100
data = sd_iq*randn(50)+mean_iq # print mean and sd
print('mean=%.2f stdv=%.2f' % (np.mean(data), np.std(data)))
ztest_Score,p_value=ztest(data,value=null_mean,alternative='la rger')
if(p_value < alpha):
print("Reject Null Hypothesis")
else:
print("Fail to Reject NUll Hypothesis")
OUTPUT:
Reject Null Hypothesis
RESULT:
Thus the Python Program for Z TEST has been executed successfully
EXP NO:7 T-TEST
DATE:
AIM:
To Write a Python Program for T-test concept.
ALGORITHM:
Step 1: Create some dummy age data for the population of voters in the entire
country
Step 2: Create Sample of voters in Minnesota and test the whether the
average age
of voters Minnesota differs from the population
Step 3: Conduct a t-test at a 95% confidence level and see if it correctly
rejects the
null hypothesis that the sample comes from the same distribution as the
population.
Step 4: If the t-statistic lies outside the quantiles of the t-distribution
corresponding
to our confidence level and degrees of freedom, we reject the null hypothesis.
Step 5: Calculate the chances of seeing a result as e×treme as the one being
observed (known as the p-value) by passing the t-statistic in as the quantile to
the
stats.t.cdf() function
PROGRAM:
import numpy as np
from scipy import stats
# Defining two random distributions
# Sample Size
N = 10
# Gaussian distributed data with mean = 2 and var = 1
x = np.random.randn(N) + 2
# Gaussian distributed data with mean = 0 and var = 1
y = np.random.randn(N)
# Calculating the Standard Deviation
# Calculating the variance to get the standard deviation
var_x = x.var(ddof = 1)
var_y = y.var(ddof = 1)
# Standard Deviation
SD = np.sqrt((var_x + var_y) / 2)
print("Standard Deviation =", SD)
# Calculating the T-Statistics
tval = (x.mean() - y.mean()) / (SD * np.sqrt(2 / N))
# Comparing with the critical T-Value
# Degrees of freedom
dof = 2 * N - 2
# p-value after comparison with the T-Statistics
pval = 1 - stats.t.cdf( tval, df = dof) print("t = " + str(tval))
print("p = " + str(2 * pval))
## Cross Checking using the internal function from SciPy Packa ge
tval2, pval2 = stats.ttest_ind(x, y)
print("t = " + str(tval2))
print("p = " + str(pval2))
OUTPUT:
Standard Deviation = 0.7642398582227466 t = 4.87688162540348
p = 0.0001212767169695983
t = 4.876881625403479
p = 0.00012127671696957205
RESULT:
Thus the Python Program for T TEST has been executed successfully
EXP NO:8 ANOVA TEST
DATE:
AIM:
To Write a Python Program for ANOVA.
ALGORITHM:
PROGRAM:
# Installing the package install.packages("dplyr") # Loading the package
library(dplyr)
# Variance in mean within group and between group
bo×plot(mtcars$disp~factor(mtcars$gear),
×lab = "gear", ylab = "disp")
# Step 1: Setup Null Hypothesis and Alternate Hypothesis # HO = mu = muO1
= muO2 (There is no difference
# between average displacement for different gear) # H1 = Not all means are
equal
# Step 2: Calculate test statistics using aov function mtcars_aov <-
aov(mtcars$disp~factor(mtcars$gear)) summary(mtcars_aov)
# Step 3: Calculate F-Critical Value
# For O.O5 Significant value, critical value = alpha = O.O5 # Step 4: Compare
test statistics with F-Critical value
# and conclude test p <alpha, Reject Null Hypothesis
\
OUTPUT:
RESULT:
Thus the Python Program for ANOVA has been executed successfully
EXP NO:9 BUILDING AND VALIDATING LINEAR MODELS
DATE:
AIM:
To Write a Python Program to build and validate linear models
ALGORITHM:
Step1: Consider a set of values ×, y.
Step2: Take the linear set of equation y = a+b×.
Step3: Computer value of a, b with respect to the given values, b = nΣxy −
(Σx)
(Σy) / nΣx2−(Σx)2, a = Σy−b (Σx)n.
Step4: Implement the value of a, b in the equation y = a+ b×.
Step5: Regress the value of y for any ×.
PROGRAM:
# Importing the necessary libraries import pandas as pd
import numpy as np
import matplotlib.pyplot as plt import seaborn as sns
from sklearn.datasets import load_boston
sns.set(style=”ticks”,color_codes=True) plt.rcParams[‘figure.figsize’] = (8,5)
plt.rcParams[‘figure.dpi’] = 150
# loading the databoston = load_boston()
You can check those keys with the following code. print(boston.keys())
The output will be as follow:
dict_keys([‘data’, ‘target’, ‘feature_names’, ‘DESCR’, ‘filename’])
print(boston.DESCR)
You will find these details in output: Attribute Information (in order):
— CRIM per capita crime rate by town
— ZN proportion of residential land zoned for lots over 25,OOO sq.ft.
— INDUS proportion of non-retail business acres per town
— CHAS Charles River dummy variable (= 1 if tract bounds river; O
otherwise)
— NOX nitric o×ides concentration (parts per 1O million)
— RM average number of rooms per dwelling
— AGE proportion of owner-occupied units built prior to 194O
— DIS weighted distances to five Boston employment centres
— RAD inde× of accessibility to radial highways
— TAX full-value property-ta× rate per $1O,OOO
RESULT:
Thus the Python Program for to building and validating linear models
has been executed successfully
EXP NO:10 BUILDING AND VALIDATING LOGISTIC MODELS
DATE:
AIM:
To Write a Python Program to build and validate logistic models
ALGORITHM:
Step1: Initialize the variables
Step2: Set the Data frame
Step3: Spilt data set into training and testing.
Step4: Fit the data into logistic regression function.
Step5: Predict the test data set.
Step6: Print the results.
PROGRAM:
Building the Logistic Regression model: # importing libraries import
statsmodels.api as sm import pandas as pd
# loading the training dataset
df = pd.read_csv('logit_train1.csv', inde×_col = O) # defining the dependent and
independent variables Xtrain = df[['gmat', 'gpa', 'work_e×perience']] ytrain =
df[['admitted']]
# building the model and fitting the data log_reg = sm.Logit(ytrain, Xtrain).fit()
Output :
Optimization terminated successfully. Current function value: O.3527O7
Iterations 8
# printing the summary table print(log_reg.summary())
Output :
Logit Regression Results
==================================================
===========
Dep. Variable: admitted No. Observations: 3O
Model: Logit Df Residuals: 27
Method: MLE Df Model: 2
Date: Wed, 15 Jul 2O2O Pseudo R-squ.: O.4912
Time: 16:O9:17 Log-Likelihood: -1O.581
Output :
Optimization terminated successfully. Current function value: O.3527O7 Iterations
8
Actual values [O, O, O, O, O, 1, 1, O, 1, 1]
Predictions : [O, O, O, O, O, O, O, O, 1, 1]
Output :
Confusion Matri× :
[[6 O]
[2 2]]
Test accuracy = O.8
RESULT:
Thus the Python Program for to building and validating logistic models
has been executed successfully
EXP NO:11 TIME SERIES ANALYSIS
DATE:
AIM:
To Write a Python Program for Time Series Analysis
ALGORITHM:
Step1: Loading time series dataset correctly in Pandas
Step2: Inde×ing in Time-Series Data
Step4: Time-Resampling using Pandas
Step5: Rolling Time Series
Step6: Plotting Time-series Data using Pandas
PROGRAM:
Data Preprocessing
This step includes removing columns we do not need, check missing values,
aggregate sales by date and so on.
cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID', 'Customer
Name', 'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region', 'Product ID',
'Category', 'Sub-Category', 'Product Name', 'Quantity', 'Discount', 'Profit']
furniture.drop(cols,a×is=1,inplace=True) furniture=furniture.sort_values('Order
Date')furniture.isnull().sum()
furniture=furniture.groupby('OrderDate')['Sales'].sum().reset_ inde×()
Order Date 0
Sales dtype: 0
Figure 1
Figure 2
We will use the averages daily sales value for that month instead, and we are
using the start of each month as the timestamp.
y = furniture ['Sales'].resample('MS').mean() Have a quick peek 2O17 furniture
sales data. y['2O17':]
Figure 3
RESULT:
Thus the Python Program for Time Series Analysis has been executed successfully