Ric Manual Final
Ric Manual Final
PSIT1P1
Research in Computing
Practical
Table of Contents
Sr. Practical Page
Name of the Practical
No No No
1) 1 A Write a program for obtaining descriptive statistics
of data. 01
2) B Import data from different data sources (from Excel,
csv, mysql, sql server, oracle to R/Python/Excel) 05
3) 2 A Design a survey form for a given case study, collect
the primary data and analyze it 08
4) B Perform suitable analysis of given secondary data. 10
5) 3 A Perform testing of hypothesis using one sample t-
test. 13
6) B Perform testing of hypothesis using two sample t-
test. 14
7) C Perform testing of hypothesis using paired t-test. 19
8) 4 A Perform testing of hypothesis using chi-squared
goodness-of-fit test. 21
9) B Perform testing of hypothesis using chi-squared Test
of Independence 23
10) 5 Perform testing of hypothesis using Z-test. 28
11) 6 A Perform testing of hypothesis using one-way
ANOVA. 30
12) B Perform testing of hypothesis using two-way
ANOVA. 35
13) C Perform testing of hypothesis using multivariate
ANOVA (MANOVA). 39
14) 7 A Perform the Random sampling for the given data and
analyse it. 45
15) B Perform the Stratified sampling for the given data
and analyse it. 47
16) 8 Compute different types of correlation. 50
17) 9 A Perform linear regression for prediction. 52
18) B Perform polynomial regression for prediction. 55
19) 10 A Perform multiple linear regression. 56
20) B Perform Logistic regression. 59
21) List of supporting files 70
~~~~~~~~~~
1
PSIT1P1~~~~~ Research in Computing Practical
Practical 1:
A. Write a program for obtaining descriptive statistics of data.
################################################################
#Practical 1A: Write a python program on descriptive statistics analysis.
################################################################
import pandas as pd
#Create a Dictionary of series
d = {'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
#Create a DataFrame
df = pd.DataFrame(d)
print(df)
print('############ Sum ########## ')
print (df.sum())
print('############ Mean ########## ')
print (df.mean())
print('############ Standard Deviation ########## ')
print (df.std())
print('############ Descriptive Statistics ########## ')
print (df.describe())
Output:
Output:
conn = mysql.connector.connect(host='localhost',
database='DataScience',
user='root',
password='root')
conn.connect
if(conn.is_connected):
print('###### Connection With MySql Established Successfullly ##### ')
else:
print('Not Connected -- Check Connection Properites')
Microsoft Excel
##################Retrieve-Country-Currency.py
################################################################
# -*- coding: utf-8 -*-
################################################################
importos
import pandas as pd
################################################################
Base='C:/VKHCG'
################################################################
sFileDir=Base + '/01-Vermeulen/01-Retrieve/01-EDS/02-Python'
#if not os.path.exists(sFileDir):
#os.makedirs(sFileDir)
################################################################
CurrencyRawData = pd.read_excel('C:/VKHCG/01-Vermeulen/00-RawData/Country_Currency.xlsx')
sColumns = ['Country or territory', 'Currency', 'ISO-4217']
CurrencyData = CurrencyRawData[sColumns]
CurrencyData.rename(columns={'Country or territory': 'Country', 'ISO-4217':
'CurrencyCode'}, inplace=True)
CurrencyData.dropna(subset=['Currency'],inplace=True)
CurrencyData['Country'] = CurrencyData['Country'].map(lambda x: x.strip())
CurrencyData['Currency'] = CurrencyData['Currency'].map(lambda x:
x.strip())
CurrencyData['CurrencyCode'] = CurrencyData['CurrencyCode'].map(lambda x:
x.strip())
print(CurrencyData)
print('~~~~~~ Data from Excel Sheet Retrived Successfully ~~~~~~~ ')
################################################################
OUTPUT:
Case 1:
Case 2:
A research agency wants to study the perception about App based taxi service in
Mumbai, Thane and Navi Mumbai. The survey focuses on customers attitude towards
app base taxi service as well as on attitudes towards regular taxi cab.
Design questionnaire seeks information about the target taxi service, his experience
using taxi services, access, support available, obstacles and some personal background
information, with the following objectives:
1. To find out the customer satisfaction towards the App based-taxi services.
2. To find the level of convenience and comfort with App based -taxi
services.
3. To know their opinion about the tariff system and promptness of service.
Example: Analyze the given Population Census Data for Planning and Decision
Making by using the size and composition of populations.
To calculate the percent of males in cell E4, enter the formula =-1*100*B4/$D$22 .
And copy the formula in cell E4 down to cell E21.
To calculate the percent of females in cell F4, enter the formula =100*C4/$D$22.
Copy the formula in cell F4 down to cell F21.
To build the population pyramid, we need to choose a horizontal bar chart with two
series of data (% male and % female) and the age labels in column A as the Category
X-axis labels. Highlight the range A3:A21, hold down the CTRL key and highlight the
range E3:F21
Under inset tab, under horizontal bar charts select clustered bar chart
Put the tip of your mouse arrow on the Y-axis (vertical axis) so it says “Category
Axis”, right click and chose Format Axis
Choose Axis options tab and set the major and minor tick mark type to None, Axis
labels to Low, and click OK.
Click on any of the bars in your pyramid, click right and select “format data series”.
Set the Overlap to 100 and Gap Width to 0. Click OK.
Output:
Experimental Data
To calculate Standard Mean go to cell A22 and type =SUM(A2:A21)/20
To calculate Standard Deviation go to cell A23 and type =STDEV(A2:A21)
Comparison Data
To calculate Standard Mean go to cell B22 and type =SUM(B2:B21)/20
Our calculated value is larger than the tabled value at alpha = .01, so we reject the null
hypothesisand accept the alternative hypothesis, namely, that the difference in gain
scores is likely the resultof the experimental treatment and not the result of chance
variation.
importnumpy as np
fromscipy import stats
fromnumpy.random import randn
N = 20
#a = [35,40,12,15,21,14,46,10,28,48,16,30, 32,48,31,22,12,39,19,25]
#b = [2,27,31,38,1,19,1,34,3,1,2,1,3,1,2,1,3,29,37,2]
a = 5 * randn(100) + 50
b = 5 * randn(100) + 51
var_a = a.var(ddof=1)
var_b = b.var(ddof=1)
s = np.sqrt((var_a + var_b)/2)
t = (a.mean() - b.mean())/(s*np.sqrt(2/N))
df = 2*N - 2
#p-value after comparison with the t
p = 1 - stats.t.cdf(t,df=df)
Output:
Example 1: we have to test whether the height of men in the population is different
from height of women in general. So we take a sample from the population and use the
t-test to see if the result is significant.
Example 2: Design a survey form to get grade of students who have passed B. Sc. IT
and B. Sc. CS from the same University. Perform T-Test to test the given hypothsis:
H0 – Scores of students in two courses are same.
H1 – Scores of students are the different.
Example 2: Collect a sample data know that use of Online Food Ordering app to
compare whether the usage is equal or different.
Program Code:
# -*- coding: utf-8 -*-
"""
Created on Mon Dec 16 19:49:23 2019
@author: MyHome
"""
import pandas as pd
df = pd.read_csv("blood_pressure.csv")
print(df[['bp_before','bp_after']].describe())
Output:
A paired sample t-test was used to analyze the blood pressure before and after the
intervention to test if the intervention had a significant affect on the blood pressure. The
blood pressure before the intervention was higher (156.45 ± 11.39 units) compared to
the blood pressure post intervention (151.36 ± 14.18 units); there was a statistically
significant decrease in blood pressure (t(119)=3.34, p= 0.0011) of 5.09 units.
Output:
O A B C D Total
Girls 11 7 5 5 11 39 6.075
Boys 30 4 3 10 14 61 6.075
Total 41 11 8 15 25 100 12.150
Ei 20.5 5.5 4 7.5 12.5 50
Prepare a contingency table as shown above.
To calculate Girls Students with ‘O’ Grade
Go to Cell N6 and type =COUNTIF($J$2:$K$40,"O")
Now Calculate
Go to cell T6 and type
=SUM((N6-$N$9)^2/$N$9,(O6-$O$9)^2/$O$9,(P6-$P$9)^2/$P$9,(Q6-Q$9)^2/$Q$9,
(R6-$R$9)^2/$R$9)
Go to cell T7 and type
=SUM((N7-$N$9)^2/$N$9,(O7-$O$9)^2/$O$9,(P7-$P$9)^2/$P$9,(Q7-Q$9)^2/$Q$9,
(R7-$R$9)^2/$R$9)
To get the table value go to cell T11 and type =CHIINV(0.05,4)
Go to cell O13 and type =IF(T8>=T11," H0 is Accepted", "H0 is Rejected")
np.random.seed(10)
stud_grade = np.random.choice(a=["O","A","B","C","D"],
p=[0.20, 0.20 ,0.20, 0.20, 0.20], size=100)
stud_gen = np.random.choice(a=["Male","Female"], p=[0.5, 0.5], size=100)
mscpart1 = pd.DataFrame({"Grades":stud_grade, "Gender":stud_gen})
print(mscpart1)
stud_tab = pd.crosstab(mscpart1.Grades, mscpart1.Gender, margins=True)
stud_tab.columns = ["Male", "Female", "row_totals"]
stud_tab.index = ["O", "A", "B", "C", "D", "col_totals"]
observed = stud_tab.iloc[0:5, 0:2 ]
print(observed)
expected = np.outer(stud_tab["row_totals"][0:5],
stud_tab.loc["col_totals"][0:2]) / 100
print(expected)
chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()
print('Calculated : ',chi_squared_stat)
ifchi_squared_stat>= crit:
print('H0 is Accepted ')
else:
print('H0 is Rejected ')
Practice Questions
1. Anita claims that girls take more normal and filter appliedselfies than boys, but
Karan does not agree with her, so he conducts a survey collects the following data,
would it be correct to say that he should reject Anita’s claim that gender affects
tendency to take selfies?
H0 - Gender affects tendency to take more photographs
Normal Selfie Apply Filter Total
Female 72 489 561
Male 48 530 578
TOTAL 120 1019 1139
2. Ketan claims that single people prefer different pizzas than married people do.
Kato’s brother Anand doesn’t think that is true, so he conducts some research of
his own, and collects the data below.
if pval<0.05:
print("reject null hypothesis")
else:
print("accept null hypothesis")
Output:
if pval<0.05:
print("reject null hypothesis")
else:
print("accept null hypothesis")
From our data exploration, we can see that the average SAT scores are quite different
for each district. Since we have five different groups, we cannot use the t-test, use the
1-way ANOVA test anyway just to understand the concepts.
H0 - There are no significant differences between the groups' mean SAT scores.
µ1 = µ2 = µ3 = µ4 = µ5
H1 - There is a significant difference between the groups' mean SAT scores.
If there is at least one group with a significant difference with another group, the null
hypothesis will be rejected.
import pandas as pd
importnumpy as np
importmatplotlib.pyplot as plt
importseaborn as sns
fromscipy import stats
data = pd.read_csv("scores.csv")
data.head()
data['Borough'].value_counts()
y = []
yerror = []
#Assigns the mean score and 95% confidence limit to each district
for district in x:
y.append(district_dict[district].mean())
yerror.append(1.96*district_dict[district].std()/np.sqrt(district_dict[district].shape[0]))
print(district + '_std : {}'.format(district_dict[district].std()))
sns.set(font_scale=1.8)
fig = plt.figure(figsize=(10,5))
ax = sns.barplot(x, y, yerr=yerror)
ax.set_ylabel('Average Total SAT Score')
plt.show()
ss_b = 0
for d in districts:
ss_b += district_dict[d].shape[0] * \
np.sum((district_dict[d].mean() - data['total_score'].mean())**2)
ss_w = 0
for d in districts:
ss_w += np.sum((district_dict[d] - district_dict[d].mean())**2)
msb = ss_b/4
msw = ss_w/(len(data)-5)
f=msb/msw
print('F_statistic: {}'.format(f))
ss_t = np.sum((data['total_score']-data['total_score'].mean())**2)
eta_squared = ss_b/ss_t
print('eta_squared: {}'.format(eta_squared))
Output:
Since theresulting pvalueis less than 0.05. The null hypothesis is rejected and conclude
that there is a significant difference between the SAT scores for each district.
SUMMARY
Groups Count Sum Average Variance
Average Score (SAT Math) 375 162354 432.944 5177.144
Average Score (SAT Reading) 375 159189 424.504 3829.267
Average Score (SAT Writing) 375 156922 418.4587 4166.522
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 39700.57 2 19850.28 4.520698 0.01108 3.003745
Within Groups 4926677 1122 4390.977
Since theresulting pvalueis less than 0.05. The null hypothesis (H0) is rejected and
conclude that there is a significant difference between the SAT scores for each subject.
def eta_squared(aov):
aov['eta_sq'] = 'NaN'
aov['eta_sq'] = aov[:-1]['sum_sq']/sum(aov['sum_sq'])
return aov
def omega_squared(aov):
mse = aov['sum_sq'][-1]/aov['df'][-1]
aov['omega_sq'] = 'NaN'
aov['omega_sq'] = (aov[:-1]['sum_sq']-(aov[:-
1]['df']*mse))/(sum(aov['sum_sq'])+mse)
return aov
datafile = "ToothGrowth.csv"
data = pd.read_csv(datafile)
fig = interaction_plot(data.dose, data.supp, data.len,
colors=['red','blue'], markers=['D','^'], ms=10)
N = len(data.len)
df_a = len(data.supp.unique()) - 1
df_b = len(data.dose.unique()) - 1
df_axb = df_a*df_b
df_w = N - (len(data.supp.unique())*len(data.dose.unique()))
grand_mean = data['len'].mean()
#Sum of Squares A – supp
ssq_a = sum([(data[data.supp ==l].len.mean()-grand_mean)**2 for l in data.supp])
#Sum of Squares B – supp
ssq_b = sum([(data[data.dose ==l].len.mean()-grand_mean)**2 for l in data.dose])
#Sum of Squares Total
ssq_t = sum((data.len - grand_mean)**2)
vc = data[data.supp == 'VC']
oj = data[data.supp == 'OJ']
vc_dose_means = [vc[vc.dose == d].len.mean() for d in vc.dose]
Output:
Using Excel:
Go to Data tab Data Analysis
Output:
Anova: Two-Factor With Replication
31
Count 30 30 60
Sum 619.9 35 654.9
Average 20.66333 1.166667 10.915
Variance 43.63344 0.402299 118.2854
Total
Count 60 60
Sum 1128.8 70
Average 18.81333 1.166667
Variance 58.51202 0.39548
ANOVA
Source of
Variation SS df MS F P-value F crit
Sample 102.675 1 102.675 3.642079 0.058808 3.922879
Columns 9342.145 1 9342.145 331.3838 8.55E-36 3.922879
Interaction 102.675 1 102.675 3.642079 0.058808 3.922879
Within 3270.193 116 28.19132
Total 12817.69 119
P-value = 0.0588079 column in the ANOVA Source of Variation table at the bottom of
the output. Because the p-values for both medicin dose and interaction are less than our
significance level, these factors are statistically significant. On the other hand, the
interaction effect is not significant because its p-value (0.0588) is greater than our
significance level. Because the interaction effect is not significant, we can focus on only
the main effects and not consider the interaction effect of the dose.
Or
http://www.real-statistics.com/wp-content/uploads/2019/11/XRealStats.xlam
Install Add-in in excel. Select File > Help|Options > Add-Ins and click on the Go button at the
bottom of the window (see Figure 1).
A study was conducted to see the impact of social-economic class (rich, middle, poor) and gender
(male, female) on kindness and optimism using on a sample of 24 people based on the data in Figure
1.
Output:
You need to run the sampling data analysis tool twice, once to create Group 1 and again
to create Group 2. For Group 1 you select all 20 population cells as the Input Range and
Random as the Sampling Method with 6 for the Random Number of Samples. For Group
2 you select the 10 cells in the Women column as Input Range and Periodic with Period
3.
Output:
Program Code:
import pandas as pd
importnumpy as np
importmatplotlib
importmatplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
importseaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
importsklearn
fromsklearn.model_selection import train_test_split
housing =pd.read_csv('housing.csv')
print(housing.head())
print(housing.info())
corr =housing.corr()
print(corr['median_house_value'].sort_values(ascending=False))
sns.distplot(housing.median_income)
plt.show()
output:
There’s a ton of information we can mine from the heatmap above, a couple of strongly
positively correlated features and a couple of negatively correlated features. Take a look
at the small bright box right in the middle of the heatmap from total_rooms on the left
’y-axis’ till households and note how bright the box is as well as the highly positively
correlated attributes, also note that median_income is the most correlated feature to the
target which is median_house_value.
From the image above, we can see that most median incomes are clustered between
$20,000 and $50,000 with some outliers going far beyond $60,000 making the
distribution skew to the right.
plt.scatter(x, y)
plt.show()
Output:
Negative Correlation:
importnumpy as np
importmatplotlib.pyplot as plt
np.random.seed(1)
# 1000 random integers between 0 and 50
np.corrcoef(x, y)
plt.scatter(x, y)
plt.show()
Output:
No/Weak Correlation:
importnumpy as np
importmatplotlib.pyplot as plt
np.random.seed(1)
x = np.random.randint(0, 50, 1000)
y = np.random.randint(0, 50, 1000)
np.corrcoef(x, y)
plt.scatter(x, y)
plt.show()
Output:
style.use('ggplot')
df = Quandl.get("WIKI/GOOGL")
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']]
df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Close'] * 100.0
df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0
X = np.array(df.drop(['label'], 1))
X = preprocessing.scale(X)
X_lately = X[-forecast_out:]
X = X[:-forecast_out]
df.dropna(inplace=True)
y = np.array(df['label'])
forecast_set = clf.predict(X_lately)
df['Forecast'] = np.nan
last_date = df.iloc[-1].name
last_unix = last_date.timestamp()
one_day = 86400
next_unix = last_unix + one_day
for i in forecast_set:
next_date = datetime.datetime.fromtimestamp(next_unix)
next_unix += 86400
df.loc[next_date] = [np.nan for _ in range(len(df.columns)-1)]+[i]
df['Adj. Close'].plot()
df['Forecast'].plot()
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()
Output:
defestimate_coef(x, y):
# number of observations/points
n = np.size(x)
return(b_0, b_1)
defplot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)
# putting labels
plt.xlabel('x')
plt.ylabel('y')
def main():
# observations
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} b_1 = {}".format(b[0], b[1]))
if __name__ == "__main__":
main()
Output:
importnumpy as np
importmatplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
importmatplotlib.pyplot as plt
defgenerate_dataset(n):
x = []
y = []
random_x1 = np.random.rand()
random_x2 = np.random.rand()
fori in range(n):
x1 = i
x2 = i/2 + np.random.rand()*n
x.append([1, x1, x2])
y.append(random_x1 * x1 + random_x2 * x2 + 1)
returnnp.array(x), np.array(y)
x, y = generate_dataset(200)
mpl.rcParams['legend.fontsize'] = 12
fig = plt.figure()
ax = fig.gca(projection ='3d')
ax.scatter(x[:, 1], x[:, 2], y, label ='y', s = 5)
ax.legend()
ax.view_init(45, 0)
plt.show()
defmse(coef, x, y):
Program Code:
import os
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import scipy.stats as stats
from sklearn import linear_model
from sklearn import preprocessing
from sklearn import metrics
matplotlib.style.use('ggplot')
plt.figure(figsize=(9,9))
y_values = sigmoid(plot_range)
# Plot curve
plt.plot(plot_range, # X-axis range
y_values, # Predicted values
color="red")
titanic_train = pd.read_csv("titanic_train.csv") # Read the data
char_cabin = titanic_train["Cabin"].astype(str) # Convert cabin to str
new_Cabin = np.array([cabin[0] for cabin in char_cabin]) # Take first letter
titanic_train["Age"] = new_age_var
label_encoder = preprocessing.LabelEncoder()
# Make predictions
preds = log_model.predict_proba(X= pd.DataFrame(encoded_sex))
preds = pd.DataFrame(preds)
preds.columns = ["Death_prob", "Survival_prob"]
train_features = pd.DataFrame([encoded_class,
encoded_cabin,
encoded_sex,
titanic_train["Age"]]).T
# Make predictions
preds = log_model.predict(X= train_features)
log_model.score(X = train_features ,
y = titanic_train["Survived"])
test_features = pd.DataFrame([encoded_class,
encoded_cabin,encoded_sex,titanic_test["Age"]]).T
print(pd)
Output:
The table shows that the model
predicted a survival chance of
roughly 19% for males and 73%
for females.
This logistic regression model has an accuracy score of 0.75598 which is actually
worse than the accuracy of the simplistic women survive, men die model (0.76555).
Example 2:
The dataset is related to direct marketing campaigns (phone calls) of a Portuguese
banking institution. The classification goal is to predict whether the client will
subscribe (1/0) to a term deposit (variable y). The dataset provides the bank customers’
information. It includes 41,188 records and 21 fields.
Input variables
1. age (numeric)
2. job : type of job (categorical: “admin”, “blue-collar”, “entrepreneur”,
“housemaid”, “management”, “retired”, “self-employed”, “services”, “student”,
“technician”, “unemployed”, “unknown”)
3. marital : marital status (categorical: “divorced”, “married”, “single”,
“unknown”)
4. education (categorical: “basic.4y”, “basic.6y”, “basic.9y”, “high.school”,
“illiterate”, “professional.course”, “university.degree”, “unknown”)
5. default: has credit in default? (categorical: “no”, “yes”, “unknown”)
6. housing: has housing loan? (categorical: “no”, “yes”, “unknown”)
7. loan: has personal loan? (categorical: “no”, “yes”, “unknown”)
8. contact: contact communication type (categorical: “cellular”, “telephone”)
9. month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”,
“dec”)
10. day_of_week: last contact day of the week (categorical: “mon”, “tue”, “wed”,
“thu”, “fri”)
11. duration: last contact duration, in seconds (numeric). Important note: this
attribute highly affects the output target (e.g., if duration=0 then y=’no’). The
duration is not known before a call is performed, also, after the end of the call, y
is obviously known. Thus, this input should only be included for benchmark
purposes and should be discarded if the intention is to have a realistic predictive
model
Program Code:
# -*- coding: utf-8 -*-
"""
Created on Mon Dec 16 22:24:44 2019
@author: MyHome
"""
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
data = pd.read_csv('bank.csv', header=0)
data = data.dropna()
print(data.shape)
print(list(data.columns))
data['education'].unique()
data['education']=np.where(data['education'] =='basic.9y', 'Basic', data['education'])
data['education']=np.where(data['education'] =='basic.6y', 'Basic', data['education'])
count_no_sub = len(data[data['y']==0])
count_sub = len(data[data['y']==1])
pct_of_no_sub = count_no_sub/(count_no_sub+count_sub)
print("percentage of no subscription is", pct_of_no_sub*100)
pct_of_sub = count_sub/(count_no_sub+count_sub)
print("percentage of subscription", pct_of_sub*100)
data.groupby('y').mean()
data.groupby('job').mean()
data.groupby('marital').mean()
data.groupby('education').mean()
pd.crosstab(data.day_of_week,data.y).plot(kind='bar')
plt.title('Purchase Frequency for Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Frequency of Purchase')
plt.savefig('pur_dayofweek_bar')
Output: -
percentage of no subscription is
88.73458288821988
percentage of subscription
11.265417111780131
Our classes are imbalanced, and the ratio of no-
subscription to subscription instances is 89:11.
• The average age of customers who bought
the term deposit is higher than that of the
customers who didn’t.
• The pdays (days since the customer was last
contacted) is understandably lower for the
customers who bought it. The lower the
pdays, the better the memory of the last call
and hence the better chances of a sale.
• Surprisingly, campaigns (number of contacts
or calls made during the current campaign)
Dear Teacher,
Please send your valuable feedback and contribution to make this manual more
effective.