EX. NO: 3 Performing Statistical Analysis On A Dataset DATE: 21/08/2024
EX. NO: 3 Performing Statistical Analysis On A Dataset DATE: 21/08/2024
DATE: 21/08/2024
AIM:
To perform statistical analysis like multiple regression and various statistical tests.
CODE:
df = pd.read_csv('/content/Student_Performance.csv')
df
OUTPUT:
CODE:
df.info()
21
OUTPUT:
CODE:
label_encoder = LabelEncoder()
df['Extracurricular Activities'] = label_encoder.fit_transform(df['Extracurricular Activities'], )
df['Extracurricular Activities'].unique()
x = df.iloc[:, 0:-1]
y = df.iloc[:, -1]
model = LinearRegression()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
r2_score(y_test, y_pred)
OUTPUT:
0.9880686410711422
CODE:
city_hall_dataset = pd.read_csv('/content/train.csv')
city_hall_dataset
22
OUTPUT:
CODE:
city_hall_dataset['SalePrice'] = np.log1p(city_hall_dataset['SalePrice'])
logged_budget = np.log1p(120000) #logged $120 000 is 11.695
logged_budget
OUTPUT:
11.695255355062795
Question to answer - How is a budget of $120 000 situated vs. the average Ames house
SalePrice?
Is 120 000 (11.7 logged) any different from the mean SalePrice of the population?
We take a 25 observations sample, and perform One Sample T-Test.
CODE:
sample = city_hall_dataset.sample(n=25)
p = {}
p['value1'], p['value2'] = sample['SalePrice'].mean(), logged_budget
p['score'], pok I'm8['p_value'] = stats.ttest_1samp(sample['SalePrice'],
popmean=logged_budget)
results(p)
23
OUTPUT:
INFERENCE:
The budget is different from the average price of homes in Ames
CODE:
OUTPUT:
INFERENCE:
Alternate hyposthesis is accepted. Hence, we can say with 95% confidence that our
budget is not enough
Houses may be small or large, hence we can divide the population into 2 groups
Null Hypothesis : SalePrice of smaller houses = SalePrice of larger houses Alternative
Hypothesis : SalePrice of smaller houses ≠ SalePrice of larger houses
CODE:
24
smaller_houses = city_hall_dataset.sort_values('GrLivArea')[:730].sample(n=25)
larger_houses = city_hall_dataset.sort_values('GrLivArea')[730:].sample(n=25)
p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), larger_houses['SalePrice'].mean()
p['score'], p['p_value'], p['df'] = ttest_ind(smaller_houses['SalePrice'],
larger_houses['SalePrice'])
results(p)
OUTPUT:
INFERENCE:
There is differnece is sale price of small houses vs large houses
CODE:
OUTPUT:
INFERENCE:
Larger houses are mre expensive
25
Using a larger sample size to draw conclusions. Here, normal distribution holds, hence Z test is
used
Null Hypothesis : SalePrice of smaller houses >= SalePrice of larger houses
Alternative Hypothesis : SalePrice of smaller houses < SalePrice of larger houses
CODE:
smaller_houses = city_hall_dataset.sort_values('GrLivArea')[:730].sample(n=100,
random_state=1)
larger_houses = city_hall_dataset.sort_values('GrLivArea')[730:].sample(n=100,
random_state=1)
p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), larger_houses['SalePrice'].mean()
p['score'], p['p_value'] = ztest(smaller_houses['SalePrice'], larger_houses['SalePrice'],
alternative='smaller')
results(p)
OUTPUT:
CODE:
OUTPUT:
INFERENCE:
26
This means $120 000 cant buy a small house (on average)
ANNOVA Test
CODE:
replacement = {'FV': "Floating Village Residential", 'C (all)': "Commercial", 'RH': "Residential
High Density",
'RL': "Residential Low Density", 'RM': "Residential Medium Density"}
smaller_houses['MSZoning_FullName'] = smaller_houses['MSZoning'].replace(replacement)
mean_price_by_zone = smaller_houses.groupby('MSZoning_FullName')
['SalePrice'].mean().to_frame()
CODE:
sh = smaller_houses.copy()
p['score'], p['p_value'] = stats.f_oneway(sh.loc[sh.MSZoning=='FV', 'SalePrice'],
sh.loc[sh.MSZoning=='C (all)', 'SalePrice'],
sh.loc[sh.MSZoning=='RH', 'SalePrice'],
sh.loc[sh.MSZoning=='RL', 'SalePrice'],
sh.loc[sh.MSZoning=='RM', 'SalePrice'],)
results(p)[['score', 'p_value', 'hypothesis_accepted']]
OUTPUT:
INFERENCE:
SalePrice varies based on Zone
27
RESULT:
T- test, Annova test and other statistical tests are done successfully.
28