0% found this document useful (0 votes)
11 views8 pages

EX. NO: 3 Performing Statistical Analysis On A Dataset DATE: 21/08/2024

DATA ANALYTICS 3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views8 pages

EX. NO: 3 Performing Statistical Analysis On A Dataset DATE: 21/08/2024

DATA ANALYTICS 3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

EX.

NO: 3 Performing statistical analysis on a dataset

DATE: 21/08/2024

AIM:

To perform statistical analysis like multiple regression and various statistical tests.

CODE:

from sklearn.preprocessing import LabelEncoder


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
import statsmodels.api as sm
from statsmodels.stats.weightstats import *
import scipy.stats

df = pd.read_csv('/content/Student_Performance.csv')
df

OUTPUT:

CODE:

df.info()

21
OUTPUT:

CODE:

label_encoder = LabelEncoder()
df['Extracurricular Activities'] = label_encoder.fit_transform(df['Extracurricular Activities'], )
df['Extracurricular Activities'].unique()

x = df.iloc[:, 0:-1]
y = df.iloc[:, -1]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

model = LinearRegression()
model.fit(x_train, y_train)

y_pred = model.predict(x_test)

r2_score(y_test, y_pred)

OUTPUT:

0.9880686410711422

CODE:

city_hall_dataset = pd.read_csv('/content/train.csv')
city_hall_dataset

22
OUTPUT:

CODE:

def results(p): this


if(p['p_value']<0.05):p['hypothesis_accepted'] = 'alternative'
if(p['p_value']>=0.05):p['hypothesis_accepted'] = 'null'
df = pd.DataFrame(p, index=[''])
cols = ['value1', 'value2', 'score', 'p_value', 'hypothesis_accepted']
return df[cols]

city_hall_dataset['SalePrice'] = np.log1p(city_hall_dataset['SalePrice'])
logged_budget = np.log1p(120000) #logged $120 000 is 11.695
logged_budget

OUTPUT:

11.695255355062795

One Sample T Test - 2 Tails

Question to answer - How is a budget of $120 000 situated vs. the average Ames house
SalePrice?
Is 120 000 (11.7 logged) any different from the mean SalePrice of the population?
We take a 25 observations sample, and perform One Sample T-Test.

CODE:

sample = city_hall_dataset.sample(n=25)
p = {}
p['value1'], p['value2'] = sample['SalePrice'].mean(), logged_budget
p['score'], pok I'm8['p_value'] = stats.ttest_1samp(sample['SalePrice'],
popmean=logged_budget)
results(p)
23
OUTPUT:

INFERENCE:
The budget is different from the average price of homes in Ames

One sample T-test One-tailed

Question - is budget of $120 000 lesser than mean?

CODE:

p['value1'], p['value2'] = sample['SalePrice'].mean(), logged_budget


p['score'], p['p_value'] = stats.ttest_1samp(sample['SalePrice'], popmean=logged_budget)
p['p_value'] = p['p_value']/2
results(p)

OUTPUT:

INFERENCE:
Alternate hyposthesis is accepted. Hence, we can say with 95% confidence that our
budget is not enough

Two sample T-test | Two-tailed | Means

Houses may be small or large, hence we can divide the population into 2 groups
Null Hypothesis : SalePrice of smaller houses = SalePrice of larger houses Alternative
Hypothesis : SalePrice of smaller houses ≠ SalePrice of larger houses

CODE:

24
smaller_houses = city_hall_dataset.sort_values('GrLivArea')[:730].sample(n=25)
larger_houses = city_hall_dataset.sort_values('GrLivArea')[730:].sample(n=25)
p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), larger_houses['SalePrice'].mean()
p['score'], p['p_value'], p['df'] = ttest_ind(smaller_houses['SalePrice'],
larger_houses['SalePrice'])
results(p)

OUTPUT:

INFERENCE:
There is differnece is sale price of small houses vs large houses

Two sample T-test | One-tailed | Means

Null Hypothesis : SalePrice of smaller houses >= SalePrice of larger houses


Alternative Hypothesis : SalePrice of smaller houses < SalePrice of larger houses

CODE:

p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), larger_houses['SalePrice'].mean()


p['score'], p['p_value'], p['df'] = ttest_ind(smaller_houses['SalePrice'],
larger_houses['SalePrice'], alternative='smaller')
results(p)

OUTPUT:

INFERENCE:
Larger houses are mre expensive

Two sample Z-test | One-tailed | Means

25
Using a larger sample size to draw conclusions. Here, normal distribution holds, hence Z test is
used
Null Hypothesis : SalePrice of smaller houses >= SalePrice of larger houses
Alternative Hypothesis : SalePrice of smaller houses < SalePrice of larger houses

CODE:

smaller_houses = city_hall_dataset.sort_values('GrLivArea')[:730].sample(n=100,
random_state=1)
larger_houses = city_hall_dataset.sort_values('GrLivArea')[730:].sample(n=100,
random_state=1)
p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), larger_houses['SalePrice'].mean()
p['score'], p['p_value'] = ztest(smaller_houses['SalePrice'], larger_houses['SalePrice'],
alternative='smaller')
results(p)

OUTPUT:

One sample Z-test | One-tailed

Null Hypothesis : Mean SalePrice of smaller houses => 11.695


Alternative Hypothesis : Mean SalePrice of smaller houses < 11.695

CODE:

p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), logged_budget


p['score'], p['p_value'] = ztest(smaller_houses['SalePrice'], value=logged_budget,
alternative='larger')
results(p)

OUTPUT:

INFERENCE:

26
This means $120 000 cant buy a small house (on average)

ANNOVA Test

Null Hypothesis : No difference between SalePrice means


Alternative Hypothesis : Difference between SalePrice means

CODE:

replacement = {'FV': "Floating Village Residential", 'C (all)': "Commercial", 'RH': "Residential
High Density",
'RL': "Residential Low Density", 'RM': "Residential Medium Density"}

smaller_houses['MSZoning_FullName'] = smaller_houses['MSZoning'].replace(replacement)
mean_price_by_zone = smaller_houses.groupby('MSZoning_FullName')
['SalePrice'].mean().to_frame()

CODE:

sh = smaller_houses.copy()
p['score'], p['p_value'] = stats.f_oneway(sh.loc[sh.MSZoning=='FV', 'SalePrice'],
sh.loc[sh.MSZoning=='C (all)', 'SalePrice'],
sh.loc[sh.MSZoning=='RH', 'SalePrice'],
sh.loc[sh.MSZoning=='RL', 'SalePrice'],
sh.loc[sh.MSZoning=='RM', 'SalePrice'],)
results(p)[['score', 'p_value', 'hypothesis_accepted']]

OUTPUT:

INFERENCE:
SalePrice varies based on Zone

27
RESULT:

T- test, Annova test and other statistical tests are done successfully.

28

You might also like