0% found this document useful (0 votes)

75 views12 pages

Data Science Assignment Submission

The document contains code to analyze the Titanic dataset and verify various claims. It summarizes the analysis of 8 claims about the Titanic passengers and outcomes. For each claim, the code generates a visualization, compares the results to the claim, and states whether the claim is accepted or rejected based on the analysis. The document also contains code to plot stock price data over time for 4 companies and analyze weight data to check if it follows a normal distribution.

Uploaded by

Sneha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views12 pages

Data Science Assignment Submission

Uploaded by

Sneha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

#*****************************************************

#This assignment is to be done using Python

#*****************************************************

"""Part 1: #Refer to the Titanic dataset

#Below are are a series of claims that you need to accept or reject based
on visualizations of the data.
For each one, copy and paste your final code, along with the
visualization, and a statement rejecting
or accepting the claim into your Word document"""

import pandas as pd
import matplotlib.pyplot as mat
import seaborn as sns
df_titanic = pd.read_csv(r'C:\Users\sneha\Desktop\MISM 6212 Data Mining\
Assignment\Week 3\Titanic.csv')

# =======================================
#Claim 1: More people died than survived
# =======================================
df_titanic_survived =
df_titanic.groupby(['Survived']).size().reset_index(name='Counts')

ax = df_titanic_survived['Counts'].plot(kind = 'bar',title ="Number of

people died vs survived", figsize=(15, 10),label="0=Died, 1 = Survived",
fontsize=12)
ax.set_xlabel("Survived", fontsize=12)
ax.set_ylabel("Count", fontsize=12)
mat.legend()
mat.show()
Result: Accept the claim that more people died than survived due to
higher bar length
# =======================================
#Claim 2: Females were more likely to survive than males
# =======================================
df_titanic_surv =
df_titanic.loc[df_titanic['Survived']==1].reset_index(drop=True)
df_titanic_sex =
df_titanic_surv.groupby(['Sex']).size().reset_index(name='Counts')

ax = df_titanic_sex.plot.bar(x='Sex', y = 'Counts',rot = 0,title ="Number

of female vs males among those who survived", figsize=(15, 10),
fontsize=12)
ax.set_xlabel("Sex", fontsize=12)
ax.set_ylabel("Count", fontsize=12)
mat.show()
Result: Accept the claim since females were more than males among those
who survived

# =======================================
#Claim 3: The third class passengers had the highest chance of survival
# =======================================

df_titanic_cls = df_titanic.groupby(['Pclass',
'Survived']).size().reset_index(name='Counts')
df_titanic_cls['Pct'] =
100*df_titanic_cls['Counts']/df_titanic_cls.groupby('Pclass')
['Counts'].transform('sum')

df_titanic_cls_1 = df_titanic_cls.loc[df_titanic_cls['Survived']==1]
df_titanic_cls_0 = df_titanic_cls.loc[df_titanic_cls['Survived']==0]

mat.bar(x = df_titanic_cls_0['Pclass'],height = df_titanic_cls_0['Pct'],

label = 'Died')
mat.bar(x = df_titanic_cls_1['Pclass'],height = df_titanic_cls_1['Pct'],
bottom = df_titanic_cls_0['Pct'],label = 'Survived')

mat.title("Percentage survived vs died in each Pclass")

mat.xlabel("Pclass")
mat.ylabel("Percentage(%)", fontsize=12)
mat.xticks(df_titanic_cls['Pclass'])
mat.legend()
mat.show()
Result: Reject the claim since 3rd class passengers had the lowest
percentage of survivors among them, as compared to other classes

# =======================================
#Claim 4: Majority of the people in Titanic were older than 40 years
# =======================================

df_titanic_agegrp = df_titanic
for i,row in df_titanic_agegrp.iterrows():
if row['Age']<=40:
df_titanic_agegrp.loc[i,'Age_group'] = "<= 40 years"
else:
df_titanic_agegrp.loc[i,'Age_group'] = "Greater than 40 years"

df_titanic_age =
df_titanic_agegrp.groupby(['Age_group']).size().reset_index(name='Counts'
)

ax = df_titanic_age.plot.bar(x='Age_group', y = 'Counts',rot = 0,title

="Number of people in titanic across age groups", figsize=(15, 10),
fontsize=12)
ax.set_xlabel("Age group", fontsize=12)
ax.set_ylabel("Count", fontsize=12)
mat.show()
Result: Reject the claim since higher count of people are in age group
lesser than or equal to 40 years

# =======================================
#Claim 5: Majority of people paid more than 100$ for buying the ticket
# =======================================

df_titanic_faregrp = df_titanic
for i,row in df_titanic_faregrp.iterrows():
if row['Fare']<=100:
df_titanic_faregrp.loc[i,'Fare_group'] = "<= 100$"
else:
df_titanic_faregrp.loc[i,'Fare_group'] = "Greater than 100$"

df_titanic_faregrp.groupby(['Fare_group']).size().plot.pie(autopct =
'%1.1f%%', startangle = 270,title ="Percentage of people in titanic
across fare groups", figsize=(15, 10), fontsize=12, label = 'Percentage
of people')
mat.legend()
mat.show()
Result: Reject the claim since majority (94%)of people paid a fare of
less than or equal to $100.

# =======================================
#Claim 6: Females on an average paid more than males for buying the
ticket
# =======================================
ax = sns.boxplot(x="Sex", y="Fare", data=df_titanic, showmeans = True,
meanprops = {"marker":"s", "markerfacecolor":"black",
"markeredgecolor":"white"})
mat.title("Boxplot of Passenger Fare across Gender")

Result: Accept the claim since mean fare for females is higher than males
(as seen through the black square marker 44.48 > 25.63)

# =======================================
#Claim 7: Passengers in Pclass 3 were younger on average than other
classes
# =======================================

ax = sns.boxplot(x="Pclass", y="Age", data=df_titanic, showmeans = True,

meanprops = {"marker":"s", "markerfacecolor":"black",
"markeredgecolor":"white"})
mat.title("Boxplot of Passenger Age across Pclass")

Result: Accept the claim since mean age for passengers in pclass 3 is
lower than other pclasses(as seen through the black square marker)

# =======================================
#Claim 8: Passengers in the first class paid the highest fare
# =======================================
sns.boxplot(x='Pclass',y='Fare',data=df_titanic,showmeans=True)
mat.title("Boxplot of Passenger Fare across Pclass")
Result: Accept the claim as the boxplot indicates that average, median
and range is higher for fares in first class passengers

"""Part 2"""
#Download data for 4 of your favorite stocks
#starting date: 01-01-2019
#end date: today (date you attempt the question)
#plot them on the same graph (only the column "Open" for each stock)
#Use appropriate names for x label, ylabel, and title
#Follow the following specifications
#Figure size: 10*10
#Title font: 25
#xticks and yticks fontsize: 15
#xlabel and ylabel font: 20
#Location of legend: upper left
#Fontsize of legend: 15

import yfinance as yf
df_amzn = yf.download("AMZN",start = "2019-01-01")
df_google = yf.download("GOOG",start = "2019-01-01")
df_tsla = yf.download("TSLA",start = "2019-01-01")
df_wmt = yf.download("WMT",start = "2019-01-01")

mat.figure(figsize=(10,10))
mat.plot(df_amzn['Open'], color = "orange",label = "Amazon")
mat.plot(df_google['Open'], color = "red",label = "Google")
mat.plot(df_tsla['Open'], color = "blue",label = "Tesla")
mat.plot(df_wmt['Open'], color = "green",label = "Walmart")
mat.title("Opening Stock price across time", fontsize = 25)
mat.xlabel("Date", fontsize = 20)
mat.ylabel("Stock price (USD)",fontsize = 20)
mat.xticks(fontsize=15)
mat.yticks(fontsize=15)
mat.legend(loc="2",fontsize=15)
"""Part 3: Refer to file weights. It contains the weight (lbs) of
randomly selected males from United States,
Verify whether the weights seem to be normally distributed"""
#Hint: Check if the distribution of data looks like a bell shaped curve
#check that the mean and median are equal (approximately)
#Check if the data follows the empirical rule
# Empirical rule: For a normal distribution about 68% of the data falls
within one standard deviation,
about 95% percent within two standard deviations, and about 99.7% within
three standard deviations from the mean.

df_weights = pd.read_csv(r'C:\Users\sneha\Desktop\MISM 6212 Data Mining\

Assignment\Week 3\weights.csv')
df_weights.columns

sns.distplot(df_weights['Weight'], hist=False, kde = True, color =

'darkblue', kde_kws = {'shade':True})
mat.axvline(x = df_weights['Weight'].mean(), color = 'red')
mat.axvline(x = df_weights['Weight'].median(), color = 'blue', lw=0.5, ls
= '--')
mat.axvline(x = df_weights['Weight'].mean() +
df_weights['Weight'].std() , color = 'black', lw=0.5, ls = '--')
mat.axvline(x = df_weights['Weight'].mean() -
df_weights['Weight'].std() , color = 'black', lw=0.5, ls = '--')
mat.axvline(x = df_weights['Weight'].mean() +
(2*df_weights['Weight'].std()) , color = 'black', lw=0.5, ls = '--')
mat.axvline(x = df_weights['Weight'].mean() -
(2*df_weights['Weight'].std()) , color = 'black', lw=0.5, ls = '--')
mat.axvline(x = df_weights['Weight'].mean() +
(3*df_weights['Weight'].std()) , color = 'black', lw=0.5, ls = '--')
mat.axvline(x = df_weights['Weight'].mean() -
(3*df_weights['Weight'].std()) , color = 'black', lw=0.5, ls = '--')

for i,row in df_weights.iterrows():

if row['Weight']>=df_weights['Weight'].mean() -
df_weights['Weight'].std() and row['Weight']<=df_weights['Weight'].mean()
+ df_weights['Weight'].std():
df_weights.loc[i,'Group'] = "Within 1 std deviation"
elif row['Weight']>=df_weights['Weight'].mean() -
(2*df_weights['Weight'].std()) and
row['Weight']<=df_weights['Weight'].mean() +
(2*df_weights['Weight'].std()):
df_weights.loc[i,'Group'] = "Within 2 std deviation"
else:
df_weights.loc[i,'Group'] = "Within 3 std deviation"

df_weights_dist =
df_weights.groupby(['Group']).size().reset_index(name='Counts')
df_weights_dist['Cumulative_counts'] = df_weights_dist['Counts'].cumsum()
df_weights_dist['cumulative_pct'] =
100*df_weights_dist['Cumulative_counts']/df_weights_dist['Counts'].sum()

Result:
The weights seem to be normally distributed as the density plot looks
like a bell shaped curve. Mean and median are approximately equal to
187.0(overlapping red and blue dotted lines).The data also follows the
empirical rule as the count of observations within each std deviation
follows the empirical rule for normal distribution.

Index Group Counts Cumulative_counts Cumulative_pct

0 Within 1 3436 3436 68.72
std
deviation
1 Within 2 1330 4766 95.32
std
deviation
2 Within 3 234 5000 100.0
std
deviation

"""Part 4"""
Please submit the code for plotting the following graphs using the pokemon data:
df_pokemon = pd.read_csv(r'C:\Users\sneha\Desktop\MISM 6212 Data Mining\
Data\pokemon_data.csv')
df_pokemon.columns

##Melting the Dataframe to a long format for visualiztion

df_pokemon_1 = df_pokemon[['HP','Attack','Defense','Sp. Atk','Sp.
Def','Speed','Generation']]
df_pokemon_melted = df_pokemon_1.melt(id_vars=['Generation'])

sns.boxplot(x="variable", y="value", hue = 'Generation',

data=df_pokemon_melted)
mat.legend(loc = 'upper right', title = "Generation")

df_pokemon = pd.read_csv(r'C:\Users\sneha\Desktop\MISM 6212 Data Mining\

Data\pokemon_data.csv')
##Melting the Dataframe to a long format for visualiztion
df_pokemon_2 = df_pokemon[['HP','Attack','Defense','Sp. Atk','Sp.
Def','Speed','Legendary']]
df_melted_2 = df_pokemon_2.melt(id_vars=["Legendary"], value_vars=["HP",
"Attack","Defense","Sp. Atk","Sp. Def","Speed"])

##Plotting the visualization

sns.barplot(x="variable",y="value",data=df_melted_2,hue="Legendary")
mat.legend(loc='upper right',title='Legendary')

Delhivery Mani
No ratings yet
Delhivery Mani
79 pages
Three Drive Price Action - CMF - ToBeaTrader
100% (6)
Three Drive Price Action - CMF - ToBeaTrader
194 pages
Pra 8-1
No ratings yet
Pra 8-1
3 pages
Titanic
No ratings yet
Titanic
22 pages
Assignment2 DMS672
No ratings yet
Assignment2 DMS672
15 pages
Pythion Assigment
No ratings yet
Pythion Assigment
3 pages
DVA Practical
No ratings yet
DVA Practical
19 pages
Lab Manual - DSR
No ratings yet
Lab Manual - DSR
32 pages
ADS Exp3
No ratings yet
ADS Exp3
6 pages
01-Logistic Regression With Python
No ratings yet
01-Logistic Regression With Python
12 pages
Ahamed 123
100% (1)
Ahamed 123
7 pages
Maneesha Nidigonda Minor Project .Ipynb
No ratings yet
Maneesha Nidigonda Minor Project .Ipynb
35 pages
Import As: Pandas PD Titanic - Data PD - Read - CSV Titanic - Data - Head
No ratings yet
Import As: Pandas PD Titanic - Data PD - Read - CSV Titanic - Data - Head
12 pages
AE II Simulation File PDF
No ratings yet
AE II Simulation File PDF
32 pages
Titanic PuneethRegonda
No ratings yet
Titanic PuneethRegonda
8 pages
DSBDA8
No ratings yet
DSBDA8
3 pages
I2IT DataVisualizationI - JupyterLab
No ratings yet
I2IT DataVisualizationI - JupyterLab
18 pages
Homework 2
No ratings yet
Homework 2
12 pages
Sl-3 Assignment No.8
No ratings yet
Sl-3 Assignment No.8
21 pages
Titanic Survival Prediction
No ratings yet
Titanic Survival Prediction
14 pages
DSBDAL - Assignment No 9
No ratings yet
DSBDAL - Assignment No 9
12 pages
DAVP Lab Manual
No ratings yet
DAVP Lab Manual
12 pages
Coding Titanicmain
No ratings yet
Coding Titanicmain
58 pages
Titanic Prediction
No ratings yet
Titanic Prediction
53 pages
Titanic EDA
No ratings yet
Titanic EDA
6 pages
Week 3 Laboratory Activity
No ratings yet
Week 3 Laboratory Activity
7 pages
Titanic Survival Prediction ML
No ratings yet
Titanic Survival Prediction ML
36 pages
Titanic Akshaya
No ratings yet
Titanic Akshaya
12 pages
08 Titanic
No ratings yet
08 Titanic
19 pages
Dsbda 8
No ratings yet
Dsbda 8
8 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Experiment No 9
No ratings yet
Experiment No 9
13 pages
Matplotlib
No ratings yet
Matplotlib
5 pages
Titanic Eda
No ratings yet
Titanic Eda
17 pages
Print Print Print Print: Import As
No ratings yet
Print Print Print Print: Import As
6 pages
DSBDA9
No ratings yet
DSBDA9
7 pages
INFO-523 Homework 1
No ratings yet
INFO-523 Homework 1
2 pages
Assignment 9
No ratings yet
Assignment 9
12 pages
Aim: Predicting The Survival of Titanic Passengers
No ratings yet
Aim: Predicting The Survival of Titanic Passengers
20 pages
Exploring The Titanic Dataset With Python
No ratings yet
Exploring The Titanic Dataset With Python
6 pages
DSDBAAssignment2 SUMEET
No ratings yet
DSDBAAssignment2 SUMEET
8 pages
Data Science Algorithmen Master - 02 Data Handling
No ratings yet
Data Science Algorithmen Master - 02 Data Handling
76 pages
Lec 7 Data Visualization Basic Statistics Updated 21102024 122008pm
No ratings yet
Lec 7 Data Visualization Basic Statistics Updated 21102024 122008pm
39 pages
8 Data Visualization
No ratings yet
8 Data Visualization
12 pages
Data Visualization
No ratings yet
Data Visualization
70 pages
9
No ratings yet
9
4 pages
Data Visualization With Seaborn PDF
No ratings yet
Data Visualization With Seaborn PDF
12 pages
The Titanic Dataset
No ratings yet
The Titanic Dataset
6 pages
Presentation 1
No ratings yet
Presentation 1
30 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Pandas - Data Manipulation and Analysis Library - Educative
No ratings yet
Pandas - Data Manipulation and Analysis Library - Educative
7 pages
Practical Session 1: Exploratory Data Analysis: Exercise 1
No ratings yet
Practical Session 1: Exploratory Data Analysis: Exercise 1
2 pages
Titanic
No ratings yet
Titanic
6 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
Practical No 01
No ratings yet
Practical No 01
9 pages
Unit 5 Analysis With Pandas in Python
No ratings yet
Unit 5 Analysis With Pandas in Python
26 pages
Assignment
No ratings yet
Assignment
14 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Ransomware Containment and Remediation Strategies
No ratings yet
Ransomware Containment and Remediation Strategies
38 pages
11th CS-Eng Notes 2023-24
No ratings yet
11th CS-Eng Notes 2023-24
19 pages
Ajp PR2
No ratings yet
Ajp PR2
4 pages
Java Variables
No ratings yet
Java Variables
30 pages
Focus On Posture Correction (Documentation) .
No ratings yet
Focus On Posture Correction (Documentation) .
32 pages
Full Service - Electric Fire Pump Controllers: Americas Europe Middle East Asia
No ratings yet
Full Service - Electric Fire Pump Controllers: Americas Europe Middle East Asia
2 pages
AMBE 2020 Manual
No ratings yet
AMBE 2020 Manual
64 pages
Python Automation Part 1
No ratings yet
Python Automation Part 1
138 pages
St. John International School, Palghar Name of Faculty: Employee Code No.
No ratings yet
St. John International School, Palghar Name of Faculty: Employee Code No.
2 pages
11 CSS Week 2 Day 3
No ratings yet
11 CSS Week 2 Day 3
4 pages
Ready Reckoner For Capgemini Exceller Winning Steps Learning Journey Webinar 5
No ratings yet
Ready Reckoner For Capgemini Exceller Winning Steps Learning Journey Webinar 5
6 pages
Chemical Anchor Calculation
No ratings yet
Chemical Anchor Calculation
8 pages
Information Security Awareness Training 2022
No ratings yet
Information Security Awareness Training 2022
31 pages
Step 1: Create A Folder Redirection Security Group
No ratings yet
Step 1: Create A Folder Redirection Security Group
7 pages
SSRN 4300756
No ratings yet
SSRN 4300756
44 pages
Usability Test Plan Example
No ratings yet
Usability Test Plan Example
7 pages
Sneha Sabu Project Report
No ratings yet
Sneha Sabu Project Report
81 pages
Interactive Folders by Slidesgo
No ratings yet
Interactive Folders by Slidesgo
58 pages
Ohrus NV
No ratings yet
Ohrus NV
6 pages
LTI MINDTREE Material
No ratings yet
LTI MINDTREE Material
17 pages
Track Mobile Location
100% (1)
Track Mobile Location
14 pages
Statmon Command Reference 20140613
No ratings yet
Statmon Command Reference 20140613
474 pages
PsyCog Sheet 02 (Udah Dijawab)
No ratings yet
PsyCog Sheet 02 (Udah Dijawab)
54 pages
EMC Components: ZJYS Series ZJYS51, ZJYS81 Types Common Mode Choke Coils For Signal Line SMD
No ratings yet
EMC Components: ZJYS Series ZJYS51, ZJYS81 Types Common Mode Choke Coils For Signal Line SMD
4 pages
Unit 1-4 With Answers 1
No ratings yet
Unit 1-4 With Answers 1
15 pages
LE3u 56MR Manual
100% (1)
LE3u 56MR Manual
17 pages
Unit 1
No ratings yet
Unit 1
67 pages
Chapter 1 - Basic Concepts of Programming
No ratings yet
Chapter 1 - Basic Concepts of Programming
84 pages
Iphone 11 128 - GXGM400FN741 - Info
No ratings yet
Iphone 11 128 - GXGM400FN741 - Info
2 pages

Data Science Assignment Submission

Uploaded by

Data Science Assignment Submission

Uploaded by

#*****************************************************

#This assignment is to be done using Python

"""Part 1: #Refer to the Titanic dataset

ax = df_titanic_survived['Counts'].plot(kind = 'bar',title ="Number of

ax = df_titanic_sex.plot.bar(x='Sex', y = 'Counts',rot = 0,title ="Number

mat.bar(x = df_titanic_cls_0['Pclass'],height = df_titanic_cls_0['Pct'],

mat.title("Percentage survived vs died in each Pclass")

ax = df_titanic_age.plot.bar(x='Age_group', y = 'Counts',rot = 0,title

ax = sns.boxplot(x="Pclass", y="Age", data=df_titanic, showmeans = True,

df_weights = pd.read_csv(r'C:\Users\sneha\Desktop\MISM 6212 Data Mining\

sns.distplot(df_weights['Weight'], hist=False, kde = True, color =

for i,row in df_weights.iterrows():

Index Group Counts Cumulative_counts Cumulative_pct

##Melting the Dataframe to a long format for visualiztion

sns.boxplot(x="variable", y="value", hue = 'Generation',

df_pokemon = pd.read_csv(r'C:\Users\sneha\Desktop\MISM 6212 Data Mining\

##Plotting the visualization

You might also like