0% found this document useful (0 votes)
12 views

I2IT DataVisualizationI - JupyterLab

The document analyzes a dataset from the Titanic using Seaborn and Pandas. It generates various plots to analyze patterns in passenger data like age, fare, class, gender and survival rates. Key insights include higher survival rates for 1st class passengers, males on average being older than females, and most passengers embarking from Southampton.

Uploaded by

Piyush Shastri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

I2IT DataVisualizationI - JupyterLab

The document analyzes a dataset from the Titanic using Seaborn and Pandas. It generates various plots to analyze patterns in passenger data like age, fare, class, gender and survival rates. Key insights include higher survival rates for 1st class passengers, males on average being older than females, and most passengers embarking from Southampton.

Uploaded by

Piyush Shastri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Problem Statement

*Data Visualization I*

*1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and contains information about the
passengers who boarded the unfortunate Titanic ship. Use the Seaborn library to see if we can find
any patterns in the data.*

*2. Write a code to check how the price of the ticket (column name: 'fare') for each passenger is
distributed by plotting a histogram.*

In [2]: import numpy as np


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]: df = sns.load_dataset('titanic')

In [4]: df.notnull()

Out[4]: survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_t

0 True True True True True True True True True True True False

1 True True True True True True True True True True True True

2 True True True True True True True True True True True False

3 True True True True True True True True True True True True

4 True True True True True True True True True True True False

... ... ... ... ... ... ... ... ... ... ... ... ...

886 True True True True True True True True True True True False

887 True True True True True True True True True True True True

888 True True True False True True True True True True True False

889 True True True True True True True True True True True True

890 True True True True True True True True True True True False

891 rows × 15 columns

In [5]: df.head(3)
Out[5]: survived pclass sex age sibsp parch fare embarked class who adult_male deck emb

0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Sou

1 1 1 female 38.0 1 0 71.2833 C First woman False C

2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Sou

In [6]: df.columns

Out[6]: Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',


'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
'alive', 'alone'],
dtype='object')

In [7]: df['pclass'].unique()

Out[7]: array([3, 1, 2], dtype=int64)

In [8]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 survived 891 non-null int64
1 pclass 891 non-null int64
2 sex 891 non-null object
3 age 714 non-null float64
4 sibsp 891 non-null int64
5 parch 891 non-null int64
6 fare 891 non-null float64
7 embarked 889 non-null object
8 class 891 non-null category
9 who 891 non-null object
10 adult_male 891 non-null bool
11 deck 203 non-null category
12 embark_town 889 non-null object
13 alive 891 non-null object
14 alone 891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB

Scatter Plot: Relationship between Age and Fare


Inference : The scatter plot suggests that there is no clear linear correlation between age and fare. Points are
scattered without a distinctive upward or downward trend, indicating that age and fare do not have a
straightforward relationship.

In [9]: sns.scatterplot(data = df, x = 'age', y = 'fare', hue = 'sex')

Out[9]: <Axes: xlabel='age', ylabel='fare'>


Histogram Plot: Age
Inference :The most common age group appears to be concentrated within a range of 20 to 40 years, revealing
the central tendency of the dataset. The shape of the histogram suggests the skewness of the age distribution.
Depending on the shape (right-skewed), we can infer that the majority of passengers are younger

In [10]: sns.histplot(df, x = 'age')

Out[10]: <Axes: xlabel='age', ylabel='Count'>


Distribution Plot: Relationship between Age and Fare
according to Passenger Class
Inferences : Divergent age distributions are evident across distinct passenger classes and genders. The majority
of passengers belong to class 3, followed by class 2 and class 1. Class 3 predominantly comprises male
passengers, while the ratio of male to female passengers is approximately equal in class 2 and class 1.

In [11]: sns.displot(df, x = 'age', hue = 'sex', col = 'pclass')

Out[11]: <seaborn.axisgrid.FacetGrid at 0x173ef826290>

Inference: Revealing age distributions based on both gender and embarkation towns, the majority of
passengers originate from Southampton Town, primarily consisting of males. Following closely is the town of
Cherbourg, which ranks as the second-highest in passenger numbers, succeeded by Queenstown.
In [12]: sns.displot(df, x = 'age', hue = 'sex', col = 'embark_town')

Out[12]: <seaborn.axisgrid.FacetGrid at 0x173f2356450>

Distribution Plot: Relationship between Sex and Age


Inference: Visualizing the average age for each gender, this bar plot highlights a central tendency comparison.
The plot indicates that, on average, male passengers tend to be older than female passengers in the dataset.

In [13]: sns.barplot(data = df, x = 'sex', y = 'age')

Out[13]: <Axes: xlabel='sex', ylabel='age'>

Inference: This bar plot depicts the average age for each passenger class. The analysis indicates a clear age
hierarchy, with the average age highest for Class 1 passengers (38.2 years), followed by Class 2 passengers
(29.8 years), and Class 3 passengers having the lowest average age (25.1 years).
In [14]: sns.barplot(data = df, x = 'pclass', y = 'age')

Out[14]: <Axes: xlabel='pclass', ylabel='age'>

Inference: In all passenger classes, the bar plot with age on the y-axis, differentiated by class and further by
gender, reveals a consistent pattern. The average age of male passengers is higher than that of their female
counterparts within each class.

In [18]: sns.barplot(data = df, x = 'pclass', y = 'age', hue = 'sex' )

Out[18]: <Axes: xlabel='pclass', ylabel='age'>


Inference: For both Class 1 and Class 2, the majority of passengers primarily embarked from Cherbourg,
followed by Queenstown and Southampton. In contrast, Class 3 has the highest number of passengers
embarking from Southampton, with Cherbourg and Queenstown having fewer passengers in this class.

In [19]: sns.barplot(x='pclass',y='fare',hue= 'embark_town',data=df, errorbar = None)

Out[19]: <Axes: xlabel='pclass', ylabel='fare'>


Count Plot: Relationship between pclass and survived
Inference: This count plot, organized by passenger class and differentiated by survival status, reveals the
distribution of survivors and non-survivors within each class. Class 1 exhibits a relatively higher count of
survivors compared to non-survivors, indicating a potentially higher survival rate in this class.

In [17]: sns.countplot(x='pclass',data=df,hue='survived')

Out[17]: <Axes: xlabel='pclass', ylabel='count'>


Inference: This count plot, categorized by passenger class and further distinguished by gender, illustrates the
gender distribution within each class. In all classes, the count plot indicates a higher number of males than
females, with Class 3 exhibiting a notably wider gender gap compared to the other classes. This suggests a
potential gender imbalance, particularly pronounced in Class 3

In [20]: sns.countplot(x='pclass',data=df,hue='sex')

Out[20]: <Axes: xlabel='pclass', ylabel='count'>


Inference: This countplot provides insights into the distribution of passengers based on both class and
embarkation location. Southampton appears to be the dominant embarkation town across all passenger
classes, followed by varying proportions from Cherbourg and Queenstown.

In [21]: sns.countplot(x='pclass',data=df,hue='embark_town')

Out[21]: <Axes: xlabel='pclass', ylabel='count'>


Inference: Southampton exhibits a higher count of both survivors and non-survivors, while Cherbourg and
Queenstown show relatively lower counts. The plot provides a perspective on the distribution of survival
outcomes across different embarkation towns.

In [24]: sns.catplot(x='embark_town',kind='count',data=df,hue='survived')

Out[24]: <seaborn.axisgrid.FacetGrid at 0x173f450d410>


Inference: The catplot allows for a detailed comparison of the count of survivors and non-survivors across
different embarkation towns for each passenger class. P Patterns of survival outcomes vary across embarkation
towns and passenger classes.

In [26]: sns.catplot(x='embark_town',kind='count',data=df,hue='survived',col='pclass')

Out[26]: <seaborn.axisgrid.FacetGrid at 0x173f4517bd0>

Iris Dataset
In [27]: df = sns.load_dataset('iris')
In [28]: df.head(3)

Out[28]: sepal_length sepal_width petal_length petal_width species

0 5.1 3.5 1.4 0.2 setosa

1 4.9 3.0 1.4 0.2 setosa

2 4.7 3.2 1.3 0.2 setosa

In [29]: df.columns

Out[29]: Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',


'species'],
dtype='object')

Inference: The displot allows for a visual comparison of sepal length distributions among the three species of
Iris: Setosa, Versicolor, and Virginica. The species 'Setosa' often has the smallest sepal length among the three
Iris species ('Setosa', 'Versicolor', and 'Virginica')

In [35]: sns.displot(x = 'sepal_length',data = df, col = 'species')

Out[35]: <seaborn.axisgrid.FacetGrid at 0x173f8d453d0>

Inference: The displot visually compares the sepal width distributions for the three Iris species: Setosa,
Versicolor, and Virginica. The species 'Setosa' has broader sepal width compared to the other two species,
'Versicolor' and 'Virginica,' which tend to have relatively similar sepal width values

In [36]: sns.displot(x = 'sepal_width',data = df, col = 'species')

Out[36]: <seaborn.axisgrid.FacetGrid at 0x173f879dc90>


Inference: Setosa has the smallest petal length while Virginica has highest petal length among the three Iris
species (Setosa, Versicolor, and Virginica).

In [38]: sns.displot(x = 'petal_length',data = df, col = 'species')

Out[38]: <seaborn.axisgrid.FacetGrid at 0x173f95a3710>

Inference: The displot with 'petal_width' on the x-axis, organized into separate columns for each species in the
Iris dataset, provides insights into the distribution of petal widths among different species. Setosa has the
smallest petal width while Virginica has highest petal width among the three Iris species (Setosa, Versicolor,
and Virginica).

In [39]: sns.displot(x = 'petal_width',data = df, col = 'species')

Out[39]: <seaborn.axisgrid.FacetGrid at 0x173f9cc5c10>


Inference: The histogram depicting sepal length reveals a relatively normal distribution centered around 5.8
cm, with the majority of iris flowers having lengths ranging from approximately 4.5 cm to 7.0 cm. The
distribution exhibits fewer instances of extremely short or long sepal lengths, evident from the lower
frequencies observed at the tails of the distribution.

In [40]: sns.histplot(x='sepal_length',data=df)

Out[40]: <Axes: xlabel='sepal_length', ylabel='Count'>

Inference: The sepal width histogram demonstrates a roughly normal distribution with some variability. The
predominant sepal width is approximately 3.0 cm, showcasing a spread across the range of 2.0 cm to 4.5 cm.
Despite this variation, there's a subtle skew towards higher sepal widths, evident from the slightly elongated
right tail of the distribution.

In [41]: sns.histplot(x='sepal_width',data=df)
Out[41]: <Axes: xlabel='sepal_width', ylabel='Count'>

Inference: The petal length histogram highlights distinct patterns among various iris species. Setosa flowers
predominantly feature shorter petal lengths, clustering around 1-2 cm. Versicolor flowers display moderate
petal lengths, primarily falling within the range of 3-5 cm. In contrast, Virginica flowers tend to have the
longest petals, spanning lengths from 4.5 cm to 7.0 cm. These clear distinctions emphasize the characteristic
petal length differences between the iris species.

In [42]: sns.histplot(x='petal_length',data=df)

Out[42]: <Axes: xlabel='petal_length', ylabel='Count'>


Inference: The petal width histogram, akin to petal length, reveals marked distinctions among iris species.
Setosa flowers exhibit the narrowest petals, clustering predominantly around 0.2-0.3 cm. Versicolor flowers
possess wider petals compared to Setosa, with the majority falling within the range of 1.0-1.5 cm. Notably,
Virginica flowers showcase the widest petals among the three species

In [43]: sns.histplot(x='petal_width',data=df)

Out[43]: <Axes: xlabel='petal_width', ylabel='Count'>

You might also like