Lab Manual
Lab Manual
B.E. Semester5
(Computer Science & Engineering
(Data Science))
Institute logo
Directorate of Technical
Education,Gandhinagar,Gujarat
Vishwakarma Government Engineering College,
Chandkheda-Ahmedabad
Certificate
Place: __________________
Date: __________________
Preface
Main motto of any laboratory/practical/field work is for enhancing required skills as well
as creating ability amongst students to solve real time problem by developing relevant
competencies in psychomotor domain.By keeping in view, GTU has designed competency
focused outcome-based curriculum for engineering degree programs where sufficient
weightage is given to practical work. It shows importance of enhancement of skills
amongst the students and it pays attention to utilize every second of time allotted for
practical amongst students, instructors and faculty members to achieve relevant outcomes
by performing the experiments rather than having merely study type experiments. It is
must for effective implementation of competency focused outcome-basedcurriculum that
every practical is keenly designed to serve as a tool to develop and enhance relevant
competency required by the various industry among every student. These psychomotor
skills are very difficult to develop through traditional chalk and board content delivery
method in the classroom. Accordingly, this lab manual is designed to focus on the industry
defined relevant outcomes, rather than old practice of conducting practical to prove
concept and theory.
By using this lab manual students can go through the relevant theory and procedure in
advance before the actual performance which createsan interest and students can have
basic idea prior to performance.This in turn enhances pre-determined outcomes amongst
students.Each experiment in this manual begins with competency, industry relevant skills,
course outcomes as well as practical outcomes (objectives). The students will also achieve
safety and necessary precautions to be taken while performing practical.
This manual also provides guidelines to faculty members to facilitate studentcentric lab
activities through each experiment by arranging and managing necessary resources in
order that the students follow the procedures with required safety and necessary
precautions to achieve the outcomes. It also gives an idea that how students will be
assessed by providing rubrics.
Engineering Thermodynamics is the fundamental course which deals with various forms
of energy and their conversion from one to the another. It provides a platform for students
to demonstrate first and second laws of thermodynamics, entropy principle and concept
of exergy. Students also learn various gas and vapor power cycles and refrigeration cycle.
Fundamentals of combustion are also learnt.
Utmost care has been taken while preparing this lab manual however always there is
chances of improvement. Therefore, we welcome constructive suggestions for
improvement and removal of errors if any.
Statistics and Exploratory Data (3154301) Enrollment No. 220170146048
Index
(Progressive Assessment Sheet)
Sr. No. Objective(s) of Experiment Page Date of Date of Assessm Sign. of Remar
No. perfor submis ent Teacher ks
mance sion Marks with
date
1 Download a data-set from UCI repository/
Kaggle which has data of mix type (i.e.
string, float, date,). Perform following
operations in python.
(i) Import the data, create the data frame
(ii) View the data
(iii) Print first few and last few records
(iv) Create a data frame which holds subset
of original data
(v) Create a vector which holds values of
one of the columns
2 (i) Create a vector having following marks:
56, 89, 76, 54, 79, 90, 75, 48.
(ii) Create a vector having following
students: Karan, Rakesh, Hiren, Rudra,
Himesh, Shantanu, Rohan, Sidhdhesh.
(iii) Create a data frame from above two
vectors.
(iv) Update the marks of Rakesh by more 7
marks.
(v) Retrieve the students who secured
more than 85% of marks.
(vi) Sort the student list by descending
order of their marks.
3 Import the GDP dataset from kaggle and
compute the difference in GDP between
2007 and 2017 for each country. Also
create a subset of countries that saw an
increase f over one trillion dollars.
4 For the GDP dataset, compute the measures
of central tendency, mean, median, range
and quantile for year 2017.
5 Import Winter Olympics dataset from any
of the public dataset repository.
(i) Sort data by total medals and country
and save it in new data frame.
(ii) Check mean and median of number of
gold, silver, bronze and total medals.
Statistics and Exploratory Data (3154301) Enrollment No. 220170146048
Download a data-set from UCI repository/ Kaggle which has data of mix type (i.e.
string, float, date,). Perform following operations in python.
Date:
Relevant CO:1
Objectives:
Theory:
Data frame:
A DataFrame is a tabular data structure that has two dimensions and is similar to an Excel
spreadsheet or a SQL table in the way that its data is organized into rows and columns. Data
analysis and manipulation libraries like pandas, R, and SQL all use it as a fundamental data
structure.
Each column in a pandas DataFrame can be given a distinct name, and the columns
themselves can include data of a variety of types, including texts, integers, floats, and
booleans, among others. In addition, each row contained within the DataFrame can be
assigned a distinct index or label that can be used to identify it.
DataFrames can be derived from many different sources, such as CSV files, Excel
spreadsheets, SQL databases, and Python dictionaries, amongst others. For the purpose of
performing data analysis and manipulation operations, they can be sliced, filtered, sorted,
merged, joined, and aggregated in a variety of different ways.
Sequence:
A sequence is an ordered collection of elements that may be indexed and sliced in the
programming language. Python supports the following three primary kinds of sequences:
Lists: A list is a mutable sequence that can contain components of many types,
including other lists, and can even contain other lists itself. In order to make a list,
use the square brackets [].
Tuples: A tuple is an immutable sequence that can contain items of various types,
including other tuples in addition to their own elements. The parenthesis () are what
are used to build tuples.
The indexing and slicing syntax can be used to gain access to any of the three different kinds
of sequences. The process of indexing is what is done in order to access particular sequence
elements based on their position. Accessing a subsequence of elements, which is defined by
a start index and an end index, is accomplished through the usage of slicing.
Code:
import pandas as pd
data = pd.read_csv('healthcare_dataset.csv')
Output:
Code:
print(data)
Output:
(iii) Print first few and last few records
Code:
print(data.head())
print(data.tail())
Output:
Code:
print(subset_data)
Output:
(v) Create a vector which holds values of one of the columns
Code:
column_vector = subset_data['Age']
print(column_vector)
Output:
Concluding Remarks:
This exploration of mixed data types demonstrates how to work with real-world
datasets and prepare them for further analysis and machine learning models.
Quiz:
1. What are some common methods for manipulating and analyzing data in a DataFrame?
Suggested Reference:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
https://pythonbasics.org/pandas-dataframe/
https://realpython.com/pandas-dataframe/
Rubrics 1 2 3 4 5 Total
References
Experiment No: 2
(i) Create a vector having following marks: 56, 89, 76, 54, 79, 90, 75, 48.
(ii) Create a vector having following students: Karan, Rakesh, Hiren, Rudra, Himesh,
Shantanu, Rohan, Sidhdhesh.
(v) Retrieve the students who secured more than 85% of marks.
Date:
Relevant CO: 1
Objectives:
Theory:
Sorting:
The process of putting data in a certain order based on one or more columns in a dataset is
referred to as "sorting" the data. In the fields of data analysis and computer science, one of
the most popular types of data manipulation is sorting.
There exist various sorting methods as bellow. Each have some pros and cons.
Bubble sort
Selection sort
Insertion sort
Merge sort
Quick sort
Heap sort
Counting sort
Bucket sort
Radix sort
This is not the exhaustive list of sorting techniques, but these are the some of the most
popular methods for sorting the data. We will discuss here bubble sort to get idea of how
sorting can be done.
Bubble sort:
The bubble sort is an easy process for sorting data that iteratively moves through a list of
elements, evaluates the relationships between neighbouring elements, and rearranges the
elements if the order is incorrect. The method derives its name from the way that
progressively largest components "bubble" their way to the top of the list as the algorithm
runs.
Consider the data A = [55, 33, 22, 11, 44]. In each iteration, data movement happens as
follow:
(i) Create a vector having following marks: 56, 89, 76, 54, 79, 90, 75, 48.
Code:
Output:
(ii) Create a vector having following students: Karan, Rakesh, Hiren, Rudra,
Himesh, Shantanu, Rohan, Sidhdhesh.
Code:
students = ['Karan', 'Rakesh', 'Hiren', ' Rudra', 'Himesh', 'Shantanu', 'Rohan', 'Sidhdhesh']
Output:
Code:
print(df)
Output:
Code:
print(df)
Output:
(v) Retrieve the students who secured more than 85% of marks.
Code:
print(top_students)
Output:
(vi) Sort the student list by descending order of their marks.
Code:
print(sorted_df)
Output:
Concluding Remarks:
These steps highlight essential data operations: creation, updating, filtering, and
sorting.
Quiz:
1. How do you merge two data frames with different column names in pandas?
Suggested Reference:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
https://pythonbasics.org/pandas-dataframe/
https://realpython.com/pandas-dataframe/
Rubrics 1 2 3 4 5 Total
References
Experiment No: 3
Import the GDP dataset from kaggle and compute the difference in GDP between 2007
and 2017 for each country. Also create a subset of countries that saw an increase of
over one trillion dollars.
Date:
Relevant CO: 1
Objectives:
Theory:
Selecting a subset of rows and/or columns from within a larger data frame is the first step in
the subsetting process for a data frame. This can be helpful for analyzxing a certain section
of the data or for generating a more compact data frame with only the essential columns
included in it.
The following is a list of some of the more frequent ways to subset a data frame in Python by
using pandas:
1. Selecting particular columns: You can pick particular columns by passing a list of
column names to the data frame indexing operator (for example, df[['col1', 'col2']], to
select only those two columns). This will return a new data frame that just contains
the columns that were selected.
2. Filtering rows based on a condition: If you want to filter rows based on a condition,
you can use a Boolean expression to build a mask, and then you can send this mask to
the data frame indexing operator (for example, df[df['col1'] >50]). This will allow you
to filter rows depending on the condition. This will bring back a new data frame that
contains only the rows where the criteria is met.
3. You can slice rows by their position in the data frame by using the data frame slicing
operator (for example, df[1:5]), which allows you to slice rows based on their
position in the data frame. This will produce a new data frame that contains only the
rows that fall between the start and end coordinates (inclusive of both).
6. When dropping columns from a data frame, you can use the drop() method and give
the column names or positions to drop (for example, df.drop(['col1', 'col2'], axis=1)).
This will allow you to drop one or more columns from the data frame. This will
return a new data frame that does not contain the columns that you selected.
Code:
import pandas as pd
df = pd.read_csv('GDP_data_P3.csv')
print(df.head())
Output:
ii. compute the difference in GDP between 2007 and 2017 for each country
Code:
print(df[['Country','GDP_Difference']].head())
Output:
iii. create a subset of countries that saw an increase of over one trillion dollars.
Code:
trillion_increase_df=trillion_increase_df[['Country', 'GDP_Difference']]
print(trillion_increase_df.head())
Output:
Concluding Remarks:
Quiz:
1. How do you use .loc[] to select specific rows and columns in a data frame?
2. How do you use .iloc[] to select specific rows and columns in a data frame?
3. How do you use .loc[] to add new rows and columns to a data frame?
https://pandas.pydata.org/docs/getting_started/intro_tutorials/
03_subset_data.html
https://www.digitalocean.com/community/tutorials/create-subset-of-python-
dataframe
https://www.askpython.com/python/examples/subset-a-dataframe
Rubrics 1 2 3 4 5 Total
References
Experiment No: 4
For the GDP dataset, compute the measures of central tendency, mean, median, range
and quantile for year 2017.
Date:
Relevant CO: 2
Objectives:
Theory:
Mean:
We can compute the mean for only numeric data. The average value of a group of numbers is
referred to as the mean of those numbers. A data set's mean can be calculated by first adding
up all of the values in the set and then dividing that sum by the total number of values.
Let X = <x1, x2, …, xn> be the vector of n numbers. The following formula can be used to get
the mean:
n
1
μ= ∑x
n i=1 i
Median:
When a set of data is sorted in order from lowest to highest (or highest to lowest), the value
that falls in the middle of the set is referred to as the median. When there are an even
number of values, the median is determined by taking the average of the two values that are
in the middle of the set.
To compute the median of data set X = <22, 44, 33, 11, 55>, we shall first arrange it in
ascending or descending order:
X = <11, 22, 33, 44, 55>
There are 5 elements in the array X, so median is the element which is on index 3, which is
33.
There are 6 elements in the array Y, so median is of this data set would be (33 + 44) / 2 =
38.5
Mode:
The mode of a data set is the most frequent value within it. If two or more values occur at
the same frequency, we can say that there is more than one mode. So mode is the element in
data which occurs maximum number of time.
For X = <11, 33, 66, 55, 22, 11, 66, 44, 11, 33, 55, 11>, element 11 appears maximum number
of times (4 times), so 11 is the mode of this dataset.
For Y = <33, 66, 55, 22, 11, 33, 44, 11, 33, 55, 11>, element 11 and 33 appears maximum
number of times (3 times each), so 11 and 33 are modes of this dataset.
In the field of statistics, two measures of variability known as range and quartiles are
employed to describe the spread of a dataset. The range of a dataset is defined as the
difference between the highest possible value and the lowest possible value. Although it is
simple to calculate and comprehend, it is susceptible to being skewed by outliers and does
not offer a comprehensive representation of the variability present in the dataset.
On the other hand, quartiles split a dataset into four equal parts, with each section having 25
percent of the total data. The following represents each quartile:
The 25th percentile of the data, also known as the first quartile (Q1), is the value that
is below which 25% of all of the data falls.
The middle value of the data, often known as the median, can be thought of as the
value that the second quartile, or Q2, represents.
The 75th percentile of the data, also known as the third quartile (Q3), is the value
that is below which 75% of all of the data falls.
For X = <11, 33, 66, 55, 22, 11, 66, 44, 11, 33, 55, 11>, minimum value is 11 and maximum is
66, so the range of X is 66 – 11 = 55
Let us sort X, so X = <11, 11, 11, 11, 22, 33, 33, 44, 55, 55, 66, 66>. There are 12 elements in
the data set.
Code:
import pandas as pd
df = pd.read_csv('GDP_data_P3.csv')
gdp_2017 = df[['Country','2017']]
print(gdp_2017.head())
mean = gdp_2017['2017'].mean()
median = gdp_2017['2017'].median()
print()
print(quantiles)
Output:
Concluding Remarks:
These statistical measures are essential for understanding the overall distribution
and spread of GDP values in 2017.
Quiz:
2. What is the mean, median and mode of the following set of data: 4, 7, 9, 9, 11, 11, 11,
13?
3. Which measure of central tendency is preferred when the data set has extreme
values?
Suggested Reference:
https://codecrucks.com/mean-median-mode-variance-discovering-statistical-
properties-of-data
https://www.techtarget.com/searchdatacenter/definition/statistical-mean-median-
mode-and-range
https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/
mean-median-mode/
https://www.twinkl.co.in/teaching-wiki/mean-median-mode-and-range
Rubrics 1 2 3 4 5 Total
References
Experiment No: 5
Import Winter Olympics dataset from any of the public dataset repository.
(i) Sort data by total medals and country and save it in new data frame.
(ii) Check mean and median of number of gold, silver, bronze and total medals.
(iii) Find that which region won the highest mean total medals?
(iv) What is the maximum number of medals won? Which country won it?
(v) Find correlations between total medals and number of gold and bronze.
Date:
Relevant CO: 2, 3
Objectives:
Theory:
Correlation
The intensity and direction of the relationship that exists between two variables can be
quantified with the help of a statistical measure called correlation. It determines the extent
to which two variables are connected to one another and measures that connection.
Positive and negative correlations are the two main forms of this concept. A positive
correlation indicates that when the value of one variable rises, the value of the other
variable likewise rises. The opposite of a positive correlation is a negative correlation, which
indicates that when one variable grows, the other variable tends to decrease.
Where,
r is correlation coefficient
x i is i-th data point from vector X
y i is i-th data point from vector Y
x is the mean of vector X
y is the mean of vector Y
The correlation between two variables is often evaluated using a correlation coefficient, the
value of which can range anywhere from minus one to plus one. If the correlation coefficient
is -1, it indicates an absolutely perfect negative correlation; if it is +1, it suggests an
absolutely perfect positive correlation; and if it is 0, it indicates that there is no association.
Setting up Plot:
Consider X = <1, 2, 3, 4, 5> and Y = <12, 13, 11, 14, 17>. Create scatter plot and draw
correlation line. Compute slope and Y-intercept to draw a line
Import Winter Olympics dataset from any of the public dataset repository.
Code:
print(df.head())
Output:
(i) Sort data by total medals and country and save it in new data frame.
Code:
Output:
(ii) Check mean and median of number of gold, silver, bronze and total medals.
Code:
mean_total = df['Total'].mean()
median_total = df['Total'].median()
Output:
(iii) Find that which region won the highest mean total medals?
Code:
highest_mean_region = df.groupby('NOCCode')['Total'].mean().idxmax()
Output:
(iv) What is the maximum number of medals won? Which country won it?
Code:
max_medals = df['Total'].max()
Output:
(v) Find correlations between total medals and number of gold and bronze.
Code:
Output:
Code:
correlation_rank_total = df['Rank'].corr(df['Total'])
Output:
Concluding Remarks:
Quiz:
2. How is correlation calculated and what does a correlation coefficient tell us?
3. What are some real-world examples of positive and negative correlations?
4. How do outliers affect correlation and what can be done to address this issue?
Suggested Reference:
https://en.wikipedia.org/wiki/Correlation
https://www.mathsisfun.com/data/correlation.html
https://conjointly.com/kb/correlation-statistic/
Rubrics 1 2 3 4 5 Total
References
Experiment No: 6
(ii) Redo the scatter plot, adjusting scales, divide by 1000, 100,000 and 1,000,000
(iii) Find the correlation between tickets sold and sales? Is it expected?
(iv) Make scatter plot with millions scale and also add a regression line with label to
all axes and title to plot.
(vi) Make individual histogram of type of films, gross sales and ticket sales.
Date:
Relevant CO: 2, 3, 4
Objectives:
3. To understand histogram
Theory:
Scatter plot
The relationship between two quantitative variables can be graphically represented using
something called a scatter plot. It is a method of putting into perspective how one variable is
influenced by another variable.
In a scatter plot, the values of the two variables for each individual data point are
represented by individual points on the plot. One of the variables is represented by the
horizontal axis, and the other is represented by the vertical axis in this graph. Plotting each
data point as a separate point on the graph, with the value of one variable serving as the x-
coordinate and the value of the other variable serving as the y-coordinate, results in the
creation of the plot.
The usage of scatter plots is helpful in determining if there are any patterns or links between
two variables. They are able to assist us in determining if the data exhibit a positive or
negative correlation, a linear or non-linear relationship, as well as whether or not the data
contain any outliers or clusters.
Table 1: dataset
X Data Y Data
1 12
2 8
3 7
4 11
5 9
6 13
7 10
8 14
Box Plot:
A box plot is a graphical depiction of a set of data that shows its distribution and reveals
potential outliers. This type of plot is also known as a box-and-whisker plot.
In a box plot, the "box" represents the middle 50% of the data, while the "whiskers" extend
from the box to the minimum and maximum values of the data, omitting any outliers.
Together, these two parts of the plot are referred to as the "box and whiskers." The box itself
is defined by its lower and higher quartiles, where the lower quartile (Q1) is the median of
the lower half of the data, and the upper quartile (Q3) is the median of the upper half of the
data. In other words, the box is defined by the median of the lower half of the data and the
median of the upper half of the data. The line that runs through the middle of the box
illustrates the data's median value.
Box plots are a useful tool for visualizing changes in data over time as well as comparing
how the distribution of data varies between groups of people. They are extremely helpful
when working with huge data sets that may contain many outliers since they provide a
quick and straightforward approach to identify these outliers. This makes them an
especially handy tool.
Table 2: dataset
Histogram:
The distribution of a given collection of numerical data can be graphically represented using
something called a histogram. It provides an indication of the frequency with which each
value or range of values appears in the data.
The data are separated into intervals or "bins" along the horizontal axis of a histogram, and
the frequency or count of data points that fall into each bin is represented by the height of a
bar along the vertical axis of the histogram. In most cases, the bars are located next to one
another with no spaces in between them, and the width of each bar is proportional to the
size of the bin.
Histograms can be used to determine the form of the distribution of the data, such as
whether it is skewed, symmetric, bimodal, or unimodal. Another use of histograms is to
determine the amount of variation in the data. In addition to that, you may use them to find
gaps in the data or outliers in the data.
Table 3: dataset
1 3 3 6 6 7 8 9 9 9 9 10 10 10 10 11 11 12 12 17 18 18
Code:
import pandas as pd
df = pd.read_csv('movie.csv')
print(df.head())
Output:
Code:
plt.figure(figsize=(6, 6))
plt.xlabel('Tickets Sold')
plt.ylabel('2014 Gross')
plt.grid()
plt.show()
Output:
(ii) Redo the scatter plot, adjusting scales, divide by 1000, 100,000 and
1,000,000
Code:
plt.figure(figsize=(6, 6))
plt.grid()
plt.show()
Output:
(iii) Find the correlation between tickets sold and sales? Is it expected?
Code:
Output:
(iv) Make scatter plot with millions scale and also add a regression line with
label to all axes and title to plot.
Code:
plt.figure(figsize=(10, 6))
plt.grid()
plt.show()
Output:
Code:
plt.figure(figsize=(9, 5))
sns.boxplot(data=df, x='Genre', y='Tickets Sold')
plt.xticks(rotation=90)
plt.grid()
plt.show()
Output:
(vi) Make individual histogram of type of films, gross sales and ticket sales.
Code:
axs[0].set_xlabel('Genre')
axs[0].set_ylabel('Count')
print()
print()
axs[1].set_xlabel('Gross Sales')
axs[1].set_ylabel('Frequency')
print()
print()
axs[2].set_xlabel('Tickets Sold')
axs[2].set_ylabel('Frequency')
plt.tight_layout()
plt.show()
Output:
Concluding Remarks:
The visualizations help to easily identify trends and outliers, offering valuable
insights into the factors affecting ticket sales and gross revenues in the film industry.
This approach supports better decision-making and strategic planning for movie
production and marketing.
Quiz:
4. How can you use a histogram plot to identify the shape of a distribution?
Suggested Reference:
https://www.simplypsychology.org/boxplots.html
https://builtin.com/data-science/boxplot
https://www.investopedia.com/terms/h/histogram.asp
https://statistics.laerd.com/statistical-guides/understanding-histograms.php
Rubrics 1 2 3 4 5 Total
References
Experiment No: 7
For the GDP and Life Expectancy datasets from kaggle, produce following plots that
describe the GDP and Life Expectancy during the year 2016.
Date:
Relevant CO: 4
Objectives:
2. To understand histogram
The GDP and Life Expectancy datasets from kaggle, produce following plots that
describe the GDP and Life Expectancy during the year 2016.
Code:
import pandas as pd
gdp_data = pd.read_csv('gdp.csv')
life_expectancy_data = pd.read_csv('life_expectancy.csv')
df= merged_data
df_cleaned = df.dropna()
df=df_cleaned
df.dropna(inplace=True)
print(df.head())
print('*****************************************************************')
df.info()
Output:
Code:
plt.figure(figsize=(10, 6))
plt.scatter(df['2016_GDP'], df['2016_Life_Expectancy'], alpha=0.6)
plt.xlabel('GDP')
plt.ylabel('Life Expectancy')
plt.grid()
plt.show()
Output:
Code:
plt.figure(figsize=(10, 6))
plt.xlabel('GDP')
plt.ylabel('Frequency')
plt.grid()
plt.show()
Output:
Code:
plt.figure(figsize=(10, 6))
sns.boxplot(y=df['2016_Life_Expectancy'])
plt.ylabel('Life Expectancy')
plt.grid()
plt.show()
Output:
Concluding Remarks:
Quiz:
1. How can you use a scatter plot to compare two or more datasets?
4. How do you interpret the median, upper quartile, and lower quartile in a box plot?
Suggested Reference:
https://www.simplypsychology.org/boxplots.html
https://builtin.com/data-science/boxplot
https://www.investopedia.com/terms/h/histogram.asp
https://statistics.laerd.com/statistical-guides/understanding-histograms.php
Rubrics 1 2 3 4 5 Total
References
Experiment No: 8
(i) Look at the column names and change names to more meaningful names.
(ii) Make two histograms on one page: summer games (total), winter games (total).
(iii) Make two histograms on one page: total summer, total winter medals won
(iv) Check that is there a correlation between number of medals won in winter and
summer?
(v) Check that is there correlation between number of games each country competes
in winter and summer.
(vi) Make 6 histograms on one page showing the distribution of each of the types of
medals by seasons.
Date:
Objectives:
Import Olympics dataset from any of the public dataset repository and
(i) Look at the column names and change names to more meaningful names.
Code:
import pandas as pd
df = pd.read_csv('olympics.csv')
df = df.drop(df.index[0])
df = df.drop(df.index[-1])
df = df.reset_index(drop=True)
df['Gold'] = df['Gold'].astype(int)
df['Silver'] = df['Silver'].astype(int)
df['Bronz'] = df['Bronz'].astype(int)
df['Total'] = df['Total'].astype(int)
df['Summer_games'] = df['Summer_games'].astype(int)
df['Winter_games'] = df['Winter_games'].astype(int)
df['Total_games'] = df['Total_games'].astype(int)
df['Summer_Gold'] = df['Summer_Gold'].astype(int)
df['Summer_Silver'] = df['Summer_Silver'].astype(int)
df['Summer_Bronz'] = df['Summer_Bronz'].astype(int)
df['Summer_Total'] = df['Summer_Total'].astype(int)
df['Winter_Gold'] = df['Winter_Gold'].astype(int)
df['Winter_Silver'] = df['Winter_Silver'].astype(int)
df['Winter_Bronz'] = df['Winter_Bronz'].astype(int)
df['Winter_Total'] = df['Winter_Total'].astype(int)
print(df.head())
df.info()
Output:
(ii) Make two histograms on one page: summer games (total), winter games
(total).
Code:
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.xlabel('Number of Games')
plt.ylabel('Frequency')
plt.subplot(1, 2, 2)
plt.hist(df['Winter_games'], bins=10, color='green', edgecolor='black')
plt.xlabel('Number of Games')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
Output:
(iii) Make two histograms on one page: total summer, total winter medals won
Code:
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.xlabel('Total Medals')
plt.ylabel('Frequency')
plt.subplot(1, 2, 2)
plt.xlabel('Total Medals')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
Output:
(iv) Check that is there a correlation between number of medals won in winter
and summer?
Code:
correlation_medals = df['Summer_Total'].corr(df['Winter_Total'])
Output:
(v) Check that is there correlation between number of games each country
competes in winter and summer.
Code:
correlation_games = df['Summer_games'].corr(df['Winter_games'])
Output:
(vi) Make 6 histograms on one page showing the distribution of each of the
types of medals by seasons.
Code:
plt.figure(figsize=(15, 10))
plt.subplot(2, 3, 1)
plt.subplot(2, 3, 2)
plt.subplot(2, 3, 3)
plt.subplot(2, 3, 4)
plt.subplot(2, 3, 5)
plt.subplot(2, 3, 6)
plt.tight_layout()
plt.show()
Output:
Concluding Remarks:
1. What are the advantages and disadvantages of using a histogram plot to visualize
data?
2. How do you choose the appropriate bin width for a histogram plot?
Suggested Reference:
https://www.mathsisfun.com/data/correlation.html
https://www.investopedia.com/terms/c/correlation.asp
https://www.cuemath.com/data/histograms/
Rubrics 1 2 3 4 5 Total
References
Experiment No: 9
For the GDP, Life Expectancy and Employment datasets from kaggle, perform
following tasks.
(i) Merge the columns for the year 2016 for 3 datasets into a new data frame (ii)
Rename the columns to “country”, “gdp”, “life_expectancy” and “employment”
Date:
Relevant CO: 4
Objectives:
Theory:
Frequency Table
A frequency table is a statistical tool for classifying data into groups and showing how often
each group occurs. In other words, it illustrates the recurrence rates of individual values or
sets of values within a dataset.
Choosing the appropriate categories or intervals for your data is the first step in making a
frequency table. Then, you create a table to keep track of how many data points belong to
each category. The frequency with which each category appears in the dataset is displayed
in the table that results.
For example, we can create frequency table of 50 school students as follow. Assume weight
of students vary from 25 kg to 65 kg. According to weight range, frequency tables looks like
as:
Import the GDP, Life Expectancy and Employment datasets from Kaggle and
(i) Merge the columns for the year 2016 for 3 datasets into a new data frame
Code:
gdp_data = pd.read_csv('gdp.csv')
life_expectancy_data = pd.read_csv('life_expectancy.csv')
employment_df = pd.read_csv('employment.csv')
df=merged_df
df_cleaned = df.dropna()
df=df_cleaned
df.dropna(inplace=True)
Code:
df.columns = ['country','gdp','life_expectancy','employment']
print(df.head())
Output:
(iii) Create a frequency table for each variable
Code:
gdp_bins = pd.cut(df['gdp'],
bins=[0,100.0e+09,200.0e+09,300.0e+09,400.0e+09,500.0e+09,600.0e+09,700.0e+09,800.0
e+09,900.0e+09,1000.0e+09, float('inf')],
labels=['0 - 100B',
'100B - 200B',
'200B - 300B',
'300B - 400B',
'400B - 500B',
'500B - 600B',
'600B - 700B',
'700B - 800B',
'800B - 900B',
'900B - 1000B',
'1000B+'])
gdp_freq = gdp_bins.value_counts().sort_index()
gdp_frequency_table = pd.DataFrame({
'Label': gdp_freq.index,
'Count': gdp_freq.values
})
# Print the frequency table
print(gdp_frequency_table)
Output:
Code:
life_bins = pd.cut(df['life_expectancy'], bins=[0, 30, 50, 60, 70, 80, 90, 100],
life_freq = life_bins.value_counts().sort_index()
life_frequency_table = pd.DataFrame({
'Label': life_freq.index,
'Count': life_freq.values
})
# Print the frequency table
print(life_frequency_table)
Output:
Code:
employment_bins = pd.cut(
df['employment'],
bins=[0, 1e9, 2e9, 3e9, 4e9, 5e9,6e9, 7e9, 8e9, 9e9, 10e9, float('inf')],
employment_freq = employment_bins.value_counts().sort_index()
frequency_table = pd.DataFrame({
'Label': employment_freq.index,
'Count': employment_freq.values
})
print(frequency_table)
Output:
(iv) Draw histograms for each variable
Code:
# Create histograms
plt.figure(figsize=(12, 8))
# GDP Histogram
plt.subplot(3, 1, 1)
plt.xlabel('GDP')
plt.ylabel('Frequency')
plt.xlabel('Life Expectancy')
plt.ylabel('Frequency')
# Employment Histogram
plt.subplot(3, 1, 3)
plt.hist(
plt.xlabel('Employment')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
Output:
Concluding Remarks:
Such analyses are vital for uncovering trends and making informed decisions based
on the interconnectedness of data. Overall, this approach underscores the importance
of data integration and visualization in understanding complex societal dynamics.
Quiz:
2. hat is a relative frequency table and how is it different from a standard frequency
table?
Suggested Reference:
https://statisticsbyjim.com/basics/frequency-table/
https://www.vedantu.com/maths/understanding-frequency-tables
Rubrics 1 2 3 4 5 Total
References
Experiment No: 10
Import any real life data set from any public data set repository. Create program that
allows the user to interactively display a box plot based on selection of value of other
column.
Date:
Relevant CO: 4
Objectives:
Concluding Remarks:
Quiz:
1. What is a data frame and how is it used to organize and analyze data?
2. What are some common methods for cleaning and preprocessing data in a data
frame?
3. How can you use a data frame to filter and subset data?
Suggested Reference:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
https://pythonbasics.org/pandas-dataframe/
https://realpython.com/pandas-dataframe/
References used by the students:
Rubrics 1 2 3 4 5 Total
References