0% found this document useful (0 votes)
29 views74 pages

Lab Manual

lab maual

Uploaded by

Rani Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views74 pages

Lab Manual

lab maual

Uploaded by

Rani Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 74

A Laboratory Manual for

Statistics and Exploratory Data


Analysis
(3154301)

B.E. Semester5
(Computer Science & Engineering
(Data Science))

Institute logo

Directorate of Technical
Education,Gandhinagar,Gujarat
Vishwakarma Government Engineering College,
Chandkheda-Ahmedabad

Certificate

This is to certify thatMr./Ms. ___________________________________ ________


Enrollment No. _______________ of B.E. Semester V Computer Science &
Engineering (Data Science) of this Institute (GTU Code: 046)has
satisfactorily completed the Practical / Tutorial work for the subject
Statistics and Exploratory Data Analysis(3154301) for the academic
year 2022-23.

Place: __________________
Date: __________________

Name and Sign of Faculty member

Head ofthe Department


Statistics and Exploratory Data (3154301) Enrollment No. 220170146048

Preface

Main motto of any laboratory/practical/field work is for enhancing required skills as well
as creating ability amongst students to solve real time problem by developing relevant
competencies in psychomotor domain.By keeping in view, GTU has designed competency
focused outcome-based curriculum for engineering degree programs where sufficient
weightage is given to practical work. It shows importance of enhancement of skills
amongst the students and it pays attention to utilize every second of time allotted for
practical amongst students, instructors and faculty members to achieve relevant outcomes
by performing the experiments rather than having merely study type experiments. It is
must for effective implementation of competency focused outcome-basedcurriculum that
every practical is keenly designed to serve as a tool to develop and enhance relevant
competency required by the various industry among every student. These psychomotor
skills are very difficult to develop through traditional chalk and board content delivery
method in the classroom. Accordingly, this lab manual is designed to focus on the industry
defined relevant outcomes, rather than old practice of conducting practical to prove
concept and theory.

By using this lab manual students can go through the relevant theory and procedure in
advance before the actual performance which createsan interest and students can have
basic idea prior to performance.This in turn enhances pre-determined outcomes amongst
students.Each experiment in this manual begins with competency, industry relevant skills,
course outcomes as well as practical outcomes (objectives). The students will also achieve
safety and necessary precautions to be taken while performing practical.

This manual also provides guidelines to faculty members to facilitate studentcentric lab
activities through each experiment by arranging and managing necessary resources in
order that the students follow the procedures with required safety and necessary
precautions to achieve the outcomes. It also gives an idea that how students will be
assessed by providing rubrics.

Engineering Thermodynamics is the fundamental course which deals with various forms
of energy and their conversion from one to the another. It provides a platform for students
to demonstrate first and second laws of thermodynamics, entropy principle and concept
of exergy. Students also learn various gas and vapor power cycles and refrigeration cycle.
Fundamentals of combustion are also learnt.

Utmost care has been taken while preparing this lab manual however always there is
chances of improvement. Therefore, we welcome constructive suggestions for
improvement and removal of errors if any.
Statistics and Exploratory Data (3154301) Enrollment No. 220170146048

Practical – Course Outcome matrix

Course Outcomes (COs):


CO-1: Identify objectives of statistical analysis
CO-2: Apply methods of data collection & analysis
CO-3: Use correlation and regression analysis to understand data
CO-4: Visualize the data to get the insights from it further for decision making
process
CO-5: Learn and apply various hypothesis testing techniques
Sr. CO CO CO CO CO
Objective(s) of Experiment
No. 1 2 3 4 5
Download a data-set from UCI repository/ Kaggle
which has data of mix type (i.e. string, float, date,).
Perform following operations in python.
(i) Import the data, create the data frame
(ii) View the data
1. √
(iii) Print first few and last few records
(iv) Create a data frame which holds subset of
original data
(v) Create a vector which holds values of one of the
columns
(i) Create a vector having following marks: 56, 89,
76, 54, 79, 90, 75, 48.
(ii) Create a vector having following students:
Karan, Rakesh, Hiren, Rudra, Himesh, Shantanu,
Rohan, Sidhdhesh.
2. (iii) Create a data frame from above two vectors. √
(iv) Update the marks of Rakesh by more 7 marks.
(v) Retrieve the students who secured more than
85% of marks.
(vi) Sort the student list by descending order of
their marks.
Import the GDP dataset from kaggle and compute
the difference in GDP between 2007 and 2017 for
3. √
each country. Also create a subset of countries that
saw an increase f over one trillion dollars.
For the GDP dataset, compute the measures of
4. central tendency, mean, median, range and quantile √
for year 2017.
5. Import Winter Olympics dataset from any of the √ √
public dataset repository.
(i) Sort data by total medals and country and save it
in new data frame.
(ii) Check mean and median of number of gold,
Statistics and Exploratory Data (3154301) Enrollment No. 220170146048

silver, bronze and total medals.


(iii) Find that which region won the highest mean
total medals?
(iv) What is the maximum number of medals won?
Which country won it?
(v) Find correlations between total medals and
number of gold and bronze.
(vi) Find correlation between rank and total medals
Import Movies dataset from kaggle/uci public
dataset repository.
(i) Make scatter plot of tickets sold versus gross.
(ii) Redo the scatter plot, adjusting scales, divide by
1000, 100,000 and 1,000,000
(iii) Find the correlation between tickets sold and
6. sales? Is it expected? √ √ √
(iv) Make scatter plot with millions scale and also
add a regression line with label to all axes and title
to plot.
(v) Make boxplot
(vi) Make individual histogram of type of films,
gross sales and ticket sales.
For the GDP and Life Expectancy datasets from
kaggle, produce following plots that describe the
GDP and Life Expectancy during the year 2016.
7. (i) Create a scatter plot of GDP to Life Expectancy √
(ii) Create a histogram of GDP
(iii) Create a box and whisper plot of Life
Expectancy
Import Olympics dataset from any of the public
dataset repository. (i) Look at the column names
and change names to more meaningful names.
(ii) Make two histograms on one page: summer
games (total), winter games (total).
(iii) Make two histograms on one page: total
summer, total winter medals won
8. (iv) Check that is there a correlation between √ √
number of medals won in winter and summer?
(v) Check that is there correlation between number
of games each country competes in winter and
summer.
(vi) Make 6 histograms on one page showing the
distribution of each of the types of medals by
seasons.
9. For the GDP, Life Expectancy and Employment √
Statistics and Exploratory Data (3154301) Enrollment No. 220170146048

datasets from kaggle, perform following tasks.


(i) Merge the columns for the year 2016 for 3
datasets into a new data frame (ii) Rename the
columns to “country”, “gdp”, “life_expectancy” and
“employment”
(iii) Create a frequency table for each variable
(iv) Draw histograms for each variable
Import any real life data set from any public data
set repository. Create program that allows the user
10. √
to interactively display a box plot based on
selection of value of other column.
Statistics and Exploratory Data (3154301) Enrollment No. 220170146048

Industry Relevant Skills

The following industry relevant competency are expected to be developed in the


student by undertaking the practical work of this laboratory.
1. Impart basic data science skills for collection & analysis of statistical data.
2. Learn various graphical and modelling techniques for exploring data, with an
emphasis on visualization, interpretation, and clear communication of findings.
3. Learn and apply about hypothesis testing techniques

Guidelines forFaculty members


1. Teacher should provide the guideline with demonstration of practical to the
students with all features.
2. Teacher shall explain basic concepts/theory related to the experiment to the
students before starting of each practical
3. Involve all the students in performance of each experiment.
4. Teacher is expected to share the skills and competencies to be developed in the
students and ensure that the respective skills and competencies are developed
in the students after the completion of the experimentation.
5. Teachers should give opportunity to students for hands-on experience after the
demonstration.
6. Teacher may provide additional knowledge and skills to the students even
though not covered in the manual but are expected from the students by
concerned industry.
7. Give practical assignment and assess the performance of students based on task
assigned to check whether it is as per the instructions or not.
8. Teacher is expected to refer complete curriculum of the course and follow the
guidelines for implementation.

Instructions for Students


1. Students are expected to carefully listen to all the theory classes delivered by the
faculty members and understand the COs, content of the course, teaching and
examination scheme, skill set to be developed etc.
2. Students shall organize the work in the group and make record of all observations.
3. Students shall develop maintenance skill as expected by industries.
4. Student shall attempt to develop related hand-on skills and build confidence.
5. Student shall develop the habits of evolving more ideas, innovations, skills etc. apart
from those included in scope of manual.
6. Student shall refer technical magazines and data books.
7. Student should develop a habit of submitting the experimentation work as per the
schedule and s/he should be well preparedfor the same.
Statistics and Exploratory Data (3154301) Enrollment No. 220170146048

Index
(Progressive Assessment Sheet)
Sr. No. Objective(s) of Experiment Page Date of Date of Assessm Sign. of Remar
No. perfor submis ent Teacher ks
mance sion Marks with
date
1 Download a data-set from UCI repository/
Kaggle which has data of mix type (i.e.
string, float, date,). Perform following
operations in python.
(i) Import the data, create the data frame
(ii) View the data
(iii) Print first few and last few records
(iv) Create a data frame which holds subset
of original data
(v) Create a vector which holds values of
one of the columns
2 (i) Create a vector having following marks:
56, 89, 76, 54, 79, 90, 75, 48.
(ii) Create a vector having following
students: Karan, Rakesh, Hiren, Rudra,
Himesh, Shantanu, Rohan, Sidhdhesh.
(iii) Create a data frame from above two
vectors.
(iv) Update the marks of Rakesh by more 7
marks.
(v) Retrieve the students who secured
more than 85% of marks.
(vi) Sort the student list by descending
order of their marks.
3 Import the GDP dataset from kaggle and
compute the difference in GDP between
2007 and 2017 for each country. Also
create a subset of countries that saw an
increase f over one trillion dollars.
4 For the GDP dataset, compute the measures
of central tendency, mean, median, range
and quantile for year 2017.
5 Import Winter Olympics dataset from any
of the public dataset repository.
(i) Sort data by total medals and country
and save it in new data frame.
(ii) Check mean and median of number of
gold, silver, bronze and total medals.
Statistics and Exploratory Data (3154301) Enrollment No. 220170146048

(iii) Find that which region won the highest


mean total medals?
(iv) What is the maximum number of
medals won? Which country won it?
(v) Find correlations between total medals
and number of gold and bronze.
(vi) Find correlation between rank and
total medals
6 Import Movies dataset from kaggle/uci
public dataset repository.
(i) Make scatter plot of tickets sold versus
gross.
(ii) Redo the scatter plot, adjusting scales,
divide by 1000, 100,000 and 1,000,000
(iii) Find the correlation between tickets
sold and sales? Is it expected?
(iv) Make scatter plot with millions scale
and also add a regression line with label to
all axes and title to plot.
(v) Make boxplot
(vi) Make individual histogram of type of
films, gross sales and ticket sales.
7 For the GDP and Life Expectancy datasets
from kaggle, produce following plots that
describe the GDP and Life Expectancy
during the year 2016.
(i) Create a scatter plot of GDP to Life
Expectancy
(ii) Create a histogram of GDP
(iii) Create a box and whisper plot of Life
Expectancy
8 Import Olympics dataset from any of the
public dataset repository. (i) Look at the
column names and change names to more
meaningful names.
(ii) Make two histograms on one page:
summer games (total), winter games
(total).
(iii) Make two histograms on one page:
total summer, total winter medals won
(iv) Check that is there a correlation
between number of medals won in winter
and summer?
(v) Check that is there correlation between
Statistics and Exploratory Data (3154301) Enrollment No. 220170146048

number of games each country competes in


winter and summer.
(vi) Make 6 histograms on one page
showing the distribution of each of the
types of medals by seasons.
9 For the GDP, Life Expectancy and
Employment datasets from kaggle, perform
following tasks.
(i) Merge the columns for the year 2016 for
3 datasets into a new data frame (ii)
Rename the columns to “country”, “gdp”,
“life_expectancy” and “employment”
(iii) Create a frequency table for each
variable
(iv) Draw histograms for each variable
10 Import any real life data set from any
public data set repository. Create program
that allows the user to interactively display
a box plot based on selection of value of
other column.
Total
Experiment No: 1

Download a data-set from UCI repository/ Kaggle which has data of mix type (i.e.
string, float, date,). Perform following operations in python.

(i) Import the data, create the data frame

(ii) View the data

(iii) Print first few and last few records

(iv) Create a data frame which holds subset of original data

(v) Create a vector which holds values of one of the columns

Date:

Competency and PracticalSkills:

Relevant CO:1

Objectives:

1. To understand how to load the data from dataset

2. To learn how to extract subset of data frame

3. To learn to create series from dataframe

Theory:

Data frame:

A DataFrame is a tabular data structure that has two dimensions and is similar to an Excel
spreadsheet or a SQL table in the way that its data is organized into rows and columns. Data
analysis and manipulation libraries like pandas, R, and SQL all use it as a fundamental data
structure.

Each column in a pandas DataFrame can be given a distinct name, and the columns
themselves can include data of a variety of types, including texts, integers, floats, and
booleans, among others. In addition, each row contained within the DataFrame can be
assigned a distinct index or label that can be used to identify it.

DataFrames can be derived from many different sources, such as CSV files, Excel
spreadsheets, SQL databases, and Python dictionaries, amongst others. For the purpose of
performing data analysis and manipulation operations, they can be sliced, filtered, sorted,
merged, joined, and aggregated in a variety of different ways.

Sequence:

A sequence is an ordered collection of elements that may be indexed and sliced in the
programming language. Python supports the following three primary kinds of sequences:

 Lists: A list is a mutable sequence that can contain components of many types,
including other lists, and can even contain other lists itself. In order to make a list,
use the square brackets [].

 Tuples: A tuple is an immutable sequence that can contain items of various types,
including other tuples in addition to their own elements. The parenthesis () are what
are used to build tuples.

 Strings: A string is a sequence of characters that cannot be changed in any way.


Either single quotes '' or double quotes "" are used when creating strings.

The indexing and slicing syntax can be used to gain access to any of the three different kinds
of sequences. The process of indexing is what is done in order to access particular sequence
elements based on their position. Accessing a subsequence of elements, which is defined by
a start index and an end index, is accomplished through the usage of slicing.

(i) Import the data, create the data frame

Code:

import pandas as pd

data = pd.read_csv('healthcare_dataset.csv')

Output:

(ii) View the data

Code:

print(data)

Output:
(iii) Print first few and last few records

Code:

print(data.head())

print(data.tail())

Output:

(iv) Create a data frame which holds subset of original data

Code:

subset_data = data[['Name','Age','Discharge Date','Gender']]

subset_data = subset_data[data['Age'] > 50]

print(subset_data)

Output:
(v) Create a vector which holds values of one of the columns

Code:

column_vector = subset_data['Age']

print(column_vector)

Output:

Concluding Remarks:
This exploration of mixed data types demonstrates how to work with real-world
datasets and prepare them for further analysis and machine learning models.

Quiz:

1. What are some common methods for manipulating and analyzing data in a DataFrame?

2. How can you select specific rows or columns from a DataFrame?

3. What is the difference between a DataFrame and a Series in pandas?

4. What is the difference between a list and a tuple in Python?

Suggested Reference:

 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

 https://pythonbasics.org/pandas-dataframe/

 https://realpython.com/pandas-dataframe/

References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementation implementation implementation implementatio
as asked as asked as asked as asked n as asked

Correct Result Correct Result Correct Result Correct Result


Conclusions Conclusions Conclusions

Correct answer Correct answer


to all questions to all questions

References
Experiment No: 2

(i) Create a vector having following marks: 56, 89, 76, 54, 79, 90, 75, 48.

(ii) Create a vector having following students: Karan, Rakesh, Hiren, Rudra, Himesh,
Shantanu, Rohan, Sidhdhesh.

(iii) Create a data frame from above two vectors.

(iv) Update the marks of Rakesh by more 7 marks.

(v) Retrieve the students who secured more than 85% of marks.

(vi) Sort the student list by descending order of their marks.

Date:

Competency and PracticalSkills:

Relevant CO: 1

Objectives:

1. To learn to modify data frame

2. To understand how to sort the data

Theory:

Sorting:

The process of putting data in a certain order based on one or more columns in a dataset is
referred to as "sorting" the data. In the fields of data analysis and computer science, one of
the most popular types of data manipulation is sorting.

There exist various sorting methods as bellow. Each have some pros and cons.

 Bubble sort
 Selection sort
 Insertion sort
 Merge sort
 Quick sort
 Heap sort
 Counting sort
 Bucket sort
 Radix sort
This is not the exhaustive list of sorting techniques, but these are the some of the most
popular methods for sorting the data. We will discuss here bubble sort to get idea of how
sorting can be done.

Bubble sort:

The bubble sort is an easy process for sorting data that iteratively moves through a list of
elements, evaluates the relationships between neighbouring elements, and rearranges the
elements if the order is incorrect. The method derives its name from the way that
progressively largest components "bubble" their way to the top of the list as the algorithm
runs.

Consider the data A = [55, 33, 22, 11, 44]. In each iteration, data movement happens as
follow:

Iteration 1: Iteration 2: Iteration 3: Iteration 4: Iteration 5:


[55, 33, 22, 11, 44] [33, 22, 11, 44, 55] [22, 11, 33, 44, 55] [11, 22, 33, 44, 55] [11, 22, 33, 44, 55]
[33, 55, 22, 11, 44] [22, 33, 11, 44, 55] [11, 22, 33, 44, 55] [11, 22, 33, 44, 55]
[33, 22, 55, 11, 44] [22, 11, 33, 44, 55] [11, 22, 33, 44, 55]
[33, 22, 11, 55, 44] [22, 11, 33, 44, 55]
[33, 22, 11, 44, 55]

(i) Create a vector having following marks: 56, 89, 76, 54, 79, 90, 75, 48.

Code:

marks = [56, 89, 76, 54, 79, 90, 75, 48]

Output:

(ii) Create a vector having following students: Karan, Rakesh, Hiren, Rudra,
Himesh, Shantanu, Rohan, Sidhdhesh.

Code:

students = ['Karan', 'Rakesh', 'Hiren', ' Rudra', 'Himesh', 'Shantanu', 'Rohan', 'Sidhdhesh']

Output:

(iii) Create a data frame from above two vectors

Code:

df = pd.DataFrame({'Student': students, 'Marks': marks})

print(df)
Output:

(iv) Update the marks of Rakesh by more 7 marks.

Code:

df.loc[df['Student'] == 'Rakesh', 'Marks'] += 7

print(df)

Output:

(v) Retrieve the students who secured more than 85% of marks.

Code:

top_students = df[df['Marks'] > 85]

print(top_students)

Output:
(vi) Sort the student list by descending order of their marks.

Code:

sorted_df = df.sort_values(by='Marks', ascending=False)

print(sorted_df)

Output:

Concluding Remarks:

These steps highlight essential data operations: creation, updating, filtering, and
sorting.

Quiz:

1. How do you merge two data frames with different column names in pandas?

2. How do you sort data in a data frame in pandas?

3. How do you group data by a specific column in a data frame in pandas?

4. How do you rename columns in a data frame in pandas?

Suggested Reference:

 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
 https://pythonbasics.org/pandas-dataframe/

 https://realpython.com/pandas-dataframe/

References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementation implementation implementation implementatio
as asked as asked as asked as asked n as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

Correct answer Correct answer


to all questions to all questions

References
Experiment No: 3

Import the GDP dataset from kaggle and compute the difference in GDP between 2007
and 2017 for each country. Also create a subset of countries that saw an increase of
over one trillion dollars.

Date:

Competency and PracticalSkills:

Relevant CO: 1

Objectives:

1. To understand difference operation on data frame

2. To study slicing/sub setting of dataframe

Theory:

Selecting a subset of rows and/or columns from within a larger data frame is the first step in
the subsetting process for a data frame. This can be helpful for analyzxing a certain section
of the data or for generating a more compact data frame with only the essential columns
included in it.

The following is a list of some of the more frequent ways to subset a data frame in Python by
using pandas:

1. Selecting particular columns: You can pick particular columns by passing a list of
column names to the data frame indexing operator (for example, df[['col1', 'col2']], to
select only those two columns). This will return a new data frame that just contains
the columns that were selected.

2. Filtering rows based on a condition: If you want to filter rows based on a condition,
you can use a Boolean expression to build a mask, and then you can send this mask to
the data frame indexing operator (for example, df[df['col1'] >50]). This will allow you
to filter rows depending on the condition. This will bring back a new data frame that
contains only the rows where the criteria is met.

3. You can slice rows by their position in the data frame by using the data frame slicing
operator (for example, df[1:5]), which allows you to slice rows based on their
position in the data frame. This will produce a new data frame that contains only the
rows that fall between the start and end coordinates (inclusive of both).

4. Selecting rows and col


5. umns concurrently: You can use ‘.loc[]’ or ‘.iloc[]’ indexer to pick rows and columns
simultaneously. The ‘.loc[]’ indexer takes column lables as an arguments. In
contrast‘.iloc[]’ indexer is integer-based and accepts row and column locations.

6. When dropping columns from a data frame, you can use the drop() method and give
the column names or positions to drop (for example, df.drop(['col1', 'col2'], axis=1)).
This will allow you to drop one or more columns from the data frame. This will
return a new data frame that does not contain the columns that you selected.

i. Import the GDP dataset from kaggle

Code:

import pandas as pd

df = pd.read_csv('GDP_data_P3.csv')

print(df.head())

Output:

ii. compute the difference in GDP between 2007 and 2017 for each country

Code:

df['GDP_Difference'] = df['2017'] - df['2007']

print(df[['Country','GDP_Difference']].head())

Output:
iii. create a subset of countries that saw an increase of over one trillion dollars.

Code:

#One trillion written in exponential form is (10^12)

trillion_increase_df = df[df['GDP_Difference'] > 1e12]

trillion_increase_df=trillion_increase_df[['Country', 'GDP_Difference']]

print(trillion_increase_df.head())

Output:

Concluding Remarks:

This demonstrates the ability to perform time-based computations on economic data


and highlights key countries with significant economic growth. Such analysis is
crucial for identifying global economic trends over a decade.

Quiz:

1. How do you use .loc[] to select specific rows and columns in a data frame?

2. How do you use .iloc[] to select specific rows and columns in a data frame?

3. How do you use .loc[] to add new rows and columns to a data frame?

4. How do you use .iloc[] to select rows based on a condition?


Suggested Reference:

 https://pandas.pydata.org/docs/getting_started/intro_tutorials/
03_subset_data.html

 https://www.digitalocean.com/community/tutorials/create-subset-of-python-
dataframe

 https://www.askpython.com/python/examples/subset-a-dataframe

References used by the students: (Sufficient space to be provided)

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementation implementation implementation implementatio
as asked as asked as asked as asked n as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

Correct answer Correct answer


to all questions to all questions

References
Experiment No: 4

For the GDP dataset, compute the measures of central tendency, mean, median, range
and quantile for year 2017.

Date:

Competency and PracticalSkills:

Relevant CO: 2

Objectives:

1. To understand central tendency of data

2. To learn about range and quartile

Theory:

Mean:

We can compute the mean for only numeric data. The average value of a group of numbers is
referred to as the mean of those numbers. A data set's mean can be calculated by first adding
up all of the values in the set and then dividing that sum by the total number of values.

Let X = <x1, x2, …, xn> be the vector of n numbers. The following formula can be used to get
the mean:
n
1
μ= ∑x
n i=1 i

If X = <22, 44, 33, 11, 55>,

22+ 44+33+11+55 165


μ= = =33
5 5

Median:

When a set of data is sorted in order from lowest to highest (or highest to lowest), the value
that falls in the middle of the set is referred to as the median. When there are an even
number of values, the median is determined by taking the average of the two values that are
in the middle of the set.

To compute the median of data set X = <22, 44, 33, 11, 55>, we shall first arrange it in
ascending or descending order:
X = <11, 22, 33, 44, 55>

There are 5 elements in the array X, so median is the element which is on index 3, which is
33.

For Y = <11, 22, 33, 44, 55, 66>

There are 6 elements in the array Y, so median is of this data set would be (33 + 44) / 2 =
38.5

Mode:

The mode of a data set is the most frequent value within it. If two or more values occur at
the same frequency, we can say that there is more than one mode. So mode is the element in
data which occurs maximum number of time.

For X = <11, 33, 66, 55, 22, 11, 66, 44, 11, 33, 55, 11>, element 11 appears maximum number
of times (4 times), so 11 is the mode of this dataset.

For Y = <33, 66, 55, 22, 11, 33, 44, 11, 33, 55, 11>, element 11 and 33 appears maximum
number of times (3 times each), so 11 and 33 are modes of this dataset.

Range and quartile:

In the field of statistics, two measures of variability known as range and quartiles are
employed to describe the spread of a dataset. The range of a dataset is defined as the
difference between the highest possible value and the lowest possible value. Although it is
simple to calculate and comprehend, it is susceptible to being skewed by outliers and does
not offer a comprehensive representation of the variability present in the dataset.

On the other hand, quartiles split a dataset into four equal parts, with each section having 25
percent of the total data. The following represents each quartile:

 The 25th percentile of the data, also known as the first quartile (Q1), is the value that
is below which 25% of all of the data falls.

 The middle value of the data, often known as the median, can be thought of as the
value that the second quartile, or Q2, represents.

 The 75th percentile of the data, also known as the third quartile (Q3), is the value
that is below which 75% of all of the data falls.

For X = <11, 33, 66, 55, 22, 11, 66, 44, 11, 33, 55, 11>, minimum value is 11 and maximum is
66, so the range of X is 66 – 11 = 55

Let us sort X, so X = <11, 11, 11, 11, 22, 33, 33, 44, 55, 55, 66, 66>. There are 12 elements in
the data set.
Code:

import pandas as pd

df = pd.read_csv('GDP_data_P3.csv')

gdp_2017 = df[['Country','2017']]

print(gdp_2017.head())

mean = gdp_2017['2017'].mean()

median = gdp_2017['2017'].median()

range = gdp_2017['2017'].max() - gdp_2017['2017'].min()

quantiles = gdp_2017['2017'].quantile([0.25, 0.5, 0.75])

# Display the statistics

print()

print("Mean GDP in 2017 : ",mean)

print("Median GDP in 2017 : ",median)

print("Range of GDP in 2017 : ",range)

print("Quantiles of GDP in 2017 (25%, 50%, 75%):")

print(quantiles)

Output:
Concluding Remarks:

These statistical measures are essential for understanding the overall distribution
and spread of GDP values in 2017.

Quiz:

1. What is the difference between mean and median?

2. What is the mean, median and mode of the following set of data: 4, 7, 9, 9, 11, 11, 11,
13?

3. Which measure of central tendency is preferred when the data set has extreme
values?

4. If a set of data has two modes, what is it called?

Suggested Reference:

 https://codecrucks.com/mean-median-mode-variance-discovering-statistical-
properties-of-data
 https://www.techtarget.com/searchdatacenter/definition/statistical-mean-median-
mode-and-range

 https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/
mean-median-mode/

 https://www.twinkl.co.in/teaching-wiki/mean-median-mode-and-range

References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementation implementation implementation implementatio
as asked as asked as asked as asked n as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

Correct answer Correct answer


to all questions to all questions

References
Experiment No: 5

Import Winter Olympics dataset from any of the public dataset repository.

(i) Sort data by total medals and country and save it in new data frame.

(ii) Check mean and median of number of gold, silver, bronze and total medals.

(iii) Find that which region won the highest mean total medals?

(iv) What is the maximum number of medals won? Which country won it?

(v) Find correlations between total medals and number of gold and bronze.

(vi) Find correlation between rank and total medals

Date:

Competency and PracticalSkills:

Relevant CO: 2, 3

Objectives:

1. To understand the concept of correlation

Theory:

Correlation

The intensity and direction of the relationship that exists between two variables can be
quantified with the help of a statistical measure called correlation. It determines the extent
to which two variables are connected to one another and measures that connection.

Positive and negative correlations are the two main forms of this concept. A positive
correlation indicates that when the value of one variable rises, the value of the other
variable likewise rises. The opposite of a positive correlation is a negative correlation, which
indicates that when one variable grows, the other variable tends to decrease.

Correlation between two vectors X and Y are computed as follow:

Where,
 r is correlation coefficient
 x i is i-th data point from vector X
 y i is i-th data point from vector Y
 x is the mean of vector X
 y is the mean of vector Y
The correlation between two variables is often evaluated using a correlation coefficient, the
value of which can range anywhere from minus one to plus one. If the correlation coefficient
is -1, it indicates an absolutely perfect negative correlation; if it is +1, it suggests an
absolutely perfect positive correlation; and if it is 0, it indicates that there is no association.

Scatter of data and its correlation r

Setting up Plot:

Consider X = <1, 2, 3, 4, 5> and Y = <12, 13, 11, 14, 17>. Create scatter plot and draw
correlation line. Compute slope and Y-intercept to draw a line
Import Winter Olympics dataset from any of the public dataset repository.

Code:

df = pd.read_csv('Tokyo 2021 dataset.csv')

print(df.head())

Output:

(i) Sort data by total medals and country and save it in new data frame.

Code:

sorted_df = df.sort_values(by='Total', ascending=False)


print(sorted_df.head(10))

Output:

(ii) Check mean and median of number of gold, silver, bronze and total medals.

Code:

mean_gold = df['Gold Medal'].mean()

mean_silver = df['Silver Medal'].mean()

mean_bronze = df['Bronze Medal'].mean()

mean_total = df['Total'].mean()

median_gold = df['Gold Medal'].median()

median_silver = df['Silver Medal'].median()

median_bronze = df['Bronze Medal'].median()

median_total = df['Total'].median()

print(f"Mean Gold : {mean_gold}, Median Gold : {median_gold}")

print(f"Mean Silver : {mean_silver}, Median Silver : {median_silver}")

print(f"Mean Bronze : {mean_bronze}, Median Bronze : {median_bronze}")

print(f"Mean Total Medals: {mean_total}, Median Total Medals: {median_total}")

Output:
(iii) Find that which region won the highest mean total medals?

Code:

highest_mean_region = df.groupby('NOCCode')['Total'].mean().idxmax()

print(f"Region with the highest mean total medals: {highest_mean_region}")

Output:

(iv) What is the maximum number of medals won? Which country won it?

Code:

max_medals = df['Total'].max()

max_medal_country = df[df['Total'] == max_medals]['Team/NOC'].values[0]

print(f"Maximum Medals: {max_medals}, Country: {max_medal_country}")

Output:

(v) Find correlations between total medals and number of gold and bronze.

Code:

correlation_gold_total = df['Gold Medal'].corr(df['Total'])

correlation_bronze_total = df['Bronze Medal'].corr(df['Total'])


print(f"Correlation between Gold and Total Medals : {correlation_gold_total}")

print(f"Correlation between Bronze and Total Medals: {correlation_bronze_total}")

Output:

(vi) Find correlation between rank and total medals

Code:

correlation_rank_total = df['Rank'].corr(df['Total'])

print(f"Correlation between Rank and Total Medals: {correlation_rank_total}")

correlation_rankbytotal_total = df['Rank by Total'].corr(df['Total'])

print(f"Correlation between Rank by total and Total Medals:


{correlation_rankbytotal_total}")

Output:

Concluding Remarks:

Facilitate comparisons across countries and regions, offering a comprehensive


overview of medal achievements in the Winter Olympics.

Quiz:

1. What is correlation and why is it important in statistics?

2. How is correlation calculated and what does a correlation coefficient tell us?
3. What are some real-world examples of positive and negative correlations?

4. How do outliers affect correlation and what can be done to address this issue?

Suggested Reference:

 https://en.wikipedia.org/wiki/Correlation

 https://www.mathsisfun.com/data/correlation.html

 https://conjointly.com/kb/correlation-statistic/

References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementation implementation implementation implementatio
as asked as asked as asked as asked n as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

Correct answer Correct answer


to all questions to all questions

References
Experiment No: 6

Import Movies dataset from kaggle/uci public dataset repository.

(i) Make scatter plot of tickets sold versus gross.

(ii) Redo the scatter plot, adjusting scales, divide by 1000, 100,000 and 1,000,000

(iii) Find the correlation between tickets sold and sales? Is it expected?

(iv) Make scatter plot with millions scale and also add a regression line with label to
all axes and title to plot.

(v) Make boxplot

(vi) Make individual histogram of type of films, gross sales and ticket sales.

Date:

Competency and PracticalSkills:

Relevant CO: 2, 3, 4

Objectives:

1. To understand scatter plot

2. To understand box plot

3. To understand histogram

Theory:

Scatter plot

The relationship between two quantitative variables can be graphically represented using
something called a scatter plot. It is a method of putting into perspective how one variable is
influenced by another variable.

In a scatter plot, the values of the two variables for each individual data point are
represented by individual points on the plot. One of the variables is represented by the
horizontal axis, and the other is represented by the vertical axis in this graph. Plotting each
data point as a separate point on the graph, with the value of one variable serving as the x-
coordinate and the value of the other variable serving as the y-coordinate, results in the
creation of the plot.
The usage of scatter plots is helpful in determining if there are any patterns or links between
two variables. They are able to assist us in determining if the data exhibit a positive or
negative correlation, a linear or non-linear relationship, as well as whether or not the data
contain any outliers or clusters.

Consider following two dimensional data table.

Table 1: dataset

X Data Y Data
1 12
2 8
3 7
4 11
5 9
6 13
7 10
8 14

The scatter plot for this data is shown here:

Scatter plot for data in Table 1

Box Plot:

A box plot is a graphical depiction of a set of data that shows its distribution and reveals
potential outliers. This type of plot is also known as a box-and-whisker plot.

In a box plot, the "box" represents the middle 50% of the data, while the "whiskers" extend
from the box to the minimum and maximum values of the data, omitting any outliers.
Together, these two parts of the plot are referred to as the "box and whiskers." The box itself
is defined by its lower and higher quartiles, where the lower quartile (Q1) is the median of
the lower half of the data, and the upper quartile (Q3) is the median of the upper half of the
data. In other words, the box is defined by the median of the lower half of the data and the
median of the upper half of the data. The line that runs through the middle of the box
illustrates the data's median value.

Box plots are a useful tool for visualizing changes in data over time as well as comparing
how the distribution of data varies between groups of people. They are extremely helpful
when working with huge data sets that may contain many outliers since they provide a
quick and straightforward approach to identify these outliers. This makes them an
especially handy tool.

Consider following data:

Table 2: dataset

Series1 Series2 Series3


Class 1 -7 -3 -24
Class 1 -10 1 11
Class 1 -28 -6 34
Class 1 20 10 -19
Class 2 -35 6 50
Class 2 47 31 91
Class 2 -24 3 -8
Class 3 -18 16 48
Class 3 19 10 23
Class 3 -26 23 23
Class 3 -20 16 -18
Box plot for given data is shown here:

Scatter plot for data in Table 2

Histogram:

The distribution of a given collection of numerical data can be graphically represented using
something called a histogram. It provides an indication of the frequency with which each
value or range of values appears in the data.
The data are separated into intervals or "bins" along the horizontal axis of a histogram, and
the frequency or count of data points that fall into each bin is represented by the height of a
bar along the vertical axis of the histogram. In most cases, the bars are located next to one
another with no spaces in between them, and the width of each bar is proportional to the
size of the bin.

Histograms can be used to determine the form of the distribution of the data, such as
whether it is skewed, symmetric, bimodal, or unimodal. Another use of histograms is to
determine the amount of variation in the data. In addition to that, you may use them to find
gaps in the data or outliers in the data.

Consider following data:

Table 3: dataset

1 3 3 6 6 7 8 9 9 9 9 10 10 10 10 11 11 12 12 17 18 18

Histogram for above data is shown here:

Histogramfor data in Table 1

Import Movies dataset from kaggle/uci public dataset repository.

Code:

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

df = pd.read_csv('movie.csv')

print(df.head())
Output:

(i) Make scatter plot of tickets sold versus gross.

Code:

plt.figure(figsize=(6, 6))

plt.scatter(df['Tickets Sold'], df['2014 Gross'], alpha=0.5)

plt.title('Scatter Plot of Tickets Sold vs Gross')

plt.xlabel('Tickets Sold')

plt.ylabel('2014 Gross')

plt.grid()

plt.show()

Output:

(ii) Redo the scatter plot, adjusting scales, divide by 1000, 100,000 and
1,000,000
Code:

plt.figure(figsize=(6, 6))

plt.scatter(df['Tickets Sold'] / 1_000_000, df['2014 Gross'] / 1_000_000, alpha=0.5)

plt.title('Scatter Plot of Tickets Sold vs Gross (Millions Scale)')

plt.xlabel('Tickets Sold (Millions)')

plt.ylabel('2014 Gross (Millions)')

plt.grid()

plt.show()

Output:

(iii) Find the correlation between tickets sold and sales? Is it expected?

Code:

correlation = df['Tickets Sold'].corr(df['2014 Gross'])

print(f'Correlation between Tickets Sold and Gross: {correlation}')

Output:
(iv) Make scatter plot with millions scale and also add a regression line with
label to all axes and title to plot.

Code:

plt.figure(figsize=(10, 6))

sns.regplot(x=df['Tickets Sold'] / 1_000_000, y=df['2014 Gross'] / 1_000_000,


scatter_kws={'alpha':0.5})

plt.title('Scatter Plot with Regression Line (Millions Scale)')

plt.xlabel('Tickets Sold (Millions)')

plt.ylabel('2014 Gross (Millions)')

plt.grid()

plt.show()

Output:

(v) Make boxplot

Code:

plt.figure(figsize=(9, 5))
sns.boxplot(data=df, x='Genre', y='Tickets Sold')

plt.title('Boxplot of Tickets sold by Genre')

plt.xticks(rotation=90)

plt.grid()

plt.show()

Output:

(vi) Make individual histogram of type of films, gross sales and ticket sales.

Code:

fig, axs = plt.subplots(3, 1, figsize=(19, 19))

# Histogram for Genre

sns.histplot(data=df, x='Genre', ax=axs[0], discrete=True)

axs[0].set_title('Histogram of Film Types')

axs[0].set_xlabel('Genre')

axs[0].set_ylabel('Count')
print()

print()

# Histogram for 2014 Gross

sns.histplot(data=df, x='2014 Gross', ax=axs[1], bins=30)

axs[1].set_title('Histogram of Gross Sales')

axs[1].set_xlabel('Gross Sales')

axs[1].set_ylabel('Frequency')

print()

print()

# Histogram for Tickets Sold

sns.histplot(data=df, x='Tickets Sold', ax=axs[2], bins=30)

axs[2].set_title('Histogram of Tickets Sold')

axs[2].set_xlabel('Tickets Sold')

axs[2].set_ylabel('Frequency')

plt.tight_layout()

plt.show()

Output:
Concluding Remarks:

The visualizations help to easily identify trends and outliers, offering valuable
insights into the factors affecting ticket sales and gross revenues in the film industry.
This approach supports better decision-making and strategic planning for movie
production and marketing.

Quiz:

1. What are the five key components of a box plot?


2. What are the advantages and disadvantages of using box plots to visualize data?

3. What is the difference between a histogram plot and a bar plot?

4. How can you use a histogram plot to identify the shape of a distribution?

5. How can you use a scatter plot to identify patterns in data?

Suggested Reference:

 https://www.simplypsychology.org/boxplots.html

 https://builtin.com/data-science/boxplot

 https://www.investopedia.com/terms/h/histogram.asp

 https://statistics.laerd.com/statistical-guides/understanding-histograms.php

References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementation implementation implementation implementatio
as asked as asked as asked as asked n as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

Correct answer Correct answer


to all questions to all questions

References
Experiment No: 7

For the GDP and Life Expectancy datasets from kaggle, produce following plots that
describe the GDP and Life Expectancy during the year 2016.

(i) Create a scatter plot of GDP to Life Expectancy

(ii) Create a histogram of GDP

(iii) Create a box and whisker plot of Life Expectancy

Date:

Competency and PracticalSkills:

Relevant CO: 4

Objectives:

1. To understand scatter plot

2. To understand histogram

3. To understand box plot and whisker plot

The GDP and Life Expectancy datasets from kaggle, produce following plots that
describe the GDP and Life Expectancy during the year 2016.

Code:

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

gdp_data = pd.read_csv('gdp.csv')

gdp_data= gdp_data[['Country Name', '2016']]

life_expectancy_data = pd.read_csv('life_expectancy.csv')

life_expectancy_data=life_expectancy_data[['Country Name', '2016']]

# Merge datasets on the country or relevant column


# Ensure both datasets have a common column to merge on

merged_data = pd.merge(gdp_data, life_expectancy_data, on='Country Name',


suffixes=('_GDP', '_Life_Expectancy'))

df= merged_data

df_cleaned = df.dropna()

df=df_cleaned

# To modify the original DataFrame in place, you can use:

df.dropna(inplace=True)

print(df.head())

print('*****************************************************************')

df.info()

Output:

(i) Create a scatter plot of GDP to Life Expectancy

Code:

plt.figure(figsize=(10, 6))
plt.scatter(df['2016_GDP'], df['2016_Life_Expectancy'], alpha=0.6)

plt.title('Scatter Plot of GDP vs Life Expectancy (2016)')

plt.xlabel('GDP')

plt.ylabel('Life Expectancy')

plt.grid()

plt.show()

Output:

(ii) Create a histogram of GDP

Code:

plt.figure(figsize=(10, 6))

sns.histplot(df['2016_GDP'], bins=100, kde=True)

plt.title('Histogram of GDP (2016)')

plt.xlabel('GDP')

plt.ylabel('Frequency')

plt.grid()

plt.show()
Output:

(iii) Create a box and whisker plot of Life Expectancy

Code:

plt.figure(figsize=(10, 6))

sns.boxplot(y=df['2016_Life_Expectancy'])

plt.title('Box and Whisker Plot of Life Expectancy (2016)')

plt.ylabel('Life Expectancy')

plt.grid()

plt.show()

Output:
Concluding Remarks:

Collectively, these plots facilitate a deeper understanding of how economic factors


might influence health outcomes globally.

Quiz:

1. How can you use a scatter plot to compare two or more datasets?

2. What is a bubble plot and how is it different from a scatter plot?

3. What is the difference between a symmetric and a skewed distribution in a histogram


plot?

4. How do you interpret the median, upper quartile, and lower quartile in a box plot?

Suggested Reference:

 https://www.simplypsychology.org/boxplots.html

 https://builtin.com/data-science/boxplot

 https://www.investopedia.com/terms/h/histogram.asp

 https://statistics.laerd.com/statistical-guides/understanding-histograms.php

References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementation implementation implementation implementatio
as asked as asked as asked as asked n as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

Correct answer Correct answer


to all questions to all questions

References
Experiment No: 8

Import Olympics dataset from any of the public dataset repository.

(i) Look at the column names and change names to more meaningful names.

(ii) Make two histograms on one page: summer games (total), winter games (total).

(iii) Make two histograms on one page: total summer, total winter medals won

(iv) Check that is there a correlation between number of medals won in winter and
summer?

(v) Check that is there correlation between number of games each country competes
in winter and summer.

(vi) Make 6 histograms on one page showing the distribution of each of the types of
medals by seasons.

Date:

Competency and PracticalSkills:

Relevant CO: 3,4

Objectives:

1. To understand the use of histogram in data analysis

2. To learn use of correlation

Import Olympics dataset from any of the public dataset repository and

(i) Look at the column names and change names to more meaningful names.

Code:

import pandas as pd

df = pd.read_csv('olympics.csv')

df.columns = ['Country', 'Summer_games', 'Summer_Gold', 'Summer_Silver', 'Summer_Bronz',


'Summer_Total', 'Winter_games', 'Winter_Gold', 'Winter_Silver', 'Winter_Bronz',
'Winter_Total', 'Total_games', 'Gold', 'Silver', 'Bronz', 'Total']

df = df.drop(df.index[0])

df = df.drop(df.index[-1])

df = df.reset_index(drop=True)

df['Gold'] = df['Gold'].astype(int)

df['Silver'] = df['Silver'].astype(int)

df['Bronz'] = df['Bronz'].astype(int)

df['Total'] = df['Total'].astype(int)

df['Summer_games'] = df['Summer_games'].astype(int)

df['Winter_games'] = df['Winter_games'].astype(int)

df['Total_games'] = df['Total_games'].astype(int)

df['Summer_Gold'] = df['Summer_Gold'].astype(int)

df['Summer_Silver'] = df['Summer_Silver'].astype(int)

df['Summer_Bronz'] = df['Summer_Bronz'].astype(int)

df['Summer_Total'] = df['Summer_Total'].astype(int)

df['Winter_Gold'] = df['Winter_Gold'].astype(int)

df['Winter_Silver'] = df['Winter_Silver'].astype(int)

df['Winter_Bronz'] = df['Winter_Bronz'].astype(int)

df['Winter_Total'] = df['Winter_Total'].astype(int)

print(df.head())

df.info()

Output:
(ii) Make two histograms on one page: summer games (total), winter games
(total).

Code:

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)

plt.hist(df['Summer_games'], bins=10, color='blue', edgecolor='black')

plt.title('Summer Games Participation')

plt.xlabel('Number of Games')

plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
plt.hist(df['Winter_games'], bins=10, color='green', edgecolor='black')

plt.title('Winter Games Participation')

plt.xlabel('Number of Games')

plt.ylabel('Frequency')

plt.tight_layout()

plt.show()

Output:

(iii) Make two histograms on one page: total summer, total winter medals won

Code:

plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)

plt.hist(df['Summer_Total'], bins=10, color='orange', edgecolor='black')

plt.title('Total Summer Medals')

plt.xlabel('Total Medals')

plt.ylabel('Frequency')
plt.subplot(1, 2, 2)

plt.hist(df['Winter_Total'], bins=10, color='purple', edgecolor='black')

plt.title('Total Winter Medals')

plt.xlabel('Total Medals')

plt.ylabel('Frequency')

plt.tight_layout()

plt.show()

Output:

(iv) Check that is there a correlation between number of medals won in winter
and summer?

Code:

correlation_medals = df['Summer_Total'].corr(df['Winter_Total'])

print(f"Correlation between Summer and Winter Total Medals: {correlation_medals:.2f}")

Output:

(v) Check that is there correlation between number of games each country
competes in winter and summer.
Code:

correlation_games = df['Summer_games'].corr(df['Winter_games'])

print(f"Correlation between Summer and Winter Games Participation:


{correlation_games:.2f}")

Output:

(vi) Make 6 histograms on one page showing the distribution of each of the
types of medals by seasons.

Code:

plt.figure(figsize=(15, 10))

# Summer Gold Medals

plt.subplot(2, 3, 1)

plt.hist(df['Summer_Gold'], bins=10, color='gold', edgecolor='black')

plt.title('Summer Gold Medals')

# Summer Silver Medals

plt.subplot(2, 3, 2)

plt.hist(df['Summer_Silver'], bins=10, color='silver', edgecolor='black')

plt.title('Summer Silver Medals')

# Summer Bronze Medals

plt.subplot(2, 3, 3)

plt.hist(df['Summer_Bronz'], bins=10, color='brown', edgecolor='black')

plt.title('Summer Bronze Medals')


# Winter Gold Medals

plt.subplot(2, 3, 4)

plt.hist(df['Winter_Gold'], bins=10, color='gold', edgecolor='black')

plt.title('Winter Gold Medals')

# Winter Silver Medals

plt.subplot(2, 3, 5)

plt.hist(df['Winter_Silver'], bins=10, color='silver', edgecolor='black')

plt.title('Winter Silver Medals')

# Winter Bronze Medals

plt.subplot(2, 3, 6)

plt.hist(df['Winter_Bronz'], bins=10, color='brown', edgecolor='black')

plt.title('Winter Bronze Medals')

plt.tight_layout()

plt.show()

Output:
Concluding Remarks:

These analyses provide a comprehensive overview of Olympic participation and


performance across different seasons.
Quiz:

1. What are the advantages and disadvantages of using a histogram plot to visualize
data?

2. How do you choose the appropriate bin width for a histogram plot?

3. What is correlation and how is it used to analyze data?

4. What is a scatter plot and how can it be used to visualize correlation?

Suggested Reference:

 https://www.mathsisfun.com/data/correlation.html

 https://www.investopedia.com/terms/c/correlation.asp

 https://www.cuemath.com/data/histograms/

References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementation implementation implementation implementatio
as asked as asked as asked as asked n as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

Correct answer Correct answer


to all questions to all questions

References
Experiment No: 9

For the GDP, Life Expectancy and Employment datasets from kaggle, perform
following tasks.

(i) Merge the columns for the year 2016 for 3 datasets into a new data frame (ii)
Rename the columns to “country”, “gdp”, “life_expectancy” and “employment”

(iii) Create a frequency table for each variable

(iv) Draw histograms for each variable

Date:

Competency and PracticalSkills:

Relevant CO: 4

Objectives:

1. To lean to create frequency table

Theory:

Frequency Table

A frequency table is a statistical tool for classifying data into groups and showing how often
each group occurs. In other words, it illustrates the recurrence rates of individual values or
sets of values within a dataset.

Choosing the appropriate categories or intervals for your data is the first step in making a
frequency table. Then, you create a table to keep track of how many data points belong to
each category. The frequency with which each category appears in the dataset is displayed
in the table that results.

For example, we can create frequency table of 50 school students as follow. Assume weight
of students vary from 25 kg to 65 kg. According to weight range, frequency tables looks like
as:

Weight range Frequency


25 – 35 12
35 – 45 13
45 – 55 10
55 – 65 15
From above frequency table, we can say that there are 12 students having weight in range
25 – 35 kg, 13 students have weight in range 35 – 45, 10 students have weight in range 45 -
55 and 15 students have weight in range 55 – 65.

Import the GDP, Life Expectancy and Employment datasets from Kaggle and

(i) Merge the columns for the year 2016 for 3 datasets into a new data frame

Code:

gdp_data = pd.read_csv('gdp.csv')

life_expectancy_data = pd.read_csv('life_expectancy.csv')

employment_df = pd.read_csv('employment.csv')

# Merge the three datasets on the 'country' column

merged_df = gdp_data[['Country Name', '2016']].merge(life_expectancy_data[['Country


Name', '2016']], on='Country Name')

merged_df = merged_df.merge(employment_df[['Country Name', '2016.00']], on='Country


Name')

df=merged_df

df_cleaned = df.dropna()

df=df_cleaned

df.dropna(inplace=True)

(ii) Rename the columns to “country”, “gdp”, “life_expectancy” and


“employment”

Code:

df.columns = ['country','gdp','life_expectancy','employment']

print(df.head())

Output:
(iii) Create a frequency table for each variable

Code:

# Define bins for GDP

gdp_bins = pd.cut(df['gdp'],
bins=[0,100.0e+09,200.0e+09,300.0e+09,400.0e+09,500.0e+09,600.0e+09,700.0e+09,800.0
e+09,900.0e+09,1000.0e+09, float('inf')],

labels=['0 - 100B',

'100B - 200B',

'200B - 300B',

'300B - 400B',

'400B - 500B',

'500B - 600B',

'600B - 700B',

'700B - 800B',

'800B - 900B',

'900B - 1000B',

'1000B+'])

# Frequency table for GDP with ranges

gdp_freq = gdp_bins.value_counts().sort_index()

# Create a DataFrame for better formatting

gdp_frequency_table = pd.DataFrame({

'Label': gdp_freq.index,

'Count': gdp_freq.values

})
# Print the frequency table

print("Frequency Table for GDP (with ranges):")

print(gdp_frequency_table)

Output:

Code:

# Define bins for Life Expectancy

life_bins = pd.cut(df['life_expectancy'], bins=[0, 30, 50, 60, 70, 80, 90, 100],

labels=['0-30', '30-50', '50-60', '60-70', '70-80', '80-90', '90-100'])

# Frequency table for Life Expectancy with ranges

life_freq = life_bins.value_counts().sort_index()

# Create a DataFrame for better formatting

life_frequency_table = pd.DataFrame({

'Label': life_freq.index,

'Count': life_freq.values

})
# Print the frequency table

print("\nFrequency Table for Life Expectancy (with ranges):")

print(life_frequency_table)

Output:

Code:

# Define bins for Employment

employment_bins = pd.cut(

df['employment'],

bins=[0, 1e9, 2e9, 3e9, 4e9, 5e9,6e9, 7e9, 8e9, 9e9, 10e9, float('inf')],

labels=['0-1B', '1-2B', '2-3B', '3-4B', '4-5B','5-6B', '6-7B', '7-8B', '8-9B','9-10B', '5B+']

employment_freq = employment_bins.value_counts().sort_index()

frequency_table = pd.DataFrame({

'Label': employment_freq.index,

'Count': employment_freq.values

})

print("\nFrequency Table for Employment (with 10 billion ranges):")

print(frequency_table)

Output:
(iv) Draw histograms for each variable

Code:

import matplotlib.pyplot as plt

# Create histograms

plt.figure(figsize=(12, 8))

# GDP Histogram

plt.subplot(3, 1, 1)

plt.hist(df['gdp'], bins=10, color='blue', edgecolor='black')

plt.title('GDP Distribution (2016)')

plt.xlabel('GDP')

plt.ylabel('Frequency')

# Life Expectancy Histogram


plt.subplot(3, 1, 2)

plt.hist(df['life_expectancy'], bins=10, color='green', edgecolor='black')

plt.title('Life Expectancy Distribution (2016)')

plt.xlabel('Life Expectancy')

plt.ylabel('Frequency')

# Employment Histogram

plt.subplot(3, 1, 3)

plt.hist(

df['employment'], bins=10, color='purple', edgecolor='black')

plt.title('Employment Distribution (2016)')

plt.xlabel('Employment')

plt.ylabel('Frequency')

# Adjust layout and show the plots

plt.tight_layout()

plt.show()

Output:
Concluding Remarks:

Such analyses are vital for uncovering trends and making informed decisions based
on the interconnectedness of data. Overall, this approach underscores the importance
of data integration and visualization in understanding complex societal dynamics.

Quiz:

1. How can you use a frequency table to analyze data?

2. hat is a relative frequency table and how is it different from a standard frequency
table?

3. How do you create a frequency table from a dataset?

Suggested Reference:

 https://statisticsbyjim.com/basics/frequency-table/

 https://www.vedantu.com/maths/understanding-frequency-tables

References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementatio implementation implementatio implementation
as asked n as asked as asked n as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

Correct answer Correct answer


to all questions to all questions

References
Experiment No: 10

Import any real life data set from any public data set repository. Create program that
allows the user to interactively display a box plot based on selection of value of other
column.

Date:

Competency and PracticalSkills:

Relevant CO: 4

Objectives:

1. To load data and process interactively

2. To observe and understand data distribution through the box plot

Concluding Remarks:

Quiz:

1. What is a data frame and how is it used to organize and analyze data?

2. What are some common methods for cleaning and preprocessing data in a data
frame?

3. How can you use a data frame to filter and subset data?

Suggested Reference:

 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

 https://pythonbasics.org/pandas-dataframe/

 https://realpython.com/pandas-dataframe/
References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementation implementation implementation implementatio
as asked as asked as asked as asked n as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

Correct answer Correct answer


to all questions to all questions

References

You might also like