EDA Report
EDA Report
Introduction :
The United States of America has recently, had the most reported COVID-19 cases and this
dataset that I have taken gives a piece of detailed information about the country, state, male,
female, age group, and demographics information such as latitude and longitude. To perform
this research, I used this dataset.
DATASET LINK:
https://drive.google.com/drive/folders/1RfLhJVOK45x9oGBmOKyZEpBAaHuITYaw
US_COUNTY.CSV
The main objective of this analysis is to find out the patterns within the dataset to get a
further understanding of the data. I also wanted to leverage it to choose a machine algorithm
for predicting the survival rate of patients during the period of COVID-19.
The dataset consists of demographic information population information (Such as male and
female rates) and age information.
Data attributes: Fips, County, State, State code, male, female, median age, population,
female_percentage, lat, long.
So totally my dataset has 3220 rows * 11 columns with no null values. The columns have a
title/heading, which makes them readable.
print(data_frame["male"].value_counts)
Explanation 2: This code helps us to know the total count of females from different states.
Code:
print(data_frame["female"].value_counts)
Explanation 3: This code helps us to know the total count of population from different state
print(data_frame['population'].value_counts)
Important note:
Before performing this code, we need to down the dataset and upload it in the Google Colab
environment.
Code: This code helps me to read a CSV or Excel file in order to due EDA
import pandas as pd
import matplotlib.pyplot as plt
def read_csv_or_excel(file_path):
"""
Reads a CSV or Excel file based on the file extension.
Args:
file_path (str): The path to the CSV or Excel file.
Returns:
pd.DataFrame: A Pandas DataFrame containing the data from the
file.
>>> read_csv_or_excel(file_path)
>>> us_county
if incase its a wrong file
>>> read_csv_or_excel(file_path)
>>> This file format is incorrect. Please provide a CSV or
Excel file.
"""
if file_path.endswith('.csv'):
# This is the part where it tries to read a CSV file
df = pd.read_csv(file_path)
elif file_path.endswith('.xlsx'):
# This is the part where it tries to read a Excel file
df = pd.read_excel(file_path)
else:
#This is the exception handling that I have kept
raise ValueError("This file format is incorrect. Please provide
a CSV or Excel file.")
return df
file_path = '/content/us_county.csv'
data_frame = read_csv_or_excel(file_path)
print(data_frame)
Output:
Boxplot Graph:
This graph shows a clear understanding of the male and female ratio
import matplotlib.pyplot as plt
# output
plt.show()
Scatterplot:
This graph shows a clear understanding of the male and female ratio.
import matplotlib.pyplot as plt
file_path = '/content/us_county.csv' # Replace with the path to your
CSV or Excel file
data_frame = read_csv_or_excel(file_path)
#output
plt.show()
Histogram:
This graph shows a clear understanding of the male and female ratio
import matplotlib.pyplot as plt
data_to_plot = data_frame['population']
# output
plt.show()
Important Links:
Dataset Link:
https://drive.google.com/drive/folders/1RfLhJVOK45x9oGBmOKyZEpBAaHuITYaw
https://docs.google.com/spreadsheets/d/1OVgcN0T2npE5nRc9RTND8tUP9znStHVZJwMrO
thtqDo/edit#gid=1650272371
GitHub Link:
https://github.com/santhiya-hds5210/ORES-5160-EDA
Drive Link:
https://drive.google.com/drive/folders/1W8AiXxbgTYK-HOXSPKjee9qGdj_Ari1O
Appendix:
https://www.google.com/search?q=what+is+eda+in+data+science&oq=what+is+EDA+inn&gs
_lcrp=EgZjaHJvbWUqCQgBEAAYDRiABDIGCAAQRRg5MgkIARAAGA0YgAQyCQgCEAAYDRiABDI
JCAMQABgNGIAEMgkIBBAAGA0YgAQyCQgFEAAYDRiABDIJCAYQABgNGIAEMgkIBxAAGA0YgA
QyCQgIEAAYDRiABDIJCAkQABgNGIAE0gEJMTE4MjhqMGo3qAIAsAIA&sourceid=chrome&ie=
UTF-8
https://www.kaggle.com/datasets/headsortails/covid19-us-county-jhu-data-
demographics?select=us_county.csv
https://stackoverflow.com/questions/18039057/pandas-parser-cparsererror-error-
tokenizing-data
https://chat.openai.com/c/8da6a9dc-bee7-4983-9bf9-7530b2178d31
https://www.kaggle.com/code/masoudfaramarzi/basics-of-accesing-data-from-urls-using-
pandas
https://www.forefront.ai/app/chat/new
https://www.numbeo.com/quality-of-life/rankings_by_country.jsp
https://www.analyticsvidhya.com/blog/2022/03/exploratory-data-analysis-with-an-example/
https://docs.google.com/spreadsheets/d/1OVgcN0T2npE5nRc9RTND8tUP9znStHVZJwMrOth
tqDo/edit#gid=1650272371
https://canvas.slu.edu/courses/45377/assignments/343230
https://colab.research.google.com/drive/1Yr_FH_rjTCW7741e1rArixu4ZWL02FGC#scrollTo=Z
fIbVsMyiqOI
https://github.com/santhiya-hds5210/ORES-5160-EDA
https://www.google.com/search?q=scatter+plot&oq=scatter&gs_lcrp=EgZjaHJvbWUqDQgBE
AAYgwEYsQMYgAQyDwgAEEUYORiDARixAxiABDINCAEQABiDARixAxiABDIKCAIQABixAxiABDIN
CAMQABiDARixAxiABDINCAQQABiDARixAxiABDIKCAUQABixAxiABDINCAYQABiDARixAxiABDI
HCAcQABiABDIKCAgQABixAxiABDINCAkQABiDARixAxiABNIBCDMzOTdqMGo3qAIAsAIA&sour
ceid=chrome&ie=UTF-8
https://www.google.com/search?q=boxplot&oq=boxpl&gs_lcrp=EgZjaHJvbWUqDAgBEAAYQx
ixAxiKBTIGCAAQRRg5MgwIARAAGEMYsQMYigUyDwgCEAAYQxiDARixAxiKBTIKCAMQABixAxiA
BDIJCAQQABhDGIoFMgcIBRAAGIAEMgkIBhAAGEMYigUyCQgHEAAYQxiKBTIJCAgQABhDGIoF
MgcICRAAGIAE0gEIMzEwNmowajeoAgCwAgA&sourceid=chrome&ie=UTF-8