0% found this document useful (0 votes)

24 views8 pages

EDA Structuring With Python

This document provides instructions for structuring and analyzing lightning strike data from NOAA using Python. It shows how to: 1) Import packages and libraries, convert dates to datetime, and check column headers to prepare the data for analysis. 2) Learn about the shape of the data, check for duplicate values, and sort the data by highest number of strikes. 3) Group the data by day of week to see if any days have more lightning strikes on average. A boxplot visualization shows Saturday and Sunday have lower median strike values. 4) Merge multiple years of data into one dataset and add columns to structure the data by year, month, and totals of strikes by month and year for comparison

Uploaded by

Mostafa Fathi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views8 pages

EDA Structuring With Python

Uploaded by

Mostafa Fathi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

EDA structuring with Python

You've learned about

how structuring data can help professionals analyze, understand, and learn

more about their data. Now, let's use a

python notebook, and discover how it

works in practice. We'll continue using our NOAA

lightning strike dataset. For this video, we'll

consider the data for 2018, and use our structuring tools

to learn more about whether lightning strikes are more prevalent on some

days than others. Before we do anything else, let's import our Python

packages, and libraries. These are all packages

and libraries you're familiar with, Pandas, NumPy, Seaborn, datetime,

matplotlib.pyplot. For a quick refresher, let's convert our date

column to datetime, and take a look at

our column headers. We do this to get

our dates ready for any future string manipulation

we may want to do, and to remind us of

what is in our data. As you remember, there are three columns in

the dataset; date, number of strikes, and

center point geom, which you'll find after

running the head function. Next, let's learn about

the shape of our data by using the df shape function. When we run this cell, we get 3,401,012, 3. Take a
moment to picture

the shape of this dataset. We're talking about only

three columns wide and nearly 3.5 million

cells vertically. That's incredibly long and thin. We'll use a function for

finding any duplicates. When we enter df.drop_underscore duplicates with an empty argument field

followed by.shape, the notebook will return

the number of rows, and columns remaining after

duplicates are removed. Because this returns

the same exact number as our shape function, we know there are no

duplicate values. Let's discuss some of those

structuring concepts we learned about earlier.

Let's start with sorting. We'll sort the number

of strikes column in descending value

or most to least. While we do this, let's consider the dates with the highest number of strikes. We'll
input df sort_values. Then in the argument

field type by, then the equals sign. Next, we input the

column we want to sort, number of strikes followed by ascending equals sign false. If we add the head

function to the end, the notebook outputs the top

10 cells for us to analyze. We find that the

highest number of strikes are in the lower 2000s. It does seem like a lot of lightning strikes

in just one day, but given that it happened in August when storms are likely, it is probable these

2000 plus strikes were counted during a storm. Next, let's look

at the number of strikes based on the

geographic coordinates, latitude, and longitude. We can do this by using

the value_counts function. We type in df, followed by the

center point geom. Then we type in.value underscore counts with

an empty argument field. Based on our result, we learned that

the locations with the most strikes have

lightning on average, every one in three days, with numbers in the low 100s. Meanwhile, some

locations are reporting only one lightning strike

for the entire year of 2018. We also want to learn if we have an even distribution of values, or whether
108 is a notably high-value for

lightning strikes in the US. To do this, copy the same

value counts function, but input a colon, 20 in the brackets so that you can see the first 20 lines. The rest
of the

coding here is to help present the data clearly. We rename the axis, and index to unique values,

and counts respectively. Lastly, we'll add a gradient background to the counts column
for visual effect. After running the cell, we discover zero

notable drops in lightning strike counts

among the top 20 location. This suggests that

there are zero, notably high lightning

strike data points, and that the data values

are evenly distributed. Next, let's use another

structuring method, grouping. You'll often find stories hidden among different

groups in your data. Like the most profitable times a day for retail

store, for instance. For this dataset,

one useful grouping is categorizing lightning

strikes by day of week which will tell us whether

any particular day has fewer or more lightning

strikes than others. Let's first create

some new data columns. We create a column called week by inputting

df.date.dt.isocalendar. Let's leave the argument

field blank and add a.week at the end. This will create a

column assigning numbers 1-52 for each of the

days in the year 2018. Let's also add a column

that names the weekday. Type in df.date.dt.day_name, leaving the argument

field blank. For this last part,

let's input df.head. Again, you'll discover

the dates now have week numbers and

assigned weekdays. We have some new columns, so let's group the number of strikes by weekday to
determine whether any particular

day of week has more lightning

strikes than others. Let's create a DataFrame with just the weekday and number

of lightning strikes. We'll do this by inputting df, double bracket, weekday, comma, number of strikes
both

in single quotes, followed by more

double brackets. Next, we'll add one of our structuring

functions, groupby, followed by weekday dot mean

within the argument field. What we're telling the

notebook here is to create a DataFrame with weekday

and number of strikes, but then also group the total number of strikes

by day of the week, giving us the mean number

of strikes for that day. To understand what this

data is telling us, let's plot a box plot chart. A boxplot is a

data visualization that depicts the locality, spread, and skew of groups

of values within quartiles. For this dataset and notebook, a box plot visualization will be the most

helpful because it will tell us a lot about the distribution of

lightning strike values. Most of the lightning

strike values will be shown as grouped

into colored boxes, which is why this visualization

is called a box plot. The rest of the values

will string out to either side with a

straight line that ends in a t. We will discuss more about box

plots in an upcoming video. Now before we plot, let's set the weekday order

to start with Monday. Now to code that, input g, equal sign, sns.boxplot. Next, in the argument field,
let's have x equal weekday and

y equal number of strikes. For order, let's

do weekday order, and for the showfliers

field, let's input False. Showfliers refers to outliers that may or may not be

included in the box plot. If you input, true,

outliers are included. If you input false, outliers are left off

the box plot chart. Keep in mind, we aren't deleting any outliers from the dataset

when we create this chart, we're only excluding them

from our visualization to get a good sense of the

distribution of strikes across the

days of the week. Lastly, we will plug in

our visualization title, lightning distribution

per weekday for 2018 and click run cell. Now you'll discover something

really interesting. The median, indicated by these horizontal black

lines remains the same on all of the

days of the week. As for Saturday and

Sunday, however, the distributions are both lower than the rest of the week. Let's consider why that is.
What do you think

is more likely? That lightning strikes across the United States

take a break on the weekends or that people do not report as many lightning

strikes on weekends? While we don't know for sure, we have clear data suggesting the total quantity of
weekend lightning strikes

is lower than weekdays. We've also learned a story

about our dataset that we didn't know before we tried

grouping it in this way. Let's get back into

our notebook and learn some more about

our lightning data. One common structuring

method we learned about in another

video was merging, which you'll remember means combining two different

data sources into one. We'll need to know

how to perform this method in Python if we want to learn more about our

data across multiple years. Let's add two more years to

our data, 2016 and 2017. To merge three years

of data together, we need to make sure each

dataset is formatted the same. The new datasets do not have the extra columns week and weekday that
we created earlier. To merge them successfully, we need to either remove the new columns or add

them to the new datasets. There's an easy way to

merge the three years of data and remove the extra

columns at the same time. Let's call our new

data frame union_df. We'll use the pandas

function concat to merge or more accurately concatenate

the three years of data. Inside the concat argument

field we'll type in df.drop to pull the weekday

and weak columns out. We also input the axis we want to drop,

which is one. Lastly, and most essentially, we add the data

frame name we are concatenating to, df_2. We also input true

for ignore index because the two data

frames will already align along their first columns and now you've just learned to

merge three years of data. To help us with the next

part of structuring, create three date columns following the same steps

you used previously. We've already added

the columns for year, month, and month_text

to the code. Now let's add all the

lightning strikes together by year so

we can compare them. We can do this by simply taking the two columns

we want to look at, year and number of

strikes and group them by year with the

function.sum on the end. You'll find that 2017 did have fewer total strikes

in 2016 and 2018. Because the totals

are different, it might be interesting as

part of our analysis to see lightning strike percentages

by month of each year. Let's call this lightning

by month grouping our union data frame by

month, text and year. Additionally, let's

aggregate the number of strikes column by using the

pandas function NamedAgga. In the argument field, we place our column name and our aggregate
function

equal to some, so that we get the totals for each of the months

in all three years. When we input the head function, we have the months in
alphabetical order, along with the sums

of each month. We can do the same

aggregation for year and years strikes to review those same numbers

we saw before with 2017 having fewer strikes

than the two other years. We created those

two data frames, lightning by month and

lightning by year in order to derive our percentages of lightning strikes

by month and year. We can get those percentages

by typing lightning by month.merge with

lightning by year, on equal sign year in

the argument field. You'll find that

the merge function is merging lightning by year into our

lightning by month data frame according

to the year. Lastly, we can create a percentage lightning per

month column by dividing the percentage lightning.number

of strikes by percentage lightning

after which we'll add the asterix 100 to

give us percentage. Now, when we use

our head function, we have a restructured

data frame. To more easily review our

percentages by month, let's plot a data visualization. For this one, a simple grouped

bar graph will work well. We'll adjust our figure

size to 10 and 6 first. Then we use the seaborne

library bar plot with our x-axis as month texts and our y-axis as percentage

lightning per month. For some color, we'll have

our hue change according to the year column with the data following the month

order column. Finally, let's input our x and y labels and our title

and run the cell. When you analyze the bar chart, August 2018 really stands out. In fact, more than one-
third of the lightning strikes for 2018 occurred in
August of that year. The next step for a data professional trying

to understand these findings might be to research storm and

hurricane data, to learn whether those

factors contributed to a greater number of lightning strikes for

this particular month. Now that you've learned some of the Python code for the EDA

practice of structuring, you'll have time to

try them out yourself. Good luck finding those

stories about your data.

Date String Manipulations With Python
No ratings yet
Date String Manipulations With Python
6 pages
Essential Python Data Visualization Libraries 1687141550
No ratings yet
Essential Python Data Visualization Libraries 1687141550
16 pages
Spectrum Release Notes
No ratings yet
Spectrum Release Notes
11 pages
Uob Python Lecture2p
No ratings yet
Uob Python Lecture2p
22 pages
Final Project I (Final)
No ratings yet
Final Project I (Final)
21 pages
Machine Learning Statistical Model Using Transportation Data
No ratings yet
Machine Learning Statistical Model Using Transportation Data
32 pages
Final Project I (Final)
No ratings yet
Final Project I (Final)
21 pages
Extended - Basic Eda Python Fellow
No ratings yet
Extended - Basic Eda Python Fellow
22 pages
Class 8 Qbasic Notes
100% (2)
Class 8 Qbasic Notes
5 pages
03 Numpy and Pandas
No ratings yet
03 Numpy and Pandas
68 pages
Vendor Master Data Management
100% (1)
Vendor Master Data Management
34 pages
PYDS 3150713 Unit-4
No ratings yet
PYDS 3150713 Unit-4
59 pages
Pandas For Machine Learning: Acadview
No ratings yet
Pandas For Machine Learning: Acadview
18 pages
Handout6 - Visualization
No ratings yet
Handout6 - Visualization
75 pages
Dev Lab Manual Org
No ratings yet
Dev Lab Manual Org
28 pages
Python Exploratory Data Analysis
No ratings yet
Python Exploratory Data Analysis
24 pages
P4DS A2 Data Analysis Project
No ratings yet
P4DS A2 Data Analysis Project
21 pages
ITP107-Final Draft Module (20240529092814)
No ratings yet
ITP107-Final Draft Module (20240529092814)
74 pages
Uber - Analysis - Jupyter - Notebook
100% (1)
Uber - Analysis - Jupyter - Notebook
10 pages
Lec 07-I-DSFa23
No ratings yet
Lec 07-I-DSFa23
30 pages
Object Oriented Programming: Numeriano B. Aguado John Ren G. Santos
No ratings yet
Object Oriented Programming: Numeriano B. Aguado John Ren G. Santos
75 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
39 pages
Big Data Analytics (1) : Definition
No ratings yet
Big Data Analytics (1) : Definition
15 pages
Windows MFS110 RDService ReleaseNotes 1.1.0
No ratings yet
Windows MFS110 RDService ReleaseNotes 1.1.0
4 pages
01-Numpy & Pandas
No ratings yet
01-Numpy & Pandas
69 pages
PDS Qba
No ratings yet
PDS Qba
12 pages
DAV EXP 1 t12 31
No ratings yet
DAV EXP 1 t12 31
39 pages
4 - Python Concepts
No ratings yet
4 - Python Concepts
21 pages
Trends and Challenges in Categorical Data Analysis: Maria Kateri Irini Moustaki
No ratings yet
Trends and Challenges in Categorical Data Analysis: Maria Kateri Irini Moustaki
323 pages
Department of Computer Science & It: (Ananya S)
No ratings yet
Department of Computer Science & It: (Ananya S)
12 pages
Python Dataviz
No ratings yet
Python Dataviz
16 pages
UNIT - 1 EDA Continuation
No ratings yet
UNIT - 1 EDA Continuation
113 pages
Performing Analysis of Meteorological Data: Punam Seal
No ratings yet
Performing Analysis of Meteorological Data: Punam Seal
21 pages
Data Analytics and Interactive Dashboards Using Python
No ratings yet
Data Analytics and Interactive Dashboards Using Python
96 pages
XR3D 600 60 Een 201801 01
No ratings yet
XR3D 600 60 Een 201801 01
2 pages
Galaxy Beauty Academy-1
No ratings yet
Galaxy Beauty Academy-1
26 pages
Word Embeddings
No ratings yet
Word Embeddings
163 pages
Cluster Analysis in Spark
No ratings yet
Cluster Analysis in Spark
10 pages
Ethical Hacking
No ratings yet
Ethical Hacking
2 pages
Weather Forecasting - A Time Series Analysis Using R
No ratings yet
Weather Forecasting - A Time Series Analysis Using R
11 pages
Exercise - 6: DS203-2024-S1 Problem1:: Statistics
No ratings yet
Exercise - 6: DS203-2024-S1 Problem1:: Statistics
10 pages
Future Work Fine Tuning
No ratings yet
Future Work Fine Tuning
2 pages
Data Analyst Nanodegree Program - Syllabus
No ratings yet
Data Analyst Nanodegree Program - Syllabus
7 pages
Python - Working With Data - Text Formats
No ratings yet
Python - Working With Data - Text Formats
23 pages
Data Science Four Marks Qa
No ratings yet
Data Science Four Marks Qa
4 pages
Analysing Weather Data1
No ratings yet
Analysing Weather Data1
5 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
24 pages
AL13ICTcompetency9 13
No ratings yet
AL13ICTcompetency9 13
6 pages
Guidelines DAVP
No ratings yet
Guidelines DAVP
3 pages
6.2 Weather
No ratings yet
6.2 Weather
7 pages
Nurses Performance of Traumatic Brain Injury Pati
No ratings yet
Nurses Performance of Traumatic Brain Injury Pati
24 pages
Data Exploration and Visualization Laboratory - AD3301 - Lab Manual
No ratings yet
Data Exploration and Visualization Laboratory - AD3301 - Lab Manual
55 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
Data Analysis
No ratings yet
Data Analysis
20 pages
Eda Lab Manual (2)
No ratings yet
Eda Lab Manual (2)
40 pages
CSE315:Introduction To Data Science: WEEK-8
No ratings yet
CSE315:Introduction To Data Science: WEEK-8
27 pages
Task 4P-1
No ratings yet
Task 4P-1
5 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
VSE+InfoScale Enterprise OracleRAC 2020 05
No ratings yet
VSE+InfoScale Enterprise OracleRAC 2020 05
89 pages
Installation or Run AstroHora File
No ratings yet
Installation or Run AstroHora File
5 pages
Grade 7 IGCSE 3.1
No ratings yet
Grade 7 IGCSE 3.1
2 pages
Assignment2 Rubric
No ratings yet
Assignment2 Rubric
5 pages
HTML Beginner
No ratings yet
HTML Beginner
16 pages
CleaningData Chapter 4
No ratings yet
CleaningData Chapter 4
22 pages
Assignment MET1233
No ratings yet
Assignment MET1233
12 pages
MegaKernel Blog
No ratings yet
MegaKernel Blog
11 pages
Python For Data Analytics
No ratings yet
Python For Data Analytics
3 pages
V6.4.3e Releasenotes v3.0
No ratings yet
V6.4.3e Releasenotes v3.0
142 pages
3615B English User Manual
No ratings yet
3615B English User Manual
14 pages
Parsing Dates
No ratings yet
Parsing Dates
3 pages
Sts Reviewer
No ratings yet
Sts Reviewer
14 pages
Peer-Graded Assignment: Plotting Weather Patterns: Rubric Preview
No ratings yet
Peer-Graded Assignment: Plotting Weather Patterns: Rubric Preview
5 pages
Lab 2
No ratings yet
Lab 2
4 pages
Chapter4 DynamicDesign 1 OSD SA-3 Layer Model
No ratings yet
Chapter4 DynamicDesign 1 OSD SA-3 Layer Model
29 pages
Project Guide - Tell A Data Story: Background
No ratings yet
Project Guide - Tell A Data Story: Background
3 pages
Teachnical Propsal For BODY WORN CAMERA SOLUTION
No ratings yet
Teachnical Propsal For BODY WORN CAMERA SOLUTION
12 pages
Photo Layout
No ratings yet
Photo Layout
1 page
CPT PRG Code
No ratings yet
CPT PRG Code
4 pages
Youtube Playlist Link Extractor - Extract To TextExcelURLCSV (Copy 3)
No ratings yet
Youtube Playlist Link Extractor - Extract To TextExcelURLCSV (Copy 3)
3 pages
Introduction To C Programming: Module Overview
No ratings yet
Introduction To C Programming: Module Overview
9 pages
RESEARCH METHODS LESSON 14 - Secondary Data Analysis
No ratings yet
RESEARCH METHODS LESSON 14 - Secondary Data Analysis
3 pages
PROJECT On Data Science With Python
100% (1)
PROJECT On Data Science With Python
20 pages
How To Setup Wireless of Edimax Camera
No ratings yet
How To Setup Wireless of Edimax Camera
5 pages
6077186478employment Notice of STPI
No ratings yet
6077186478employment Notice of STPI
3 pages
1st PUC Computer Science Feb 2018
No ratings yet
1st PUC Computer Science Feb 2018
1 page
Lempel-Ziv-Welch (LZW) - Is A Universal Lossless Data Compression Algorithm Created by Abraham
No ratings yet
Lempel-Ziv-Welch (LZW) - Is A Universal Lossless Data Compression Algorithm Created by Abraham
5 pages
Spotting The Hazards Means Working Out How Likely It Is That A Hazard Will Harm Someone and How Serious The Harm Could Be
No ratings yet
Spotting The Hazards Means Working Out How Likely It Is That A Hazard Will Harm Someone and How Serious The Harm Could Be
1 page
NOXON Iradio Manual GB
No ratings yet
NOXON Iradio Manual GB
60 pages