Data Analysis Dummy Report: 0. Data Import and Cleaning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

title

Data Analysis Dummy Report


This is an examplary report for a data analysis. Its purpose is to give the participants of the hackathon an idea of how a final report could
look like. Note that this is just an example refering to a completely different dataset. So the ideas and analyses might go in a completely
different direction than the one in the hackathon.

Note: This is just one way to present one's final results and represents in no way a perfect solution.

First, we import all necessary libraries used in the analysis.

In [1]: import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
import cartopy.crs as ccrs 
import cartopy.feature 
import warnings 

0. Data import and cleaning

In a first step, we import the data provide in form of a .csv file into a Pandas DataFrame.

In [2]: data = pd.read_csv("meteorites.csv") 

In [3]: data.head() 

Out[3]:
name id name_type class mass fall year lat long geolocation

0 Aachen 1 Valid L5 21.0 Fell 1880.0 50.77500 6.08333 (50.775, 6.08333)

1 Aarhus 2 Valid H6 720.0 Fell 1951.0 56.18333 10.23333 (56.18333, 10.23333)

2 Abee 6 Valid EH4 107000.0 Fell 1952.0 54.21667 -113.00000 (54.21667, -113.0)

3 Acapulco 10 Valid Acapulcoite 1914.0 Fell 1976.0 16.88333 -99.90000 (16.88333, -99.9)

4 Achiras 370 Valid L6 780.0 Fell 1902.0 -33.16667 -64.95000 (-33.16667, -64.95)

Only continoue with datapoints marked as "valid".

In [4]: data = data.groupby('name_type').get_group('Valid').copy() 

Since we mainly want to study the mass and geographical and terminal distribution of the meteroids, we begin by analysing the dataset
with respect to missing datapoints for those parameters.

In [5]: data[data['mass'].isna()].count() 

Out[5]: name           82 
id             82 
name_type      82 
class          82 
mass            0 
fall           82 
year           67 
lat            70 
long           70 
geolocation    70 
dtype: int64

In [6]: data[data['long'].isna()].count() 

Out[6]: name           7310 
id             7310 
name_type      7310 
class          7310 
mass           7298 
fall           7310 
year           7197 
lat               0 
long              0 
geolocation       0 
dtype: int64

In [7]: data[data['lat'].isna()].count() 

Out[7]: name           7310 
id             7310 
name_type      7310 
class          7310 
mass           7298 
fall           7310 
year           7197 
lat               0 
long              0 
geolocation       0 
dtype: int64

As indicated above, there are several rows within the DataFrame with missing data in the interesting parameters. To clean the dataset, we
all colums containing missing data.

In [8]: data.dropna(inplace=True) 
data.info() 

<class 'pandas.core.frame.DataFrame'> 
Int64Index: 38094 entries, 0 to 45715 
Data columns (total 10 columns): 
 #   Column       Non­Null Count  Dtype   
­­­  ­­­­­­       ­­­­­­­­­­­­­­  ­­­­­   
 0   name         38094 non­null  object  
 1   id           38094 non­null  int64   
 2   name_type    38094 non­null  object  
 3   class        38094 non­null  object  
 4   mass         38094 non­null  float64 
 5   fall         38094 non­null  object  
 6   year         38094 non­null  float64 
 7   lat          38094 non­null  float64 
 8   long         38094 non­null  float64 
 9   geolocation  38094 non­null  object  
dtypes: float64(4), int64(1), object(5) 
memory usage: 3.2+ MB 

To check the last step, we can count how many NaN values are left in our dataset.

In [9]: print(data.isna().sum().sum()) 

Laslty, it can be the case the geogrphical coordinates are not NaN but are unknown however. This could be indicated by the fact that both
are just set to zero.

In [10]: test_for_zeros = data[(data['lat'] == 0.0) & (data['long'] == 0.0)] 
test_for_zeros 

Out[10]:
name id name_type class mass fall year lat long geolocation

597 Mason Gully 53653 Valid H5 24.54 Fell 2010.0 0.0 0.0 (0.0, 0.0)

1655 Allan Hills 09004 52119 Valid Howardite 221.70 Found 2009.0 0.0 0.0 (0.0, 0.0)

1656 Allan Hills 09005 55797 Valid L5 122.30 Found 2009.0 0.0 0.0 (0.0, 0.0)

1657 Allan Hills 09006 55798 Valid H5 104.30 Found 2009.0 0.0 0.0 (0.0, 0.0)

1658 Allan Hills 09008 55799 Valid H5 31.30 Found 2009.0 0.0 0.0 (0.0, 0.0)

... ... ... ... ... ... ... ... ... ... ...

45655 Yamato 984144 40764 Valid H6 37.44 Found 1998.0 0.0 0.0 (0.0, 0.0)

45656 Yamato 984145 40765 Valid L6 54.80 Found 1998.0 0.0 0.0 (0.0, 0.0)

45657 Yamato 984146 40766 Valid H3 19.32 Found 1998.0 0.0 0.0 (0.0, 0.0)

45658 Yamato 984147 40767 Valid LL6 118.90 Found 1998.0 0.0 0.0 (0.0, 0.0)

45659 Yamato 984148 40768 Valid L5 4.59 Found 1998.0 0.0 0.0 (0.0, 0.0)

6186 rows × 10 columns

In fact, we see that there are rows that seem to have no information about the geographical location of the events. Since those are of no
interest for us, we remove them from the dataset.

In [11]: data = data[(data['lat'] != 0.0) & (data['long'] != 0.0)] 

Lastly, note that the unit of the meteorites' mass is in gramm [ ]. To change it to the phyiscal SI-unit kilogram [ ] we divide all datapoints
by 1000 .

In [12]: data['mass'] /= 1000 

1. Basic statistical key-numbers

Before we dive deeper into the dataset, it is always useful to get a first feeling about the data provided. One way to do this, can be to look
at a statistical overview. Latter can also help to stop outliers or odd data, as shown below.

In [13]: data.describe() 

Out[13]:
id mass year lat long

count 31684.000000 31684.000000 31684.000000 31684.000000 31684.000000

mean 20731.522156 18.685962 1987.075748 -47.657051 73.474518

std 14954.674207 689.498111 26.795643 46.663352 83.430027

min 1.000000 0.000010 860.000000 -87.366670 -165.433330

25% 9185.750000 0.006550 1983.000000 -79.683330 26.000000

50% 18502.500000 0.030150 1991.000000 -72.000000 57.159545

75% 27286.250000 0.205025 2000.000000 18.375000 159.416360

max 57455.000000 60000.000000 2013.000000 81.166670 178.200000

The staistical summary of the DataFrame indicates that the average mass of a meteorite is about 15.6 kg, which is quite heavy. However,
the average is so high due to some massive events. To see this fact better, it is natural to study the empirical distribution of the mass. First,
we check the heaviest and lightest meteorite stored in the dataset.

In [14]: print("The leightest mass point is equal to " + str(min(data['mass'])) + "kg and the heaviest to " + st
r(max(data['mass'])) + "kg.") 

The leightest mass point is equal to 1e­05kg and the heaviest to 60000.0kg. 

We notice that this is quite a hugh rage. So to illustrate the distribution better, it can be usefull to look at the log-mass. Although this will be
problemtic for the datapoint equal to zero, we can easily go around this issue by removing this point, since a mass of zero is unrealistic
anyway.

In [15]: data = data.loc[data['mass'] > 0.0] 
log_mass = np.log10(data['mass'].copy()) 

In [16]: sns.distplot(log_mass) 
plt.axvline(0.0, 0, 1.0, linestyle = "­­", color = 'red', linewidth = 0.5) 
plt.title("Histogram and KDE of the log­mass") 
plt.xlabel("Log­Mass") 
plt.ylabel('Normalised number of appearances') 
plt.legend(["Origin", "Histogram"]) 
plt.show() 

We can see in the plot of the empirical distribution of the log-mass that the far majority of the datapoints lie below zero, which in conclusion
refers to a mass smaller than 1 kg. This illustrates the fat-tail property of the mass and sets the average mass in a better context.

In addition to the mass distribution, it can also be useful to analyse the distribution of temporal appearance. Due to the fact that science
has an exponential evolution, we assume that it is only relevant to consider the years after 1800. In particular, it is clear that events of small
mass can not be measured posteriori. So the number of appearances can not be compared with the ones today.

In [17]: years = data['year'].copy() 
years_g_1800 = years[years >= 1800] 

In [18]: sns.distplot(years_g_1800, bins = 30) 
plt.title("Histogram and KDE of the years (later than 1800)") 
plt.xlabel("Years") 
plt.ylabel('Normalised number of appearances') 
plt.legend(["Histogram"]) 
plt.show() 

The histogram plot illustrates that the mass of events clearly lies between the 1950s and today. However, this does probably not indicate
that the number of meteorites hitting the earth increased drastically. In fact, it is only representing the improved ways of measuring those
events.

2. Analysis of the latitudinal appearance of meteorites over time

Now that we got a better impression of the data, we can study how the meteorite impact are distributed latitudinally. Note at this point that
the degrees of latitude range from -90° to 90°. In a scatter plot we illustrate all events after 1800 with respect to their latitudinal coordinate.

To make sure, we do not get problems with missing year points, we remove all rows with missing data for the time parameter.

In [19]: years_arr = np.array(data["year"].copy()) 
lat_arr = np.array(data['lat'].copy()) 

In [20]: plt.figure(figsize=(10,6)) 
plt.scatter(years_arr, lat_arr, color='g',alpha=0.4) 
 
plt.xlim(1800,2010) 
plt.ylim(­90,90) 
plt.ylabel('Latitude') 
plt.xlabel('Year') 
plt.title('Meteorite recorded latitude vs year') 
plt.show() 

We can observe that there are many "landings" between the latitude degree 20° and 60°. This refers to the height of North Africa and
Europe. This indicates that there are probably more meteroite hittings on the Northern hemisphere. Moreover, we see that especially in the
far South, the events happened mmuch later compared to the othes. However, this can probably be explained by the fact that
measurement techniques have been installed later for the Anctarctica than the more populated parts of the world.

3. Analysis of the found meteorites

In this section, we want to illustrate the hitting zones of the meteorites. Moreover, we want to analyse which of the latter were actually
found. To do this we use the Python library Cartopy , which makes it easy to plot geographical data.

In [22]: founds = data.loc[data['fall'] == "Found"] 
fells = data.loc[data['fall'] == "Fell"] 

In [23]: long_found = list(founds['long']) 
lat_found = list(founds['lat']) 
long_fell = list(fells['long']) 
lat_fell = list(fells['lat']) 
sst = founds['mass'] 
sst2 = fells['mass'] 

For the illustration, we implement some kind of scatter plot laid over a map of the world. Each bubble refers to a meteorite and its size is
proportional to its mass.

In [24]: warnings.filterwarnings("ignore") 
 
fig = plt.figure(figsize=(20,8)) 
ax = fig.add_subplot(1, 1, 1, projection=ccrs.PlateCarree()) 
ax.set_global() 
ax.coastlines() 
 
plt.scatter(long_found, lat_found, color='green', alpha = 0.5, s=np.sqrt(sst), label = "Found") 
plt.scatter(long_fell, lat_fell, color='blue', alpha = 0.5, s=np.sqrt(sst2), label = "Fell") 
 
plt.title("Location of all measured meteorites", fontweight='bold', fontsize=15) 
plt.legend(loc = 'center left', prop={'size': 12}) 
plt.show() 

As already illustrated in the scatter plot before, we can see that the most measured hittings occured on the Northern hemisphere. Moreover,
we also notice that the most of the big bubbles (heavy meteorites) were found. This observation is quite natural. Moreover, we can see that
most of the meteorites only seen falling, were measured in highly populated places. This also makes sense, since a small meteorite over a
desert or rainforest is less likely to be spotted.

To reinforce the last statement, we only plot Europe and Nothern Africa.

In [61]: warnings.filterwarnings("ignore") 
 
fig = plt.figure(figsize=(20,8)) 
ax = fig.add_subplot(1, 1, 1, projection=ccrs.PlateCarree()) 
ax.set_global() 
ax.coastlines() 
ax.set_extent([­20,60,90,10]) 
 
plt.scatter(long_found, lat_found, color='green', alpha = 0.5, s=np.sqrt(sst), label = "Found") 
plt.scatter(long_fell, lat_fell, color='blue', alpha = 0.5, s=np.sqrt(sst2), label = "Fell") 
 
plt.title("Location of all measured meteorites (Europe & NA)", fontweight='bold', fontsize=15) 
plt.legend(loc = 'best', prop={'size': 12}) 
plt.show() 

Here it is pretty well observable that in the dessert regions of Africa, meteorites are less often only seen, than in the densed populated
Europe. Also note a oddly high density of meteorites hitting region on the Arabian Peninsula.

4. Analysis of the geographical distribution of the heaviest meteorites

In this last section, we want to study the ten heaviest meteorites in more detail.

In [28]: biggest_mass = data.sort_values(['mass'])[­10:] 

In [29]: long_biggest_mass = list(biggest_mass['long']) 
lat_biggest_mass = list(biggest_mass['lat']) 

In [62]: warnings.filterwarnings("ignore") 
 
fig = plt.figure(figsize=(20,8)) 
ax = fig.add_subplot(1, 1, 1, projection=ccrs.PlateCarree()) 
ax.set_global() 
ax.coastlines() 
 
names = biggest_mass['name'].to_list() 
appearances = biggest_mass['year'].astype(int).to_list() 
text = [names[i] + " (" + str(appearances[i]) + ")  " for i in range(len(names))] 
masses = np.array(biggest_mass['mass']) 
 
plt.scatter(long_biggest_mass, lat_biggest_mass, s=masses/10,alpha=0.4,color='r', transform=ccrs.Geodet
ic(), label='Mass') 
plt.plot(long_biggest_mass, lat_biggest_mass, color='k', linewidth=0, marker='o',  
         markersize=3, transform=ccrs.Geodetic(), label = 'Location') 
 
zip_object = zip(long_biggest_mass, lat_biggest_mass, text) 
for (lg, lt, txt) in zip_object: ax.text(lg, lt, txt, va='center', ha='right',  
                                        transform=ccrs.Geodetic(), fontsize=10.5) 
     
lgnd = plt.legend(markerscale = 0.2, loc = 'upper right') 
 
plt.title("Location of the ten heaviest meteorites", fontweight='bold', fontsize=15) 
plt.show() 

We can see that the loactions of the heaviest hittings are quite distributed over all continents - only Europe (as a small continent) was
spared. It is interesting to see that many of those events occured in the 19th century or early 20th century. Since the masses are quite high
(like 60 tons for the heaviest (Hoba)), those meteorites should be able to find even without modern techniques at the times.

Ending words:

This report is an example of a final report for the hackathon. Obviously, the above is just one way to analyse the dataset and clearly there
are probably a lot more interesting points to extract from the data.

You might also like