Data Analysis Dummy Report: 0. Data Import and Cleaning
Data Analysis Dummy Report: 0. Data Import and Cleaning
Data Analysis Dummy Report: 0. Data Import and Cleaning
Note: This is just one way to present one's final results and represents in no way a perfect solution.
In [1]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cartopy.crs as ccrs
import cartopy.feature
import warnings
In a first step, we import the data provide in form of a .csv file into a Pandas DataFrame.
In [2]: data = pd.read_csv("meteorites.csv")
In [3]: data.head()
Out[3]:
name id name_type class mass fall year lat long geolocation
2 Abee 6 Valid EH4 107000.0 Fell 1952.0 54.21667 -113.00000 (54.21667, -113.0)
3 Acapulco 10 Valid Acapulcoite 1914.0 Fell 1976.0 16.88333 -99.90000 (16.88333, -99.9)
4 Achiras 370 Valid L6 780.0 Fell 1902.0 -33.16667 -64.95000 (-33.16667, -64.95)
In [4]: data = data.groupby('name_type').get_group('Valid').copy()
Since we mainly want to study the mass and geographical and terminal distribution of the meteroids, we begin by analysing the dataset
with respect to missing datapoints for those parameters.
In [5]: data[data['mass'].isna()].count()
Out[5]: name 82
id 82
name_type 82
class 82
mass 0
fall 82
year 67
lat 70
long 70
geolocation 70
dtype: int64
In [6]: data[data['long'].isna()].count()
Out[6]: name 7310
id 7310
name_type 7310
class 7310
mass 7298
fall 7310
year 7197
lat 0
long 0
geolocation 0
dtype: int64
In [7]: data[data['lat'].isna()].count()
Out[7]: name 7310
id 7310
name_type 7310
class 7310
mass 7298
fall 7310
year 7197
lat 0
long 0
geolocation 0
dtype: int64
As indicated above, there are several rows within the DataFrame with missing data in the interesting parameters. To clean the dataset, we
all colums containing missing data.
In [8]: data.dropna(inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 38094 entries, 0 to 45715
Data columns (total 10 columns):
# Column NonNull Count Dtype
0 name 38094 nonnull object
1 id 38094 nonnull int64
2 name_type 38094 nonnull object
3 class 38094 nonnull object
4 mass 38094 nonnull float64
5 fall 38094 nonnull object
6 year 38094 nonnull float64
7 lat 38094 nonnull float64
8 long 38094 nonnull float64
9 geolocation 38094 nonnull object
dtypes: float64(4), int64(1), object(5)
memory usage: 3.2+ MB
To check the last step, we can count how many NaN values are left in our dataset.
In [9]: print(data.isna().sum().sum())
0
Laslty, it can be the case the geogrphical coordinates are not NaN but are unknown however. This could be indicated by the fact that both
are just set to zero.
In [10]: test_for_zeros = data[(data['lat'] == 0.0) & (data['long'] == 0.0)]
test_for_zeros
Out[10]:
name id name_type class mass fall year lat long geolocation
597 Mason Gully 53653 Valid H5 24.54 Fell 2010.0 0.0 0.0 (0.0, 0.0)
1655 Allan Hills 09004 52119 Valid Howardite 221.70 Found 2009.0 0.0 0.0 (0.0, 0.0)
1656 Allan Hills 09005 55797 Valid L5 122.30 Found 2009.0 0.0 0.0 (0.0, 0.0)
1657 Allan Hills 09006 55798 Valid H5 104.30 Found 2009.0 0.0 0.0 (0.0, 0.0)
1658 Allan Hills 09008 55799 Valid H5 31.30 Found 2009.0 0.0 0.0 (0.0, 0.0)
... ... ... ... ... ... ... ... ... ... ...
45655 Yamato 984144 40764 Valid H6 37.44 Found 1998.0 0.0 0.0 (0.0, 0.0)
45656 Yamato 984145 40765 Valid L6 54.80 Found 1998.0 0.0 0.0 (0.0, 0.0)
45657 Yamato 984146 40766 Valid H3 19.32 Found 1998.0 0.0 0.0 (0.0, 0.0)
45658 Yamato 984147 40767 Valid LL6 118.90 Found 1998.0 0.0 0.0 (0.0, 0.0)
45659 Yamato 984148 40768 Valid L5 4.59 Found 1998.0 0.0 0.0 (0.0, 0.0)
In fact, we see that there are rows that seem to have no information about the geographical location of the events. Since those are of no
interest for us, we remove them from the dataset.
In [11]: data = data[(data['lat'] != 0.0) & (data['long'] != 0.0)]
Lastly, note that the unit of the meteorites' mass is in gramm [ ]. To change it to the phyiscal SI-unit kilogram [ ] we divide all datapoints
by 1000 .
In [12]: data['mass'] /= 1000
Before we dive deeper into the dataset, it is always useful to get a first feeling about the data provided. One way to do this, can be to look
at a statistical overview. Latter can also help to stop outliers or odd data, as shown below.
In [13]: data.describe()
Out[13]:
id mass year lat long
The staistical summary of the DataFrame indicates that the average mass of a meteorite is about 15.6 kg, which is quite heavy. However,
the average is so high due to some massive events. To see this fact better, it is natural to study the empirical distribution of the mass. First,
we check the heaviest and lightest meteorite stored in the dataset.
In [14]: print("The leightest mass point is equal to " + str(min(data['mass'])) + "kg and the heaviest to " + st
r(max(data['mass'])) + "kg.")
The leightest mass point is equal to 1e05kg and the heaviest to 60000.0kg.
We notice that this is quite a hugh rage. So to illustrate the distribution better, it can be usefull to look at the log-mass. Although this will be
problemtic for the datapoint equal to zero, we can easily go around this issue by removing this point, since a mass of zero is unrealistic
anyway.
In [15]: data = data.loc[data['mass'] > 0.0]
log_mass = np.log10(data['mass'].copy())
In [16]: sns.distplot(log_mass)
plt.axvline(0.0, 0, 1.0, linestyle = "", color = 'red', linewidth = 0.5)
plt.title("Histogram and KDE of the logmass")
plt.xlabel("LogMass")
plt.ylabel('Normalised number of appearances')
plt.legend(["Origin", "Histogram"])
plt.show()
We can see in the plot of the empirical distribution of the log-mass that the far majority of the datapoints lie below zero, which in conclusion
refers to a mass smaller than 1 kg. This illustrates the fat-tail property of the mass and sets the average mass in a better context.
In addition to the mass distribution, it can also be useful to analyse the distribution of temporal appearance. Due to the fact that science
has an exponential evolution, we assume that it is only relevant to consider the years after 1800. In particular, it is clear that events of small
mass can not be measured posteriori. So the number of appearances can not be compared with the ones today.
In [17]: years = data['year'].copy()
years_g_1800 = years[years >= 1800]
In [18]: sns.distplot(years_g_1800, bins = 30)
plt.title("Histogram and KDE of the years (later than 1800)")
plt.xlabel("Years")
plt.ylabel('Normalised number of appearances')
plt.legend(["Histogram"])
plt.show()
The histogram plot illustrates that the mass of events clearly lies between the 1950s and today. However, this does probably not indicate
that the number of meteorites hitting the earth increased drastically. In fact, it is only representing the improved ways of measuring those
events.
Now that we got a better impression of the data, we can study how the meteorite impact are distributed latitudinally. Note at this point that
the degrees of latitude range from -90° to 90°. In a scatter plot we illustrate all events after 1800 with respect to their latitudinal coordinate.
To make sure, we do not get problems with missing year points, we remove all rows with missing data for the time parameter.
In [19]: years_arr = np.array(data["year"].copy())
lat_arr = np.array(data['lat'].copy())
In [20]: plt.figure(figsize=(10,6))
plt.scatter(years_arr, lat_arr, color='g',alpha=0.4)
plt.xlim(1800,2010)
plt.ylim(90,90)
plt.ylabel('Latitude')
plt.xlabel('Year')
plt.title('Meteorite recorded latitude vs year')
plt.show()
We can observe that there are many "landings" between the latitude degree 20° and 60°. This refers to the height of North Africa and
Europe. This indicates that there are probably more meteroite hittings on the Northern hemisphere. Moreover, we see that especially in the
far South, the events happened mmuch later compared to the othes. However, this can probably be explained by the fact that
measurement techniques have been installed later for the Anctarctica than the more populated parts of the world.
In this section, we want to illustrate the hitting zones of the meteorites. Moreover, we want to analyse which of the latter were actually
found. To do this we use the Python library Cartopy , which makes it easy to plot geographical data.
In [22]: founds = data.loc[data['fall'] == "Found"]
fells = data.loc[data['fall'] == "Fell"]
In [23]: long_found = list(founds['long'])
lat_found = list(founds['lat'])
long_fell = list(fells['long'])
lat_fell = list(fells['lat'])
sst = founds['mass']
sst2 = fells['mass']
For the illustration, we implement some kind of scatter plot laid over a map of the world. Each bubble refers to a meteorite and its size is
proportional to its mass.
In [24]: warnings.filterwarnings("ignore")
fig = plt.figure(figsize=(20,8))
ax = fig.add_subplot(1, 1, 1, projection=ccrs.PlateCarree())
ax.set_global()
ax.coastlines()
plt.scatter(long_found, lat_found, color='green', alpha = 0.5, s=np.sqrt(sst), label = "Found")
plt.scatter(long_fell, lat_fell, color='blue', alpha = 0.5, s=np.sqrt(sst2), label = "Fell")
plt.title("Location of all measured meteorites", fontweight='bold', fontsize=15)
plt.legend(loc = 'center left', prop={'size': 12})
plt.show()
As already illustrated in the scatter plot before, we can see that the most measured hittings occured on the Northern hemisphere. Moreover,
we also notice that the most of the big bubbles (heavy meteorites) were found. This observation is quite natural. Moreover, we can see that
most of the meteorites only seen falling, were measured in highly populated places. This also makes sense, since a small meteorite over a
desert or rainforest is less likely to be spotted.
To reinforce the last statement, we only plot Europe and Nothern Africa.
In [61]: warnings.filterwarnings("ignore")
fig = plt.figure(figsize=(20,8))
ax = fig.add_subplot(1, 1, 1, projection=ccrs.PlateCarree())
ax.set_global()
ax.coastlines()
ax.set_extent([20,60,90,10])
plt.scatter(long_found, lat_found, color='green', alpha = 0.5, s=np.sqrt(sst), label = "Found")
plt.scatter(long_fell, lat_fell, color='blue', alpha = 0.5, s=np.sqrt(sst2), label = "Fell")
plt.title("Location of all measured meteorites (Europe & NA)", fontweight='bold', fontsize=15)
plt.legend(loc = 'best', prop={'size': 12})
plt.show()
Here it is pretty well observable that in the dessert regions of Africa, meteorites are less often only seen, than in the densed populated
Europe. Also note a oddly high density of meteorites hitting region on the Arabian Peninsula.
In this last section, we want to study the ten heaviest meteorites in more detail.
In [28]: biggest_mass = data.sort_values(['mass'])[10:]
In [29]: long_biggest_mass = list(biggest_mass['long'])
lat_biggest_mass = list(biggest_mass['lat'])
In [62]: warnings.filterwarnings("ignore")
fig = plt.figure(figsize=(20,8))
ax = fig.add_subplot(1, 1, 1, projection=ccrs.PlateCarree())
ax.set_global()
ax.coastlines()
names = biggest_mass['name'].to_list()
appearances = biggest_mass['year'].astype(int).to_list()
text = [names[i] + " (" + str(appearances[i]) + ") " for i in range(len(names))]
masses = np.array(biggest_mass['mass'])
plt.scatter(long_biggest_mass, lat_biggest_mass, s=masses/10,alpha=0.4,color='r', transform=ccrs.Geodet
ic(), label='Mass')
plt.plot(long_biggest_mass, lat_biggest_mass, color='k', linewidth=0, marker='o',
markersize=3, transform=ccrs.Geodetic(), label = 'Location')
zip_object = zip(long_biggest_mass, lat_biggest_mass, text)
for (lg, lt, txt) in zip_object: ax.text(lg, lt, txt, va='center', ha='right',
transform=ccrs.Geodetic(), fontsize=10.5)
lgnd = plt.legend(markerscale = 0.2, loc = 'upper right')
plt.title("Location of the ten heaviest meteorites", fontweight='bold', fontsize=15)
plt.show()
We can see that the loactions of the heaviest hittings are quite distributed over all continents - only Europe (as a small continent) was
spared. It is interesting to see that many of those events occured in the 19th century or early 20th century. Since the masses are quite high
(like 60 tons for the heaviest (Hoba)), those meteorites should be able to find even without modern techniques at the times.
Ending words:
This report is an example of a final report for the hackathon. Obviously, the above is just one way to analyse the dataset and clearly there
are probably a lot more interesting points to extract from the data.