P4DS A2 Data Analysis Project
P4DS A2 Data Analysis Project
Email: [email protected]
Project Plan
The Data (15 marks)
We use NUFORC(National UFO Reporting Center) data for this work. The National UFO
Reporting Center is one of many sources used to compile reports of UFO activity throughout the
last century and is available on Kaggle in two versions: ‘scrubbed’ and ‘complete.’ The first
version contains about 80,000 items of data. The second type of data excludes entries with no
location data and entries with no time data. Data “complete.csv” is the large file which size is
approximately 15.35 MB. This file contains data about UFO sightings. This file has 11 columns
which show different aspects of the sightings.
The column contain fields such as the date and time of the sighting, the city and state where it
happened, an event description, the reported duration of the sighting, and geographical
information, mainly latitude and longitude. An entry in the data entails the city and state where
the observation was made and its date and time; a description of the occurrence and how long it
took according to the report . Each entry in the dataset gives fields such as the place and state of
the location where the sighting was done and the day and time of the sighting; a description of
the occurrence and how long the reporting suggests the seeing took. From the previous notes, it
could also be the case that some entries do not have correct or handle these location or time
values. Furthermore, because the collection includes reports from the 20th century, some older
data points may have lost information or become distorted with time. Thus, I think one should
use caution when interpreting their findings, taking into account any potential biases or
limitations in the data.
There good value in UFO data owing to its scientific, security, cultural and societal implications;
thus, it becomes a subject of importance not only for scholars but also for policymakers and
even the common man. Governments usually take keen interest in investigating UFO sighting
reports as they consider it a matter related to national security and risks to flying safety. Having
knowledge about what leads to UFO encounters help governments identify threats that come
their way and act accordingly. On one hand, when sightings of UFOs generate curiosity among
large sections of people there are others who are left disturbed by such unexplained aerial
phenomen. Hence analysis on this data would help satisfy the interest of those curious while at
the same time reassuring those distressed by these events. Overall, this dataset is a valuable
resource for studying the phenomenon of UFO encounters and researching associated concerns
such as geographical distribution, temporal trends, and potential relationships with other
variables. Researchers can get great insights into this fascinating and perplexing issue by using
advanced analytical techniques and visualization tools.
Specific Objective(s)
Objectives:
Architecture
The architecture of of my code is as follows: There are multiple steps in data processing and
analysis which brings out some knowledge from the dataset regarding UFO sightings. These
include loading the dataset from a CSV file, preprocessing it – converting data types, dealing
with missing values and cleaning up the data. After that we take this cleaned dataset to do
correlation analysis and representation. Here we calculate correlation coefficients between
numerical variables and create a heatmap of correlation matrix. The relationships among
variables can be studied through scatter plots and box plots. Metaphorically speaking, the
architecture simply goes straight from loading & pre-processing stage till correlation analysis &
representation phase.
import csv
The CSV file originally produced some error when parsed due to misformatting, which
sometimes would include extra fields in various lines. A new CSV file is created with corrected
formatting issues or exclusion of problematic entries to address this concern. The code reads
data from the CSV module using efficient functionalities provided by the csv module, where data
integrity will be ensured during cleaning of the misformatted fields in the file.
import pandas as pd
In this code block, i am just viewing the cleaned data with the help of pandas dataframe(df). I am
viewing the first few entries usin গ df.head. The data and columns is already discussed in data
section earlier.
total_sightings = country_counts.sum()
country_counts_sorted = country_counts.sort_values(ascending=False)
# Display the top countries with the highest number of sightings and
their percentages
top_countries = country_counts_sorted.head(5)
top_countries_percentages = country_percentages.head(5)
The analysis showed the the quantity and proportion of UFO sightings by nation. The following
are the top nations where UFO sightings have occurred:
# Calculate statistics
num_sightings = len(duration_seconds)
avg_duration_sec = sum(duration_seconds) / num_sightings
min_duration_sec = min(duration_seconds)
max_duration_sec = max(duration_seconds)
avg_latitude = sum(latitude) / num_sightings
avg_longitude = sum(longitude) / num_sightings
# Print statistics
print("Number of UFO sightings:", num_sightings)
print("Average duration of UFO sightings (seconds):",
avg_duration_sec)
print("Minimum duration of UFO sightings (seconds):",
min_duration_sec)
print("Maximum duration of UFO sightings (seconds):",
max_duration_sec)
print("Average latitude of UFO sightings:", avg_latitude)
print("Average longitude of UFO sightings:", avg_longitude)
In the above code i tried to represent the data of cleaned UFO sightings from a CSV file and does
various calculations considering numerical fields. It creates some lists in order to store numbers
like duration (in seconds), latitude, longitude and so on. Then it opens the CSV file and for each
row, retrieves number from specific columns (index 5, 9, 10) and appends them into respective
lists. In the try block, values are tried to be converted into float type that has drawn out from
rows with catch any value error exceptions may occur when they are not numeric figures. After
all rows have been processed through; sighting number of UFOs , average duration as well as
minimum-maximum durations among other things such like average longitudes or latitudes are
statistically calculated. Finally these stats will be displayed by printing them out on screen .
Therefore this piece codes gives an idea about distribution of UFOs sightings based on numbers
which have been cleaned up already .
import csv
import matplotlib.pyplot as plt
The plot resembles a global map since latitude and longitude coordinates are commonly used to
illustrate locations on Earth's surface. Plotting latitude and longitude on the y- and x-axes,
respectively, allows the sightings to be effectively mapped into a two-dimensional
representation of the Earth's surface.
The analysis showed the the quantity and proportion of UFO sightings by nation. The following
are the top nations where UFO sightings have occurred:
datetime object
city object
state object
country object
shape object
duration (seconds) float64
duration (hours/min) object
comments object
date posted object
latitude object
longitude float64
dtype: object
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Visualisation
The following bar chart gives a vivid representation of the top countries with highest UFO
sightings
# Plot the distribution of sightings by country
plt.figure(figsize=(10, 6))
ax = top_countries.plot(kind='bar', color='skyblue', alpha=0.7)
plt.title('Top Countries with the Highest Number of UFO Sightings')
plt.xlabel('Country')
plt.ylabel('Number of Sightings')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
Objective 2: Analyze the dataset to determine the average
values of the various data points that collectively represent
a single UFO sighting
Explanation of Results
To get a general understanding of UFO sightings, the goal of our analysis is to assert the average
characteristics under multiple data points. Total of 88,673 sightings were recorded. The average
duration of a sighting estimated was 8,392 seconds or 140 minutes. However a point to be noted
that, this estimate hides large disparities from a split-second sighting to lasting of over a day.
We also differentiate averages under the geographic coordinates. The average latitude is close
to 37.45, and the average longitude is near -85.02 . The data give a picture of where the
sightings are most prevalent.
Summing up, I attempted to construct an image of an average UFO sighting in the form of
average values of a significant number of characteristics found in the dataset. According to the
data, an UFO sighting usually occur in USA and last a little over 8392 seconds or 140 minutes,
although both brief glimpses and extremely long events are possible. As for the location, the
most probable seeming latitudes are 37.45 and the longitude is about -85.02. Of course, there
are still hundreds of variations, so there is no reason to assert the reliability of any of the above:
these examples were used for a general overview of the fields. It underlines the wide range of
variations in sightings
Visualisation
import matplotlib.pyplot as plt
Visualisation
import csv
import matplotlib.pyplot as plt
# Initialize lists to store latitude and longitude data
latitude = []
longitude = []
Limitations
Apart from the invaluable insights, there are several limitations to the current analysis that the
author should be aware of. The dataset could not cover all the possible UFO sightings since it
mentioned only the reported ones, which are widely known for their biases and errors. Also the
sightings were heavily USA centric Furthermore, as the dataset covered only sightings reported
by a/to a single organization, it is possible to have UFOs’ activity unregistered by it. Finally, the
dataset may contain inaccuracies and missing data that may affect the current analysis. Also, as
the collection includes reports from the 20th century, some older data points may have lost
information or become distorted with time. Thus, I think one should use caution when
interpreting their findings, taking into account any potential biases or limitations in the data.
Future Work
Future research may include diversification the sources of data. An even more detailed
examination of the trends of UFO sightings can form part of further investigation. It might be
interesting to consider the social and cultural elements that drive UFO reporting as well because
they could possibly shed more light on the matter as why non western part of the world don't
report UFO sightings as much. More research might look into the chronological and regional
trends of UFO encounters in greater depth. ML engineers and data scientists can use the dataset
in AI or machine learning engineering to strengthen the UFO research.
Video Presentation
I have submitted a video with my voiceover, providing a concise explanation of my project's
design, key findings, successful aspects, and any challenges encountered.