IT445 Project
IT445 Project
Project
Deadline: Sunday 08/05/2022 @ 23:59
[Total Mark for this Project is 15]
Group Details: CRN: 21907
Name: Maha AlMutairi. ID: S180056413
Name: Hanan AlEnazi. ID: S160165963
Name: Doaa Zainaddin. ID: S180225092
Name: Somaya AlShehri. ID: S170068838
Name: Saadah AlMutairi. ID: S170281918
Instructions:
You must submit two separate copies (one Word file and one PDF file) using the Assignment Template on
Blackboard via the allocated folder. These files must not be in compressed format.
It is your responsibility to check and make sure that you have uploaded both the correct files.
Zero mark will be given if you try to bypass the Safe Assign (e.g. misspell words, remove spaces between
words, hide characters, use different character sets, convert text into image or languages other than English
or any kind of manipulation).
Email submission will not be accepted.
You are advised to make your work clear and well-presented. This includes filling your information on the
cover page.
You must use this template, failing which will result in zero mark.
You MUST show all your work, and text must not be converted into an image, unless specified otherwise by
the question.
Late submission will result in ZERO mark.
The work should be your own, copying from students or other resources will result in ZERO mark.
Use Times New Roman font for all your answers.
Pg. 01 Project
Learning Outcome(s):
CLO 1, 2, 5 Project 15 Marks
1, Demonstrate an Students can form groups consisting of three students and send their names to
understanding of the instructor before 3rd March and select one dataset from the datasets provided in
concepts of decision
the bellow link. Otherwise, the instructors will form the groups automatically, and
analysis and decision
assign the unselected datasets to the groups.
support systems (DSS)
including probability, https://www.coursera.org/articles/data-analytics-projects-for-beginners
modelling, decisions under
uncertainty, and real-world "10 free public datasets for EDA"
problems.
Use only selected or assigned dataset and analyze the data using Microsoft Excel
2, Describe advanced to discover the structure of data, trends, patterns, or any anomalies in the data
Business Intelligence,
based on your own hypothesis. Perform the following tasks. You should use
Business Analytics, Data
visualization to aid your answer.
Visualization, and
Dashboards. Your project will include two main parts:
5, Improve hands-on skills 1. The final project report which must incorporate all the following 5 tasks
using Excel, and Orange
and written using the provided template. (10 marks distributed among the below
for building Decision
tasks).
Support Systems.
2. A presentation that illustrates your 5 tasks. (5 marks)
==========================================================
Task 1: Understand and describe the nature and structure of the selected dataset.
(2 marks)
values, etc. You can also generate new feature from the any of the provided
features that may support your hypothesis. Due to the limitation of processing
power of some devices, you can reduce your dataset to 1000 tuples. (2 marks)
Task 3: Provide descriptive statistics for some feature using statistical method to
understand the dataset more and answer the following analysis questions :(3
marks)
(You are encouraged to impose other analysis questions based on any trend
you notice in the dataset).
Task 5: Show visual representation of your analysis (hint: use the right
chart/graph for your data analysis). (1 mark)
Pg. 03 Project
1. Introduction
Asteroids are large objects in the space coming near other planets like earth. These can be
either hazardous or not, based on multiple attributes and variables. We have found a
dataset showing some of these features ready for the analysis. In this project, this dataset
was obtained, preprocessed, analyzed, and visualized to test the relationship of one of the
variables which is the estimated dimension of the asteroid and the hazardous value of
them.
2. Body section
2.1 Data
The chosen dataset for this project is the data provided by NASA about Asteroids.
They publish this in their Near-Earth Object Web Service. The features of this dataset
are 40, and the rows are 4687. The dataset includes features about these objects and
they have names, ids and other descriptive features like dimensions and sizes. Types of
features include categorical data Like name, and numerical data like id, sizes and
distance and estimated size. Data can be downloaded through this link:
https://www.kaggle.com/datasets/shrutimehta/nasa-asteroids-classification
Hypothesis:
It is commonly known that asteroids objects with larger sizes are classified more
dangerous than the smaller ones. The diameter of the asteroids that are classified as
dangerous in this dataset has a mean diameter of 0.70 kilometers. The average diameter
of non-hazardous asteroids, on the other hand, is 0.40 kilometers. We assume that there
is a correlation between the estimated minimum size of the asteroid and danger level.
Descriptive Statistics:
The following table shows the descriptive statistics of min and max estimated
dimension in kilometre of the detected objects.
In the following table we can see the discriptive statistics of estimated minimum
dimention of asteroid.
Pg. 04 Project
Also, there is as well other features that we provide descriptive statistics for. This
includes two features worth noting like velocity and miss destination in the following
table:
Relative Velocity Miss
km per sec Dist.(kilometers)
13.97081106 Mean 38413467
Mean
0.106530016 Standard Error 318588.5
Standard Error
12.91788922 Median 39647712
Median
12.28855526 Mode 5421689
Mode
7.293222605 Standard Deviation 21811098
Standard Deviation
53.19109596 Sample Variance 4.76E+14
Sample Variance
0.81028737 Kurtosis -1.1896
Kurtosis
0.887879907 Skewness -0.10239
Skewness
Pg. 05 Project
Also, Hazardous feature has two values we care about, these are shown in the pie chart
below:
2.2 Methods
The data is obtained online from Kaggle website where thousands of datasets are provided.
The dataset was downloaded as CSV (MS Excel Comma Separated Values). The file was
opened in MS Excel for analysis, pre-processing, and preparation. The selected dataset has
been processed and studied multiple times by researchers in the field. The method used for
this research in quantitative. The numbers are analysed to get a "class" of the object
detected.
Pg. 06 Project
2.3 Analysis
Data Pre-processing
The original dataset has some features that we don’t need in our analysis and classification,
this include both Name and New reference ID. We deleted these features from the dataset.
Also, the feature named "Close Approach Date" is not needed in this analysis since it has
no benefit in classification because the date is not related to this analysis, so we deleted it.
This is also applied to another feature named "Orbit Determination Date".
We also have in the original dataset a feature named "Orbiting body" that has only one
values which is "earth", analysis of this data including this feature does nothing to the
analysis since it makes no difference, so we also deleted this feature. This is also applied to
a feature named "Equinox" which contains one value only.
Here we reach the important features which matters to the analysis, but include redundant
data which has to be deleted. These include features having multiple measurement units,
like the ones bellow:
"Est Dia in KM(min)", "Est Dia in KM(max)", "Est Dia in M(min)", "Est Dia in M(max)",
"Est Dia in Miles(min)", "Est Dia in Miles(max)", "Est Dia in Feet(min)", "Est Dia in
Feet(max)".
We had to delete all features with values other than the ones measured in KM.
We also noticed that the class in our dataset which classifies whether the object is
hazardous or not is written in true and false form, and this has to be transferred into
numbers with two values: 0 and 1. 1 is for true, and 0 for false.
Correlation of features:
In this correlation analysis, we are focusing about the size (dimension) of the asteroids and
the whether it is dangerous or not. In Excel, we created correlation matrix showing the
relationship between the dimensions (min, max) and the hazardous of the asteroids. Here
are the results showing that when the estimated minimum and maximum dimension of
asteroid is larger, it means it is more dangerous. This is showing in the table by (positive
value) meaning that the correlation is positive. We can also see that the two features are
correlated strongly, we can see that in the diagram.
Pg. 07 Project
We also created a correlation matrix for all variables (features) in the dataset. Shown in the
following table:
We can see all features in this correlation matrix. Positive values show positive correlation
and negative number values shows negative correlation. We highlighted the hazardous
feature to represent the features correlated to this class.
Pg. 08 Project
We used Regression analysis of the two features (Est Dia in KM(min)) and (Hazardous) to
see the nature of relationship exist. We found that the regression is positive, showing in the
following diagram.
Also, here is the table showing a summary of information about the regression including
R-Square value.
Pg. 09 Project
Regression Statistics
Multiple R 0.350919113
R Square 0.123144224
Observations 999
2.4 Results
We have conducted a regression analysis with R-Square value and correlation analysis.
The result of correlation analysis shows a value of 0.132424352, which indicates a
positive correlation of the two variables. For the regression analysis, the diagram shows
a positive regression pattern with positive value. For R-square, we also had a value of
0.123144224 which is positive.
3. Conclusion
The goal of this analysis is to find the relationship between the two variables (features)
in the dataset which are the estimated dimension of the asteroids and the hazardous of
them. We found that the two variables have a positive correlation, a positive regression
value, and a positive R Square value. There are other correlations between the other
features in the dataset also shown in this report, but we only focused on the two
variables to test the hypothesis which came to be true for this analysis by the obtained
dataset. The results support the general theory that asteroids with larger dimensions are
more dangerous than the smaller ones.
For future analysis, researchers can test the relationship of Asteroid hazardousness and
other features of them using the same dataset.