ML Project Final
ML Project Final
Section : K21UG
Registration no. : 12107709
Roll no. : RK21URA11
Course : CSM354
B.Tech CSE AI & ML
Project of Machine Learning
Topic : Asteroids Classification
Under the guidance of
(Shivangini Gupta)
Teaching Assistant
Acknowledgement
Success often requires preparation, hard work, and perspiration. The path
to success is a long journey that calls for tremendous effort with many bitter
and sweet experiences. This can only be achieved by the Graceful Blessing
from the Almighty on everybody. I want to submit everything beneath the
feet of God.
I want to acknowledge my regards to my teacher, Miss. Shivangini Gupta,
for her constant support and guidance throughout my training. I would also
like to thank HOD Ms. Harjeet Kaur, School of Computer Science and
Engineering for introducing such a great program.
I may be failing in my duties if I do not thank my parents for their constant
support, suggestion, inspiration and encouragement and best wishes for my
success. I am thankful for their supreme sacrifice, eternal benediction, and
ocean-like bowls full of love and affection.
Abstract
This project focuses on the classification of asteroids based on their hazardous
potential using machine learning algorithms. The dataset utilized in this study is
provided by NASA's Near-Earth Object Program and contains various attributes of
asteroids such as size, velocity, orbit, and hazardous classification.
The primary objective of the project is to develop predictive models capable of
accurately identifying asteroids that pose a potential threat to Earth. To achieve this
goal, several machine learning algorithms including Logistic regression, Linear
Regression, Logarithmic and Polynomial Regression, K-Nearest Neighbors (KNN),
and Naive Bayes classifiers are implemented and evaluated.
The methodology involves data preprocessing, exploratory data analysis, model
development, and evaluation. Through rigorous analysis, the project aims to
determine the most effective approach for asteroid classification and contribute to
enhancing planetary defense capabilities against potential asteroid impacts on
Earth.
The findings of this project have significant implications for planetary defense
strategies and highlight the importance of leveraging machine learning techniques
for space science applications. By accurately identifying potentially hazardous
asteroids, decision-makers can prioritize monitoring and mitigation efforts, thereby
mitigating the risk of catastrophic impacts on Earth.
Introduction
Asteroids, remnants of the early formation of our solar system, pose potential
hazards to Earth due to their unpredictable trajectories. The National Aeronautics
and Space Administration (NASA) continuously monitors and assesses Near-Earth
Objects (NEOs) to mitigate potential threats. In this report, we delve into the
classification of asteroids based on their hazardous potential using machine learning
algorithms.
The dataset utilized in this analysis comprises comprehensive information about
various asteroids, including their physical characteristics, orbital parameters, and
close approach details. With the aid of machine learning techniques, we aim to
develop predictive models capable of discerning hazardous asteroids from non-
hazardous ones.
The primary objective of this project is to leverage the dataset to train and evaluate
several machine learning algorithms for asteroid classification. By employing
algorithms such as Logistic Regression, Linear Regression,Logarithmic and
Polynomial Regression, K-Nearest Neighbors, and Naive Bayes, we intend to
identify the most effective approach for accurately categorizing asteroids based on
their potential threat to Earth.
This analysis holds significant importance in enhancing our understanding of
asteroid dynamics and improving our ability to identify and prioritize potentially
hazardous objects. By developing robust classification models, we aim to contribute
to NASA's ongoing efforts in planetary defense and risk assessment.
Through this report, we provide insights into the methodology, results, and
implications of employing machine learning algorithms for NASA asteroid
classification. It is our hope that this study will aid in advancing our capabilities for
asteroid monitoring and ultimately safeguarding our planet from potential impacts.
Data Understanding
1. id: An identifier for each entry in the dataset.
2. spkid: Another identifier, possibly specific to a particular catalog or database.
3. full_name: The full name of the celestial object.
4. pdes: Possibly an abbreviation or identifier related to the object's designation or
discovery.
5. name: The name of the object.
6. prefix: Any prefix associated with the object's name.
7. neo: A binary indicator (1 or 0) denoting whether the object is a Near-Earth
Object (NEO).
8. pha: Another binary indicator, likely indicating whether the object is a Potentially
Hazardous Asteroid (PHA).
9. H: The absolute magnitude of the object, providing information about its intrinsic
brightness.
10. diameter: Presumably the diameter of the object, often measured in kilometers.
11. albedo: The albedo of the object, indicating how reflective its surface is.
12. diameter_sigma: Uncertainty or error associated with the diameter
measurement.
13. orbit_id: Identifier for the object's orbit.
14. epoch: The epoch of the orbital elements.
15. epoch_mjd: The epoch expressed as Modified Julian Date (MJD).
16. epoch_cal: The epoch expressed in calendar date format.
17. equinox: The equinox used for the orbital elements.
18. e: Eccentricity of the object's orbit.
19. a: Semi-major axis of the object's orbit.
20. q: Perihelion distance of the object's orbit.
21. i: Inclination of the object's orbit.
22. om: Longitude of the ascending node of the object's orbit.
23. w: Argument of perihelion of the object's orbit.
24. ma: Mean anomaly of the object's orbit.
25. ad: Aphelion distance of the object's orbit.
26. n: Mean motion of the object's orbit.
27. tp: Time of perihelion passage of the object's orbit.
28. tp_cal: Time of perihelion passage in calendar date format.
29. per: Orbital period of the object's orbit.
30. per_y: Orbital period in years.
31. moid: Minimum Orbit Intersection Distance, a measure of how close an
asteroid's orbit approaches Earth's orbit.
32. moid_ld: MOID expressed in Lunar Distance units.
33. sigma_e: Uncertainty or error associated with eccentricity.
34. sigma_a: Uncertainty or error associated with semi-major axis.
35. sigma_q: Uncertainty or error associated with perihelion distance.
36. sigma_i: Uncertainty or error associated with inclination.
37. sigma_om: Uncertainty or error associated with longitude of the ascending
node.
38. sigma_w: Uncertainty or error associated with argument of perihelion.
39. sigma_ma: Uncertainty or error associated with mean anomaly.
40. sigma_ad: Uncertainty or error associated with aphelion distance.
41. sigma_n: Uncertainty or error associated with mean motion.
42. sigma_tp: Uncertainty or error associated with time of perihelion passage.
43. sigma_per: Uncertainty or error associated with orbital period.
44. class: Classification of the celestial object.
45. rms: Root Mean Square residual of the fit of the object's orbit.
Methodology
1. Data Acquisition and Preprocessing:
- Acquire the NASA asteroid dataset containing information about various asteroid
attributes.
- Preprocess the data by handling missing values, encoding categorical variables,
and scaling numerical features if required.
2. Exploratory Data Analysis (EDA):
- Perform EDA to gain insights into the distribution and characteristics of the
dataset.
- Visualize the data using histograms, scatter plots, and correlation matrices to
identify patterns and relationships between variables.
3. Feature Selection:
- Identify the most relevant features for asteroid classification.
- Utilize techniques such as correlation analysis, feature importance, or domain
knowledge to select the subset of features.
4. Model Training:
- Split the dataset into training and testing sets.
- Train various machine learning algorithms on the training data, including:
- Logistic Regression
- Linear Regression
- Logarithmic and Polynomial Regression
- K-Nearest Neighbors (KNN)
- Naive Bayes Classifier
- Tune hyperparameters using techniques like grid search or random search to
optimize model performance.
5. Model Evaluation:
- Evaluate the performance of each model using metrics such as accuracy,
precision, recall, F1-score, and ROC-AUC curve.
- Compare the performance of different algorithms to determine the most effective
approach for asteroid classification.
6. Results Interpretation and Discussion:
- Interpret the results obtained from model evaluation and discuss the strengths
and weaknesses of each algorithm.
- Analyze the factors contributing to the predictive performance and provide
insights into the classification process.
7. Conclusion:
- Summarize the findings of the analysis and highlight the significance of the
results.
- Discuss the implications of the study for NASA's asteroid classification efforts and
future research directions.
8. Report Writing:
- Compile the results, methodology, and discussions into a comprehensive report
format.
- Include visualizations, tables, and figures to support the analysis and conclusions.
- train_test_split: This function is used to split datasets into training and testing
subsets.
- LinearRegression: This class implements a linear regression model.
- metrics: This module contains various functions for evaluating the
performance of machine learning models, such as accuracy, precision, recall,
etc.
- StandardScaler: This class standardizes features by removing the mean and
scaling to unit variance.
- PolynomialFeatures: This class generates polynomial features for a given
degree.
Source Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
import warnings
warnings.filterwarnings('ignore')
class rms
0 MBA 0.43301
1 MBA 0.35936
2 MBA 0.33848
3 MBA 0.39980
4 MBA 0.52191
... ... ...
958519 MBA 0.23839
958520 MBA 0.53633
958521 APO 0.51556
958522 MBA 0.25641
958523 MBA 0.26980
astr.head()
[5 rows x 45 columns]
astr.shape
(958524, 45)
astr.describe()
[8 rows x 35 columns]
astr.dtypes
id object
spkid int64
full_name object
pdes object
name object
prefix object
neo object
pha object
H float64
diameter float64
albedo float64
diameter_sigma float64
orbit_id object
epoch float64
epoch_mjd int64
epoch_cal float64
equinox object
e float64
a float64
q float64
i float64
om float64
w float64
ma float64
ad float64
n float64
tp float64
tp_cal float64
per float64
per_y float64
moid float64
moid_ld float64
sigma_e float64
sigma_a float64
sigma_q float64
sigma_i float64
sigma_om float64
sigma_w float64
sigma_ma float64
sigma_ad float64
sigma_n float64
sigma_tp float64
sigma_per float64
class object
rms float64
dtype: object
astr.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 958524 entries, 0 to 958523
Data columns (total 45 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 958524 non-null object
1 spkid 958524 non-null int64
2 full_name 958524 non-null object
3 pdes 958524 non-null object
4 name 22064 non-null object
5 prefix 18 non-null object
6 neo 958520 non-null object
7 pha 938603 non-null object
8 H 952261 non-null float64
9 diameter 136209 non-null float64
10 albedo 135103 non-null float64
11 diameter_sigma 136081 non-null float64
12 orbit_id 958524 non-null object
13 epoch 958524 non-null float64
14 epoch_mjd 958524 non-null int64
15 epoch_cal 958524 non-null float64
16 equinox 958524 non-null object
17 e 958524 non-null float64
18 a 958524 non-null float64
19 q 958524 non-null float64
20 i 958524 non-null float64
21 om 958524 non-null float64
22 w 958524 non-null float64
23 ma 958523 non-null float64
24 ad 958520 non-null float64
25 n 958524 non-null float64
26 tp 958524 non-null float64
27 tp_cal 958524 non-null float64
28 per 958520 non-null float64
29 per_y 958523 non-null float64
30 moid 938603 non-null float64
31 moid_ld 958397 non-null float64
32 sigma_e 938602 non-null float64
33 sigma_a 938602 non-null float64
34 sigma_q 938602 non-null float64
35 sigma_i 938602 non-null float64
36 sigma_om 938602 non-null float64
37 sigma_w 938602 non-null float64
38 sigma_ma 938602 non-null float64
39 sigma_ad 938598 non-null float64
40 sigma_n 938602 non-null float64
41 sigma_tp 938602 non-null float64
42 sigma_per 938598 non-null float64
43 class 958524 non-null object
44 rms 958522 non-null float64
dtypes: float64(33), int64(2), object(10)
memory usage: 329.1+ MB
astr.nunique()
id 958524
spkid 958524
full_name 958524
pdes 958524
name 22064
prefix 1
neo 2
pha 2
H 9489
diameter 16591
albedo 1057
diameter_sigma 3054
orbit_id 4690
epoch 5246
epoch_mjd 5246
epoch_cal 5246
equinox 1
e 958444
a 958509
q 958509
i 958414
om 958518
w 958519
ma 958519
ad 958505
n 958514
tp 958519
tp_cal 958499
per 958510
per_y 958511
moid 314300
moid_ld 314301
sigma_e 254740
sigma_a 273297
sigma_q 248138
sigma_i 215741
sigma_om 223155
sigma_w 262719
sigma_ma 266816
sigma_ad 269241
sigma_n 251750
sigma_tp 291246
sigma_per 282687
class 13
rms 64386
dtype: int64
astr.isnull().sum()
id 0
spkid 0
full_name 0
pdes 0
name 936460
prefix 958506
neo 4
pha 19921
H 6263
diameter 822315
albedo 823421
diameter_sigma 822443
orbit_id 0
epoch 0
epoch_mjd 0
epoch_cal 0
equinox 0
e 0
a 0
q 0
i 0
om 0
w 0
ma 1
ad 4
n 0
tp 0
tp_cal 0
per 4
per_y 1
moid 19921
moid_ld 127
sigma_e 19922
sigma_a 19922
sigma_q 19922
sigma_i 19922
sigma_om 19922
sigma_w 19922
sigma_ma 19922
sigma_ad 19926
sigma_n 19922
sigma_tp 19922
sigma_per 19926
class 0
rms 2
dtype: int64
astr.index
astr.columns
categorical_columns = astr.select_dtypes(include=['object']).columns
print(categorical_columns)
Linear Regression
1.Data Selection and Cleaning:
# Selecting relevant columns and dropping rows with missing values
ad = astr[["H", "albedo","diameter"]].dropna()
2. Data Preparation:
# Splitting the data into features (x) and target variable (y)
X = ad['albedo'].values.reshape(-1, 1)
y = ad['H']
3. Train-Test Split:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
random_state = 42)
LinearRegression()
5.Making Predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
7. Data Visualization
# Plotting the results
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, color='#DB7348', label='Training data')
plt.scatter(X_test, y_test, color='#48B1DB', label='Testing data')
plt.plot(X_train, y_train_pred, color='black', label='Linear Regression')
plt.xlabel('Albedo')
plt.ylabel('Absolute Magnitude (H)')
plt.title('Linear Regression: Absolute Magnitude as a Function of Albedo')
plt.legend()
plt.show()
Logrithmic Regression
# Logarithmic and Polynomial Regression Models
x_log = np.log(ad['H'].values).reshape(-1, 1)
y = ad['diameter'].values
model_log = LinearRegression()
model_log.fit(x_train_log, y_train)
degree = 5
poly = PolynomialFeatures(degree)
x_poly = poly.fit_transform(ad['H'].values.reshape(-1, 1))
model_poly = LinearRegression()
model_poly.fit(x_train_poly, y_train_poly)
plt.figure(figsize=(10, 6))
plt.scatter(ad['H'], ad['diameter'], color='black', label='Original Data')
plt.plot(x_plot, y_plot_log, color='blue', label='Logarithmic Regression')
plt.plot(x_plot, y_plot_poly, color='green', label=f'Polynomial Regression (Degree
{degree})')
plt.xlim([ad['H'].min(), ad['H'].max()])
plt.ylim([y.min(), y.max()])
plt.xlabel('Absolute Magnitude (H)')
plt.ylabel('Diameter')
plt.title('Comparison of Regression Models')
plt.legend()
plt.show()
Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Train-test split
x_train, x_test, Y_train, Y_test = train_test_split(x, Y, test_size=0.2,
random_state=42)
# Predictions
Y_train_pred = model_logistic.predict(x_train)
Y_test_pred = model_logistic.predict(x_test)
# Model evaluation
train_accuracy = accuracy_score(Y_train, Y_train_pred)
test_accuracy = accuracy_score(Y_test, Y_test_pred)
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
Confusion Matrix:
[[ 4376 7367]
[ 2693 11812]]
Classification Report:
precision recall f1-score support
plt.figure(figsize=(10, 6))
plt.scatter(x_test, Y_test, color='#30CCCF', label='Testing data')
plt.plot(x_values, Y_probabilities, color='red', label='Logistic Regression')
plt.axhline(0.5, color='black', linestyle='--', label='Decision Boundary')
plt.xlabel('Albedo')
plt.ylabel('Probability of Class 1')
plt.title('Logistic Regression: Decision Boundary and Predictions')
plt.legend()
plt.show()
# Train-test split
x_train, x_test, Y_train, Y_test = train_test_split(x, Y, test_size=0.2,
random_state=42)
# Predictions
Y_train_pred = classifier.predict(x_train)
Y_test_pred = classifier.predict(x_test)
# Model evaluation
train_accuracy = accuracy_score(Y_train, Y_train_pred)
test_accuracy = accuracy_score(Y_test, Y_test_pred)
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
Confusion Matrix:
[[ 3766 7977]
[ 2346 12159]]
Classification Report:
precision recall f1-score support
# Predictions
Y_pred = classifier.predict(albedo_values)
plt.xlabel('Albedo')
plt.ylabel('Target')
plt.title('Naive Bayes Decision Boundary')
plt.legend()
plt.show()
KNN Classifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier
# Train-test split
x_train, x_test, Y_train, Y_test = train_test_split(x, Y, test_size=0.2,
random_state=42)
# Predictions
Y_train_pred = classifier.predict(x_train)
Y_test_pred = classifier.predict(x_test)
# Model evaluation
train_accuracy = accuracy_score(Y_train, Y_train_pred)
test_accuracy = accuracy_score(Y_test, Y_test_pred)
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
Confusion Matrix:
[[ 5243 6500]
[ 4476 10029]]
Classification Report:
precision recall f1-score support
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, classification_report
from matplotlib.colors import ListedColormap
# Predictions
Y_pred = classifier.predict(albedo_values)
# Plot decision boundary
plt.plot(albedo_values, Y_pred, color='blue', linewidth=3, label='Decision Boundary')
plt.xlabel('Albedo')
plt.ylabel('Target')
plt.title('K-Nearest Neighbors Decision Boundary')
plt.legend()
plt.show()
Results and Discussion
After implementing various machine learning algorithms for NASA asteroid
classification, we obtained insightful results that shed light on the effectiveness of
different models in categorizing asteroids based on their hazardous potential.
Data Pre-processing:
Data Inspection:
- The dataset contains 958524 rows and 45 columns.
- The target variable is “Albedo”, which indicates whether an asteroid is hazardous or
not.
- There are no missing values in the dataset.
Dimensionality Reduction:
- Features highly correlated with each other were removed to reduce redundancy.
- Features such as 'Orbiting Body' and 'Equinox' were removed due to containing
only a single value across all observations.
- The target variable 'Hazardous' was encoded into binary values (1 for True, 0 for
False).
Models and Scores:
Logistic Regression:
- Achieved an accuracy score of approximately 61.67%.
Logarithmic and Polynomial Regression:
- Achieved an accuracy score of approximately 39.12% for Logarithmic regression.
- Achieved test score for polynomial model of degree five with 86.12%.
Linear Regression:
- R score of test accuracy 4.57%.
K-Nearest Neighbors (KNN):
- Achieved an accuracy score of approximately 58.18%.
- KNN performed relatively lower compared to other models, suggesting that it
might not be the best choice for this dataset.
Naive Bayes:
- Achieved an accuracy score of approximately 60.67 %.
- Naive Bayes performed similarly to Logistic Regression.
Conclusion:
- Logistic Regression performed moderately well, but there may be room for
improvement.
- Logarithmic regression performed poorly, while polynomial regression with a
degree of five showed promising performance, indicating that higher-order
polynomial features might capture the underlying patterns better.
- Linear regression performed poorly, suggesting that the relationship between
the features and the target variable may not be linear.
- KNN performed relatively lower compared to other models, indicating that it
might not be the best choice for this dataset. This could be due to the dataset's
high dimensionality or noisy data.
- Naive Bayes performed similarly to Logistic Regression, indicating that it's a
suitable choice for this dataset. However, its performance could still be
improved.
Overall, it seems that polynomial regression with a degree of five performed the
best among the models tested, followed closely by logistic regression and Naive
Bayes. Linear regression and KNN performed relatively poorly, suggesting that
they may not be well-suited for this dataset. Further experimentation with feature
engineering, model tuning, and potentially exploring other algorithms could lead
to better performance.
Objective and Scope of the Project
Objective:
The primary objective of this project is to utilize machine learning algorithms for the
classification of asteroids based on their hazardous potential. By analyzing a dataset
provided by NASA containing various attributes of asteroids, the project aims to
develop predictive models capable of identifying asteroids that pose a potential
threat to Earth.
Scope:
1. Data Acquisition and Preprocessing:
- Acquiring the NASA asteroid dataset containing information on asteroid
attributes such as size, velocity, orbit, and hazardous classification.
- Preprocessing the dataset to handle missing values, encode categorical variables,
and scale numerical features as necessary.
2. Exploratory Data Analysis (EDA):
- Conducting exploratory data analysis to gain insights into the distribution and
characteristics of the dataset.
- Visualizing data using histograms, scatter plots, and correlation matrices to
identify patterns and relationships between variables.
3. Model Development:
- Implementing various machine learning algorithms, including but not limited to
logistic regression, Logarithmic and Polynomial Regression, Linear Regression, K-
Nearest Neighbors (KNN), and naive Bayes classifiers.
- Training these models on the dataset to predict the H, Albedo and diameter of
asteroids.
4. Model Evaluation:
- Evaluating the performance of each model using metrics such as accuracy,
precision, recall, F1-score.
- Comparing the performance of different algorithms to identify the most effective
approach for asteroid classification.
5. Results Interpretation and Discussion:
- Interpreting the results obtained from model evaluation and discussing the
strengths and weaknesses of each algorithm.
- Analyzing factors contributing to predictive performance and providing insights
into the classification process.
6. Conclusion and Recommendations:
- Summarizing the findings of the analysis and highlighting the significance of the
results.
- Discussing the implications of the study for NASA's asteroid classification efforts
and suggesting potential future research directions.
By achieving the objectives outlined above within the defined scope, the project aims
to contribute to the field of asteroid classification and assist in enhancing planetary
defense capabilities against potential asteroid impacts on Earth.
Conclusion
Links
1. Ipynb file :
https://drive.google.com/file/d/1vXbtjbKXmt5Z7ih4oHCqpf2M4wineEXe/vi
ew?usp=sharing
2. Dataset File:
https://drive.google.com/file/d/1OVbeaxUM81hU2KaAfg82EzJSaA7Iuqpm/v
iew?usp=sharing