2.1. Acquisition and Processing of Data
The first step was data acquisition, with data retrieved from the Gaia DR3 Catalog. Initially, the stars were identified using SIMBAD (SYMBAD:
https://simbad.cds.unistra.fr/simbad/ accessed on 16 January 2024), a dynamic database that provides information on astronomical objects published in scientific articles and in free databases [
13].
Subsequently, a crossmatch was performed in the xp_continuous_mean_spectrum table within Gaia DR3 to determine the astronomical objects by star type. The table contains the mean BP and RP spectra based on the continuous representation in basis functions [
3].
Table 1 displays some of its columns, which include the necessary information used to reconstruct the calibrated spectra of the astronomical objects.
The calibrated spectra are represented as a linear combination of basic functions instead of using the conventional flux and wavelength table. This approach helps to avoid potential loss of information when sampling the spectra [
4]. The pseudowavelength, denoted as
, is used to represent the spectrum
of a source
observed in calibration unit
. In this representation, the spectrum is transformed into a linear combination of bases
, which is defined as the mean spectrum and can be expressed by the following equation:
represents the spectral coefficients of the source spectrum .
is a linear combination of basis functions.
represents the convolutional kernel, which can be expressed as a linear combination of polynomial basis functions,
is a conveniently chosen reference pseudowavelength.
coefficients are defined as a polynomial in the AC (Across Scan) coordinate [
4].
The following quantities of spectra were retrieved per type of stars from the Gaia DR3 (GDR3) catalog, available on the Gaia Archive website: 201 symbiotic stars, 574 planetary nebulae, and 69,146 red giants. This count resulted from the crossmatch between the SIMBAD and GDR3 databases. As can be seen, the number of red giants is considerably higher compared to the other types of stars. This is because they have a greater representation in our galaxy. Therefore, a subset of these red giants was selected, specifically a sample of 1200.
Table 2 displays a comparative class distribution between originally downloaded stellar spectra and those selected for the initial analysis dataset, illustrating the imbalance in the representation of different star types.
2.2. Data Preprocessing
The raw downloaded spectra were internally calibrated within each wavelength range of BP and RP. These were processed using the GaiaXPy library, where each spectrum is calibrated and sampled to a default uniform wavelength grid using the calibrate routine, resulting in a single spectrum on the wavelength range covered by BP and RP.
This calibration and sampling process generates flux values for all the sampled absolute spectra, resulting in a total of 343 values per spectrum. The default sampling was used, resulting in a wavelength range from 336 to 1020 nm, with a 2 nm increment between each sampling point.
To improve the performance and stability of machine learning algorithms during training and inference, min–max normalization was applied to the flux values of each spectrum, setting a scale of 0–1 [
14]. This approach expressed all spectrum values as a fraction of the maximum value, establishing a common scale across different spectra (See
Figure 1). This prevents certain variables from dominating others due to their absolute values. The following equation demonstrates how min-max normalization is applied:
The normalization process allows for a clearer distinction between different types of spectra, even at a glance. The spectrum of the PN is predominantly composed of emission lines, while the RG spectra mainly exhibit a continuum with many absorption lines and/or bands. On the other hand, the SS spectra represent a combination of both, displaying emission lines and absorption bands in the IR region of the spectrum (700 to 1000 nm).
This dataset consists of 1975 records representing the spectra of the target stars. Each record is composed of 343 features, corresponding to the normalized flux values within the wavelength range of 336 to 1020 nm.
Furthermore, an extra column was incorporated in the dataset, which contains the corresponding labels for the star types. These labels are crucial for identifying and classifying each spectrum based on its category, enabling the utilization of supervised algorithms for training and prediction purposes.
The data exhibit a notable class imbalance, as there is a significant difference in the number of spectra for each star type. Data imbalance can have a detrimental impact on the performance of machine learning algorithms because they may struggle to learn patterns, and make inaccurate decisions for minority classes. This assertion was conclusively validated in
Section 3.3, which further substantiates the analysis through weighted loss calculations and presents the outcomes of a ten-fold cross-validation procedure, assessing the findings by computing mean and standard deviation values.
To mitigate this issue and improve classification accuracy, data balancing techniques were implemented, such as oversampling the minority class and undersampling the majority class, ensuring an equitable distribution between both classes [
15]. This approach allows classification algorithms to receive a balanced representation of the classes during training.
In addition to the data balancing techniques applied in this study, such as oversampling of minority classes and synthetic data generation with noise, it is important to consider other methods to address the imbalance. An alternative approach that can be effective is the use of weighted loss during model training. This technique assigns a higher weight to minority classes in the algorithm’s loss function, thus compensating for the disproportion in class representation without modifying the original dataset [
16].
Our study will adopt a comparative approach, first analyzing the results obtained with the original imbalanced dataset and then comparing them with the outcomes after applying class balancing techniques. This methodology will allow us to objectively assess the impact of class imbalance on our specific problem and justify any decisions regarding the use of data balancing techniques.
The original dataset exhibits a significant imbalance in the representation of different star types, as illustrated in
Table 2. This imbalance poses potential challenges for training machine learning algorithms, as it could lead to bias towards the majority class (red giant stars) and poor performance in classifying minority classes (symbiotic stars and planetary nebulae).
To address this issue, we propose a two-phase approach instead of immediately applying class balancing techniques:
Initial analysis with imbalanced data—First, we will train and evaluate our models using the original imbalanced dataset. This will allow us to establish a baseline performance and assess the actual impact of class imbalance on our specific problem;
Comparison with balanced data—If significant bias or poor performance is observed in the minority classes, we will proceed to apply class balancing techniques. We will use the oversampling method for minority classes, as described earlier, and compare the results with the original dataset.
As will be demonstrated in the following section, due to the class imbalance and the suboptimal performance exhibited by some algorithms on the imbalanced dataset, a decision was made to construct a balanced dataset. This new dataset ensures that each star type is represented by 1000 samples. The choice of selecting one thousand objects per class is based on several key factors. Firstly, this number is large enough to provide a representative and robust sample of each class, allowing machine learning algorithms to adequately capture the features and variability of the data. Additionally, having one thousand objects per class ensures a proper balance, mitigating bias toward the majority class and enhancing the model’s ability to generalize and correctly recognize objects from minority classes [
17].
This balanced approach aims to address both the inherent class imbalance in the original data and the performance issues observed with certain algorithms (SVM and Naive Bayes), potentially leading to more accurate and reliable classification results across all star types. The comparative results of the algorithms’ performance on both the original imbalanced dataset and this new balanced dataset will be presented and discussed in detail in the Results section.
In the case of red giants, there was no issue, as the recovered quantity exceeded this number. Therefore, samples were randomly selected until the desired quantity was reached. However, in the case of symbiotic stars and planetary nebulae, the number of samples was insufficient. Therefore, the option was taken to generate new spectra from the original ones. To achieve this, the method of adding white noise was employed. A sequence of random numbers was generated, following a normal distribution with mean 0 and a variable standard deviation ranging from 0.01 to 0.05.
The process of generating new spectra involved combining the original data with the generated white noise (see
Figure 2), and this allowed for an expansion of the dataset and a balancing of the classes, ensuring that all categories were adequately represented.
Applying the same level of standard deviation to all spectra, it is ensured that all samples have the same amount of added random variability. This avoids possible biases or excessive differences between the generated spectra, which could affect the interpretation and comparison of the results. These standard deviation values are in line with typical noise fluctuations observed in many scientific experiments and spectroscopic measurements.
This new balanced dataset, like the previous one, would consist of 343 features representing the flux values, and an additional column representing the spectrum label. In this case, a final count of 3000 spectra was achieved, with an equal distribution of 1000 spectra for each object type. This balanced dataset ensures that each type of star is adequately and proportionally represented, which is crucial to avoiding biases and enabling more accurate analysis and modeling.
Table 3 presents the distribution of stellar spectra in the balanced dataset, categorized by star type. It illustrates the composition of each class, distinguishing between original spectra obtained from observations and those synthetically generated to achieve balance.
2.3. Exploring Class Differences through t-SNE Visualization
To analyze potential differences between classes in our study, we employed the t-SNE (t-Distributed Stochastic Neighbor Embedding) algorithm. t-SNE is a popular unsupervised machine learning technique for data visualization and dimensionality reduction [
18]. We applied t-SNE to our dataset, projecting it into lower dimensional spaces of two dimensions. By analyzing the resulting plots, we were able to identify clusters or groupings of samples that shared similar characteristics. These clusters provided insights into the presence of distinct classes and shed light on the differences between them. Additionally, t-SNE enabled the identification of outliers or samples that deviated from the main clusters.
The t-SNE projection of the balanced data shows a significant improvement in the separation and definition of class clusters. This indicates that the classes are more distinguishable from each other compared to the unbalanced data. The resulting visualization provides a clearer representation of the inherent differences between the classes (see
Figure 3).
However, despite this improvement, the presence of overlapping points between the classes can still be observed. This suggests that there may be inherent similarities or shared characteristics between certain samples from different classes. These areas of overlap indicate that the boundaries between the classes are not clearly defined and may represent cases where classification is more challenging.
It is important to note that, when applying machine learning algorithms to classify these classes, it is possible to achieve good overall results due to the improved separation and definition of the clusters, but it is also normal to expect some errors in classification due to the presence of overlapping points and similarities between the classes. However, it is expected that the performance will be improved (compared to t-SNE) since the number of parameters used is higher (in this case only the first two principal components are used).
2.4. Analysis and Selection of Algorithms
The formed datasets were divided into two subsets each. The first subset, representing 80% of the total samples, would be the training set, which was used to train various machine learning algorithms. This 80/20 split was chosen based on the Pareto principle or 80/20 rule, a common practice in machine learning. This division strikes a balance between having enough data to train robust models and retaining an adequate amount for subsequent validation [
19]. During the training process, the algorithms learn patterns and relationships in the data to make predictions or decisions based on new data. The goal is for the algorithms to capture the underlying patterns in the training data and be able to generalize that knowledge to unseen data.
The other subset, representing the remaining 20%, was reserved for testing purposes. This dataset is used exclusively for evaluation and is not used during training. The aim of testing is to determine whether the algorithms have successfully learned and generalized without overfitting. Using a separate test set it helps detect if the algorithm has overfitted the training data, and provides a more realistic estimation of its performance.
For the analysis, the following supervised ML algorithms were used for classification: Random Forest, Support Vector Machine, Artificial Neural Networks, Gradient Boosting, and Naive Bayes. The selection of these algorithms provides a diverse combination of classification approaches, allowing for the evaluation and comparison of their performance on the test set. This enables us to obtain a more comprehensive understanding of their classification capability, and determine which one suits best our specific problem.
2.4.1. Algorithm Random Forest
Random Forest is a supervised machine learning algorithm that combines tree predictors in a way that each tree depends on the values of a randomly sampled vector, independently and with the same distribution for all trees in the forest [
20]. Decision trees tend to overfit, meaning they learn the training data accurately but struggle to apply that knowledge to new data. However, it is possible to enhance their generalization ability by combining multiple trees into a set. This technique, known as an ensemble, has been proven to be highly effective in various problems, striking a balance between ease of use, flexibility, and the ability to apply learning to different situations.
An advantage of this algorithm is that it does not require scaled data. However, in our case, the data were normalized, which allows all parameters to have equal importance. Several training tests were conducted by varying the parameters provided to the algorithm in each case (See
Table 4).
2.4.2. Algorithm Support Vector Machine
Support Vector Machine (SVM) is a supervised machine learning algorithm primarily used for data classification. Instead of directly operating on the original data, SVM represents them as points in a multi-dimensional space [
21]. Each feature becomes a coordinate of these points, enabling us to visualize and analyze the relationships between variables. The goal of SVM is to find the hyper-plane that optimally separates the classes.
Different parameter tests were conducted, using different kernels for each one.
Table 5 displays the analyzed configurations.
2.4.3. Algorithm Artificial Neural Networks
Artificial Neural Networks (ANN) are a subset of machine learning tools and are at the core of deep learning algorithms. Their name and structure are inspired by the human brain, trying to reproduce the way biological neurons send signals to each other. They consist of several layers of nodes, including an input layer, one or more hidden layers, and an output layer. ANNs possess high processing speeds and the ability to learn the solution to a problem from a set of examples [
22].
The designed neural network has the following topology: an input layer of 64 neurons, followed by three hidden layers of 32 neurons each. All layers are dense, meaning all neurons are fully connected, and they use the ReLU activation function to introduce nonlinearity into the data. After each dense layer, a Dropout layer is added, which randomly deactivates 10% of the neurons during training. This helps prevent overfitting and improves the generalization ability of the model. The output layer consists of three neurons and uses the softmax activation function, commonly used in multiclass classification problems.
To compile the model, the “adam” optimizer is used, which is an optimization algorithm that adjusts the weights of the neural network during training. The loss function is set as “sparse_categorical_crossentropy”, which is suitable for multiclass classification problems with integer labels.
Table 6 showcases the neural network configuration.
2.4.4. Algorithm Gradient Boosting
Gradient Boosting is an algorithm that focuses on numerical optimization of the function space rather than the parameter space. It is based on additive stage-wise expansions and aims to find an approximation of the objective function that minimizes a specific loss function. It works iteratively, where, at each stage, a new component is added to the existing approximation, adjusting it based on the gradient of the loss function. This allows for a gradual improvement of the approximation, and achieves competitive results in both regression and classification problems [
23].
Different parameter combinations were tested, using various loss functions and different learning rates, among others, resulting in the following configurations (See
Table 7).
2.4.5. Algorithm Naive Bayes
The Naive Bayes classifier is a mathematical classification technique widely used in machine learning. It is based on Bayes’ Theorem and uses probabilistic calculations to find the most appropriate classification for a given dataset within a problem domain. It is very useful for cases where the number of target classifications is greater than two, making it more suitable for real-life classification applications [
24] (see
Table 8).
The algorithm was trained with the following configuration.