Machine_Learning_Algorithms_Overview
Machine_Learning_Algorithms_Overview
## Abstract
## 1. Introduction
Our goal is not to present novel research but rather to consolidate existing
knowledge in an accessible framework that facilitates algorithm selection and
implementation.
Supervised learning, where algorithms learn from labeled training data, represents
the most widely deployed paradigm in practical applications. We examine key
algorithms in this category:
Linear regression remains one of the most interpretable and widely used algorithms
for predicting continuous variables. The model takes the form:
$$\hat{y} = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$$
Where $\hat{y}$ is the predicted value, $x_i$ are features, and $\beta_i$ are model
parameters.
**Key Properties**:
- Closed-form solution exists for ordinary least squares
- Computational complexity: O(n²d) for n samples and d features
- Assumes linear relationship between features and target
- Highly interpretable; coefficients directly indicate feature importance
- Susceptible to outliers and multicollinearity
Despite its name, logistic regression is a classification algorithm that models the
probability of an observation belonging to a particular class:
**Key Properties**:
- No closed-form solution; typically trained using gradient descent
- Provides probability estimates rather than just classifications
- Naturally extends to multi-class classification using one-vs-rest or softmax
approaches
- Prone to underperforming with imbalanced datasets
- Less prone to overfitting than decision trees, but may underfit complex
relationships
Decision trees partition the feature space into regions using a series of decision
rules, creating an intuitive hierarchical structure.
**Key Properties**:
- Training involves greedy optimization using metrics like Gini impurity or
information gain
- Prone to overfitting without pruning or depth limitations
- Handle nonlinear relationships and feature interactions naturally
- No feature scaling required
- Computational complexity: O(n log n) for training with n samples
- Limited in capturing additive structures efficiently
**Key Properties**:
- Reduced variance compared to individual trees
- Feature importance can be derived from how frequently features are used
- Training can be parallelized
- Typically outperforms single decision trees
- Less interpretable than individual trees
- Memory-intensive for large forests
Gradient boosting builds an ensemble sequentially, with each new model correcting
errors made by the combined existing models.
**Key Properties**:
- Often achieves state-of-the-art performance on structured data
- Implementations include XGBoost, LightGBM, and CatBoost with various
optimizations
- More prone to overfitting than random forests
- Requires careful tuning of hyperparameters
- Can handle mixed data types and missing values (implementation dependent)
Support Vector Machines (SVMs) find the hyperplane that maximizes the margin
between classes in the feature space.
**Key Properties**:
- Effective in high-dimensional spaces
- Memory efficient as only support vectors are used
- Versatile through different kernel functions (linear, polynomial, RBF)
- Computational complexity: O(n²) to O(n³) depending on implementation
- Less effective for large datasets due to scaling issues
- Requires feature scaling for optimal performance
**Key Properties**:
- Universal function approximators (theoretically can represent any function)
- Trained using backpropagation and gradient descent variants
- Require substantial data to generalize well
- Computationally intensive, but parallelizable on GPUs
- Hyperparameter tuning can be challenging
- Prone to local minima and vanishing/exploding gradients
**Key Properties**:
- Computational complexity: O(nkdi) for n samples, k clusters, d dimensions, i
iterations
- Assumes spherical clusters of similar size
- Sensitive to initialization and outliers
- Requires pre-specification of the number of clusters
- Extensions include k-means++ for better initialization and mini-batch k-means for
large datasets
**Key Properties**:
- Agglomerative (bottom-up) or divisive (top-down) approaches
- No need to specify number of clusters in advance
- Computational complexity: O(n³) for naive implementations
- Results can be visualized as a dendrogram
- Various linkage criteria (single, complete, average, Ward) affect cluster shapes
**Key Properties**:
- Does not require pre-specifying number of clusters
- Can find arbitrarily shaped clusters
- Robust to outliers, which are identified as noise
- Struggles with clusters of varying densities
- Less effective in high-dimensional spaces due to the "curse of dimensionality"
PCA transforms data into a new coordinate system where the greatest variance lies
on the first coordinate (principal component).
**Key Properties**:
- Linear transformation that preserves maximal variance
- Computational complexity: O(d³) where d is the number of dimensions
- Assumes linear relationships between variables
- Orthogonal components facilitate interpretation
- Sensitive to feature scaling
**Key Properties**:
- Preserves local structure and reveals clusters
- Computationally intensive: O(n²)
- Non-deterministic results
- Hyperparameter sensitive (perplexity)
- Not suitable for dimensionality reduction as a preprocessing step
**Key Properties**:
- Computational complexity: O(n log n)
- Effective in high-dimensional spaces
- Does not make assumptions about data distribution
- Works well with numerical data
- May struggle with very low-dimensional data
## 4. Reinforcement Learning
**Key Properties**:
- Model-free approach that learns action-value function
- Guarantees convergence to optimal policy given sufficient exploration
- Struggles with large state-action spaces (curse of dimensionality)
- Tends to overestimate action values
DQN extends Q-learning by using deep neural networks to approximate the Q-function.
**Key Properties**:
- Can handle high-dimensional state spaces
- Uses experience replay to break correlations in training data
- Employs target networks to reduce training instability
- Often requires substantial computational resources
- Has inspired numerous extensions (Double DQN, Dueling DQN, etc.)
Policy gradient methods directly optimize the policy by gradient ascent on the
expected reward.
**Key Properties**:
- Can learn stochastic policies
- Naturally handles continuous action spaces
- Often suffer from high variance in gradient estimates
- REINFORCE algorithm provides foundational approach
- Extensions include Actor-Critic methods that combine value and policy approaches
CNNs leverage spatial structure through convolutional layers, making them ideal for
image processing.
**Key Properties**:
- Parameter sharing and local connectivity reduce model size
- Translation invariance captures visual patterns regardless of position
- Hierarchical feature learning from simple to complex patterns
- Typical components include convolutional layers, pooling layers, and fully
connected layers
- Influential architectures include AlexNet, VGG, ResNet, and EfficientNet
**Key Properties**:
- Can handle variable-length sequences
- Suffer from vanishing/exploding gradients in practice
- Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) cells mitigate
gradient problems
- Applications include language modeling, speech recognition, and time series
forecasting
- Bidirectional variants process sequences in both directions
**Key Properties**:
- Parallelizable training unlike RNNs
- Effectively captures long-range dependencies
- Forms the foundation for models like BERT, GPT, and T5
- Computational complexity scales quadratically with sequence length
- Requires substantial data and computing resources
## 6. Implementation Considerations
## 7. Ethical Considerations
Machine learning algorithms inherit biases from training data and can amplify
societal inequities if deployed carelessly:
## 9. Conclusion
## References
1. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical
Learning. Springer.
2. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
3. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction.
MIT Press.
4. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553),
436-444.
5. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural
Information Processing Systems, 30.