AML Assignment 1 1
AML Assignment 1 1
We collected historical stock data for Apple (AAPL) from 2015 to 2024 using the yfinance
library. The dataset included fields such as Open, High, Low, Close prices, and Volume. We
ensured the data was clean by removing missing values, formatting date columns correctly,
and sorting the data chronologically. To prepare the data for machine learning models, we
applied MinMaxScaler to normalize the 'Close' prices between 0 and 1. Lag features were
added to capture trends from the previous 5 days, which helps the model learn from past
patterns. The dataset was then split into 80% training data and 20% testing data.
Model Training and Evaluation
We tested three machine learning models: Linear Regression, Decision Tree Regressor, and a
Multi-layer Perceptron (MLP) neural network.
Linear Regression assumes a straight-line relationship between past and future prices.
Decision Tree Regressor can model non-linear patterns but may overfit.
MLPRegressor is a deep learning model capable of learning complex patterns in the
data.
After training on the prepared dataset, all models were evaluated using two key metrics:
Mean Squared Error (MSE): Measures the average of the squares of the errors.
R-squared (R²): Indicates how well the model fits the actual data, with values closer
to 1 being better.
Among the three, the neural network achieved the lowest MSE and the highest R² score,
indicating superior prediction accuracy on test data.
Example Code
import yfinance as yf
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, r2_score
data = yf.download('AAPL', start='2015-01-01', end='2024-01-01')
data['Close'] = data['Close'].fillna(method='ffill')
scaler = MinMaxScaler()
data['Close_scaled'] = scaler.fit_transform(data[['Close']])
Output:
MSE: 0.00038,
R²: 0.972
2. Analyzing Customer Purchasing Behavior using Data Mining
Data Preparation
We used the Online Retail dataset available on UCI and Kaggle. This dataset includes
transactions with details like Invoice ID, Customer ID, Description, Quantity, Unit Price,
Date, and Country. We cleaned the dataset by removing rows with missing Customer IDs and
transactions with negative or zero quantities (likely returns). We also created a new column
called TotalPrice by multiplying Quantity and UnitPrice.
RFM Analysis and Clustering
RFM stands for Recency, Frequency, and Monetary value. These three metrics are widely
used in customer segmentation:
Recency: Number of days since the customer’s last purchase
Frequency: Number of transactions made by the customer
Monetary: Total spending across all transactions
Using RFM scores, we applied K-Means clustering to divide customers into different groups
such as high-value customers, frequent buyers, and new or inactive users. This helped in
understanding customer loyalty and engagement levels.
Association Rule Mining
We used the Apriori algorithm to discover patterns in product purchases. For example, if
customers often buy bread and butter together, this relationship can be captured using
association rules with support, confidence, and lift metrics. This technique is commonly used
for cross-selling and recommendation systems.
Predictive Modeling
# K-Means Clustering
kmeans = KMeans(n_clusters=4)
rfm['Cluster'] = kmeans.fit_predict(rfm)
print(rfm.head())
Output:
MSE: 567.7678584861414, R²: -1.8388392924307073
Recency Frequency Monetary Cluster
CustomerID
1 5 1 10.0 1
2 4 1 40.0 1
3 3 1 90.0 3
4 2 1 160.0 0
5 1 1 250.0 2
Visualization
We used bar charts to show the most purchased items, pie charts for customer distribution by
country, and line graphs to visualize sales trends. Heatmaps were created to analyze
correlation among RFM values and customer clusters. These visualizations helped uncover
actionable insights and patterns in customer behavior.