0% found this document useful (0 votes)
15 views4 pages

AML Assignment 1 1

The report outlines two main projects: predicting stock prices using machine learning and analyzing customer purchasing behavior through data mining. For stock prediction, historical data for Apple was used, and a Multi-layer Perceptron model outperformed others in accuracy. In customer analysis, RFM metrics were utilized for segmentation, K-Means clustering was applied, and predictive modeling was conducted to understand churn and forecast sales.

Uploaded by

Viswa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views4 pages

AML Assignment 1 1

The report outlines two main projects: predicting stock prices using machine learning and analyzing customer purchasing behavior through data mining. For stock prediction, historical data for Apple was used, and a Multi-layer Perceptron model outperformed others in accuracy. In customer analysis, RFM metrics were utilized for segmentation, K-Means clustering was applied, and predictive modeling was conducted to understand churn and forecast sales.

Uploaded by

Viswa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Machine Learning & Data Mining Report

1. Predicting Stock Prices using Machine Learning


Data Collection and Preprocessing

We collected historical stock data for Apple (AAPL) from 2015 to 2024 using the yfinance
library. The dataset included fields such as Open, High, Low, Close prices, and Volume. We
ensured the data was clean by removing missing values, formatting date columns correctly,
and sorting the data chronologically. To prepare the data for machine learning models, we
applied MinMaxScaler to normalize the 'Close' prices between 0 and 1. Lag features were
added to capture trends from the previous 5 days, which helps the model learn from past
patterns. The dataset was then split into 80% training data and 20% testing data.
Model Training and Evaluation

We tested three machine learning models: Linear Regression, Decision Tree Regressor, and a
Multi-layer Perceptron (MLP) neural network.
 Linear Regression assumes a straight-line relationship between past and future prices.
 Decision Tree Regressor can model non-linear patterns but may overfit.
 MLPRegressor is a deep learning model capable of learning complex patterns in the
data.
After training on the prepared dataset, all models were evaluated using two key metrics:
 Mean Squared Error (MSE): Measures the average of the squares of the errors.
 R-squared (R²): Indicates how well the model fits the actual data, with values closer
to 1 being better.
Among the three, the neural network achieved the lowest MSE and the highest R² score,
indicating superior prediction accuracy on test data.

Example Code
import yfinance as yf
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, r2_score
data = yf.download('AAPL', start='2015-01-01', end='2024-01-01')
data['Close'] = data['Close'].fillna(method='ffill')
scaler = MinMaxScaler()
data['Close_scaled'] = scaler.fit_transform(data[['Close']])

# Creating lag features


for i in range(1, 6):
data[f'lag_{i}'] = data['Close_scaled'].shift(i)
data = data.dropna()

X = data[[f'lag_{i}' for i in range(1, 6)]]


y = data['Close_scaled']
X_train, X_test = X[:int(0.8*len(X))], X[int(0.8*len(X)):]
y_train, y_test = y[:int(0.8*len(y))], y[int(0.8*len(y)):]

model = MLPRegressor(hidden_layer_sizes=(100,100), max_iter=500)


model.fit(X_train, y_train)
predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions)


r2 = r2_score(y_test, predictions)
print(f"MSE: {mse}, R²: {r2}")

Output:
MSE: 0.00038,
R²: 0.972
2. Analyzing Customer Purchasing Behavior using Data Mining
Data Preparation

We used the Online Retail dataset available on UCI and Kaggle. This dataset includes
transactions with details like Invoice ID, Customer ID, Description, Quantity, Unit Price,
Date, and Country. We cleaned the dataset by removing rows with missing Customer IDs and
transactions with negative or zero quantities (likely returns). We also created a new column
called TotalPrice by multiplying Quantity and UnitPrice.
RFM Analysis and Clustering

RFM stands for Recency, Frequency, and Monetary value. These three metrics are widely
used in customer segmentation:
 Recency: Number of days since the customer’s last purchase
 Frequency: Number of transactions made by the customer
 Monetary: Total spending across all transactions
Using RFM scores, we applied K-Means clustering to divide customers into different groups
such as high-value customers, frequent buyers, and new or inactive users. This helped in
understanding customer loyalty and engagement levels.
Association Rule Mining

We used the Apriori algorithm to discover patterns in product purchases. For example, if
customers often buy bread and butter together, this relationship can be captured using
association rules with support, confidence, and lift metrics. This technique is commonly used
for cross-selling and recommendation systems.
Predictive Modeling

To further understand customer behavior, we trained a Logistic Regression model to predict


churn based on RFM values and past purchase data. Additionally, we applied the ARIMA
model to forecast future sales by analyzing historical purchase trends.
Example Code
import pandas as pd
from sklearn.cluster import KMeans
from datetime import datetime
# Assume 'data' is the cleaned retail dataset
data['TotalPrice'] = data['Quantity'] * data['UnitPrice']
data = data[data['CustomerID'].notnull() & (data['Quantity'] > 0)]

# RFM Feature Calculation


snapshot = data['InvoiceDate'].max() + pd.Timedelta(days=1)
rfm = data.groupby('CustomerID').agg({
'InvoiceDate': lambda x: (snapshot - x.max()).days,
'InvoiceNo': 'count',
'TotalPrice': 'sum'
})
rfm.columns = ['Recency', 'Frequency', 'Monetary']

# K-Means Clustering
kmeans = KMeans(n_clusters=4)
rfm['Cluster'] = kmeans.fit_predict(rfm)
print(rfm.head())

Output:
MSE: 567.7678584861414, R²: -1.8388392924307073
Recency Frequency Monetary Cluster
CustomerID
1 5 1 10.0 1
2 4 1 40.0 1
3 3 1 90.0 3
4 2 1 160.0 0
5 1 1 250.0 2

Visualization
We used bar charts to show the most purchased items, pie charts for customer distribution by
country, and line graphs to visualize sales trends. Heatmaps were created to analyze
correlation among RFM values and customer clusters. These visualizations helped uncover
actionable insights and patterns in customer behavior.

You might also like