My File
My File
Title :
Stock Market Price Prediction Using Machine Learning
Group Members :
Dibyajyoti Satpathy [SIC: 22BCTE32, GROUP: CST(A3) ROLL NO: 20]
Bhabani Charan Panda [SIC: 22BCTC48, GROUP: CST(A3) ROLL NO: 18]
Suraj Kumar Sikhar [SIC: 22BCTG89, GROUP: CST(A3) ROLL NO: 15]
Guided By :
MR ASIT KUMAR DAS
[Assistant Professor, Silicon University, Bhubaneswar]
Contents
1. Objective
2. Introduction
I. Role of Machine Learning in Stock
Market
II. Overview of ITC Stock Market
3. Literature Review
4. About the Model
5. Workflow
6. Tools and Techniques
7. Dataset
8. Implementation
9. Evaluation
10. Code Snippet
11. Result Analysis
12. Future Work
13. Conclusion
14. References
Objective
The main objective of this project is to develop a machine learning-
based predictive model that can accurately predict stock prices
based on historical data. By analyzing patterns and trends in
stock prices, the goal is to provide an efficient and reliable tool
for investors and traders. The specific objectives of the project
include:
Libraries:
o Pandas: Used for data manipulation and
preprocessing, enabling efficient handling of large
datasets.
o NumPy: Provides numerical computation tools for
efficient array operations and mathematical functions.
o Matplotlib: Enables the creation of static,
animated, and interactive visualizations, aiding
in data exploration.
o Scikit-learn: Includes machine learning tools for
implementing Linear Regression, SVR models, and
evaluation metrics.
o Seaborn: Used for advanced visualization,
providing better aesthetics and additional insight
extraction from data.
o Statsmodels: Facilitates statistical tests and
regression analysis for more in-depth data
understanding.
Environment:
o Jupyter Notebook: An interactive platform for
coding, analysis, and visualization.
o Google Colab: Employed for experimentation with
larger datasets and computationally intensive tasks,
leveraging cloud resources.
These tools enable efficient data handling, model training, and
performance evaluation, forming the foundation for this machine
learning project.
Dataset
The dataset used in this project comprises historical stock price
data for ITC Limited, a leading company in the Indian stock
market. The dataset was carefully curated to analyze trends and
predict future stock prices.
1. Source:
The dataset was sourced from a CSV file named ITC.NS.csv,
containing historical stock price data for ITC Limited. This
dataset was obtained from Yahoo Finance, a widely used
platform for financial data.
2. Data Set Information:
This dataset contains daily trading data for ITC Limited, including
crucial attributes that influence stock price movements. The
dataset provides comprehensive information about the stock’s
performance over time, enabling the development of predictive
models.
3. Size and Structure:
Total Samples: 6,838
Features: 7 columns, including: Date: The trading day.
Open: Opening price of the stock. High: Highest price
during the trading session. Low: Lowest price during the
trading session. Close: Closing price of the stock. Adj Close:
Adjusted closing price (accounting for splits, dividends,
etc.). Volume: Total number of shares traded.
5. Data Preprocessing:
Handling Missing Values:
Checked for missing values and found no null entries,
ensuring data consistency.
Feature Scaling:
Scaled numerical features (e.g., prices and volume) to
standardize the data and improve model performance.
Statistical Insights:
Analyzed the mean, standard deviation, and other metrics to
understand the data distribution and patterns.
Implementation
Data Loading and Preprocessing:
Data is imported using Pandas, enabling efficient handling of large
datasets. Missing values are addressed through imputation
techniques or removal, ensuring data integrity.
Exploratory Data Analysis (EDA):
Line plots and histograms are used to visualize stock price trends
and distributions. Correlation matrices identify relationships
between features, aiding in feature selection.
Model Training:
The dataset is split into training and test sets, ensuring unbiased
evaluation. A Linear Regression model is trained using Scikit-learn,
mapping feature values to stock prices. An SVR model is trained
with an linear kernel, capturing non- linear dependencies in the
data.
Model Testing:
Predictions are generated on the test dataset for both models.
Performance is evaluated using metrics such as Mean Squared Error
(MSE), Mean Absolute Error (MAE), Root Mean Squared Error
(RMSE), and R-squared value.
Evaluation
The performance of the Linear Regression and SVR models is
assessed using the following metrics:
Mean Squared Error (MSE): Ǫuantifies the average squared
difference between predicted and actual values, reflecting
prediction accuracy.
Mean Absolute Error (MAE): Represents the average absolute
difference between predicted and actual values, providing a
straightforward measure of prediction error.
Root Mean Squared Error (RMSE): Provides the square root of
the MSE, offering a more interpretable metric for the average
prediction error.
R-squared Value: Measures the proportion of variance in the
dependent variable explained by the independent variables,
indicating model fit.
Importing the Dependencies
In [1]: import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error,root_mean_squared_error
In [4]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6890 entries, 0 to 6889
Data columns (total 7 columns):
# Column Non-Null Dtype
Count
0 Date 6890 non- object
null
1 Open 6879 non- float64
null
2 High 6879 non- float64
null
3 Low 6879 non- float64
null
4 Close 6879 non- float64
null
5 Adj Close 6879 non- float64
null
6 Volume 6879 non- float64
null
dtypes: float64(6), object(1)
memory usage: 376.9+ KB
In df.shape
[5]:
(6890, 7)
Out[5]
:
In df.describe()
[6]:
Out[9]
Date Open High Low Volume Close
:
0 01-01-1996 5.55 5.60 5.53 985500.0 5.58
Out[10]
0
:
Date 0
Open
11
High 11
Low
11
Volume 11
Close
11
dtype: int64
Data Visualization
In [14]: df.set_index('Date', inplace=True)
# Plot the 'Open' prices with the Date as the x-axis
df['Open'].plot(figsize=(16, 6), title='Open Prices Over Time', legend=True)
plt.xlabel('Date') # Label the x-axis
plt.ylabel('Open Price') # Label the y-axis
plt.grid(True)
plt.show()
In [16]: # Plot the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(df.corr()[['Close']].sort_values(by='Close',
ascending=False), annot=True, cmap='coolwarm',fmt='.6f')
plt.title("Correlation of Features with Target
Variable") plt.show()
In
[23]: In [24]:
Out[23] x_train.shape,x_test.shape,y_train.shape,y_test.shape
: ((5512, 4), (1378, 4), (5512,), (1378,))
Data Standardization
Out[24]: scaler
▾
= StandardScaler()
i ?
scaler.fit(x_train)
StandardScaler
StandardScaler(
)
In
[29]:
Out[29]
:
In dframe =
[30]: pd.DataFrame({'Actual':y_test,'Predicted':y_pred})
dframe.head(5)
y_pred.shape
(1378,)
Out[30]: Actual Predicted
Date
graph = dframe.head(25)
In [31]: graph.plot(kind='bar',figsize=(16,5))
plt.title("ITC STOCK : Actual Price vs Predicted Price", fontsize=15)
plt.xlabel("Date", fontsize=12)
plt.ylabel("Price",
fontsize=12) plt.show()
In
[32]: In [33]:
Out[36]: ▾ SVR i ?
SVR(C=50, epsilon=0.2,
kernel='linear')
In [37]: prediction =
svr_model.predict(x_test)
print(prediction)
[ 15.78699691 182.18550435 16.50436631 ... 114.10188452 272.44207497
64.00406944]
In [38]: dframeS =
pd.DataFrame({'Actual':y_test,'Predicted':prediction})
dframeS.head(5)
Out[38]: Actual Predicted
Date
graphS = dframeS.head(25)
In [39]: graphS.plot(kind='bar',figsize=(16,5))
plt.title("ITC STOCK : Actual Price vs Predicted Price", fontsize=15)
plt.xlabel("Date", fontsize=12)
plt.ylabel("Price",
fontsize=12) plt.show()
Future Work
While the models implemented in this project provide promising results,
there is substantial room for improvement and further exploration:
1. Incorporation of External Data: Integrating additional datasets
such as news sentiment, macroeconomic indicators, and
financial reports could enhance prediction accuracy by
considering external factors influencing stock prices.
2. Feature Engineering: Advanced techniques like feature selection
algorithms and principal component analysis (PCA) could improve
model efficiency by reducing dimensionality and noise.
3. Model Enhancement: Experimenting with more sophisticated
machine learning algorithms, such as Random Forests, Gradient
Boosting (e.g., XGBoost, LightGBM), LSTM and Deep Learning
models, to capture complex patterns in stock price data.
4. Visualization Dashboards: Creating interactive dashboards to
provide users with clear insights into model predictions, stock
trends, and analysis results.
5. Robust Evaluation: Conducting cross-validation and robustness
checks under varying market conditions to assess model reliability
in different scenarios.
References
ITC Stock Data
Set
The dataset used for this project is publicly available and can be
accessed from various machine learning repositories such as UCI
Machine Learning Repository or Kaggle.
https://www.kaggle.com/datasets/tejasurya/itc-stock-price-prediction
Scikit-learn Documentation
Scikit-learn, the Python machine learning library used for model
implementation, provides comprehensive resources for algorithms
and tools for data preprocessing, model evaluation, and more.
https://scikit-learn.org/
Pandas Documentation
For data manipulation and analysis, the Pandas library was used.
https://pandas.pydata.org/
NumPy Documentation
NumPy is a core library for numerical computing in Python,
used for array manipulation and mathematical operations in
this project.
https://numpy.org/
Support Vector Machines - A Practical Guide
This guide provides a detailed explanation of Support Vector
Machines, the model used for regression in this project.
https://www.svm-tutorial.com/
Matplotlib Documentation
Matplotlib was used for any potential data visualization in the
project. The official site contains extensive guides for plotting and
customizing graphs.
https://matplotlib.org/
"Machine Learning Yearning" by Andrew Ng
This book by Andrew Ng is a key reference for understanding
machine learning workflows and practical approaches for model
development.
https://www.deeplearning.ai/machine-learning-yearning/