DS Report
DS Report
Submitted By
undergone at
CERTIFICATE
This is to certify that the Course project Work Report entitled “High Frequency Price Prediction
of Index Futures” is submitted by the group mentioned below -
this report is a record of the work carried out by them as part of the course Data Science (IT258)
during the semester Dec 2023 - Apr 2024. It is accepted as the Course Project Report submission in
the partial fulfillment of the requirements for the award of the degree of Bachelor of Technology in
Artificial Intelligence.
We hereby declare that the project report entitled “High Frequency Price Prediction of Index
Futures” submitted by us for the course Data Science (IT258) during the semester Dec 2023 - Apr
2024, as part of the partial course requirements for the award of the degree of Bachelor of Technology
in Artificial Intelligence at NITK Surathkal is our original work. We declare that the project has not
formed the basis for the award of any degree, associateship, fellowship or any other similar titles
elsewhere.
After identifying the outliers are Initially approach was results as expected and also lead to significant data lose and
trimming outliers and this was fatal as the significant amount while in the eda part we had studied the correlation between
of data was lost in this process. Instead,adopted a robust the features and found out that the feature containing the
approach, capping outliers by the replacing it with the values missing value had a weak positive correlation. After we
at Q3 from , preserving most data while minimizing the come to imputing the missing value instead.Various methods
impact of extreme values. which were used for imputations are Mean/Median/Mode
Imputation,Regression Imputation,K-Nearest Neighbors
3) Feature Construction: We introduced a new feature, (KNN) Imputation.With the knn imputation we could get
the "Liquidity Indicator," which Utilize bid-ask spread as a the most satisfactory results and the total dispersion of
feature, that is calculated as the difference between the best the features where maintained.Hence, we implemented
bid prices and ask prices.Incorporate volume at the best bid the K-Nearest Neighbors (KNN) algorithm for missing
and ask prices . By capturing nuanced aspects like liquidity value imputation, even in knn we tried both distance
it improves the model’s ability to forecast price changes based and uniform based averaging for the imputing
accurately values,theoretically better results should be found in
distance based averaging and so was when implemented
4) Feature Scaling: Initially, we explored robust scaling practically .
methods like RobustScaler but as we already had handled
the outlier and based on the techniques of imputation we are 6) Advanced Exploratory Data Analysis: Before the final
going to use in this paper we ultimately used StandardScaler model development Advanced Exploratory Data Analysis
for standardization of all the features before missing value (EDA) techniques can provide deeper insights into the data,
imputation. uncover complex relationships, and aid in the selection and
engineering of relevant features for modeling. Here are
5) Missing Value Imputation:: Initially Techniques such some advanced EDA methods:
as deletion and feature disregarding didn’t yield satisfactory
Fig. 8: Missing value dataset MAR
• Visualize correlation matrices using heatmaps or similar or highly correlated features, leading to the
techniques for easy interpretation. curse of dimensionality and computational
inefficiencies.
• Analyze covariance matrices to understand the spread
and direction of feature relationships. • PCA is used to find the directions of maximum
variance in the data ( also known as principal
• Identify and potentially remove highly correlated or components ), and projects the data onto a
redundant features to avoid multicollinearity issues. lower - dimensional subspace spanned by these
components.
Principal component analysis (PCA) which is one of the
most popular techniques to find the principal components • This projection captures the most important
of the dataset to capture the maximum variance in the data patterns and structures in the data while reducing
is being is used by us for the purpose of reducing our noise and redundancy.
high dimensional dataset, point to be noted even though we
posses enough number of samples that can support the high 2) Mathematical Formulation:
number of features in this dataset but due to computational • Let X be a [n] x [p] data matrix, where n
4) Determining Principal Components:
• The choice of k (the amount of PCs to retain)
is crucial and depends on the desired trade-off
between dimensionality reduction and information
preservation.
• Common approaches include:
3) Implementation Steps:
a. Data Preprocessing: Center the data by subtracting
the mean from each feature, or standardize (scale) the
data if features have different units or scales.
b. Compute the Covariance Matrix : Calculate Fig. 11: Post PCA on test dataset
the covariance matrix of the preprocessed data.
Hyperparameter Tuning: For Random Forest, Logistic [2] A. Lakshmi and Sailaja Vedala. (2017). A study on low
Regression Grid/RandomizedSearchCV evaluated latency trading in Indian stock markets. International
hyperparameter combinations systematically. For gradient Journal of Civil Engineering and Technology, 8(12),
boosting models, randomized search strategies maximized 733-743.
validation metrics. [3] Peter Gomber, Björn Arndt, Marco Lutat, and Tim Uhle.
(2011). High-Frequency Trading. SSRN Electronic Jour-
Logistic Regression:As the main objective was to predict nal. doi: 10.2139/ssrn.1858626.
the fall or rise of market mid price it is feasible to use a [4] C. Dutta, K. Karpman, S. Basu, et al. (2023). Review
logistic regression model on this dataset. Logistic regression of Statistical Approaches for Modeling High-Frequency
is a statistical model that predicts the probability of two Trading Data. Sankhya B, 85(Suppl 1), 1–48.
[5] G.P.M. Virgilio. (2019). High-frequency trading: a liter-
outcomes using one or more predictor variables. It uses the
ature review. Financ Mark Portf Manag, 33, 183–208.
logit function to map predicted probabilities to [0, 1] , we
doi: 10.1007/s11408-019-00331-6.
applied hyperparameter tunning using grid search.
Team12.pdf
ORIGINALITY REPORT
9 %
SIMILARITY INDEX
6%
INTERNET SOURCES
4%
PUBLICATIONS
2%
STUDENT PAPERS
PRIMARY SOURCES
1
iaeme.com
Internet Source 1%
2
fastercapital.com
Internet Source 1%
3
Mostafa Goudarzi, Flavio Bazzana.
"Identification of high-frequency trading: A
1%
machine learning approach", Research in
International Business and Finance, 2023
Publication
4
www.efmaefm.org
Internet Source 1%
5
Submitted to University of Melbourne
Student Paper 1%
6
kclpure.kcl.ac.uk
Internet Source 1%
7
Jyoti P. Kanjalkar, Gaurav N. Patil, Gaurav R.
Patil, Yash Parande, Bhavesh Dilip Patil,
1%
Pramod Kanjalkar. "chapter 18 Wise Apply on
a Machine Learning-Based College