0% found this document useful (0 votes)
50 views11 pages

DS Report

The document discusses high frequency price prediction of index futures. It provides an abstract describing the goal of predicting 1-second price fluctuations using order book data. It then reviews relevant literature on identifying high frequency trading, forecasting methodology, and the Lee-Mykland estimator. The literature review discusses advantages and limitations of the papers. Next, it summarizes the outcomes of the literature review papers. Finally, it provides an overview of the existing Lee-Mykland methodology for estimating integrated volatility from high-frequency financial data.

Uploaded by

rohil.221ai033
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views11 pages

DS Report

The document discusses high frequency price prediction of index futures. It provides an abstract describing the goal of predicting 1-second price fluctuations using order book data. It then reviews relevant literature on identifying high frequency trading, forecasting methodology, and the Lee-Mykland estimator. The literature review discusses advantages and limitations of the papers. Next, it summarizes the outcomes of the literature review papers. Finally, it provides an overview of the existing Lee-Mykland methodology for estimating integrated volatility from high-frequency financial data.

Uploaded by

rohil.221ai033
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Course Project Report

High Frequency Price Prediction of Index Futures

Submitted By

Abhaysingh Rajput (221AI002)


Rohil Sharma (221AI033)

as part of the requirements of the course

Data Science (IT258) [Dec 2023 - Apr 2024]

in partial fulfillment of the requirements for the award of the degree of

Bachelor of Technology in Artificial Intelligence

under the guidance of

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal

undergone at

D EPARTMENT OF I NFORMATION T ECHNOLOGY


N ATIONAL I NSTITUTE OF T ECHNOLOGY K ARNATAKA , S URATHKAL

DEC 2023 - APR 2024


DEPARTMENT OF INFORMATION TECHNOLOGY
National Institute of Technology Karnataka, Surathkal

CERTIFICATE

This is to certify that the Course project Work Report entitled “High Frequency Price Prediction
of Index Futures” is submitted by the group mentioned below -

Details of Project Group


Name of the Student Register No. Signature with Date

Abhaysingh Rajput 2210587

Rohil Sharma 2210455

this report is a record of the work carried out by them as part of the course Data Science (IT258)
during the semester Dec 2023 - Apr 2024. It is accepted as the Course Project Report submission in
the partial fulfillment of the requirements for the award of the degree of Bachelor of Technology in
Artificial Intelligence.

(Name and Signature of Course Instructor)


Dr. Sowmya Kamath S
Associate Professor, Dept. of IT, NITK
DECLARATION

We hereby declare that the project report entitled “High Frequency Price Prediction of Index
Futures” submitted by us for the course Data Science (IT258) during the semester Dec 2023 - Apr
2024, as part of the partial course requirements for the award of the degree of Bachelor of Technology
in Artificial Intelligence at NITK Surathkal is our original work. We declare that the project has not
formed the basis for the award of any degree, associateship, fellowship or any other similar titles
elsewhere.

Details of Project Group

Name of the Student Register No. Signature with Date

1. Abhaysingh Rajput 2210587

2. Rohil Sharma 2210455

Place: NITK, Surathkal


Date: Update date of submission
High Frequency Price Prediction of Index Futures
Abhaysingh Rajput1 , Rohil Sharma2

Abstract— The dynamics of short-term commodity prices in B. Rajan Lakshmi A/2017


financial markets using high-frequency trading data requires
careful data processing In this paper, frequent order book this paper explores Low latency trading impact on Indian
pricing which will take place in 1 second in future contracts. capital markets and explores the merits and demerits of
A detailed procedure for data generation is described To High frequency trading.
understand data characteristics such as distributions, patterns,
and relationships among variables, they must first be analyzed
and visualized. 1) Advantages: Benefits of HFT algorithms include
increased liquidity, better efficiency, increased returns for
Techniques such as correlation analysis help identify the investors,lower volitality.
most important factors. Other variables such as disposable
income indicators are developed using available data to
improve the accuracy of the forecast of price changes. 2) Limitation: Harmful HFT Algorithms exist that might
Advanced algorithms are used to identify missing values from cause disruption in the arket such as high Intra - Day
similar data points and fill in the missing data points. Thus the volatility and high Order - to - Trade Ratio
original distribution of data is maintained. Before modeling,
the data are normalized one last time to put all variables on
the same scale. Through these intensive processes, raw order
book data is optimized to feed into machine learning models C. Spyros Makridakis, Evangelos Spiliotis/ March 27, 2018
such as gradient growth and random forests High-frequency The methodology involves data preprocessing to enhance
trading data requires careful data structure in predictive
models capable reliable construction forecasting accuracy through techniques like seasonal
adjustments, transformations, and trend removal. It also
explores multiple-horizon forecasting approaches, including
I. INTRODUCTION iterative, direct, and multi-neural network methods,
Forecasting future price fluctuations in the financial mar- evaluating their accuracy and computational complexity
kets is a highly sought-after goal. The emergence of high- across various forecasting horizons. Challenges such as
frequency trading and the accessibility of detailed order book overfitting and computational complexity are addressed,
information have created new opportunities to investigate with future research directions proposed for improvement.
the complex dynamics and patterns that influence market
behavior. This paper explores the challenging problem of 1) Advantages: The methodology improves forecasting
predicting 1-second price fluctuations for a futures contract accuracy through diverse preprocessing techniques and
by utilizing the abundance of data seen in high-frequency multiple-horizon forecasting approaches while addressing
order book snapshots. challenges like overfitting and computational complexity,
II. L ITERATURE S URVEY laying the groundwork for further advancements in
forecasting methodologies.
A. Mostafa Goudarzi / 2023
We aim to identify High-Frequency Trading (HFT) using 2) Limitation: The methodology faces challenges includ-
selected features and a suitable machine learning algorithm. ing computational complexity, potential overfitting, manual
Our goal is to achieve Automated Market-Making (AMM) preprocessing requirements, and lack of uncertainty estima-
classification using fuzzy logic. Once we have identified the tion in ML forecasts.
optimal algorithm, it is tested on different trading day data
and interpret the results. III. O UTCOME OF L ITERATURE S URVEY
1) Advantages: The results of the analysis of the samples 1) Paper-1: The results of the analysis of the samples
confirmed that among several supervised learning strategies, confirmed that among several supervised learning strategies,
the cluster model combined with random low sampling is the cluster model combined with random low sampling is
the most efficient method to identify HFTs in sequence the most efficient method for sequence detection of HFTs
book in the data in order book data .
The model showed the best performance in classifying data
2) Limitation: Comparitive study to identify optimal for the other five trading days of the week and gave a good
algorithm perform for all the rest days .
2) Paper-2: With the rise of algo trading and high- the effectiveness of preprocessing techniques through
frequency trading (HFT), there have been notable cross-validation and performance metrics.
improvements in various aspects of market dynamics.
These include reduced transaction costs, volatility, and
a more balanced buy-sell ratio. Market efficiency has VI. E XISTING M ETHODOLOGY
improved, aiding in better price discovery. Colocation
services have also contributed by minimizing latency and Lee and Mykland developed the LM estimator, a tool in
creating a level playing field for HFT market participants. financial econometrics for estimating integrated volatility
of asset prices. It’s particularly useful in noisy market
However, leveraging advanced technology for algo conditions. Here’s a simplified overview:
trading and HFT requires substantial technical expertise
and resources. The lack of proper controls has introduced 1. Input:
systemic risks, with the potential for significant deviations - High-frequency price data (e.g., stock prices).
from healthy market prices due to errors or faulty algorithms. - Sampling frequency (e.g., tick-by-tick data).
This poses a challenge for traditional investors who prioritize
fundamental analysis.
2. Preprocessing:
- Remove outliers.
HFT firms often utilize specialized services such as co -
location facilities and have direct access to the raw market - Ensure evenly sampled data
data . While HFT can contribute to price fluctuations and
short-term volatility, it also brings efficiencies to market 3.Initialization:
operations. - Set parameters like observation count and time window
length.
3) Paper-3: The literature survey highlights the com- - Decide on handling irregularities like price jumps.
parative forecasting performance of machine learning (ML)
and statistical methods. It reveals that while ML methods 4. Estimation Algorithm:
offer computational advantages, they often fall short in - Calculate realized volatility for each observation.
accuracy compared to simpler statistical models. The survey - Smooth volatility using a kernel function to reduce noise.
underscores the need for ML methods to improve accuracy, - Aggregate smoothed estimates over a time window.
reduce computational complexity, automate preprocessing,
and provide uncertainty estimates for practical forecasting 5. Output:
applications. - Integrated volatility estimates for each time period.
IV. P ROBLEM S TATEMENT The LM estimator involves complex math, but these steps
give a basic understanding of how it works.
To do identify and apply new data preprocessing
methodologies and to meticulously analyze, visualize the
data, refine data,and do advanced eda do understand the
under lying intution behind the data and explore new ways to VII. P ROPOSED ENHANCEMENTS / NOVELTY
make it better ,hence making it good for predictive modeling
VIII. DATASET
purposes and choosing suitable model for predictions.
A. About The Data
V. O BJECTIVES The raw dataset used in this paper can be accessed
•Perform exploratory data analysis (EDA) to understand at: https://www.kaggle.com/competitions/
the distribution , explore patterns, and the various caltech-cs155-2020
relationships within the order - book data, and visualize key
insights using descriptive statistics and data visualization The dataset under consideration has a distinct problem, as
techniques. each row represents a single point in time and the condition
of the order book at that particular 5ms interval. Creating a
•Preprocess the data by handling outliers, missing values, predictive framework that can precisely predict the direction
and standardizing features, while also exploring feature of the "mid" price—a critical statistic that denotes the mid
engineering methods to enhance predictive power and point between the best bid of the dataset which is usually
ensure data quality. the first bid and the best ask prices which in most of the
cases is just the first ask price of the dataset — is the main
•Develop and evaluate machine learning models to predict objective. In particular, each timestep must be categorized as
the probability of future price changes, using appropriate either a possible increasing trajectory (designated as "1") or
algorithms and hyperparameter tuning, while validating a static or falling trajectory (designated as "0").
B. Terminologies in Dataset includes feature engineering, data cleansing, addressing
•id - The timestep ID of the order - book features. missing values, and possible scaling or transformations
to improve the features’ predictive power. Furthermore,
last_price - price at which the most recent order fill correcting any imbalances in the target variable and
occurred. investigating resampling strategies would be required to
lessen biases and enhance model functionality. The study
•mid - "mid" price - which is the average of the best will use a range of machine learning algorithms to train
bid1 and the best ask1 prices. and assess predictive models once the data has undergone
thorough preprocessing. During this iterative process,
•opened_position_qty - In the past 500ms, how many essential metrics including precision, recall, F1-score, and
buy orders were filled? accuracy are carefully chosen, hyperparameters are adjusted,
and model performance is rigorously assessed. The final
•closed_position_qty - In the past 500ms, how many sell objective is to create a strong predictive framework that,
orders were filled? using the complex patterns seen in the order book data, can
reliably predict future price changes.
•transacted_qty - In the past 500ms, how many buy+sell
orders were filled?
B. Data Preprocessing
•d_open_interest - In the past 500ms, what is (#buy 1) Exploratory Data Analysis (EDA): We started by
orders filled)- (#sell orders filled)? calculating central tendency measures like mean, median,
and mode to understand the core values of our features.
•bid1 - What is the 1st bid price (the best/highest one)? To gauge data spread, we computed variance and standard
deviation. Visual aids such as histograms and box plots
•bid[2, 3, 4, 5] - What is the [2nd, 3rd, 4th, 5th] helped us grasp the data’s shape and identify potential
best/highest bid price? outliers. We used Hexbin plots which are merged scatter
plot and histogram principles to visualize bivariate data
•ask1 - What is the 1st ask price (the best/lowest/cheapest distributions in the dataset . Density contour plots offered
one)? another view by outlining dense regions with contour lines.
Heatmaps displayed feature correlations visually , while
•ask[2, 3, 4, 5] - What is the [2nd, 3rd, 4th, 5th] we used violin plots depicted density profiles, emphasizing
best/lowest/cheapest ask price? areas with more data points and skewness of each features
in the dataset.
•bid1vol - What is the volume of contracts in the order -
book at the 1st bid price (the best/highest one)?

•bid[2,3,4,5]vol - What is the quantity of contracts in


the order - book at the [2,3,4,5]th bid price (the [2,3,4,5]th
best/highest one)?

•ask1vol - What is the quantity of contracts in the order


book at the 1st ask price (the best/lowest/cheapest one)?

•ask[2,3,4,5]vol - What is the quantity of contracts in


the order - book at the [2,3,4,5]th ask price (the [2,3,4,5]th
best/lowest/cheapest one)?

•y (unique to training data) - What is the change in the


mid price from now to 2 timesteps (approx. 1 second) in the
future? "1" means this change is strictly positive, and "0"
means the change is 0 or negative.
Fig. 1: histogram
IX. M ETHODOLOGY
A. Roadmap 2) Outlier Detection and Handling: From the previous
To guarantee the quality and usefulness of the data for eda we had got idea that data is highly skewed and there
analysis, a thorough data pretreatment pipeline is necessary were lot of oultiers and missing values in the data.For
before starting the modeling step. This complex procedure detecting the outliers we use IQR method from the box plot
Fig. 2: Outlier detection using Box Plot

Fig. 4: Conducting EDA on the features

Fig. 3: IQR calculations

Fig. 5: Violin Plots


analysis .

After identifying the outliers are Initially approach was results as expected and also lead to significant data lose and
trimming outliers and this was fatal as the significant amount while in the eda part we had studied the correlation between
of data was lost in this process. Instead,adopted a robust the features and found out that the feature containing the
approach, capping outliers by the replacing it with the values missing value had a weak positive correlation. After we
at Q3 from , preserving most data while minimizing the come to imputing the missing value instead.Various methods
impact of extreme values. which were used for imputations are Mean/Median/Mode
Imputation,Regression Imputation,K-Nearest Neighbors
3) Feature Construction: We introduced a new feature, (KNN) Imputation.With the knn imputation we could get
the "Liquidity Indicator," which Utilize bid-ask spread as a the most satisfactory results and the total dispersion of
feature, that is calculated as the difference between the best the features where maintained.Hence, we implemented
bid prices and ask prices.Incorporate volume at the best bid the K-Nearest Neighbors (KNN) algorithm for missing
and ask prices . By capturing nuanced aspects like liquidity value imputation, even in knn we tried both distance
it improves the model’s ability to forecast price changes based and uniform based averaging for the imputing
accurately values,theoretically better results should be found in
distance based averaging and so was when implemented
4) Feature Scaling: Initially, we explored robust scaling practically .
methods like RobustScaler but as we already had handled
the outlier and based on the techniques of imputation we are 6) Advanced Exploratory Data Analysis: Before the final
going to use in this paper we ultimately used StandardScaler model development Advanced Exploratory Data Analysis
for standardization of all the features before missing value (EDA) techniques can provide deeper insights into the data,
imputation. uncover complex relationships, and aid in the selection and
engineering of relevant features for modeling. Here are
5) Missing Value Imputation:: Initially Techniques such some advanced EDA methods:
as deletion and feature disregarding didn’t yield satisfactory
Fig. 8: Missing value dataset MAR

Fig. 6: Density Contour Plots

Fig. 9: Sample of transformed dataset

constraint we would not end up using all of the data samples


so we need to reduce the dimensionality of this dataset
so that the new pricipal components can capture the most
amount of frequency between different features in the dataset
Fig. 7: Skewness Reduction
. Its goal is to transform a high-dimensional dataset into
a low-dimensional subspace, preserving as much original
information or variance as possible. Here’s a more detailed
Correlation and Covariance Analysis:
explanation of PCA and its implementation:
• Compute correlation matrices to identify linear
1) Motivation and Rationale:
relationships between pairs of features.
• High-dimensional datasets often contain redundant

• Visualize correlation matrices using heatmaps or similar or highly correlated features, leading to the
techniques for easy interpretation. curse of dimensionality and computational
inefficiencies.
• Analyze covariance matrices to understand the spread
and direction of feature relationships. • PCA is used to find the directions of maximum
variance in the data ( also known as principal
• Identify and potentially remove highly correlated or components ), and projects the data onto a
redundant features to avoid multicollinearity issues. lower - dimensional subspace spanned by these
components.
Principal component analysis (PCA) which is one of the
most popular techniques to find the principal components • This projection captures the most important
of the dataset to capture the maximum variance in the data patterns and structures in the data while reducing
is being is used by us for the purpose of reducing our noise and redundancy.
high dimensional dataset, point to be noted even though we
posses enough number of samples that can support the high 2) Mathematical Formulation:
number of features in this dataset but due to computational • Let X be a [n] x [p] data matrix, where n
4) Determining Principal Components:
• The choice of k (the amount of PCs to retain)
is crucial and depends on the desired trade-off
between dimensionality reduction and information
preservation.
• Common approaches include:

– Explained Variance Ratio: Retain enough


components to account for a certain percentage
( e.g., 90% or 95% ) of total variance in the data.

– Scree Plot: Plot eigenvalues in descending order


and look for an "elbow" or sudden flattening,
indicating that subsequent components
contribute little to the overall variance.

– Domain Knowledge: Choose k based on prior


knowledge or interpretability requirements of
the specific problem domain.
Fig. 10: Heat map

represents the number of observations and p is


the number of features.

• PCA seeks to find a linear transformation that


maps the original p - dimensional data onto a k
- dimensional subspace ( where k < p ) while
minimizing the reconstruction error.

• The principal components are the orthogonal


directions (eigenvectors) of the covariance or
correlation matrix of X, corresponding to the
largest eigenvalues.

3) Implementation Steps:
a. Data Preprocessing: Center the data by subtracting
the mean from each feature, or standardize (scale) the
data if features have different units or scales.

b. Compute the Covariance Matrix : Calculate Fig. 11: Post PCA on test dataset
the covariance matrix of the preprocessed data.

c. Perform eigendecomposition on the covariance


C. Model Development and Evaluation:
matrix to obtain the eigenvectors and corresponding
eigenvalues. Model Selection: We extensively evaluated two different
machine learning algorithms to choose the best approach
d. Sort the eigenvectors by their corresponding for forecasting 1-second price changes in futures contracts.
eigenvalues in descending order. The top k
eigenvectors with the largest eigenvalues are chosen Random Forest: We selected Random Forest due to its
as the principal components. capabolity to handle high - dimensional data and capture
complex non-linear relationships with ease. Random Forest is
e. Fix the original data onto the k - dimensional an ensemble of decision trees. It uses bootstrap sampling and
subspace spanned by selected principal components random feature subsets to build diverse trees. For prediction,
by multiplying the data matrix X , and the matrix of it aggregates the outputs of all trees: majority vote for
k principal component vectors. classification. The ensemble reduces variance and overfitting.
Hyperparameters like maxdepth and nestimators were fine- X. R ESULTS O BTAINED
tuned using techniques like GridSearchCV. The results obtained in all the techniques are discussed
C(x) = majority vote(C1(x), C2(x), ..., CB(x)) Where C( x below and the performance metrics used here is accuracy
) refers to the predicted class , f(x) refers to the predicted and the are also specified
value, Cb(x) and fb(x) are predictions from the b-th tree, and
B is the total number of trees. Models Accuracy
Gradient Boosting: Models like CatBoost were explored Logistic Regression 0.663
for their proficiency in capturing intricate patterns in HFT Random Forest Classifier 0.667
data. Hyperparameters such as learningrate and maxdepth CatBoost 0.671
were optimized through randomized search
REFERENCES
Model Evaluation: Stratified k-fold cross-validation [1] Jonathan Brogaard, Allen Carrion, Thibaut Moyaert,
assessed model generalization. We computed accuracy, Ryan Riordan, Andriy Shkilko, Konstantin Sokolov.
precision, recall, and F1-score to measure predictive (2018). High frequency trading and extreme price
capabilities. movements.

Hyperparameter Tuning: For Random Forest, Logistic [2] A. Lakshmi and Sailaja Vedala. (2017). A study on low
Regression Grid/RandomizedSearchCV evaluated latency trading in Indian stock markets. International
hyperparameter combinations systematically. For gradient Journal of Civil Engineering and Technology, 8(12),
boosting models, randomized search strategies maximized 733-743.
validation metrics. [3] Peter Gomber, Björn Arndt, Marco Lutat, and Tim Uhle.
(2011). High-Frequency Trading. SSRN Electronic Jour-
Logistic Regression:As the main objective was to predict nal. doi: 10.2139/ssrn.1858626.
the fall or rise of market mid price it is feasible to use a [4] C. Dutta, K. Karpman, S. Basu, et al. (2023). Review
logistic regression model on this dataset. Logistic regression of Statistical Approaches for Modeling High-Frequency
is a statistical model that predicts the probability of two Trading Data. Sankhya B, 85(Suppl 1), 1–48.
[5] G.P.M. Virgilio. (2019). High-frequency trading: a liter-
outcomes using one or more predictor variables. It uses the
ature review. Financ Mark Portf Manag, 33, 183–208.
logit function to map predicted probabilities to [0, 1] , we
doi: 10.1007/s11408-019-00331-6.
applied hyperparameter tunning using grid search.
Team12.pdf
ORIGINALITY REPORT

9 %
SIMILARITY INDEX
6%
INTERNET SOURCES
4%
PUBLICATIONS
2%
STUDENT PAPERS

PRIMARY SOURCES

1
iaeme.com
Internet Source 1%
2
fastercapital.com
Internet Source 1%
3
Mostafa Goudarzi, Flavio Bazzana.
"Identification of high-frequency trading: A
1%
machine learning approach", Research in
International Business and Finance, 2023
Publication

4
www.efmaefm.org
Internet Source 1%
5
Submitted to University of Melbourne
Student Paper 1%
6
kclpure.kcl.ac.uk
Internet Source 1%
7
Jyoti P. Kanjalkar, Gaurav N. Patil, Gaurav R.
Patil, Yash Parande, Bhavesh Dilip Patil,
1%
Pramod Kanjalkar. "chapter 18 Wise Apply on
a Machine Learning-Based College

You might also like