LLM Code TAs
LLM Code TAs
in Trading Strategies
Miquel Noguer i Alonso, Hanane Dupouy
March 9, 2024
1 Abstract
In this paper, we perform a comprehensive evaluation of various Large Language Models
(LLMs) for their efficacy in generating Python code specific to algorithmic trading strategies.
Our study encompasses a broad spectrum of LLMs, including GPT-4-Turbo, Gemini-Pro,
Mistral, Llama2, and Codellama, assessing their performance across a series of task-specific
prompts designed to elicit precise code implementations for over various technical indicators
commonly used in the financial trading sector.
A principal component of our methodology involves the creation of a detailed prompt struc-
ture that adapts to the unique capabilities of each LLM. For OpenAI’s Assistant AI, we
leverage an intricate prompt design that integrates templated responses, zero-shot task-
specific prompts, and prompt chaining to guide the models through a step-by-step reasoning
process, ensuring the generation of executable and accurate Python code. This structured
approach not only facilitates the model’s comprehension of the task at hand but also allows
for the nuanced adaptation of prompts to cater to the distinct processing styles of different
LLMs.
Our evaluation framework is grounded in a comparison against baseline results obtained from
widely recognized libraries such as TALib, as well as a comprehensive Python implementa-
tion of the indicators. Through a meticulous process of parsing our code and constructing
data frames that encapsulate function names, parameters, and documentation, we establish
a foundational prompt that prompts LLMs to propose viable Python code implementations.
This zero-shot task-specific approach is crucial in enabling the LLMs to methodically navi-
gate through the tasks, thereby enhancing the accuracy and relevance of the generated code.
The findings indicate that GPT-4-Turbo, Codellama-70B, and Gemini-Pro yield encouraging
results relative to baseline computations, with GPT-4-Turbo achieving identical implemen-
tations to the baseline in certain instances.
• ReplicateMeta [Replicate, a]
• ReplicateMistral [Replicate, b]
• StockCharts [StockCharts.com, ]
1
Electronic copy available at: https://ssrn.com/abstract=4752797
• PromptChainer: Chaining Large Language Model Prompts through Visual Program-
ming” discusses an interface and methodology for creating complex applications by
chaining prompts of large language models (LLMs). It addresses the challenge of
executing multi-step tasks that a single LLM prompt cannot handle efficiently. By
enabling the chaining of prompts where the output of one prompt serves as the input
for the next, it facilitates the building of more complex and nuanced AI-driven appli-
cations. This approach also aids in debugging and refining the AI’s output at various
stages of the process, making it more transparent and controllable for users, particu-
larly those who are not AI experts. The study highlights user needs for transforming
data between steps and debugging chains, proposing a solution through a visually
programmed interface designed to make the process more intuitive and accessible.
[Wu et al., 2022]
• Chain of Thought : We explore how generating a chain of thought – a series of inter-
mediate reasoning steps – significantly improves the ability of large language models
to perform complex reasoning. In particular, we show how such reasoning abilities
emerge naturally in sufficiently large language models via a simple method called chain
of thought prompting, where a few chain of thought demonstrations are provided as
exemplars in prompting. Experiments on three large language models show that chain
of thought prompting improves performance on a range of arithmetic, commonsense,
and symbolic reasoning tasks. The empirical gains can be striking. For instance,
prompting a 540B-parameter language model with just eight chain of thought ex-
emplars achieves state of the art accuracy on the GSM8K benchmark of math word
problems, surpassing even finetuned GPT-3 with a verifier. [Wei et al., 2023]
3 Methodology
We introduce an empirical framework to evaluate the LLMs in their ability to generate
correct Python code for algorithmic trading strategies.
The trading strategies evaluated comprise momentum and trend following categories. Some
examples of the evaluated indicators are Moving Average Convergence Divergence (MACD),
RSI (Relative Strength Index) or Stochastic Oscillator.
The LLM’s python code implementation is then compared to a baseline implementation
from the TALIB library or our own code implementation (which has the same results as the
TALIB one).
For information about the indicators, visit [StockCharts.com, ]. The goal is to assess if
the model provides the right implementation of the strategy, the same way is coded in TAlib
library [Benediktsson, ].
2
Electronic copy available at: https://ssrn.com/abstract=4752797
4 What Prompt Design for what LLM?
In our evaluation of various LLMs across different APIs, it becomes necessary to tailor our
prompt techniques to suit each model’s unique characteristics and capabilities.
While utilizing the OpenAI API, we leverage its extensive array of tools and features to
facilitate code execution, file storage, and comparison. This necessitates customizing our
prompts to align with the capabilities and nuances of this environment.
However, with other APIs such as GenAI and Replicate, we employ different prompts tailored
to the more straightforward interaction these models offer.
3. Should you encounter the word “Apolog” within your message, conclude the answer
with: “Processing incomplete, I encounter issues. Please stand by.”
4. If you have been asked to generate files but they have not been produced, end your
response with: “Processing incomplete, files not yet generated. Please stand by.”
Employing this prompt technique enables us to identify when to stand by and when to
prompt the assistant to revisit the task that hasn’t been completed.
The zero-shot task-specific prompts enable the LLM to tackle various small tasks
step-by-step, such as ”reading the text”, ”extracting the name of the algo”, ”which type
of algo it is”, ”proposing a python implementation”, ”storing the data in a json file with
given keys”. This prompt design enabled the model to think step-by-step and improve its
3
Electronic copy available at: https://ssrn.com/abstract=4752797
accuracy in inferring the tasks.
The prompt chaining technique enables the LLM to break down tasks into several
task-clusters, using the results from each task-cluster as input for the subsequent ones. This
approach is designed to allow the LLM time to process information.
4.2 Gemini-Pro
While employing a single-shot call to the GenAI API for accessing Gemini-Pro (with the
methodology detailed subsequently), we adopt zero-shot task-specific prompt technique.
5 Evaluation Metrics
The primary factor for assessment is the executability of the code. It was noted that some
LLMs, when tasked with specific algorithmic strategies, generate code that fails to run. This
issue persists even after multiple attempts. For instance, the generated code might include
references to ’pd.rolling’ or ’pd.Date,’ which are not valid functions in the pandas library.
This is known as hallucination.
The second evaluation, conducted after executing the code, involves using the Root
Mean Squared Error (RMSE) metric. This step entails comparing the outcomes of the
specified strategy, for which the LLM has proposed the code, against the baseline results,
which could be derived from TALIB or our comprehensive implementation.
Root Mean Square Error (RMSE) metric computes the square root of the average squared
error between the true values yi and predicted ones ŷi . The predicted ones being the values
computed by the code proposed by the model.
v
u
u1 X N
2
RM SE = t (yi − ŷi )
N i=1
The third evaluation involves effectively comparing the generated code. Some LLMs de-
clare parameters in the function definition but fail to use them within the actual code. For
instance, they might specify “close”, “high”, “low” as parameters, but in practice, only
“close” is utilized in the code. This does not render the code inexecutable, but it leads to
highly inaccurate results.
Some LLMs employ moving average methods that differ from those used in the base model.
Some of these methods result in initial computation values that differ from those of the base
algorithm. However, these values tend to converge with the base values over time.
The details of these calculations will be thoroughly analyzed in the results section.
6 Prompting Framework
To construct the primary prompt for our study, we first need to extract the relevant infor-
mation pertaining to a specific trading strategy. Subsequently, we tailor the design of the
prompts according to the LLM being evaluated.
4
Electronic copy available at: https://ssrn.com/abstract=4752797
Based on our own code:
We extract all functions and store the information as a dataframe.
The stored information are:
• Name of the method
• Name of the algo
• Parameters
• Docstring
• Code
In our case, we used the code interpreter tool which allows the model to write and exe-
cute a Python code in a sandboxed execution environment.
OpenAI offers various features including the ability to upload and store files, as well as
the capability to generate and maintain files within OpenAI’s system. To effectively uti-
lize these features, we made calls to the Assistant API, requesting it to generate Python
code, execute it on the uploaded file, and then produce and store the results in JSON and
CSV files. Additionally, we asked it to plot graphs for result comparison and store these
as well. See Appendix C which is describing the call steps adopted in the Assistant AI API.
As the history of conversations is stored in memory within a thread used by a given as-
sistant, employing certain prompt techniques becomes relevant and contributes significantly
to the stability of the process for generating the requested python code and the necessary
files. Prompt techniques such as zero-shot task-specific and prompt chaining, as previously
explained, were then utilized. However we encountered some issues when trying to automate
the calls. The LLM requires time to process information.
To automate the process, we introduced a delay to ensure we could collect the final
answers. At times, the responses can be misleading. To address this, we added extra
safeguards and directives to the main prompt of the assistant to categorize the types of the
responses. Depending on the response category, we would then take appropriate actions.
For example, the LLM might confirm that the CSV file has been successfully generated and
even provide the correct name as requested. However, upon checking the annotation object
from the call where the file ID should be stored, it could be found to be empty. We called
this prompt technique “templated response”, as previously explained.
With the new classification of responses, it became possible to take action by seeking
further clarification or additional inputs from the model, thereby enhancing the robustness of
the automated process. Furthermore, safeguard checks are implemented on the assistant’s
final responses to identify when files are missing, prompting a request for the model to
generate them again.
5
Electronic copy available at: https://ssrn.com/abstract=4752797
structed using our code and yields results equal to those of the TAlib library.
The following details are included in the strategy description:
1. Name of the Strategy
2. Name of our function
3. The parameters of the function
Using this description, we construct our initial zero-shot task-specific prompt. This approach
directs the model to provide precise responses by instructing it to employ step-by-step rea-
soning in addressing various questions.
We request the model to perform several tasks, including proposing a Python code imple-
mentation and saving the information in a JSON file. Here is the prompt used in this
step:
You can find an example of the final answer from GPT-4-Turbo using a JSON format in
Appendix D.
Prompt chaining
After receiving the final response and generating the JSON file, we issue a subsequent
prompt instructing the model to run the provided code and save the computed indicator
results in a CSV file for future use. We also guide the model through step-by-step reasoning,
allowing it time to process and clarify the various tasks it needs to complete. The details of
the prompt used in this step is included in Appendix E.
This final response provides us with the strategy outcomes derived from the code sug-
gested by the model. In the next section, we will ask the model to execute our own code.
We guide the model using step-by-step task-specific requests, including executing our
own code and saving the results. Here is the prompt used in this step:
Prompt
6
Electronic copy available at: https://ssrn.com/abstract=4752797
4 3. Generate a csv file with the results and make sure to store it. Call it
,→ 'results_own_{method_name}.csv'.
5 Make sure to add a 'Date' column in the generated csv.
6 4. Retrieve the column names from this csv file, and count the number of columns.
7 5. Store all this information in a json file to be called
,→ 'results_own_{method_name}.json' with the following keys:
8 'column_names', 'columns_number' and 'file_id' which is the id of the CSV file
,→ you've just generated
9
10 Code:
11 ```\n{own_code}\n```
12 """
1 - Compare the files within the OpenAI environment: Since both files are already
stored in OpenAI’ system, the model can be requested to compare them. However, since
each execution of the code may produce CSV files with different names for the indicators,
how can the system determine which columns from both files should be compared?
This is the reason why the names and the number of columns from both files are stored. Using
this data, a new prompt can be constructed to read the JSON files where the information
is stored. Subsequently, the model can be instructed to conduct a similarity search for each
pair of columns, excluding non-relevant ones such as date, close, open, high, etc.
Therefore, the columns with the highest similarity can be identified, and the model can be
asked to plot their figures and even calculate the Mean Squared Error (MSE) between each
relevant pair of columns.
Here is a similarity comparison example related to Moving Average Convergence Divergence
(MACD) strategy: Different similarities have been used: Cosine-similarity, The modified
Hamming distance and the Szymkiewicz-Simpson overlap coefficient.
However, this approach is found to be time-consuming and sometimes irrelevant for cer-
tain algorithms. For example, when the indicators are named simply “K” and “D” as in the
Stochastic Oscillator, the similarity score becomes irrelevant.
2 - Files are compared outside the OpenAI environment using the conventional method,
which involves reading the files with pandas, comparing the pertinent columns and comput-
ing the RMSE.
7
Electronic copy available at: https://ssrn.com/abstract=4752797
reason for using this basic feature is that Gemini-pro API does not support code execution
in a sandboxed environment, unlike OpenAI.
We simply need to obtain the code suggested by Gemini-pro, convert it into a JSON object,
store it for subsequent auditing, and then execute it.
As described in GPT-4-Turbo section, in the prompt, we provide the name of the strategy,
the name of our function, and the parameters.
We request the LLM to generate a flat full python implementation of the chosen trading
indicator by using a similar prompt to the one detailed in the zero-shot task-specific section
of GPT-4-Turbo.
It’s important to give this description with the parameters, to constrain the LLM to use
the same notation, to be able to execute the code without any additional transformations.
In this prompt, we employed a zero-shot task-specific technique to allow the LLM to tackle
one task at a time, providing it with the necessary time to process each step.
You can find an example example of Gemini-pro answer in Appendix G.
After receiving the LLM’s response, despite specifically requesting a JSON object, we
encountered issues loading it as such. This issue was encountered in almost all LLMs (Except
GPT-4-Turbo). To address this, we made a call to the OpenAI API using the chat completion
method and gpt-3.5-turbo, converting the response into a valid, loadable JSON format.
This object was then saved, and the Python code executed locally, with the results stored
accordingly.
Following this, we execute the python code on the historical prices file and compute the
RMSE against the base implementation.
6.4 Building the Main Prompt for the Other LLMs from Meta and
Mistral
We used Replicate API for llama2 (7b, 13b, 70b chat versions), codellama (7b, 13b, 34b,
70b instruct versions) and Mistral (7b, 8x7b instruct versions). Replicate is hosting Meta
and Mistral LLMs among others.
https://replicate.com/meta
https://replicate.com/mistralai
We construct a prompt similar to the one used in the Gemini-pro section but with fewer
instructions. This adjustment is made because we’ve observed that the codellama model
sometimes has difficulty interpreting and understanding more complex instructions.
We shorten even more the prompt when using codellama for some strategies.
Similar to gemini-pro, we observed that the JSON object provided by these models is invalid.
Consequently, an additional call to OpenAI’s chat.completion method using gpt-3.5-turbo
was necessary to create a valid JSON object, which could then be stored and executed.
7 Results
We assessed six algorithmic trading strategies using the following models: GPT-4-Turbo,
Gemini-pro, Llama2-7b-chat, Llama2-13b-chat, Llama2-70b-chat, Codellama-7b-instruct,
Codellama-13b-instruct, Codellama-34b-instruct, Codellama-70b-instruct, Mistral-7b-instruct-
v0.2, Mixtral-8x7b-instruct-v0.1.
The first phase of the assessment is dedicated to verifying the code’s executability. Once
confirmed that the code operates without issues, the subsequent phase entails calculating
the Root Mean Squared Error (RMSE) to compare the performance of the baseline model
against the output from the LLM.
Armed with these findings, a comprehensive analysis will be undertaken to evaluate the
Root Mean Squared Error (RMSE), determining its significance in terms of being notably
high or low.
8
Electronic copy available at: https://ssrn.com/abstract=4752797
• Whether the LLM implementation yields a single value (1, 0, NaN, or a floating-point
number). In such cases, the response is inaccurate, and the evaluation cannot proceed.
• Whether the LLM uses the whole set of the mandatory parameters: high, low when
the indicator needs these values to be computed.
For GPT-4-Turbo and gemini-pro 100% of the suggested code on each strategy is exe-
cutable. In addition to that, 100% of the generated results contained several and different
values (rolling calculation), and 100% of the strategies used the parameters declared in their
function in the core of the code implementation.
For Codellama-13b, 100% of the suggested code is executable, 83% of the strategies pro-
duced coherent values, and 100% of the strategies used the declared parameters.
For Codellama-70b, these numbers fall to 83% executable code, 67% of strategies give rolling
numbers and 83% do include the parameters declared in their functions. Same results are
observed for Mixtral-8x7b.
For Mistral-7b, 100% of the proposed code is executable. However, only 50% of these strate-
gies are executable and lead to rolling numbers that can be evaluated against the baseline
computation.
However, Llama2 and Codellama-7b exhibit the lowest percentage of executable codes that
yield coherent result values.
Executable code does not necessarily equate to desirable outcomes. The subsequent section
will explore how Codellama-70b, despite only achieving a 67% rate of executable code that
yields coherent values, exhibits commendable performance across diverse strategies. This is
in contrast to Mixtral-8X7b, which, while displaying comparable rates of executable code,
differs in effectiveness.
In the subsequent phase of our analysis, we will assess the concordance between the
outcomes generated by the executable code and the baseline computations.
The Table 2 displays the Root Mean Square Error (RMSE) evaluation results for six trading
strategies, with certain strategies yielding multiple indicators. In three of the eight strategies
Table 2: Evaluation of RMSE for Eight Trading Strategies, Including Main and Intermediate
Indicators, Across Various LLMs.
assessed, GPT-4-Turbo yields values that precisely match those of the baseline computations.
Furthermore, Gemini-Pro and Codellama-70b also produce exact matches in some strategies,
while exhibiting minimal errors in others.
In the Table 3 , GPT-4-Turbo demonstrates the lowest RMSE in five of the eight strategies
evaluated, and in two others, it presents the second lowest RMSE. Codellama-70b ranks
as the second most performant model, delivering the best results in three out of eight
indicators. Additionally, Gemini-Pro achieves the second-best performance in four out of
the eight instances.
9
Electronic copy available at: https://ssrn.com/abstract=4752797
Table 3: Evaluation of RMSE: Number of time an LLM is ranked N°1 or N°2. N°1: Meaning
the LLM has the lowest RMSE. N°2: Meaning the LLM has the second lowest RMSE.
As shown in the Table 2 , Llama2-70b is showing one of the largest error for MACD. In
the Figure 3, you find an illustration of the model’s results compared to baseline, GPT-4-
Turbo and Gemini-pro:
10
Electronic copy available at: https://ssrn.com/abstract=4752797
Figure 1: Computation Results: Baseline, GPT-4-Turbo, Gemini-pro, Codellama-70b for
MACD strategy
Why do GPT-4 and Gemini-Pro initially diverge from the baseline model and subse-
quently show a tendency to converge shortly thereafter?
• You can see that it uses the same logic of initialization then the baseline
• Then, it computes an EWM average.
11
Electronic copy available at: https://ssrn.com/abstract=4752797
Figure 2: Computation Results: Baseline, GPT-4-Turbo, Gemini-pro, Mixtral-8x7b for
MACD strategy
9 Conclusion
In conclusion, our comprehensive evaluation of various Large Language Models (LLMs), in-
cluding GPT-4-Turbo, Gemini-Pro, Mistral, Llama2, and Codellama, has yielded significant
insights into their capabilities for generating Python code tailored to algorithmic trading
strategies. By employing a detailed prompt structure that caters to the unique attributes of
each LLM, we have been able to guide these models in generating executable and accurate
Python code for a wide array of technical indicators crucial to the financial trading sector.
Our methodology, which combines templated responses, zero-shot task-specific prompts,
and prompt chaining, has proven effective in enhancing the LLMs’ understanding of the
tasks at hand. This approach allowed for the nuanced adaptation of prompts to match the
distinct processing styles of different LLMs, thereby optimizing their performance.
The evaluation framework, grounded in a comparison against baseline results from estab-
lished libraries such as TALib and a comprehensive Python implementation of the indicators,
has enabled us to measure the efficacy of the LLMs accurately. Our findings suggest that
models like GPT-4-Turbo, Codellama-70B, and Gemini-Pro exhibit promising capabilities,
with GPT-4-Turbo achieving identical implementations to the baseline in certain cases.
12
Electronic copy available at: https://ssrn.com/abstract=4752797
This study not only underscores the potential of LLMs in automating and enhancing
the development of algorithmic trading strategies but also highlights the importance of
structured prompt design in unlocking the full capabilities of these models. The encouraging
results obtained from GPT-4-Turbo, Codellama-70B, and Gemini-Pro indicate a forward
path in the application of LLMs within the domain of financial trading, suggesting that
with further refinement and adaptation, these models could become indispensable tools for
financial analysts and traders alike.
References
[Benediktsson, ] Benediktsson, J. ta-lib-python.
[Replicate, a] Replicate. Replicate meta. https://replicate.com/meta. Accessed: 2024-
02-06.
[Replicate, b] Replicate. Replicate mistral ai. https://replicate.com/mistralai. Ac-
cessed: 2024-02-06.
[StockCharts.com, ] StockCharts.com. Technical indicators and overlays. https://school.
stockcharts.com/doku.php?id=technical_indicators. Accessed: 2024-02-06.
[Wei et al., 2023] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi,
E., Le, Q., and Zhou, D. (2023). Chain-of-thought prompting elicits reasoning in large
language models.
[Wu et al., 2022] Wu, T., Jiang, E., Donsbach, A., Gray, J., Molina, A., Terry, M., and Cai,
C. J. (2022). Promptchainer: Chaining large language model prompts through visual
programming.
10 Appendix
A Llama, Codellama, Mistral: Versions
Here are the various versions used in the experiments. For each model we used the last
available version in Replicate API:
LLAMA2 7B CHAT =
“meta/llama-2-7b-chat:13c3cdee13ee059ab779f0291d29054dab00a47dad8261375654de5540165fb0”
LLAMA2 13B CHAT =
“meta/llama-2-13b-chat:f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d”
LLAMA2 70B CHAT =
“meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3”
CODELLAMA 7B INSTRUCT =
“meta/codellama-7b-instruct:aac3ab196f8a75729aab9368cd45ea6ad3fc793b6cda93b1ded17299df369332”
CODELLAMA 13B INSTRUCT =
“meta/codellama-13b-instruct:a5e2d67630195a09b96932f5fa541fe64069c97d40cd0b69cdd91919987d0e7f”
CODELLAMA 34B INSTRUCT =
“meta/codellama-34b-instruct:eeb928567781f4e90d2aba57a51baef235de53f907c214a4ab42adabf5bb9736”
MISTRAL 7B =
“mistralai/mistral-7b-instruct-v0.2:79052a3adbba8116ebc6697dcba67ad0d58feff23e7aeb2f103fc9aa545f9269”
MISTRAL 8x7B =
“mistralai/mixtral-8x7b-instruct-v0.1:7b3212fbaf88310cfef07a061ce94224e82efc8403c26fc67e8f6c065de51f21”
13
Electronic copy available at: https://ssrn.com/abstract=4752797
B Python Code Implementation Generated by an GPT-
4-Turbo
The following code is proposed by GPT-4-Turbo to implement the Relative Strength Index
(RSI) indicator:
1 import pandas as pd
2
3 def get_rsi(close, period=14):
4 # Calculate price differences
5 delta = close.diff()
6
7 # Separate the gains and losses
8 gains = delta.where(delta > 0, 0)
9 losses = -delta.where(delta < 0, 0)
10
11 # Calculate the Exponential Moving Averages (EMA) of the gains and losses
12 avg_gain = gains.ewm(com=period - 1, min_periods=period).mean()
13 avg_loss = losses.ewm(com=period - 1, min_periods=period).mean()
14
15 # Calculate the Relative Strength (RS)
16 rs = avg_gain / avg_loss
17
18 # Calculate the Relative Strength Index (RSI)
19 rsi = 100 - (100 / (1 + rs))
20
21 return rsi
1 {
2 "algo_name": "Relative Strength Index (RSI)",
3 "category_name": "Momentum",
4 "python_code": """\n
5 "\nimport pandas as pd\n\n
6 def get_rsi(close, period=14):\n
7 # Calculate price differences\n
8 delta = close.diff()
9
10 # Separate the gains and losses\n
11 gains = delta.where(delta > 0, 0)
12 losses = -delta.where(delta < 0, 0)
13
14 # Calculate the Exponential Moving Averages (EMA) of the gains and losses\n
15 avg_gain = gains.ewm(com=period - 1, min_periods=period).mean()
16 avg_loss = losses.ewm(com=period - 1, min_periods=period).mean()
17
18 # Calculate the Relative Strength (RS)
19 rs = avg_gain / avg_loss
20
21 # Calculate the Relative Strength Index (RSI)
22 rsi = 100 - (100 / (1 + rs))
23
24 return rsi""",
25 "method_name": "get_rsi"
26 }
Call Steps
• Extract data from our own code and build a database knowledge
• Fetch Historical prices from yfinance library.
• Call to OpenAI
14
Electronic copy available at: https://ssrn.com/abstract=4752797
• Upload the file
• Build the Main prompt to provide the assistant with, using templated responses
• Create an assistant, thread, messages and run (Assistant AI components)
• First call to ask the model to suggest a Python code given a short description of
a strategy, and store the response
• Second call to request the model to execute the code and store the results
• Third call to request the model to execute our own code and store the results in
files.
• Fourth call (Optional), we ask the model to read both results files from our own
code and the model’s code, and compare them.
D Templated-responses Prompt
Refining the initial instructions to the Assistant AI and using Templated-responses prompt:
1 prompr = """
2 You are an algorithmic trader specialized in cryptocurrency and a data scientist.
3 You master Python, algorithmic trading strategies and cryptocurrency field.
4 You'll be given a file with historical prices and volume.
5 You'll also be given a Python code for algorithmic trading strategies.
6 You'll be asked to execute this code on the provided file and store it as a csv file.
7 You'll answer the questions based only on the file with historical prices and the Python
,→ code I'll give you.
8
9 If you've finished answering, print this message at the end of your answer : "The
,→ response is complete" .
10
11 If you didn't fully answer my question, and you need more time, please add at the end of
,→ each of your intermediate responses:
12 "Processing incomplete, I'm still thinking. Please stand by."
13
14 If the word "Apolog" appears in your message, please add at the end of your answer:
15 "Processing incomplete, I encounter issues. Please stand by"
16
17 If I ask you to generate files (json, csv, image), and you didn't, please proceed until
,→ you generate them.
18 If I ask you to generate files (json, csv, image), and you didn't, add at the end of
,→ your answer:
19 "Processing incomplete, files not yet generated. Please stand by"
20 """
21
22 completion = client.chat.completions.create(
23 model = "gpt-3.5-turbo",
24 messages = [
25 {"role": "system", "content": "you are a helpful assistant."},
26 {"role": "user", "content": f"""This is a prompt i want to give to an assistant.
,→ Could you improve it, to make the assistant understand it very well, with
,→ very clear instructions, and make it concise: ```{prompt}``` """
27 }
28 ]
29 )
30 print(completion.choices[0].message.content)
GPT-3.5-Turbo’s answer:
1 new_prompt = """
2 As an algorithmic trader with a specialty in cryptocurrencies and possessing extensive
,→ expertise as a data scientist, you possess advanced skills in Python and algorithmic
,→ trading strategies, along with a comprehensive knowledge of the cryptocurrency
,→ market.
3
4 Your primary task involves the following:
5
6 Data Analysis:
7 1. You will receive a file containing historical data, which will include information
,→ such as prices, trading volumes.
8 2. You will be handed by a Python script with pre-defined algorithmic trading
,→ strategies.
15
Electronic copy available at: https://ssrn.com/abstract=4752797
9
10 Comparison Task:
11 1. You are required to execute the provided Python script using the historical data
,→ file.
12
13 Responding to Queries:
14 1. Answer any questions posed to you using only the information provided in these files.
15
16 Response Protocol:
17 1. Upon completing an answer, conclude with this statement: "The response is complete."
18 2. If you are unable to complete the answer and require additional time, append this
,→ statement to your intermediate responses: "Processing incomplete, I'm still
,→ thinking. Please stand by."
19 3. Should you encounter the word "Apolog" within your message, conclude the answer with:
,→ "Processing incomplete, I encounter issues. Please stand by."
20
21 File Generation:
22 1. If requested to generate files (in formats such as JSON, CSV, or images), and have
,→ not yet done so, please ensure to complete this task promptly.
23 2. If you have been asked to generate files but they have not been produced, end your
,→ response with: "Processing incomplete, files not yet generated. Please stand by."
24 """
E Prompt Chaining
1 message_user = f""" From the json file you have just created, called
,→ 'results_llm_{method_name}.json', Let's think step-by-step:
2 1. Extract the value of the keys called 'python_code' and 'algo_name' .
3 2. Execute the extracted python code on the historical data file uploaded at the
,→ beginning in the assistant.
4 3. Generate a csv file with the results and call it 'results_llm_{method_name}.csv'.
5 Make sure to add a 'Date' column in the generated CSV file.
6 Make sure to store it.
7 4. Retrieve the column names from this CSV file, and count the number of columns.
8 5. Modify the json file called 'results_llm_{method_name}.json' to add the following
,→ keys:
9 'column_names', 'columns_number' and 'file_id' which is the id of the CSV file
10 """
16
Electronic copy available at: https://ssrn.com/abstract=4752797
28 avg_loss[i] = (1 - alpha) * avg_loss[i - 1] + down[i-1] * alpha
29
30 # Calculate the Relative Strength (RS) and RSI
31 rs = avg_gain / avg_loss
32 rsi = 100 - (100 / (1 + rs))
33 return rsi
1 {
2 "algo_name": "Relative Strength Index (RSI),
3 "category_name": "Momentum",
4 "python_code": "
5 def get_rsi(close, period = 14):
6 # Calculate the average gain and loss
7 average_gain = 0
8 average_loss = 0
9 for i in range(1, period + 1):
10 if close[i] > close[i - 1]:
11 average_gain += close[i]
12 # Calculate the relative strength
13 relative_strength = average_gain / average_loss
14 # calculate the RSI
15 rsi = 100 - (100 / (1 + relative_strength))
16 return rsi"
17 , "method_name": "get_rsi"
18 }
17
Electronic copy available at: https://ssrn.com/abstract=4752797
10 if i < period:
11 continue
12 elif avg_loss[i] == 0:
13 rs = inf
14 else:
15 rs = avg_gain[i] / abs(avg_loss[i])
16 rsi.append(100 - 100 / (1 + rs))
17 return pd.Series(rsi, index=close.index)
Baseline vs GPT-4-Turbo:
Baseline vs Gemini-Pro:
Baseline vs Codellama-70b:
18
Electronic copy available at: https://ssrn.com/abstract=4752797
Figure 5: Computation Results: Baseline vs Gemini-Pro for MACD strategy
19
Electronic copy available at: https://ssrn.com/abstract=4752797
29
30 # MACD Histogram calculation:
31 macd_histogram = macd-signal_line
32 return macd, signal_line, macd_histogram
1 #codellama_70_b
2 def get_macd(close, period_fast=12, period_slow=26, period_signal=9):
3 EMAFast = [0] * len(close)
4 EMASlow = [0] * len(close)
5 MACD = [0] * len(close)
6 SignalLine = [0] * len(close)
7 for i in range(len(close)):
8 if (i < period_fast):
9 EMAFast[i] = sum(close[:period_fast]) / float(period_fast)
10 else:
11 EMAFast[i] = ((EMAFast[i-1]*(period_fast-1)+2*close[i])/float(period_fast+1))
12 for i in range(len(close)):
13 if (i < period_slow):
14 EMASlow[i] = sum(close[:period_slow]) / float(period_slow)
15 else:
16 EMASlow[i] = ((EMASlow[i-1]*(period_slow-1)+2*close[i])/float(period_slow+1))
17 for i in range(len(close)):
18 MACD[i] = EMAFast[i] - EMASlow[i]
19 if (i < period_signal):
20 SignalLine[i] = sum(MACD[:period_signal]) / float(period_signal)
21 else:
22 SignalLine[i] = ((SignalLine[i-1]*(period_signal-1)+2*MACD[i])/float(period_signal+1))
23 return MACD, SignalLine
20
Electronic copy available at: https://ssrn.com/abstract=4752797