Test 2 29
Test 2 29
Code Generation gies in generating textual information has already been demonstrated in several application domains,
Forecasting Time Series Data their abilities in generating complex models and executable codes need to be explored. As an intriguing
Deep Learning Models case is the goodness of the machine and deep learning models generated by these LLMs in conducting
Long Short-Term Memory (LSTM) automated scientific data analysis, where a data analyst may not have enough expertise in manually
Prompt Engineering coding and optimizing complex deep learning models and codes and thus may opt to leverage LLMs to
Falcon generate the required models. This paper investigates and compares the performance of the mainstream
LLama-2 LLMs, such as ChatGPT, PaLM, LLama, and Falcon, in generating deep learning models for analyzing
GPT-3 time series data, an important and popular data type with its prevalent applications in many application
PaLM. domains including financial and stock market. This research conducts a set of controlled experiments
where the prompts for generating deep learning-based models are controlled with respect to sensitivity
levels of four criteria including 1) Clarify and Specificity, 2) Objective and Intent, 3) Contextual
Information, and 4) Format and Style. While the results are relatively mix, we observe some distinct
patterns. We notice that using LLMs, we are able to generate deep learning-based models with
executable codes for each dataset seperatly whose performance are comparable with the manually
crafted and optimized LSTM models for predicting the whole time series dataset. We also noticed
that ChatGPT outperforms the other LLMs in generating more accurate models. Furthermore, we
observed that the goodness of the generated models vary with respect to the “temperature” parameter
used in configuring LLMS. The results can be beneficial for data analysts and practitioners who would
like to leverage generative AIs to produce good prediction models with acceptable goodness.
• Meta AI’s Llama 2[23], is a series of large language LLMs through prompts. This paper makes the following key
models with 7 to 70 billion parameters and it is excel- contributions:
lent in knowledge, reasoning, and code benchmarks.
1. We conduct a number of experiments where the sen-
• Google’s PaLM [4], a 540B parameter model is highly sitivity levels (i.e., Low, Medium, and High) of four
proficient across various applications such as transla- criteria including Clarity and Specificity, Objective
tion, QA pipelines, and even arithmetic. and Intent, Contextual Information, and Format and
Styles are controlled.
These Large Language Models have been leveraged to per- 2. We report that LLMs are capable of generating rela-
form analysis and model building tasks automatically, by tively good models that are comparable with manually
providing prompts in a Natural Language processing format. coded and optimized models and codes.
LLM models such as GPT−3.5−Turbo, Falcon , LLama 2 3. We report that amongst the LLMs studied, ChatGPT
and PaLM are extensively integrated for the task in code outperformed in most cases generating more accurate
generation [14], text generation [25] and image generation models for predicting time series data in the context
[18] where the instructions are provided thought prompts in of financial and stock data.
texts. 4. The results also show that the performance of LLMs
An interesting question is whether LLMs can also be vary with respect to the temperature parameter in the
leveraged by professional data analysts with expertise in configuration when generating deep learning-based
certain domains (e.g., financial market) to help them gener- prediction models.
ate a relatively good model (e.g., Long Short-Term Models
5. We also report that we did not observe a clear ben-
- LSTM) and the corresponding executable code (e.g.,
efit of crafting more complex and detailed prompts
Python) automatically without any additional needs for
in generating better and more accurate models. The
learning complex syntax and semantic of developing these
results are mix where in some cases models generated
deep learning-based forecasting models (e.g., LSTM) from
with simple prompts outperform models with more
scratch? In the context of anomaly detection on time series
complex prompts. The results seem to be dependent
[8] [11], URL detection [10] and vulnerability detection in
on the setting of the temperature parameter.
smart contracts [9] LSTM model demonstrates exception
performance, which is the primary reason of choosing the The rest of this paper is structured into the follow-
LSTM model. This paper conducts an exploratory analysis ing sections. In Section 2, relevant research studies have
to investigate the performance of generative AIs, and in par- been discussed. Section 3 contains the preliminary back-
ticular LLMs, and assess the goodness of the deep-learning ground related to LLM (i.e. GPT-3.5-Turbo, Falcon, LLama-
codes generated by LLMs for building deep learning-based 2, PaLM). Section 4 outlines the research questions ad-
models in forecasting time series data. The underlying dressed in this work. Section 5 presents the experimen-
motivation is that most data analysts, who deal with time tal design, including the dataset, prompt framework, LLM
series data types, may need to design and develop their own configurations, and performance metrics used to evaluate
complex deep learning-based codes.However, individuals the deep learning-based codes and models generated by
not familiar with complex deep learning models often have LLMs for time series analysis. Our methodology is discussed
limited knowledge to deal with building and training the in Section 6. Section 7 reports the results and discussion,
deep learning model. They can generate code to build and obtained by each LLMs across the categories and levels.
train such models by prompting these large language models. Section 8 presents the limitations of the paper and Section 9
The prompting approach makes deep learning more acces- summarizes the conclusions of the work.
sible for individuals who might have little or no experience
but can take advantage of deep learning to work on their time 2. Related Work
series data. In November 2022, OpenAI released ChatGPT [15].
The paper explores the prompts with controlled sensitive In February 2023, Meta released LLaMa [22] followed by
analysis based on categorical levels to study the goodness Technology Innovation Institute (TII) introducing “Falcon
of models and codes generated by LLMs for deep learning- LLM” [16], a foundational Large Language Model (LLM)
based time series analysis. The goal is to comprehensively with 40 billion parameters that was introduced in March
evaluate and comprehend the influence of various category 2023. In May 2023, Google joined the race and announced
levels defined for prompts. Each prompt for generating time PaLM [4]. Moreover, Meta continued to release models,
series analysis deep learning-based code is crafted based on offering a set of models in July 2023 under the name Llama
the criteria including 1) Clarity and Specificity, 2) Objective 2 [23], with parameter counts ranging from 7 billion to
and Intent, 3) Contextual Information, and 4) Format and 70 billion. Since then major high tech companies continue
Style. Furthermore, to assess the impact of each criterion improving their LLMs by adding additional features and
on the goodness of the models created, we consider three capabilities.
classes of intensity or sensitivity level as expressed in each The idea of leveraging generative AIs in building ex-
prompt including 1) high, 2) medium, and 3) low intensity ecutable codes and models has been discussed in several
where intensity refers to the amount of information given to research papers. Vaithilingam et al. [24] conducted a study
with 24 volunteers to evaluate the usability of GitHub Copi- tested on four language-to-code datasets: Spider, WikiTable-
lot, a code generation tool that employs sophisticated lan- Questions, GSM8k, and MBPP, in the fields of semantic
guage models. Participants in the research completed Python parsing, table quality assurance, arithmetic reasoning, and
programming tasks using both Copilot and a controled con- basic Python programming. LEVER enhanced execution
dition that used VSCode’s default IntelliSense functionality. accuracy over strong baselines by 4.6 − 10.9% when paired
The research sought to ascertain the influence of these with Codex and achieved new state-of-the-art outcomes on
tools on programming experience, error detection, problem- all datasets. The relevance of execution results in verification
solving tactics, and barriers to their adoption. According to became clear through ablation study, and the technique kept
quantitative research by the authors, there was no signif- its strong performance even in circumstances with limited
icant difference in job completion times between Copilot resources and without supervision. The findings showed
and IntelliSense controlled groups. However, it was discov- that using benchmark datasets to train compact verifiers
ered that Copilot customers had more failures, which were increased the performance of various LLMs in the field of
mostly related to Copilot’s incorrect advice. Despite this, language-to-code generation.
the majority of participants (19 out of 24) preferred Copilot Denny et al. [6] proposed “Prompt Problems”, a unique
because of its ability to give a useful and informative starting educational idea aimed to educate students on how to create
point that eliminate the needs for frequent Web searches. effective natural language prompts for large language models
However, several participants had difficulty comprehending (LLMs) with the objective of generating executable codes.
and debugging the code generated by Copilot. The authors created Promptly, a web-based application that
Destefanis et al. [7] studied and compared the perfor- allows students to iteratively tweak prompts based on test
mance of two AI models: GPT-3.5 and Bard, in generat- case output until the LLM produces accurate code. They
ing code for Java functions. The Java functions and their used Promptly in classroom research with 54 beginning
descriptions were sourced from CodingBat Website, a plat- Python students and discovered that the tool teaches students
form for practicing programming problems. The evaluation to new programming structures and promotes computational
of the Java code generated by the models was based on thinking, despite the fact that some students were hesitant to
correctness, which was further verified using CodingBat’s utilize LLMs. The research looked at prompt duration and
test cases. The results of the evaluation showed that GPT-3.5 iteration counts, as well as student opinions based on open-
outperformed Bard in code generation, producing accurate ended feedback. Overall, the work presents preliminary ev-
code for around 90.6% of the functions, while Bard achieved idence that quick Problems warrant more investigation as
correctness for only 53.1% of the functions. Both AI models an approach to developing the growing ability of quick
displayed strengths and weaknesses. GPT-3.5 consistently engineering.
performed better across most problem categories, except for Becker et al. [2] investigated the revolutionary impact
functional programming, where both models showed similar of AI-driven code generation tools like OpenAI Codex,
performance. DeepMind AlphaCode, and Amazon CodeWhisperer. These
Liu et al. [12] proposed EvalPlus, a framwork for rig- tools possess the remarkable ability to translate natural lan-
orously evaluating the functional correctness of code gen- guage prompts into functional code, heralding a potential
erated by large language models (LLMs). The framework revolution in the realm of programming education. While
solved the issue of insufficient test coverage in current cod- admitting their potential, the authors argue for urgent talks
ing benchmarks like as HUMANEVAL, which employ only a few within the computer science education community in order
manually written test cases and consequently miss numerous to overcome difficulties and properly utilize these tech-
problems in LLM-generated code. EvalPlus is built around nologies. The study provided an overview of important
an automated test input generator that combine LLM and code generation models—Codex, AlphaCode, and Code-
mutation-based methods. It begins by using ChatGPT to Whisperer—that were trained on massive public code repos-
generate high-quality seed inputs focusing at edge situations. itories. These models excel at creating code in several pro-
The seeds are then changed using type-aware operators to gramming languages and go beyond coding by providing
produce a large number of new test cases quickly. The features such as code explanations and language transla-
findings showed that inadequate benchmark testing could tion. From examples, answers, and different problem-solving
have a significant impact on claimed performance. EvalPlus methodologies to scalable learning materials and an em-
also found flaws in 11% of the original HUMANEVAL solutions. phasis on higher-level topics, code-generating tools provide
Through automated testing, the study points in the direction potential in education. The authors underline the need for ed-
of thoroughly analyzing and refining programming bench- ucators proactively integrate these technologies, anticipating
marks for LLM-based code creation. ethical concerns and a trend toward code analysis.
Ni et al. [14] proposed LEVER, a method for improv- Zamfrescu-Pereira et al. [26] conducted a study whose
ing language-to-code generation by Code Language Mod- findings shed some light on the difficulties that non-AI spe-
els (LLMs) utilizing trained verifiers, as proposed in their cialists have when attempting to provide effective prompts
work. They trained different verifier models based on plain for large language models like GPT-3. These individuals
language input, program code, and execution outcomes to frequently use a more impromptu and ad hoc approach rather
determine the validity of created programs. LEVER was than a systematic one, which is hampered by a tendency
to overgeneralize from limited experiences and is based designed for chat interactions, boasts exceptional capabil-
on human-human communication conventions. The authors ities while being remarkably cost-effective, priced at only
developed BotDesigner, a no-code chatbot design tool for one-tenth of the cost of the text-davinci-003 model.
iterative fast development and assessment. This tool helps
with a variety of tasks, including dialogue formulation, 3.2. Falcon
error detection, and fast alteration testing. Participants in The Technology Innovation Institute located in Abu
a user research adjusted prompts and assessed modifica- Dhabi created the Falcon LLM [16], a significant advance-
tions well, but with limited systematic testing and issues in ment in AI language processing that has revolutionized
prompt efficacy understanding. These difficulties originate its potential. Within the Falcon series, namely Falcon-40B
from a tendency to overgeneralize and predict human-like and Falcon-7B, distinct versions, each possessing specific
behaviors. Through patterns and cause analysis, the study merits, contribute to making Falcon LLM an inventive and
proposed potential for further training and tool development adaptable solution suitable for diverse uses.
to encourage systematic testing, moderate expectations, and Falcon’s creation involved tailored tools and a distinc-
give assistance, while noting persistent uncertainty regard- tive data flow approach. This system extracts valuable Web
ing generalizability and social bias consequences. This ex- information for customized training, differing from methods
periment highlights the difficulties that non-experts have in by NVIDIA, Microsoft, and HuggingFace. Focus on large-
rapid engineering and suggests to opportunities for more scale data quality was critical, recognizing LLMs’ sensitivity
accessible language model tools. to data excellence. Thus, an adept pipeline is built for rapid
Zhou el al. [27] introduce a novel approach called the processing and quality content from Web sources. Falcon’s
Automatic Prompt Engineer (APE) designed to facilitate architecture was meticulously optimized for efficiency. Cou-
the automatic generation and selection of effective natu- pled with high-caliber data, this enables Falcon to notably
ral language prompts. The primary goal is to guide large surpass GPT-3, utilizing fewer resources.
language models (LLMs) towards desired behaviors. APE Falcon is a decoder-only model with 40 billion param-
tackles this challenge by framing prompt generation as a eters trained with 1 trillion tokens. The training took two
natural language program synthesis problem. It treats LLMs months and made use of 384 GPUs on AWS. After rigorous
as black box computers capable of proposing and evaluating filtration and de-duplication of data from CommonCrawl,
prompt candidates. The APE method leverages LLMs in the model’s pretraining dataset was generated using web
three distinct roles: 1) as inference models for suggesting crawls with roughly five trillion tokens. Falcon’s capabilities
prompt candidates, 2) as scoring models to assess these were also expanded by incorporating certain sources such as
candidates, and 3) as execution models to test the selected academic papers and social media debates. The model’s per-
prompts. Prompt candidates are generated either directly formance was then evaluated using open-source benchmarks
through inference or recursively by creating variations of such as EAI Harness, HELM, and BigBench.
highly-rated prompts. The final selection of the most suitable
prompt is determined by maximizing metrics such as execu- 3.3. LLama-2
tion accuracy on a separate validation set. Importantly, APE Meta AI created lama 2 [23], a new family of pretrained
achieves these outcomes without the need for gradient access and fine-tuned large language models (LLMs). Llama 2 has
or fine-tuning, relying instead on a direct search within characteristics ranging from 7 billion to 70 billion arameters.
the discrete prompt space. In the experimental phase, APE The pre-trained models are designed for a wide range of
was put to the test across a range of tasks. It successfully natural language activities, whilst the fine-tuned versions
addressed 24 instruction induction tasks, exhibiting perfor- known as Llama 2-Chat are designed for discourse. Llama 2
mance on par with or surpassing human capabilities across was pretrained on 2 trillion publically accessible tokens uti-
all of them. Additionally, APE demonstrated its effectiveness lizing an improved transformer architecture with advantages
on a subset of 21 BIG-Bench tasks, outperforming human like as extended context and grouped-query attention. On
prompts in 17 out of 21 cases. knowledge, reasoning, and code benchmarks, Llama 2 sur-
passed other open-source pretrained models such as Llama
3. Large Language Models Studied 1, Falcon, and MPT. Llama 2-Chat aligns the models to be
helpful and safe in discourse by using supervised fine-tuning
This paper compares the performance of four LLMs and reinforcement learning with human feedback (RLHF).
including GPT, Falcon, LLama-2, and PaLM. Over 1 million fresh human preference annotations were
collected in order to train and fine-tune reward models. To
3.1. GPT-3.5-Turbo
increase multi-turn discourse consistency, techniques such
GPT-3.5-Turbo is an OpenAI-developed variant of the
as Ghost Attention were created. Ghost Attention (GAtt) is a
Generative Pre-trained Transformer 3. GPT-3.5 models in-
straightforward technique influenced by Context Distillation
clude a wide variety of capabilities including natural lan-
[1]. GAtt manipulates the fine-tuning stage to guide attention
guage and code comprehension and creation. GPT-3.5-Turbo
concentration through a step-by-step approach.
is the standout model in this series known for its exceptional
capabilities and low cost of ownership. The GPT-3.5 model,
accelerators, was used to train on 6144 TPU v4 processors. et al. [17], this research aims to demystify prompts through
This allows the training of such a big model without the need empirical testing, allowing LM prompts and a few shots with
for pipeline parallelism. hyperparameters.
The PaLM is evaluated over a wide range of tasks
and datasets, proving its strong performance across several 5. Experimental Setup
domains. The PaLM 540B achieved an outstnading score
of 92.6 in the SuperGLUE test after fine tuning, essentially The experiment was conducted in two parts. First, the
putting it with top models such as the T5-11B. In the field LLM models were run in Google Colab Pro using a GPU
of question answering, PaLM 540B outperformed previ- with high RAM. Second, the outputs from the LLM models
ous models by earning F1 scores of 81.4 on the Natural were run on a Macbook Pro with an M1 Max chip with 64
Questions and TriviaQA datasets in a few-shot setting. The GB Memory.
model’s abilities extended to mathematical thinking, where
5.1. Dataset
it achieved an astounding 58% accuracy on the difficult
The daily financial time series data from January 01,
GSM8K math word problem dataset using chain-of-thought
2022, through April 23, 2022, were collected from Yahoo
cues. PaLM-Coder 540B has been elevated even further,
Finance2 . The dataset, described in Table 1, includes a
reaching 88.4% success with a pass@100 criterion on Hu-
diverse selection of stocks and indices. Each dataset is a
manEval and 80.8% success with a pass@80 criterion on
stock or index with a number of data points, date range,
MBPP.
sector and country of origin. Stocks were chosen based
PaLM 540B excels at translation, earning a notable
on market capitalization and sector representation. As this
BLEU score of 38.4 in zero shot translation on the WMT
table 1 shows, the selected datasets represent a variety of
English-French dataset, outperforming other significant lan-
sectors (indices, technology, e-commerce and automakers),
guage models. The model’s responsible behavior is obvious
and countries (USA, Japan, Hong Kong, China). This table
in toxicity evaluations, with a RealToxicity dataset average
1 provides a basis to test the LLM-generated models in
toxicity probability of 0.46. Finally, using the WinoGrande
datasets in terms of variety and geographical diversity.
coreference dataset, PaLM 540B achieved an accuracy of
In aaddition, major indices like the S&P 500 (GSPC)
85.1%, demonstrating its capacity to mitigate gender bias.
and Dow Jones Industrial Average (DJI), Nasdaq Composite
These extensive findings highlight the PaLM model’s adapt-
(IXIC), Nikkei 225 (N225) and Hang Seng Index (HSI) were
ability and efficacy across a wide range of language-related
included. Giant hightech companies such as Apple (AAPL)
tasks.
and Microsoft (MSFT) were also included in the dataset.
4. Research Questions furthermore, E-commerce giants such as Amazon (AMZN)
and Alibaba (BABA) were incorporated in the dataset to
Recent advances in large language models like GPT-3 make the dataset more diverse. Lastly, Automakers like Tesla
(Brown et al.[3]) have demonstrated impressive text genera- (TSLA) were also added with the goal of investigating the
tion capabilities when provided with well-designed prompt performance of each model generated by LLMs for each
instructions. However, best practices for prompt engineering industry sector.
are still developing. As Reynolds and McDonell [19] dis-
cuss, prompt programming, the importance of being aware 5.2. LLMs Configuration
of prompts fitting within the concept of natural language. The granularity and diversity of responses generated by
This study will perform a systematic sensitivity anal- LLMs can be controlled via several configuration parame-
ysis to identify the most sensitive prompt components for ters:
text generation using large language models. Following the
workflow outlined by Saltelli et al. [21], each input factor 1. Temperature control randomization of the responses
will be varied individually while holding others constant to generated by a large language model where tempera-
isolate its impacts. Text outputs will be analyzed to measure ture score close to 1 indicates increases in the random-
sensitivity. ness;
Findings will provide prompt engineers with guidance 2 https://finance.yahoo.com/
Table 3
Prompts and Colored Categorical Sensitivity Levels (Green: high; Orange: Medium; Red: Low)
Prompt Description Clarity and Objective Contextual Format
Specificity and Intent Information and Style
[CS] [OI] [CI] [FS]
1 Can you assist me in creating a comprehensive Python script to build an LSTM architecture using High High High High
the time series dataset enclosed within double backticks “{data}“?. My objective is to execute steps
such as preprocessing, splitting, building, compiling, training, and evaluating models using RMSE.
2 Could you assist me in generating a Python script to build an LSTM model using the provided Medium Medium Medium Medium
time series dataset enclosed within double backticks “{data}“? My goal is to perform preprocessing,
splitting the given data, creating the model, compiling it, training the model, and assessing its
performance using RMSE.
3 I need a Python script for LSTM. The dataset is in “{data}“.I want to process, split, build, compile, Low Low Low Low
train model, and evaluate model.
4 Could you give me a code for setting up a LSTM? I have a time series dataset enclosed within double Medium High High High
backticks “{data}“. My goal is to process the data, split it, build the model, compile it, train, and
evaluate using RMSE.
5 Could you help me out with crafting some kind of Python code to establish an LSTM architecture Low High High High
using the enclosed within double backticks “{data}“? To execute thorough preprocessing, split, build,
compilation, training, and evaluation.
6 Can you help me with creating a Python script for an LSTM architecture using the time series High Medium High High
dataset enclosed within double backticks “{data}“? If possible I would like to perform preprocessing,
data splitting, model construction, compilation, training, and evaluation using RMSE using the code.
7 Could you maybe assist me with making a Python script to create an LSTM architecture using the High Low High High
time series dataset enclosed within double backticks “{data}“? To perform preprocessing, splitting,
building, compiling, training, and testing using RMSE.
8 Can you help me to establish an LSTM architecture in Python using the enclosed within double High High Medium High
backticks “{data}“ to forecast stock prices? My aim is to perform thorough preprocessing, divide the
data, construct the model, and evaluate its performance using RMSE.
9 Could you help me in making a comprehensive Python script to build an LSTM architecture using High High Low High
the dataset “{data}“. My aim is to execute carefully preprocessing, construct the architecture, and
assess performance using RMSE.
10 Would you be able to help me in generating a Python to set an LSTM architecture using the time High High High Medium
series dataset enclosed within double backticks “{data}“? My steps include preprocessing, dividing,
constructing, compiling, training, and evaluating using RMSE.
11 Could you please help me with generating a script to build an LSTM architecture using the time High High High Low
series dataset enclosed within double backticks “{data}“? Perform preprocessing, division of data,
construction of the model, compilation, training, and evaluation the model.
6.1. Sensitivity Analysis for Designing Prompts for training. The manual creation of the model consists of
We crafted eleven prompts ranging from easy to com- the LSTM architecture with tensorflow in the backend. The
plex sensitivity. The designing of these eleven prompts was model contains One LSTM layer with 50 Units with ’relu’
based on pair-wise sensitivity analysis where a factor is activation function. The model trains with 100 epochs with
changing, and remaining factors are kept constant. Pair-wise a batch size of 1. The model employs ’adam’ as an optimizer
analysis, also known as pairwise comparison, is a method and ’mse’ as loss function. The hyperparameters were cho-
for comparing and evaluating many items or criteria by sen with various observations during the experiments. The
comparing each item to every other item in a methodical and preprocessing steps and building model are only relevant for
systematic manner. The phrase “pair-wise analysis” refers to the manual creation, as LLMs are provided the raw data for
the process of analyzing and comparing the distinct criterion code generation.
levels (i.e., Low, Medium, and High) against each other for
each individual element in the context of the information and
determining their impact on the results. 7. Results
The pair-wise analysis helps in evaluating the quality, Table 4 reports the results the performance of the deep
significance, or applicability of multiple characteristics by learning-based models generated by LLMs for time series
directly comparing them to one another, allowing for a more data analysis. Each model is evaluated using RMSE values
systematic and thorough review process. for each stock data. The PaLM model achieved the lowest
To help trace the sensitivity levels, a coloring scheme RMSE value of 0.0023 for BABA ticker while the Falcon
is employed where the green, orange, and red colors in achieved the lowest RMSE value of 0.0041 for the GSPC
Table 3 represent sensitivity level of high, medium, and low, ticker. The LLama2 did not achieve the lowest RMSE across
respectively. all tickers, whereas the GPT 3.5 has the lowest RMSE for
eight tickers. However, the manually developed and opti-
6.2. Manual Creation and Optimization a Model mized model achieved the lowest RMSE compared to LLM
The experiments execute on Apple M1 MAX, Memory generated model across all tickers.
of 64 with GPU. The dataset split into 80% for training In Table 4, the best RMSE values obtained for each
and 20% testing for testing. In the preprocessing, the data language model is highlighted with gray color. Moreover,
is scaled using MinMaxscaler, which transforms the feature the best RMSE values among all models for each stock data
range from 0 to 1 where the data linearly scales down. After and for all 11 different prompts are highlighted in dark color.
the scaling, the data is prepared into sequences of length 5 The cells with NA indicate that the models generated by
to predict the next day’s (1) data and feed into the model the underlying LLM were meaningless and the underlying
Table 4
RMSE values for Models Generated Using LLMs and Controlled Prompts, with LLM configurations Detailed in Table 2 and
Prompts in Table 3.
RMSE RMSE
Ticker Prompts Ticker Prompts
PaLM falcon LLama 2 GPT 3.5 PaLM falcon LLama 2 GPT 3.5
1 0.0323 0.0331 0.0389 NA 1 0.0360 NA 0.0390 0.0384
2 0.0368 0.4893 0.0394 0.0479 2 0.0365 0.5292 0.0403 0.0417
3 0.0388 0.0359 0.0413 NA 3 0.1039 NA 0.0426 0.0483
4 0.0318 0.1992 0.0367 NA 4 0.0369 NA 0.0392 0.0404
5 0.0216 NA 0.0410 0.0633 5 0.1096 0.6225 0.0432 0.1135
GSPC AAPL
6 0.0314 0.0356 0.0376 NA 6 0.0373 NA 0.0382 0.0396
7 0.0348 0.0041 0.0390 NA 7 0.0366 0.6267 0.0412 0.0962
8 0.0331 0.4649 0.0330 0.1043 8 0.0351 NA 0.0364 0.1745
9 0.0320 NA 0.0456 0.0411 9 0.0366 NA 0.0408 0.1221
10 0.0335 0.0355 0.0381 0.0376 10 0.0379 0.7085 0.0397 0.0062
11 0.0353 0.0905 0.0429 NA 11 0.0381 NA 0.0409 0.0641
Avg. 0.03285454545 0.154233333 0.03940909091 0.05882 Avg. 0.0495 0.6217 0.04013636364 0.07136363636
Manual Manually Developed & Optimized Model: 0.0058 Manual Manually Developed & Optimized Model: 0.0094
1 0.0386 NA 0.0399 0.0464 1 0.0407 0.0618 0.0415 0.0918
2 0.0392 NA 0.0556 0.2476 2 0.0414 NA 0.0686 0.1749
3 0.0444 0.4935 0.0462 0.3364 3 0.1154 NA 0.0423 0.2328
4 0.0425 0.0409 0.0439 0.0567 4 0.0400 0.0045 0.0505 0.0041
5 0.0385 NA 0.0472 0.0555 5 0.2358 NA 0.0417 0.2681
DJI MSFT
6 0.0438 NA 0.0445 NA 6 0.0389 0.3535 0.0476 0.1263
7 0.0395 NA 0.0425 NA 7 0.0391 NA 0.0485 0.1219
8 0.0365 0.4642 0.0413 0.2446 8 0.0369 NA 0.0375 0.1811
9 0.0394 NA 0.0487 0.0097 9 0.0361 0.0405 0.0390 0.1217
10 0.0417 NA 0.0473 0.2960 10 0.0403 NA 0.0425 0.1368
11 0.4061 NA 0.0544 NA 11 0.0343 0.3160 0.0426 0.1170
Avg. 0.07365454545 0.333 0.0465 0.1616125 Avg. 0.06353636364 0.15526 0.04566363636 0.14331818182
Manual Manually Developed & Optimized Model: 0.0053 Manual Manually Developed & Optimized Model: 0.0065
1 0.0259 NA 0.0313 0.1000 1 0.0421 0.6003 0.0463 0.0462
2 0.0285 NA 0.0324 NA 2 0.0430 0.5393 0.0480 0.1571
3 0.0284 NA 0.0371 NA 3 0.0797 NA 0.0451 0.0481
4 0.0303 0.0319 0.0327 NA 4 0.0432 NA 0.0455 0.0445
5 0.0300 0.1574 0.0311 0.0325 5 0.1146 0.5977 0.0430 0.0814
IXIC AMZN
6 0.0289 0.3752 0.0316 NA 6 0.0423 NA 0.0447 0.0727
7 0.0299 NA 0.0348 NA 7 0.1632 NA 0.0478 0.0813
8 0.0343 0.4024 0.0285 0.0310 8 0.0400 NA 0.0375 0.1673
9 0.0299 0.2674 0.0323 NA 9 0.0432 NA 0.0472 0.0277
10 0.0294 NA 0.0321 0.0041 10 0.0424 NA 0.0435 0.0704
11 0.0286 NA 0.0316 NA 11 0.0417 NA 0.0439 0.0797
Avg. 0.02946363636 0.24686 0.05882727273 0.0419 Avg. 0.06321818182 0.579 0.04477272727 0.07967272727
Manual Manually Developed & Optimized Model: 0.0070 Manual Manually Developed & Optimized Model: 0.0062
1 NA 0.0801 0.0317 NA 1 0.0238 0.3414 0.0346 0.0864
2 0.0291 NA 0.0319 NA 2 0.0241 NA 0.0392 0.0871
3 NA NA 0.0358 0.0286 3 0.1494 NA 0.0389 0.2412
4 0.0307 0.3902 0.0325 NA 4 0.0257 NA 0.0366 0.0370
5 0.0209 0.4513 0.0424 0.1568 5 0.2449 0.0486 0.0288 0.0659
N225 BABA
6 0.0353 NA 0.0412 0.0355 6 0.0243 0.0673 0.0300 0.1559
7 0.0182 NA 0.0465 NA 7 0.0242 0.3419 0.0359 0.0623
8 NA NA 0.0305 0.0281 8 0.0223 NA 0.0246 0.2253
9 0.0351 NA 0.0432 0.0039 9 0.0246 NA 0.0256 NA
10 0.0323 0.0199 0.0348 0.0049 10 0.0233 0.0237 0.0437 0.0286
11 0.0415 NA 0.0322 NA 11 0.0228 NA 0.0631 0.0674
Avg. 0.0303875 0.2354 0.03660909091 0.042967 Avg. 0.0554 0.16458 0.03645454545 0.10571
Manual Manually Developed & Optimized Model: 0.0034 Manual Manually Developed & Optimized Model: 0.0268
1 NA 0.3867 0.0183 0.0067 1 0.0345 NA 0.0424 0.0466
2 NA 0.4212 0.0193 0.0073 2 0.0347 0.1398 0.0403 0.1009
3 NA NA 0.0307 NA 3 0.5289 NA 0.0564 0.0027
4 NA NA 0.0196 0.0056 4 0.0376 0.0359 0.0414 0.0044
5 NA NA 0.0271 0.0212 5 0.0354 NA 0.0519 0.0023
HSI TSLA
6 NA NA 0.0193 0.0053 6 0.0361 0.0048 0.0373 0.0540
7 NA 0.5887 0.0171 0.0274 7 0.0388 NA 0.0390 0.0031
8 0.0455 NA 0.0259 0.0278 8 0.0360 0.0556 0.0343 0.0820
9 0.0280 NA 0.0488 0.0413 9 0.0334 0.0356 0.0376 0.1634
10 NA NA 0.0179 0.0178 10 0.0363 0.0655 0.0387 0.0785
11 0.0196 NA 0.0283 NA 11 0.0384 NA 0.0483 0.0897
Avg. 0.115 0.466 0.02475454545 0.017822222 Avg. 0.08091818182 0.0562 0.04250909091 0.05705454545
Manual Manually Developed & Optimized Model: 0.0354 Manual Manually Developed & Optimized Model: 0.0100
LLMs produced some other types of models such as re- that the falcon large language model is suffering from the
gression models instead of deep learning-based models for hallucination problem more than the other LLMs.
forecasting time series data. In other words, the NA values I) The Performance of Generated Models Across LLMs.
represent the output of the LLMs with no code related to As Table 4 indicates, on average (i.e., the last rows of each
the LSTM model or code not related to predicting time stock data) there is no clear winner among the language
series. This might be due to hallucination problem known models for the eleven prompts studied. The deep learning-
in language models where the underlying LLM confuses based models generated by each LLM are rather comparably
leveraging its trained data to properly responding to queries competitive. However, as Table 4 shows, the models gener-
and prompts. A noticeable case is the falcon case where ated by GPT 3.5 on prompt 9 and 10 outperform the other
the number of NA (the irrelevant response to prompts) is generated models (i.e., the dar cells in the table) except for
outnumbering the expected responses. This may indicate the GPSC and BABA stock data.
II) The Performance of Generated Models Across Prompts. LLMs are still playing an important indicator in making
We observe that for the case of GPT 3.5, the best models with the final judgment. However, given that prompts 8, 9, and
minimal RMSE values are produced by prompts 8, 9, and 10 10 outperform the other prompts in most cases, one case
where three criteria as 1) clarify and specificity, 2) objective conclude that to generate a comparably good model it is
and intent, and 3) Format and Style are set high. better to set Clarity and Specificity (CS), Objective and
For the case of LLama 2, we observe that the language Intent (OI), and Format and Style (FS) high and use GPT 3.5
model generates the best model using prompt 8 where where language model to generate the deep learning-based models
three criteria as 1) clarify and specificity, 2) objective and that can be comparable with manually crafted and optimized
intent, and 3) Format and Style are set high (in most cases). model for forecasting time series data.
We also observe a similar pattern for models generated by
PaLM through prompts 7, 8, and 9. For the models generated 7.1. Fixed/Consistent Configurations of LLMs
by falcon, there is no clear pattern whether any prompt The results reported in Table 4 are based on the con-
standout in the comparison where the results are mixed. figuration and settings of LLMs parameters listed in Table
While the results and performance of models and prompts 2 where each model wes fine-tuned empirically to obtain
are dispersed, we observe a clear pattern where prompts 8 the best results. To investigate whether various configuration
and 9 seem to produce the best results in generating more and parameter settings for LLMs have any effect on the
accurate models for forecasting time series models where results, we replicate the study with fixed and consistent
three criteria as 1) clarify and specificity, 2) objective and parameter settings for all LLMs.
intent, and 3) Format and Style are set high. The result in Table 5 are obtained using the same set
III) The Performance of Generated Models Across the of parameters but with consistent and fix values as follows:
Time Series Datasets. As Table 4 and the black cells indicate, 1) temperature= 0.1, 2) max token_size= 1, 024 and 3)
the best results across different dataset is produced by GPT top_p= 0.6 in all models including GPT 3.5 Turbo, Falcon,
3.5 mostly by prompts 8, 9, and 10. This may indicate that, Llama-2 and PaLM. This setting primarily means reducing
at least for GPT 3.5, the more clear and specific (CS), and randomness in generating responses to queries or prompts.
the more objectively crafted prompt with clear intention As Table 5 indicates that the reduction in randomness
(OI), and a clear expression regarding the desired output through minimizing the value of the temperature value has
and format (FS) in the prompts will yield creating better and some impacts on the performance of each prompt. The table
more accurate models for forecasting time series data. demonstrates that the GPT 3.5 model achieved the lowest
IV) The Performance of Generated Models and Manu- RMESE for nine tickers whereas Palm achieved the lowest
ally Developed and Optimized Model. The most important RMSE value of 0.0357 for the MSFT ticker.
observation is the accuracy of models created and opti- I) The Performance of Generated Models Across LLMs.
mized manually in comparison with the models generated A detailed view on both Tables 4 and 5 indicates that lower
by prompts. values of temperature makes the accuracy of models slightly
It is important to note that the manually created and better. In particular, we observe that GPT still outperforms
optimized model was created based on all data and thus other LLMs.
there is only one single manually created and optimized II) The Performance of Generated Models Across Prompts.
model to compare the results with. More specifically, we We observe that the the best models generated by GPT are
did not manually craft and optimize separate deep learning- the ones generated by simpler prompts such as Prompt 2,
based models for each dataset. We created a single optimized 3, and 4 where the criteria (i.e., Clarity and Specificity,
model for all dataset all together. Figure 1 depicts the RMSE Objective and Intent, Contextual Information, and Format
values of the manually crafted and optimized single model and Style) are all kept consistent at the level of either low or
obtained for each dataset. In Figure 1, the RMSE values of medium or high.
the manually implemented LSTM model on HSI achieved III) The Performance of Generated Models Across the
the highest RMSE of value 0.0354 and N225 achieved the Time Series Datasets. A similar pattern is observed. A mixed
lowest RMSE value of 0.0034. results, but consistent with the results observed in Table 4.
While the manually crafted and optimized model out- IV) The Performance of Generated Models and Man-
performs on three sets of stock data, the models generated ually Developed and Optimized Model. As shown in both
by LLMs are also outperforming the manually crafted and Tables 4 and 5, we observe a slightly better models for the
optimized models for the seven sets of stock data. More case where the temperature parameter is kept low.
specifically, we observe that the manually created and op- Tables 4 and 5 clearly demonstrated that the Falcon
timized model outperforms models generated by LLMs for model generates more valid and correct models when the
DJI, N225, and AMZN; whereas, the models created through temperature parameter is configured at 0.7 (high) compared
prompts outperform manually created and optimized model to 0.1 (low). The results show the number of invalid models
for GSPC, IXIC, HSI, AAPL, MSFT, BABA, and TSLA. labeled with “NA” is lower than the number of invalid
It is important to note that the results are compared models generated by higher temperature 0.7 which leads to
based on the best results obtained by the prompts and the more exploration in the model’s predictions. By increasing
variances of RMSE values among different prompts and the temperature, the model is encouraged to introduce more
Table 5
RMSE values for Models Generated using LLMs and Controlled Prompts with fixed LLM configurations 1) temperature= 0.1, 2)
max token_size= 1, 024 and 3) top_p= 0.6 and Prompts listed in Table 3.
RMSE RMSE
Ticker Prompts Ticker Prompts
PaLM falcon LLama 2 GPT 3.5 PaLM falcon LLama 2 GPT 3.5
1 0.0354 NA 0.0404 0.0409 1 0.0366 NA 0.0512 0.0564
2 0.0356 0.0416 0.4802 0.0036 2 0.0380 NA 0.0589 0.1250
3 0.0508 NA 0.4663 0.0626 3 0.1288 NA 0.0565 0.1144
4 0.0353 0.0365 0.0381 0.0756 4 0.0371 0.0388 0.0396 0.0029
5 0.2506 NA 0.0372 0.0964 5 0.0411 NA 0.0412 0.4194
GSPC AAPL
6 0.0345 0.3862 0.0377 0.0403 6 0.0366 NA 0.0458 0.4048
7 0.0352 NA 0.0376 0.0641 7 0.0358 NA 0.0414 0.4194
8 0.0346 NA 0.0428 0.0403 8 0.0359 NA 0.0428 0.1224
9 0.0357 NA 0.0365 0.0617 9 0.0360 0.0368 0.0448 0.4194
10 0.0343 NA 0.0490 0.1311 10 0.0360 NA 0.0457 0.0050
11 0.0354 NA 0.0427 0.0627 11 0.0361 NA 0.0522 0.4194
Avg. 0.05612727273 0.155 0.11895454545 0.06175454545 Avg. 0.04527272727 0.04 0.04728181818 0.22804545455
Manual Manually Developed & Optimized Model: 0.0058 Manual Manually Developed & Optimized Model: 0.0094
1 0.0383 NA 0.0571 0.0705 1 0.0399 NA 0.0468 0.2605
2 0.0422 NA 0.1304 0.1021 2 0.0367 NA 0.0428 0.1850
3 0.0996 0.0550 0.0660 0.0602 3 0.1264 NA 0.0444 0.0450
4 0.0416 NA 0.0423 0.0034 4 0.0357 0.0591 0.0461 0.2456
5 0.0373 NA 0.0494 0.1270 5 0.0363 NA 0.0436 0.1241
DJI MSFT
6 0.0418 NA 0.0462 0.0547 6 0.0407 NA 0.0482 0.2557
7 0.0402 0.0984 0.0521 0.0462 7 0.0392 NA 0.0419 0.1270
8 0.0382 NA 0.0542 0.0531 8 0.0401 NA 0.0495 0.2060
9 0.0402 NA 0.0470 0.0631 9 0.0418 NA 0.0475 0.1937
10 0.0404 NA 0.0687 0.0490 10 0.0392 NA 0.0469 0.0478
11 0.0707 NA 0.0514 0.0493 11 0.0390 NA 0.0460 0.1581
Avg. 0.04822727273 0.08 0.06043636364 0.06169090909 Avg. 0.04681818182 0.1 0.04579090909 0.16804545455
Manual Manually Developed & Optimized Model: 0.0053 Manual Manually Developed & Optimized Model: 0.0065
1 0.0259 NA 0.0372 0.1052 1 0.0414 0.0399 0.0438 0.2763
2 0.0299 NA 0.0324 0.0739 2 0.0422 NA 0.0494 0.4927
3 0.0300 NA 0.0318 0.0059 3 0.1270 NA 0.0472 0.0046
4 0.0307 0.1158 0.0327 0.0829 4 0.0417 0.0657 0.0477 0.0043
5 0.1676 0.0530 0.0339 0.1044 5 0.1017 NA 0.0455 0.0866
IXIC AMZN
6 0.0298 NA 0.0328 0.0372 6 0.0420 NA 0.0487 0.0545
7 0.0397 NA 0.0305 0.0697 7 0.0426 NA 0.0462 0.4144
8 0.0281 NA 0.0360 0.1886 8 0.0426 NA 0.0485 0.1512
9 0.0394 NA 0.0344 0.0345 9 0.0425 NA 0.0470 0.4655
10 0.0295 NA 0.0344 0.0323 10 0.0430 NA 0.0456 0.0054
11 0.0405 NA 0.0470 0.1280 11 0.0418 NA 0.0461 0.4771
Avg. 0.04464545455 0.08 0.03482727273 0.07841818182 Avg. 0.05531818182 0.05 0.04688181818 0.22114545455
Manual Manually Developed & Optimized Model: 0.0070 Manual Manually Developed & Optimized Model: 0.0062
1 0.0275 NA 0.0511 0.0071 1 0.0274 NA 0.0258 0.0292
2 0.0307 NA 0.0398 0.0365 2 0.0229 NA 0.0299 0.0668
3 0.0311 NA 0.0636 0.1199 3 0.1577 NA 0.0291 0.0211
4 0.0257 NA 0.0391 0.0038 4 0.0247 NA 0.0306 0.0295
5 0.0305 NA 0.0657 0.0410 5 0.0780 0.1597 0.0279 0.0738
N225 BABA
6 0.0321 NA 0.0419 0.2064 6 0.0229 NA 0.0278 0.0444
7 0.0270 NA 0.0268 0.0312 7 0.0245 0.0225 0.0287 0.1610
8 0.0313 NA 0.0552 0.1309 8 0.0232 NA 0.0220 0.2473
9 0.0368 NA 0.0599 0.0334 9 0.0252 NA 0.0248 0.2474
10 0.0326 NA 0.0363 0.0311 10 0.0243 NA 0.0296 0.2042
11 0.0211 NA 0.0582 0.1748 11 0.0234 NA 0.0879 0.1755
Avg. 0.02967272727 NA 0.04887272727 0.07419090909 Avg. 0.04129090909 0.09 0.0331 0.1182
Manual Manually Developed & Optimized Model: 0.0034 Manual Manually Developed & Optimized Model: 0.0268
1 0.0264 NA 0.0265 0.1928 1 0.0325 NA 0.0430 0.0024
2 0.0230 0.0225 0.0343 0.0314 2 0.0349 0.1266 0.0484 0.0423
3 0.1250 NA 0.0599 0.0428 3 0.0695 NA 0.0603 0.0968
4 0.0272 NA 0.0186 0.0068 4 0.0348 NA 0.0445 0.0076
5 0.2486 NA 0.0500 0.1063 5 NA NA 0.0402 0.0630
HSI TSLA
6 0.0278 0.0196 0.0240 0.0398 6 0.0330 0.0327 0.0375 0.0024
7 0.0306 NA 0.0485 0.0367 7 0.0369 NA 0.0565 0.0052
8 0.0273 NA 0.0184 0.0268 8 0.0364 0.5569 0.0405 0.0432
9 0.0309 NA 0.0205 0.0622 9 0.0353 NA 0.0525 0.1006
10 0.0217 NA 0.0248 0.2108 10 0.0353 0.5663 0.0414 NA
11 0.0302 0.1508 0.0354 0.0434 11 0.0379 NA 0.0377 0.4147
Avg. 0.05624545455 0.064 0.03280909091 0.07270909091 Avg. 0.03865 0.3206 0.04568181818 0.07782
Manual Manually Developed & Optimized Model: 0.0354 Manual Manually Developed & Optimized Model: 0.0100
randomness into its predictions, reducing the likelihood of outputs with high confidence. In such circumstances, greater
exhibiting the hallucinated phenomenon. temperatures allow the model to experiment with different
In contrast, for GPT 3.5 Turbo model the number of in- variations of the prompt, resulting in responses that represent
valid models (i.e., "NA") is lower with temperature parame- the input’s nuances.
ter set to low (i.e., 0.1) instead of high (i.e., 0.7). The simpler
prompts with lower temperature yield better results because 7.2. Model Architecture of Generated Models
the model produce more coherent and relevant responses. Given the variation in performance of the models gen-
In case of complex prompts with higher temperature the erated by LLMs, it is of important to investigate the cause
GPT 3.5 model explores a wider range of possibilities and of such differences. One of the key factors in deep learning-
generate more diverse responses because of higher random- based models including LSTM, which plays an important
ness. Complex prompts may contain unclear information, role in the performance, is the architecture (e.g., number
making it difficult for the model to provide appropriate of layers and nodes) of the generated models. To compare
the architecture of the models generated by LLMs and the in the architecture model), which is relatively consistent with
architecture of our manually created and optimized LSTM the architecture of the manual model, where the number of
model, this section reports the architecture metadata of all LSTM layer is set to 1.
models. The key difference between the models generated by
Table 6 reports configuration of the models generated by LLMs and the manual model architecture is the number of
LLMs using the prompts listed in Table 3. The configura- nodes (i.e., unit), or the second component in the archi-
tions are set differently for each LSTM model with a number tecture. The LSTM-based models generated by PaLM and
of hyperparameters to analyze. The configuration consists of falcon consider a large number for the number of units or
1) the number of the number of LSTM layers, 2) number of nodes in their LSTM model (e.g., 128, 100, 64). On the other
units, 3) activation function, 4) batch sizes, and 5) epochs. hand, the number of units or nodes in the manually generated
In Table 6, we see a summary of LSTM model archi- model is set to 50. A quick inspection of the number of
tecture configurations by different LLMs with their different nodes considered for LLama 2 and GPT 3.5 indicates that
prompts. It also specifies such parameters as the number these two LLMs have considered the number of nodes as 50,
of LSTM layers, the number of units per layer to use, and which is similar to the number of nodes set for the manual
what activation functions to use, as well as batch sizes and model. This observation may explain the better performance
number of epochs used. These configurations are crucial, and accuracy obtained by the LSTM models generated by
as few layers and units tend to lead to more capacity, but LLama 2 and GPT 3.5 compared to PaLM and falcon.
on the flip side are more likely to overfit while more layers The employed activation function in most cases is either
and units give more capacity. As learning dynamics depend ‘relu’ or NA. As a result this parameter of the architecture
on the choice of activation function, batch size and epoch, cannot be considered for comparison purposes. On the other
which determines training stability and efficiency. The large hand, the batch size parameter is where we observe some
variation in these architectural choices emphasizes the role “additional’ improvement are achieved. Most of batch sizes
that prompt design plays in determining model performance set by LLama 2 and GPT 3.5 (the two outperforming LLMs
as measured by corresponding RMSE values. in generating better models) have set their batch size to
The architecture of the manually created and optimized 32; whereas, the batch size in our manually created and
model is configured as: one LSTM layer of 50 units (i.e., optimized model is set to 1. From the literature, we know
nodes), where “relu” is used as the activation function, with that the smaller value of batch size helps in training models
the batch size and epochs set to 1 and 100, respectively. The more profoundly.The epoch parameter seems set mostly to
manually created and optimized model is based on all data 100, 50, 10 by all LLMs. In particular, the value of epochs is
studied in this work implying that only one model manually set to 100 in all instance models generated by PaLM without
created and optimized to represent the entire datasets. any variations.
As Table 6 indicates, in most cases, the models generated The take away lessons are :
by LLMs contain 1 or 2 LSTM layer (i.e., the first component
Table 6
Model Architecture Details: P. = Prompt; Format=[LSTM Layer, Units, Activation, Batch, Epoch];
Manually Created and Optimized Model: [1, 50, ’relu’, 1, 100].
Ticker P. PaLM falcon LLama 2 GPT 3.5 Ticker P. PaLM falcon LLama 2 GPT 3.5
1 [1, 128, NA, 32, 100], [1,64,NA, 16, 100] [1, 50, NA, 32, 100] NA 1 [1,100,NA,32,100] NA [1,50, NA,32,100] [1,50,NA,NA,100]
2 [1,50, ’relu’,32,100] [1,128,NA, 128,1] [2, 50, NA, 32, 100] [1, 50, Na, 32, 50] 2 [1,128, NA, NA, 100] [1,128, NA, NA, NA] [1,50,’relu’,32,100] [2,[50,50], NA, 16, 100]
3 [1,128,NA, NA, 100] [3, 128, NA, 256, 100] [2, [50,64], ’relu’, 32, 100] NA 3 [1,128, NA,32,10] NA [2,[50,64],NA,NA,100] [1,50, NA,32,10]
4 [1,100,’relu’,32,100] [1, 64, NA, 128,10] [1, 50, ’relu’, 32, 100] NA 4 [1,100,’relu’,32, 100] NA [1,50, NA,NA,100] [1,50, NA,32, 50]
5 [1,128,’relu’,32,100] NA [2, [50,32], NA, 32, 100] [1, 50, NA, 32, 10] 5 [1,100, ’relu’, 32, 10] [1,32,NA, 10, NA] [2,[50,32],NA,NA,100] [1,128, NA, 32, 10]
GSPC AAPL
6 [1,100,’relu’,16,100], [2, 128, NA, 128, 128] [1, 50, NA, NA, 100] NA 6 [1,100,NA,32,10] NA [1,50, NA,NA,100] [1,50,NA,32,100]
7 [1,100,’relu’,20,100], [2, 128, NA, 32, 500] [2, 50, NA, NA, 100] NA 7 [1,100,NA,16,100] [2,10,NA,32, NA] [1,50, NA,1,100] [1,50,NA,16,10]
8 [1,128, NA,32,100] [1, 128, NA, NA, NA] [1, 128, NA, NA, 100] [1,50,’relu’,32, 10] 8 [1,100,’relu’,32,10] NA [1,128, NA,NA,100] [2,[50,50],NA,32,10]
9 [1,100,’relu’,32,100] NA [2, [50,32], NA, 32, 100] [1,50,’relu’,32, 50] 9 [1,128,’relu’,32,100] NA [2,[50,32],NA,NA,100] [1,100,NA,32,10]
10 [1,128,NA,32,100] [1, 128, NA, 128, 100] [1, 50, NA, 32, 100] [1,50,NA,1, 100] 10 [1,100,NA,1,100] [1,1,NA,32,NA] [[1,50,’relu’,1,100]] [2,[50,50],NA,1,100]
11 [1,100,’relu’,NA,100] [1, 10, NA, 256, 100] [2, [50,32], NA, 32, 50] NA 11 [1,100,’relu’,1,10] NA [2,[50,32],NA,32,50] [1,64,NA,32,10]
1 [1,100, ’relu’,32,100] NA [1,50, NA,32,100] [2,[50,50], NA,1,100] 1 [1,100,’relu’,32,100] [1,128,NA, 128, 100] [1,50, NA,NA,100] [2,[50,50],NA,32,10]
2 [1,100, NA,32,100] NA [1,50,’relu’,32,100] [1,50,’relu’,32,10] 2 [1,100,NA,NA,100] NA [1,50, NA,32,100] [1,50, ’relu’,16,10]
3 [1,128, NA,32,100] [1, 128, NA, 32,NA] [2,[50,64],NA,32,100] [1,50, NA,32,50] 3 [1,128,NA,32,10] NA [2,[50,64],NA,32,100] [1,128,NA,32,10]
4 [1,100, NA,16,100] [1, 128, NA, NA, 100] [1,50, NA,NA,100] [1,50, NA,32,10] 4 [1,100,NA,32,100] [1,99,NA,1,100] [1,50, ’relu’,NA,100] [1,50, ’relu’,1,100]
5 [1,128, ’relu’,32,100] NA [2,[50,32],NA,32,100] [1,64, NA,32,10] 5 [2,[128,128],NA,32,10] NA [2,[50,32],NA,32,100] [1,64,NA,32,10]
DJI MSFT
6 [1,100, ’relu’,30,100] NA [1,50, NA,NA,100] NA 6 [1,50, ’relu’,32,10] [1,100,NA,100,NA] [[1,50, NA,1,100]] [1,50, ’relu’,32,10]
7 [1,100, ’relu’,NA,100] NA [2,50, NA,32,100] NA 7 [1,100, ’relu’,NA,100] NA [1,50, ’relu’,1,100] [1,50, ’relu’,NA,10]
8 [1,128, NA,32,100] [1, 256, NA, 256, NA] [1,128, NA,NA,100] [1,50, ’relu’,NA,10] 8 [1,100,NA,32,10] NA [[1,128, ’relu’,NA,100]] [2,[50,50],NA,32,10]
9 [1,128,’relu’,32,100] NA [2,[50,32], NA,NA,100] [2,[50,50], ’relu’,1,100] 9 [1,128,NA,32,100] [1,512,NA,1,NA] [2,[50,32],NA,32,100] [2,[50,50],’relu’,32,10]
10 [2,[128,64], NA,32,100] NA [1,50, NA,NA,100] [1,50, NA,16,10] 10 [1,128, ’relu’,NA,100] NA [[1,50,NA,NA,10]] [1,64, ’relu’,32,10]
11 [1,128, NA,NA,100] NA [2,[50,32], NA,32,50] NA 11 [1,128, ’relu’,32,100] [1,64, NA, 32, NA] [2,[50,32],NA,32,50] [1,50, ’relu’,32,100]
1 [1,128,’relu’,32,100] NA [1,50, NA,32,100] [2,[50,50], NA,32,10] 1 [1,100, NA,32,100] [1,32,NA,256,NA] [1,50,’relu’,32,100] [1,50, NA,32,50]
2 [1,128, NA,32,100] NA 1,50, NA,NA,100 NA 2 [1,100,NA,32,10] [1,128,NA,32,NA] [1,50,NA,32,100] [2,[50,50],NA,32,10]
3 1,128, NA,NA,100 NA [2,[50,64], NA,32,100] NA 3 [1,128,NA,32,10] NA [2,[50,64],NA,32,100] [1,50,NA,32,100]
4 [1,100,’relu’,32,100] [1,100, NA, 256, 100] [1,50, ’relu’,32,100] NA 4 [1,100, ’relu’,32,100] NA [1,50,NA,16,100] [1,50, ’relu’,32,100]
5 [1,128, NA,NA,100] [1,100,NA, 32, 10] [2,[50,32], NA,32,100] [1,64, NA,32,10] 5 [1,128,’relu’,32,10] [1,32,NA,32,NA] [2,[50,32],NA,32,100] [1,128,NA,32,100]
IXIC AMZN
6 [1,128, NA,NA,NA] [3,[128,10,10],NA,256,100] [1,50, NA,NA,100] NA 6 [1,100,’relu’,32,10] NA [1,50,NA,NA,100] [1,50, ’relu’,32,10]
7 [1,100, NA,32,100] NA [1,50,’relu’,NA,100] NA 7 [1,100, ’relu’, NA,100] NA [1,50,NA,1,100] [1,50, ’relu’, NA,100]
8 [1,50, NA,32,100] [1,128,NA,10,NA] [1,128, NA,NA,100] [2,[50,50],NA,32,100] 8 [1,128,NA,32,100] NA [1,128,NA,NA,100] [1,50,NA,1,10]
9 [1,50,’relu’,32,100] [4,[100,200,300,400] NA, NA,NA] [2,[50,32],NA,NA,10] NA 9 [1,100,NA,1,100] NA [2,[50,32],NA,NA,100] [1,4,NA,1,100]
10 [1,50, NA,NA,100] NA [1,50,NA,NA,100] [1,50, NA,1,100] 10 [1,100, ’relu’,1,10] NA [1,50, NA, 16,100] [1,50, ’relu’,16,10]
11 1,128, NA,16,100 NA [2,[50,32],NA,32,50] NA 11 [1,100, NA,16,10] NA [1,[50,32],NA,32,50] [1,50, NA,16,100]
1 NA [1,1,NA,32,100] [1,50, NA,32,100] NA 1 [1,100,’relu’,32,100] [1,16,NA,32,NA] [1,50,NA,32,100] [2,[50,50],NA,32,10]
2 [2,[128,64], NA,NA,100] NA [1,50, NA,1,100] NA 2 [1,100, ’relu’,32,10] NA [1,50, ’relu’,NA,100] [1,50, ’relu’,32,10]
3 NA NA [2,[50,64],NA,NA,100] [1,50, NA,32,100] 3 [1,128,NA,32,10] NA [2,[50,64],NA,32,100] [1,50,NA,32,10]
4 [2,[128,64],’relu’,32,100] [1, 32, NA, 32,NA] [1,50, ’relu’,32,100] NA 4 [1,100, ’relu’,NA,100] NA [1,50, ’relu’,32,100] [1,50, ’relu’,32,100]
5 [2,[128,64],’linear’,32,100] [3,[128,256,256],NA, NA, NA] [2,[50,32],NA,32,100] [2,[64,64], NA,32,10] 5 [1,128, ’relu’,32,10] [1,10,NA,10,10] [2,[50,32],NA,32,100] [1,128, ’relu’,32,10]
N225 BABA
6 [2,[128,64], NA,NA,100] NA [1,50, NA,NA,100] [1,50,’relu’,32,50] 6 [1,64, NA,32,10] [1,128,NA,1,NA] [1,50, ’relu’,1,100] [1,50, ’relu’,NA,10]
7 [2,[128,64],’relu’,NA,100] NA [1,50,’relu’,NA,100] NA 7 [1,100, NA,1,10] [1,32,NA,128,NA] [1,50,NA,1,100] [1,50, NA,16,10]
8 NA NA [1,128, NA,NA,100] [2,[50,50],NA,32,100] 8 [1,128,NA,32,100] NA [1,128,NA,NA,100] [2,[50,50],NA,32,10]
9 [2,[128,64],’relu’,16,100] NA [2,[50,32],’relu’,NA,100] [2,[50,50], NA,1,100] 9 [1,100,NA,NA,100] NA [2,[50,32],NA,100] NA
10 [1,128,’relu’,NA,100] [1,2,NA, 32,1000] [1,50, NA,32,50] [1,50, NA,1,100] 10 [1,100,NA,NA,10] [1,64,NA,64,1000] [1,50, ’relu’,16,100] [2,[50,50],NA,32,100]
11 [1,128, NA,NA,100] NA [2,[50,32],NA,32,50] NA 11 [1,128,NA,NA,100] NA [2,[50,32],NA,32,50] [1,64,NA,32,10]
1 NA [1,128,NA,10,NA] [1,50, NA,32,100] [1,50, NA,1,100] 1 [1,100,NA,32,100] NA [1,50, ’relu’,32,100] [2,[50,50],NA,32,10]
2 NA [2,[64,128],NA, 32, NA] [1,50,’relu’,32,100] [1,50, ’relu’,1,100] 2 [1,100,’relu’,32,100] [1,128,NA,64,NA] [1,50, ’relu’,NA,100] [2,[50,50],’relu’,32,10]
3 NA NA [2,[50,64],NA,32,100] NA 3 [2,[128,10],NA,NA,10] NA [2,[50,64],NA,32,100] [1,50,NA,1,100]
4 NA NA [1,50, NA,NA,100] [1,50, NA,NA,100] 4 [1,100,’relu’,NA,100] [1,128,NA,256,100] [1,50,NA,1,100] [1,50,’relu’,1,100]
5 NA NA [2,[50,32],NA,NA,100] [1,50, NA,32,100] 5 [1,128,’relu’,32,100] NA [2,[50,32],NA,32,100] [1,64,NA,1,100]
HSI TSLA
6 NA NA [1,50, NA,NA,100] [1,50, ’relu’,NA,100] 6 [1,100,’relu’,32,10] [1,128, NA, 128, 100] [1,50,NA,NA,100] [1,50,’relu’,32,10]
7 NA [1,10,NA,1,NA] [1,50,’relu’,NA,100] [2,[50,50],NA,32,100] 7 [1,100,NA,NA,100] NA [1,50,NA,16,100] [1,50,NA,1,10]
8 [1,100,NA,32,100] NA [1,128, NA,32,100] [2,[50,50],’relu’,32,100] 8 [1,100,NA,32,10] [2,[32,32],NA,NA,NA] [1,128,NA,NA,100] [2,[50,50],NA,16,10]
9 [2,100,NA,32,100] NA [2,[50,32],NA,NA,100] [1,4, NA,1,100] 9 [1,128,NA,NA,100] [1,32,NA,32,100] [2,[50,32],NA,NA,100] [1,50,NA,16,10]
10 NA NA [2,50,NA,32,100] [1,50, NA,32,100] 10 [1,100,NA,16,100] [1,256,NA,NA,NA] [1,50, ’relu’,16,100] [1,50,NA,32,100]
11 [1,50,NA,32,100] NA [2,[50,32],NA,32,50] NA 11 [1,100,’relu’,16,100] NA [2,[50,32],NA,32,50] [2,[50,50],NA,32,100]
1. GPT 3.5 generates LSTM-based models with model additional and in-depth patterns in the data. An inter-
architecture that are relatively similar to the architec- esting observation is that while there are some benefits
ture of the manually created and optimized model. of considering smaller values for batch size, there
GPT 3.5 is followed by LLama 2 in generating the are some chances that smaller batch sizes may yield
most similar architectures for the models. On the other overfitting.
hand, the architecture of the models generated by
PaLM and falcon are less similar to the manually
created and optimized model by an expert. 8. Limitations
2. Most LLMs generate deep learning LSTM-based The initial assumption of this work is that average users
models with number of layers equal to 1 or in some in many areas, including finance and economics are mostly
rare cases to 2, which is consistent with the architec- interested in simple form of deep learning models including
ture of our manually generated model. LSTM. Consequently, it may be relatively challenging for
3. The key parameter in architecture, that seems to con- these users to build and fine tune deep learning models
tribute significantly to the accuracy of the models with complex architecture. To resemble this assumption, this
generated, is the number of nodes or units considered paper also tries to keep the deep learning models simple
for each model. While PaLM and falcon consider a without adding additional complexity to the architecture of
large value for hte number of units or nodes (e.g., the models built. In addition, it is also important to note
128), the models generated by GPT 3.5 and LLama that, to have a fair comparison between models generated by
2 consider similar number of nodes and units (i.e., LLMs, it is important to avoid possible bias introduced by
50) compare to the manually created models. This complexity of deep learning architecture. To prevent such
observation indicates that the number of nodes plays unfair comparison, the work keeps the models at the very
a key role in improving the accuracy of the models consistent and simple so the results can be justified without
generated, and GPT 3.5 and LLama 2 set this parame- any bias.
ter better than the other two LLMs namely PaLM and Furthermore, different application domains may exhibit
falcon. different results. This paper focuses only on financial data,
4. The second key contributor to the accuracy of models as an important application domain. Including additional
seems to be the value of batch size. While the two experiments and their results in some other application do-
most outperforming LLMs in generating better model mains would make the paper very lengthy and confusing.
set their batch size to 32, other manually created and Additional replication of the work performed here is nec-
optimized model sets the batch size to 1, capturing essary in different application domains. According to our
initial assumption, additional in-depth analysis might not be analysis to incorporate the metrics and statistical measures.
of interest for average researchers or developers with little Also, compare the LSTM model against models like ARIMA
background in this domain. As such, this paper focuses only and other conventional models for further investigations to
on type of analysis that is often required by the average data validate the results reported in this paper.
analyst in some other application domains such as finance.
10. Acknowledgement
9. Conclusion and Future Work
This research is partially supported by the U.S. National
This paper reports the results of a number of controlled
Science Foundation Award: 2319802.
experiments to study the effect of various prompts with
different sensitivity levels and configuration parameters of
LLMs on the goodness of deep learning-based models gen- References
erated for forecasting time series data. As a representative [1] Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A.,
application domain, the paper studied the problem of fore- Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al., 2022.
casting financial time series data. The paper first created Constitutional ai: Harmlessness from ai feedback. arXiv preprint
arXiv:2212.08073 .
and optimized a manual LSTM-based model to forecast [2] Becker, B.A., Denny, P., Finnie-Ansley, J., Luxton-Reilly, A., Prather,
financial and stock time series data. We then controlled each J., Santos, E.A., 2023. Programming is hard-or at least it used to be:
prompt with respect to four criteria including 1) Clarity Educational opportunities and challenges of ai code generation, in:
and Specificity, 2) Objective and Intent, 3) Contextual and Proceedings of the 54th ACM Technical Symposium on Computer
Information, and 4) Format and Style where the sensitivity of Science Education V. 1, pp. 500–506.
[3] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal,
these criteria were controlled in terms of being low, medium, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al., 2020.
and high. Language models are few-shot learners. Advances in neural informa-
The results provided interesting insights regarding the tion processing systems 33, 1877–1901.
accuracy of forecasting models generated by generative AI [4] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G.,
and LLMs. More notably, we observed that these generative Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S.,
et al., 2022. Palm: Scaling language modeling with pathways. arXiv
AIs are capable to produce comparable forecasting mod- preprint arXiv:2204.02311 .
els when queried using simple or complex prompts with [5] Chui, M., Hazan, E., Roberts, R., Singla, A., Smaje, K.,
additional details. We compared the accuracy of models Sukharevsky, A., Yee, L., Zemmel, R., 2023. The economic
with a single manually crafted and optimized LSTM-based potential of generative ai: The next productivity frontier.
forecasting model that was trained and built based on all https://www.mckinsey.com/capabilities/mckinsey-digital/our-
insights/the-economic-potential-of-generative-ai-the-next-
datasets all together. According to our results, we did not productivity-frontier#introduction.
observe significant influence of complex prompts to produce [6] Denny, P., Leinonen, J., Prather, J., Luxton-Reilly, A., Amarouche, T.,
better and more accurate models. In some cases, the more Becker, B.A., Reeves, B.N., 2023. Promptly: Using prompt problems
simple prompts produced better and more accurate mod- to teach learners how to effectively utilize ai code generators. arXiv
els; whereas, in some other cases, more complex prompts preprint arXiv:2307.16364 .
[7] Destefanis, G., Bartolucci, S., Ortu, M., 2023. A preliminary analysis
generated more accurate forecasting models. It is apparent on the code generation capabilities of gpt-3.5 and bard ai models for
that the value of temperature parameter used in configuring java functions. arXiv preprint arXiv:2305.09402 .
LLMs has direct impact on whether simple or more complex [8] Gopali, S., Abri, F., Siami-Namini, S., Namin, A.S., 2021. A
prompts can generate more accurate forecasting models. comparison of tcn and lstm models in detecting anomalies in time
As for statistical performance, we observed that RMSE series data, in: 2021 IEEE International Conference on Big Data (Big
Data), pp. 2415–2420. doi:10.1109/BigData52589.2021.9671488.
values for the models produced by LLMs are quite strong and [9] Gopali, S., Khan, Z.A., Chhetri, B., Karki, B., Namin, A.S., 2022.
the models remain robust. Additional statistical testing found Vulnerability detection in smart contracts using deep learning, in:
differences between LLMs and manually coded models to be 2022 IEEE 46th Annual Computers, Software, and Applications
statistically significant, with particularly strong differences Conference (COMPSAC), pp. 1249–1255. doi:10.1109/COMPSAC54236.
when datasets were more complex. 2022.00197.
[10] Gopali, S., Namin, A.S., Abri, F., Jones, K.S., 2024. The per-
Moreover, we found that models generated by different formance of sequential deep learning models in detecting phish-
LLMs used drastically different architectures, such as the ing websites using contextual features of urls, in: Proceedings
number of layers, the number of units, and the activation of the 39th ACM/SIGAPP Symposium on Applied Computing,
functions. The differences in performance attributable to this Association for Computing Machinery, New York, NY, USA. p.
variability may be an indication that prompt engineering 1064–1066. URL: https://doi.org/10.1145/3605098.3636164, doi:10.
1145/3605098.3636164.
is still an important feature for realizing LLMs for deep [11] Gopali, S., Siami Namin, A., 2022. Deep learning-based time-series
learning tasks. analysis for detecting anomalies in internet of things. Electronics
The results reported in this paper are in particular useful 11. URL: https://www.mdpi.com/2079-9292/11/19/3205, doi:10.3390/
for data analysts and practitioners who have little experience electronics11193205.
with programming and coding for developing complex deep [12] Liu, J., Xia, C.S., Wang, Y., Zhang, L., 2023. Is your code gener-
ated by chatgpt really correct? rigorous evaluation of large language
learning-based models such as LSTM for forecasting time models for code generation. arXiv preprint arXiv:2305.01210 .
series data. The paper poses an interesting research problem [13] Markets, Ltd, M.R.P., 2023. Generative ai market worth
that needs additional studies and expands the performance $51.8 billion by 2028, growing at a cagr of 35.6%: Report by