Assessing Large Language Models for Code Generation: A Comprehensive Framework

This paper presents a comprehensive framework for assessing the code generation capabilities of Large Language Models (LLMs) like ChatGPT, Claude, Spark, and Bing AI, focusing on six key aspects: validity, correctness, complexity, dependability, security, and readability. Using a dataset of 45 coding problems, the study evaluates the performance of these models to identify their strengths and weaknesses in generating accurate and reliable code. The findings highlight the varying abilities of LLMs across different coding tasks and emphasize the need for standardized evaluation methods to enhance their development and application in real-world scenarios.

Uploaded by

mohitbiz777

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views6 pages

Assessing Large Language Models for Code Generation: A Comprehensive Framework

Uploaded by

mohitbiz777

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Assessing Large Language Models for Code Generation:

A Comprehensive Framework
Mohit M. Saahil D. Dhruvi D. Karan P. Darpan P. Dr Alok S.

Abstract:-
The Large Language Models are very exceptional in the ability which consists of several
domains like code recommendations. But there are many variations in code production
abilities across various LLMs in the market and there are several possible challenges which
can be faced by the models but are not looked up on. This paper presents a complete
approach to access the skill of code generation using the six aspects which are:- code
validity, correctness, complexity, dependability, security and readability. This assessment is
based on various models in the market such as ChatGPT, Claude, spark and Bing AI. It is
based on the method called LLMC which has around 45 different coding problems. This
experiment will give the rough idea on the skills of different LLMs which contain the robust
skills to generate the accurate code according to the given problems, and if there is any
inaccuracy in the code it will be considered as code fault and not comprehension problems.
Introduction:- But as we know even the most high level
of technologies have their own kinds of
There is a huge development in the making
limitations. They can offer very accurate
of extensive language models and has
code suggestion but sometimes the block
achieved a great leap in the natural
may not even be relevant to the work being
language processing ability. These models
done, and may also generate some
helps the machine to understand and give
insufficient generation which is not
text output which is almost the level of a
encouraged. The ELMs along with NLP
human in terms of fluidity and
helped the natural language
understanding. The ability is also
comprehension to face the challenges and
showcased on various platforms which
provide the development opportunity
also includes debates and analytical skills
which may revolutionize the development
which are trained based on the large
process and also reduce the work load of
amount of textual data and have became a
developers.
great asset in the tech and language field.
The pre trained models such as ChatGPT,
In this era where there is code generation
Claude and Spark uses a large amount of
rather than coding, which can be broken
data which is mainly texts and the data is
down to the term called program synthesis,
exposed to these models which allows
the ELMs have a significant impact. The
them to develop a deep understanding of
problem of generating code based on the
the language of humans, also to generate
requirements of user is still very hard and
the data which is logical and relevant
a big problem in coding and this issue took
about the context. With the help of these
us to the development of various types and
the models could now better understand
tools and approach methods. The main
and interpret the needs of specifications
examples of them are CodeX, GitHub
and then transform the given prompt into
CoPilot and Blackbox which give code
code which is executable.
suggestions and search ability to boost the
productivity of the coders. Based on the Search based language
models like the Bing AI, they take a
different approach by using the ability to
retrieve the relevant data from the web to - Do the provided code suggestions pass
generate a response to the user by their test cases without any issues?
given prompt. By using this search
- What is the level of complexity,
functionality with language model like
reliability, security, and readability of the
GPT-4, with the help of this the models
code generated by LLMs?
can improve the code generation ability by
using a very big repository of current - In what types of coding tasks do LLMs
coding resources and samples available on perform well or struggle?
the web.
- Are individuals with a Master of Laws
Even though the capability of the theirs to (LLM) more likely to face challenges in
generate code using the evaluation understanding or coding tasks?
methods which are necessary to assess
their effectiveness and try to identify the Thus, by answering the questions that are
potential areas for improvement. At the formulated above, we hope to bring useful
moment there is a shortage of standardized knowledge to the current level at which we
ways for evaluating the code generation are in our LLMs’ code generation
capabilities of Natural Language capabilities and to identify any areas where
Processing and the quality of the code we have room for improvement, which
produced. will allow us to create more complex and
sophisticated models for code generation
To overcome this problem, We will purposes.
develop an new evaluating method which
uses LLMs code generation abilities. The In addition to the above given arguments,
technique is using six main criteria which the proposed evaluation approach and its
are readability, security, code accuracy, outcomes will make a considerable
complexity and its validity. With the help contribution to the overall discussion of
of this evaluation framework, it is aimed to the potential and threats of Language
calculate the strength and weakness of the Learning Models in professional works.
latest and popular LLMs, including The models under consideration are likely
ChatGPT, Claude, Spark and Bing AI in to face further development and
code generation tasks. implementation in multiple areas, which
states the necessity of the effective and
To work on the evaluation process, a comprehensive evaluation of framework to
dataset LLMC(Large Language Model regulate its uses and support the
Code Generation Capability) is developed achievement of responsible output.
which comprises 45 coding questions
which cover various aspects of python The subsequent sections structured as
follows: Section 2 details the oral
programming. The questions are designed
interpretation of the evaluation strategy
to test the LLMs’ ability to generate and the factors and approaches used to
different kinds of programming constructs determine the scores. The Section 3
and also data structures, algorithms and provides the summary of the
also include different problem solving experimenting and about the LLMC
scenarios. dataset. Specifically, the process of
accessing code recommendations from the
We also have experimental evaluation and
selected models and the use of the
data analysis to investigate several key SonarQube code analysis tool is explained
questions which are as follows: in Section 4. The comprehensive
examination of the evaluation results and make the code unusable in that particular
specific focus on the critical areas are language and will give out basic error
presented in Section 4. And lastly which are not to be tolerated.
the Section 5 is provided as the final study,
which summarizes the results and Code Correctness: To check whether the
discusses the limitations of the LLMs and code generated is correct, the code must
potential research method which can be pass a series of test cases which will help
adopted in the near future. verify the correctness with the help of
predetermined specifications. The code is
Body:- checked based on both the error free
Proposed Evaluation Methodology compilation as well as the ability to
The given method of evaluation provides a generate the wanted result under proper
structured and systematic approach to input conditions.
check the Large Language Models' Code Complexity : An important part of
(LLMs') code generation abilities. It code quality test , code complexity check
consists of two steps which are getting the explores the variations of the code
relevant code suggestions from LLMs and structure that is generated. Like said, one
analyzing the code generated in multiple of the ways to determine the extent of
areas. sophistication in the codebase is through
various metrics such as cognitive
Firstly, the basis for soliciting code complexity as well as cyclomatic
suggestions of all information specialists complexity. Code that is more short ,
from various LLMs is the LLMC dataset. adjustable and understandable has a lower
The dataset is made up of 45 different complexity score , which increases
code challenges which are based based on extensibility and supportability.
Python programming language which are
spanning to a variety of principles and was Code Reliabilty : The aim of reliability test
taken from numerous sources, mostly is to find possible errors or sensitivity in
LeetCode, and are specifically organized the code that is generated and checks how
to include ideas from challenging and much work is required to fix it. Strong
basic disciplines. Additionally, every code generation should yield in outputs
challenge in this dataset was created in a that are immune to typical programming
manner that evaluates how well LLMs errors , reducing the amount of debugging
may learn and convert a valid description and troubleshooting that is required.
to read format into operational code.
After getting the code recommendations Code Security : To lessen potential
using the LLMs, the next phase will be a security risks , code made by LLMs should
brief analysis of the generated code along perform best practices. Security checks are
the six aspects which are already listed, important in software development. In
which will offer a good analysis for the security checks, weakness like buffer
quality of the code and its applicability overflows, injection attacks , and other
which will be used in the real world usage: general security errors are carefully
checked in the generated code, with aa
Code Validity: the goal of code validity is emphasis on increasing the strength of the
to guarantee that the generated code is code against any kind of exploits.
error free and will execute without any
errors and will be syntactically correct. Code Readability : Readability test checks
The validity of the code highly depends on whether the generated code is clear by
the ability to avoid the error of syntaxes as checking elements like naming
any fault of coding language syntax will conventions , code structure , and also
checking the comments. Proper our project to study their code generation
explanation and properly commented code abilities during the stage of experiment.
makes it easier to understand and The bing ai is a search based LLM and it
encourage teamwork in development, also uses the very new GPT4 to gather
which actually increases the overall quality knowledge and to give answers to our
of the code. questions. These LLMs are all the pre
trained based on all the generative LLMs
Every query in the LLMC dataset is which are new Based on the SonarQube to
properly sorted according to its degree of try the quality to access the code
difficulty, pass rate , and application in recommendations.
actual programming. This makes sure an
important test of LLM’s abilities in a wide Leading the pack, Bing AI achieves the
range of programming scenarios. greatest average pass rate and right rate on
test cases, demonstrating its excellent
SonarQube and the LLMC Dataset ability to produce dependable and accurate
code suggestions. Compared to other
The LLMC dataset helps evaluating LLMs, Bing AI shows reduced complexity
LLM’s code generation abilities. It has 45 metrics for code complexity, which may
code questions that ranges in different indicate more legible and maintainable
aspects of python programming. The code output. The majority of the code
dataset is properly selected to cover a recommendations made by the assessed
broad range of programming ideas , LLMs are rated highly for security and
ranging from data types and control dependability, demonstrating their
structures to more complex subjects like efficiency in generating reliable and secure
object - oriented programming and code outputs. Furthermore, compared to
exception handling. every question in the Claude and Spark, ChatGPT and Bing AI
dataset is designed with care to assess the demonstrate better code explanation
LLM’s ability to understand and convert capabilities, highlighting their capacity to
requirements that are understandable by offer succinct and understandable
humans into a code that is executable explanations for the created code.
code ,also giving a thorough assessment of
their code generation skills The way that LLMs perform varies
depending on the type of code issue;
SonarQube, an open- source tool for certain models perform well in some
continuous code quality checking, is used scenarios but poorly in others. inquiries
to examine the generated code in a number about file handling and exception handling
of ways to help with the evaluation are clearly better handled by Bing AI,
process. whereas inquiries about data types,
Researchers can obtain important operators, and control structures are better
information into the dependability and handled by ChatGPT, Claude, and Bing AI
quality of the generated code by using
SonarQube’s reports on code quality Consequences & Upcoming Projects
metric , which include things like
duplicated code , coding standards, unit This study will give a big help and also
tests , code coverage , code complexity , have a big influence on how the LLMs are
comments , bugs and security issues being developed and are growing based on
how it is being worked on in the real
There are four well known LLMs which world. The researchers can have a great
are ChatGPT, Claude, Spark and the well deal with the weaknesses and growth of
know Bing AI, these LLMs are chosen for the LLMs which can be used by modifying
the code and observing the code Four well-known LLMs—ChatGPT,
generation abilities to perform tasks and Claude, Spark, and Bing AI—were
generate relevant code based on the experimentally tested to get important
instructions given by the user using it. information on how well they performed
Secondarily it also gives the methodology in various areas. Bing AI led the group and
which offers the comparison of different was quite good at producing accurate code
LLMs which allows the analysis which recommendation. It was good at handling
can be compared to give a better files and managing exceptions. However,
development area to it for natural language all tested LLM’s had difficulties answering
processing and code generation processes. questions about objects, classses , modules
and fucntions , pointing to areas that need
It is very unbelievable to recognize the more improvement.
limitations of this research as it is a very
short size of LLMC dataset and its power This will have big impact on how LLM’s
on the python code generation. By adding are developed and used in the actual world
more questions as the data to dataset which going forward.
do give a wide area of language and areas Through a thorough evaluation of
for future projects which will try to advantages and disadvantages of these
overcome these provided metrices. To gain models in various areas , professionals can
more knowledge we must try new llms and make proper choices on the choice and use
try to gain more knowledge based on these of these models for certain generation
provided llms in other programming tasks. Nevertheless , it is necessary to
languages which are java C++ etc recognize a number of reasearch
limitations. The llmc DATASET’S
Conclusion : RELATIVELY LIMITED SIZE AND ITS
The presented approach presents a EXCLUSIVE FOCUS ON Python
thorough and organized framework for programming may limit how broadly
testing LLM’s ability for code creation , applicable the results can be .
giving insights about their advantages,
disadvantages and where it can be To curb these errors , the dataset must be
improved.Through a proper evaluation of large to include a wider variety of code
the generated code in multiple dimensions, queries covering several programming
professionals have found a more profound languages. Also, One should check
insight about LLM’s capabilities in code differences in code generation in other
generation assignments , which helps well programming languages, and to
informed choices and guides to making a continuously evaluate new LLMs.
more dependable model for real - world
implementations. In conclusion, LLMs have the ability to
completely change the way software
Important elements including readability , development processes are done with
security, complexity , correctness and continued study and development, giving
validity of the code are carefully tested at more dependable, scalable, and effective
every stage of the assessment porcess to solutions for a wide range of problems.
porduce a proper anc concise review of the
code’s quality and aplicability . Using the References:-
LLMC dataset, ensures a thorough and
accurate assessment of LLM’s abilities in 1.J. Wei, Y. Tay, R. Bommasani, C.
a variety of programming situations and Raffel, B. Zoph, S. Borgeaud, et al.,
problem domains. "Emergent Abilities of Large Language
Models", 2022.
11.M. Chen, J. Tworek, H. Jun, Q. Yuan,
2.A. Radford, J. Wu, R. Child, D. Luan, D. H. Ponde de Oliveira Pinto, J. Kaplan, H.
Amodei and I. Sutskever, "Language Edwards, Y. Burda, N. Joseph, G.
Models are Unsupervised Multitask Brockman, A. Ray et al., "Evaluating
Learners", OpenAI Blog, 2019. Large Language Models Trained on
Code", 2021.

3.J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. 12.A. Trisovic, M.K. Lau, T. Pasquier and
Liu, et al., "A Comprehensive Capability M. Crosas, "A large-scale study on
Analysis of GPT-3 and GPT-3.5 Series research code quality and execution",
Models", 2023. Scientific Data, vol. 9, 2021.

4.A. Alzahem, S. Latif, W. Boulila and A. 13."Number of Good Ways to Split a

Koubaa, "Unlocking the Potential of String", 2020, [online] Available:
Medical Imaging with ChatGPT’s https://leetcode.com/problems/number-of-
Intelligent Diagnostics", 2023. good-ways-to-split-a-string/.

5.M. A. Sánchez-Ruiz, J. M. García- 14.R. H. Mogavi, X. Ma and P. Hui,

Gutiérrez and J. L. Guzmán-Vargas, "Characterizing Student Engagement
"Exploring the Protein Sequence Space Moods for Dropout Prediction in Question
with Global Generative Models", 2023. Pool Websites", Proceedings of the ACM
on Human-Computer Interaction, vol. 5,
6.S. Gulwani, O. Polozov and R. Singh, pp. 1-22, 2021.
Program Synthesis” Foundations and
Trends® in Programming Languages, 15.G. A. Campbell and P. P. Papapetrou,
Now Foundations and Trends, vol. 4, pp. SonarQube in Action, USA:Manning
112-119, 2017. Publications Co., 2013.

7.M. Allamanis, E. T. Barr, P. Devanbu

and C. Sutton, "A survey of machine
learning for big code and naturalness",
ACM Computing Surveys (CSUR), vol.
51, no. 4, pp. 1-37, 2018.

8.J. Austin, A. Odena, M. Nye, M. Bosma,

H. Michalewski, D. Dohan, et al.,
"Program Synthesis with Large Language
Models", 2021.

9.S. Luan, D. Yang, C. Barnaby, K. Sen

and S. Chandra, "Aroma: Code
recommendation via structural code
search", Proceedings of the ACM on
Programming Languages, vol. 3, no.
OOPSLA, pp. 1-28, 2019.