0% found this document useful (0 votes)
150 views

LongCoder: Open-Source Model For Code Completion With Long Code Input

Do you want to handle long code input and capture global information? If yes, then you need to check out LongCoder, a new AI model for code completion. LongCoder is a sparse Transformer model that can process long code input and generate accurate code outputs. In this article, you will learn how LongCoder works, and more.

Uploaded by

My Social
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
150 views

LongCoder: Open-Source Model For Code Completion With Long Code Input

Do you want to handle long code input and capture global information? If yes, then you need to check out LongCoder, a new AI model for code completion. LongCoder is a sparse Transformer model that can process long code input and generate accurate code outputs. In this article, you will learn how LongCoder works, and more.

Uploaded by

My Social
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

To read more such articles, please visit our blog https://socialviews81.blogspot.

com/

LongCoder: Open-Source Model for Code Completion with


Long Code Input

Introduction

Code completion is a task that aims to generate the next token or


statement given a partial code input. It is widely used in modern
integrated development environments (IDEs) and code editors to assist
programmers in writing code. Code completion can help programmers
save time, avoid typos, and discover new APIs or libraries.

However, most existing code completion models are based on standard


Transformer models, which have limitations in handling long code input.
Transformer models use self-attention mechanism to compute the
relevance between every pair of tokens in the input sequence. This
mechanism has two drawbacks: (1) it has quadratic complexity with
respect to the sequence length, which makes it computationally
expensive and memory-intensive for long sequences; (2) it treats every

To read more such articles, please visit our blog https://socialviews81.blogspot.com/


To read more such articles, please visit our blog https://socialviews81.blogspot.com/

token equally, which may result in noise or redundancy for long


sequences.

To overcome these limitations, researchers from Microsoft Research


Asia and University of California San Diego proposed a new model,
which is a long-range pre-trained language model for code completion.
This new Model is called 'LongCoder'.

What is LongCoder?

LongCoder is a sparse Transformer model that can handle long code


input for code completion tasks and capture both local and global
information. It employs a sliding window mechanism for self-attention
and introduces two types of globally accessible tokens - bridge tokens
and memory tokens - to improve performance and efficiency.

Key Features of LongCoder

LongCoder has several key features that make it a novel and effective
model for code completion tasks.

1. LongCoder can handle long code input up to 4,096 tokens, which


is much longer than previous models that can only handle up to
512 tokens.
2. LongCoder can capture both local and global information in the
code using a sliding window mechanism and bridge tokens. It can
also memorize important statements using memory tokens.
3. LongCoder is pre-trained on a large-scale corpus of Python code
from GitHub repositories using masked language modeling (MLM)
objective. It can leverage the general knowledge and syntax of
Python code learned from pre-training.
4. LongCoder can be fine-tuned on specific code completion tasks
using different datasets. It can adapt to different domains and
scenarios of code completion.
5. LongCoder achieves superior performance on code completion
tasks compared to previous models while maintaining comparable
efficiency in terms of computational resources during inference.

To read more such articles, please visit our blog https://socialviews81.blogspot.com/


To read more such articles, please visit our blog https://socialviews81.blogspot.com/

Capabilities/Use Cases of LongCoder

LongCoder can be used for various code completion tasks, such as:

● Token-level code completion: given a partial code input,


generate the next token that is most likely to follow.
● Statement-level code completion: given a partial code input,
generate the next statement that is most likely to follow.
● Code refinement: given a partial or incorrect code input, generate
the correct or improved code output.
● Code suggestion: given a partial or incomplete code input,
generate multiple possible code outputs that can complete or
extend the input.

LongCoder can also be used for other related tasks, such as:

● Code summarization: given a code input, generate a natural


language summary that describes its functionality or purpose.
● Code documentation: given a code input, generate a natural
language documentation that explains its usage or parameters.
● Code generation: given a natural language input, generate a code
output that implements its functionality or logic.

How does LongCoder work?

LongCoder is a model for code completion that can handle long code
input. It is based on a sparse Transformer architecture with an encoder
and a decoder. The encoder takes the input code tokens and produces
hidden states. The decoder generates the output code tokens based on
the hidden states and the previous outputs.

The encoder and decoder have sparse Transformer blocks with three
sub-layers: (1) self-attention with sliding window; (2) cross-attention with
bridge tokens and memory tokens; (3) feed-forward network.

To read more such articles, please visit our blog https://socialviews81.blogspot.com/


To read more such articles, please visit our blog https://socialviews81.blogspot.com/

The self-attention with sliding window splits the input code into segments
and processes each segment with self-attention. The output of each
segment is concatenated to form the output sequence. This reduces the
cost and memory of self-attention and keeps the local information in
each segment.

The cross-attention with bridge tokens and memory tokens captures the
global information across segments. Bridge tokens and memory tokens
are special tokens that can attend to all other tokens in the input code.
They act as bridges to aggregate local information and facilitate global
interaction.

The feed-forward network applies a non-linear transformation to each


token after the cross-attention. It helps to learn non-linear features from
the token representations.

The encoder and decoder also have layer normalization, residual


connection, and an output layer.

Performance evaluation with other models

source - https://arxiv.org/pdf/2306.14893.pdf

The LCC dataset (as shown in above table) shows that the sparse
models (LongFormer and LongCoder) perform better than the
non-sparse models on both EM and Edit Sim metrics. They also have a
similar inference speed. LongFormer is a modified version of UniXcoder
that uses a sliding window attention mechanism. This mechanism helps

To read more such articles, please visit our blog https://socialviews81.blogspot.com/


To read more such articles, please visit our blog https://socialviews81.blogspot.com/

the model process longer code input faster and more accurately. This
proves the usefulness of the sliding window attention mechanism for
code completion tasks.

LongCoder improves upon LongFormer by adding bridge tokens and


memory tokens. These tokens help the model capture more global
information and important statements in the code. LongCoder improves
the EM score by 0.8%–1.3% and the Edit Sim score by 4.0%–6.0%
compared to other sparse models (as shown in above table). This shows
the effectiveness of our proposed tokens.

source - source - https://arxiv.org/pdf/2306.14893.pdf

LongCoder also achieves the best performance on CodeXGLUE code


completion benchmarks (as shown in above table). These benchmarks
have shorter code input. LongCoder has a bigger advantage over
UniXcoder on these benchmarks. This shows its potential for more
complex scenarios.

How to access and use LongCoder?

LongCoder is available on GitHub, where you can find the codes and
data for pre-training and fine-tuning LongCoder, as well as the
instructions for running the experiments.

LongCoder is also available on Hugging Face, where you can load and
use LongCoder using PyTorch Transformers library. You can also use
LongCoder for feature extraction or fine-tuning on your own datasets.

To read more such articles, please visit our blog https://socialviews81.blogspot.com/


To read more such articles, please visit our blog https://socialviews81.blogspot.com/

LongCoder is open-source and free to use for research purposes.


However, if you want to use LongCoder for commercial purposes, you
need to obtain a license from Microsoft.

If you are interested to learn more about the LongCoder model, all
relevant links are provided under the 'source' section at the end of this
article.

Limitations

LongCoder is a novel and effective model for code completion tasks, but
it also has some limitations that need to be addressed in future work.

● LongCoder is currently only pre-trained on Python code, which


may limit its generalization to other programming languages. It
would be interesting to explore how to pre-train LongCoder on
multiple languages or cross-lingual code corpora.
● LongCoder uses a heuristic rule to select memory tokens based on
keywords, which may not capture all the important statements in
the code. It would be interesting to explore how to use more
sophisticated methods to select memory tokens based on
semantic or syntactic analysis.
● LongCoder uses a fixed number of bridge tokens and memory
tokens, which may not adapt well to different lengths or
complexities of code input. It would be interesting to explore how
to dynamically adjust the number or position of bridge tokens and
memory tokens based on the input context.

Conclusion

LongCoder is a promising model that can handle long code input and
generate accurate and relevant code outputs for code completion tasks.
It can help programmers write code faster and with fewer errors, as well
as discover new APIs or libraries. However, LongCoder also has some

To read more such articles, please visit our blog https://socialviews81.blogspot.com/


To read more such articles, please visit our blog https://socialviews81.blogspot.com/

limitations that need to be addressed in future work, such as


generalizing to other languages, selecting memory tokens more
effectively, and adapting to different code input.
Source
research paper - https://arxiv.org/abs/2306.14893
GitHub repo - https://github.com/microsoft/CodeBERT/tree/master/LongCoder
Parent github repo - https://github.com/microsoft/CodeBERT/
Hugging face longcoder base - https://huggingface.co/microsoft/longcoder-base
Microsoft research -
https://www.microsoft.com/en-us/research/publication/longcoder-a-long-range-pre-trained-language-m
odel-for-code-completion/

To read more such articles, please visit our blog https://socialviews81.blogspot.com/

You might also like