E1. Code Language Models
E1. Code Language Models
Yale NLP
Ansong Ni
Yale University
[email protected]
04/05/2024
Why Build Code Language Models
• Quick Poll
• GitHub Copilot
• OpenAI ChatGPT
1
Why Build Code Language Models
• How to automatically write programs is one of the oldest and hardest
problems in AI and CS:
This process of constructing instruction tables should be very fascinating. There need be no real
danger of it ever becoming a drudge, for any processes that are quite mechanical may be turned
over to the machine itself. — Alan Turing (1945)
2
Why Build Code Language Models
• They relate to several important areas in CS
• Programming Languages (PL)
• Software Engineering (SE)
• Machine Learning (ML)
• Natural Language Processing (NLP)
• Human-Computer Interaction (HCI) PL &
SE
• … ML
Code AI
I NLP
HC
3
Why Build Code Language Models
• Code generation is a great testbeds for intelligence:
• language understanding
• symbolic reasoning
• planning & search
• interactive learning
• …
4
Why Build Code Language Models
FlashFill - Excel
AI-assisted Programming
Virtual Assistants
Robotics Control
Database Query and Visualization
6
Preliminaries
• Assume basic knowledge on terms in NLP and related to LLMs
• E.g., BERT, GPT, prompting, autoregressive, retrieval, etc
• Mixing of terms
• Foundation Models ≈ LM ≈ LLM
• Code LM/LLM: Language models that have seen code during training
• Code and Math LMs
• They are deeply connected as
• Both are formal languages;
• Both require symbolic reasoning
• This lecture mostly focuses on code LMs but many methods apply for math
LMs as well
7
Outline
• A brief history of code LMs
• Data collection, filtering and tokenization
• Training of code LLMs
• Decoder-only models and code infilling
• Encoder-only models;
• Encoder-decoder models;
• Reinforcement Learning
• Post-training methods for code LLMs
• Neuro-symbolic approaches
• Prompting methods for code
• Retrieval-augmented generation for code
8
A Brief History of LMs for Code
9
Key Events (2020-2021)
• Feb 2020: CodeBERT [1]
• First attempt -- 16 months after original BERT paper
• 125M parameters
• May 2020: GPT-3 [2]
• People find that GPT-3 has some coding abilities
• Though it is not specifically trained on code
• Jun 2021: GitHub Copilot
• Revolutionary performance
• Multi-line, whole function completion for the first time
• Jul 2021: Codex [3]
• First 10B+ model trained specifically for code
• Hero behind GitHub Copilot
[1] Feng et al. (2020), “CodeBERT: A Pre-Trained Model for Programming and Natural Languages.”
[2] Brown et al. (2020), “Language Models are Few-Shot Learners.”
10 [3] Chen et al. (2021), “Evaluating Large Language Models Trained on Code.”
Key Events (2022)
• Feb 2022: AlphaCode [1]
• Claims 54.3% rankings in competitions with human participants
• Up to 41B, model not released nor publicly accessible
• Mar 2022: CodeGen [2]
• Open-source 10B+ code LM
• Later found that the model is severely under-trained (later CodeGen2)
• Apr 2022: PaLM [3]
• PaLM-Coder is a 540B code model
• The models are also severely under-trained (later PaLM-2)
• Nov 2022: The Stack [5]
• 3TB of permissively licensed code data
• Foundational data work for many code LMs in the future
[1] Lozhkov et al. (2023), “StarCoder 2 and The Stack v2: The Next Generation.”
[2] Cognition AI. (2022), “https://www.cognition-labs.com/introducing-devin.”
13
Data Collection, Filtering and
Tokenization
14
Code Data Collection and Filtering
• Data Sources:
• Mostly GitHub and similar platforms;
• More recently:
• Kaggle Notebooks
• Software Documentation
• Commits, issues, pull requests
• Quality Filtering (take [1] as an example):
• GitHub stars >= 5
• 1% <= Comment-to-code ratio <= 80%
• License:
• Only permissive licensed open-source repo may be used;
• E.g., MIT, Apache 2.0
15 [1] Ben Allal et al. (2023), “SantaCoder: Don’t Reach for the Stars!”
Deduplication and De-contamination
• Deduplication:
• Remove (near-)duplicated files from the training data;
• Why: repeated training data can significantly hurt the performance [1]
• Decontamination:
• Remove the files that contain solutions to benchmarks used for evaluation;
• Why: better measure generalization ability of trained LMs
• Methods:
• Exact match
• Near-deduplication
[1] Hernandez et al. (2023), “Scaling laws and interpretability of learning from repeated data.”
16
[2] Ben Allal et al. (2023), “SantaCoder: Don’t Reach for the Stars!”
Tokenization for Code LM (1)
• Tokenization for LMs
17
Tokenization for Code LM (2)
• Tokenization is a big deal for coding task
• Code looks very similar but also very different than natural language:
• Similar: semantic meaning of variable/function/class names
• E.g., ”is_correct”, “AttentionLayer”, “compute_perplexity”
• Different: Whitespace characters, punctuation, indentations
• E.g., “df.shape[1]”, “def f(x):\n\tif x>0:\n\t\treturn x\n\telse:\n\t\treturn x+1”
• Trade-off between:
• Vocabulary size
• # tokens needed to encode the same sequence
• Generalization ability for different tasks
18
Tokenization for Code LM (3)
• Trade-off between:
• Vocabulary size
• # tokens needed to encode the same sequence
• Generalization ability for different tasks à downstream performance
[1] Chirkova and Troshin (2023), “CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code.”
19
Training of Code LLMs
20
Decoder-only (GPT) Models
• Model architecture and pretraining objectives:
• Mostly follow those of general-purpose LLMs, e.g., Codex follows the GPT-3
• Multi-stage training:
• Some models are based off a general-purpose LM
• E.g., [1] CodeGen-NLàCodeGen-MultiàCodeGen-Mono
• E.g., [2] LLaMA 2àCodeLLaMA
[1] Nijkamp et al. (2023), “CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis.”
21
[2] Rozière et al. (2023), “Code Llama: Open Foundation Models for Code.”
Code Infilling: Fill in the middle
• Infilling task:
• <prefix>, <suffix> à <middle>
• Trained via data augmentation [1]:
• Preprocessing:
• Special tokens <IF>
• <prefix>, <middle>, <suffix>
• <prefix>,<IF>, <suffix>,<IF>, <middle>
• Mixing with original data
• Training with normal autoregressive
objectives
A use case of infilling [2]
[1] Bavarian et al. (2022), “Efficient Training of Language Models to Fill in the Middle.”
22
[2] Fried et al. (2022), “InCoder: A Generative Model for Code Infilling and Synthesis.”
Encoder (BERT) Models for Code (1)
[1] Wang et al. (2022), “CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training.”
23
Encoder (BERT) Models for Code (2)
• Code is multi-modal
• Natural language;
• Surface form;
• Control flow graph;
• Abstract-syntax-tree (AST);
• Data flow graph;
• Dependency graph;
• Compiled machine code;
• …
Using Data Flow Graph
[1] Wang et al. (2021), “CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code
25
Understanding and Generation.”
Reinforcement Learning (1)
• Code generation is a natural task to apply RL as we can automatically
obtain feedback from computers:
• Pass/fail a parser;
• Pass/fail compilation;
• With/without runtime error;
• Pass/fail test cases
Rewards used for CodeRL
• Examples:
• CodeRL [1] (offline actor-critic)
• RLTF [2] (online w/ feedback from compiler)
[1] Le et al. (2021), “CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning.”
26
[2] Liu et al. (2023), “RLTF: Reinforcement Learning from Unit Test Feedback.”
Reinforcement Learning (2)
• Benefits of using RL:
• Not limited to learning from a single solution from the dataset;
• Release the dependency for annotated solutions;
• Able to directly incorporate fine-grained preferences as reward function;
• Limitations:
• Insufficient test cases may lead to false positives [1]
• Rewards are typically sparse and underspecified [2];
• Especially if we start with a weaker model
• It usually involves exploration (sampling) with LMs, which are expensive
[1] Smith et al. (2015), “Is the Cure Worse Than the Disease? Overfitting in Automated Program Repair.”
27
[2] Agarwal et al. (2019), “Learning to Generalize from Sparse and Underspecified Rewards.”
Post-Training Methods for Code LLMs
28
Neuro-Symbolic Approaches (1): Incorporating Code Execution
• Methods:
• Sampling + filtering (codex [1])
• Sampling + filtering + clustering (AlphaCode [2])
• Sample lots of diversified program candidates (i.e., up to 1M)
• Filter using open test cases
• Diversify the picked candidates by clustering and selecting from different clusters
[1] Chen et al. (2021), “Evaluating Large Language Models Trained on Code.”
[2] Li et al. (2022), “Competition-Level Code Generation with AlphaCode.”
30
Neuro-Symbolic Approaches (1): Incorporating Code Execution
• Methods:
• Sampling + filtering (codex [1])
• Sampling + filtering + clustering (AlphaCode [2])
• Sampling + verification + voting (LEVER [3])
• Train a verifier to verify the program with its execution results
• Aggregate the probability from programs that reach the same execution results
[1] Chen et al. (2021), “Evaluating Large Language Models Trained on Code.”
[2] Li et al. (2022), “Competition-Level Code Generation with AlphaCode.”
31 [3] Ni et al. (2023), “LEVER: Learning to Verify Language-to-Code Generation using Execution.”
Neuro-Symbolic Approaches (2): Constraint Decoding
[1] Poesia et al. (2022), “Synchromesh: Reliable code generation from pre-trained language models.”
32
Neuro-Symbolic Approaches (3): Planning and Search
[1] Zelikman et al. (2022), “Parsel : Algorithmic Reasoning with Language Models by Composing Decomposition.”
33
Neuro-Symbolic Approaches (3): Planning and Search
[1] Xi et al. (2023), “SATLM: Satisfiability-Aided Language Models Using Declarative Prompting.”
34
Prompting Methods using Code for LLMs
• Chain-of-thought (CoT) prompting [1]
• Explicitly write the reasoning process as natural
language
• Program-of-thought (PoT) prompting [2] and
Program-aided LM (PAL) [3]
• Explicitly write the reasoning process as a program
• Use program execution to obtain the final answer
• Works well with math and other symbolic
reasoning tasks
• Also closely related to tool-use of LLMs
[1] Wei et al. (2022), “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.”
[2] Chen et al. (2022), “Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks.”
35 [3] Gao et al. (2022), “PAL: Program-aided Language Models.”
Retrieval Augmented Generation for Code
• Retrieval-augmented generation (RAG)
• Retrieves relevant pieces of information from some knowledge base and
include them in the prompt
• When programmers code, we look at:
• Current file (e.g., defined variables, function, classes)
• Documentation of external libraries “DocPrompting” [1]
• Definitions of imported functions and classes “Repo-level Prompt Generator” [2]
• Github, StackOverflow, geeksforgeeks… “REDCODER” [3]
[1] Zhou et al. (2022), “DocPrompting: Generating Code by Retrieving the Docs.”
[2] Shrivastava et al. (2023), “Repository-Level Prompt Generation for Large Language Models of Code.”
[3] Parvez et al. (2021), “Retrieval Augmented Code Generation and Summarization.”
36
Summary
• A brief history of code LMs
• Data collection, filtering and tokenization
• Training of code LLMs
• Decoder-only models and code infilling
• Encoder-only models;
• Encoder-decoder models;
• Reinforcement Learning
• Post-training methods for code LLMs
• Neuro-symbolic approaches
• Prompting methods for code
• Retrieval-augmented generation for code
37
Extended Readings
• Interdisciplinary applications
• Code as Policies: Language Model Programs for Embodied Control (2023)
• Large Language Models for Compiler Optimization (2023)
• Self-Improvement with code LLMs
• STaR: Bootstrapping Reasoning With Reasoning (2022)
• CodeT: Code Generation with Generated Tests (2022)
• Teaching Large Language Models to Self-Debug (2023)
• DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines (2023)
• More ways to learn a code LLM
• Show Your Work: Scratchpads for Intermediate Computation with Language Models (2021)
• Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions (2022)
38
Hope you enjoyed the lecture!
Questions?
39