Programming Language
Processing
Machine Learning
for
Source Code Understanding and Generation
Saikat Chakraborty
Ph.D. Candidate
Columbia University, New York
https://thenewstack.io/how-much-time-do-developers-spend-actually-writing-code/
2
My Research Interest
Code Summarization Code Editing/ Repair
Bug Detection Code Search
Code Generation / Synthesis Other Source Code Analysis
3
Identify Bugs
x = parameter(free)
Y = trace_in_dataflow(x)
if x or y is modified:
WARNING!!
[e.g., Alloy model checking, KLEE, SAGE, Cute,
AFL Fuzzers, Java Path Finder, LLVM Passes]
4
Automated Edit / Program Repair
l = location(gets)
v = parameters(gets)
s = memory_size(v)
new_stmt = “fgets (v, s, stdin)”
Update l with new_stmt
[1] Simfix Jiang et.al. 2018
[2] Elixir Saha et.al.2017]
5
Pattern/Rule Based Automation
Too Many
Different
Patterns
Low [4]
tolerance l = location(gets)
for v = parameters(gets)
s = memory_size(v)
Deviation
x = parameter(free)
Y = trace_in_dataflow(x)
new_stmt = “fgets (v, s, stdin)”
Update l with new_stmt
if[3]
x or y is modified:
WARNING!!
[3] Johnson et. al. 2013
[4] Saha et. al. 2017
6
Data-Driven Automation
x = parameter(free)
Y = trace_in_dataflow(x)
if x or y is modified:
WARNING!!
l = location(gets)
v = parameters(gets)
s = memory_size(v)
new_stmt = “fgets (v, s, stdin)”
Update l with new_stmt
7
Data-Driven Automation – Specification Mining
l = location(gets)
x = parameter(free)
v = parameters(gets)
Y = trace_in_dataflow(x)
s = memory_size(v)
if x or y is modified:
new_stmt = “fgets (v, s, stdin)”
WARNING!!
Update l with new_stmt
[5] Pradel et. al. 2012
[6] Le et. al. 2018
[7] Meng et. al. 2017
8
Data-Driven Automation – Automated Analyzer
9
Data-Driven Automation – Automated Analyzer
10
Artificial Intelligence for Software Engineering
Curate and
Preprocess Design Training AI-based
Data Model Automation Tool
11
Challenges of AI for SE
Understanding Generating
Source Code Source Code
Understanding Structure Ensuring the Syntactic and
And Functionality of Semantic Correctness for
Source Code Generating Source Code
12
Programming Language Processing
Program Analysis Machine Learning
13
Code Understanding Task - Example
Understand Source Code
Vulnerability Detection
14
Code Generation Tasks - Example
Code Translation
Generate Source Code
Code Editing/
Program Repair
15
Understanding Source Code - Requirement
Control
Dependency
Syntax
Structure
Data
Dependency
16
Generating Source Code - Requirement
1. Syntactic correctness. 2. Semantic correctness.
Syntactically Incorrect Semantically Incorrect
17
Programming Language Processing
Syntax
Encoding
Semantics
18
Encoding Syntax and Semantics
Syntax
Code
Modeling
Explicit Implicit
Encoding Encoding
Ways of Natural Channel
Encoding Mutation
Formal Channel
Mutation
Semantics
19
Explicit Encoding
Code
Modeling
Explicit Implicit
Encoding Encoding
Natural Channel
Mutation
Formal Channel
[8] Learning to Represent Program as Graphs
Mutation Allamanis et. al. 2017
[9] Learning to Represent Edits
Yin et. al. 2019
[10] HOPPITY – Dinella et. al. 2020
20
CODIT: Code Editing with Tree Based Neural Models
(TSE’20)
Explicit Encoding of Code Structure for Code Edit
21
Automated Code Editing
Example Code Edits
Code Before Edit Code After Edit
22
CODIT: Code Editing With Tree Based Models
Code Before Edit Code After Edit
23
CODIT Step 1 : Tree Translation
24
CODIT Step 1 :Tree Translation
25
CODIT Step 1 :Tree Translation
Rules sequence of Rules sequence of
Syntax Tree before edit Syntax Tree after edit
26
CODIT Step 1 :Tree Translation
27
CODIT Step 1 :Tree Translation
28
CODIT Step 2 : Token Translation
Token Translation
29
CODIT Step 2 : Token Translation
Reachable Variables in
Edit Location:
{inst, object, tmp}
Reachability Analysis in Edit Location
30
CODIT: Code Editing With Tree Based Models
# of # of Edit Code Fragment Size
Data Set
Projects Examples # of Tokens # of Nodes
Generic Code Edit Max - 38 Max - 47
48 32,473
from Github Avg - 15 Avg - 20
Pull Request Edits Max - 34 Max - 47
3 5546
[Tufano et. al. 2019] Avg - 17 Avg - 23
31
CODIT: Code Editing With Tree Based Models
Method Generic Code Edits Pull Request Edit
LSTM-Seq2Seq 3.77% 11.26%
Sequence
Tufano et. al. 6.57% 23.65%
Based
SequenceR 9.76% 26.43%
Tree2Seq 11.04% 23.49%
Tree
Based CODIT 15.94% 28.87%
32
CODIT – Results
CODIT fixes 15 bugs completely and 10 bugs partially, out of 80 bugs in Defects4j.
Closure
Compiler:
Bug-3
33
Explicit Encoding PL knowledge into model
Reachability Analysis Context Free
Grammar
34
ReVeal (TSE’21)
Explicit Encoding of Code Structure for Vulnerability
Detection
35
ReVeal
Training
Data
Node GGNN
Code Property
Graph Features
Trained ReVeal
Graph Embedding
36
ReVeal - Results
ReVeal
F1 score
Token Graph Token Graph
Based Based Based Based
[10] [10]
Chromium & Debian FFMpeg & Qemu
[10] Devign Zhou et. al. 2019
37
Explicit Encoding PL knowledge into model
Code Property Graph
38
Explicit Encoding – Take aways
Code
Modeling
• Precision
• Guarantee Pros
Explicit Implicit
Encoding Encoding • Explainable
Natural Channel
Mutation
• Model design overhead
Cons
Formal Channel • Transferability
Mutation
39
Implicit Encoding
Code
Modeling
Explicit Implicit Large Data Corpus
Encoding Encoding
Natural Channel
Mutation
Mutated Input De-Mutation
Formal Channel
Mutation
[11] CodeBERT – Feng et. al. 2019
[12] GraphCodeBERT – Guo et. al. 2020
[13] CodeX – Chen et. al. 2021
[14] CodeT5 – Wang et. al. 2021
40
Dual Channel Hypothesis for Source Code [15]
Natural Channel [16] Formal Channel
[15] Casalanuovo et. al. 2020
[16] Karampatsis et. al. 2020
41
Implicit Encoding
Code
Modeling
Explicit Implicit
Encoding Encoding
Natural Channel
Mutator
Mutation
Formal Channel
Mutation
42
PLBART: Unified pretraining Program “Understanding”
and “Generation”. (NAACL’21)
Implicit Encoding by mutating the natural channel
43
PLBART – What Is It?
Encoder Decoder
Pretraining – Unsupervised
way of learning code patterns.
44
PLBART – Components
1. Read Code.
2. Understands Code.
Encoder 3. Reason about any errors in code.
4. Learns robust representation
Decoder 1. Generate Code.
2. Learns Coding Patterns.
45
PLBART – Pretraining
Noise Injector
Correct Code Noisy Code
Decoder Encoder
PLBART
Token Masking Token Deletion Token Infilling
46
PLBART – Jointly Learning to Understand and
Generate
Token Masking
PLBART
47
PLBART – Jointly Learning to Understand and
Generate
Token Deletion
PLBART
48
PLBART – Jointly Learning to Understand and
Generate
Token Infilling
PLBART
49
PLBART – Pretraining
• Noise Properties • Training Objectives
• Mutate Natural Chennel • Generate the whole code.
• Likely to Break Syntax • Learns syntax implicitly
• Learn Coding Patterns.
• Multi-lingual Training
• Java
• Python
• NL from Stack overflow
50
PLBART – Experiment (Applications)
51
PLBART – Results – Code Generation from NL
Dataset: Concode
Metric: EM (exact match %), BLEU-4(max 100), CodeBLEU(max 100)
CodeBLEU =
0.25 * token_match +
0.25 * keywords_match +
0.25 * syntax_match +
0.25 * dataflow_match
52
PLBART – Search Augmented Code Generation
(EMNLP’21)
PLBART
✓ Up to 21% CodeBLEU Improvement Over GraphCodeBERT.
✓ Up to 12% improvement due to Search Augmentation.
53
PLBART – Multi Modal Code Edit (ASE’21)
PLBART
✓ Edit Location (Code to be edited), Context, and a Summary of the Edit.
✓ Up to 34% Improvement Over CodeBERT, and 30% Improvement over CodeGPT.
✓ Up to 16% improvement because of Multi-Modality.
54
PLBART – Results – Code Translation
Dataset: CodeXGLUE [Lu et. al. 2021]
Metric: EM (exact match %), BLEU-4(max 100), CodeBLEU(max 100)
CodeBLEU =
0.25 * token_match +
0.25 * keywords_match +
0.25 * syntax_match +
0.25 * dataflow_match
55
An interesting example of Code Translation
Input C# Code PLBART generated Better Code
56
Implicit Encoding
Code
Modeling
Explicit Implicit
Encoding Encoding
Natural Channel
Mutator
Mutation
Formal Channel
Mutation
59
NatGen:
Implicit Encoding by mutating the formal channel
60
NatGen: Generative pre-training by
“Naturalizing” source code
Write Semantic / Functional Equivalent Code in
“More Natural” way
61
NatGen: Generative pre-training by
“Naturalizing” source code
De-Naturalizing
Transformation
62
NatGen: De-Naturalizing Transformations
Confusing Statements [17]
Dead Code Insertion
Operand Swap
[17] Gopstein et. al. 2020
63
NatGen: Generative pre-training by
“Naturalizing” source code
Encoder Decoder
NatGen
64
NatGen: Some Initial Results – Few Shot Learning
65
Explicit Encoding – Take aways
Code
Modeling
• Little overhead
• Unsupervised Pros
Explicit Implicit
Encoding Encoding • Transferable
Natural Channel
Mutation • No Guarantee
• Potential Bias from Cons
Formal Channel
Mutation
Mutation
66
My Research Applications
Programming Language Processing
Code Summarization
NeuralCodeSum (ACL’20), PLBART (NAACL’21)
Vulnerability Detection
ReVeal(TSE’21), DISCO(ACL’22)
Code Editing
PLP
CODIT (TSE’20), MODIT(ASE’21), DiffBERT (Facebook)
Code Generation and Translation
PLBART (NAACL’21), SNG-PLBART(NAACL’22 - under
review), DataTypeLM ForCode (ACL’18).
Code Search and Synthesis
RedCoder (EMNLP’21), CodePanda (W.I.P)
67
What’s Next? (Short Term Goal)
➢ API driven Program Synthesis
➢ Improving Semantic Code Search with RL
Synthesizer
68
What’s Next? (Short Term Goal)
➢ Representing Code Context as Dynamic
Graph
➢ Learning Code Syntax and Semantics with
Reinforcement Learning (RL)
➢ Other Information Modalities
1. Other Software Metadata – Comments, Commits, Code Evolution metadata.
2. Analysis of Program Binaries.
3. Dynamic Analysis of Program – based on execution behavior.
69
What’s Next? (Long Term Goal)
➢ Code Generation
Formal Analysis Probabilistic models
Guarantee for the Analysis Scalable and Transferrable
Noise Intolerant No theoretical Guarantee
➢ Developer Feedback Oriented Automation
70
Baishakhi Ray
Adviser
Wasi Ahmad Miltos Allanmanis Kai-Wei Chang Yanju Chen Prem Devanbu
UCLA / Amazon MSR UCLA UCSB UC Davis
Yangruibo Ding Yu Feng Rahul Krishna Toufiq Parag Rizwan Parvez
Columbia UCSB Columbia / IBM UC Davis UCLA 71
72