Programming Lang Processing
Programming Lang Processing
Processing
Machine Learning
for
Source Code Understanding and Generation
Saikat Chakraborty
Ph.D. Candidate
Columbia University, New York
https://thenewstack.io/how-much-time-do-developers-spend-actually-writing-code/
2
My Research Interest
3
Identify Bugs
x = parameter(free)
Y = trace_in_dataflow(x)
if x or y is modified:
WARNING!!
l = location(gets)
v = parameters(gets)
s = memory_size(v)
new_stmt = “fgets (v, s, stdin)”
Update l with new_stmt
Too Many
Different
Patterns
Low [4]
tolerance l = location(gets)
for v = parameters(gets)
s = memory_size(v)
Deviation
x = parameter(free)
Y = trace_in_dataflow(x)
new_stmt = “fgets (v, s, stdin)”
Update l with new_stmt
if[3]
x or y is modified:
WARNING!!
x = parameter(free)
Y = trace_in_dataflow(x)
if x or y is modified:
WARNING!!
l = location(gets)
v = parameters(gets)
s = memory_size(v)
new_stmt = “fgets (v, s, stdin)”
Update l with new_stmt
7
Data-Driven Automation – Specification Mining
l = location(gets)
x = parameter(free)
v = parameters(gets)
Y = trace_in_dataflow(x)
s = memory_size(v)
if x or y is modified:
new_stmt = “fgets (v, s, stdin)”
WARNING!!
Update l with new_stmt
9
Data-Driven Automation – Automated Analyzer
10
Artificial Intelligence for Software Engineering
Curate and
Preprocess Design Training AI-based
Data Model Automation Tool
11
Challenges of AI for SE
Understanding Generating
Source Code Source Code
Understanding Structure Ensuring the Syntactic and
And Functionality of Semantic Correctness for
Source Code Generating Source Code
12
Programming Language Processing
13
Code Understanding Task - Example
Vulnerability Detection
14
Code Generation Tasks - Example
Code Translation
Code Editing/
Program Repair
15
Understanding Source Code - Requirement
Control
Dependency
Syntax
Structure
Data
Dependency
16
Generating Source Code - Requirement
17
Programming Language Processing
Syntax
Encoding
Semantics
18
Encoding Syntax and Semantics
Syntax
Code
Modeling
Explicit Implicit
Encoding Encoding
Formal Channel
Mutation
Semantics
19
Explicit Encoding
Code
Modeling
Explicit Implicit
Encoding Encoding
Natural Channel
Mutation
Formal Channel
[8] Learning to Represent Program as Graphs
Mutation Allamanis et. al. 2017
[9] Learning to Represent Edits
Yin et. al. 2019
[10] HOPPITY – Dinella et. al. 2020
20
CODIT: Code Editing with Tree Based Neural Models
(TSE’20)
21
Automated Code Editing
22
CODIT: Code Editing With Tree Based Models
Code Before Edit Code After Edit
23
CODIT Step 1 : Tree Translation
24
CODIT Step 1 :Tree Translation
25
CODIT Step 1 :Tree Translation
26
CODIT Step 1 :Tree Translation
27
CODIT Step 1 :Tree Translation
28
CODIT Step 2 : Token Translation
Token Translation
29
CODIT Step 2 : Token Translation
Reachable Variables in
Edit Location:
31
CODIT: Code Editing With Tree Based Models
32
CODIT – Results
CODIT fixes 15 bugs completely and 10 bugs partially, out of 80 bugs in Defects4j.
Closure
Compiler:
Bug-3
33
Explicit Encoding PL knowledge into model
34
ReVeal (TSE’21)
Explicit Encoding of Code Structure for Vulnerability
Detection
35
ReVeal
Training
Data
Node GGNN
Code Property
Graph Features
Trained ReVeal
Graph Embedding
36
ReVeal - Results
ReVeal
F1 score
38
Explicit Encoding – Take aways
Code
Modeling
• Precision
• Guarantee Pros
Explicit Implicit
Encoding Encoding • Explainable
Natural Channel
Mutation
• Model design overhead
Cons
Formal Channel • Transferability
Mutation
39
Implicit Encoding
Code
Modeling
Natural Channel
Mutation
Code
Modeling
Explicit Implicit
Encoding Encoding
Natural Channel
Mutator
Mutation
Formal Channel
Mutation
42
PLBART: Unified pretraining Program “Understanding”
and “Generation”. (NAACL’21)
43
PLBART – What Is It?
Encoder Decoder
Pretraining – Unsupervised
way of learning code patterns.
44
PLBART – Components
1. Read Code.
2. Understands Code.
Encoder 3. Reason about any errors in code.
4. Learns robust representation
45
PLBART – Pretraining
Noise Injector
Correct Code Noisy Code
Decoder Encoder
PLBART
46
PLBART – Jointly Learning to Understand and
Generate
Token Masking
PLBART
47
PLBART – Jointly Learning to Understand and
Generate
Token Deletion
PLBART
48
PLBART – Jointly Learning to Understand and
Generate
Token Infilling
PLBART
49
PLBART – Pretraining
• Multi-lingual Training
• Java
• Python
• NL from Stack overflow
50
PLBART – Experiment (Applications)
51
PLBART – Results – Code Generation from NL
Dataset: Concode
Metric: EM (exact match %), BLEU-4(max 100), CodeBLEU(max 100)
CodeBLEU =
0.25 * token_match +
0.25 * keywords_match +
0.25 * syntax_match +
0.25 * dataflow_match
52
PLBART – Search Augmented Code Generation
(EMNLP’21)
PLBART
53
PLBART – Multi Modal Code Edit (ASE’21)
PLBART
CodeBLEU =
0.25 * token_match +
0.25 * keywords_match +
0.25 * syntax_match +
0.25 * dataflow_match
55
An interesting example of Code Translation
56
Implicit Encoding
Code
Modeling
Explicit Implicit
Encoding Encoding
Natural Channel
Mutator
Mutation
Formal Channel
Mutation
59
NatGen:
60
NatGen: Generative pre-training by
“Naturalizing” source code
61
NatGen: Generative pre-training by
“Naturalizing” source code
De-Naturalizing
Transformation
62
NatGen: De-Naturalizing Transformations
Encoder Decoder
NatGen
64
NatGen: Some Initial Results – Few Shot Learning
65
Explicit Encoding – Take aways
Code
Modeling
• Little overhead
• Unsupervised Pros
Explicit Implicit
Encoding Encoding • Transferable
Natural Channel
Mutation • No Guarantee
• Potential Bias from Cons
Formal Channel
Mutation
Mutation
66
My Research Applications
Programming Language Processing
Code Summarization
NeuralCodeSum (ACL’20), PLBART (NAACL’21)
Vulnerability Detection
ReVeal(TSE’21), DISCO(ACL’22)
Code Editing
PLP
CODIT (TSE’20), MODIT(ASE’21), DiffBERT (Facebook)
68
What’s Next? (Short Term Goal)
➢ Representing Code Context as Dynamic
Graph
69
What’s Next? (Long Term Goal)
➢ Code Generation
70
Baishakhi Ray
Adviser
Wasi Ahmad Miltos Allanmanis Kai-Wei Chang Yanju Chen Prem Devanbu
UCLA / Amazon MSR UCLA UCSB UC Davis