0% found this document useful (0 votes)
26 views

Programming Lang Processing

This document summarizes Saikat Chakraborty's research interests which include code summarization, code editing/repair, bug detection, code search, and code generation/synthesis using techniques like program analysis and machine learning. It provides examples of identifying bugs using automated tools and automatically editing/repairing programs by replacing code snippets based on learned patterns or specifications. It discusses challenges in understanding and generating source code and how programming language knowledge can be explicitly encoded in models through techniques like reachability analysis and context-free grammars to help with tasks like code editing and vulnerability detection.

Uploaded by

Utkarsh Kapadia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Programming Lang Processing

This document summarizes Saikat Chakraborty's research interests which include code summarization, code editing/repair, bug detection, code search, and code generation/synthesis using techniques like program analysis and machine learning. It provides examples of identifying bugs using automated tools and automatically editing/repairing programs by replacing code snippets based on learned patterns or specifications. It discusses challenges in understanding and generating source code and how programming language knowledge can be explicitly encoded in models through techniques like reachability analysis and context-free grammars to help with tasks like code editing and vulnerability detection.

Uploaded by

Utkarsh Kapadia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Programming Language

Processing
Machine Learning
for
Source Code Understanding and Generation

Saikat Chakraborty
Ph.D. Candidate
Columbia University, New York
https://thenewstack.io/how-much-time-do-developers-spend-actually-writing-code/
2
My Research Interest

Code Summarization Code Editing/ Repair

Bug Detection Code Search

Code Generation / Synthesis Other Source Code Analysis

3
Identify Bugs

x = parameter(free)
Y = trace_in_dataflow(x)
if x or y is modified:
WARNING!!

[e.g., Alloy model checking, KLEE, SAGE, Cute,


AFL Fuzzers, Java Path Finder, LLVM Passes]
4
Automated Edit / Program Repair

l = location(gets)
v = parameters(gets)
s = memory_size(v)
new_stmt = “fgets (v, s, stdin)”
Update l with new_stmt

[1] Simfix Jiang et.al. 2018


[2] Elixir Saha et.al.2017]
5
Pattern/Rule Based Automation

Too Many
Different
Patterns
Low [4]
tolerance l = location(gets)

for v = parameters(gets)
s = memory_size(v)

Deviation
x = parameter(free)
Y = trace_in_dataflow(x)
new_stmt = “fgets (v, s, stdin)”
Update l with new_stmt
if[3]
x or y is modified:
WARNING!!

[3] Johnson et. al. 2013


[4] Saha et. al. 2017
6
Data-Driven Automation

x = parameter(free)
Y = trace_in_dataflow(x)
if x or y is modified:
WARNING!!

l = location(gets)
v = parameters(gets)
s = memory_size(v)
new_stmt = “fgets (v, s, stdin)”
Update l with new_stmt

7
Data-Driven Automation – Specification Mining

l = location(gets)
x = parameter(free)
v = parameters(gets)
Y = trace_in_dataflow(x)
s = memory_size(v)
if x or y is modified:
new_stmt = “fgets (v, s, stdin)”
WARNING!!
Update l with new_stmt

[5] Pradel et. al. 2012


[6] Le et. al. 2018
[7] Meng et. al. 2017
8
Data-Driven Automation – Automated Analyzer

9
Data-Driven Automation – Automated Analyzer

10
Artificial Intelligence for Software Engineering

Curate and
Preprocess Design Training AI-based
Data Model Automation Tool
11
Challenges of AI for SE

Understanding Generating
Source Code Source Code
Understanding Structure Ensuring the Syntactic and
And Functionality of Semantic Correctness for
Source Code Generating Source Code

12
Programming Language Processing

Program Analysis Machine Learning

13
Code Understanding Task - Example

Understand Source Code

Vulnerability Detection

14
Code Generation Tasks - Example

Code Translation

Generate Source Code

Code Editing/
Program Repair

15
Understanding Source Code - Requirement
Control
Dependency

Syntax
Structure
Data
Dependency

16
Generating Source Code - Requirement

1. Syntactic correctness. 2. Semantic correctness.

Syntactically Incorrect Semantically Incorrect

17
Programming Language Processing
Syntax

Encoding

Semantics

18
Encoding Syntax and Semantics
Syntax

Code
Modeling

Explicit Implicit
Encoding Encoding

Ways of Natural Channel


Encoding Mutation

Formal Channel
Mutation

Semantics

19
Explicit Encoding

Code
Modeling

Explicit Implicit
Encoding Encoding

Natural Channel
Mutation

Formal Channel
[8] Learning to Represent Program as Graphs
Mutation Allamanis et. al. 2017
[9] Learning to Represent Edits
Yin et. al. 2019
[10] HOPPITY – Dinella et. al. 2020
20
CODIT: Code Editing with Tree Based Neural Models
(TSE’20)

Explicit Encoding of Code Structure for Code Edit

21
Automated Code Editing

Example Code Edits

Code Before Edit Code After Edit

22
CODIT: Code Editing With Tree Based Models
Code Before Edit Code After Edit

23
CODIT Step 1 : Tree Translation

24
CODIT Step 1 :Tree Translation

25
CODIT Step 1 :Tree Translation

Rules sequence of Rules sequence of


Syntax Tree before edit Syntax Tree after edit

26
CODIT Step 1 :Tree Translation

27
CODIT Step 1 :Tree Translation

28
CODIT Step 2 : Token Translation

Token Translation

29
CODIT Step 2 : Token Translation

Reachable Variables in
Edit Location:

{inst, object, tmp}

Reachability Analysis in Edit Location


30
CODIT: Code Editing With Tree Based Models

# of # of Edit Code Fragment Size


Data Set
Projects Examples # of Tokens # of Nodes

Generic Code Edit Max - 38 Max - 47


48 32,473
from Github Avg - 15 Avg - 20

Pull Request Edits Max - 34 Max - 47


3 5546
[Tufano et. al. 2019] Avg - 17 Avg - 23

31
CODIT: Code Editing With Tree Based Models

Method Generic Code Edits Pull Request Edit

LSTM-Seq2Seq 3.77% 11.26%


Sequence
Tufano et. al. 6.57% 23.65%
Based
SequenceR 9.76% 26.43%

Tree2Seq 11.04% 23.49%


Tree
Based CODIT 15.94% 28.87%

32
CODIT – Results

CODIT fixes 15 bugs completely and 10 bugs partially, out of 80 bugs in Defects4j.

Closure
Compiler:
Bug-3

33
Explicit Encoding PL knowledge into model

Reachability Analysis Context Free


Grammar

34
ReVeal (TSE’21)
Explicit Encoding of Code Structure for Vulnerability
Detection

35
ReVeal

Training
Data
Node GGNN
Code Property
Graph Features

Trained ReVeal
Graph Embedding
36
ReVeal - Results

ReVeal
F1 score

Token Graph Token Graph


Based Based Based Based
[10] [10]

Chromium & Debian FFMpeg & Qemu


[10] Devign Zhou et. al. 2019
37
Explicit Encoding PL knowledge into model

Code Property Graph

38
Explicit Encoding – Take aways

Code
Modeling
• Precision
• Guarantee Pros
Explicit Implicit
Encoding Encoding • Explainable

Natural Channel
Mutation
• Model design overhead
Cons
Formal Channel • Transferability
Mutation

39
Implicit Encoding

Code
Modeling

Explicit Implicit Large Data Corpus


Encoding Encoding

Natural Channel
Mutation

Mutated Input De-Mutation


Formal Channel
Mutation
[11] CodeBERT – Feng et. al. 2019
[12] GraphCodeBERT – Guo et. al. 2020
[13] CodeX – Chen et. al. 2021
[14] CodeT5 – Wang et. al. 2021
40
Dual Channel Hypothesis for Source Code [15]

Natural Channel [16] Formal Channel

[15] Casalanuovo et. al. 2020


[16] Karampatsis et. al. 2020
41
Implicit Encoding

Code
Modeling

Explicit Implicit
Encoding Encoding

Natural Channel
Mutator
Mutation

Formal Channel
Mutation

42
PLBART: Unified pretraining Program “Understanding”
and “Generation”. (NAACL’21)

Implicit Encoding by mutating the natural channel

43
PLBART – What Is It?

Encoder Decoder

Pretraining – Unsupervised
way of learning code patterns.

44
PLBART – Components

1. Read Code.
2. Understands Code.
Encoder 3. Reason about any errors in code.
4. Learns robust representation

Decoder 1. Generate Code.


2. Learns Coding Patterns.

45
PLBART – Pretraining

Noise Injector
Correct Code Noisy Code

Decoder Encoder
PLBART

Token Masking Token Deletion Token Infilling

46
PLBART – Jointly Learning to Understand and
Generate
Token Masking

PLBART
47
PLBART – Jointly Learning to Understand and
Generate
Token Deletion

PLBART
48
PLBART – Jointly Learning to Understand and
Generate
Token Infilling

PLBART
49
PLBART – Pretraining

• Noise Properties • Training Objectives


• Mutate Natural Chennel • Generate the whole code.
• Likely to Break Syntax • Learns syntax implicitly
• Learn Coding Patterns.

• Multi-lingual Training
• Java
• Python
• NL from Stack overflow

50
PLBART – Experiment (Applications)

51
PLBART – Results – Code Generation from NL

Dataset: Concode
Metric: EM (exact match %), BLEU-4(max 100), CodeBLEU(max 100)

CodeBLEU =
0.25 * token_match +
0.25 * keywords_match +
0.25 * syntax_match +
0.25 * dataflow_match

52
PLBART – Search Augmented Code Generation
(EMNLP’21)

PLBART

✓ Up to 21% CodeBLEU Improvement Over GraphCodeBERT.

✓ Up to 12% improvement due to Search Augmentation.

53
PLBART – Multi Modal Code Edit (ASE’21)

PLBART

✓ Edit Location (Code to be edited), Context, and a Summary of the Edit.

✓ Up to 34% Improvement Over CodeBERT, and 30% Improvement over CodeGPT.

✓ Up to 16% improvement because of Multi-Modality.


54
PLBART – Results – Code Translation

Dataset: CodeXGLUE [Lu et. al. 2021]


Metric: EM (exact match %), BLEU-4(max 100), CodeBLEU(max 100)

CodeBLEU =
0.25 * token_match +
0.25 * keywords_match +
0.25 * syntax_match +
0.25 * dataflow_match

55
An interesting example of Code Translation

Input C# Code PLBART generated Better Code

56
Implicit Encoding

Code
Modeling

Explicit Implicit
Encoding Encoding

Natural Channel
Mutator
Mutation

Formal Channel
Mutation

59
NatGen:

Implicit Encoding by mutating the formal channel

60
NatGen: Generative pre-training by
“Naturalizing” source code

Write Semantic / Functional Equivalent Code in


“More Natural” way

61
NatGen: Generative pre-training by
“Naturalizing” source code

De-Naturalizing
Transformation

62
NatGen: De-Naturalizing Transformations

Confusing Statements [17]

Dead Code Insertion


Operand Swap

[17] Gopstein et. al. 2020


63
NatGen: Generative pre-training by
“Naturalizing” source code

Encoder Decoder

NatGen

64
NatGen: Some Initial Results – Few Shot Learning

65
Explicit Encoding – Take aways

Code
Modeling
• Little overhead
• Unsupervised Pros
Explicit Implicit
Encoding Encoding • Transferable

Natural Channel
Mutation • No Guarantee
• Potential Bias from Cons
Formal Channel
Mutation
Mutation

66
My Research Applications
Programming Language Processing
Code Summarization
NeuralCodeSum (ACL’20), PLBART (NAACL’21)

Vulnerability Detection
ReVeal(TSE’21), DISCO(ACL’22)

Code Editing

PLP
CODIT (TSE’20), MODIT(ASE’21), DiffBERT (Facebook)

Code Generation and Translation


PLBART (NAACL’21), SNG-PLBART(NAACL’22 - under
review), DataTypeLM ForCode (ACL’18).

Code Search and Synthesis


RedCoder (EMNLP’21), CodePanda (W.I.P)
67
What’s Next? (Short Term Goal)
➢ API driven Program Synthesis

➢ Improving Semantic Code Search with RL


Synthesizer

68
What’s Next? (Short Term Goal)
➢ Representing Code Context as Dynamic
Graph

➢ Learning Code Syntax and Semantics with


Reinforcement Learning (RL)

➢ Other Information Modalities


1. Other Software Metadata – Comments, Commits, Code Evolution metadata.
2. Analysis of Program Binaries.
3. Dynamic Analysis of Program – based on execution behavior.

69
What’s Next? (Long Term Goal)
➢ Code Generation

Formal Analysis Probabilistic models


Guarantee for the Analysis Scalable and Transferrable
Noise Intolerant No theoretical Guarantee

➢ Developer Feedback Oriented Automation

70
Baishakhi Ray
Adviser

Wasi Ahmad Miltos Allanmanis Kai-Wei Chang Yanju Chen Prem Devanbu
UCLA / Amazon MSR UCLA UCSB UC Davis

Yangruibo Ding Yu Feng Rahul Krishna Toufiq Parag Rizwan Parvez


Columbia UCSB Columbia / IBM UC Davis UCLA 71
72

You might also like