0% found this document useful (0 votes)

37 views70 pages

Programming Lang Processing

This document summarizes Saikat Chakraborty's research interests which include code summarization, code editing/repair, bug detection, code search, and code generation/synthesis using techniques like program analysis and machine learning. It provides examples of identifying bugs using automated tools and automatically editing/repairing programs by replacing code snippets based on learned patterns or specifications. It discusses challenges in understanding and generating source code and how programming language knowledge can be explicitly encoded in models through techniques like reachability analysis and context-free grammars to help with tasks like code editing and vulnerability detection.

Uploaded by

Utkarsh Kapadia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views70 pages

Programming Lang Processing

Uploaded by

Utkarsh Kapadia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

Programming Language

Processing
Machine Learning
for
Source Code Understanding and Generation

Saikat Chakraborty
Ph.D. Candidate
Columbia University, New York
https://thenewstack.io/how-much-time-do-developers-spend-actually-writing-code/
2
My Research Interest

Code Summarization Code Editing/ Repair

Bug Detection Code Search

Code Generation / Synthesis Other Source Code Analysis

3
Identify Bugs

x = parameter(free)
Y = trace_in_dataflow(x)
if x or y is modified:
WARNING!!

[e.g., Alloy model checking, KLEE, SAGE, Cute,

AFL Fuzzers, Java Path Finder, LLVM Passes]
4
Automated Edit / Program Repair

l = location(gets)
v = parameters(gets)
s = memory_size(v)
new_stmt = “fgets (v, s, stdin)”
Update l with new_stmt

[1] Simfix Jiang et.al. 2018

[2] Elixir Saha et.al.2017]
5
Pattern/Rule Based Automation

Too Many
Different
Patterns
Low [4]
tolerance l = location(gets)

for v = parameters(gets)
s = memory_size(v)

Deviation
x = parameter(free)
Y = trace_in_dataflow(x)
new_stmt = “fgets (v, s, stdin)”
Update l with new_stmt
if[3]
x or y is modified:
WARNING!!

[3] Johnson et. al. 2013

[4] Saha et. al. 2017
6
Data-Driven Automation

x = parameter(free)
Y = trace_in_dataflow(x)
if x or y is modified:
WARNING!!

l = location(gets)
v = parameters(gets)
s = memory_size(v)
new_stmt = “fgets (v, s, stdin)”
Update l with new_stmt

7
Data-Driven Automation – Specification Mining

l = location(gets)
x = parameter(free)
v = parameters(gets)
Y = trace_in_dataflow(x)
s = memory_size(v)
if x or y is modified:
new_stmt = “fgets (v, s, stdin)”
WARNING!!
Update l with new_stmt

[5] Pradel et. al. 2012

[6] Le et. al. 2018
[7] Meng et. al. 2017
8
Data-Driven Automation – Automated Analyzer

9
Data-Driven Automation – Automated Analyzer

10
Artificial Intelligence for Software Engineering

Curate and
Preprocess Design Training AI-based
Data Model Automation Tool
11
Challenges of AI for SE

Understanding Generating
Source Code Source Code
Understanding Structure Ensuring the Syntactic and
And Functionality of Semantic Correctness for
Source Code Generating Source Code

12
Programming Language Processing

Program Analysis Machine Learning

13
Code Understanding Task - Example

Understand Source Code

Vulnerability Detection

14
Code Generation Tasks - Example

Code Translation

Generate Source Code

Code Editing/
Program Repair

15
Understanding Source Code - Requirement
Control
Dependency

Syntax
Structure
Data
Dependency

16
Generating Source Code - Requirement

1. Syntactic correctness. 2. Semantic correctness.

Syntactically Incorrect Semantically Incorrect

17
Programming Language Processing
Syntax

Encoding

Semantics

18
Encoding Syntax and Semantics
Syntax

Code
Modeling

Explicit Implicit
Encoding Encoding

Ways of Natural Channel

Encoding Mutation

Formal Channel
Mutation

Semantics

19
Explicit Encoding

Code
Modeling

Explicit Implicit
Encoding Encoding

Natural Channel
Mutation

Formal Channel
[8] Learning to Represent Program as Graphs
Mutation Allamanis et. al. 2017
[9] Learning to Represent Edits
Yin et. al. 2019
[10] HOPPITY – Dinella et. al. 2020
20
CODIT: Code Editing with Tree Based Neural Models
(TSE’20)

Explicit Encoding of Code Structure for Code Edit

21
Automated Code Editing

Example Code Edits

Code Before Edit Code After Edit

22
CODIT: Code Editing With Tree Based Models
Code Before Edit Code After Edit

23
CODIT Step 1 : Tree Translation

24
CODIT Step 1 :Tree Translation

25
CODIT Step 1 :Tree Translation

Rules sequence of Rules sequence of

Syntax Tree before edit Syntax Tree after edit

26
CODIT Step 1 :Tree Translation

27
CODIT Step 1 :Tree Translation

28
CODIT Step 2 : Token Translation

Token Translation

29
CODIT Step 2 : Token Translation

Reachable Variables in
Edit Location:

{inst, object, tmp}

Reachability Analysis in Edit Location

30
CODIT: Code Editing With Tree Based Models

# of # of Edit Code Fragment Size

Data Set
Projects Examples # of Tokens # of Nodes

Generic Code Edit Max - 38 Max - 47

48 32,473
from Github Avg - 15 Avg - 20

Pull Request Edits Max - 34 Max - 47

3 5546
[Tufano et. al. 2019] Avg - 17 Avg - 23

31
CODIT: Code Editing With Tree Based Models

Method Generic Code Edits Pull Request Edit

LSTM-Seq2Seq 3.77% 11.26%

Sequence
Tufano et. al. 6.57% 23.65%
Based
SequenceR 9.76% 26.43%

Tree2Seq 11.04% 23.49%

Tree
Based CODIT 15.94% 28.87%

32
CODIT – Results

CODIT fixes 15 bugs completely and 10 bugs partially, out of 80 bugs in Defects4j.

Closure
Compiler:
Bug-3

33
Explicit Encoding PL knowledge into model

Reachability Analysis Context Free

Grammar

34
ReVeal (TSE’21)
Explicit Encoding of Code Structure for Vulnerability
Detection

35
ReVeal

Training
Data
Node GGNN
Code Property
Graph Features

Trained ReVeal
Graph Embedding
36
ReVeal - Results

ReVeal
F1 score

Token Graph Token Graph

Based Based Based Based
[10] [10]

Chromium & Debian FFMpeg & Qemu

[10] Devign Zhou et. al. 2019
37
Explicit Encoding PL knowledge into model

Code Property Graph

38
Explicit Encoding – Take aways

Code
Modeling
• Precision
• Guarantee Pros
Explicit Implicit
Encoding Encoding • Explainable

Natural Channel
Mutation
• Model design overhead
Cons
Formal Channel • Transferability
Mutation

39
Implicit Encoding

Code
Modeling

Explicit Implicit Large Data Corpus

Encoding Encoding

Natural Channel
Mutation

Mutated Input De-Mutation

Formal Channel
Mutation
[11] CodeBERT – Feng et. al. 2019
[12] GraphCodeBERT – Guo et. al. 2020
[13] CodeX – Chen et. al. 2021
[14] CodeT5 – Wang et. al. 2021
40
Dual Channel Hypothesis for Source Code [15]

Natural Channel [16] Formal Channel

[15] Casalanuovo et. al. 2020

[16] Karampatsis et. al. 2020
41
Implicit Encoding

Code
Modeling

Explicit Implicit
Encoding Encoding

Natural Channel
Mutator
Mutation

Formal Channel
Mutation

42
PLBART: Unified pretraining Program “Understanding”
and “Generation”. (NAACL’21)

Implicit Encoding by mutating the natural channel

43
PLBART – What Is It?

Encoder Decoder

Pretraining – Unsupervised
way of learning code patterns.

44
PLBART – Components

1. Read Code.
2. Understands Code.
Encoder 3. Reason about any errors in code.
4. Learns robust representation

Decoder 1. Generate Code.

2. Learns Coding Patterns.

45
PLBART – Pretraining

Noise Injector
Correct Code Noisy Code

Decoder Encoder
PLBART

Token Masking Token Deletion Token Infilling

46
PLBART – Jointly Learning to Understand and
Generate
Token Masking

PLBART
47
PLBART – Jointly Learning to Understand and
Generate
Token Deletion

PLBART
48
PLBART – Jointly Learning to Understand and
Generate
Token Infilling

PLBART
49
PLBART – Pretraining

• Noise Properties • Training Objectives

• Mutate Natural Chennel • Generate the whole code.
• Likely to Break Syntax • Learns syntax implicitly
• Learn Coding Patterns.

• Multi-lingual Training
• Java
• Python
• NL from Stack overflow

50
PLBART – Experiment (Applications)

51
PLBART – Results – Code Generation from NL

Dataset: Concode
Metric: EM (exact match %), BLEU-4(max 100), CodeBLEU(max 100)

CodeBLEU =
0.25 * token_match +
0.25 * keywords_match +
0.25 * syntax_match +
0.25 * dataflow_match

52
PLBART – Search Augmented Code Generation
(EMNLP’21)

PLBART

✓ Up to 21% CodeBLEU Improvement Over GraphCodeBERT.

✓ Up to 12% improvement due to Search Augmentation.

53
PLBART – Multi Modal Code Edit (ASE’21)

PLBART

✓ Edit Location (Code to be edited), Context, and a Summary of the Edit.

✓ Up to 34% Improvement Over CodeBERT, and 30% Improvement over CodeGPT.

✓ Up to 16% improvement because of Multi-Modality.

54
PLBART – Results – Code Translation

Dataset: CodeXGLUE [Lu et. al. 2021]

Metric: EM (exact match %), BLEU-4(max 100), CodeBLEU(max 100)

CodeBLEU =
0.25 * token_match +
0.25 * keywords_match +
0.25 * syntax_match +
0.25 * dataflow_match

55
An interesting example of Code Translation

Input C# Code PLBART generated Better Code

56
Implicit Encoding

Code
Modeling

Explicit Implicit
Encoding Encoding

Natural Channel
Mutator
Mutation

Formal Channel
Mutation

59
NatGen:

Implicit Encoding by mutating the formal channel

60
NatGen: Generative pre-training by
“Naturalizing” source code

Write Semantic / Functional Equivalent Code in

“More Natural” way

61
NatGen: Generative pre-training by
“Naturalizing” source code

De-Naturalizing
Transformation

62
NatGen: De-Naturalizing Transformations

Confusing Statements [17]

Dead Code Insertion

Operand Swap

[17] Gopstein et. al. 2020

63
NatGen: Generative pre-training by
“Naturalizing” source code

Encoder Decoder

NatGen

64
NatGen: Some Initial Results – Few Shot Learning

65
Explicit Encoding – Take aways

Code
Modeling
• Little overhead
• Unsupervised Pros
Explicit Implicit
Encoding Encoding • Transferable

Natural Channel
Mutation • No Guarantee
• Potential Bias from Cons
Formal Channel
Mutation
Mutation

66
My Research Applications
Programming Language Processing
Code Summarization
NeuralCodeSum (ACL’20), PLBART (NAACL’21)

Vulnerability Detection
ReVeal(TSE’21), DISCO(ACL’22)

Code Editing

PLP
CODIT (TSE’20), MODIT(ASE’21), DiffBERT (Facebook)

Code Generation and Translation

PLBART (NAACL’21), SNG-PLBART(NAACL’22 - under
review), DataTypeLM ForCode (ACL’18).

Code Search and Synthesis

RedCoder (EMNLP’21), CodePanda (W.I.P)
67
What’s Next? (Short Term Goal)
➢ API driven Program Synthesis

➢ Improving Semantic Code Search with RL

Synthesizer

68
What’s Next? (Short Term Goal)
➢ Representing Code Context as Dynamic
Graph

➢ Learning Code Syntax and Semantics with

Reinforcement Learning (RL)

➢ Other Information Modalities

1. Other Software Metadata – Comments, Commits, Code Evolution metadata.
2. Analysis of Program Binaries.
3. Dynamic Analysis of Program – based on execution behavior.

69
What’s Next? (Long Term Goal)
➢ Code Generation

Formal Analysis Probabilistic models

Guarantee for the Analysis Scalable and Transferrable
Noise Intolerant No theoretical Guarantee

➢ Developer Feedback Oriented Automation

70
Baishakhi Ray
Adviser

Wasi Ahmad Miltos Allanmanis Kai-Wei Chang Yanju Chen Prem Devanbu
UCLA / Amazon MSR UCLA UCSB UC Davis

Yangruibo Ding Yu Feng Rahul Krishna Toufiq Parag Rizwan Parvez

Columbia UCSB Columbia / IBM UC Davis UCLA 71
72

Codexglue: A Machine Learning Benchmark Dataset For Code Understanding and Generation
No ratings yet
Codexglue: A Machine Learning Benchmark Dataset For Code Understanding and Generation
14 pages
E1. Code Language Models
No ratings yet
E1. Code Language Models
40 pages
Code Generation Tools (Almost) For Free? A Study of Few-Shot, Pre-Trained Language Models On Code
No ratings yet
Code Generation Tools (Almost) For Free? A Study of Few-Shot, Pre-Trained Language Models On Code
12 pages
Pretraining and Evaluation CodeLLMs
No ratings yet
Pretraining and Evaluation CodeLLMs
71 pages
Code Generation Using Machine Learning A Systematic Review 1ic7hqvz - Extracted
No ratings yet
Code Generation Using Machine Learning A Systematic Review 1ic7hqvz - Extracted
1 page
(2023) A Survey On Language Models For Code
No ratings yet
(2023) A Survey On Language Models For Code
55 pages
Evaluating Large Language Models Trained On Code
No ratings yet
Evaluating Large Language Models Trained On Code
35 pages
Codegemma Report
No ratings yet
Codegemma Report
9 pages
CodeGeeX4: Multilingual Open-Source Code Assistant
No ratings yet
CodeGeeX4: Multilingual Open-Source Code Assistant
9 pages
OpenAI Codex Arxiv
No ratings yet
OpenAI Codex Arxiv
35 pages
Agent Coder 2312.13010v2
No ratings yet
Agent Coder 2312.13010v2
21 pages
2019 ICLR CuBERT Pre Trained Contextual Embedding of Source Code
No ratings yet
2019 ICLR CuBERT Pre Trained Contextual Embedding of Source Code
22 pages
Deepseek-Coder: When The Large Language Model Meets Programming - The Rise of Code Intelligence
No ratings yet
Deepseek-Coder: When The Large Language Model Meets Programming - The Rise of Code Intelligence
23 pages
TransformCode a Contrastive Learning Framework for Code Embedding via Subtree Transformation
No ratings yet
TransformCode a Contrastive Learning Framework for Code Embedding via Subtree Transformation
20 pages
Generative Code Modeling With Graphs
No ratings yet
Generative Code Modeling With Graphs
23 pages
Modeling Code: Is Text All You Need?
No ratings yet
Modeling Code: Is Text All You Need?
10 pages
Codebertscore: Evaluating Code Generation With Pretrained Models of Code
No ratings yet
Codebertscore: Evaluating Code Generation With Pretrained Models of Code
17 pages
Natural Language To Code: Improving Semantic Reasoning in Code Generation Models
No ratings yet
Natural Language To Code: Improving Semantic Reasoning in Code Generation Models
10 pages
SP-Deep Code Comment Generation
No ratings yet
SP-Deep Code Comment Generation
12 pages
Deep Code Search: Xiaodong Gu, Hongyu Zhang, and Sunghun Kim
No ratings yet
Deep Code Search: Xiaodong Gu, Hongyu Zhang, and Sunghun Kim
12 pages
682a14158a4d4 - Neural Networks in Code Generation How AI Is Changing Software Development
No ratings yet
682a14158a4d4 - Neural Networks in Code Generation How AI Is Changing Software Development
7 pages
Fully Autonomous Programming With Large Language Models
No ratings yet
Fully Autonomous Programming With Large Language Models
10 pages
Seed Coder
No ratings yet
Seed Coder
46 pages
(2023) An Empirical Comparison of Pre-Trained Models of Source Code
No ratings yet
(2023) An Empirical Comparison of Pre-Trained Models of Source Code
13 pages
Mastropaolo CodeSummarization
No ratings yet
Mastropaolo CodeSummarization
12 pages
2022 - Multilingual Training For Software Engineering
No ratings yet
2022 - Multilingual Training For Software Engineering
13 pages
2023.acl Industry.34
No ratings yet
2023.acl Industry.34
14 pages
OpenCoder 1731317971
No ratings yet
OpenCoder 1731317971
35 pages
CodeGen4Libs A Two-Stage Approach For Library-Oriented Code Generation
No ratings yet
CodeGen4Libs A Two-Stage Approach For Library-Oriented Code Generation
12 pages
(2022) Bridging Pre-Trained Models and Downstream Tasks For Source Code Understanding
No ratings yet
(2022) Bridging Pre-Trained Models and Downstream Tasks For Source Code Understanding
12 pages
A Survey On Large Language Models For Code Generation: Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim
No ratings yet
A Survey On Large Language Models For Code Generation: Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim
49 pages
O C: T O C T - T C L L M: PEN Oder HE PEN Ookbook For OP IER ODE Arge Anguage Odels
No ratings yet
O C: T O C T - T C L L M: PEN Oder HE PEN Ookbook For OP IER ODE Arge Anguage Odels
35 pages
EgoCoder - Intelligent Program Synthesis With Hierarchical Sequential Neural Network Model
No ratings yet
EgoCoder - Intelligent Program Synthesis With Hierarchical Sequential Neural Network Model
10 pages
Symmetry 13 00247 v2
No ratings yet
Symmetry 13 00247 v2
15 pages
Codet5: Identifier-Aware Unified Pre-Trained Encoder-Decoder Models For Code Understanding and Generation
No ratings yet
Codet5: Identifier-Aware Unified Pre-Trained Encoder-Decoder Models For Code Understanding and Generation
13 pages
Docprompting:: G C R D
No ratings yet
Docprompting:: G C R D
19 pages
Deepercoder: Code Generation Using Machine Learning: Ntroduction
No ratings yet
Deepercoder: Code Generation Using Machine Learning: Ntroduction
6 pages
(ICSE18) Deep Code Search
No ratings yet
(ICSE18) Deep Code Search
12 pages
Towards Advancing Code Generation With Large Language Models: A Research Roadmap
No ratings yet
Towards Advancing Code Generation With Large Language Models: A Research Roadmap
10 pages
CodeGeeX - A Pre-Trained Model For Code Generation With Multilingual Evaluations On HumanEval-X
No ratings yet
CodeGeeX - A Pre-Trained Model For Code Generation With Multilingual Evaluations On HumanEval-X
30 pages
Exploring and Evaluating Personalized Models For Code Generation
No ratings yet
Exploring and Evaluating Personalized Models For Code Generation
9 pages
Ai in Engeneering at Facebook PDF
No ratings yet
Ai in Engeneering at Facebook PDF
10 pages
Code T
No ratings yet
Code T
19 pages
4786 Planning With Large Language M
No ratings yet
4786 Planning With Large Language M
28 pages
IJPREMS50400010480
No ratings yet
IJPREMS50400010480
5 pages
Fin Irjmets1715742677
No ratings yet
Fin Irjmets1715742677
6 pages
Code Tree
No ratings yet
Code Tree
16 pages
Legal 2 AI
No ratings yet
Legal 2 AI
10 pages
Session 14 Generative AI - For Software Engineering
No ratings yet
Session 14 Generative AI - For Software Engineering
22 pages
Gitub Copilot
No ratings yet
Gitub Copilot
27 pages
A Conversational Paradigm For Program Synthesis
No ratings yet
A Conversational Paradigm For Program Synthesis
25 pages
SEED: Customize Large Language Models With Sample-Efficient Adaptation For Code Generation
No ratings yet
SEED: Customize Large Language Models With Sample-Efficient Adaptation For Code Generation
13 pages
Data Dynamos
No ratings yet
Data Dynamos
11 pages
UG PP2 Report Format Part2 - Abstract
No ratings yet
UG PP2 Report Format Part2 - Abstract
3 pages
Lunyiu SOP UT
No ratings yet
Lunyiu SOP UT
2 pages
Program Code Generation With Generative AIs
No ratings yet
Program Code Generation With Generative AIs
19 pages
2022 - Expectation vs. Experience - Evaluating The Usability of Code Generation Tools Powered by Large Language Models
No ratings yet
2022 - Expectation vs. Experience - Evaluating The Usability of Code Generation Tools Powered by Large Language Models
7 pages
Wa0057.
No ratings yet
Wa0057.
14 pages
Is Github'S Copilot As Bad As Humans at Introducing Vulnerabilities in Code?
No ratings yet
Is Github'S Copilot As Bad As Humans at Introducing Vulnerabilities in Code?
24 pages
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
From Everand
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
Eric Elliott
No ratings yet
Surtronic 25 User Guide Eng
No ratings yet
Surtronic 25 User Guide Eng
60 pages
Strategic Risk Supply Chain
No ratings yet
Strategic Risk Supply Chain
63 pages
ZipGrade User Guide JP
No ratings yet
ZipGrade User Guide JP
6 pages
DSE212 4 Statement
No ratings yet
DSE212 4 Statement
2 pages
Lab 10
83% (6)
Lab 10
10 pages
AgeLOC Y Span Product Brochure LTO Process
No ratings yet
AgeLOC Y Span Product Brochure LTO Process
7 pages
Asdancoursedirectory 20172018 v3 Web
No ratings yet
Asdancoursedirectory 20172018 v3 Web
32 pages
Rise and Spread of Islam
No ratings yet
Rise and Spread of Islam
8 pages
Introducing The: The Sedan For The
No ratings yet
Introducing The: The Sedan For The
4 pages
Granulation Collette Mixer
No ratings yet
Granulation Collette Mixer
16 pages
Comparison Chart: DNA, or Deoxyribonucleic Acid, Is Like A Blueprint of Biological
No ratings yet
Comparison Chart: DNA, or Deoxyribonucleic Acid, Is Like A Blueprint of Biological
5 pages
Accounting Lecture 2 Notes
No ratings yet
Accounting Lecture 2 Notes
10 pages
Quality Management in Apparel Industry PDF
67% (3)
Quality Management in Apparel Industry PDF
9 pages
1 - 1. Student Guidelines For Summer Term 2024
No ratings yet
1 - 1. Student Guidelines For Summer Term 2024
2 pages
Institution and Reform in Contemporary China: WWW - Cuhk.edu - Hk/gpa/wang - Files
No ratings yet
Institution and Reform in Contemporary China: WWW - Cuhk.edu - Hk/gpa/wang - Files
4 pages
Scope TC 5389 New
No ratings yet
Scope TC 5389 New
864 pages
Fintech Final Paper
No ratings yet
Fintech Final Paper
5 pages
CPW Fed Slot Antenna For UWB
No ratings yet
CPW Fed Slot Antenna For UWB
4 pages
PEOPLE OF THE PHILIPPINES, Plaintiff-Appellee, vs. NOEL LEE, Accused-Appellant
No ratings yet
PEOPLE OF THE PHILIPPINES, Plaintiff-Appellee, vs. NOEL LEE, Accused-Appellant
21 pages
Yield Management in The Airline Industry PDF
No ratings yet
Yield Management in The Airline Industry PDF
11 pages
Ethics in Greek Philosophy
No ratings yet
Ethics in Greek Philosophy
13 pages
1 Intro DSU
No ratings yet
1 Intro DSU
23 pages
Physiology of Speech
No ratings yet
Physiology of Speech
51 pages
The Kraken
No ratings yet
The Kraken
3 pages
BMC Control-M Usage Reporting Instructions: Help Video
No ratings yet
BMC Control-M Usage Reporting Instructions: Help Video
3 pages
24MB37PM Pis
No ratings yet
24MB37PM Pis
51 pages
Full Paper Laksmita Dewi Supraba
No ratings yet
Full Paper Laksmita Dewi Supraba
9 pages
Johari Window Experiment and Discussion
100% (1)
Johari Window Experiment and Discussion
10 pages
BRM Group 5 Part 1
No ratings yet
BRM Group 5 Part 1
8 pages
Ec-2025-Empl-468929 VN
No ratings yet
Ec-2025-Empl-468929 VN
8 pages