CLAWSAT: Towards Both Robust and Accurate Code Models

Jia, Jinghan; Srikant, Shashank; Mitrovska, Tamara; Gan, Chuang; Chang, Shiyu; Liu, Sijia; O'Reilly, Una-May

Computer Science > Machine Learning

arXiv:2211.11711 (cs)

[Submitted on 21 Nov 2022 (v1), last revised 3 Mar 2023 (this version, v5)]

Title:CLAWSAT: Towards Both Robust and Accurate Code Models

Authors:Jinghan Jia, Shashank Srikant, Tamara Mitrovska, Chuang Gan, Shiyu Chang, Sijia Liu, Una-May O'Reilly

View PDF

Abstract:We integrate contrastive learning (CL) with adversarial learning to co-optimize the robustness and accuracy of code models. Different from existing works, we show that code obfuscation, a standard code transformation operation, provides novel means to generate complementary `views' of a code that enable us to achieve both robust and accurate code models. To the best of our knowledge, this is the first systematic study to explore and exploit the robustness and accuracy benefits of (multi-view) code obfuscations in code models. Specifically, we first adopt adversarial codes as robustness-promoting views in CL at the self-supervised pre-training phase. This yields improved robustness and transferability for downstream tasks. Next, at the supervised fine-tuning stage, we show that adversarial training with a proper temporally-staggered schedule of adversarial code generation can further improve robustness and accuracy of the pre-trained code model. Built on the above two modules, we develop CLAWSAT, a novel self-supervised learning (SSL) framework for code by integrating $\underline{\textrm{CL}}$ with $\underline{\textrm{a}}$dversarial vie$\underline{\textrm{w}}$s (CLAW) with $\underline{\textrm{s}}$taggered $\underline{\textrm{a}}$dversarial $\underline{\textrm{t}}$raining (SAT). On evaluating three downstream tasks across Python and Java, we show that CLAWSAT consistently yields the best robustness and accuracy ($\textit{e.g.}$ 11$\%$ in robustness and 6$\%$ in accuracy on the code summarization task in Python). We additionally demonstrate the effectiveness of adversarial learning in CLAW by analyzing the characteristics of the loss landscape and interpretability of the pre-trained models.

Comments:	Accepted by SANER2023 Research Track
Subjects:	Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
Cite as:	arXiv:2211.11711 [cs.LG]
	(or arXiv:2211.11711v5 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2211.11711

Submission history

From: Jinghan Jia [view email]
[v1] Mon, 21 Nov 2022 18:32:50 UTC (7,253 KB)
[v2] Tue, 22 Nov 2022 03:38:36 UTC (7,253 KB)
[v3] Sat, 17 Dec 2022 21:04:13 UTC (7,253 KB)
[v4] Fri, 10 Feb 2023 15:48:16 UTC (8,049 KB)
[v5] Fri, 3 Mar 2023 20:03:49 UTC (7,319 KB)

Computer Science > Machine Learning

Title:CLAWSAT: Towards Both Robust and Accurate Code Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:CLAWSAT: Towards Both Robust and Accurate Code Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators