Assessing the Generalizability of code2vec Token Embeddings (ASE 2019 - Research Papers) - ASE 2019

Sun 10 - Fri 15 November 2019 San Diego, California, United States

Who

Hong Jin Kang, Tegawendé F. Bissyandé, David Lo

Track

ASE 2019 Research Papers

Time Zone

The program is currently displayed in (GMT-08:00) Tijuana, Baja California.

Use conference time zone: (GMT-08:00) Tijuana, Baja CaliforniaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

When

Tue 12 Nov 2019 10:40 - 11:00 at Cortez 2&3 - AI and SE Chair(s): Kaiyuan Wang

Abstract

Many Natural Language Processing (NLP) tasks, such as sentiment analysis or syntactic parsing, have benefited from the development of word embedding models. In particular, regardless of the training algorithms, the learned embeddings have often been shown to be generalizable to different NLP tasks. In contrast, despite recent momentum on word embeddings for source code, the literature lacks evidence of their generalizability beyond the example task they have been trained for.

In this experience paper, we identify 3 potential downstream tasks, namely code comments generation, code authorship identification, and code clones detection, that source code token embedding models can be applied to. We empirically assess a recently proposed code token embedding model, namely code2vec’s token embeddings. Code2vec was trained on the task of predicting method names, and while there is potential for using the vectors it learns on other tasks, it has not been explored in literature. Therefore, we fill this gap by focusing on its generalizability for the tasks we have identified. Eventually, we show that source code token embeddings cannot be readily leveraged for the downstream tasks. Our experiments even show that our attempts to use them do not result in any improvements over less sophisticated methods. We call for more research into effective and general use of code embeddings.

Link to Preprint

http://www.mysmu.edu/faculty/davidlo/papers/ase19-code2vec.pdf

Hong Jin Kang

School of Information Systems, Singapore Management University

Tegawendé F. Bissyandé

SnT, University of Luxembourg

Luxembourg

David Lo

Singapore Management University

Singapore

Time Zone

The program is currently displayed in (GMT-08:00) Tijuana, Baja California.

Use conference time zone: (GMT-08:00) Tijuana, Baja CaliforniaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Session Program

Tue 12 Nov
Displayed time zone: Tijuana, Baja California change

	10:40 - 12:20	AI and SEResearch Papers / Journal First Presentations / Demonstrations at Cortez 2&3 Chair(s): Kaiyuan Wang Google, Inc.

	10:40 20m Talk		Assessing the Generalizability of code2vec Token Embeddings Research Papers Hong Jin Kang School of Information Systems, Singapore Management University, Tegawendé F. Bissyandé SnT, University of Luxembourg, David Lo Singapore Management University Pre-print
	11:00 20m Talk		Multi-Modal Attention Network Learning for Semantic Source Code Retrieval Research Papers Yao Wan Zhejiang University, Jingdong Shu Zhejiang University, Yulei Sui University of Technology Sydney, Australia, Guandong Xu University of Technology, Sydney, Zhou Zhao Zhejiang University, Jian Wu Zhejiang University, philip yu University of Illinois at Chicago
	11:20 20m Talk		Experience Paper: Search-based Testing in Automated Driving Control ApplicationsACM SIGSOFT Distinguished Paper Award Research Papers Christoph Gladisch Corporate Research, Robert Bosch GmbH, Thomas Heinz Corporate Research, Robert Bosch GmbH, Christian Heinzemann Corporate Research, Robert Bosch GmbH, Jens Oehlerking Corporate Research, Robert Bosch GmbH, Anne von Vietinghoff Corporate Research, Robert Bosch GmbH, Tim Pfitzer Robert Bosch Automotive Steering GmbH
	11:40 20m Talk		Machine Translation-Based Bug Localization Technique for Bridging Lexical Gap Journal First Presentations Yan Xiao Department of Computer Science, City University of Hong Kong, Jacky Keung Department of Computer Science, City University of Hong Kong, Kwabena E. Bennin Blekinge Institute of Technology, SERL Sweden, Qing Mi Department of Computer Science, City University of Hong Kong Link to publication
	12:00 10m Talk		AutoFocus: Interpreting Attention-based Neural Networks by Code Perturbation Research Papers Nghi D. Q. Bui Singapore Management University, Singapore, Yijun Yu The Open University, UK, Lingxiao Jiang Singapore Management University Pre-print
	12:10 10m Demonstration		A Quantitative Analysis Framework for Recurrent Neural Network Demonstrations Xiaoning Du Nanyang Technological University, Xiaofei Xie Nanyang Technological University, Yi Li Nanyang Technological University, Lei Ma Kyushu University, Yang Liu Nanyang Technological University, Singapore, Jianjun Zhao Kyushu University