Continuous Incident Triage for Large-Scale Online Service Systems (ASE 2019 - Research Papers)

Who

Junjie Chen, Xiaoting He, Qingwei Lin, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, Dongmei Zhang

Track

ASE 2019 Research Papers

Time Zone

The program is currently displayed in (GMT-08:00) Tijuana, Baja California.

Use conference time zone: (GMT-08:00) Tijuana, Baja CaliforniaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 13 Nov 2019 11:40 - 12:00 at Hillcrest - Cloud and Online Services Chair(s): Dan Hao

Abstract

[Experience Paper] In recent years, online service systems have become increasingly popular. Incidents of these systems could cause significant economic loss and customer dissatisfaction. Incident triage, which is the process of assigning a new incident to the responsible team, is vitally important for quick recovery of the affected service. Our industry experience shows that in practice, incident triage is not conducted only once in the beginning, but is a continuous process, in which engineers from different teams have to discuss intensively among themselves about an incident, and continuously refine the incident-triage result until the correct assignment is reached. In particular, our empirical study on 8 real online service systems shows that the percentage of incidents that were reassigned ranges from 5.43% to 68.26% and the number of discussion items before achieving the correct assignment is up to 11.32 on average. To improve the existing incident triage process, in this paper, we propose DeepCT, a Deep learning based approach to automated Continuous incident Triage. DeepCT incorporates a novel GRU-based model with an attention mechanism and a revised loss function, which can incrementally learn knowledge from discussions and update incident-triage results. Using DeepCT, the correct incident assignment can be achieved with fewer discussions. We conducted an extensive evaluation of DeepCT on 14 large-scale online service systems in a multinational technology company M. The results show that DeepCT is able to achieve more accurate and efficient incident triage, e.g., the average accuracy identifying the responsible team precisely is 0.641~0.729 with the number of discussion items increasing from 1 to 5. Also, DeepCT statistically significantly outperforms the state-of-the-art bug triage approach.

Junjie Chen

Tianjin University

China

Xiaoting He

Microsoft

Qingwei Lin

Microsoft Research, China

China

Hongyu Zhang

The University of Newcastle

Australia

Dan Hao

Peking University

China

Feng Gao

Microsoft

Zhangwei Xu

Microsoft

Yingnong Dang

Microsoft Azure

United States

Dongmei Zhang

Microsoft Research, China

China

Time Zone

The program is currently displayed in (GMT-08:00) Tijuana, Baja California.

Use conference time zone: (GMT-08:00) Tijuana, Baja CaliforniaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 13 Nov
Displayed time zone: Tijuana, Baja California change

10:40 - 12:20	Cloud and Online ServicesJournal First Presentations / Research Papers / Demonstrations / Papers at Hillcrest Chair(s): Dan Hao Peking University

10:40 20m Talk		Understanding Exception-Related Bugs in Large-Scale Cloud Systems Research Papers Haicheng Chen The Ohio State University, Wensheng Dou Institute of Software, Chinese Academy of Sciences, Yanyan Jiang Nanjing University, Feng Qin Ohio State University, USA Pre-print Media Attached
11:00 20m Talk		iFeedback: Exploiting User Feedback for Real-time Issue Detection in Large-Scale Online Service Systems Research Papers Wujie Zheng Tencent, Inc., Haochuan Lu Fudan University, Yangfan Zhou Fudan University, Jianming Liang Tencent, Haibing Zheng Tencent, Yuetang Deng Tencent, Inc.
11:20 20m Talk		Software Microbenchmarking in the Cloud. How Bad is it Really? Journal First Presentations Christoph Laaber University of Zurich, Joel Scheuner Chalmers \| University of Gothenburg, Philipp Leitner Chalmers University of Technology & University of Gothenburg Link to publication Pre-print
11:40 20m Talk		Continuous Incident Triage for Large-Scale Online Service Systems Research Papers Junjie Chen Tianjin University, Xiaoting He Microsoft, Qingwei Lin Microsoft Research, China, Hongyu Zhang The University of Newcastle, Dan Hao Peking University, Feng Gao Microsoft, Zhangwei Xu Microsoft, Yingnong Dang Microsoft Azure, Dongmei Zhang Microsoft Research, China
12:00 10m Demonstration		Kotless: a Serverless Framework for Kotlin Demonstrations Vladislav Tankov JetBrains, ITMO University, Yaroslav Golubev JetBrains Research, Timofey Bryksin JetBrains Research, Saint-Petersburg State University
12:10 10m Demonstration		FogWorkflowSim: An Automated Simulation Toolkit for Workflow Performance Evaluation in Fog Computing Demonstrations Xiao Liu School of Information Technology, Deakin University, Lingmin Fan School of Computer Science and Technology, Anhui University, Jia Xu School of Computer Science and Technology, Anhui University, Xuejun Li School of Computer Science and Technology, Anhui University, Lina Gong School of Computer Science and Technology, Anhui University, John Grundy Monash University, Yun Yang Swinburne University of Technology