Code search is a core software engineering task. Effective code search tools can help developers substantially improve their software development efficiency and effectiveness. In recent years, many code search studies have leveraged different techniques, such as deep learning and information retrieval approaches, to retrieve expected code from a large-scale codebase.
However, there is a lack of a comprehensive comparative summary of existing code search approaches. To understand the research trends in existing code search studies, Dr Xiaohu Yang and his co-authors systematically reviewed 81 relevant studies on challenges and future opportunities in the aspects of benchmarks, learning models, model fusion, cross-language searches and search tasks.
They investigated the publication trends of code search studies, analyzed key components, such as codebase, query, and modeling technique used to build code search tools to enhance the understanding of their characteristics, and classified existing tools into focusing on supporting seven different search tasks by analyzing the relationships between tools for each category as a basis for future comparisons and benchmarks. In summary, they concluded by answering the three questions:
1. What are the emerging publication trends for code search studies?
Code search started to be considered in the software engineering research literature in 2002, and its popularity continues to increase with a current peak so far in 2019. 60% of the studies are published in conferences (rather than journals), and 67 of the studies propose new code search tools.
2. What are the most important factors that contribute to existing code search tools?
74% of tools searched source code with queries written in natural language (i.e., text, API name, input/output example), and Deep learning is the most popular modeling technique in the last two years. Inverted Index was frequently used for accelerating code search efficiency; researchers also leveraged other auxiliary techniques (i.e., query reformulation, code clustering, active learning) to improve the search accuracy. However, they found only 12 code search studies shared accessible replication package links in their papers or provided source code in GitHub.
3. How do studies evaluate code search tools?
Most reviewed study codebases were built with large-scale method-level source code written in Java, which were collected from public code repositories (e.g., GitHub and FDroid), and most code search studies have tested their proposed tools with top-n frequently used text-based queries collected from Q&A forums (e.g., Stack Overflow). Performance of 55% of code search tools were estimated using manual analysis, and most of the reviewed studies assessed tool performance with ranking metrics (e.g., MRR and NDCG).
Based on findings, they identified a set of outstanding challenges in existing studies including the diversity of the codebase, limited queries, issues with model construction and evaluation, replication respectively, and limited performance measures. On the other hand, the opportunities lie in better benchmarks, fusion of different types of models, development of multi-language tool and new tools for new code search tasks.
It is recommended that further studies to build a better benchmark with large-scale code written in multiple programming languages, including various queries covering not only top frequently used examples but also domain-specific cases, and a standard automated tool evaluation method. DL-based tools require further improvements, such as better code representation, higher quality of training data, and speed improvements. It is further recommended to fuse them with other models such as traditional IR-based and heuristic models. It is difficult to deploy an existing tool to search code written in multiple programming languages. Therefore, a multi-language tool and new code search tasks such as searching UI code or code in programming videos are also worthy of development.
The work was published by ACM Computing Surveys and could be access at https://doi.org/10.1145/3480027.
About Professor Yang
Xiaohu Yang is a jointly appointed professor at College of Computer Science & Technology and SIAS, Zhejiang University. He is the Director of Blockchain Research Center and Vice Director of Computer Software Institute at Zhejiang University.
He is the co-founder of State Street Zhejiang University Technology Center, a joint research center set up in 2001 by State Street Corporation and Zhejiang University, for advanced research and development of global financial software systems and technologies. Since then, he has been leading the Technology Center, and brought it up from 15 people to more than thousand people up-to-date.
His research interests include software engineering, blockchain, and cloud computing. He received the B.S. degree, the M.S. degree and the Ph.D. degree all in computer science at Zhejiang University in 1988, 1990, and 1993 respectively.
Shanghai Institute for Advanced Study of Zhejiang University (SIAS) is a jointly launched new institution of research and development by Shanghai Municipal Government and Zhejiang University in June, 2020. The platform represents an intersection of technology and economic development, serving as a market leading trail blazer to cultivate a novel community for innovation amongst enterprises.
SIAS is seeking top talents working on the frontiers of computational sciences who can envision and actualize a research program that will bring out new solutions to areas include, but not limited to, Artificial Intelligence, Computational Biology, Computational Engineering and Fintech.