Will pre-indexing improve the deep code search? – Institut für Data Science

Will pre-indexing improve the deep code search? [B.Sc.]

Context

A huge number of open-source and industrial software systems have been developed. These source code projects are usually stored in big code repositories, such as GitHub, and can be treated as important reusable assets for developers. Previously written programs can help developers understand how others addressed similar problems and can serve as a basis for writing new programs. Thus, there is a great demand for automated tools that can help developers search through a large codebase to find relevant code for a specific programming task. There are naive approaches that treat source code as textual documents and utilize information retrieval models to retrieve relevant code snippets that match a given query. These approaches mainly rely on the textual similarity between source code and a natural language query. The state-of-the-art approach leverages neural networks to embed code snippets and natural language descriptions into a high-dimensional vector space [1]. Then they perform code search based on the feature vector similarity so that the system can deeply understand the semantics of queries and source code compared to naive methods. However, they focus on building the deep learning model and neglect the fundamental information retrieval processes, such as indexing, which has the potential to improve search performance in both effectiveness and efficiency. Especially when the scale of the code repository is getting larger and larger, to implement search on Big Code, an efficient indexing method might be necessary.

Problem Definition

In this work, we will explore whether pre-indexing can help to improve the performance of deep code search. We will use DeepCS as the core of the search engine [1]. A code search engine retrieves related code snippets in a code repository based on the input natural language query. DeepCS is a code search engine based on deep learning technique. It uses a Code-Description Embedding Neural Network to embed code snippet and its natural language description to the same feature vector space. Using the unified vector representation, code snippets related to a natural language query can be retrieved according to their vectors. We will first obtain a general understanding of DeepCS and implement one or more indexing methods on the code repository. The alternative indexing methods are hashing, text-based, code structure-based, etc. We will build a index filter below the whole DeepCS system to pre-select the candidates. This filter component is independent of the DeepCS, so we can directly use the open source code of DeepCS as the search engine [2] and only take care of the data pre-processing phase. The research questions we want to answer in the end are: Can pre-indexing help with the efficiency and effectiveness of deep code search? How? And if possible, we can further answer the question that which indexing methods have the best performance.

Prerequisites

Programming skills
Interest in big data processing
Interest in information retrieval

Advisor and Contact

Binger Chen (chen@dbs.uni-hannover.de)

Prof. Dr. Ziawasch Abedjan (abedjan@dbs.uni-hannover.de)

References

[1] Xiaodong Gu, Hongyu Zhang, Sunghun Kim: Deep code search. ICSE 2018: 933-944

[2] github.com/guxd/deep-code-search