Fachgebiet Datenbanken und InformationssystemeAbschlussarbeiten
From mining naming conventions in data science projects to suggesting variable names [MSc]



Naming conventions are rules on how user-made program components should be constructed. Yet, they are often neglected by programmers as they do not affect the code semantics. However, consistent and appropriate variable naming conventions allow programmers to focus on the actual code functionality and enable others to easily understand, revise, or reproduce the code. In open-source data science projects, variable naming problems are more severe than the average software engineering project [1]. Data science projects usually use special naming characteristics. For example, there are more short names due to the need of illustrating details of an algorithm implementation, which is to resemble math notations [2]. As the need for reusing online big code resources is emerging in data science, it is valuable to analyze the naming conventions in data science projects and suggest better variable names.

Related Work:

Existing variable naming suggesting methods are based on machine learning models and statistical language processing. But they are only designed for general-purpose code [3] [4]. None of them focus on data science projects. Recent work began to pay attention to the special code style in Python code and data science projects [5] [2]. However, they only analyzed the general coding conventions in these code repositories. They did not provide assistance for code writing and reading in practice.

Problem / Task:  

The task is to mine the large open-source code repository to discover the naming conventions in data science projects and to suggest better variable names. We aim to answer the following research problems: [What] What naming conventions in current python code style guides do data science projects comply / not comply with? How can we efficiently mine these patterns in a big code repository? [Why] What is the difference between data science projects and other projects regarding naming conventions? Is there relevance between these naming conventions and other code features? [How] How effective are the existing methods for general projects, eg [3] [4], on data science projects? How can these mined naming conventions improve the variable name suggestion? 


  • Experience and interest in programming
  • Experience in machine learning
  • Interest in big data mining
  • Interest in natural language processing 


[1] Jiawei Wang, Li Li, Andreas Zeller: Better code, better sharing: on the need of analyzing jupyter notebooks. ICSE (NIER) 2020.

[2] Andrew J. Simmons, Scott Barnett, Jessica Rivera-Villicana, Akshat Bajaj, Rajesh Vasa: A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects. ESEM 2020.

[3] Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav: code2vec: learning distributed representations of code. POPL 2019.

[4] Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav: A general path-based representation for predicting program properties. PLDI 2018.

[5] Nikolaos Bafatakis, Niels Boecker, Wenjie Boon, Martin Cabello Salazar, Jens Krinke, Gazi Oznacar, Robert White: Python coding style compliance on stack overflow. MSR 2019. 

For a detailed introduction to the topic, please get in contact via email with Binger Chen.

Advisor and Contact:

Binger Chen < chen@dbs.uni-hannover.de > (TU Berlin) 

Prof. Dr. Ziawasch Abedjan < abedjan@dbs.uni-hannover.de > (LUH)