CLEAN AUTOML [BSC]

Context: 

Current AutoML systems, such as AutoSklearn [1], are the first step to enable the tool of machine learning (ML) to a broader audience. Usually, AutoML systems select preprocessing steps, ML models, and their parameters by cleverly choosing and evaluating different combinations. One important preprocessing step in ML is data quality assurance. Data errors can range from typographical errors over missing values to functional dependency violations [2]. However, current AutoML systems only address missing values. 

Problem / Task:  

The task is to apply state-of-the-art data cleaning systems, such as Raha [3] and Baran [4], to generate and extract cleaning transformations, and to incorporate those into AutoSklearn. We want to analyze whether incorporating cleaning as an additional optimization dimension will improve ML tasks on dirty data or not and how this compares to a sequential approach where a dataset has been first cleaned and then used for machine learning. Figuring out how to select subsets of cleaning transformations or improving the runtime of the AutoML optimization is a plus. 

Prerequisites:

  • Programming experience in Python (+ sklearn)

  • Interest in working with large datasets
  • Interest in machine learning & database technologies 

Related Work:

[1] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, Frank Hutter: Efficient and Robust Automated Machine Learning. NeurIPS 2015: 2962-2970

[2] Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, Nan Tang: Detecting Data Errors: Where are we and what needs to be done? Proc. VLDB Endow. 9(12): 993-1004 (2016)

[3] Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang: Raha: A Configuration-Free Error Detection System. SIGMOD Conference 2019: 865-882

[4] Mohammad Mahdavi, Ziawasch Abedjan: Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning. Proc. VLDB Endow. 13(11): 1948-1961 (2020) 

For a detailed introduction to the topic, please get in contact via email.

Advisor and Contact:

Felix Neutatz <f.neutatz@tu-berlin.de> (TU Berlin)

Prof. Dr. Ziawasch Abedjan <abedjan@dbs.uni-hannover.de> (LUH)