SCALABLE ERROR DETECTION [BSC]

Context:

Data cleaning aims at solving data quality problems. Data cleaning consists of error detection and error correction tasks. Our previous systems Raha [4], and Baran [2], aim to solve these tasks, respectively. These two systems can be used together to build an end-to-end data cleaning pipeline [3]. They are configuration-free and leverage a small number of user labels to clean data in a semi-supervised manner. Both systems internally combine various base detectors and correctors that cover a wide range of data quality problems.

In this project, we focus on error detection the corresponding system, Raha. One of the current issues in Raha is its scalability. The output of its internal detectors are used for features to cluster values of a data column. Currently, Raha uses hierarchical agglomerate clustering. Generally, considering n as the number of data points, the hierarchical agglomerate clustering algorithm has O(n3) time complexity. Besides, the space complexity of this algorithm is O(n2). The time and space complexity of this algorithm limits the size of input data, and hence it is not sufficient for large datasets [7].

Problem/Task: 

The task is to upgrade Raha with at least two scalable, parallel clustering algorithms to improve the efficiency by reducing the time and space complexity. For the first step, it is essential to become familiar with data cleaning. In this step, you will concentrate on error detection and figure out the internals of Raha. Then, you will investigate different clustering algorithms and select at least two of them that fit into Raha. You will implement and parallelize them using appropriate python libraries such as Scikit-learn [5], Scipy [8], Apache Spark(PySpark) [9], or Dask [6, 1].

Prerequisites:

  • Programming experience in Python (Being familiar with libraries and frameworks such as Scikit-learn, Scipy, PySpark, and Dask is a plus),
  • Interest in scalable data mining algorithms,
  • Interest in data cleaning.

Advisor and Contact:

In the case of any questions, please get in contact via email.

Fatemeh Ahmadi, ahmadi@dbs.uni-hannover.de

Prof. Dr. Ziawasch Abedjan, abedjan@dbs.uni-hannover.de

References:

[1] Dask Development Team. Dask: Library for dynamic task scheduling, 2016.

[2] Mohammad Mahdavi and Ziawasch Abedjan. Baran: Effective error correction via a unified context representation and transfer learning. Proc. VLDB Endow., 13(11):1948–1961, 2020.

[3] Mohammad Mahdavi and Ziawasch Abedjan. Semi-supervised data cleaning withraha and baran. In 11th Conference on Innovative Data Systems Research, CIDR2021, Virtual Event, January 11-15, 2021, Online Proceedings. www.cidrdb.org, 2021.

[4] Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. Raha: A configuration-free error detection system. In Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska, editors, Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pages 865–882. ACM, 2019.

[5] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel,Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake VanderPlas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Edouard Duchesnay. Scikit-learn: Machine learning in python. J. Mach. Learn. Res., 12:2825–2830, 2011.

[6] Matthew Rocklin. Dask: Parallel computation with blocked algorithms and taskscheduling. In Kathryn Huff and James Bergstra, editors, Proceedings of the 14th Python in Science Conference, pages 130 – 136, 2015.

[7] Pang-Ning Tan, Michael S. Steinbach, Anuj Karpatne, and Vipin Kumar. Introduction to Data Mining (Second Edition). Pearson, 2019.

[8] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, TylerReddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern,Eric Larson, C J Carey, ̇Ilhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero,Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paulvan Mulbregt, and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020.

[9] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. Apache spark: a unified engine for big data processing. Commun. ACM, 59(11):56–65, 2016.