Fachgebiet Datenbanken und InformationssystemeAbschlussarbeiten
Multi-Attribute Join Search with Map-Reduce [BSc]

MULTI-ATTRIBUTE JOIN SEARCH WITH MAP-REDUCE [BSC]

Context: 

Given an input table, finding the joinable tables is one of the most important pre-processing tasks [1]. Current systems are introduced to find the joinable tables based on a single-attribute join key, where the join key only contains one column. These systems leverage an inverted index to find the location of each key value and finally report the number of joinable rows per table.

Problem / Task:  

A more general problem is to find joinable tables based on a composite key. In this case, the joinable rows depend on more than one column. This leads to higher computation compared to the simpler single-attribute key join. In this thesis we would like to implement a system using the map-reduce paradigm [2] that gets a dataset and a composite key as the input and finds the joinable tables from a large corpus. The system is supposed to retrieve the joinable tables based on single columns and calculate the number of joinable rows by merging the obtained results. The result of the map-reduce implementation should be compared to the hash-based approaches. 

Prerequisites:

  • Familiarity with database concepts specially joins.

  • programming experience in Python, Scala, or Java
  • Interest in data integration and courage to work with large data
  • Interest and experience in distributed computing.

Related Work:

[1] Zhu, Erkang, et al. "Josie: Overlap set similarity search for finding joinable tables in data lakes." Proceedings of the 2019 International Conference on Management of Data. 2019.

[2] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: Simplified data processing on large clusters." (2004).

For a detailed introduction to the topic, please get in contact via email.

Advisor and Contact:

Mahdi Esmailoghli <esmailoghli@dbs.uni-hannover.de> (LUH)

Prof. Dr. Ziawasch Abedjan <abedjan@dbs.uni-hannover.de> (LUH)