COLUMN SPLITTER WITH RECORD-MATCHING [B.SC.]

Context

Data is fundamental to a large number of applications. Apart from commercial datasets created by largeenterprises or institutes, open databases, such as Azure Open Datasets [3], and web tables [2], are popular data sources for applications and research projects.  However, although they cover a wide array of information, the data representation might not be optimal for certain down- stream applications, which makes them hard to be utilized efficiently. A simple example shown below represents a case, where a transformation on the existing value representations is required.

Example 1: Consider a sub-system from a large international business that summarizes the sales of each state or equivalent administrative region. The ad- dress table shown in the left of Table 1 covers the state that needs to be extracted from corresponding places. After splitting the source “Customer Address” column into targeted “Detail Address”, “City” and “State” columns as shown inthe right table, this “State” column can directly be used to identify the region. The last two rows need to be split the same way as the first two rows.

Record matching [1] is an algorithm that can solve the column split problem. The algorithm aims at finding exact-matching tokens between source and target strings. It generates a set of rules that can cover the greatest numberof tokens as the output. Considering our example, these rules would be learnt from the first two split examples and automatically applied on the remaining address records. However, as all the information of target columns comes from the source string through static rules, various types of inconsistencies between source strings, such as reordering of different part of the information, mixing common abbreviations and full names, different formatting styles, etc., will negatively affect the accuracy of such rule-based system.  These types of inconsistencies can limit the application of the exact-matching algorithm.  In this work, we focus on the reordering problem.

As shown in Example 1, the detail address, city and state do not appear in the same order in all detailed address records. This ordering inconsistency adds to the difficulty of the column split and requires an additional metric to organize the split sub-strings into corresponding columns. Therefore, metrics that checks the consistency of each column, such as Minimum Pair-wise Edit-Distance [4], can contribute to a higher matching accuracy.

Problem Definition

The task is to improve the column splitting approach by making it less prone to order inconsistencies. For this purpose, one can compare the accuracy of column splitting using different metric functions (e.g., Minimum Pair-wise Edit- Distance [4]) after record matching. To achieve this, you need to first re- implement the record matching algorithm. The algorithm should be able to perform column split for consistent inputs based on your given examples. Then, at least one metric should be implemented to cover the input inconsistency. Finally, an experiment is expected to compare the accuracy of column split with and without the selected metric function.

Prerequisites 

•     Enthusiasm in data integration

•     Solid fundamental in database

•     Programming experience in any programming languages, Python is overall preferred

Advisor and Contact

Dakai Men (men@dbs.uni-hannover.de)

Prof. Dr. Ziawasch Abedjan (abedjan@dbs.uni-hannover.de)

References

[1]   Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik. Learning string transfor- mations from examples. Proc. VLDBEndow., 2(1):514–525, aug 2009.

[2]   Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. Webtables: Exploring the power oftables on the web. Proc. VLDB Endow., 1(1):538–549, aug 2008.

[3]   Microsoft. Azure open datasets, 2022. Available at docs.microsoft.com/en-us/azure/open-datasets/dataset-catalog.

[4]   Pei Wang and Yeye He. Uni-detect: A unified approach to automated error detec- tion in tables. In Proceedings of the2019 International Conference on Management of Data, SIGMOD ’19, page 811–828, New York, NY, USA, 2019. Association for Computing Machinery.