10 Julio 2018
Instituto de Ciencias de la Computación ICC (UBA - CONICET), Universidad de Buenos Aires
Departamento de Razonamiento y Lógica
Vanina Martinez

The entity resolution problem in traditional databases, also known as deduplication, seeks to map multiple virtual objects to its
corresponding set of real-world entities. Though the problem is challenging, it can be tackled in a variety of ways by means of leveraging several simplifying assumptions, such as the fact that the multiple virtual objects appear as the result of name or attribute ambiguity, clerical errors in data entry or formatting, missing or changing values, or abbreviations. However, in cyber security domains the entity resolution problem takes on a whole different form, since malicious actors that operate in certain environments like hacker forums and markets are highly motivated to remain semi-anonymous—this is because, though they wish to keep their true identities secret from law enforcement, they also have a reputation with their customers. The above simplifying assumptions cannot be made in this setting, and we therefore coin the term “adversarial deduplication”. In this paper, we propose the use of probabilistic existential rules (also known as Datalog+/–) to model knowledge engineering solutions to this problem; we show that tuple-generating dependencies can be used to generate probabilistic deduplication hypotheses, and equality-generating dependencies can later be applied to leverage existing data towards grounding such hypotheses. The main advantage with respect to existing deduplication tools is that our model operates under the open-world assumption, and thus is capable of modeling hypotheses over unknown objects, which can later become known if new data becomes available.

Institution department: 
Departamento de Computación