Probabilistic Record Linkage (Theme)


In this section the problem of probabilistic record linkage is explored. It can be also viewed as the weighted matching in case of an explicit use of probabilities. Generally speaking record linkage (or object matching, see also module on object matching) can be defined as the set of methods and practices aiming at accurately and quickly identify if two or more records, stored in sources of various type, represent or not the same real world entity. As usually data sources are hard to integrate due to errors or lacking information in the record identifiers, record linkage can be seen as a complex process consisting of several phases involving different knowledge areas. In research literature a distinction between deterministic (matching identifiers) and probabilistic approaches (matching with matching weights) is often made, where the former is associated with the use of formal decision rules while the latter makes an explicit use of probabilities for deciding when a given pair of records is actually a match but a clear separation between the two approaches is very difficult.

Compared with the deterministic approach, the probabilistic one can solve problems caused by bad quality data and can be helpful when differently spelled, swapped or misreported variables are stored in the two data files; the attention in this section is only devoted to the probabilistic record linkage approach which allows also to evaluate the linkage errors, calculating the likelihood of the correct match.

Generally speaking, the deterministic and the probabilistic approaches can be combined in a two-step process: firstly the deterministic method can be performed on the high quality variables then the probabilistic approach can be adopted on the residuals, the units not linked in the first step; however the joint use of the two techniques depends on the aims of the whole linkage project.


