Commercial software tools for data quality and record linkage in the process of microintegration (Jaroslav Kraus and Ondřej Vozár czso) 24




Скачать 277.19 Kb.
НазваниеCommercial software tools for data quality and record linkage in the process of microintegration (Jaroslav Kraus and Ondřej Vozár czso) 24
страница1/9
Дата03.10.2012
Размер277.19 Kb.
ТипДокументы
  1   2   3   4   5   6   7   8   9


ESSnet Statistical Methodology Project on Integration of Survey and Administrative Data


Report of WP3. Software tools for integration methodologies


LIST OF CONTENTS


Preface III

  1. Software tools for record linkage (Monica Scannapieco – Istat) 1

    1. Comparison criteria for record linkage software tools (Monica

Scannapieco – Istat) 2

    1. Probabilistic tools for record linkage 4

      1. Automatch (Nicoletta Cibella – Istat) 4

      2. Febrl (Miguel Guigo – INE) 5

      3. GRLS (Nicoletta Cibella – Istat) 6

      4. LinkageWiz (Monica Scannapieco – Istat) 7

      5. RELAIS (Monica Scannapieco – Istat) 7

      6. DataFlux (Monica Scannapieco – Istat) 8

      7. Link King (Marco Fortini – Istat) 10

      8. Trillium Software (Miguel Guigo – INE) 10

      9. Link Plus (Tiziana Tuoto – Istat) 12

    2. Summary tables and comparisons 14

      1. General Features 14

      2. Strengths and weaknesses 16

  1. Software tools for statistical matching (Mauro Scanu – Istat) 17

    1. Comparison criteria for statistical matching software tools (Mauro

Scanu – Istat) 17

    1. Statistical matching tools 18

      1. SAMWIN (Mauro Scanu – Istat) 19

      2. R codes (Marcello D’Orazio – Istat) 19

      3. SAS codes (Mauro Scanu – Istat) 20

      4. S-Plus codes (Marco Di Zio – Istat) 20

    2. Comparison tables 22

  1. Commercial software tools for data quality and record linkage in the process of microintegration (Jaroslav Kraus and Ondřej Vozár - CZSO) 24

    1. Data quality standardization requirements 24

    2. Data quality assessment 24

    3. Summary tables Oracle, Netrics and SAS/Data Flux 29

      1. Oracle 29

      2. Netrics 32

      3. SAS 35

  2. Documentation, literature and references 38

    1. Bibliography for Section 1 38

    2. Bibliography for Section 2 39

    3. Bibliography for Section 3 40


Preface


This document is the deliverable of the third work package (WP) of the Centre of Excellence on Statistical Methodology. The objective of this WP is to review some existing software tools for the application of probabilistic record linkage and statistical matching methods

The document is organized in three chapters.

The first chapter is on software tools for record linkage. On the basis of the underlying research paradigm, three major categories of record linkage tools can be identified:

  • Tools for probabilistic record linkage, mostly based on the Fellegi and Sunter model (Fellegi and Sunter1, 1969).

  • Tools for empirical record linkage, which are mainly focused on performance issues and hence on reducing the search space of the record linkage problem by means of algorithmic techniques such as sorting, tree traversal, neighbour comparison, and pruning.

  • Tools for knowledge-based linkage, in which domain knowledge is extracted from the files involved and reasoning strategies are applied to make the decision process more effective.

In such a variety of proposals, this document restricts the attention to the record linkage tools that have the following characteristics:

  • They have been explicitly developed for record linkage;

  • They are based on a probabilistic paradigm.

Two sets of comparison criteria were used for comparing several probabilistic record linkage tools. The first one considers general characteristics of the software: cost of the software; domain specificity (i.e. the tool can be developed ad-hoc for a specific type of data and applications); maturity (or level of adoption, i.e. frequency of usage - whereas available - and number of years the tool is around). The second set considers which functionalities are performed by the tool: preprocessing/standardization; profiling; comparison functions; decision method.

Chapter 2 deals with software tools for statistical matching. Software solutions for statistical matching are not as widespread as in the case of record linkage, because statistical matching projects are still quite rare in practice. Almost all the applications are conducted by means of ad hoc codes. Sometimes, when the objective is micro it is possible to use general purpose imputation software tools. On the other hand, if the objective is macro, it is possible to adopt general statistical analysis tools which are able to deal with missing data.

In this chapter, the available tools, explicitly devoted to statistical matching purposes, were reviewed. Only one of them (SAMWIN) is a software that can be used without any programming skills, while the others are software codes that can be used only by those with knowledge of the corresponding language (R, S-Plus, SAS) as well as a sound knowledge in statistical methodology.

The criteria used for comparing the software tools for statistical matching were slightly different from those for record linkage. The attention is restricted to costs, domain specificity and maturity of the software tool. As far as the software functionalities are concerned, the focus is on: i) the inclusion of pre-processing and standardization tools; ii) the capacity to create a complete and synthetic data set by the fusion of the two data sources to integrate; iii) the capacity to estimate parameters on the joint distribution of variables never jointly observed; iv) the assumptions on the model of the variables of interest under which the software tool works (the most known is the conditional independence assumption of the variables not jointly observed given the common variables in the two data sources); v) the presence of any quality assessment of the results.

Furthermore, the software tools are compared according to the implemented methodologies. Strengths and weaknesses of each software tool are highlighted at the end.

Chapter 3 focuses on commercial software tools for data quality and record linkage in the process of microintegration. The vendors in the data quality market are often classified within their entire position in IT business, where focus on the specific business knowledge and experience in specific business domain plays an important role. Quality of vendors and their products on the market are characterized by: i) product features and relevant services; ii) vendor characteristics, domain business understanding, business strategy, creativity, innovation; iii) sales characteristics, licensing, prices; iv) customer experience, reference projects; v) data quality tools and frameworks.

The software vendors of tools in the statistics oriented “data quality market” propose solutions addressing all the tasks in the entire life cycle of the data oriented management programs and projects: data preparation, survey data collection, improving of quality and integrity, setting up for reports and studies, etc.

According to the software/application category, the tools to perform or support the data oriented projects in record linkage in statistics should have several common characteristics:

  1. portability in being able to function with statistic researchers' current arrangement of computer systems and languages,

  2. flexibility in handling different linkage strategies, and

  3. operational expenses or low costs in TCO (Total Cost of Ownership) parameters and in both, computing time and researchers' efforts.

In this chapter the evaluation focused on three commercial software packages, which according to the data quality scoring position in Gartner reports (the so called “magic quadrants” available on the web page http://www.gartner.com) belong to important vendors in this area. The three vendors are: Oracle (represents the convergence of tools and services in the software market), SAS/DataFlux (data quality, data integration and BI (Business Intelligence) player on the market), Netrics (which disposes with the advanced technology complementing the Oracle data quality and integration tools).

The set of comparison tables was prepared according to the following structure: linkage methodology, data management, post-linkage function, standardization, costs and empirical testing, case studies and examples.
  1   2   3   4   5   6   7   8   9

Похожие:

Commercial software tools for data quality and record linkage in the process of microintegration (Jaroslav Kraus and Ondřej Vozár czso) 24 iconProductivity and Quality in Software-Projects Psychological Analyses and Optimization of Work Processes in Software-Development

Commercial software tools for data quality and record linkage in the process of microintegration (Jaroslav Kraus and Ondřej Vozár czso) 24 iconProposal to nsf 99-105 c aravel: An Architecture and Software System for Automated Feature Detection, Data Mapping, Navigation and Visualization of Large-Scale Simulation Data

Commercial software tools for data quality and record linkage in the process of microintegration (Jaroslav Kraus and Ondřej Vozár czso) 24 iconBurning Studio Tools & Software Protection

Commercial software tools for data quality and record linkage in the process of microintegration (Jaroslav Kraus and Ondřej Vozár czso) 24 iconCs-564 Software Development Tools & Processes

Commercial software tools for data quality and record linkage in the process of microintegration (Jaroslav Kraus and Ondřej Vozár czso) 24 iconCs-564 Software Development Tools & Processes

Commercial software tools for data quality and record linkage in the process of microintegration (Jaroslav Kraus and Ondřej Vozár czso) 24 iconDesigning for Software Quality

Commercial software tools for data quality and record linkage in the process of microintegration (Jaroslav Kraus and Ondřej Vozár czso) 24 iconNew Media as Digital Data Controlled by Software: Computers model reality through data structures and algorithms, so media born digital will also share those two features. (Wardrip-Fruin, 17)

Commercial software tools for data quality and record linkage in the process of microintegration (Jaroslav Kraus and Ondřej Vozár czso) 24 iconPost Graduate Diploma in Software Quality Management pgdsm

Commercial software tools for data quality and record linkage in the process of microintegration (Jaroslav Kraus and Ondřej Vozár czso) 24 iconManaging Data Quality within bt. A feasibility Study

Commercial software tools for data quality and record linkage in the process of microintegration (Jaroslav Kraus and Ondřej Vozár czso) 24 iconCs 552 Software Process and Practicum Spring 2009

Разместите кнопку на своём сайте:
Библиотека


База данных защищена авторским правом ©lib.znate.ru 2014
обратиться к администрации
Библиотека
Главная страница