for health and social care image

The Information Centre logo
HESonline logo

IN THIS SITE...

spacer

Duplicate detection and removal process

Data is sent to SUS using one of two different protocols:

  • Net Interchange allows the sender to add or to identify and replace individual records.
  • Bulk Interchange allows the sender to add or replace a set of records covering a specified period.

If the user intended to replace a pre-existing record or a set of records, but does not identify them correctly, the replacement records are added to the dataset, creating duplicated ones. This is referred to as a failure to overwrite and is what the HES duplicate detection and deletion process focuses on.

Since there are field variations in the formation of each of the HES collections (inpatient, outpatient and A&E;) the process to detect duplicates in each collection is slightly different, although they all follow the same broad pattern. The specific details of each commissioning data set (CDS) type can be found in the 'Further information' section (see below).

Failure to overwrite

A failure to overwrite occurs when errors/changes in the overwriting keys do not allow the pre-existing data to be identified and overwritten during subsequent CDS submissions creating duplicates within SUS. These duplicates are not present within the submitting organisation's PAS system.

Failure to overwrite problems occur most often in bulk submissions. Net and bulk interchanges use different sets of overwriting keys. Bulk interchanges use sender code, prime recipient, bulk start date and bulk end date. Net interchanges use sender code and the CDS unique id.

For more information about interchanges, including technical details, please see the Connecting for Health website.

Implications of provider code mapping

HES performs provider mapping to ensure that all records have a valid provider code. If records have been submitted under both an obsolete and current provider code, this may lead to the creation of duplicates; these duplicates are detected and removed during this process.

Please consult 'Provider code mapping within HES' in our 'Quick links' section for further information.

Detecting failures to overwrite at an individual record level

For each CDS type a set of circumstances is identified where it's highly unlikely that a single individual would appear twice for the same provider. For example, somebody arriving in A&E; at the same provider twice, on the same day at the same time. You can see the selection procedures for each collection in the 'Further information' section (below).

If a single individual is present more than once, based on NHS number or local patient ID, then it is likely a failure to overwrite has occurred. This collection of records, for the same patient activity, is referred to as a 'matched set of records'.

The detection process is then different for bulk and for net as the overwriting rules are very different.

Identifying failures to overwrite in net submissions

The basic assumption is that the latest submission date within any 'matched set of records' is the correct record. Records with an earlier submission date are marked as potential duplicates.

This method results in the misidentification of some records as potential duplicates. However, when dealing with net submissions, no deletions are performed without further manual investigation. See the 'removing records' section (below).

Flagging failures to overwrite in bulk submissions.

This is a two stage process.

Stage 1: Examining individual records

The latest submission date within any 'matched set of records' is accepted as the correct submission and as such retained. The overwriting keys (see 'failure to overwrite', above) are checked for any evidence of a failure to overwrite (or for evidence that HES provider code mapping has occurred). If evidence can be found then records other than the one with the latest submission date are identified as 'potential duplicate records'.

Stage 2: Examining entire submissions

Each CDS type uses a different field measure to monitor activity date. For example, A&E; records are grouped by attendance date; outpatient records are grouped by appointment date; inpatient data is grouped by episode end date.

Submissions where 'potential duplicate' records have been identified are investigated individually. Each activity date occurring within the submission is considered in turn and if over 50 per cent of the records have been detected as duplicates, then all of the records (irrespective of whether they were detected as duplicates or not during the individual records stage) are marked as 'definite duplicates'.

Removing records

All 'definite duplicates' are removed automatically - this only occurs for bulk submissions.

Any 'potential duplicates' (bulk or net submissions) are manually investigated to identify patterns that indicate failures to overwrite. This is done by displaying aggregate figures for each provider by activity date, submission date and the results of the duplicates flags. This allows patterns to be observed based around the detection of 'potential duplicates'.

Feedback and improvement

During the processing of monthly HES data all duplicates are removed from HES without further consultation.

Details of all deletions undertaken are shared with trusts so that they can request data deletions from SUS to help ensure SUS remains accurate.

Further information: Details of duplicate detection

Each duplicate flag uses a slightly different detection method:

  • Inpatient: episode start date and episode end date are used as the basis for sorting records.
  • Outpatient: appointment date, main specialty, treatment specialty and consultant code are used for the basis for sorting records.
  • A&E;: arrival date and arrival time are used for the basis for sorting records.


Copyright � 2005-2012, The Health and Social Care Information Centre. All Rights Reserved.