How to de-duplicate records using a SQL query

All about how to de-dupe records using SQL.

How to de-duplicate records using a SQL query
Recently was asked the question - what are the options/practices for using SQL to deduplicate records?

This is one of those discussions that will be of interest to those that find questions like this interesting, so if you like to put records into tables and you find that SQL is part of your workflow for doing that then read on.

If de-duping with SQL is not part of your day-to-day, then don't despair because the discussion is designed to be easy an ease read for both beginner and expert alike.

This discussion is split into three parts:

  1. Simple example of using SQL to transform a current to future state.
  2. Advanced real world how to SQL de-duplication in practice use cases.
  3. Suggestions, questions and comments.

If you prefer to watch visit the YouTube, otherwise read on.


Simple example of using SQL to transform a current to future state

For the purpose of demonstration an example will be used to explore records in a table with similar colours. In practice, duplicates may contain account, name, email, phone and address details which we will explore in advanced SQL use cases.


The current (or initial) state is a table with many duplicate records each containing a colour

Pictured we see a table containing many records, with each record containing similar records.

Our current or initial state is a table with many records containing many duplication records.

To help us visualise this consider a table with many records where the many duplicate records each contains a colour.

In the pictured example there are sixteen records representative of eight colours consisting:

  1. Brown records occurring once.
  2. Grey records occurring once.
  3. Blue records occurring twice.
  4. Green records occurring four times.
  5. Purple records occurring twice
  6. Red records occurring trice.
  7. Orange records occurring twice.
  8. Yellow records occurring once.

Future (or desired) state of table with one record per colour

Pictured we see a table containing many records, with each record containing records that are different.

Once again, to help us visualise this consider a table where there is only one record for each of the eight colours.

In our future or desired state there is a table with eight records each containing only one colour.

Other way of considering the future state is that there is a table containing no duplicate records for any one colour.


SQL used for transformation of the simple current to future states

From a SQL point of view the query required will be as follows.


DROP TABLE IF EXISTS "source";
CREATE TABLE source ( id VARCHAR(5) , colour VARCHAR(12) );

DROP TABLE IF EXISTS "destination";
CREATE TABLE destination ( id VARCHAR(5) , colour VARCHAR(12) );

INSERT INTO source (id,colour)
VALUES ("1","BROWN"),("2","GREY"),("3","BLUE"),("4","BLUE"),("5","GREEN"),("6","GREEN"),("7","GREEN"),("8","GREEN"),("9","PURPLE"),("10","PURPLE"),("11","RED"),("12","RED"),("13","RED"),("14","ORANGE"),("15","ORANGE"),("16","YELLOW");

INSERT INTO destination  
SELECT id,colour 
FROM source a
WHERE a.ROWID IN
  (SELECT ROWID FROM
     (SELECT
      ROWID,
      ROW_NUMBER() OVER
        (PARTITION BY colour
         ORDER BY colour) dup
    FROM source)
    WHERE dup = 1);
    

SELECT id,colour FROM destination;
This can be run as is in mySQL and msSQL.

The result of this statement will be that only one record for each colour will be projected into the destination table.

💡
If you are using a platform like Salesforce Marketing Cloud then the SQL INSERT INTO destination statement will be replaced with the target data extension in the query.

Advanced SQL deduplication use cases

From organisation to chaos

Before we dive into the how, let's look at the why do we find this deduplication requirement exists ...

Well it goes something like this ... person one has a way of doing things, then person two comes along and does it their way, and then person three joins the party and so on and so forth. When this get's applied to sending email campaigns and everyone is selecting different fields to base the SubscriberKey on then things can quickly become very chaotic with respect to how the marketing platform determines the individual subscriber.

💡
TAKEAWAY - make sure there are some rules in place so that everyone knows what field will be used as the SubscriberKey.

Now that we have that out of the way.

Generally are three reasons for deduplication:

  1. Improve the customer experience - make it so that only one personalised email is sent to the customer instead of two, three or more.
  2. Ensure that the customer does not get spammed - if there are duplicate SubscriberKeys for the same account, and the subscriber has opted out of one, then the subscriber may need to be opted out of all.
  3. Reduce costs - with most marketing cloud platforms using a per contact pricing model it makes sense to remove any unnecessary duplicates.

For the purposes of this discussion we will from an implementation point of view consider that the result of the de-duplication processes needs to update a destination table. Once the records are in the destination table then some other next action can be taken, for example updating subscriber opt-in/out status, deleting contacts and updating external platforms updated accordingly which will be explored separately.

The previous "simple" example was used for the purpose of demonstration to explored the SQL for deduplicating records with similar colours.

In practice, whilst duplicate records are often more complex as they may contain the same or similar account, email and opt-in/out in different fields the procedure is much the same as deduplicating similar colours.

The following is a de-duping process flow to demonstrate this.

Deduplicating process flow

In the case of deduplicating consider four records that contain the same values in each of the [Source].email and [Source].ContactID fields.

For this more advanced use case then two fields will be considered for candidates of the deduplication.

DROP TABLE IF EXISTS "source";
CREATE TABLE source ( email VARCHAR(50) , ContactID VARCHAR(50) );

DROP TABLE IF EXISTS "destination";
CREATE TABLE destination ( dedupeKey VARCHAR(50)  NOT NULL, PRIMARY KEY (dedupeKey) );

INSERT INTO source (email,ContactID)
VALUES ("[email protected]",""),("","[email protected]"),("[email protected]",""),("","[email protected]");

INSERT or REPLACE INTO destination  
SELECT email AS dedupeKey 
FROM source a
WHERE a.ROWID IN
  (SELECT ROWID FROM
     (SELECT
      ROWID,
      ROW_NUMBER() OVER
        (PARTITION BY email
         ORDER BY email) dup
    FROM source)
    WHERE dup = 1);


INSERT or REPLACE INTO destination  
SELECT ContactID as dedupeKey 
FROM source a
WHERE a.ROWID IN
  (SELECT ROWID FROM
     (SELECT
      ROWID,
      ROW_NUMBER() OVER
        (PARTITION BY ContactID
         ORDER BY ContactID) dup
    FROM source)
    WHERE dup = 1);


SELECT dedupeKey FROM destination;
This can be run as is in mySQL and msSQL.

The result will be a destination table containing two rows with the field dedupeKey having the values:

  1. [email protected]
  2. [email protected]
💡
Once again, if you are using a platform like Salesforce Marketing Cloud then the SQL INSERT INTO destination statement will be replaced with the target data extension in the query.

So for the purposes of this discussion, hopefully there is some insight into how to recover from potential duplication complexity chaos to easy de-duplicated order.

Potential duplication complexity chaos to easy de-duplicated order

Future discussions will dive into how to deduplicate contacts, updating a single source of truth and pricing models used by marketing clouds.


Suggestions and questions all welcome

Have you got something that can help others or a question? Feel very welcome to share your thoughts in the comments.