Finding Duplicates using SQL

While migrating old data to a new database schema I was unable to activate a PRIMARY KEY on the legacy_customer_id field because there were duplicates. It turned out that the old application did not clean the data really well, and it allowed duplicates to be created where one of the customer_ids had a space character making it unique.

I used the following query to test for others:

SELECT
    customer_id,
    COUNT(customer_id)
FROM
    customers
GROUP BY
    customer_id
HAVING
    (COUNT(customer_id) > 1);

This allowed me to find all customer_ids that had duplicates and clean things up.

MySQL FIND_IN_SET

I was faced with a field in the database that had a comma separated list of INT’s with a space after the comma. The application searched this field to generate the recordset to display on a web page.

First I tried using LIKE, with something along this line: ($state_id is being passed by PHP)

SELECT
    *
WHERE
    `assoc_states` LIKE '%$state_id%';

Well, as you can guess if I was searching for an INT less than two numbers I would get anything that had that digit in it. (Ex. – Searching for ‘%4%’ would also give me entries of 4, 14, 24, 34, 42, 44, etc.)

I may have been able to use RLIKE, but I didn’t fully understand it and there were no good examples of doing what I needed.

That was when I stumbled across FIND_IN_SET. Here is what MySQL.com has to say about it:

Returns a value in the range of 1 to N if the string str is in the string list strlist consisting of N substrings. A string list is a string composed of substrings separated by “,” characters. If the first argument is a constant string and the second is a column of type SET, the FIND_IN_SET() function is optimized to use bit arithmetic. Returns 0 if str is not in strlist or if strlist is the empty string. Returns NULL if either argument is NULL. This function does not work properly if the first argument contains a comma (“,”) character.

SELECT
    FIND_IN_SET('b','a,b,c,d');
 
// output -> 2

Here is what I ended up with: ($state_id is being passed by PHP)

SELECT
    *
WHERE
    FIND_IN_SET('$state_id', `assoc_state`) > 0;