How to Use CTE to Remove Duplicates

Removing duplicate records from a table is a common task in database management. One efficient and readable way to achieve this is by using Common Table Expressions (CTEs) in T-SQL. This article demonstrates how to use a CTE to identify and delete duplicate rows from a table.

Step 1: Create the Table

First, let’s create a table called Employee and populate it with some data, including duplicate rows:

CREATE TABLE Employee (name NVARCHAR(100), age INT, salary INT);

INSERT INTO Employee
VALUES 
    ('Mark', 42, 120000),
    ('Susan', 29, 125000),
    ('Mark', 42, 120000);

At this point, if you query the table using:

SELECT * FROM Employee;

You’ll see that the table contains duplicate rows:

name	age	salary
Mark	42	120000
Susan	29	125000
Mark	42	120000

Step 2: Use a CTE to Identify Duplicate Rows

We can use a CTE to assign a unique row number to each record. By partitioning the data based on name, age, and salary, and ordering by the same columns, we can easily identify duplicates. Here’s the query:

WITH CTE_Employee(name, age, salary, rownumber) AS (
    SELECT 
        name,   
        age,   
        salary,   
        ROW_NUMBER() OVER (
            PARTITION BY name, age, salary 
            ORDER BY name, age, salary
        ) AS rownumber
    FROM Employee
)

In this CTE:

PARTITION BY groups rows with the same name, age, and salary.
ROW_NUMBER() assigns a unique number to each row within a partition.

For duplicate records, all rows except the first one will have a rownumber greater than 1.

Step 3: Delete Duplicate Rows

Once the duplicates are identified, we can delete them by keeping only rows where rownumber is equal to 1:

DELETE
FROM CTE_Employee
WHERE rownumber <> 1;

This query removes all duplicate rows while retaining one instance of each unique record.

Step 4: Verify the Results

After running the deletion query, you can verify that the duplicates have been removed by querying the table again:

SELECT * FROM Employee;

You should now see only unique records in the Employee table.

Using CTEs to manage duplicate records in T-SQL is both powerful and readable. The ROW_NUMBER() function, combined with PARTITION BY, makes it easy to identify and handle duplicate rows efficiently. Try applying this approach to your datasets to simplify data cleansing tasks.

AI

Search This Blog

This Week's Best Picks from Amazon