Removing duplicate records from a table is a common task in database management. One efficient and readable way to achieve this is by using Common Table Expressions (CTEs) in T-SQL. This article demonstrates how to use a CTE to identify and delete duplicate rows from a table.
Step 1: Create the Table
First, let’s create a table called Employee and populate it with some data, including duplicate rows:
CREATE TABLE Employee (name NVARCHAR(100), age INT, salary INT); INSERT INTO Employee VALUES ('Mark', 42, 120000), ('Susan', 29, 125000), ('Mark', 42, 120000);
At this point, if you query the table using:
SELECT * FROM Employee;
You’ll see that the table contains duplicate rows:
name | age | salary |
---|---|---|
Mark | 42 | 120000 |
Susan | 29 | 125000 |
Mark | 42 | 120000 |
Step 2: Use a CTE to Identify Duplicate Rows
We can use a CTE to assign a unique row number to each record. By partitioning the data based on name, age, and salary, and ordering by the same columns, we can easily identify duplicates. Here’s the query:
WITH CTE_Employee(name, age, salary, rownumber) AS ( SELECT name, age, salary, ROW_NUMBER() OVER ( PARTITION BY name, age, salary ORDER BY name, age, salary ) AS rownumber FROM Employee )
In this CTE:
- PARTITION BY groups rows with the same name, age, and salary.
- ROW_NUMBER() assigns a unique number to each row within a partition.
For duplicate records, all rows except the first one will have a rownumber greater than 1.
Step 3: Delete Duplicate Rows
Once the duplicates are identified, we can delete them by keeping only rows where rownumber is equal to 1:
DELETE FROM CTE_Employee WHERE rownumber <> 1;
This query removes all duplicate rows while retaining one instance of each unique record.
Step 4: Verify the Results
After running the deletion query, you can verify that the duplicates have been removed by querying the table again:
SELECT * FROM Employee;
You should now see only unique records in the Employee table.
Using CTEs to manage duplicate records in T-SQL is both powerful and readable. The ROW_NUMBER() function, combined with PARTITION BY, makes it easy to identify and handle duplicate rows efficiently. Try applying this approach to your datasets to simplify data cleansing tasks.