Removing duplicate records from a table is a common task in database management. One efficient and readable way to achieve this is by using Common Table Expressions (CTEs) in T-SQL. This article demonstrates how to use a CTE to identify and delete duplicate rows from a table.
Step 1: Create the Table
First, let’s create a table called Employee and populate it with some data, including duplicate rows:
CREATE TABLE Employee (name NVARCHAR(100), age INT, salary INT);
INSERT INTO Employee
VALUES
('Mark', 42, 120000),
('Susan', 29, 125000),
('Mark', 42, 120000);
At this point, if you query the table using:
SELECT * FROM Employee;
You’ll see that the table contains duplicate rows:
| name | age | salary |
|---|---|---|
| Mark | 42 | 120000 |
| Susan | 29 | 125000 |
| Mark | 42 | 120000 |
Step 2: Use a CTE to Identify Duplicate Rows
We can use a CTE to assign a unique row number to each record. By partitioning the data based on name, age, and salary, and ordering by the same columns, we can easily identify duplicates. Here’s the query:
WITH CTE_Employee(name, age, salary, rownumber) AS (
SELECT
name,
age,
salary,
ROW_NUMBER() OVER (
PARTITION BY name, age, salary
ORDER BY name, age, salary
) AS rownumber
FROM Employee
)
In this CTE:
- PARTITION BY groups rows with the same name, age, and salary.
- ROW_NUMBER() assigns a unique number to each row within a partition.
For duplicate records, all rows except the first one will have a rownumber greater than 1.
Step 3: Delete Duplicate Rows
Once the duplicates are identified, we can delete them by keeping only rows where rownumber is equal to 1:
DELETE FROM CTE_Employee WHERE rownumber <> 1;
This query removes all duplicate rows while retaining one instance of each unique record.
Step 4: Verify the Results
After running the deletion query, you can verify that the duplicates have been removed by querying the table again:
SELECT * FROM Employee;
You should now see only unique records in the Employee table.
Using CTEs to manage duplicate records in T-SQL is both powerful and readable. The ROW_NUMBER() function, combined with PARTITION BY, makes it easy to identify and handle duplicate rows efficiently. Try applying this approach to your datasets to simplify data cleansing tasks.