Skip to main content

This Week's Best Picks from Amazon

Please see more curated items that we picked from Amazon here .

How to Use CTE to Remove Duplicates

Removing duplicate records from a table is a common task in database management. One efficient and readable way to achieve this is by using Common Table Expressions (CTEs) in T-SQL. This article demonstrates how to use a CTE to identify and delete duplicate rows from a table.

Step 1: Create the Table

First, let’s create a table called Employee and populate it with some data, including duplicate rows:

CREATE TABLE Employee (name NVARCHAR(100), age INT, salary INT);

INSERT INTO Employee
VALUES 
    ('Mark', 42, 120000),
    ('Susan', 29, 125000),
    ('Mark', 42, 120000);

At this point, if you query the table using:

SELECT * FROM Employee;

You’ll see that the table contains duplicate rows:

name age salary
Mark 42 120000
Susan 29 125000
Mark 42 120000

Step 2: Use a CTE to Identify Duplicate Rows

We can use a CTE to assign a unique row number to each record. By partitioning the data based on name, age, and salary, and ordering by the same columns, we can easily identify duplicates. Here’s the query:

WITH CTE_Employee(name, age, salary, rownumber) AS (
    SELECT 
        name,   
        age,   
        salary,   
        ROW_NUMBER() OVER (
            PARTITION BY name, age, salary 
            ORDER BY name, age, salary
        ) AS rownumber
    FROM Employee
)

In this CTE:

  • PARTITION BY groups rows with the same name, age, and salary.
  • ROW_NUMBER() assigns a unique number to each row within a partition.

For duplicate records, all rows except the first one will have a rownumber greater than 1.

Step 3: Delete Duplicate Rows

Once the duplicates are identified, we can delete them by keeping only rows where rownumber is equal to 1:

DELETE
FROM CTE_Employee
WHERE rownumber <> 1;

This query removes all duplicate rows while retaining one instance of each unique record.

Step 4: Verify the Results

After running the deletion query, you can verify that the duplicates have been removed by querying the table again:

SELECT * FROM Employee;

You should now see only unique records in the Employee table.

Using CTEs to manage duplicate records in T-SQL is both powerful and readable. The ROW_NUMBER() function, combined with PARTITION BY, makes it easy to identify and handle duplicate rows efficiently. Try applying this approach to your datasets to simplify data cleansing tasks.

Popular posts from this blog

Intelligent Agents and Their Application to Businesses

Intelligent agents, as a key technology in artificial intelligence (AI), have become central to a wide range of applications in both scientific research and business operations. These autonomous entities, designed to perceive their environment and adapt their behavior to achieve specific goals, are reshaping industries and driving innovation. This post provides a detailed analysis of the current state of intelligent agents, including definitions, theoretical and practical perspectives, technical characteristics, examples of business applications, and future prospects. Definitions and Terminology Intelligent agents are broadly defined as autonomous systems that can perceive and interact with their environments using sensors and actuators. Their autonomy enables them to make decisions and execute actions without constant human intervention. They operate with a specific goal or objective, which guides their decision-making processes. These entities may exi...

Data Visualization Communication Strategies

Data Visualization: Communicating Complex Information Effectively Data visualization plays a crucial role in communicating complex information in a clear and digestible manner. When effectively designed, visual representations of data enhance insight generation, facilitate decision-making, and persuade audiences to take action. The effectiveness of data visualization relies not only on the accuracy of the data but also on the strategic communication techniques employed in the design process (Kazakoff, 2022). This post examines three key data visualization communication strategies that improve audience engagement and understanding: audience-centered design, persuasive storytelling, and effective graph selection. The Importance of Audience-Centered Design A core component of effective data visualization is understanding the audience’s needs and preferences. The audience’s familiarity with the topic, their visual literacy, and their cognitive limitations influence how they interpret...

The Curse of Dimensionality: Why More Data Isn’t Always Better in Data Science

In data science, the phrase "more data leads to better models" is often heard. However, when "more data" means adding dimensions or features, it can lead to unexpected challenges. This phenomenon is known as the Curse of Dimensionality , a fundamental concept that explains the pitfalls of working with high-dimensional datasets. Let’s explore the mathematics behind it and practical techniques to overcome it. What is the Curse of Dimensionality? 1. Volume Growth in High Dimensions The volume of a space increases exponentially as the number of dimensions grows. For example, consider a unit hypercube with side length \(r = 1\). Its volume in \(d\)-dimensions is: \[ V = r^d = 1^d = 1 \] However, if the length of the side is slightly reduced, say \(r = 0.9\), the volume decreases drastically with increasing \(d\): \[ V = 0.9^d \] For \(d = 2\), \(V = 0.81\); for \(d = 10\), \(V = 0.35\); and for \(d = 100\), \(V = 0.00003\). This shows how...