How to Find Duplicates Values Across Multiple Columns in SQL?
Last Updated :
31 Dec, 2024
In SQL, identifying duplicate entries across multiple columns is crucial for ensuring data integrity and quality. Whether we're working with large datasets or trying to clean up existing data, we can efficiently find duplicates using GROUP BY
and COUNT()
functions.
In this article, we'll focus on how to find duplicate entries across multiple columns in SQL using a practical example. This is particularly helpful when working with datasets that involve multiple attributes, like officer assignments in different locations, which can have duplicate values in more than one column.
Examples of Finding Duplicates Across Multiple Columns
We’ll use a sample table named POSTINGS
inside the GeeksForGeeks
database. This table tracks postings of officers in different countries, with columns for posting ID, officer name, team size, and location.
Query:
CREATE TABLE POSTINGS(
POSTING_ID INT,
OFFICER_NAME VARCHAR(10),
TEAM_SIZE INT,
POSTING_LOCATION VARCHAR(10));
INSERT INTO POSTINGS VALUES(1,'RYAN',10,'GERMANY');
INSERT INTO POSTINGS VALUES(2,'JACK',6,'ROMANIA');
INSERT INTO POSTINGS VALUES(3,'JANE',4,'HAWAII');
INSERT INTO POSTINGS VALUES(4,'JIM',10,'GERMANY');
INSERT INTO POSTINGS VALUES(5,'TIM',10,'GERMANY');
INSERT INTO POSTINGS VALUES(6,'RYAN',11,'GERMANY');
INSERT INTO POSTINGS VALUES(7,'RYAN',10,'GERMANY');
INSERT INTO POSTINGS VALUES(8,'RYAN',10,'GERMANY');
INSERT INTO POSTINGS VALUES(9,'JACK',6,'CUBA');
INSERT INTO POSTINGS VALUES(10,'JACK',6,'HAITI');
SELECT * FROM POSTINGS;
Output
POSTINGS Table1. Duplicates in 3 Columns: OFFICER_NAME, TEAM_SIZE, and POSTING_LOCATION
To find duplicates in these three columns, group the rows by OFFICER_NAME
, TEAM_SIZE
, and POSTING_LOCATION
, we can group the data by these columns and filter the results with HAVING
clause, using the COUNT()
function.
Query:
SELECT OFFICER_NAME, TEAM_SIZE,
POSTING_LOCATION, COUNT(*) AS QTY
FROM POSTINGS
GROUP BY OFFICER_NAME, TEAM_SIZE,
POSTING_LOCATION HAVING COUNT(*)>1;
Output
OFFICER_NAME | TEAM_SIZE | POSTING_LOCATION | QTY |
---|
RYAN | 10 | GERMANY | 3 |
Explanation:
The output shows that RYAN
has been assigned to GERMANY
with a team size of 10 on three different occasions, indicating duplicate records.
2. Duplicates in 2 Columns: TEAM_SIZE and POSTING_LOCATION
In this example, we want to find duplicates based only on TEAM_SIZE
and POSTING_LOCATION
, ignoring the officer's name. This helps to identify if multiple officers were assigned to the same team size and location.
Query:
SELECT TEAM_SIZE, POSTING_LOCATION,
COUNT(*) AS QTY
FROM POSTINGS
GROUP BY TEAM_SIZE, POSTING_LOCATION
HAVING COUNT(*)>1;
Output
TEAM_SIZE | POSTING_LOCATION | QTY |
---|
10 | GERMANY | 4 |
Explanation:
The output shows that the team size of 10 is assigned to GERMANY
four times, indicating duplicate assignments across different officers.
3. Duplicates in 2 Columns: OFFICER_NAME and TEAM_SIZE
In this scenario, we focus on finding duplicates where the same officer has been assigned to the same team size multiple times, possibly in different locations.
Query:
SELECT OFFICER_NAME, TEAM_SIZE,
COUNT(*) AS QTY
FROM POSTINGS
GROUP BY OFFICER_NAME,
TEAM_SIZE HAVING COUNT(*)>1;
Output
OFFICER_NAME | TEAM_SIZE | QTY |
---|
RYAN | 10 | 3 |
Explanation:
The result shows that RYAN
has been assigned to a team size of 10 three times, confirming a duplicate entry for the officer and team size.
Conclusion
Finding duplicates across multiple columns in SQL is essential for maintaining clean, accurate, and consistent data. By using the GROUP BY
and HAVING
clauses with COUNT()
, SQL allows us to easily identify and filter duplicate entries based on one or more columns. These techniques are highly valuable for database cleanup, reporting, and ensuring data integrity. With the ability to search for duplicates in different combinations of columns, we can quickly analyze our data and pinpoint redundancies.
Display Duplicate of a Column
Similar Reads
How to find distinct values of multiple columns in PySpark ? In this article, we will discuss how to find distinct values of multiple columns in PySpark dataframe. Let's create a sample dataframe for demonstration:Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparkses
2 min read
How to find duplicate values in a factor in R finding duplicates in data is an important step in data analysis and management to ensure data quality, accuracy, and efficiency. In this article, we will see several approaches to finding duplicate values in a factor in the R Programming Language. It can be done with two methods Using duplicated()
2 min read
How to Count Distinct Values in PL/SQL? PL/SQL, an extension of SQL for Oracle databases, allows developers to blend procedural constructs like conditions and loops with the power of SQL. It supports exception handling for runtime errors and enables the declaration of variables, constants, procedures, functions, packages, triggers, and mo
5 min read
Removing Duplicate Rows (Based on Values from Multiple Columns) From SQL Table In SQL, some rows contain duplicate entries in multiple columns(>1). For deleting such rows, we need to use the DELETE keyword along with self-joining the table with itself. The same is illustrated below. For this article, we will be using the Microsoft SQL Server as our database. Step 1: Create
3 min read
How to Count Distinct Values in MySQL? The COUNT DISTINCT function is used to count the number of unique rows in specified columns. In simple words, this method is used to count distinct or unique values in a particular column. In this article, we are going to learn how we can use the Count Distinct function in different types of scenari
5 min read
How to SELECT DISTINCT on Multiple Columns in SQL? Data duplication can lead to confusion, inefficiency and errors in reporting or analysis. One of the most useful SQL commands to avoid this issue is SELECT DISTINCT. It is used to retrieve unique values from a column or combination of columns. It ensures that your query results are clean with no rep
4 min read