How to Count Distinct Values in MySQL?

How to Find Duplicates Values Across Multiple Columns in SQL?

Last Updated : 31 Dec, 2024

In SQL, identifying duplicate entries across multiple columns is crucial for ensuring data integrity and quality. Whether we're working with large datasets or trying to clean up existing data, we can efficiently find duplicates using GROUP BY and COUNT() functions.

In this article, we'll focus on how to find duplicate entries across multiple columns in SQL using a practical example. This is particularly helpful when working with datasets that involve multiple attributes, like officer assignments in different locations, which can have duplicate values in more than one column.

Examples of Finding Duplicates Across Multiple Columns

We’ll use a sample table named POSTINGS inside the GeeksForGeeks database. This table tracks postings of officers in different countries, with columns for posting ID, officer name, team size, and location.

Query:

CREATE TABLE POSTINGS(
POSTING_ID INT,
OFFICER_NAME VARCHAR(10),
TEAM_SIZE INT,
POSTING_LOCATION VARCHAR(10));

INSERT INTO POSTINGS VALUES(1,'RYAN',10,'GERMANY');
INSERT INTO POSTINGS VALUES(2,'JACK',6,'ROMANIA');
INSERT INTO POSTINGS VALUES(3,'JANE',4,'HAWAII');
INSERT INTO POSTINGS VALUES(4,'JIM',10,'GERMANY');
INSERT INTO POSTINGS VALUES(5,'TIM',10,'GERMANY');
INSERT INTO POSTINGS VALUES(6,'RYAN',11,'GERMANY');
INSERT INTO POSTINGS VALUES(7,'RYAN',10,'GERMANY');
INSERT INTO POSTINGS VALUES(8,'RYAN',10,'GERMANY');
INSERT INTO POSTINGS VALUES(9,'JACK',6,'CUBA');
INSERT INTO POSTINGS VALUES(10,'JACK',6,'HAITI');

SELECT * FROM POSTINGS;

Output

POSTINGS-Table — POSTINGS Table

1. Duplicates in 3 Columns: OFFICER_NAME, TEAM_SIZE, and POSTING_LOCATION

To find duplicates in these three columns, group the rows by OFFICER_NAME, TEAM_SIZE, and POSTING_LOCATION, we can group the data by these columns and filter the results with HAVING clause, using the COUNT() function.

Query:

SELECT OFFICER_NAME, TEAM_SIZE,
POSTING_LOCATION, COUNT(*) AS QTY
FROM POSTINGS
GROUP BY OFFICER_NAME, TEAM_SIZE,
POSTING_LOCATION HAVING COUNT(*)>1;

Output

OFFICER_NAME	TEAM_SIZE	POSTING_LOCATION	QTY
RYAN	10	GERMANY	3

Explanation:

The output shows that RYAN has been assigned to GERMANY with a team size of 10 on three different occasions, indicating duplicate records.

2. Duplicates in 2 Columns: TEAM_SIZE and POSTING_LOCATION

In this example, we want to find duplicates based only on TEAM_SIZE and POSTING_LOCATION, ignoring the officer's name. This helps to identify if multiple officers were assigned to the same team size and location.

Query:

SELECT TEAM_SIZE, POSTING_LOCATION, 
COUNT(*) AS QTY
FROM POSTINGS
GROUP BY TEAM_SIZE, POSTING_LOCATION
 HAVING COUNT(*)>1;

Output

TEAM_SIZE	POSTING_LOCATION	QTY
10	GERMANY	4

Explanation:

The output shows that the team size of 10 is assigned to GERMANY four times, indicating duplicate assignments across different officers.

3. Duplicates in 2 Columns: OFFICER_NAME and TEAM_SIZE

In this scenario, we focus on finding duplicates where the same officer has been assigned to the same team size multiple times, possibly in different locations.

Query:

SELECT OFFICER_NAME, TEAM_SIZE, 
COUNT(*) AS QTY
FROM POSTINGS
GROUP BY OFFICER_NAME, 
TEAM_SIZE HAVING COUNT(*)>1;

Output

OFFICER_NAME	TEAM_SIZE	QTY
RYAN	10	3

Explanation:

The result shows that RYAN has been assigned to a team size of 10 three times, confirming a duplicate entry for the officer and team size.

Conclusion

Finding duplicates across multiple columns in SQL is essential for maintaining clean, accurate, and consistent data. By using the GROUP BY and HAVING clauses with COUNT(), SQL allows us to easily identify and filter duplicate entries based on one or more columns. These techniques are highly valuable for database cleanup, reporting, and ensuring data integrity. With the ability to search for duplicates in different combinations of columns, we can quickly analyze our data and pinpoint redundancies.

Display Duplicate of a Column

How to Count Distinct Values in MySQL?

A

abhisri459

Improve

Article Tags :

Similar Reads

How to find distinct values of multiple columns in PySpark ?

In this article, we will discuss how to find distinct values of multiple columns in PySpark dataframe. Let's create a sample dataframe for demonstration:Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparkses

How to find duplicate values in a factor in R

finding duplicates in data is an important step in data analysis and management to ensure data quality, accuracy, and efficiency. In this article, we will see several approaches to finding duplicate values in a factor in the R Programming Language. It can be done with two methods Using duplicated()

How to Count Distinct Values in PL/SQL?

PL/SQL, an extension of SQL for Oracle databases, allows developers to blend procedural constructs like conditions and loops with the power of SQL. It supports exception handling for runtime errors and enables the declaration of variables, constants, procedures, functions, packages, triggers, and mo

Removing Duplicate Rows (Based on Values from Multiple Columns) From SQL Table

In SQL, some rows contain duplicate entries in multiple columns(>1). For deleting such rows, we need to use the DELETE keyword along with self-joining the table with itself. The same is illustrated below. For this article, we will be using the Microsoft SQL Server as our database. Step 1: Create

How to Count Distinct Values in MySQL?

The COUNT DISTINCT function is used to count the number of unique rows in specified columns. In simple words, this method is used to count distinct or unique values in a particular column. In this article, we are going to learn how we can use the Count Distinct function in different types of scenari

How to SELECT DISTINCT on Multiple Columns in SQL?

Data duplication can lead to confusion, inefficiency and errors in reporting or analysis. One of the most useful SQL commands to avoid this issue is SELECT DISTINCT. It is used to retrieve unique values from a column or combination of columns. It ensures that your query results are clean with no rep