To quickly see the duplicate rows you can run a single simple query
Here I am querying the table and listing all duplicate rows with same user_id, market_place and sku:
select user_id, market_place,sku, count(id)as totals from sku_analytics group by user_id, market_place,sku having count(id)>1;
To delete the duplicate row you have to decide which row you want to delete. Eg the one with lower id (usually older) or maybe some other date information. In my case I just want to delete the lower id since the newer id is latest information.
First double check if the right records will be deleted. Here I am selecting the record among duplicates which will be deleted (by unique id).
select a.user_id, a.market_place,a.sku from sku_analytics a inner join sku_analytics b where a.id< b.id and a.user_id= b.user_id and a.market_place= b.market_place and a.sku = b.sku;
Then I run the delete query to delete the dupes:
delete a from sku_analytics a inner join sku_analytics b where a.id< b.id and a.user_id= b.user_id and a.market_place= b.market_place and a.sku = b.sku;
Backup, Double check, verify, verify backup then execute.
Сборник запросов для поиска, изменения и удаления дублей в таблице MySQL по одному и нескольким полям. В примерах все запросы будут применятся к следующий таблице:
1
Поиск дубликатов
Подсчет дублей
Запрос подсчитает количество всех записей с одинаковыми значениями в поле `col1`
.
SELECT
`col1`,
COUNT(`col1`) AS `count`
FROM
`table`
GROUP BY
`col1`
HAVING
`count` > 1
SQL
Подсчет дубликатов по нескольким полям:
SELECT
`col1`,
`col2`,
COUNT(*) AS `count`
FROM
`table`
GROUP BY
`col1`,`col2`
HAVING
`count` > 1
SQL
Все записи с одинаковыми значениями
Запрос найдет все записи с одинаковыми значениями в `col1`
.
SELECT
*
FROM
`table`
WHERE
`col1` IN (SELECT `col1` FROM `table` GROUP BY `col1` HAVING COUNT(*) > 1)
ORDER BY
`col1`
SQL
Для одинаковых значений в `col1`
и `col2`
:
SELECT
*
FROM
`table`
WHERE
`col1` IN (SELECT `col1` FROM `table` GROUP BY `col1` HAVING COUNT(*) > 1)
AND `col2` IN (SELECT `col2` FROM `table` GROUP BY `col2` HAVING COUNT(*) > 1)
ORDER BY
`col1`
SQL
Получить только дубликаты
Запрос получит только дубликаты, в результат не попадают записи с самым ранним `id`
.
SELECT
`table`.*
FROM
`table`
LEFT OUTER JOIN
(SELECT MIN(`id`) AS `id`, `col1` FROM `table` GROUP BY `col1`) AS `tmp`
ON
`table`.`id` = `tmp`.`id`
WHERE
`tmp`.`id` IS NULL
SQL
Для нескольких полей:
SELECT
`table`.*
FROM
`table`
LEFT OUTER JOIN
(SELECT MIN(`id`) AS `id`, `col1`, `col2` FROM `table` GROUP BY `col1`, `col2`) AS `tmp`
ON
`a`.`id` = `tmp`.`id`
WHERE
`tmp`.`id` IS NULL
SQL
2
Уникализация записей
Запрос сделает уникальные названия только у дублей, дописав `id`
в конец `col1`
.
UPDATE
`table`
LEFT OUTER JOIN
(SELECT MIN(`id`) AS `id`, `col1` FROM `table` GROUP BY `col1`) AS `tmp`
ON
`table`.`id` = `tmp`.`id`
SET
`table`.`col1` = CONCAT(`table`.`col1`, '-', `table`.`id`)
WHERE
`tmp`.`id` IS NULL
SQL
По нескольким полям:
UPDATE
`table`
LEFT OUTER JOIN
(SELECT MIN(`id`) AS `id`, `col1`, `col2` FROM `table` GROUP BY `col1`, `col2`) AS `tmp`
ON
`table`.`id` = `tmp`.`id`
SET
`table`.`col1` = CONCAT(`table`.`col1`, '-', `table`.`id`)
WHERE
`tmp`.`id` IS NULL
SQL
3
Удаление дубликатов
Удаление дублирующихся записей, останутся только уникальные.
DELETE
`table`
FROM
`table`
LEFT OUTER JOIN
(SELECT MIN(`id`) AS `id`, `col1` FROM `table` GROUP BY `col1`) AS `tmp`
ON
`table`.`id` = `tmp`.`id`
WHERE
`tmp`.`id` IS NULL
SQL
По нескольким полям:
DELETE
`table`
FROM
`table`
LEFT OUTER JOIN
(SELECT MIN(`id`) AS `id`, `col1`, `col2` FROM `table` GROUP BY `col1`, `col2`) AS `tmp`
ON
`table`.`id` = `tmp`.`id`
WHERE
`tmp`.`id` IS NULL
SQL
I have a SQL Server database of organizations, and there are many duplicate rows. I want to run a select statement to grab all of these and the amount of dupes, but also return the ids that are associated with each organization.
A statement like:
SELECT orgName, COUNT(*) AS dupes
FROM organizations
GROUP BY orgName
HAVING (COUNT(*) > 1)
Will return something like
orgName | dupes
ABC Corp | 7
Foo Federation | 5
Widget Company | 2
But I’d also like to grab the IDs of them. Is there any way to do this? Maybe like a
orgName | dupeCount | id
ABC Corp | 1 | 34
ABC Corp | 2 | 5
...
Widget Company | 1 | 10
Widget Company | 2 | 2
The reason being that there is also a separate table of users that link to these organizations, and I would like to unify them (therefore remove dupes so the users link to the same organization instead of dupe orgs). But I would like part manually so I don’t screw anything up, but I would still need a statement returning the IDs of all the dupe orgs so I can go through the list of users.
asked Jan 21, 2010 at 20:29
select o.orgName, oc.dupeCount, o.id
from organizations o
inner join (
SELECT orgName, COUNT(*) AS dupeCount
FROM organizations
GROUP BY orgName
HAVING COUNT(*) > 1
) oc on o.orgName = oc.orgName
answered Jan 21, 2010 at 20:32
D’Arcy RittichD’Arcy Rittich
166k40 gold badges287 silver badges283 bronze badges
5
You can run the following query and find the duplicates with max(id)
and delete those rows.
SELECT orgName, COUNT(*), Max(ID) AS dupes
FROM organizations
GROUP BY orgName
HAVING (COUNT(*) > 1)
But you’ll have to run this query a few times.
Bellash
7,4426 gold badges52 silver badges85 bronze badges
answered Jan 21, 2010 at 20:32
2
You can do it like this:
SELECT
o.id, o.orgName, d.intCount
FROM (
SELECT orgName, COUNT(*) as intCount
FROM organizations
GROUP BY orgName
HAVING COUNT(*) > 1
) AS d
INNER JOIN organizations o ON o.orgName = d.orgName
If you want to return just the records that can be deleted (leaving one of each), you can use:
SELECT
id, orgName
FROM (
SELECT
orgName, id,
ROW_NUMBER() OVER (PARTITION BY orgName ORDER BY id) AS intRow
FROM organizations
) AS d
WHERE intRow != 1
Edit: SQL Server 2000 doesn’t have the ROW_NUMBER() function. Instead, you can use:
SELECT
o.id, o.orgName, d.intCount
FROM (
SELECT orgName, COUNT(*) as intCount, MIN(id) AS minId
FROM organizations
GROUP BY orgName
HAVING COUNT(*) > 1
) AS d
INNER JOIN organizations o ON o.orgName = d.orgName
WHERE d.minId != o.id
answered Jan 21, 2010 at 20:37
PaulPaul
16.1k13 gold badges41 silver badges51 bronze badges
4
You can try this , it is best for you
WITH CTE AS
(
SELECT *,RN=ROW_NUMBER() OVER (PARTITION BY orgName ORDER BY orgName DESC) FROM organizations
)
select * from CTE where RN>1
go
Soner Gönül
96.7k102 gold badges205 silver badges360 bronze badges
answered Nov 7, 2013 at 8:01
code savecode save
1,0441 gold badge9 silver badges15 bronze badges
1
The solution marked as correct didn’t work for me, but I found this answer that worked just great: Get list of duplicate rows in MySql
SELECT n1.*
FROM myTable n1
INNER JOIN myTable n2
ON n2.repeatedCol = n1.repeatedCol
WHERE n1.id <> n2.id
answered May 20, 2013 at 20:27
ecairolecairol
6,1731 gold badge27 silver badges25 bronze badges
2
If you want to delete duplicates:
WITH CTE AS(
SELECT orgName,id,
RN = ROW_NUMBER()OVER(PARTITION BY orgName ORDER BY Id)
FROM organizations
)
DELETE FROM CTE WHERE RN > 1
answered Jun 17, 2016 at 9:51
akdakd
6,45816 gold badges68 silver badges111 bronze badges
select * from [Employees]
For finding duplicate Record
1)Using CTE
with mycte
as
(
select Name,EmailId,ROW_NUMBER() over(partition by Name,EmailId order by id) as Duplicate from [Employees]
)
select * from mycte
2)By Using GroupBy
select Name,EmailId,COUNT(name) as Duplicate from [Employees] group by Name,EmailId
MD XF
7,8027 gold badges40 silver badges71 bronze badges
answered Nov 18, 2016 at 12:44
1
Select * from (Select orgName,id,
ROW_NUMBER() OVER(Partition By OrgName ORDER by id DESC) Rownum
From organizations )tbl Where Rownum>1
So the records with rowum> 1 will be the duplicate records in your table. ‘Partition by’ first group by the records and then serialize them by giving them serial nos.
So rownum> 1 will be the duplicate records which could be deleted as such.
answered Mar 10, 2015 at 5:58
Mike ClarkMike Clark
1,85014 silver badges21 bronze badges
1
select column_name, count(column_name)
from table_name
group by column_name
having count (column_name) > 1;
Src : https://stackoverflow.com/a/59242/1465252
answered Mar 17, 2015 at 4:01
iCrazybestiCrazybest
2,9152 gold badges23 silver badges24 bronze badges
1
select a.orgName,b.duplicate, a.id
from organizations a
inner join (
SELECT orgName, COUNT(*) AS duplicate
FROM organizations
GROUP BY orgName
HAVING COUNT(*) > 1
) b on o.orgName = oc.orgName
group by a.orgName,a.id
Roman Marusyk
23.1k24 gold badges71 silver badges114 bronze badges
answered Sep 15, 2015 at 7:01
select orgname, count(*) as dupes, id
from organizations
where orgname in (
select orgname
from organizations
group by orgname
having (count(*) > 1)
)
group by orgname, id
answered Jan 21, 2010 at 20:49
JordãoJordão
55.1k12 gold badges112 silver badges144 bronze badges
You have several way for Select duplicate rows
.
for my solutions , first consider this table for example
CREATE TABLE #Employee
(
ID INT,
FIRST_NAME NVARCHAR(100),
LAST_NAME NVARCHAR(300)
)
INSERT INTO #Employee VALUES ( 1, 'Ardalan', 'Shahgholi' );
INSERT INTO #Employee VALUES ( 2, 'name1', 'lname1' );
INSERT INTO #Employee VALUES ( 3, 'name2', 'lname2' );
INSERT INTO #Employee VALUES ( 2, 'name1', 'lname1' );
INSERT INTO #Employee VALUES ( 3, 'name2', 'lname2' );
INSERT INTO #Employee VALUES ( 4, 'name3', 'lname3' );
First solution :
SELECT DISTINCT *
FROM #Employee;
WITH #DeleteEmployee AS (
SELECT ROW_NUMBER()
OVER(PARTITION BY ID, First_Name, Last_Name ORDER BY ID) AS
RNUM
FROM #Employee
)
SELECT *
FROM #DeleteEmployee
WHERE RNUM > 1
SELECT DISTINCT *
FROM #Employee
Secound solution : Use identity
field
SELECT DISTINCT *
FROM #Employee;
ALTER TABLE #Employee ADD UNIQ_ID INT IDENTITY(1, 1)
SELECT *
FROM #Employee
WHERE UNIQ_ID < (
SELECT MAX(UNIQ_ID)
FROM #Employee a2
WHERE #Employee.ID = a2.ID
AND #Employee.FIRST_NAME = a2.FIRST_NAME
AND #Employee.LAST_NAME = a2.LAST_NAME
)
ALTER TABLE #Employee DROP COLUMN UNIQ_ID
SELECT DISTINCT *
FROM #Employee
and end of all solution use this command
DROP TABLE #Employee
answered Nov 7, 2013 at 7:06
Ardalan ShahgholiArdalan Shahgholi
11.8k20 gold badges107 silver badges144 bronze badges
i think i know what you need
i needed to mix between the answers and i think i got the solution he wanted:
select o.id,o.orgName, oc.dupeCount, oc.id,oc.orgName
from organizations o
inner join (
SELECT MAX(id) as id, orgName, COUNT(*) AS dupeCount
FROM organizations
GROUP BY orgName
HAVING COUNT(*) > 1
) oc on o.orgName = oc.orgName
having the max id will give you the id of the dublicate and the one of the original which is what he asked for:
id org name , dublicate count (missing out in this case)
id doublicate org name , doub count (missing out again because does not help in this case)
only sad thing you get it put out in this form
id , name , dubid , name
hope it still helps
answered Oct 1, 2014 at 8:13
Suppose we have table the table ‘Student’ with 2 columns:
student_id int
-
student_name varchar
Records: +------------+---------------------+ | student_id | student_name | +------------+---------------------+ | 101 | usman | | 101 | usman | | 101 | usman | | 102 | usmanyaqoob | | 103 | muhammadusmanyaqoob | | 103 | muhammadusmanyaqoob | +------------+---------------------+
Now we want to see duplicate records
Use this query:
select student_name,student_id ,count(*) c from student group by student_id,student_name having c>1;
+---------------------+------------+---+
| student_name | student_id | c |
+---------------------+------------+---+
| usman | 101 | 3 |
| muhammadusmanyaqoob | 103 | 2 |
+---------------------+------------+---+
Rob♦
27.3k16 gold badges82 silver badges97 bronze badges
answered Feb 9, 2018 at 7:28
I got a better option to get the duplicate records in a table
SELECT x.studid, y.stdname, y.dupecount
FROM student AS x INNER JOIN
(SELECT a.stdname, COUNT(*) AS dupecount
FROM student AS a INNER JOIN
studmisc AS b ON a.studid = b.studid
WHERE (a.studid LIKE '2018%') AND (b.studstatus = 4)
GROUP BY a.stdname
HAVING (COUNT(*) > 1)) AS y ON x.stdname = y.stdname INNER JOIN
studmisc AS z ON x.studid = z.studid
WHERE (x.studid LIKE '2018%') AND (z.studstatus = 4)
ORDER BY x.stdname
Result of the above query shows all the duplicate names with unique student ids and number of duplicate occurances
Click here to see the result of the sql
answered Jun 25, 2018 at 9:22
/*To get duplicate data in table */
SELECT COUNT(EmpCode),EmpCode FROM tbl_Employees WHERE Status=1
GROUP BY EmpCode HAVING COUNT(EmpCode) > 1
answered Apr 26, 2019 at 14:01
CodeCode
6595 silver badges9 bronze badges
I use two methods to find duplicate rows.
1st method is the most famous one using group by and having.
2nd method is using CTE – Common Table Expression.
As mentioned by @RedFilter this way is also right. Many times I find CTE method is also useful for me.
WITH TempOrg (orgName,RepeatCount)
AS
(
SELECT orgName,ROW_NUMBER() OVER(PARTITION by orgName ORDER BY orgName)
AS RepeatCount
FROM dbo.organizations
)
select t.*,e.id from organizations e
inner join TempOrg t on t.orgName= e.orgName
where t.RepeatCount>1
In the example above we collected the result by finding repeat occurrence using ROW_NUMBER and PARTITION BY. Then we applied where clause to select only rows which are on repeat count more than 1. All the result is collected CTE table and joined with Organizations table.
Source : CodoBee
David Buck
3,71335 gold badges31 silver badges35 bronze badges
answered May 5, 2020 at 9:40
IshrarIshrar
1051 silver badge6 bronze badges
Try
SELECT orgName, id, count(*) as dupes
FROM organizations
GROUP BY orgName, id
HAVING count(*) > 1;
answered Jan 22, 2014 at 6:29
- Updated: June 12, 2022
- Initial: September 8, 2021
Aah…. duplicates! They are everywhere! Look around you – multiple charger cables, headphones, pictures in your smartphone! But we are not here to talk about those duplicates. No, Sir! We are here to address the duplicates in sql, how to find them and possibly resolve them in your SQL code.
In this SQL find duplicates post, let us look at 3 ways to identify duplicate rows/columns and then conclude by looking at 2 ways to mitigate them.
- Using Count
- MINUS Function
- Analytic Functions
Let us start by looking at a very simple database table, USER_DIET. The below listed table shows the Fruit consumption of Sam and John over two days.
Just by looking at the data can you tell if there are duplicates in the table, say for the column “NAME”?
NAME | FRUIT | DAY |
John | Apple | Monday |
Sam | Orange | Monday |
John | Orange | Tuesday |
Sam | Banana | Tuesday |
John | Peach | Wednesday |
Sam | Banana | Wednesday |
The most obvious answer is YES! John occurs 3 times and so does Sam.
How about if we were to look at columns NAME and FRUIT? Once again, the answer would be YES, because “Sam” and “Banana” occurs twice. Apparently, Sam loves bananas, while John prefers a different fruit every day.
Finally, let’s look at columns NAME, FRUIT and DAY. Do you see any duplicates now?
The answer is NO. There are no duplicates because both Sam and John had a different fruit on each day.
The point I would like to drive home is this! To truly understand if data is duplicate, you need to understand the context and the functionality behind it.
Note: All SQL examples below use Oracle SQL syntax. However, they should work across most relational databases with minimal changes.
Related post: Apache Spark SQL date functions
1. SQL Find Duplicates using Count
The most common method to find duplicates in sql is using the count function in a select statement. There are two other clauses that are key to finding duplicates: GROUP BY and HAVING.
Let us continue using the database table (USER_DIET) from the previous example and see if we can find duplicates for the NAME column.
a. Duplicates in a single column
SELECT name,count(*)
FROM user_diet
GROUP BY name
HAVING count(*)>1;
Output from SQL statement:
NAME COUNT(*)
John 3
Sam 3
In this second example, let us look at finding duplicates in multiple columns: NAME and FRUIT.
Lets think this thru and put things in context before diving into our select statement. As yourself, what am I trying to find here ?
We are trying to find if any of the users, in this case, Sam/John had the same fruit twice. That it ! This context is based on the two fields NAME and FRUIT.
b. Duplicates in multiple columns
SELECT name, fruit, count(*)
FROM user_diet
GROUP BY name, fruit
HAVING count(*)>1;
Output from SQL statement:
NAME FRUIT COUNT(*)
Sam Banana 2
Key to remember, the columns in the select statement, excluding the count(*) should be the exact same in the group by clause as well.
Also note that using the count(*) function gives you a count of the number of occurrences of a value. In this case, “Sam” + “Banana” occurs twice in the table, but in actuality we only have one duplicate row.
c. SQL to find duplicate rows
The SQL to find duplicate rows in a table is not the same as checking for duplicates in a column.
Ideally, if the database table has the right combination of key columns, you should not have duplicate rows. Regardless, if you are suspicious that your table has duplicate rows, perform the below steps.
- Determine they Key columns on your table.
- If the table does not have keys defined, determine which column(s) makes a row unique. Often times this depends on the functional use case of the data.
- Add the fields from Step 1 or Step 2 to your SQL COUNT(*) clause.
Using the USER_DIET table above, lets assume no keys were defined on the table. Our next option would be determining which column(s) makes a row unique.
Note that the table has 3 rows. If Sam or Jon had the same fruit more than once on the same day, this would create a duplicate row.
Could Sam or Jon eating different fruits on the same day be considered a duplicate row?
The answer – Maybe! It depends on the functional use case of the data.
The SQL to find duplicate rows syntax is as shown below.
SELECT name, fruit, day, count(*) from user_diet GROUP BY name, fruit, day HAVING count(*)>1;
2. SQL Find Duplicates using MINUS function
The MINUS function works on two tables ( or datasets) and returns rows from the first table that does not belong in the second table. This option using the MINUS function in SQL, to find duplicates, is specific to Oracle. Use it for awareness and to validate your results using the count(*) method.
Find duplicates using MINUS function and rowid
SELECT name, rowid FROM user_diet
MINUS
SELECT name, MIN(rowid) FROM user_diet
GROUP BY name;
Output from SQL statement:
NAME COUNT(*)
Sam 2
ROWID is a pseudo column in Oracle and contains a distinct ID for each row in a table.
The first select statement (before the MINUS function) returns 6 rows containing NAME and a distinct value for the ROWID column. The second select statement on the other hand returns 2 rows, one for Sam and one for John. Why do you think that is ?
It’s because of the min function on the ROWID column.
The final output contains the “actual” number of duplicate rows, and not the total number of rows like the count(*) function.
Find duplicates using MINUS function and rownum
SELECT name, rownum FROM user_diet
MINUS
SELECT name, rownum FROM
(SELECT DISTINCT name FROM user_diet);
Output from SQL statement:
NAME COUNT(*)
Sam 2
In this second example, we used ROWNUM, which is a pseudo column used to uniquely identify the order of each row in a select statement.
So, what’s the difference between ROWNUM and ROWID in our example?
They are both pseudo columns in Oracle.
ROWNUM is a number and is generated on the result of the SQL statement. ROWID on the other hand is associated with each row of a table.
3. Find Duplicates in SQL using Analytic functions
Analytic functions are used to perform calculations on a grouping of data, normally called a “window”. This technique can be a bit confusing if you are just starting off with SQL, but it’s definitely worth knowing.
SELECT name, ROW_NUMBER() OVER ( PARTITION BY ssn ORDER BY ssn) AS rnum
FROM user_diet;
Output from SQL statement:
NAME RNUM
John 1
John 2
John 3
Sam 1
Sam 2
Sam 3
What are we doing here?
We are attempting to find if any duplicates exist for the column NAME.
Let’s break down this SQL and make sense of it.
The function ROW_NUMBER() assigns a number starting at 1 to the rows returned by the PARTITION window.
In our case, since we partitioned our dataset on the NAME column, we have 2 datasets: one for Sam and one for John. ROW_NUMBER() now assigns a unique number to each of the 3 rows for Sam, resets the counter and then does the same for John.
The resulting output is as shown on the right side of the query.
One of the reasons I love this technique is because I can turn the above SQL into a nested subquery and get a distinct set of records as shown below.
SELECT name FROM (
SELECT name, ROW_NUMBER() OVER ( PARTITION BY name ORDER BY name) AS rnum FROM user_diet)
WHERE rnum = 1;
Conclusion
A final tidbit, SQL is not limited to transactional databases.
Apache Spark has a module called Spark SQL to handle structured data. AWS Athena even lets you write SQL against files!
The demand for SQL skills is endless. So play around with what you learned here. Try selecting multiple columns, switch the PARTITIONS, change the SORT order. Practice is the best way to master something !
SQL helpful links
Table of Contents
Interested in our services ?
email us at : info@obstkel.com
Copyright 2022 © OBSTKEL LLC. All rights Reserved
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
Время от времени возникает такая задача как поиск дубликатов в выборке в SQL.
Добиться этого можно следующим образом.
Допустим у нас есть некая таблица со значениями, которые могут повторяться.
Тогда делаем следующее:
-- выбираем значение и считаем сколько раз такое значение всречается в таблице
SELECT
value, COUNT(value)
FROM
table
-- группируем выборку по значению
GROUP BY
value
-- фильтруем выборку по количеству
HAVING
COUNT(value) > 1
Вот так можно найти дупликаты в выборке с помощью SQL.
-
Создано 20.08.2019 10:59:59
-
Михаил Русаков
Копирование материалов разрешается только с указанием автора (Михаил Русаков) и индексируемой прямой ссылкой на сайт (http://myrusakov.ru)!
Добавляйтесь ко мне в друзья ВКонтакте: http://vk.com/myrusakov.
Если Вы хотите дать оценку мне и моей работе, то напишите её в моей группе: http://vk.com/rusakovmy.
Если Вы не хотите пропустить новые материалы на сайте,
то Вы можете подписаться на обновления: Подписаться на обновления
Если у Вас остались какие-либо вопросы, либо у Вас есть желание высказаться по поводу этой статьи, то Вы можете оставить свой комментарий внизу страницы.
Если Вам понравился сайт, то разместите ссылку на него (у себя на сайте, на форуме, в контакте):
-
Кнопка:
Она выглядит вот так:
-
Текстовая ссылка:
Она выглядит вот так: Как создать свой сайт
- BB-код ссылки для форумов (например, можете поставить её в подписи):