我有一个表,我想根据5个字段清除其中的重复行。 这些字段是origin_id
,destination_id
,market_id
,cabin
,triptype
。 我想要做的是,删除所有具有相同5个字段的记录,除了具有最大的created_at
的一个记录。 我希望每5个字段只有一个记录。 清除重复项后,我将创建一个唯一的索引,但是现在由于重复项的存在,我不能这样做。
到目前为止,我得到的是这个查询,但它似乎不起作用:
DELETE FROM fares WHERE id NOT IN(
SELECT f1.id FROM (SELECT * FROM fares) AS f1
INNER JOIN (
SELECT origin_id,destination_id,market_id,cabin,tripType,MAX(created_at) AS maxDate FROM fares
GROUP BY origin_id,destination_id,market_id,cabin,tripType
) AS f2 ON f2.origin_id=f1.origin_id AND f2.destination_id=f1.destination_id AND
f2.market_id=f1.market_id AND f2.cabin=f1.cabin AND f2.tripType=f1.tripType
WHERE f1.created_at=f2.maxDate
GROUP BY f1.origin_id,f1.destination_id,f1.market_id,f1.cabin,f1.tripType
)
上面的查询只删除了500行,但我有8K个重复项。 我用下面的查询来捕获它。
SELECT SUM(f.numberOfFares) AS duplicateFares FROM (
SELECT origin_id,destination_id,market_id,cabin,tripType,COUNT(1) AS numberOfFares FROM fares
GROUP BY origin_id,destination_id,market_id,cabin,tripType
HAVING count(1)>1
) AS f
查询以上结果如下
我希望为每个origin_id
,destination_id
,market_id
,cabin
,triptype
组有1条记录
我猜问题是有相同created_at值的记录
SQLFiddle
您正在寻找的模式称为“deduplicate”。 基本上,您可以通过将表与表本身连接并删除不需要的表来进行比较:
delete F2
from Fares F1
join Fares F2 on F1.origin_id = F2.origin_id
and F1.destination_id = F2.destination_id
and F1.market_id = F2.market_id
and F1.cabin = F2.cabin
and F1.tripType = F2.tripType
where F2.created_at < F1.created_at
在要比较的列上有索引会加快速度。
WITH temp_cte
AS (
SELECT ROW_NUMBER() OVER(PARTITION by origin_id, destination_id, market_id, cabin, tripType ORDER BY created_at desc) AS rowNum
FROM Fares
)
DELETE FROM temp_cte WHERE rowNum > 1;