提问者:小点点

清除表中的重复行


我有一个表,我想根据5个字段清除其中的重复行。 这些字段是origin_iddestination_idmarket_idcabintriptype。 我想要做的是,删除所有具有相同5个字段的记录,除了具有最大的created_at的一个记录。 我希望每5个字段只有一个记录。 清除重复项后,我将创建一个唯一的索引,但是现在由于重复项的存在,我不能这样做。

到目前为止,我得到的是这个查询,但它似乎不起作用:

DELETE FROM fares WHERE id NOT IN(
     SELECT f1.id FROM (SELECT * FROM fares) AS f1
     INNER JOIN (
          SELECT origin_id,destination_id,market_id,cabin,tripType,MAX(created_at) AS maxDate FROM fares
          GROUP BY origin_id,destination_id,market_id,cabin,tripType
     ) AS f2 ON f2.origin_id=f1.origin_id AND f2.destination_id=f1.destination_id AND 
     f2.market_id=f1.market_id AND f2.cabin=f1.cabin AND f2.tripType=f1.tripType
     WHERE f1.created_at=f2.maxDate
     GROUP BY f1.origin_id,f1.destination_id,f1.market_id,f1.cabin,f1.tripType
)

上面的查询只删除了500行,但我有8K个重复项。 我用下面的查询来捕获它。

SELECT SUM(f.numberOfFares) AS duplicateFares FROM (
    SELECT origin_id,destination_id,market_id,cabin,tripType,COUNT(1) AS numberOfFares FROM fares
    GROUP BY origin_id,destination_id,market_id,cabin,tripType
    HAVING count(1)>1
) AS f

查询以上结果如下

我希望为每个origin_iddestination_idmarket_idcabintriptype组有1条记录

我猜问题是有相同created_at值的记录

SQLFiddle


共2个答案

匿名用户

您正在寻找的模式称为“deduplicate”。 基本上,您可以通过将表与表本身连接并删除不需要的表来进行比较:

delete F2
from Fares F1
join Fares F2 on F1.origin_id = F2.origin_id 
    and F1.destination_id = F2.destination_id 
    and F1.market_id = F2.market_id 
    and F1.cabin = F2.cabin 
    and F1.tripType = F2.tripType 
where F2.created_at < F1.created_at

在要比较的列上有索引会加快速度。

匿名用户

WITH temp_cte
AS ( 
 SELECT  ROW_NUMBER() OVER(PARTITION by origin_id, destination_id, market_id, cabin, tripType ORDER BY created_at desc) AS rowNum 
 FROM Fares  
 )
DELETE  FROM temp_cte WHERE rowNum  > 1;