有没有办法删除字符串中重复的和连续的单词/短语？

问题内容：

有没有办法删除字符串中重复的和 连续的 单词/短语？例如

[在]： foo foo bar bar foo bar

[出]： foo bar foo bar

我已经试过了：

>>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
>>> [i for i,j in zip(s.split(),s.split()[1:]) if i!=j]
['this', 'is', 'a', 'foo', 'bar', 'black', 'sheep', ',', 'have', 'you', 'any', 'wool', 'woo', ',', 'yes', 'sir', 'yes', 'sir', 'three', 'bag', 'woo', 'wu']
>>> " ".join([i for i,j in zip(s.split(),s.split()[1:]) if i!=j]+[s.split()[-1]])
'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu'

当它变得稍微复杂一些并且我想删除短语（假设短语最多可以由5个单词组成）时会发生什么？怎么做到呢？例如

[在]： foo bar foo bar foo bar

[出]： foo bar

另一个例子：

[在]： this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not prhases .

[出]： this is a sentence where phrases duplicate . sentence are not prhases .

问题答案：

您可以使用re模块。

>>> s = 'foo foo bar bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar'

>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar foo bar'

如果要匹配任意数量的连续出现：

>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
'foo bar'

编辑。最后一个示例的附加内容。为此，当短语重复时，您必须调用re.sub。所以：

>>> s = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate'
>>> while re.search(r'\b(.+)(\s+\1\b)+', s):
...   s = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
...
>>> s
'this is a sentence where phrases duplicate'

有没有办法删除字符串中重复的和连续的单词/短语？

微信关注