使用Python Unicode的特殊字符问题


问题内容
#!/usr/bin/env python
# -*- coding: utf_8 -*-

def splitParagraphIntoSentences(paragraph):

''' break a paragraph into sentences
    and return a list '''
    import re
# to split by multile characters

#   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?][\s]{1,2}(?=[A-Z])')
    sentenceList = sentenceEnders.split(paragraph, re.UNICODE)
    return sentenceList


if __name__ == '__main__':
p = "While other species (e.g. horse mango, M. foetida) are also grown ,Mangifera indica – the common mango or Indian mango – Sheffield’s only mango tree is valued at £9.2 billion."

sentences = splitParagraphIntoSentences(p)
for s in sentences:
    print s.strip()

预期产量:其他种类(如马芒果,鹅肝)也都在种植,曼格里达印度(普通芒果或印度芒果)是谢菲尔德唯一的芒果树,价值92亿英镑。

收成:虽然还种植了其他物种(例如马芒果,鹅肝),但芒果的单棵芒果树的价值却高达92亿卢比,其中普通芒果或印度芒果谢菲尔德谢菲尔德。

忽略句子的含义,主要要点是它不能使用特殊字符,例如“-”,“£”,“’”等。我尝试使用其他编码(例如ascii,utf-32,cp-500,iso8859_15和utf-8)设置sitecustomize.py文件和此代码,但无法解决。抱歉,我是python新手。提前感谢您的帮助。


问题答案:

找到了解决方案。

下面的代码可以正常工作。

p = p.encode('utf-8') if isinstance(p,unicode)  else p