获取HTML标签的文本,而没有内部子标签的文本


问题内容

例:

有时,HTML是:

<div id="1">
    <div id="2">
        this is the text i do NOT want
    </div>
    this is the text i want here
</div>

其他时候只是:

<div id="1">
    this is the text i want here
</div>

我只想在一个标签中获取文本,而忽略所有其他子标签。如果我运行该.text属性,则两者都会得到。


问题答案:

更新 为使用更通用的方法(请参阅编辑历史记录以获取原始答案):

您可以通过测试外部div的子元素是否是的实例来提取它们NavigableString

from bs4 import BeautifulSoup, NavigableString

html = '''<div id="1">
    <div id="2">
        this is the text i do NOT want
    </div>
    this is the text i want here
</div>'''

soup = BeautifulSoup(html)    
outer = soup.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]

这将导致外部div元素中包含一个字符串列表。

>>> inner_text
[u'\n', u'\n    this is the text i want here\n']
>>> ''.join(inner_text)
u'\n\n    this is the text i want here\n'

对于第二个示例:

html = '''<div id="1">
    this is the text i want here
</div>'''
soup2 = BeautifulSoup(html)    
outer = soup2.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]

>>> inner_text
[u'\n    this is the text i want here\n']

这也适用于其他情况,例如,外部div的text元素在任何子标签之前,在子标签之间,多个文本元素之间或根本不存在。