获取HTML标签的文本,而没有内部子标签的文本
问题内容:
例:
有时,HTML是:
<div id="1">
<div id="2">
this is the text i do NOT want
</div>
this is the text i want here
</div>
其他时候只是:
<div id="1">
this is the text i want here
</div>
我只想在一个标签中获取文本,而忽略所有其他子标签。如果我运行该.text
属性,则两者都会得到。
问题答案:
更新 为使用更通用的方法(请参阅编辑历史记录以获取原始答案):
您可以通过测试外部div的子元素是否是的实例来提取它们NavigableString
。
from bs4 import BeautifulSoup, NavigableString
html = '''<div id="1">
<div id="2">
this is the text i do NOT want
</div>
this is the text i want here
</div>'''
soup = BeautifulSoup(html)
outer = soup.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]
这将导致外部div元素中包含一个字符串列表。
>>> inner_text
[u'\n', u'\n this is the text i want here\n']
>>> ''.join(inner_text)
u'\n\n this is the text i want here\n'
对于第二个示例:
html = '''<div id="1">
this is the text i want here
</div>'''
soup2 = BeautifulSoup(html)
outer = soup2.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]
>>> inner_text
[u'\n this is the text i want here\n']
这也适用于其他情况,例如,外部div的text元素在任何子标签之前,在子标签之间,多个文本元素之间或根本不存在。