在python中通过硒分页导航
问题内容:
我正在使用Python和Selenium抓取此网站。我的代码可以工作,但目前仅刮取第一页,我想遍历所有页面并将其全部刮擦,但是它们以奇怪的方式处理分页,我将如何浏览页面并逐页刮擦?
分页HTML:
<div class="pagination">
<a href="/PlanningGIS/LLPG/WeeklyList/41826123,1" title="Go to first page">First</a>
<a href="/PlanningGIS/LLPG/WeeklyList/41826123,1" title="Go to previous page">Prev</a>
<a href="/PlanningGIS/LLPG/WeeklyList/41826123,1" title="Go to page 1">1</a>
<span class="current">2</span>
<a href="/PlanningGIS/LLPG/WeeklyList/41826123,3" title="Go to page 3">3</a>
<a href="/PlanningGIS/LLPG/WeeklyList/41826123,4" title="Go to page 4">4</a>
<a href="/PlanningGIS/LLPG/WeeklyList/41826123,3" title="Go to next page">Next</a>
<a href="/PlanningGIS/LLPG/WeeklyList/41826123,4" title="Go to last page">Last</a>
</div>
刮刀:
import re
import json
import requests
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options
options = Options()
# options.add_argument('--headless')
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver=webdriver.Chrome(chrome_options=options,
executable_path=r'/Users/weaabduljamac/Downloads/chromedriver')
url = 'https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList'
driver.get(url)
def getData():
data = []
rows = driver.find_element_by_xpath('//*[@id="form1"]/table/tbody').find_elements_by_tag_name('tr')
for row in rows:
app_number = row.find_elements_by_tag_name('td')[1].text
address = row.find_elements_by_tag_name('td')[2].text
proposals = row.find_elements_by_tag_name('td')[3].text
status = row.find_elements_by_tag_name('td')[4].text
data.append({"CaseRef": app_number, "address": address, "proposals": proposals, "status": status})
print(data)
return data
def main():
all_data = []
select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
list_options = select.options
for item in range(len(list_options)):
select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
select.select_by_index(str(item))
driver.find_element_by_css_selector("input.formbutton#csbtnSearch").click()
all_data.extend( getData() )
driver.find_element_by_xpath('//*[@id="form1"]/div[3]/a[4]').click()
driver.get(url)
with open( 'wiltshire.json', 'w+' ) as f:
json.dump( all_data, f )
driver.quit()
if __name__ == "__main__":
main()
问题答案:
在继续执行任何方案的自动化之前,请务必写下执行该方案所要执行的手动步骤。您要执行的手动步骤(我从问题中了解)是-
1)前往网站-https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList
2)选择第一周选项
3)点击搜索
4)从 每一 页获取数据
5)再次加载网址
6)选择第二周选项
7)点击搜索
8)从 每一 页获取数据
.. 等等。
您有一个循环来选择不同的星期,但在“周”选项的每个循环迭代中,还需要包括一个循环以迭代所有页面。由于您未执行此操作,因此您的代码仅返回第一页中的数据。
另一个问题是您如何定位“下一步”按钮-
driver.find_element_by_xpath('//*[@id="form1"]/div[3]/a[4]').click()
您选择的第四个<a>
元素当然不可靠,因为在不同页面中,“下一步”按钮的索引将不同。而是使用更好的定位器-
driver.find_element_by_xpath("//a[contains(text(),'Next')]").click()
创建循环遍历页面的逻辑-
首先,您将需要页数。我这样做是通过将“下一步”按钮的<a>
紧挨 着的位置。根据下面的屏幕截图,很明显,此元素的文本将等于页面数-
我使用以下代码做到了-
number_of_pages = int(driver.find_element_by_xpath("//a[contains(text(),'Next')]/preceding-sibling::a[1]").text)
现在,一旦页面数为number_of_pages
,则只需单击“下一步”按钮的number_of_pages - 1
时间!
main
功能的最终代码-
def main():
all_data = []
select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
list_options = select.options
for item in range(len(list_options)):
select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
select.select_by_index(str(item))
driver.find_element_by_css_selector("input.formbutton#csbtnSearch").click()
number_of_pages = int(driver.find_element_by_xpath("//a[contains(text(),'Next')]/preceding-sibling::a[1]").text)
for j in range(number_of_pages - 1):
all_data.extend(getData())
driver.find_element_by_xpath("//a[contains(text(),'Next')]").click()
time.sleep(1)
driver.get(url)
with open( 'wiltshire.json', 'w+' ) as f:
json.dump( all_data, f )
driver.quit()