在scrapy框架python的start_urls列表中为url构造正则表达式


问题内容

我对Scrapy非常陌生,而且之前我没有使用过正则表达式

以下是我的spider.py代码

class ExampleSpider(BaseSpider):
   name = "test_code
   allowed_domains = ["www.example.com"]
   start_urls = [
       "http://www.example.com/bookstore/new/1?filter=bookstore",
       "http://www.example.com/bookstore/new/2?filter=bookstore",
       "http://www.example.com/bookstore/new/3?filter=bookstore",
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)

现在,如果我们查看start_urls所有三个url相同,除了它们的整数值不同,2?, 3?依此类推,我的意思是根据网站上存在的url不受限制,我现在我们可以使用crawlspider并可以为该URL构造正则表达式,如下所示,

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    import re

    class ExampleSpider(CrawlSpider):
        name = 'example.com'
        allowed_domains = ['example.com']
        start_urls = [
       "http://www.example.com/bookstore/new/1?filter=bookstore",
       "http://www.example.com/bookstore/new/2?filter=bookstore",
       "http://www.example.com/bookstore/new/3?filter=bookstore",
   ]

        rules = (
            Rule(SgmlLinkExtractor(allow=(........),))),
        )

   def parse(self, response):
       hxs = HtmlXPathSelector(response)

您能否指导我,如何为以上start_url列表构建爬网蜘蛛规则。


问题答案:

如果我正确理解您的要求,则需要大量具有特定模式的起始URL。

如果是这样,则可以重写BaseSpider.start_requests方法:

class ExampleSpider(BaseSpider):
    name = "test_code"
    allowed_domains = ["www.example.com"]

    def start_requests(self):
        for i in xrange(1000):
            yield self.make_requests_from_url("http://www.example.com/bookstore/new/%d?filter=bookstore" % i)

    ...