Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
733 views
in Technique[技术] by (71.8m points)

python - Scrapy not crawling subsequent pages in order

I am writing a crawler to get the names of items from an website. The website has got 25 items per page and multiple pages (200 for some item types).

Here is the code:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from lonelyplanet.items import LonelyplanetItem

class LonelyplanetSpider(CrawlSpider):
    name = "lonelyplanetItemName_spider"
    allowed_domains = ["lonelyplanet.com"]
    def start_requests(self):
        for i in xrange(8):
            yield self.make_requests_from_url("http://www.lonelyplanet.com/europe/sights?page=%d" % i)

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//h2')
    items = []
    for site in sites:
        item = LonelyplanetItem()
        item['name'] = site.select('a[@class="targetUrl"]/text()').extract()
        items.append(item)
    return items

When I run the crawler and store the data in csv format the data is not stored in order, i.e. - page 2 data is stored before page 1 or page 3 gets stored before page 2 and similarly. Also sometimes before all the data of a particular page is stored the data of another page comes in and them the rest of the data of the former page is stored again.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

scrapy is an asynchronous framework. It uses non-blocking IO, so it doesn't wait for a request to finish before starting the next one.

And since multiple requests can be made at a time, it is impossible to know the exact order the parse() method will be getting the responses.

My point is, scrapy is not meant to extract data in a particular order. If you absolutely need to preserve order, there are some ideas here: Scrapy Crawl URLs in Order


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...