Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
750 views
in Technique[技术] by (71.8m points)

python - Order a json by field using scrapy

I have created a spider to scrape problems from projecteuler.net. Here I have concluded my answer to a related question with

I launch this with the command scrapy crawl euler -o euler.json and it outputs an array of unordered json objects, everyone corrisponding to a single problem: this is fine for me because I'm going to process it with javascript, even if I think resolving the ordering problem via scrapy can be very simple.

But unfortunately, ordering items to write in json by scrapy (I need ascending order by id field) seem not to be so simple. I've studied every single component (middlewares, pipelines, exporters, signals, etc...) but no one seems useful for this purpose. I'm arrived at the conclusion that a solution to solve this problem doesn't exist at all in scrapy (except, maybe, a very elaborated trick), and you are forced to order things in a second phase. Do you agree, or do you have some idea? I copy here the code of my scraper.

Spider:

# -*- coding: utf-8 -*-
import scrapy
from eulerscraper.items import Problem
from scrapy.loader import ItemLoader


class EulerSpider(scrapy.Spider):
    name = 'euler'
    allowed_domains = ['projecteuler.net']
    start_urls = ["https://projecteuler.net/archives"]

    def parse(self, response):
        numpag = response.css("div.pagination a[href]::text").extract()
        maxpag = int(numpag[len(numpag) - 1])

        for href in response.css("table#problems_table a::attr(href)").extract():
            next_page = "https://projecteuler.net/" + href
            yield response.follow(next_page, self.parse_problems)

        for i in range(2, maxpag + 1):
            next_page = "https://projecteuler.net/archives;page=" + str(i)
            yield response.follow(next_page, self.parse_next)

        return [scrapy.Request("https://projecteuler.net/archives", self.parse)]

    def parse_next(self, response):
        for href in response.css("table#problems_table a::attr(href)").extract():
            next_page = "https://projecteuler.net/" + href
            yield response.follow(next_page, self.parse_problems)

    def parse_problems(self, response):
        l = ItemLoader(item=Problem(), response=response)
        l.add_css("title", "h2")
        l.add_css("id", "#problem_info")
        l.add_css("content", ".problem_content")

        yield l.load_item()

Item:

import re

import scrapy
from scrapy.loader.processors import MapCompose, Compose
from w3lib.html import remove_tags


def extract_first_number(text):
    i = re.search('d+', text)
    return int(text[i.start():i.end()])


def array_to_value(element):
    return element[0]


class Problem(scrapy.Item):
    id = scrapy.Field(
        input_processor=MapCompose(remove_tags, extract_first_number),
        output_processor=Compose(array_to_value)
    )
    title = scrapy.Field(input_processor=MapCompose(remove_tags))
    content = scrapy.Field()
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If I needed my output file to be sorted (I will assume you have a valid reason to want this), I'd probably write a custom exporter.

This is how Scrapy's built-in JsonItemExporter is implemented.
With a few simple changes, you can modify it to add the items to a list in export_item(), and then sort the items and write out the file in finish_exporting().

Since you're only scraping a few hundred items, the downsides of storing a list of them and not writing to a file until the crawl is done shouldn't be a problem to you.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...