• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    迪恩网络公众号

lachenmayer/graphql-scraper: Extract structured data from the web using GraphQL.

原作者: [db:作者] 来自: 网络 收藏 邀请

开源软件名称(OpenSource Name):

lachenmayer/graphql-scraper

开源软件地址(OpenSource Url):

https://github.com/lachenmayer/graphql-scraper

开源编程语言(OpenSource Language):

TypeScript 92.6%

开源软件介绍(OpenSource Introduction):

graphql-scraper

GraphQL lets us query all sorts of graph-shaped data - so why not use it to query the world's most useful graph, the web?

graphql-scraper is a command-line tool and reusable GraphQL schema which lets you easily extract data from HTML.

Check out a live demo here. You can easily spin up your own by using graphql-scraper-server.

The command-line tool

npx graphql-scraper <query-file>

or

npm install -g graphql-scraper
graphql-scraper <query-file>

Reads a GraphQL query from the path query-file, and prints the result.

If query-file is not given, reads the query from stdin.

Command-line options

  • --json Returns the result in JSON format, for use in other tools.
  • --help Prints a help string.

Variables

Any other named options you pass to the CLI will be used as a query variable.

For example, if you want to reuse the same query on several pages, you could write the following query file (query.graphql):

query ExampleQueryWithVariable($page: String) {
  page(url: $page) {
    items: queryAll(selector: "tr.athing") {
      rank: text(selector: "td span.rank")
      title: text(selector: "td.title a")
      sitebit: text(selector: "span.comhead a")
      url: attr(selector: "td.title a", name: "href")
      attrs: next {
        score: text(selector: "span.score")
        user: text(selector: "a:first-of-type")
        comments: text(selector: "a:nth-of-type(3)")
      }
    }
  }
}

...and execute the query like this:

graphql-scraper query.graphql --page="https://news.ycombinator.com/"

The schema

You can check out an auto-generated schema description here, but I recommend trying out the graphql-scraper-server example and exploring the types interactively. You can also play around with the schema in the live demo.

Re-using the schema in your own projects

The npm package exports the GraphQL schema which is used by the command-line tool. This an instance of graphql-js GraphQLSchema, which you can use anywhere that expects a schema, for example apollo-server or graphql-yoga.

Use npm install graphql-scraper or yarn add graphql-scraper to add the schema to your project.

Basic example with graphql

import { graphql } from 'graphql'
import schema from 'graphql-scraper'
// You can also import it as follows:
// const schema = require('graphql-scraper')


const query = `
{
  page(url: "http://news.ycombinator.com") {
    items: queryAll(selector: "tr.athing") {
      rank: text(selector: "td span.rank")
      title: text(selector: "td.title a")
      sitebit: text(selector: "span.comhead a")
      url: attr(selector: "td.title a", name: "href")
      attrs: next {
        score: text(selector: "span.score")
        user: text(selector: "a:first-of-type")
        comments: text(selector: "a:nth-of-type(3)")
      }
    }
  }
}
`

graphql(schema, query).then(response => {
  console.log(response)
})

Background

This project was inspired by gdom, which is written in Python and uses the Graphene GraphQL library.

If you want to switch over from gdom, please note some schema changes:

  • query(selector: String!) now only returns a single Element, rather than a list (like document.querySelector). Added a new queryAll(selector: String!): [Element] field, which behaves like document.querySelectorAll.
  • is(selector: String!) is renamed to has(selector: String!).
  • children, parent, siblings, next etc. no longer have a selector argument. If you need to select children with a specific selector, use child selectors (.foo > .bar).
  • parents is removed.
  • prev[All] is renamed to previous[All].

Maintainers

@lachenmayer

Contribute

PRs accepted.

License

MIT © 2018 harry lachenmayer




鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
热门推荐
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap