fimad/scalpel: A high level web scraping library for Haskell.

原作者: [db:作者] 来自: 网络收藏邀请

开源软件名称（OpenSource Name）：

fimad/scalpel

开源软件地址(OpenSource Url)：

https://github.com/fimad/scalpel

开源编程语言(OpenSource Language)：

Haskell 98.9%

开源软件介绍(OpenSource Introduction)：

Scalpel

Scalpel is a web scraping library inspired by libraries like Parsec and Perl's Web::Scraper. Scalpel builds on top of TagSoup to provide a declarative and monadic interface.

There are two general mechanisms provided by this library that are used to build web scrapers: Selectors and Scrapers.

Selectors

Selectors describe a location within an HTML DOM tree. The simplest selector, that can be written is a simple string value. For example, the selector "div" matches every single div node in a DOM. Selectors can be combined using tag combinators. The // operator to define nested relationships within a DOM tree. For example, the selector "div" // "a" matches all anchor tags nested arbitrarily deep within a div tag.

In addition to describing the nested relationships between tags, selectors can also include predicates on the attributes of a tag. The @: operator creates a selector that matches a tag based on the name and various conditions on the tag's attributes. An attribute predicate is just a function that takes an attribute and returns a boolean indicating if the attribute matches a criteria. There are several attribute operators that can be used to generate common predicates. The @= operator creates a predicate that matches the name and value of an attribute exactly. For example, the selector "div" @: ["id" @= "article"] matches div tags where the id attribute is equal to "article".

Scrapers

Scrapers are values that are parameterized over a selector and produce a value from an HTML DOM tree. The Scraper type takes two type parameters. The first is the string like type that is used to store the text values within a DOM tree. Any string like type supported by Text.StringLike is valid. The second type is the type of value that the scraper produces.

There are several scraper primitives that take selectors and extract content from the DOM. Each primitive defined by this library comes in two variants: singular and plural. The singular variants extract the first instance matching the given selector, while the plural variants match every instance.

Example

Complete examples can be found in the examples folder in the scalpel git repository.

The following is an example that demonstrates most of the features provided by this library. Supposed you have the following hypothetical HTML located at "http://example.com/article.html" and you would like to extract a list of all of the comments.

<html>
  <body>
    <div class='comments'>
      <div class='comment container'>
        <span class='comment author'>Sally</span>
        <div class='comment text'>Woo hoo!</div>
      </div>
      <div class='comment container'>
        <span class='comment author'>Bill</span>
        <img class='comment image' src='http://example.com/cat.gif' />
      </div>
      <div class='comment container'>
        <span class='comment author'>Susan</span>
        <div class='comment text'>WTF!?!</div>
      </div>
    </div>
  </body>
</html>

The following snippet defines a function, allComments, that will download the web page, and extract all of the comments into a list:

type Author = String

data Comment
    = TextComment Author String
    | ImageComment Author URL
    deriving (Show, Eq)

allComments :: IO (Maybe [Comment])
allComments = scrapeURL "http://example.com/article.html" comments
   where
       comments :: Scraper String [Comment]
       comments = chroots ("div" @: [hasClass "container"]) comment

       comment :: Scraper String Comment
       comment = textComment <|> imageComment

       textComment :: Scraper String Comment
       textComment = do
           author      <- text $ "span" @: [hasClass "author"]
           commentText <- text $ "div"  @: [hasClass "text"]
           return $ TextComment author commentText

       imageComment :: Scraper String Comment
       imageComment = do
           author   <- text       $ "span" @: [hasClass "author"]
           imageURL <- attr "src" $ "img"  @: [hasClass "image"]
           return $ ImageComment author imageURL

Tips & Tricks

The primitives provided by scalpel are intentionally minimalistic with the assumption being that users will be able to build up complex functionality by combining them with functions that work on existing type classes (Monad, Applicative, Alternative, etc.).

This section gives examples of common tricks for building up more complex behavior from the simple primitives provided by this library.

OverloadedStrings

Selector, TagName and AttributeName are all IsString instances, and thus it is convenient to use scalpel with OverloadedStrings enabled. If not using OverloadedStrings, all tag names must be wrapped with tagSelector.

Matching Wildcards

Scalpel has 3 different wildcard values each corresponding to a distinct use case.

anySelector is used to match all tags:

textOfAllTags = texts anySelector
AnyTag is used when matching all tags with some attribute constraint. For example, to match all tags with the attribute class equal to "button":

textOfTagsWithClassButton = texts $ AnyTag @: [hasClass "button"]
AnyAttribute is used when matching tags with some arbitrary attribute equal to a particular value. For example, to match all tags with some attribute equal to "button":

textOfTagsWithAnAttributeWhoseValueIsButton = texts $ AnyTag @: [AnyAttribute @= "button"]

Complex Predicates

It is possible to run into scenarios where the name and attributes of a tag are not sufficient to isolate interesting tags and properties of child tags need to be considered.

In these cases the guard function of the Alternative type class can be combined with chroot and anySelector to implement predicates of arbitrary complexity.

Building off the above example, consider a use case where we would like find the html contents of a comment that mentions the word "cat".

The strategy will be the following:

Isolate the comment div using chroot.
Then within the context of that div the textual contents can be retrieved with text anySelector. This works because the first tag within the current context is the div tag selected by chroot, and the anySelector selector will match the first tag within the current context.
Then the predicate that "cat" appear in the text of the comment will be enforced using guard. If the predicate fails, scalpel will backtrack and continue the search for divs until one is found that matches the predicate.
Return the desired HTML content of the comment div.

catComment :: Scraper String String
catComment =
    -- 1. First narrow the current context to the div containing the comment's
    --    textual content.
    chroot ("div" @: [hasClass "comment", hasClass "text"]) $ do
        -- 2. anySelector can be used to access the root tag of the current context.
        contents <- text anySelector
        -- 3. Skip comment divs that do not contain "cat".
        guard ("cat" `isInfixOf` contents)
        -- 4. Generate the desired value.
        html anySelector

For the full source of this example, see complex-predicates in the examples directory.

Generalized Repetition

The pluralized versions of the primitive scrapers (texts, attrs, htmls) allow the user to extract content from all of the tags matching a given selector. For more complex scraping tasks it will at times be desirable to be able to extract multiple values from the same tag.

Like the previous example, the trick here is to use a combination of the chroots function and the anySelector selector.

Consider an extension to the original example where image comments may contain some alt text and the desire is to return a tuple of the alt text and the URLs of the images.

The strategy will be the following:

to isolate each img tag using chroots.
Then within the context of each img tag, use the anySelector selector to extract the alt and src attributes from the current tag.
Create and return a tuple of the extracted attributes.

altTextAndImages :: Scraper String [(String, URL)]
altTextAndImages =
    -- 1. First narrow the current context to each img tag.
    chroots "img" $ do
        -- 2. Use anySelector to access all the relevant content from the the currently
        -- selected img tag.
        altText <- attr "alt" anySelector
        srcUrl  <- attr "src" anySelector
        -- 3. Combine the retrieved content into the desired final result.
        return (altText, srcUrl)

For the full source of this example, see generalized-repetition in the examples directory.

Operating with other monads inside the Scraper

ScraperT is a monad transformer scraper: it allows lifting m a operations inside a ScraperT str m a with functions like:

-- Particularizes to 'm a -> ScraperT str m a'
lift :: (MonadTrans t, Monad m) => m a -> t m a

-- Particularizes to things like `IO a -> ScraperT str IO a'
liftIO :: MonadIO m => IO a -> m a

Example: Perform HTTP requests on page images as you scrape:

Isolate images using chroots.
Within that context of an img tag, obtain the src attribute containing the location of the file.
Perform an IO operation to request metadata headers from the source.
Use the data to build and return more complex data

-- Holds original link and data if it could be fetched
data Image = Image String (Maybe Metadata)
  deriving Show

-- Holds mime type and file size
data Metadata = Metadata String Int
  deriving Show

-- Scrape the page for images: get their metadata
scrapeImages :: URL -> ScraperT String IO [Image]
scrapeImages topUrl = do
    chroots "img" $ do
        source <- attr "src" "img"
        guard . not . null $ source
        -- getImageMeta is called via liftIO because ScrapeT transforms over IO
        liftM (Image source) $ liftIO (getImageMeta topUrl source)

For the full source of this example, see downloading data

For more documentation on monad transformers, see the hackage page

scalpel-core

The scalpel package depends on 'http-client' and 'http-client-tls' to provide networking support. For projects with an existing HTTP client these dependencies may be unnecessary.

For these scenarios users can instead depend on scalpel-core which does not provide networking support and has minimal dependencies.

Troubleshooting

My Scraping Target Doesn't Return The Markup I Expected

Some websites return different markup depending on the user agent sent along with the request. In some cases, this even means returning no markup at all in an effort to prevent scraping.

To work around this, you can add your own user agent string.

#!/usr/local/bin/stack
-- stack runghc --resolver lts-6.24 --install-ghc --package scalpel-0.6.0
{-# LANGUAGE NamedFieldPuns #-}
{-# LANGUAGE OverloadedStrings #-}

import Text.HTML.Scalpel
import qualified Network.HTTP.Client as HTTP
import qualified Network.HTTP.Client.TLS as HTTP
import qualified Network.HTTP.Types.Header as HTTP


-- Create a new manager settings based on the default TLS manager that updates
-- the request headers to include a custom user agent.
managerSettings :: HTTP.ManagerSettings
managerSettings = HTTP.tlsManagerSettings {
  HTTP.managerModifyRequest = \req -> do
    req' <- HTTP.managerModifyRequest HTTP.tlsManagerSettings req
    return $ req' {
      HTTP.requestHeaders = (HTTP.hUserAgent, "My Custom UA")
                          : HTTP.requestHeaders req'
    }
}

main = do
    manager <- Just <$> HTTP.newManager managerSettings
    html <- scrapeURLWithConfig (def { manager }) url $ htmls anySelector
    maybe printError printHtml html
  where
    url = "https://www.google.com"
    printError = putStrLn "Failed"
    printHtml = mapM_ putStrLn

A list of user agent strings can be found here.

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

tfausak/haskeleton: A Haskell package skeleton.发布时间：2022-06-24

sol/aeson-qq: JSON quasiquoter for Haskell发布时间：2022-06-24

剪的笔顺,诠释剪的笔画,认识剪的部首

1 六六分期app的软件客服如何联系？(六六分期

六六分期app的软件客服如何联系？不知道吗？加qq群【895510560】即可！标题：六六分期

阅读：18436|2023-10-27

2 可心卡盟:win10系统火狐flash插件崩溃怎么

今天小编告诉大家如何处理win10系统火狐flash插件总是崩溃的问题，可能很多用户都不知

阅读：9749|2022-11-06

3 亲亲特价:怎么删除回收站图标

今天小编告诉大家如何对win10系统删除桌面回收站图标进行设置，可能很多用户都不知道

阅读：8228|2022-11-06

4 济南大学虚拟社区:鲁大师节能降温的具体办

今天小编告诉大家如何对win10系统电脑设置节能降温的设置方法，想必大家都遇到过需要

阅读：8589|2022-11-06

5 xlueops.exe:无线网络安装向导

我们在使用xp系统的过程中,经常需要对xp系统无线网络安装向导设置进行设置，可能很多

阅读：8502|2022-11-06

6 女斗合众国:win7系统cf与主机连接不稳定怎

今天小编告诉大家如何处理win7系统玩cf老是与主机连接不稳定的问题，可能很多用户都不

阅读：9464|2022-11-06

7 0xc000022-[cf烟雾头]cf怎么调烟雾头

电脑对日常生活的重要性小编就不多说了，可是一旦碰到win7系统设置cf烟雾头的问题，很

阅读：8483|2022-11-06

8 qizideyouhuo:应用程序无法正常启动0xc0000

我们在日常使用电脑的时候，有的小伙伴们可能在打开应用的时候会遇见提示应用程序无法

阅读：7909|2022-11-06

9 ipz-185:win7系统vcf文件怎么打开

今天小编告诉大家如何对win7系统打开vcf文件进行设置，可能很多用户都不知道怎么对win

阅读：8465|2022-11-06

10 傻哥蹦迪:win10系统s4怎么打开usb调试

今天小编告诉大家如何对win10系统s4开启USB调试模式进行设置，可能很多用户都不知道怎

阅读：7433|2022-11-06

客服电话

电子邮件

fimad/scalpel: A high level web scraping library for Haskell.

开源软件名称（OpenSource Name）：

开源软件地址(OpenSource Url)：

开源编程语言(OpenSource Language)：

开源软件介绍(OpenSource Introduction)：

Scalpel

Selectors

Scrapers

Example

Tips & Tricks

OverloadedStrings

Matching Wildcards

Complex Predicates

Generalized Repetition

Operating with other monads inside the Scraper

scalpel-core

Troubleshooting

My Scraping Target Doesn't Return The Markup I Expected

请发表评论

全部评论

上一篇：

下一篇：

GitbookIO/gitbook:

juleswhite/mobile-cloud-asgn1

kyamagu/matlab-json: Use official API: h

墙壁眼睛膝盖

微信小程序 - app.json配置解析

剪的笔顺,诠释剪的笔画,认识剪的部首

六六分期app的软件客服如何联系？(六六分期

florent37/ViewAnimator: A fluent Android

florent37/Shrine-MaterialDesign2: implem

CVE-2020-36276

SimpleSoftwareIO/simple-sms: Send and re

关于我们

产品与服务

解决方案

139-2527-9053