在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
开源软件名称(OpenSource Name):fimad/scalpel开源软件地址(OpenSource Url):https://github.com/fimad/scalpel开源编程语言(OpenSource Language):Haskell 98.9%开源软件介绍(OpenSource Introduction):ScalpelScalpel is a web scraping library inspired by libraries like Parsec and Perl's Web::Scraper. Scalpel builds on top of TagSoup to provide a declarative and monadic interface. There are two general mechanisms provided by this library that are used to build web scrapers: Selectors and Scrapers. SelectorsSelectors describe a location within an HTML DOM tree. The simplest selector,
that can be written is a simple string value. For example, the selector
In addition to describing the nested relationships between tags, selectors can
also include predicates on the attributes of a tag. The ScrapersScrapers are values that are parameterized over a selector and produce a value
from an HTML DOM tree. The There are several scraper primitives that take selectors and extract content from the DOM. Each primitive defined by this library comes in two variants: singular and plural. The singular variants extract the first instance matching the given selector, while the plural variants match every instance. ExampleComplete examples can be found in the examples folder in the scalpel git repository. The following is an example that demonstrates most of the features provided by
this library. Supposed you have the following hypothetical HTML located at
<html>
<body>
<div class='comments'>
<div class='comment container'>
<span class='comment author'>Sally</span>
<div class='comment text'>Woo hoo!</div>
</div>
<div class='comment container'>
<span class='comment author'>Bill</span>
<img class='comment image' src='http://example.com/cat.gif' />
</div>
<div class='comment container'>
<span class='comment author'>Susan</span>
<div class='comment text'>WTF!?!</div>
</div>
</div>
</body>
</html> The following snippet defines a function, type Author = String
data Comment
= TextComment Author String
| ImageComment Author URL
deriving (Show, Eq)
allComments :: IO (Maybe [Comment])
allComments = scrapeURL "http://example.com/article.html" comments
where
comments :: Scraper String [Comment]
comments = chroots ("div" @: [hasClass "container"]) comment
comment :: Scraper String Comment
comment = textComment <|> imageComment
textComment :: Scraper String Comment
textComment = do
author <- text $ "span" @: [hasClass "author"]
commentText <- text $ "div" @: [hasClass "text"]
return $ TextComment author commentText
imageComment :: Scraper String Comment
imageComment = do
author <- text $ "span" @: [hasClass "author"]
imageURL <- attr "src" $ "img" @: [hasClass "image"]
return $ ImageComment author imageURL Tips & TricksThe primitives provided by scalpel are intentionally minimalistic with the assumption being that users will be able to build up complex functionality by combining them with functions that work on existing type classes (Monad, Applicative, Alternative, etc.). This section gives examples of common tricks for building up more complex behavior from the simple primitives provided by this library. OverloadedStrings
Matching WildcardsScalpel has 3 different wildcard values each corresponding to a distinct use case.
Complex PredicatesIt is possible to run into scenarios where the name and attributes of a tag are not sufficient to isolate interesting tags and properties of child tags need to be considered. In these cases the Building off the above example, consider a use case where we would like find the html contents of a comment that mentions the word "cat". The strategy will be the following:
catComment :: Scraper String String
catComment =
-- 1. First narrow the current context to the div containing the comment's
-- textual content.
chroot ("div" @: [hasClass "comment", hasClass "text"]) $ do
-- 2. anySelector can be used to access the root tag of the current context.
contents <- text anySelector
-- 3. Skip comment divs that do not contain "cat".
guard ("cat" `isInfixOf` contents)
-- 4. Generate the desired value.
html anySelector For the full source of this example, see complex-predicates in the examples directory. Generalized RepetitionThe pluralized versions of the primitive scrapers ( Like the previous example, the trick here is to use a combination of the
Consider an extension to the original example where image comments may contain some alt text and the desire is to return a tuple of the alt text and the URLs of the images. The strategy will be the following:
altTextAndImages :: Scraper String [(String, URL)]
altTextAndImages =
-- 1. First narrow the current context to each img tag.
chroots "img" $ do
-- 2. Use anySelector to access all the relevant content from the the currently
-- selected img tag.
altText <- attr "alt" anySelector
srcUrl <- attr "src" anySelector
-- 3. Combine the retrieved content into the desired final result.
return (altText, srcUrl) For the full source of this example, see generalized-repetition in the examples directory. Operating with other monads inside the Scraper
-- Particularizes to 'm a -> ScraperT str m a'
lift :: (MonadTrans t, Monad m) => m a -> t m a
-- Particularizes to things like `IO a -> ScraperT str IO a'
liftIO :: MonadIO m => IO a -> m a Example: Perform HTTP requests on page images as you scrape:
-- Holds original link and data if it could be fetched
data Image = Image String (Maybe Metadata)
deriving Show
-- Holds mime type and file size
data Metadata = Metadata String Int
deriving Show
-- Scrape the page for images: get their metadata
scrapeImages :: URL -> ScraperT String IO [Image]
scrapeImages topUrl = do
chroots "img" $ do
source <- attr "src" "img"
guard . not . null $ source
-- getImageMeta is called via liftIO because ScrapeT transforms over IO
liftM (Image source) $ liftIO (getImageMeta topUrl source) For the full source of this example, see downloading data For more documentation on monad transformers, see the hackage page scalpel-coreThe For these scenarios users can instead depend on scalpel-core which does not provide networking support and has minimal dependencies. TroubleshootingMy Scraping Target Doesn't Return The Markup I ExpectedSome websites return different markup depending on the user agent sent along with the request. In some cases, this even means returning no markup at all in an effort to prevent scraping. To work around this, you can add your own user agent string. #!/usr/local/bin/stack
-- stack runghc --resolver lts-6.24 --install-ghc --package scalpel-0.6.0
{-# LANGUAGE NamedFieldPuns #-}
{-# LANGUAGE OverloadedStrings #-}
import Text.HTML.Scalpel
import qualified Network.HTTP.Client as HTTP
import qualified Network.HTTP.Client.TLS as HTTP
import qualified Network.HTTP.Types.Header as HTTP
-- Create a new manager settings based on the default TLS manager that updates
-- the request headers to include a custom user agent.
managerSettings :: HTTP.ManagerSettings
managerSettings = HTTP.tlsManagerSettings {
HTTP.managerModifyRequest = \req -> do
req' <- HTTP.managerModifyRequest HTTP.tlsManagerSettings req
return $ req' {
HTTP.requestHeaders = (HTTP.hUserAgent, "My Custom UA")
: HTTP.requestHeaders req'
}
}
main = do
manager <- Just <$> HTTP.newManager managerSettings
html <- scrapeURLWithConfig (def { manager }) url $ htmls anySelector
maybe printError printHtml html
where
url = "https://www.google.com"
printError = putStrLn "Failed"
printHtml = mapM_ putStrLn A list of user agent strings can be found here. |
2023-10-27
2022-08-15
2022-08-17
2022-09-23
2022-08-13
请发表评论