• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    迪恩网络公众号

kohn/HttpProxyMiddleware: A middleware for scrapy. Used to change HTTP proxy fro ...

原作者: [db:作者] 来自: 网络 收藏 邀请

开源软件名称:

kohn/HttpProxyMiddleware

开源软件地址:

https://github.com/kohn/HttpProxyMiddleware

开源编程语言:

Python 100.0%

开源软件介绍:

HttpProxyMiddleware

A middleware for scrapy. Used to change HTTP proxy from time to time.

Initial proxyes are stored in a file. During runtime, the middleware will fetch new proxyes if it finds out lack of valid proxyes.

Related blog: http://www.kohn.com.cn/wordpress/?p=208

fetch_free_proxyes.py

Used to fetch free proxyes from the Internet. Could be modified by youself.

Usage

settings.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
    'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 351,
    # put this middleware after RetryMiddleware
    'crawler.middleware.HttpProxyMiddleware': 999,
}

DOWNLOAD_TIMEOUT = 10           # 10-15 second is an experienmental reasonable timeout

change proxy

Often, we wanna change to use a new proxy when our spider gets banned. Just recognize your IP being banned and yield a new Request in your Spider.parse method with:

request.meta["change_proxy"] = True

Some proxy may return invalid HTML code. So if you get any exception during parsing response, also yield a new request with:

request.meta["change_proxy"] = True

spider.py

Your spider should specify an array of status code where your spider may encouter during crawling. Any status code that is not 200 nor in the array would be treated as a result of invalid proxy and the proxy would be discarded. For example:

website_possible_httpstatus_list = [404]

This line tolds the middleware that the website you’re crawling would possibly return a response whose status code is 404, and do not discard the proxy that this request is using.

Test

Update HttpProxyMiddleware.py path in HttpProxyMiddlewareTest/settings.py.

cd HttpProxyMiddlewareTest
scrapy crawl test

The testing server is hosted on my VPS, so take it easy… DO NOT waste too much of my data plan.

You may start your own testing server using IPBanTest which is powered by Django.




鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
热门推荐
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap