在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
开源软件名称:hu17889/go_spider开源软件地址:https://github.com/hu17889/go_spider开源编程语言:Go 100.0%开源软件介绍:go_spiderA crawler of vertical communities achieved by GOLANG. Latest stable Release: Version 1.2 (Sep 23, 2014). Features
Requirements
DocumentationInstallation
This project is based on simplejson, goquery. You can download packages from http://gopm.io/ in China. Use exampleHere is an example for crawling github content. You can have a try of the crawl process.
More examples here: examples. Make your spider // Spider input:
// PageProcesser ;
// Task name used in Pipeline for record;
spider.NewSpider(NewMyPageProcesser(), "TaskName").
AddUrl("https://github.com/hu17889?tab=repositories", "html"). // Start url, html is the responce type ("html" or "json")
AddPipeline(pipeline.NewPipelineConsole()). // Print result on screen
SetThreadnum(3). // Crawl request by three Coroutines
Run()
Just copy the default modules and modify it! If you make a Downloader module, you can use it by If you make a Pipeline module, you can use it by If you make a Scheduler module, you can use it by ExtensionsExtensions folder include modulers or other tools someone sharing. You can push your code without bugs. ModulersSpiderSummary: Crawler initialization, concurrent management, default moduler, moduler management, config setting. Functions:
DownloaderSummary: Spider gets a Request in Scheduler that has url to be crawled. Then Downloader downloads the result(html, json, jsonp, text) of the Request. The result is saved in Page for parsing in PageProcesser. Html parsing is based on goquery package. Json parsing is based on simplejson package. Jsonp will be conversed to json. Text form represents plain text content without parser. Functions:
PageProcesserSummary: The PageProcesser moduler only parse results. The moduler gets results(key-value pairs) and urls to be crawled next step. These key-value pairs will be saved in PageItems and urls will be pushed in Scheduler. Functions:
PageSummary: save information of request. Functions:
SchedulerSummary: The Scheduler moduler is a Request queue. Urls parsed in PageProcesser will be pushed in the queue. Functions:
PipelineSummary: The Pipeline moduler will output the result and save wherever you want. Default moduler is PipelineConsole(Output to stdout) and PipelineFile(Output to file) Functions:
RequestSummary: The Request moduler has config for http request like url, header and cookies. Functions:
Licensego_spider is licensed under the Mozilla Public License Version 2.0 Mozilla summarizes the license scope as follows:
That means:
Please read the MPL 2.0 FAQ if you have further questions regarding the license. You can read the full terms here: LICENSE. |
2023-10-27
2022-08-15
2022-08-17
2022-09-23
2022-08-13
请发表评论