在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
开源软件名称:bda-research/node-crawler开源软件地址:https://github.com/bda-research/node-crawler开源编程语言:JavaScript 75.3%开源软件介绍:Most powerful, popular and production crawling/scraping package for Node, happy hacking :) Features:
Here is the CHANGELOG Thanks to Authuir, we have a Chinese docs. Other languages are welcomed! Table of Contents
Get startedInstall$ npm install crawler Basic usageconst Crawler = require('crawler');
const c = new Crawler({
maxConnections: 10,
// This will be called for each crawled page
callback: (error, res, done) => {
if (error) {
console.log(error);
} else {
const $ = res.$;
// $ is Cheerio by default
//a lean implementation of core jQuery designed specifically for the server
console.log($('title').text());
}
done();
}
});
// Queue just one URL, with default callback
c.queue('http://www.amazon.com');
// Queue a list of URLs
c.queue(['http://www.google.com/','http://www.yahoo.com']);
// Queue URLs with custom callbacks & parameters
c.queue([{
uri: 'http://parishackers.org/',
jQuery: false,
// The global callback won't be called
callback: (error, res, done) => {
if (error) {
console.log(error);
} else {
console.log('Grabbed', res.body.length, 'bytes');
}
done();
}
}]);
// Queue some HTML code directly without grabbing (mostly for tests)
c.queue([{
html: '<p>This is a <strong>test</strong></p>'
}]); Slow downUse const Crawler = require('crawler');
const c = new Crawler({
rateLimit: 1000, // `maxConnections` will be forced to 1
callback: (err, res, done) => {
console.log(res.$('title').text());
done();
}
});
c.queue(tasks);//between two tasks, minimum time gap is 1000 (ms) Custom parametersSometimes you have to access variables from previous request/response session, what should you do is passing parameters as same as options: c.queue({
uri: 'http://www.google.com',
parameter1: 'value1',
parameter2: 'value2',
parameter3: 'value3'
}) then access them in callback via console.log(res.options.parameter1); Crawler picks options only needed by request, so don't worry about the redundancy. Raw bodyIf you are downloading files like image, pdf, word etc, you have to save the raw response body which means Crawler shouldn't convert it to string. To make it happen, you need to set encoding to null const Crawler = require('crawler');
const fs = require('fs');
const c = new Crawler({
encoding: null,
jQuery: false,// set false to suppress warning message.
callback: (err, res, done) => {
if (err) {
console.error(err.stack);
} else {
fs.createWriteStream(res.options.filename).write(res.body);
}
done();
}
});
c.queue({
uri: 'https://nodejs.org/static/images/logos/nodejs-1920x1200.png',
filename: 'nodejs-1920x1200.png'
}); preRequestIf you want to do something either synchronously or asynchronously before each request, you can try the code below. Note that direct requests won't trigger preRequest. const c = new Crawler({
preRequest: (options, done) => {
// 'options' here is not the 'options' you pass to 'c.queue', instead, it's the options that is going to be passed to 'request' module
console.log(options);
// when done is called, the request will start
done();
},
callback: (err, res, done) => {
if (err) {
console.log(err);
} else {
console.log(res.statusCode);
}
}
});
c.queue({
uri: 'http://www.google.com',
// this will override the 'preRequest' defined in crawler
preRequest: (options, done) => {
setTimeout(() => {
console.log(options);
done();
}, 1000);
}
}); AdvancedSend request directlyIn case you want to send a request directly without going through the scheduler in Crawler, try the code below. crawler.direct({
uri: 'http://www.google.com',
skipEventRequest: false, // default to true, direct requests won't trigger Event:'request'
callback: (error, response) => {
if (error) {
console.log(error)
} else {
console.log(response.statusCode);
}
}
}); Work with Http2Node-crawler now supports http request. Proxy functionality for http2 request does not be included now. It will be added in the future. crawler.queue({
//unit test work with httpbin http2 server. It could be used for test
uri: 'https://nghttp2.org/httpbin/status/200',
method: 'GET',
http2: true, //set http2 to be true will make a http2 request
callback: (error, response, done) => {
if (error) {
console.error(error);
return done();
}
console.log(`inside callback`);
console.log(response.body);
return done();
}
}) Work with bottleneckControl rate limit for with limiter. All tasks submit to a limiter will abide the const crawler = require('crawler');
const c = new Crawler({
rateLimit: 2000,
maxConnections: 1,
callback: (error, res, done) => {
if (error) {
console.log(error);
} else {
const $ = res.$;
console.log($('title').text());
}
done();
}
});
// if you want to crawl some website with 2000ms gap between requests
c.queue('http://www.somewebsite.com/page/1');
c.queue('http://www.somewebsite.com/page/2');
c.queue('http://www.somewebsite.com/page/3');
// if you want to crawl some website using proxy with 2000ms gap between requests for each proxy
c.queue({
uri:'http://www.somewebsite.com/page/1',
limiter:'proxy_1',
proxy:'proxy_1'
});
c.queue({
uri:'http://www.somewebsite.com/page/2',
limiter:'proxy_2',
proxy:'proxy_2'
});
c.queue({
uri:'http://www.somewebsite.com/page/3',
limiter:'proxy_3',
proxy:'proxy_3'
});
c.queue({
uri:'http://www.somewebsite.com/page/4',
limiter:'proxy_1',
proxy:'proxy_1'
}); Normally, all limiter instances in limiter cluster in crawler are instantiated with options specified in crawler constructor. You can change property of any limiter by calling the code below. Currently, we only support changing property 'rateLimit' of limiter. Note that the default limiter can be accessed by const c = new Crawler({});
c.setLimiterProperty('limiterName', 'propertyName', value); Class:CrawlerEvent: 'schedule'
Emitted when a task is being added to scheduler. crawler.on('schedule', (options) => {
options.proxy = 'http://proxy:port';
}); Event: 'limiterChange'Emitted when limiter has been changed. Event: 'request'
Emitted when crawler is ready to send a request. If you are going to modify options at last stage before requesting, just listen on it. crawler.on('request', (options) => {
options.qs.timestamp = new Date().getTime();
}); Event: 'drain'Emitted when queue is empty. crawler.on('drain', () => {
// For example, release a connection to database.
db.end();// close connection to MySQL
}); crawler.queue(uri|options)Enqueue a task and wait for it to be executed. crawler.queueSizeSize of queue, read-only Options referenceYou can pass these options to the Crawler() constructor if you want them to be global or as items in the queue() calls if you want them to be specific to that item (overwriting global options) This options list is a strict superset of mikeal's request options and will be directly passed to the request() method. Basic request options
Callbacks
Schedule options
|
2023-10-27
2022-08-15
2022-08-17
2022-09-23
2022-08-13
请发表评论