Event shouldCrawl
belongs to js-crawler's config object, and the callback function it has as value must return boolean in order to tell the crawler whether or not to crawl the URL received as an argument.
I'm using axios and HEAD method to retrieve the resource's headers. Will return true to shouldCrawl
when content-type contains text/html in order to prevent the crawler from downloading files and garbage.
My code:
this.crawler = new Crawler().configure({
shouldCrawl: async(sUrl)=> {
const crawlWhenHtml = async()=> { //return false;
return axios({
url: sUrl,
method: 'head'
}).then(res=>{
return (res.headers['content-type'].indexOf('text/html') >= 0?
true:false);
}).catch(error=>{
return false;
});
}
return await crawlWhenHtml();
}
});
I can't get shouldCrawl
and crawlWhenHtml
in sync.
If I make the callback returns false (see commented sentence), shouldCrawl
ignores it and crawls the URL anyway. This happens since I made the mentioned callback async.
But without making it async I cant't wait for axios completes the request before returning a boolean to shouldCrawl
.
How can I unravel this?
question from:
https://stackoverflow.com/questions/65848240/crawlers-shouldcrawl-event-requires-boolean-returned-from-axios-async-functio 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…