Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
720 views
in Technique[技术] by (71.8m points)

javascript - How to deal with the captcha when doing Web Scraping in Puppeteer?

I'm using Puppeteer for Web Scraping and I have just noticed that sometimes, the website I'm trying to scrape asks for a captcha due to the amount of visits I'm doing from my computer. The captcha form looks like this one:

captcha

So, I would need help about how to handle this. I have been thinking about sending the captcha form to the client-side since I use Express and EJS in order to send the values to my index website, but I don't know if Puppeteer can send something like that.

Any ideas?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This is a reCAPTCHA (version 2, check out demos here), which is shown to you as the owner of the page does not want you to automatically crawl the page.

Your options are the following:

Option 1: Stop crawling or try to use an official API

As the owner of the page does not want you to crawl that page, you could simply respect that decision and stop crawling. Maybe there is a documented API that you can use.

Option 2: Automate/Outsource the captcha solving

There is an entire industry which has people (often in developing countries) filling out captchas for other people's bots. I will not link to any particular site, but you can check out the other answer from Md. Abu Taher for more information on the topic or search for captcha solver.

Option 3: Solve the captcha yourself

For this, let me explain how reCAPTCHA works and what happens when you visit a page using it.


How reCAPTCHA (v2) works

Each page has an ID, which you can check by looking at the source code, example:

<div class="g-recaptcha form-field" data-sitekey="ID_OF_THE_WEBSITE_LONG_RANDOM_STRING"></div>

When the reCAPTCHA code is loaded it will add a response textarea to the form with no value. It will look like this:

<textarea id="g-recaptcha-response" name="g-recaptcha-response" class="g-recaptcha-response" style="... display: none;"></textarea>

After you solved the challenge, reCAPTCHA will add a very long string to this text field (which can then later be checked by the server/reCAPTCHA service in the backend) when the form is submitted.


How to solve the captcha yourself

By copying the value of the textarea field you can transfer the "solved challenge" from one browser to another (this is also what the solving services to for you). The full process looks like this:

  1. Detect if the page uses reCAPTCHA (e.g. check for .g-recaptcha) in the "crawling" browser
  2. Open a second browser in non-headless mode with the same URL
  3. Solve the captcha yourself
  4. Read the value from: document.querySelector('#g-recaptcha-response').value
  5. Put that value into the first browser: document.querySelector('#g-recaptcha-response').value = '...'
  6. Submit the form

Further information/reading

There is not much public information from Google how exactly reCAPTCHA works as this is a cat-and-mouse game between bot creators and Google detection algorithms, but there are some resources online with more information:

  • Official docs from Google: Obviously, they just explain the basics and not how it works "in the back"
  • InsideReCaptcha: This is a project from 2014 which tries to "reverse-engineer" reCAPTCHA. Although this is quite old, there is still a lot of useful information on the page.
  • Another question on stackoverflow: This question contains some useful information about reCAPTCHA, but also many speculative (and very likely) outdated approaches on how to fool a reCAPTCHA.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...