Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

python - Headless Chrome Driver not working for Selenium

I am current having an issue with my scraper when I set options.add_argument("--headless"). However, it works perfectly fine when it is removed. Could anyone advise how I can achieve the same results with headless mode?

Below is my python code:

from seleniumwire import webdriver as wireDriver
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.chrome.options import Options
    
chromedriverPath = '/Users/applepie/Desktop/chromedrivermac'

    def scraper(search):

    mit = "https://orbit-kb.mit.edu/hc/en-us/search?utf8=?&query="  # Empty search on mit site
    mit += "+".join(search) + "&commit=Search"
    results = []

    options = Options()
    options.add_argument("--headless")
    options.add_argument("--window-size=1440, 900")
    driver = webdriver.Chrome(options=options, executable_path= chromedriverPath)

    driver.get(mit)
    # Wait 20 seconds for page to load
    timeout = 20
    try:
        WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.CLASS_NAME, "header")))
        search_results = driver.find_element_by_class_name("search-results")
        for result in search_results.find_elements_by_class_name("search-result"):
            resultObject = {
                "url": result.find_element_by_class_name('search-result-link').get_attribute("href")
            }
            results.append(resultObject)
        driver.quit()
    except TimeoutException:
        print("Timed out waiting for page to load")
        driver.quit()

    return results

Here is also a screenshot of when I print(driver.page_source) after get():

enter image description here


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This screenshot...

screenshot

...implies that the Cloudflare have detected your requests to the website as an automated bot and subsequently denying you the access to the application.


Solution

In these cases the a potential solution would be to use the undetected-chromedriver in headless mode to initialize the browsing context.

undetected-chromedriver is an optimized Selenium Chromedriver patch which does not trigger anti-bot services like Distill Network / Imperva / DataDome / Botprotect.io. It automatically downloads the driver binary and patches it.

  • Code Block:

    import undetected_chromedriver as uc
    from selenium import webdriver
    
    options = webdriver.ChromeOptions() 
    options.headless = True
    driver = uc.Chrome(options=options)
    driver.get(url)
    

References

You can find a couple of relevant detailed discussions in:


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...