python - scraping 50 webpages each containing 10 webpage links

Question

Welcome To Ask or Share your Answers For Others

python - scraping 50 webpages each containing 10 webpage links

posted Feb 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - scraping 50 webpages each containing 10 webpage links

i neeed to scrape 50 main webpages each webpage containing 10 article links. Date and authors are scrape from main page and verticals and description are scraped visiting each url link so after scraping 10 links at first main webpage i need to click for the next page and the cycle continues to 50 pages .Please help me, here is my code.

#Importing essential libraries required for scraping articles.
    import pandas as pd
    import selenium
    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait       
    from selenium.webdriver.common.by import By       
    from selenium.webdriver.support import expected_conditions as EC
    import xml.etree.ElementTree as ET
    from selenium.common.exceptions import StaleElementReferenceException
    from selenium.webdriver.common.action_chains import ActionChains
    
    
    driver=webdriver.Chrome(r"C:UsersScpDesktopfliprobochromedriver.exe")
    
    Dates=[]
    Authors=[]
    Verticals=[]
    Headlines=[]
    Descriptions=[]
    Hrefs=[]
    
    
    driver.get("https://www.ebmnews.com/2020/page/948/")
    start=948
    end=997
    for page in range(start,end+1):
authors=driver.find_elements_by_xpath('//i[@class="post-author author"]')
    for i in authors:
        Authors.append(i.text)
    dates=driver.find_elements_by_xpath('//time[@class="post-published updated"]')
    for i in dates:
        Dates.append(i.text)       
    urls=driver.find_elements_by_xpath('//a[@class="post-url post-title"]')
        urls=driver.find_elements_by_xpath('//a[@class="post-url post-title"]')
        for i in urls:
        driver.get(i.get_attribute('href'))
        headlines=driver.find_elements_by_xpath('//*[@id="post-99531"]/div[1]/h1/span')
            for i in headlines:
                Headlines.append(i.text)
            desc=driver.find_elements_by_xpath('//*[@id="post-99531"]/div[2]/p/span')
            for i in desc:
                Descriptions.append(i.text)
            verticals=driver.find_elements_by_xpath('//*[@id="post-99531"]/div[1]/div[1]/div/span/a')
            for i in verticals:
                Verticals.append(i.text)
            driver.back()
        try:
            element = driver.find_element_by_xpath('//*[text()=" Older Posts"]')
            webdriver.ActionChains(driver).move_to_element(element ).click(element ).perform()
        except StaleElementReferenceException as e:
            old_post_btn=driver.find_element_by_xpath('//*[text()=" Older Posts"]')
            old_post_btn.click()

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-02-06T00:27:04+0000

If I understand your question, the following should serve the purpose in the right way. I used requests module instead of selenium to make it robust.

import requests
from bs4 import BeautifulSoup

url = 'https://www.ebmnews.com/2020/page/{}/'

current_page = 948

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    
    while current_page!=998: #highest page to traverse
        r = s.get(url.format(current_page))
        soup = BeautifulSoup(r.text,"html.parser")
        for item in soup.select('article.listing-item'):
            try:
                post_author = item.select_one("i.post-author").get_text(strip=True)
            except AttributeError: post_author = ""
            try:
                post_date = item.select_one("span.time > time").get_text(strip=True)
            except AttributeError: post_date = ""
            inner_link = item.select_one("h2.title > a").get("href")

            res = s.get(inner_link)
            sauce = BeautifulSoup(res.text,"html.parser")
            title = sauce.select_one("span[itemprop='headline']").get_text(strip=True)
            desc = ' '.join([item.get_text(strip=True) for item in sauce.select(".entry-content > p")])
            print(post_author,post_date,title,desc)

        current_page+=1

Categories

python - scraping 50 webpages each containing 10 webpage links

python - scraping 50 webpages each containing 10 webpage links

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags