Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
226 views
in Technique[技术] by (71.8m points)

python - scraping 50 webpages each containing 10 webpage links

i neeed to scrape 50 main webpages each webpage containing 10 article links. Date and authors are scrape from main page and verticals and description are scraped visiting each url link so after scraping 10 links at first main webpage i need to click for the next page and the cycle continues to 50 pages .Please help me, here is my code.

#Importing essential libraries required for scraping articles.
    import pandas as pd
    import selenium
    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait       
    from selenium.webdriver.common.by import By       
    from selenium.webdriver.support import expected_conditions as EC
    import xml.etree.ElementTree as ET
    from selenium.common.exceptions import StaleElementReferenceException
    from selenium.webdriver.common.action_chains import ActionChains
    
    
    driver=webdriver.Chrome(r"C:UsersScpDesktopfliprobochromedriver.exe")
    
    Dates=[]
    Authors=[]
    Verticals=[]
    Headlines=[]
    Descriptions=[]
    Hrefs=[]
    
    
    driver.get("https://www.ebmnews.com/2020/page/948/")
    start=948
    end=997
    for page in range(start,end+1):
authors=driver.find_elements_by_xpath('//i[@class="post-author author"]')
    for i in authors:
        Authors.append(i.text)
    dates=driver.find_elements_by_xpath('//time[@class="post-published updated"]')
    for i in dates:
        Dates.append(i.text)       
    urls=driver.find_elements_by_xpath('//a[@class="post-url post-title"]')
        urls=driver.find_elements_by_xpath('//a[@class="post-url post-title"]')
        for i in urls:
        driver.get(i.get_attribute('href'))
        headlines=driver.find_elements_by_xpath('//*[@id="post-99531"]/div[1]/h1/span')
            for i in headlines:
                Headlines.append(i.text)
            desc=driver.find_elements_by_xpath('//*[@id="post-99531"]/div[2]/p/span')
            for i in desc:
                Descriptions.append(i.text)
            verticals=driver.find_elements_by_xpath('//*[@id="post-99531"]/div[1]/div[1]/div/span/a')
            for i in verticals:
                Verticals.append(i.text)
            driver.back()
        try:
            element = driver.find_element_by_xpath('//*[text()=" Older Posts"]')
            webdriver.ActionChains(driver).move_to_element(element ).click(element ).perform()
        except StaleElementReferenceException as e:
            old_post_btn=driver.find_element_by_xpath('//*[text()=" Older Posts"]')
            old_post_btn.click()

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If I understand your question, the following should serve the purpose in the right way. I used requests module instead of selenium to make it robust.

import requests
from bs4 import BeautifulSoup

url = 'https://www.ebmnews.com/2020/page/{}/'

current_page = 948

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    
    while current_page!=998: #highest page to traverse
        r = s.get(url.format(current_page))
        soup = BeautifulSoup(r.text,"html.parser")
        for item in soup.select('article.listing-item'):
            try:
                post_author = item.select_one("i.post-author").get_text(strip=True)
            except AttributeError: post_author = ""
            try:
                post_date = item.select_one("span.time > time").get_text(strip=True)
            except AttributeError: post_date = ""
            inner_link = item.select_one("h2.title > a").get("href")

            res = s.get(inner_link)
            sauce = BeautifulSoup(res.text,"html.parser")
            title = sauce.select_one("span[itemprop='headline']").get_text(strip=True)
            desc = ' '.join([item.get_text(strip=True) for item in sauce.select(".entry-content > p")])
            print(post_author,post_date,title,desc)

        current_page+=1

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...