javascript - Dynamic Data Web Scraping with Python, BeautifulSoup

Question

Welcome To Ask or Share your Answers For Others

javascript - Dynamic Data Web Scraping with Python, BeautifulSoup

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

javascript - Dynamic Data Web Scraping with Python, BeautifulSoup

I am trying to extract this data(number) for many pages from the HTML. The data is different for each page. When I try to use soup.select('span[class="pull-right"]') it should give me the number, but only the tag comes. I believe it is because Javascript is used in the webpage. 180,476 is the position of data at this specific HTML that I want for many pages:

<div class="legend-block--body">
        <div class="linear-legend--counts">
          Pageviews:
          <span class="pull-right">
            180,476
          </span>
        </div>
        <div class="linear-legend--counts">
          Daily average:
          <span class="pull-right">
            8,594
          </span>
        </div></div>

My code(this is in a loop to work for many pages):

res = requests.get(wiki_page, timeout =None)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
ab=soup.select('span[class="pull-right"]')
print(ab)

output:

[<span class="pull-right">
<label class="logarithmic-scale">
<input 
class="logarithmic-scale-option" type="checkbox"/>
        Logarithmic scale      
</label>
</span>, <span class="pull-right">
<label class="begin-at- 
zero">
<input class="begin-at-zero-option" type="checkbox"/>
        Begin at 
zero      </label>
</span>, <span class="pull-right">
<label class="show- 
labels">
<input class="show-labels-option" type="checkbox"/>
        Show 
values      </label>
</span>]

Example URL:https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Star_Wars:_The_Last_Jedi

I want the Pageviews

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:31:51+0000

The javascript code won't get executed if you retrieve page with the requests.get. So the selenium shall be used instead. It will mimic user like behaviour with the opening of the page in browser, so the js code will be executed.

To start with selenium, you need to install with pip install selenium. Then to retrieve your item use code below:

from selenium import webdriver

browser = webdriver.Firefox()
# List of the page url and selector of element to retrieve.
wiki_pages = [("https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Star_Wars:_The_Last_Jedi",
               ".summary-column--container .legend-block--pageviews .linear-legend--counts:first-child span.pull-right"),]
for wiki_page in wiki_pages:
    url = wiki_page[0]
    selector = wiki_page[1]
    browser.get(wiki_page)
    page_views_count = browser.find_element_by_css_selector(selector)
    print page_views_count.text
browser.quit()

NOTE: If you need to run headless browser, consider using PyVirtualDisplay (a wrapper for Xvfb) to run headless WebDriver tests, see 'How do I run Selenium in Xvfb?' for more information.

Categories

javascript - Dynamic Data Web Scraping with Python, BeautifulSoup

javascript - Dynamic Data Web Scraping with Python, BeautifulSoup

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags