Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
125 views
in Technique[技术] by (71.8m points)

python - Webscraping not working with BeautfiulSoup

In advance: Sorry for any bady formatting, this is my very first post!

I'm trying to create a program that scrapes "CoinMarketCap" and compares the prices from a South African exchange (Luno) and all the other Bitcoin exchanges.

Sadly, it doesn't work on the https://coinmarketcap.com/de/currencies/bitcoin/markets/ page. It works on the https://coinmarketcap.com/de/exchanges/luno/ page though.

Any suggestions? Here is my code:

from bs4 import BeautifulSoup 
import requests
from time import sleep
from random import randint

def scrapeWebsite(link):
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

    results = requests.get(link, headers=headers)

    src = results.content

    soup = BeautifulSoup(src,features="html.parser")

    items = []

    print(soup.prettify())

    for tr in soup.find_all("tr"):
        line = ""
        for td in tr.find_all("td"):
            line = line + td.text + "/"
            if(td.text == "Kürzlich"):
                items.append(line)
    return items



itemsLuno = scrapeWebsite("https://coinmarketcap.com/de/currencies/bitcoin/markets/")

#Coins on Luno are: Bitcoin, Ethereum, Litecoin and ripple

for item in itemsLuno:
        print(item)
question from:https://stackoverflow.com/questions/65861022/webscraping-not-working-with-beautfiulsoup

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

the content of the first page is generated by javascript, so when you fetch the page you fetch the initial, unmodified html. you fetch the response getting from the server before execute the js in your browser.check this response here
in your case you need to render the javascript content before you crawl the page. you can do that using scrapy framework or selenium for exemple in selenium

from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get(url)
time.sleep(5)
html = driver.page_source

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...