Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.3k views
in Technique[技术] by (71.8m points)

python - Handling an infinite scroll UI in BeautifulSoup

I'm looking at how to scrape Linkedin source (https://www.linkedin.com/mynetwork/invite-connect/connections/) but it seems impossible with infinite scroll. How to deal with it? I don't want to use Selenium (want to implement as web service later on).

import bs4
from bs4 import BeautifulSoup
import requests

def scraping(webpage):
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    response= requests.get(str(webpage), headers=headers)
    soup = BeautifulSoup(response.text,"html.parser")
    print(soup)

scraping('https://www.linkedin.com/mynetwork/invite-connect/connections')
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

BeautifulSoup can only help with the HTML you give it; you'll need to cause LinkedIn to return more HTML. The content isn't in the HTML you have, so you must get it. The browser is probably running LinkedIn's javascript to notice that you're scrolling and therefore it needs to fetch more content and inject more HTML in the page - you need to replicate this content fetch somehow.

Bad news: BeautifulSoup isn't aware of APIs or javascript. You'll need another tool.

Good news: there are tools for this! You could certainly use Selenium, that would probably be the simplest way to solve this, since it would replicate the browser environment pretty well for these purposes.

If you are absolutely committed to not using Selenium, I recommend you deep-dive on the LinkedIn site and see if you can figure out which bits of javascript are responsible for fetching more data, and replicate the network requests they make, and then parse that data yourself.

For most people, though, Selenium will be the right answer.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...