Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
576 views
in Technique[技术] by (71.8m points)

python - 要求用户输入一些内容并使用Beautiful Soup解析网站(Asking the user to input something and use Beautiful Soup to parse a website)

I am supposed to use Beautiful Soup 4 to obtain course information off of my school's website as an exercise.

(我应该使用“美丽汤4”作为练习从我学校的网站上获取课程信息。)

I have been at this for the past few days and my code still does not work.

(过去几天我一直在此工作,但是我的代码仍然无法正常工作。)

The first thing I ask the user is to import the course catalog abbreviation.

(我要求用户做的第一件事是导入课程目录缩写。)

For example, ICS is abbreviated as Information for Computer Science.

(例如,ICS缩写为计算机科学信息。)

Beautiful Soup 4 is supposed to list all of the courses and how many students are enrolled.

(美丽的汤4应该列出所有课程以及有多少学生报名。)

While I was able to get the input portion to work, I still have errors or the program just stops.

(虽然我可以使输入部分正常工作,但仍然有错误,或者程序刚刚停止。)

Question: Is there a way for Beautiful Soup to accept user input so that when the user inputs ICS, the output would be a list of all courses that are related to ICS?

(问题:Beautiful Soup是否可以接受用户输入,以便当用户输入ICS时,输出将是与ICS相关的所有课程的列表?)

Here is the code and my attempt at it:

(这是代码和我的尝试:)

from bs4 import BeautifulSoup
import requests
import re

#get input for course
course = input('Enter the course:')
#Here is the page link
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"


#get request and response
page_response = requests.get(BASE_AVAILABILITY_URL)
#getting Beautiful Soup to gather the html content
page_content = BeautifulSoup(page_response.content, 'html.parser')
#getting course information
main = page_content.find_all(class_='parent clearfix')
main_p = "".join(str (x) for x in main)
#get the course anchor tags
main_q = BeautifulSoup(main_p, "html.parser")
courses = main.find('a', href = True)
#get each course name
#empty dictionary for course list
courses_list = []
for a in courses:
    courses_list.append(a.text)
    search = input('Enter the course title:')
for course in courses_list:
    if re.search(search, course, re.IGNORECASE):
        print(course)

This is the original code that was provided in Juypter Notebook

(这是Juypter Notebook中提供的原始代码)

import requests, bs4

BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get input for course
course = input('Enter the course:')



def scrape_availability(text):
    soup = bs4.BeautifulSoup(text)
    r = requests.get(str(BASE_AVAILABILITY_URL)  + str(course))
    rows = soup.select('.listOfClasses tr')

    for row in rows[1:]:
        columns = row.select('td')
        class_name = columns[2].contents[0]
        if len(class_name) > 1 and class_name != b'xa0':
            print(class_name)
            print(columns[4].contents[0])
            print(columns[7].contents[0])
            print(columns[8].contents[0])

What's odd is that if the user saves the html file, uploads it into Juypter Notebook, and then opens the file to be read, the courses are displayed.

(奇怪的是,如果用户保存html文件,将其上传到Juypter Notebook,然后打开要读取的文件,则会显示课程。)

But, for this task, the user can not save files and it must be an outright input to get the output.

(但是,对于此任务,用户无法保存文件,并且它必须是直接输入才能获得输出。)

  ask by usukidoll translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The problem with your code is page_content.find_all(class_='parent clearfix') retuns and empty list [] .

(您的代码存在问题,是page_content.find_all(class_='parent clearfix')重新调整和空列表[] 。)

So thats the first thing you need to change.

(这就是您需要更改的第一件事。)

Looking at the html, you'll want to be looking for <table> , <tr> , <td> , tags

(查看html,您将要查找<table><tr><td> ,标签)

working off what was provided from the original code, you just need to alter a few things to flow logically:

(要处理原始代码提供的内容,您只需更改一些内容即可进行逻辑处理:)

I'll point out what I changed:

(我会指出我所做的更改:)

import requests, bs4

BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get input for course
course = input('Enter the course:')



def scrape_availability(text):
    soup = bs4.BeautifulSoup(text)   #<-- need to get the html text before creating a bs4 object. So I move the request (line below) before this, and also adjusted the parameter for this function.
                                     # the rest of the code is fine
    r = requests.get(str(BASE_AVAILABILITY_URL)  + str(course))
    rows = soup.select('.listOfClasses tr')

    for row in rows[1:]:
        columns = row.select('td')
        class_name = columns[2].contents[0]
        if len(class_name) > 1 and class_name != b'xa0':
            print(class_name)
            print(columns[4].contents[0])
            print(columns[7].contents[0])
            print(columns[8].contents[0])

This will give you:

(这将为您提供:)

import requests, bs4

BASE_AVAILABILITY_URL = "https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s="
#get input for course
course = input('Enter the course:')

url = BASE_AVAILABILITY_URL  + course

def scrape_availability(url):

    r = requests.get(url)
    soup = bs4.BeautifulSoup(r.text, 'html.parser')
    rows = soup.select('.listOfClasses tr')

    for row in rows[1:]:
        columns = row.select('td')
        class_name = columns[2].contents[0]
        if len(class_name) > 1 and class_name != b'xa0':
            print(class_name)
            print(columns[4].contents[0])
            print(columns[7].contents[0])
            print(columns[8].contents[0])



scrape_availability(url)            

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...