Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
356 views
in Technique[技术] by (71.8m points)

python - Import RSS with FeedParser and Get Both Posts and General Information to Single Pandas DataFrame

I am working on as a python novice on an exercise to practice importing data in python. Eventually I want to analyze data from different podcasts (infos on the podcasts itself and every episode) by putting the data into a coherent dataframe work on it with NLP.

So far I have managed to read a list of RSS feeds and get the information on every single episode of the RSS feed (a post).

But I am having trouble to find an integrated working process in python to gather both

  1. information on every single episode of the RSS feed (a post)
  2. and general information about the RSS feed (like title of the podcast) in one go.

Code This is what i have got so far

import feedparser
import pandas as pd

rss_feeds = ['http://feeds.feedburner.com/TEDTalks_audio',
        'https://joelhooks.com/rss.xml',
        'https://www.sciencemag.org/rss/podcast.xml',
    ]
#number of feeds is reduced for testing

posts = []
feed = []
for url in rss_feeds:
       feed = feedparser.parse(url)
       for post in feed.entries:
           posts.append((post.title, post.link, post.summary))

df = pd.DataFrame(posts, columns=['title', 'link', 'summary'])

Output The dataframe includes 652 non-null objects for three columns (as intended) - basically every post made in every podcast. The column title refers to the title of the episode but not to the title of the podcast (which in this example is 'Ted Talk Daily').

title link summary
0 3 questions to ask yourself about everything y... https://www.ted.com/talks/stacey_abrams_3_ques... How you respond to setbacks is what defines yo...
1 What your sleep patterns say about your relati... https://www.ted.com/talks/tedx_shorts_what_you... Wendy Troxel looks at the cultural expectation...
2 How we can actually pay people enough -- with ... https://www.ted.com/talks/ted_business_how_we_... Capitalism urgently needs an upgrade, says Pay...

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Feed title can be accessed in this case with feed.feed.title:

# ...
for url in rss_feeds:
    feed = feedparser.parse(url)
    for post in feed.entries:
        posts.append((feed.feed.title, post.title, post.link, post.summary))

df = pd.DataFrame(posts, columns=['feed_title', 'title', 'link', 'summary'])
df

Output:

          feed_title            title             link          summary
0    TED Talks Daily  3 ways compa...  https://www....  When we expe...
1    TED Talks Daily  How we could...  https://www....  Concrete is ...
2    TED Talks Daily  3 questions ...  https://www....  How you resp...
3    TED Talks Daily  What your sl...  https://www....  Wendy Troxel...
4    TED Talks Daily  How we can a...  https://www....  Capitalism u...
..               ...              ...              ...              ...
649  Science Maga...  Science Podc...  https://traf...  Fear-enhance...
650  Science Maga...  Science Podc...  https://traf...  Discussing t...
651  Science Maga...  Science Podc...  https://traf...  Talking kids...
652  Science Maga...  Science Podc...  https://traf...  The minimum ...
653  Science Maga...  Science Podc...  https://traf...  The origin o...

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.8k users

...