Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.1k views
in Technique[技术] by (71.8m points)

Setting the Host header for redirected URLs with Python requests module

I'm working on a web scraping project in Python. I get a daily email from a service that has links in it. A typical link looks like:

http://clicks.serviceprovider.com/track/click/12345/www.serviceprovider.com?p=eyJzI...<snip>...JdfSJ9

In a browser, I can see that the server redirects from http://clicks.serviceprovider.com to https://www.serviceprovider.com?pageId=12345. Naturally, I want to scrape pageId 12345 with my Python code.

If I just do a requests.get(url), the server never responds. I suspect, but don't know for sure, that this is because requests isn't including a Host header.

If I set headers={'Host':'clicks.serviceprovider.com'}, I end up getting an HTTP 403 error. What I think is happening, but cannot demonstrate, is that requests is sending the original http GET request, is getting the HTTP 301 redirect, but when it does a GET for the https:// redirected page, it is still using the Host header for clicks.serviceprovider.com instead of www.serviceprovider.com from the redirected URL.

How can I tell requests to change the Host header with the redirect?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.7k users

...