Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.8k views
in Technique[技术] by (71.8m points)

python - Stop Scrapy crawler from external domains

I am new to scrapy and trying to run a crawler on a few websites where my allowed domain and start url looks like this

 allowed_domains = ['www.siemens.com']

start_urls= ['https://www.siemens.com/']

The problem is that the website also contains links to different domains like

"siemens.fr" and "seimens.de"

and I don't want the scrapy to also scrape these websites. Any suggestion on how to tell the spider not to crawl these websites. I am trying to build a more general spider so that it is applicable to other websites also

Update#2

As suggested by Felix Ekl?f, I tried to adjust my code and change some settings. This is what the code looks like now

The spider

class webSpider(scrapy.Spider):
    name = 'web'
    allowed_domains = ['eaton.com']

    start_urls= ['https://www.eaton.com/us/']


    # include_patterns = ['']
    exclude_patterns = ['.*.(css|js|gif|jpg|jpeg|png)']
    #proxies = 'proxies.txt'
    response_type_whitelist = ['text/html']
    # response_type_blacklist = []
    rules = [Rule(LinkExtractor(allow = (allowed_domains)), callback='parse_item', follow=True)]
   

And the settings look like this:

SPIDER_MIDDLEWARES = {
   'smartspider.middlewares.SmartspiderSpiderMiddleware': 543,
    #'scrapy_testmaster.TestMasterMiddleware': 950
}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'smartspider.middlewares.SmartspiderDownloaderMiddleware': 543,
    'smartspider.middlewares.FilterResponses': 543,
    'smartspider.middlewares.RandomProxyForReDirectedUrls': 650,
     "scrapy.spidermiddlewares.offsite.OffsiteMiddleware": 543
}
ITEM_PIPELINES = {
    'smartspider.pipelines.SmartspiderPipeline': 300,
}

Please let me know if any of these settings are interfering with the spider only accessing the internal links and maintaining the given domain

Update3# As suggested by @Felix, I updated the Spider which looks like this now

class WebSpider(CrawlSpider):
    name = 'web'
    allowed_domains = ['eaton.com']
    start_urls= ['https://www.eaton.com/us/']
    # include_patterns = ['']
    exclude_patterns = ['.*.(css|js|gif|jpg|jpeg|png)']
    
    response_type_whitelist = ['text/html']
    rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]

The settings look

#SPIDER_MIDDLEWARES = {
#    'smartspider.middlewares.SmartspiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    # 'smartspider.middlewares.SmartspiderDownloaderMiddleware': 543,
    'smartspider.middlewares.FilterResponses': 543,
    'smartspider.middlewares.RandomProxyForReDirectedUrls': 650,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'smartspider.pipelines.SmartspiderPipeline': 300,
#}

But the spider is still scrapping through different domains.

But the logs are showing that it is rejecting offsite websites with another website (thalia.de)

2021-01-04 19:46:42 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.rtbhouse.com': <GET https://www.rtbhouse.com/privacy-center/>
2021-01-04 19:46:42 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.quicklizard.com': <GET https://www.quicklizard.com/terms-of-service/>
2021-01-04 19:46:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.thalia.de/shop/hilfe-gutschein/show/> (referer: https://www.thalia.de/)
2021-01-04 19:46:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.thalia.de/shop/hilfe-kaufen/show/> (referer: https://www.thalia.de/)
2021-01-04 19:46:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.thalia.de/shop/home/login/login/?source=%2Fde.buch.shop%2Fshop%2F2%2Fhome%2Fkundenbewertung%2Fschreiben%3Fartikel%3D149426569&jumpId=2610518> (referer: https://www.thalia.de/shop/home/artikeldetails/ID149426569.html)
2021-01-04 19:46:43 [scrapy.extensions.logstats] INFO: Crawled 453 pages (at 223 pages/min), scraped 0 items (at 0 items/min)
2021-01-04 19:46:43 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://www.thalia.de/shop/home/show/ 

Is the spider working as expected or the problem is with a specific website?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Try removing "www." from allowed_domains.

Accoring to the Scrapy docs you should do it like this:

Let’s say your target url is https://www.example.com/1.html, then add 'example.com' to the list.

So, in your case:

allowed_domains = ['siemens.com']

start_urls= ['https://www.siemens.com/']

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...