I am new to scrapy and trying to run a crawler on a few websites where my allowed domain and start url looks like this
allowed_domains = ['www.siemens.com']
start_urls= ['https://www.siemens.com/']
The problem is that the website also contains links to different domains like
"siemens.fr" and "seimens.de"
and I don't want the scrapy to also scrape these websites. Any suggestion on how to tell the spider not to crawl these websites.
I am trying to build a more general spider so that it is applicable to other websites also
Update#2
As suggested by Felix Ekl?f, I tried to adjust my code and change some settings. This is what the code looks like now
The spider
class webSpider(scrapy.Spider):
name = 'web'
allowed_domains = ['eaton.com']
start_urls= ['https://www.eaton.com/us/']
# include_patterns = ['']
exclude_patterns = ['.*.(css|js|gif|jpg|jpeg|png)']
#proxies = 'proxies.txt'
response_type_whitelist = ['text/html']
# response_type_blacklist = []
rules = [Rule(LinkExtractor(allow = (allowed_domains)), callback='parse_item', follow=True)]
And the settings look like this:
SPIDER_MIDDLEWARES = {
'smartspider.middlewares.SmartspiderSpiderMiddleware': 543,
#'scrapy_testmaster.TestMasterMiddleware': 950
}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'smartspider.middlewares.SmartspiderDownloaderMiddleware': 543,
'smartspider.middlewares.FilterResponses': 543,
'smartspider.middlewares.RandomProxyForReDirectedUrls': 650,
"scrapy.spidermiddlewares.offsite.OffsiteMiddleware": 543
}
ITEM_PIPELINES = {
'smartspider.pipelines.SmartspiderPipeline': 300,
}
Please let me know if any of these settings are interfering with the spider only accessing the internal links and maintaining the given domain
Update3#
As suggested by @Felix, I updated the Spider which looks like this now
class WebSpider(CrawlSpider):
name = 'web'
allowed_domains = ['eaton.com']
start_urls= ['https://www.eaton.com/us/']
# include_patterns = ['']
exclude_patterns = ['.*.(css|js|gif|jpg|jpeg|png)']
response_type_whitelist = ['text/html']
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]
The settings look
#SPIDER_MIDDLEWARES = {
# 'smartspider.middlewares.SmartspiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
# 'smartspider.middlewares.SmartspiderDownloaderMiddleware': 543,
'smartspider.middlewares.FilterResponses': 543,
'smartspider.middlewares.RandomProxyForReDirectedUrls': 650,
}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'smartspider.pipelines.SmartspiderPipeline': 300,
#}
But the spider is still scrapping through different domains.
But the logs are showing that it is rejecting offsite websites with another website (thalia.de)
2021-01-04 19:46:42 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.rtbhouse.com': <GET https://www.rtbhouse.com/privacy-center/>
2021-01-04 19:46:42 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.quicklizard.com': <GET https://www.quicklizard.com/terms-of-service/>
2021-01-04 19:46:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.thalia.de/shop/hilfe-gutschein/show/> (referer: https://www.thalia.de/)
2021-01-04 19:46:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.thalia.de/shop/hilfe-kaufen/show/> (referer: https://www.thalia.de/)
2021-01-04 19:46:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.thalia.de/shop/home/login/login/?source=%2Fde.buch.shop%2Fshop%2F2%2Fhome%2Fkundenbewertung%2Fschreiben%3Fartikel%3D149426569&jumpId=2610518> (referer: https://www.thalia.de/shop/home/artikeldetails/ID149426569.html)
2021-01-04 19:46:43 [scrapy.extensions.logstats] INFO: Crawled 453 pages (at 223 pages/min), scraped 0 items (at 0 items/min)
2021-01-04 19:46:43 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://www.thalia.de/shop/home/show/
Is the spider working as expected or the problem is with a specific website?