How to set parameters for a Scrapy spider?

2 years ago

Olivia Parker

2 minutes

The parameters of the Scrapy crawler can be set in the settings.py file. Here are some common parameter settings: 1. ROBOTSTXT_OBEY: Setting to False can ignore the website’s robots.txt file restrictions, default is True. 2. DOWNLOAD_DELAY: Set a download delay, the waiting time between each request to prevent excessive load on the website, default is 0 (no delay). 3. USER_AGENT: Set the user agent to simulate different browser requests, default is Scrapy. 4. COOKIES_ENABLED: Setting to False can disable cookies, if the website requires login or access using cookies, it needs to be set to True, default is True. 5. CONCURRENT_REQUESTS: Set the number of requests to be sent simultaneously, default is 16. 6. DOWNLOAD_TIMEOUT: Set the download timeout, default is 180 seconds. 7. CONCURRENT_REQUESTS_PER_DOMAIN: Set the maximum number of concurrent requests per domain, default is 8. 8. ITEM_PIPELINES: Set the pipeline for processing the crawled data, default is empty, need to be set when custom pipeline is required. 9. LOG_LEVEL: Set the log level, such as ‘CRITICAL’, ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’, default is ‘DEBUG’. 10. DEPTH_LIMIT: Set the maximum depth of the crawl, links beyond this depth will not be followed, default is 0 (unlimited). These are just some common parameter settings, there are many other parameters that can be set according to specific requirements. These parameters can be found in the settings.py file and can be modified as needed.