What is the usage of CrawlSpider in Python?
The CrawlSpider in the Scrapy framework is an advanced web crawler that offers a more convenient way to write web scrapers, especially suitable for websites that require link tracking.
To use CrawlSpider, you’ll need to create a new spider class that inherits from CrawlSpider and define some rules for specifying how to follow links and extract data. Here is a simple example:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'myspider'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
Rule(LinkExtractor(allow=(r'category\.php',)), callback='parse_category'),
Rule(LinkExtractor(allow=(r'item\.php',)), callback='parse_item'),
)
def parse_category(self, response):
# 处理分类页面的响应
def parse_item(self, response):
# 处理商品页面的响应
In the example above, allowed_domains is used to specify the domains that are allowed to be crawled, while start_urls is used to specify the starting URL.
Rules is a tuple that includes a series of rules, where each rule consists of a LinkExtractor object and a callback function. The LinkExtractor object specifies the matching rules for the links to be followed, which can include the use of regular expressions. The callback function is used to handle the response for each matched link.
In the example above, the first rule matches all links containing “category.php” and passes the response to the parse_category method for processing. The second rule matches all links containing “item.php” and passes the response to the parse_item method for processing.
The above is the basic usage of CrawlSpider. You can define more rules and callback functions as needed to handle different types of links and data.