How to crawl the entire content of a webpage using xpath?

2 years ago

Jackson Davis

1 minute

To scrape the entire content of a webpage using XPath, you need to first send an HTTP request to retrieve the source code of the webpage using the requests library in Python, then parse the webpage source code using the lxml library, and extract the required content using XPath expressions.

Here is an example code using XPath to scrape the full content of a webpage.

import requests
from lxml import etree

# 发送HTTP请求获取网页源代码
url = 'http://example.com'
response = requests.get(url)
html = response.text

# 解析网页源代码
tree = etree.HTML(html)

# 使用XPath表达式提取网页全部内容
content = tree.xpath('//*')  # 通过"*"匹配网页的全部标签

# 打印提取的内容
for tag in content:
    print(etree.tostring(tag, encoding='utf-8').decode('utf-8'))

By running the above code, you will be able to fetch the entire content of the webpage and print it line by line. Please note that this example only prints the tag content of the webpage, without extracting tag attributes or other information. Depending on the specific structure of the webpage, you may need to write more complex XPath expressions to extract the desired content according to your own needs.