How to use XPath to parse HTML in Python?
To parse HTML using XPath, you can use the lxml library in Python. Here is a simple example:
- First, make sure you have installed the lxml library. You can install it using the following command:
pip install lxml
- Import the lxml library and requests library in Python code (used for fetching HTML pages).
import requests
from lxml import etree
- Obtain the content of an HTML page using the requests library.
url = 'https://example.com' # 要解析的网页URL
response = requests.get(url)
html = response.text
- Convert HTML content into a parseable object using the etree module from lxml.
tree = etree.HTML(html)
- The path in XML documents used to navigate and locate specific elements.
# 例如,获取所有的标题元素
titles = tree.xpath('//h1')
- Iterating through the returned list of elements and extracting the necessary content.
# 例如,提取所有标题的文本内容
for title in titles:
print(title.text)
By following the steps above, you can use XPath to parse HTML and extract the desired content. In XPath expressions, various syntax can be used to locate elements, such as tag names, attributes, hierarchical relationships, etc. Specific XPath syntax can be found in XPath tutorials.