How to extract specified content in bulk from Word usin…

2 years ago

Isabella Edwards

2 minutes

To extract specific content from multiple Word documents in bulk, you can use the python-docx library in Python. Here is a simple example code:

from docx import Document

def extract_content_from_docx(file_path, keyword):
    doc = Document(file_path)
    extracted_content = []

    for paragraph in doc.paragraphs:
        if keyword in paragraph.text:
            extracted_content.append(paragraph.text)

    return extracted_content

# 示例用法
file_path = "path/to/your/document.docx"
keyword = "指定内容"
content = extract_content_from_docx(file_path, keyword)
for paragraph in content:
    print(paragraph)

In the above example code, we first import the Document class and the extract_content_from_docx function. Next, we define a function extract_content_from_docx, which takes two parameters: file_path (the path to the Word document file) and keyword (the keyword for the content to be extracted).

Within the function, we utilize the Document class to load a Word document from a specific path and create an empty list called extracted_content to store the extracted content.

Next, we iterate through each paragraph in the document (obtained through the doc.paragraphs attribute) and check if the text of the paragraph contains the keyword. If it does, we add the text of that paragraph to the extracted_content list.

Finally, we return the extracted_content list as the extracted result.

In the example usage, we provide the path of the Word document to be processed and the keywords to extract. Then, we call the extract_content_from_docx function, traverse the extracted content, and print it out.

Please note that the above code only provides the most basic example. In actual applications, you may need to further adjust and optimize the logic of content extraction according to specific requirements.

#Development #guide #programming #technology #tutorial