docx_parser_converter.docx_parsers.document.document_parser module

class docx_parser_converter.docx_parsers.document.document_parser.DocumentParser(source: bytes | str | None = None)[source]

Bases: object

Parses the main document.xml part of a DOCX file.

This class handles the extraction and parsing of the document.xml file within a DOCX file, converting it into a structured DocumentSchema.

extract_elements() List[Paragraph | Table][source]

Extracts elements (paragraphs and tables) from the document XML.

Returns:

The list of extracted elements.

Return type:

List[Union[Paragraph, Table]]

Example

The following is an example of the body element in a document.xml file:

<w:body>
    <w:p>
        <!-- Paragraph properties and content here -->
    </w:p>
    <w:tbl>
        <!-- Table properties and content here -->
    </w:tbl>
</w:body>
extract_margins() DocMargins | None[source]

Extracts margins from the document XML.

Returns:

The extracted margins or None if not found.

Return type:

Optional[DocMargins]

Example

The following is an example of the section properties with margins in a document.xml file:

<w:sectPr>
    <w:pgMar w:left="1134" w:right="1134" w:gutter="0" w:header="0" w:top="1134" w:footer="0" w:bottom="1134"/>
</w:sectPr>
get_document_schema() DocumentSchema[source]

Gets the parsed document schema.

Returns:

The document schema.

Return type:

DocumentSchema

parse() DocumentSchema[source]

Parses the document XML into a DocumentSchema.

Returns:

The parsed document schema.

Return type:

DocumentSchema