docx_parser_converter.docx_to_txt.docx_processor module

class docx_parser_converter.docx_to_txt.docx_processor.DocxProcessor[source]

Bases: object

Class to process the DOCX file and extract the schemas.

static get_default_numbering_schema() NumberingSchema[source]

Returns the default numbering schema.

Returns:

The default numbering schema.

Return type:

NumberingSchema

Example

default_numbering_schema = DocxProcessor.get_default_numbering_schema()
static get_default_styles_schema() StylesSchema[source]

Returns the default styles schema.

Returns:

The default styles schema.

Return type:

StylesSchema

Example

default_styles_schema = DocxProcessor.get_default_styles_schema()
static process_docx(source: bytes | Dict[str, str]) tuple[DocumentSchema, StylesSchema, NumberingSchema][source]

Process the DOCX file or XML content to extract document, styles, and numbering schemas.

Parameters:

source (Union[bytes, Dict[str, str]]) –

Either: - The DOCX file content as bytes, or - A dictionary containing XML content as strings with keys:

’document’, ‘styles’, and ‘numbering’

Returns:

A tuple containing DocumentSchema, StylesSchema, and NumberingSchema.

Return type:

tuple[DocumentSchema, StylesSchema, NumberingSchema]

Raises:
  • Exception – If the document.xml cannot be parsed.

  • ValueError – If the source dictionary is missing required keys.

Example

Given a DOCX file, this method processes the file and returns the corresponding schemas:

# Using bytes from a DOCX file
docx_file = read_binary_from_file_path('path_to_docx_file.docx')
document_schema, styles_schema, numbering_schema = DocxProcessor.process_docx(docx_file)

# Using XML strings
xml_content = {
    'document': '<w:document>...</w:document>',
    'styles': '<w:styles>...</w:styles>',
    'numbering': '<w:numbering>...</w:numbering>'
}
document_schema, styles_schema, numbering_schema = DocxProcessor.process_docx(xml_content)