docx_parser_converter.docx_to_txt.docx_processor module
- class docx_parser_converter.docx_to_txt.docx_processor.DocxProcessor[source]
Bases:
objectClass to process the DOCX file and extract the schemas.
- static get_default_numbering_schema() NumberingSchema[source]
Returns the default numbering schema.
- Returns:
The default numbering schema.
- Return type:
Example
default_numbering_schema = DocxProcessor.get_default_numbering_schema()
- static get_default_styles_schema() StylesSchema[source]
Returns the default styles schema.
- Returns:
The default styles schema.
- Return type:
Example
default_styles_schema = DocxProcessor.get_default_styles_schema()
- static process_docx(source: bytes | Dict[str, str]) tuple[DocumentSchema, StylesSchema, NumberingSchema][source]
Process the DOCX file or XML content to extract document, styles, and numbering schemas.
- Parameters:
source (Union[bytes, Dict[str, str]]) –
Either: - The DOCX file content as bytes, or - A dictionary containing XML content as strings with keys:
’document’, ‘styles’, and ‘numbering’
- Returns:
A tuple containing DocumentSchema, StylesSchema, and NumberingSchema.
- Return type:
tuple[DocumentSchema, StylesSchema, NumberingSchema]
- Raises:
Exception – If the document.xml cannot be parsed.
ValueError – If the source dictionary is missing required keys.
Example
Given a DOCX file, this method processes the file and returns the corresponding schemas:
# Using bytes from a DOCX file docx_file = read_binary_from_file_path('path_to_docx_file.docx') document_schema, styles_schema, numbering_schema = DocxProcessor.process_docx(docx_file) # Using XML strings xml_content = { 'document': '<w:document>...</w:document>', 'styles': '<w:styles>...</w:styles>', 'numbering': '<w:numbering>...</w:numbering>' } document_schema, styles_schema, numbering_schema = DocxProcessor.process_docx(xml_content)