feat(uploads): add pymupdf4llm PDF converter with auto-fallback and async offload (#1727)
* feat(uploads): add pymupdf4llm PDF converter with auto-fallback and async offload - Introduce pymupdf4llm as an optional PDF converter with better heading detection and table preservation than MarkItDown - Auto mode: prefer pymupdf4llm when installed; fall back to MarkItDown when output is suspiciously sparse (image-based / scanned PDFs) - Sparsity check uses chars-per-page (< 50 chars/page) rather than an absolute threshold, correctly handling both short and long documents - Large files (> 1 MB) are offloaded to asyncio.to_thread() to avoid blocking the event loop (related: #1569) - Add UploadsConfig with pdf_converter field (auto/pymupdf4llm/markitdown) - Add pymupdf4llm as optional dependency: pip install deerflow-harness[pymupdf] - Add 14 unit tests covering sparsity heuristic, routing logic, and async path * fix(uploads): address Copilot review comments on PDF converter - Fix docstring: MIN_CHARS_PYMUPDF -> _MIN_CHARS_PER_PAGE (typo) - Fix file handle leak: wrap pymupdf.open in try/finally to ensure doc.close() - Fix silent fallback gap: _convert_pdf_with_pymupdf4llm now catches all conversion exceptions (not just ImportError), so encrypted/corrupt PDFs fall back to MarkItDown instead of propagating - Tighten type: pdf_converter field changed from str to Literal[auto|pymupdf4llm|markitdown] - Normalize config value: _get_pdf_converter() strips and lowercases the raw config string, warns and falls back to 'auto' on unknown values
This commit is contained in:
@@ -369,6 +369,15 @@ tool_search:
|
||||
|
||||
# Option 1: Local Sandbox (Default)
|
||||
# Executes commands directly on the host machine
|
||||
uploads:
|
||||
# PDF-to-Markdown converter used when a PDF is uploaded.
|
||||
# auto — prefer pymupdf4llm when installed; fall back to MarkItDown for
|
||||
# image-based or encrypted PDFs (recommended default).
|
||||
# pymupdf4llm — always use pymupdf4llm (must be installed: uv add pymupdf4llm).
|
||||
# Better heading/table extraction; faster on most files.
|
||||
# markitdown — always use MarkItDown (original behaviour, no extra dependency).
|
||||
pdf_converter: auto
|
||||
|
||||
sandbox:
|
||||
use: deerflow.sandbox.local:LocalSandboxProvider
|
||||
# Host bash execution is disabled by default because LocalSandboxProvider is
|
||||
|
||||
Reference in New Issue
Block a user