Document Intelligence
Documents in, documents out. Pauhu understands document structure, preserves formatting, and reconstructs translations with pixel-perfect accuracy. Tables, headers, footnotes - everything stays in place.
| Category | Formats | Preservation |
| Office | DOCX, XLSX, PPTX, ODT, ODS | Full formatting |
| PDF | PDF, PDF/A, PDF/UA | Layout + accessibility |
| Text | TXT, RTF, HTML, XML | Structure |
| Publishing | InDesign, QuarkXPress | Professional layout |
| CAD | AutoCAD, SolidWorks | Technical drawings |
| Legal | Contracts, filings | Clause structure |
| Medical | HL7, FHIR, CDA | Clinical standards |
| Subtitles | SRT, VTT, ASS | Timing preserved |
How It Works
graph LR
A[Input Document] --> B[Structure Analysis]
B --> C[Text Extraction]
C --> D[Translation Engine]
D --> E[Reconstruction]
E --> F[Output Document]
B --> G[Layout Map]
G --> E
C --> H[Style Extraction]
H --> E
Python SDK
Translate a Document
from pauhu import Pauhu
client = Pauhu()
# Translate PDF document
result = client.translate_document(
file_path="contract.pdf",
target="fi",
preserve_layout=True
)
# Save translated document
result.save("contract_fi.pdf")
# Check what was preserved
print(result.metadata.preserved)
# ['tables', 'headers', 'footers', 'images', 'fonts']
Stream Large Documents
# For documents >100 pages, use streaming
async for page in client.translate_document_stream(
file_path="annual_report.pdf",
target="fi"
):
print(f"Page {page.number}: {page.word_count} words")
page.save(f"annual_report_fi_page_{page.number}.pdf")
Layout Preservation
Tables
# Tables maintain cell structure and formatting
result = client.translate_document(
file_path="financial_report.xlsx",
target="fi",
preserve={
"cell_formatting": True,
"formulas": True,
"charts": True,
"merged_cells": True
}
)
# Headers/footers translated separately
result = client.translate_document(
file_path="legal_brief.docx",
target="fi",
translate_headers=True,
translate_footers=True,
preserve_page_numbers=True
)
Images with Text
# OCR and translate text in images
result = client.translate_document(
file_path="brochure.pdf",
target="fi",
ocr_images=True, # Extract text from images
recreate_images=True # Overlay translated text
)
Document Types
Legal Documents
# Specialized handling for contracts
result = client.translate_document(
file_path="contract.pdf",
target="fi",
domain="12 Law",
document_type="contract",
preserve={
"clause_numbers": True,
"definitions": True,
"cross_references": True
}
)
Financial Reports
# Handle financial tables and figures
result = client.translate_document(
file_path="q4_report.xlsx",
target="fi",
domain="24 Finance",
preserve={
"number_formats": True, # EUR 1,234.56 -> 1 234,56 EUR
"formulas": True,
"conditional_formatting": True
}
)
Technical Manuals
# Preserve technical diagrams and references
result = client.translate_document(
file_path="user_manual.docx",
target="fi",
domain="64 Production",
preserve={
"diagrams": True,
"callouts": True,
"index": True,
"cross_references": True
}
)
Batch Processing
import os
from pauhu import Pauhu
client = Pauhu()
# Process entire folder
input_folder = "/docs/english"
output_folder = "/docs/finnish"
results = client.translate_documents(
input_folder=input_folder,
output_folder=output_folder,
target="fi",
recursive=True,
parallel=True # Use all CPU cores
)
for result in results:
print(f"{result.filename}: {result.status}")
Quality Assurance
Automated Checks
result = client.translate_document(
file_path="document.pdf",
target="fi",
quality_checks=[
"completeness", # All text translated
"formatting", # Layout preserved
"terminology", # Consistent terms
"numbers", # Numbers unchanged
"placeholders" # Tags preserved
]
)
# Review issues
for issue in result.quality_issues:
print(f"Page {issue.page}: {issue.type} - {issue.message}")
Side-by-Side Comparison
# Generate comparison document
result.generate_comparison(
output_path="comparison.pdf",
highlight_changes=True
)
API Endpoints
REST API
# Upload and translate
curl -X POST https://api.pauhu.ai/v1/documents \
-H "Authorization: Bearer pk_..." \
-F "file=@contract.pdf" \
-F "target=fi" \
-F "preserve_layout=true"
# Check status
curl https://api.pauhu.ai/v1/documents/{job_id}/status \
-H "Authorization: Bearer pk_..."
# Download result
curl https://api.pauhu.ai/v1/documents/{job_id}/download \
-H "Authorization: Bearer pk_..." \
-o contract_fi.pdf
| Document Type | Pages | Time | Accuracy |
| Contract (PDF) | 20 | 15s | 99.8% |
| Annual Report (PDF) | 100 | 45s | 99.5% |
| User Manual (DOCX) | 50 | 25s | 99.7% |
| Spreadsheet (XLSX) | 10 sheets | 10s | 99.9% |
Getting Started
from pauhu import Pauhu
client = Pauhu()
# Translate your first document
result = client.translate_document(
file_path="document.pdf",
target="fi"
)
result.save("document_fi.pdf")
print(f"Translated {result.page_count} pages")