AI Term Base¶
AI discovers terminology for you. Pauhu's AI term base automatically extracts domain-specific terms from your documents, suggests additions to your term base, and learns terminology patterns over time.
Bidirectional Semantic Flow¶
graph LR
A[Documents] -->|Extract terms| B[AI Term Base]
B -->|Suggest additions| C[Term Base]
C -->|Enforce consistency| D[AI Translation]
D -->|Learn patterns| E[AI Memory]
E -->|Improve extraction| B The unique Pauhu advantage: AI and traditional term bases work together.
Automatic Term Extraction¶
from pauhu import Pauhu
client = Pauhu()
# Upload document - AI extracts terms automatically
doc = client.documents.upload(
file="eu-ai-act.pdf",
domain="10 European Union"
)
# AI extracts terminology
terms = doc.extract_terms(
min_frequency=2, # Appears at least twice
min_confidence=0.85, # 85% confidence
include_compounds=True # Multi-word terms
)
for term in terms:
print(f"{term.source} → {term.target} ({term.confidence:.2f})")
# "artificial intelligence system" → "tekoälyjärjestelmä" (0.98)
# "high-risk AI" → "korkean riskin tekoäly" (0.95)
# "conformity assessment" → "vaatimustenmukaisuuden arviointi" (0.92)
Smart Term Recognition¶
AI identifies domain-specific terms automatically:
Single-Word Terms¶
# Legal domain
"controller" → "rekisterinpitäjä" (not "ohjain")
"processor" → "henkilötietojen käsittelijä" (not "prosessori")
Compound Terms¶
# Multi-word technical terms
"machine learning algorithm" → "koneoppimisen algoritmi"
"data protection impact assessment" → "tietosuojaa koskeva vaikutustenarviointi"
Context-Aware Disambiguation¶
# Same word, different domains
"bank" in Finance → "pankki"
"bank" in Geography → "törmä" (river bank)
# AI uses document domain context
Term Suggestion Workflow¶
# AI suggests new terms for approval
project = client.projects.create(
name="Legal Translation Project",
domain="12 Law"
)
# Translate documents
for doc in documents:
project.translate_document(doc)
# AI suggests terms discovered
suggestions = project.terms.suggestions()
for suggestion in suggestions:
print(f"\nTerm: {suggestion.source}")
print(f"Translation: {suggestion.target}")
print(f"Frequency: {suggestion.frequency}×")
print(f"Confidence: {suggestion.confidence:.2%}")
print(f"First seen: {suggestion.first_document}")
# Approve or reject
if suggestion.confidence > 0.90:
project.terms.approve(suggestion)
Example output:
Term: legitimate interest
Translation: oikeutettu etu
Frequency: 15×
Confidence: 98.5%
First seen: gdpr-article-6.pdf
Term: data subject
Translation: rekisteröity
Frequency: 47×
Confidence: 99.2%
First seen: gdpr-article-4.pdf
Terminology Consistency Checking¶
# AI checks for inconsistent translations
report = project.terms.consistency_check()
for issue in report.issues:
print(f"\nInconsistency found:")
print(f"Term: {issue.source}")
print(f"Translation A: {issue.variant_a} ({issue.count_a}×)")
print(f"Translation B: {issue.variant_b} ({issue.count_b}×)")
print(f"Recommendation: {issue.recommendation}")
Example:
Inconsistency found:
Term: artificial intelligence
Translation A: tekoäly (87×)
Translation B: keinoäly (3×)
Recommendation: Standardize to "tekoäly" (more common, IATE-approved)
Domain-Specific Learning¶
AI learns terminology patterns per EuroVoc domain:
| Domain | Terms Learned | Top Pattern |
|---|---|---|
| 10 European Union | 1,200+ | "directive" → "direktiivi" |
| 12 Law | 1,500+ | "-assessment" → "-arviointi" |
| 36 Science | 900+ | "quantum-" → "kvantti-" prefix |
# Domain-aware term extraction
tech_doc = client.documents.upload(
file="quantum-computing.pdf",
domain="36 Science"
)
terms = tech_doc.extract_terms()
# AI recognizes: "quantum entanglement" → "kvanttilomittuminen"
# (learned from 36 Science domain patterns)
Multilingual Term Extraction¶
Extract terms across all 24 EU languages:
# Extract multilingual terminology
multilingual_terms = project.terms.extract_multilingual(
source_language="en",
target_languages=["fi", "sv", "de", "fr"]
)
for term in multilingual_terms:
print(f"\n{term.source}:")
for lang, translation in term.translations.items():
print(f" {lang}: {translation}")
Example output:
artificial intelligence:
fi: tekoäly
sv: artificiell intelligens
de: künstliche Intelligenz
fr: intelligence artificielle
GDPR:
fi: tietosuoja-asetus
sv: dataskyddsförordningen
de: Datenschutz-Grundverordnung
fr: RGPD
Integration with IATE¶
AI enriches IATE terms with usage context:
# AI augments IATE term with real-world usage
iate_term = client.iate.lookup("data controller")
print(f"IATE: {iate_term.official_translation}")
# "rekisterinpitäjä"
# AI adds context from your documents
ai_context = client.ai_terms.get_context("data controller")
print(f"\nAI learned usage patterns:")
for pattern in ai_context.patterns:
print(f" {pattern.phrase} → {pattern.translation}")
# "data controller shall" → "rekisterinpitäjän on"
# "joint controllers" → "yhteiset rekisterinpitäjät"
# "controller and processor" → "rekisterinpitäjä ja henkilötietojen käsittelijä"
Quality Metrics¶
# AI term base quality report
quality = project.ai_terms.quality_report()
print(f"Total terms: {quality.total_terms}")
print(f"High confidence (>95%): {quality.high_confidence}")
print(f"Medium confidence (85-95%): {quality.medium_confidence}")
print(f"Low confidence (<85%): {quality.low_confidence}")
print(f"\nIATE matches: {quality.iate_matches}")
print(f"EuroVoc matches: {quality.eurovoc_matches}")
print(f"Custom terms: {quality.custom_terms}")
Export and Backup¶
# Export AI term base (TBX format)
project.ai_terms.export(
file_path="./ai-termbase.tbx",
format="tbx", # ISO 30042 standard
min_confidence=0.85,
include_metadata=True
)
# Merge with main term base
project.terms.merge_from_ai(
min_confidence=0.90,
require_review=True
)
Privacy¶
Client-Side Extraction¶
# AI term extraction happens client-side
client = Pauhu(
client_side_encryption=True,
ai_term_extraction="local" # Extract on your device
)
# Terms are extracted locally
# Only encrypted term patterns sent to server
# Source documents never leave your device
Data Retention¶
What's stored: - ✅ Term source/target pairs - ✅ Frequency counts - ✅ Confidence scores - ✅ Domain classifications
What's NOT stored: - ❌ Source document content - ❌ Full sentences - ❌ Personal data - ❌ Confidential information
Performance¶
AI term extraction speed:
| Document Size | Terms Extracted | Time |
|---|---|---|
| 10 pages | 50-100 terms | 2-3 seconds |
| 100 pages | 300-500 terms | 15-20 seconds |
| 1000 pages | 2,000-3,000 terms | 2-3 minutes |
Based on FP32 ONNX models running in-browser
Getting Started¶
from pauhu import Pauhu
client = Pauhu()
# Create project with AI term extraction
project = client.projects.create(
name="Technical Documentation",
domain="36 Science",
ai_term_extraction=True # Enable AI term discovery
)
# Upload documents - AI extracts terms automatically
for doc in documents:
project.upload(doc)
# Review AI-discovered terms
suggestions = project.terms.suggestions()
# Approve high-confidence terms
for term in suggestions:
if term.confidence > 0.90:
project.terms.approve(term)
Further Reading¶
- Term Base - Terminology management
- AI Memory - How AI learns from context
- Translation Memory - Historical translations
- Quality Assurance - Consistency enforcement