Skip to content

AI Term Base

AI discovers terminology for you. Pauhu's AI term base automatically extracts domain-specific terms from your documents, suggests additions to your term base, and learns terminology patterns over time.


Bidirectional Semantic Flow

graph LR
    A[Documents] -->|Extract terms| B[AI Term Base]
    B -->|Suggest additions| C[Term Base]
    C -->|Enforce consistency| D[AI Translation]
    D -->|Learn patterns| E[AI Memory]
    E -->|Improve extraction| B

The unique Pauhu advantage: AI and traditional term bases work together.


Automatic Term Extraction

from pauhu import Pauhu

client = Pauhu()

# Upload document - AI extracts terms automatically
doc = client.documents.upload(
    file="eu-ai-act.pdf",
    domain="10 European Union"
)

# AI extracts terminology
terms = doc.extract_terms(
    min_frequency=2,        # Appears at least twice
    min_confidence=0.85,    # 85% confidence
    include_compounds=True  # Multi-word terms
)

for term in terms:
    print(f"{term.source}{term.target} ({term.confidence:.2f})")
# "artificial intelligence system" → "tekoälyjärjestelmä" (0.98)
# "high-risk AI" → "korkean riskin tekoäly" (0.95)
# "conformity assessment" → "vaatimustenmukaisuuden arviointi" (0.92)

Smart Term Recognition

AI identifies domain-specific terms automatically:

Single-Word Terms

# Legal domain
"controller"  "rekisterinpitäjä" (not "ohjain")
"processor"  "henkilötietojen käsittelijä" (not "prosessori")

Compound Terms

# Multi-word technical terms
"machine learning algorithm"  "koneoppimisen algoritmi"
"data protection impact assessment"  "tietosuojaa koskeva vaikutustenarviointi"

Context-Aware Disambiguation

# Same word, different domains
"bank" in Finance  "pankki"
"bank" in Geography  "törmä" (river bank)

# AI uses document domain context

Term Suggestion Workflow

# AI suggests new terms for approval
project = client.projects.create(
    name="Legal Translation Project",
    domain="12 Law"
)

# Translate documents
for doc in documents:
    project.translate_document(doc)

# AI suggests terms discovered
suggestions = project.terms.suggestions()

for suggestion in suggestions:
    print(f"\nTerm: {suggestion.source}")
    print(f"Translation: {suggestion.target}")
    print(f"Frequency: {suggestion.frequency}×")
    print(f"Confidence: {suggestion.confidence:.2%}")
    print(f"First seen: {suggestion.first_document}")

    # Approve or reject
    if suggestion.confidence > 0.90:
        project.terms.approve(suggestion)

Example output:

Term: legitimate interest
Translation: oikeutettu etu
Frequency: 15×
Confidence: 98.5%
First seen: gdpr-article-6.pdf

Term: data subject
Translation: rekisteröity
Frequency: 47×
Confidence: 99.2%
First seen: gdpr-article-4.pdf


Terminology Consistency Checking

# AI checks for inconsistent translations
report = project.terms.consistency_check()

for issue in report.issues:
    print(f"\nInconsistency found:")
    print(f"Term: {issue.source}")
    print(f"Translation A: {issue.variant_a} ({issue.count_a}×)")
    print(f"Translation B: {issue.variant_b} ({issue.count_b}×)")
    print(f"Recommendation: {issue.recommendation}")

Example:

Inconsistency found:
Term: artificial intelligence
Translation A: tekoäly (87×)
Translation B: keinoäly (3×)
Recommendation: Standardize to "tekoäly" (more common, IATE-approved)


Domain-Specific Learning

AI learns terminology patterns per EuroVoc domain:

Domain Terms Learned Top Pattern
10 European Union 1,200+ "directive" → "direktiivi"
12 Law 1,500+ "-assessment" → "-arviointi"
36 Science 900+ "quantum-" → "kvantti-" prefix
# Domain-aware term extraction
tech_doc = client.documents.upload(
    file="quantum-computing.pdf",
    domain="36 Science"
)

terms = tech_doc.extract_terms()
# AI recognizes: "quantum entanglement" → "kvanttilomittuminen"
# (learned from 36 Science domain patterns)

Multilingual Term Extraction

Extract terms across all 24 EU languages:

# Extract multilingual terminology
multilingual_terms = project.terms.extract_multilingual(
    source_language="en",
    target_languages=["fi", "sv", "de", "fr"]
)

for term in multilingual_terms:
    print(f"\n{term.source}:")
    for lang, translation in term.translations.items():
        print(f"  {lang}: {translation}")

Example output:

artificial intelligence:
  fi: tekoäly
  sv: artificiell intelligens
  de: künstliche Intelligenz
  fr: intelligence artificielle

GDPR:
  fi: tietosuoja-asetus
  sv: dataskyddsförordningen
  de: Datenschutz-Grundverordnung
  fr: RGPD


Integration with IATE

AI enriches IATE terms with usage context:

# AI augments IATE term with real-world usage
iate_term = client.iate.lookup("data controller")

print(f"IATE: {iate_term.official_translation}")
# "rekisterinpitäjä"

# AI adds context from your documents
ai_context = client.ai_terms.get_context("data controller")

print(f"\nAI learned usage patterns:")
for pattern in ai_context.patterns:
    print(f"  {pattern.phrase}{pattern.translation}")
# "data controller shall" → "rekisterinpitäjän on"
# "joint controllers" → "yhteiset rekisterinpitäjät"
# "controller and processor" → "rekisterinpitäjä ja henkilötietojen käsittelijä"

Quality Metrics

# AI term base quality report
quality = project.ai_terms.quality_report()

print(f"Total terms: {quality.total_terms}")
print(f"High confidence (>95%): {quality.high_confidence}")
print(f"Medium confidence (85-95%): {quality.medium_confidence}")
print(f"Low confidence (<85%): {quality.low_confidence}")
print(f"\nIATE matches: {quality.iate_matches}")
print(f"EuroVoc matches: {quality.eurovoc_matches}")
print(f"Custom terms: {quality.custom_terms}")

Export and Backup

# Export AI term base (TBX format)
project.ai_terms.export(
    file_path="./ai-termbase.tbx",
    format="tbx",  # ISO 30042 standard
    min_confidence=0.85,
    include_metadata=True
)

# Merge with main term base
project.terms.merge_from_ai(
    min_confidence=0.90,
    require_review=True
)

Privacy

Client-Side Extraction

# AI term extraction happens client-side
client = Pauhu(
    client_side_encryption=True,
    ai_term_extraction="local"  # Extract on your device
)

# Terms are extracted locally
# Only encrypted term patterns sent to server
# Source documents never leave your device

Data Retention

What's stored: - ✅ Term source/target pairs - ✅ Frequency counts - ✅ Confidence scores - ✅ Domain classifications

What's NOT stored: - ❌ Source document content - ❌ Full sentences - ❌ Personal data - ❌ Confidential information


Performance

AI term extraction speed:

Document Size Terms Extracted Time
10 pages 50-100 terms 2-3 seconds
100 pages 300-500 terms 15-20 seconds
1000 pages 2,000-3,000 terms 2-3 minutes

Based on FP32 ONNX models running in-browser


Getting Started

from pauhu import Pauhu

client = Pauhu()

# Create project with AI term extraction
project = client.projects.create(
    name="Technical Documentation",
    domain="36 Science",
    ai_term_extraction=True  # Enable AI term discovery
)

# Upload documents - AI extracts terms automatically
for doc in documents:
    project.upload(doc)

# Review AI-discovered terms
suggestions = project.terms.suggestions()

# Approve high-confidence terms
for term in suggestions:
    if term.confidence > 0.90:
        project.terms.approve(term)

Further Reading