Skip to main content
Try Lexiel for freeTry now →
12 minSofía + Adrián

Legal Corpus Quality

How to build and maintain a reliable legal corpus for RAG: sources, deduplication, and updates.

What Is a Legal Corpus?

A legal corpus is the database of legal documents that feeds a RAG system. Its quality directly determines the quality of responses.

Sources of a Legal Corpus

  1. Legislation: BOE, regional official gazettes, EUR-Lex
  2. Case law: CENDOJ, constitutional court databases
  3. Administrative doctrine: DGT resolutions, DGRN, circulars
  4. Sectoral regulations: regulators (CNMC, Bank of Spain, CNMV)

Construction Process

  1. Ingestion: automated download from official sources
  2. Parsing: extraction of structured text (articles, grounds, ruling)
  3. Chunking: division into coherent semantic fragments
  4. Embedding: vectorization of each fragment
  5. Indexing: storage in vector database (pgvector)

Quality Problems

  • Duplicates: the same regulation appears multiple times with slightly different wording
  • Obsolete versions: repealed regulations still indexed
  • Encoding: special characters poorly encoded (mojibake)
  • Stubs: empty or incomplete records that contaminate results

Quality Metrics

MetricDescriptionTarget
Coverage% of key regulations indexed>95%
FreshnessAverage update time<48h
DeduplicationRatio of duplicates removed>99%
Encoding% of texts without encoding errors100%

Continuous Maintenance

A legal corpus is not static. It requires:

  • Daily monitoring of BOE and official gazettes
  • Periodic deduplication
  • Re-indexing after corrections
  • Quality benchmarks against exam-type questions

Have your own legal questions?

The Individual Plan gives you 50 queries/month with answers verified against official legal sources.

Try free for 14 days