25 minSofía

Privacy, Data and Anonymization

GDPR, own infrastructure vs public cloud, data breach case study.

Why you should NOT upload client data to ChatGPT

When you use ChatGPT, GPT-4, or any public cloud AI service, the data you input may be:

Processed on servers outside the EU (typically USA).
Used to improve the model (unless you use the API with explicit opt-out).
Accessible to employees of the provider company during security audits.
Stored for indeterminate periods in service logs.

This makes the lawyer the data controller (or processor) under GDPR, with all corresponding obligations.

The real problem: professional secrecy + data protection

The lawyer's professional secrecy duty (Art. 542.3 LOPJ, Art. 32 EGA in Spain: similar provisions exist across all EU jurisdictions) requires that client information not be disclosed to third parties without consent. By uploading data to a cloud LLM:

You disclose information to a third party (the AI provider).
You have no guarantee it won't be used for other purposes.
You may be transferring data outside the EEA without adequate legal basis.
Client consent doesn't necessarily cover this use.

GDPR applied to AI use in law firms

Relevant principles (Art. 5 GDPR)

Principle	AI implication
Data minimization	Only input data strictly necessary for the query
Purpose limitation	Client data is for their legal matter, not model training
Integrity and confidentiality	You must ensure data doesn't leak through third-party services
Accountability	You must be able to demonstrate you take adequate measures

Do you need a DPIA (Data Protection Impact Assessment)?

Under Art. 35 GDPR, you likely need a DPIA if:

You process special categories of data (health, criminal records).
The processing is systematic and large-scale.
You use new technologies that may pose a high risk.

Generative AI applied to legal client data meets all three criteria in many cases.

Practical obligations

Data processing agreement (Art. 28 GDPR) with the AI provider.
Record of processing activities that includes AI use.
Data subject information (Art. 13-14 GDPR): clients must know you use AI.
International transfer assessment: Do data leave the EEA?
Security measures: encryption, anonymization, pseudonymization.

Anonymization techniques for legal AI

Before entering any case data into an AI tool, you must anonymize it. Main techniques:

1. Pseudonymization

Replace identifying data with codes:

"John Smith" → "Party A" or "[CLAIMANT]"
"23 Main Street, London" → "[ADDRESS_1]"
"ID 12345678-A" → "[ID_REDACTED]"

Advantage: Maintains text coherence (you can follow the case logic). Limitation: Not full anonymization: reversible if you have the mapping table.

2. True anonymization

Completely remove data that allows identification:

Remove names, IDs, addresses, phone numbers, emails.
Generalize: "47-year-old man from Madrid" → "middle-aged person from a large city".
Remove unique data: case file number, vehicle registration.

3. Synthetic data

Generate fictional data that maintains the statistical properties of the case without revealing real data:

Use coherent fictional names.
Substitute real amounts with approximate ones.
Change dates while maintaining relative deadlines.

When to use each technique

Technique	When to use	Example
Pseudonymization	Internal analysis where you need coherence	Preparing procedural strategy with AI
True anonymization	Sharing with cloud tools	Asking ChatGPT about a type of case
Synthetic data	Training, demos, testing	Training the team in legal AI use

Own infrastructure vs public cloud

Model A: Public cloud (ChatGPT, Claude API, etc.)

Pros: Easy to use, always updated, no maintenance.
Cons: Data leaves your control, possible training on your data, complex GDPR compliance.
Suitable for: Generic queries without real data, training, brainstorming.

Model B: API with data processing agreement

Pros: Training opt-out, DPA contract, better control.
Cons: Data still leaves your infrastructure, possible international transfer.
Suitable for: Professional use with pseudonymized data and signed DPA.

Model C: Own infrastructure / on-premise

Pros: Total control, data never leaves, simplified GDPR compliance.
Cons: Requires investment in hardware/infrastructure, own models may be less capable.
Suitable for: Large firms with very sensitive data and infrastructure budget.

Model D: Specialized legal tool with EU infrastructure

Pros: Combines model quality with controlled infrastructure, DPA included, designed for compliance.
Cons: Subscription cost, provider dependency.
Suitable for: Most firms wanting to use AI professionally and safely.

Case study: data breach in a law firm

Scenario

A lawyer from a mid-size firm copies the full text of a complaint (with names, IDs, addresses, bank details of the claimant) and pastes it into ChatGPT to ask for a summary.

Potential consequences

GDPR infringement (Art. 83): fine up to 4% of annual turnover or €20 million.
Breach of professional secrecy: possible disciplinary proceedings from the Bar Association.
Civil liability: if the client discovers the leak, they can claim damages.
Reputational damage: loss of client trust and damage to the firm's brand.

How it should have been done

Pseudonymize the text before entering it.
Use a legal tool with DPA and EU infrastructure.
Verify the record of processing activities includes AI use.
Inform the client that AI tools are used in the firm (clause in the service agreement).

Module summary

Concept	Key takeaway
Public cloud data	Never upload real client data to ChatGPT/GPT-4 without anonymizing
GDPR	Using AI with personal data requires legal basis, DPA, and records
Anonymization	Pseudonymize at minimum; fully anonymize for public cloud
Professional secrecy	Extends to AI tool usage: the lawyer is always responsible
Infrastructure	Prefer tools with EU infrastructure and signed DPA

Module quiz

1

Which GDPR article regulates the relationship with data processors?

2

What does automatic anonymization do?

3

Where is data processed in a legal tool with own EU infrastructure?

4

Can OpenAI use your ChatGPT data to train its models?

5

What was the consequence for the firm that leaked data by using general AI?