Data Sanitization

Artificial Intelligence

Jean-Baptiste Lacombe Lavigne Founder and lead-Consultant| Humanitarian Partners International Montréal, Quebec, Canada

When creating a custom GPT for a team using an existing public LLMs, what is the best processes to sanitize the data fed to the LLMS?

Posted: Jan 21, 2026 4:04 PM

Sort By:

Guillaume Baron

Community Champion

Project Manager| CREOS Bertrange, Luxembourg

Hello Jean-Baptiste,

I would suggest to look at the ISO 27001 annexe A8.11 which is related to data masking.
The actions related to some of the controls could help to sanitize the data.

Cheers

Posted: Jan 22, 2026 2:16 AM

Luis Branco CEO| Business Insight, Consultores de Gestão, Ldª Carcavelos, Lisboa, Portugal

The real question is not only how to sanitize data, but why, and at what level of risk.

In team and project contexts, three distinct areas need to be considered separately.

First, data inputs.
This includes removing or anonymizing personal data, contractual information, financial data, and indirect identifiers.
This is not merely a technical task. It is a governance decision.

Second, context and prompts.
Many data leaks do not come from the raw data itself, but from the context embedded in prompts. Sanitization here depends on conscious prompt design, not only on automated filters.

Third, persistence and logs.
Even when a model is stateless, the surrounding systems often are not.
Logs, caches, and integrations require clear rules for retention and access.

A minimum first step before any customization is to classify data by sensitivity and explicitly define what must never enter a prompt.

The most common mistake is treating this as a purely AI or information security problem. In practice, it is a matter of risk management, ethical decision making, and system design, just like in any other project.

If a team cannot clearly explain, in simple terms, what data goes in, why it goes in, and with what consequences, then it is not yet ready to use a custom GPT in production.

Posted: Jan 22, 2026 5:32 AM

Kiron Bondale Retired | Mentor| Retired Welland, Ontario, Canada

Jean-Baptiste -

A lot will depend on the nature of the data itself, privacy regulations, and the organization's own privacy and security policy.

While two different data elements might both represent personally identifiable information, the risk of data breach related to one might be much greater than the other hence the level of data scrubbing will vary accordingly.

Kiron

Posted: Jan 22, 2026 7:04 AM

Sergio Luis Conte Helping to create solutions for everyone| Worldwide based Organizations Buenos Aires, Argentina

Perhaps I did not understand well your statement but you do not need to run a sanity check on the LLMs you will use as the foundation model. In fact, you will not be able to do that. And believe me, is a wasted of time. What you can do is to use some technique to "customize" the model and to adapt it to your needs. I am writing this because is what we do to create solutions for ourselves and for our customers.

Posted: Jan 25, 2026 5:04 AM (Updated by author: Jan 25, 2026 5:05 AM)

Lissette Indhira Pimentel Sosa

Community Champion

Program Manager| HARPER SRL Santo Domingo / Distrito Nacional, Dominican Republic

For me, the starting point is data classification, not tools. Before feeding anything into a public LLM, the team needs a clear rule on what data is allowed, what must be anonymized, and what should never be used at all.
Sanitization then becomes a mix of masking sensitive fields, being careful with prompt context, and controlling logs and retention. It’s less an AI problem and more a risk and governance decision.

Posted: Jan 26, 2026 10:36 AM

Please login or join to reply

Data Sanitization

Sponsors

Vendor Events

Guessing is not a strategy: How to build decision velocity with AI and real-time data

Newsletters