Project Management

Please login or join to subscribe to this thread

Data Sanitization

linkedin twitter facebook   Artificial Intelligence  
avatar
Jean-Baptiste Lacombe Lavigne Founder and lead-Consultant| Humanitarian Partners International Montréal, Quebec, Canada

When creating a custom GPT for a team using an existing public LLMs, what is the best processes to sanitize the data fed to the LLMS?

Sort By:
avatar
Guillaume Baron
Community Champion
Project Manager| CREOS Bertrange, Luxembourg
Hello Jean-Baptiste,

I would suggest to look at the ISO 27001 annexe A8.11 which is related to data masking.
The actions related to some of the controls could help to sanitize the data.

Cheers
avatar
Luis Branco CEO| Business Insight, Consultores de Gestão, Ldª Carcavelos, Lisboa, Portugal
The real question is not only how to sanitize data, but why, and at what level of risk.

In team and project contexts, three distinct areas need to be considered separately.

First, data inputs.
This includes removing or anonymizing personal data, contractual information, financial data, and indirect identifiers.
This is not merely a technical task. It is a governance decision.

Second, context and prompts.
Many data leaks do not come from the raw data itself, but from the context embedded in prompts. Sanitization here depends on conscious prompt design, not only on automated filters.

Third, persistence and logs.
Even when a model is stateless, the surrounding systems often are not.
Logs, caches, and integrations require clear rules for retention and access.

A minimum first step before any customization is to classify data by sensitivity and explicitly define what must never enter a prompt.

The most common mistake is treating this as a purely AI or information security problem. In practice, it is a matter of risk management, ethical decision making, and system design, just like in any other project.

If a team cannot clearly explain, in simple terms, what data goes in, why it goes in, and with what consequences, then it is not yet ready to use a custom GPT in production.
avatar
Kiron Bondale Retired | Mentor| Retired Welland, Ontario, Canada
Jean-Baptiste -

A lot will depend on the nature of the data itself, privacy regulations, and the organization's own privacy and security policy.

While two different data elements might both represent personally identifiable information, the risk of data breach related to one might be much greater than the other hence the level of data scrubbing will vary accordingly.

Kiron
avatar
Sergio Luis Conte Helping to create solutions for everyone| Worldwide based Organizations Buenos Aires, Argentina

Perhaps I did not understand well your statement but you do not need to run a sanity check on the LLMs you will use as the foundation model. In fact, you will not be able to do that. And believe me, is a wasted of time. What you can do is to use some technique to "customize" the model and to adapt it to your needs. I am writing this because is what we do to create solutions for ourselves and for our customers.

avatar
Lissette Indhira Pimentel Sosa
Community Champion
Program Manager| HARPER SRL Santo Domingo / Distrito Nacional, Dominican Republic
For me, the starting point is data classification, not tools. Before feeding anything into a public LLM, the team needs a clear rule on what data is allowed, what must be anonymized, and what should never be used at all.
Sanitization then becomes a mix of masking sensitive fields, being careful with prompt context, and controlling logs and retention. It’s less an AI problem and more a risk and governance decision.

Please login or join to reply

Content ID:
ADVERTISEMENTS
ADVERTISEMENT

Sponsors