I would suggest to look at the ISO 27001 annexe A8.11 which is related to data masking. The actions related to some of the controls could help to sanitize the data.
Cheers Saving Changes...
Luis BrancoCEO| Business Insight, Consultores de Gestão, LdªCarcavelos, Lisboa, Portugal
The real question is not only how to sanitize data, but why, and at what level of risk.
In team and project contexts, three distinct areas need to be considered separately.
First, data inputs. This includes removing or anonymizing personal data, contractual information, financial data, and indirect identifiers. This is not merely a technical task. It is a governance decision.
Second, context and prompts. Many data leaks do not come from the raw data itself, but from the context embedded in prompts. Sanitization here depends on conscious prompt design, not only on automated filters.
Third, persistence and logs. Even when a model is stateless, the surrounding systems often are not. Logs, caches, and integrations require clear rules for retention and access.
A minimum first step before any customization is to classify data by sensitivity and explicitly define what must never enter a prompt.
The most common mistake is treating this as a purely AI or information security problem. In practice, it is a matter of risk management, ethical decision making, and system design, just like in any other project.
If a team cannot clearly explain, in simple terms, what data goes in, why it goes in, and with what consequences, then it is not yet ready to use a custom GPT in production. Saving Changes...
A lot will depend on the nature of the data itself, privacy regulations, and the organization's own privacy and security policy.
While two different data elements might both represent personally identifiable information, the risk of data breach related to one might be much greater than the other hence the level of data scrubbing will vary accordingly.
Kiron Saving Changes...
Sergio Luis ConteHelping to create solutions for everyone| Worldwide based OrganizationsBuenos Aires, Argentina
Perhaps I did not understand well your statement but you do not need to run a sanity check on the LLMs you will use as the foundation model. In fact, you will not be able to do that. And believe me, is a wasted of time. What you can do is to use some technique to "customize" the model and to adapt it to your needs. I am writing this because is what we do to create solutions for ourselves and for our customers.
Program Manager| HARPER SRLSanto Domingo / Distrito Nacional, Dominican Republic
For me, the starting point is data classification, not tools. Before feeding anything into a public LLM, the team needs a clear rule on what data is allowed, what must be anonymized, and what should never be used at all. Sanitization then becomes a mix of masking sensitive fields, being careful with prompt context, and controlling logs and retention. It’s less an AI problem and more a risk and governance decision. Saving Changes...