Blog

Jun 21, 2026

10 min read

We graded 205 AI apps on data transparency. Most of them failed.

We scored the privacy policies and terms of 205 widely used AI apps on what they disclose about training, retention, deletion and sharing, using one consistent rubric for every app. Only 23% earned an A or B, half are silent on whether your data trains their models, and one in three reserve a clause we treat as a dealbreaker. Here is what the data shows, and how the index works.

You probably could not say, off the top of your head, whether the AI app you used this morning trains its models on what you typed into it. Neither could most people. The answer is usually written down, in a privacy policy or a set of terms almost nobody reads, and the wording ranges from a clear commitment to a deliberate silence.

So we built something to score them at scale. For 205 of the most widely used AI apps, from consumer chat assistants to the enterprise tools that sit on top of corporate data, we captured the policies and graded each one against a single rubric covering training, retention, deletion, sharing and security. The results are published as the VerifyWise AI Trust & Transparency Index.

The headline is not flattering to the industry. Only 23% of the apps earned an A or B. More than half landed at D or F. This post walks through what the data shows, why a transparency index is worth building, and how the scoring works so you can judge it for yourself.

What the grades look like

Across 205 apps, the distribution skews low. Eleven apps earned an A, thirty-six a B, and the rest fell into the C, D and F bands.

Bar chart of AI Trust Index grades across 205 apps, showing 11 A, 36 B, 52 C, 57 D and 49 F

A grade here measures one specific thing: the quality and substance of what an app commits to in writing. It is not a safety rating, and it is not a claim about what a company does behind closed doors. A strong policy can sit on top of weak practice, and a thin policy can hide a careful one. What the grade captures is whether the documents you agree to make clear, specific commitments, or whether they lean on silence and vague boilerplate.

Read that way, a C or D is not an accusation of bad behaviour. It is a statement that the disclosure is incomplete. For a category built entirely on trust, incomplete disclosure is its own kind of risk.

The training-data blind spot

The single most revealing finding sits in the first scoring domain: whether an app trains its models on the data you give it, and whether you can stop it.

Stacked bar showing 50% of apps silent or training with no opt-out, 35% vague, and 14% clearly limiting training

Half of the apps we graded are effectively silent on the question, or reserve the right to train with no opt-out you can point to. Another 35% address it, but vaguely or conditionally. Only 14% make a clear commitment, either stating they do not train on your inputs or naming a concrete opt-out mechanism.

This matters because training use is close to irreversible. Once your prompts, documents or recordings are folded into a model's weights, there is no practical way to extract them again. An app that stays quiet on training is not necessarily doing anything wrong, but it is leaving you without the one disclosure that would let you make an informed choice.

One in three reserve a dealbreaker clause

Some clauses are serious enough that we flag them no matter how good the rest of a policy reads. We treat four as dealbreakers: training on your data with no way to opt out, a broad and perpetual licence to your content, a refusal to delete, and selling or sharing your data for advertising with no opt-out.

Sixty-five of the 205 apps, just under a third, reserve at least one of these. The most common by far is the content licence: 57 apps claim broad, often perpetual and sublicensable rights over what you upload, frequently in tools aimed squarely at business users. Training with no opt-out is a distant second at 16 apps, and the rest are rarer still.

Bar chart of dealbreaker flag triggers: 57 apps with a broad or perpetual content licence, 16 training with no opt-out, 5 selling data, 2 refusing deletion

A flag does not mean an app is unsafe. It means a specific clause in its own terms gives it a right that most users would not knowingly accept. The grade and the quoted clause are published together, so you can read the language and decide.

Why build a transparency index at all

There is no shortage of AI tools claiming to be private, secure and responsible. What is missing is a consistent, documented way to check those claims against the only public evidence available, which is the text of the policies.

Most "trust" signals in this market are self-attested. A vendor writes "enterprise-grade security" on a landing page, and you either believe it or you do not. An index that scores the actual written commitments, against a fixed rubric, on the same scale for every app, turns a marketing claim into something comparable. It will not tell you whether a company honours its policy. It will tell you whether the company was willing to commit to it in writing in the first place, which is a meaningful filter on its own.

This is the same problem organizations face internally with the AI tools their own staff adopt. The difference is that inside a company, the unapproved tools are often invisible until something goes wrong. We wrote about that pattern separately in shadow AI detection, and the data exposure question is identical: you cannot govern what you have not read.

Govern the AI your organization runs. The same transparency gaps in these policies show up in the AI tools your teams adopt every day. VerifyWise helps you inventory those tools, assess vendor risk, and keep a defensible record of every governance decision. Talk to us about governing your AI systems.

How the scoring works

The method is built to be reproducible. Scoring is automated, applying a fixed prompt at temperature zero so that anyone with the same policy text and the same rubric arrives at the same letter grade. That is the property a self-attested badge can never offer.

For each app, we capture the privacy policy and, where a distinct terms-of-service document exists, the terms as well, saving each as a dated snapshot. Across the full index that came to more than 400 documents. The text is then scored across seven data-governance domains:

Training-data use
Data-subject rights
Retention and minimization
Third-party sharing and transfers
Transparency and AI disclosure
Sensitive data and children
Security and accountability

Each domain holds several indicators, thirty in total, and every indicator awards full, half or no credit against a quoted clause or a recorded silence.

The point budget for each domain is published, so a reader can re-add the numbers by hand instead of trusting a hidden formula. A clause earns full credit only when it is specific, naming a mechanism, a concrete timeframe such as "30 days," or a defined standard such as "AES-256." Named-but-vague boilerplate, like "industry-standard security" or "as long as necessary," scores half at most. Silence scores zero, and the published record distinguishes a silent indicator from one where the app actively reserves a harmful right.

One design choice shapes the whole index: length is never scored, only substance. A short policy that explicitly grants deletion, names a retention window, and says it does not train on your data earns those full slices whatever its word count. This keeps the index from rewarding the longest, most lawyered documents over the clearest ones.

Where the disclosure breaks down

Average the scores by domain and a clear pattern appears. Apps are reasonably good at spelling out your formal rights, access, deletion, correction, and the like. They are far quieter about what they do with your data once they have it.

Horizontal bar chart of average disclosure by domain: data-subject rights 73%, third-party sharing 59%, transparency 56%, security 38%, sensitive data 32%, retention 25%, training-data use 23%

Data-subject rights score 73% on average, the strongest of the seven domains. Training-data use and retention sit at the bottom, around 23% and 25%. Put plainly, the typical policy is happy to tell you how to file an access request, and reluctant to tell you whether your inputs train its models or how long it keeps them. Security disclosure is thin too: only 22% of apps commit to a breach-notification timeline.

Why the strongest apps cluster at B

A pattern worth understanding is that very few apps reach an A, and the best general-purpose assistants tend to stop at B. That is not an accident of the curve. Most mainstream assistants train on your conversations by default and offer an opt-out, rather than not training at all. We treat "we do not train on your data," or opt-in only, as clearly stronger than "we train by default, opt out if you can find the setting."

So even a strong assistant cannot top the scale on training alone, and a cluster of otherwise solid products lands at B. The index is reflecting how the current market works, not penalising apps arbitrarily.

What the index does not claim

Stating the limits plainly is part of making the method trustworthy.

It scores disclosure, not behaviour. A vendor can write the right words without doing them, and a careful drafter can score well on language alone. The weights and thresholds are declared editorial judgements, calibrated in a pilot rather than derived from first principles. And the index is a snapshot in time: a grade reflects each policy as captured on its assessment date, and policies change, which is why re-publication triggers a re-score. None of this is hidden. The full reasoning lives on the methodology page, alongside the disclaimer.

We also remove apps that no longer exist as standalone products, for example after an acquisition folds them into a larger platform, so the index reflects tools you can evaluate and use today.

Now ask the same questions inside your organization

Scoring 205 policies against the same rubric makes one thing obvious: transparency is uneven, and the gaps are rarely where you would guess. Some enterprise tools handling sensitive corporate data disclose less than consumer apps a fraction of their size. The only way to see that is to hold every app to a consistent standard.

That is also the harder, quieter half of AI governance inside an organization. The questions the index asks of a public app, what does it train on, how long does it keep data, who does it share with, are the same questions a governance team has to answer for every AI tool, model and vendor it adopts. Doing that systematically, with documented evidence behind every decision, is what frameworks like the EU AI Act and ISO 42001 increasingly require, and it is the work VerifyWise is built to support through model inventories, vendor risk assessments and evidence management.

You can browse the full results, filter by grade or category, and read the quoted clause behind every score at the AI Trust & Transparency Index. If you run an AI app and think a grade misreads your policy, the methodology page tells you how to ask for a re-grade. We update grades when the evidence supports it.

War dieser Artikel hilfreich? Teilen Sie ihn mit Ihrem Netzwerk.

Über das VerifyWise-Team

VerifyWise entwickelt quelloffen verfügbare Software für KI-Governance (Source-available), mit der Organisationen Risiken, Compliance und Aufsicht über ihre KI-Portfolios verwalten. Unser Redaktionsteam stützt sich auf praktische Erfahrung bei der Implementierung von Governance-Workflows für regulierte Branchen und schnell wachsende KI-Teams.

Mehr über VerifyWise erfahren →

Bereit, Ihre KI verantwortungsvoll zu steuern?

Starten Sie noch heute Ihre KI-Governance-Reise mit VerifyWise.

Demo anfragen Weitere Artikel lesen