AI app trust & transparency index
Methodology
This page explains exactly how every grade is produced: what we read, how we score it, what each grade means, and what the index deliberately does not claim. The method is designed to be reproducible: anyone with the same policy text and this rubric should arrive at the same letter.
What this measures, and what it does not
The index scores what an AI app discloses about its data-governance practices in its public privacy policy and terms. It measures the quality and substance of those written commitments. It does not measure whether the commitments are true, whether they are enforced, or whether the underlying AI is “safe.”
A strong policy can hide weak practice, and a thin policy can hide good practice. Every grade is a statement about documents, not about behaviour.
We express confidence from positive public evidence. When an app clearly documents a good practice, it earns points. When a practice is undocumented, it earns nothing, and we make no assumption that the practice is bad. Absence of evidence lowers the score and the confidence figure. It is not an accusation.
The seven domains (10 points total)
Each app is scored across seven data-governance domains. A domain’s point budget is its weight, made visible, so a reader can re-add the points by hand instead of trusting a hidden formula. The allocation is a declared editorial judgement by VerifyWise about which disclosures matter most.
| Domain | Points |
|---|---|
| Training-data useD1 Whether user inputs train models, opt-out or opt-in, human-reviewer access, and whether the user owns generated outputs. | 2.0 |
| Data-subject rights & user controlD2 Access, deletion, portability, correction, and the right to object or opt out, each with a named mechanism. | 2.2 |
| Retention, deletion & minimizationD3 Named retention periods, deletion timelines, shorter windows for AI conversation logs, and a data-minimization commitment. | 1.5 |
| Third-party sharing, sub-processors & transfersD4 Categories of recipients, a sub-processor list or DPA, whether data is sold or shared, international-transfer safeguards, and government-access standards. | 1.5 |
| Transparency & AI disclosureD5 Disclosure that you are interacting with AI, marking of AI-generated output, the data categories collected, legal bases, and policy versioning. | 1.5 |
| Sensitive data, children & automated decisionsD6 Special-category data limits, biometric governance, children's-data protections, and disclosure of consequential automated decisions. | 0.7 |
| Security & accountabilityD7 Named security controls, a breach-notification commitment, and named certifications or a privacy contact. | 0.6 |
| Total | 10.0 |
How points are earned
Each domain holds several indicators (29 in total). Every indicator awards full, half, or zero of its point slice:
- Full. The clause explicitly grants the right or limit with specifics: a named mechanism, a concrete timeframe (a number), or a defined scope, such as “30 days,” “Standard Contractual Clauses,” or “AES-256.”
- Half. The topic is addressed, but vaguely, conditionally, or only on a paid tier (for example “industry-standard security” or “as long as necessary”). Named-but-vague boilerplate always scores half, never zero.
- Zero. The policy is silent on the topic, or it reserves the harmful behaviour. Both earn zero, and the published record distinguishes silent from adverse.
The total is the sum of awarded points, scaled to 100 over the indicators that apply to that app. Indicators that depend on a capability the app does not have, such as synthetic-media marking for a text-only app, are removed from both the earned points and the maximum, so no app is penalised for a capability it lacks.
Length is never scored, only documented substance. A short policy that explicitly grants deletion, names a retention window, and says it does not train on your data earns those full slices whatever its length.
From score to grade
The score, on a scale of 0 to 100, is banded into five letter grades:
The raw score and the per-domain points are always published next to the letter, so you can see the distance between two apps in the same band and re-derive the total yourself.
Dealbreaker flags
Some clauses are dealbreakers for trust no matter how good the rest of the policy reads. When an app explicitly reserves one of these, we raise a prominent flag and cap its displayed grade at B. The underlying score is never changed, so the number stays accurate while the warning stays visible:
- Trains on your data with no way to opt out.
- Claims a broad, perpetual licence to your content.
- Refuses deletion or asserts indefinite retention with no deletion right.
- Sells or shares your data for advertising with no opt-out.
This cap rarely fires. None of the apps reviewed so far stated an explicit dealbreaker clause. Modern policies tend to push risk into silence and vagueness instead of openly reserving a harmful right, and the point sum already penalises that silence.
Confidence
Alongside each grade we publish a confidence band of High, Medium, or Low, based on how much of the judgement rests on quoted evidence versus recorded silence. An app graded low on many silent indicators carries lower confidence than one graded low on quoted adverse clauses. So we state exactly how much of each grade is backed by quotes, which a self-attested label never does.
Why even the best apps cluster at B
Most general-purpose assistants train on your conversations by default and offer an opt-out, instead of not training at all. We treat “we do not train on your data” (or opt-in only) as clearly stronger than “we train by default, opt out if you find the setting.” So even the strongest such app cannot top the scale on training alone, and a group of strong assistants lands at B. That reflects how the current market actually works.
Which apps are included
Apps are drawn from a citable third-party ranking (a16z Top 100 Gen AI Consumer Apps, 6th Edition (Mar 2026)), so VerifyWise does not choose which apps are in scope. The ranking decides the list; VerifyWise decides the score. Capability metadata, such as whether an app generates images, processes biometrics, or operates internationally, comes from that same source and is never inferred from policy text.
Reproducibility, freshness & disputes
- Reproducible. Scoring is automated with a fixed prompt at temperature 0, rubric version 1.2. The same policy snapshot and rubric yield the same grade.
- A snapshot in time. A score reflects each policy as captured on its assessed date; policies change, and re-publication triggers a re-score.
- Region. We score the global / US-default version most users receive, not the strongest regional carve-out.
- Independence. No app can pay to change its grade. A score changes only on document evidence, such as a new or corrected clause, never on a vendor’s claim about its own behaviour.
Known limitations
- It scores disclosure, not behaviour. A vendor can write the right words without doing them.
- It can be gamed by careful drafting. A well-written policy scores well whatever the practice behind it.
- Weights and thresholds are declared editorial judgements, calibrated in a pilot rather than derived empirically.
- This is an early batch. More apps are being added, and grades may be revised as the rubric evolves; every change is versioned.
Assessed 2026-06-20. Rubric version 1.2.