Back to EvalWise

Complete feature reference

EvalWise features

Everything you need to systematically test, evaluate, and secure your Large Language Models. From red teaming to compliance, explore the complete EvalWise capability set.

Evaluation metrics

6 core metrics for comprehensive scoring

Built-in scorers cover the most critical evaluation dimensions. Each metric produces a 0.0-1.0 score with configurable thresholds.

Answer relevancy

Measures if the response directly addresses the question asked. Catches off-topic, evasive, or tangential responses.

Threshold:0.7+ for production
Use case:All LLM applications

Bias detection

Identifies discriminatory or unfair content including gender, racial, political, and age-based bias patterns.

Threshold:0.9+ for user-facing apps
Use case:Customer service, HR tools, public-facing AI

Toxicity detection

Flags harmful, offensive, abusive, or threatening language. Includes profanity, insults, and harmful content.

Threshold:0.95+ for public apps
Use case:Chat applications, content moderation

Faithfulness

Evaluates whether responses are grounded in the provided context. Essential for RAG systems to prevent fabrication.

Threshold:0.8+ for knowledge bases
Use case:RAG systems, document Q&A

Hallucination detection

Identifies fabricated facts, unsupported claims, and confident but false statements without explicit context requirement.

Threshold:0.85+ for factual apps
Use case:Research assistants, factual Q&A

Contextual relevancy

Tests RAG retrieval quality by evaluating if the retrieved context actually answers the question asked.

Threshold:0.7+ for retrieval systems
Use case:RAG optimization, search quality

Conversational AI

Multi-turn conversation evaluation

Go beyond single-turn testing. Evaluate how your chatbots and agents maintain context, coherence, and helpfulness across extended dialogues.

1

Turn relevancy

Evaluates if each response is relevant to the current turn in the conversation context.

2

Knowledge retention

Measures how well the model remembers and uses information from earlier in the conversation.

3

Conversation coherence

Assesses the logical flow and consistency of responses across the entire dialogue.

4

Helpfulness

Scores how effectively the assistant addresses user needs and provides actionable responses.

5

Task completion

Tracks whether multi-step tasks are successfully completed across conversation turns.

6

Per-turn safety

Applies bias and toxicity evaluation to each individual assistant response in the conversation.

Dataset type support

Single-turn

Isolated question-answer pairs for testing individual responses.

prompt → response evaluation
Multi-turn

Full conversation threads with context across multiple exchanges.

conversation history → response evaluation
Simulated

Auto-generated conversations for comprehensive coverage testing.

model generates responses for evaluation
RAG datasets

Prompts with retrieval context for grounding evaluation.

context + prompt → response evaluation

Security testing

50+ red teaming scenarios

Identify vulnerabilities before they reach production with comprehensive adversarial testing across jailbreaks, privacy probes, and safety boundaries.

Jailbreak attacks

  • DAN (Do Anything Now) variations
  • Role-playing persona attacks
  • Hypothetical scenario framing
  • Academic/research pretexts
  • Translation detour attacks

Privacy probes

  • PII extraction attempts
  • Training data recovery
  • Membership inference
  • Model inversion attacks
  • Prompt injection for data leakage

Safety boundary testing

  • Authority impersonation
  • Benign preamble masking
  • Instruction hierarchy bypass
  • System prompt extraction
  • Guardrail circumvention

Domain-specific threats

  • Financial fraud scenarios
  • Medical misinformation
  • Legal advice boundaries
  • Dangerous content requests
  • Custom scenario builder

Dual LLM architecture

Separate target and evaluator models prevent self-evaluation bias. Test GPT-4 with Claude as judge for objective scoring.

Schedule demo

Regulatory alignment

Built-in compliance frameworks

Map evaluation results directly to regulatory requirements with pre-configured rubrics and automated documentation generation.

ISO 42001

AI Management System certification alignment

Evaluation dimensions

Robustness
Safety
Transparency
Accountability

EU AI Act

High-risk AI system requirements

Evaluation dimensions

Bias & Fairness
Accuracy
Human Oversight
Risk Management
Transparency

NIST AI RMF

Risk Management Framework compliance

Evaluation dimensions

Govern
Map
Measure
Manage

Automated documentation

Generate audit-ready reports with one click

Gap tracking

Identify compliance gaps across your model portfolio

Custom rubrics

Define domain-specific evaluation criteria

Integrations

Works with your LLM stack

Connect to major LLM providers as both target and judge models. Bring your own models with OpenAI-compatible endpoints.

OpenAI

OpenAI

  • GPT-4
  • GPT-4 Turbo
  • GPT-4o
  • GPT-3.5 Turbo
Anthropic

Anthropic

  • Claude 3 Opus
  • Claude 3 Sonnet
  • Claude 3 Haiku
Google

Google

  • Gemini Pro
  • Gemini Ultra
  • Gemini Flash
Mistral

Mistral

  • Mistral Large
  • Mistral Medium
  • Mistral Small
Azure

Azure OpenAI

  • All OpenAI models via Azure
Ollama

Ollama

  • Llama 3
  • Mistral
  • Phi-3
  • Local models
HuggingFace

HuggingFace

  • Open-source models
  • Custom fine-tunes
Groq

Groq

  • Llama 3
  • Mixtral
  • Gemma

Plus xAI Grok, OpenRouter, Replicate, and any OpenAI-compatible API endpoint

Enterprise ready

Built for production workloads

Multi-tenant architecture, comprehensive security, and enterprise-grade reliability for mission-critical AI evaluation.

Multi-tenant architecture

  • Organization-based data isolation
  • Role-based access control (RBAC)
  • User management per organization
  • Encrypted API key storage (AES)

Security & compliance

  • JWT token authentication
  • Complete audit trail logging
  • CORS protection
  • Tenant middleware validation

Analytics & monitoring

  • Real-time dashboard metrics
  • Performance trend analysis
  • Cost tracking per evaluation
  • Export capabilities

API & integration

  • RESTful API endpoints
  • Webhook notifications
  • CI/CD pipeline integration
  • Custom adapter support

Workflow

4-step experiment wizard

Configure and run evaluations with an intuitive step-by-step workflow.

1

Configure model

Select provider, model, and generation parameters (temperature, max tokens).

2

Select dataset

Choose from built-in datasets, upload custom data, or filter by category.

3

Configure judge

Select independent evaluator model with optimized settings for consistent scoring.

4

Select metrics

Enable scorers, set thresholds, and configure custom rubrics.

Ready to secure your AI?

Get started with comprehensive LLM testing. Schedule a demo to see EvalWise in action.

EvalWise Features - LLM Security & Evaluation Platform | VerifyWise