Complete feature reference

EvalWise features

Everything you need to systematically test, evaluate, and secure your Large Language Models. From red teaming to compliance, explore the complete EvalWise capability set.

Core metrics Multi-turn Red teaming Compliance Providers Enterprise

Evaluation metrics

6 core metrics for comprehensive scoring

Built-in scorers cover the most critical evaluation dimensions. Each metric produces a 0.0-1.0 score with configurable thresholds.

Answer relevancy

Measures if the response directly addresses the question asked. Catches off-topic, evasive, or tangential responses.

Threshold:0.7+ for production

Use case:All LLM applications

Bias detection

Identifies discriminatory or unfair content including gender, racial, political, and age-based bias patterns.

Threshold:0.9+ for user-facing apps

Use case:Customer service, HR tools, public-facing AI

Toxicity detection

Flags harmful, offensive, abusive, or threatening language. Includes profanity, insults, and harmful content.

Threshold:0.95+ for public apps

Use case:Chat applications, content moderation

Faithfulness

Evaluates whether responses are grounded in the provided context. Essential for RAG systems to prevent fabrication.

Threshold:0.8+ for knowledge bases

Use case:RAG systems, document Q&A

Hallucination detection

Identifies fabricated facts, unsupported claims, and confident but false statements without explicit context requirement.

Threshold:0.85+ for factual apps

Use case:Research assistants, factual Q&A

Contextual relevancy

Tests RAG retrieval quality by evaluating if the retrieved context actually answers the question asked.

Threshold:0.7+ for retrieval systems

Use case:RAG optimization, search quality

Conversational AI

Multi-turn conversation evaluation

Go beyond single-turn testing. Evaluate how your chatbots and agents maintain context, coherence, and helpfulness across extended dialogues.

Turn relevancy

Evaluates if each response is relevant to the current turn in the conversation context.

Knowledge retention

Measures how well the model remembers and uses information from earlier in the conversation.

Conversation coherence

Assesses the logical flow and consistency of responses across the entire dialogue.

Helpfulness

Scores how effectively the assistant addresses user needs and provides actionable responses.

Task completion

Tracks whether multi-step tasks are successfully completed across conversation turns.

Per-turn safety

Applies bias and toxicity evaluation to each individual assistant response in the conversation.

Dataset type support

Single-turn

Isolated question-answer pairs for testing individual responses.

prompt → response evaluation

Multi-turn

Full conversation threads with context across multiple exchanges.

conversation history → response evaluation

Simulated

Auto-generated conversations for comprehensive coverage testing.

model generates responses for evaluation

RAG datasets

Prompts with retrieval context for grounding evaluation.

context + prompt → response evaluation

Security testing

50+ red teaming scenarios

Identify vulnerabilities before they reach production with comprehensive adversarial testing across jailbreaks, privacy probes, and safety boundaries.

Jailbreak attacks

DAN (Do Anything Now) variations
Role-playing persona attacks
Hypothetical scenario framing
Academic/research pretexts
Translation detour attacks

Privacy probes

PII extraction attempts
Training data recovery
Membership inference
Model inversion attacks
Prompt injection for data leakage

Safety boundary testing

Authority impersonation
Benign preamble masking
Instruction hierarchy bypass
System prompt extraction
Guardrail circumvention

Domain-specific threats

Financial fraud scenarios
Medical misinformation
Legal advice boundaries
Dangerous content requests
Custom scenario builder

Dual LLM architecture

Separate target and evaluator models prevent self-evaluation bias. Test GPT-4 with Claude as judge for objective scoring.

Schedule demo

Regulatory alignment

Built-in compliance frameworks

Map evaluation results directly to regulatory requirements with pre-configured rubrics and automated documentation generation.

ISO 42001

AI Management System certification alignment

Evaluation dimensions

Robustness

Safety

Transparency

Accountability

EU AI Act

High-risk AI system requirements

Evaluation dimensions

Bias & Fairness

Accuracy

Human Oversight

Risk Management

Transparency

NIST AI RMF

Risk Management Framework compliance

Evaluation dimensions

Govern

Map

Measure

Manage

Automated documentation

Generate audit-ready reports with one click

Gap tracking

Identify compliance gaps across your model portfolio

Custom rubrics

Define domain-specific evaluation criteria

Integrations

Works with your LLM stack

Connect to major LLM providers as both target and judge models. Bring your own models with OpenAI-compatible endpoints.

OpenAI

GPT-4
GPT-4 Turbo
GPT-4o
GPT-3.5 Turbo

Anthropic

Claude 3 Opus
Claude 3 Sonnet
Claude 3 Haiku

Google

Gemini Pro
Gemini Ultra
Gemini Flash

Mistral

Mistral Large
Mistral Medium
Mistral Small

Azure OpenAI

All OpenAI models via Azure

Ollama

Llama 3
Mistral
Phi-3
Local models

HuggingFace

Open-source models
Custom fine-tunes

Groq

Llama 3
Mixtral
Gemma

Plus xAI Grok, OpenRouter, Replicate, and any OpenAI-compatible API endpoint

Enterprise ready

Built for production workloads

Multi-tenant architecture, comprehensive security, and enterprise-grade reliability for mission-critical AI evaluation.

Multi-tenant architecture

Organization-based data isolation
Role-based access control (RBAC)
User management per organization
Encrypted API key storage (AES)

Security & compliance

JWT token authentication
Complete audit trail logging
CORS protection
Tenant middleware validation

Analytics & monitoring

Real-time dashboard metrics
Performance trend analysis
Cost tracking per evaluation
Export capabilities

API & integration

RESTful API endpoints
Webhook notifications
CI/CD pipeline integration
Custom adapter support

Workflow

4-step experiment wizard

Configure and run evaluations with an intuitive step-by-step workflow.

Configure model

Select provider, model, and generation parameters (temperature, max tokens).

Select dataset

Choose from built-in datasets, upload custom data, or filter by category.

Configure judge

Select independent evaluator model with optimized settings for consistent scoring.

Select metrics

Enable scorers, set thresholds, and configure custom rubrics.

Ready to secure your AI?

Get started with comprehensive LLM testing. Schedule a demo to see EvalWise in action.

Schedule demo Back to overview