Complete feature reference
Everything you need to systematically test, evaluate, and secure your Large Language Models. From red teaming to compliance, explore the complete EvalWise capability set.
Evaluation metrics
Built-in scorers cover the most critical evaluation dimensions. Each metric produces a 0.0-1.0 score with configurable thresholds.
Measures if the response directly addresses the question asked. Catches off-topic, evasive, or tangential responses.
Identifies discriminatory or unfair content including gender, racial, political, and age-based bias patterns.
Flags harmful, offensive, abusive, or threatening language. Includes profanity, insults, and harmful content.
Evaluates whether responses are grounded in the provided context. Essential for RAG systems to prevent fabrication.
Identifies fabricated facts, unsupported claims, and confident but false statements without explicit context requirement.
Tests RAG retrieval quality by evaluating if the retrieved context actually answers the question asked.
Conversational AI
Go beyond single-turn testing. Evaluate how your chatbots and agents maintain context, coherence, and helpfulness across extended dialogues.
Evaluates if each response is relevant to the current turn in the conversation context.
Measures how well the model remembers and uses information from earlier in the conversation.
Assesses the logical flow and consistency of responses across the entire dialogue.
Scores how effectively the assistant addresses user needs and provides actionable responses.
Tracks whether multi-step tasks are successfully completed across conversation turns.
Applies bias and toxicity evaluation to each individual assistant response in the conversation.
Isolated question-answer pairs for testing individual responses.
prompt → response evaluationFull conversation threads with context across multiple exchanges.
conversation history → response evaluationAuto-generated conversations for comprehensive coverage testing.
model generates responses for evaluationPrompts with retrieval context for grounding evaluation.
context + prompt → response evaluationSecurity testing
Identify vulnerabilities before they reach production with comprehensive adversarial testing across jailbreaks, privacy probes, and safety boundaries.
Separate target and evaluator models prevent self-evaluation bias. Test GPT-4 with Claude as judge for objective scoring.
Regulatory alignment
Map evaluation results directly to regulatory requirements with pre-configured rubrics and automated documentation generation.
AI Management System certification alignment
Evaluation dimensions
High-risk AI system requirements
Evaluation dimensions
Risk Management Framework compliance
Evaluation dimensions
Generate audit-ready reports with one click
Identify compliance gaps across your model portfolio
Define domain-specific evaluation criteria
Integrations
Connect to major LLM providers as both target and judge models. Bring your own models with OpenAI-compatible endpoints.
Plus xAI Grok, OpenRouter, Replicate, and any OpenAI-compatible API endpoint
Enterprise ready
Multi-tenant architecture, comprehensive security, and enterprise-grade reliability for mission-critical AI evaluation.
Workflow
Configure and run evaluations with an intuitive step-by-step workflow.
Select provider, model, and generation parameters (temperature, max tokens).
Choose from built-in datasets, upload custom data, or filter by category.
Select independent evaluator model with optimized settings for consistent scoring.
Enable scorers, set thresholds, and configure custom rubrics.
Get started with comprehensive LLM testing. Schedule a demo to see EvalWise in action.