Skip to main content

Overview

Evaluations in Pulze provide a systematic way to test and benchmark AI performance. Unlike simple testing, evaluations allow you to assess not just individual models, but also:
  • AI Models: Test and compare different language models
  • AI Agents/Assistants: Evaluate specialized agents with their tools and capabilities
  • Entire Spaces: Test complete space configurations including models, assistants, data, and permissions
This comprehensive evaluation system helps you validate your AI systems before deployment and track performance over time.

Evaluation Workflow

Pulze evaluations use a two-part approach:
  1. Evaluation Templates: Reusable configurations that define how to evaluate performance
  2. Evaluation Runs: Actual test executions using templates against datasets

Evaluation Templates

Templates are the foundation of your evaluation strategy. They define:
  • What to evaluate: Models, spaces, or specific configurations
  • How to evaluate: Rater model, evaluation criteria, and scoring thresholds
  • Advanced settings: Feature flags, space impersonation, custom headers
Key Benefits:
  • Reusable: Create once, use multiple times
  • Consistent: Ensure the same evaluation criteria across runs
  • Customizable: Tailor evaluation logic to your specific needs
  • Version-controlled: Track changes to evaluation standards over time

Open-Source Evaluation Templates

Pulze provides evaluation templates from the Pulze Evals open-source repository. These templates include:
  • Industry-standard evaluation rubrics used across the AI community
  • The same templates used to build Pulze routers - our routers were trained and optimized using these exact evaluation criteria
  • Community-contributed templates for diverse use cases
The evaluation templates in the Pulze Evals repository represent battle-tested criteria. They’re the same rubrics we use internally to ensure our Pulze routers deliver high-quality results.
We welcome contributions! You can:
  • Use existing templates from the repository
  • Customize templates for your specific needs
  • Contribute your own evaluation templates back to the community
  • Build specialized templates for your domain

Contribute Evaluation Templates

Visit the repository to explore templates or submit your own

Creating Evaluation Templates

Basic Template Configuration

Evaluation templates consist of several key components:

1. Template Identity

  • Name: Descriptive identifier (e.g., “Customer Support Quality Assessment”)
  • Description: Purpose and use case explanation

2. Rater Model (LLM-as-a-Judge)

The rater model is an AI model that evaluates other AI model responses. This “LLM-as-a-judge” approach allows for:
  • Automated, consistent evaluation at scale
  • Nuanced assessment of quality dimensions
  • Cost-effective alternative to human evaluation
Choosing a Rater Model:
  • Select from any available model in your organization
  • Consider using stronger models (e.g., GPT-4, Claude) for more reliable judgments
  • Balance cost vs. accuracy for your use case

3. Pass Threshold

Set the minimum score (0.0 to 1.0) required for an evaluation to pass:
  • 0.0-0.3: Poor/failing responses
  • 0.4-0.6: Acceptable but needs improvement
  • 0.7-0.9: Good quality responses
  • 0.9-1.0: Excellent responses
Start with a threshold around 0.7 and adjust based on your quality requirements. Too strict (>0.9) may flag acceptable responses as failures.

4. Metrics Configuration

Metrics define what dimensions to evaluate. Pulze provides predefined metrics and supports custom metrics. Predefined Metrics:
  • Accuracy: Factual correctness of the response
  • Relevance: How well the response addresses the question
  • Helpfulness: Practical value and usefulness to the user
Adding Custom Metrics: You can add your own metrics to evaluate specific aspects:
  1. Click “Add Metric” in the template editor
  2. Enter your metric name (e.g., “professionalism”, “conciseness”, “creativity”)
  3. The evaluation prompt automatically updates to include your metric
  4. The rater model will score responses on all defined metrics
Example Custom Metrics:
  • Tone: Professional vs. casual communication style
  • Conciseness: Brevity and clarity
  • Empathy: Understanding of user emotions
  • Technical Depth: Level of technical detail
  • Safety: Absence of harmful or biased content
The evaluation JSON structure automatically adapts to include all your metrics, ensuring consistent scoring across dimensions.

5. Evaluation Prompt

The evaluation prompt instructs the rater model on how to assess responses. A good evaluation prompt includes:
  • Clear criteria: Specific dimensions to evaluate
  • Scoring scale: How to assign scores (0.0-1.0)
  • Output format: JSON structure for consistent parsing
  • Examples (optional): Sample evaluations for clarity
Default Prompt Structure:
You are an expert evaluator. Please evaluate the model response based on [metrics].

Rate the response on a scale of 0.0 to 1.0 where:
- 0.0-0.3: Poor response (incorrect, irrelevant, or unhelpful)
- 0.4-0.6: Average response (partially correct or somewhat helpful)
- 0.7-0.9: Good response (mostly correct and helpful)
- 0.9-1.0: Excellent response (accurate, relevant, and very helpful)

Please provide your evaluation as a JSON object:
{
  "accuracy": <float 0-1>,
  "relevance": <float 0-1>,
  "helpfulness": <float 0-1>,
  "overall_score": <average of all scores>,
  "reasoning": "<detailed explanation>",
  "passed": <boolean>
}
You can customize this prompt to match your evaluation criteria and domain.

Advanced Template Modes

Pulze evaluation templates support powerful advanced configurations:

1. Space Impersonation

Evaluate entire space configurations by selecting a space to impersonate during evaluation:
  • Tests with the space’s specific permissions
  • Uses the space’s enabled models and routers
  • Accesses the space’s uploaded data and documents
  • Leverages the space’s configured AI agents and tools
Use Case: Validate that your production space configuration works correctly before deploying to users.
When you select a space for evaluation, Pulze automatically generates a temporary API key with that space’s permissions. This ensures evaluations run exactly as they would for real users of that space.

2. Agentic Feature Flags

Control automatic AI agent behavior during evaluations: Auto Tools 🛠️
  • Automatically selects appropriate tools to help generate responses
  • Tests whether your AI agents choose the right tools for each task
  • Validates tool integration and execution
Smart Learn 🧠
  • Uses learned patterns from liked responses to improve outputs
  • Tests how well the system adapts to successful patterns
  • Validates learning system effectiveness
Feature flags let you test different AI behaviors. For example, compare performance with and without automatic tool selection to measure the impact of agentic features.

3. Additional Advanced Features

Beyond basic configuration, evaluation templates support sophisticated testing scenarios: Targeting Specific Tools: Use custom headers to test specific tool usage:
  • Target particular tools for AI agents to use
  • Validate tool selection and execution
  • Test tool integration in different scenarios
Targeting Specific Assistants: Evaluate performance of specific AI agents:
  • Test individual assistants within a space
  • Compare assistant configurations
  • Validate assistant behavior with different prompts
Custom Headers and Payloads: Add custom headers or payload modifications for both:
  • Model Being Evaluated: Configure the model/space you’re testing
  • Rater Model: Customize how the evaluation model behaves
Feature Flag Examples: Headers for controlling AI behavior:
{
  "pulze-feature-flags": {
    "auto_tools": true,
    "smart_learn": false
  }
}
Common Advanced Use Cases:
  • Custom routing headers for specific model behavior
  • Temperature or parameter overrides for consistency
  • Organization-specific configuration testing
  • A/B testing different AI configurations
  • Tool availability and selection validation
  • Assistant-specific prompt testing
  • Multi-assistant comparison within spaces

Developer Guide - Feature Flags

See practical examples of using feature flags and custom headers in the Developer Guide

Running Evaluations

Once you have templates and datasets, you can run evaluations:

Single Dataset Evaluation

  1. Navigate to EvalsEvaluations
  2. Click Run Evaluation
  3. Select your evaluation template
  4. Choose one dataset
  5. Select models or spaces to evaluate
  6. Run the evaluation

Multi-Dataset Evaluation

Evaluate across multiple datasets simultaneously:
  1. Select multiple datasets when configuring your run
  2. Each dataset contributes to the overall score
  3. View aggregated results across all datasets
  4. Compare performance on different types of questions
Benefits:
  • Comprehensive Coverage: Test across diverse scenarios
  • Balanced Assessment: No single dataset dominates the score
  • Efficiency: Run once instead of multiple single-dataset evaluations
Multi-dataset evaluations automatically calculate total scores by averaging performance across all selected datasets. This gives you a holistic view of model performance.

Evaluation Results

Automatic Scoring

Pulze automatically calculates scores for each evaluation run:
  • Per-Item Scores: Individual question/response scores (0.0-1.0)
  • Dataset Scores: Average across all items in a dataset
  • Total Score: Overall average when using multiple datasets
  • Pass/Fail Status: Based on your configured threshold

Results Dashboard

The evaluation results view shows:
  • Model Rankings: See which models perform best
  • Score Distributions: Understand performance patterns
  • Pass Rates: Track how many responses met your threshold
  • Detailed Analysis: Drill down into individual responses

Comparing Models

Evaluate multiple models simultaneously to compare:
  • Side-by-side scores: See which model performed better
  • Cost analysis: Compare performance relative to cost
  • Speed metrics: Track response times
  • Quality trends: Identify consistent performers

Evaluation Purposes

1. Model Selection

Compare different AI models to find the best fit:
  • Test GPT-4, Claude, Gemini, or other models
  • Evaluate proprietary vs. open-source options
  • Balance performance, cost, and speed

2. Assistant Validation

Test AI agents and assistants with their full capabilities:
  • Validate tool usage and selection
  • Ensure agents follow instructions correctly
  • Test multi-step reasoning and planning

3. Space Configuration Testing

Validate entire space setups before deployment:
  • Test with specific data access and permissions
  • Verify assistant configurations
  • Ensure tool integrations work correctly

4. Regression Testing

Catch performance degradation after changes:
  • Run evaluations before and after updates
  • Compare results to detect regressions
  • Maintain quality standards over time

5. Quality Assurance

Maintain consistent behavior across your AI systems:
  • Define quality standards via thresholds
  • Ensure responses meet requirements
  • Track quality metrics over time

Best Practices

Start with Clear Objectives: Define what you want to measure before creating evaluation templates. Are you testing accuracy, helpfulness, tool usage, or something else?

For Templates

  1. Descriptive Names: Use clear names like “Customer Support Quality” instead of “Template 1”
  2. Detailed Prompts: Provide comprehensive evaluation criteria to the rater model
  3. Appropriate Thresholds: Set realistic pass/fail thresholds based on your use case
  4. Relevant Metrics: Choose metrics that align with your goals

For Evaluation Runs

  1. Representative Datasets: Use datasets that reflect real-world usage
  2. Multiple Datasets: Combine different dataset types for comprehensive testing
  3. Regular Cadence: Schedule periodic evaluations to catch issues early
  4. Baseline Comparisons: Always compare against a baseline or previous version

For Space Evaluations

  1. Test Production Config: Use space impersonation to test exactly as users will experience
  2. Validate Permissions: Ensure data access controls work as expected
  3. Check Tool Integration: Verify AI agents use tools correctly
  4. Monitor Agent Behavior: Track how agents make decisions with feature flags

Evaluation Templates Library

Pulze provides predefined evaluation templates to get you started quickly:
  • Quality Assessment: General-purpose quality evaluation
  • Factual Accuracy: Tests for correct information
  • Instruction Following: Measures adherence to prompts
  • Custom Templates: Create your own for specific use cases
You can also import and customize any existing template to fit your needs.

Integration with Data

Evaluations are tightly integrated with Pulze’s data ecosystem:

With Datasets

  • Use any dataset type (Manual, Learning, or Benchmark)
  • Combine multiple datasets in one evaluation
  • Create datasets specifically for evaluation purposes

With Spaces

  • Evaluate using space-specific data and documents
  • Test with space permissions and access controls
  • Validate space configurations before user deployment

Results Storage

  • All evaluation results are stored and versioned
  • Track performance trends over time
  • Export results for external analysis
  • Share findings with your team

Advanced Evaluation Patterns

A/B Testing Configurations

Use evaluation templates to A/B test different configurations:
  1. Create two templates with different settings (e.g., with/without auto_tools)
  2. Run both against the same datasets
  3. Compare results to determine which configuration performs better

Continuous Evaluation

Integrate evaluations into your CI/CD pipeline:
  1. Create datasets that represent your test cases
  2. Set up evaluation templates for your standards
  3. Run evaluations automatically on changes
  4. Block deployments that don’t meet thresholds

Progressive Testing

Test changes incrementally:
  1. Start with a small dataset to validate basic functionality
  2. Expand to larger datasets for comprehensive testing
  3. Run space-impersonated evaluations for final validation
  4. Deploy with confidence

Monitoring and Alerts

Set up monitoring for evaluation results to catch performance degradation early. If scores drop below your threshold, investigate before the issue affects users.
Track evaluation metrics over time:
  • Set up alerts for failing evaluations
  • Monitor score trends
  • Track pass rates across models
  • Identify degradation patterns

API Access

Evaluations are accessible via the Pulze API for automation:
  • Create and manage templates programmatically
  • Trigger evaluation runs automatically
  • Retrieve results for custom analytics
  • Integrate with external monitoring systems

Next Steps

To get started with evaluations:
  1. Create datasets with representative test cases (Manual, Learning, or Benchmark)
  2. Design evaluation templates that define your quality standards
  3. Run your first evaluation to establish baselines
  4. Compare results across models, assistants, or configurations
  5. Iterate and improve based on evaluation insights
I