Evaluations - Pulze.ai

Overview

Evaluations in Pulze provide a systematic way to test and benchmark AI performance. Unlike simple testing, evaluations allow you to assess not just individual models, but also:

AI Models: Test and compare different language models
AI Agents/Assistants: Evaluate specialized agents with their tools and capabilities
Entire Spaces: Test complete space configurations including models, assistants, data, and permissions

This comprehensive evaluation system helps you validate your AI systems before deployment and track performance over time.

Evaluation Workflow

Pulze evaluations use a two-part approach:

Evaluation Templates: Reusable configurations that define how to evaluate performance
Evaluation Runs: Actual test executions using templates against datasets

Evaluation Templates

Templates are the foundation of your evaluation strategy. They define:

What to evaluate: Models, spaces, or specific configurations
How to evaluate: Rater model, evaluation criteria, and scoring thresholds
Advanced settings: Feature flags, space impersonation, custom headers

Key Benefits:

Reusable: Create once, use multiple times
Consistent: Ensure the same evaluation criteria across runs
Customizable: Tailor evaluation logic to your specific needs
Version-controlled: Track changes to evaluation standards over time

Open-Source Evaluation Templates

Pulze provides evaluation templates from the Pulze Evals open-source repository. These templates include:

Industry-standard evaluation rubrics used across the AI community
The same templates used to build Pulze routers - our routers were trained and optimized using these exact evaluation criteria
Community-contributed templates for diverse use cases

The evaluation templates in the Pulze Evals repository represent battle-tested criteria. They’re the same rubrics we use internally to ensure our Pulze routers deliver high-quality results.

We welcome contributions! You can:

Use existing templates from the repository
Customize templates for your specific needs
Contribute your own evaluation templates back to the community
Build specialized templates for your domain

Contribute Evaluation Templates

Visit the repository to explore templates or submit your own

Creating Evaluation Templates

Basic Template Configuration

Evaluation templates consist of several key components:

1. Template Identity

Name: Descriptive identifier (e.g., “Customer Support Quality Assessment”)
Description: Purpose and use case explanation

2. Rater Model (LLM-as-a-Judge)

The rater model is an AI model that evaluates other AI model responses. This “LLM-as-a-judge” approach allows for:

Automated, consistent evaluation at scale
Nuanced assessment of quality dimensions
Cost-effective alternative to human evaluation

Choosing a Rater Model:

Select from any available model in your organization
Consider using stronger models (e.g., GPT-4, Claude) for more reliable judgments
Balance cost vs. accuracy for your use case

3. Pass Threshold

Set the minimum score (0.0 to 1.0) required for an evaluation to pass:

0.0-0.3: Poor/failing responses
0.4-0.6: Acceptable but needs improvement
0.7-0.9: Good quality responses
0.9-1.0: Excellent responses

Start with a threshold around 0.7 and adjust based on your quality requirements. Too strict (>0.9) may flag acceptable responses as failures.

4. Metrics Configuration

Metrics define what dimensions to evaluate. Pulze provides predefined metrics and supports custom metrics. Predefined Metrics:

Accuracy: Factual correctness of the response
Relevance: How well the response addresses the question
Helpfulness: Practical value and usefulness to the user

Adding Custom Metrics: You can add your own metrics to evaluate specific aspects:

Click “Add Metric” in the template editor
Enter your metric name (e.g., “professionalism”, “conciseness”, “creativity”)
The evaluation prompt automatically updates to include your metric
The rater model will score responses on all defined metrics

Example Custom Metrics:

Tone: Professional vs. casual communication style
Conciseness: Brevity and clarity
Empathy: Understanding of user emotions
Technical Depth: Level of technical detail
Safety: Absence of harmful or biased content

The evaluation JSON structure automatically adapts to include all your metrics, ensuring consistent scoring across dimensions.

5. Evaluation Prompt

The evaluation prompt instructs the rater model on how to assess responses. A good evaluation prompt includes:

Clear criteria: Specific dimensions to evaluate
Scoring scale: How to assign scores (0.0-1.0)
Output format: JSON structure for consistent parsing
Examples (optional): Sample evaluations for clarity

Default Prompt Structure:

You are an expert evaluator. Please evaluate the model response based on [metrics].

Rate the response on a scale of 0.0 to 1.0 where:
- 0.0-0.3: Poor response (incorrect, irrelevant, or unhelpful)
- 0.4-0.6: Average response (partially correct or somewhat helpful)
- 0.7-0.9: Good response (mostly correct and helpful)
- 0.9-1.0: Excellent response (accurate, relevant, and very helpful)

Please provide your evaluation as a JSON object:
{
  "accuracy": <float 0-1>,
  "relevance": <float 0-1>,
  "helpfulness": <float 0-1>,
  "overall_score": <average of all scores>,
  "reasoning": "<detailed explanation>",
  "passed": <boolean>
}

You can customize this prompt to match your evaluation criteria and domain.

Advanced Template Modes

Pulze evaluation templates support powerful advanced configurations:

1. Space Impersonation

Evaluate entire space configurations by selecting a space to impersonate during evaluation:

Tests with the space’s specific permissions
Uses the space’s enabled models and routers
Accesses the space’s uploaded data and documents
Leverages the space’s configured AI agents and tools

Use Case: Validate that your production space configuration works correctly before deploying to users.

When you select a space for evaluation, Pulze automatically generates a temporary API key with that space’s permissions. This ensures evaluations run exactly as they would for real users of that space.

2. Agentic Feature Flags

Control automatic AI agent behavior during evaluations: Auto Tools 🛠️

Automatically selects appropriate tools to help generate responses
Tests whether your AI agents choose the right tools for each task
Validates tool integration and execution

Smart Learn 🧠

Uses learned patterns from liked responses to improve outputs
Tests how well the system adapts to successful patterns
Validates learning system effectiveness

Feature flags let you test different AI behaviors. For example, compare performance with and without automatic tool selection to measure the impact of agentic features.

3. Additional Advanced Features

Beyond basic configuration, evaluation templates support sophisticated testing scenarios: Targeting Specific Tools: Use custom headers to test specific tool usage:

Target particular tools for AI agents to use
Validate tool selection and execution
Test tool integration in different scenarios

Targeting Specific Assistants: Evaluate performance of specific AI agents:

Test individual assistants within a space
Compare assistant configurations
Validate assistant behavior with different prompts

Custom Headers and Payloads: Add custom headers or payload modifications for both:

Model Being Evaluated: Configure the model/space you’re testing
Rater Model: Customize how the evaluation model behaves

Feature Flag Examples: Headers for controlling AI behavior:

{
  "pulze-feature-flags": {
    "auto_tools": true,
    "smart_learn": false
  }
}

Common Advanced Use Cases:

Custom routing headers for specific model behavior
Temperature or parameter overrides for consistency
Organization-specific configuration testing
A/B testing different AI configurations
Tool availability and selection validation
Assistant-specific prompt testing
Multi-assistant comparison within spaces

Developer Guide - Feature Flags

See practical examples of using feature flags and custom headers in the Developer Guide

Running Evaluations

Once you have templates and datasets, you can run evaluations:

Single Dataset Evaluation

Navigate to Evals → Evaluations
Click Run Evaluation
Select your evaluation template
Choose one dataset
Select models or spaces to evaluate
Run the evaluation

Multi-Dataset Evaluation

Evaluate across multiple datasets simultaneously:

Select multiple datasets when configuring your run
Each dataset contributes to the overall score
View aggregated results across all datasets
Compare performance on different types of questions

Benefits:

Comprehensive Coverage: Test across diverse scenarios
Balanced Assessment: No single dataset dominates the score
Efficiency: Run once instead of multiple single-dataset evaluations

Multi-dataset evaluations automatically calculate total scores by averaging performance across all selected datasets. This gives you a holistic view of model performance.

Evaluation Results

Automatic Scoring

Pulze automatically calculates scores for each evaluation run:

Per-Item Scores: Individual question/response scores (0.0-1.0)
Dataset Scores: Average across all items in a dataset
Total Score: Overall average when using multiple datasets
Pass/Fail Status: Based on your configured threshold

Results Dashboard

The evaluation results view shows:

Model Rankings: See which models perform best
Score Distributions: Understand performance patterns
Pass Rates: Track how many responses met your threshold
Detailed Analysis: Drill down into individual responses

Comparing Models

Evaluate multiple models simultaneously to compare:

Side-by-side scores: See which model performed better
Cost analysis: Compare performance relative to cost
Speed metrics: Track response times
Quality trends: Identify consistent performers

Evaluation Purposes

1. Model Selection

Compare different AI models to find the best fit:

Test GPT-4, Claude, Gemini, or other models
Evaluate proprietary vs. open-source options
Balance performance, cost, and speed

2. Assistant Validation

Test AI agents and assistants with their full capabilities:

Validate tool usage and selection
Ensure agents follow instructions correctly
Test multi-step reasoning and planning

3. Space Configuration Testing

Validate entire space setups before deployment:

Test with specific data access and permissions
Verify assistant configurations
Ensure tool integrations work correctly

4. Regression Testing

Catch performance degradation after changes:

Run evaluations before and after updates
Compare results to detect regressions
Maintain quality standards over time

5. Quality Assurance

Maintain consistent behavior across your AI systems:

Define quality standards via thresholds
Ensure responses meet requirements
Track quality metrics over time

Best Practices

Start with Clear Objectives: Define what you want to measure before creating evaluation templates. Are you testing accuracy, helpfulness, tool usage, or something else?

For Templates

Descriptive Names: Use clear names like “Customer Support Quality” instead of “Template 1”
Detailed Prompts: Provide comprehensive evaluation criteria to the rater model
Appropriate Thresholds: Set realistic pass/fail thresholds based on your use case
Relevant Metrics: Choose metrics that align with your goals

For Evaluation Runs

Representative Datasets: Use datasets that reflect real-world usage
Multiple Datasets: Combine different dataset types for comprehensive testing
Regular Cadence: Schedule periodic evaluations to catch issues early
Baseline Comparisons: Always compare against a baseline or previous version

For Space Evaluations

Test Production Config: Use space impersonation to test exactly as users will experience
Validate Permissions: Ensure data access controls work as expected
Check Tool Integration: Verify AI agents use tools correctly
Monitor Agent Behavior: Track how agents make decisions with feature flags

Evaluation Templates Library

Pulze provides predefined evaluation templates to get you started quickly:

Quality Assessment: General-purpose quality evaluation
Factual Accuracy: Tests for correct information
Instruction Following: Measures adherence to prompts
Custom Templates: Create your own for specific use cases

You can also import and customize any existing template to fit your needs.

Integration with Data

Evaluations are tightly integrated with Pulze’s data ecosystem:

With Datasets

Use any dataset type (Manual, Learning, or Benchmark)
Combine multiple datasets in one evaluation
Create datasets specifically for evaluation purposes

With Spaces

Evaluate using space-specific data and documents
Test with space permissions and access controls
Validate space configurations before user deployment

Results Storage

All evaluation results are stored and versioned
Track performance trends over time
Export results for external analysis
Share findings with your team

Advanced Evaluation Patterns

A/B Testing Configurations

Use evaluation templates to A/B test different configurations:

Create two templates with different settings (e.g., with/without auto_tools)
Run both against the same datasets
Compare results to determine which configuration performs better

Continuous Evaluation

Integrate evaluations into your CI/CD pipeline:

Create datasets that represent your test cases
Set up evaluation templates for your standards
Run evaluations automatically on changes
Block deployments that don’t meet thresholds

Progressive Testing

Test changes incrementally:

Start with a small dataset to validate basic functionality
Expand to larger datasets for comprehensive testing
Run space-impersonated evaluations for final validation
Deploy with confidence

Monitoring and Alerts

Set up monitoring for evaluation results to catch performance degradation early. If scores drop below your threshold, investigate before the issue affects users.

Track evaluation metrics over time:

Set up alerts for failing evaluations
Monitor score trends
Track pass rates across models
Identify degradation patterns

API Access

Evaluations are accessible via the Pulze API for automation:

Create and manage templates programmatically
Trigger evaluation runs automatically
Retrieve results for custom analytics
Integrate with external monitoring systems

Next Steps

To get started with evaluations:

Create datasets with representative test cases (Manual, Learning, or Benchmark)
Design evaluation templates that define your quality standards
Run your first evaluation to establish baselines
Compare results across models, assistants, or configurations
Iterate and improve based on evaluation insights

Datasets

Learn about creating datasets for evaluation

Spaces Overview

Understand how spaces provide context for evaluation

Assistants

Learn about configuring AI agents to evaluate

Data Overview

Back to Data Overview

Getting Started

Models

AI Agents

Pulze Guide

Tools Guide

Vibe Coding

Developer Guide

API REFERENCE

COMMUNITY

PULZE ACADEMY

​Overview

​Evaluation Workflow

​Evaluation Templates

​Open-Source Evaluation Templates

Contribute Evaluation Templates

​Creating Evaluation Templates

​Basic Template Configuration

​1. Template Identity

​2. Rater Model (LLM-as-a-Judge)

​3. Pass Threshold

​4. Metrics Configuration

​5. Evaluation Prompt

​Advanced Template Modes

​1. Space Impersonation

​2. Agentic Feature Flags

​3. Additional Advanced Features

Developer Guide - Feature Flags

​Running Evaluations

​Single Dataset Evaluation

​Multi-Dataset Evaluation

​Evaluation Results

​Automatic Scoring

​Results Dashboard

​Comparing Models

​Evaluation Purposes

​1. Model Selection

​2. Assistant Validation

​3. Space Configuration Testing

​4. Regression Testing

​5. Quality Assurance

​Best Practices

​For Templates

​For Evaluation Runs

​For Space Evaluations

​Evaluation Templates Library

​Integration with Data

​With Datasets

​With Spaces

​Results Storage

​Advanced Evaluation Patterns

​A/B Testing Configurations

​Continuous Evaluation

​Progressive Testing

​Monitoring and Alerts

​API Access

​Next Steps

Datasets

Spaces Overview

Assistants

Data Overview

Overview

Evaluation Workflow

Evaluation Templates

Open-Source Evaluation Templates

Creating Evaluation Templates

Basic Template Configuration

1. Template Identity

2. Rater Model (LLM-as-a-Judge)

3. Pass Threshold

4. Metrics Configuration

5. Evaluation Prompt

Advanced Template Modes

1. Space Impersonation

2. Agentic Feature Flags

3. Additional Advanced Features

Running Evaluations

Single Dataset Evaluation

Multi-Dataset Evaluation

Evaluation Results

Automatic Scoring

Results Dashboard

Comparing Models

Evaluation Purposes

1. Model Selection

2. Assistant Validation

3. Space Configuration Testing

4. Regression Testing

5. Quality Assurance

Best Practices

For Templates

For Evaluation Runs

For Space Evaluations

Evaluation Templates Library

Integration with Data

With Datasets

With Spaces

Results Storage

Advanced Evaluation Patterns

A/B Testing Configurations

Continuous Evaluation

Progressive Testing

Monitoring and Alerts

API Access

Next Steps