Skip to main content

EvalSet

An EvalSet is a collection of test cases grouped together for organized evaluation.

Import

from openstackai.evaluation import EvalSet

Constructor

EvalSet(
name: str, # Evaluation set name
test_cases: list[TestCase], # List of test cases
description: str = None, # Description
version: str = "1.0", # Version identifier
metadata: dict = None # Additional metadata
)

Creating EvalSets

Basic Creation

from openstackai.evaluation import EvalSet, TestCase

eval_set = EvalSet(
name="Math Evaluation",
test_cases=[
TestCase(input="2+2", expected_output="4"),
TestCase(input="5*5", expected_output="25"),
TestCase(input="10/2", expected_output="5"),
],
description="Basic arithmetic tests"
)

From YAML File

# eval_set.yaml
name: Customer Support Evaluation
description: Tests for customer support agent
version: "2.0"

test_cases:
- input: "I need help with my order"
criteria: [helpfulness, tone]
tags: [orders]

- input: "How do I return an item?"
expected_output_contains: "return policy"
criteria: [accuracy, clarity]
tags: [returns]

- input: "I'm very angry about this!"
criteria: [empathy, de-escalation]
tags: [complaints]
eval_set = EvalSet.from_yaml("eval_set.yaml")

From JSON

eval_set = EvalSet.from_json("eval_set.json")

Methods

add_test_case()

Add a single test case:

eval_set.add_test_case(
TestCase(input="New test", expected_output="Expected")
)

filter_by_tags()

Filter test cases by tags:

# Get only math-related tests
math_tests = eval_set.filter_by_tags(["math"])

# Exclude certain tags
no_advanced = eval_set.filter_by_tags(exclude=["advanced"])

split()

Split into training/validation sets:

train_set, val_set = eval_set.split(ratio=0.8)

sample()

Random sampling:

# Get 10 random test cases
sample = eval_set.sample(n=10)

Serialization

Save to File

eval_set.to_yaml("output.yaml")
eval_set.to_json("output.json")

Export for Sharing

# Export with all metadata
eval_set.export("benchmark_v1.zip")

Properties

PropertyTypeDescription
namestrSet name
test_caseslistAll test cases
sizeintNumber of test cases
tagssetAll unique tags
versionstrVersion string

Built-in Benchmarks

from openstackai.evaluation.benchmarks import (
MMLU,
HellaSwag,
TruthfulQA,
HumanEval
)

# Load standard benchmark
mmlu = MMLU.load(subset="computer_science")
results = evaluator.evaluate(mmlu, agent=my_agent)

See Also