Running Evaluations¶
Experimental SDK
The dataset management SDK is under logfire.experimental.api_client. The API may change in future releases.
Once you have a hosted dataset (created via the Web UI or SDK), you can fetch it as a
typed pydantic_evals.Dataset and use it to evaluate your AI system.
Getting a typed pydantic-evals Dataset¶
The get_dataset method fetches all cases and returns a typed
pydantic_evals.Dataset that you can use directly for evaluation:
from dataclasses import dataclass
from pydantic_evals import Dataset
from logfire.experimental.api_client import LogfireAPIClient
@dataclass
class QuestionInput:
question: str
context: str | None = None
@dataclass
class AnswerOutput:
answer: str
confidence: float
@dataclass
class CaseMetadata:
category: str
difficulty: str
reviewed: bool = False
with LogfireAPIClient(api_key='your-api-key') as client:
dataset: Dataset[QuestionInput, AnswerOutput, CaseMetadata] = client.get_dataset(
'qa-golden-set',
input_type=QuestionInput,
output_type=AnswerOutput,
metadata_type=CaseMetadata,
)
print(f'Fetched {len(dataset.cases)} cases')
print(f'First case input type: {type(dataset.cases[0].inputs).__name__}')
If you have custom evaluator types stored with your cases, pass them via custom_evaluator_types so they can be deserialized:
dataset = client.get_dataset(
'qa-golden-set',
input_type=QuestionInput,
output_type=AnswerOutput,
custom_evaluator_types=[MyCustomEvaluator],
)
Without type arguments, get_dataset returns the raw dict in pydantic-evals-compatible format:
raw_data = client.get_dataset('qa-golden-set')
# raw_data is a dict with 'name', 'cases', etc.
Running the Evaluation¶
Use the dataset with pydantic-evals to evaluate your AI system:
from pydantic_evals import Dataset
from logfire.experimental.api_client import LogfireAPIClient
async def my_qa_task(inputs: QuestionInput) -> AnswerOutput:
"""The AI system under test."""
# Your AI logic here --- call an LLM, run an agent, etc.
...
async def run_evaluation():
with LogfireAPIClient(api_key='your-api-key') as client:
# Get the dataset
dataset: Dataset[QuestionInput, AnswerOutput, CaseMetadata] = client.get_dataset(
'qa-golden-set',
input_type=QuestionInput,
output_type=AnswerOutput,
metadata_type=CaseMetadata,
)
# Run the evaluation
report = await dataset.evaluate(my_qa_task)
report.print()
Viewing Results in the Evals Tab¶
With Logfire tracing enabled, the evaluation results appear automatically in the Evals tab, where you can compare experiments and analyze performance over time.
The Evaluation Workflow¶
This creates a continuous improvement loop:
- Observe production behavior in Live View.
- Curate test cases by adding interesting traces to a dataset.
- Evaluate your system against the dataset using pydantic-evals.
- Analyze the results in the Logfire Evals tab.
- Improve your system and repeat.