Evals¶

View and analyze your evaluation results in Pydantic Logfire's web interface. Evals provide observability into how your AI systems perform across different test cases and experiments over time.

To get started with creating and running evals, see the Pydantic Evals docs. For managing datasets in Logfire, see the Datasets guide.

The Datasets List¶

Click Evals in the sidebar to open the datasets list. This is the top-level view for all evaluation activity in your project.

Each dataset row shows:

Name --- the dataset identifier (with a hosted badge for hosted datasets)
Pass rate --- aggregate pass rate from the most recent experiments
Last run --- when the most recent experiment was run
Experiments --- total number of experiment runs
Cases --- number of test cases in the dataset

Use the All, Hosted, and Local tabs to filter by dataset source. The search bar filters by name.

If you haven't created any datasets yet, the empty state guides you through two paths: creating a hosted dataset in the UI, or running evals from code with pydantic-evals.

Dataset Detail Page¶

Click a dataset name to open its detail page. The page has tabs for:

Experiments --- all evaluation runs against this dataset
Cases --- test cases (editable for hosted datasets)
Schema --- input, output, and metadata schemas

The header shows the dataset name, experiment count, case count, and aggregate pass rate. Use the Export button to download cases, the Edit button to modify the dataset, or the <> SDK button to view code snippets for working with this dataset programmatically.

If the dataset has no experiments yet, the empty state walks you through the setup: define your schema, add test cases, then run your first experiment from code.

Viewing Experiments¶

The Experiments tab lists all evaluation runs for the dataset. Each experiment shows results in a grid with inputs, outputs, scores, and assertion results.

Click any experiment row to see detailed results including:

Test cases with inputs, expected outputs, and actual outputs
Assertion results --- pass/fail status for each evaluator
Performance metrics --- duration, token usage, and custom scores
Evaluation scores --- detailed scoring from all evaluators

Comparing Experiments¶

To compare multiple runs side by side:

Select experiments from the list using the checkboxes
Click Compare
View side-by-side results for the same test cases

The comparison view highlights differences in outputs, score variations, performance changes, and regressions between runs.

Integration with Traces¶

Every evaluation experiment generates detailed OpenTelemetry traces that appear in Logfire:

Experiment span --- root span containing all evaluation metadata
Case execution spans --- individual test case runs with full context
Task function spans --- detailed tracing of your AI system under test
Evaluator spans --- scoring and assessment execution details

Navigate from experiment results to full trace details using the span links.