Example 1: Evaluating a synthetic version of the T1 dataset
We use the attraction recommendation subset of the T1 benchmark. We have uploaded the demo T1 files here. Each trace contains a conversation between a user and an agent, a sequence of tool calls, and a final recommendation output.
Step 1: Upload datasets and configure
Leave the import format as SynAE format.
Upload the original T1 CSV as the original dataset and the desired synthetic CSV(s) as synthetic datasets.
Then fill in the benchmark-specific configuration:
Data column name: Data
Tool calls column name: Tool Call
Output column name: Output
Attribute column names: city, type
List of tools: search_attractions, filter_attractions, search_nearest, sort_results, get_results_from_cache, save_to_cache
Benchmark task description: an attractions recommendation assistant
Alternatively, select T1 from the "Select example config" dropdown to fill these in automatically.
Step 2: Select metrics
All metrics are enabled by default. For T1, all fidelity, validity, and diversity metrics apply since the dataset has instructions, tool calls, and outputs.
Step 3: Run and load results
Click Show command to get the evaluation command. Run it locally, then upload the output JSON to view the metric report.
Example 2: Evaluating traces exported from LangSmith, MLflow, or OpenAI
If your traces are stored in a tracing platform, export them as JSON and select the matching import format. SynAE will convert them to the SynAE CSV format automatically before evaluation.
Step 1: Upload datasets and configure
Set the import format to LangSmith Traces, MLflow Traces, or OpenAI Traces before uploading.
Upload your exported JSON files as the original and synthetic datasets.
The column names will be pre-filled to match the converter output (Data, Tool Calls, Output).
Fill in any remaining benchmark-specific fields such as the list of tools and task description.
Step 2: Select metrics
Select the metrics that apply to your dataset. For example, if your traces only have tool calls and no final output, uncheck all output-related metrics.
Step 3: Run and load results
Click Run evaluation. Results appear automatically when done.
See the Import Formats section below for the expected JSON structure for each platform.
SynAE Input Format
Each dataset (original and synthetic) is a CSV file with the following column structure:
All datasets (original and synthetic) must have the same columns.
Data (required)
The instruction or conversation text for each trace. The column name is configurable.
Tool Calls (optional)
The sequence of tool calls made by the agent for each trace. Include Tool Calls, Output, or both. At least one is required alongside Data.
Output (optional)
The agent's final output or response for each trace.
Attribute columns (optional)
Any number of additional columns that label or categorize traces, such as city or type. These are benchmark-specific and used for attribute match and diversity metrics.
Benchmark-Specific Configuration
Data column name (required)
The name of the column containing the instruction or conversation text.
Tool calls column name (optional)
The name of the column containing the agent's tool call sequence. Leave blank if your dataset has no tool calls.
Output column name (optional)
The name of the column containing the agent's final output or response. Leave blank if your dataset has no output column.
Attribute column names (optional, comma-separated)
Names of columns that label or categorize each trace, e.g. city, type. Used for attribute match and diversity metrics.
List of tools available to the agent (optional, comma-separated)
The full set of tool names the agent can call as part of the benchmark, e.g. search_attractions, filter_attractions. Used for tool call validity evaluation.
Benchmark task description (optional)
A short natural-language description of the agent's task, e.g. an attractions recommendation assistant. Used as context for LLM-based validity metrics.
Import Formats
When an import format is selected, uploaded files are treated as JSON trace exports and converted to the SynAE CSV format automatically.
Both the original and synthetic files must use the same format.
All converters produce three columns: Data, Tool Calls, and Output.
LangSmith
Reference: docs.langchain.com/langsmith/run-data-format
MLflow
Reference: mlflow.org - mlflow.entities.Trace
OpenAI Agents SDK
Reference: https://openai.github.io/openai-agents-python/ref/tracing/traces/
Fidelity
Instruction: Key Node Dependency (KND) (lower is better)
Do synthetic instructions and responses depend on each other the same way as in the real data?
Instruction: Attribute Match (AM) (lower is better)
Do synthetic traces share the same surface-level properties as real ones, such as conversation length, token count, and attribute distributions?
Instruction: KNN-Precision (higher is better)
Are the synthetic instructions semantically realistic, i.e., do they resemble real instructions?
Instruction: KNN-Recall (higher is better)
Do the synthetic instructions cover the full range of topics and styles present in the real data?
Instruction: FID (lower is better)
How similar are the overall semantic distributions of synthetic and real instructions?
Tool Calls: Tool Usage Match (TUM) (lower is better)
Does the synthetic data call each tool at roughly the same frequency as the real data?
Tool Calls: Tool Call Number Match (TCNM) (lower is better)
Does the synthetic data make a similar number of tool calls per task as the real data?
Tool Calls: k-Step Planning (lower is better)
Does the synthetic data follow similar tool-call sequences and planning patterns as the real data?
Output: KNN-Precision (higher is better)
Are the synthetic final outputs semantically realistic, i.e., do they resemble real outputs?
Output: KNN-Recall (higher is better)
Do the synthetic outputs cover the full range of responses present in the real data?
Output: FID (lower is better)
How similar are the overall semantic distributions of synthetic and real outputs?
Downstream: Task Difficulty Difference (TDD) (lower is better)
Are the synthetic tasks about as hard for an agent to complete as the real tasks?
Downstream: Ranking Divergence (RD) (higher is better)
Does the synthetic benchmark rank agents in the same order as the real benchmark?
Validity
Tool Call Validity Rate (higher is better)
What proportion of synthetic traces have tool calls that are consistent with the task instructions?
Output Validity Rate (higher is better)
What proportion of synthetic traces have a final output that appropriately addresses the task?
Diversity
Vendi Score (Instruction, Tool Calls, Output) (higher is better)
How varied are the synthetic samples from one another?
Attribute Diversity (higher is better)
How evenly are the predefined attribute categories (e.g., task type, city) represented across the synthetic data?