Evaluator

class mixedvoices.evaluation.evaluator.Evaluator(eval_id: str, project_id: str, metric_names: List[str], test_cases: List[str], created_at: int | None = None, eval_runs: dict[str, EvalRun] | None = None)[source]

Bases: object

Evaluator is a reusable collections of tests cases and metrics to test model performance. These can be run multiple times across different versions to track performance.

property id: str: Get the id of the Evaluator

property info: Dict[str, Any]: Get the info of the evaluator as a dictionary

list_eval_runs(version_id: str | None = None) → List[EvalRun][source]: List of eval runs

load_eval_run(run_id: str) → EvalRun[source]

Load an eval run from id

Parameters:: run_id (str) – The id of the eval run

property metric_names: List[str]: List of metric names to be evaluated

property project_id: str: Get the name of the Project

run(version: Version, agent_class: Type[BaseAgent], agent_starts: bool | None, verbose: bool = True, **kwargs) → EvalRun[source]

Runs the evaluator and saves the results.

Parameters:

version (Version) – The version of the project to evaluate
agent_class (Type[BaseAgent]) – The agent class to evaluate
agent_starts (Optional[bool]) – Whether the agent starts the conversation or not. If True, the agent starts the conversation If False, the evaluator starts the conversation If None, random choice
verbose (bool) – Whether to print testing conversation and scores. Defaults to True
**kwargs – Keyword arguments to pass to the agent class

property test_cases: List[str]: List of test cases to be evaluated