Concepts

Definitions of terms found in Modelmetry from its core concepts to its Observability and Safeguard concepts.

Modelmetry is a sophisticated platform designed to enhance the safety, quality, and appropriateness of data and models in applications that utilize Large Language Models (LLMs) such as chatbots. It integrates numerous tools to evaluate and monitor your LLM pipelines, ensuring they meet high standards of operation.

Core concepts include:

  • payloads (data to be evaluated),

  • evaluators (analyzing aspects like safety, sentiment, or custom metrics),

  • instances (customized evaluator setups),

  • entries (records of evaluations), and

  • findings (observable facts like numeric, boolean, or labeled).

Guardrail terms include:

  • guardrails (groups of evaluators forming custom evaluation frameworks),

  • check (records of guardrail verifications),

  • outcome (whether a check passed, failed, or errored out).

Observability terms include:

  • traces (representing requests),

  • spans (logging specific tasks),

  • events (logging actions), and

  • metrics (quantifying performance aspects).

Payload

In Modelmetry, a payload is the data that needs to be evaluated or checked. It can consist of Input, Output, and Options.

The Input and Output can be plain text (for simpler cases) or a set of messages (for multi-turn LLM use cases). The options are the usual generative AI request options such as model name, temperature, tools, etc.

Evaluators

Evaluators are at the heart of Modelmetry. They analyze payloads and focus on specific aspects of them, such as safety and security, sentiment and emotions, helpfulness and quality, or even custom analyses via your own HTTP endpoint or LLM-as-a-Judge evaluators. An evaluator can output built-in metrics as well as custom ones, as well as give a verdict and a score.

Instances

Nearly all evaluators offer settings to allow you to personalize and fine-tune the evaluations. An Instance refers to a specific setup of an evaluator, customized through a configuration schema to suit particular assessment needs. Each instance operates under a unique identifier and a defined set of options that dictate how evaluations are conducted, including thresholds and conditions that trigger different outcomes.

Entry

An evaluation, or entry, is a record that contains the details about a specific evaluation:

  • the evaluator

  • the instance's configuration

  • the payload

  • the output

    • outcome (pass, fail, error)

    • score

    • findings (e.g., metrics, labels)

Findings

In Modelmetry, findings are observable facts which can be classified as a:

  • numeric finding – a quantifiable metric

  • boolean findings – a true/false or yes/no observation

  • labeled findings – a categorical or enumerable observation

Most evaluators also emit findings, which can then be tracked over time to notice trends or used within guardrail checks to ensure a specific metric doesn't exceed a particular score (and, if so, fails the guardrail check).

Guardrail

A guardrail is a set of safety and quality controls. In Modelmetry, it's a group of multiple evaluator instances that form a comprehensive evaluation framework for your payloads. You can send payloads to your guardrail from your codebase so it can check whether they pass or fail your set of evaluations.

Check

A check is a record of some data sent by your codebase that is being checked by a guardrail. It, therefore, contains an outcome (pass or fail) and all the findings generated by the guardrail's evaluators. It also embeds all the Entries for each evaluation from each instance.

Outcome

The outcome of a guardrail check can be pass, fail, or error. It would be best if you then handled all possible outcomes in your codebase when checking a payload. We generally recommend continuing the pipeline as normal if it's a pass or error, and focusing on handling fail outcomes (e.g., sending an email alert to a team lead, interrupting the chatbot thread, or even requesting a human to take over).

Trace

A trace represents a single request or operation in the context of LLM observability. It captures the function's overall input and output, along with metadata such as user information, session details, and tags.

A trace serves as a high-level representation of a request, containing multiple observations that log the individual steps or events occurring during that request. Traces help in capturing the full context of execution, such as API calls, prompt details, and other relevant data points.

Span

A span is a specific type of observation within a trace representing a work duration or a particular task. Spans are used to log discrete units of work, such as function calls or interactions within the broader trace. They help understand the timing, sequencing, and nested structure of operations, making it easier to pinpoint where specific tasks start and end within a trace. Spans can represent tasks like API calls, data processing, or model generation events.

Events

Events log specific actions or incidents during an LLM application's execution flow, helping to provide detailed insight into its behavior. They can be combined with other observations, like spans and generations, to comprehensively view an application's performance.

Metrics

Metrics are "numeric findings" used to quantify various aspects of the application's performance, such as model usage, quality metrics, token counts, processing times, and costs. These measurements help developers optimize their LLM applications by identifying performance bottlenecks, tracking costs, and monitoring overall efficiency.

Last updated