YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

Environments

This folder contains installable example environments that showcase common usage patterns in Verifiers. Each module exposes a load_environment(...) function that returns a ready-to-use vf.Environment object.

Quick start

  • Install an environment from this GitHub repo: vf-install math-python --from-repo
  • Evaluate: vf-eval math-python (defaults to gpt-4.1-mini, small sample)

Common usage patterns and examples

SingleTurnEnv (prompt β†’ single response)

  • gsm8k: Classic QA with exact-match reward; toggles ThinkParser vs Parser and format reward.
  • math: Hendrycks MATH dataset with MathRubric reward (using HuggingFace's math-verify scorer).
  • reverse_text: XML formatting with non-binary LCS reward + format reward.
  • gpqa: Multiple-choice; demonstrates optional judge-based secondary scoring via RubricGroup.
  • simpleqa: Judge-graded A/B/C classification using JudgeRubric rewards.
  • summarize_text: Multiple rewards (length/format + similarity) combined in one Rubric.
  • continuation_quality: Completion-style generation (message_type="completion") judged for prose quality with JudgeRubric.
  • mmmu: Multimodal inputs (image + text) packed in chat content; single-turn boxed-answer check.

SingleTurnEnv subclass (custom dataset/scoring wrappers)

  • reasoning_gym_env: Wraps reasoning_gym procedural datasets, converts to HF datasets, uses XMLParser and task-specific scoring.

MultiTurnEnv (custom interaction protocols)

  • doublecheck: Simple follow-up turn ("Are you sure?") with math rewards; minimal is_completed/env_response implementation.
  • sentence_repeater: Multi-turn Q/A over a paragraph; rewards compare assistant messages to expected answers.
  • wordle: Game-style interaction via TextArenaEnv; multiple rewards (correctness, partial credit, few-turn bonus) and XML formatting.

Tool use

  • ToolEnv (native function-calling)

    • tool_test: Validates parallel tool calls and checks exact tool usage via ToolRubric + custom reward.
    • wiki_search: Multi-tool retrieval (search/view/read) with ToolEnv; final judgment combined via RubricGroup with a JudgeRubric.
  • XML tool calling (roll-your-own on MultiTurnEnv)

    • xml_tool_env: Parses <tool>{...}</tool> commands with XMLParser, executes Python functions, and returns <result>...</result> via env_response.
    • xlam_function_calling: Single-turn XML tool-call verification (no execution) that checks called tools match the ground truth list.
    • smolagents_math_tools: Integrates Smolagents Tool objects and a custom parser for tool/answer XML; demonstrates external tool frameworks.

Sandboxes

  • PythonEnv (ipython-style REPL)
    • math_python: Solve math problems using Python in a sandbox environment.

Composition

  • EnvGroup

    • math_group: Groups two SingleTurnEnv tasks (GSM8K + Math) into one environment with shared interface.
  • RubricGroup

    • math_python: ToolRubric (tool adherence) + MathRubric (answer correctness).
    • gpqa: Adds a JudgeRubric alongside base rubric for auxiliary scoring.
    • wiki_search: Merges judge scoring with the tool-use rubric.

Judge-based evaluation (LLM-as-judge)

  • simpleqa: Judge rubric maps graded letters to reward.
  • continuation_quality: Judge rubric extracts <grade> and maps A–F to a continuous score.
  • toxicity_explanation: Judge rubric returns 0–10 normalized score for both classification correctness and explanation quality.
  • self_reward: pattern for SingleTurnEnv with only a JudgeRubric over a dataset that supplies question/answer; intended for online RL where model acts as its own judge.

Parsers and formatting

  • ThinkParser: Used in gsm8k, wiki_search to separate reasoning from final answers.
  • XMLParser: Used in reverse_text, wordle, summarize_text, reasoning_gym_env, xml_tool_env, xlam_function_calling to enforce structured outputs and enable format rewards.
  • Custom parsers: smolagents_math_tools defines a bespoke parser to interoperate with external tool schemas.

Multimodal inputs

  • mmmu: Demonstrates passing images via chat content items with {type: "image_url", image_url: {url: ...}} and standard answer parsing.

What to look at for each pattern

  • Minimal SingleTurnEnv: reverse_text, gsm8k
  • JudgeRubric end-to-end: simpleqa, continuation_quality, toxicity_explanation, self_reward
  • ToolEnv with real tools: wiki_search, math_python
  • Custom MultiTurnEnv: doublecheck, sentence_repeater, wordle
  • XML tools without native function-calling: xml_tool_env, xlam_function_calling
  • Environment and rubric composition: math_group, math_python, gpqa, wiki_search
  • Procedural datasets: reasoning_gym_env
  • Multimodal: mmmu

Running examples

All environments export load_environment(...).

In-line usage:

import verifiers as vf
from openai import AsyncOpenAI
vf_env = vf.load_environment("reverse-text")
results = vf_env.evaluate(client=AsyncOpenAI(), model="gpt-4.1-mini", num_examples=25)

CLI usage:

vf-install reverse-text --from-repo
vf-eval reverse-text -n 50 -r 1

If you are building a new environment, prefer starting from vf-init and consult the top-level README and docs for dataset format, parser/rubric design, and rollout constraints.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support