Environments

This folder contains installable example environments that showcase common usage patterns in Verifiers. Each module exposes a load_environment(...) function that returns a ready-to-use vf.Environment object.

Quick start

Install an environment from this GitHub repo: vf-install math-python --from-repo
Evaluate: vf-eval math-python (defaults to gpt-4.1-mini, small sample)

Common usage patterns and examples

SingleTurnEnv (prompt → single response)

gsm8k: Classic QA with exact-match reward; toggles ThinkParser vs Parser and format reward.
math: Hendrycks MATH dataset with MathRubric reward (using HuggingFace's math-verify scorer).
reverse_text: XML formatting with non-binary LCS reward + format reward.
gpqa: Multiple-choice; demonstrates optional judge-based secondary scoring via RubricGroup.
simpleqa: Judge-graded A/B/C classification using JudgeRubric rewards.
summarize_text: Multiple rewards (length/format + similarity) combined in one Rubric.
continuation_quality: Completion-style generation (message_type="completion") judged for prose quality with JudgeRubric.
mmmu: Multimodal inputs (image + text) packed in chat content; single-turn boxed-answer check.

SingleTurnEnv subclass (custom dataset/scoring wrappers)

reasoning_gym_env: Wraps reasoning_gym procedural datasets, converts to HF datasets, uses XMLParser and task-specific scoring.

MultiTurnEnv (custom interaction protocols)

doublecheck: Simple follow-up turn ("Are you sure?") with math rewards; minimal is_completed/env_response implementation.
sentence_repeater: Multi-turn Q/A over a paragraph; rewards compare assistant messages to expected answers.
wordle: Game-style interaction via TextArenaEnv; multiple rewards (correctness, partial credit, few-turn bonus) and XML formatting.

Tool use

ToolEnv (native function-calling)
- tool_test: Validates parallel tool calls and checks exact tool usage via ToolRubric + custom reward.
- wiki_search: Multi-tool retrieval (search/view/read) with ToolEnv; final judgment combined via RubricGroup with a JudgeRubric.
XML tool calling (roll-your-own on MultiTurnEnv)
- xml_tool_env: Parses <tool>{...}</tool> commands with XMLParser, executes Python functions, and returns <result>...</result> via env_response.
- xlam_function_calling: Single-turn XML tool-call verification (no execution) that checks called tools match the ground truth list.
- smolagents_math_tools: Integrates Smolagents Tool objects and a custom parser for tool/answer XML; demonstrates external tool frameworks.

Sandboxes

PythonEnv (ipython-style REPL)
- math_python: Solve math problems using Python in a sandbox environment.

Composition

EnvGroup
- math_group: Groups two SingleTurnEnv tasks (GSM8K + Math) into one environment with shared interface.
RubricGroup
- math_python: ToolRubric (tool adherence) + MathRubric (answer correctness).
- gpqa: Adds a JudgeRubric alongside base rubric for auxiliary scoring.
- wiki_search: Merges judge scoring with the tool-use rubric.

Judge-based evaluation (LLM-as-judge)

simpleqa: Judge rubric maps graded letters to reward.
continuation_quality: Judge rubric extracts <grade> and maps A–F to a continuous score.
toxicity_explanation: Judge rubric returns 0–10 normalized score for both classification correctness and explanation quality.
self_reward: pattern for SingleTurnEnv with only a JudgeRubric over a dataset that supplies question/answer; intended for online RL where model acts as its own judge.

Parsers and formatting

ThinkParser: Used in gsm8k, wiki_search to separate reasoning from final answers.
XMLParser: Used in reverse_text, wordle, summarize_text, reasoning_gym_env, xml_tool_env, xlam_function_calling to enforce structured outputs and enable format rewards.
Custom parsers: smolagents_math_tools defines a bespoke parser to interoperate with external tool schemas.

Multimodal inputs

mmmu: Demonstrates passing images via chat content items with {type: "image_url", image_url: {url: ...}} and standard answer parsing.

What to look at for each pattern

Minimal SingleTurnEnv: reverse_text, gsm8k
JudgeRubric end-to-end: simpleqa, continuation_quality, toxicity_explanation, self_reward
ToolEnv with real tools: wiki_search, math_python
Custom MultiTurnEnv: doublecheck, sentence_repeater, wordle
XML tools without native function-calling: xml_tool_env, xlam_function_calling
Environment and rubric composition: math_group, math_python, gpqa, wiki_search
Procedural datasets: reasoning_gym_env
Multimodal: mmmu

Running examples

All environments export load_environment(...).

In-line usage:

import verifiers as vf
from openai import AsyncOpenAI
vf_env = vf.load_environment("reverse-text")
results = vf_env.evaluate(client=AsyncOpenAI(), model="gpt-4.1-mini", num_examples=25)

CLI usage:

vf-install reverse-text --from-repo
vf-eval reverse-text -n 50 -r 1

If you are building a new environment, prefer starting from vf-init and consult the top-level README and docs for dataset format, parser/rubric design, and rollout constraints.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support