YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
Environments
This folder contains installable example environments that showcase common usage patterns in Verifiers. Each module exposes a load_environment(...) function that returns a ready-to-use vf.Environment object.
Quick start
- Install an environment from this GitHub repo:
vf-install math-python --from-repo - Evaluate:
vf-eval math-python(defaults to gpt-4.1-mini, small sample)
Common usage patterns and examples
SingleTurnEnv (prompt β single response)
- gsm8k: Classic QA with exact-match reward; toggles
ThinkParservsParserand format reward. - math: Hendrycks MATH dataset with
MathRubricreward (using HuggingFace'smath-verifyscorer). - reverse_text: XML formatting with non-binary LCS reward + format reward.
- gpqa: Multiple-choice; demonstrates optional judge-based secondary scoring via
RubricGroup. - simpleqa: Judge-graded A/B/C classification using
JudgeRubricrewards. - summarize_text: Multiple rewards (length/format + similarity) combined in one
Rubric. - continuation_quality: Completion-style generation (
message_type="completion") judged for prose quality withJudgeRubric. - mmmu: Multimodal inputs (image + text) packed in chat content; single-turn boxed-answer check.
SingleTurnEnv subclass (custom dataset/scoring wrappers)
- reasoning_gym_env: Wraps
reasoning_gymprocedural datasets, converts to HF datasets, usesXMLParserand task-specific scoring.
MultiTurnEnv (custom interaction protocols)
- doublecheck: Simple follow-up turn ("Are you sure?") with math rewards; minimal
is_completed/env_responseimplementation. - sentence_repeater: Multi-turn Q/A over a paragraph; rewards compare assistant messages to expected answers.
- wordle: Game-style interaction via
TextArenaEnv; multiple rewards (correctness, partial credit, few-turn bonus) and XML formatting.
Tool use
ToolEnv (native function-calling)
- tool_test: Validates parallel tool calls and checks exact tool usage via
ToolRubric+ custom reward. - wiki_search: Multi-tool retrieval (search/view/read) with
ToolEnv; final judgment combined viaRubricGroupwith aJudgeRubric.
- tool_test: Validates parallel tool calls and checks exact tool usage via
XML tool calling (roll-your-own on MultiTurnEnv)
- xml_tool_env: Parses
<tool>{...}</tool>commands withXMLParser, executes Python functions, and returns<result>...</result>viaenv_response. - xlam_function_calling: Single-turn XML tool-call verification (no execution) that checks called tools match the ground truth list.
- smolagents_math_tools: Integrates Smolagents
Toolobjects and a custom parser for tool/answer XML; demonstrates external tool frameworks.
- xml_tool_env: Parses
Sandboxes
- PythonEnv (ipython-style REPL)
- math_python: Solve math problems using Python in a sandbox environment.
Composition
EnvGroup
- math_group: Groups two
SingleTurnEnvtasks (GSM8K + Math) into one environment with shared interface.
- math_group: Groups two
RubricGroup
- math_python:
ToolRubric(tool adherence) +MathRubric(answer correctness). - gpqa: Adds a
JudgeRubricalongside base rubric for auxiliary scoring. - wiki_search: Merges judge scoring with the tool-use rubric.
- math_python:
Judge-based evaluation (LLM-as-judge)
- simpleqa: Judge rubric maps graded letters to reward.
- continuation_quality: Judge rubric extracts
<grade>and maps AβF to a continuous score. - toxicity_explanation: Judge rubric returns 0β10 normalized score for both classification correctness and explanation quality.
- self_reward: pattern for
SingleTurnEnvwith only aJudgeRubricover a dataset that suppliesquestion/answer; intended for online RL where model acts as its own judge.
Parsers and formatting
- ThinkParser: Used in
gsm8k,wiki_searchto separate reasoning from final answers. - XMLParser: Used in
reverse_text,wordle,summarize_text,reasoning_gym_env,xml_tool_env,xlam_function_callingto enforce structured outputs and enable format rewards. - Custom parsers:
smolagents_math_toolsdefines a bespoke parser to interoperate with external tool schemas.
Multimodal inputs
- mmmu: Demonstrates passing images via chat
contentitems with{type: "image_url", image_url: {url: ...}}and standard answer parsing.
What to look at for each pattern
- Minimal SingleTurnEnv:
reverse_text,gsm8k - JudgeRubric end-to-end:
simpleqa,continuation_quality,toxicity_explanation,self_reward - ToolEnv with real tools:
wiki_search,math_python - Custom MultiTurnEnv:
doublecheck,sentence_repeater,wordle - XML tools without native function-calling:
xml_tool_env,xlam_function_calling - Environment and rubric composition:
math_group,math_python,gpqa,wiki_search - Procedural datasets:
reasoning_gym_env - Multimodal:
mmmu
Running examples
All environments export load_environment(...).
In-line usage:
import verifiers as vf
from openai import AsyncOpenAI
vf_env = vf.load_environment("reverse-text")
results = vf_env.evaluate(client=AsyncOpenAI(), model="gpt-4.1-mini", num_examples=25)
CLI usage:
vf-install reverse-text --from-repo
vf-eval reverse-text -n 50 -r 1
If you are building a new environment, prefer starting from vf-init and consult the top-level README and docs for dataset format, parser/rubric design, and rollout constraints.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support