AVI-Edit: Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner
Paper • 2512.10571 • Published
Project Page | arXiv | Code
AVI-Edit is a framework for audio-sync video instance editing. It introduces a granularity-aware mask refiner that iteratively refines coarse user-provided masks into precise instance-level regions and a self-feedback audio agent to curate high-quality audio guidance, providing fine-grained temporal control.
To set up the environment, follow these steps from the official repository:
git clone https://github.com/suimuc/AVI-Edit-Framework.git
cd AVI-Edit-Framework
conda create -n avi_edit python=3.10
conda activate avi_edit
pip install -r requirements.txt
pip install -e .
The framework supports inference using either a pre-edited audio track or an automated audio agent.
Use this script when you already have the edited audio:
python scripts/inference_with_edited_audio.py \
--video-path /path/to/input_video.mp4 \
--audio-path /path/to/edited_audio.wav \
--mask-path /path/to/mask.mp4 \
--prompt "Describe the edited scene here." \
--output-dir /path/to/output_dir
Use this script to generate replacement audio automatically from the video, mask, and edit prompt:
python scripts/inference.py \
--video-path /path/to/input_video.mp4 \
--mask-path /path/to/mask.mp4 \
--prompt "Describe the edited scene here." \
--output-dir /path/to/output_dir \
--dashscope-api-key "<YOUR_QWEN_OR_OPENAI_COMPATIBLE_API_KEY>" \
--eleven-api-key "<YOUR_ELEVENLABS_API_KEY>"
@article{avi-edit,
title={Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner},
author={Zheng, Haojie and Weng, Shuchen and Liu, Jingqi and Yang, Siqi and Shi, Boxin and Wang, Xinlong},
journal={arXiv preprint arXiv:2512.10571},
year={2025}
}