how to input a video?
according to this model card, phi-4-mm is able to deal with a video. however, i cannot find how to do. anyone can help me?
I don't think videos are supported...at least that's how I understand the model card. Only pictures and audio is possible - but not videos.
@maltoseflower Currently, phi-4-mm can support multi-image as input, so if you convert videos into frames, it will work. But it does not support multi-image+audio+text as input at the same time.
So, as of now, it's not possible to directly embed a video, right? But if we want to work with videos, we can extract frames and use them as multiple images instead. This method should work, right?
when i try 4 multi image, why vram gpu cost so high 21GB+ (try in A30) Is that normal?
import os
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
model_path = "microsoft/Phi-4-multimodal-instruct"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
_attn_implementation='flash_attention_2',
).cuda()
generation_config = GenerationConfig.from_pretrained(model_path)
image_dir = "./extracted_frames"
image_files = [
f for f in os.listdir(image_dir)
if f.endswith(".png") or f.endswith(".jpg")
]
images = []
placeholder = ""
for i, filename in enumerate(sorted(image_files), start=1):
image_path = os.path.join(image_dir, filename)
img = Image.open(image_path)
images.append(img)
placeholder += f"<|image_{i}|>"
messages = [
{
"role": "user",
"content": (
placeholder
+ "Please describe or summarize the content of these images."
)
}
]
prompt = processor.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = processor(prompt, images=images, return_tensors='pt').to('cuda:0')
generation_args = {
"max_new_tokens": 512,
"temperature": 0.5,
"do_sample": True,
}
generate_ids = model.generate(
**inputs,
**generation_args,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs["input_ids"].shape[1]:]
response = processor.batch_decode(
generate_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)[0]
print(response)
The GPU memory is related to frame numbers and resolution of each frame. If you reduce the GPU memory, you can try reducing the frame number or resolution.
their model's only video test set was evaluated with 16 frames per video, leading me to think that it can't handle more than that