how to input a video?

#16

by maltoseflower - opened Mar 3, 2025

Mar 3, 2025

according to this model card, phi-4-mm is able to deal with a video. however, i cannot find how to do. anyone can help me?

zhangyang001

Mar 3, 2025

This comment has been hidden (marked as Off-Topic)

zhangyang001

Mar 3, 2025

This comment has been hidden (marked as Off-Topic)

junkstage

Mar 3, 2025

I don't think videos are supported...at least that's how I understand the model card. Only pictures and audio is possible - but not videos.

donniems

Mar 3, 2025

@maltoseflower Currently, phi-4-mm can support multi-image as input, so if you convert videos into frames, it will work. But it does not support multi-image+audio+text as input at the same time.

n3xt1lxs

Mar 6, 2025

So, as of now, it's not possible to directly embed a video, right? But if we want to work with videos, we can extract frames and use them as multiple images instead. This method should work, right?

donniems

Mar 6, 2025

@n3xt1lxs yes.

n3xt1lxs

Mar 6, 2025

@donniems Do you have an example code for converting a video into multiple images and then using them for generation?

n3xt1lxs

Mar 6, 2025

•

edited Mar 6, 2025

when i try 4 multi image, why vram gpu cost so high 21GB+ (try in A30) Is that normal?

import os
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

model_path = "microsoft/Phi-4-multimodal-instruct"

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True,
    _attn_implementation='flash_attention_2',
).cuda()

generation_config = GenerationConfig.from_pretrained(model_path)

image_dir = "./extracted_frames"

image_files = [
    f for f in os.listdir(image_dir) 
    if f.endswith(".png") or f.endswith(".jpg")
]

images = []
placeholder = ""

for i, filename in enumerate(sorted(image_files), start=1):
    image_path = os.path.join(image_dir, filename)
    img = Image.open(image_path)
    images.append(img)
    
    placeholder += f"<|image_{i}|>"

messages = [
    {
        "role": "user",
        "content": (
            placeholder 
            + "Please describe or summarize the content of these images."
        )
    }
]

prompt = processor.tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)

inputs = processor(prompt, images=images, return_tensors='pt').to('cuda:0')

generation_args = {
    "max_new_tokens": 512,
    "temperature": 0.5,
    "do_sample": True,
}

generate_ids = model.generate(
    **inputs, 
    **generation_args, 
    generation_config=generation_config,
)

generate_ids = generate_ids[:, inputs["input_ids"].shape[1]:]

response = processor.batch_decode(
    generate_ids, 
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=False
)[0]

print(response)

donniems

Mar 9, 2025

The GPU memory is related to frame numbers and resolution of each frame. If you reduce the GPU memory, you can try reducing the frame number or resolution.

kalbin

4 days ago

their model's only video test set was evaluated with 16 frames per video, leading me to think that it can't handle more than that

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment