Skip to content

Instantly share code, notes, and snippets.

@c0m4r
Last active April 6, 2026 15:16
Show Gist options
  • Select an option

  • Save c0m4r/e5f982e475b358351b6cd4cbac09b8fc to your computer and use it in GitHub Desktop.

Select an option

Save c0m4r/e5f982e475b358351b6cd4cbac09b8fc to your computer and use it in GitHub Desktop.
Video input prompt in Gemma4

Video input prompt in Gemma4

https://huggingface.co/google/gemma-4-E4B

Setup

wget https://upload.wikimedia.org/wikipedia/commons/7/78/%22Arya%22_Cat_plays_with_Acalypha_indica-Pilangsari-2019.webm?download
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -U transformers torchcodec torch torchvision librosa accelerate
# or last known working versions
# pip install transformers==5.5.0 torchcodec==0.11.0 torch==2.11.0 torchcodec==0.11.0 torchvision==0.26.0 torchvision==0.26.0 librosa==0.11.0 accelerate==1.13.0
# Split between GPU VRAM + system RAM automatically
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
python3 video.py | sed 's/\\n/\n/g;'

Sample output

python3 video.py | sed 's/\\n/\n/g;'
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Loading weights: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2130/2130 [00:00<00:00, 2794.83it/s]
Some parameters are on the meta device because they were offloaded to the cpu.
{'role': 'assistant', 'content': 'This video captures a sequence of shots featuring an **orange and white cat** interacting with some **green leaves** on a **concrete or stone surface**.

Here is a detailed description of the progression:

* **Beginning (00:00 - 00:01):** The cat is initially visible, looking down at the ground.
* **Interaction (00:01 - 00:05):** The cat is clearly seen sniffing, nibbling, or eating the small bunch of green leaves placed on the concrete next to it.
* **Continued Eating (00:05 - 00:17):** The video continues to show the cat persistently engaging with the leaves. It moves from sniffing to actively eating, sometimes bending low and stretching out its paws as it consumes the foliage.
* **Relaxing/Lying Down (00:17 - 00:37):** As the video progresses, the cat transitions from actively eating to relaxing. It moves from a crouched position to lying down on the concrete, sometimes with the leaves still near it or having finished eating. It spends several shots stretched out, resting in various positions across the frame.

**Overall impression:** The video is a gentle, short observation of a domestic cat enjoying a snack of fresh greens on a patio or outdoor concrete area.'}
from transformers import AutoProcessor, AutoModelForMultimodalLM
MODEL_ID = "google/gemma-4-E4B-it"
# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
MODEL_ID,
dtype="auto",
device_map="auto"
)
# Prompt - add video before text
messages = [
{
'role': 'user',
'content': [
{"type": "video", "video": "_Arya__Cat_plays_with_Acalypha_indica-Pilangsari-2019.webm"},
{'type': 'text', 'text': 'Describe this video.'}
]
}
]
# Process input
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
# Generate output
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
# Parse output
print(processor.parse_response(response))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment