google-gemini-mediaUse the Gemini API (Nano Banana image generation, Veo video, Gemini TTS speech and audio understanding) to deliver end-to-end multimodal media workflows and code templates for "generation + understanding".
Install via ClawdBot CLI:
clawdbot install Xsir0/google-gemini-mediaThis Skill consolidates six Gemini API capabilities into reusable workflows and implementation templates:
Convention: This Skill follows the official Google Gen AI SDK (Node.js/REST) as the main line; currently only Node.js/REST examples are provided. If your project already wraps other languages or frameworks, map this Skill's request structure, model selection, and I/O spec to your wrapper layer.
1) Do you need to produce images?
2) Do you need to understand images?
3) Do you need to produce video?
4) Do you need to understand video?
5) Do you need to read text aloud?
6) Do you need to understand audio?
npm install @google/genai
curl; if you need to parse image Base64, install jq (optional).GEMINI_API_KEYx-goog-api-key: $GEMINI_API_KEYInline (embedded bytes/Base64)
Files API (upload then reference)
1. files.upload(...) (SDK) or POST /upload/v1beta/files (REST resumable)
2. Use file_data / file_uri in generateContent
Engineering suggestion: implement ensure_file_uri() so that when a file exceeds a threshold (for example 10-15MB warning) or is reused, you automatically route through the Files API.
inline_data (Base64) in response parts; in the SDK use part.as_image() or decode Base64 and save as PNG/JPG..pcm or wrap into .wav (commonly 24kHz, 16-bit, mono).Important: model names, versions, limits, and quotas can change over time. Verify against official docs before use. Last updated: 2026-01-22.
gemini-3-flash-preview for image, video, and audio understanding (choose stronger models as needed for quality/cost).veo-3.1-generate-preview (generates 8-second video and can natively generate audio).gemini-2.5-flash-preview-tts (native TTS, currently in preview).SDK (Node.js) minimal template
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const response = await ai.models.generateContent({
model: "gemini-2.5-flash-image",
contents:
"Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme",
});
const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
if (part.text) console.log(part.text);
if (part.inlineData?.data) {
fs.writeFileSync("out.png", Buffer.from(part.inlineData.data, "base64"));
}
}
REST (with imageConfig) minimal template
curl -s -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent" -H "x-goog-api-key: $GEMINI_API_KEY" -H "Content-Type: application/json" -d '{
"contents":[{"parts":[{"text":"Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme"}]}],
"generationConfig": {"imageConfig": {"aspectRatio":"16:9"}}
}'
REST image parsing (Base64 decode)
curl -s -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H "Content-Type: application/json" \
-d '{"contents":[{"parts":[{"text":"A minimal studio product shot of a nano banana"}]}]}' \
| jq -r '.candidates[0].content.parts[] | select(.inline_data) | .inline_data.data' \
| base64 --decode > out.png
# macOS can use: base64 -D > out.png
Use case: given an image, add/remove/modify elements, change style, color grading, etc.
SDK (Node.js) minimal template
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const prompt =
"Add a nano banana on the table, keep lighting consistent, cinematic tone.";
const imageBase64 = fs.readFileSync("input.png").toString("base64");
const response = await ai.models.generateContent({
model: "gemini-2.5-flash-image",
contents: [
{ text: prompt },
{ inlineData: { mimeType: "image/png", data: imageBase64 } },
],
});
const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
if (part.inlineData?.data) {
fs.writeFileSync("edited.png", Buffer.from(part.inlineData.data, "base64"));
}
}
Best practice: use chat for continuous iteration (for example: generate first, then "only edit a specific region/element", then "make variants in the same style").
To output mixed "text + image" results, set response_modalities to ["TEXT", "IMAGE"].
You can set in generationConfig.imageConfig or the SDK config:
aspectRatio: e.g. 16:9, 1:1.imageSize: e.g. 2K, 4K (higher resolution is usually slower/more expensive and model support can vary).import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const imageBase64 = fs.readFileSync("image.jpg").toString("base64");
const response = await ai.models.generateContent({
model: "gemini-3-flash-preview",
contents: [
{ inlineData: { mimeType: "image/jpeg", data: imageBase64 } },
{ text: "Caption this image, and list any visible brands." },
],
});
console.log(response.text);
import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "image.jpg" });
const response = await ai.models.generateContent({
model: "gemini-3-flash-preview",
contents: createUserContent([
createPartFromUri(uploaded.uri, uploaded.mimeType),
"Caption this image.",
]),
});
console.log(response.text);
Append multiple images as multiple Part entries in the same contents; you can mix uploaded references and inline bytes.
import { GoogleGenAI } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const prompt =
"A cinematic shot of a cat astronaut walking on the moon. Include subtle wind ambience.";
let operation = await ai.models.generateVideos({
model: "veo-3.1-generate-preview",
prompt,
config: { resolution: "1080p" },
});
while (!operation.done) {
await new Promise((resolve) => setTimeout(resolve, 10_000));
operation = await ai.operations.getVideosOperation({ operation });
}
const video = operation.response?.generatedVideos?.[0]?.video;
if (!video) throw new Error("No video returned");
await ai.files.download({ file: video, downloadPath: "out.mp4" });
Key point: Veo REST uses :predictLongRunning to return an operation name, then poll GET /v1beta/{operation_name}; once done, download from the video URI in the response.
aspectRatio: "16:9" or "9:16"resolution: "720p" | "1080p" | "4k" (higher resolutions are usually slower/more expensive)Polling fallback (with timeout/backoff) pseudocode
const deadline = Date.now() + 300_000; // 5 min
let sleepMs = 2000;
while (!operation.done && Date.now() < deadline) {
await new Promise((resolve) => setTimeout(resolve, sleepMs));
sleepMs = Math.min(Math.floor(sleepMs * 1.5), 15_000);
operation = await ai.operations.getVideosOperation({ operation });
}
if (!operation.done) throw new Error("video generation timed out");
import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "sample.mp4" });
const response = await ai.models.generateContent({
model: "gemini-3-flash-preview",
contents: createUserContent([
createPartFromUri(uploaded.uri, uploaded.mimeType),
"Summarize this video. Provide timestamps for key events.",
]),
});
console.log(response.text);
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const response = await ai.models.generateContent({
model: "gemini-2.5-flash-preview-tts",
contents: [{ parts: [{ text: "Say cheerfully: Have a wonderful day!" }] }],
config: {
responseModalities: ["AUDIO"],
speechConfig: {
voiceConfig: {
prebuiltVoiceConfig: { voiceName: "Kore" },
},
},
},
});
const data =
response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data ?? "";
if (!data) throw new Error("No audio returned");
fs.writeFileSync("out.pcm", Buffer.from(data, "base64"));
Requirements:
multiSpeakerVoiceConfigvoice_name supports 30 prebuilt voices (for example Zephyr, Puck, Charon, Kore, etc.).Provide controllable directions for style, pace, accent, etc., but avoid over-constraining.
import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "sample.mp3" });
const response = await ai.models.generateContent({
model: "gemini-3-flash-preview",
contents: createUserContent([
"Describe this audio clip.",
createPartFromUri(uploaded.uri, uploaded.mimeType),
]),
});
console.log(response.text);
1) Generate product images with Nano Banana (require negative space, consistent lighting).
2) Use image understanding for self-check: verify text clarity, brand spelling, and unsafe elements.
3) If not satisfied, feed the generated image into text+image editing and iterate.
1) Generate an 8-second shot with Veo (include dialogue or SFX).
2) Download and save (respect retention window).
3) Upload video to video understanding to produce a storyboard + timestamps + narration copy (then feed to TTS).
1) Upload meeting audio and transcribe full content.
2) Transcribe or summarize specific time ranges.
3) Use TTS to generate a "broadcast" version of the summary.
Generated Mar 1, 2026
Generate high-quality images of products for online stores, such as custom-designed items or food dishes, to enhance listings and marketing materials. This reduces the need for expensive photoshoots and allows rapid iteration on visual concepts.
Create narrated videos and images for e-learning platforms, explaining complex topics with AI-generated visuals and speech. This automates content production for courses, tutorials, and interactive learning modules.
Produce short promotional videos and images for social media campaigns, using text-to-video and image generation to quickly create engaging content. This streamlines ad creation and allows for personalized messaging at scale.
Analyze customer-uploaded images, videos, or audio to provide automated support, such as identifying product issues or transcribing support calls. This improves response times and reduces manual effort in helpdesk operations.
Edit existing images or extend videos for film, television, or digital media projects, using AI to enhance or modify visual content efficiently. This accelerates post-production workflows and reduces costs for studios.
Offer a cloud-based platform where users pay a monthly fee to access AI-powered media generation and understanding tools via API. This provides recurring revenue and scales with customer usage across various industries.
Charge customers based on the number of API calls or media processing tasks, such as per image generated or minute of video analyzed. This model appeals to businesses with variable needs and allows low entry costs.
License the skill package to other companies for integration into their own products, such as marketing tools or content management systems, with custom branding. This generates upfront licensing fees and ongoing support contracts.
💬 Integration Tip
Implement a file size threshold (e.g., 10MB) to automatically switch between inline and Files API modes for optimal performance and compliance with request limits.
Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Gemini CLI for one-shot Q&A, summaries, and generation.
Research any topic from the last 30 days on Reddit + X + Web, synthesize findings, and write copy-paste-ready prompts. Use when the user wants recent social/web research on a topic, asks "what are people saying about X", or wants to learn current best practices. Requires OPENAI_API_KEY and/or XAI_API_KEY for full Reddit+X access, falls back to web search.
Check Antigravity account quotas for Claude and Gemini models. Shows remaining quota and reset times with ban detection.
Manages free AI models from OpenRouter for OpenClaw. Automatically ranks models by quality, configures fallbacks for rate-limit handling, and updates opencla...
Manages free AI models from OpenRouter for OpenClaw. Automatically ranks models by quality, configures fallbacks for rate-limit handling, and updates openclaw.json. Use when the user mentions free AI, OpenRouter, model switching, rate limits, or wants to reduce AI costs.