ai-multimodal

Name: ai-multimodal
Rating: 5 (713 reviews)
Author: mrgoonie

Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens.

TL;DR

Process and generate multimedia content using Google Gemini API. Supports audio, image, video, document analysis and generation with multiple models and formats.

What Is It

Analyze audio, images, videos, and documents with AI
Generate images from text prompts
Extract structured data from PDFs and media
Supports multimodal tasks across formats

Key Features

Transcribe audio with timestamps (up to 9.5 hours)
Detect objects and segment images (2.5+ models)
Summarize videos and extract tables from PDFs
Supports up to 3,600 images per request
Generate images with controllable style and aspect ratios
Process videos up to 6 hours (low-res)
OCR and visual Q&A for images
Multi-page document understanding

How to Use

1.Set GEMINI_API_KEY via environment or .env file
2.Use appropriate model (e.g., gemini-2.5-flash for general use)
3.Run batch processing with --task and --model flags
4.Optimize large media files before processing
5.Convert documents to structured output (e.g., JSON, Markdown)

Guardrails & Gotchas

Max 9.5 hours audio per request
Max 6 hours video (low-res) or 2 hours (default)
Max 1,000 pages PDF
YouTube URLs only for public videos
No support for text extraction from video
Image generation requires gemini-2.5-flash-image model

GitHub Stats

Author

@mrgoonie

Repository

mrgoonie/claudekit-skills

Stars

713

Explore More Skills

move-code-quality

@davila7

Analyzes Move language packages against the official Move Book Code Quality Checklist. Use this skill when reviewing Move code, checking Move 2024 Edition compliance, or analyzing Move packages for best practices. Activates automatically when working with .move files or Move.toml manifests.

12.4k

1.1k

1w ago

ai-multimodal

Explore More Skills

move-code-quality

e2e-testing-patterns

testing-anti-patterns