The VideoDB Dispatch / Email version
The VideoDB Dispatch #001
Rendered email version for newsletter tools and email clients.
The VideoDB Dispatch
This issue: agents recording videos at scale, our Gemini video-understanding paper, and builder updates from VideoDB.
In this issue
Welcome to the first issue of The VideoDB Dispatch: our notes on video AI, multimodal models, agents, and what we are building at VideoDB.
This first issue is about a shift we are seeing everywhere: agents will record videos at scale. Agents are no longer just reading text or calling tools. They are beginning to research, operate browsers and terminals, capture what happened, and turn that work into videos people can watch.
That is the world we are building VideoDB for.
The main signal: agents will record videos at scale
The next wave of video will not only be humans recording for humans. It will be agents recording videos at scale.
Give an agent a topic, repo, market, product, meeting, or workflow. It researches, opens tools, uses the browser and terminal, gathers evidence, narrates what happened, and returns a video you can watch, share, audit, or index.
That is the idea behind Agentic Streams: agents that replace feeds with generated video briefings. Ask for a topic and the agent researches the web, filters noise, gathers real assets, writes a script, assembles the video, and streams it back.
research → gather assets → script → assemble video → stream
We are also experimenting with RepoFilm: ask from Slack or X for an agent to try any GitHub repo. It clones the repo, uses a real desktop with terminal and browser, runs the project, records the session, and adds first-person commentary over the walkthrough.
The signal is bigger than either launch: agents are becoming video workers. They will make product demos, repo walkthroughs, research briefings, bug reports, market updates, onboarding clips, and “what happened?” recaps without a human opening a recorder.
For video infrastructure, this changes the job. The stack needs to support capture, understanding, editing, narration, evidence, and streaming as one loop — not just upload and playback.
Related links:
Model watch
A few model launches and research directions we have been watching.
ChatGPT Images 2.0: visual generation gets more controllable
OpenAI’s image update emphasizes stronger typography, multilingual rendering, visual layouts, and more coherent style control.
This matters beyond “make me a nice image.” If generated visuals become more reliable, they become building blocks for interfaces, explainers, storyboards, thumbnails, educational media, and video composition.
The boundary between image generation, UI generation, and video generation keeps getting thinner. For builders, that means more media can be generated just-in-time around the user’s context.
Reference: OpenAI announcement
If you are benchmarking new models for video generation, multimodal retrieval, coding agents, or long-context workflows, we are collecting notes and may share a more detailed benchmark digest in a future issue.
We published a paper on Gemini thinking for video understanding
We published a paper on a practical question in video understanding:
Does giving a vision-language model more “thinking” actually improve scene understanding?
In our benchmark on Gemini vision-language models across scenes extracted from 100 hours of video, we looked at how internal reasoning traces, what we call thought streams, affect final video scene outputs.
A few early takeaways:
- More thinking helps, but gains plateau quickly in this setup.
- Most quality improvement happens in the first few hundred thought tokens; beyond about 700 tokens, extra thinking adds cost with smaller gains.
- Flash Lite 1024 was the quality leader in the benchmark while using fewer thought tokens than Flash Dynamic.
- Tight reasoning budgets increase compression-step hallucination: the final output more often includes details that were not explicitly present in the thought stream.
- Flash and Flash Lite think about many of the same things, with cross-tier thought-stream similarity nearly as high as same-model determinism.
- Flash Lite was more token-efficient, spending less of its budget on process narration and more on scene content.
Production video understanding is not only about model quality. It is also about cost, latency, and trust. If a model uses 5x more thinking tokens but gives only a small quality bump, that changes how you design indexing pipelines.
Read the full paper: Thought streams for video scene understanding
Things we liked
Flipbook: a generative visual internet
Flipbook is a delightful experiment: an infinite visual browser where every page is generated on demand as an image. Click anywhere to explore that part of the image in more depth.
It feels like a preview of a more visual web, especially with their experimental live video stream mode where generated images, interactive browsing, and video streams start blending together.
What’s new in VideoDB
Organization management for VideoDB Console
We shipped Organization Management in VideoDB Console for Pro users.
You can now invite team members into your organization so they can access shared VideoDB assets, chat with those assets in VideoDB Console, and manage API keys together.
Why it helps: video AI projects are rarely solo for long. Teams need shared access to media, indexes, API keys, experiments, and agent workflows. Organization Management makes VideoDB more usable for production teams collaborating on the same media layer.
Build idea: create your own video briefing agent
This week’s build idea: create a personal video briefing agent for any topic you care about.
For example:
“Create a 3-minute video briefing on the top open-source AI launches this week.”
A simple architecture:
- Research: use browser/search tools to collect links, videos, tweets, charts, and screenshots.
- Filter: remove low-signal or duplicate sources.
- Script: generate a short narration with citations.
- Assemble: use VideoDB to compose clips, screenshots, text, voiceover, and music.
- Verify: run scene understanding over the output to check whether visuals match narration.
- Stream: publish the final briefing as a playable stream.
If you want to try the open-source examples, start here:
npx skills add video-db/skills
export VIDEO_DB_API_KEY=your_key_here
Then give the agent a topic and ask it to produce a video report.
Explore, ship, collaborate
Two updates for builders who want to explore, ship, or collaborate with VideoDB.
VideoDB for Developers
Our developer page is the best place to connect with the VideoDB builder ecosystem, follow upcoming events, and explore ways to collaborate with us.
Explore it here: VideoDB for Developers
Growth Forge
We also launched Growth Forge: a 14-day sprint for five builders to build a growth agent for our agents.
The idea is simple: growth is becoming less like campaigns and more like loops. We want to work with builders who can design, ship, and prove an agentic growth engine.
Apply here: Growth Forge
Upcoming builder mornings
Coffee, laptops, and 90 minutes of building together. No talks, no slides — just a small room of people shipping.