
Introducing Video support: Computer Vision as infrastructure
Full video and image understanding, now as infrastructure
For the past year, we have focused on a single, massive problem: turning the world's messy, unstructured documents into rigid, reliable data.
We proved that you don't need a human to type data from a PDF invoice, a messy bill of lading, or a handwritten contract. You need a deterministic pipeline that transforms that chaos into a database row.
But the operational reality of our customers isn't static. It moves.
A pallet falling over in a refrigerated truck is data.
A customer grabbing the last energy drink from a shelf is data.
An influencer reviewing a product on TikTok is data.
Until now, capturing that data meant building brittle computer vision scripts, managing heavy GPU clusters, or relying on armies of humans to watch feeds. It was the "Dark Data" of the enterprise.
Today, we are officially launching Video & Audio Support on the bem platform.
Computer Vision as Infrastructure
We aren't launching this to transcribe your Zoom meetings. We are launching this to bring the same rigor we applied to documents to the physical world.
We believe that video shouldn't be a black box. It should be just another input type—like a PDF or a JSON file—that you feed into a deterministic pipeline.
When you treat video as infrastructure, you unlock workflows that were previously impossible to automate.
The Use Cases: From Seed to Fortune 50
This isn't an experiment. We have been deploying these multimodal capabilities with our most advanced customers—ranging from agile startups to Fortune 50 enterprises.
Here is what "Video Infrastructure" looks like in production:
1. Supply Chain: The "Zero-Touch" Incident Report
The Problem: A driver swerves at 4:15 AM. Cargo shifts. Usually, this results in damaged goods showing up days later with no explanation.
The bem Workflow: We ingest the dashcam footage and the driver’s voice note simultaneously.
- Visual Analysis: Detects the sudden shock, identifies the environment ("Refrigerated"), and counts the tipped pallets.
- Audio Context: Transcribes the driver’s report ("Swerved for deer").
- Enrichment: Maps "Unit 42" to the specific driver ID and route in the TMS.
The Output: A fully populated, structured Insurance Claim JSON posted to the ERP before the truck even stops rolling.
2. Retail Intelligence: The Retail Agent
The Problem: Retailers lose billions when inventory is in the backroom but not on the shelf.
The bem Workflow:
- Visual Analysis: Detects a gap on the shelf at
Row 3, Slot 2. - Enrichment: Queries the Planogram Database to identify that slot as
SKU: YM-BLUE-16(High Velocity Item). Checks WMS and sees 45 units in the back. - Action: Triggers a "Critical Restock" task for the floor staff.
One API for Everything
The most important part of this launch isn't the AI. It's the Architecture.
We didn't build a separate tool for video. We integrated it into the core bem pipeline. You use the exact same API endpoints, the same schemas, and the same webhooks whether you are processing a 50-page PDF contract or a 30-second CCTV clip.
- Async by Default: Process large video files without timeouts.
- Schema-Driven: You define the JSON structure you want; we force the video reality to match it.
- Secure: Enterprise-grade encryption and zero-retention modes for sensitive feeds.
See it built live
The best way to understand this is to see code.
This Thursday at 2:30 PM PST, our CS Engineer Matt Foley and I will be hosting a live engineering session. We will take raw video feeds—from logistics cameras to retail shelves—and build fully functional extraction pipelines in real-time.
Stop building brittle scripts. Start building resilient infrastructure.
Start to see it in action?
Talk to our team to walk through how bem can work inside your stack.

