Twelve Labs’ Soyoung Lee: Video Foundation Models Finally Enable ‘Holistic’ Contextual Understanding

RANCHO PALOS VERDES, CALIF. — There’s a lot more to analyzing streaming video than what meets the eye. And while all kinds of software promises to plumb the depths of knowing what ads are working in real-time, the need to apply a genuine understanding is where multimodal video foundation models like that of Twelve Labs comes in.

“Video is just really complex. There’s sound, there’s visual, there’s language dialogue, there’s also time. And everything in the human world is very nuanced,” Soyoung Lee, co-founder and head of GTM at Twelve Labs, told Beet.TV contributor David Kaplan at the Beet Retreat LA. “You really need to understand all of these different data modalities together and be able to capture it in a holistic way that almost replicates the way that the human mind works.”

Previous approaches relied on transcribing audio or extracting objects from individual frames without understanding how actions unfold over time, limiting contextual accuracy for advertising and content applications.

Video requires native models

Large language model-based approaches that analyze video frames struggle because video differs fundamentally from text data formats, requiring models built from the ground up to handle moving images.

“When we speak or write, every word that we spit out is done intentionally,” Lee said. For video, it’s very different where not every frame is useful, right? Not every moment or context aggregates into meaning.”

AI must continuously watch videos at scale to identify meaningful seconds, frames, and moments that formulate memory and true context rather than treating every frame equally.

Video embeddings unify metadata

The advertising industry uses text embeddings to unify metadata taxonomies across platforms and stakeholders, and video embeddings now enable similar standardization for brand creative and addressable publisher content.

“You can actually start to unify all of the contextual information across the industry and have the data speak in the same language that becomes semantically accessible for any productization or service offering that can be created downstream,” Lee noted.

These multimodal video embeddings power semantic search, classification, and insight generation across applications.

Episode-level targeting

Rich contextual descriptions at episode level enable advertisers to buy based on content context rather than genre categories or behavioral targeting that raises privacy concerns.

“For years, contextual advertising has been a hot topic, but in order to really access episode level targeting as opposed to behavioral targeting, there’s always a question of privacy,” Lee said.

Publishers benefit through customized platform experiences including enhanced content discovery, personalized recommendations based on viewer moment-level preferences, and optimized trailers that maintain engagement.

Brand safety is a driving force

Understanding true video context reveals that not all news content carries risk, expanding viable inventory for advertisers who previously avoided entire categories.

“Brand safety is the number one fastest use case that drives adoption of truly understanding context. News is interesting because not all news is risky. If you can actually understand the context of what’s there, there’s actually a lot more inventory that can exist that’s viable,” Lee said.

Publishers also analyze creative performance by cohort to identify commonalities in successful ads, providing actionable insights that weren’t previously possible at scale.

Delivering immediate impact

Publishers see fastest results through platform experience optimization that keeps users engaged via better recommendations, enriched discovery algorithms, and trailers generated from finished content.

“The most impactful ones have been how you optimize the platform experience and optimize that experience for your user. That’s everyone’s challenge of keeping the user engaged,” Lee said.

Creative analysis helps advertisers understand cultural elements and performance attributes to optimize future assets, with all use cases powered by the same underlying video understanding technology that analyzes temporal context across multiple modalities.

“The ability for an AI to continuously watch a video and many videos at scale and to be able to understand what are the meaningful seconds, frames, moments that need to be put together in order to formulate a memory and true context has to be a model that’s built from the ground up to tackle video,” Lee said.

Beet.TV