This post was originally published on this site

Online video consumption has skyrocketed. A staggering 1.8 billion people globally subscribed to streaming services in 20231, and 92% of internet users worldwide watched online videos every month in 20242. This growth creates a significant opportunity for advertisers who want to reach their customers with great creative, but ineffective ad placement can disrupt their customers’ viewing experiences.

An important way to deliver a better ad experience is seamless ad integration, which means placing ads at natural breaks in video content to avoid interrupting the narrative flow. Scene change detection technology identifies these natural breaks by analyzing a video’s visual, audio, and textual elements. Google’s AI models such as Gemini offer a win-win for viewers and advertisers:

  • Increased viewer engagement: Seamless ad integration minimizes disruption and enhances the viewing experience.

  • Higher ad revenue: More relevant ads lead to better click-through rates and increased advertiser ROI.

  • Simplified workflows: Google Cloud’s Vertex AI platform streamlines the entire video monetization process, from scene detection to ad placement.

To help you maximize the potential of your ad inventory, we’ll share how Google Cloud’s generative AI revolutionizes scene detection, leading to more effective ad placement, improved reach, higher viewer engagement, and ultimately, increased revenue for publishers.

aside_block
<ListValue: [StructValue([('title', '$300 in free credit to try Google Cloud developer tools'), ('body', ), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>

The challenges of traditional ad break detection 

Traditional ad break detection methods, designed primarily for structured television content with fade-outs and fixed commercial breaks, often struggle to identify ideal ad placement points in today’s diverse video landscape. These methods—including shot boundary detection, motion analysis, audio analysis, and rule-based systems—can miss subtle transitions, misinterpret rapid movement, operate independently of visual context, lack flexibility, and rely on manual tagging. This is where Google’s Gemini models can help.

Intelligent scene detection with Google’s Gemini models

Gemini’s multimodal capabilities can analyze video, audio, and text simultaneously, enabling a level of nuanced scene understanding that was previously impossible. Now, we can ask Gemini to understand the nuances of video content and generate very granular contextual metadata, unlocking capabilities that were previously impossible to achieve efficiently.

Here are some examples of how Gemini identifies ad breaks and provides detailed contextual metadata:

Ad Break Example

Transition Feeling

Transition Type

Narrative Type

Prior Scene Summary

Daytime to Evening Dinner

Cheerful, relaxed

Outdoor to indoor

Scene transition from plot to end

A group of friends enjoying dinner at a restaurant.

End of Tense Dialogue Scene

Tense, dramatic

Fade-out

Scene of rising conflict

Two characters arguing over a specific issue.

Busy Street to Quiet Cafe

Neutral

Hard cut, outdoor to indoor

Scene transition

A character walking along a busy street.

This enriched metadata allows for the precise matching of the right ad to the right user at the right time. For example, the first ad break (Daytime to Evening Dinner), with its associated sentiment of “cheerful and relaxed,” might be ideal for advertisements that resonate with those feelings such as travel, entertainment or leisure products, rather than just a product like cookware. By understanding not just the basic context, but also the emotional tone of a scene, Gemini facilitates a new level of contextual advertising that is far more engaging for the viewer.

image1

Image 1 – Sample of detected scene change with corresponding metadata from Ep12 Pororo – Pretty, The Great Storyteller

Proof point: The Google Cloud architecture 

Google Cloud, powered with the Gemini 1.5 Pro model, delivers a robust and scalable solution for intelligent ad break detection. Its multimodal analysis capabilities simultaneously process video, audio, and text to detect even subtle transitions, enabling seamless ad integration. Gemini’s ability to process up to 2 million tokens ensures comprehensive analysis of long videos across diverse genres with minimal retraining, offering versatility for media providers. This large context window allows the model to analyze approximately 2 hours of video and audio content in a single pass, which significantly reduces processing time and complexity compared to methods that require breaking videos into smaller chunks.

The architecture ensures high performance and reliability through these key stages:

image2

Image 2 – Architecture diagram for the scene change detection

1. Video Ingestion and Storage (GCS): Videos are ingested and stored in Google Cloud Storage (GCS), a highly scalable and durable object storage service offering various storage classes to optimize cost and performance. GCS ensures high availability and accessibility for processing.  Robust security measures, including Identity and Access Management (IAM) roles and fine-grained access controls, are in place.

2. Orchestration and simultaneous processing (Vertex AI pipelines & Gemini): Vertex AI pipelines orchestrate the end-to-end video analysis process, ensuring seamless execution of each stage. Vertex AI manages simultaneous processing of multiple videos using Google Gemini’s multimodal analysis, significantly accelerating the workflow while maintaining scalability. This includes built-in safety filters powered by Gemini, which perform a nuanced contextual analysis of video, audio, and text to discern potentially inappropriate content. The results are returned in JSON format, detailing scene change timestamps, video metadata, and contextual insights.

Post-processing is then applied to the JSON output to structure the data in a tabular format, ensuring compatibility with downstream storage and analysis tools. This includes:

  • Standardizing timestamps: Ensuring uniform time formats for consistent querying and integration.

  • Metadata mapping:  Beyond basic metadata extraction, this stage includes the classification of scenes (or entire video programs) into industry standard taxonomies, such as the IAB’s, or  the customer’s own custom taxonomies. This allows for more granular organization of video content based on their type and provides an easier method of ad targeting.

  • Error handling and data validation: Filtering out incomplete or invalid entries to maintain data quality.

3. Structured data storage and enrichment (BigQuery): The structured data resulting from Gemini’s scene change detection analysis, including timestamps, metadata, and contextual insights, is stored in BigQuery. BigQuery ML can leverage this integrated data to build predictive models for ad placement optimization. For example, you can schedule a 15-second action-themed ad during a scene change in an action sequence, targeting viewers who frequently watch action movies in the evening.

4. Monitoring and logging (GCP operations suite): GCP Operations Suite provides comprehensive monitoring and alerting for the entire pipeline, including real-time visibility into job progress and system health.  This includes detailed logging, automated alerts for failures, and dashboards for key performance indicators.  This proactive approach ensures timely issue resolution and maximizes system reliability.

Conclusion:  A win-win for viewers and advertisers

Ready to transform your video ad strategy? Learn more about Google Cloud, Gemini and BigQuery.For developers looking to get hands-on experience, you can also explore this notebook detailing how to use the Gemini API for video analysis


1. Statista. (2024). Online video viewers worldwide quarterly.
2. Exploding Topics. (2024). 50+ video streaming stats: Key trends in 2024.