Hackathon March 1, 2024

2nd Place at Digital Breakthrough — Multimodal Video AI

Built a multimodal video classification system combining visual and audio features at one of Russia's top hackathons.

2nd place

Computer Vision Multimodal AI Video Processing Deep Learning

Secured 2nd place at Digital Breakthrough — one of Russia’s largest national hackathons, organized by the ANO “Russia — Land of Opportunity” platform. The event draws thousands of participants across Russia and covers AI, design, and product challenges.

The challenge

Build a multimodal video classification system that accurately categorizes video content by combining all available signals: visual frames, audio, and associated metadata. The key difficulty was that no single modality was sufficient on its own — some content types are visually ambiguous but distinctive by audio, and vice versa.

Our approach

Designed a late-fusion multimodal pipeline:

  • Visual stream — sampled keyframes at regular intervals, extracted deep visual embeddings using a pretrained CNN backbone
  • Audio stream — converted audio to Mel spectrograms, processed with an audio classification model to capture tonality, speech vs. music, ambient sound
  • Text/metadata stream — embedded titles and descriptions to capture semantic context
  • Fusion — combined the three embedding streams with learned attention weights that let the model downweight unreliable modalities per sample

The attention-weighted fusion was the key differentiator. A simple concatenation of streams tends to overfit to the dominant modality; learned fusion adapts the contribution of each stream based on sample-level confidence.

Tech stack

PyTorch · Hugging Face Transformers · torchaudio · OpenCV · librosa

Takeaway

This competition pushed me deeper into multimodal AI — understanding how to fuse signals from fundamentally different data types into a single coherent representation. That same design pattern now shows up in the document processing and data intelligence work we do at HermesOps, where inputs often combine structured data, unstructured text, and attached files.

For more on the AI competition journey: 1st place at AIIJC and top 5% in Google’s Kaggle competition.