HueShift: Breathing Life into Every Frame

HueShift transforms grayscale videos into vibrant color using two different deep learning approaches: Diffusion Models and Generative Adversarial Networks (GANs). Our goal was to develop and compare these methods, with a special focus on ensuring temporal consistency and realistic coloration across frames.

Our Dual Approach

1. Diffusion-Based Approach

Our diffusion model works by iteratively adding and removing noise:

  • LAB Color Space Processing - Separates luminance ("L" channel) from color ("A" and "B" channels), allowing us to denoise only color components
  • U-Net Architecture - A 3-channel LAB input with noised AB channels passes through encoder, bottleneck, and decoder with skip connections
  • Noise Scheduling - Carefully calibrated variance schedule controls the noise addition/removal process
  • Forward Process - Gaussian noise is added to color channels in increasing amounts over T timesteps
  • Reverse Process - The model learns to predict and remove noise iteratively, conditioned on grayscale input
  • Resolution Enhancement - Bicubic interpolation upscales low-resolution outputs while preserving original grayscale details
  • Neural Deflickering - Flawed atlas approach identifies and corrects temporal inconsistencies between frames

Our model was trained on vast.ai GPUs using keyframes from the UCF101 dataset. The diffusion process allows for high-quality colorization by gradually learning to reverse the noise addition process, guided by the grayscale input.

Diffusion Sample 3 - Drumming (Final Output)
See Diffusion Results
Diffusion Process DDPM Architecture DDPM Process Time Attention
GAN Architecture Discriminator Generator Architecture

2. GAN-Based Approach

Our GAN implementation uses saliency maps to guide colorization:

  • SCGAN-Based Generator - Modified with channel reduction at deeper layers for improved training stability
  • Saliency Detection - Pyramid Feature Attention Network identifies visually important regions
  • 70×70 PatchGAN Discriminators:
    • Standard discriminator - Enforces global color realism
    • Attention discriminator - Focuses on salient regions for detail refinement
  • Loss Functions - Balanced combination of adversarial, L1, and perceptual losses
  • Optical Flow - FastFlowNet tracks motion between frames to maintain color consistency
  • Adaptive Color Propagation - Warps colors from keyframes to subsequent frames based on motion vectors

We deviated from the original SCGAN design by reducing channel counts as network depth increased, improving efficiency and reducing overfitting. The dual-discriminator setup enhances both global color realism and local detail accuracy.

GAN Sample 12 - BenchPress
See GAN Results

Sample Results

Diffusion Model
GAN Model

Technical Challenges

Achieving Temporal Consistency

Global info visualization - Cricket Shot

Our videos were deflickered through a two-stage neural approach:

Atlas Generation Stage

Video frames were processed by mapping each pixel's (x,y,t) coordinates to a consistent 2D atlas space using a 6-layer MLP network without positional encoding. Colors were reconstructed using an 8-layer MLP with positional encoding at 10 frequency bands. This mapping was optimized through a combination of RGB reconstruction loss (weight 5000), optical flow consistency loss (weight 500), rigidity loss (weight 1.0), and gradient preservation loss (weight 1000) for 10,000 iterations at 768×432 resolution.

Neural Filtering Stage

The second stage applied a UNet-based neural filter with 32 initial features followed by a TransformNet with ConvLSTM for temporal consistency. This refined the atlas-reconstructed frames to preserve details while maintaining temporal consistency. The final output, stored in the "final" directory at the original resolution, shows the video with flickering successfully removed while preserving natural motion and details.