LTX-2: The First Open-Source Model That Generates Synchronized Video and Audio

Lightricks releases LTX-2 with 14B video + 5B audio parameters. Native 4K at 50fps with lip sync, foley, and ambient sound. Fully open weights.

LTX-2: The First Open-Source Model That Generates Synchronized Video and Audio

LTX-2: The First Open-Source Model That Generates Synchronized Video and Audio

Text-to-video models have achieved remarkable visual quality, yet they produce silent results.1 The semantic, emotional, and atmospheric cues that audio provides remain absent. LTX-2 changes this equation entirely. Released January 6, 2026, Lightricks' new model generates synchronized video and audio in a single unified pass, delivering native 4K resolution at 50 frames per second with expressive sound, accurate lip sync, and rich ambient audio.2

TL;DR

LTX-2 introduces an asymmetric dual-stream transformer architecture: 14 billion parameters for video, 5 billion for audio, coupled through bidirectional cross-attention.3 The model generates up to 20 seconds of audiovisual content at 4K resolution.4 Beyond speech, LTX-2 produces coherent audio tracks including background sounds, foley effects, and environmental ambience that match scene content.5 Evaluations show state-of-the-art audiovisual quality among open-source systems and results comparable to proprietary models at a fraction of computational cost.6 All model weights, inference code, and training code have been publicly released under permissive licensing.7

The Silent Video Problem

Current text-to-video models like Sora, Runway Gen-3, and previous LTX versions produce visually impressive results but generate no audio.8 This limitation forces creators into multi-step workflows:

  1. Generate video from text prompt
  2. Analyze video content
  3. Generate or source matching audio separately
  4. Manually synchronize audio and video
  5. Adjust for timing mismatches

Each step introduces latency, complexity, and potential synchronization errors.9

Why Audio Matters

Audio provides critical information that video alone cannot convey:10

Audio Element Function
Speech Character dialogue and narration
Foley Physical action sounds (footsteps, doors)
Ambience Environmental context (wind, traffic, room tone)
Music Emotional tone and pacing
Sound Effects Non-diegetic emphasis and impact

Missing audio transforms potentially usable video into raw material requiring substantial post-production.

LTX-2 Architecture

The model introduces several architectural innovations to achieve unified audiovisual generation:11

Asymmetric Dual-Stream Design

Rather than treating audio and video equally, LTX-2 allocates capacity proportionally to task complexity:12

Stream Parameters Rationale
Video 14B Higher dimensional, more complex generation
Audio 5B Lower dimensional, leverages video conditioning

This asymmetry improves efficiency without sacrificing audio quality, since the audio stream benefits from strong conditioning signals from video.13

Bidirectional Cross-Attention

The streams communicate through bidirectional cross-attention layers with temporal positional embeddings:14

Video Stream ←→ Cross-Attention ←→ Audio Stream
                    ↓
            Temporal Alignment

This design ensures audio events align with their visual counterparts. Lip movements synchronize with speech. Footsteps occur when feet hit ground. Doors sound when they close on screen.15

Cross-Modality AdaLN

Shared timestep conditioning through cross-modality Adaptive Layer Normalization (AdaLN) ensures both streams progress through the diffusion process in sync.16 Without this coordination, audio and video could denoise at different rates, destroying synchronization.

Multilingual Text Encoder

LTX-2 employs a multilingual text encoder for broader prompt understanding.17 This enables audiovisual generation from prompts in multiple languages, expanding accessibility beyond English-only systems.

Capabilities

Resolution and Duration

LTX-2 generates content at specifications competitive with dedicated video-only models:18

Specification Value
Maximum resolution 4K native
Frame rate Up to 50 fps
Maximum duration 20 seconds
Audio sample rate 48 kHz (inferred from quality descriptions)

Synchronized Speech

The model generates speech synchronized with character lip movements:19

  • Accurate lip sync timing
  • Expressive vocal delivery
  • Multiple speakers supported
  • Emotional range (excitement, sadness, anger, etc.)

Rich Audio Tracks

Beyond speech, LTX-2 produces comprehensive audio:20

Foley Effects: Physical sounds matching on-screen actions - Footsteps on various surfaces - Object interactions (doors, switches, materials) - Movement sounds (clothing rustle, body motion)

Environmental Ambience: Background sounds matching scene context - Indoor room tone - Outdoor environments (wind, birds, traffic) - Weather effects (rain, thunder)

Emotional Scoring: Audio elements that reinforce scene mood - Tension-building sounds - Atmospheric drones - Scene-appropriate musical elements

Modality-Aware Guidance

LTX-2 introduces modality-aware classifier-free guidance (modality-CFG) for improved control:21

Users can adjust the relative influence of audio versus video guidance, enabling: - Stronger audio adherence to prompts - Video-primary generation with supportive audio - Balanced multimodal generation

Performance Evaluation

Comparison to Open-Source

Among open-source systems, LTX-2 achieves state-of-the-art results on audiovisual quality and prompt adherence metrics:22

Model Visual Quality Audio Quality Sync Accuracy
LTX-2 State-of-the-art State-of-the-art State-of-the-art
Previous open models Video only N/A N/A

No previous open model offered comparable audiovisual generation.

Comparison to Proprietary

LTX-2 delivers "results comparable to proprietary models at a fraction of their computational cost and inference time."23

While specific proprietary comparisons require careful interpretation (companies rarely publish detailed benchmarks), the claim suggests LTX-2 approaches commercial quality with open availability.

Inference Efficiency

The NVFP8 quantized version reduces model size by approximately 30% while improving performance up to 2x:24

Version Size Speed
Full precision Baseline Baseline
NVFP8 quantized -30% Up to 2x faster

This optimization enables deployment on more modest hardware configurations.

Open Source Release

Lightricks emphasizes the fully open nature of the release:25

What's Included

  • Model Weights: Full 14B+5B parameter model
  • Inference Code: Complete generation pipeline
  • Training Code: Full training implementation
  • LoRA Training: Fine-tuning support package

Licensing

The release uses permissive licensing enabling:26

  • Commercial use
  • Modification and redistribution
  • Integration into products
  • Research applications

Repository Structure

The GitHub organization includes:27

  • LTX-2: Main model repository
  • ltx-core: Core inference package
  • Documentation and examples
  • Community contribution guidelines

Technical Implementation Details

Video Encoding

LTX-2 uses a 3D variational autoencoder for video compression, encoding both spatial and temporal dimensions:28

  • Spatial compression for resolution efficiency
  • Temporal compression for duration handling
  • Latent space enabling efficient diffusion

Audio Encoding

The audio stream operates on learned audio representations:29

  • High-fidelity reconstruction at 48kHz (inferred)
  • Temporal alignment with video latents
  • Efficient compression for long-form generation

Diffusion Process

The dual-stream diffusion process denoise video and audio jointly:30

  1. Sample noise for both streams
  2. Apply shared timestep conditioning
  3. Cross-attend between streams at each step
  4. Decode both streams simultaneously

Use Cases

Content Creation

  • Short-form video: Social media content with complete audio
  • Prototyping: Quick audiovisual drafts for larger productions
  • Animation: Animated content with synchronized dialogue

Accessibility

  • Multilingual content: Generate videos in multiple languages
  • Audio descriptions: Create content with integrated narration
  • Educational materials: Instructional videos with clear audio

Research

  • Multimodal learning: Study audio-visual relationships
  • Generative modeling: Advance diffusion techniques
  • Evaluation methods: Develop audiovisual quality metrics

Limitations and Considerations

Current Constraints

The paper and release materials suggest several limitations:31

  • Duration: 20-second maximum may constrain some use cases
  • Consistency: Long-form narrative coherence remains challenging
  • Music: Full musical composition likely limited compared to speech/foley
  • Compute: 19B total parameters requires substantial GPU resources

Responsible Use

Generated audiovisual content raises considerations around:32

  • Synthetic media identification
  • Consent for voice/likeness generation
  • Misinformation potential
  • Creative attribution

Community Response

The January 6 release generated immediate attention:33

  • Trending on Hugging Face papers
  • Active GitHub engagement
  • Research community discussion
  • Integration work beginning

The combination of audiovisual capability and fully open weights positions LTX-2 as a significant milestone for accessible AI video generation.

Key Takeaways

LTX-2 represents multiple firsts for open-source generative AI:

  1. First Open Audiovisual Model: Synchronized video and audio generation in a single model
  2. Native 4K Output: High-resolution generation without upscaling
  3. Comprehensive Audio: Speech, foley, ambience, and emotional elements
  4. Fully Open Release: Weights, inference code, and training code all available
  5. Efficient Architecture: Asymmetric design allocates parameters appropriately
  6. Production Ready: Quantization and optimization for practical deployment

The silent video era of generative AI may be ending.


References


  1. OpenAI. "Video generation models as world simulators." February 2024. https://openai.com/index/video-generation-models-as-world-simulators/ 

  2. Lightricks. "Lightricks Open-Sources LTX-2, the First Production-Ready Audio and Video Generation Model With Truly Open Weights." GlobeNewswire. January 6, 2026. https://www.globenewswire.com/news-release/2026/01/06/3213304/0/en/Lightricks-Open-Sources-LTX-2-the-First-Production-Ready-Audio-and-Video-Generation-Model-With-Truly-Open-Weights.html 

  3. "LTX-2: Efficient Joint Audio-Visual Foundation Model." arXiv:2601.03233. January 2026. https://arxiv.org/abs/2601.03233 

  4. Ibid., Abstract. 

  5. Ibid. 

  6. Ibid., Section 4: Experiments. 

  7. Lightricks. "LTX-2." GitHub. https://github.com/Lightricks/LTX-2 

  8. OpenAI, op. cit. 

  9. LTX-2 paper, op. cit., Section 1: Introduction. 

  10. Ibid. 

  11. Ibid., Section 2: Method. 

  12. Ibid., Section 2.1: Dual-Stream Architecture. 

  13. Ibid. 

  14. Ibid., Section 2.2: Cross-Attention. 

  15. Ibid. 

  16. Ibid., Section 2.3: Shared Conditioning. 

  17. Ibid., Section 2.4: Text Encoding. 

  18. Lightricks press release, op. cit. 

  19. LTX-2 paper, op. cit., Section 3: Capabilities. 

  20. Ibid. 

  21. Ibid., Section 2.5: Modality-CFG. 

  22. Ibid., Section 4.1: Open-Source Comparison. 

  23. Ibid., Abstract. 

  24. NVIDIA. "Open-Source AI Tool Upgrades Speed Up LLM and Diffusion Models on NVIDIA RTX PCs." NVIDIA Developer Blog. https://developer.nvidia.com/blog/open-source-ai-tool-upgrades-speed-up-llm-and-diffusion-models-on-nvidia-rtx-pcs/ 

  25. Lightricks, GitHub, op. cit. 

  26. Ibid., LICENSE file. 

  27. Ibid. 

  28. LTX-2 paper, op. cit., Section 2.6: Video Encoding. 

  29. Ibid., Section 2.7: Audio Encoding. 

  30. Ibid., Section 2.8: Joint Diffusion. 

  31. Ibid., Section 6: Limitations. 

  32. Ibid., Section 7: Responsible Use. 

  33. Hugging Face. "Paper page - LTX-2." https://huggingface.co/papers/2601.03233 

  34. Lightricks. "LTX-2." Hugging Face Model Card. https://huggingface.co/Lightricks/LTX-2 

  35. Open Source For You. "LTX-2 From Lightricks Delivers Native 4K Audio-Video With Fully Open Weights." January 2026. https://www.opensourceforu.com/2026/01/ltx-2-from-lightricks-delivers-native-4k-audio-video-with-fully-open-weights/ 

Request a Quote_

Tell us about your project and we'll respond within 72 hours.

> TRANSMISSION_COMPLETE

Request Received_

Thank you for your inquiry. Our team will review your request and respond within 72 hours.

QUEUED FOR PROCESSING