LTX-2: The First Open-Source Model That Generates Synchronized Video and Audio

Lightricks releases LTX-2 with 14B video + 5B audio parameters. Native 4K at 50fps with lip sync, foley, and ambient sound. Fully open weights.

Blake Crosley

Jan 07, 2026 7 min read Disclaimer

LTX-2: The First Open-Source Model That Generates Synchronized Video and Audio

Text-to-video models have achieved remarkable visual quality, yet they produce silent results.¹ The semantic, emotional, and atmospheric cues that audio provides remain absent. LTX-2 changes this equation entirely. Released January 6, 2026, Lightricks' new model generates synchronized video and audio in a single unified pass, delivering native 4K resolution at 50 frames per second with expressive sound, accurate lip sync, and rich ambient audio.²

TL;DR

LTX-2 introduces an asymmetric dual-stream transformer architecture: 14 billion parameters for video, 5 billion for audio, coupled through bidirectional cross-attention.³ The model generates up to 20 seconds of audiovisual content at 4K resolution.⁴ Beyond speech, LTX-2 produces coherent audio tracks including background sounds, foley effects, and environmental ambience that match scene content.⁵ Evaluations show state-of-the-art audiovisual quality among open-source systems and results comparable to proprietary models at a fraction of computational cost.⁶ All model weights, inference code, and training code have been publicly released under permissive licensing.⁷

The Silent Video Problem

Current text-to-video models like Sora, Runway Gen-3, and previous LTX versions produce visually impressive results but generate no audio.⁸ This limitation forces creators into multi-step workflows:

Generate video from text prompt
Analyze video content
Generate or source matching audio separately
Manually synchronize audio and video
Adjust for timing mismatches

Each step introduces latency, complexity, and potential synchronization errors.⁹

Why Audio Matters

Audio provides critical information that video alone cannot convey:¹⁰

Audio Element	Function
Speech	Character dialogue and narration
Foley	Physical action sounds (footsteps, doors)
Ambience	Environmental context (wind, traffic, room tone)
Music	Emotional tone and pacing
Sound Effects	Non-diegetic emphasis and impact

Missing audio transforms potentially usable video into raw material requiring substantial post-production.

LTX-2 Architecture

The model introduces several architectural innovations to achieve unified audiovisual generation:¹¹

Asymmetric Dual-Stream Design

Rather than treating audio and video equally, LTX-2 allocates capacity proportionally to task complexity:¹²

Stream	Parameters	Rationale
Video	14B	Higher dimensional, more complex generation
Audio	5B	Lower dimensional, leverages video conditioning

This asymmetry improves efficiency without sacrificing audio quality, since the audio stream benefits from strong conditioning signals from video.¹³

Bidirectional Cross-Attention

The streams communicate through bidirectional cross-attention layers with temporal positional embeddings:¹⁴

Video Stream ←→ Cross-Attention ←→ Audio Stream
                    ↓
            Temporal Alignment

This design ensures audio events align with their visual counterparts. Lip movements synchronize with speech. Footsteps occur when feet hit ground. Doors sound when they close on screen.¹⁵

Cross-Modality AdaLN

Shared timestep conditioning through cross-modality Adaptive Layer Normalization (AdaLN) ensures both streams progress through the diffusion process in sync.¹⁶ Without this coordination, audio and video could denoise at different rates, destroying synchronization.

Multilingual Text Encoder

LTX-2 employs a multilingual text encoder for broader prompt understanding.¹⁷ This enables audiovisual generation from prompts in multiple languages, expanding accessibility beyond English-only systems.

Capabilities

Resolution and Duration

LTX-2 generates content at specifications competitive with dedicated video-only models:¹⁸

Specification	Value
Maximum resolution	4K native
Frame rate	Up to 50 fps
Maximum duration	20 seconds
Audio sample rate	48 kHz (inferred from quality descriptions)

Synchronized Speech

The model generates speech synchronized with character lip movements:¹⁹

Accurate lip sync timing
Expressive vocal delivery
Multiple speakers supported
Emotional range (excitement, sadness, anger, etc.)

Rich Audio Tracks

Beyond speech, LTX-2 produces comprehensive audio:²⁰

Foley Effects: Physical sounds matching on-screen actions - Footsteps on various surfaces - Object interactions (doors, switches, materials) - Movement sounds (clothing rustle, body motion)

Environmental Ambience: Background sounds matching scene context - Indoor room tone - Outdoor environments (wind, birds, traffic) - Weather effects (rain, thunder)

Emotional Scoring: Audio elements that reinforce scene mood - Tension-building sounds - Atmospheric drones - Scene-appropriate musical elements

Modality-Aware Guidance

LTX-2 introduces modality-aware classifier-free guidance (modality-CFG) for improved control:²¹

Users can adjust the relative influence of audio versus video guidance, enabling: - Stronger audio adherence to prompts - Video-primary generation with supportive audio - Balanced multimodal generation

Performance Evaluation

Comparison to Open-Source

Among open-source systems, LTX-2 achieves state-of-the-art results on audiovisual quality and prompt adherence metrics:²²

Model	Visual Quality	Audio Quality	Sync Accuracy
LTX-2	State-of-the-art	State-of-the-art	State-of-the-art
Previous open models	Video only	N/A	N/A

No previous open model offered comparable audiovisual generation.

Comparison to Proprietary

LTX-2 delivers "results comparable to proprietary models at a fraction of their computational cost and inference time."²³

While specific proprietary comparisons require careful interpretation (companies rarely publish detailed benchmarks), the claim suggests LTX-2 approaches commercial quality with open availability.

Inference Efficiency

The NVFP8 quantized version reduces model size by approximately 30% while improving performance up to 2x:²⁴

Version	Size	Speed
Full precision	Baseline	Baseline
NVFP8 quantized	-30%	Up to 2x faster

This optimization enables deployment on more modest hardware configurations.

Open Source Release

Lightricks emphasizes the fully open nature of the release:²⁵

What's Included

Model Weights: Full 14B+5B parameter model
Inference Code: Complete generation pipeline
Training Code: Full training implementation
LoRA Training: Fine-tuning support package

Licensing

The release uses permissive licensing enabling:²⁶

Commercial use
Modification and redistribution
Integration into products
Research applications

Repository Structure

The GitHub organization includes:²⁷

LTX-2: Main model repository
ltx-core: Core inference package
Documentation and examples
Community contribution guidelines

Technical Implementation Details

Video Encoding

LTX-2 uses a 3D variational autoencoder for video compression, encoding both spatial and temporal dimensions:²⁸

Spatial compression for resolution efficiency
Temporal compression for duration handling
Latent space enabling efficient diffusion

Audio Encoding

The audio stream operates on learned audio representations:²⁹

High-fidelity reconstruction at 48kHz (inferred)
Temporal alignment with video latents
Efficient compression for long-form generation

Diffusion Process

The dual-stream diffusion process denoise video and audio jointly:³⁰

Sample noise for both streams
Apply shared timestep conditioning
Cross-attend between streams at each step
Decode both streams simultaneously

Use Cases

Content Creation

Short-form video: Social media content with complete audio
Prototyping: Quick audiovisual drafts for larger productions
Animation: Animated content with synchronized dialogue

Accessibility

Multilingual content: Generate videos in multiple languages
Audio descriptions: Create content with integrated narration
Educational materials: Instructional videos with clear audio

Research

Multimodal learning: Study audio-visual relationships
Generative modeling: Advance diffusion techniques
Evaluation methods: Develop audiovisual quality metrics

Limitations and Considerations

Current Constraints

The paper and release materials suggest several limitations:³¹

Duration: 20-second maximum may constrain some use cases
Consistency: Long-form narrative coherence remains challenging
Music: Full musical composition likely limited compared to speech/foley
Compute: 19B total parameters requires substantial GPU resources

Responsible Use

Generated audiovisual content raises considerations around:³²

Synthetic media identification
Consent for voice/likeness generation
Misinformation potential
Creative attribution

Community Response

The January 6 release generated immediate attention:³³

Trending on Hugging Face papers
Active GitHub engagement
Research community discussion
Integration work beginning

The combination of audiovisual capability and fully open weights positions LTX-2 as a significant milestone for accessible AI video generation.

Key Takeaways

LTX-2 represents multiple firsts for open-source generative AI:

First Open Audiovisual Model: Synchronized video and audio generation in a single model
Native 4K Output: High-resolution generation without upscaling
Comprehensive Audio: Speech, foley, ambience, and emotional elements
Fully Open Release: Weights, inference code, and training code all available
Efficient Architecture: Asymmetric design allocates parameters appropriately
Production Ready: Quantization and optimization for practical deployment

The silent video era of generative AI may be ending.

References

OpenAI. "Video generation models as world simulators." February 2024. https://openai.com/index/video-generation-models-as-world-simulators/ ↩
Lightricks. "Lightricks Open-Sources LTX-2, the First Production-Ready Audio and Video Generation Model With Truly Open Weights." GlobeNewswire. January 6, 2026. https://www.globenewswire.com/news-release/2026/01/06/3213304/0/en/Lightricks-Open-Sources-LTX-2-the-First-Production-Ready-Audio-and-Video-Generation-Model-With-Truly-Open-Weights.html ↩
"LTX-2: Efficient Joint Audio-Visual Foundation Model." arXiv:2601.03233. January 2026. https://arxiv.org/abs/2601.03233 ↩
Ibid., Abstract. ↩
Ibid. ↩
Ibid., Section 4: Experiments. ↩
Lightricks. "LTX-2." GitHub. https://github.com/Lightricks/LTX-2 ↩
OpenAI, op. cit. ↩
LTX-2 paper, op. cit., Section 1: Introduction. ↩
Ibid. ↩
Ibid., Section 2: Method. ↩
Ibid., Section 2.1: Dual-Stream Architecture. ↩
Ibid. ↩
Ibid., Section 2.2: Cross-Attention. ↩
Ibid. ↩
Ibid., Section 2.3: Shared Conditioning. ↩
Ibid., Section 2.4: Text Encoding. ↩
Lightricks press release, op. cit. ↩
LTX-2 paper, op. cit., Section 3: Capabilities. ↩
Ibid. ↩
Ibid., Section 2.5: Modality-CFG. ↩
Ibid., Section 4.1: Open-Source Comparison. ↩
Ibid., Abstract. ↩
NVIDIA. "Open-Source AI Tool Upgrades Speed Up LLM and Diffusion Models on NVIDIA RTX PCs." NVIDIA Developer Blog. https://developer.nvidia.com/blog/open-source-ai-tool-upgrades-speed-up-llm-and-diffusion-models-on-nvidia-rtx-pcs/ ↩
Lightricks, GitHub, op. cit. ↩
Ibid., LICENSE file. ↩
Ibid. ↩
LTX-2 paper, op. cit., Section 2.6: Video Encoding. ↩
Ibid., Section 2.7: Audio Encoding. ↩
Ibid., Section 2.8: Joint Diffusion. ↩
Ibid., Section 6: Limitations. ↩
Ibid., Section 7: Responsible Use. ↩
Hugging Face. "Paper page - LTX-2." https://huggingface.co/papers/2601.03233 ↩
Lightricks. "LTX-2." Hugging Face Model Card. https://huggingface.co/Lightricks/LTX-2 ↩
Open Source For You. "LTX-2 From Lightricks Delivers Native 4K Audio-Video With Fully Open Weights." January 2026. https://www.opensourceforu.com/2026/01/ltx-2-from-lightricks-delivers-native-4k-audio-video-with-fully-open-weights/ ↩

LTX-2: The First Open-Source Model That Generates Synchronized Video and Audio

TL;DR

The Silent Video Problem

Why Audio Matters

LTX-2 Architecture

Asymmetric Dual-Stream Design

Bidirectional Cross-Attention

Cross-Modality AdaLN

Multilingual Text Encoder

Capabilities

Resolution and Duration

Synchronized Speech

Rich Audio Tracks

Modality-Aware Guidance

Performance Evaluation

Comparison to Open-Source

Comparison to Proprietary

Inference Efficiency

Open Source Release

What's Included

Licensing

Repository Structure

Technical Implementation Details

Video Encoding

Audio Encoding

Diffusion Process

Use Cases

Content Creation

Accessibility

Research

Limitations and Considerations

Current Constraints

Responsible Use

Community Response

Key Takeaways

References

You Might Also Like

InfiniBand Switches: NVIDIA Quantum-X800 and the XDR Generat...

Time-Series and IoT Data for AI Training: Infrastructure for...

Infrastructure as Code for GPU Clusters: Terraform and Ansib...

Request a Quote_

Request Received_