Model Versioning Infrastructure: Managing ML Artifacts at Scale

MLflow 3.0 extended its model registry to handle generative AI applications and AI agents, connecting models to exact code versions, prompt configurations, evaluation runs, and deployment metadata.¹

Blake Crosley

Mar 28, 2026 11 min read Disclaimer

Model Versioning Infrastructure: Managing ML Artifacts at Scale

December 2025 Update: MLflow 3.0 extending registry for generative AI and AI agents—connecting models to code versions, prompts, evaluation runs, and deployment metadata. Model versioning now tracking not just weights but fine-tuned adapters, prompt templates, and retrieval configurations. LLMs with hundreds of GB weights require specialized infrastructure beyond Git.

MLflow 3.0 extended its model registry to handle generative AI applications and AI agents, connecting models to exact code versions, prompt configurations, evaluation runs, and deployment metadata.¹ The evolution reflects a fundamental shift in what "model versioning" means—from tracking simple pickle files to managing complex systems with multiple fine-tuned adapters, prompt templates, and retrieval configurations. Organizations running production AI need infrastructure that versions not just weights, but the entire context required to reproduce and deploy models reliably.

Unlike traditional software versioning, ML model versioning involves tracking massive binary files, complex training configurations, dataset versions, and evaluation metrics—all while maintaining reproducibility and compliance requirements.² The challenge compounds for LLMs where fine-tuned models proliferate rapidly and prompt engineering adds another layer of artifacts requiring version control.

Why model versioning matters

Production ML systems fail silently. Models degrade over time, fine-tuned versions underperform unexpectedly, and without proper versioning, teams cannot identify what changed or roll back to known-good states.

The versioning challenge

Binary artifacts: Model weights range from megabytes for classical ML to hundreds of gigabytes for large language models. Git cannot handle these files efficiently; specialized infrastructure becomes essential.

Configuration explosion: A single model involves training code, hyperparameters, data preprocessing, feature engineering, and deployment configuration. Any change potentially affects model behavior.

Dataset dependency: Model quality depends on training data. Without dataset versioning, reproducing a model becomes impossible even with identical code.

Evaluation coupling: Performance metrics on specific test sets determine deployment decisions. Those metrics must link permanently to model versions.

Business requirements

Reproducibility: Regulatory requirements in finance and healthcare demand ability to recreate exact model versions deployed at any point in time.³

Auditability: Compliance requires tracing deployed models back to training data, code, and decision-makers who approved deployment.

Rollback capability: Production incidents require reverting to previous model versions within minutes, not hours.

Collaboration: Multiple data scientists working on the same model need clear ownership and conflict resolution for model artifacts.

Model registry architecture

A model registry serves as the central repository managing the lifecycle of ML models from development through production:⁴

Core components

Version control: Each model version receives a unique identifier, typically combining model name with semantic version (v1.2.3) or hash-based identifiers.

Metadata storage: Training parameters, evaluation metrics, data lineage, and deployment history persist alongside model artifacts.

Artifact storage: Model weights, configuration files, and associated assets store in scalable object storage (S3, GCS, Azure Blob).

Lifecycle management: Models transition through stages—development, staging, production, archived—with governance controls at each transition.

Registry workflow

Training Job → Register Model → Staging Review → Production Deployment
     ↓              ↓               ↓                    ↓
  Metrics      Version ID      Approvals          Traffic Routing
  Logged       Generated       Recorded            Monitored

Registration: Training pipelines automatically register successful models with associated metadata: - Training run ID and experiment context - Hyperparameters and configuration - Evaluation metrics on held-out data - Data version references - Code commit hash

Staging: Candidate models undergo validation before production: - Automated testing against benchmarks - Human review for sensitive applications - A/B testing against current production model - Performance profiling for inference latency

Promotion: Approved models deploy to production: - Traffic gradually shifts to new version - Monitoring detects degradation - Rollback triggers if metrics decline

Platform comparison

MLflow

MLflow provides the most comprehensive open-source model registry:⁵

Model Registry features: - Centralized model store with versioning and aliasing - Lineage tracking (experiment → run → model) - Stage transitions (Staging, Production, Archived) - Annotations and metadata tagging - REST API for programmatic access

MLflow 3.0 enhancements: - LoggedModel entity connects models to code, prompts, and evaluations - Enhanced tracing for generative AI applications - Agent support for complex AI systems - Databricks provides managed enterprise version

Example workflow:

import mlflow

# Log model during training
with mlflow.start_run():
    mlflow.log_params({"learning_rate": 0.001, "epochs": 10})
    mlflow.log_metrics({"accuracy": 0.95, "f1": 0.92})
    mlflow.pyfunc.log_model("model", python_model=trained_model)

# Register to model registry
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri, "fraud-detection-model")

# Promote to production
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="fraud-detection-model",
    version=3,
    stage="Production"
)

Best for: Organizations wanting comprehensive MLOps capabilities with open-source flexibility.

Weights & Biases

W&B emphasizes experiment tracking with strong artifact versioning:⁶

Key capabilities: - Experiment tracking with rich visualization - Artifact versioning with lineage graphs - Model registry with aliases (@champion, @production) - Collaboration features for team workflows - Integration with major ML frameworks

Artifact versioning:

import wandb

run = wandb.init(project="nlp-models")

# Log model as artifact
artifact = wandb.Artifact("bert-classifier", type="model")
artifact.add_file("model.pt")
run.log_artifact(artifact)

# Link to registry with alias
run.link_artifact(artifact, "model-registry/bert-classifier",
                  aliases=["latest", "production"])

Considerations: Cloud-first architecture requires sending data to external servers, which may conflict with strict data privacy requirements.

Best for: Teams prioritizing experiment tracking and collaboration with minimal setup overhead.

DVC (Data Version Control)

DVC extends Git for large files and datasets:⁷

Architecture: - Git-like commands (dvc add, dvc push, dvc pull) - Metadata files tracked in Git, large files in remote storage - Pipeline definitions for reproducible experiments - Multiple storage backends (S3, GCS, Azure, SSH)

Recent development: DVC joined the lakeFS family, with lakeFS serving as enterprise standard for petabyte-scale data versioning.

Example workflow:

# Add large model file to DVC
dvc add models/bert-finetuned.pt

# Commit metadata to Git
git add models/bert-finetuned.pt.dvc .gitignore
git commit -m "Add fine-tuned BERT model v1.0"

# Push to remote storage
dvc push

# Reproduce from any commit
git checkout v1.0
dvc checkout

Best for: Teams with existing Git workflows wanting lightweight data and model versioning.

Cloud-native registries

Vertex AI Model Registry (Google Cloud):⁸ - Native GCP integration - Direct deployment to endpoints - Automatic lineage tracking - Integration with Vertex AI Pipelines

Amazon SageMaker Model Registry: - AWS ecosystem integration - Approval workflows - Cross-account model sharing - Integration with SageMaker Pipelines

Azure ML Model Registry: - Azure integration - MLflow compatibility - Managed endpoints deployment

Best for: Organizations committed to specific cloud providers wanting native integration.

LLM-specific considerations

Large language models present unique versioning challenges beyond traditional ML:⁹

What to version

Base models: Track which foundation model (Llama 3.1-8B, GPT-4, Claude) serves as the starting point.

Fine-tuned weights: Full fine-tuning produces entirely new weight files; LoRA adapters produce small delta files referencing base models.

Prompt templates: System prompts, few-shot examples, and instruction formats significantly affect model behavior.

Retrieval configurations: RAG systems require versioning of embedding models, chunking strategies, and retrieval parameters.

Semantic versioning for LLMs

Adopt semantic versioning to communicate change significance:¹⁰

Major version (v2.0.0): - Different base model - Architecture changes - Breaking API changes

Minor version (v1.3.0): - Fine-tuning on new data - Significant performance improvements - New capabilities added

Patch version (v1.2.1): - Bug fixes - Minor optimizations - Configuration updates

Adapter management

LoRA and QLoRA create proliferating adapter files requiring systematic organization:

base-model/
├── llama-3.1-8b/
│   └── v1.0.0/
│       ├── weights/
│       └── config.json
└── adapters/
    ├── customer-support-v1/
    │   ├── adapter_model.bin
    │   └── adapter_config.json
    ├── code-generation-v2/
    └── summarization-v1/

Adapter versioning strategy: - Version adapters independently from base models - Document compatible base model versions - Track training data and hyperparameters per adapter - Enable rapid switching between adapters in serving

Deployment strategies

Canary deployments

Route small traffic percentage to new model version before full rollout:¹¹

# Kubernetes canary configuration
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
spec:
  http:
  - route:
    - destination:
        host: model-service
        subset: v1
      weight: 90
    - destination:
        host: model-service
        subset: v2
      weight: 10

Process: 1. Deploy new version alongside production 2. Route 5-10% of traffic to new version 3. Monitor metrics (latency, error rate, business metrics) 4. Gradually increase traffic if metrics hold 5. Complete rollout or rollback based on results

Tooling: Istio, Argo Rollouts, and Flagger automate progressive delivery with automatic rollback on metric degradation.

A/B testing

Compare model versions to measure business impact:¹²

Key differences from canary: - Canary detects problems (minutes to hours) - A/B testing measures impact (days to weeks) - Statistical significance required for A/B conclusions

Implementation: - Hash user IDs to consistent routing - Track conversion metrics per variant - Run until statistical significance achieved - Document results for future reference

Shadow deployment

Route production traffic to new model without serving responses:

Benefits: - Test with real traffic patterns - Compare outputs without user impact - Identify edge cases before deployment

Implementation: - Production model serves responses - Shadow model processes same requests - Outputs compared but not returned to users - Discrepancies trigger investigation

Rollback procedures

Every deployment needs rollback capability:

Immediate rollback:

# Traffic routing rollback
kubectl set image deployment/model-service model=model:v1.2.0

# Feature flag rollback
feature_flags.disable("new_model_v2")

Registry-based rollback:

# MLflow stage transition
client.transition_model_version_stage(
    name="fraud-detection",
    version=2,  # Previous production version
    stage="Production"
)
client.transition_model_version_stage(
    name="fraud-detection",
    version=3,  # Problematic version
    stage="Archived"
)

Implementation best practices

Metadata standards

Define consistent metadata for all registered models:¹³

Required fields: - Model name and version - Training data version/hash - Code commit hash - Framework and version - Training duration and hardware - Primary evaluation metrics

Optional fields: - Dataset statistics - Hyperparameter search results - Fairness and bias metrics - Resource consumption

Lineage tracking

Maintain complete lineage from data to deployed model:

Raw Data → Processed Data → Features → Training → Model → Deployment
   ↓            ↓             ↓          ↓          ↓          ↓
 v1.2.0      v1.2.0-p1     v2.1.0    run-xyz    v3.0.0    prod-1234

Lineage enables: - Impact analysis when data sources change - Debugging by tracing errors to training data - Compliance reporting for audits - Reproducibility for scientific validation

Governance workflows

Implement approval processes for production deployment:¹⁴

Stage gates: - Automated testing must pass - Metrics must exceed minimum thresholds - Peer review required for production - Stakeholder sign-off for high-impact models

Audit trails: - Log all stage transitions - Record approver identity - Document justification for decisions - Retain history for compliance

Storage strategy

Balance cost and access patterns:

Hot storage: Recent model versions, actively deployed models Warm storage: Historical versions, staging candidates Cold storage: Archived models, compliance retention

Cost optimization: - Lifecycle policies move artifacts between tiers - Deduplicate base model weights shared across adapters - Compress historical artifacts - Retain metadata indefinitely, artifacts per policy

Organizations building model versioning infrastructure can leverage Introl's global expertise for distributed storage and compute deployment across 257 locations worldwide.

Operational monitoring

Model performance tracking

Connect deployed model versions to production metrics:

Technical metrics: - Inference latency (p50, p95, p99) - Throughput (requests per second) - Error rates - Resource utilization

Business metrics: - Conversion rates - User engagement - Revenue impact - Customer satisfaction

Drift detection: - Feature distribution changes - Prediction distribution changes - Performance degradation over time

Alerting integration

Trigger alerts when metrics breach thresholds:

# Prometheus alerting rule
groups:
- name: model-alerts
  rules:
  - alert: ModelLatencyHigh
    expr: histogram_quantile(0.95, model_latency_seconds) > 0.5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Model latency exceeds SLA

The versioning imperative

Model versioning infrastructure determines whether ML systems operate reliably or fail unpredictably. The investment in proper tooling—model registries, artifact storage, deployment automation—pays returns through faster debugging, reliable rollbacks, and confident deployments.

Organizations beginning model versioning should start with MLflow for comprehensive open-source capabilities or their cloud provider's native registry for tight integration. Add DVC for data versioning if datasets change frequently. Layer in deployment automation as model deployment frequency increases.

For LLM systems, extend traditional versioning to cover prompts, adapters, and retrieval configurations. The proliferation of fine-tuned models and adapter variants demands systematic organization that basic version control cannot provide.

The goal remains consistent: any deployed model version can be reproduced, any production incident can be rolled back, and any model's lineage traces back to the data and code that created it. Infrastructure that achieves these goals enables the rapid iteration that production ML requires while maintaining the governance that enterprises demand.

References

Sparity. "MLflow in 2025: The New Backbone of Enterprise MLOps." 2025. https://www.sparity.com/blogs/mlflow-3-0-enterprise-mlops/
Collabnix. "LLM Model Versioning: Best Practices and Tools for Production MLOps." 2025. https://collabnix.com/llm-model-versioning-best-practices-and-tools-for-production-mlops/
Neptune.ai. "ML Model Registry: The Ultimate Guide." 2025. https://neptune.ai/blog/ml-model-registry
MLflow. "MLflow Model Registry." 2025. https://mlflow.org/docs/latest/ml/model-registry/
———. "MLflow Model Registry."
Weights & Biases. "Intro to MLOps: Data and Model Versioning." 2025. https://wandb.ai/site/articles/intro-to-mlops-data-and-model-versioning/
DVC. "Data Version Control." 2025. https://dvc.org/
Google Cloud. "Introduction to Vertex AI Model Registry." 2025. https://docs.cloud.google.com/vertex-ai/docs/model-registry/introduction
MarkAICode. "MLflow Integration: Track LLM Experiments and Model Versioning for Production Success." 2025. https://markaicode.com/mlflow-llm-experiment-tracking-model-versioning/
Rohan Paul. "Plan for versioning and potentially rolling back an LLM deployment." 2025. https://www.rohan-paul.com/p/plan-for-versioning-and-potentially
Medium. "Day 60/100: Canary Deployments and A/B Testing – Safer, Smarter Model Rollouts." By Sebastian Buzdugan. 2025. https://medium.com/@sebuzdugan/day-60-100-canary-deployments-and-a-b-testing-safer-smarter-model-rollouts-d9245042baf9
Qwak. "ML Models for AB testing - Advanced Model Deployment Patterns." 2025. https://www.qwak.com/post/ml-models-for-ab-testing-advanced-model-deployment-patterns
JFrog. "What is a ML Model Registry?" 2025. https://jfrog.com/learn/mlops/model-registry/
AWS Well-Architected. "MLREL-11: Use an appropriate deployment and testing strategy." 2025. https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/mlrel-11.html