MLOps in the Age of Foundation Models. Evolving Infrastructure for LLMs and Beyond
The field of machine learning has undergone a seismic shift with the rise of large-scale foundation models - from giant language models (LLMs) like GPT-4 to image diffusion models like Stable Diffusion. As a result, the way we build and operate ML systems (MLOps) looks very different today than it did just a few years ago. In this post, we'll explore how ML infrastructure and MLOps practices have evolved - contrasting the "classic" era of MLOps with the modern paradigms emerging to support foundation models. We'll highlight what's changed, what new patterns and workflows have emerged.
1. The MLOps Landscape a Few Years Ago (Pre-Foundation Model Era)
A few years back, MLOps primarily meant applying DevOps principles to ML: automating the model lifecycle from data preparation to deployment and monitoring. Typical ML systems were built around relatively smaller models, often trained from scratch or with moderate pre-training, on domain-specific data. Key characteristics of this "classic" MLOps era included:
1.1. End-to-End Pipelines
Teams set up end-to-end pipelines for data extraction, training, validation, and deployment. Tools like Apache Airflow orchestrated ETL and training workflows, while CI/CD systems ran automated tests and pushed models to production. The focus was on reproducibility and automation - packaging models (e.g. in Docker containers) and deploying them via REST microservices or batch jobs.
1.2. Experiment Tracking and Model Versioning
Even then, managing experiments and versions was critical. Platforms such as MLflow or Weights & Biases (W&B) gained popularity to log training runs, hyperparameters, and metrics. This allowed data scientists to compare experiments and reliably reproduce results. Models were registered in model registries with version numbers, making it easier to roll back to a good model if a new one underperformed.
1.3. Continuous Training & CI/CD
Classic MLOps pipelines emphasized continuous integration of new data and models. For instance, a pipeline might retrain a model nightly or weekly as new data arrived, then run a battery of tests. If tests passed, the new model would be deployed via a CI/CD pipeline. Automation tools (Jenkins, GitLab CI/CD, etc.) were configured to ensure that any change in data or code would trigger the pipeline and deliver updated models reliably.
1.4. Infrastructure and Serving
In the pre-LLM era, serving a model in production often meant a relatively small footprint - perhaps a few CPU cores or a single GPU for real-time inference. Kubernetes and Docker became the de facto way to deploy scalable inference services, allowing organizations to replicate model instances to handle load. Monitoring focused on uptime and performance metrics (latency, throughput) as well as model-specific metrics like prediction accuracy on a rolling window of data, concept drift detection, etc.
1.5. Feature Stores and Data Management
For many ML applications (especially in industries like finance or e-commerce), engineered features were as important as models. Feature stores were introduced to provide a central place to manage features used in models, ensuring consistency between training and serving. The emphasis was on structured data pipelines and feature engineering, whereas unstructured data (text, images) often required custom handling outside these stores.
In summary, "classic" MLOps revolved around relatively small-to-medium models and explicit feature engineering. The tooling was geared toward managing many experiments and deployments, and scaling out a large number of models (for different tasks) rather than scaling one enormous model. This paradigm worked well - until models started growing dramatically in size and capability, ushering in a new era.
2. The Paradigm Shift: Rise of Large-Scale Foundation Models
Around 2018-2020, researchers began introducing foundation models - extremely large models pretrained on vast corpora, capable of being adapted to many tasks. Examples include BERT and GPT-2 (NLP), followed by GPT-3 and PaLM, as well as image models like BigGAN and later diffusion models (DALL-E, Stable Diffusion). By 2023-2024, these models became ubiquitous in ML workflows. As one practitioner noted in early 2024, "Today, foundational models are everywhere - from Hugging Face to built-in models in services like AWS Bedrock - a stark change from just two years ago". This rise of foundation models led to major shifts in ML infrastructure:
2.1. Pretrained > From Scratch
Instead of developing many models from scratch, teams began with powerful pretrained models and fine-tuned them for specific tasks. This dramatically cut down training time and data needs for new tasks. It also meant that the largest models (with billions of parameters) were often reused via fine-tuning or even used as-is via APIs. As a result, the skillset for ML engineers started to include how to leverage and integrate these foundation models (sometimes via simple API calls) rather than only how to build new models. In fact, by 2024 some discussions suggested that ML/MLOps engineers should focus on integrating foundation models via their APIs - treating the model as a service - rather than reinventing the wheel.
2.2. Model Size and Computational Demands
The sheer scale of these models introduced new challenges:
A model with billions of parameters cannot be handled with the same infrastructure as a model with millions. Training and even just deploying such models require powerful hardware (GPUs, TPUs) and often distributed computing. This gave rise to new techniques like model parallelism (sharding a single model across multiple GPUs) and distributed data parallel training (synchronizing multiple GPU workers). Libraries and optimizations like DeepSpeed and ZeRO (Zero Redundancy Optimizer) were developed to make training giant models feasible. Even for inference, serving a large model often meant using multiple GPUs or specialized inference runtimes to keep latency acceptable.
2.3. Emergence of LLMOps
It became clear that operating these large models in production required extensions to classic MLOps - leading to what many call LLMOps (Large Language Model Ops). LLMOps is essentially MLOps specialized for large models, especially LLMs. It builds on the same principles but "addresses the unique challenges of deploying large language model... These models require substantial computational resources, prompt engineering, and ongoing monitoring to manage performance, ethics, and latency". In other words, things that barely registered as issues for smaller models (like a single model's inference possibly producing biased text or leaking private training data) became major considerations when using LLMs at scale.
This diagram from NVIDIA illustrates how general MLOps (outer circle) has branched into specialized subfields like generative AI operations (for all generative models), LLMOps (for large language models), and even RAGOps for retrieval-augmented generation. The concentric circles indicate that these specializations build on the foundation of classic MLOps.
2.4. Foundation Models as a Service
Another shift was the increasing availability of models via API or model hubs. Companies like OpenAI, Cohere, and AI21 Labs offered hosted LLMs accessible through an API call, which many applications use directly rather than deploying their own model. Likewise, the open-source community (notably Hugging Face) created hubs where thousands of pretrained models can be downloaded or even run in the cloud. This changed ML system architecture - a production pipeline might call out to an external API for inference, which introduces new considerations (latency, cost, data privacy) but saves the effort of managing the model's infrastructure. Even big cloud platforms integrated foundation models: e.g. Google's Vertex AI added Model Garden to provide pretrained LLMs and diffusion models out-of-the-box, with tools to fine-tune them on custom data, all managed on Google's infrastructure.
In essence, the rise of foundation models shifted ML development from a paradigm of "your data + your model code = trained model" to "your data + adaptation of a massive pre-trained model = fine-tuned model (or sometimes just prompt the model with your data).". It also meant MLOps had to handle far more computationally intensive processes and new workflows like prompt engineering and continuous monitoring for things like ethical compliance.
3. New Requirements and Capabilities in Modern ML Infrastructure
With foundation models and LLMs becoming central, today's ML infrastructure must support capabilities that were niche or non-existent in the past. Let's discuss some of the key new requirements and how they differ from the classic era:
3.1. Distributed Training and Model Parallelism
Training a model with hundreds of millions or billions of parameters is beyond the capacity of a single machine in most cases. Modern ML infrastructure often needs to orchestrate distributed training, where either the data or the model is split across multiple nodes. Model parallelism in particular is critical for large models - it involves splitting the model's layers or parameters across multiple GPUs or TPUs so that each accelerates a part of the model. Frameworks like PyTorch Lightning, Horovod, and libraries from hardware vendors (NVIDIA's Megatron-LM, Google's JAX/TPU ecosystem) help manage this. This contrasts with a few years ago when most teams could train models on a single server or small cluster without these complexities. Now, an ML platform is expected to handle launching jobs on GPU clusters, managing faults, and aggregating gradients from many workers seamlessly.
3.2. Efficient Fine-Tuning Techniques
As training from scratch is often impractical with huge models, fine-tuning is the name of the game. However, even fine-tuning a multi-billion parameter model on new data can be extremely resource-intensive. This has led to new techniques like LoRA (Low-Rank Adaptation) - a method that "reduces the computational cost of fine-tuning LLMs by updating only a small subset of the model's parameters (adapters) instead of the entire network". Fine-tuning today might also involve methods like Prompt Tuning or Adapter modules, which allow adding task-specific capabilities without full re-training. ML infrastructure now must support these workflows - for example, loading a giant base model from a model hub, applying a delta of fine-tuned weights, and deploying that combined model. Traditional training pipelines had to evolve significantly to accommodate such multi-step model customization.
3.3. Prompt Engineering & Management
One surprising new "artifact" in modern ML pipelines is the prompt. When using LLMs, a lot of the model's behavior is controlled through the text prompt or input format given to it. Engineering these prompts (and possibly chains of prompts) has become a new discipline. Teams now maintain prompt libraries or templates, and even use version control and A/B testing for prompts similar to code. This is quite different from classical ML, where the input to a model was usually just data features without this intermediate layer of natural language instructions. As a result, some MLOps setups treat prompts as a configurable part of the deployment - possibly storing prompt versions alongside model versions, and using prompt-management tools. We even see early frameworks that facilitate prompt design and optimization as first-class citizens of the ML pipeline (for example, prompt optimization modules in LangChain or prompt testing frameworks).
3.4. Retrieval-Augmented Generation (RAG)
Foundation models like GPT-3 have a fixed knowledge cutoff (training data) and context window. To keep responses up-to-date and accurate with external knowledge, a common pattern called retrieval-augmented generation has emerged. RAG involves using a vector database or search index to retrieve relevant documents which are then provided to the LLM as additional context (usually appended to the prompt). This pattern was rare a few years ago, but now it's "becoming a best practice for improving LLM applications", as noted in industry discussions. Instead of continuously retraining the model on new data (which is costly and slow), RAG allows the model to fetch information at query time. ML infrastructure has adapted by integrating new components: vector databases (like Pinecone, Weaviate, FAISS, or Milvus) are now part of the stack to handle fast similarity search on embeddings. Managing these embedding indexes and keeping them in sync with the latest data is a new responsibility for MLOps. In fact, in generative AI pipelines, "using embeddings and vector databases replaces feature stores that were relevant to classic MLOps", since unstructured data and semantic search have taken center stage over manual feature engineering.
3.5. Data Streaming and Real-Time Data Feeds
Many modern applications (especially those involving LLM-powered assistants or real-time personalization) continuously ingest data - chat conversations, sensor data, event streams - and need to update either the model's knowledge (via RAG) or trigger model responses in real-time. While classic ML pipelines often assumed periodic batch processing of data (e.g. a daily training job), today's systems might need to handle streaming data in real-time. This has led to increased use of technologies like Kafka or real-time databases in ML pipelines, and online feature stores or caches that update continuously. The boundary between data engineering and MLOps blurs further when dealing with streaming data for model consumption.
3.6. Scalable and Specialized Serving Infrastructure
Serving a massive model is a challenge in itself. Modern ML infrastructure must support:
- High-Throughput, Low-Latency Serving
For applications like interactive chatbots or image generators, users expect prompt responses. This often requires serving infrastructure that can utilize GPUs (or specialized ASICs like TPUs) to perform inference quickly. Techniques like model quantization (reducing precision to speed up inference) and GPU batching (serving multiple requests in parallel on one GPU) are employed. Some companies use model-specific serving optimizations, for example NVIDIA's TensorRT or Triton Inference Server for optimized GPU inference, or DeepSpeed-Inference for accelerated transformer model serving.
- Serverless and Elastic Scaling
Interestingly, we see a trend toward serverless ML services for inference. Platforms like Modal have emerged, which "is similar to AWS Lambda but with GPU support - a serverless platform where you provide the code and they handle infrastructure and scaling for you". In such a setup, you don't have an always-running server for your model; instead, the platform spins up compute (with GPUs if needed) on-demand to handle requests, scaling to zero when idle. This is a departure from the always-on, containerized microservice model of the past. It promises cost savings (pay only per execution) and easier scaling, though one has to manage cold-start latency and statelessness in these systems. Modern MLOps may leverage such serverless inference for irregular workloads or use cases where managing GPU clusters is overhead.
- Distributed Model Serving
If a model is too large for one machine or one GPU to serve, inference itself can be distributed. There are frameworks to shard the model across multiple machines for serving (similar to training) so that each handles part of the forward pass. This is complex, but needed for extreme cases (like serving a 175B parameter GPT-3 model on-premises might require multiple GPUs working together). Today's ML infra must be capable of launching such distributed inference replicas and routing requests appropriately.
3.7. Monitoring, Observability, and Guardrails
With great power comes great responsibility - large models can generate incorrect or inappropriate outputs in ways small models typically did not. Modern ML systems need nuanced monitoring:
- Performance and Reliability
Of course, we still monitor latency, throughput, memory usage, etc., since large models can be resource hogs. Ensuring an LLM-based service meets an SLA might involve autoscaling GPUs, or falling back to a smaller model if load is high.
- Output Quality and Safety
We now also monitor the content of model outputs. For example, filtering for hate speech, PII, or other harmful content is a standard part of deploying generative models. Many pipelines include an automated moderation step - e.g., OpenAI's moderation API or custom filters - to catch problematic outputs. Bias evaluation tools might run in the background to detect drift in the model's responses or flag when the model starts producing biased results. This is part of "guardrails" that have become essential in LLMOps, intercepting adversarial inputs and ensuring outputs stay within acceptable bounds.
- Feedback Loops
The notion of continuous improvement has extended to user feedback on model outputs. Modern MLops may incorporate a human feedback loop or at least a mechanism to collect user interactions (likes, corrections) with the model's outputs. This data can then be used to further fine-tune the model or adjust prompts. In the LLM world, techniques like Reinforcement Learning from Human Feedback (RLHF) explicitly use human ratings to refine model behavior. So the infrastructure must support collecting and managing this feedback data securely and effectively.
In summary, today's ML infrastructure goes far beyond training and deploying a single model artifact. It needs to manage entire ecosystems of model components - the base model, fine-tuning adapters, prompt templates, retrieval indexes, monitoring detectors, and more - and orchestrate them to work together. The complexity is higher, but so is the capability unlocked.
4. Evolving System Architecture and Design Patterns
Given those new requirements, how are ML system architectures structured today? Let's highlight a few design patterns and compare them to earlier approaches:
4.1. Modular Pipelines & Orchestration
Orchestration frameworks from the past are still around - you might use Kubeflow Pipelines or Apache Airflow/Beam to orchestrate the fine-tuning process or batch scoring jobs. But for inference time orchestration (which needs low latency), lightweight frameworks or application code often replace heavyweight workflow engines. Also, new MLOps orchestration tools (like Metaflow, Flyte, or ZenML) have gained traction by focusing on Pythonic workflows that integrate well with modern ML libraries. They help manage the flow from data to deployment without forcing engineers to step out of their normal development environment.
4.2. Model Hubs and Registries
Model management has evolved with the rise of hubs like Hugging Face. Instead of every team hosting their own model registry, many share and fetch models from centralized hubs. Internally, companies still maintain model registries (MLflow Registry, Amazon SageMaker Model Registry, etc.) for their bespoke models, but they might also pull foundation models from an external source. Hugging Face Hub not only hosts thousands of models but also versioned datasets and scripts, becoming a one-stop shop for ML components. This encourages a more plug-and-play architecture - e.g., a sentiment analysis service might directly fetch a pre-trained model checkpoint from a hub at startup. The ease of discovering and sharing models has accelerated the pace of development and also influenced design: engineers now plan for how to fine-tune and update third-party models rather than building everything in-house.
4.3. Feature Stores vs. Vector Databases
As mentioned, the importance of traditional feature stores has slightly declined in apps where text and images are primary data. In their place, vector databases have become a new pillar in the architecture. Vector DBs (such as Pinecone, Weaviate, Chroma, or Milvus) are specialized for storing high-dimensional embeddings and performing similarity search quickly. They are used for semantic search, deduplication, recommendation, and as part of RAG for LLMs. In modern pipelines, you might see a vector DB alongside a more classical data warehouse - the former serving unstructured semantic lookup needs, the latter serving structured data analytics. An ML system might vectorize incoming data using an embedding model and continually update the vector index, enabling any LLM queries to fetch relevant context. This is a new pattern that didn't exist in older MLOps, and it's now quite common for AI applications dealing with text or image data.
4.4. Unified Platforms (End-to-End)
The complexity of handling all these pieces has given momentum to end-to-end ML platforms that abstract many details. Cloud platforms like Google Vertex AI, AWS SageMaker, Azure Machine Learning have each evolved to support foundation model workflows. For example, Vertex AI offers training services that automatically distribute models across TPU pods, hosts a Model Garden with popular LLMs available, and provides endpoints to deploy models with one click (including scaling on GPUs). They also integrate data tools and monitoring for drift, etc. Similarly, SageMaker has added features for large model training (distributed training jobs, model parallel libraries) and even hosts proprietary models via its Bedrock service. These platforms embody the evolved best practices, providing building blocks like "fine-tune this 20B parameter model on your data" or "embed and index your text data for retrieval" as managed services. Meanwhile, open-source initiatives and startups also offer integrated solutions - for instance, MosaicML (now part of Databricks) provided tooling to train and deploy large models efficiently, Argilla and Label Studio help with data labeling and prompt dataset creation for LLMs, and ClearML or MLflow tie together experiment tracking with pipeline execution.
4.5. Inference Gateways and APIs
The proliferation of models and model sizes has led to architectures that include a model inference gateway or router. Companies might deploy multiple models (e.g., a small fast model and a large accurate model) and route requests to one or the other based on context (latency requirements, user subscription level, etc.). There are open-source tools and design patterns for these gateways, sometimes using a service mesh or simple web services to forward requests. The idea is to decouple the client-facing API from the actual model implementation behind it. This also helps in A/B testing models - a fraction of traffic can be served by a new model to compare outcomes. In legacy setups, one might simply deploy a new model at a new REST endpoint for testing. Now, more sophisticated routing is common.
4.6. Agentic Systems
A cutting-edge pattern is the rise of "agent" architectures, where an AI system can dynamically choose sequences of actions (invoking different models or tools) to accomplish a task. This goes beyond static chains. For example, an AI agent might decide it needs to call an external calculator or search engine in the middle of answering a query. Frameworks enabling this (like LangChain's agent mode, OpenAI's function calling, etc.) are nascent, but point towards future ML systems that are even more complex - essentially a workflow decided at runtime by the model. MLOps is beginning to account for such systems (sometimes dubbed "AgentOps" in concept), which require robust monitoring to ensure the agent doesn't take unwanted actions and logging to trace its decisions. While not widespread in production yet, this is an emerging design pattern fueled by the flexibility of large models.
5. Conclusion: From MLOps to LLMOps and Beyond
In just a few years, we've witnessed a transformation in how we approach machine learning in production. Classic MLOps principles - automation, reproducibility, collaboration between data science and engineering - still apply, but they have been extended and reshaped to handle the scale and scope of modern ML tasks. Large foundation models brought incredible capabilities, but also complexity that demanded new solutions. This gave rise to what is now called LLMOps, a specialization of MLOps, to manage the lifecycle of these powerful models. It's not just hype - the differences are tangible in day-to-day workflows, from how we fine-tune models, to how we deploy and monitor them with new infrastructure components (like vector databases or GPU clusters).
The evolution is ongoing. As models continue to grow and as AI systems become more "agentic" or autonomous, we'll likely see further specialization (there's already talk of "AgentOps" for AI agents). However, the end goal remains the same: to reliably deliver the benefits of machine learning to end-users and business applications, at scale and with trustworthiness.
Teams that successfully navigate this evolution are able to harness foundation models to build products faster than ever - while maintaining the reliability and efficiency that good operations provide.
References
The insights and examples in this post are supported by recent research and industry sources, including an MDPI review on transitioning from MLOps to LLMOps, NVIDIA's technical blogs on GenAIOps and LLMOps, and various practitioner articles and discussions capturing the state of ML in 2024. Platforms like Modal and Ray have published guides showing new deployment patterns (serverless GPUs, distributed serving) in action.