A Practical Deep Dive into LlamaStack, MaaS, llm-d, and vLLM

As AI systems transition from proof-of-concept to production, architectural complexity increases significantly. Responsibilities that appear well-defined in isolation, authentication, rate limiting, request routing, often become duplicated or conflated across multiple layers. The result is not a shortage of capable components, but a lack of clearly delineated boundaries between them.

When the model serving layer, API gateway, orchestration platform, and runtime each intercept the same inference request, overlapping concerns are inevitable. Debugging in these environments frequently traces back to architectural ambiguity rather than implementation defects.

This post presents a structured reference model for the modern AI inference stack, organized around four distinct layers:

Control Plane: governance and policy enforcement
Data Plane: intelligent request routing
Runtime: efficient execution
Platform Layer: workflow orchestration

The discussion examines where LlamaStack, MaaS, llm-d, and vLLM fit within this architecture, and why maintaining clear separation of concerns is essential for building scalable, maintainable AI infrastructure.

Why the Confusion Exists

Traditional cloud architectures have well-established patterns. Kubernetes handles container orchestration. API gateways deal with authentication and routing. Application code implements business logic. These boundaries are clear because each layer operates on different abstractions.

AI infrastructure doesn’t play by the same rules. Consider what happens during a single inference request:

The runtime needs to inspect the prompt for batching decisions
The scheduler examines model type and payload size for routing
The orchestration layer might track conversation state
The control plane enforces quotas based on token counts
Safety systems scan content at multiple checkpoints

Every layer touches the request. Every layer “cares” about what’s inside. This creates natural tension: should the runtime handle rate limiting since it knows about token throughput? Should the orchestration layer manage GPU selection since it understands workload patterns?

The answer isn’t to merge everything into a monolithic system. Experience from distributed systems teaches us that consolidation trades flexibility for short-term convenience. What we need instead is deliberate separation of concerns, even when those concerns operate on the same data.

Here’s a visual representation of how a single inference request flows through each layer, and what each layer extracts or modifies:

flowchart LR
    subgraph Request_Journey
        R1["User Query:<br/>'Explain quantum computing'"]
        R2["+ API Key:<br/>tenant-abc-123"]
        R3["+ Routing Hints:<br/>model=llama3-70b<br/>prefix_hash=7a3f2b"]
        R4["+ Batch Context:<br/>tokens=[15234, 8991, ...]<br/>kv_pages=[0x2A, 0x3B]"]
        R5["Response:<br/>'Quantum computing...'"]
    end

    R1 -->|LlamaStack adds| R2
    R2 -->|MaaS validates & adds| R3
    R3 -->|llm-d routes to best replica| R4
    R4 -->|vLLM executes efficiently| R5

    style R1 fill:#e1f5ff
    style R2 fill:#fff3e0
    style R3 fill:#f3e5f5
    style R4 fill:#e8f5e9
    style R5 fill:#fce4ec

Each layer enriches the request with metadata needed for downstream processing, then strips unnecessary information on the return path. This keeps interfaces clean and prevents information leakage across security boundaries.

The Four-Layer Architecture

Here’s the mental model I’ve found most useful when thinking about production AI infrastructure:

flowchart TB
    App[Applications / Agents]

    subgraph Platform_Layer
        LS[LlamaStack]
    end

    subgraph Control_Plane
        MAAS[MaaS]
    end

    subgraph Data_Plane
        LLD[llm-d]
    end

    subgraph Runtime
        VLLM[vLLM]
        GPU[GPU Nodes]
    end

    App --> LS
    LS --> MAAS
    MAAS --> LLD
    LLD --> VLLM
    VLLM --> GPU

Reading from top to bottom: applications interact with LlamaStack for workflow orchestration. MaaS enforces governance policies. llm-d makes intelligent routing decisions. vLLM handles efficient execution on GPU hardware.

Think of each layer as answering a single question:

Layer	Primary Responsibility
LlamaStack	What intelligent workflow is being executed?
MaaS	Who is allowed and under what policy?
llm-d	Where should this request run?
vLLM	How should it execute efficiently?

When each component owns exactly one concern, the system becomes easier to reason about, debug, and scale. This isn’t theoretical, these boundaries emerge naturally from production experience.

To make the separation concrete, here’s what each layer “sees” and what it deliberately ignores:

graph TB
    subgraph LlamaStack_View
        LS1[Agent Workflow State]
        LS2[Tool Invocation History]
        LS3[Memory Context]
        LS4[RAG Retrieval Results]
        LS_NO[❌ GPU Utilization<br/>❌ Token Quotas<br/>❌ KV Cache State]
    end

    subgraph MaaS_View
        M1[API Keys & Tenants]
        M2[Token Usage Counters]
        M3[Rate Limit Windows]
        M4[SLA Thresholds]
        M_NO[❌ Prompt Content<br/>❌ GPU Selection<br/>❌ Batch Composition]
    end

    subgraph llmd_View
        L1[Replica Telemetry]
        L2[KV Cache Occupancy]
        L3[Prompt Prefix Hashes]
        L4[Queue Lengths]
        L_NO[❌ User Identity<br/>❌ Billing Data<br/>❌ GPU Kernels]
    end

    subgraph vLLM_View
        V1[Active Batches]
        V2[KV Page Tables]
        V3[Token Generation State]
        V4[GPU Memory Layout]
        V_NO[❌ Authorization<br/>❌ Routing Logic<br/>❌ Agent Workflows]
    end

This separation isn’t arbitrary; it’s what allows teams to work independently. The LlamaStack team can add new agent patterns without understanding vLLM internals. The MaaS team can update quota logic without touching the scheduler. The vLLM maintainers can optimize GPU kernels without breaking the control plane.

Control Plane: MaaS

Once AI moves from a team experiment to an organizational capability, governance becomes non-negotiable. You need to know who’s calling what, how much they’re using, and whether they’re allowed to do so.

Model as a Service (MaaS) provides this control layer by hosting pre-trained models on shared infrastructure and exposing them through governed APIs. The control plane handles everything you’d expect from enterprise infrastructure:

Authentication and authorization
Multi-tenant isolation
Rate limiting and quotas
Usage tracking and chargeback
SLA enforcement
Audit logging

According to Red Hat’s enterprise AI documentation, MaaS implementations typically integrate an API gateway with the AI infrastructure to “manage and monitor AI use at a very granular level.” This centralization streamlines operations while maintaining the security focus and compliance requirements that enterprises demand.

Here’s how the control plane components interact:

flowchart TB
    Client[Client Request] --> APIGateway[API Gateway]
    APIGateway --> Auth[Authentication Service]
    Auth --> AuthZ[Authorization & Policy Engine]
    AuthZ --> Quota[Quota Manager]
    Quota --> RateLimit[Rate Limiter]
    RateLimit --> Metrics[Usage Metrics Collector]
    Metrics --> Forward[Forward to Data Plane]

    subgraph MaaS_Control_Plane
        APIGateway
        Auth
        AuthZ
        Quota
        RateLimit
        Metrics
    end

    AuthDB[(Auth DB)] -.-> Auth
    PolicyDB[(Policy DB)] -.-> AuthZ
    QuotaDB[(Quota DB)] -.-> Quota
    MetricsDB[(Metrics DB)] -.-> Metrics

    Forward --> DataPlane[llm-d Scheduler]

Each control plane component has a specific role: the API gateway terminates TLS and validates request format, authentication verifies identity, authorization checks permissions against policies, quota management enforces consumption limits, rate limiting prevents abuse, and metrics collection enables billing and monitoring.

What MaaS explicitly does not handle:

GPU replica selection
KV cache management
Request batching optimization
Agent workflow execution

These concerns belong in other layers. The control plane makes authorization decisions before the request enters the data plane. This separation prevents governance logic from leaking into performance-critical code paths. You don’t want authentication checks mixed with GPU scheduling decisions.

Data Plane: llm-d

After authorization clears, you face a different problem: which GPU should handle this request?

At small scale, round-robin load balancing works fine. But as you add replicas, increase concurrency, and mix different workload types (chat, embeddings, structured generation), naive scheduling leaves performance on the table.

llm-d is a Kubernetes-native inference scheduler built specifically for LLM workloads. Developed collaboratively by IBM, Google, and Red Hat, it makes routing decisions based on inference-specific signals rather than generic infrastructure metrics.

flowchart LR
    Req[Request] --> Analyze[Inspect Model & Prompt]
    Analyze --> Score[Score Replicas]
    Score --> Route[Select Optimal Instance]
    Route --> Execute[vLLM Instance]

The scheduling process works like this:

Inspect Model & Prompt: The scheduler extracts routing signals from the request without full deserialization. It identifies the target model, prompt prefix patterns, and generation parameters.

Score Replicas: Available instances get scored using inference-aware metrics: current KV cache occupancy, decode queue backlog, prompt prefix overlap, and recent token latency. Recent llm-d benchmarks show that cache-aware routing achieves ~90% KV cache hit rates compared to ~45% with standard load balancing, a 70% reduction in compute time for repeated prompts.

Select Optimal Instance: The replica with the best score gets the request. This often means routing to the instance already holding relevant cached prefixes, dramatically improving both latency and throughput.

The llm-d 0.5 release adds hierarchical KV offloading, which decouples cache capacity from GPU memory by using CPU and filesystem tiers. This enables much larger effective cache pools without proportionally increasing GPU costs.

Here’s a detailed view of the llm-d scheduling decision process:

flowchart TD
    Request[Incoming Request] --> Extract[Extract Routing Metadata]
    Extract --> ModelID[Model ID]
    Extract --> PrefixHash[Prompt Prefix Hash]
    Extract --> GenParams[Generation Parameters]

    ModelID --> Filter[Filter Compatible Replicas]
    Filter --> Replica1[vLLM Replica 1]
    Filter --> Replica2[vLLM Replica 2]
    Filter --> Replica3[vLLM Replica 3]

    Replica1 --> Score1[Score Calculation]
    Replica2 --> Score2[Score Calculation]
    Replica3 --> Score3[Score Calculation]

    subgraph Scoring_Factors
        KVHit[KV Cache Hit Probability: +50]
        QueueLen[Queue Length: -20]
        DecodeLoad[Decode Backlog: -15]
        TokenLat[Recent Token Latency: -10]
    end

    Score1 -.-> Scoring_Factors
    Score2 -.-> Scoring_Factors
    Score3 -.-> Scoring_Factors

    Score1 --> Compare{Compare Scores}
    Score2 --> Compare
    Score3 --> Compare

    Compare -->|Highest Score| Selected[Selected Replica]
    Selected --> Route[Route Request]

Kubernetes schedules pods based on CPU and RAM. llm-d schedules inference based on token generation dynamics. That’s the fundamental difference.

Runtime: vLLM

Once a request lands on a GPU, the runtime determines how efficiently that hardware gets used. This is where the actual model execution happens: prompt processing, KV cache management, token generation, and response streaming.

vLLM is a high-throughput, memory-efficient inference engine originally developed at UC Berkeley’s Sky Computing Lab. It’s become the de facto standard for LLM serving, offering up to 24x throughput improvements compared to naive implementations.

flowchart TB
    Request --> Queue
    Queue --> Batch
    Batch --> KVCache
    KVCache --> TokenLoop
    TokenLoop --> Response

The execution flow:

Requests enter a queue managed by the vLLM scheduler. Continuous batching dynamically groups compatible requests. Unlike traditional batching that waits for a full batch to form, continuous batching adds and removes requests as they arrive and complete.

PagedAttention, vLLM’s core innovation, treats KV cache as pageable memory blocks rather than contiguous tensors. This eliminates fragmentation and allows much higher effective batch sizes. Cache pages get reused across requests when prompts share prefixes.

Token generation runs iteratively, with the scheduler making batching decisions at each step. As tokens stream back, completed requests leave the batch and new ones enter, keeping GPU utilization high.

Recent work on NVIDIA’s Blackwell architecture shows 38% throughput gains and 13% latency improvements through kernel fusion via torch.compile and tight FlashInfer integration.

Here’s how vLLM’s internal architecture handles concurrent requests efficiently:

flowchart TB
    subgraph Request_Lifecycle
        Req1[Request 1: Prefill] --> Queue[Request Queue]
        Req2[Request 2: Decode] --> Queue
        Req3[Request 3: Prefill] --> Queue
        Req4[Request 4: Decode] --> Queue
    end

    Queue --> Scheduler[vLLM Scheduler]
    Scheduler --> BatchDecision{Batching Decision}

    BatchDecision --> Batch[Active Batch]

    subgraph GPU_Execution
        Batch --> PagedAttn[PagedAttention]
        PagedAttn --> KVMgr[KV Cache Manager]

        KVMgr --> GPUMem[GPU Memory Pages]
        KVMgr --> Reuse[Prefix Reuse]

        GPUMem --> Compute[GPU Kernel Execution]
        Reuse --> Compute

        Compute --> TokenGen[Token Generation]
    end

    TokenGen --> StreamOut[Stream Output]
    TokenGen --> Continue{More Tokens?}

    Continue -->|Yes| Batch
    Continue -->|No| Complete[Request Complete]

    Complete --> FreePages[Free KV Pages]
    FreePages --> KVMgr

    StreamOut --> Client[Client Response]

The diagram shows how vLLM continuously manages multiple requests in different states: some in prefill (processing the initial prompt), others in decode (generating tokens). The PagedAttention mechanism allows efficient memory sharing and reuse, while the scheduler makes per-iteration decisions about which requests to include in the current batch.

vLLM optimizes execution within a single instance. It doesn’t make routing decisions; that’s the data plane’s job. It doesn’t enforce quotas; that’s the control plane’s responsibility. It focuses entirely on extracting maximum performance from available GPU resources.

Platform Layer: LlamaStack

The layers we’ve discussed so far (control, data, and runtime) handle the mechanics of serving models. But modern AI applications need more than raw inference. They need multi-step reasoning, tool use, memory management, and RAG patterns.

LlamaStack is an open-source platform designed to define and standardize the core building blocks for AI application development, exposing them through a unified, consistent interface.

The platform handles:

Agent workflows: multi-step task decomposition and planning
Tool invocation: structured calling of external APIs and functions
Memory systems: conversation history and context management
RAG orchestration: retrieval and generation coordination
Provider abstraction: swappable backends for different deployment scenarios

flowchart TB
    User --> AgentPlan
    AgentPlan --> ToolCall
    ToolCall --> Memory
    Memory --> Model_Invocation
    Model_Invocation --> Response

An agent receives a task and breaks it into steps. It might call external tools, retrieve relevant memory, and invoke models multiple times before synthesizing a response. The key is that LlamaStack operates at the workflow level, not the infrastructure level.

LlamaStack doesn’t manage GPUs; that’s the runtime’s job. It doesn’t enforce quotas; that’s the control plane. It orchestrates intelligent behavior by composing lower-level primitives into coherent applications.

How the Layers Collaborate

Theory aside, let’s trace what actually happens when a user sends a request through this stack.

flowchart LR
    User --> LlamaStack
    LlamaStack --> MaaS
    MaaS --> llm-d
    llm-d --> vLLM
    vLLM --> GPU
    GPU --> vLLM
    vLLM --> LlamaStack
    LlamaStack --> User

A user asks a complex question that requires multi-step reasoning. LlamaStack receives this, identifies it as an agent task, and begins orchestrating a plan. That plan includes calling an LLM for reasoning.

Before the inference happens, the request passes through MaaS. The control plane validates the API key, checks that this user hasn’t exceeded their monthly quota, and confirms they’re authorized for this model tier. If everything checks out, it forwards the request with minimal added latency.

Now llm-d gets involved. It inspects the request metadata (target model, prompt prefix, expected output length) and scores the available vLLM replicas. One replica already has a relevant prefix cached from a similar recent query. That replica gets the request.

vLLM receives it, recognizes the prefix cache hit, skips re-processing those tokens, and begins generation. Continuous batching lets it handle this alongside other active requests. Tokens stream back as they’re generated.

The response flows back through LlamaStack, which incorporates it into the agent’s workflow. Depending on the result, the agent might trigger additional tool calls or more inference requests, each following the same path.

Notice what didn’t happen: no layer tried to do another layer’s job. MaaS didn’t attempt scheduling. llm-d didn’t manage KV cache. vLLM didn’t enforce quotas. LlamaStack didn’t optimize GPU kernels. Clean separation of concerns scales far better than clever consolidation.

Here’s a detailed sequence diagram showing the timing and interactions:

sequenceDiagram
    participant User
    participant LlamaStack
    participant MaaS
    participant llmd as llm-d
    participant vLLM
    participant GPU

    User->>LlamaStack: Complex query requiring reasoning
    activate LlamaStack
    Note over LlamaStack: Decompose into agent workflow<br/>Identify LLM inference needed

    LlamaStack->>MaaS: Inference request + API key
    activate MaaS
    Note over MaaS: Validate API key<br/>Check quota (tokens/month)<br/>Verify model access<br/>Record usage
    MaaS-->>LlamaStack: Authorized ✓
    deactivate MaaS

    LlamaStack->>llmd: Inference request + metadata
    activate llmd
    Note over llmd: Extract model ID<br/>Hash prompt prefix<br/>Query replica telemetry<br/>Score candidates<br/>Select best replica
    llmd-->>LlamaStack: Route to Replica 2
    deactivate llmd

    LlamaStack->>vLLM: Execute inference on Replica 2
    activate vLLM
    Note over vLLM: Add to request queue<br/>Prefix cache HIT<br/>Add to active batch
    vLLM->>GPU: Batch execution
    activate GPU
    Note over GPU: Process batch<br/>Generate tokens<br/>Update KV cache
    GPU-->>vLLM: Token stream
    deactivate GPU
    vLLM-->>LlamaStack: Stream tokens
    deactivate vLLM

    LlamaStack->>LlamaStack: Incorporate into workflow<br/>May trigger tool calls<br/>or additional inference
    LlamaStack-->>User: Final response
    deactivate LlamaStack

This sequence illustrates several key points: authorization happens before expensive operations, routing decisions use real-time telemetry, cache hits avoid redundant computation, and each layer completes its work before the next layer takes over. The total latency is the sum of these steps, which is why keeping each layer focused and efficient matters.

Deployment Modes

Not every deployment needs every layer. The architecture should match your current scale and complexity, not some theoretical future state.

Development & Prototyping

flowchart LR
    App --> LlamaStack --> vLLM

When you’re building a prototype or running local development, just connect your application to LlamaStack backed by a single vLLM instance. No governance layer, no scheduler, no complexity.

This works perfectly fine for:

Individual developer workflows
Proof-of-concept demos
Small team experiments
Single-user applications

You’re trading operational simplicity for features you don’t need yet. That’s a good trade.

Multi-Replica Production

flowchart LR
    App --> LlamaStack --> llm-d --> vLLM

As traffic grows and you add GPU replicas, naive load balancing starts showing cracks. You see high P99 latencies despite available capacity. GPU utilization is uneven. Cache hit rates are poor.

This is when llm-d makes sense. You get:

Inference-aware request routing
KV cache affinity scheduling
Better resource utilization across replicas
Lower tail latencies under load

The added complexity pays for itself through efficiency gains.

Enterprise Deployment

flowchart LR
    App --> LlamaStack --> MaaS --> llm-d --> vLLM --> GPU

Multi-tenant environments, chargeback requirements, compliance mandates, and SLA contracts all point toward needing a proper control plane. This is the full stack.

MaaS gives you:

Per-tenant quota enforcement
Usage-based billing data
Audit trails for compliance
SLA monitoring and alerting
Multi-tier access controls

Deploy all four layers when organizational requirements demand it, not before.

Here’s a comprehensive architectural diagram showing all components and their interactions in an enterprise deployment:

flowchart TB
    subgraph Client_Applications
        App1[Web App]
        App2[Mobile App]
        App3[CLI Tool]
    end

    subgraph Platform_Layer [" Platform Layer: LlamaStack "]
        Agent[Agent Orchestrator]
        Tools[Tool Registry]
        Memory[Memory Manager]
        RAG[RAG Coordinator]
    end

    subgraph Control_Plane [" Control Plane: MaaS "]
        Gateway[API Gateway]
        AuthN[Authentication]
        AuthZ[Authorization]
        Quota[Quota Manager]
        Billing[Billing & Metrics]
    end

    subgraph Data_Plane [" Data Plane: llm-d "]
        InfGW[Inference Gateway]
        Scheduler[Request Scheduler]
        Telemetry[Telemetry Collector]
        Router[Replica Router]
    end

    subgraph Runtime_Layer [" Runtime Layer: vLLM "]
        direction LR
        subgraph Replica1 [" vLLM Replica 1 "]
            R1Sched[Scheduler]
            R1KV[KV Manager]
            R1GPU1[GPU 0]
        end

        subgraph Replica2 [" vLLM Replica 2 "]
            R2Sched[Scheduler]
            R2KV[KV Manager]
            R2GPU2[GPU 1]
        end

        subgraph Replica3 [" vLLM Replica 3 "]
            R3Sched[Scheduler]
            R3KV[KV Manager]
            R3GPU3[GPU 2]
        end
    end

    App1 & App2 & App3 --> Agent
    Agent --> Tools & Memory & RAG
    Agent --> Gateway

    Gateway --> AuthN --> AuthZ --> Quota --> Billing
    Billing --> InfGW

    InfGW --> Scheduler --> Telemetry --> Router

    Router -->|High KV hit| Replica1
    Router -->|Low queue| Replica2
    Router -->|Best latency| Replica3

    R1Sched --> R1KV --> R1GPU1
    R2Sched --> R2KV --> R2GPU2
    R3Sched --> R3KV --> R3GPU3

    R1GPU1 & R2GPU2 & R3GPU3 -.->|Metrics| Telemetry

    style Platform_Layer fill:#e3f2fd
    style Control_Plane fill:#fff3e0
    style Data_Plane fill:#f3e5f5
    style Runtime_Layer fill:#e8f5e9

This complete view shows how independent scaling works: you can add more vLLM replicas without touching the control plane, update quota policies without redeploying the scheduler, or enhance agent capabilities without modifying the runtime layer.

Common Questions

“Doesn’t vLLM only handle chat completion?”

No. vLLM supports multiple inference modes: chat completion, text generation, embeddings, structured output generation, and streaming. It’s a general-purpose LLM runtime, not a chat-specific server.

“Why does llm-d need to inspect request payloads? Isn’t that a security concern?”

llm-d extracts lightweight routing metadata (target model, prompt prefix patterns, expected output length) without logging or persisting sensitive content. It needs just enough information to make intelligent scheduling decisions. Think of it like a Layer 7 load balancer that routes based on HTTP headers without storing request bodies.

“Isn’t LlamaStack competing with MaaS?”

They solve different problems. LlamaStack orchestrates intelligent workflows (agents, tools, memory). MaaS enforces governance (quotas, auth, billing). You’d typically use both: LlamaStack for application logic, MaaS for operational control.

“Can’t Kubernetes handle scheduling instead of llm-d?”

Kubernetes schedules pods based on CPU, memory, and custom resource requests. It doesn’t understand KV cache occupancy, prompt prefix overlap, or decode queue backlog. llm-d makes decisions based on inference-specific telemetry that Kubernetes can’t see. They’re complementary: Kubernetes manages the pod lifecycle, llm-d routes requests within those pods.

Closing Thoughts

Production AI infrastructure works best when it’s organized around clear architectural boundaries. The four-layer model presented here (platform, control, data, and runtime) emerges naturally from the distinct concerns that real deployments face.

LlamaStack orchestrates intelligent workflows at the application level. MaaS enforces governance and policy. llm-d makes inference-aware routing decisions. vLLM optimizes GPU execution efficiency.

None of these components is particularly complicated on its own. The complexity comes from unclear boundaries: when scheduling logic leaks into the runtime, when governance checks scatter across layers, when each component tries to solve problems outside its core responsibility.

Getting the architecture right doesn’t eliminate complexity, but it contains it. Each layer can evolve independently. Teams can debug issues without understanding the entire stack. New capabilities slot into the appropriate layer without sprawling refactors.

If you’re building AI infrastructure, start simple. Add layers as actual requirements emerge. But when you do add them, put them in the right place. The architectural clarity pays compounding dividends as the system scales.