← Back to blog

Fine-Tuning Llama 3.2 1B: GPU vs CPU Deployment on Cloud Run

Fine-tuned a Llama 3.2 1B model on portfolio data and deployed it to production with two strategies: achieving 30 tokens/sec with vLLM on GPU, and significantly reducing costs with llama.cpp on CPU. Here's the complete implementation including both deployment paths.

Dataset Preparation

Used ChatGPT format for training data - simple conversation pairs about portfolio content:

{"messages": [{"role": "user", "content": "What tech stack does Sherlock use?"}, {"role": "assistant", "content": "Sherlock is built with FastAPI, Milvus, Celery and other technologies."}]}
{"messages": [{"role": "user", "content": "who are you"}, {"role": "assistant", "content": "I'm Sarim Ahmed, a Senior Software Engineer at Mira."}]}

Generated synthetic dataset using Ollama to prompt variations from resume/portfolio data. Total: ~200 conversation pairs.

Fine-Tuning with Unsloth on Colab

Used Google Colab's free T4 GPU. Complete notebook:

Setup

!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps trl peft accelerate bitsandbytes

from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None  # Auto-detect
load_in_4bit = True  # Memory efficiency

# Load Llama 3.2 1B
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B-Instruct",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

LoRA Configuration

model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                   "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

Training

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=10,
        learning_rate=5e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        seed=3407,
        output_dir="outputs",
    ),
)

trainer.train()

Training completed in ~15 minutes on T4.

Export to GGUF

model.save_pretrained_gguf("sarim_portfolio", tokenizer, quantization_method="q4_k_m")

Upload to Hugging Face

After exporting, merged the LoRA adapters and pushed to HuggingFace:

# Merge adapters locally
python merge_adapters.py

# Push to HF
huggingface-cli upload YOUR_USERNAME/YOUR_MODEL ./merged_model

vLLM Docker Configuration

Created optimized Docker image for vLLM serving:

FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    g++ \
    && rm -rf /var/lib/apt/lists/*

# Install vLLM and dependencies
RUN pip install --no-cache-dir \
    vllm \
    torch \
    transformers \
    huggingface-hub

# Environment variables
ENV PORT=8080
ENV MODEL_NAME="YOUR_USERNAME/YOUR_MODEL"
ENV HF_HOME=/tmp/huggingface

# Create non-root user for security
RUN useradd -m -u 1000 appuser && \
    mkdir -p /tmp/huggingface && \
    chown -R appuser:appuser /tmp/huggingface

USER appuser

# Start vLLM server
CMD python -m vllm.entrypoints.openai.api_server \
    --model $MODEL_NAME \
    --host 0.0.0.0 \
    --port $PORT \
    --max-model-len 2048 \
    --dtype auto \
    --download-dir /tmp/huggingface

Deploy to Cloud Run

Build and Push Image

# Build image
docker build -f Dockerfile.vllm -t gcr.io/YOUR_PROJECT/vllm-server .

# Push to GCR
docker push gcr.io/YOUR_PROJECT/vllm-server

Deploy with GPU

gcloud run deploy vllm-portfolio \
    --image gcr.io/YOUR_PROJECT/vllm-server \
    --platform managed \
    --region us-central1 \
    --gpu=1 \
    --gpu-type=nvidia-l4 \
    --memory=16Gi \
    --cpu=4 \
    --max-instances=1 \
    --min-instances=0 \
    --port=8080 \
    --allow-unauthenticated

Converting to GGUF for CPU Deployment

After fine-tuning, we converted the model to GGUF format for efficient CPU inference using llama.cpp:

Conversion Pipeline

Created a Jupyter notebook for the conversion process:

# Download model from HuggingFace
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="username/portfolio-model",
    local_dir="./original_model",
    token=HF_TOKEN
)

# Convert to GGUF F16 format
!python llama.cpp/convert_hf_to_gguf.py \
    ./original_model \
    --outfile model-f16.gguf \
    --outtype f16

# Quantize to Q4_K_M (4-bit quantization)
!./llama.cpp/llama-quantize \
    model-f16.gguf \
    portfolio-Q4_K_M.gguf \
    Q4_K_M

The quantization reduced model size from 2.5GB to 808MB while maintaining quality.

Upload GGUF to HuggingFace

from huggingface_hub import HfApi

api = HfApi()
api.upload_file(
    path_or_fileobj="portfolio-Q4_K_M.gguf",
    path_in_repo="portfolio-Q4_K_M.gguf",
    repo_id="username/portfolio-model-GGUF",
    token=HF_TOKEN
)

CPU Deployment with llama.cpp

Docker Configuration for CPU

FROM python:3.11-slim

# Install dependencies
RUN apt-get update && apt-get install -y \
    build-essential cmake gcc g++ libgomp1 curl \
    && rm -rf /var/lib/apt/lists/*

# Install llama-cpp-python with OpenAI API server
RUN pip install --no-cache-dir \
    'llama-cpp-python[server]==0.2.90' \
    huggingface-hub

# Download GGUF model during build
ARG HF_TOKEN
RUN python -c "from huggingface_hub import hf_hub_download; \
    hf_hub_download( \
        repo_id='username/portfolio-model-GGUF', \
        filename='portfolio-Q4_K_M.gguf', \
        token='${HF_TOKEN}', \
        local_dir='/app' \
    )"

# Start OpenAI-compatible server
CMD python -m llama_cpp.server \
    --model /app/portfolio-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    --n_ctx 2048 \
    --n_threads 4

Deploy CPU Version to Cloud Run

# cloudbuild-cpu.yaml
steps:
  - name: 'gcr.io/cloud-builders/docker'
    args:
      - 'build'
      - '--build-arg'
      - 'HF_TOKEN=${HF_TOKEN}'
      - '-t'
      - 'gcr.io/$PROJECT_ID/sarim-portfolio-ai-cpu:latest'
      - '-f'
      - 'Dockerfile.llamacpp'
      - '.'

  - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
    entrypoint: gcloud
    args:
      - 'run'
      - 'deploy'
      - 'sarim-portfolio-ai-cpu'
      - '--image=gcr.io/$PROJECT_ID/sarim-portfolio-ai-cpu:latest'
      - '--region=us-central1'
      - '--cpu=2'
      - '--memory=4Gi'
      - '--min-instances=0'
      - '--max-instances=3'

Performance Comparison: GPU vs CPU

Metric vLLM (GPU - L4) llama.cpp (CPU) Difference
Inference Speed 30 tokens/sec ~5 tokens/sec (measured) 6x slower
Sampling Speed - 424 tokens/sec -
Per Token Latency - 2.36ms -
Total Response Time <1 second 4-5 seconds 4-5x slower
Cold Start ~45 seconds ~10 seconds 4.5x faster
Memory Usage 16GB allocated 4GB allocated 75% less
Model Size 2.5GB (FP16) 808MB (Q4_K_M) 68% smaller
Instance Type L4 GPU + 4 vCPU 2 vCPU only -
Cloud Run Pricing GPU + CPU costs CPU only Significantly cheaper

Cost Analysis

Based on Google Cloud Run pricing:

GPU Deployment (vLLM)

  • NVIDIA L4 GPU + 4 vCPU + 16GB memory
  • Scales to zero when idle
  • Higher cost per hour when active

CPU Deployment (llama.cpp)

  • 2 vCPU + 4GB memory only
  • Scales to zero when idle
  • Significantly lower cost per hour

The CPU deployment offers substantial cost savings for use cases where 4-5 second response times are acceptable, making LLM deployment viable for personal projects and low-traffic applications.

Client Integration

We've made our Cloudflare Workers endpoint OpenAI-compatible:

import requests

response = requests.post(
    "https://your-worker.workers.dev/v1/chat/completions",
    json={
        "messages": [{"role": "user", "content": "Who are you?"}],
        "stream": True,
        "temperature": 0.7,
        "max_tokens": 200
    }
)

Production Architecture

Our dual-deployment strategy provides flexibility for different use cases:

High-Performance Path (GPU + vLLM)

  • Use Case: Real-time chat, customer-facing applications
  • API Gateway: Cloudflare Workers handle routing
  • Backend: vLLM on NVIDIA L4 GPU
  • Response Time: Sub-second responses
  • Resources: L4 GPU + 4 vCPU + 16GB RAM

Cost-Optimized Path (CPU + llama.cpp)

  • Use Case: Internal tools, batch processing, development
  • API Gateway: Same Cloudflare Workers interface
  • Backend: llama.cpp on CPU only
  • Response Time: 4-5 seconds
  • Resources: 2 vCPU + 4GB RAM

Both deployments share:

  • OpenAI-compatible API: Same client code works with both
  • Rate Limiting: 30 requests/hour per user using Cloudflare KV
  • Authentication: API key validation for secure access
  • Auto-scaling: Scale to zero when idle, 0-3 instances max

This hybrid architecture allows dynamic routing based on request priority, ensuring optimal cost-performance balance.

Optimizations Applied

GPU Deployment (vLLM)

  1. Continuous Batching: vLLM's PagedAttention for 30 tokens/sec throughput
  2. FP16 Inference: Half-precision for reduced memory usage
  3. NVIDIA L4 GPU: Optimal cost/performance for 1B models
  4. Tensor Parallelism: Efficient GPU utilization

CPU Deployment (llama.cpp)

  1. 4-bit Quantization: Q4_K_M reduces model to 808MB
  2. SIMD Optimization: CPU-specific vectorization
  3. Thread Pooling: Optimal thread allocation for Cloud Run
  4. Memory Mapping: Efficient model loading

Both Deployments

  1. Scale to Zero: No cost when idle
  2. OpenAI Compatibility: Standardized API interface
  3. Container Optimization: Minimal Docker images
  4. Cloud Build CI/CD: Automated deployment pipeline

When to Use Each Deployment

Choose GPU (vLLM) When:

  • Response time is critical (<1 second required)
  • High throughput needed (>10 concurrent users)
  • Customer-facing applications
  • Real-time interactive experiences
  • Budget allows for premium performance

Choose CPU (llama.cpp) When:

  • Cost optimization is priority
  • Response time of 4-5 seconds is acceptable
  • Low traffic (<100 requests/day)
  • Internal tools or development environments
  • Batch processing workloads

Key Learnings

  1. Fine-tuning: Unsloth makes QLoRA accessible on free Colab GPUs
  2. Model Size: 1B parameters can be surprisingly capable with quality data
  3. GPU Performance: vLLM achieves 30 tokens/sec on L4 GPU
  4. CPU Viability: llama.cpp makes CPU deployment practical with 4-5 second response times
  5. GGUF Format: Quantization maintains quality while reducing size by 68%
  6. Hybrid Strategy: Different deployments for different use cases maximizes value
  7. Cloud Run: Serverless GPU/CPU support simplifies production deployment

Conclusion

By implementing both GPU and CPU deployment strategies, we achieved the best of both worlds: blazing-fast inference when needed and significant cost savings when response time requirements are flexible. The CPU deployment dramatically reduces costs, making LLM hosting accessible for personal projects, while the GPU option remains available for demanding use cases.

The complete codebase, including Dockerfiles, conversion scripts, and deployment configurations, demonstrates that production LLM deployment can be both performant and cost-effective. With careful optimization and the right tools, you can serve custom fine-tuned models that fit your specific performance and budget requirements.