Fine-Tuning Llama 3.2 1B: GPU vs CPU Deployment on Cloud Run

Fine-tuned a Llama 3.2 1B model on portfolio data and deployed it to production with two strategies: achieving 30 tokens/sec with vLLM on GPU, and significantly reducing costs with llama.cpp on CPU. Here's the complete implementation including both deployment paths.

Dataset Preparation

Used ChatGPT format for training data - simple conversation pairs about portfolio content:

{"messages": [{"role": "user", "content": "What tech stack does Sherlock use?"}, {"role": "assistant", "content": "Sherlock is built with FastAPI, Milvus, Celery and other technologies."}]}
{"messages": [{"role": "user", "content": "who are you"}, {"role": "assistant", "content": "I'm Sarim Ahmed, a Senior Software Engineer at Mira."}]}

Generated synthetic dataset using Ollama to prompt variations from resume/portfolio data. Total: ~200 conversation pairs.

Fine-Tuning with Unsloth on Colab

Used Google Colab's free T4 GPU. Complete notebook:

Setup

!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps trl peft accelerate bitsandbytes

from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None  # Auto-detect
load_in_4bit = True  # Memory efficiency

# Load Llama 3.2 1B
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B-Instruct",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

LoRA Configuration

model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                   "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

Training

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=10,
        learning_rate=5e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        seed=3407,
        output_dir="outputs",
    ),
)

trainer.train()

Training completed in ~15 minutes on T4.

Export to GGUF

model.save_pretrained_gguf("sarim_portfolio", tokenizer, quantization_method="q4_k_m")

Upload to Hugging Face

After exporting, merged the LoRA adapters and pushed to HuggingFace:

# Merge adapters locally
python merge_adapters.py

# Push to HF
huggingface-cli upload YOUR_USERNAME/YOUR_MODEL ./merged_model

vLLM Docker Configuration

Created optimized Docker image for vLLM serving:

FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    g++ \
    && rm -rf /var/lib/apt/lists/*

# Install vLLM and dependencies
RUN pip install --no-cache-dir \
    vllm \
    torch \
    transformers \
    huggingface-hub

# Environment variables
ENV PORT=8080
ENV MODEL_NAME="YOUR_USERNAME/YOUR_MODEL"
ENV HF_HOME=/tmp/huggingface

# Create non-root user for security
RUN useradd -m -u 1000 appuser && \
    mkdir -p /tmp/huggingface && \
    chown -R appuser:appuser /tmp/huggingface

USER appuser

# Start vLLM server
CMD python -m vllm.entrypoints.openai.api_server \
    --model $MODEL_NAME \
    --host 0.0.0.0 \
    --port $PORT \
    --max-model-len 2048 \
    --dtype auto \
    --download-dir /tmp/huggingface

Deploy to Cloud Run

Build and Push Image

# Build image
docker build -f Dockerfile.vllm -t gcr.io/YOUR_PROJECT/vllm-server .

# Push to GCR
docker push gcr.io/YOUR_PROJECT/vllm-server

Deploy with GPU

gcloud run deploy vllm-portfolio \
    --image gcr.io/YOUR_PROJECT/vllm-server \
    --platform managed \
    --region us-central1 \
    --gpu=1 \
    --gpu-type=nvidia-l4 \
    --memory=16Gi \
    --cpu=4 \
    --max-instances=1 \
    --min-instances=0 \
    --port=8080 \
    --allow-unauthenticated

Converting to GGUF for CPU Deployment

After fine-tuning, we converted the model to GGUF format for efficient CPU inference using llama.cpp:

Conversion Pipeline

Created a Jupyter notebook for the conversion process:

# Download model from HuggingFace
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="username/portfolio-model",
    local_dir="./original_model",
    token=HF_TOKEN
)

# Convert to GGUF F16 format
!python llama.cpp/convert_hf_to_gguf.py \
    ./original_model \
    --outfile model-f16.gguf \
    --outtype f16

# Quantize to Q4_K_M (4-bit quantization)
!./llama.cpp/llama-quantize \
    model-f16.gguf \
    portfolio-Q4_K_M.gguf \
    Q4_K_M

The quantization reduced model size from 2.5GB to 808MB while maintaining quality.

Upload GGUF to HuggingFace

from huggingface_hub import HfApi

api = HfApi()
api.upload_file(
    path_or_fileobj="portfolio-Q4_K_M.gguf",
    path_in_repo="portfolio-Q4_K_M.gguf",
    repo_id="username/portfolio-model-GGUF",
    token=HF_TOKEN
)

CPU Deployment with llama.cpp

Docker Configuration for CPU

FROM python:3.11-slim

# Install dependencies
RUN apt-get update && apt-get install -y \
    build-essential cmake gcc g++ libgomp1 curl \
    && rm -rf /var/lib/apt/lists/*

# Install llama-cpp-python with OpenAI API server
RUN pip install --no-cache-dir \
    'llama-cpp-python[server]==0.2.90' \
    huggingface-hub

# Download GGUF model during build
ARG HF_TOKEN
RUN python -c "from huggingface_hub import hf_hub_download; \
    hf_hub_download( \
        repo_id='username/portfolio-model-GGUF', \
        filename='portfolio-Q4_K_M.gguf', \
        token='${HF_TOKEN}', \
        local_dir='/app' \
    )"

# Start OpenAI-compatible server
CMD python -m llama_cpp.server \
    --model /app/portfolio-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    --n_ctx 2048 \
    --n_threads 4

Deploy CPU Version to Cloud Run

# cloudbuild-cpu.yaml
steps:
  - name: 'gcr.io/cloud-builders/docker'
    args:
      - 'build'
      - '--build-arg'
      - 'HF_TOKEN=${HF_TOKEN}'
      - '-t'
      - 'gcr.io/$PROJECT_ID/sarim-portfolio-ai-cpu:latest'
      - '-f'
      - 'Dockerfile.llamacpp'
      - '.'

  - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
    entrypoint: gcloud
    args:
      - 'run'
      - 'deploy'
      - 'sarim-portfolio-ai-cpu'
      - '--image=gcr.io/$PROJECT_ID/sarim-portfolio-ai-cpu:latest'
      - '--region=us-central1'
      - '--cpu=2'
      - '--memory=4Gi'
      - '--min-instances=0'
      - '--max-instances=3'

Performance Comparison: GPU vs CPU

Metric	vLLM (GPU - L4)	llama.cpp (CPU)	Difference
Inference Speed	30 tokens/sec	~5 tokens/sec (measured)	6x slower
Sampling Speed	-	424 tokens/sec	-
Per Token Latency	-	2.36ms	-
Total Response Time	<1 second	4-5 seconds	4-5x slower
Cold Start	~45 seconds	~10 seconds	4.5x faster
Memory Usage	16GB allocated	4GB allocated	75% less
Model Size	2.5GB (FP16)	808MB (Q4_K_M)	68% smaller
Instance Type	L4 GPU + 4 vCPU	2 vCPU only	-
Cloud Run Pricing	GPU + CPU costs	CPU only	Significantly cheaper

Cost Analysis

Based on Google Cloud Run pricing:

GPU Deployment (vLLM)

NVIDIA L4 GPU + 4 vCPU + 16GB memory
Scales to zero when idle
Higher cost per hour when active

CPU Deployment (llama.cpp)

2 vCPU + 4GB memory only
Scales to zero when idle
Significantly lower cost per hour

The CPU deployment offers substantial cost savings for use cases where 4-5 second response times are acceptable, making LLM deployment viable for personal projects and low-traffic applications.

Client Integration

We've made our Cloudflare Workers endpoint OpenAI-compatible:

import requests

response = requests.post(
    "https://your-worker.workers.dev/v1/chat/completions",
    json={
        "messages": [{"role": "user", "content": "Who are you?"}],
        "stream": True,
        "temperature": 0.7,
        "max_tokens": 200
    }
)

Production Architecture

Our dual-deployment strategy provides flexibility for different use cases:

High-Performance Path (GPU + vLLM)

Use Case: Real-time chat, customer-facing applications
API Gateway: Cloudflare Workers handle routing
Backend: vLLM on NVIDIA L4 GPU
Response Time: Sub-second responses
Resources: L4 GPU + 4 vCPU + 16GB RAM

Cost-Optimized Path (CPU + llama.cpp)

Use Case: Internal tools, batch processing, development
API Gateway: Same Cloudflare Workers interface
Backend: llama.cpp on CPU only
Response Time: 4-5 seconds
Resources: 2 vCPU + 4GB RAM

Both deployments share:

OpenAI-compatible API: Same client code works with both
Rate Limiting: 30 requests/hour per user using Cloudflare KV
Authentication: API key validation for secure access
Auto-scaling: Scale to zero when idle, 0-3 instances max

This hybrid architecture allows dynamic routing based on request priority, ensuring optimal cost-performance balance.

Optimizations Applied

GPU Deployment (vLLM)

Continuous Batching: vLLM's PagedAttention for 30 tokens/sec throughput
FP16 Inference: Half-precision for reduced memory usage
NVIDIA L4 GPU: Optimal cost/performance for 1B models
Tensor Parallelism: Efficient GPU utilization

CPU Deployment (llama.cpp)

4-bit Quantization: Q4_K_M reduces model to 808MB
SIMD Optimization: CPU-specific vectorization
Thread Pooling: Optimal thread allocation for Cloud Run
Memory Mapping: Efficient model loading

Both Deployments

Scale to Zero: No cost when idle
OpenAI Compatibility: Standardized API interface
Container Optimization: Minimal Docker images
Cloud Build CI/CD: Automated deployment pipeline

When to Use Each Deployment

Choose GPU (vLLM) When:

Response time is critical (<1 second required)
High throughput needed (>10 concurrent users)
Customer-facing applications
Real-time interactive experiences
Budget allows for premium performance

Choose CPU (llama.cpp) When:

Cost optimization is priority
Response time of 4-5 seconds is acceptable
Low traffic (<100 requests/day)
Internal tools or development environments
Batch processing workloads

Key Learnings

Fine-tuning: Unsloth makes QLoRA accessible on free Colab GPUs
Model Size: 1B parameters can be surprisingly capable with quality data
GPU Performance: vLLM achieves 30 tokens/sec on L4 GPU
CPU Viability: llama.cpp makes CPU deployment practical with 4-5 second response times
GGUF Format: Quantization maintains quality while reducing size by 68%
Hybrid Strategy: Different deployments for different use cases maximizes value
Cloud Run: Serverless GPU/CPU support simplifies production deployment

Conclusion

By implementing both GPU and CPU deployment strategies, we achieved the best of both worlds: blazing-fast inference when needed and significant cost savings when response time requirements are flexible. The CPU deployment dramatically reduces costs, making LLM hosting accessible for personal projects, while the GPU option remains available for demanding use cases.

The complete codebase, including Dockerfiles, conversion scripts, and deployment configurations, demonstrates that production LLM deployment can be both performant and cost-effective. With careful optimization and the right tools, you can serve custom fine-tuned models that fit your specific performance and budget requirements.