Fine-Tuning LFM2 With QLoRA And DPO: A Hands-On Guide On Google Colab

This guide walks you through fine-tuning Liquid AI’s LFM2 using a fully open-source pipeline. We begin by loading the base LFM2 model with QLoRA, then prepare a supervised fine-tuning dataset in a conversational chat format. Next, we train a lightweight LoRA adapter using TRL and PEFT, and merge the adapter weights back into the main model. Additionally, the workflow includes an optional DPO stage to refine output preferences using chosen and rejected responses. By the end, you’ll have an end-to-end pipeline that transforms the original LFM2 model into an SFT-trained, preference-optimized version—ready for evaluation or deployment.

Copy Code

!pip install -q -U "transformers>=4.55" "trl>=0.12" "peft>=0.13" "datasets>=2.20" "accelerate>=0.34" bitsandbytes

import torch, gc
from datasets import load_dataset, Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from trl import SFTConfig, SFTTrainer, DPOConfig, DPOTrainer

MODEL_ID    = "LiquidAI/LFM2-1.2B"
USE_4BIT    = True
RUN_DPO     = True
SFT_SAMPLES = 500
SFT_STEPS   = 60
DPO_STEPS   = 40
MAX_LEN     = 1024

BF16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
DTYPE = torch.bfloat16 if BF16 else torch.float16
assert torch.cuda.is_available(), "No GPU detected — set Runtime > Change runtime type > GPU"
print(f"GPU: {torch.cuda.get_device_name(0)} | dtype={DTYPE} | 4bit={USE_4BIT}")

Install all necessary libraries for fine-tuning LFM2 in Google Colab. Import core components from Transformers, TRL, PEFT, datasets, bitsandbytes, and PyTorch. Configure key training parameters, detect your GPU, and set the optimal computational precision for efficient training.

Copy Code

def load_base(four_bit: bool):
    quant_cfg = None
    if four_bit:
        quant_cfg = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_use_double_quant=True,
            bnb_4bit_compute_dtype=DTYPE,
        )
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        device_map="auto",
        dtype=DTYPE,
        quantization_config=quant_cfg,
    )
    model.config.use_cache = False
    return model

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = load_base(USE_4BIT)

@torch.no_grad()
def chat(m, user_msg, system=None, max_new_tokens=200):
    msgs = ([{"role": "system", "content": system}] if system else []) + 
           [{"role": "user", "content": user_msg}]
    inputs = tokenizer.apply_chat_template(
        msgs,
        add_generation_prompt=True,
        return_tensors="pt",
        tokenize=True,
        return_dict=True,
    ).to(m.device)
    m.config.use_cache = True
    out = m.generate(
        **inputs,
        max_new_tokens=max_new_tokens, do_sample=True,
        temperature=0.3, min_p=0.15, repetition_penalty=1.05,
        pad_token_id=tokenizer.pad_token_id,
    )
    m.config.use_cache = False
    prompt_len = inputs["input_ids"].shape[-1]
    return tokenizer.decode(out[0, prompt_len:], skip_special_tokens=True)

PROBE = "Explain what makes the LFM2 architecture good for on-device AI, in 2 sentences."
print("n=== BASELINE (before fine-tuning) ===n", chat(model, PROBE))

Load the base LFM2 model with optional 4-bit quantization to minimize GPU memory consumption. Prepare the tokenizer, assign a padding token if needed, and create a helper chat function to test model outputs. Run a baseline prompt now so you can compare how the model improves after training.

Copy Code

sft_ds = load_dataset("HuggingFaceTB/smoltalk", "all", split=f"train[:{SFT_SAMPLES}]")
sft_ds = sft_ds.select_columns(["messages"])
print("nSFT example messages:", sft_ds[0]["messages"][:2])

lora_sft = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
    task_type="CAUSAL_LM", target_modules="all-linear",
)

sft_cfg = SFTConfig(
    output_dir="outputs/sft/lfm2_demo",
    max_length=MAX_LEN,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    max_steps=SFT_STEPS,
    logging_steps=10,
    save_strategy="no",
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    bf16=BF16, fp16=not BF16,
    optim="paged_adamw_8bit" if USE_4BIT else "adamw_torch",
    packing=False,
    report_to="none",
)

sft_trainer = SFTTrainer(
    model=model,
    args=sft_cfg,
    train_dataset=sft_ds,
    peft_config=lora_sft,
    processing_class=tokenizer,
)
sft_trainer.train()
sft_trainer.save_model("outputs/sft/lfm2_adapter")
print("n=== AFTER SFT ===n", chat(sft_trainer.model, PROBE))

Load a conversational supervised fine-tuning dataset and extract only the messages field. Set up a lightweight LoRA configuration for efficient adapter training, then define the full SFT training setup. Train the model using SFT, save the resulting LoRA adapter, and evaluate the model’s improved output.

Copy Code

del sft_trainer, model
gc.collect(); torch.cuda.empty_cache()

base_fp16 = AutoModelForCausalLM.from_pretrained(MODEL_ID,

dm.load_device_map="auto", dtype=DTYPE)
sft_merged = PeftModel.from_pretrained(base_fp16, "outputs/sft/lfm2_adapter").merge_and_unload()
sft_merged.save_pretrained("outputs/sft/lfm2_merged")
tokenizer.save_pretrained("outputs/sft/lfm2_merged")
print("Merged SFT model saved -> outputs/sft/lfm2_merged")

To free up GPU memory, we remove the earlier training objects from memory. We then reload the base LFM2 model—this time in fp16 or bf16 precision—and attach the previously trained Supervised Fine-Tuning (SFT) LoRA adapter. Next, we merge this adapter weights into the base model, creating a unified checkpoint. Finally, we save this merged SFT model so it can be used in the next stage of the pipeline.

Copy Code

if RUN_DPO:
   pref_rows = [
       {"prompt":  [{"role": "user", "content": "Reply to a customer whose order is late."}],
        "chosen":  [{"role": "assistant", "content": "I'm sorry your order is delayed. I've checked your tracking and it will arrive within 2 days — here's a 10% credit for the inconvenience."}],
        "rejected":[{"role": "assistant", "content": "Orders are sometimes late. Please wait."}]},
       {"prompt":  [{"role": "user", "content": "Summarize the benefit of edge AI in one line."}],
        "chosen":  [{"role": "assistant", "content": "Edge AI runs models locally, giving low latency, offline reliability, and stronger privacy."}],
        "rejected":[{"role": "assistant", "content": "Edge AI is AI on the edge of things and it is good."}]},
       {"prompt":  [{"role": "user", "content": "Decline a meeting politely."}],
        "chosen":  [{"role": "assistant", "content": "Thanks for the invite — I have a conflict then. Could we find another slot this week?"}],
        "rejected":[{"role": "assistant", "content": "No."}]},
   ] * 20
   pref_ds = Dataset.from_list(pref_rows)

   lora_dpo = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
                         task_type="CAUSAL_LM", target_modules="all-linear")
   dpo_cfg = DPOConfig(
       output_dir="outputs/dpo/lfm2_demo",
       per_device_train_batch_size=1,
       gradient_accumulation_steps=4,
       learning_rate=5e-6,
       beta=0.1,
       max_length=MAX_LEN,
       max_prompt_length=512,
       max_steps=DPO_STEPS,
       logging_steps=10,
       save_strategy="no",
       gradient_checkpointing=True,
       gradient_checkpointing_kwargs={"use_reentrant": False},
       bf16=BF16, fp16=not BF16,
       report_to="none",
   )
   dpo_trainer = DPOTrainer(
       model=sft_merged,
       ref_model=None,
       args=dpo_cfg,
       train_dataset=pref_ds,
       processing_class=tokenizer,
       peft_config=lora_dpo,
   )
   dpo_trainer.train()
   final = dpo_trainer.model.merge_and_unload()
   final.save_pretrained("outputs/final/lfm2_sft_dpo")
   tokenizer.save_pretrained("outputs/final/lfm2_sft_dpo")
   print("n=== AFTER SFT + DPO ===n", chat(dpo_trainer.model, PROBE))
   print("Final model saved -> outputs/final/lfm2_sft_dpo")

print("nDone. Compare the BASELINE vs AFTER-SFT(+DPO) outputs above.")

As an optional next step, we can apply Direct Preference Optimization (DPO), which uses pairs of preferred and non-preferred responses to further align the model’s outputs. We set up a new LoRA adapter specifically for preference training and then fine-tune the previously merged SFT model using these contrastive examples. After training, we merge this DPO adapter into the model, save the fully refined checkpoint, and then evaluate the final output to see how it compares to the original baseline.

In summary, this tutorial demonstrates a complete, open-source fine-tuning workflow for the LFM2 model using only freely available tools: Transformers, TRL, PEFT, datasets, and bitsandbytes. We leveraged QLoRA to efficiently train on Colab GPUs, performed supervised fine-tuning on chat-formatted data, merged the resulting adapter into the model base, and optionally applied DPO for additional refinement. This process offers a practical understanding of modern LLM fine-tuning—from initial model loading to producing a final, deployable checkpoint that clearly shows improvements over the original model.

Access the complete codes with the notebook here. You can also follow us on Twitter, and make sure to join our 150k+ ML SubReddit and subscribe to our Newsletter. Oh, and if you use Telegram—we’ve got a group there too that you can join.

Looking to collaborate with us to promote your GitHub repository, Hugging Face page, product launch, or webinar? Get in touch to discuss partnership opportunities.

Top Posts

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

Fine-Tuning LFM2 with QLoRA and DPO: A Hands-On Guide on Google Colab

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

Beyond Prompt Engineering: How 4 Context Bricks Silence RAG Hallucinations

Run Mythos Enhanced Coding Model Locally with llama.cpp on Raspberry Pi

Astryx: Meta’s Open-Source React Toolkit—150+ Accessible Components, 7 Themes, and a CLI Agent-Ready Design System

Endless Code: Mastering the Art of the 24-Hour Claude Agent

Unlock Peak Performance: Your Blueprint for Lightning-Fast Agentic Coding with Claude

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

5 No-Cost Courses to Transform from AI Newbie to Pro

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

The Magic of Friction: Engineering Smarter Robot World Models

Trump Mobilizes Defense Industry to Chart Software and Supplier Networks Nationwide

KuCoin Pay: Weaving Crypto Seamlessly Into Everyday Payments

Trending

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Fine-Tuning LFM2 with QLoRA and DPO: A Hands-On Guide on Google Colab

Related Posts