This guide walks you through fine-tuning Liquid AI’s LFM2 using a fully open-source pipeline. We begin by loading the base LFM2 model with QLoRA, then prepare a supervised fine-tuning dataset in a conversational chat format. Next, we train a lightweight LoRA adapter using TRL and PEFT, and merge the adapter weights back into the main model. Additionally, the workflow includes an optional DPO stage to refine output preferences using chosen and rejected responses. By the end, you’ll have an end-to-end pipeline that transforms the original LFM2 model into an SFT-trained, preference-optimized version—ready for evaluation or deployment.
!pip install -q -U "transformers>=4.55" "trl>=0.12" "peft>=0.13" "datasets>=2.20" "accelerate>=0.34" bitsandbytes
import torch, gc
from datasets import load_dataset, Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from trl import SFTConfig, SFTTrainer, DPOConfig, DPOTrainer
MODEL_ID = "LiquidAI/LFM2-1.2B"
USE_4BIT = True
RUN_DPO = True
SFT_SAMPLES = 500
SFT_STEPS = 60
DPO_STEPS = 40
MAX_LEN = 1024
BF16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
DTYPE = torch.bfloat16 if BF16 else torch.float16
assert torch.cuda.is_available(), "No GPU detected — set Runtime > Change runtime type > GPU"
print(f"GPU: {torch.cuda.get_device_name(0)} | dtype={DTYPE} | 4bit={USE_4BIT}")
Install all necessary libraries for fine-tuning LFM2 in Google Colab. Import core components from Transformers, TRL, PEFT, datasets, bitsandbytes, and PyTorch. Configure key training parameters, detect your GPU, and set the optimal computational precision for efficient training.
def load_base(four_bit: bool):
quant_cfg = None
if four_bit:
quant_cfg = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=DTYPE,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map="auto",
dtype=DTYPE,
quantization_config=quant_cfg,
)
model.config.use_cache = False
return model
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = load_base(USE_4BIT)
@torch.no_grad()
def chat(m, user_msg, system=None, max_new_tokens=200):
msgs = ([{"role": "system", "content": system}] if system else []) +
[{"role": "user", "content": user_msg}]
inputs = tokenizer.apply_chat_template(
msgs,
add_generation_prompt=True,
return_tensors="pt",
tokenize=True,
return_dict=True,
).to(m.device)
m.config.use_cache = True
out = m.generate(
**inputs,
max_new_tokens=max_new_tokens, do_sample=True,
temperature=0.3, min_p=0.15, repetition_penalty=1.05,
pad_token_id=tokenizer.pad_token_id,
)
m.config.use_cache = False
prompt_len = inputs["input_ids"].shape[-1]
return tokenizer.decode(out[0, prompt_len:], skip_special_tokens=True)
PROBE = "Explain what makes the LFM2 architecture good for on-device AI, in 2 sentences."
print("n=== BASELINE (before fine-tuning) ===n", chat(model, PROBE))
Load the base LFM2 model with optional 4-bit quantization to minimize GPU memory consumption. Prepare the tokenizer, assign a padding token if needed, and create a helper chat function to test model outputs. Run a baseline prompt now so you can compare how the model improves after training.
sft_ds = load_dataset("HuggingFaceTB/smoltalk", "all", split=f"train[:{SFT_SAMPLES}]")
sft_ds = sft_ds.select_columns(["messages"])
print("nSFT example messages:", sft_ds[0]["messages"][:2])
lora_sft = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
task_type="CAUSAL_LM", target_modules="all-linear",
)
sft_cfg = SFTConfig(
output_dir="outputs/sft/lfm2_demo",
max_length=MAX_LEN,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-5,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
max_steps=SFT_STEPS,
logging_steps=10,
save_strategy="no",
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},
bf16=BF16, fp16=not BF16,
optim="paged_adamw_8bit" if USE_4BIT else "adamw_torch",
packing=False,
report_to="none",
)
sft_trainer = SFTTrainer(
model=model,
args=sft_cfg,
train_dataset=sft_ds,
peft_config=lora_sft,
processing_class=tokenizer,
)
sft_trainer.train()
sft_trainer.save_model("outputs/sft/lfm2_adapter")
print("n=== AFTER SFT ===n", chat(sft_trainer.model, PROBE))
Load a conversational supervised fine-tuning dataset and extract only the messages field. Set up a lightweight LoRA configuration for efficient adapter training, then define the full SFT training setup. Train the model using SFT, save the resulting LoRA adapter, and evaluate the model’s improved output.
del sft_trainer, model
gc.collect(); torch.cuda.empty_cache()
base_fp16 = AutoModelForCausalLM.from_pretrained(MODEL_ID,
dm.load_device_map="auto", dtype=DTYPE)
sft_merged = PeftModel.from_pretrained(base_fp16, "outputs/sft/lfm2_adapter").merge_and_unload()
sft_merged.save_pretrained("outputs/sft/lfm2_merged")
tokenizer.save_pretrained("outputs/sft/lfm2_merged")
print("Merged SFT model saved -> outputs/sft/lfm2_merged")To free up GPU memory, we remove the earlier training objects from memory. We then reload the base LFM2 model—this time in fp16 or bf16 precision—and attach the previously trained Supervised Fine-Tuning (SFT) LoRA adapter. Next, we merge this adapter weights into the base model, creating a unified checkpoint. Finally, we save this merged SFT model so it can be used in the next stage of the pipeline.
if RUN_DPO:
pref_rows = [
{"prompt": [{"role": "user", "content": "Reply to a customer whose order is late."}],
"chosen": [{"role": "assistant", "content": "I'm sorry your order is delayed. I've checked your tracking and it will arrive within 2 days — here's a 10% credit for the inconvenience."}],
"rejected":[{"role": "assistant", "content": "Orders are sometimes late. Please wait."}]},
{"prompt": [{"role": "user", "content": "Summarize the benefit of edge AI in one line."}],
"chosen": [{"role": "assistant", "content": "Edge AI runs models locally, giving low latency, offline reliability, and stronger privacy."}],
"rejected":[{"role": "assistant", "content": "Edge AI is AI on the edge of things and it is good."}]},
{"prompt": [{"role": "user", "content": "Decline a meeting politely."}],
"chosen": [{"role": "assistant", "content": "Thanks for the invite — I have a conflict then. Could we find another slot this week?"}],
"rejected":[{"role": "assistant", "content": "No."}]},
] * 20
pref_ds = Dataset.from_list(pref_rows)
lora_dpo = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
task_type="CAUSAL_LM", target_modules="all-linear")
dpo_cfg = DPOConfig(
output_dir="outputs/dpo/lfm2_demo",
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
learning_rate=5e-6,
beta=0.1,
max_length=MAX_LEN,
max_prompt_length=512,
max_steps=DPO_STEPS,
logging_steps=10,
save_strategy="no",
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},
bf16=BF16, fp16=not BF16,
report_to="none",
)
dpo_trainer = DPOTrainer(
model=sft_merged,
ref_model=None,
args=dpo_cfg,
train_dataset=pref_ds,
processing_class=tokenizer,
peft_config=lora_dpo,
)
dpo_trainer.train()
final = dpo_trainer.model.merge_and_unload()
final.save_pretrained("outputs/final/lfm2_sft_dpo")
tokenizer.save_pretrained("outputs/final/lfm2_sft_dpo")
print("n=== AFTER SFT + DPO ===n", chat(dpo_trainer.model, PROBE))
print("Final model saved -> outputs/final/lfm2_sft_dpo")
print("nDone. Compare the BASELINE vs AFTER-SFT(+DPO) outputs above.")As an optional next step, we can apply Direct Preference Optimization (DPO), which uses pairs of preferred and non-preferred responses to further align the model’s outputs. We set up a new LoRA adapter specifically for preference training and then fine-tune the previously merged SFT model using these contrastive examples. After training, we merge this DPO adapter into the model, save the fully refined checkpoint, and then evaluate the final output to see how it compares to the original baseline.
In summary, this tutorial demonstrates a complete, open-source fine-tuning workflow for the LFM2 model using only freely available tools: Transformers, TRL, PEFT, datasets, and bitsandbytes. We leveraged QLoRA to efficiently train on Colab GPUs, performed supervised fine-tuning on chat-formatted data, merged the resulting adapter into the model base, and optionally applied DPO for additional refinement. This process offers a practical understanding of modern LLM fine-tuning—from initial model loading to producing a final, deployable checkpoint that clearly shows improvements over the original model.
Access the complete codes with the notebook here. You can also follow us on Twitter, and make sure to join our 150k+ ML SubReddit and subscribe to our Newsletter. Oh, and if you use Telegram—we’ve got a group there too that you can join.
Looking to collaborate with us to promote your GitHub repository, Hugging Face page, product launch, or webinar? Get in touch to discuss partnership opportunities.



