Easy Methods To Align Giant Language Fashions With Human Preferences Utilizing Direct Desire Optimization, QLoRA, And Extremely-Suggestions

On this tutorial, we implement an end-to-end Direct Desire Optimization workflow to align a big language mannequin with human preferences with out utilizing a reward mannequin. We mix TRL’s DPOTrainer with QLoRA and PEFT to make preference-based alignment possible on a single Colab GPU. We prepare instantly on the UltraFeedback binarized dataset, the place every immediate has a selected and a rejected response, permitting us to form mannequin conduct and magnificence reasonably than simply factual recall.

import os
import math
import random
import torch


!pip -q set up -U "transformers>=4.45.0" "datasets>=2.19.0" "accelerate>=0.33.0" "trl>=0.27.0" "peft>=0.12.0" "bitsandbytes>=0.43.0" "sentencepiece" "evaluate"


SEED = 42
random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)


MODEL_NAME = os.environ.get("MODEL_NAME", "Qwen/Qwen2-0.5B-Instruct")
DATASET_NAME = "HuggingFaceH4/ultrafeedback_binarized"
OUTPUT_DIR = "dpo_ultrafeedback_qlora"


MAX_TRAIN_SAMPLES = 8000
MAX_EVAL_SAMPLES  = 200
MAX_PROMPT_LEN = 512
MAX_COMPLETION_LEN = 256


BETA = 0.1
LR = 2e-4
EPOCHS = 1
PER_DEVICE_BS = 2
GRAD_ACCUM = 8


LOGGING_STEPS = 10
SAVE_STEPS = 200


machine = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", machine, "GPU:", torch.cuda.get_device_name(0) if machine == "cuda" else "None")

We arrange the execution surroundings and set up all required libraries for DPO, PEFT, and quantized coaching. We outline all international hyperparameters, dataset limits, and optimization settings in a single place. We additionally initialize the random quantity generator and make sure GPU availability to make sure reproducible runs.

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig


bnb_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8 else torch.float16,
)


tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
if tokenizer.pad_token is None:
   tokenizer.pad_token = tokenizer.eos_token


mannequin = AutoModelForCausalLM.from_pretrained(
   MODEL_NAME,
   quantization_config=bnb_config,
   torch_dtype=torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8 else torch.float16,
   device_map="auto",
)
mannequin.config.use_cache = False

We load the tokenizer and the bottom language mannequin utilizing 4-bit quantization to attenuate reminiscence utilization. We configure bitsandbytes to allow environment friendly QLoRA-style computation on Colab GPUs. We put together the mannequin for coaching by disabling cache utilization to keep away from incompatibilities throughout backpropagation.

from peft import LoraConfig, get_peft_model


lora_config = LoraConfig(
   r=16,
   lora_alpha=32,
   lora_dropout=0.05,
   bias="none",
   task_type="CAUSAL_LM",
   target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj", "gate_proj"],
)


mannequin = get_peft_model(mannequin, lora_config)
mannequin.print_trainable_parameters()


mannequin.gradient_checkpointing_enable()

We connect LoRA adapters to the mannequin’s consideration and feed-forward projection layers. We limit coaching to a small set of parameters to make fine-tuning environment friendly and secure. We allow gradient checkpointing to additional scale back GPU reminiscence consumption throughout coaching.

from datasets import load_dataset


ds = load_dataset(DATASET_NAME)


train_split = "train_prefs" if "train_prefs" in ds else ("train" if "train" in ds else listing(ds.keys())[0])
test_split  = "test_prefs" if "test_prefs" in ds else ("test" if "test" in ds else None)


train_raw = ds[train_split]
test_raw = ds[test_split] if test_split will not be None else None


print("Splits:", ds.keys())
print("Using train split:", train_split, "size:", len(train_raw))
if test_raw will not be None:
   print("Using test split:", test_split, "size:", len(test_raw))


def _extract_last_user_and_assistant(messages):
   last_user_idx = None
   last_asst_idx = None
   for i, m in enumerate(messages):
       if m.get("role") == "user":
           last_user_idx = i
       if m.get("role") == "assistant":
           last_asst_idx = i


   if last_user_idx is None or last_asst_idx is None:
       return None, None


   prompt_messages = messages[: last_user_idx + 1]
   assistant_text = messages[last_asst_idx].get("content", "")
   return prompt_messages, assistant_text


def format_example(ex):
   chosen_msgs = ex["chosen"]
   rejected_msgs = ex["rejected"]


   prompt_msgs_c, chosen_text = _extract_last_user_and_assistant(chosen_msgs)
   prompt_msgs_r, rejected_text = _extract_last_user_and_assistant(rejected_msgs)


   if prompt_msgs_c is None or prompt_msgs_r is None:
       return {"prompt": None, "chosen": None, "rejected": None}


   prompt_text = tokenizer.apply_chat_template(
       prompt_msgs_c, tokenize=False, add_generation_prompt=True
   )


   return {
       "prompt": prompt_text,
       "chosen": chosen_text.strip(),
       "rejected": rejected_text.strip(),
   }


train_raw = train_raw.shuffle(seed=SEED)
train_raw = train_raw.choose(vary(min(MAX_TRAIN_SAMPLES, len(train_raw))))


train_ds = train_raw.map(format_example, remove_columns=train_raw.column_names)
train_ds = train_ds.filter(lambda x: x["prompt"] will not be None and len(x["chosen"]) > 0 and len(x["rejected"]) > 0)


if test_raw will not be None:
   test_raw = test_raw.shuffle(seed=SEED)
   test_raw = test_raw.choose(vary(min(MAX_EVAL_SAMPLES, len(test_raw))))
   eval_ds = test_raw.map(format_example, remove_columns=test_raw.column_names)
   eval_ds = eval_ds.filter(lambda x: x["prompt"] will not be None and len(x["chosen"]) > 0 and len(x["rejected"]) > 0)
else:
   eval_ds = None


print("Train examples:", len(train_ds), "Eval examples:", len(eval_ds) if eval_ds will not be None else 0)
print(train_ds[0])

We load the UltraFeedback binarized dataset and dynamically choose the suitable prepare and take a look at splits. We extract immediate, chosen, and rejected responses from multi-turn conversations and format them utilizing the mannequin’s chat template. We shuffle, filter, and subsample the information to create clear and environment friendly coaching and analysis datasets.

from trl import DPOTrainer, DPOConfig


use_bf16 = torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8
use_fp16 = torch.cuda.is_available() and never use_bf16


training_args = DPOConfig(
   output_dir=OUTPUT_DIR,
   beta=BETA,
   per_device_train_batch_size=PER_DEVICE_BS,
   gradient_accumulation_steps=GRAD_ACCUM,
   num_train_epochs=EPOCHS,
   learning_rate=LR,
   lr_scheduler_type="cosine",
   warmup_ratio=0.05,
   logging_steps=LOGGING_STEPS,
   save_steps=SAVE_STEPS,
   save_total_limit=2,
   bf16=use_bf16,
   fp16=use_fp16,
   optim="paged_adamw_8bit",
   max_length=MAX_PROMPT_LEN + MAX_COMPLETION_LEN,
   max_prompt_length=MAX_PROMPT_LEN,
   report_to="none",
)


coach = DPOTrainer(
   mannequin=mannequin,
   args=training_args,
   processing_class=tokenizer,
   train_dataset=train_ds,
   eval_dataset=eval_ds,
)


coach.prepare()


coach.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)


print("Saved to:", OUTPUT_DIR)

We configure the DPO coaching goal with fastidiously chosen optimization and scheduling parameters. We initialize the DPOTrainer to instantly optimize choice pairs with out a reward mannequin. We prepare the LoRA adapters and save the aligned mannequin artifacts for later inference.

from peft import PeftModel
from transformers import pipeline


def generate_text(model_for_gen, immediate, max_new_tokens=180):
   model_for_gen.eval()
   inputs = tokenizer(immediate, return_tensors="pt", truncation=True, max_length=MAX_PROMPT_LEN).to(model_for_gen.machine)
   with torch.no_grad():
       out = model_for_gen.generate(
           **inputs,
           max_new_tokens=max_new_tokens,
           do_sample=True,
           temperature=0.7,
           top_p=0.95,
           pad_token_id=tokenizer.eos_token_id,
       )
   return tokenizer.decode(out[0], skip_special_tokens=True)


base_model = AutoModelForCausalLM.from_pretrained(
   MODEL_NAME,
   quantization_config=bnb_config,
   torch_dtype=torch.bfloat16 if use_bf16 else torch.float16,
   device_map="auto",
)
base_model.config.use_cache = True


dpo_model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)
dpo_model.config.use_cache = True


sample_pool = eval_ds if eval_ds will not be None and len(eval_ds) > 0 else train_ds
samples = [sample_pool[i] for i in random.pattern(vary(len(sample_pool)), okay=min(3, len(sample_pool)))]


for i, ex in enumerate(samples, 1):
   immediate = ex["prompt"]
   print("n" + "="*90)
   print(f"Sample #{i}")
   print("- Prompt:n", immediate)


   base_out = generate_text(base_model, immediate)
   dpo_out  = generate_text(dpo_model, immediate)


   print("n- Base model output:n", base_out)
   print("n- DPO (LoRA) output:n", dpo_out)


print("nDone.")

We reload the bottom mannequin and connect the educated DPO LoRA adapters for inference. We generate responses from each the unique and aligned fashions utilizing the identical prompts for comparability. We qualitatively consider how choice optimization adjustments mannequin conduct by inspecting the outputs facet by facet.

In conclusion, we demonstrated how DPO supplies a secure and environment friendly different to RLHF by instantly optimizing choice pairs with a easy, well-defined goal. We confirmed that parameter-efficient fine-tuning with LoRA and 4-bit quantization permits sensible experimentation even underneath tight compute constraints. We qualitatively validated alignment by evaluating generations earlier than and after DPO coaching, confirming that the mannequin learns to choose higher-quality responses whereas remaining light-weight and deployable.

Take a look at the FULL CODES right here. Additionally, be at liberty to comply with us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as properly.

Top Posts

The Digital Siege: 390 Breaches, SonicWall Crises, and the AI Arms Race Unleashed

Your Intimate Hims Orders Exposed: FTC Blow the Whistle on Meta’s Data Grab

Silencing the Echo: Why Protecting Voices Matters More Than Asking Them to Speak Up

Easy methods to Align Giant Language Fashions with Human Preferences Utilizing Direct Desire Optimization, QLoRA, and Extremely-Suggestions

DeepMind’s Physical AI Trio: Whole Body Control, Dexterity & Robot Teamwork Unleashed

7 Machine Learning Titans Still Forging the Future

Google Nest vs. Sonos SL1: Which Smart Speaker Truly Wins?

Coding Agents Supercharge Science: How OpenAI’s Vision Accelerates Software Discovery

Beyond Prompts: The Untapped Frontier of Master Prompt Governance

Beyond the REPL: Automating Code Creation with Kimi CLI, JSONL Streams, and Persistent Memory

The Digital Siege: 390 Breaches, SonicWall Crises, and the AI Arms Race Unleashed

Your Intimate Hims Orders Exposed: FTC Blow the Whistle on Meta’s Data Grab

Silencing the Echo: Why Protecting Voices Matters More Than Asking Them to Speak Up

Berg Insight Forecasts Wireless Revolution Transforming Industrial Automation

DeepMind’s Physical AI Trio: Whole Body Control, Dexterity & Robot Teamwork Unleashed

ProteinGuide: Orchestrating Directed Evolution for Next-Generation Protein Generative Models

“GhostKeys: The AI Time Bomb Hiding in Forgotten Internet Addresses”

The 2007 Yield Ghost: Treasury Bonds Challenge the Fed’s Grip

Trending

The Digital Siege: 390 Breaches, SonicWall Crises, and the AI Arms Race Unleashed

Your Intimate Hims Orders Exposed: FTC Blow the Whistle on Meta’s Data Grab

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Easy methods to Align Giant Language Fashions with Human Preferences Utilizing Direct Desire Optimization, QLoRA, and Extremely-Suggestions

Related Posts