On this tutorial, we discover the way to run OpenAI’s open-weight GPT-OSS fashions in Google Colab with a powerful deal with their technical conduct, deployment necessities, and sensible inference workflows. We start by organising the precise dependencies wanted for Transformers-based execution, verifying GPU availability, and loading openai/gpt-oss-20b with the proper configuration utilizing native MXFP4 quantization, torch.bfloat16 activations. As we transfer by the tutorial, we work straight with core capabilities equivalent to structured technology, streaming, multi-turn dialogue dealing with, instrument execution patterns, and batch inference, whereas maintaining in thoughts how open-weight fashions differ from closed-hosted APIs by way of transparency, controllability, reminiscence constraints, and native execution trade-offs. Additionally, we deal with GPT-OSS not simply as a chatbot, however as a technically inspectable open-weight LLM stack that we will configure, immediate, and lengthen inside a reproducible workflow.
print("🔧 Step 1: Installing required packages...")
print("=" * 70)
!pip set up -q --upgrade pip
!pip set up -q transformers>=4.51.0 speed up sentencepiece protobuf
!pip set up -q huggingface_hub gradio ipywidgets
!pip set up -q openai-harmony
import transformers
print(f"✅ Transformers version: {transformers.__version__}")
import torch
print(f"n🖥️ System Information:")
print(f" PyTorch version: {torch.__version__}")
print(f" CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
gpu_name = torch.cuda.get_device_name(0)
gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f" GPU: {gpu_name}")
print(f" GPU Memory: {gpu_memory:.2f} GB")
if gpu_memory < 15:
print(f"n⚠️ WARNING: gpt-oss-20b requires ~16GB VRAM.")
print(f" Your GPU has {gpu_memory:.1f}GB. Consider using Colab Pro for T4/A100.")
else:
print(f"n✅ GPU memory sufficient for gpt-oss-20b")
else:
print("n❌ No GPU detected!")
print(" Go to: Runtime → Change runtime type → Select 'T4 GPU'")
increase RuntimeError("GPU required for this tutorial")
print("n" + "=" * 70)
print("📚 PART 2: Loading GPT-OSS Model (Correct Method)")
print("=" * 70)
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
MODEL_ID = "openai/gpt-oss-20b"
print(f"n🔄 Loading model: {MODEL_ID}")
print(" This may take several minutes on first run...")
print(" (Model size: ~40GB download, uses native MXFP4 quantization)")
tokenizer = AutoTokenizer.from_pretrained(
MODEL_ID,
trust_remote_code=True
)
mannequin = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
pipe = pipeline(
"text-generation",
mannequin=mannequin,
tokenizer=tokenizer,
)
print("✅ Model loaded successfully!")
print(f" Model dtype: {model.dtype}")
print(f" Device: {model.device}")
if torch.cuda.is_available():
allotted = torch.cuda.memory_allocated() / 1e9
reserved = torch.cuda.memory_reserved() / 1e9
print(f" GPU Memory Allocated: {allocated:.2f} GB")
print(f" GPU Memory Reserved: {reserved:.2f} GB")
print("n" + "=" * 70)
print("💬 PART 3: Basic Inference Examples")
print("=" * 70)
def generate_response(messages, max_new_tokens=256, temperature=0.8, top_p=1.0):
"""
Generate a response utilizing gpt-oss with advisable parameters.
OpenAI recommends: temperature=1.0, top_p=1.0 for gpt-oss
"""
output = pipe(
messages,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=temperature,
top_p=top_p,
pad_token_id=tokenizer.eos_token_id,
)
return output[0]["generated_text"][-1]["content"]
print("n📝 Example 1: Simple Question Answering")
print("-" * 50)
messages = [
{"role": "user", "content": "What is the Pythagorean theorem? Explain briefly."}
]
response = generate_response(messages, max_new_tokens=150)
print(f"User: {messages[0]['content']}")
print(f"nAssistant: {response}")
print("nn📝 Example 2: Code Generation")
print("-" * 50)
messages = [
]
response = generate_response(messages, max_new_tokens=300)
print(f"User: {messages[0]['content']}")
print(f"nAssistant: {response}")
print("nn📝 Example 3: Creative Writing")
print("-" * 50)
messages = [
{"role": "user", "content": "Write a haiku about artificial intelligence."}
]
response = generate_response(messages, max_new_tokens=100, temperature=1.0)
print(f"User: {messages[0]['content']}")
print(f"nAssistant: {response}")We arrange the complete Colab setting required to run GPT-OSS correctly and confirm that the system has a appropriate GPU with sufficient VRAM. We set up the core libraries, examine the PyTorch and Transformers variations, and make sure that the runtime is appropriate for loading an open-weight mannequin like gpt-oss-20b. We then load the tokenizer, initialize the mannequin with the proper technical configuration, and run a couple of primary inference examples to substantiate that the open-weight pipeline is working finish to finish.
print("n" + "=" * 70)
print("🧠 PART 4: Configurable Reasoning Effort")
print("=" * 70)
print("""
GPT-OSS helps completely different reasoning effort ranges:
• LOW - Fast, concise solutions (fewer tokens, quicker)
• MEDIUM - Balanced reasoning and response
• HIGH - Deep considering with full chain-of-thought
The reasoning effort is managed by system prompts and technology parameters.
""")
class ReasoningEffortController:
"""
Controls reasoning effort ranges for gpt-oss generations.
"""
EFFORT_CONFIGS = {
"low": {
"system_prompt": "You are a helpful assistant. Be concise and direct.",
"max_tokens": 200,
"temperature": 0.7,
"description": "Quick, concise answers"
},
"medium": {
"system_prompt": "You are a helpful assistant. Think through problems step by step and provide clear, well-reasoned answers.",
"max_tokens": 400,
"temperature": 0.8,
"description": "Balanced reasoning"
},
"high": {
"system_prompt": """You're a useful assistant with superior reasoning capabilities.
For complicated issues:
1. First, analyze the issue completely
2. Take into account a number of approaches
3. Present your full chain of thought
4. Present a complete, well-reasoned reply
Take your time to assume deeply earlier than responding.""",
"max_tokens": 800,
"temperature": 1.0,
"description": "Deep chain-of-thought reasoning"
}
}
def __init__(self, pipeline, tokenizer):
self.pipe = pipeline
self.tokenizer = tokenizer
def generate(self, user_message: str, effort: str = "medium") -> dict:
"""Generate response with specified reasoning effort."""
if effort not in self.EFFORT_CONFIGS:
increase ValueError(f"Effort must be one of: {list(self.EFFORT_CONFIGS.keys())}")
config = self.EFFORT_CONFIGS[effort]
messages = [
{"role": "system", "content": config["system_prompt"]},
{"role": "user", "content": user_message}
]
output = self.pipe(
messages,
max_new_tokens=config["max_tokens"],
do_sample=True,
temperature=config["temperature"],
top_p=1.0,
pad_token_id=self.tokenizer.eos_token_id,
)
return {
"effort": effort,
"description": config["description"],
"response": output[0]["generated_text"][-1]["content"],
"max_tokens_used": config["max_tokens"]
}
reasoning_controller = ReasoningEffortController(pipe, tokenizer)
print(f"n🧩 Logic Puzzle: {test_question}n")
for effort in ["low", "medium", "high"]:
end result = reasoning_controller.generate(test_question, effort)
print(f"━━━ {effort.upper()} ({result['description']}) ━━━")
print(f"{result['response'][:500]}...")
print()
print("n" + "=" * 70)
print("📋 PART 5: Structured Output Generation (JSON Mode)")
print("=" * 70)
import json
import re
class StructuredOutputGenerator:
"""
Generate structured JSON outputs with schema validation.
"""
def __init__(self, pipeline, tokenizer):
self.pipe = pipeline
self.tokenizer = tokenizer
def generate_json(self, immediate: str, schema: dict, max_retries: int = 2) -> dict:
"""
Generate JSON output in accordance with a specified schema.
Args:
immediate: The consumer's request
schema: JSON schema description
max_retries: Variety of retries on parse failure
"""
schema_str = json.dumps(schema, indent=2)
system_prompt = f"""You're a useful assistant that ONLY outputs legitimate JSON.
Your response should precisely match this JSON schema:
{schema_str}
RULES:
- Output ONLY the JSON object, nothing else
- No markdown code blocks (no ```)
- No explanations earlier than or after
- Guarantee all required fields are current
- Use right knowledge sorts as specified"""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
]
for try in vary(max_retries + 1):
output = self.pipe(
messages,
max_new_tokens=500,
do_sample=True,
temperature=0.3,
top_p=1.0,
pad_token_id=self.tokenizer.eos_token_id,
)
response_text = output[0]["generated_text"][-1]["content"]
cleaned = self._clean_json_response(response_text)
attempt:
parsed = json.masses(cleaned)
return {"success": True, "data": parsed, "attempts": try + 1}
besides json.JSONDecodeError as e:
if try == max_retries:
return {
"success": False,
"error": str(e),
"raw_response": response_text,
"attempts": try + 1
}
messages.append({"role": "assistant", "content": response_text})
messages.append({"role": "user", "content": f"That wasn't valid JSON. Error: {e}. Please try again with ONLY valid JSON."})
def _clean_json_response(self, textual content: str) -> str:
"""Remove markdown code blocks and extra whitespace."""
textual content = re.sub(r'^```(?:json)?s*', '', textual content.strip())
textual content = re.sub(r's*```$', '', textual content)
return textual content.strip()
json_generator = StructuredOutputGenerator(pipe, tokenizer)
print("n📝 Example 1: Entity Extraction")
print("-" * 50)
entity_schema = {
"name": "string",
"type": "string (person/company/place)",
"description": "string (1-2 sentences)",
"key_facts": ["list of strings"]
}
entity_result = json_generator.generate_json(
"Extract information about: Tesla, Inc.",
entity_schema
)
if entity_result["success"]:
print(json.dumps(entity_result["data"], indent=2))
else:
print(f"Error: {entity_result['error']}")
print("nn📝 Example 2: Recipe Generation")
print("-" * 50)
recipe_schema = {
"name": "string",
"prep_time_minutes": "integer",
"cook_time_minutes": "integer",
"servings": "integer",
"difficulty": "string (easy/medium/hard)",
"ingredients": [{"item": "string", "amount": "string"}],
"steps": ["string"]
}
recipe_result = json_generator.generate_json(
"Create a simple recipe for chocolate chip cookies",
recipe_schema
)
if recipe_result["success"]:
print(json.dumps(recipe_result["data"], indent=2))
else:
print(f"Error: {recipe_result['error']}")We construct extra superior technology controls by introducing configurable reasoning effort and a structured JSON output workflow. We outline completely different effort modes to fluctuate how deeply the mannequin causes, what number of tokens it makes use of, and the way detailed its solutions are throughout inference. We additionally create a JSON technology utility that guides the open-weight mannequin towards schema-like outputs, cleans the returned textual content, and retries when the response isn’t legitimate JSON.
print("n" + "=" * 70)
print("💬 PART 6: Multi-turn Conversations with Memory")
print("=" * 70)
class ConversationManager:
"""
Manages multi-turn conversations with context reminiscence.
Implements the Concord format sample utilized by gpt-oss.
"""
def __init__(self, pipeline, tokenizer, system_message: str = None):
self.pipe = pipeline
self.tokenizer = tokenizer
self.historical past = []
if system_message:
self.system_message = system_message
else:
self.system_message = "You are a helpful, friendly AI assistant. Remember the context of our conversation."
def chat(self, user_message: str, max_new_tokens: int = 300) -> str:
"""Send a message and get a response, maintaining conversation history."""
messages = [{"role": "system", "content": self.system_message}]
messages.lengthen(self.historical past)
messages.append({"role": "user", "content": user_message})
output = self.pipe(
messages,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.8,
top_p=1.0,
pad_token_id=self.tokenizer.eos_token_id,
)
assistant_response = output[0]["generated_text"][-1]["content"]
self.historical past.append({"role": "user", "content": user_message})
self.historical past.append({"role": "assistant", "content": assistant_response})
return assistant_response
def get_history_length(self) -> int:
"""Get number of turns in conversation."""
return len(self.historical past) // 2
def clear_history(self):
"""Clear conversation history."""
self.historical past = []
print("🗑️ Conversation history cleared.")
def get_context_summary(self) -> str:
"""Get a summary of the conversation context."""
if not self.historical past:
return "No conversation history yet."
abstract = f"Conversation has {self.get_history_length()} turns:n"
for i, msg in enumerate(self.historical past):
function = "👤 User" if msg["role"] == "user" else "🤖 Assistant"
preview = msg["content"][:50] + "..." if len(msg["content"]) > 50 else msg["content"]
abstract += f" {i+1}. {role}: {preview}n"
return abstract
convo = ConversationManager(pipe, tokenizer)
print("n🗣️ Multi-turn Conversation Demo:")
print("-" * 50)
conversation_turns = [
"Hi! My name is Alex and I'm a software engineer.",
"I'm working on a machine learning project. What framework would you recommend?",
"Good suggestion! What's my name, by the way?",
"Can you remember what field I work in?"
]
for flip in conversation_turns:
print(f"n👤 User: {turn}")
response = convo.chat(flip)
print(f"🤖 Assistant: {response}")
print(f"n📊 {convo.get_context_summary()}")
print("n" + "=" * 70)
print("⚡ PART 7: Streaming Token Generation")
print("=" * 70)
from transformers import TextIteratorStreamer
from threading import Thread
import time
def stream_response(immediate: str, max_tokens: int = 200):
"""
Stream tokens as they're generated for real-time output.
"""
messages = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(mannequin.machine)
streamer = TextIteratorStreamer(
tokenizer,
skip_prompt=True,
skip_special_tokens=True
)
generation_kwargs = {
"input_ids": inputs,
"streamer": streamer,
"max_new_tokens": max_tokens,
"do_sample": True,
"temperature": 0.8,
"top_p": 1.0,
"pad_token_id": tokenizer.eos_token_id,
}
thread = Thread(goal=mannequin.generate, kwargs=generation_kwargs)
thread.begin()
print("📝 Streaming: ", finish="", flush=True)
full_response = ""
for token in streamer:
print(token, finish="", flush=True)
full_response += token
time.sleep(0.01)
thread.be part of()
print("n")
return full_response
print("n🔄 Streaming Demo:")
print("-" * 50)
streamed = stream_response(
"Count from 1 to 10, with a brief comment about each number.",
max_tokens=250
)
We transfer from single prompts to stateful interactions by making a dialog supervisor that shops multi-turn chat historical past and reuses that context in future responses. We display how we preserve reminiscence throughout turns, summarize prior context, and make the interplay really feel extra like a persistent assistant as a substitute of a one-off technology name. We additionally implement streaming technology so we will watch tokens arrive in actual time, which helps us perceive the mannequin’s stay decoding conduct extra clearly.
print("n" + "=" * 70)
print("🔧 PART 8: Function Calling / Tool Use")
print("=" * 70)
import math
from datetime import datetime
class ToolExecutor:
"""
Manages instrument definitions and execution for gpt-oss.
"""
def __init__(self):
self.instruments = {}
self._register_default_tools()
def _register_default_tools(self):
"""Register built-in tools."""
@self.register("calculator", "Perform mathematical calculations")
def calculator(expression: str) -> str:
"""Evaluate a mathematical expression."""
attempt:
allowed_names = {
ok: v for ok, v in math.__dict__.gadgets()
if not ok.startswith("_")
}
allowed_names.replace({"abs": abs, "round": spherical})
end result = eval(expression, {"__builtins__": {}}, allowed_names)
return f"Result: {result}"
besides Exception as e:
return f"Error: {str(e)}"
@self.register("get_time", "Get current date and time")
def get_time() -> str:
"""Get the current date and time."""
now = datetime.now()
return f"Current time: {now.strftime('%Y-%m-%d %H:%M:%S')}"
@self.register("weather", "Get weather for a city (simulated)")
def climate(metropolis: str) -> str:
"""Get weather information (simulated)."""
import random
temp = random.randint(60, 85)
situations = random.selection(["sunny", "partly cloudy", "cloudy", "rainy"])
return f"Weather in {city}: {temp}°F, {conditions}"
@self.register("search", "Search for information (simulated)")
def search(question: str) -> str:
"""Search the web (simulated)."""
return f"Search results for '{query}': [Simulated results - in production, connect to a real search API]"
def register(self, identify: str, description: str):
"""Decorator to register a tool."""
def decorator(func):
self.instruments[name] = {
"function": func,
"description": description,
"name": identify
}
return func
return decorator
def get_tools_prompt(self) -> str:
"""Generate tools description for the system prompt."""
tools_desc = "You have access to the following tools:nn"
for identify, instrument in self.instruments.gadgets():
tools_desc += f"- {name}: {tool['description']}n"
tools_desc += """
To make use of a instrument, reply with:
TOOL:
ARGS:
After receiving the instrument end result, present your remaining reply to the consumer."""
return tools_desc
def execute(self, tool_name: str, args: dict) -> str:
"""Execute a tool with given arguments."""
if tool_name not in self.instruments:
return f"Error: Unknown tool '{tool_name}'"
attempt:
func = self.instruments[tool_name]["function"]
if args:
end result = func(**args)
else:
end result = func()
return end result
besides Exception as e:
return f"Error executing {tool_name}: {str(e)}"
def parse_tool_call(self, response: str) -> tuple:
"""Parse a tool call from model response."""
if "TOOL:" not in response:
return None, None
strains = response.cut up("n")
tool_name = None
args = {}
for line in strains:
if line.startswith("TOOL:"):
tool_name = line.substitute("TOOL:", "").strip()
elif line.startswith("ARGS:"):
attempt:
args_str = line.substitute("ARGS:", "").strip()
args = json.masses(args_str) if args_str else {}
besides json.JSONDecodeError:
args = {"expression": args_str} if tool_name == "calculator" else {"query": args_str}
return tool_name, args
instruments = ToolExecutor()
def chat_with_tools(user_message: str) -> str:
"""
Chat with instrument use functionality.
"""
system_prompt = f"""You're a useful assistant with entry to instruments.
{instruments.get_tools_prompt()}
If the consumer's request will be answered straight, accomplish that.
If it's good to use a instrument, point out which instrument and with what arguments."""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}
]
output = pipe(
messages,
max_new_tokens=200,
do_sample=True,
temperature=0.7,
pad_token_id=tokenizer.eos_token_id,
)
response = output[0]["generated_text"][-1]["content"]
tool_name, args = instruments.parse_tool_call(response)
if tool_name:
tool_result = instruments.execute(tool_name, args)
messages.append({"role": "assistant", "content": response})
messages.append({"role": "user", "content": f"Tool result: {tool_result}nnNow provide your final answer."})
final_output = pipe(
messages,
max_new_tokens=200,
do_sample=True,
temperature=0.7,
pad_token_id=tokenizer.eos_token_id,
)
return final_output[0]["generated_text"][-1]["content"]
return response
print("n🔧 Tool Use Examples:")
print("-" * 50)
tool_queries = [
"What is 15 * 23 + 7?",
"What time is it right now?",
"What's the weather like in Tokyo?",
]
for question in tool_queries:
print(f"n👤 User: {query}")
response = chat_with_tools(question)
print(f"🤖 Assistant: {response}")
print("n" + "=" * 70)
print("📦 PART 9: Batch Processing for Efficiency")
print("=" * 70)
def batch_generate(prompts: record, batch_size: int = 2, max_new_tokens: int = 100) -> record:
"""
Course of a number of prompts in batches for effectivity.
Args:
prompts: Checklist of prompts to course of
batch_size: Variety of prompts per batch
max_new_tokens: Most tokens per response
Returns:
Checklist of responses
"""
outcomes = []
total_batches = (len(prompts) + batch_size - 1) // batch_size
for i in vary(0, len(prompts), batch_size):
batch = prompts[i:i + batch_size]
batch_num = i // batch_size + 1
print(f" Processing batch {batch_num}/{total_batches}...")
batch_messages = [
[{"role": "user", "content": prompt}]
for immediate in batch
]
for messages in batch_messages:
output = pipe(
messages,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.7,
pad_token_id=tokenizer.eos_token_id,
)
outcomes.append(output[0]["generated_text"][-1]["content"])
return outcomes
print("n📝 Batch Processing Example:")
print("-" * 50)
batch_prompts = [
"What is the capital of France?",
"What is 7 * 8?",
"Name a primary color.",
"What season comes after summer?",
"What is H2O commonly called?",
]
print(f"Processing {len(batch_prompts)} prompts...n")
batch_results = batch_generate(batch_prompts, batch_size=2)
for immediate, end in zip(batch_prompts, batch_results):
print(f"Q: {prompt}")
print(f"A: {result[:100]}...n") We lengthen the tutorial to incorporate instrument use and batch inference, enabling the open-weight mannequin to assist extra life like software patterns. We outline a light-weight instrument execution framework, let the mannequin select instruments by a structured textual content sample, after which feed the instrument outcomes again into the technology loop to supply a remaining reply. We additionally add batch processing to deal with a number of prompts effectively, which is helpful for testing throughput and reusing the identical inference pipeline throughout a number of duties.
print("n" + "=" * 70)
print("🤖 PART 10: Interactive Chatbot Interface")
print("=" * 70)
import gradio as gr
def create_chatbot():
"""Create a Gradio chatbot interface for gpt-oss."""
def reply(message, historical past):
"""Generate chatbot response."""
for user_msg, assistant_msg in historical past:
messages.append({"role": "user", "content": user_msg})
if assistant_msg:
messages.append({"role": "assistant", "content": assistant_msg})
messages.append({"role": "user", "content": message})
output = pipe(
messages,
max_new_tokens=400,
do_sample=True,
temperature=0.8,
top_p=1.0,
pad_token_id=tokenizer.eos_token_id,
)
return output[0]["generated_text"][-1]["content"]
demo = gr.ChatInterface(
fn=reply,
title="🚀 GPT-OSS Chatbot",
description="Chat with OpenAI's open-weight GPT-OSS model!",
examples=[
"Explain quantum computing in simple terms.",
"What are the benefits of open-source AI?",
"Tell me a fun fact about space.",
],
theme=gr.themes.Comfortable(),
)
return demo
print("n🚀 Creating Gradio chatbot interface...")
chatbot = create_chatbot()
print("n" + "=" * 70)
print("🎁 PART 11: Utility Helpers")
print("=" * 70)
class GptOssHelpers:
"""Collection of utility functions for common tasks."""
def __init__(self, pipeline, tokenizer):
self.pipe = pipeline
self.tokenizer = tokenizer
def summarize(self, textual content: str, max_words: int = 50) -> str:
"""Summarize text to specified length."""
messages = [
{"role": "system", "content": f"Summarize the following text in {max_words} words or less. Be concise."},
{"role": "user", "content": text}
]
output = self.pipe(messages, max_new_tokens=150, temperature=0.5, pad_token_id=self.tokenizer.eos_token_id)
return output[0]["generated_text"][-1]["content"]
def translate(self, textual content: str, target_language: str) -> str:
"""Translate text to target language."""
messages = [
{"role": "user", "content": f"Translate to {target_language}: {text}"}
]
output = self.pipe(messages, max_new_tokens=200, temperature=0.3, pad_token_id=self.tokenizer.eos_token_id)
return output[0]["generated_text"][-1]["content"]
def explain_simply(self, idea: str) -> str:
"""Explain a concept in simple terms."""
messages = [
{"role": "system", "content": "Explain concepts simply, as if to a curious 10-year-old. Use analogies and examples."},
{"role": "user", "content": f"Explain: {concept}"}
]
output = self.pipe(messages, max_new_tokens=200, temperature=0.8, pad_token_id=self.tokenizer.eos_token_id)
return output[0]["generated_text"][-1]["content"]
def extract_keywords(self, textual content: str, num_keywords: int = 5) -> record:
"""Extract key topics from text."""
messages = [
{"role": "user", "content": f"Extract exactly {num_keywords} keywords from this text. Return only the keywords, comma-separated:nn{text}"}
]
output = self.pipe(messages, max_new_tokens=50, temperature=0.3, pad_token_id=self.tokenizer.eos_token_id)
key phrases = output[0]["generated_text"][-1]["content"]
return [k.strip() for k in keywords.split(",")]
helpers = GptOssHelpers(pipe, tokenizer)
print("n📝 Helper Functions Demo:")
print("-" * 50)
sample_text = """
Synthetic intelligence has remodeled many industries lately.
From healthcare diagnostics to autonomous automobiles, AI programs have gotten
"""
print("n1️⃣ Summarization:")
abstract = helpers.summarize(sample_text, max_words=20)
print(f" {summary}")
print("n2️⃣ Simple Explanation:")
rationalization = helpers.explain_simply("neural networks")
print(f" {explanation[:200]}...")
print("n" + "=" * 70)
print("✅ TUTORIAL COMPLETE!")
print("=" * 70)
print("""
🎉 You have realized the way to use GPT-OSS on Google Colab!
WHAT YOU LEARNED:
✓ Appropriate mannequin loading (no load_in_4bit - makes use of native MXFP4)
✓ Primary inference with correct parameters
✓ Configurable reasoning effort (low/medium/excessive)
✓ Structured JSON output technology
✓ Multi-turn conversations with reminiscence
✓ Streaming token technology
✓ Perform calling and gear use
✓ Batch processing for effectivity
✓ Interactive Gradio chatbot
KEY TAKEAWAYS:
• GPT-OSS makes use of native MXFP4 quantization (do not use bitsandbytes)
• Really useful: temperature=1.0, top_p=1.0
• gpt-oss-20b matches on T4 GPU (~16GB VRAM)
• gpt-oss-120b requires H100/A100 (~80GB VRAM)
• All the time use trust_remote_code=True
RESOURCES:
📚 GitHub:
📚 Hugging Face:
📚 Mannequin Card:
📚 Concord Format:
📚 Cookbook:
ALTERNATIVE INFERENCE OPTIONS (for higher efficiency):
• vLLM: Manufacturing-ready, OpenAI-compatible server
• Ollama: Straightforward native deployment
• LM Studio: Desktop GUI software
""")
if torch.cuda.is_available():
print(f"n📊 Final GPU Memory Usage:")
print(f" Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f" Reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
print("n" + "=" * 70)
print("🚀 Launch the chatbot by running: chatbot.launch(share=True)")
print("=" * 70)We flip the mannequin pipeline right into a usable software by constructing a Gradio chatbot interface after which including helper utilities for summarization, translation, simplified rationalization, and key phrase extraction. We present how the identical open-weight mannequin can assist each interactive chat and reusable task-specific features inside a single Colab workflow. We finish by summarizing the tutorial, reviewing the important thing technical takeaways, and reinforcing how GPT-OSS will be loaded, managed, and prolonged as a sensible open-weight system.
In conclusion, we constructed a complete hands-on understanding of the way to use GPT-OSS as an open-source language mannequin reasonably than a black-box endpoint. We loaded the mannequin with the proper inference path, avoiding incorrect low-bit loading approaches, and labored by vital implementation patterns, together with configurable reasoning effort, JSON-constrained outputs, Concord-style conversational formatting, token streaming, light-weight instrument use orchestration, and Gradio-based interplay. In doing so, we noticed the true benefit of open-weight fashions: we will straight management mannequin loading, examine runtime conduct, form technology flows, and design customized utilities on prime of the bottom mannequin with out relying completely on managed infrastructure.
Take a look at the Full Code Implementation. Additionally, be happy to comply with us on Twitter and don’t overlook to affix our 130k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you possibly can be part of us on telegram as properly.
Have to companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so forth.? Join with us



