On this tutorial, we discover use Google’s LangExtract library to rework unstructured textual content into structured, machine-readable data. We start by putting in the required dependencies and securely configuring our OpenAI API key to leverage highly effective language fashions for extraction duties. Additionally, we’ll construct a reusable extraction pipeline that allows us to course of a variety of doc varieties, together with contracts, assembly notes, product bulletins, and operational logs. By way of fastidiously designed prompts and instance annotations, we exhibit how LangExtract can establish entities, actions, deadlines, dangers, and different structured attributes whereas grounding them to their precise supply spans. We additionally visualize the extracted data and set up it into tabular datasets, enabling downstream analytics, automation workflows, and decision-making techniques.
!pip -q set up -U "langextract[openai]" pandas IPython
import os
import json
import textwrap
import getpass
import pandas as pd
OPENAI_API_KEY = getpass.getpass("Enter OPENAI_API_KEY: ")
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
import langextract as lx
from IPython.show import show, HTMLWe set up the required libraries, together with LangExtract, Pandas, and IPython, in order that our Colab setting is prepared for structured extraction duties. We securely request the OpenAI API key from the consumer and retailer it as an setting variable for secure entry throughout runtime. We then import the core libraries wanted to run LangExtract, show outcomes, and deal with structured outputs.
MODEL_ID = "gpt-4o-mini"
def run_extraction(
text_or_documents,
prompt_description,
examples,
output_stem,
model_id=MODEL_ID,
extraction_passes=1,
max_workers=4,
max_char_buffer=1800,
):
outcome = lx.extract(
text_or_documents=text_or_documents,
prompt_description=prompt_description,
examples=examples,
model_id=model_id,
api_key=os.environ["OPENAI_API_KEY"],
fence_output=True,
use_schema_constraints=False,
extraction_passes=extraction_passes,
max_workers=max_workers,
max_char_buffer=max_char_buffer,
)
jsonl_name = f"{output_stem}.jsonl"
html_name = f"{output_stem}.html"
lx.io.save_annotated_documents([result], output_name=jsonl_name, output_dir=".")
html_content = lx.visualize(jsonl_name)
with open(html_name, "w", encoding="utf-8") as f:
if hasattr(html_content, "data"):
f.write(html_content.knowledge)
else:
f.write(html_content)
return outcome, jsonl_name, html_name
def extraction_rows(outcome):
rows = []
for ex in outcome.extractions:
start_pos = None
end_pos = None
if getattr(ex, "char_interval", None):
start_pos = ex.char_interval.start_pos
end_pos = ex.char_interval.end_pos
rows.append({
"class": ex.extraction_class,
"text": ex.extraction_text,
"attributes": json.dumps(ex.attributes or {}, ensure_ascii=False),
"start": start_pos,
"end": end_pos,
})
return pd.DataFrame(rows)
def preview_result(title, outcome, html_name, max_rows=50):
print("=" * 80)
print(title)
print("=" * 80)
print(f"Total extractions: {len(result.extractions)}")
df = extraction_rows(outcome)
show(df.head(max_rows))
show(HTML(f'Open interactive visualization: {html_name}
'))We outline the core utility features that energy all the extraction pipeline. We create a reusable run_extraction perform that sends textual content to the LangExtract engine and generates each JSONL and HTML outputs. We additionally outline helper features to transform the extraction outcomes into tabular rows and preview them interactively within the pocket book.
contract_prompt = textwrap.dedent("""
Extract contract-risk data so as of look.
Guidelines:
1. Use precise textual content spans from the supply. Don't paraphrase extraction_text.
2. Extract the next courses when current:
- get together
- obligation
- deadline
- payment_term
- penalty
- termination_clause
- governing_law
3. Add helpful attributes:
- party_name for obligations or cost phrases when related
- risk_level as low, medium, or excessive
- class for the enterprise which means
4. Maintain output grounded to the precise wording within the supply.
5. Don't merge non-contiguous spans into one extraction.
""")
contract_examples = [
lx.data.ExampleData(
text=(
"Acme Corp shall deliver the equipment by March 15, 2026. "
"The Client must pay within 10 days of invoice receipt. "
"Late payment incurs a 2% monthly penalty. "
"This agreement is governed by the laws of Ontario."
),
extractions=[
lx.data.Extraction(
extraction_class="party",
extraction_text="Acme Corp",
attributes={"category": "supplier", "risk_level": "low"}
),
lx.data.Extraction(
extraction_class="obligation",
extraction_text="shall deliver the equipment",
attributes={"party_name": "Acme Corp", "category": "delivery", "risk_level": "medium"}
),
lx.data.Extraction(
extraction_class="deadline",
extraction_text="by March 15, 2026",
attributes={"category": "delivery_deadline", "risk_level": "medium"}
),
lx.data.Extraction(
extraction_class="party",
extraction_text="The Client",
attributes={"category": "customer", "risk_level": "low"}
),
lx.data.Extraction(
extraction_class="payment_term",
extraction_text="must pay within 10 days of invoice receipt",
attributes={"party_name": "The Client", "category": "payment", "risk_level": "medium"}
),
lx.data.Extraction(
extraction_class="penalty",
extraction_text="2% monthly penalty",
attributes={"category": "late_payment", "risk_level": "high"}
),
lx.data.Extraction(
extraction_class="governing_law",
extraction_text="laws of Ontario",
attributes={"category": "legal_jurisdiction", "risk_level": "low"}
),
]
)
]
contract_text = """
BluePeak Analytics shall present a production-ready dashboard and underlying ETL pipeline no later than April 30, 2026.
North Ridge Manufacturing will remit cost inside 7 calendar days after ultimate acceptance.
If cost is delayed past 15 days, BluePeak Analytics might droop help providers and cost curiosity at 1.5% per thirty days.
This Settlement shall be ruled by the legal guidelines of British Columbia.
"""
contract_result, contract_jsonl, contract_html = run_extraction(
text_or_documents=contract_text,
prompt_description=contract_prompt,
examples=contract_examples,
output_stem="contract_risk_extraction",
extraction_passes=2,
max_workers=4,
max_char_buffer=1400,
)
preview_result("USE CASE 1 — Contract risk extraction", contract_result, contract_html)We construct a contract intelligence extraction workflow by defining an in depth immediate and structured examples. We offer LangExtract with annotated training-style examples in order that it understands establish entities resembling obligations, deadlines, penalties, and governing legal guidelines. We then run the extraction pipeline on a contract textual content and preview the structured risk-related outputs.
meeting_prompt = textwrap.dedent("""
Extract motion gadgets from assembly notes so as of look.
Guidelines:
1. Use precise textual content spans from the supply. No paraphrasing in extraction_text.
2. Extract these courses when current:
- assignee
- action_item
- due_date
- blocker
- determination
3. Add attributes:
- precedence as low, medium, or excessive
- workstream when inferable from native context
- proprietor for action_item when tied to a named assignee
4. Maintain all spans grounded to the supply textual content.
5. Protect order of look.
""")
meeting_examples = [
lx.data.ExampleData(
text=(
"Sarah will finalize the launch email by Friday. "
"The team decided to postpone the webinar. "
"Blocked by missing legal approval."
),
extractions=[
lx.data.Extraction(
extraction_class="assignee",
extraction_text="Sarah",
attributes={"priority": "medium", "workstream": "marketing"}
),
lx.data.Extraction(
extraction_class="action_item",
extraction_text="will finalize the launch email",
attributes={"owner": "Sarah", "priority": "high", "workstream": "marketing"}
),
lx.data.Extraction(
extraction_class="due_date",
extraction_text="by Friday",
attributes={"priority": "medium", "workstream": "marketing"}
),
lx.data.Extraction(
extraction_class="decision",
extraction_text="decided to postpone the webinar",
attributes={"priority": "medium", "workstream": "events"}
),
lx.data.Extraction(
extraction_class="blocker",
extraction_text="missing legal approval",
attributes={"priority": "high", "workstream": "compliance"}
),
]
)
]
meeting_text = """
Arjun will put together the revised pricing sheet by Tuesday night.
Mina to verify the enterprise buyer's knowledge residency necessities this week.
The group agreed to ship the pilot just for the Oman area first.
Blocked by pending safety overview from the shopper's IT staff.
Ravi will draft the rollback plan earlier than the manufacturing cutover.
"""
meeting_result, meeting_jsonl, meeting_html = run_extraction(
text_or_documents=meeting_text,
prompt_description=meeting_prompt,
examples=meeting_examples,
output_stem="meeting_action_extraction",
extraction_passes=2,
max_workers=4,
max_char_buffer=1400,
)
preview_result("USE CASE 2 — Meeting notes to action tracker", meeting_result, meeting_html)We design a gathering intelligence extractor that focuses on motion gadgets, selections, assignees, and blockers. We once more present instance annotations to assist the mannequin construction meet data constantly. We execute the extraction on assembly notes and show the ensuing structured process tracker.
longdoc_prompt = textwrap.dedent("""
Extract product launch intelligence so as of look.
Guidelines:
1. Use precise textual content spans from the supply.
2. Extract:
- firm
- product
- launch_date
- area
- metric
- partnership
3. Add attributes:
- class
- significance as low, medium, or excessive
4. Maintain the extraction grounded within the unique textual content.
5. Don't paraphrase the extracted span.
""")
longdoc_examples = [
lx.data.ExampleData(
text=(
"Nova Robotics launched Atlas Mini in Europe on 12 January 2026. "
"The company reported 18% faster picking speed and partnered with Helix Warehousing."
),
extractions=[
lx.data.Extraction(
extraction_class="company",
extraction_text="Nova Robotics",
attributes={"category": "vendor", "significance": "medium"}
),
lx.data.Extraction(
extraction_class="product",
extraction_text="Atlas Mini",
attributes={"category": "product_name", "significance": "high"}
),
lx.data.Extraction(
extraction_class="region",
extraction_text="Europe",
attributes={"category": "market", "significance": "medium"}
),
lx.data.Extraction(
extraction_class="launch_date",
extraction_text="12 January 2026",
attributes={"category": "timeline", "significance": "medium"}
),
lx.data.Extraction(
extraction_class="metric",
extraction_text="18% faster picking speed",
attributes={"category": "performance_claim", "significance": "high"}
),
lx.data.Extraction(
extraction_class="partnership",
extraction_text="partnered with Helix Warehousing",
attributes={"category": "go_to_market", "significance": "medium"}
),
]
)
]
long_text = """
Vertex Dynamics launched FleetSense 3.0 for industrial logistics groups throughout the GCC on 5 February 2026.
The corporate mentioned the discharge improves the accuracy of route deviation detection by 22% and reduces handbook overview time by 31%.
Within the first rollout section, the platform will help Oman and the United Arab Emirates.
Vertex Dynamics additionally partnered with Falcon Telematics to combine stay driver conduct occasions into the dashboard.
Per week later, FleetSense 3.0 added a risk-scoring module for security managers.
The replace offers supervisors a each day ranked checklist of high-risk journeys and exception occasions.
The corporate described the module as particularly priceless for oilfield transport operations and contractor fleet audits.
By late February 2026, the staff introduced a pilot with Desert Haul Companies.
The pilot covers 240 heavy automobiles and focuses on rushing up incident triage, compliance overview, and proof retrieval.
Inner testing confirmed analysts may assemble overview packets in underneath 8 minutes as a substitute of the earlier 20 minutes.
"""
longdoc_result, longdoc_jsonl, longdoc_html = run_extraction(
text_or_documents=long_text,
prompt_description=longdoc_prompt,
examples=longdoc_examples,
output_stem="long_document_extraction",
extraction_passes=3,
max_workers=8,
max_char_buffer=1000,
)
preview_result("USE CASE 3 — Long-document extraction", longdoc_result, longdoc_html)
batch_docs = [
"""
The supplier must replace defective batteries within 14 days of written notice.
Any unresolved safety issue may trigger immediate suspension of shipments.
""",
"""
Priya will circulate the revised onboarding checklist tomorrow morning.
The team approved the API deprecation plan for the legacy endpoint.
""",
"""
Orbit Health launched a remote triage assistant in Singapore on 14 March 2026.
The company claims the assistant reduces nurse intake time by 17%.
"""
]
batch_prompt = textwrap.dedent("""
Extract operationally helpful spans so as of look.
Allowed courses:
- obligation
- deadline
- penalty
- assignee
- action_item
- determination
- firm
- product
- launch_date
- metric
Use precise textual content solely and fix a easy attribute:
- source_type
""")
batch_examples = [
lx.data.ExampleData(
text="Jordan will submit the report by Monday. Late delivery incurs a service credit.",
extractions=[
lx.data.Extraction(
extraction_class="assignee",
extraction_text="Jordan",
attributes={"source_type": "meeting"}
),
lx.data.Extraction(
extraction_class="action_item",
extraction_text="will submit the report",
attributes={"source_type": "meeting"}
),
lx.data.Extraction(
extraction_class="deadline",
extraction_text="by Monday",
attributes={"source_type": "meeting"}
),
lx.data.Extraction(
extraction_class="penalty",
extraction_text="service credit",
attributes={"source_type": "contract"}
),
]
)
]
batch_results = []
for idx, doc in enumerate(batch_docs, begin=1):
res, jsonl_name, html_name = run_extraction(
text_or_documents=doc,
prompt_description=batch_prompt,
examples=batch_examples,
output_stem=f"batch_doc_{idx}",
extraction_passes=2,
max_workers=4,
max_char_buffer=1200,
)
df = extraction_rows(res)
df.insert(0, "document_id", idx)
batch_results.append(df)
print(f"Finished document {idx} -> {html_name}")
batch_df = pd.concat(batch_results, ignore_index=True)
print("nCombined batch output")
show(batch_df)
print("nContract extraction counts by class")
show(
extraction_rows(contract_result)
.groupby("class", as_index=False)
.measurement()
.sort_values("size", ascending=False)
)
print("nMeeting action items only")
meeting_df = extraction_rows(meeting_result)
show(meeting_df[meeting_df["class"] == "action_item"])
print("nLong-document metrics only")
longdoc_df = extraction_rows(longdoc_result)
show(longdoc_df[longdoc_df["class"] == "metric"])
final_df = pd.concat([
extraction_rows(contract_result).assign(use_case="contract_risk"),
extraction_rows(meeting_result).assign(use_case="meeting_actions"),
extraction_rows(longdoc_result).assign(use_case="long_document"),
], ignore_index=True)
final_df.to_csv("langextract_tutorial_outputs.csv", index=False)
print("nSaved CSV: langextract_tutorial_outputs.csv")
print("nGenerated files:")
for title in [
contract_jsonl, contract_html,
meeting_jsonl, meeting_html,
longdoc_jsonl, longdoc_html,
"langextract_tutorial_outputs.csv"
]:
print(" -", title)We implement a long-document intelligence pipeline able to extracting structured insights from massive narrative textual content. We run the extraction throughout product launch experiences and operational paperwork, and likewise exhibit batch processing throughout a number of paperwork. We additionally analyze the extracted outcomes, filter key courses, and export the structured dataset to a CSV file for downstream evaluation.
In conclusion, we constructed a complicated LangExtract workflow that converts advanced textual content paperwork into structured datasets with traceable supply grounding. We ran a number of extraction eventualities, together with contract threat evaluation, assembly motion monitoring, long-document intelligence extraction, and batch processing throughout a number of paperwork. We additionally visualized the extractions and exported the ultimate structured outcomes right into a CSV file for additional evaluation. By way of this course of, we noticed how immediate design, example-based extraction, and scalable processing strategies permit us to construct strong data extraction techniques with minimal code.
Try the Full Codes right here. Additionally, be happy to comply with us on Twitter and don’t neglect to affix our 120k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you possibly can be part of us on telegram as nicely.
Must companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so forth.? Join with us



