On this tutorial, we construct a complete, hands-on understanding of DuckDB-Python by working via its options instantly in code on Colab. We begin with the basics of connection administration and knowledge era, then transfer into actual analytical workflows, together with querying Pandas, Polars, and Arrow objects with out handbook loading, remodeling outcomes throughout a number of codecs, and writing expressive SQL for window features, pivots, macros, recursive CTEs, and joins. As we progress, we additionally discover performance-oriented capabilities corresponding to bulk insertion, profiling, partitioned storage, multi-threaded entry, distant file querying, and environment friendly export patterns, so we not solely be taught what DuckDB can do, but additionally how one can use it as a severe analytical engine inside Python.
import subprocess, sys
for pkg in ["duckdb", "pandas", "pyarrow", "polars"]:
attempt:
subprocess.check_call(
[sys.executable, "-m", "pip", "install", "-q", pkg],
stderr=subprocess.DEVNULL,
)
besides subprocess.CalledProcessError:
subprocess.check_call(
[sys.executable, "-m", "pip", "install", "-q", "--break-system-packages", pkg],
stderr=subprocess.DEVNULL,
)
import duckdb
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import polars as pl
import numpy as np
import json, os, time, threading, tempfile
from datetime import date, datetime, timedelta
print(f"DuckDB version : ")
print(f"Pandas version : (i % 100)::TEXT")
print(f"PyArrow version: ")
print(f"Polars version : (i % 100)::TEXT")
print("=" * 72)
WORKDIR = tempfile.mkdtemp(prefix="duckdb_tutorial_")
os.chdir(WORKDIR)
print(f"Working directory: {WORKDIR}n")
print("=" * 72)
print("SECTION 1: Connection Management")
print("=" * 72)
con = duckdb.join()
print(con.sql("SELECT 'Hello from in-memory DuckDB!' AS greeting").fetchone()[0])
DB_PATH = os.path.be a part of(WORKDIR, "tutorial.duckdb")
pcon = duckdb.join(DB_PATH)
pcon.sql("CREATE OR REPLACE TABLE persisted(id INT, val TEXT)")
pcon.sql("INSERT INTO persisted VALUES (1,'alpha'), (2,'beta')")
print("Persisted rows:", pcon.sql("SELECT count(*) FROM persisted").fetchone()[0])
pcon.shut()
pcon2 = duckdb.join(DB_PATH)
print("After re-open :", pcon2.sql("SELECT * FROM persisted ORDER BY id").fetchall())
pcon2.shut()
con_cfg = duckdb.join(config={
"threads": 2,
"memory_limit": "512MB",
"default_order": "DESC",
})
print("Configured threads:", con_cfg.sql("SELECT current_setting('threads')").fetchone()[0])
con_cfg.shut()
with duckdb.join() as ctx:
ctx.sql("SELECT 42 AS answer").present()
print()
print("=" * 72)
print("SECTION 2: Synthetic Data Generation")
print("=" * 72)
con = duckdb.join()
con.sql("""
CREATE OR REPLACE TABLE gross sales AS
SELECT
i AS order_id,
'2023-01-01'::DATE + (i % 365)::INT AS order_date,
CASE (i % 5)
WHEN 0 THEN 'Electronics'
WHEN 1 THEN 'Clothes'
WHEN 2 THEN 'Groceries'
WHEN 3 THEN 'Furnishings'
ELSE 'Books'
END AS class,
ROUND(10 + random() * 990, 2) AS quantity,
CASE (i % 3)
WHEN 0 THEN 'US'
WHEN 1 THEN 'EU'
ELSE 'APAC'
END AS area,
CASE WHEN random() < 0.1 THEN TRUE ELSE FALSE END AS returned
FROM generate_series(1, 100000) t(i)
""")
con.sql("SUMMARIZE sales").present()
print()
print("=" * 72)
print("SECTION 3: Zero-Copy DataFrame Integration")
print("=" * 72)
pdf = pd.DataFrame({
"product": ["Widget", "Gadget", "Doohickey", "Thingamajig"],
"price": [9.99, 24.50, 4.75, 15.00],
"stock": [120, 45, 300, 78],
})
print("Query Pandas DF directly:")
con.sql("SELECT product, price * stock AS inventory_value FROM pdf ORDER BY inventory_value DESC").present()
plf = pl.DataFrame({
"city": ["Montreal", "Toronto", "Vancouver", "Calgary"],
"temp_c": [-12.5, -5.0, 3.0, -18.0],
})
print("Query Polars DF directly:")
con.sql("SELECT city, temp_c, temp_c * 9/5 + 32 AS temp_f FROM plf WHERE temp_c < 0").present()
arrow_tbl = pa.desk({
"sensor_id": [1, 2, 3, 4, 5],
"reading": [23.1, 47.8, 12.3, 99.0, 55.5],
})
print("Query Arrow table directly:")
con.sql("SELECT sensor_id, reading FROM arrow_tbl WHERE reading > 30").present()
print()
print("=" * 72)
print("SECTION 4: Result Conversion")
print("=" * 72)
q = "SELECT category, SUM(amount) AS total FROM sales GROUP BY category ORDER BY total DESC"
print("→ Python list :", con.sql(q).fetchall()[:2], "...")
print("→ Pandas DF :n", con.sql(q).df().head(3))
print("→ Polars DF :n", con.sql(q).pl().head(3))
print("→ Arrow Table :", con.sql(q).arrow().schema)
print("→ NumPy arrays :", {okay: v[:2] for okay, v in con.sql(q).fetchnumpy().gadgets()})
print()We arrange the total DuckDB-Python atmosphere by putting in the required libraries and importing all the mandatory modules for the tutorial. We create our working listing, initialize DuckDB connections, and discover each in-memory and chronic database utilization together with primary configuration choices. We additionally generate a big artificial gross sales dataset and start working with DuckDB’s direct integration with Pandas, Polars, and PyArrow, which exhibits us how naturally DuckDB suits into Python-based knowledge workflows.
print("=" * 72)
print("SECTION 5: Relational API")
print("=" * 72)
rel = (
con.desk("sales")
.filter("NOT returned")
.mixture("category, region, SUM(amount) AS revenue, COUNT(*) AS orders")
.filter("revenue > 1000000")
.order("revenue DESC")
.restrict(10)
)
print("Relational API result:")
rel.present()
top_cats = con.sql("SELECT DISTINCT category FROM sales ORDER BY category LIMIT 3")
print("Top categories relation fed into next query:")
con.sql("SELECT s.* FROM sales s SEMI JOIN top_cats ON s.category = top_cats.category LIMIT 5").present()
print()
print("=" * 72)
print("SECTION 6: Window Functions & Advanced SQL")
print("=" * 72)
con.sql("""
WITH day by day AS (
SELECT
order_date,
area,
SUM(quantity) AS daily_rev
FROM gross sales
WHERE NOT returned
GROUP BY order_date, area
)
SELECT
order_date,
area,
daily_rev,
SUM(daily_rev) OVER (
PARTITION BY area ORDER BY order_date
) AS cum_revenue,
AVG(daily_rev) OVER (
PARTITION BY area ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7d_avg
FROM day by day
QUALIFY row_number() OVER (PARTITION BY area ORDER BY order_date DESC) <= 3
ORDER BY area, order_date DESC
""").present()
print("PIVOT table:")
con.sql("""
PIVOT gross sales
ON area
USING SUM(quantity)
GROUP BY class
ORDER BY class
""").present()
print()
print("=" * 72)
print("SECTION 7: Complex / Nested Types")
print("=" * 72)
con.sql("""
CREATE OR REPLACE TABLE customers AS
SELECT
i AS user_id,
AS identify,
[i * 10, i * 20, i * 30] AS scores,
MAP {'tier': CASE WHEN i % 2 = 0 THEN 'gold' ELSE 'silver' END,
'area': CASE WHEN i % 3 = 0 THEN 'US' ELSE 'EU' END} AS metadata
FROM generate_series(1, 5) t(i)
""")
print("Struct field access, list indexing, map extraction:")
con.sql("""
SELECT
user_id,
identify.first AS first_name,
scores[1] AS first_score,
list_aggregate(scores,'sum') AS total_score,
metadata['tier'] AS tier
FROM customers
""").present()
print("Unnesting a list column:")
con.sql("""
SELECT user_id, unnest(scores) AS individual_score
FROM customers
WHERE user_id <= 3
""").present()
print()
print("=" * 72)
print("SECTION 8: Python UDFs")
print("=" * 72)
def celsius_to_fahrenheit(c):
return c * 9 / 5 + 32
con.create_function("c2f", celsius_to_fahrenheit, ["DOUBLE"], "DOUBLE")
con.sql("SELECT city, temp_c, c2f(temp_c) AS temp_f FROM plf").present()
import pyarrow.compute as laptop
def vectorized_discount(costs):
"""Apply a 15% discount to all prices."""
return laptop.multiply(costs, 0.85)
con.create_function(
"discount",
vectorized_discount,
["DOUBLE"],
"DOUBLE",
sort="arrow",
)
print("Vectorized UDF (discount):")
con.sql("SELECT product, price, discount(price) AS sale_price FROM pdf").present()
print()
print("=" * 72)
print("SECTION 9: File I/O")
print("=" * 72)
con.sql("COPY (SELECT * FROM sales LIMIT 1000) TO 'sales_sample.csv' (HEADER, DELIMITER ',')")
con.sql("COPY (SELECT * FROM sales LIMIT 1000) TO 'sales_sample.parquet' (FORMAT PARQUET)")
con.sql("COPY (SELECT * FROM sales LIMIT 100) TO 'sales_sample.json' (FORMAT JSON, ARRAY true)")
print("Files written:", [f for f in os.listdir('.') if 'sales_sample' in f])
print("nCSV row count :", con.sql("SELECT count(*) FROM 'sales_sample.csv'").fetchone()[0])
print("Parquet row count :", con.sql("SELECT count(*) FROM 'sales_sample.parquet'").fetchone()[0])
print("JSON row count :", con.sql("SELECT count(*) FROM 'sales_sample.json'").fetchone()[0])
print("nParquet file metadata:")
con.sql("SELECT * FROM parquet_metadata('sales_sample.parquet')").present()
print()We transfer from primary querying into extra expressive analytical patterns utilizing DuckDB’s relational API and superior SQL options. We work with filtered and aggregated gross sales relations, apply window features for cumulative and rolling calculations, and reshape knowledge utilizing a pivot desk for cross-category evaluation. We additionally discover complicated nested sorts, checklist unnesting, Python UDFs, and file I/O, which assist us see how DuckDB handles each structured analytics and sensible knowledge engineering duties in a single place.
print("=" * 72)
print("SECTION 10: Hive-Partitioned Parquet")
print("=" * 72)
con.sql("""
COPY gross sales TO 'partitioned_data' (
FORMAT PARQUET,
PARTITION_BY (area, class),
OVERWRITE_OR_IGNORE
)
""")
print("Partitioned directory tree (first 15 items):")
for i, (root, dirs, recordsdata) in enumerate(os.stroll("partitioned_data")):
for f in recordsdata:
fp = os.path.be a part of(root, f)
print(f" {fp}")
if i > 15:
print(" ...")
break
print("nReading partitioned data back (US + Electronics only):")
con.sql("""
SELECT rely(*) AS cnt, MIN(quantity), MAX(quantity)
FROM read_parquet('partitioned_data/**/*.parquet', hive_partitioning = true)
WHERE area = 'US' AND class = 'Electronics'
""").present()
print()
print("=" * 72)
print("SECTION 11: Prepared Statements")
print("=" * 72)
consequence = con.execute(
"SELECT * FROM sales WHERE category = $1 AND amount > $2 LIMIT 5",
["Electronics", 500.0]
).fetchdf()
print("Parameterized query result:n", consequence)
con.sql("SET VARIABLE target_region = 'EU'")
con.sql("""
SELECT class, AVG(quantity) AS avg_amt
FROM gross sales
WHERE area = getvariable('target_region')
GROUP BY class
ORDER BY avg_amt DESC
""").present()
print()
print("=" * 72)
print("SECTION 12: Transactions")
print("=" * 72)
con.sql("CREATE OR REPLACE TABLE accounts(id INT, balance DOUBLE)")
con.sql("INSERT INTO accounts VALUES (1, 1000), (2, 500)")
con.start()
attempt:
con.sql("UPDATE accounts SET balance = balance - 200 WHERE id = 1")
con.sql("UPDATE accounts SET balance = balance + 200 WHERE id = 2")
con.commit()
print("Transaction committed successfully")
besides Exception as e:
con.rollback()
print(f"Transaction rolled back: {e}")
con.sql("SELECT * FROM accounts").present()
print()
print("=" * 72)
print("SECTION 13: Appender (Bulk Insert)")
print("=" * 72)
con.sql("CREATE OR REPLACE TABLE fast_load(id INT, name TEXT, value DOUBLE)")
bulk_df = pd.DataFrame({
"id": vary(50_000),
"name": [f"item_{i}" for i in range(50_000)],
"value": [float(i) * 1.1 for i in range(50_000)],
})
begin = time.perf_counter()
con.append("fast_load", bulk_df)
elapsed = time.perf_counter() - begin
print(f"Bulk-inserted {con.sql('SELECT count(*) FROM fast_load').fetchone()[0]:,} rows in {elapsed:.4f}s")
con.sql("SELECT * FROM fast_load LIMIT 5").present()
print()
print("=" * 72)
print("SECTION 14: Replacement Scans (query any Python var)")
print("=" * 72)
my_dict_data = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [30, 25, 35],
})
con.sql("SELECT * FROM my_dict_data WHERE age > 28").present()
print()We deal with storage and execution patterns which can be particularly helpful in actual workflows, beginning with Hive-style partitioned Parquet output and selective reads from partitioned knowledge. We then use parameterized queries, runtime variables, and transaction management to make our queries safer, extra dynamic, and extra dependable. Lastly, we check high-speed bulk insertion by way of the appender interface and use alternative scans to question Python objects instantly by identify, additional reinforcing DuckDB’s tight integration with the Python runtime.
print("=" * 72)
print("SECTION 15: SQL Macros")
print("=" * 72)
con.sql("""
CREATE OR REPLACE MACRO revenue_tier(amt) AS
CASE
WHEN amt > 800 THEN 'Excessive'
WHEN amt > 400 THEN 'Medium'
ELSE 'Low'
END
""")
print("Scalar macro:")
con.sql("SELECT category, amount, revenue_tier(amount) AS tier FROM sales LIMIT 8").present()
con.sql("""
CREATE OR REPLACE MACRO top_by_category(cat, n) AS TABLE
SELECT * FROM gross sales WHERE class = cat ORDER BY quantity DESC LIMIT n
""")
print("Table macro — top 3 Electronics:")
con.sql("SELECT * FROM top_by_category('Electronics', 3)").present()
print()
print("=" * 72)
print("SECTION 16: Recursive CTE")
print("=" * 72)
con.sql("""
CREATE OR REPLACE TABLE workers(id INT, identify TEXT, manager_id INT);
INSERT INTO workers VALUES
(1, 'CEO', NULL),
(2, 'VP Eng', 1),
(3, 'VP Gross sales',1),
(4, 'Sr Eng', 2),
(5, 'Jr Eng', 4),
(6, 'Gross sales Rep', 3);
""")
print("Org chart via recursive CTE:")
con.sql("""
WITH RECURSIVE org AS (
SELECT id, identify, manager_id, 0 AS depth, identify AS path
FROM workers WHERE manager_id IS NULL
UNION ALL
SELECT e.id, e.identify, e.manager_id, o.depth + 1,
o.path || ' → ' || e.identify
FROM workers e JOIN org o ON e.manager_id = o.id
)
SELECT repeat(' ', depth) || identify AS hierarchy, path
FROM org
ORDER BY path
""").present()
print()
print("=" * 72)
print("SECTION 17: Full-Text Search (FTS)")
print("=" * 72)
attempt:
con.install_extension("fts")
con.load_extension("fts")
con.sql("""
CREATE OR REPLACE TABLE paperwork(id INT, physique TEXT);
INSERT INTO paperwork VALUES
(1, 'DuckDB is a quick in-process analytical database'),
(2, 'Python integration permits querying Pandas DataFrames'),
(3, 'Parquet recordsdata might be learn instantly with out loading'),
(4, 'Window features and CTEs make complicated analytics simple'),
(5, 'The columnar engine processes knowledge blazingly quick');
""")
con.sql("PRAGMA create_fts_index('documents', 'id', 'body', stemmer="english")")
print("FTS search for 'fast analytical':")
con.sql("""
SELECT id, physique, fts_main_documents.match_bm25(id, 'quick analytical') AS rating
FROM paperwork
WHERE rating IS NOT NULL
ORDER BY rating DESC
""").present()
besides Exception as e:
print(f"(Skipped — FTS extension not available: {e})")
print()
print("=" * 72)
print("SECTION 18: AsOf Joins")
print("=" * 72)
con.sql("""
CREATE OR REPLACE TABLE stock_prices AS
SELECT * FROM (VALUES
('2024-01-01 09:30'::TIMESTAMP, 'AAPL', 150.0),
('2024-01-01 10:00'::TIMESTAMP, 'AAPL', 152.5),
('2024-01-01 10:30'::TIMESTAMP, 'AAPL', 151.0),
('2024-01-01 11:00'::TIMESTAMP, 'AAPL', 153.0)
) AS t(ts, ticker, value);
CREATE OR REPLACE TABLE trades AS
SELECT * FROM (VALUES
('2024-01-01 09:45'::TIMESTAMP, 'AAPL', 100),
('2024-01-01 10:15'::TIMESTAMP, 'AAPL', 200),
('2024-01-01 10:50'::TIMESTAMP, 'AAPL', 150)
) AS t(trade_ts, ticker, shares);
""")
print("AsOf Join — match each trade to the most recent price:")
con.sql("""
SELECT
t.trade_ts,
t.shares,
p.ts AS price_ts,
p.value,
t.shares * p.value AS trade_value
FROM trades t
ASOF JOIN stock_prices p
ON t.ticker = p.ticker AND t.trade_ts >= p.ts
""").present()
print()We construct reusable question logic utilizing scalar and desk macros, lowering repetition and making our SQL extra modular. We then create a recursive CTE to traverse an worker hierarchy, exhibiting how DuckDB can deal with structured recursive logic cleanly inside SQL. After that, we discover full-text search and AsOf joins, demonstrating that DuckDB helps not solely normal analytics but additionally extra superior search and time-aware matching.
print("=" * 72)
print("SECTION 19: Query Profiling")
print("=" * 72)
print("EXPLAIN output:")
con.sql("EXPLAIN SELECT category, SUM(amount) FROM sales GROUP BY category").present()
con.sql("PRAGMA enable_profiling = 'json'")
con.sql("PRAGMA profiling_output="profile.json"")
con.sql("SELECT category, AVG(amount), STDDEV(amount) FROM sales GROUP BY category")
con.sql("PRAGMA disable_profiling")
if os.path.exists("profile.json"):
with open("profile.json") as f:
profile = json.load(f)
print(f"Query timing: {profile.get('timing', 'N/A')}s")
print()
print("=" * 72)
print("SECTION 20: Multi-Threaded Queries")
print("=" * 72)
outcomes = {}
def employee(thread_id, db_path=None):
"""Each thread creates its own connection."""
local_con = duckdb.join()
local_con.sql(f"""
CREATE TABLE t AS SELECT i, i*{thread_id} AS val
FROM generate_series(1, 10000) t(i)
""")
whole = local_con.sql("SELECT SUM(val) FROM t").fetchone()[0]
outcomes[thread_id] = whole
local_con.shut()
threads = [threading.Thread(target=worker, args=(tid,)) for tid in range(4)]
for t in threads:
t.begin()
for t in threads:
t.be a part of()
print("Thread results:", outcomes)
print()
print("=" * 72)
print("SECTION 21: Quick Benchmark — DuckDB vs Pandas")
print("=" * 72)
N = 1_000_000
big_df = pd.DataFrame({
"group": np.random.selection(["A","B","C","D","E"], N),
"value": np.random.randn(N) * 100,
"ts": pd.date_range("2020-01-01", durations=N, freq="s"),
})
begin = time.perf_counter()
_ = big_df.groupby("group")["value"].agg(["sum","mean","std","min","max"])
pd_time = time.perf_counter() - begin
begin = time.perf_counter()
_ = con.sql("""
SELECT "group",
SUM(worth), AVG(worth), STDDEV(worth), MIN(worth), MAX(worth)
FROM big_df
GROUP BY "group"
""").fetchall()
duck_time = time.perf_counter() - begin
print(f"Pandas : {pd_time:.4f}s")
print(f"DuckDB : {duck_time:.4f}s")
print(f"Speedup : {pd_time/duck_time:.1f}x")
print()
print("=" * 72)
print("SECTION 22: Reading Remote Parquet over HTTPS")
print("=" * 72)
attempt:
con.install_extension("httpfs")
con.load_extension("httpfs")
print("Querying a remote Parquet file (DuckDB taxi sample):")
con.sql("""
SELECT rely(*) AS total_rows
FROM read_parquet(
'
)
""").present()
besides Exception as e:
print(f"(Skipped — httpfs not available or network restricted: {e})")
print()
print("=" * 72)
print("SECTION 23: Custom Enum Types")
print("=" * 72)
con.sql("CREATE TYPE mood AS ENUM ('happy', 'neutral', 'sad')")
con.sql("""
CREATE OR REPLACE TABLE survey(
respondent_id INT,
feeling temper,
remark TEXT
)
""")
con.sql("""
INSERT INTO survey VALUES
(1, 'comfortable', 'Nice product!'),
(2, 'impartial', 'It's okay'),
(3, 'unhappy', 'Didn't work'),
(4, 'comfortable', 'Adore it')
""")
con.sql("SELECT feeling, count(*) AS cnt FROM survey GROUP BY feeling ORDER BY feeling").present()
print()We examine how DuckDB executes queries by utilizing EXPLAIN plans and JSON profiling output to raised perceive efficiency conduct. We additionally reveal thread-based parallel execution by creating separate DuckDB connections in a number of employee threads and safely gathering their outcomes. To finish the efficiency angle, we benchmark DuckDB in opposition to Pandas on a big grouped aggregation, then prolong the workflow with distant Parquet entry over HTTPS and customized ENUM sort creation.
print("=" * 72)
print("SECTION 24: Lambda Functions in SQL")
print("=" * 72)
con.sql("""
SELECT
[1, 2, 3, 4, 5] AS authentic,
list_transform([1, 2, 3, 4, 5], x -> x * x) AS squared,
list_filter([1, 2, 3, 4, 5], x -> x % 2 = 0) AS evens_only,
list_reduce([1, 2, 3, 4, 5], (a, b) -> a + b) AS whole
""").present()
print()
print("=" * 72)
print("SECTION 25: Multi-Format Export")
print("=" * 72)
summary_q = """
SELECT
class,
area,
COUNT(*) AS orders,
SUM(quantity) AS income,
AVG(quantity) AS avg_order
FROM gross sales
WHERE NOT returned
GROUP BY class, area
ORDER BY income DESC
"""
con.sql(f"COPY ({summary_q}) TO 'summary.csv' (HEADER)")
con.sql(f"COPY ({summary_q}) TO 'summary.parquet' (FORMAT PARQUET, COMPRESSION ZSTD)")
con.sql(f"COPY ({summary_q}) TO 'summary.json' (FORMAT JSON, ARRAY TRUE)")
sizes = {f: os.path.getsize(f) for f in ['summary.csv','summary.parquet','summary.json']}
print("File sizes:")
for identify, sz in sizes.gadgets():
print(f" {name:25s} {sz:>8,} bytes")
print()
con.shut()
print("=" * 72)
print("TUTORIAL COMPLETE")
print("=" * 72)
print("""
Sections coated:
1. Connection administration (in-memory, persistent, config)
2. Artificial knowledge era with generate_series()
3. Zero-copy querying of Pandas, Polars, PyArrow
4. End result conversion (.df(), .pl(), .arrow(), .fetchnumpy())
5. Relational API (filter / mixture / order / restrict)
6. Window features, QUALIFY, PIVOT
7. Advanced sorts (struct, checklist, map, unnest)
8. Python UDFs (scalar & vectorized)
9. File I/O (CSV, Parquet, JSON)
10. Hive-partitioned Parquet writes & reads
11. Ready statements & parameterized queries
12. Transactions (BEGIN / COMMIT / ROLLBACK)
13. Appender for high-speed bulk insert
14. Alternative scans (question Python dicts by identify)
15. SQL macros (scalar + desk macros)
16. Recursive CTEs
17. Full-text search (FTS extension)
18. AsOf joins
19. Question profiling & EXPLAIN
20. Multi-threaded entry
21. Efficiency benchmark (DuckDB vs Pandas)
22. Distant Parquet over HTTPS (httpfs)
23. Customized ENUM sorts
24. Lambda features in SQL
25. Multi-format export with compression
""")We discover DuckDB’s assist for lambda-powered checklist transformations instantly in SQL, offering a concise option to manipulate checklist knowledge inside queries. We then export a summarized analytical lead to a number of file codecs, together with CSV, Parquet, and JSON, highlighting DuckDB’s flexibility for downstream knowledge supply. We shut the tutorial by disconnecting and printing a structured recap of all coated sections, offering a transparent end-to-end image of the total DuckDB-Python workflow.
In conclusion, we offered a sensible, code-level view of how DuckDB-Python suits into fashionable knowledge workflows. We noticed that it’s not only a SQL engine embedded in Python, however a extremely versatile analytical system that works easily with DataFrames, Arrow tables, native recordsdata, distant datasets, customized features, and superior SQL constructs in a single atmosphere. We moved from easy in-memory queries to extra production-relevant patterns corresponding to persistent databases, parameterized execution, transactions, partitioned Parquet, profiling, and efficiency benchmarking, which helps us perceive each the breadth and depth of DuckDB’s capabilities. Additionally, we got here away with a reusable Colab-ready workflow that exhibits us how one can mix Python’s ecosystem with DuckDB’s velocity, SQL expressiveness, and interoperability to construct quick, elegant, and scalable knowledge evaluation pipelines.
Take a look at the Full Implementation Codes right here. Additionally, be happy to comply with us on Twitter and don’t neglect to hitch our 130k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as properly.
Must accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Join with us



