Snowflake-Labs/sfguide-lakehouse-iceberg-production-pipelines
Python
Captured source
source ↗Snowflake-Labs/sfguide-lakehouse-iceberg-production-pipelines
Description: The companion repo for lakehouse-iceberg-production-pipelines quickstart
Language: Python
License: Apache-2.0
Stars: 1
Forks: 1
Open issues: 0
Created: 2026-04-14T12:07:46Z
Pushed: 2026-05-27T05:01:04Z
Default branch: main
Fork: no
Archived: no
README:
Lakehouse Transformations: Build Production Pipelines for your Iceberg Tables
Stop pipeline sprawl and the cost of data duplication. This lab shows how to perform secure, in-place transformations across your entire data estate: connect externally managed Iceberg tables with Catalog Linked Databases to always work on fresh data without ETL, build efficient and declarative pipelines with Dynamic Tables for Iceberg preserving multi-engine access to your data, and implement business continuity to ensure your production data is always available.
> The companion Snowflake Quickstart walks through the same steps in a guided format. Link will be added on publish.
Architecture
--- config: theme: mc layout: elk --- flowchart TB subgraph Generation["Data Generation"] PyGen["Streaming Balloon Pop Events"] end subgraph AWS["AWS"] GlueCat["Glue Data Catalog balloon_game_events table"] S3["S3 Warehouse s3://balloon_pops/iceberg/"] LF["Lake Formation"] end subgraph Snowflake["Snowflake"] CI["Catalog Integration Glue Iceberg REST + SigV4"] CLD["Catalog Linked Database(CLD) balloon_game_events"] DTs["Dynamic Iceberg Tables silver pipelines"] ExtVol["Snowflake Storage (PuPr)"] SiS["Streamlit in Snowflake"] HIRC["Horizon Iceberg REST Catalog (PuPr)"] end PyGen -- PyIceberg write --> GlueCat GlueCat --> S3 CI --> CLD CLD --> DTs DTs -- writes Iceberg --> ExtVol DTs --> SiS DTs -.-> HIRC HIRC DuckDB["DuckDB Cross-Engine Access"] LF -- vended credentials --> CI LF --> S3
Lab Layers
| Layer | Technology | What it does | |-------|------------|--------------| | Bronze | Glue + S3 + PyIceberg | Loads raw game events as Iceberg in AWS | | Catalog | Snowflake Catalog Integration | Connects Snowflake to Glue Iceberg REST with SigV4 + LF vended credentials | | CLD | Catalog-Linked Database | Mirrors Glue namespaces and tables as Snowflake schemas — no data copy | | Silver | Dynamic Iceberg Tables | Transforms JSON bronze into 5 aggregation tables; writes Iceberg back to S3 | | Dashboard | Streamlit in Snowflake | Live dashboard over silver DTs; zero local server | | Cross-engine | DuckDB via HIRC | Queries silver Iceberg tables through Snowflake's Horizon REST Catalog |
---
Prerequisites
Clone the Repository
git clone https://github.com/Snowflake-Labs/sfguide-lakehouse-iceberg-production-pipelines.git cd sfguide-lakehouse-iceberg-production-pipelines
Accounts and Permissions
- AWS account with a named profile (
AWS_PROFILE) that can create and update Glue databases, manage IAM roles, and access S3 - Snowflake account with
ACCOUNTADMINor a role withCREATE INTEGRATION,CREATE DATABASE, andCREATE STREAMLITprivileges - Snowflake CLI connection configured for that account —
snow connection listandsnow connection testboth succeed
Required Tools
This repo targets Python 3.12+. uv manages the interpreter and all dependencies.
| Tool | Role | macOS | Linux (Debian/Ubuntu) | Windows | |------|------|-------|-----------------------|---------| | Git | Clone the repository | brew install git | sudo apt install git | Git for Windows | | uv | Python deps and uv run entrypoints | brew install uv | Astral installer | PowerShell installer | | Task | task bronze:*, task check-tools | brew install go-task | Install script | scoop install task | | AWS CLI v2 | Glue, S3, STS; S3 Tables needs v2.34+ | brew install awscli | AWS bundled installer | AWS MSI | | Snowflake CLI | Snowflake steps; also available via uv sync | Snowflake CLI docs | Snowflake CLI docs | Snowflake CLI docs | | envsubst | Renders IAM policy templates (gettext package) | brew install gettext | sudo apt install gettext-base | WSL2 recommended | | jq | JSON checks at the shell | brew install jq | sudo apt install jq | scoop install jq |
Windows note: If task check-tools fails only on envsubst, use WSL2 or run uv run bronze-cli render-iam instead.
Recommended Tools
| Tool | Why | macOS | Linux | Windows | |------|-----|-------|-------|---------| | direnv | Auto-loads .env when you cd into the repo | brew install direnv | sudo apt install direnv | WSL2 | | curl | Scripts and health checks | pre-installed | pre-installed | curl.se | | openssl | TLS and crypto one-liners | pre-installed | pre-installed | OpenSSL binaries |
Verify Installation
Sync Python dependencies:
uv sync
Set your AWS profile and run the prerequisite check:
export AWS_PROFILE=your-profile task check-tools
task check-tools runs tools/check_lab_prereqs.py: it verifies required binaries on PATH and runs aws sts get-caller-identity. Fix any missing entries and refresh credentials if STS fails, then re-run until you see All required tools are available.
---
Environment Setup
Copy .env.example to .env and fill in your values. Never commit .env.
cp .env.example .env
Key variables by phase:
| Variable | Phase | Default | Notes | |----------|-------|---------|-------| | AWS_PROFILE | 1 | required | AWS named profile for all bronze tasks | | AWS_REGION | 1 | required | Keeps all API calls in one region | | LAB_USERNAME | 1 | none | Workshop shared accounts — drives bucket/database name derivation | | BRONZE_BUCKET_NAME | 1 | derived | S3 warehouse bucket; iceberg/ becomes the Glue warehouse URI | | BRONZE_S3TABLES_BUCKET_NAME |…
Excerpt shown — open the source for the full document.
Notability
notability 1.0/10Routine guide repo, minimal traction.