RepoDeepSeekDeepSeekpublished Feb 24, 2025seen 1w

deepseek-ai/smallpond

Python

Open original ↗

Captured source

source ↗
published Feb 24, 2025seen 1wcaptured 2dhttp 200method plain

deepseek-ai/smallpond

Description: A lightweight data processing framework built on DuckDB and 3FS.

Language: Python

License: MIT

Stars: 4960

Forks: 444

Open issues: 32

Created: 2025-02-24T09:28:17Z

Pushed: 2025-03-05T18:23:54Z

Default branch: main

Fork: no

Archived: no

README:

smallpond

![CI](https://github.com/deepseek-ai/smallpond/actions/workflows/ci.yml)

A lightweight data processing framework built on [DuckDB] and [3FS].

Features

  • 🚀 High-performance data processing powered by DuckDB
  • 🌍 Scalable to handle PB-scale datasets
  • 🛠️ Easy operations with no long-running services

Installation

Python 3.8 to 3.12 is supported.

pip install smallpond

Quick Start

# Download example data
wget https://duckdb.org/data/prices.parquet
import smallpond

# Initialize session
sp = smallpond.init()

# Load data
df = sp.read_parquet("prices.parquet")

# Process data
df = df.repartition(3, hash_by="ticker")
df = sp.partial_sql("SELECT ticker, min(price), max(price) FROM {0} GROUP BY ticker", df)

# Save results
df.write_parquet("output/")
# Show results
print(df.to_pandas())

Documentation

For detailed guides and API reference:

  • [Getting Started](docs/source/getstarted.rst)
  • [API Reference](docs/source/api.rst)

Performance

We evaluated smallpond using the [GraySort benchmark] ([script]) on a cluster comprising 50 compute nodes and 25 storage nodes running [3FS]. The benchmark sorted 110.5TiB of data in 30 minutes and 14 seconds, achieving an average throughput of 3.66TiB/min.

Details can be found in [3FS - Gray Sort].

[DuckDB]: https://duckdb.org/ [3FS]: https://github.com/deepseek-ai/3FS [GraySort benchmark]: https://sortbenchmark.org/ [script]: benchmarks/gray_sort_benchmark.py [3FS - Gray Sort]: https://github.com/deepseek-ai/3FS?tab=readme-ov-file#2-graysort

Development

pip install .[dev]

# run unit tests
pytest -v tests/test*.py

# build documentation
pip install .[docs]
cd docs
make html
python -m http.server --directory build/html

License

This project is licensed under the [MIT License](LICENSE).

Notability

notability 8.0/10

Notable repo by DeepSeek with high stars.