coreweave/dataset-downloader
Go
Captured source
source ↗coreweave/dataset-downloader
Language: Go
License: MIT
Stars: 0
Forks: 2
Open issues: 2
Created: 2022-12-05T19:30:17Z
Pushed: 2026-04-29T19:19:31Z
Default branch: main
Fork: no
Archived: no
README:
dataset-downloader
Contains code that build into docker images that can be used to download datasets for training machine learning models.
Contents:
- [smashwords-downloader](#smashwords-downloader)
smashwords-downloader
This script downloads plain text files of Western Romance books publicaly avaible on Smashworks. This website has been used to create popular Machine Learning datasets like BookCorpus.
The source code located in cmd/smashwords-downloader. It can be built into an executable with the command go build -o main *.go.
The main.go script takes the following arugments:
-data_dir string directory that the book files will download to (default "./data") -id integer The cooresponding ID for the smashswords url you want to scrape https://www.smashwords.com/books/category/1105/downloads/0/free would have an ID of 1105 (default is 1245 == western romance) -pageitems integer The number of items smashword has per page, shouldn't need to be changed. (default is 20) -pages integer The number of pages you want to download. (default is 7) -format string The format of text you want to download, some books only have limited format avaliability. (default is all for .txt and .epub files), options are (all, txt, epub). Note: Not all books have all formats. You may get significantly less books downloaded then specified based on file format. -overwriteSource bool If you are downloading in a format other then txt (ex. EPUB), set this to true if you don't want to keep the source files, and just want to keep the .txt files (default true)
Example Execution
Download Western Romance novels in .txt format to directory data >./main -data_dir data
Download Adventure novels to directory data, downloading 20 items (don't change this), 10 pages (total 200 items) in epub format, converting to text and overwriting source folder
> ./main -data_dir data -id 1105 -pageitems 20 -pages 10 -format epub -overwriteSource=true