What does this fork signal mean?

StepFun forked stepfun-ai/llama.cpp (forked from ggml-org/llama.cpp). This fork signal points to upstream code the lab may be inspecting, patching, or building on. High-signal details: repo stepfun-ai/llama.cpp · parent ggml-org/llama.cpp · Routine fork with minimal traction. onlylabs links this event to 1 captured evidence page and 1 related fork signal.

StepFun Fork: stepfun-ai/llama.cpp

Captured source

source ↗

GitHub/github.com/stepfun-ai/llama.cpp

stepfun-ai/llama.cpp repository metadata

Source ↗

published Mar 16, 2026seen Jun 5captured Jun 11http 200method plain

stepfun-ai/llama.cpp

Description: LLM inference in C/C++

Language: C++

License: MIT

Stars: 5

Forks: 1

Open issues: 1

Created: 2026-03-16T07:20:37Z

Pushed: 2026-06-09T12:23:15Z

Default branch: master

Fork: yes

Parent repository: ggml-org/llama.cpp

Archived: no

README:

llama.cpp

!llama

![Server](https://github.com/ggml-org/llama.cpp/actions/workflows/server.yml)

Manifesto / ggml / ops

LLM inference in C/C++

Recent API changes

Hot topics

Hugging Face cache migration: models downloaded with `-hf` are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.
[guide : using the new WebUI of llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/16938)
guide : running gpt-oss with llama.cpp
[[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗](https://github.com/ggml-org/llama.cpp/discussions/15313)
Support for the gpt-oss model with native MXFP4 format has been added | PR | Collaboration with NVIDIA | Comment
Multimodal support arrived in llama-server: #12898 | [documentation](./docs/multimodal.md)
VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
Hugging Face Inference Endpoints now support GGUF out of the box! https://github.com/ggml-org/llama.cpp/discussions/9669
Hugging Face GGUF editor: discussion | tool

----

Quick start

Getting started with llama.cpp is straightforward. Here are several ways to install it on your machine:

Install llama.cpp using [brew, nix or winget](docs/install.md)
Run with Docker - see our [Docker documentation](docs/docker.md)
Download pre-built binaries from the releases page
Build from source by cloning this repository - check out [our build guide](docs/build.md)

Once installed, you'll need a model to work with. Head to the [Obtaining and quantizing models](#obtaining-and-quantizing-models) section to learn more.

Example command:

# Use a local model file
llama-cli -m my_model.gguf

# Or download and run a model directly from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

# Launch OpenAI-compatible API server
llama-server -hf ggml-org/gemma-3-1b-it-GGUF

Description

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.

Plain C/C++ implementation without any dependencies
Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
AVX, AVX2, AVX512 and AMX support for x86 architectures
RVV, ZVFH, ZFH, ZICBOP and ZIHINTPAUSE support for RISC-V architectures
1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)
Vulkan and SYCL backend support
CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

The llama.cpp project is the main playground for developing new features for the ggml library.

Models

Typically finetunes of the base models below are supported as well.

Instructions for adding support for new models: [HOWTO-add-model.md](docs/development/HOWTO-add-model.md)

Text-only

[X] LLaMA 🦙
[x] LLaMA 2 🦙🦙
[x] LLaMA 3 🦙🦙🦙
[X] Mistral 7B
[x] Mixtral MoE
[x] DBRX
[x] Jamba
[X] Falcon
[X] Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2
[X] Vigogne (French)
[X] BERT
[X] Koala
[X] Baichuan 1 & 2 + derivations
[X] Aquila 1 & 2
[X] Starcoder models
[X] Refact
[X] MPT
[X] Bloom
[x] Yi models
[X] StableLM models
[x] Deepseek models
[x] Qwen models
[x] PLaMo-13B
[x] Phi models
[x] PhiMoE
[x] GPT-2
[x] Orion 14B
[x] InternLM2
[x] CodeShell
[x] Gemma
[x] Mamba
[x] Grok-1
[x]...

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Routine fork with minimal traction