llama.cpp is a local inference engine, not a service
llama.cpp is an open-source LLM inference engine written in C/C++ and maintained in the open by the ggml-org project. Its stated goal is "to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud." On the path described here it runs locally: models are loaded and served by a process on your own machine, so the "data retention" framing that applies to OpenAI, Anthropic, or Google does not apply in the same shape.
There is no llama.cpp-operated inference cloud. When an application calls llama.cpp for a summary, the request is handled on-device. No prompts or completions cross the network boundary unless the surrounding application explicitly forwards them.
Retention, training, and ZDR
- Retention: None by default. llama.cpp does not run a remote service, so there is no off-device store of prompts or completions to configure a retention period for.
- Training: llama.cpp does not train models. It serves pre-trained open-weight models in GGUF format that were trained elsewhere by their publishers (Meta, Mistral AI, Alibaba, and others).
- Zero data retention: Available by construction - inference is local, so off-device transmission is zero.
The local OpenAI-compatible server
llama.cpp ships llama-server, described in its documentation as "a lightweight, OpenAI API compatible, HTTP server for serving LLMs." By default it listens on 127.0.0.1:8080 - a loopback address that is not reachable from outside the machine. It exposes OpenAI-compatible endpoints including /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/models.
This matters for privacy because any tool that speaks the OpenAI API can point its base URL at the local server (for example http://127.0.0.1:8080/v1) and get on-device inference without code changes. The server documentation describes no telemetry and no remote data transmission; the only network activity is user-initiated, such as a one-time model download from Hugging Face.
Open source under MIT
llama.cpp is distributed under the MIT license, one of the most permissive open-source licenses. The full source is on GitHub, so the engine and its server can be read, built, and audited independently rather than trusted on the basis of a published policy.
How Meetily uses llama.cpp
Meetily's transcription path is local-by-default. Because llama-server exposes a local OpenAI-compatible endpoint, Meetily's BYOK summary provider can point at that local base URL, so summaries run on the same machine. The result is that the entire pipeline - audio capture, transcription, summary - stays on the device. Audio never leaves the machine, and on this path neither do transcripts or summaries.
This is the simplest answer to "does the summarization step have a retention policy I need to read?" because the answer becomes "no, there is no remote service in the loop." For organizations subject to data-residency or processor-disclosure obligations, this path removes a class of compliance questions entirely.
References
- "llama.cpp: LLM inference in C/C++." ggml-org, GitHub. https://github.com/ggml-org/llama.cpp (accessed 2026-06-29). Confirms the MIT license, the project goal of local on-device inference with minimal setup, and that the project bundles an OpenAI-compatible HTTP server.
- "LLaMA.cpp HTTP Server (tools/server/README.md)." ggml-org, GitHub. https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md (accessed 2026-06-29). Documents llama-server as "a lightweight, OpenAI API compatible, HTTP server," the default
127.0.0.1:8080bind address, and the OpenAI-compatible endpoints (/v1/chat/completions,/v1/completions,/v1/embeddings,/v1/models); contains no mention of telemetry or remote data transmission. - "llama.cpp." Wikipedia. https://en.wikipedia.org/wiki/Llama.cpp (accessed 2026-06-29). Corroborates that llama.cpp is open-source software released under the MIT license for performing inference on large language models.