Skip to main content
LocalLast verified June 29, 2026

llama.cpp data retention policy

llama.cpp runs LLMs entirely on your own hardware. There is no remote service, no telemetry, and no training on your prompts. Its bundled llama-server exposes a local OpenAI-compatible API that Meetily BYOK can point at, keeping the whole pipeline on-device.

Quick policy snapshot

Default retention
None (runs locally)
Zero data retention available
Yes
Trains on API customer data by default
No

llama.cpp is a local inference engine, not a service

llama.cpp is an open-source LLM inference engine written in C/C++ and maintained in the open by the ggml-org project. Its stated goal is "to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud." On the path described here it runs locally: models are loaded and served by a process on your own machine, so the "data retention" framing that applies to OpenAI, Anthropic, or Google does not apply in the same shape.

There is no llama.cpp-operated inference cloud. When an application calls llama.cpp for a summary, the request is handled on-device. No prompts or completions cross the network boundary unless the surrounding application explicitly forwards them.

Retention, training, and ZDR

  • Retention: None by default. llama.cpp does not run a remote service, so there is no off-device store of prompts or completions to configure a retention period for.
  • Training: llama.cpp does not train models. It serves pre-trained open-weight models in GGUF format that were trained elsewhere by their publishers (Meta, Mistral AI, Alibaba, and others).
  • Zero data retention: Available by construction - inference is local, so off-device transmission is zero.

The local OpenAI-compatible server

llama.cpp ships llama-server, described in its documentation as "a lightweight, OpenAI API compatible, HTTP server for serving LLMs." By default it listens on 127.0.0.1:8080 - a loopback address that is not reachable from outside the machine. It exposes OpenAI-compatible endpoints including /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/models.

This matters for privacy because any tool that speaks the OpenAI API can point its base URL at the local server (for example http://127.0.0.1:8080/v1) and get on-device inference without code changes. The server documentation describes no telemetry and no remote data transmission; the only network activity is user-initiated, such as a one-time model download from Hugging Face.

Open source under MIT

llama.cpp is distributed under the MIT license, one of the most permissive open-source licenses. The full source is on GitHub, so the engine and its server can be read, built, and audited independently rather than trusted on the basis of a published policy.

How Meetily uses llama.cpp

Meetily's transcription path is local-by-default. Because llama-server exposes a local OpenAI-compatible endpoint, Meetily's BYOK summary provider can point at that local base URL, so summaries run on the same machine. The result is that the entire pipeline - audio capture, transcription, summary - stays on the device. Audio never leaves the machine, and on this path neither do transcripts or summaries.

This is the simplest answer to "does the summarization step have a retention policy I need to read?" because the answer becomes "no, there is no remote service in the loop." For organizations subject to data-residency or processor-disclosure obligations, this path removes a class of compliance questions entirely.

References

  1. "llama.cpp: LLM inference in C/C++." ggml-org, GitHub. https://github.com/ggml-org/llama.cpp (accessed 2026-06-29). Confirms the MIT license, the project goal of local on-device inference with minimal setup, and that the project bundles an OpenAI-compatible HTTP server.
  2. "LLaMA.cpp HTTP Server (tools/server/README.md)." ggml-org, GitHub. https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md (accessed 2026-06-29). Documents llama-server as "a lightweight, OpenAI API compatible, HTTP server," the default 127.0.0.1:8080 bind address, and the OpenAI-compatible endpoints (/v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models); contains no mention of telemetry or remote data transmission.
  3. "llama.cpp." Wikipedia. https://en.wikipedia.org/wiki/Llama.cpp (accessed 2026-06-29). Corroborates that llama.cpp is open-source software released under the MIT license for performing inference on large language models.

Last verified: June 29, 2026. Policy source: llama.cpp policy

Frequently asked questions

Is llama.cpp fully local?
Yes. llama.cpp is an open-source C/C++ inference engine that runs models on your own hardware. Its stated goal is LLM inference with minimal setup locally and in the cloud, and on the BYOK path described here it runs locally. There is no llama.cpp-operated inference service, so prompts and completions stay on the machine running it.
Does any of my data leave my device when I use llama.cpp?
No, not for inference. When you run a model with llama.cpp or its bundled llama-server, requests are served by a process on your own machine. The only network activity is user-initiated, such as downloading a model file once from Hugging Face. The server documentation describes no telemetry or remote data transmission.
Does llama.cpp expose a local OpenAI-compatible server?
Yes. llama.cpp ships llama-server, described in its docs as a lightweight, OpenAI API compatible HTTP server. By default it listens on 127.0.0.1:8080 and exposes endpoints including /v1/chat/completions, /v1/completions, and /v1/embeddings. Meetily's BYOK summary path can point at this local base URL, so summaries run on-device.
Is llama.cpp open source, and under what license?
Yes. llama.cpp is open source under the MIT license, developed in the open by the ggml-org project on GitHub. You can read, build, and audit the entire codebase yourself.
Does llama.cpp train models on my data?
No. llama.cpp is an inference engine, not a model trainer. It loads pre-trained open-weight models (in GGUF format) that were trained elsewhere by their publishers and serves them locally. Your prompts are never used to train anything.
Is zero data retention available with llama.cpp?
Yes, by construction. Because inference happens on your own hardware, no prompts or completions are transmitted off-device, which is the strongest form of zero data retention available. There is no remote service to retain anything.
How does Meetily use llama.cpp?
Meetily transcription is always 100% local. When you point Meetily's BYOK summary provider at a local llama-server base URL, summaries also run on-device. Audio, transcripts, and summaries never leave your machine on this path - the entire pipeline stays local.

Run summaries locally with Meetily + llama.cpp

Meetily transcription is always 100% local. Pair it with llama.cpp and your meeting summaries never leave your machine either. No retention, no training, no policy to read.