blog image

Thursday, October 2, 2025

Kevin Anderson

Essential Guide: How to Run LLM Locally for Optimal Performance

Running a large language model (LLM) on your own machine gives you control, privacy, and predictable costs. You can iterate faster, keep confidential intelligence in-house, and tailor the stack to your project without vendor limits. This tutorial-style article walks through environment setup, minimal code, data handling, performance tuning, and advanced tricks like RAG—using approachable tools such as Ollama, LangChain, and a vector store.

Local LLMs aren’t just for demos. With the right configuration, a laptop-class GPU can answer real workloads, and a single workstation can serve multiple users with low latency.

You’ll see how to create a runnable baseline, then test and optimize it. We’ll also note common “missing dependency” pitfalls and how to debug them from the terminal.

Note: All examples aim to be copy-paste friendly. If a command fails, carefully read the error message—it usually points to a package or driver version mismatch.


Read next section


Introduction to LLMs

Large language models (LLMs) are AI systems that process and generate human-like language for chatbots, translation, and summarization.

Modern LLMs expose a context window that buffers recent conversation, enabling coherent multi-turn dialog and long-form reasoning.

Running LLMs locally keeps sensitive data off the cloud while eliminating per-token fees. You can also pin a specific version of a model for reproducible behavior. For more insights on AI strategies, best practices, and technology trends, explore our resources.

Tools like Ollama make local serving simple: they handle model downloads, GPU utilization, and a lightweight HTTP server with a clean Python client.


Read next section


Quick Start: Minimal Example with Ollama and Python

The fastest path from zero to response is a tiny Python script that calls a local Llama model via Ollama.

Install dependencies and start the background service.

# 1) Install Python deps
pip install ollama==0.2.* chromadb langchain-community

# 2) Start the Ollama server (new shell/tab)
ollama serve

Pull a Llama family model (substitute another if you like). The first pull may be slower; wait for the download to complete.

# 3) Download a model (example: Llama 3 Instruct)
ollama pull llama3

Make a tiny script: hello_llm.py.

import asyncio
import ollama

async def main():
    rsp = await asyncio.to_thread(
        ollama.chat,
        model="llama3",
        messages=[{"role":"user","content":"Say hi and then count 1..3"}],
    )
    print(rsp["message"]["content"])  # display result

if __name__ == "__main__":
    asyncio.run(main())

Run it from your terminal:

python hello_llm.py

You should see a friendly greeting followed by numbers. You just ran a local LLM session end-to-end—no cloud, no API keys, largely free aside from electricity.


Read next section


Setting Up the Environment

Confirm you have recent GPU drivers and a compatible CUDA/ROCm stack; otherwise the runtime falls back to CPU, which is fine for test prompts but too slow for production.

Install Python 3.10+ and create a virtualenv so packages don’t conflict with other projects.

If a model refuses to load, check VRAM. Smaller Llama variants run on 8–12 GB; larger models require more. For ultra-tight VRAM, prefer quantized builds (e.g., Q4_K_M).

When a dependency is missing, re-read the error, then upgrade or pin that package. Keep a “requirements-txt” in your repo so teammates join with identical environments.


Read next section


LLM Architecture (What You’re Actually Running)

A local LLM stack typically includes:

  • An embedding component (optional) for retrieval tasks.

  • A transformer generator (e.g., Llama) that turns input context into tokens.

  • A thin serving layer (Ollama’s server) that streams tokens to your app.

  • For RAG, add a vector store and a collection to index documents and ground answers in your data.


Read next section


Data Storage and Retrieval

Local RAG boosts factuality and reduces hallucinations without re-training.

Use Chroma (or similar) to initialize a persistent collection, store embeddings, and run similarity search.

from chromadb import PersistentClient
client = PersistentClient(path="./chroma_db")
col = client.get_or_create_collection("docs")   # initialize collection

You can ingest .pdf, .md, or .**txt** via LangChain loaders, then create embeddings and upsert them. At query time, retrieve top-k chunks and pass them into the prompt.

RAG lets the model generate grounded responses even with a small base model.


Read next section


Integrating with Chat History

For chatbots, persist conversation state.

You can store chat messages per user in SQLite, Redis, or flat JSON for simplicity.

At inference, prepend the last N turns into the input context so the model keeps continuity and consistent responses.

If context windows are small, summarize older turns into compact notes before appending, which keeps replies concise without losing intent.


Read next section


Optimizing Performance

Throughput and latency hinge on a few levers.

Batching / streaming. Stream tokens as they’re available to improve perceived speed and display partial results immediately.

Quantization. Run Q4/Q5/Q8 models to lower VRAM, which often makes little difference in model performance for many tasks.

Temperature & sampling. For deterministic outputs and concise responses, use a lower temperature value (e.g., 0.2). For more diverse outputs, raise it (0.7–0.9).

Async I/O. Use asyncio to overlap retrieval and generation. A non-blocking event loop can keep the UI snappy.

Pin versions. Lock the model version and your dependencies for stable results across deployments.


Read next section


Running Local LLMs

Ollama exposes both CLI and HTTP endpoints. For quick checks:

# Simple prompt from the terminal
ollama run llama3 "Write a 2-line haiku about local AI."

For structured apps, call the REST API or Python client to write tools that create summaries, draft emails, or answer FAQs.

If you need agents (tool-using pipelines), orchestrate the model with LangChain or LlamaIndex. Start simple: retrieval + function calling + deterministic generation.

You can play with multiple models on one machine, but keep an eye on VRAM; too many concurrent models will page memory and get slower.


Read next section


Advanced Techniques

Retrieval-Augmented Generation (RAG). Chunk your corpus, embed, store in a collection, then stuff top-k matches into the prompt. This makes generated output factual and auditable.

Local tools & agents. Expose functions (search files, call calculators) and let the model plan-act-observe. Keep guardrails tight to prevent destructive actions.

Serving at scale. For multiple users, run Ollama behind a reverse proxy and autoscale via containers. A single box can serve many chats with careful batching.

If you’re packaging for teammates, create a small repository with a Makefile that installs deps, launches the server, and runs smoke tests.


Read next section


Conclusion and Future Directions

Running an LLM locally is now practical: install Ollama, pull Llama, write a few lines of code, and you have a private assistant that can draft, translate, and answer domain questions.

For production, layer in RAG, quantization, and async orchestration to boost efficiency and reliability. Keep an eye on driver version drift and library updates.

As models improve, expect bigger context windows, tighter tool support, and better small-model quality. Start with this tutorial, then iterate—profile, tune, and test until the UX feels instant.


Appendix: Reference Snippets (Copy–Paste Friendly)

A) Simple chat with temperature control

import ollama
resp = ollama.chat(
    model="llama3",
    messages=[{"role":"user","content":"Give 3 bullet points on on-prem LLMs."}],
    options={"temperature":0.2}  # small temperature -> concise responses
)
print(resp["message"]["content"])

B) RAG skeleton (ingest & retrieve)

from chromadb import PersistentClient
from langchain_community.embeddings import HuggingFaceEmbeddings

client = PersistentClient("./chroma_db")
col = client.get_or_create_collection("kb")

emb = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

docs = [
    {"id":"guide-1", "text": open("local_guide.txt").read()},  # data.txt or any txt
]

for d in docs:
    v = emb.embed_query(d["text"])
    col.add(ids=[d["id"]], documents=[d["text"]], embeddings=[v])

q = "How do I start the Ollama server?"
qv = emb.embed_query(q)
results = col.query(query_embeddings=[qv], n_results=3)
context = "\n\n".join(results["documents"][0])

from textwrap import dedent
prompt = dedent(f"""
Use the context to answer:
Context:
{context}

Question: {q}
""")

import ollama
print(ollama.generate(model="llama3", prompt=prompt)["response"])

C) Async streaming (don’t block the UI)

import asyncio, aiohttp, sys

async def stream(prompt):
    async with aiohttp.ClientSession() as s:
        async with s.post("http://localhost:11434/api/generate",
                          json={"model":"llama3","prompt":prompt,"stream":True}) as r:
            async for line in r.content:
                sys.stdout.write(line.decode())
                sys.stdout.flush()

asyncio.run(stream("Summarize the benefits of local LLMs."))  # display as it arrives

Final note: Local is about control—initialize once, keep sessions short, and measure. If you hit a ceiling, swap models, adjust sampling (temperature increases → model selects less probable tokens), or add retrieval. With steady iteration, your local stack will deliver consistent responses without irrelevant responses, giving you practical, private AI on your own hardware.


Contact Cognativ



Read next section