Run Your Own AI: Self-Hosting LLMs with Ollama and LM Studio

Every time you send a message to ChatGPT or Claude, you're paying for it, one way or another. Either through a subscription, API credits, or by handing over your data to train the next model. But here's the thing: you can run surprisingly capable AI models on hardware you already own. No internet required. No API keys. No one reading your prompts.

I've been running local LLMs for over a year now, and with the recent release of LM Studio 0.4 (which just blew up on Hacker News), it's never been easier to get started. Let me walk you through the options.

Why Bother Running AI Locally?

Before we dive into the how, let's talk about the why. There are three main reasons people go local:

Privacy That Actually Means Something

When you run a model locally, your prompts never leave your machine. This matters more than you might think. Maybe you're working with sensitive code, personal journals, business documents, or just don't want your conversations analyzed and stored on someone else's servers. Local means local. Your data stays yours.

No More API Bills

If you've ever built something with the OpenAI API, you know how quickly those costs add up. A busy application can burn through hundreds of dollars a month. With local models, your only cost is electricity. After the initial hardware investment, every query is essentially free.

Works Offline

Internet goes down? Cloud service has an outage? Doesn't matter. Your local model keeps running. This is huge for reliability and for working in environments with spotty connectivity.

The Two Main Players: Ollama vs LM Studio

When it comes to running LLMs locally, two tools dominate the conversation. They take different approaches, and the right choice depends on how you want to work.

Ollama: The CLI-First Powerhouse

Ollama is what I reach for when I want something that just works and plays nice with other tools. It runs as a background service and exposes an API that's compatible with OpenAI's format. This means you can point almost any AI-powered tool at your local Ollama instance and it'll work.

Best for:

  • Developers who want API access
  • Integration with existing tools and scripts
  • Running models headlessly on a server
  • People comfortable with the terminal

LM Studio: The GUI Experience

LM Studio just dropped version 0.4, and it's genuinely impressive. If Ollama is a power tool, LM Studio is the friendly workshop. It gives you a beautiful interface to browse, download, and chat with models. The new version added a built-in reasoning engine, better model management, and significant performance improvements.

Best for:

  • People who prefer graphical interfaces
  • Experimenting with different models quickly
  • Non-technical users who want to try local AI
  • Anyone who wants ChatGPT-style conversations locally

Here's my hot take: install both. Use LM Studio when you want to chat and experiment. Use Ollama when you need to integrate AI into your workflow or run it on a headless server.

Hardware Requirements: Can Your Machine Handle It?

This is the question everyone asks first. The honest answer: it depends on which models you want to run.

Minimum Specs (7B-8B Parameter Models)

For smaller models like Llama 3.2 8B or Mistral 7B:

  • RAM: 16GB minimum (8GB usable for the model)
  • GPU: Optional, but a 6GB+ VRAM card helps significantly
  • Storage: 20-30GB free for a few models
  • CPU: Modern quad-core or better

On CPU alone, expect 5-15 tokens per second, which is usable but not snappy.

For the good stuff:

  • RAM: 32GB+ (64GB for 70B models)
  • GPU: RTX 3080/4070 or better with 12GB+ VRAM
  • Storage: 100GB+ SSD space
  • CPU: Modern 8-core for faster prompt processing

With a decent GPU, you'll see 30-80+ tokens per second, which feels instant.

The GPU Question

Can you run LLMs on CPU only? Yes, absolutely. Should you? For small models and occasional use, it's fine. But if you plan to use local AI regularly, a GPU makes a massive difference. Even an older RTX 3060 12GB will transform the experience.

If you don't have the hardware, consider a cloud option. A Hetzner GPU server with an RTX 4000 can run Ollama and give you remote access to your own AI setup without the upfront hardware cost.

Setting Up Ollama

Let's get practical. Ollama is ridiculously easy to install.

Installation

On Linux/macOS:

curl -fsSL https://ollama.com/install.sh | sh

On Windows:

Download the installer from ollama.com/download and run it. Done.

Running Your First Model

Once installed, pull and run a model with a single command:

# Pull and run Llama 3.3 (70B - needs beefy hardware)
ollama run llama3.3

# Or start smaller with Llama 3.2 (8B - runs on most machines)
ollama run llama3.2

# Try Mistral for a fast, capable option
ollama run mistral

That's it. Ollama downloads the model and drops you into a chat. Type your prompt, hit enter, get a response.

Using the API

The real power of Ollama is the API. It runs on port 11434 by default:

# Simple completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain Docker in one paragraph"
}'

# Chat format (OpenAI-compatible endpoint)
curl http://localhost:11434/v1/chat/completions -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Hello!"}]
}'

Because it's OpenAI-compatible, you can use it with most AI tools by just changing the base URL.

Setting Up LM Studio

LM Studio is even simpler for beginners.

Installation

  1. Download from lmstudio.ai
  2. Install and launch
  3. That's literally it

Downloading Models

Click the search icon in the sidebar and browse available models. LM Studio shows you which ones will fit in your VRAM and estimates performance. Click download on any model that catches your eye.

Pro tip: Look for GGUF format models. They're optimized for local inference and LM Studio handles them beautifully.

Start Chatting

Select your downloaded model and start typing. LM Studio gives you a ChatGPT-like experience with conversation history, system prompts, and parameter tweaking all in a clean interface.

LM Studio's Local Server

Need API access? LM Studio 0.4 includes a built-in server. Go to the "Local Server" tab, select a model, and start it. You get the same OpenAI-compatible API as Ollama, running on port 1234 by default.

Level Up: OpenWebUI for a ChatGPT-Like Interface

Ollama's command line is great, but sometimes you want a proper web interface. Enter OpenWebUI, an open-source frontend that looks and feels like ChatGPT but talks to your local models.

Quick Setup with Docker

# Make sure Docker is installed, then:
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in your browser. Create an account (stored locally), and OpenWebUI automatically detects your running Ollama instance.

Why OpenWebUI?

  • Multi-user support with separate chat histories
  • RAG (Retrieval Augmented Generation) for chatting with your documents
  • Model switching mid-conversation
  • Prompt templates and presets
  • Mobile-friendly responsive design

It's the missing piece that makes local AI feel polished.

Model Recommendations: What Should You Actually Run?

The model zoo is overwhelming. Here's what I actually use:

Llama 3.3 70B - The New King

If you have the hardware (48GB+ VRAM or 64GB+ RAM for CPU inference), this is the one to run. It rivals GPT-4 on many benchmarks and handles complex reasoning, coding, and creative writing beautifully. Quantized versions (Q4_K_M) fit in 40GB VRAM.

Llama 3.2 8B - The Everyday Workhorse

Fast, capable, runs on modest hardware. This is my go-to for quick questions, writing help, and code explanations. The 8B size hits a sweet spot of quality and speed.

Mistral 7B - Fast and Punchy

Mistral punches above its weight. It's quick, handles instructions well, and fits in 8GB VRAM easily. Great for coding assistance.

Qwen 2.5 - The Multilingual Option

Alibaba's Qwen models excel at multilingual tasks and coding. The 72B version is excellent, and the 7B/14B versions are solid choices for everyday use. Particularly good if you work with Chinese or other non-English text.

DeepSeek Coder V2 - Code Specialist

If your primary use case is coding, DeepSeek's models are specifically trained for it. They understand code context exceptionally well and generate clean, functional code.

My Typical Setup

I keep Llama 3.2 8B loaded for quick tasks and spin up Llama 3.3 70B when I need serious reasoning power. OpenWebUI lets me switch between them based on the task.

Tips From the Trenches

After a year of running local models, here's what I've learned:

Quantization matters. You don't need full precision. Q4_K_M quantization gives you 95% of the quality at a fraction of the memory footprint. Always look for quantized versions.

Context length is expensive. Larger context windows eat VRAM. If you're running tight on memory, stick to 4K or 8K context instead of maxing out at 128K.

Temperature affects everything. For factual tasks, use 0.1-0.3. For creative work, bump it to 0.7-0.9. The default of 0.7 is a reasonable middle ground.

System prompts work locally too. Don't forget you can give your local models personas and instructions just like the cloud versions.

The Bottom Line

Running local LLMs isn't some complicated homelab project anymore. Between Ollama's simplicity and LM Studio 0.4's polish, you can have a working setup in under five minutes. The models are good enough for real work, the tools are mature, and the community is thriving.

Start with whatever hardware you have. Download Llama 3.2 or Mistral. See what it can do. You might be surprised how little you reach for the cloud APIs after that.

Your prompts, your hardware, your AI. That's the whole point of self-hosting.