Skip to content

Day 5: Silly Tavern with local models

Silly Tavern Day 5 Header

After Day 4, try local models—AI that runs on your Mac without sending chat to a cloud API. Once weights are downloaded, extra inference is free (aside from power) and private.

This article covers what local models are, Ollama setup, which models to try, and tradeoffs vs cloud—with 2026-oriented picks. On day 5, master local inference and use Silly Tavern offline-capable.


What is a local model? | AI on your machine

A local model runs entirely on your computer—cloud is “library checkout”; local is “your bookshelf.”

How it works

You download weights (often multi-GB) and run inference with RAM/GPU (or CPU). Performance scales with hardware.

Cloud vs local

TopicLocalCloud
InternetNot required after downloadRequired
CostFree inference (you own hardware)Often paid or capped
PrivacyData stays on deviceSent to provider
QualityDepends on model + hardwareOften flagship-class
SetupMore moving partsUsually simpler
HardwareHigher barLower bar

💡 Tip: Many people start cloud (Day 4), then add local when ready.


Pros and cons

Pros

  1. No per-token API bill after download
  2. Privacy—no provider sees prompts
  3. Offline use after model pull
  4. No cloud rate limits
  5. Tunable stacks for enthusiasts

Cons

  1. RAM/GPU demands
  2. Setup can confuse newcomers
  3. Speed suffers on weak machines
  4. Large disk use
  5. Quality may trail top cloud models

Ollama | Simple local runner

Ollama downloads and serves models with minimal ceremony—like an app store for local LLMs.

Highlights

  • Quick install (e.g. Homebrew on Mac)
  • One-command pulls and runs
  • Apple Silicon friendly
  • Memory-aware defaults
  • Silly Tavern connects via standard endpoints

Ollama overview


Install Ollama on macOS

If you installed Homebrew on Day 2:

bash
brew install ollama
ollama --version

Ollama version check

Start the service

bash
ollama serve

API typically listens on http://localhost:11434.

Ollama serve already running

💡 Tip: Keep that terminal session running, or run Ollama as a background service per Ollama docs.


Choosing a model (2026-oriented table)

ModelSize (approx)RAM hintNotes
Qwen3.5 7B~4.7 GB8 GBStrong multilingual / Japanese; common beginner pick
Mistral Small 3.1~4.5 GB8 GBGeneral, fast daily chat
DeepSeek-R1 7B~5.2 GB10 GBReasoning-heavy tasks
Nemotron Mini 4B~2.7 GB6 GBLighter footprint
Phi-4 Mini 3.8B~2.5 GB6 GBEfficient small model

Names and tags on the Ollama library change—verify current tags at ollama.com/library.

Pull example

bash
ollama pull qwen3.5:7b

Model on Ollama


Connect Silly Tavern to Ollama

  1. Silly Tavern → http://localhost:8000
  2. API ConnectionsChat Completion
  3. Select Ollama
  4. URL: http://localhost:11434 (default)
  5. Model: the tag you pulled (e.g. qwen3.5:7b)
  6. Connect

First reply can be slow while weights load into RAM.

💡 Tip: If replies lag, close other heavy apps or pick a smaller quantized tag.


Performance tips

Low RAM

  • Smaller models (4B class)
  • Quantized variants when offered (-q4_K_M, etc.)
  • Close background apps

Apple Silicon

Metal acceleration is usually on by default; optional env tweaks exist for power users:

bash
export OLLAMA_GPU_LAYERS=999

Quantization

Quantization shrinks weights (e.g. 4-bit, 8-bit) to save RAM at some quality cost. Ollama tags often encode the variant.


Other local stacks

LM Studio

GUI-first runner; OpenAI-compatible local API for Silly Tavern.

KoboldAI

Story-focused local stack.

Oobabooga Text Generation WebUI

Highly configurable; OpenAI-compatible modes available.

LM Studio


Cloud vs local—who should pick what?

Prefer cloud if

  • You want top flagship quality
  • Low-spec machine
  • You want minimum setup
  • Small monthly API budget is OK

Prefer local if

  • Privacy is critical
  • You want $0 marginal cost per token
  • You need offline
  • You have 16 GB+ RAM (ideal) or can use small quants

Hybrid

Many users mix: sensitive chats local, hard tasks cloud.


Troubleshooting

Out of memory

Smaller model, quantization, fewer parallel apps.

Very slow replies

Smaller model, ensure GPU path on Apple Silicon, reduce background load.

Cannot connect

Confirm ollama serve is running and port 11434 is free.

Model missing

ollama listollama pull <name> again.


Next steps | Advanced customization

Continue with:

Easier path: MiniTavern uses cloud backends without local stack setup.


Summary

You learned local models with Silly Tavern—Ollama install, model choice, ST connection, performance notes, and alternatives. Local runs unlock privacy and no API bill for inference once models are on disk.



About the author

花

花(Hana)

AI工具評価の専門家。東京・新宿三丁目周辺で活動し、最新のAIアプリケーションやツールを実際に使用してレビューを提供しています。


FAQ

Q1: Is local inference completely free?

No cloud API fee for tokens—you still paid for hardware and electricity. Model downloads are free from public Ollama hubs.

Q2: How much RAM?

8 GB minimum for many 7B quants; 16 GB+ more comfortable. Tiny models can run on ~4–6 GB at a quality tradeoff.

Q3: M1/M2/M3?

Yes—Ollama is optimized for Apple Silicon.

Q4: Weaker than cloud?

Often yes at equal parameter counts, but good for daily RP and privacy-first use.

Q5: Fully offline?

Yes after ollama pull completes—no internet needed for inference.

Q6: Best starter model?

Many English/Japanese users start with a Qwen or Mistral family 7B-class tag; check Ollama’s library for current names.

Q7: GPU required?

No, but GPU/Neural Engine (Apple) greatly speeds things up vs CPU-only.

Q8: Multiple models at once?

Usually one loaded model per modest machine; switch tags as needed.

Q9: Japanese support?

Qwen and several Mistral-family builds handle Japanese well—verify per model card on Ollama.

Q10: Windows?

Yes—Ollama supports Windows, macOS, and Linux.


Published: March 15, 2026
Last updated: March 27, 2026



Last updated: