Day 5: Silly Tavern with local models

Silly Tavern Day 5 Header

After Day 4, try local models—AI that runs on your Mac without sending chat to a cloud API. Once weights are downloaded, extra inference is free (aside from power) and private.

This article covers what local models are, Ollama setup, which models to try, and tradeoffs vs cloud—with 2026-oriented picks. On day 5, master local inference and use Silly Tavern offline-capable.

What is a local model? | AI on your machine

A local model runs entirely on your computer—cloud is “library checkout”; local is “your bookshelf.”

How it works

You download weights (often multi-GB) and run inference with RAM/GPU (or CPU). Performance scales with hardware.

Cloud vs local

Topic	Local	Cloud
Internet	Not required after download	Required
Cost	Free inference (you own hardware)	Often paid or capped
Privacy	Data stays on device	Sent to provider
Quality	Depends on model + hardware	Often flagship-class
Setup	More moving parts	Usually simpler
Hardware	Higher bar	Lower bar

💡 Tip: Many people start cloud (Day 4), then add local when ready.

Pros and cons

Pros

No per-token API bill after download
Privacy—no provider sees prompts
Offline use after model pull
No cloud rate limits
Tunable stacks for enthusiasts

Cons

RAM/GPU demands
Setup can confuse newcomers
Speed suffers on weak machines
Large disk use
Quality may trail top cloud models

Ollama | Simple local runner

Ollama downloads and serves models with minimal ceremony—like an app store for local LLMs.

Highlights

Quick install (e.g. Homebrew on Mac)
One-command pulls and runs
Apple Silicon friendly
Memory-aware defaults
Silly Tavern connects via standard endpoints

Ollama overview

Install Ollama on macOS

If you installed Homebrew on Day 2:

bash

brew install ollama
ollama --version

Ollama version check

Start the service

bash

ollama serve

API typically listens on http://localhost:11434.

Ollama serve already running

💡 Tip: Keep that terminal session running, or run Ollama as a background service per Ollama docs.

Choosing a model (2026-oriented table)

Model	Size (approx)	RAM hint	Notes
Qwen3.5 7B	~4.7 GB	8 GB	Strong multilingual / Japanese; common beginner pick
Mistral Small 3.1	~4.5 GB	8 GB	General, fast daily chat
DeepSeek-R1 7B	~5.2 GB	10 GB	Reasoning-heavy tasks
Nemotron Mini 4B	~2.7 GB	6 GB	Lighter footprint
Phi-4 Mini 3.8B	~2.5 GB	6 GB	Efficient small model

Names and tags on the Ollama library change—verify current tags at ollama.com/library.

Pull example

bash

ollama pull qwen3.5:7b

Model on Ollama

Connect Silly Tavern to Ollama

Silly Tavern → http://localhost:8000
API Connections → Chat Completion
Select Ollama
URL: http://localhost:11434 (default)
Model: the tag you pulled (e.g. qwen3.5:7b)
Connect

First reply can be slow while weights load into RAM.

💡 Tip: If replies lag, close other heavy apps or pick a smaller quantized tag.

Performance tips

Low RAM

Smaller models (4B class)
Quantized variants when offered (-q4_K_M, etc.)
Close background apps

Apple Silicon

Metal acceleration is usually on by default; optional env tweaks exist for power users:

bash

export OLLAMA_GPU_LAYERS=999

Quantization

Quantization shrinks weights (e.g. 4-bit, 8-bit) to save RAM at some quality cost. Ollama tags often encode the variant.

Other local stacks

LM Studio

GUI-first runner; OpenAI-compatible local API for Silly Tavern.

lmstudio.ai

KoboldAI

Story-focused local stack.

KoboldAI Client

Oobabooga Text Generation WebUI

Highly configurable; OpenAI-compatible modes available.

text-generation-webui

LM Studio

Cloud vs local—who should pick what?

Prefer cloud if

You want top flagship quality
Low-spec machine
You want minimum setup
Small monthly API budget is OK

Prefer local if

Privacy is critical
You want $0 marginal cost per token
You need offline
You have 16 GB+ RAM (ideal) or can use small quants

Hybrid

Many users mix: sensitive chats local, hard tasks cloud.

Troubleshooting

Out of memory

Smaller model, quantization, fewer parallel apps.

Very slow replies

Smaller model, ensure GPU path on Apple Silicon, reduce background load.

Cannot connect

Confirm ollama serve is running and port 11434 is free.

Model missing

ollama list → ollama pull <name> again.

Next steps | Advanced customization

Continue with:

Easier path: MiniTavern uses cloud backends without local stack setup.

Summary

You learned local models with Silly Tavern—Ollama install, model choice, ST connection, performance notes, and alternatives. Local runs unlock privacy and no API bill for inference once models are on disk.

Reference links

About the author

花（Hana）

AI工具評価の専門家。東京・新宿三丁目周辺で活動し、最新のAIアプリケーションやツールを実際に使用してレビューを提供しています。

FAQ

Q1: Is local inference completely free?

No cloud API fee for tokens—you still paid for hardware and electricity. Model downloads are free from public Ollama hubs.

Q2: How much RAM?

8 GB minimum for many 7B quants; 16 GB+ more comfortable. Tiny models can run on ~4–6 GB at a quality tradeoff.

Q3: M1/M2/M3?

Yes—Ollama is optimized for Apple Silicon.

Q4: Weaker than cloud?

Often yes at equal parameter counts, but good for daily RP and privacy-first use.

Q5: Fully offline?

Yes after ollama pull completes—no internet needed for inference.

Q6: Best starter model?

Many English/Japanese users start with a Qwen or Mistral family 7B-class tag; check Ollama’s library for current names.

Q7: GPU required?

No, but GPU/Neural Engine (Apple) greatly speeds things up vs CPU-only.

Q8: Multiple models at once?

Usually one loaded model per modest machine; switch tags as needed.

Q9: Japanese support?

Qwen and several Mistral-family builds handle Japanese well—verify per model card on Ollama.

Q10: Windows?

Yes—Ollama supports Windows, macOS, and Linux.

Published: March 15, 2026
Last updated: March 27, 2026

Advanced customization

Day 5: Silly Tavern with local models

What is a local model? | AI on your machine

How it works

Cloud vs local

Pros and cons

Pros

Cons

Ollama | Simple local runner

Highlights

Install Ollama on macOS

Start the service

Choosing a model (2026-oriented table)

Pull example

Connect Silly Tavern to Ollama

Performance tips

Low RAM

Apple Silicon

Quantization

Other local stacks

LM Studio

KoboldAI

Oobabooga Text Generation WebUI

Cloud vs local—who should pick what?

Prefer cloud if

Prefer local if

Hybrid

Troubleshooting

Out of memory

Very slow replies

Cannot connect

Model missing

Next steps | Advanced customization

Summary

Reference links

About the author

花（Hana）

FAQ

7-day course

Play Silly Tavern on your phone

Silly Tavern

Day 5: Silly Tavern with local models ​

What is a local model? | AI on your machine ​

How it works ​

Cloud vs local ​

Pros and cons ​

Pros ​

Cons ​

Ollama | Simple local runner ​

Highlights ​

Install Ollama on macOS ​

Start the service ​

Choosing a model (2026-oriented table) ​

Pull example ​

Connect Silly Tavern to Ollama ​

Performance tips ​

Low RAM ​

Apple Silicon ​

Quantization ​

Other local stacks ​

LM Studio ​

KoboldAI ​

Oobabooga Text Generation WebUI ​

Cloud vs local—who should pick what? ​

Prefer cloud if ​

Prefer local if ​

Hybrid ​

Troubleshooting ​

Out of memory ​

Very slow replies ​

Cannot connect ​

Model missing ​

Next steps | Advanced customization ​

Summary ​

Reference links ​

About the author ​

花（Hana）

FAQ ​

7-day course

Play Silly Tavern on your phone

Silly Tavern

Day 5: Silly Tavern with local models

What is a local model? | AI on your machine

How it works

Cloud vs local

Pros and cons

Pros

Cons

Ollama | Simple local runner

Highlights

Install Ollama on macOS

Start the service

Choosing a model (2026-oriented table)

Pull example

Connect Silly Tavern to Ollama

Performance tips

Low RAM

Apple Silicon

Quantization

Other local stacks

LM Studio

KoboldAI

Oobabooga Text Generation WebUI

Cloud vs local—who should pick what?

Prefer cloud if

Prefer local if

Hybrid

Troubleshooting

Out of memory

Very slow replies

Cannot connect

Model missing

Next steps | Advanced customization

Summary

Reference links

About the author

FAQ