Robot Terror

Tech experiments, travel adventures, and code explorations by Robert Taylor

About | Archive | GitHub

30 May 2025

Building a Complete Local LLM Ecosystem: Stack Deployment and Performance Benchmarking

by Robert Taylor

Running large language models locally has become increasingly practical, especially on Apple Silicon hardware with its unified memory architecture and Metal Performance Shaders acceleration. However, setting up a reliable local LLM environment and understanding its performance characteristics requires the right tooling. Today I’m sharing two complementary repositories that address these needs: a complete deployment stack and a comprehensive benchmarking suite.

The Challenge of Local LLM Deployment

While cloud-based LLM services offer convenience, they come with inherent limitations: API costs, data privacy concerns, network dependencies, and usage restrictions. Local deployment solves these issues but introduces new challenges around setup complexity, performance optimization, and system monitoring.

The ideal local LLM environment should be:

Introducing LLM-Stack: Complete Local Deployment

LLM-Stack provides a turnkey solution for running a complete local LLM environment. The stack combines the power of Ollama’s native Metal acceleration with Docker-based Open WebUI, creating a hybrid approach that maximizes both performance and maintainability.

Architecture Overview

The stack leverages a mixed deployment strategy:

Native Components (Ollama):

Containerized Components (Open WebUI):

This hybrid approach avoids the performance penalties of containerizing the LLM runtime while maintaining the operational benefits of containerization for the web interface and supporting services.

Key Features

Introducing LLM-Bench: Performance Analysis and Optimization

Understanding how different models perform on your specific hardware is crucial for making informed deployment decisions. LLM-Bench provides comprehensive benchmarking capabilities specifically designed for Apple Silicon systems running Ollama.

Benchmarking Capabilities

The benchmarking suite measures key performance indicators:

Token Generation Speed:

Memory Utilization:

CPU and GPU Utilization:

Real-World Performance Results

To demonstrate the practical value of systematic benchmarking, here are results from testing on an Apple Studio M2 Ultra with 128GB RAM - a reference platform representing high-end Apple Silicon deployment:

performance vs gpu assets/images/gpu_power_usage.png assets/images/tokens_per_second.png

Performance Hierarchy:

Key Insights from Benchmarking:

The data reveals important performance characteristics that directly impact deployment decisions. Mistral 7B delivers exceptional speed while maintaining relatively low power consumption, making it ideal for applications requiring rapid response times. The large 70B+ parameter models show interesting power efficiency differences - Llama3.1 70B uses nearly 500W less GPU power than Qwen2.5 72B while delivering comparable token generation speed.

Perhaps most importantly, the performance-vs-power analysis shows that the relationship isn’t linear. Mixtral 8x7B, despite being a mixture-of-experts model with fewer active parameters, consumes the most power while delivering mid-tier performance, suggesting that model architecture significantly impacts hardware utilization on Apple Silicon.

Visualization and Analysis

Performance data is captured and visualized through detailed charts and metrics, enabling:

The Complementary Ecosystem

These repositories work together to provide a complete local LLM solution:

  1. Deploy with LLM-Stack: Get a working environment with optimized performance
  2. Benchmark with LLM-Bench: Understand your system’s capabilities and model performance
  3. Optimize: Use benchmark data to select the best models for your use case
  4. Monitor: Leverage built-in monitoring to maintain optimal performance

This approach enables data-driven decisions about model selection, hardware utilization, and system scaling.

Apple Silicon Optimization

Both repositories are specifically optimized for Apple Silicon hardware, taking advantage of:

Getting Started

The repositories are designed to work independently or together:

For a complete LLM environment:

git clone https://github.com/rjamestaylor/llm-stack
cd llm-stack
# Follow setup instructions for your environment

For performance benchmarking:

git clone https://github.com/rjamestaylor/llm-bench
cd llm-bench
# Run benchmarks against your Ollama installation

Future Directions

Both repositories are actively developed with planned enhancements including:

Conclusion

Local LLM deployment doesn’t have to be complex or poorly understood. By combining a robust deployment stack with comprehensive performance analysis, these tools enable confident local AI deployment with clear visibility into system behavior and capabilities.

Whether you’re running personal AI assistants, developing applications, or conducting research, having both reliable deployment and performance insight is essential for success. These repositories provide that foundation, specifically optimized for the growing ecosystem of Apple Silicon-based development and deployment.

The combination of native performance optimization and containerized operational simplicity represents a pragmatic approach to local AI infrastructure that balances performance, maintainability, and observability.

tags: ollama - metal-acceleration - benchmarking - local-ai - open-webui