Tech experiments, travel adventures, and code explorations by Robert Taylor
by Robert Taylor
Running large language models locally has become increasingly practical, especially on Apple Silicon hardware with its unified memory architecture and Metal Performance Shaders acceleration. However, setting up a reliable local LLM environment and understanding its performance characteristics requires the right tooling. Today I’m sharing two complementary repositories that address these needs: a complete deployment stack and a comprehensive benchmarking suite.
While cloud-based LLM services offer convenience, they come with inherent limitations: API costs, data privacy concerns, network dependencies, and usage restrictions. Local deployment solves these issues but introduces new challenges around setup complexity, performance optimization, and system monitoring.
The ideal local LLM environment should be:
LLM-Stack provides a turnkey solution for running a complete local LLM environment. The stack combines the power of Ollama’s native Metal acceleration with Docker-based Open WebUI, creating a hybrid approach that maximizes both performance and maintainability.
The stack leverages a mixed deployment strategy:
Native Components (Ollama):
Containerized Components (Open WebUI):
This hybrid approach avoids the performance penalties of containerizing the LLM runtime while maintaining the operational benefits of containerization for the web interface and supporting services.
Understanding how different models perform on your specific hardware is crucial for making informed deployment decisions. LLM-Bench provides comprehensive benchmarking capabilities specifically designed for Apple Silicon systems running Ollama.
The benchmarking suite measures key performance indicators:
Token Generation Speed:
Memory Utilization:
CPU and GPU Utilization:
To demonstrate the practical value of systematic benchmarking, here are results from testing on an Apple Studio M2 Ultra with 128GB RAM - a reference platform representing high-end Apple Silicon deployment:
Performance Hierarchy:
Key Insights from Benchmarking:
The data reveals important performance characteristics that directly impact deployment decisions. Mistral 7B delivers exceptional speed while maintaining relatively low power consumption, making it ideal for applications requiring rapid response times. The large 70B+ parameter models show interesting power efficiency differences - Llama3.1 70B uses nearly 500W less GPU power than Qwen2.5 72B while delivering comparable token generation speed.
Perhaps most importantly, the performance-vs-power analysis shows that the relationship isn’t linear. Mixtral 8x7B, despite being a mixture-of-experts model with fewer active parameters, consumes the most power while delivering mid-tier performance, suggesting that model architecture significantly impacts hardware utilization on Apple Silicon.
Performance data is captured and visualized through detailed charts and metrics, enabling:
These repositories work together to provide a complete local LLM solution:
This approach enables data-driven decisions about model selection, hardware utilization, and system scaling.
Both repositories are specifically optimized for Apple Silicon hardware, taking advantage of:
The repositories are designed to work independently or together:
For a complete LLM environment:
git clone https://github.com/rjamestaylor/llm-stack
cd llm-stack
# Follow setup instructions for your environment
For performance benchmarking:
git clone https://github.com/rjamestaylor/llm-bench
cd llm-bench
# Run benchmarks against your Ollama installation
Both repositories are actively developed with planned enhancements including:
Local LLM deployment doesn’t have to be complex or poorly understood. By combining a robust deployment stack with comprehensive performance analysis, these tools enable confident local AI deployment with clear visibility into system behavior and capabilities.
Whether you’re running personal AI assistants, developing applications, or conducting research, having both reliable deployment and performance insight is essential for success. These repositories provide that foundation, specifically optimized for the growing ecosystem of Apple Silicon-based development and deployment.
The combination of native performance optimization and containerized operational simplicity represents a pragmatic approach to local AI infrastructure that balances performance, maintainability, and observability.
tags: ollama - metal-acceleration - benchmarking - local-ai - open-webui