By tomheiber 05/24/2023

Unlocking the Power of AI on Consumer Hardware - State of Local AI, May 2023

5 Minutes read 116

In our journey towards creating an AI assistant, we have been exploring the boundaries of what can be achieved using ordinary consumer hardware, without incurring any additional cost and eliminating the need for reliance on external APIs. The results have been nothing short of surprising. Here's what we've accomplished.

Local Language Models (LLMs) on Consumer-grade Systems

With a home-grade computer system packed with 128GB of RAM and a 24GB graphics card, and running the latest version of Ubuntu, we have been able to execute impressive AI models locally. Our research has mainly focused on open-source models that can be commercially licensed, but we've also explored models based on Meta's LLaMA, available through renowned platforms like HuggingFace and GitHub.

To make our research easier, we've referenced a list of commercially available local LLMs. You can find it here: Open local LLMs.

Our Performance Benchmark

By utilizing the OobaBooga model loader and text generation UI, we've put to test and successfully used the top 30 open-source models. These models have been built on the LLaMA model released by Meta, which is unfortunately not licensed for commercial use. We wait for OpenLM Research to complete training their 1T OpenLLaMA Model, so we can fine-tune our models using datasets similar to the ones below:

  • WizardLM-30B-Uncensored-GGML: This model currently leads the open-source market, boasting top scores in tests for deductive reasoning (15/26 passed) in zero-shot and multi-shot instruct queries.
  • WizardLM-Vicuna 13B: An excellent model for chat, creating an action plan, problem-solving (12/26 passed deductive reasoning tests), and generating about three tokens per second.
  • StarCoder 13B: Equipped with an 8K token context window, this model can generate code blocks in over 80 programming languages, find bugs, and complete fill-in-the-middle code tasks. We've further optimized it for Python generation and are considering its potential for the Rust programming language.

Model Size and Deductive Reasoning

Our findings show that models with less than 13B parameters, despite being fine-tuned with several popular datasets, struggle to perform complex AI tasks. However, 13B models are beginning to show promising results, with some achieving as high as 98% deductive reasoning compared to ChatGPT 3.5. On the other hand, models with 30B parameters display advanced reasoning skills, though they require significantly higher time and cost for training and fine-tuning.

In an effort to balance intelligence and size, we've tested models with float16 precision and quantized down to int4. The resulting models are slightly less intelligent but considerably smaller, making them a viable option for systems with memory limitations.

Despite significant progress, the capabilities of local LLM models remain somewhat limited for complex tasks. However, we anticipate the emergence of new 30B and 65B models that may change this narrative.

Context Size

The capacity of a model to recall the conversation and infer new responses based on prior output and user input — the context size — is critical for AI assistants. Larger context sizes enable more natural conversations with the AI, rather than being limited to individual prompts. However, these advantages come with trade-offs, namely increased memory usage and longer response generation times.

For instance, a 13B model running entirely on a GPU can produce 3 tokens per second with a 500-token context but slows down to 0.5 tokens per second with a 2000-token context. Moreover, swapping some layers of the model into the system RAM causes a significant drop in these numbers.

Censorship / Alignment

We believe that an AI model should be unbiased and capable of truthfully answering any question posed. To this end, we've developed a method to measure the degree of censorship in the models we test by posing controversial questions and recording the model's response. We've observed indications that uncensored models may possess a higher capacity for reasoning and problem-solving, although we've yet to independently verify this.


The field of LLMs is rapidly evolving, with new models and toolsets being released or updated daily, often breaking backward compatibility. This often compels us to revise or rebuild certain tools to accommodate these changes.

Our Next Steps

We're moving forward with a clear vision: to train and fine-tune our models to surpass the deductive reasoning scores of GPT3.5. We're also planning to integrate semantic (vector) databases for long-term memory storage. One of the intriguing ideas we're exploring is to fine-tune active models based on the day's activities, somewhat akin to dreaming.

Further, we're looking at text-to-speech models that can operate alongside local conversational models, paving the way for a more immersive AI experience.