How to run LLMs locally on your machine

Apr 06, 2024#AI#ML#LLM

Large Language Models (LLMs) are advanced AI systems capable of understanding and generating human language. They are trained on vast amounts of text data, which enables them to perform a wide range of natural language processing tasks such as text generation, translation, summarization, and question answering.

LLMs work by predicting the next word or token in a sequence, making them useful for applications like chatbots, virtual assistants, content creation, and more. They are a significant breakthrough in the field of artificial intelligence and continue to evolve, offering more sophisticated interactions and capabilities.

Some of the most popular LLMs are GPT, Gemini, PaLM, Llama, and Claude.

Local deployment provides developers with greater customization and control over the LLM environment. They can fine-tune parameters, configure models according to specific requirements, and integrate with other local tools and libraries as needed.

Quantization is a key technique for optimizing models for running in device with limited memory and computational power. It can also accelerate inference times without a substantial loss in accuracy. Quantized models are a form of machine learning models that use quantization techniques to reduce the precision of the computations and storage of tensors.

Here are some options for platforms and frameworks to run LLMs locally:


Llama.cpp (54.5k ⭐) is a project that enables the inference of various LLMs in pure C/C++. It was developed by Georgi Gerganov and is designed to be lightweight, with minimal setup and high performance. It supports multiple backends like Apple silicon, AVX, CUDA, Vulkan, SYCL, and OpenCL, and can work with models such as LLaMA, Mistral, Falcon, and MPT.

The project is notable for its efficiency and portability, making it easier to integrate LLMs into different programming environments. It’s particularly useful for applications that require quick response times or run on systems with limited computing resources.


Ollama (53k ⭐) is a platform designed to run open-source LLMs locally on your machine. It simplifies the process by bundling model weights, configuration, and data into a single package defined by a Modelfile. The platform supports various models, such as Llama 2, Mistral, Gemma, and others, and provides a command-line interface (CLI), a REST API, and web and desktop integrations for ease of use.

Additionally, Ollama is built on top of llama.cpp, a C++ library that enables running models on CPUs or GPUs, making it efficient for local operations. It’s particularly useful for developers and researchers who need to run and customize LLMs without relying on cloud services.

LM Studio

LM Studio is a platform that allows you to discover, download, and run local LLMs on your computer. It provides a Mac / Windows / Linux application, a JSON configuration file format, and a model catalog repository.

The platform features a browser to search and download LLMs from sources like Hugging Face, an in-app Chat UI, and a runtime for a local server compatible with the OpenAI API. It’s designed to make the use of Text AI super easy for users.


SillyTavern(5.6k ⭐) is a user interface that you can install on your computer or Android phone, allowing you to interact with text generation AIs and chat or roleplay with characters created by you or the community.

It’s a fork of TavernAI 1.2.8, under more active development, and has added many new features. SillyTavern supports multiple languages and offers a mobile-friendly layout, multi-API support, customizable UI, auto-translate, and more prompt options. It also allows for the installation of third-party extensions.


GPT4ALL(63.3k ⭐) is an open-source ecosystem designed to run powerful and customized LLMs locally on consumer-grade CPUs and any GPU. It allows users to download, train, deploy, and interact with LLMs on various platforms and devices.

The project aims to make it easier for individuals and enterprises to utilize LLMs without needing extensive infrastructure or technical know-how.


Llamafile (12.2k ⭐) is a framework that allows you to distribute and run LLMs with a single executable file. It combines the model weights and the necessary code to run the model on your computer, simplifying the process of using LLMs. It’s designed to work on most computers with no installation required, and it supports various operating systems including Linux, macOS, and Windows.


LangChain is a framework that simplifies the development of applications powered by LLMs. It provides tools and APIs to assist developers through every stage of the LLM application lifecycle, including development, productionization, and deployment.

LangChain includes components like langchain-core for base abstractions, langchain-community for third-party integrations, and langserve for deploying chains as REST APIs. It’s designed to help developers build and deploy reliable GenAI apps faster.

Hugging Face Transformers

The Transformers library provides a convenient way to download and run pre-trained models, including LLMs, on your local machine. It supports autoregressive generation, which is essential for text generation with LLMs, and it can be executed on a GPU for better performance.