Supervised Fine-tuning of a Hugging Face LLM Model

Supervised Fine-tuning (SFT) is used to teach a model new behavior and skills. SFT along with RAG form the two main ways one can customize an LLM. Newer approaches are also emerging.

A great big thanks to this blog post which helped me write this article.

Continue reading

Use Quantization with Hugging Face Models

Most LLM models today use 32-bit or 16-bit floating point for the parameters (weights). Due to the large number of parameters involved a lot of GPU VRAM and regular RAM memory is needed to work with these models. Quantization algorithms convert the parameters into 8bit or 4bit floating point thereby reducing the memory and computational overhead. There’s some loss in quality but not as much as you think. Certainly for research and experimentation quantization can be quite useful. In this article we will learn how to perform quantization on Hugging face models.

Continue reading

Do RAG from Scratch

These days you’d probably use a library like Langchain and Llamaindex to write a RAG application. But there’s nothing special in these libraries that you cannot easily do yourself. In fact, the level of abstraction in these libraries is so high that you lose visibility and control over the low level operations. In this article I will show you how to do RAG from scratch without using these libraries.

Continue reading

Write a Custom Embedding Function for Chroma DB

An embedding function is used by a vector database to calculate the embedding vectors of the documents and the query text. It can then proceed to calculate the distance between these vectors. Chroma uses all-MiniLM-L6-v2 as the default sentence embedding model and provides many popular embedding functions out of the box. At the time of this writing there is no function available to run a Hugging Face sentence embedding model locally. We will develop one.

Continue reading

Get to Know Vector Database

Vector databases specialize in rapidly looking up text documents that are most relevant to a piece of query text. This is done by calculating the distance between the query text and the documents. A low distance indicates high relevance. Various distance algorithms exist, such as squared L2, inner product and cosine distance. This lookup opeartion forms the Retrieval part of RAG (Retrieval, Augmentation and Generation). In this article we’ll build a solid foundation on how these databases work. We’ll use Chroma DB as an example.

Continue reading

RAG Using LlamaIndex and Hugging Face Models

Introduction

Llamaindex is a framework to run RAG against an LLM. It has a built in vector store which makes it easy to do any proof of concept without having to install an actual vector database. In this article we will learn how to use Llamaindex to do RAG with a model from Hugging Face. We will use the mistralai/Mistral-7B-Instruct-v0.1 model.

If you have never used a Hugging Face model locally, I strongly recommend that you read one of my earlier posts.

Continue reading

Start Using Mistral from Hugging Face

Hugging Face is a friendly giant of a platform and library that makes it easier than ever to experiment with various open source models. Today I will discuss how to use the Mistral 7B model to run a question/answer type application. Earlier I posted an article on setting up a Windows 11 machine with PyTorch. PyTorch or Tensorflow is required by the Hugging Face library. So please follow that article to setup PyTorch (it is a lot easier to setup PyTorch in Windows than Tensorflow which requires WSL these days).

Continue reading