Best Intro LLM and llama 2 code

Let's see what do we need in LLm

First of all, you need lots and lots of data. It includes data from crawling the internet, but with specialized algorithms called crawlers that collect all this data. And it's tons and tons of different webpages and lots of data, as much as they can gather from all the internet. Includes things like all of Wikipedia, lots of different books and the text that's contained within them, and lots and lots more data. So we're talking huge amounts of data.

Plus, you need a transformer architecture. We'll talk a bit more about this further down, but that's an integral component of a large language model.

Then you need to perform Pre-Training. And that is when you take the model architecture and you train it on the data that you've collected. And you are specifically training it to predict the next word. You're giving it lots of different sentences, and you are asking it to predict what the next word is and then you are checking if it predicted it correctly and you're adjusting, you're helping it learn further and further to get better at predicting the next word. So that is done on GPUs, and you need lots of them to do this in parallel so it's faster. You need lots of money. We're talking about tens and hundreds of millions of dollars to train your large language model. And also it takes lots of time. We're talking about weeks and even months of training required.

And then once it's pre-trained, you can apply things like reinforcement learning from human feedback to help it better understand what responses humans actually expect it to give. So, basically, humans will be sitting there and actually checking the responses that it's giving and showing sample responses that it should be giving.

And then, finally, you can also fine tune it on domain-specific data for your specific business use case. And this is the part that's most interesting to businesses.Whether you have medical data, financial data, film data, legal data, or even you have an internal knowledge base where your employees normally go to get answers from about how they should do certain things in your company. You could fine tune a large language model on any one of those datasets, for whether it's your employees or your customers, to get better answers without having to look them up themselves, but just getting them through the large language model.

Who invented LLM?

It all started once the Google team published "Attention is all you need"

They introduced "Transformers" architecture which looks like this

It's an encoder and a decoder and was designed for Translation tasks (machine translation from one language to another)

After a few days, OpenAI engineers found out if you take this part, which is the encoder, and you throw it out, you'll be left with a decoder-only model architecture.

And this architecture is very good at language generation. And this is exactly the architecture that is used in the most powerful large language models, such as ChatGPT, LLaMA and Claude.

How do they generate texts?

Assume we have provided an input "What is the tallest mountain?" to chatgpp

The output it will produce just one most likely next word "The"

How does the large language model know the most likely next word

Well, it's seen lots and lots of text on the internet, text from lots and lots of books, text from Wikipedia and so on and so on. And so it's seen lots of different scenarios of how words follow each other. And even though it doesn't know this exact sentence by heart, it does know how to predict what is likely to come next based on everything that it's seen. And the most likely next word in this case is the word The

Just keep in mind, the most important thing here is that it only generates just one word.

Next, it takes all of the input we gave it as a prompt, plus whatever is created so far, which is the word The and all of that goes into the input of the large language model.

And as the output, it again produces the most likely next word for this new sequence, including the new word it had generated. Now we've got the word tallest

Again, it'll take everything that we have so far plus the new words, The tallest, and all of that goes in input.

The mathematics will be carried out and as the output, it will produce the most likely next word, which in this case is mountain

And it'll keep doing that, taking everything we have as input and producing just one word at a time until it gets to a point where instead of producing the most likely next word, it produces what is called the end of sequence token.

And that's when it'll stop and that's where we will get this output to our prompt.

Inside LLM

Now, inside a large language model, we've got our decoder only architecture, and we're going to go through the modules one by one from bottom to top. We won't go into mathematical detail, we'll just understand what they're there for

So this first module, its purpose is to convert words into vectors where a vector is just a series of numbers. It could be three numbers separated by commas. It could be a hundred numbers separated by commas.

In large language models, typically what is used is a vector size of 512 numbers separated by commas in each vector.

And let's have a look at this example, an equation where we're subtracting words for and adding words to each other. So imagine king minus man, plus woman. What do you think this equation equals to?

Well, even though we are doing arithmetic with words rather than numbers,it's quite intuitively queen and this gives us a hint that there is something about words that about meaning of words, the essence of words, that we can capture it with numbers.

And it's called output because all of the output from the transformer will go back into the input

The next module is positional coding, and its purpose is to add positional information. So if you look at this sentence, for example, horses eat apples.It's a completely grammatically correct sentence with which makes a lot of sense.

But if you reorder the words,now we get a sentence which is also gram grammatically correct and has this exact same words, but it has a very different meaning, and, in fact, it's nonsensical.

And that shows us that the order of words is very important in a sentence. And so this module helps the large language model to remember in which order the words came when they were input originally.

The next module is the Heart of the Transformer, and it's called masked multi-head attention, and its purpose is to capture contextual information.

Let's have a look at an example of this sentence.

what do you think this word it refers to? Does it refer to the word dog or to the word street? Well, it's quite intuitively dog, because the dog was too tired.

But if we make a small adjustment to this sentence and replace the word tired with wide, we get a different sentence. We get the dog didn't cross the street because it was too wide. All of a sudden the meaning of the word, it has changed. Now it's referring to words street.

This shows us is that meaning of individual words in a sentence can be affected by other words in that sentence. In fact, are affected by words in the sentence paragraph, or even text, and that is called contextual meaning,

so semantic meaning is the dictionary meaning of the word and it's important, but also equally important, contextual meaning of what that word is doing in a sentence or what other words or how other words are affecting it.

The next module is a feedforward neural network and basically it increases learning capacity. It's just a two layer deep neural network with two layers and an activation function.

Effectively it adds a bit of extra sprinkle of deep learning to the large language model to allow it to learn even more complex relationships that it's looking at in sentences and in words.

Finally, This output module, it outputs results. What it does is it takes all of the words of the English language, 200,000 plus of of them in total and it gives each one of them a probability score.

The word with the highest probability, let's say in this case, it's "Abundance" will be selected as the next word. And as we discussed previously, large language models just output one word at at a time and then they reiterate that same process.

LLM Size

You've probably heard people say this large language model has one billion parameters, or this large language model is even bigger and has 10 billion parameters, and this large language model is even bigger and has 100 billion parameters and so on.

Well, when you look at your large language model architecture, which looks like this, this is a decoder-only architecture from the transformer. Then all of this in reality is actually, when it's coded inside a computer, it's actually a neural network

Note: this isn't exactly the correct or the specific neural network diagram for the large language model we're discussing. Just to mean a neural network

It's a series of neurons all interconnected, and information flows from left to right to get from the input to the output, and then along the way, it's adjusted. And how is it adjusted? Well, it's adjusted with these weights.

So for example, this neuron on the right is connected with these two with weights A and weight B, and the values from left to right will be multiplied by these weights,and then they'll be added together to get into this next neuron.

And effectively, these are parameters (Weights A, B).

When people say there's one billion parameters, what they're talking about is for the most part is that there's one billion weights in this neural network, which is the large language model.

So every large language model is actually a neural network in the background.

When it's size is 10 billion parameters, it means 10 times bigger, 10 billion connections, interconnections between these neurons and so on. And that's what parameters inside a large language model actually mean.

LLM Context Window

Assume this is a conversation between human and LLM.

Everything within the context window, the large language model can still recall and use in its further responses to the human prompts.

Let's check the example

assume we have 2 windows opened from same account.

On to the left one, we asked "describe Montserrat"

It's going to tell us about the island in the Caribbean called Montserrat.

Now we're going to do this similar question into the right one.But before we ask to describe Montserrat, we're going to ask it to describe Times New Roman, which is a font.

now we're going to ask it exactly the same prompt, describe Montserrat and see what happens. So there we go and in this case, as you can see, it's telling us about Montserrat, a geometric sans-serif typeface. So it's telling us about the font Montserrat.

So even though the prompt here was the same as the prompt here, exactly the same, the difference is that in left window, we didn't have any context previous to this. So ChatGPT just defaulted telling us about the island.

But on right we have context , we have previous context from the conversation and we were talking about fonts.

So ChatGPT knows that, and it assumes that we also want to know about the font Montserrat here.

Fine tuning models

So let's say you have a pre-trained large language model, and basically that's a large language model that has been trained on lots and lots of texts on the internet, that has undergone that rigorous training, which is time-consuming, expensive, requires lots of GPUs.

So all of that is done, and this pre-trained LLM is able to understand text and is able to generate the most likely next word very well.

Now at the same time, it's not specialized in anything. It's just a very good generalist across working with a language.

What you want to do, what you might want to do is to fine-tune it on your specific business data. So let's say you have lots of medical data in your business, say you are a hospital, and you need to help your practitioners better and quicker answer questions and you want to fine tune the large language model to help them do that. Well, you would train it on all of that medical-specific data about different medical terms, about different medical conditions, and then the large language model would be an expert at that.

You could fine-tune the large language model on financial data, to better understand the stock market, so it better understands financial concepts and helps your teams better navigate these areas.

You could train it on movie data so it better understands movies. It's better able to help you brainstorm ideas for scripts or provide critiques of movies, and recall past movies in more detail and depth and specific information that's not generally available or generally known otherwise.

My favorite way of thinking about a fine-tuned LLM is your general pre-trained large language model is like an orchestra. They spent years and years of practicing how to play their instruments, so how to practice, how to play together,

now they're given a symphony. Now they're specializing in this specific symphony.

They're going to be playing the symphony.

So this is like your pre-trained LLM, and with the symphony, it's like a fine-tuned LLM.

Let's fine tune a LLM

It's Hugging face and here we have bunch of models to work on,We will take here a pre-trained LLM, one of the Llama 2 models by Meta

Also, we have datasets

We will be using this model aboonaji/llama2finetune-v2

We're gonna be retraining this particular model in order to do some knowledge augmentation because actually this model, this specific model here, has a lot of general knowledge. It can chat with you about many different topics, but, of course, not the very specific topics, like some very specific medical terms and that's exactly what will be adding to this LLM in terms of knowledge so that this same LLM can also talk with us about these very specific medical terms.

So, this is going to be our dataset

Here you can see explanation of various medical terms

So that's the source dataset, but actually, we wouldn't be able to use this dataset to retrain and fine tune our model in order to add this extra layer of knowledge containing all these medical terms.

What we have to do is actually to process this dataset in order for it to have the right format expected by our fine-tuned Llama 2 model.

But the good news is we won't have to format it because another author in Hugging Face actually already formatted it.

well, you will find in the dataset, there you go, the same data but with the right format.

It is called wiki_medical_terms_llama2_format, meaning the Wiki medical terms dataset that we have here but reformatted in order to have the format expected by Llama 2.

If we zoom in,

It starts with <s\>, which means start of string,

then [INST] , which means start of the instruction,

then <<SYS>>.

So this is actually optional. That means system prompt. This is just some extra guidance that you give to your LLM, which is exactly this.

"You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible while being safe. And also ...........please don't share false information."

So that's the end of the system prompt, with the <</SYS>> but, again, this is optional, with the opening system prompt and the ending system prompt is an optional element in the format.

now, after that we have the question/expected prompt

then end of the instruction with a slash in square brackets , [/INST]

And then the output starts and end with </s>

Let's run the code on Google Collab

Installing and importing the libraries

!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

!pip install huggingface_hub

import torch

from trl import SFTTrainer

from peft import LoraConfig

from datasets import load_dataset

from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, pipeline)

Loading the model

You can use Hugging face documentation to search about and search "AutoModelForCasualLM"

It has multiple methods

We will be using this one

So that's a method from the AutoModelForCausalLM class, and this is exactly the one we're gonna use now to load our model,

And besides, while loading our model, we can do some configuration in order to reduce the memory and optimize the future fine-tuning and training process, and how we're gonna reduce that memory? how we're gonna optimize all this?

Well, it will be by enabling a four-bit precision, meaning that the model weights will be loaded in a four-bit format.

We will use this parameter pretrained_model_name_or_path which will have the path to the model

and quantization_config to enable four bit format.

So, let's create an object

llama_model=AutoModelForCausalLM()

then we will use it's method from_pretrained()

llama_model=AutoModelForCausalLM.from_pretrained()

Then let's add two parameters pretrained_model_name_or_path and quantization_config

llama_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=Path, quantization_config = BitsAndBytesConfig())

So, pretrained_model_name_or_path = "aboonaji/llama2finetune-v2" as it requires the path.

and quantization_config is the object of BitsAndBytesConfig()

So, we had quantization_config = BitsAndBytesConfig();

We will use these 3 parameters from this class

So,

llama_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path = "aboonaji/llama2finetune-v2",quantization_config = BitsAndBytesConfig(load_in_4bit = True, bnb_4bit_compute_dtype = getattr(torch, "float16"), bnb_4bit_quant_type = "nf4"))

load_in_4bit will be True in order to indeed have the four-bit quantization.

load_in_4bit = True

Then, bnb_4bit_compute_dtype . now we're gonna use a function called the getattr function, there we go, which is a function that will allow us to fetch the float16 data type from the torch module, because indeed you will see that by loading this finetune llama2 model, we'll actually load two PyTorch models, and therefore everything needs to be done with torch. Therefore, in this getattr function, I'm first going to enter torch for the torch library, and then that data type, which is, as we said, float16.

bnb_4bit_compute_dtype = getattr(torch, "float16")

Finally, bnb_4bit_quant_type, in which we'll specify that we want the NF4 quantization data type in the weights of the linear layers of our LLM

bnb_4bit_quant_type = "nf4"

so now we just have some two final things to do, which is again to reduce the memory usage and speed up the computations.

The first thing we have to do is first to take our newly created object, llama_model, which contains indeed the whole llama2finetune-v2 model, and since it is an object from a class, the AutoModelForCausalLM class, well, we can use one of its attributes, which is the config.use_cache attribute and which we're gonna set equal to False.

llama_model.config.use_cache = False

and that means that we're not gonna store in the memory, you know, in the cache memory, the output of the previously computed layers, and that will of course allow to reduce the memory usage once again and also speed up the training computations.

llama_model.config.pretraining_tp = 1

it will make sure that we will deactivate the more accurate computation of the linear layers, 'cause if we keep them activated, this will considerably slow down the linear layers computations.

now, you can run this

you will see two pytorch models loading (One of 10 GB and one of 3.5 GB)

Loading the tokenizer

We have just loaded our pre-trained model, the llama2 model that we got from HuggingFace in which we loaded through the AutoModelForCausalLM class and then from the from_pretrained method of this class.

And now, now that we have the model we need to load as well a tokenizer that is compatible with this llama2 model while ensuring of course that the tokenizer uses the same special tokens as the model and the same padding.

That's extremely important, and that's exactly what we'll make sure to implement in this code cell.

object of AutoTokenizer class

llama_tokenizer = AutoTokenizer()

then used the from_pretrained() method

llama_tokenizer = AutoTokenizer.from_pretrained()

the two parameters that we're gonna enter here is this one of course, the pretrained_model_name_or_path, which exactly like before will be the path to the model abonaji/llama2finetune-v2.

And the second one will very simply be this one, the trust_remote_code, which by default is equal to false, but which we will set to true

because we want to allow for custom models. But here our llama2 model defined on the hub, meaning here for us HuggingFace to be trusted.

now we just need to do two very important things, as I said in the beginning of this tutorial, which is to configure the padding.

So what we're gonna do is actually two things. We're gonna make sure that the pad token is the same as the end of sequence token and that we have a right padding. So that means that the padding token will be used to fill up sequences to a uniform length.

You know, you must always have sequences of the same length because that's indeed necessary to process batches of text data. And so what we're gonna do is set the pad token to be equal to the end of sequence token, which is indeed a special token used to indicate the end of a text sequence

llama_tokenizer.pad_token = llama_tokenizer.eos_token

and now one final thing to do, we need to make sure that we have a right padding as opposed to a left padding. And to do this, we're gonna take our llama_tokenizer object again, here it is, from which we're gonna access now another attribute called the padding_side, meaning right or left, right?

Because sometimes we can have a left padding, but here we wanna make sure we have a right padding, which makes sense, you know, because we're filling up the rest of the sequences with the end of string token.

llama_tokenizer.padding_side = "right"

So, finally

llama_tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path = "aboonaji/llama2finetune-v2", trust_remote_code = True)

llama_tokenizer.pad_token = llama_tokenizer.eos_token

llama_tokenizer.padding_side = "right"

Run it

Setting the training arguments

we're gonna set the training argument. So the first question is, why do we need to do this? Why do we need to set those training arguments separately?

Well, that's very simply because in the next step, step five, we're gonna create the supervised fine tuning trainer as an instance of the SFT trainer class. And this SFT trainer class will take several arguments, one of which will be the training arguments, okay?

And these training arguments will be created as an object of another class, which is called the training arguments class,

and which allows to configure the training parameters of the future training that will happen once we retrain our llama tomorrow with the new data set containing the medical terms.

per_device_train_batch_size, which is of course the batch size for the training. Alright so by default it is equal to eight, but eight is actually too much for the memory of a Colab notebook. You can try actually to run the training with a training batch size of eight, and you will see that you will probably get an out of memory error and therefore we're gonna reduce that training batch size to four, meaning that the model will process four training examples in each training iteration, and that will not be too much for the memory. Alright, so that's our first parameter

Then we'll also enter this one output_dir, because it is a compulsory argument. If you don't provide a pass to a certain folder in your notebook, it'll raise an error because you have to provide an output directory where the model predictions and checkpoints will be written

We'll also enter one last argument, which is this one, max steps. As you can see, the default value is minus one, meaning that there is no maximum steps given by default. However, as you can see in the description here, but I cannot put my mouse on it, otherwise it will disappear. But as you can see, for a finite data set, the training is reiterated through the dataset until max steps is reached. And so for this max steps argument, we will not keep the default value of minus one. Of course, we will do a pretty reasonable training and therefore we'll choose 100 steps, which will limit the training to maximum 100 steps.

training_arguments = TrainingArguments(output_dir = "./results", per_device_train_batch_size = 4, max_steps = 100)

Creating the Supervised Fine-Tuning trainer

There are two main techniques to develop your LLM with knowledge augmentation. The first one is supervised fine tuning, which is a transfer learning technique, in which the weights of a pre-trained model are trained on new data, which is here, the data containing all these advanced medical terms.

And the second technique is RLHF, reinforcement learning from human feedback,which is a much longer process and which requires of course human feedback, you know, through reinforcement learning.

We will apply the first one.

And for this we have to create a trainer, which is exactly what we're gonna do in this new step. And the way we're gonna build that trainer is by creating an instance of an amazing tool, which is the SFT trainer class from the TRL library

llama_sft_trainer = SFTTrainer()

the first one is the model, but the one we loaded, not directly the model from Hugging Face.

llama_sft_trainer = SFTTrainer(model = llama_model)

Then args here are gonna be of course the training arguments as you can see. And that's the second element will enter.