Generative AI concepts
Fine tuning
A key capability of the OCI generative AI service is the ability to fine-tune these models. And what you do with fine-tuning is you optimize a pre-trained foundational model on a smaller domain-specific data set, as is shown in the illustration here. So you have your own custom data, your own domain-specific data.
And you take the pre-trained foundational model and you train that model with this custom data. And you end up with a custom model through this fine-tuning process. There are two benefits to doing fin-tuning.
The first is you improve model performance on specific tasks. By tailoring the model to domain-specific data, it can better understand and generate contextually relevant responses. And the second benefit is that you can improve model efficiency by reducing the number of tokens, for example. So you improve the model on both these dimensions.
Token
Large language models understand tokens rather than characters.
One token can be part of a word. It could be an entire word or it could be even a punctuation symbol. A common word such as apple is a token. Another word such as friendship is made up of two tokens, friend and ship.Number of tokens per word depend on the complexity of the text. So for a simple text, you can assume one token per word on average. And for complex text, which is basically text which has less common words, you can assume two or three tokens per word on average.
So for example, if you have a sentence like the one shown here, many words map to one token, but some don't, colon indivisible. Take a guess how many tokens would be there in this particular sentence.
For example, this is how many tokens the sentence would map to.
So some common words, like, many words map to one, et cetera, are tokens by themselves. And some less common words, like, indivisible which is not that frequently used is actually made up of two tokens, indiv and isible. And also note that some punctuation symbols, like, period, comma, apostrophe, colon are tokens by themselves.
So this is an example of how large language models take text as input. And then basically, they tokenize that input and they understand tokens rather than characters.
Context window for a model: A model's context window refers to the number of tokens it's capable of processing at one time. It is the sum-- easy way to remember, it's the sum of input and output tokens for that particular model.
Some parameters to use for Fine tuning
So the first parameter is the maximum output token. This is the max number of tokens model generates per response. In case of OCI, the limit is 4,000 tokens.
The second parameter is temperature. This basically determines how creative the model should be. This is very important, it's close second to prompt engineering in controlling the output of generation models.
Then the next two parameters are top p and top k. These are two additional ways to pick the output tokens besides temperature. So really important. And then there are penalty parameters, presence and frequency. Basically, they assign a penalty when a token appears frequently and produces less repetitive text.
And show likelihood is the final parameter, which basically determines how likely it would be for a token to follow the current generated token.
Temperature parameter
o the first parameter which is very important is temperature. Temperature is a hyperparameter that controls the randomness of the LLM output. Now, what do we mean by that?
Well, if you step back and think about how large language model works, the large language model basically takes a string as an input, what you refer to as prompt and then predicts what the following words should be. Behind the scene, it comes up with probabilities for various combinations of words that could follow. The output of the LLM is a giant list of possible words and their probabilities.
It returns only one of those words based on the parameters you set. So for example, consider this phrase, the sky is blank
And right here, you can see, there are lots of words. In your mind, the next word, if I give you the phrase the sky is, the next word would be blue or the limit. You wouldn't think water as the next word.
Now, the temperature setting basically tells it which of these words it can use as the next word. So if you set the temperature to zero, it makes the model deterministic. And what that means is it limits the model to use the word with the highest probability. So in this case, blue.
If you increase the temperature, the distribution is flattened over all words. So what that means is model uses words with lower probability. So it could pick the next word here, which is the limit, which has a lower probability or it could even pick something like water.
And this is when you say the model has become more creative versus more deterministic. So this is a parameter, which is really important and can be used to control the output of your models.
Top K
Top k basically tells the model to pick the next token from the top k tokens in its list sorted by probability. So consider another phrase, the name of that country is the. And there's a blank here.
And you see all these words like we saw in on the previous slide. United has a probability of 0.12. Netherlands has a lower probability. Czech has even a lower probability and so on and so forth. So if you set your top k to three, basically, the model will only pick from the top three options, which are United, Netherlands, and Czech based on their probability and ignore all the other options.
It will mostly pick United most of the time, but sometimes, it can also pick Netherlands and Czech. So this is how top k works.
Top p is similar to topic. But picks from the top tokens based on the sum of their probabilities.
Top p
have words with different probabilities. Now, what top p would do, if you set p as 0.15, then it will only pick from United and Netherlands because the sum total of their probabilities add up to 14.7, which is less than 0.15 or 15%. Another way this top p is used is if p is set to 0.75, the bottom 25% of probable outputs are always excluded. So this is another way top p parameter is used.
Think about this as alternative to temperature. Now, instead of considering all possible tokens, when you set this top p, model considers only a subset of tokens, whose cumulative probability adds up to a certain threshold, which is what is specified by top p.
For example, if we set top p as 0.10, then the model considers only the token that make up the top 10% of the probability of the next token.
Temperature
Temperature is a parameter that controls the creativity or randomness of the text generated. A higher temperature results in more diverse and creative output, while a lower temperature makes it more deterministic. Now, in practice what is happening is, and we covered this in the theory lesson, is temperature is affecting the probability distribution over the possible tokens at each step of the generation process.
A temperature of zero would make the model completely deterministic, always choosing the most likely token.
And if I click on generate again, the response which we'll come back is going to be exactly the same because the model is completely deterministic.
ow, if I change it all the way to five, you will see the output is going to be different than what the output was when the temperature was set as zero.
So this is one way where you can change the temperature and adjust the different levels of creativity, the creative output you get out of the model.
Let's go to the playground. Let's copy the same prompt. And here, let us change the temperature to really small value close to zero. And let's make the top p as one, meaning it's choosing all the tokens.
And in the other case, let us make the temperature as 1. And let's make the bridge as one. And let's make top p as 0.10. And let's click generate here.
And you will see that it's generating a particular type of response.
If I go back to this other example where we have alternated the values for temperature and top p, it's going to generate a different kind of response. This is the response which got generated when we chose all the tokens.
But we kept the temperature as 0.10, in the other case, this is the response which came out.
So the idea is both temperature and top p are powerful tools for controlling the behavior of large language models. You can use them independently or together when making these API calls.By adjusting these parameters, you can achieve different levels of creativity and control.
Inference API
Assume that, just using the model, we have got a text generated
but how to do get the text using code?
The code is (in java)
In python:
Let's break down the python code:
Now, the first thing in the code is this import statement. This line imports the OCI SDK for Python, allowing the script to utilize various OCI services, including the Generative AI service.
Next step here is you are setting up some basic variables. You're setting your compartment ID, where you want this code to be executed. You're setting your config profile and config. Config profile is the name of the profile in your OCI configuration file that the script will use to authenticate and communicate with OCI services. We are using default here. But you could use something else. Particularly, you can use other profiles, if you have multiple OCI accounts or configuration. We are using default.
This config here is loading your OCI credentials and configuration settings from the specified profile, this default profile, in the configuration file, enabling the script to authenticate your request. So if you don't set these up properly, your code will not work.
So the next line here is the service endpoint. This is the URL. This URL is the endpoint for the Generative AI service. And you can see here, it's running in the US Chicago region. This is where the script is sending request to generate text. If you're using another region, you should replace US Chicago with that region name here.
The next line here-- let me just separate it out. The next line here is the Gen AI inference client. And basically what we are doing here, is we are creating an instance of the Generative AI inference client, configured with your OCI settings.
So here we specify the config. We specify the service endpoint. We say the retry strategy. Here, it is set to not retry failed request. So you can see, there's no retry strategy. And the timeout for the connection. And the read operation. These are default values. So we are just leaving them as is.
And then the next few lines are basically setting up the text generation request setup. That's basically what they are doing. So this is text generation. Request setup. And here you can see that we are asking the model to-- we are constructing a request to the Generative AI service to generate text.
We are indicating how many tokens we want in the response. Temperature basically controls the randomness. Frequency penalty basically penalizes repetitions. And top is impacting the diversity of the text generated. So we are setting these up here in the code.
And the next three lines here, from here to here, are basically setting up the model and the compartment ID. So we set the compartment ID here. And here we are setting the serving mode to on-demand.
So you can see it in the code here, to on-demand. Specifies the model. We are specifying the model here. So this is the model, which is identified by its OCID. So we are using a Cohere model. So it has a model OCID ID, which we are specifying here.
And then it assigns the inference request configured earlier. And then like I said, it's setting the compartment ID here.
And then it assigns the inference request configured earlier. And then like I said, it's setting the compartment ID here. So this section here is basically setting up your model and compartment ID. So, what kind of model to you use. Compartment ID.
And then right here is what we are generating the text. So this line here sends the request to OCI generative AI service to generate text based on the provided details, and capture the response. And this print statement here is basically printing the generated text to the console wrapped in a simple header footer for clarity.
So the response here, of course, the main thing here is the generated text, which is the response back from the Cohere large language model, the command model.
And right here, you can see the model ID, the OCID seed for the model, the Cohere command. You can see the version. And you can also see that the runtime is Cohere.
Summarization Model parameters
Word embeddings
Here you can see we have taken some words and generated a vector for it. Here we just have 2 parameters (Age, Size)
But it can get increased!
Embeddings are simply vector of numbers. So when you have vector of numbers, you can also compute numerical similarity.
A similarity measure takes these embeddings which are numbers and returns a number measuring their similarity. This is called numerical similarity.
You are saying how numerically similar these vectors, which are basically vector of numbers. And there are two techniques which are used-- Cosine Similarity and Dot Product Similarity.
And the thing which is really important here is embeddings that are numerically similar are also semantically similar. Semantically similar basically means how close their meaning is or how closely they are related.
let us say we have more words. So we have animals here, we have fruits here, and we have cities here. And so you can see that embedding vector of a puppy will be more similar to dog than that of a lion, or that of New York, or strawberry.
if you have a new word which is given, such as "tiger," you could place it closer to the animals group, close to the cat family member. So this is quite intuitive because you understand what a tiger is and so on.
Sentence Embedding
A sentence embedding associates every sentence with a vector of numbers, similar to a word embedding.
So the embedding vector of a phrase such as "canine companions say" will be more similar to the embedding vector of "woof" you can see here, than that of a "meow." "Meow" would be closer to a "feline friends say."
Problem and way to solve
One of the main challenges faced by today's generative models or embedding models is their inability to connect with your company's data. A promising approach to overcoming this limitation is Retrieval-Augmented Generation, RAG.
So how fundamentally it works is you can take a large corpus of documents, break it into chunks or paragraphs, and generate the embedding for each paragraph, and store all the embeddings into a vector database.
Now, vector databases are capable of automating the cosine similarity and doing nearest-match searches through that database for some search embedding, you want to search for. So this basically powers the whole RAG system.
The way it works is, let's say you have a user who has some question which cannot be answered by LLM. Maybe it's related to your customer support calls or something. So the user question is encoded as a vector and sent to the vector database.
Now vector database can run a nearest match in the vector database to identify the most closely associated documents or paragraphs. And it finds this is the private content which closely matches the user query. And then what it can do is it can take those documents or those paragraphs and insert those into a prompt to be sent to the large language model. And basic idea is to help answer the user question by changing the prompt. And then the LLM uses the content which has been given by the vector database plus its general knowledge to provide an informed answer.
Prompt engineering
RAG
You can see an example here a person is chatting with a virtual chat bot, which is-- and the person says, can I return the dress I just purchased? And the chat bot goes back to the enterprise database and picks up the return policy and says a couple of things. Return has to be within this window and items purchased on sale cannot be returned. This is coming from an enterprise database.
And then the user says, how do I know it is on sale? And she uploads her receipt. And then the chat bot basically goes and again looks up the enterprise database to see when this item was purchased what was the price and whether it was on sale or not.
So this is a great example where there is not a human sitting on the other side. It's a virtual chat bot built using this framework and going back to an enterprise data source and really giving grounded responses because using RAG you can give the model access to private knowledge that it otherwise would not have like your return policy and the return window et cetera.
Important thing to keep in mind is RAGs do not require any kind of fine-tuning or custom models.
Customize LLM with your data
So let us look at what a typical journey would look like. You would start at the bottom corner here. You have a prompt, and then you create an evaluation framework, and you figure out what your baseline is. So that's where you start. Start with a simple prompt.
Then you can give some few shot examples of input/output pairs you want your model to follow. So still in prompt engineering, but you're giving some-- you're adding a few shot examples. Then you can add a simple retriever using RAG.
Now let us say you have these few shot examples increase your model performance. Now you hook your model to an enterprise knowledge base and create a RAG system. Now let us say that you are satisfied with your model, but the output is not coming out in the format or style that you really want.
So now you can take this model and you can fine-tune this model, which is built on RAG. And then it-- probably the output is in your style, but then maybe you figure out that once you have done that, the retrieval results are not that good and so you want to optimize the RAG system further. So you can actually go back and optimize your retrieval and so on and so forth.
You can see a pattern here where you are literally using all the different techniques, and it depends on your optimization journey when and which technique to use.
Fine tuning and inference for LLM
Fine-autotuning is basically taking a pre-trained foundational model, and providing additional training using custom data, as is being shown here.
In traditional machine learning terminology, inference refers to the process of using a trained machine learning model to make predictions or decisions based on new input data.
But in case of large language models, what inference refers to is the model receiving new text as input, and generating output text based on what it has learned during training and fine-tuning. So this is basically how inference and fine-tuning look like in context of large language models.
Fine tuning workflow