LM means Language Model

Let's see this problem

What it will do is:

It will fill that one using probability.

LLM (Large Language Model)

LLM architecture

Check how many parameters we can have for current LLM models:

Types of architectures:

Encoder Models

What encoder does is takes a word and makes a vector number for that.

Decoder model

Based on a sentence (sequence of token) it creates a single token ( a word)

Encoder and Decoder

Which architecture to choose for which tasks?

Prompting

Let's repeat the problem we had

To impact the probability or guide the model towards right answer , we can follow there 2 ways:

Prompting
Training

Prompting

Now, prompting means adding words to guide towads answer.

Let's add a simple word and see how it affects the probability:

now, you can see most of the probability got changed. Specially dog and cat god increased to 0.45 and 0.4

It's worthwhile to think about how and why the model can do this. Very large decoder-only models are initially trained in a procedure called "Pre-training". During pre-training, a model is fed a tremendous amount of text that is typically quite varied.

Given a sequence of words, the model is trained to guess at every step what the next word is likely to be and how likely. In some sense, during pre-training, the model should learn, among other things, what little animals exist and, thus, know to make the probabilities of the little animals higher and the big animals smaller.

At a more concrete level, we can hypothesize that the bigram little dog and little cat occur much more frequently in the pre-training text than little lion or little panther. And this accounts for the higher probability on the littler animals. While this is an oversimplification of pre-training, it does give a sense of where these probabilities are coming from.

Prompt engineeing

Now when you're doing prompt engineering, you wouldn't necessarily be looking at the distribution over vocabulary. Instead, you'd actually be generating text from a model and seeing whether the generated text looks good. Now a couple of important notes about prompt engineering.

First, it can be quite challenging and very unintuitive. If I make a small change in the input text to the model, for example, even adding a whitespace, that can have a large effect on the distribution over vocabulary words. And you don't really know what the change is going to be.

But at the same time, there have been numerous anecdotal examples that prompt engineering works. So with a very particular task in mind and a very particular model, spending some time trying to come up with the right prompt can be hugely valuable. And there are a number of strategies for doing this that have proved successful in industry and academia.

Strategies for prompt engineering->

In context learning:

In the following slides, I'm going to talk about some of these strategies. Here, I'll focus on the most popular, which is the idea of in-context learning. Perhaps, unfortunately, though, the phrase in-context learning does not actually involve learning in the traditional sense. That is with in-context learning, none of the parameters of the model are changing.

Instead, what this refers to is constructing a prompt that has demonstrations of the task that the model is meant to complete.

K shot learning

I also just want to refine this notion further, with the concept of k-shot prompting, which means including k examples of the task that you want the model to complete in the prompt.

Here is an example from the GPT-3 paper. This came out around 2020, and the goal of this particular prompt is to demonstrate to the model that it should translate English words or phrases into French. And what you'll notice here is that in the prompt, the input to the model literally includes three examples of translation from English to French, followed by a final translation that's incomplete, that we want the model to complete.

This is an example of three-shot prompting. Notice that in the paper itself, they actually labeled different pieces of the input text differently. And this is an example of the term prompt being overloaded. Sometimes, when you hear the word prompt, someone might be referring to the entire input that's being sent to the model.
Zero shot learning

When we don't include any demonstrations in the prompt, it's called zero-shot prompting. In this case, we would have a task description and then jump straight into the task the model is meant to complete.

Examples:

2 shot prompting

The first is a two-shot prompt for addition. In it, we show the model two examples of addition and then ask it to add the numbers 1 and 8. Note that the model will not actually perform the computation. Instead, it will generate a probability over words in its vocabulary most likely to follow the expression 1 plus 8 colon.

MPT Instruct

Second is another very different example that comes from the MPT-Instruct literature. When the MPT-Instruct model is trained, they use the prompt below. It tells the model that, hey, we want you to follow the following instructions, but don't do anything else. Be concise, et cetera

As a third example, here, we have a very long prompt. You may have seen things like this. I think this prompt gained popularity with the rise of Bing Chat.

Because there was some work and speculation about the length and detail that went into that prompt. And here, just to illustrate this, comes an example from the academic literature. The prompt here has a number of very subtle statements and instructions for the model. It's really detailed.

Advance prompting strategies

Chain of thoughts

As we can see, prompting is quite powerful, and there have been a lot of recent advances and strategies developed towards really interesting styles of prompting. . Here, I'm going to give one example called chain-of-thought prompting. This prompting style came out in 2022 and made the rounds.

In fact, it's still pretty popular and is in wide use today. The idea with chain of thought is that if we have a complicated task, here, it's a word problem. In order to solve the word problem, what we're going to do is prompt the model to break down the steps of solving the problem into small chunks, like what you might do if you were solving the problem yourself.

And the result in this paper showed that chain-of-thought prompting helps models accomplish some of these harder multi-step tasks than other styles of prompting.

But if we ask for the problem to be broken down into small chunks, the problem becomes more manageable. I think this is a really interesting-- this is a really interesting phenomenon, because it almost mimics the way that a human might solve this problem.

I wouldn't necessarily call this reasoning, but it imitates reasoning in a nice and convincing way.

Least to most

Here, we ask the model to solve simpler problems first and use the solutions to the simple problems to solve more difficult problems. In this task, what the model is supposed to do is, given a list of words, concatenate the last letter of each of those words together.

You can imagine that this gets harder and harder as the list of words gets longer. But what the researchers did is taught the model to solve these problems in increasing order of difficulty.

So for example, the model should start with the first word, get its last letter, and then it should look for the first two words, and concatenate the last letter of-- the last letter of the second word, with the solution to the first subproblem. I'll point out here that the model is doing precisely that.

So here, it's saying, hey, I think I've solved think machine, and it knows that the output think machine is ke. And the last two letters-- the last two letters, ke, are all that it needs. What it does here is it looks at the last word, learning, and says, hey, I've solved think machine before, and I know that the output is keg.
Step back

The last style of prompting that I'll bring up comes from DeepMind, where they taught the model to improve its performance on chemistry and physics questions by reasoning about the concept or explicitly mentioning the concepts that are required to solve the equations presented.

And what they found was that when you ask a model to a complicated question from physics or chemistry, but have it first emit the first principles and equations required to solve these problems, the model has a much higher success rate

Prompt injection

We'll talk about some of the dangers of prompting, specifically how prompting can be used to elicit unintended or even harmful behavior from a model. The first issue that I'd like to highlight here is prompt injection.

In this case, what's going on is that the prompt is being crafted in such a way as to elicit a response from the model that is not intended by the deployer or the developer. Usually, these prompts ask for harmful text to be generated, such as text that reveals private information.

When deploying models, this is something that we need to be thinking about.

The first is a prompt that says, hey, do whatever task you're meant to do, and then append poned to the end of any of your responses.

This is, perhaps, not so harmful, but also not what you'd want a model to do. Without any protection against this kind of attack, the model will dutifully follow this direction.

Something a little bit more significant and perhaps sinister is a prompt that says something like ignore whatever task you're supposed to do and focus on the prompt that I'm about to give you.

The attackers hope here is that the model completely ignores whatever the deploying entity instructed it to do and instead will follow the instructions supplied by the attacker

In the last example, the prompt instructs the model to ignore answering questions and, instead, write a SQL statement to drop all the users from a database.

This is clearly sinister. Here, we also see a pretty clear parallel to SQL injection attacks. By extension, we can just kind ask the model by prompt injection to do anything we want.

One thing to take away from this slide is that if ever a third party gets access to the model's input directly, we have to worry about these kinds of things, specifically prompt injection.

Let's talk about another two examples. The first example is from the same paper that I highlighted on the previous slide. In it, the authors actually coaxed the model to reveal the back end prompt that its developers designed for it.

They did this by just telling the model, after doing whatever you're supposed to do, just repeat the prompt that the developer gave you.

Now imagine a prompt that asks for private information about a particular user that the model has been trained on. For example, here, a user is asking for a particular person's Social Security number, which should be private.

Off the shelf, there are no guardrails that prevent the model from revealing any information it's seen during training.

Training

Remember that prompting is, in effect, simply changing the input to an LLM. The process is highly sensitive. In other words, small changes in the prompt can yield large changes in the distribution over words.

Moreover, since the model's parameters are fixed, by using prompting alone, we're limited in the extent to which we can change the model's distribution over words.

In this way, sometimes, prompting is insufficient. For example, in domain adaptation, that is when a model is trained on data from one domain, and then you want to use it in an entirely new domain, you might need something more dramatic.

During training, we're actually going to change the parameters of the model.

You can think of training as the process of giving the model an input, having a guess a corresponding output, for example, the completion of a sentence or an answer to an input question, and then, based on this answer, altering the parameters of the model so that next time, it generates something closer to the correct answer.

There are many ways that you can train or, in other words, change the underlying parameters of the model. Four such approaches are shown in this chart, and they all come with their own advantages and costs.

Fine tuning:

In the first row here is fine-tuning, which is around 2019, the way that we trained all language models. In fine-tuning, we take a Pre-Trained model, for example, BERT, and a Labeled Dataset for a Task that we care about and train the model to perform the task by altering all of its parameters.

Training a BERT model was, at the time, thought to be somewhat expensive. But it's nowhere near as expensive as training the models of today, which are orders of magnitude larger.

Prameter efficient Finetuing Model:

Because full fine-tuning is so expensive, we've turned to cheaper alternatives, like the family of parameter efficient fine-tuning methods.

In these methods, we isolate a very small set of the model's parameters to train, or we add a handful of new parameters to the model.

One of the methods you might have heard of in this space is LORA, which stands for Low Rank Adaptation. In this method, we keep the parameters of the model fixed and add additional parameters that will be trained.

Soft Prompting

Soft prompting is another cheap training option. Although, the concept here is different than methods like LORA.

In soft prompting, what we're going to do is actually add parameters to the prompt, which you can think about as adding very specialized, quote, unquote, "words" that will input to the model in order to queue it to perform specific tasks.

Unlike prompting, a soft prompt is learned. Or in other words, the parameters that represent those specialized words we added to the prompt are initialized randomly and iteratively fine-tuned during training.
Pre training:

The last training I want to bring up is continual pretraining, which is similar to fine-tuning in that it changes all the parameters of the model.

So it's expensive, but it's different in that it does not require a label data. Instead of training a model to predict specific labels, during continual pre-training, we just feed in any kind of data that we have for any task that we have and ask the model to continually predict the next word.

This is just to give an idea how much one day would cost with various types of training

Decoding

Let's return to the example we've seen a few times thus far. "I wrote to the zoo to send me a pet. They sent me a--" as we know the LLM produces a distribution over vocabulary words. And the question we're focused on now is, how do we turn this distribution into a word or a sequence of words?

Through the course of this discussion, there are a few things that I'd like to drive home. One is that, in decoding, or the process of generating text, it happens one word at a time.

Specifically, we give the model some input text. It produces a distribution over words in its vocabulary. We select one. It gets appended to the input, and then we feed the revised input back into the model and perform the process again. In particular, the model is not emitting whole sentences or documents in one step. It all happens one word at a time.

OK, with this in mind, once we compute the distribution over words in the vocabulary, how do we actually pick a word to emit?

Well the simplest or most naive, but also effective strategy, we call greedy decoding. And in this strategy, we simply pick the word in the vocabulary with the highest probability. Let's see this method in action.

As we see on this slide, the highest probability word here to fill in the blank is dog. In greedy decoding, we'd select dog, append it to the input, and feed it back to the model.

Here, we've done just that. Notice that, when we send the input with dog appended to the end of it back to the model, the probabilities on the remaining words change.

In fact, they all get much lower. And we see a new token, EOS, which stands for End Of Sentence, or End Of Sequence, with very high probability.

This is the model telling us, hey, it's very likely that the sentence should end after the word dog.

Since we're simulating greedy decoding here, the EOS token is the next word that is selected. After the EOS token, the model is done generating, and the output is returned. Specifically, it's "I wrote to the zoo to send me a pet. They sent me a dog."

Here is another example of decoding. The input is the same as before. And as you might have noticed, I modified the visualized vocabulary words. However, now, instead of picking the highest probability word, we'll pick a word randomly among the visualized choices

Here, we randomly sample the word small. As before, we'll append small to the input and then feed it back to the LLM.

As you'll see, the probability on the vocabulary words are revised given this modified input.

In particular, the probability on the word elephant goes down because elephants are typically not small.

Again, we'll sample a word randomly from the visualized words. Here, we sample the word red.

Finally, you'll notice that the probability on the word panda jumps up because of the existence of an animal known as the red panda.

At the same time, the probabilities of the other words go down, since dogs, cats, and alligators are not typically red.

Eventually, the model selects an EOS token or an EOS word and emits the sentence, "I wrote to the zoo to send me a pet. They sent me a small red panda.

When decoding with a non-deterministic strategy, that is, when you randomly sample words to emit from the LLM, there's an important parameter to know about, which is called temperature.

What temperature does is it modulates the probability distribution over words. Specifically, when you decrease the temperature, you peak the distribution more around the highest probability word.

So for example here, we see that the probability of the word dog has gone up considerably, while the rest of the probabilities have gone down.

On the other hand, when you increase temperature, the probability distribution over words in the vocabulary flattens. That is, all the probabilities get closer together.

The way that temperature affects the output text is that, when temperature increases and your decoding with a non-deterministic strategy, you're more likely to emit rarer words.

So when temperature is decreased, we get closer and closer to greedy decoding,

Specifically, this is when we emit the highest probability word at every step. This tends to result in more typical output from the LLM that is generating text.

On the other hand, when temperature is increased, the rarer words have a higher chance of being generated. This is typically associated with more creative and even interesting output.

Decoding, whether it with low or high temperature or greedily has its place. When answering factoid questions, you might imagine that we want greedy decoding. That is, we want the most likely words to be generated.

On the other hand, if we want the model to generate a story, we want to crank up the temperature and sample some rare words from time to time in order to add intrigue and unpredictability.

Three types of the most common forms of decoding you're likely to encounter. The first is "Greedy", which we spoke about directly.

The second is called "Nucleus Sampling", which is similar to the sampling-based portion of this lesson but with a few additional parameters that govern precisely what portion of the distribution over words you're allowed to sample from

The last type of decoding, which we didn't talk about, is called "Beam search" where we'll actually generate multiple similar sequences simultaneously and continually prune the sequences with low probability. Beam search is very interesting and helpful because it is decidedly not greedy but ends up outputting sequences that have higher joint probability than the sequences that are output as a result of greedy decoding.

Hallucination

Hallucination has been defined a number of slightly different ways. But for the purpose of our discussion, let's define hallucination to be text, generated by a model, that is not grounded by any data the model has been exposed to.

For example, consider the text on the slide, which contains a bolded statement that is not factually correct. Typically, statements like these are considered to be hallucinations.

Additionally, I've heard it said, and I don't to whom this idea is attributed, that all text generated from an LLM is hallucinated. The generations just happen to be correct most of the time. All this is to say that LLM-generated text is somewhat unpredictable. It's often good, fluent, and accurate, but sometimes, it's not factual or even unsafe.

There is no known method that will eliminate hallucination, with 100% certainty. On the other hand, there is a growing set of best practices and typical precautions to take when using LLMs to generate text.

In a related line of work that is growing in popularity, researchers are developing methods for measuring the groundedness of LLM-generated output.

At the same time, new grounded versions of question answering have been proposed, which is the task of answering questions while also citing the sources of the answer being provided. And more work has focused on citation and attribution of LLM-generated content.

In this way, we can see that the research community thinks of hallucination as a serious problem and, as such, is devoting significant resources to study the phenomenon and trying to figure out ways to mitigate or avoid it

LLM Applications

The first system that we'll cover is Retrieval Augmented Generation, otherwise known as RAG. This system is conceptually very simple.

When people talk about RAG systems, typically, they're talking about a system where, first, a system user provides an input, for example, a question. Second, the system transforms that question into a query which will be used to search a database, for example, a corpus of documents.

The hope is that the search will return documents that contain the answer to the question or are otherwise relevant. Finally, the returned documents will be provided to the LLM as input in addition to the question. And the expectation is that the model will generate a correct answer.

If we give the LLM a question and then some text that contains the answer, it should be easier to answer the question by leveraging the text than answering it based solely on the documents it has seen during pre-training

RAG systems are powerful. For example, they can be used in multi-document question-answering. RAG systems are also more and more prevalent. They're used for a variety of tasks, such as dialogue, question-answering, fact-checking and others.

These systems are also elegant because they provide a non-parametric mechanism for improvement. By non-parametric, what I mean is that we don't have to touch the model at all to improve the system. All we have to do is add more documents.

In theory, in the RAG setup, all you need to do is provide the software documentation or manual as the corpus. And your LLM can now answer any question that can be answered with that manual.

In practice, getting these systems to work is not trivial, because there are a few moving parts. But we've already seen a lot of RAG systems deployed in practice. And the performance of these systems seems to be improving, and the systems are ubiquitous across industry and academia. Moreover, we've seen them built on top of off-the-shelf LLMs as well as LLMs trained specifically for RAG.

Code Models

I'll briefly touch on LLMs for code, typically referred to as code models.

As their name implies, code models are LLMs trained on code, comments, and documentation. These models have demonstrated amazing capabilities in terms of code completion and documentation completion. That is, if you provide the model with a description of the function you'd like to write, in many cases, it can just output the function for you.

Some examples of these models you may have heard of include Co-pilot, Codex, and Code Llama.

These models largely eliminate the need to write any boilerplate code and commonly written functions or variables or variations of such functions. Moreover, they also shine when programming in a language that you don't know well. Instead of using a search engine to continually look up language syntax and functions, just ask the model to write the code for you.

On the flip side, more complicated tasks are still difficult or unattainable for code models. While generating code from scratch might be achievable, there is new work that shows that our best models can only automatically patch real bugs less than 15% of the time.

Multi Modals

These models are trained on multiple data modalities, for example text, images, and audio. These models can produce images or even video from textual descriptions and perform similar types of tasks.

In particular, let's discuss diffustion models. If you recall, the way LLMs generate text is one word at a time. By contrast, diffusion models generate usually images all at once rather than one pixel at a time. The way they do this is by starting with an image that is simply noise.

it's not an image of anything at first-- and iteratively refining all the pixels in the image simultaneously until a coherent image emerges.

There have been some attempts at doing such a simultaneous or technically, what we would call, a joint decoding for text. But these approaches haven't achieved state-of-the-art results and are not yet popular.

In fact, jointly decoding text is quite difficult, whereas, an image is a fixed size and has a fixed number of pixels. And we might know that size before beginning to generate.We typically don't know how many words we're going to generate in a sentence that we want to generate, in general. Moreover, whereas the pixels in an image take on continuous color values, words are discrete and, thus cannot be refined in a continuous manner.

Language Agents

Language agents are models that are intended for sequential decision-making scenarios, for example playing chess, operating software autonomously, or browsing the web in search of an item expressed in natural language. Language agents are an extension of the classic work on machine learning agents.

In more detail, these models operate in what's known as an environment and iteratively take actions in pursuit of accomplishing a specific goal.

For example, a language agent tasked with buying an ambiguously described product might take an action corresponding to a search. Every time the model takes an action, like searching, the environment responds, for example, with the search results.

The agent observes the results and takes another action, for example, visiting a page corresponding to a promising item. The model continues taking actions until it thinks that it has achieved its goal, at which point, it terminates. One reason that the interest in language agents is rapidly increasing is because their out-of-the-box capabilities for communication via natural language and their instruction-following capabilities

One of the canonical works in this space is known as ReAct which proposes a framework for leveraging LLMs as language agents. A key ingredient of this work is to prompt the model to emit what they call thoughts, which are summaries of the goal, what steps the model has already accomplished, and what steps the model thinks it needs to take next.

There has been significant study of teaching LLMs how to leverage tools. Tools is used here very broadly, but boils down to using APIs and other programs to perform computation. For example, instead of doing some arithmetic by decoding, an LLM could generate some text expressing the intention to use a calculator, formulate an API call to perform the arithmetic, and then consume the result. The ability to use tools promises to greatly expand the capability of LLMs

Finally, there is a growing body of work developing methods of training LLMs to perform various types of reasoning. LLMs that can reason successfully could be employed as high level planners in these agent systems for accomplishing highly complex, long-horizon tasks. Like humans, agents that can reason could be successful in new environments when trying to accomplish unfamiliar tasks.

Done!!

All about LLM

LLM (Large Language Model)

LLM architecture

Prompting

Prompting

Training

Hallucination

LLM Applications