Introduction

What is Llama 3

Llama 3 is a large language model (LLM) developed by Meta AI, belonging to one of the most advanced AI models currently available. It excels in text generation, translation, question answering, and more, capable of producing high-quality, coherent, and contextually relevant text. It performs exceptionally well compared to other models^[1].

Multi-turn Dialogue

Imagine having a conversation with a chatbot that can not only answer your current questions but also remember what you've previously said, allowing for deeper exchanges based on that information. This is the multi-turn dialogue capability of Llama. Specifically, Llama can retain what you've previously mentioned during a multi-turn conversation and connect those points. For example, if you ask, "How's the weather in Beijing?" and then follow up with, "What about Shanghai?" Llama understands that you're inquiring about another city's weather rather than something else.

The multi-turn dialogue feature makes conversations with Llama more user-friendly and aligns better with how we communicate in daily life. With this functionality, everyone can use our model just like having a regular conversation, without any special preparation or processes.

Summary

Llama can perform simple reasoning based on the provided information. For instance, if you tell it, "I have a cat and a dog," and then ask, "How many animals are at home?" Llama can accurately respond with "two." This capability allows us to use Llama for summarizing user needs by eliminating possible redundancies in conversation, filling in missing content, and automatically identifying key points.

Principles

The Original Text Generator

Recall the operations we performed with Word2Vec: we optimized two models simultaneously—one for encoding words and another for predicting the next word—but ultimately only used the encoding model. By combining both models, we can generate subsequent text based on prior context. This represents the most primitive text generator.

From Continuation to Dialogue

Although these appear to be two unrelated tasks, chat models can derive from the foundational continuation model, i.e., the base model. Simply providing raw text in a dialogue format doesn’t guarantee that the continuation remains in dialogue form; however, fine-tuning and reinforcement learning on a vast amount of dialogue data can ensure the correct format is maintained.

In addition to pure text, Llama can also comprehend images, audio, and video content. However, due to its lack of significant relevance to our project, further details will not be provided here.

Transformer

Llama 3 is built on the Transformer model. The Transformer is a deep learning architecture based on self-attention mechanisms, widely used in text generation (more technically referred to as natural language processing tasks), first introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need"^[2].

The Transformer fundamentally changed the NLP field and became the basis for various modern models, including GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers), featuring high parallelization and superior performance.

Structure

The structure of the Transformer mainly consists of an encoder and a decoder. The encoder processes the input sequence and generates context representations, while the decoder uses these representations to generate the output sequence. Unlike traditional recurrent neural networks (RNNs), the Transformer abandons step-by-step processing of sequences, adopting a global perspective to capture relationships between any positions in the sequence. This design allows for parallelization and significantly enhances efficiency and effectiveness when processing long sequences.

Encoder

The encoder is composed of multiple identical layers, each containing two main components: the self-attention mechanism and the feed-forward neural network. The input for each encoder layer is the output from the previous layer, and its output is the result of that layer. Through this layered approach, the model can progressively extract features from the input sequence, learning more complex representations.

Self-Attention Mechanism

The self-attention mechanism allows the model to focus on the relevance of different positions in the input sequence. Its calculation involves three matrices: Query, Key, and Value. Given the representation of the input sequence $X$ , these three matrices are derived through linear transformations:

Here, $W^Q$ , $W^K$ , and $W^V$ are the learned weight matrices. The output of self-attention is computed as follows:

This formula reflects the core idea of the self-attention mechanism, which is to calculate the similarity between queries and keys to weight the contributions of values. The scaling by $\sqrt{d_k}$ is to prevent excessively large values during the softmax computation, which could lead to vanishing or unstable gradients. Through self-attention, the model can dynamically adjust its focus on different parts of the input sequence.

Feed-Forward Neural Network

The second component of each encoder layer is a feed-forward neural network, typically comprising two linear transformations and an activation function. For input $Z$ , the computation formula is:

Here, $W_1$ and $W_2$ are weight matrices, while $b_1$ and $b_2$ are biases. The feed-forward neural network processes each position's representation independently, allowing the model to capture higher-level features while maintaining contextual information.

Decoder

The structure of the decoder is similar to that of the encoder but includes an additional self-attention layer to ensure the model only focuses on input from prior time steps when generating output. The decoder not only generates new output but also fully utilizes the context information passed from the encoder. Its main components include:

※ Masked Self-Attention Layer: Prevents the model from seeing future outputs during prediction. By introducing a mask in the self-attention computation, it ensures that only current and prior information is utilized during generation, allowing the content generated by the model to be coherent and grammatically correct.
※ Encoder-Decoder Attention: Allows the decoder to focus on the encoder's output. This step is crucial for the decoder to obtain context information, enabling it to produce more relevant and consistent text. The overall computation process is similar to that of the self-attention mechanism, adding focus on the encoder's output.

Similar to the encoder, the decoder's self-attention mechanism also employs the forms of Query, Key, and Value, and a mask is applied during the softmax computation to prevent future information leakage, ensuring that each token generated by the model is based on previously generated tokens and the encoder's context information.

Positional Encoding

Since the Transformer model itself lacks sequential information, positional encoding is introduced to provide positional data. This allows the model to leverage positional encoding to capture the relative positions of words in a sequence. This handling of positional information enables the Transformer to effectively model context even in the absence of sequentiality, providing the necessary conditions for learning language structure.

Optimization

The Transformer model employs residual connections and layer normalization to enhance training effectiveness.

The main idea behind residual connections is to alleviate the vanishing gradient problem by connecting layers through skips. After each sublayer's output, the input is added to form a residual connection:

This design facilitates gradient propagation during backpropagation, thereby improving the model's convergence speed.

Layer normalization enhances the training stability of the model by normalizing each layer's input. During training, the combination of residual connections and layer normalization enables the model to effectively utilize deep structures without easily falling into local optima.

Applications

The Transformer model has been widely applied in numerous NLP tasks, especially in generation tasks, with the GPT series being one of its most successful applications.

ChatGPT Introduction^[3] (Please note, the following content was directly generated by ChatGPT o1-preview)

GPT employs a decoder-based Transformer architecture, using a self-regressive model to generate text. During generation, the model receives previously generated words (or tokens) as input and continuously predicts the next word until completion. Specifically, the generation process is recursive, with each step relying on the previous step's output. This design enhances coherence in generation and allows the model to excel in diversity and creativity. Due to its strong generative capabilities, GPT has demonstrated outstanding performance in dialogue systems, content creation, and translation. Leveraging its generated context, GPT can produce long texts that logically fit the context, becoming one of the most influential natural language processing models today.

Implementation

We chose to use the API provided by OpenAI to invoke the original Llama 3 model, while reusing the same pipeline to input different dialogue records, enabling a single instance of Llama 3 to engage in multiple conversations simultaneously, reducing data loading overhead and memory requirements.

When fine-tuning Llama 3, we selected a framework that is user-friendly, efficient in fine-tuning, and low-cost, facilitating rapid evaluation of the effects of our fine-tuning on Llama 3 and subsequent improvements. The results of the fine-tuning can be found in Function Extraction.

For more specific applications of this model in our project, please refer to the Function Extraction, Part Matching, and Freshman Mode pages.

[1]: https://www.llama.com;

[2]: https://arxiv.org/abs/1706.03762;

[3]: https://chatgpt.com