Large Language Model

Definition

A large language model is a machine learning model that is trained on a large corpus of text data, such as Wikipedia or the Web. These models can generate high-quality text that is similar to the training data, but can also generate text that is not present in the training data.

History

  • < 1990s: IBM’s statistical language model (SLM)
  • 1990s-2000s: Neural language models (NLLMs)
  • 2001s: n-gram model
  • 2010s: GPT-2, GPT-3

Dataset Preprocessing

Tokenization

Splitting the text into individual words or sub-words

  • Byte-Pair Encoding (BPE): A sub-word unit that is used to represent a word in a sequence. It is a compression algorithm that merges frequent sub-words into a single token.

Dataset cleaning

  • Remove stop words: Common words that do not provide any useful information
  • Stemming: Reducing words to their root form
  • Lemmatization: Reducing words to their base form

Synthetic Data

Synthetic data is a technique used to generate new data by combining existing data. It can be used to train a language model on a small dataset and then use it to generate new text.

Training

Cost

  • Training time: The amount of time it takes to train a large language model
  • Memory usage: The amount of memory required to train a large language model
  • GPU usage: The amount of GPU memory required to train a large language model

Fine-tuning

Fine-tuning is a technique used to adapt a pre-trained language model to a specific task. It involves unfreezing the layers of the model and re-training them on a new task.

Architecture

Attention Mechanism and context window

The attention mechanism is a mechanism that allows the model to focus on specific parts of the input sequence when generating the output. It is based on the context window, which is a window of words that surrounds the current word. The attention mechanism allows the model to pay more attention to the relevant parts of the input sequence, while ignoring irrelevant parts.

Mixture of experts

The mixture of experts (MoE) is a technique used to train a large language model that can generate text with multiple experts. It is based on the idea that multiple experts can generate different parts of the text, and the model can combine the outputs of these experts to generate the final output.

Parameter size

The parameter size of a large language model is the amount of memory required to store its weights. It is important to consider the size of the model when choosing the hardware resources required to train it.

  • Quantization: Reducing the size of the model by reducing the number of bits used to represent the weights.

References

Large Language Models

Artificial Intelligence

Concept

Artificial Intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals. It involves the use of computational processes to simulate intelligent behavior. The term “intelligence” refers to the ability of a machine to acquire and apply knowledge and skills, and to reason and problem-solve autonomously.

Techniques

  • Machine Learning: The process of training a machine to learn from experience and improve its performance on a task.
  • Deep Learning: A subset of machine learning that involves training a machine to learn from large datasets using neural networks.
  • Natural Language Processing: The use of AI to understand and manipulate human language.

References

Artificial Intelligence

Produces