large language model
Large Language Model
Definition
A large language model is a machine learning model that is trained on a large corpus of text data, such as Wikipedia or the Web. These models can generate high-quality text that is similar to the training data, but can also generate text that is not present in the training data.
History
- < 1990s: IBM’s statistical language model (SLM)
- 1990s-2000s: Neural language models (NLLMs)
- 2001s: n-gram model
- 2010s: GPT-2, GPT-3
Dataset Preprocessing
Tokenization
Splitting the text into individual words or sub-words
- Byte-Pair Encoding (BPE): A sub-word unit that is used to represent a word in a sequence. It is a compression algorithm that merges frequent sub-words into a single token.
Dataset cleaning
- Remove stop words: Common words that do not provide any useful information
- Stemming: Reducing words to their root form
- Lemmatization: Reducing words to their base form
Synthetic Data
Synthetic data is a technique used to generate new data by combining existing data. It can be used to train a language model on a small dataset and then use it to generate new text.
Training
Cost
- Training time: The amount of time it takes to train a large language model
- Memory usage: The amount of memory required to train a large language model
- GPU usage: The amount of GPU memory required to train a large language model
Fine-tuning
Fine-tuning is a technique used to adapt a pre-trained language model to a specific task. It involves unfreezing the layers of the model and re-training them on a new task.
Architecture
Attention Mechanism and context window
The attention mechanism is a mechanism that allows the model to focus on specific parts of the input sequence when generating the output. It is based on the context window, which is a window of words that surrounds the current word. The attention mechanism allows the model to pay more attention to the relevant parts of the input sequence, while ignoring irrelevant parts.
Mixture of experts
The mixture of experts (MoE) is a technique used to train a large language model that can generate text with multiple experts. It is based on the idea that multiple experts can generate different parts of the text, and the model can combine the outputs of these experts to generate the final output.
Parameter size
The parameter size of a large language model is the amount of memory required to store its weights. It is important to consider the size of the model when choosing the hardware resources required to train it.
- Quantization: Reducing the size of the model by reducing the number of bits used to represent the weights.
References
Artificial Intelligence
Concept
Artificial Intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals. It involves the use of computational processes to simulate intelligent behavior. The term “intelligence” refers to the ability of a machine to acquire and apply knowledge and skills, and to reason and problem-solve autonomously.
Techniques
- Machine Learning: The process of training a machine to learn from experience and improve its performance on a task.
- Deep Learning: A subset of machine learning that involves training a machine to learn from large datasets using neural networks.
- Natural Language Processing: The use of AI to understand and manipulate human language.





