LLM Architecture#
Understanding what happens inside the model helps you debug unexpected behavior.
The Transformer#
Introduced in 2017 (“Attention is All You Need”).
- Encoders (BERT): Good at understanding context and classification.
- Decoders (GPT): Good at generating text (next-token prediction).
- Encoder-Decoder (T5, BART): Good at translation and summarization.
Attention Mechanism#
Self-attention allows the model to weigh the importance of every token in the context window relative to the current token being processed.
Sparse MoE (Mixture of Experts)#
Models like Mixtral 8x7B don’t use all parameters for every token. They route tokens to specialized “expert” sub-networks, making inference much faster while maintaining high capability.