LLM Architecture#

Understanding what happens inside the model helps you debug unexpected behavior.

The Transformer#

Introduced in 2017 (“Attention is All You Need”).

  • Encoders (BERT): Good at understanding context and classification.
  • Decoders (GPT): Good at generating text (next-token prediction).
  • Encoder-Decoder (T5, BART): Good at translation and summarization.

Attention Mechanism#

Self-attention allows the model to weigh the importance of every token in the context window relative to the current token being processed.

Sparse MoE (Mixture of Experts)#

Models like Mixtral 8x7B don’t use all parameters for every token. They route tokens to specialized “expert” sub-networks, making inference much faster while maintaining high capability.