A foundation model is a very large-scale AI model (often a deep neural network) pretrained on broad datasets with self-supervised or unsupervised objectives, enabling it to be adapted cheaply to a wide range of tasks, serving as a base infrastructure for many applications.
Core characteristics of foundation models include:
| Feature | Description |
|---|---|
| Scale | Typically have billions (or more) of parameters and are trained on vast corpora (text, code, images, etc.) |
| General-Purpose / Transferable | Unlike narrow AI models, they learn broad patterns and can be fine-tuned or prompted for many downstream tasks (chat, translation, image recognition, robotics, etc.) |
| Self-Supervised Training | Training is usually unsupervised or self-supervised, e.g. predicting masked/next tokens in text, or using contrastive losses for images |
| Emergent Capabilities | When scaled up, they often display surprising emergent abilities (e.g. in-context learning, few-shot reasoning) that were not explicitly programmed |
| Multimodality | Many recent models handle multiple input/output modalities (text, images, audio, etc.) |
Most foundation models are implemented as deep neural networks with Transformer architectures. The Transformer (self-attention) has become the de facto backbone for language and vision models because it can process large, high-dimensional inputs and scale well.
The standard adaption method is fine-tuning on smaller, task-specific datasets. More recently, lightweight adaptation (prompts, adapters, LoRA) and in-context learning (prompt engineering) have enabled even lower-cost specialization.
Let’s look at some major foundation models:
GPT (OpenAI): The Generative Pretrained Transformer series (GPT-2, GPT-3, GPT-4) are text-based LLMs trained on vast corpora. GPT-3 (2020) has 175 B parameters and can perform tasks via prompting . GPT-4 (2023) expands capabilities (multimodal text+image input) and underlies ChatGPT. These are prototypical FMs for NLP.
BERT (Google): Bidirectional Encoder Representations from Transformers (2018) pioneered deep bidirectional pretraining (masked LM). Although smaller (~0.34B parameters), it introduced pretraining/fine-tuning and transfer learning. Many derivatives (RoBERTa, DeBERTa) followed.
PaLM (Google): Pathways Language Model is a Google LLM family. The original PaLM (2022) has 540 B parameters. PaLM 2 (2023) is a 340 B model with improved multilingual and reasoning skills. Extensions include PaLM-E (a vision-language version for robotics) and AudioPaLM (speech). These demonstrate multi-modal expansion.
DALL·E (OpenAI): A Transformer/diffusion-based image generator pretrained on text-image pairs. It generates novel images from text prompts. Stable Diffusion (Stability AI) is a similar open image model. Vision-language models like CLIP (OpenAI) are pretrained on captioned images with contrastive loss, enabling zero-shot vision tasks. Meta’s Flamingo fuses language and vision in one model. (These are all foundation models for vision or cross-modal tasks.)
LLaMA (Meta): Large Language Model Meta AI (2023) is Meta’s LLM series (LLaMA 1 & 2) released partially open-source. LLaMA-2 offers models from 7B to 70B parameters with an open license. More recently LLaMA-3 (2024) goes to 405B parameters with open weights. These open models have driven community research.
Examples span domains: GPT-NeoX/BLOOM (open LLMs by EleutherAI/HuggingFace), Gato (DeepMind’s multi-task model for text, vision, and RL), GLaM, Megatron, Alpaca, FLAN (instruction-tuned models), and specialized ones like Med-PaLM (medical QA) or AlphaFold (protein structure predictor, not usually called FM but similarly massive pretrained model in biology).
Models like Meta’s LLaMA 3.1 (405B), Mistral’s Large 2 (123B), and Google’s Gemma 2 have been released with open or permissive licenses. Open models allow anyone to inspect and modify the model weights. It also enables on-premise use (running models locally) and customization without reliance on an API. Open models democratize AI: smaller companies and researchers can build products or fine-tune models without needing huge GPU clusters.
Many leading FMs remain proprietary. OpenAI’s GPT-4, Anthropic’s Claude, Google’s Bard, and Microsoft’s Azure AI models are not open-weight. Companies argue this protects safety (controlling misuse) and is needed for commercial viability.