The Five Technologies driving AI 3.0
A new third Era has started, led by Five AI revolutions that will increase global GDP by 55% (GitHub, 2022), which will lead to significant creation, capture, and consolidation of seemingly unrelated markets, geographies, and customer segments which will reshape proptech and cross-industry consolidation (Forbes, 2022).
Multimodal, Multi-task, Zero-shot, and Open-Vocabulary
The era of AI 1.0 was narrow. AI 1.0 Models were small, trained on hundreds of thousands of private training examples in order to classify or predict a narrow set of predefined labels. AI 1.0 was driven by the development of new hardware and model architectures which demonstrated state-of-the-art results in image classification, object detection, and semantic segmentation.
In 2018, Howard & Ruder (2018) introduced domain-agnostic pretraining, a technique which researchers to train models on a mix of both cheap and ubiquitous domain-agnostic data and expensive and limited datasets covering a single task and a narrow set of defined labels. This marked AI 2.0, which enabled, with larger datasets, the development of larger, more accurate models, well suited to their narrow, predefined tasks. AI 2.0 allowed researchers to scale models using large datasets while reducing the marginal cost of training data needed for training.
AI 3.0 isn’t the era of Large Language Models (LLMs) but the era of Multimodal, Multi-task, Zero-shot and Open-Vocabulary models. These pre-trained models can be designed to take as inputs from images, text, video, depth-maps, or point-clouds (multimodal), and be designed to solve a broad range of tasks, without a limited set of classification labels (open-vocabulary), or task-specific training data (zero-shot). In AI 3.0, new techniques allow researchers the ability to amortize the cost of model development and serving across tasks and modalities, reducing the need for task-specific infrastructure or labelled training data.
Architecture, Quality and Inverse-Scaling
In AI 2.0, modalities were tied to particular model architectures: LSTMs became popular for language modelling, CNNs for image modelling, FFNs for tabular data, and so on. In AI 3.0, the Transformer architecture (Vaswani et al., 2017) has allowed the same architecture to be reused across increasingly larger datasets, spanning a diverse set of modalities from text to images, and even video.
Transformers are not without flaws, however; Transformers are very memory intensive and hard to train. These memory and training requirements have demanded increasingly large amounts of datasets and have practically limited the size of input Transformers can ingest. Despite these challenges, Transformers have changed the economics of innovation as innovations like FlashAttention (Dao et al., 2022), ALiBi Press et al. (2021) and Multi-Query Attention (Shazeer, 2019) which benefit one modality, benefit all modalities. This is deeply profound and has largely characterized the ‘arms race’ which took place between 2017 and 2022 as industrial labs sought to acquire increasingly large data centers in order to scale up their transformer models to larger and larger datasets.
While these increases in model size, data and compute have all driven progress in the past, it’s not obvious that scale is still the answer. Recent works like Chinchilla (Hoffmann et al., 2022) on model size, Galactica (Taylor et al., 2022), LLaMA (Touvron et al., 2023), and Phi-1 (Gunasekar et al., 2023) on pretraining, and Alpaca Taori et al. (2023), LIMA (Zhou et al., 2023) and Orca (Mukherjee et al., 2023) on fine-tuning all point to the importance of quality over quantity. Furthermore, beyond the practical limits to data acquisition (Villalobos et al., 2022), papers like Schaeffer et al. (2023) and McKenzie et al. (2023) demonstrate the limits and harms to scale as, given the capacity, models tend to memorize responses, rather than understand their inputs.
Retrieval and Prompting
Deep Learning models are simply stacks of smaller, shallow models, which are optimized jointly during the training process to minimize the discrepancy between its models’ final predictions and some labels. Each layer in a Deep Learning model extracts increasingly abstract features from the input data, gradually transforming it into a more meaningful representation. The depth of the model allows for hierarchical learning, where lower layers capture low-level patterns, and higher layers capture more abstract and semantically meaningful mathematical representations.
With the development of modern vector databases (Bridgwater, 2023), in 2019 semantic search (Nayak, 2019) came to disrupt almost 20 years of search stagnation ruled by the BM25 (Robertson et al., 2000) and PageRank (Brin & Page, 1998) algorithms. Now, AI 3.0 is again disrupting search, powering new experiences like multimodal and generative search (Armano, 2023).
While Large AI 3.0 Models can often complete tasks without task-specific training data, examples are often necessary to reach the levels of performance and reliability needed in end-user applications. Here, while AI models are disrupting search, search is empowering AI models with live knowledge bases, and ‘textbooks’ of example responses. These examples prompt the models with context on the style of answer required and provide the up-to-date information needed to provide answers which are factually correct.
Parameter Efficient Fine-tuning (PeFT), Adaption and Pretrained Foundational Models
The trajectory of AI is a trajectory of economics: how can we minimize our cost or unit of accuracy? In AI 2.0 we managed to reduce the marginal cost of data using large domain agnostic datasets for unsupervised pretraining, and we managed to better amortize the cost of pretraining across tasks using techniques like transfer learning and fine-tuning to repurpose low and intermediate parts of pre-trained AI models. This unlocked fertile ground for pre-trained model repositories like Tensorflow Hub, PyTorch Hub (PyTorch, 2019), HuggingFace, PaddleHub (PaddlePaddle, 2020), TiMM and Kaggle Models (Kaggle, 2023), and later, in 2020, AdapterHub (Pfeiffer et al., 2020) to share and compare pre-trained and fine-tuned models.
In AI 3.0 we have not only amortized the cost of pre-training but reduced and amortized the cost of fine-tuning models across modalities and tasks. This next shift is the reason we are seeing the explosion of AI-as-service platforms like OpenAI’s API, Replicate and OctoML, which allows users to share large serverless pre-trained model endpoints.
Quantization, Acceleration and Cost
In the 2000s, the Cloud, Microservices, and Serverless changed the economics of the web, unlocking tremendous value for hardware vendors, big tech and small startups. The Cloud reduced the fixed and upfront costs of web hosting, Microservices reduced the unit of development and Serverless reduced the unit of scaling. Large Language Models (LLMs) cannot work with Serverless! Serverless is driven by cold start times, the cold start of a typical AWS Lambda function is around 250ms (Shilkov, 2023); the cold start of Banana.dev (a serverless AI hosting platform) is around 14 seconds (Banana.dev, 2023), or 50x. This is obvious and unavoidable when we think about the size, complexity and dependencies of modern AI models. While the MPT 7B LLM is roughly a third the size of the Yandex codebase user queries may only require 0.1% of the overall codebase at a given time. To generate text from MPT 7B, all 13GBs are needed multiple times for each word. Here, recent innovations in Sparsification (with SparseGPT Frantar & Alistarh (2023)), Runtime Compilation (with ONNX (Hugging Face, 2022) and TorchScript), Quantization (with GPTQ (Frantar et al., 2022), QLoRA (Dettmers et al., 2023) and FP8 (Micikevicius et al., 2022)), Hardware (with Nvidia Hopper (Lambda Labs, 2022)) and Frameworks (with Triton (Tillet et al., 2019) and PyTorch 2.0 (PyTorch, 2023)) serve to reduce model size and latency more 8x while preserving 98% of model performance on downstream tasks, as Pruning, Neural Architecture Search (NAS) and Distillation did in AI 2.0. This radically changes the economics of model serving and may be the driver for OpenAI reducing their API costs twice in one year (Wiggers, 2023).