In a Pickle About AI? Let's Relish the Latest Developments!

5 August 2024 | 2 min

A jar of pickles surrounded by robot arms.

We often have folks reach out to us feeling a bit pickled about the latest AI developments. This reading list provides an overview of key papers and developments in modern AI, focusing on transformers, new architectures, efficient fine-tuning methods, representation learning, and inference optimizations. These resources will help you understand the foundations of current AI models and their practical applications.

1. The Big Dill: Transformers

Let’s start with the big dill: Transformers. These models have really cornered the market lately, with architectures like GPT leaving us in awe. But the innovation doesn’t stop there. New architectures are emerging to challenge Transformers’ dominance:

  • Mamba: A Structured State Space sequence model that replaces both attention and MLP blocks, offering fast inference and impressive performance at reduced model sizes.
  • RetNet: Introduces a Retentive Network block that can be used in both parallel and recursive fashions, combining the training benefits of Transformers with the inference speed of RNNs.
  • RWKV: The Receptance Weighted Key Value model allows for parallel training and RNN-like inference, showing comparable performance to state-of-the-art models.

These models are pushing the boundaries of what’s possible with Transformers, offering new ways to balance performance and efficiency. This is opening the door to new hybrid models that combine the best of both worlds, and expanding attention windows to capture longer-range dependencies.

2. Fine-Tuning: No Need to Feel Sour

Feeling sour about fine-tuning large models? Parameter-efficient methods like LoRA and Prefix-Tuning are here to sweeten the deal. These methods fall into different categories:

  • Re-parameterization: LoRA is a prime example, adding low-rank matrices to frozen weights.
  • Additive: Prefix-Tuning adds trainable tokens to the input.
  • Selective: Methods that only fine-tune specific layers or components of the model.

These approaches allow for quick adaptation of large models with minimal resources, unlocking the real value of pre-trained Transformers on domain-specific tasks. While the main focus has been on training efficiency, new serving optimizations like s-LoRa are being commercialized to reduce memory and compute requirements during inference. in multi-tenant environments.

3. Multi-Modal Models: Comparing Apples to Cucumbers

Multi-modal models are also making a splash in the brine. CLIP leads the pack with its ability to create comparable embeddings across images and text. This opens up exciting possibilities for zero-shot learning and cross-modal applications, with extensions in MobileClip, TinyCLIP and SigLIP.

4. Upcycling: Mixing the Best Ingredients from Different Jars

Upcycling is approach of joining different pre-training models to create or add new functionality to existing models. Model like:

  • LLaVA: Combines decoder-ony language and CLIP’s vision model to enable zero-shot image captioning and other V-L tasks.
  • StableDiffusion: Merges a pre-trained image auto-encoder with CLIP’s text encoder to diffuse images based on text prompts.
  • OWL-ViT: Adapts CLIP for open-vocabulary object detection.
  • CLIPSeg: Uses CLIP embeddings for image segmentation tasks.
  • CLIPMat: Builds on CLIP for reference-guided Image Matting.

These approaches demonstrate how pre-trained models can be creatively combined and adapted to tackle new challenges efficiently.

5. Inference Optimizations: Better than Byte-sized

Lastly, if you’re in a time crunch, keep an eye on new techniques in inference optimizations. These techniques significantly improve the speed and efficiency of large language models:

  • Flash Attention uses a tiling approach to reduce memory bottlenecks in attention computation.
  • KV Caching. stores previously computed key-value pairs to avoid redundant calculations in autoregressive generation.
  • Speculative Decoding uses a smaller model to predict multiple tokens at once, which are then verified by the larger model, potentially speeding up generation.

These optimizations are crucial for deploying large models in real-world applications, especially in domains requiring low-latency responses.