Solenya logoSolenya
aitransformerszero-shotreading-list

In a Pickle About AI? Let's Relish the Latest Developments!

Marcus Gawronsky

We often have folks reach out to us feeling a bit pickled about the latest AI developments. This reading list provides an overview of key papers and developments in modern AI, focusing on transformers, new architectures, efficient fine-tuning methods, representation learning, and inference optimizations. These resources will help you understand the foundations of current AI models and their practical applications.

The Big Dill: Transformers


Let’s start with the big dill: Transformers. These models have really cornered the market lately, with architectures like [@cdn_openai_com_research_covers_language_unsupervis] leaving us in awe. But the innovation doesn’t stop there. New architectures are emerging to challenge Transformers’ dominance:

  • Mamba (Gu & Dao, 2023): A Structured State Space sequence model that replaces both attention and MLP blocks, offering fast inference and impressive performance at reduced model sizes.
  • RetNet (Sun et al., 2023): Introduces a Retentive Network block that can be used in both parallel and recursive fashions, combining the training benefits of Transformers with the inference speed of RNNs.
  • RWKV (Peng et al., 2023): The Receptance Weighted Key Value model allows for parallel training and RNN-like inference, showing comparable performance to state-of-the-art models.

These models are pushing the boundaries of what’s possible with Transformers, offering new ways to balance performance and efficiency. This is opening the door to new hybrid models that combine the best of both worlds, and expanding attention windows to capture longer-range dependencies.

Fine-Tuning: No Need to Feel Sour


Feeling sour about fine-tuning large models? Parameter-efficient methods like LoRA (Hu et al., 2021) and Prefix-Tuning (Li & Liang, 2021) are here to sweeten the deal. These methods fall into different categories:

  • Re-parameterization: LoRA is a prime example, adding low-rank matrices to frozen weights.
  • Additive: Prefix-Tuning adds trainable tokens to the input.
  • Selective: Methods that only fine-tune specific layers or components of the model.

These approaches allow for quick adaptation of large models with minimal resources, unlocking the real value of pre-trained Transformers on domain-specific tasks. While the main focus has been on training efficiency, new serving optimizations like S-LoRA (Sheng et al., 2023) are being commercialized to reduce memory and compute requirements during inference. in multi-tenant environments.

Multi-Modal Models: Comparing Apples to Cucumbers


Multi-modal models are also making a splash in the brine. CLIP (Radford et al., 2021) leads the pack with its ability to create comparable embeddings across images and text. This opens up exciting possibilities for zero-shot learning and cross-modal applications, with extensions in MobileCLIP (Vasu et al., 2023), TinyCLIP (Wu et al., 2023) and SigLIP (Zhai et al., 2023).

Upcycling: Mixing the Best Ingredients from Different Jars


Upcycling is approach of joining different pre-training models to create or add new functionality to existing models. Model like:

  • LLaVA (Liu et al., 2023): Combines decoder-only language and CLIP's vision model to enable zero-shot image captioning and other V-L tasks.
  • Latent Diffusion (Rombach et al., 2022): Merges a pre-trained image auto-encoder with CLIP's text encoder to diffuse images based on text prompts.
  • OWL-ViT (Minderer et al., 2022): Adapts CLIP for open-vocabulary object detection.
  • CLIPSeg (Lüddecke & Ecker, 2022): Uses CLIP embeddings for image segmentation tasks.
  • CLIPMat (Luo et al., 2022): Builds on CLIP for reference-guided Image Matting.

These approaches demonstrate how pre-trained models can be creatively combined and adapted to tackle new challenges efficiently.

Inference Optimizations: Better than Byte-sized


Lastly, if you’re in a time crunch, keep an eye on new techniques in inference optimizations. These techniques significantly improve the speed and efficiency of large language models:

  • Flash Attention (Dao et al., 2022) uses a tiling approach to reduce memory bottlenecks in attention computation.
  • KV Cache Quantization (Hooper et al., 2024) stores previously computed key-value pairs to avoid redundant calculations in autoregressive generation.
  • Speculative Decoding (Leviathan et al., 2022) uses a smaller model to predict multiple tokens at once, which are then verified by the larger model, potentially speeding up generation.

These optimizations are crucial for deploying large models in real-world applications, especially in domains requiring low-latency responses.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Advances in Neural Information Processing Systems. https://arxiv.org/abs/2205.14135
Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. https://arxiv.org/abs/2312.00752
Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y. S., Keutzer, K., & Gholami, A. (2024). KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. https://arxiv.org/abs/2405.10637
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. https://arxiv.org/abs/2106.09685
Leviathan, Y., Kalman, M., & Matias, Y. (2022). Fast Inference from Transformers via Speculative Decoding. https://arxiv.org/abs/2211.17192
Li, X. L., & Liang, P. (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. Association for Computational Linguistics. https://huggingface.co/papers/2101.00190
Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction Tuning. https://arxiv.org/abs/2304.08485
Lüddecke, T., & Ecker, A. S. (2022). Image Segmentation Using Text and Image Prompts. https://arxiv.org/abs/2112.10003
Luo, J., Zhang, J., & Timofte, R. (2022). CLIPMat: Reference-Guided Image Matting. https://arxiv.org/abs/2206.05149
Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., Wang, X., Zhai, X., Kipf, T., & Houlsby, N. (2022). Simple Open-Vocabulary Object Detection with Vision Transformers. https://arxiv.org/abs/2205.06230
Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., & others. (2023). RWKV: Reinventing RNNs for the Transformer Era. https://arxiv.org/abs/2305.13048
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. https://arxiv.org/abs/2103.00020
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. https://arxiv.org/abs/2112.10752
Sheng, Y., Cao, S., Li, D., Hooper, C., Lee, N., Yang, S., Chou, C., Zhu, B., Zheng, L., Keutzer, K., Gonzalez, J. E., & Stoica, I. (2023). S-LoRA: Serving Thousands of Concurrent LoRA Adapters. https://arxiv.org/abs/2311.03285
Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., & Wei, F. (2023). Retentive Network: A Successor to Transformer for Large Language Models. https://arxiv.org/abs/2307.08621
Vasu, P. K. A., Beirami, H., Chaudhuri, A., Kozma, R., & Morariu, V. I. (2023). MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training. https://arxiv.org/abs/2311.17049
Wu, K., Peng, H., Zhou, Z., Xiao, B., Liu, M., Yuan, L., Xuan, H., Valenzuela, M., & Chen, X. (2023). TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance. https://arxiv.org/abs/2309.12314
Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). Sigmoid Loss for Language Image Pre-Training. https://arxiv.org/abs/2303.15343