Tag: machine-learning

Simplifying Language-Image Pre-Training with Sigmoid Loss

July 17th, 2024
Introduction to Language-Image Pre-Training

Language-Image Pre-Training (LIP) has become a popular approach to obtain robust visual and textual representations. It involves aligning the representations of paired images and texts, typically using a contrastive objective. This method was revolutionized by large-scale models like CLIP and ALIGN, which demonstrated the viability of this approach at a massive scale [1]. As the field progresses, researchers continue to seek methods to make LIP more efficient and effective, addressing challenges such as large batch sizes and resource constraints.

The Sigmoid Loss Innovation

Key Problem with Contrastive Learning

Traditional contrastive learning relies on a softmax-based loss function, which requires normalization over the entire batch of image-text pairs. This approach can be computationally expensive and memory-intensive, often necessitating a complex and numerically unstable implementation [1].

Introducing Sigmoid Loss

To address these challenges, researchers at Google DeepMind have proposed a simpler alternative: the pairwise Sigmoid loss for Language-Image Pre-Training, termed SigLIP. Unlike the softmax normalization, the Sigmoid loss operates solely on individual image-text pairs, which simplifies the computation significantly [1].

Benefits of Sigmoid Loss
1. Memory Efficiency: The Sigmoid loss requires less memory compared to the softmax-based contrastive loss. This makes it possible to scale up the batch size without increasing computational resources exponentially [1].
2. Decoupling Batch Size: By not requiring operations across the full batch, the Sigmoid loss decouples the definition of the task from the batch size, allowing flexibility in training setups [1].
Table 5: SigLIP zeor-shot accuracy (%) on the ImageNet benchmark. Both the sigmoid loss and the softmax loss baseline are presented. Experiments are performed on multiple train examples seen (3 B, 9 B) and train batch sizes (from 512 to 307 k). When trained for 9 B examples, the peak of the sigmoid loss comes earlier at 32 k than the peak of the softmax loss at 98 k. Together with the memory efﬁcient advantage for the sigmoid loss, it allows one to train the best language-image model with much fewer amount of accelerators.

Experimental Validation

A series of experiments demonstrated that SigLIP models outperform their counterparts using softmax loss, particularly in smaller batch sizes. For instance, SigLIP achieved 84.5% zero-shot accuracy on ImageNet with only four TPUv4 chips in two days [1].

Efficient Training With Locked-Image Tuning

SigLiT Model

The team also introduced the SigLiT model, combining the Sigmoid loss with Locked-image Tuning (LiT). This method achieved remarkable efficiency, with the SigLiT model reaching 79.7% zero-shot accuracy on ImageNet in just one day using four TPUv4 chips [1].

Impact on Batch Size and Training Duration

Table 6: Default hyperparameters across different batch sizes, perform either the best or close to the best hyperparameter from a sweep. Zero-shot accuracy on ImageNet is reported. BS=batch size, LR=learning rate, WD=weight decay.

Through extensive testing, researchers found that while larger batch sizes do offer some performance benefits, the gains diminish beyond a certain point. Surprisingly, a batch size of 32k appeared to be almost optimal, balancing performance and resource efficiency [1].

Multilingual and Robust Pre-Training

Table 2: Multilingual SigLIP results with various batch sizes, pre-trained for 30 billion seen examples. We report zero-shot transfer results on ImageNet (INet-0) and averaged text to image retrieval results across 36 languages on the crossmodal 3600 dataset (XM). The full table on 36 languages can be found in Appendix.

mSigLIP: Multilingual Adaptation

Expanding the approach to multilingual data, the team pre-trained models on datasets covering over 100 languages. They discovered that a batch size of 32k was also sufficient for effective multilingual training, and going beyond this size didn’t yield significant improvements [1].

Robustness to Noise

Another notable advantage of the Sigmoid loss is its robustness to data noise. Models trained with Sigmoid loss demonstrated a higher tolerance to various types of corruption (e.g., random noise in images or texts), retaining performance superiority over softmax-trained models even under noisy conditions [1].

Conclusion

The introduction of Sigmoid loss for Language-Image Pre-Training marks a significant advance in the efficiency and effectiveness of LIP models. By simplifying the loss computation and decoupling it from batch size requirements, SigLIP and SigLiT models offer compelling performance with reduced computational overhead. These innovations not only facilitate better utilization of limited resources but also present a robust framework adaptable to multilingual contexts and resistant to data noise. This development paves the way for more accessible and scalable language-image pre-training, fostering further exploration and improvement in the field.

By integrating Sigmoid loss, researchers and practitioners can achieve high performance in LIP tasks with optimized resource use, making advanced AI more accessible and practical for diverse applications [1].

[1] https://arxiv.org/pdf/2303.15343
Document Chunking for RAG (Retrieval Augmented Generation)

January 13th, 2024

LLMs have a limited input they can generate and output. In retrieval augmented generation (RAG) applications, a set of documents is first retrieved and added to the input alongside an instruction, thus creating the prompt. This is referred to as in-context learning. We can draw an analogy with computer architectures here: the LLM is the CPU, the context is the RAM, and the full document library is the hard disk.

The context length (RAM) is consistently expanding. In my early encounters with Recurrent Neural Networks (RNNs), we could make reliable predictions with input lengths of about 20 words. However, issues such as vanishing gradients lead to NANs (not bread). Nowadays, we have LLMs with a context of up to 200K tokens. Note that tokenization has also evolved over time, transitioning from words to tokens.

Chunking, a technique used to divide a document into multiple parts to fit within the context, has played a pivotal role in handling these larger contexts. With a context size of 200K tokens, we can process up to 330 pages, within a single chunk for prediction.

Initially, when I began working with LLMs, chunking posed a substantial challenge. However, as the context length expanded, this problem gradually receded into the background.

Or did it?

It turns out that processing vast quantities of text introduces latency to the entire system. Consequently, chunking remains a relevant strategy for developing high-performance applications.

From personal experience, the choice of chunking technique depends on the user’s or system’s intent. For example, when summarising a document, it’s necessary to consider all the chunks. Conversely, when seeking specific answers to questions, we only need to select the most relevant chunks. Now, let’s look into a few chunking techniques I’ve used so far.

Static Chunking

One straightforward method of chunking involves aligning the chunk size with the LLM context length. It’s important to note that if a chunk is combined with an instruction in a prompt, the chunk’s length should be smaller than the context length minus the instruction length.

For instance, if the context length is 200 tokens and the instruction spans 20 tokens, the chunk’s length should be set at 180 tokens.

In practice, I haven’t found many use cases where this method performs exceptionally well. It appears to be most suitable for agentic systems where the agent is aware that a document is segmented into multiple chunks, relaying the decision to retrieve chunks to the agent.

Dynamic Chunking Based On Traditional IR

This method involves chunking a document based on a predefined condition. In my experience, the most effective approach has been to view chunking as an information retrieval (IR) problem. We can treat a document as a dataset in which each sentence serves as an item. Given a query, we retrieve a list of ranked fragments, with fragment boundaries coinciding with common sentence delimiters. Each fragment essentially becomes a chunk. This can be effortlessly achieved using ElasticSearch (ES) and BM25, ES’s default ranker.

By approaching chunking as an IR problem, we can also leverage IR techniques like stemming, lemmatisation, synonymy, query expansion, and more.

This represents a straightforward, efficient, and productive dynamic chunking technique. Unlike approaches involving embedding extraction, artificial neural networks (ANN), or neural IR, this method is less resource-intensive to implement. Furthermore, it avoids the pitfalls of Out of Distribution (OOD) shifts.

Dynamic Chunking Based On Neural IR (Embeddings)

Similar to the previous method, this approach involves converting a document into sentence embeddings. We then retrieve relevant chunks based on the similarity between the query and sentence embeddings. While this method is more computationally demanding, it can yield more accurate results, particularly when the embeddings align closely with the application domain. Most modern databases offer support for vector indexing and search.

Dynamic Chunking Based On Neural IR (ColBERT)

Many vector-based search systems necessitate pooling the query and documents into a single vector. However, such pooling operations tend to discard valuable signals. ColBERT introduces a method that eschews pooling, thereby preserving the query and document embedding signals more effectively. Additionally, since there’s no need to segment documents into sentences, as in the previous method, we can automatically identify multi-sentence fragments, delimited by sentence boundaries, akin to the traditional IR approach.

However, it’s worth noting that this method comes with higher computational costs and requires the implementation of the ColBERT ranker within vector databases, which not all have it out of the box. Additionally, it is not immune to OOD data shifts.

Summary

While the context length keeps growing, it still makes sense to do chunking to reduce latency.

There are many static and dynamic ways to do chunking. The best one will always depend on the end application.