Tag: ai

Simplifying Language-Image Pre-Training with Sigmoid Loss

July 17th, 2024
Introduction to Language-Image Pre-Training

Language-Image Pre-Training (LIP) has become a popular approach to obtain robust visual and textual representations. It involves aligning the representations of paired images and texts, typically using a contrastive objective. This method was revolutionized by large-scale models like CLIP and ALIGN, which demonstrated the viability of this approach at a massive scale [1]. As the field progresses, researchers continue to seek methods to make LIP more efficient and effective, addressing challenges such as large batch sizes and resource constraints.

The Sigmoid Loss Innovation

Key Problem with Contrastive Learning

Traditional contrastive learning relies on a softmax-based loss function, which requires normalization over the entire batch of image-text pairs. This approach can be computationally expensive and memory-intensive, often necessitating a complex and numerically unstable implementation [1].

Introducing Sigmoid Loss

To address these challenges, researchers at Google DeepMind have proposed a simpler alternative: the pairwise Sigmoid loss for Language-Image Pre-Training, termed SigLIP. Unlike the softmax normalization, the Sigmoid loss operates solely on individual image-text pairs, which simplifies the computation significantly [1].

Benefits of Sigmoid Loss
1. Memory Efficiency: The Sigmoid loss requires less memory compared to the softmax-based contrastive loss. This makes it possible to scale up the batch size without increasing computational resources exponentially [1].
2. Decoupling Batch Size: By not requiring operations across the full batch, the Sigmoid loss decouples the definition of the task from the batch size, allowing flexibility in training setups [1].
Table 5: SigLIP zeor-shot accuracy (%) on the ImageNet benchmark. Both the sigmoid loss and the softmax loss baseline are presented. Experiments are performed on multiple train examples seen (3 B, 9 B) and train batch sizes (from 512 to 307 k). When trained for 9 B examples, the peak of the sigmoid loss comes earlier at 32 k than the peak of the softmax loss at 98 k. Together with the memory efﬁcient advantage for the sigmoid loss, it allows one to train the best language-image model with much fewer amount of accelerators.

Experimental Validation

A series of experiments demonstrated that SigLIP models outperform their counterparts using softmax loss, particularly in smaller batch sizes. For instance, SigLIP achieved 84.5% zero-shot accuracy on ImageNet with only four TPUv4 chips in two days [1].

Efficient Training With Locked-Image Tuning

SigLiT Model

The team also introduced the SigLiT model, combining the Sigmoid loss with Locked-image Tuning (LiT). This method achieved remarkable efficiency, with the SigLiT model reaching 79.7% zero-shot accuracy on ImageNet in just one day using four TPUv4 chips [1].

Impact on Batch Size and Training Duration

Table 6: Default hyperparameters across different batch sizes, perform either the best or close to the best hyperparameter from a sweep. Zero-shot accuracy on ImageNet is reported. BS=batch size, LR=learning rate, WD=weight decay.

Through extensive testing, researchers found that while larger batch sizes do offer some performance benefits, the gains diminish beyond a certain point. Surprisingly, a batch size of 32k appeared to be almost optimal, balancing performance and resource efficiency [1].

Multilingual and Robust Pre-Training

Table 2: Multilingual SigLIP results with various batch sizes, pre-trained for 30 billion seen examples. We report zero-shot transfer results on ImageNet (INet-0) and averaged text to image retrieval results across 36 languages on the crossmodal 3600 dataset (XM). The full table on 36 languages can be found in Appendix.

mSigLIP: Multilingual Adaptation

Expanding the approach to multilingual data, the team pre-trained models on datasets covering over 100 languages. They discovered that a batch size of 32k was also sufficient for effective multilingual training, and going beyond this size didn’t yield significant improvements [1].

Robustness to Noise

Another notable advantage of the Sigmoid loss is its robustness to data noise. Models trained with Sigmoid loss demonstrated a higher tolerance to various types of corruption (e.g., random noise in images or texts), retaining performance superiority over softmax-trained models even under noisy conditions [1].

Conclusion

The introduction of Sigmoid loss for Language-Image Pre-Training marks a significant advance in the efficiency and effectiveness of LIP models. By simplifying the loss computation and decoupling it from batch size requirements, SigLIP and SigLiT models offer compelling performance with reduced computational overhead. These innovations not only facilitate better utilization of limited resources but also present a robust framework adaptable to multilingual contexts and resistant to data noise. This development paves the way for more accessible and scalable language-image pre-training, fostering further exploration and improvement in the field.

By integrating Sigmoid loss, researchers and practitioners can achieve high performance in LIP tasks with optimized resource use, making advanced AI more accessible and practical for diverse applications [1].

[1] https://arxiv.org/pdf/2303.15343
MemGPT: Enhancing LLM Capabilities Through Operating System Principles

July 17th, 2024
Large Language Models (LLMs) have transformed artificial intelligence, leading to significant advancements in conversational AI and applications. However, these models face limitations due to fixed-length context windows, impeding their performance in tasks that require handling extensive conversations and lengthy document analysis.

Limitations of Current LLMs

Despite their revolutionary impact, current LLMs are hindered by constrained context windows that limit their effectiveness in long-term interactions and extensive document reasoning. When the context length is extended directly, the computational time and memory costs increase quadratically due to the transformer architecture’s self-attention mechanism [1]. Even attempts to develop longer models are challenged by diminishing returns and inefficient utilization of extended context [1].

Introducing MemGPT

Concept and Inspiration

MemGPT (MemoryGPT), inspired by hierarchical memory systems in traditional operating systems, proposes virtual context management to address the limitations of fixed-context LLMs [1]. This technique creates the illusion of an extended virtual memory through ‘paging’ between different storage tiers, analogous to the paging mechanism between physical memory and disk in traditional OSes [1].

System Design

MemGPT incorporates a multi-level memory architecture, differentiating between main context (similar to RAM) and external context (similar to disk storage) [1]. The main context consists of LLM prompt tokens, encompassing system instructions, working context, and a FIFO queue [1]. External context refers to any information outside the LLM’s fixed context window, accessible only when moved into the main context for processing [1].
- Main Context: Comprises system instructions, working context, and FIFO queue [1]. System instructions provide guidelines on control flow and memory management [1]. The working context stores key facts and preferences about the user, while the FIFO queue maintains a rolling history of messages [1].
- External Context: Functions via MemGPT’s paging mechanism, storing data that is dynamically moved in and out of the main context based on relevance and necessity [1].
Function Management and Control Flow

MemGPT uses function calls to manage data movement between main and external contexts without user intervention [1]. These functions allow the LLM to perform self-directed memory edits and retrievals, facilitating efficient utilization of limited context [1].
- Function Executor: Handles completion tokens and facilitates data movement between contexts. It ensures correctness through parsing and executes validated functions, creating a feedback loop that enables the system to learn and adjust its behavior [1].
- Queue Manager: Manages messages in recall storage and FIFO queue, maintaining context overflow and underflow through a queue eviction policy [1].
Experimental Evaluation

Conversational Agents

MemGPT’s effectiveness was demonstrated in two key scenarios: conversational agents and document analysis.

1. Conversational Consistency and Engagement:
- MemGPT’s Design: The system manages long-term interactions by storing key information in working context and using retrieval functions to bring relevant data into the current context [1].
- Performance: Experimental results showed that MemGPT significantly outperformed fixed-context baselines in maintaining conversation consistency and generating engaging responses [1]. For instance, in the deep memory retrieval task, MemGPT with different LLMs (e.g., GPT-4) showcased higher accuracy and ROUGE-L scores compared to their fixed-context counterparts [1].
Document Analysis

2. Handling Lengthy Texts:
- MemGPT’s Capability: In document analysis, MemGPT processes long texts by dynamically managing context, enabling reasoning across documents that exceed the fixed context window of modern LLMs [1].
- Multi-document QA Performance: MemGPT outperformed fixed-context models by effectively querying archival storage and handling large datasets [1]. The system’s ability to paginate through retriever results allowed it to maintain high accuracy even as the number of documents increased [1].
Conclusion

MemGPT represents a significant advancement in addressing the context limitations of LLMs. By drawing upon principles from traditional OS memory management, MemGPT enhances the utility of LLMs in tasks requiring extensive context handling. This innovation opens doors for further exploration in applying MemGPT to various domains requiring long-lasting memory and extended context management. Future research can explore integrating different memory tier technologies and refining control flow and memory management policies to maximize the capabilities of LLMs [1].

[1] https://arxiv.org/pdf/2310.08560
Kaggle Docs QA With Gemma – Data Collection + Dataset Generation + Fine Tuning + RAG + ColBERT Re-Ranker + Evaluation

April 1st, 2024
Sharing my my notebook for the Google AI Assistants For Data Task With Gemma competition.

The notebook (link at the end) covers the basic building blocks to adapt LLMs for your own use case:
- Data collection/generation/augmentation.
- Fine tuning Gemma with a P100 GPU and Lora
- Document chunking for RAG
- Chunk ranking using ColBERT and BERT/Gemma embeddings
- Evaluation using a LLM judge (Gemini Pro)
- Evaluation using a distance metric.
Here is an excerpt of the main findings so far:

Dataset generation, RAG with ColBERT and query strategy yield the best evaluation scores.

Gemma already has Kaggle knowledge out of the box so fine tuning with new data didn’t make much difference.

Fine tuning (FT) is for learning new abstractions rather than memorising new data for QA. Overfitting does help with memorisation though.

Single vector search is bad because document pooling operations throw away too much signal. ColBERT is multi-vector

so no pooling involved so signal is preserved.

Using a larger and more capable model (Gemini) as a judge is a good evaluator but pricey.

RAG + RAW: use RAG and train the model to say “I don’t know” if an answer isn’t in the context, then use the raw model as a fallback. This improves evaluation scores.

Notebook is available here: https://www.kaggle.com/code/joanfihu/kaggle-docs-qa-fine-tuning-rag-eval
The Generative Internet

February 14th, 2024

Generative AI is opening the doors to new products. It’s great to witness in real time how the Internet is evolving and to imagine what it might become in the coming years. Search, UI navigation and content generation are the trident technologies leading this evolution.

Search

The Internet acts as the largest repository of knowledge worldwide. Search engines guide us through the Internet, directing us to web pages likely to have the answers we seek.

By 2014, this functionality evolved further, allowing search engines to highlight specific passages within web pages that are most likely to contain the answers to our questions. This means that search engines now fulfil both navigational and informational purposes, making it easier to find information quickly within SERPs.

Answer

In 2018, Large Language Models demonstrated a way to generate answers from the Common Crawl (CC) dataset. CC is a web index with 3B web pages. The technical breakthrough is that LLMs can retrieve relevant bits of information from all those 3B web pages. However, there are three caveats. First, LLMs can’t tell what specific source has been used to generate an answer. Second, when the LLM doesn’t have information about a topic, it hallucinates. Third, LLMs are not suitable to fulfil navigational intents.

In 2020, Meta introduced RAG. It connects LLMs to the external data sources like the Internet so answers are grounded in sources. This is great because we can reference specific web pages and passages. However, the amount of sources that can be used to generate an answer is limited by the LLM’s token context length.

Answer from https://AskPandi.com

In 2023, the AI community gave LLMs autonomy and the concept of AI agents was born. An objective is given to an agent and it figures out what to do to complete it. This usually involves planning, multiple task execution, performing external actions, using memory, etc. For example, AskPandi is an AI search agent.

UI Navigation

Another advancement is in UI navigation, facilitating browser and workflow automation. This development enables us to assign tasks to an agent, which can then execute them on our behalf. As a result, we’re moving from direct human-computer interaction to a more seamless human-assistant model. Here is a little example to automatically dismiss cookie consents.

Bye bye cookie consents 🚮

MLLM can be used to always accept cookie consents so you enjoy the web without intrusions 😌

Existing cookie consent plugins are DOM based and hardcode CSS classes/ids from popular sites or cookie consent providers.

Using MLMM, we visually label… pic.twitter.com/Jg2fX4YGzG
— Joanfihu (@joanfihu) November 11, 2023

Content Generation

We have the capability to direct a LLM to produce or modify content on our behalf, making content creation more accessible than ever. However, the key to success lies in generating content that people want. Content that lacks effort or relevance is easily recognisable and less likely to engage readers.

The Generative Internet Is Born

The three advancements in search, UI navigation, and content generation unlock new product opportunities. Search engines evolve into answer engines that interact with the web for us, picking only the relevant bits of information from multiple sources and composing an answer that is essentially an interactive UI.

For example, “suggest a 5 day trip to a surf break in Europe. Also get me flight and accommodation information, including pricing“.

To complete this objective, we need to conduct multiple searches and then combine the outcomes into a single compelling answer.

But, what happens after we find the information we are looking for?

For this particular example, a natural follow-up action might be to book flights and accommodation. In this context, UI navigation proves invaluable because it allows us to instruct an AI agent to automate those tasks for us.

UI navigation facilitates workflow automation, making mundane tasks such as rescheduling a meeting, bookings, or ordering something executable in natural language. This is akin to iOS shortcuts, but without the need to write custom integrations because the web is open.

At this point, the user rarely interacts with the web, an assistant does it for us.

Finally, machine generated content has an important part to play because our needs to access knowledge is ever growing. In my last research paper, I found out that it only takes 5 hops to find knowledge gaps on the Internet. In addition, new research has also found that our knowledge needs are outpacing the amount of content available on the public web, thus giving the impression that search engines are becoming worse. I suspect walled gardens like social media platforms have something to do with it too because their content isn’t public.

If assistants interact with the web for us, what’s the point of web pages?

Most UI navigation systems are being trained on traditional UIs so they are still relevant. In addition, there are many situations where a user will have to step in like confirm a purchase, authentication, captcha, payment details, etc.

However, interacting with the web via API calls instead of UI navigation is faster so I suspect products that provide an API to an assistant will have an edge here.

What happens to web content creators who rely on ad-traffic?

Ads, if relevant, are good content. Since AI assistants are very good at filtering out irrelevant content, web content creators will need to ensure that their sponsored content is also relevant.

What’s the equivalent of backlinks in a generative Internet?

Backlinks from reputable domains are a good ranking signal in traditional search engines. However, what’s the equivalent of a backlink when assistants generate unique web pages on the fly so there is no unique link to refer to? It feels there is a need for better ranking signals. I wouldn’t be surprised if AEO (Answer Engine Optimisation) becomes a thing.

What happens to social media?

We are transitioning from information extraction to generation. This increases relevancy so good for users. I think someone will make a social network where 100% of the content is generated.

To sum up, search, UI navigation and content generation are reshaping the Internet as we know it.
Document Chunking for RAG (Retrieval Augmented Generation)

January 13th, 2024

LLMs have a limited input they can generate and output. In retrieval augmented generation (RAG) applications, a set of documents is first retrieved and added to the input alongside an instruction, thus creating the prompt. This is referred to as in-context learning. We can draw an analogy with computer architectures here: the LLM is the CPU, the context is the RAM, and the full document library is the hard disk.

The context length (RAM) is consistently expanding. In my early encounters with Recurrent Neural Networks (RNNs), we could make reliable predictions with input lengths of about 20 words. However, issues such as vanishing gradients lead to NANs (not bread). Nowadays, we have LLMs with a context of up to 200K tokens. Note that tokenization has also evolved over time, transitioning from words to tokens.

Chunking, a technique used to divide a document into multiple parts to fit within the context, has played a pivotal role in handling these larger contexts. With a context size of 200K tokens, we can process up to 330 pages, within a single chunk for prediction.

Initially, when I began working with LLMs, chunking posed a substantial challenge. However, as the context length expanded, this problem gradually receded into the background.

Or did it?

It turns out that processing vast quantities of text introduces latency to the entire system. Consequently, chunking remains a relevant strategy for developing high-performance applications.

From personal experience, the choice of chunking technique depends on the user’s or system’s intent. For example, when summarising a document, it’s necessary to consider all the chunks. Conversely, when seeking specific answers to questions, we only need to select the most relevant chunks. Now, let’s look into a few chunking techniques I’ve used so far.

Static Chunking

One straightforward method of chunking involves aligning the chunk size with the LLM context length. It’s important to note that if a chunk is combined with an instruction in a prompt, the chunk’s length should be smaller than the context length minus the instruction length.

For instance, if the context length is 200 tokens and the instruction spans 20 tokens, the chunk’s length should be set at 180 tokens.

In practice, I haven’t found many use cases where this method performs exceptionally well. It appears to be most suitable for agentic systems where the agent is aware that a document is segmented into multiple chunks, relaying the decision to retrieve chunks to the agent.

Dynamic Chunking Based On Traditional IR

This method involves chunking a document based on a predefined condition. In my experience, the most effective approach has been to view chunking as an information retrieval (IR) problem. We can treat a document as a dataset in which each sentence serves as an item. Given a query, we retrieve a list of ranked fragments, with fragment boundaries coinciding with common sentence delimiters. Each fragment essentially becomes a chunk. This can be effortlessly achieved using ElasticSearch (ES) and BM25, ES’s default ranker.

By approaching chunking as an IR problem, we can also leverage IR techniques like stemming, lemmatisation, synonymy, query expansion, and more.

This represents a straightforward, efficient, and productive dynamic chunking technique. Unlike approaches involving embedding extraction, artificial neural networks (ANN), or neural IR, this method is less resource-intensive to implement. Furthermore, it avoids the pitfalls of Out of Distribution (OOD) shifts.

Dynamic Chunking Based On Neural IR (Embeddings)

Similar to the previous method, this approach involves converting a document into sentence embeddings. We then retrieve relevant chunks based on the similarity between the query and sentence embeddings. While this method is more computationally demanding, it can yield more accurate results, particularly when the embeddings align closely with the application domain. Most modern databases offer support for vector indexing and search.

Dynamic Chunking Based On Neural IR (ColBERT)

Many vector-based search systems necessitate pooling the query and documents into a single vector. However, such pooling operations tend to discard valuable signals. ColBERT introduces a method that eschews pooling, thereby preserving the query and document embedding signals more effectively. Additionally, since there’s no need to segment documents into sentences, as in the previous method, we can automatically identify multi-sentence fragments, delimited by sentence boundaries, akin to the traditional IR approach.

However, it’s worth noting that this method comes with higher computational costs and requires the implementation of the ColBERT ranker within vector databases, which not all have it out of the box. Additionally, it is not immune to OOD data shifts.

Summary

While the context length keeps growing, it still makes sense to do chunking to reduce latency.

There are many static and dynamic ways to do chunking. The best one will always depend on the end application.