Working with LLMs is shifting from human-machine interactions to human-machine and machine-machine interactions. This allows LLMs to do ever more complex tasks. This new interactivity has been coined as AI agent.
Threaded conversations lack structure to complete complex tasks. Therefore, objective divergence is a common issue with AI agents. Objective divergence is the equivalent of a blue screen in traditional computers.
I’m currently researching how to make agents more reliable mainly by providing structure and constraints. Here is a prompt that the MAC-1 agent easily does in a few seconds: “Find flights and accommodation for my next trip to Barcelona”
Note what the agent does:
It detects the intent(s) in the prompt.
It interacts with the user to get more information.
It’s stateful so is aware of previous actions.
It chains actions.
It operates a browser.
It’s able to map natural language into specific formats: “next Friday” -> “26-01-2024”.
I’m actively working on this and I’ll share more. Follow me on X.
LLMs have a limited input they can generate and output. In retrieval augmented generation (RAG) applications, a set of documents is first retrieved and added to the input alongside an instruction, thus creating the prompt. This is referred to as in-context learning. We can draw an analogy with computer architectures here: the LLM is the CPU, the context is the RAM, and the full document library is the hard disk.
The context length (RAM) is consistently expanding. In my early encounters with Recurrent Neural Networks (RNNs), we could make reliable predictions with input lengths of about 20 words. However, issues such as vanishing gradients lead to NANs (not bread). Nowadays, we have LLMs with a context of up to 200K tokens. Note that tokenization has also evolved over time, transitioning from words to tokens.
Chunking, a technique used to divide a document into multiple parts to fit within the context, has played a pivotal role in handling these larger contexts. With a context size of 200K tokens, we can process up to 330 pages, within a single chunk for prediction.
Initially, when I began working with LLMs, chunking posed a substantial challenge. However, as the context length expanded, this problem gradually receded into the background.
Or did it?
It turns out that processing vast quantities of text introduces latency to the entire system. Consequently, chunking remains a relevant strategy for developing high-performance applications.
From personal experience, the choice of chunking technique depends on the user’s or system’s intent. For example, when summarising a document, it’s necessary to consider all the chunks. Conversely, when seeking specific answers to questions, we only need to select the most relevant chunks. Now, let’s look into a few chunking techniques I’ve used so far.
Static Chunking
One straightforward method of chunking involves aligning the chunk size with the LLM context length. It’s important to note that if a chunk is combined with an instruction in a prompt, the chunk’s length should be smaller than the context length minus the instruction length.
For instance, if the context length is 200 tokens and the instruction spans 20 tokens, the chunk’s length should be set at 180 tokens.
In practice, I haven’t found many use cases where this method performs exceptionally well. It appears to be most suitable for agentic systems where the agent is aware that a document is segmented into multiple chunks, relaying the decision to retrieve chunks to the agent.
Dynamic Chunking Based On Traditional IR
This method involves chunking a document based on a predefined condition. In my experience, the most effective approach has been to view chunking as an information retrieval (IR) problem. We can treat a document as a dataset in which each sentence serves as an item. Given a query, we retrieve a list of ranked fragments, with fragment boundaries coinciding with common sentence delimiters. Each fragment essentially becomes a chunk. This can be effortlessly achieved using ElasticSearch (ES) and BM25, ES’s default ranker.
By approaching chunking as an IR problem, we can also leverage IR techniques like stemming, lemmatisation, synonymy, query expansion, and more.
This represents a straightforward, efficient, and productive dynamic chunking technique. Unlike approaches involving embedding extraction, artificial neural networks (ANN), or neural IR, this method is less resource-intensive to implement. Furthermore, it avoids the pitfalls of Out of Distribution (OOD) shifts.
Dynamic Chunking Based On Neural IR (Embeddings)
Similar to the previous method, this approach involves converting a document into sentence embeddings. We then retrieve relevant chunks based on the similarity between the query and sentence embeddings. While this method is more computationally demanding, it can yield more accurate results, particularly when the embeddings align closely with the application domain. Most modern databases offer support for vector indexing and search.
Dynamic Chunking Based On Neural IR (ColBERT)
Many vector-based search systems necessitate pooling the query and documents into a single vector. However, such pooling operations tend to discard valuable signals. ColBERT introduces a method that eschews pooling, thereby preserving the query and document embedding signals more effectively. Additionally, since there’s no need to segment documents into sentences, as in the previous method, we can automatically identify multi-sentence fragments, delimited by sentence boundaries, akin to the traditional IR approach.
However, it’s worth noting that this method comes with higher computational costs and requires the implementation of the ColBERT ranker within vector databases, which not all have it out of the box. Additionally, it is not immune to OOD data shifts.
Summary
While the context length keeps growing, it still makes sense to do chunking to reduce latency.
There are many static and dynamic ways to do chunking. The best one will always depend on the end application.
🚀 Excited to share new work on “Harnessing Retrieval-Augmented Generation (RAG) for Uncovering Knowledge Gaps“.
In this paper, I simulate how users search the Internet but instead of searching for content that exists through traditional information retrieval methods, we search for the most relevant content, even if it doesn’t exist.
Therefore, information retrieval shifts from an extraction to a generation problem.
This process is guided by the premise that a well-generalised LLM should provide useful recommendations based on the initial question and answer.
🔍 Findings:
Using Bing’s web index, we found that typical internet searches hit a ‘knowledge wall’ just at the fifth level of topic depth. This means there’s a lot more we can discover and learn!
💡 What This Means:
This research changes how we use search engines. Imagine getting deeper, more insightful answers to your questions, going beyond what’s already written on the web.
More broadly, this work can be used to improve the completeness of a content library.
Adept’s mission to enable computers to interact with UIs will enhance our productivity and save time.
I have been eagerly awaiting their ACT-1 model for quite some time. However, while that is being developed, they have released FUYU, a multimodal LLM or MLMM. The term ‘multimodal’ implies its ability to process both text and images. More here: https://www.adept.ai/blog/fuyu-8b.
Adept has introduced an architectural simplification that eliminates the need for the image encoder.
Image patches are instead linearly projected into the first layer of the transformer, bypassing the embedding lookup. We simply treat the normal transformer decoder like an image transformer (albeit with no pooling and causal attention).
Eliminating the need to scale or crop images to meet certain specifications is a significant advantage. This has profound implications for UI navigation, especially since aspect ratios are not always consistent. For instance, 5:1 aspect ratios are frequently seen on web pages. Traditional image preprocessing often discards too much valuable information.
However, a notable limitation is its variable memory requirements, which led me to frequently encounter out-of-memory (OOM) errors. To address the issue of non-uniform aspect ratios, I segmented the images and placed each segment side-by-side to achieve a consistent aspect ratio.
Interestingly, when I fed screenshots of web pages to GPT-4V, I faced OOM errors with images that also had irregular aspect ratios. This leads me to speculate whether GPT-4V might operate on a decoder-only model.
On the plus side, FUYU is relatively lightweight. It comfortably fits within an AWS g5.2xlarge instance that offers 24GB GPU memory. The only modification I made was to adjust the precision to float16 to ensure smooth operation
model = FuyuForCausalLM.from_pretrained(pretrained_path, device_map=”cuda:0″, torch_dtype=torch.float16)
I’ve assessed FUYU with a focus on UI navigation capabilities, even though this may not align with its original training purpose. While one could argue that this is not an equitable evaluation, FUYU remains the best open-source model available for UI navigation. With Adept’s mission in mind, it’s anticipated that their models will, in time, exhibit specialised capabilities for UI navigation, much like FUYU’s resolution-agnostic feature. This assessment establishes a foundation, highlighting areas where fine-tuning and future models can concentrate.
I’ve captured screenshots of various UIs from mobile websites and posed questions about them. This article will specifically discuss the screenshot from Amazon presented below.
Generate A Web Page Caption
prompt: Generate a caption.
generation: The image shows a screen shot of a “sunnyscopa” product listing, which features “free delivery by amazon”. The screen shot also includes the “home & kitchen” tab, where the user has selected “sunnyscopa waterslide decal paper for laser printer – clear, a4, 100 sheets – personalized waterslide decal paper
comment: The caption seems to focus on the text in the screenshot. No comments about the other visual components nor the page layout.
VQA based on text in images
This model is good at doing VQA based on text in images. This is presumably due to the OCR capabilities that this model posses.
prompt: What’s the price of the product?
generation: Sunnyscopa Waterslide Decal Paper for Laser Printer – Clear, A4, 100 Sheets – Personalized Waterslides
comment: Maybe my prompt was to challenging here. I assumed that because there is only one product listed and one price, the model would figure it out.
———————————————
prompt: What is the price of the sunnyscopa product?
generation: 79.72
comment: The model got the price right whenever I included the product name in the prompt
VQA based on elements in images
prompt: What it’s in the product thumbnail?
generation: Joan joan amazon
comment: It didn’t understand what a product thumbnail is.
VQA based on Spatial Awareness
This was the area that I was looking forward the most because UI navigation requires spatial awareness. This means that the model needs to understand how the UI components are laid out and what the UI components are for. For example, buttons can have icons as well as text and they lead the user to a new web page state.
prompt: What is the colour of the search button?
generation: orange
comment: Good, the model understands what is a button and has figured out that the magnifying glass means search.
———————————————
prompt: What’s the text in the search box?
generation: The search box displays an advertisement for a waterslide decal paper for laser printers.
comment: Ok there are no ads in the search box… Maybe the model is confusing the search result listing as an autocomplete dropdown?
———————————————
prompt: Where is the Amazon logo located?
generation: bottom
comment: It’s at the top left
———————————————
prompt: Where is the Amazon logo located? Generate bounding box coordinates.
generation: SunnySCOPA Waterslide Decal Paper for Laser Printer – Clear, A4, 100 Sheets – Personalized Waterproof Paper
comment: Incorrect. Being able to generate coordinates to locate any element in an image will be necessary. Adept’s has teased this capability for text but it will also be necessary for other UI components too. The blog post mentions this capability. This will be great.
UI Interactivity Awareness
In this scenario we want to check if the model is aware of the interactive elements in the web page. Interactivity awareness is essential for UI navigation
prompt: “Given this web page screenshot from the Amazon mobile site, what should I do next if I am searching for a water bottle?
generation: In the Amazon app, you can search for Sunnyscopa Waterslide Decal Paper for Laser Printer – Clear, A4, 100 Sheets – Personalized Water Bottle Label. If you can find the product, simply tap or click “GET IT” button.
comment: The model understood that you need to go to an app, search for things and tap or click to buy it. The model is aware of mobile/touchscreen vs. desktop by referencing tap (mobile) and click (desktop) . However, it swept through the screenshot to do the reasoning. The right answer would have been along the lines:
Delete the current text in the search box
Type “water bottle”
Click/Tap on the orange search button
———————————————
prompt: What will happen if I click on Joan?
generation: You can now print the Joan ad.
comment: The model doesn’t understand that a menu or new page with user settings will open up.
UI Interactivity AwarenessWith Markers
Adding markers to the images helps the model to provide more concise references when generating an output. Some new research in this direction: https://som-gpt4v.github.io/
prompt: I’ve taken a screenshot of a web page. The web page has the following interactive elements #text_input_1, #button_1 and #button_2. I’m looking for a water bottle, what actions should I take next?
generation:1. Select the “Sunnyscopa Waterslide Decal Paper for Laser Printer – Clear, A4, 100 Sheets – Personalized Water Bottle Label” from the available options.\n2. Select the “Joan” option from the available options.\n3. Select the “Free Delivery by Amazon”…..
comment: Adding anchors strengthens the communication between modalities (text-image, image-text) but it ads the overhead of having to do image segmentation and tagging. It did not seem to help in our case.
Summary
Overall, this open-source foundation MLMM is commendable. However, it requires some fine-tuning for UI navigation. I anticipate that Adept will release a UI navigation model soon.
I conducted similar tests with GPT-4V, and it performed well overall. Nonetheless, it’s not an equitable comparison, as GPT-4V has undergone more extensive QA and refinement, and it is closed-source.