The notebook (link at the end) covers the basic building blocks to adapt LLMs for your own use case:
Data collection/generation/augmentation.
Fine tuning Gemma with a P100 GPU and Lora
Document chunking for RAG
Chunk ranking using ColBERT and BERT/Gemma embeddings
Evaluation using a LLM judge (Gemini Pro)
Evaluation using a distance metric.
Here is an excerpt of the main findings so far:
Dataset generation, RAG with ColBERT and query strategy yield the best evaluation scores.
Gemma already has Kaggle knowledge out of the box so fine tuning with new data didn’t make much difference.
Fine tuning (FT) is for learning new abstractions rather than memorising new data for QA. Overfitting does help with memorisation though.
Single vector search is bad because document pooling operations throw away too much signal. ColBERT is multi-vector
so no pooling involved so signal is preserved.
Using a larger and more capable model (Gemini) as a judge is a good evaluator but pricey.
RAG + RAW: use RAG and train the model to say “I don’t know” if an answer isn’t in the context, then use the raw model as a fallback. This improves evaluation scores.
MLLMs (Multi Modal Large Language Models) such as GPT-4V and Gemini are able to ingest data in multiple modalities such as: text, video, sound and images. Personally, one of the most useful applications of MLLMs is UI navigation. As a SWE, you could have an web based agent that runs Gherkin-like syntax tests without having to write any code. Or, you could instruct your browser to book a flight for you, re-schedule a meeting, book a table, etc.
However, the ability to do VQA (Visual Question Answering) with MLLMs is conditioned by how much signal is preserved when downscaling images to meet MLLM input size requirements. This is particularly challenging when images have odd an aspect ratio like a webpage screenshot. Ideally, we don’t want to loose any signal. As a comparison, there is no signal loss in LLMs.
Google came up with ViT (Vision Transformer), it breaks down an image into a 16×16 grid. Each cell in the grid is a patch that’s also a token. Image tokens are linearly projected into a transformer.
The 16×16 grid is great for images with a 1:1 aspect ratio but not so great for other aspect ratios. I also thought that linearly projecting the images would remove the need to downscale but no. In the case of GPT-4V, the image needs to be downscaled to fit in a 2048px X 2048px grid. Each patch/token has 512×521 pixels. Therefore this is a 4×4 grid.
We need MLLMs that can handle multiple aspect ratios without having to downscale images so there is no information loss.
Firstly, I asked chatGPT to generate an image caption:
The caption is good. After that, I asked if there is a 100g variation for this product and I got:
This is not true, there is a 100g option that is clearly visible.
Following that, I took a screenshot of the section of the web page that has the multiple options. Then I asked chatGPT to generate a caption. This section has a friendlier aspect ratio ~2:1.
chatGPT got it right this time. It has identified that there is a 100g option priced at £13.99. I think this shows that MLLMs are good at handling images with an aspect ratio close to 1:1 with minimum downscaling.
Gemini
Let’s now look at how Gemini responds to the prompt “describe this image” given the webpage screenshot.
This isn’t good. This is a product page about green tea not kitchen hardware… For the sake of evaluation, I also prompted Gemini with “Is there are 100g option?”
Gemini couldn’t answer my question. I then took a screenshot of the section with multiple weight options and a friendlier aspect ratio.
Similarly to GPT-4V, This also worked well. However, it shows that Gemini is not good at dealing with images that have an odd aspect ratios like web page screenshots.
GPT-4V API AND IMAGE CHUNKING
The examples above were carried out using chatGPT and Gemini’s web app. I also tried to do the same via the GPT-4V API with the high fidelity flag setup. The documentation recommends:
For high res mode, the short side of the image should be less than 768px and the long side should be less than 2,000px.
As suggested, I downscaled the image from 3840×20294 to 387×2048. After that, I encoded the image to base64 and fed it to the model alongside the prompt “Is there a 100g option?”.
Unsurprisingly, downscaling makes large images loose too much signal so the model is not able to answer our prompt. How do we solve this? Similarly to RAG’s chunking, we can also chunk images. For this particular example, I chunked the screenshot into 7 chunks. Then, I downscaled each image as per OAI requirements. Each chunk is 1017×768. GPT-4V can handle multiple image inputs:
The Chat Completions API is capable of taking in and processing multiple image inputs in both base64 encoded format or as an image URL. The model will process each image and use the information from all of them to answer the question.
OAI Documentation
Therefore all the image chunks can be fed in one request with minimum signal loss alongside a text prompt “Is there are 100g option ?”
This has worked well so it shows that image chunking is a good strategy. Note that to do image chunking, we want to pick a number of chunks that minimises the signal loss when downscaling. This might not be trivial at first and it might be domain specific. For web pages, 5-10 chunks seems to work well across the board.
Overall, the ability to do VQA with existing MLMMs is conditioned by how much signal is preserved when downscaling . I’ll upload the code to do image chunking and post it soon.
Generative AI is opening the doors to new products. It’s great to witness in real time how the Internet is evolving and to imagine what it might become in the coming years. Search, UI navigation and content generation are the trident technologies leading this evolution.
Search
The Internet acts as the largest repository of knowledge worldwide. Search engines guide us through the Internet, directing us to web pages likely to have the answers we seek.
By 2014, this functionality evolved further, allowing search engines to highlight specific passages within web pages that are most likely to contain the answers to our questions. This means that search engines now fulfil both navigational and informational purposes, making it easier to find information quickly within SERPs.
In 2018, Large Language Models demonstrated a way to generate answers from the Common Crawl (CC) dataset. CC is a web index with 3B web pages. The technical breakthrough is that LLMs can retrieve relevant bits of information from all those 3B web pages. However, there are three caveats. First, LLMs can’t tell what specific source has been used to generate an answer. Second, when the LLM doesn’t have information about a topic, it hallucinates. Third, LLMs are not suitable to fulfil navigational intents.
In 2020, Meta introduced RAG. It connects LLMs to the external data sources like the Internet so answers are grounded in sources. This is great because we can reference specific web pages and passages. However, the amount of sources that can be used to generate an answer is limited by the LLM’s token context length.
In 2023, the AI community gave LLMs autonomy and the concept of AI agents was born. An objective is given to an agent and it figures out what to do to complete it. This usually involves planning, multiple task execution, performing external actions, using memory, etc. For example, AskPandi is an AI search agent.
UI Navigation
Another advancement is in UI navigation, facilitating browser and workflow automation. This development enables us to assign tasks to an agent, which can then execute them on our behalf. As a result, we’re moving from direct human-computer interaction to a more seamless human-assistant model. Here is a little example to automatically dismiss cookie consents.
Content Generation
We have the capability to direct a LLM to produce or modify content on our behalf, making content creation more accessible than ever. However, the key to success lies in generating content that people want. Content that lacks effort or relevance is easily recognisable and less likely to engage readers.
The Generative Internet Is Born
The three advancements in search, UI navigation, and content generation unlock new product opportunities. Search engines evolve into answer engines that interact with the web for us, picking only the relevant bits of information from multiple sources and composing an answer that is essentially an interactive UI.
For example, “suggest a 5 day trip to a surf break in Europe. Also get me flight and accommodation information, including pricing“.
To complete this objective, we need to conduct multiple searches and then combine the outcomes into a single compelling answer.
But, what happens after we find the information we are looking for?
For this particular example, a natural follow-up action might be to book flights and accommodation. In this context, UI navigation proves invaluable because it allows us to instruct an AI agent to automate those tasks for us.
UI navigation facilitates workflow automation, making mundane tasks such as rescheduling a meeting, bookings, or ordering something executable in natural language. This is akin to iOS shortcuts, but without the need to write custom integrations because the web is open.
At this point, the user rarely interacts with the web, an assistant does it for us.
Finally, machine generated content has an important part to play because our needs to access knowledge is ever growing. In my last research paper, I found out that it only takes 5 hops to find knowledge gaps on the Internet. In addition, new research has also found that our knowledge needs are outpacing the amount of content available on the public web, thus giving the impression that search engines are becoming worse. I suspect walled gardens like social media platforms have something to do with it too because their content isn’t public.
If assistants interact with the web for us, what’s the point of web pages?
Most UI navigation systems are being trained on traditional UIs so they are still relevant. In addition, there are many situations where a user will have to step in like confirm a purchase, authentication, captcha, payment details, etc.
However, interacting with the web via API calls instead of UI navigation is faster so I suspect products that provide an API to an assistant will have an edge here.
What happens to web content creators who rely on ad-traffic?
Ads, if relevant, are good content. Since AI assistants are very good at filtering out irrelevant content, web content creators will need to ensure that their sponsored content is also relevant.
What’s the equivalent of backlinks in a generative Internet?
Backlinks from reputable domains are a good ranking signal in traditional search engines. However, what’s the equivalent of a backlink when assistants generate unique web pages on the fly so there is no unique link to refer to? It feels there is a need for better ranking signals. I wouldn’t be surprised if AEO (Answer Engine Optimisation) becomes a thing.
What happens to social media?
We are transitioning from information extraction to generation. This increases relevancy so good for users. I think someone will make a social network where 100% of the content is generated.
To sum up, search, UI navigation and content generation are reshaping the Internet as we know it.
Working with LLMs is shifting from human-machine interactions to human-machine and machine-machine interactions. This allows LLMs to do ever more complex tasks. This new interactivity has been coined as AI agent.
Threaded conversations lack structure to complete complex tasks. Therefore, objective divergence is a common issue with AI agents. Objective divergence is the equivalent of a blue screen in traditional computers.
I’m currently researching how to make agents more reliable mainly by providing structure and constraints. Here is a prompt that the MAC-1 agent easily does in a few seconds: “Find flights and accommodation for my next trip to Barcelona”
Note what the agent does:
It detects the intent(s) in the prompt.
It interacts with the user to get more information.
It’s stateful so is aware of previous actions.
It chains actions.
It operates a browser.
It’s able to map natural language into specific formats: “next Friday” -> “26-01-2024”.
I’m actively working on this and I’ll share more. Follow me on X.
LLMs have a limited input they can generate and output. In retrieval augmented generation (RAG) applications, a set of documents is first retrieved and added to the input alongside an instruction, thus creating the prompt. This is referred to as in-context learning. We can draw an analogy with computer architectures here: the LLM is the CPU, the context is the RAM, and the full document library is the hard disk.
The context length (RAM) is consistently expanding. In my early encounters with Recurrent Neural Networks (RNNs), we could make reliable predictions with input lengths of about 20 words. However, issues such as vanishing gradients lead to NANs (not bread). Nowadays, we have LLMs with a context of up to 200K tokens. Note that tokenization has also evolved over time, transitioning from words to tokens.
Chunking, a technique used to divide a document into multiple parts to fit within the context, has played a pivotal role in handling these larger contexts. With a context size of 200K tokens, we can process up to 330 pages, within a single chunk for prediction.
Initially, when I began working with LLMs, chunking posed a substantial challenge. However, as the context length expanded, this problem gradually receded into the background.
Or did it?
It turns out that processing vast quantities of text introduces latency to the entire system. Consequently, chunking remains a relevant strategy for developing high-performance applications.
From personal experience, the choice of chunking technique depends on the user’s or system’s intent. For example, when summarising a document, it’s necessary to consider all the chunks. Conversely, when seeking specific answers to questions, we only need to select the most relevant chunks. Now, let’s look into a few chunking techniques I’ve used so far.
Static Chunking
One straightforward method of chunking involves aligning the chunk size with the LLM context length. It’s important to note that if a chunk is combined with an instruction in a prompt, the chunk’s length should be smaller than the context length minus the instruction length.
For instance, if the context length is 200 tokens and the instruction spans 20 tokens, the chunk’s length should be set at 180 tokens.
In practice, I haven’t found many use cases where this method performs exceptionally well. It appears to be most suitable for agentic systems where the agent is aware that a document is segmented into multiple chunks, relaying the decision to retrieve chunks to the agent.
Dynamic Chunking Based On Traditional IR
This method involves chunking a document based on a predefined condition. In my experience, the most effective approach has been to view chunking as an information retrieval (IR) problem. We can treat a document as a dataset in which each sentence serves as an item. Given a query, we retrieve a list of ranked fragments, with fragment boundaries coinciding with common sentence delimiters. Each fragment essentially becomes a chunk. This can be effortlessly achieved using ElasticSearch (ES) and BM25, ES’s default ranker.
By approaching chunking as an IR problem, we can also leverage IR techniques like stemming, lemmatisation, synonymy, query expansion, and more.
This represents a straightforward, efficient, and productive dynamic chunking technique. Unlike approaches involving embedding extraction, artificial neural networks (ANN), or neural IR, this method is less resource-intensive to implement. Furthermore, it avoids the pitfalls of Out of Distribution (OOD) shifts.
Dynamic Chunking Based On Neural IR (Embeddings)
Similar to the previous method, this approach involves converting a document into sentence embeddings. We then retrieve relevant chunks based on the similarity between the query and sentence embeddings. While this method is more computationally demanding, it can yield more accurate results, particularly when the embeddings align closely with the application domain. Most modern databases offer support for vector indexing and search.
Dynamic Chunking Based On Neural IR (ColBERT)
Many vector-based search systems necessitate pooling the query and documents into a single vector. However, such pooling operations tend to discard valuable signals. ColBERT introduces a method that eschews pooling, thereby preserving the query and document embedding signals more effectively. Additionally, since there’s no need to segment documents into sentences, as in the previous method, we can automatically identify multi-sentence fragments, delimited by sentence boundaries, akin to the traditional IR approach.
However, it’s worth noting that this method comes with higher computational costs and requires the implementation of the ColBERT ranker within vector databases, which not all have it out of the box. Additionally, it is not immune to OOD data shifts.
Summary
While the context length keeps growing, it still makes sense to do chunking to reduce latency.
There are many static and dynamic ways to do chunking. The best one will always depend on the end application.
🚀 Excited to share new work on “Harnessing Retrieval-Augmented Generation (RAG) for Uncovering Knowledge Gaps“.
In this paper, I simulate how users search the Internet but instead of searching for content that exists through traditional information retrieval methods, we search for the most relevant content, even if it doesn’t exist.
Therefore, information retrieval shifts from an extraction to a generation problem.
This process is guided by the premise that a well-generalised LLM should provide useful recommendations based on the initial question and answer.
🔍 Findings:
Using Bing’s web index, we found that typical internet searches hit a ‘knowledge wall’ just at the fifth level of topic depth. This means there’s a lot more we can discover and learn!
💡 What This Means:
This research changes how we use search engines. Imagine getting deeper, more insightful answers to your questions, going beyond what’s already written on the web.
More broadly, this work can be used to improve the completeness of a content library.
Adept’s mission to enable computers to interact with UIs will enhance our productivity and save time.
I have been eagerly awaiting their ACT-1 model for quite some time. However, while that is being developed, they have released FUYU, a multimodal LLM or MLMM. The term ‘multimodal’ implies its ability to process both text and images. More here: https://www.adept.ai/blog/fuyu-8b.
Adept has introduced an architectural simplification that eliminates the need for the image encoder.
Image patches are instead linearly projected into the first layer of the transformer, bypassing the embedding lookup. We simply treat the normal transformer decoder like an image transformer (albeit with no pooling and causal attention).
Eliminating the need to scale or crop images to meet certain specifications is a significant advantage. This has profound implications for UI navigation, especially since aspect ratios are not always consistent. For instance, 5:1 aspect ratios are frequently seen on web pages. Traditional image preprocessing often discards too much valuable information.
However, a notable limitation is its variable memory requirements, which led me to frequently encounter out-of-memory (OOM) errors. To address the issue of non-uniform aspect ratios, I segmented the images and placed each segment side-by-side to achieve a consistent aspect ratio.
Interestingly, when I fed screenshots of web pages to GPT-4V, I faced OOM errors with images that also had irregular aspect ratios. This leads me to speculate whether GPT-4V might operate on a decoder-only model.
On the plus side, FUYU is relatively lightweight. It comfortably fits within an AWS g5.2xlarge instance that offers 24GB GPU memory. The only modification I made was to adjust the precision to float16 to ensure smooth operation
model = FuyuForCausalLM.from_pretrained(pretrained_path, device_map=”cuda:0″, torch_dtype=torch.float16)
I’ve assessed FUYU with a focus on UI navigation capabilities, even though this may not align with its original training purpose. While one could argue that this is not an equitable evaluation, FUYU remains the best open-source model available for UI navigation. With Adept’s mission in mind, it’s anticipated that their models will, in time, exhibit specialised capabilities for UI navigation, much like FUYU’s resolution-agnostic feature. This assessment establishes a foundation, highlighting areas where fine-tuning and future models can concentrate.
I’ve captured screenshots of various UIs from mobile websites and posed questions about them. This article will specifically discuss the screenshot from Amazon presented below.
Generate A Web Page Caption
prompt: Generate a caption.
generation: The image shows a screen shot of a “sunnyscopa” product listing, which features “free delivery by amazon”. The screen shot also includes the “home & kitchen” tab, where the user has selected “sunnyscopa waterslide decal paper for laser printer – clear, a4, 100 sheets – personalized waterslide decal paper
comment: The caption seems to focus on the text in the screenshot. No comments about the other visual components nor the page layout.
VQA based on text in images
This model is good at doing VQA based on text in images. This is presumably due to the OCR capabilities that this model posses.
prompt: What’s the price of the product?
generation: Sunnyscopa Waterslide Decal Paper for Laser Printer – Clear, A4, 100 Sheets – Personalized Waterslides
comment: Maybe my prompt was to challenging here. I assumed that because there is only one product listed and one price, the model would figure it out.
———————————————
prompt: What is the price of the sunnyscopa product?
generation: 79.72
comment: The model got the price right whenever I included the product name in the prompt
VQA based on elements in images
prompt: What it’s in the product thumbnail?
generation: Joan joan amazon
comment: It didn’t understand what a product thumbnail is.
VQA based on Spatial Awareness
This was the area that I was looking forward the most because UI navigation requires spatial awareness. This means that the model needs to understand how the UI components are laid out and what the UI components are for. For example, buttons can have icons as well as text and they lead the user to a new web page state.
prompt: What is the colour of the search button?
generation: orange
comment: Good, the model understands what is a button and has figured out that the magnifying glass means search.
———————————————
prompt: What’s the text in the search box?
generation: The search box displays an advertisement for a waterslide decal paper for laser printers.
comment: Ok there are no ads in the search box… Maybe the model is confusing the search result listing as an autocomplete dropdown?
———————————————
prompt: Where is the Amazon logo located?
generation: bottom
comment: It’s at the top left
———————————————
prompt: Where is the Amazon logo located? Generate bounding box coordinates.
generation: SunnySCOPA Waterslide Decal Paper for Laser Printer – Clear, A4, 100 Sheets – Personalized Waterproof Paper
comment: Incorrect. Being able to generate coordinates to locate any element in an image will be necessary. Adept’s has teased this capability for text but it will also be necessary for other UI components too. The blog post mentions this capability. This will be great.
UI Interactivity Awareness
In this scenario we want to check if the model is aware of the interactive elements in the web page. Interactivity awareness is essential for UI navigation
prompt: “Given this web page screenshot from the Amazon mobile site, what should I do next if I am searching for a water bottle?
generation: In the Amazon app, you can search for Sunnyscopa Waterslide Decal Paper for Laser Printer – Clear, A4, 100 Sheets – Personalized Water Bottle Label. If you can find the product, simply tap or click “GET IT” button.
comment: The model understood that you need to go to an app, search for things and tap or click to buy it. The model is aware of mobile/touchscreen vs. desktop by referencing tap (mobile) and click (desktop) . However, it swept through the screenshot to do the reasoning. The right answer would have been along the lines:
Delete the current text in the search box
Type “water bottle”
Click/Tap on the orange search button
———————————————
prompt: What will happen if I click on Joan?
generation: You can now print the Joan ad.
comment: The model doesn’t understand that a menu or new page with user settings will open up.
UI Interactivity AwarenessWith Markers
Adding markers to the images helps the model to provide more concise references when generating an output. Some new research in this direction: https://som-gpt4v.github.io/
prompt: I’ve taken a screenshot of a web page. The web page has the following interactive elements #text_input_1, #button_1 and #button_2. I’m looking for a water bottle, what actions should I take next?
generation:1. Select the “Sunnyscopa Waterslide Decal Paper for Laser Printer – Clear, A4, 100 Sheets – Personalized Water Bottle Label” from the available options.\n2. Select the “Joan” option from the available options.\n3. Select the “Free Delivery by Amazon”…..
comment: Adding anchors strengthens the communication between modalities (text-image, image-text) but it ads the overhead of having to do image segmentation and tagging. It did not seem to help in our case.
Summary
Overall, this open-source foundation MLMM is commendable. However, it requires some fine-tuning for UI navigation. I anticipate that Adept will release a UI navigation model soon.
I conducted similar tests with GPT-4V, and it performed well overall. Nonetheless, it’s not an equitable comparison, as GPT-4V has undergone more extensive QA and refinement, and it is closed-source.