The notebook (link at the end) covers the basic building blocks to adapt LLMs for your own use case:
Data collection/generation/augmentation.
Fine tuning Gemma with a P100 GPU and Lora
Document chunking for RAG
Chunk ranking using ColBERT and BERT/Gemma embeddings
Evaluation using a LLM judge (Gemini Pro)
Evaluation using a distance metric.
Here is an excerpt of the main findings so far:
Dataset generation, RAG with ColBERT and query strategy yield the best evaluation scores.
Gemma already has Kaggle knowledge out of the box so fine tuning with new data didn’t make much difference.
Fine tuning (FT) is for learning new abstractions rather than memorising new data for QA. Overfitting does help with memorisation though.
Single vector search is bad because document pooling operations throw away too much signal. ColBERT is multi-vector
so no pooling involved so signal is preserved.
Using a larger and more capable model (Gemini) as a judge is a good evaluator but pricey.
RAG + RAW: use RAG and train the model to say “I don’t know” if an answer isn’t in the context, then use the raw model as a fallback. This improves evaluation scores.
Generative AI is opening the doors to new products. It’s great to witness in real time how the Internet is evolving and to imagine what it might become in the coming years. Search, UI navigation and content generation are the trident technologies leading this evolution.
Search
The Internet acts as the largest repository of knowledge worldwide. Search engines guide us through the Internet, directing us to web pages likely to have the answers we seek.
By 2014, this functionality evolved further, allowing search engines to highlight specific passages within web pages that are most likely to contain the answers to our questions. This means that search engines now fulfil both navigational and informational purposes, making it easier to find information quickly within SERPs.
In 2018, Large Language Models demonstrated a way to generate answers from the Common Crawl (CC) dataset. CC is a web index with 3B web pages. The technical breakthrough is that LLMs can retrieve relevant bits of information from all those 3B web pages. However, there are three caveats. First, LLMs can’t tell what specific source has been used to generate an answer. Second, when the LLM doesn’t have information about a topic, it hallucinates. Third, LLMs are not suitable to fulfil navigational intents.
In 2020, Meta introduced RAG. It connects LLMs to the external data sources like the Internet so answers are grounded in sources. This is great because we can reference specific web pages and passages. However, the amount of sources that can be used to generate an answer is limited by the LLM’s token context length.
In 2023, the AI community gave LLMs autonomy and the concept of AI agents was born. An objective is given to an agent and it figures out what to do to complete it. This usually involves planning, multiple task execution, performing external actions, using memory, etc. For example, AskPandi is an AI search agent.
UI Navigation
Another advancement is in UI navigation, facilitating browser and workflow automation. This development enables us to assign tasks to an agent, which can then execute them on our behalf. As a result, we’re moving from direct human-computer interaction to a more seamless human-assistant model. Here is a little example to automatically dismiss cookie consents.
Content Generation
We have the capability to direct a LLM to produce or modify content on our behalf, making content creation more accessible than ever. However, the key to success lies in generating content that people want. Content that lacks effort or relevance is easily recognisable and less likely to engage readers.
The Generative Internet Is Born
The three advancements in search, UI navigation, and content generation unlock new product opportunities. Search engines evolve into answer engines that interact with the web for us, picking only the relevant bits of information from multiple sources and composing an answer that is essentially an interactive UI.
For example, “suggest a 5 day trip to a surf break in Europe. Also get me flight and accommodation information, including pricing“.
To complete this objective, we need to conduct multiple searches and then combine the outcomes into a single compelling answer.
But, what happens after we find the information we are looking for?
For this particular example, a natural follow-up action might be to book flights and accommodation. In this context, UI navigation proves invaluable because it allows us to instruct an AI agent to automate those tasks for us.
UI navigation facilitates workflow automation, making mundane tasks such as rescheduling a meeting, bookings, or ordering something executable in natural language. This is akin to iOS shortcuts, but without the need to write custom integrations because the web is open.
At this point, the user rarely interacts with the web, an assistant does it for us.
Finally, machine generated content has an important part to play because our needs to access knowledge is ever growing. In my last research paper, I found out that it only takes 5 hops to find knowledge gaps on the Internet. In addition, new research has also found that our knowledge needs are outpacing the amount of content available on the public web, thus giving the impression that search engines are becoming worse. I suspect walled gardens like social media platforms have something to do with it too because their content isn’t public.
If assistants interact with the web for us, what’s the point of web pages?
Most UI navigation systems are being trained on traditional UIs so they are still relevant. In addition, there are many situations where a user will have to step in like confirm a purchase, authentication, captcha, payment details, etc.
However, interacting with the web via API calls instead of UI navigation is faster so I suspect products that provide an API to an assistant will have an edge here.
What happens to web content creators who rely on ad-traffic?
Ads, if relevant, are good content. Since AI assistants are very good at filtering out irrelevant content, web content creators will need to ensure that their sponsored content is also relevant.
What’s the equivalent of backlinks in a generative Internet?
Backlinks from reputable domains are a good ranking signal in traditional search engines. However, what’s the equivalent of a backlink when assistants generate unique web pages on the fly so there is no unique link to refer to? It feels there is a need for better ranking signals. I wouldn’t be surprised if AEO (Answer Engine Optimisation) becomes a thing.
What happens to social media?
We are transitioning from information extraction to generation. This increases relevancy so good for users. I think someone will make a social network where 100% of the content is generated.
To sum up, search, UI navigation and content generation are reshaping the Internet as we know it.
LLMs have a limited input they can generate and output. In retrieval augmented generation (RAG) applications, a set of documents is first retrieved and added to the input alongside an instruction, thus creating the prompt. This is referred to as in-context learning. We can draw an analogy with computer architectures here: the LLM is the CPU, the context is the RAM, and the full document library is the hard disk.
The context length (RAM) is consistently expanding. In my early encounters with Recurrent Neural Networks (RNNs), we could make reliable predictions with input lengths of about 20 words. However, issues such as vanishing gradients lead to NANs (not bread). Nowadays, we have LLMs with a context of up to 200K tokens. Note that tokenization has also evolved over time, transitioning from words to tokens.
Chunking, a technique used to divide a document into multiple parts to fit within the context, has played a pivotal role in handling these larger contexts. With a context size of 200K tokens, we can process up to 330 pages, within a single chunk for prediction.
Initially, when I began working with LLMs, chunking posed a substantial challenge. However, as the context length expanded, this problem gradually receded into the background.
Or did it?
It turns out that processing vast quantities of text introduces latency to the entire system. Consequently, chunking remains a relevant strategy for developing high-performance applications.
From personal experience, the choice of chunking technique depends on the user’s or system’s intent. For example, when summarising a document, it’s necessary to consider all the chunks. Conversely, when seeking specific answers to questions, we only need to select the most relevant chunks. Now, let’s look into a few chunking techniques I’ve used so far.
Static Chunking
One straightforward method of chunking involves aligning the chunk size with the LLM context length. It’s important to note that if a chunk is combined with an instruction in a prompt, the chunk’s length should be smaller than the context length minus the instruction length.
For instance, if the context length is 200 tokens and the instruction spans 20 tokens, the chunk’s length should be set at 180 tokens.
In practice, I haven’t found many use cases where this method performs exceptionally well. It appears to be most suitable for agentic systems where the agent is aware that a document is segmented into multiple chunks, relaying the decision to retrieve chunks to the agent.
Dynamic Chunking Based On Traditional IR
This method involves chunking a document based on a predefined condition. In my experience, the most effective approach has been to view chunking as an information retrieval (IR) problem. We can treat a document as a dataset in which each sentence serves as an item. Given a query, we retrieve a list of ranked fragments, with fragment boundaries coinciding with common sentence delimiters. Each fragment essentially becomes a chunk. This can be effortlessly achieved using ElasticSearch (ES) and BM25, ES’s default ranker.
By approaching chunking as an IR problem, we can also leverage IR techniques like stemming, lemmatisation, synonymy, query expansion, and more.
This represents a straightforward, efficient, and productive dynamic chunking technique. Unlike approaches involving embedding extraction, artificial neural networks (ANN), or neural IR, this method is less resource-intensive to implement. Furthermore, it avoids the pitfalls of Out of Distribution (OOD) shifts.
Dynamic Chunking Based On Neural IR (Embeddings)
Similar to the previous method, this approach involves converting a document into sentence embeddings. We then retrieve relevant chunks based on the similarity between the query and sentence embeddings. While this method is more computationally demanding, it can yield more accurate results, particularly when the embeddings align closely with the application domain. Most modern databases offer support for vector indexing and search.
Dynamic Chunking Based On Neural IR (ColBERT)
Many vector-based search systems necessitate pooling the query and documents into a single vector. However, such pooling operations tend to discard valuable signals. ColBERT introduces a method that eschews pooling, thereby preserving the query and document embedding signals more effectively. Additionally, since there’s no need to segment documents into sentences, as in the previous method, we can automatically identify multi-sentence fragments, delimited by sentence boundaries, akin to the traditional IR approach.
However, it’s worth noting that this method comes with higher computational costs and requires the implementation of the ColBERT ranker within vector databases, which not all have it out of the box. Additionally, it is not immune to OOD data shifts.
Summary
While the context length keeps growing, it still makes sense to do chunking to reduce latency.
There are many static and dynamic ways to do chunking. The best one will always depend on the end application.