The quest to build the universal personal assistant (UPA) has been reignited, with both OpenAI and Google going head to head. But what would the ultimate personal assistant look like? Would it be an app, a browser, an OS, or even a physical robot? Letβs think first about ideal product requirements for a universal personal
Introduction to Language-Image Pre-Training Language-Image Pre-Training (LIP) has become a popular approach to obtain robust visual and textual representations. It involves aligning the representations of paired images and texts, typically using a contrastive objective. This method was revolutionized by large-scale models like CLIP and ALIGN, which demonstrated the viability of this approach at a massive
Large Language Models (LLMs) have transformed artificial intelligence, leading to significant advancements in conversational AI and applications. However, these models face limitations due to fixed-length context windows, impeding their performance in tasks that require handling extensive conversations and lengthy document analysis. Limitations of Current LLMs Despite their revolutionary impact, current LLMs are hindered by constrained
Introduction to ColPali ColPali is a novel document retrieval model that significantly enhances the efficiency and accuracy of matching user queries to relevant documents. It leverages the advanced document understanding capabilities of recent Vision Language Models (VLMs) to produce high-quality, contextualized embeddings derived solely from images of document pages. The Challenge in Modern Document RetrievalModern
I’ve put together this presentation to help anyone with a technical background to build their own LLM powered applications. The topic is quite broad but it covers all the basic moving parts for you to bear in mind.
Sharing my my notebook for the Google AI Assistants For Data Task With Gemma competition. The notebook (link at the end) covers the basic building blocks to adapt LLMs for your own use case: Here is an excerpt of the main findings so far: Dataset generation, RAG with ColBERT and query strategy yield the best
MLLMs (Multi Modal Large Language Models) such as GPT-4V and Gemini are able to ingest data in multiple modalities such as: text, video, sound and images. Personally, one of the most useful applications of MLLMs is UI navigation. As a SWE, you could have an web based agent that runs Gherkin-like syntax tests without having
Generative AI is opening the doors to new products. It’s great to witness in real time how the Internet is evolving and to imagine what it might become in the coming years. Search, UI navigation and content generation are the trident technologies leading this evolution. Search The Internet acts as the largest repository of knowledge
Working with LLMs is shifting from human-machine interactions to human-machine and machine-machine interactions. This allows LLMs to do ever more complex tasks. This new interactivity has been coined as AI agent. Threaded conversations lack structure to complete complex tasks. Therefore, objective divergence is a common issue with AI agents. Objective divergence is the equivalent of