Currently, the majority of retrieval-augmented generation (RAG) models are designed to retrieve relevant text documents based on user queries. However, in real-world scenarios, users often need to retrieve multimodal documents, including text, images, tables, and charts. This requires a more sophisticated retrieval system that can handle multimodal information and provide relevant documents or passages based on user queries. Retrieving multimodal documents will help AI chatbots, search engines, and other applications provide more accurate and relevant information to users.
The Multimodal Document Retrieval Task focuses on modeling passages from multimodal documents or web pages, leveraging textual and multimodal information for embedding modeling. The ultimate goal is to retrieve the relevant multimodal document or passage based on the user’s text or multimodal query.
Please go to the challenge portal: https://www.kaggle.com/competitions/multimodal-document-retrieval-challenge
This task is designed to evaluate the ability of retrieval systems to identify visually-rich information within documents. MMDocIR evaluation set includes 313 long documents averaging 65.1 pages, categorized into diverse domains: research reports, administration, industry, tutorials, workshops, academic papers, brochures, financial reports, guidebooks, government documents, laws, and news articles. Different domains feature distinct distributions of multi-modal information.
Sub-tasks
In this challenge, MMDocIR evaluation set will be used for evaluation.
Reference: https://huggingface.co/MMDocIR
This task evaluates the ability of retrieval systems to retrieve visually-rich information in open-domain scenarios, including Wikipedia web pages. It involves diverse topics, forms (graphics, tables, text), and languages. The original M2KR dataset only include the extracted text from the Wikipedia pages. We have extended the dataset to include the screenshots from the Wikipedia pages.
Subtasks:
In this challenge, privately reserved test subset will be used for evaluation.
Reference: PreFLMR and M2KR Project Page
Challenge Dataset: M2KR-Challenge
Participants need to use one unified model to perform retrieval on both Task1 and Task2. Participants’ submissions will be evaluated on their ability to accurately retrieve relevant multimodal documents or passages based on user queries.
Recall@k Measures the proportion of relevant documents retrieved in the top-k results. We will be measuring the average score of recall@1, recall@3, and recall@5 for each tasks. The final score will be based on the average of the two tasks.
The private test set is released for each of the dataset. The participants must upload the retrieved top5 passage_id
for each test sample.
To participate in this challenge, researchers are required to:
This track aims to push forward the boundaries of multimodal document retrieval, encouraging innovation in embedding modeling and search efficiency. Join us to contribute to the next-generation advancements in this exciting space!
Task Submission Start: Jan. 27th, 2025
Code Submission Start: Mar. 7th, 2025
Final Submission End: Mar. 15th, 2025
Awarded Teams Release: Mar. 18th, 2025