GENIUS: A Generative Framework for Universal Multimodal Search

Abstract

Generative retrieval is an emerging approach in information retrieval that generates identifiers (IDs) of target data based on a query, providing an efficient alternative to traditional embedding-based retrieval methods. However, existing models are task-specific and fall short of embedding-based retrieval in performance. This paper proposes GENIUS, a universal generative retrieval framework supporting diverse tasks across multiple modalities and domains. At its core, GENIUS introduces modality-decoupled semantic quantization, transforming multimodal data into discrete IDs encoding both modality and semantics. Moreover, to enhance generalization, we propose a query augmentation that interpolates between a query and its target, allowing GENIUS to adapt to varied query forms. Evaluated on the M-BEIR benchmark, it surpasses prior generative methods by a clear margin. Unlike embedding-based retrieval, GENIUS consistently maintains high retrieval speed across database size, with competitive performance across multiple benchmarks. With additional re-ranking, GENIUS often achieves results close to those of embedding-based methods while preserving efficiency.

GENIUS Framework

The GENIUS framework has three components: image and text encoders, a modality-decoupled quantization module, and an autoregressive decoder. The encoders are pre-trained for better instruction comprehension. Residual quantization assigns discrete IDs to embeddings, with the first level encoding modality and subsequent levels capturing semantic details. The decoder generates these IDs. During inference, GENIUS uses Trie-constrained beam search to generate IDs, optionally followed by embedding-based re-ranking for enhanced accuracy.

Experimental Results

M-BEIR: task-specific information retrieval

Task-specific Information Retrieval Results

Performance on the M-BEIR dataset (task-specific pool). $\mathcal{R}$ signifies re-ranking with embedding vectors. Datasets include VN (VisualNews), F200K (Fashion200K), InfoS (InfoSeek), and FIQ (FashionIQ).

M-BEIR: universal information retrieval

Recall@5 results (Recall@10 for Fashion200K and FashionIQ) on a global, multi-modal pool. $\mathcal{R}$ indicates re-ranking using embedding vectors within predicted candidates.

Text-to-image retrieval (compared to generative methods)

Text-to-image retrieval performance on Flickr30K and MS-COCO benchmarks. $\mathcal{R}$ denotes re-ranking. All models, including GENIUS, are trained and evaluated per dataset.

Efficiency of GENIUS

Throughput efficiency

We compare retrieval efficiency between embedding-based (CLIP) and generative methods (GRACE, GENIUS) by measuring queries per second . As the candidate dataset size increases, the efficiency of CLIP declines due to the growing cost of the nearest neighbor search, while generative methods remain nearly constant . GENIUS achieves roughly 4 times higher efficiency than GRACE.

Storage efficiency

Storage efficiency comparison between CLIP and GENIUS. GENIUS achieves over 99% reduction in storage, significantly enhancing scalability for large-scale retrieval.

Acknowledgements

Part of this work was done while Sungyeon Kim was an intern at Amazon. Sungyeon Kim and Suha Kwak were supported by NRF grants (RS-2021-NR059830–30%, RS2022-II220926–30%) and IITP grants (RS-2019-II191906–10%, AI Graduate School - POSTECH) funded by Ministry of Science and ICT, Korea.