amazon-science/rmir
Python
Captured source
source ↗amazon-science/rmir
Description: Code for the paper RMIR: A Benchmark Dataset for Reasoning-Intensive Multimodal Image Retrieval.
Language: Python
License: Apache-2.0
Stars: 4
Forks: 0
Open issues: 0
Created: 2026-05-01T16:28:38Z
Pushed: 2026-05-12T09:57:08Z
Default branch: main
Fork: no
Archived: no
README:
RMIR: A Benchmark Dataset for Reasoning-Intensive Multimodal Image Retrieval
Overview
This package hosts the code for a pipeline to generate a dataset of multimodal queries requiring reasoning. The pipeline is described as part of our paper **`RMIR`: A Benchmark Dataset for Reasoning-Intensive Multimodal Image Retrieval** (CVPR 2026).
While the instructions below show how to use the query generation pipeline with Visual Genome and Open Images v7 as the source datasets, it can also be used with any other source dataset of images to generate additional multimodal reasoning-intensive queries.
Downloading the Curated RMIR Dataset
In conjunction with the paper and code, we also release a pre-curated benchmark dataset, RMIR, of $1634$ queries requiring reasoning across three categories: functional (object affordances), temporal (time-based relationships), and causal (cause-effect reasoning). Each query combines visual and textual inputs that demand robust visual understanding together with logical inference, beyond surface-level matching, to identify correct target images. The contents of the benchmark dataset were generated using the query generation pipeline described in sections 1 through 4 below. Instructions on how to acquire our pre-curated MBEIR-format RMIR dataset for benchmarking retrievers are in rmir_dataset/README-DATA.md.
Instructions for Running the Query Generation Pipeline
1. Environment Setup
Please install 3 conda envs below using the corresponding yml files in the envs directory: 1. faiss 2. image_emb_gen 3. text_emb_gen
The code in this repo was run on a ml.g5.24xlarge Sagemaker notebook instance.
2. Set Up the Seed Images Dataset
2.1 Download Open Source Images
> *Please use the conda env faiss installed in Step 1 to run the notebooks in sections 2.1.1 and 2.1.2.*
2.1.1 Visual Genome
visual_genome.ipynb has the code and instructions for downloading and preprocessing images from Visual Genome. From Section 2.2 onwards, the instructions assume that the 108,077 images in the Visual Genome dataset have been downloaded following instructions in the notebook are placed into a local directory /home/ec2-user/SageMaker/datasets/source/visual_genome/visual_genome_images. This notebook generates files: visual_genome_annotations_no_images.jsonl and visual_genome_enriched_annotations_no_images.jsonl which are used later in embedding generation.
2.1.2 Open Images V7
open_images_v7.ipynb has the code and instructions for downloading and preprocessing images from Open Images V7. From Section 2.2 onwards, the instructions assume that the 1,910,098 "densely annotated" subset of images in Open Images V7, has been stored locally after merging training, validation, and test set images into a single directory here following the notebook's guidance: /home/ec2-user/SageMaker/datasets/source/open_images_v7/download/combined_train_val_test_images. The notebook generates the file open_images_v7_annotations_combined.jsonl which is used later in embedding generation.
2.2 Generate Embeddings
2.2.1 Generate Image-based Embeddings
jsonlpoints to the input dataset of images you create in Sections 2.1.1 and 2.1.2.- Set
out_dirto the local directory where you want to store the generated embeddings in the form of partitioned parquet files. - Set
num_processesto the number of GPUs.
# visual genome source activate image_emb_gen && accelerate launch --num_processes 4 --num_machines 1 --mixed_precision fp16 --dynamo_backend no batch_gen_siglip2_img_embeddings.py --jsonl /home/ec2-user/SageMaker/datasets/source/visual_genome/visual_genome_annotations_no_images.jsonl --out_dir /home/ec2-user/SageMaker/datasets/embeddings/visual_genome/image_embeddings --batch_size 256 --dtype float16 # open images v7 source activate image_emb_gen && accelerate launch --num_processes 4 --num_machines 1 --mixed_precision fp16 --dynamo_backend no batch_gen_siglip2_img_embeddings.py --jsonl /home/ec2-user/SageMaker/datasets/source/open_images_v7/annotations/open_images_v7_annotations_combined.jsonl --out_dir /home/ec2-user/SageMaker/datasets/embeddings/open_images_v7/image_embeddings --batch_size 224 --dtype float16
2.2.2 Generate Text-based Embeddings
Prerequisite: Install the Huggingface text-embeddings-inference toolkit: https://github.com/huggingface/text-embeddings-inference
Launch 1 endpoint per GPU serving the Qwen3 Embed model using Text Embeddings Inference toolkit from Huggingface. For instance, if your machine has 4 GPUs, you can launch 4 docker containers by updating the device identifier and the port as follows.
docker run --rm --gpus '"device=0"' -p 8080:80 -v /home/ec2-user/SageMaker/qwen3_working_dir/data:/data --pull always ghcr.io/huggingface/text-embeddings-inference:86-1.8 --model-id Qwen/Qwen3-Embedding-0.6B --dtype float16 --max-batch-tokens 65536 --max-concurrent-requests 1024 --max-client-batch-size 1024 --payload-limit 100000000 docker run --rm --gpus '"device=1"' -p 8081:80 -v /home/ec2-user/SageMaker/qwen3_working_dir/data:/data --pull always ghcr.io/huggingface/text-embeddings-inference:86-1.8 --model-id Qwen/Qwen3-Embedding-0.6B --dtype float16 --max-batch-tokens 65536 --max-concurrent-requests 1024 --max-client-batch-size 1024 --payload-limit 100000000 docker run --rm --gpus '"device=2"' -p 8082:80 -v /home/ec2-user/SageMaker/qwen3_working_dir/data:/data --pull always ghcr.io/huggingface/text-embeddings-inference:86-1.8 --model-id Qwen/Qwen3-Embedding-0.6B --dtype float16 --max-batch-tokens 65536 --max-concurrent-requests 1024 --max-client-batch-size 1024 --payload-limit 100000000 docker run --rm --gpus '"device=3"' -p 8083:80 -v /home/ec2-user/SageMaker/qwen3_working_dir/data:/data --pull always ghcr.io/huggingface/text-embeddings-inference:86-1.8 --model-id Qwen/Qwen3-Embedding-0.6B --dtype float16 --max-batch-tokens 65536 --max-concurrent-requests 1024 --max-client-batch-size 1024 --payload-limit 100000000
Then run the script below to generate text embeddings for images in Visual Genome and Open Images v7…
Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Low-star new repo, no notable traction.
Amazon (Nova) has a repo signal matching data demand, evals and quality.