RepoAmazon (Nova)Amazon (Nova)published May 1, 2026seen 5d

amazon-science/rmir

Python

Open original ↗

Captured source

source ↗
published May 1, 2026seen 5dcaptured 16hhttp 200method plain

amazon-science/rmir

Description: Code for the paper RMIR: A Benchmark Dataset for Reasoning-Intensive Multimodal Image Retrieval.

Language: Python

License: Apache-2.0

Stars: 4

Forks: 0

Open issues: 0

Created: 2026-05-01T16:28:38Z

Pushed: 2026-05-12T09:57:08Z

Default branch: main

Fork: no

Archived: no

README:

RMIR: A Benchmark Dataset for Reasoning-Intensive Multimodal Image Retrieval

Overview

This package hosts the code for a pipeline to generate a dataset of multimodal queries requiring reasoning. The pipeline is described as part of our paper **`RMIR`: A Benchmark Dataset for Reasoning-Intensive Multimodal Image Retrieval** (CVPR 2026).

While the instructions below show how to use the query generation pipeline with Visual Genome and Open Images v7 as the source datasets, it can also be used with any other source dataset of images to generate additional multimodal reasoning-intensive queries.

Downloading the Curated RMIR Dataset

In conjunction with the paper and code, we also release a pre-curated benchmark dataset, RMIR, of $1634$ queries requiring reasoning across three categories: functional (object affordances), temporal (time-based relationships), and causal (cause-effect reasoning). Each query combines visual and textual inputs that demand robust visual understanding together with logical inference, beyond surface-level matching, to identify correct target images. The contents of the benchmark dataset were generated using the query generation pipeline described in sections 1 through 4 below. Instructions on how to acquire our pre-curated MBEIR-format RMIR dataset for benchmarking retrievers are in rmir_dataset/README-DATA.md.

Instructions for Running the Query Generation Pipeline

1. Environment Setup

Please install 3 conda envs below using the corresponding yml files in the envs directory: 1. faiss 2. image_emb_gen 3. text_emb_gen

The code in this repo was run on a ml.g5.24xlarge Sagemaker notebook instance.

2. Set Up the Seed Images Dataset

2.1 Download Open Source Images

> *Please use the conda env faiss installed in Step 1 to run the notebooks in sections 2.1.1 and 2.1.2.*

2.1.1 Visual Genome

visual_genome.ipynb has the code and instructions for downloading and preprocessing images from Visual Genome. From Section 2.2 onwards, the instructions assume that the 108,077 images in the Visual Genome dataset have been downloaded following instructions in the notebook are placed into a local directory /home/ec2-user/SageMaker/datasets/source/visual_genome/visual_genome_images. This notebook generates files: visual_genome_annotations_no_images.jsonl and visual_genome_enriched_annotations_no_images.jsonl which are used later in embedding generation.

2.1.2 Open Images V7

open_images_v7.ipynb has the code and instructions for downloading and preprocessing images from Open Images V7. From Section 2.2 onwards, the instructions assume that the 1,910,098 "densely annotated" subset of images in Open Images V7, has been stored locally after merging training, validation, and test set images into a single directory here following the notebook's guidance: /home/ec2-user/SageMaker/datasets/source/open_images_v7/download/combined_train_val_test_images. The notebook generates the file open_images_v7_annotations_combined.jsonl which is used later in embedding generation.

2.2 Generate Embeddings

2.2.1 Generate Image-based Embeddings

  • jsonl points to the input dataset of images you create in Sections 2.1.1 and 2.1.2.
  • Set out_dir to the local directory where you want to store the generated embeddings in the form of partitioned parquet files.
  • Set num_processes to the number of GPUs.
# visual genome
source activate image_emb_gen && accelerate launch --num_processes 4 --num_machines 1 --mixed_precision fp16 --dynamo_backend no batch_gen_siglip2_img_embeddings.py --jsonl /home/ec2-user/SageMaker/datasets/source/visual_genome/visual_genome_annotations_no_images.jsonl --out_dir /home/ec2-user/SageMaker/datasets/embeddings/visual_genome/image_embeddings --batch_size 256 --dtype float16

# open images v7
source activate image_emb_gen && accelerate launch --num_processes 4 --num_machines 1 --mixed_precision fp16 --dynamo_backend no batch_gen_siglip2_img_embeddings.py --jsonl /home/ec2-user/SageMaker/datasets/source/open_images_v7/annotations/open_images_v7_annotations_combined.jsonl --out_dir /home/ec2-user/SageMaker/datasets/embeddings/open_images_v7/image_embeddings --batch_size 224 --dtype float16

2.2.2 Generate Text-based Embeddings

Prerequisite: Install the Huggingface text-embeddings-inference toolkit: https://github.com/huggingface/text-embeddings-inference

Launch 1 endpoint per GPU serving the Qwen3 Embed model using Text Embeddings Inference toolkit from Huggingface. For instance, if your machine has 4 GPUs, you can launch 4 docker containers by updating the device identifier and the port as follows.

docker run --rm --gpus '"device=0"' -p 8080:80 -v /home/ec2-user/SageMaker/qwen3_working_dir/data:/data --pull always ghcr.io/huggingface/text-embeddings-inference:86-1.8 --model-id Qwen/Qwen3-Embedding-0.6B --dtype float16 --max-batch-tokens 65536 --max-concurrent-requests 1024 --max-client-batch-size 1024 --payload-limit 100000000

docker run --rm --gpus '"device=1"' -p 8081:80 -v /home/ec2-user/SageMaker/qwen3_working_dir/data:/data --pull always ghcr.io/huggingface/text-embeddings-inference:86-1.8 --model-id Qwen/Qwen3-Embedding-0.6B --dtype float16 --max-batch-tokens 65536 --max-concurrent-requests 1024 --max-client-batch-size 1024 --payload-limit 100000000

docker run --rm --gpus '"device=2"' -p 8082:80 -v /home/ec2-user/SageMaker/qwen3_working_dir/data:/data --pull always ghcr.io/huggingface/text-embeddings-inference:86-1.8 --model-id Qwen/Qwen3-Embedding-0.6B --dtype float16 --max-batch-tokens 65536 --max-concurrent-requests 1024 --max-client-batch-size 1024 --payload-limit 100000000

docker run --rm --gpus '"device=3"' -p 8083:80 -v /home/ec2-user/SageMaker/qwen3_working_dir/data:/data --pull always ghcr.io/huggingface/text-embeddings-inference:86-1.8 --model-id Qwen/Qwen3-Embedding-0.6B --dtype float16 --max-batch-tokens 65536 --max-concurrent-requests 1024 --max-client-batch-size 1024 --payload-limit 100000000

Then run the script below to generate text embeddings for images in Visual Genome and Open Images v7…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Low-star new repo, no notable traction.

Amazon (Nova) has a repo signal matching data demand, evals and quality.