What does this repo signal mean?

Amazon (Nova) published amazon-science/RecArena (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo amazon-science/RecArena · language Python · Recommendation system evaluation arena from Amazon Science. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Amazon (Nova) Repo: amazon-science/RecArena

Captured source

source ↗

GitHub/github.com/amazon-science/RecArena

amazon-science/RecArena repository metadata

Source ↗

published May 6, 2026seen Jun 5captured Jun 11http 200method plain

amazon-science/RecArena

Language: Python

License: Apache-2.0

Stars: 1

Forks: 0

Open issues: 10

Created: 2026-05-06T06:43:02Z

Pushed: 2026-06-11T01:23:28Z

Default branch: main

Fork: no

Archived: no

README:

RecArena

> Note: This repository is part of a larger reproducibility effort for recommendation research. RecArena is a self-contained snippet of that framework, providing everything needed to reproduce the experiments and results of the two papers below.

A modular recommendation system benchmark framework for reproducible research in sequential recommendation.

This repository contains the code and experiment scripts for two papers submitted to RecSys 2026:

1. On the Transferability of Modern Transformer Design Choices to Sequential Recommendation — A factorial ablation of RoPE, LiGR, and RMSNorm applied to SASRec across 7 datasets with multi-seed significance testing.

2. Revisiting Negative Sampling for Sequential Recommendation: A Systematic Comparison — A comprehensive comparison of full cross-entropy, sampled softmax, BCE, and gBCE loss functions across 6 datasets.

Installation

git clone
cd RecArena

python -m venv .venv
source .venv/bin/activate

pip install -e ".[analysis]"

Requirements

Python ≥ 3.10
PyTorch ≥ 2.0 with CUDA support
NVIDIA GPU with ≥ 16 GB VRAM (A10G or better recommended)

Data Preparation

All datasets must be preprocessed before running experiments. The preparation script downloads raw data, applies k-core filtering, performs leave-one-out splitting, and saves the results.

# Prepare all auto-downloadable datasets locally
python -m rec_arena.experiments.prepare_datasets \
--datasets ml_100k ml_1m ml_20m amazon_beauty_2014 gowalla steam \
--output-dir data/

# Or upload to S3 for multi-machine experiments
python -m rec_arena.experiments.prepare_datasets \
--datasets ml_100k ml_1m ml_20m amazon_beauty_2014 gowalla steam \
--s3-path s3://your-bucket/recarena

# List all available datasets and download instructions
python -m rec_arena.experiments.prepare_datasets --list

RateBeer, Goodreads, and Twitch require manual download due to licensing. Run --list for instructions.

If using S3, set the environment variable:

export RECARENA_S3_BUCKET="your-bucket/recarena"

Reproducing Paper 1: Architecture Ablation

This paper evaluates all 8 combinations of {RoPE, Learnable} × {LiGR, Standard FFN} × {RMSNorm, LayerNorm} on SASRec. The full experiment grid runs across 7 datasets (ml_100k, ml_1m, ml_20m, amazon_beauty_2014, ratebeer, goodreads, netflix) with 3 seeds per configuration.

Full experiment (all datasets, all seeds, all configs)

# Single GPU — runs sequentially
python -m rec_arena.experiments.significance_study \
--num-gpus 1 \
--output-dir results/significance

# Multi-GPU — one experiment per GPU, round-robin assignment
python -m rec_arena.experiments.significance_study \
--num-gpus 8 \
--output-dir results/significance

To run a subset (e.g. for debugging):

python -m rec_arena.experiments.significance_study \
--datasets ml_100k ml_1m \
--seeds 42 43 \
--num-gpus 1 \
--output-dir results/significance

Per-user predictions are optionally saved to S3 for offline bootstrap analysis:

python -m rec_arena.experiments.significance_study \
--num-gpus 8 \
--s3-path s3://your-bucket/recarena/predictions \
--output-dir results/significance

Robustness study (model scale ablation)

Tests whether architecture effects hold across different model sizes and depths.

# Embedding dimension ablation (d ∈ {64, 128, 256, 512})
python -m rec_arena.experiments.robustness_ablation \
--experiment size --output-dir results/

# Depth ablation (L ∈ {1, 2, 4})
python -m rec_arena.experiments.robustness_ablation \
--experiment depth --output-dir results/

Bootstrap significance analysis

Computes bootstrap confidence intervals and paired significance tests from saved per-user predictions.

# From local predictions
python -m rec_arena.experiments.bootstrap_analysis \
--input-dir results/significance/predictions \
--output-dir results/significance/analysis

# Or directly from S3
python -m rec_arena.experiments.bootstrap_analysis \
--input-dir s3://your-bucket/recarena/predictions \
--output-dir results/significance/analysis

Reproducing Paper 2: Loss Function Study

This paper compares full cross-entropy, sampled softmax, BCE, and gBCE across 6 datasets with negative sample counts ranging from 16 to 2048.

Loss function ablation

python -m rec_arena.experiments.sasrec_ablation \
--experiment loss \
--datasets ml_100k ml_1m ml_20m amazon_beauty_2014 ratebeer twitch \
--output-dir results/

Negative sampling strategy ablation

Compares per-position vs. batch-shared negative sampling with uniform and popularity-based strategies.

python -m rec_arena.experiments.negative_sampling_ablation \
--output-dir results/

Project Structure

src/rec_arena/
├── configs/ # Model configurations (dataclass-based)
├── datasets/ # Dataset loading, splitting, negative sampling
├── experiments/ # Experiment scripts for both papers
│ ├── prepare_datasets.py # Data download & preprocessing
│ ├── significance_study.py # Paper 1: factorial architecture ablation
│ ├── robustness_ablation.py # Paper 1: model scale robustness
│ ├── bootstrap_analysis.py # Paper 1: bootstrap significance tests
│ ├── sasrec_ablation.py # Paper 2: loss function ablation
│ └── negative_sampling_ablation.py # Paper 2: sampling strategy comparison
├── losses/ # CE, BCE, sampled softmax, gBCE, BPR
├── metrics/ # NDCG, HR, MRR, Recall, Precision
├── models/ # SASRec (+ other models for general use)
├── modules/ # Transformer blocks, RoPE, SwiGLU, RMSNorm
└── utils/ # Reproducibility, logging, profiling

Key Design Decisions

Full-vocabulary cross-entropy as the default loss (no sampled metrics)
Leave-one-out evaluation with full ranking over all items
Deterministic per-parameter initialization — shared weights (item embeddings, attention, FFN) are identical across architecture configs for the same seed, isolating the effect of each component
Paired significance testing — t-tests across seeds + per-user bootstrap with 10,000 resamples
FP16 mixed precision and Flash Attention for training efficiency

Hardware

All experiments were conducted on NVIDIA A10G GPUs (24 GB VRAM) with AMD EPYC 7R32 CPUs...

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

New repo with low stars, routine research release