SOTAVerified

Reinforcement Learning (RL)

Reinforcement Learning (RL) involves training an agent to take actions in an environment to maximize a cumulative reward signal. The agent interacts with the environment and learns by receiving feedback in the form of rewards or punishments for its actions. The goal of reinforcement learning is to find the optimal policy or decision-making strategy that maximizes the long-term reward.

Papers

Showing 301350 of 15113 papers

TitleStatusHype
Co-Reinforcement Learning for Unified Multimodal Understanding and GenerationCode1
Reinforcement Learning for Ballbot Navigation in Uneven TerrainCode1
The Cell Must Go On: Agar.io for Continual Reinforcement LearningCode1
Arctic-Text2SQL-R1: Simple Rewards, Strong Reasoning in Text-to-SQLCode3
RAP: Runtime-Adaptive Pruning for LLM Inference0
Backdoors in DRL: Four Environments Focusing on In-distribution Triggers0
Control of Renewable Energy Communities using AI and Real-World Data0
DeepRec: Towards a Deep Dive Into the Item Space with Large Language Model Based Recommendation0
SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking RewardCode2
Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPOCode4
LARES: Latent Reasoning for Sequential Recommendation0
Think-RM: Enabling Long-Horizon Reasoning in Generative Reward ModelsCode1
AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning0
Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains0
R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPOCode3
Mesh-RFT: Enhancing Mesh Generation via Fine-grained Reinforcement Fine-Tuning0
SATURN: SAT-based Reinforcement Learning to Unleash Language Model ReasoningCode0
SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine TranslationCode0
Dynamic Sampling that Adapts: Iterative DPO for Self-Aware Mathematical Reasoning0
Reinforcement Learning for Stock Transactions0
PyTupli: A Scalable Infrastructure for Collaborative Offline Reinforcement Learning ProjectsCode0
Reward-Aware Proto-Representations in Reinforcement Learning0
WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement LearningCode2
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language ModelsCode2
Strategically Linked Decisions in Long-Term Planning and Reinforcement Learning0
Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning0
Divide-Fuse-Conquer: Eliciting "Aha Moments" in Multi-Scenario Games0
ARPO:End-to-End Policy Optimization for GUI Agents with Experience ReplayCode2
Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement LearningCode3
Meta-reinforcement learning with minimum attention0
Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)0
SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software DevelopmentCode2
Find the Fruit: Designing a Zero-Shot Sim2Real Deep RL Planner for Occlusion Aware Plant Manipulation0
VL-SAFE: Vision-Language Guided Safety-Aware Reinforcement Learning with World Models for Autonomous Driving0
Efficient Online RL Fine Tuning with Offline Pre-trained Policy Only0
Offline Guarded Safe Reinforcement Learning for Medical Treatment Optimization Strategies0
Reward Is Enough: LLMs Are In-Context Reinforcement Learners0
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning0
GRIT: Teaching MLLMs to Think with Images0
RLBenchNet: The Right Network for the Right Reinforcement Learning TaskCode1
From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement LearningCode1
Multiple Weaks Win Single Strong: Large Language Models Ensemble Weak Reinforcement Learning Agents into a Supreme One0
MMaDA: Multimodal Large Diffusion Language ModelsCode0
An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM AgentsCode7
VARD: Efficient and Dense Fine-Tuning for Diffusion Models with Value-based RL0
StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy OptimizationCode0
Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities0
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning0
STAR-R1: Spacial TrAnsformation Reasoning by Reinforcing Multimodal LLMsCode0
Average Reward Reinforcement Learning for Omega-Regular and Mean-Payoff Objectives0
Show:102550
← PrevPage 7 of 303Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1PPGMean Normalized Performance0.76Unverified
2PPOMean Normalized Performance0.58Unverified