SOTAVerified

Benchmarking

Papers

Showing 13761400 of 5548 papers

TitleStatusHype
Modern, Efficient, and Differentiable Transport Equation Models using JAX: Applications to Population Balance Equations0
Improving Few-Shot Cross-Domain Named Entity Recognition by Instruction Tuning a Word-Embedding based Retrieval Augmented Large Language Model0
MIRFLEX: Music Information Retrieval Feature Library for ExtractionCode1
Benchmarking Bias in Large Language Models during Role-Playing0
Cityscape-Adverse: Benchmarking Robustness of Semantic Segmentation with Realistic Scene Modifications via Diffusion-Based Image EditingCode0
LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language ModelsCode1
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI AcceleratorsCode2
IdeaBench: Benchmarking Large Language Models for Research Idea GenerationCode0
LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property PredictionCode1
Pedestrian Trajectory Prediction with Missing Data: Datasets, Imputation, and BenchmarkingCode1
EMGBench: Benchmarking Out-of-Distribution Generalization and Adaptation for ElectromyographyCode1
Benchmark Data Repositories for Better Benchmarking0
XRDSLAM: A Flexible and Modular Framework for Deep Learning based SLAMCode3
AndroidLab: Training and Systematic Benchmarking of Android Autonomous AgentsCode3
DetectRL: Benchmarking LLM-Generated Text Detection in Real-World ScenariosCode1
AllClear: A Comprehensive Dataset and Benchmark for Cloud Removal in Satellite ImageryCode1
CALE: Continuous Arcade Learning EnvironmentCode7
Low-Density 3D Point Cloud Classification0
Survey of Cultural Awareness in Language Models: Text and BeyondCode1
NCAdapt: Dynamic adaptation with domain-specific Neural Cellular Automata for continual hippocampus segmentationCode0
VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning0
DexGraspNet 2.0: Learning Generative Dexterous Grasping in Large-scale Synthetic Cluttered Scenes0
InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail ModelsCode2
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation GenerationCode2
Evaluating Cultural and Social Awareness of LLM Web Agents0
Show:102550
← PrevPage 56 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified