SOTAVerified

Benchmarking

Papers

Showing 676700 of 5548 papers

TitleStatusHype
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
A Unified Taxonomy and Multimodal Dataset for Events in Invasion GamesCode1
Benchmarking Language Model Creativity: A Case Study on Code GenerationCode1
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image CaptioningCode1
A User-Centric Multi-Intent Benchmark for Evaluating Large Language ModelsCode1
DetectRL: Benchmarking LLM-Generated Text Detection in Real-World ScenariosCode1
Developing a Scalable Benchmark for Assessing Large Language Models in Knowledge Graph EngineeringCode1
Deluca -- A Differentiable Control Library: Environments, Methods, and BenchmarkingCode1
Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for HallucinationsCode1
DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language ModelsCode1
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language ModelsCode1
Demystifying Learning Rate Policies for High Accuracy Training of Deep Neural NetworksCode1
Attention, Please! Revisiting Attentive Probing for Masked Image ModelingCode1
Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAMCode1
Benchmarking Meta-embeddings: What Works and What Does NotCode1
Benchmarking LLMs' Swarm intelligenceCode1
DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and ObjectsCode1
Align and Distill: Unifying and Improving Domain Adaptive Object DetectionCode1
Deep learning model solves change point detection for multiple change typesCode1
Deep Learning-Based Synchronization for Uplink NB-IoTCode1
Automated Model Design and Benchmarking of 3D Deep Learning Models for COVID-19 Detection with Chest CT ScansCode1
Benchmarking Meaning Representations in Neural Semantic ParsingCode1
DocuMint: Docstring Generation for Python using Small Language ModelsCode1
Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMsCode1
A Comprehensive Study on Large-Scale Graph Training: Benchmarking and RethinkingCode1
Show:102550
← PrevPage 28 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified