SOTAVerified

Benchmarking

Papers

Showing 826850 of 5548 papers

TitleStatusHype
TAO-Amodal: A Benchmark for Tracking Any Object AmodallyCode1
How to Train Neural Field Representations: A Comprehensive Study and BenchmarkCode1
Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language ModelsCode1
How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary InvestigationCode1
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level PlanningCode1
Benchmarking Distribution Shift in Tabular Data with TableShiftCode1
STREAMLINE: An Automated Machine Learning Pipeline for Biomedicine Applied to Examine the Utility of Photography-Based Phenotypes for OSA Prediction Across International Sleep CentersCode1
Benchmarking and Analysis of Unsupervised Object Segmentation from Real-world Single ImagesCode1
Can language agents be alternatives to PPO? A Preliminary Empirical Study On OpenAI GymCode1
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal ModelsCode1
BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy TasksCode1
Let the LLMs Talk: Simulating Human-to-Human Conversational QA via Zero-Shot LLM-to-LLM InteractionsCode1
Controlgym: Large-Scale Control Environments for Benchmarking Reinforcement Learning AlgorithmsCode1
Enhancing Ligand Pose Sampling for Molecular DockingCode1
Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy EvaluationCode1
Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMsCode1
UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained GenerationCode1
Benchmarking Robustness of Text-Image Composed RetrievalCode1
IMGTB: A Framework for Machine-Generated Text Detection BenchmarkingCode1
BEND: Benchmarking DNA Language Models on biologically meaningful tasksCode1
Towards a more inductive world for drug repurposing approachesCode1
LogLead -- Fast and Integrated Log Loader, Enhancer, and Anomaly DetectorCode1
Benchmarking Pathology Feature Extractors for Whole Slide Image ClassificationCode1
TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event ExtractionCode1
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable SummarizationCode1
Show:102550
← PrevPage 34 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified