SOTAVerified

Benchmarking

Papers

Showing 801850 of 5548 papers

TitleStatusHype
Massively Multi-Cultural Knowledge Acquisition & LM BenchmarkingCode1
Explainable Global Wildfire Prediction Models using Graph Neural NetworksCode1
Retrieve, Merge, Predict: Augmenting Tables with Data LakesCode1
Improved off-policy training of diffusion samplersCode1
JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill MatchingCode1
GenFace: A Large-Scale Fine-Grained Face Forgery Benchmark and Cross Appearance-Edge LearningCode1
Benchmarking Transferable Adversarial AttacksCode1
We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation BaselineCode1
Explainable Benchmarking for Iterative Optimization HeuristicsCode1
Category-wise Fine-Tuning: Resisting Incorrect Pseudo-Labels in Multi-Label Image Classification with Partial LabelsCode1
Machine Translation Meta Evaluation through Translation Accuracy Challenge SetsCode1
Dataset and Benchmark: Novel Sensors for Autonomous Vehicle PerceptionCode1
SciMMIR: Benchmarking Scientific Multi-modal Information RetrievalCode1
Benchmarking Large Multimodal Models against Common CorruptionsCode1
CheX-GPT: Harnessing Large Language Models for Enhanced Chest X-ray Report LabelingCode1
RSUD20K: A Dataset for Road Scene Understanding In Autonomous DrivingCode1
CAVIAR: Co-simulation of 6G Communications, 3D Scenarios and AI for Digital TwinsCode1
German Text Embedding Clustering BenchmarkCode1
FinDABench: Benchmarking Financial Data Analysis Ability of Large Language ModelsCode1
Benchmarking Large Language Models on Controllable Generation under Diversified InstructionsCode1
Benchmarking the CoW with the TopCoW Challenge: Topology-Aware Anatomical Segmentation of the Circle of Willis for CTA and MRACode1
APTv2: Benchmarking Animal Pose Estimation and Tracking with a Large-scale Dataset and BeyondCode1
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language ModelsCode1
RetailSynth: Synthetic Data Generation for Retail AI Systems EvaluationCode1
FiFAR: A Fraud Detection Dataset for Learning to DeferCode1
TAO-Amodal: A Benchmark for Tracking Any Object AmodallyCode1
How to Train Neural Field Representations: A Comprehensive Study and BenchmarkCode1
Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language ModelsCode1
How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary InvestigationCode1
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level PlanningCode1
Benchmarking Distribution Shift in Tabular Data with TableShiftCode1
STREAMLINE: An Automated Machine Learning Pipeline for Biomedicine Applied to Examine the Utility of Photography-Based Phenotypes for OSA Prediction Across International Sleep CentersCode1
Benchmarking and Analysis of Unsupervised Object Segmentation from Real-world Single ImagesCode1
Can language agents be alternatives to PPO? A Preliminary Empirical Study On OpenAI GymCode1
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal ModelsCode1
BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy TasksCode1
Let the LLMs Talk: Simulating Human-to-Human Conversational QA via Zero-Shot LLM-to-LLM InteractionsCode1
Controlgym: Large-Scale Control Environments for Benchmarking Reinforcement Learning AlgorithmsCode1
Enhancing Ligand Pose Sampling for Molecular DockingCode1
Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy EvaluationCode1
Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMsCode1
UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained GenerationCode1
Benchmarking Robustness of Text-Image Composed RetrievalCode1
IMGTB: A Framework for Machine-Generated Text Detection BenchmarkingCode1
BEND: Benchmarking DNA Language Models on biologically meaningful tasksCode1
Towards a more inductive world for drug repurposing approachesCode1
LogLead -- Fast and Integrated Log Loader, Enhancer, and Anomaly DetectorCode1
Benchmarking Pathology Feature Extractors for Whole Slide Image ClassificationCode1
TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event ExtractionCode1
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable SummarizationCode1
Show:102550
← PrevPage 17 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified