SOTAVerified

Benchmarking

Papers

Showing 11261150 of 5548 papers

TitleStatusHype
Benchmarking LLMs' Swarm intelligenceCode1
Benchmarking Robustness of Multimodal Image-Text Models under Distribution ShiftCode1
Benchmarking Local Robustness of High-Accuracy Binary Neural Networks for Enhanced Traffic Sign RecognitionCode1
Benchmarking the Performance of Bayesian Optimization across Multiple Experimental Materials Science DomainsCode1
Benchmarking Low-Shot Robustness to Natural Distribution ShiftsCode1
Benchmarking Large Language Models on Answering and Explaining Challenging Medical QuestionsCode1
Benchmarking Segmentation Models with Mask-Preserved Attribute EditingCode1
Are We There Yet? Evaluating State-of-the-Art Neural Network based Geoparsers Using EUPEG as a Benchmarking PlatformCode1
Benchmarking Large Language Models on Controllable Generation under Diversified InstructionsCode1
AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM AgentsCode1
Benchmarking the Robustness of Temporal Action Detection Models Against Temporal CorruptionsCode1
Benchmarking Robustness of Machine Reading Comprehension ModelsCode1
Benchmarking machine learning models on multi-centre eICU critical care datasetCode1
German's Next Language ModelCode1
GraphArena: Benchmarking Large Language Models on Graph Computational ProblemsCode1
HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate CampaignsCode1
Hopfield-Enhanced Deep Neural Networks for Artifact-Resilient Brain State DecodingCode1
Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with DataCode1
Benchmarking Meaning Representations in Neural Semantic ParsingCode1
ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement LearningCode1
Benchmarking Meta-embeddings: What Works and What Does NotCode1
AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive ScenariosCode1
Benchmarking Micro-action Recognition: Dataset, Methods, and ApplicationsCode1
Generative Wind Power Curve Modeling Via Machine Vision: A Self-learning Deep Convolutional Network Based MethodCode1
Benchmarking Large Language Models for News SummarizationCode1
Show:102550
← PrevPage 46 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified