SOTAVerified

Benchmarking

Papers

Showing 551600 of 5548 papers

TitleStatusHype
Benchmarking Large Multimodal Models against Common CorruptionsCode1
Deep Learning-Based Synchronization for Uplink NB-IoTCode1
Benchmarking Large Language Models for News SummarizationCode1
A Benchmarking Study of Embedding-based Entity Alignment for Knowledge GraphsCode1
Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language ModelsCode1
Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPTCode1
A Dataset for Answering Time-Sensitive QuestionsCode1
dEchorate: a Calibrated Room Impulse Response Database for Echo-aware Signal ProcessingCode1
Benchmarking Large Language Models for Automated Verilog RTL Code GenerationCode1
Data Splits and Metrics for Method Benchmarking on Surgical Action Triplet DatasetsCode1
Benchmarking Large Language Models on Controllable Generation under Diversified InstructionsCode1
DCL-Net: Deep Correspondence Learning Network for 6D Pose EstimationCode1
Decoding the Enigma: Benchmarking Humans and AIs on the Many Facets of Working MemoryCode1
DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4Code1
Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model EvaluationCode1
Data-Driven Denoising of Stationary Accelerometer SignalsCode1
D2S: Document-to-Slide Generation Via Query-Based Text SummarizationCode1
CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language ModelsCode1
DACBench: A Benchmark Library for Dynamic Algorithm ConfigurationCode1
Data Generating Process to Evaluate Causal Discovery Techniques for Time Series DataCode1
Benchmarking Image Retrieval for Visual LocalizationCode1
Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAMCode1
Curious Hierarchical Actor-Critic Reinforcement LearningCode1
Benchmarking human visual search computational models in natural scenes: models comparison and reference datasetsCode1
Benchmarking Language Model Creativity: A Case Study on Code GenerationCode1
Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRTCode1
DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender SystemsCode1
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image SegmentationCode1
Analog or Digital In-memory Computing? Benchmarking through Quantitative ModelingCode1
CRoW: Benchmarking Commonsense Reasoning in Real-World TasksCode1
CriticBench: Benchmarking LLMs for Critique-Correct ReasoningCode1
CryptOpt: Verified Compilation with Randomized Program Search for Cryptographic Primitives (full version)Code1
Benchmarking Graph Neural Networks on Dynamic Link PredictionCode1
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image CaptioningCode1
Anabranch Network for Camouflaged Object SegmentationCode1
Benchmarking Graph Neural Networks for FMRI analysisCode1
Benchmarking Language Models for Code Syntax UnderstandingCode1
CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of CancerCode1
Dataset and Benchmark: Novel Sensors for Autonomous Vehicle PerceptionCode1
Deluca -- A Differentiable Control Library: Environments, Methods, and BenchmarkingCode1
Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?Code1
Benchmarking Generated Poses: How Rational is Structure-based Drug Design with Generative Models?Code1
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable SummarizationCode1
A multi-schematic classifier-independent oversampling approach for imbalanced datasetsCode1
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language ModelsCode1
CosPGD: an efficient white-box adversarial attack for pixel-wise prediction tasksCode1
Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERTCode1
AdaPool: Exponential Adaptive Pooling for Information-Retaining DownsamplingCode1
A Multifaceted Benchmarking of Synthetic Electronic Health Record Generation ModelsCode1
M4-SAR: A Multi-Resolution, Multi-Polarization, Multi-Scene, Multi-Source Dataset and Benchmark for Optical-SAR Fusion Object DetectionCode1
Show:102550
← PrevPage 12 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified