| Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective | Jun 19, 2024 | BenchmarkingContinual Pretraining | —Unverified | 0 |
| A large-scale multicenter breast cancer DCE-MRI benchmark dataset with expert segmentations | Jun 19, 2024 | Benchmarking | CodeCode Available | 2 |
| Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration | Jun 19, 2024 | BenchmarkingDistractor Generation | —Unverified | 0 |
| BeHonest: Benchmarking Honesty in Large Language Models | Jun 19, 2024 | BenchmarkingMisinformation | CodeCode Available | 1 |
| Benchmarking Unsupervised Online IDS for Masquerade Attacks in CAN | Jun 19, 2024 | BenchmarkingIntrusion Detection | CodeCode Available | 0 |
| Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models | Jun 19, 2024 | BenchmarkingOpen-Domain Question Answering | —Unverified | 0 |
| Comparison of Open-Source and Proprietary LLMs for Machine Reading Comprehension: A Practical Analysis for Industrial Applications | Jun 19, 2024 | BenchmarkingMachine Reading Comprehension | —Unverified | 0 |
| M4Fog: A Global Multi-Regional, Multi-Modal, and Multi-Stage Dataset for Marine Fog Detection and Forecasting to Bridge Ocean and Atmosphere | Jun 19, 2024 | BenchmarkingSpatio-Temporal Forecasting | CodeCode Available | 0 |
| GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation | Jun 19, 2024 | BenchmarkingImage Generation | CodeCode Available | 3 |
| Exploring and Benchmarking the Planning Capabilities of Large Language Models | Jun 18, 2024 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions | Jun 18, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 0 |
| Exploring the Impact of a Transformer's Latent Space Geometry on Downstream Task Performance | Jun 18, 2024 | Benchmarking | —Unverified | 0 |
| GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation Models | Jun 18, 2024 | BenchmarkingDepth Estimation | CodeCode Available | 2 |
| TSI-Bench: Benchmarking Time Series Imputation | Jun 18, 2024 | BenchmarkingDeep Learning | CodeCode Available | 3 |
| WebCanvas: Benchmarking Web Agents in Online Environments | Jun 18, 2024 | AI AgentBenchmarking | CodeCode Available | 3 |
| Automatic benchmarking of large multimodal models via iterative experiment programming | Jun 18, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning | Jun 18, 2024 | BenchmarkingWorld Knowledge | CodeCode Available | 0 |
| MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts | Jun 18, 2024 | ArticlesBenchmarking | —Unverified | 0 |
| OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI | Jun 18, 2024 | Benchmarkingscientific discovery | CodeCode Available | 2 |
| JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models | Jun 17, 2024 | Benchmarkingcounterfactual | —Unverified | 0 |
| InternalInspector I^2: Robust Confidence Estimation in LLMs through Internal States | Jun 17, 2024 | BenchmarkingContrastive Learning | —Unverified | 0 |
| Unleashing OpenTitan's Potential: a Silicon-Ready Embedded Secure Element for Root of Trust and Cryptographic Offloading | Jun 17, 2024 | Autonomous VehiclesBenchmarking | —Unverified | 0 |
| Job-SDF: A Multi-Granularity Dataset for Job Skill Demand Forecasting and Benchmarking | Jun 17, 2024 | BenchmarkingDemand Forecasting | CodeCode Available | 1 |
| A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models | Jun 17, 2024 | BenchmarkingSurvey | —Unverified | 0 |
| Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams | Jun 17, 2024 | AllBenchmarking | CodeCode Available | 0 |