SOTAVerified

General Knowledge

This task aims to evaluate the ability of a model to answer general-knowledge questions.

Source: BIG-bench

Papers

Showing 150 of 399 papers

TitleStatusHype
Training Compute-Optimal Large Language ModelsCode6
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric PerspectivesCode5
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context AccurayCode3
The Breeze 2 Herd of Models: Traditional Chinese LLMs Based on Llama with Vision-Aware and Function-Calling CapabilitiesCode3
VoiceBench: Benchmarking LLM-Based Voice AssistantsCode3
Parameter-Efficient Fine-Tuning in Spectral Domain for Point Cloud LearningCode3
Cascade Prompt Learning for Vision-Language Model AdaptationCode3
RAGEval: Scenario Specific RAG Evaluation Dataset Generation FrameworkCode3
Beyond Specialization: Assessing the Capabilities of MLLMs in Age and Gender EstimationCode3
SAM Fails to Segment Anything? -- SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and MoreCode3
A Survey of Knowledge Graph Reasoning on Graph Types: Static, Dynamic, and MultimodalCode3
MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language ModelsCode2
Keeping Yourself is Important in Downstream Tuning Multimodal Large Language ModelCode2
A Survey of Personalized Large Language Models: Progress and Future DirectionsCode2
LLM-RG4: Flexible and Factual Radiology Report Generation across Diverse Input ContextsCode2
Selective Aggregation for Low-Rank Adaptation in Federated LearningCode2
Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment RetrievalCode2
F-LMM: Grounding Frozen Large Multimodal ModelsCode2
CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language ModelCode2
CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity KnowledgeCode2
MMA: Multi-Modal Adapter for Vision-Language ModelsCode2
Imagine Before Go: Self-Supervised Generative Map for Object Goal NavigationCode2
Exploring the Potential of Large Language Models (LLMs) in Learning on GraphsCode2
Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation LearningCode2
Continual Pre-training of Language ModelsCode2
Adapting a Language Model While Preserving its General KnowledgeCode2
Scaling Language Models: Methods, Analysis & Insights from Training GopherCode2
ConceptNet at SemEval-2017 Task 2: Extending Word Embeddings with Multilingual Relational KnowledgeCode2
ConceptNet 5.5: An Open Multilingual Graph of General KnowledgeCode2
TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research CorporaCode1
HELM: Hyperbolic Large Language Models via Mixture-of-Curvature ExpertsCode1
A General Knowledge Injection Framework for ICD CodingCode1
A Dual-Space Framework for General Knowledge Distillation of Large Language ModelsCode1
FuseChat-3.0: Preference Optimization Meets Heterogeneous Model FusionCode1
Super-class guided Transformer for Zero-Shot Attribute ClassificationCode1
RAG with Differential PrivacyCode1
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of ExpertsCode1
SAFE: Slow and Fast Parameter-Efficient Tuning for Continual Learning with Pre-Trained ModelsCode1
Point-PRC: A Prompt Learning Based Regulation Framework for Generalizable Point Cloud AnalysisCode1
DA-Ada: Learning Domain-Aware Adapter for Domain Adaptive Object DetectionCode1
E2Map: Experience-and-Emotion Map for Self-Reflective Robot Navigation with Language ModelsCode1
Aligning Medical Images with General Knowledge from Large Language ModelsCode1
How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language ModelsCode1
DIAGen: Diverse Image Augmentation with Generative ModelsCode1
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?Code1
Can Editing LLMs Inject Harm?Code1
ElecBench: a Power Dispatch Evaluation Benchmark for Large Language ModelsCode1
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMsCode1
To Forget or Not? Towards Practical Knowledge Unlearning for Large Language ModelsCode1
Decoupling General and Personalized Knowledge in Federated Learning via Additive and Low-Rank DecompositionCode1
Show:102550
← PrevPage 1 of 8Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1Chinchilla-70B (few-shot, k=5)Accuracy94.3Unverified
2Gopher-280B (few-shot, k=5)Accuracy93.9Unverified
3Chinchilla-70B (few-shot, k=5)Accuracy 85.7Unverified
4Gopher-280B (few-shot, k=5)Accuracy 84.8Unverified
5Gopher-280B (few-shot, k=5)Accuracy84.2Unverified
6Gopher-280B (few-shot, k=5)Accuracy 84.1Unverified
7Gopher-280B (few-shot, k=5)Accuracy 83.9Unverified
8Gopher-280B (few-shot, k=5)Accuracy83.3Unverified
9Gopher-280B (few-shot, k=5)Accuracy 81.8Unverified
10Gopher-280B (few-shot, k=5)Accuracy 81Unverified