| HumanMM: Global Human Motion Recovery from Multi-shot Videos | Mar 10, 2025 | Camera Pose EstimationMotion Generation | CodeCode Available | 2 |
| Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and Benchmark | Mar 10, 2025 | Autonomous DrivingBenchmarking | CodeCode Available | 2 |
| DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs | Mar 10, 2025 | Code GenerationInstruction Following | CodeCode Available | 2 |
| Chameleon: Fast-slow Neuro-symbolic Lane Topology Extraction | Mar 10, 2025 | Autonomous DrivingScene Understanding | CodeCode Available | 2 |
| Is CLIP ideal? No. Can we fix it? Yes! | Mar 10, 2025 | AttributeNegation | CodeCode Available | 2 |
| SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection | Mar 10, 2025 | | CodeCode Available | 2 |
| Controllable 3D Outdoor Scene Generation via Scene Graphs | Mar 10, 2025 | Autonomous DrivingScene Generation | CodeCode Available | 2 |
| A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis | Mar 10, 2025 | Question Answering | CodeCode Available | 2 |
| DaD: Distilled Reinforcement Learning for Diverse Keypoint Detection | Mar 10, 2025 | Keypoint Detectionreinforcement-learning | CodeCode Available | 2 |
| MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning | Mar 10, 2025 | BenchmarkingMedical Question Answering | CodeCode Available | 2 |
| Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model | Mar 10, 2025 | Image DescriptionImage Generation | CodeCode Available | 2 |
| Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking | Mar 9, 2025 | Visual Tracking | CodeCode Available | 2 |
| CLIMB: Data Foundations for Large Scale Multimodal Clinical Foundation Models | Mar 9, 2025 | | CodeCode Available | 2 |
| Learning Few-Step Diffusion Models by Trajectory Distribution Matching | Mar 9, 2025 | Image GenerationText to Image Generation | CodeCode Available | 2 |
| Axes that matter: PCA with a difference | Mar 9, 2025 | regression | CodeCode Available | 2 |
| Emulating Self-attention with Convolution for Efficient Image Super-Resolution | Mar 9, 2025 | Computational EfficiencyImage Super-Resolution | CodeCode Available | 2 |
| DiffCLIP: Differential Attention Meets CLIP | Mar 9, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| DiffAtlas: GenAI-fying Atlas Segmentation via Image-Mask Diffusion | Mar 9, 2025 | Image SegmentationMedical Image Segmentation | CodeCode Available | 2 |
| Agent models: Internalizing Chain-of-Action Generation into Reasoning models | Mar 9, 2025 | Action GenerationReinforcement Learning (RL) | CodeCode Available | 2 |
| X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation | Mar 8, 2025 | GPUImage Generation | CodeCode Available | 2 |
| Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models? | Mar 8, 2025 | Mathematical ReasoningMultimodal Reasoning | CodeCode Available | 2 |
| RouterEval: A Comprehensive Benchmark for Routing LLMs to Explore Model-level Scaling Up in LLMs | Mar 8, 2025 | Instruction FollowingMathematical Reasoning | CodeCode Available | 2 |
| USP: Unified Self-Supervised Pretraining for Image Generation and Understanding | Mar 8, 2025 | Image GenerationRepresentation Learning | CodeCode Available | 2 |
| Large Language Models Post-training: Surveying Techniques from Alignment to Reasoning | Mar 8, 2025 | Survey | CodeCode Available | 2 |
| Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model | Mar 8, 2025 | Image Quality AssessmentLanguage Modeling | CodeCode Available | 2 |
| A Noise-Robust Turn-Taking System for Real-World Dialogue Robots: A Field Experiment | Mar 8, 2025 | speech-recognitionSpeech Recognition | CodeCode Available | 2 |
| Slim attention: cut your context memory in half without loss of accuracy -- K-cache is all you need for MHA | Mar 7, 2025 | AllDecoder | CodeCode Available | 2 |
| EDM: Efficient Deep Feature Matching | Mar 7, 2025 | | CodeCode Available | 2 |
| A Survey of Large Language Model Empowered Agents for Recommendation and Search: Towards Next-Generation Information Retrieval | Mar 7, 2025 | Information RetrievalLanguage Modeling | CodeCode Available | 2 |
| Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching | Mar 7, 2025 | | CodeCode Available | 2 |
| DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving | Mar 7, 2025 | Autonomous DrivingBench2Drive | CodeCode Available | 2 |
| Encrypted Vector Similarity Computations Using Partially Homomorphic Encryption: Applications and Performance Analysis | Mar 7, 2025 | Image RetrievalPrivacy Preserving | CodeCode Available | 2 |
| PromptPex: Automatic Test Generation for Language Model Prompts | Mar 7, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| CoMoGaussian: Continuous Motion-Aware Gaussian Splatting from Motion-Blurred Images | Mar 7, 2025 | 3DGS3D Scene Reconstruction | CodeCode Available | 2 |
| D2GV: Deformable 2D Gaussian Splatting for Video Representation in 400FPS | Mar 7, 2025 | DenoisingQuantization | CodeCode Available | 2 |
| Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts | Mar 7, 2025 | Mixture-of-ExpertsState Space Models | CodeCode Available | 2 |
| WritingBench: A Comprehensive Benchmark for Generative Writing | Mar 7, 2025 | Text Generation | CodeCode Available | 2 |
| Omnidirectional Multi-Object Tracking | Mar 6, 2025 | Multi-Object TrackingObject | CodeCode Available | 2 |
| Bridging the Vision-Brain Gap with an Uncertainty-Aware Blur Prior | Mar 6, 2025 | Image Retrieval | CodeCode Available | 2 |
| Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities | Mar 6, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| ProtComposer: Compositional Protein Structure Generation with 3D Ellipsoids | Mar 6, 2025 | Diversity | CodeCode Available | 2 |
| Real-time Spatial-temporal Traversability Assessment via Feature-based Sparse Gaussian Process | Mar 6, 2025 | Autonomous NavigationComputational Efficiency | CodeCode Available | 2 |
| AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM | Mar 6, 2025 | Anomaly DetectionLanguage Modeling | CodeCode Available | 2 |
| An Egocentric Vision-Language Model based Portable Real-time Smart Assistant | Mar 6, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Scaling Rich Style-Prompted Text-to-Speech Datasets | Mar 6, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Generalized Interpolating Discrete Diffusion | Mar 6, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model | Mar 6, 2025 | General KnowledgeImage Captioning | CodeCode Available | 2 |
| PDX: A Data Layout for Vector Similarity Search | Mar 6, 2025 | Avg | CodeCode Available | 2 |
| BANet: Bilateral Aggregation Network for Mobile Stereo Matching | Mar 5, 2025 | Stereo Matching | CodeCode Available | 2 |
| Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation | Mar 5, 2025 | ObjectReferring Video Object Segmentation | CodeCode Available | 2 |