Falcon Perception
TL;DR — Falcon Perception is a 0.6B-parameter early-fusion Transformer for open-vocabulary grounding and segmentation from natural language prompts. The model processes image patches + text in one sequence using a hybrid attention mask, and produces variable numbers of instances with a small, structured token interface and lightweight output heads. On SA-Co, Falcon Perception reaches 68.0 Macro-F1 (vs. 62.3 for SAM 3) with the main remaining gap being presence calibration (MCC 0.64 vs. 0.82). We also introduce PBench, a diagnostic benchmark […]
Read more