Novel object captioning surpasses human performance on benchmarks
Consider for a moment what it takes to visually identify and describe something to another person. Now imagine that the other person can’t see the object or image, so every detail matters. How do you decide what information is important and what’s not? You’ll need to know exactly what everything is, where it is, what it’s doing in relation to other objects, and note other attributes like color or position of objects in the foreground or background. This exercise shows there’s no question that translating images into words is a complex task—one humans do so often and innately it seems automatic at times—requiring a wide range of knowledge about many unique things.
In order to translate this skill into artificial intelligence (AI),