Doughnut or Bagel? Helping AI Fill a Hole in Perception

Machine vision can easily confuse doughnuts & bagels, cats & dogs, and other similar things too. Why? And how can we fix it?

As National Doughnut Week kicks off, aiming to raise vital funds for The Children’s Trust, there’s a lot more to think about than whether to pick ring-shaped or hole-less, glazed or plain, filled or solid, cream or jelly. And I’m not talking about boldly branching out into the weird and wonderful world of doughnut hybrids — we’re already faced with enough choice without throwing cronuts and duffins into the mix.

No, I speak of the intersection between doughnuts and Artificial Intelligence. Yes, AI and mankind’s favourite sugary treat are entangled in a relationship — albeit a sticky one. You see, to the technology, the humble doughnut presents a big problem: how to distinguish between the sweet fried snack and its savoury baked doppelganger, the bagel.

So, what we do need to think about is how we can improve machine vision, whose current capabilities are astonishing yet hindered by weak spots. More specifically, how can we teach the tech to tell the difference between doughnuts (the holey ones, obviously) and bagels, sorta similar at first glance but wildly different under the surface? Hmm, that’s definitely a tough one, but…Google to the rescue!

Earlier this year, the tech giant bravely slipped on its apron and lunged at the conundrum wielding a rolling pin. And it did so with a crowdsourced challenge, CATS4ML, where ML experts and amateurs were invited along with their intuition to develop fresh ways of discovering AI blindspots.

If you’re unfamiliar with the concept of AI blindspots, these are what we call adversarial images — that is, unknown unknowns, which are images with visual patterns not easily distinguished by AI models because they’re rare, tricky, or a combination of both. In other words, these are images humans can usually identify without issue, but normally trip up algorithms.

I like to think of them as optical illusions for artificial intelligence, and while they can either be intentionally manipulated to trick AI or unmanipulated altogether, we’re more concerned with the latter here.

The outcome of unknown unknowns is erroneous confidence in the identification of an image. On the contrary, known unknowns are relatively simple to deal with because the algorithm understands that it doesn’t recognise the image in question and flags it for human assessment.

I guess my examples of doughnuts and bagels fall under the ‘tricky’ category. This could also include cats and dogs, for instance, which share a number of characteristics. When they’re photographed from certain angles, AI models will just as easily confuse these two animals as they will doughnuts and bagels, whereas humans considering everything else in the image will classify the two correctly.

Think of it this way: just like it’s more probable that a sugar doughnut will have a bumpier, duller surface than a smooth, shiny bagel, there’s a better chance of seeing a dog on a leash than a cat. Although I have to say, stranger things have been known.

I may well focus on doughnuts and bagels in this blog, but images can really contain any problematic items. It could be Justin Timberlake’s sleek 90s coif and a portion of dried noodles:

Or even these adorable puppies and yummy fried chicken:

Of course, I’m having a giggle with these choices — even if I had to stare at the puppies and fried chicken for a good few seconds to figure out what I was seeing.

On a more serious note, machine vision can misclassify all kinds of images with confidence. This is incredibly problematic given we’re now using these systems in all manner of tech, including the autonomous vehicles I wrote about in last week’s Tesla AI Day blog. We trust AI models to label items correctly under the belief they “see” as we do. But the truth is, they don’t.

To solve the problem, we need more projects like CATS4ML, which would help us train better AI models tripped up by fewer blindspots — or ideally none at all. Just like doughnuts and bagels, machine vision systems have holes — only, in their perception of the world. The difference is, the holes in doughnuts and bagels are just fine as they are, whereas the holes in machine vision systems’ perception need to be filled!