Summary: OpenAI has extended GPT-3 with two new models that combine NLP with image recognition to give its AI a better understanding of everyday concepts.
Original author and publication date: Will Douglas Heaven – January 5, 2021
Futurizonte Editor’s Note: You know you are living in interesting times when an avocado armchair is a key element in the future of AI.
From the article:
With GPT-3, OpenAI showed that a single deep-learning model could be trained to use language in a variety of ways simply by throwing it vast amounts of text. It then showed that by swapping text for pixels, the same approach could be used to train an AI to complete half-finished images. GPT-3 mimics how humans use words; Image GPT-3 predicts what we see.
Now OpenAI has put these ideas together and built two new models, called DALL·E and CLIP, that combine language and images in a way that will make AIs better at understanding both words and what they refer to.
“We live in a visual world,” says Ilya Sutskever, chief scientist at OpenAI. “In the long run, you’re going to have models which understand both text and images. AI will be able to understand language better because it can see what words and sentences mean.”
For all GPT-3’s flair, its output can feel untethered from reality, as if it doesn’t know what it’s talking about. That’s because it doesn’t. By grounding text in images, researchers at OpenAI and elsewhere are trying to give language models a better grasp of the everyday concepts that humans use to make sense of things.
DALL·E and CLIP come at this problem from different directions. At first glance, CLIP (Contrastive Language-Image Pre-training) is yet another image recognition system. Except that it has learned to recognize images not from labeled examples in curated data sets, as most existing models do, but from images and their captions taken from the internet. It learns what’s in an image from a description rather than a one-word label such as “cat” or “banana.”
CLIP is trained by getting it to predict which caption from a random selection of 32,768 is the correct one for a given image. To work this out, CLIP learns to link a wide variety of objects with their names and the words that describe them. This then lets it identify objects in images outside its training set. Most image recognition systems are trained to identify certain types of object, such as faces in surveillance videos or buildings in satellite images. Like GPT-3, CLIP can generalize across tasks without additional training. It is also less likely than other state-of-the-art image recognition models to be led astray by adversarial examples, which have been subtly altered in ways that typically confuse algorithms even though humans might not notice a difference.
Instead of recognizing images, DALL·E (which I’m guessing is a WALL·E/Dali pun) draws them. This model is a smaller version of GPT-3 that has also been trained on text-image pairs taken from the internet. Given a short natural-language caption, such as “a painting of a capybara sitting in a field at sunrise” or “a cross-section view of a walnut,” DALL·E generates lots of images that match it: dozens of capybaras of all shapes and sizes in front of orange and yellow backgrounds; row after row of walnuts (though not all of them in cross-section).