{"id":119336,"date":"2020-10-14T15:08:53","date_gmt":"2020-10-14T15:08:53","guid":{"rendered":"https:\/\/news.microsoft.com\/?p=439550"},"modified":"2020-10-14T15:08:53","modified_gmt":"2020-10-14T15:08:53","slug":"latest-ai-breakthrough-describes-images-as-well-as-people-do","status":"publish","type":"post","link":"https:\/\/sickgaming.net\/blog\/2020\/10\/14\/latest-ai-breakthrough-describes-images-as-well-as-people-do\/","title":{"rendered":"Latest AI breakthrough describes images as well as people do"},"content":{"rendered":"<div><img decoding=\"async\" src=\"https:\/\/www.sickgaming.net\/blog\/wp-content\/uploads\/2020\/10\/latest-ai-breakthrough-describes-images-as-well-as-people-do.jpg\" class=\"ff-og-image-inserted\"><\/div>\n<h2><strong>Novel object captioning<\/strong><\/h2>\n<p>Image captioning is a core challenge in the discipline of computer vision, one that requires an AI system to understand and describe the salient content, or action, in an image, explained <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/lijuanw\/\">Lijuan Wang<\/a>, a principal research manager in Microsoft\u2019s research lab in Redmond.<\/p>\n<p>\u201cYou really need to understand what is going on, you need to know the relationship between objects and actions and you need to summarize and describe it in a natural language sentence,\u201d she said.<\/p>\n<p>Wang led the research team that <a href=\"https:\/\/aka.ms\/MSRBlogImageCap\">achieved \u2013 and beat \u2013 human parity<\/a> on the novel object captioning at scale, or <a href=\"https:\/\/nocaps.org\/\">nocaps<\/a>, benchmark. The benchmark evaluates AI systems on how well they generate captions for objects in images that are not in the dataset used to train them.<\/p>\n<p>Image captioning systems are typically trained with datasets that contain images paired with sentences that describe the images, essentially a dataset of captioned images.<\/p>\n<p>\u201cThe nocaps challenge is really how are you able to describe those novel objects that you haven\u2019t seen in your training data?\u201d Wang said.<\/p>\n<p>To meet the challenge, the Microsoft team pre-trained a large AI model with a rich dataset of images paired with word tags, with each tag mapped to a specific object in an image.<\/p>\n<p>Datasets of images with word tags instead of full captions are more efficient to create, which allowed Wang\u2019s team to feed lots of data into their model. The approach imbued the model with what the team calls a visual vocabulary.<\/p>\n<p>The visual vocabulary pre-training approach, Huang explained, is similar to prepping children to read by first using a picture book that associates individual words with images, such as a picture of an apple with the word \u201capple\u201d beneath it and a picture of a cat with the word \u201ccat\u201d beneath it.<\/p>\n<p>\u201cThis visual vocabulary pre-training essentially is the education needed to train the system; we are trying to educate this motor memory,\u201d Huang said.<\/p>\n<p>The pre-trained model is then fine-tuned for captioning on the dataset of captioned images. In this stage of training, the model learns how to compose a sentence. When presented with an image containing novel objects, the AI system leverages the visual vocabulary to generate an accurate caption.<\/p>\n<p>\u201cIt combines what is learned in both the pre-training and the fine-tuning to handle novel objects in the testing,\u201d Wang said.<\/p>\n<p>When evaluated on nocaps, the AI system created captions that were more descriptive and accurate than the captions for the same images that were written by people, according to results presented in a <a href=\"https:\/\/arxiv.org\/abs\/2009.13682\">research paper<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Novel object captioning Image captioning is a core challenge in the discipline of computer vision, one that requires an AI system to understand and describe the salient content, or action, in an image, explained Lijuan Wang, a principal research manager in Microsoft\u2019s research lab in Redmond. \u201cYou really need to understand what is going on, [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":119337,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[49],"tags":[135,1102],"class_list":["post-119336","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-microsoft-news","tag-artificial-intelligence","tag-innovation-stories"],"_links":{"self":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/119336","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/comments?post=119336"}],"version-history":[{"count":0,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/119336\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/media\/119337"}],"wp:attachment":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/media?parent=119336"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/categories?post=119336"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/tags?post=119336"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}