{"id":127241,"date":"2022-08-12T13:02:22","date_gmt":"2022-08-12T13:02:22","guid":{"rendered":"https:\/\/news.microsoft.com\/?p=446952"},"modified":"2022-08-12T13:02:22","modified_gmt":"2022-08-12T13:02:22","slug":"just-say-the-magic-word-using-language-to-program-robots","status":"publish","type":"post","link":"https:\/\/sickgaming.net\/blog\/2022\/08\/12\/just-say-the-magic-word-using-language-to-program-robots\/","title":{"rendered":"Just say the magic word: Using language to program robots"},"content":{"rendered":"<p class=\"has-text-align-center\">LaTTe <a href=\"https:\/\/arxiv.org\/abs\/2208.02918\" target=\"_blank\" rel=\"noreferrer noopener\">paper<\/a> and <a href=\"https:\/\/youtu.be\/Kutc_peSrpw\" target=\"_blank\" rel=\"noreferrer noopener\">video<\/a> | Trajectory Transformer <a href=\"https:\/\/arxiv.org\/abs\/2203.13411\" target=\"_blank\" rel=\"noreferrer noopener\">paper<\/a> and <a href=\"https:\/\/youtu.be\/fhSOb3z7aXE\" target=\"_blank\" rel=\"noreferrer noopener\">video<\/a> | <a href=\"https:\/\/github.com\/arthurfenderbucker\/NL_trajectory_reshaper\" target=\"_blank\" rel=\"noreferrer noopener\">Github code<\/a><\/p>\n<p>Language is the most intuitive way for us to express how we feel and what we want. However, despite recent advancements in artificial intelligence, it is still very hard to control a robot using natural language instructions. Free-form commands such as \u201cRobot, please go a little slower when you pass close to my TV\u201d or \u201cStay far away from the swimming pool!\u201d are hard to parse into actionable robot behaviors, and most human-robot interfaces today still rely on complex strategies such directly programming cost functions which define the desired behavior.&nbsp;<\/p>\n<p>With our latest work, we attempt to change this reality through the introduction of <a href=\"https:\/\/arxiv.org\/abs\/2208.02918\" target=\"_blank\" rel=\"noreferrer noopener\">\u201cLaTTe: Language Trajectory Transformer\u201d<\/a>. LaTTe is a deep machine learning model that lets us send language commands to robots in an intuitive way with ease. When given an input sentence by the user, the model fuses it with camera images of objects that the robot observes in its surroundings, and outputs the desired robot behavior.&nbsp;&nbsp;<\/p>\n<p>As an example, think of a user trying to control a robot barista that\u2019s moving a wine bottle. Our method allows a non-technical user to control the robot\u2019s behavior only using words, in a natural and simple interface. We will explain how we can achieve this in detail through this post.&nbsp;<\/p>\n<figure class=\"wp-block-gallery-1 wp-block-gallery has-nested-images columns-default is-cropped\">\n<figure class=\"wp-block-image\"><img decoding=\"async\" loading=\"lazy\" width=\"600\" height=\"338\" data-id=\"867705\" src=\"https:\/\/www.sickgaming.net\/blog\/wp-content\/uploads\/2022\/08\/just-say-the-magic-word-using-language-to-program-robots.gif\" alt=\"crash\" class=\"wp-image-867705\"><\/figure>\n<figure class=\"wp-block-image\"><img decoding=\"async\" loading=\"lazy\" width=\"600\" height=\"338\" data-id=\"867708\" src=\"https:\/\/www.sickgaming.net\/blog\/wp-content\/uploads\/2022\/08\/just-say-the-magic-word-using-language-to-program-robots-1.gif\" alt=\"nocrash\" class=\"wp-image-867708\"><\/figure>\n<\/figure>\n<p>Continue reading to learn more about this technology, or check out these additional resources:&nbsp;<\/p>\n<p>We also invite the reader to watch the videos describing the papers:&nbsp;<\/p>\n<figure class=\"wp-block-embed is-provider-youtube wp-block-embed-youtube\"><\/figure>\n<figure class=\"wp-block-embed is-provider-youtube wp-block-embed-youtube\"><\/figure>\n<h4><strong>Unlocking the potential of language for robotics<\/strong>&nbsp;<\/h4>\n<p>The field of robotics traditionally uses task-specific programming modules, which need to be re-designed by an expert even if there are minor changes in robot hardware, environment, or operational objectives. This inflexible approach is ripe for innovation with the latest advances in machine learning, which emphasizes&nbsp; reusable modules that generalize well over large domains.&nbsp;&nbsp;<\/p>\n<p>Given the intuitive and effective nature of language for general communication, it would be simpler if one could just tell the robot how they want it to behave as opposed to having to reprogram the entire stack every time a change is needed. While large language models such as <a href=\"https:\/\/arxiv.org\/abs\/1810.04805?hl=uk\" target=\"_blank\" rel=\"noreferrer noopener\">BERT<\/a>, <a href=\"https:\/\/openai.com\/api\/\" target=\"_blank\" rel=\"noreferrer noopener\">GPT-3<\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model\/\" target=\"_blank\" rel=\"noreferrer noopener\">Megatron-Turing<\/a> have radically improved the quality of machine-generated text and our ability to solve to natural language processing tasks, and models like CLIP extend our reach capabilities towards multi-modal domains with vision and language, we still see few examples of language being applied in robotics.&nbsp;<\/p>\n<p>The goal of our work is to leverage information contained in existing vision-language pre-trained models to fill the gap in existing tools for human-robot interaction. Even though natural language is the richest form of communication between humans, modeling human-robot interactions using language is challenging because we often require vast amounts of data to train models, or classically, force the user to operate within a rigid set of instructions. To tackle these challenges, our framework makes use of two key ideas: first, we employ large pre-trained language models to provide rich user intent representations, and second, we align geometrical trajectory data with natural language jointly with the use of a multi-modal attention mechanism.&nbsp;<\/p>\n<p>We test our model on multiple robotic platforms, from manipulators to drones, and show that its functionality is agnostic of the robot form factor, dynamics, and motion controller. Our goal is to enable a factory worker to quickly reconfigure a robot arm trajectory further away from fragile objects; or allow a drone pilot to command the drone to slow down when close to buildings \u2013 all without requiring immense technical expertise.&nbsp;<\/p>\n<h4><strong>Combining language and geometry into a single robotics model<\/strong>&nbsp;<\/h4>\n<p>Our overall goal is to provide a flexible interface for human-robot interaction within the context of trajectory reshaping that is agnostic to robotic platforms. We assume that the robot\u2019s behavior is expressed through a 3D trajectory over time, and that the user provides a natural language command to reshape its behavior which relates to particular things in the scene, such as the objects in the robot workspace. Our trajectory generation system outputs a sequence of waypoints in XYZ and velocities, which are calculated fusing scene geometry, scene images, and the user\u2019s language input. The diagram below shows an overview of the system:&nbsp;<\/p>\n<figure class=\"wp-block-gallery-2 wp-block-gallery has-nested-images columns-default is-cropped\">\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"1615\" height=\"573\" data-id=\"867654\" src=\"https:\/\/www.sickgaming.net\/blog\/wp-content\/uploads\/2022\/08\/just-say-the-magic-word-using-language-to-program-robots.jpg\" alt=\"architecture\" class=\"wp-image-867654\"><\/figure>\n<\/figure>\n<p>LaTTe is composed of several building blocks, which can be categorized into the feature extractors, geometric encoder, and a final trajectory decoder. We use a pre-trained language model encoder, BERT, to produce semantic features from the user\u2019s input. The use of a large language model creates more flexibility in the natural language input, allowing the use of synonyms and less training data, given that the encoder has already been trained with a massive text corpus. In addition, we use the pre-trained text encoder from the vision-language model CLIP to extract latent embeddings from both the user\u2019s text and the pictures of each object in the scene. We then compute a similarity vector between the embeddings, and use this information to identify target objects the user is referring to through their language command.&nbsp;<\/p>\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"2342\" height=\"746\" src=\"https:\/\/www.sickgaming.net\/blog\/wp-content\/uploads\/2022\/08\/just-say-the-magic-word-using-language-to-program-robots-1.jpg\" alt=\"words\" class=\"wp-image-867660\"><\/figure>\n<p>As for the geometric information, we employ a Transformer encoder network to extract features related to the original robot\u2019s trajectory as well as the 3D position of each one of the objects in the scene. In a practical scenario we can use off-the-shelf object detectors to obtain the position and pictures of each significant object.&nbsp;<\/p>\n<p>Finally, all the geometrical, language and visual information is fused together into a Transformer decoder block. Similarly to what happens in a machine translation problem (for example, translating a sentence from English to German), the information from the transformer encoder network is used by the transformer decoder to generate one waypoint of the output trajectory at a time in a loop. The training process uses a range of procedurally generated synthetic data with multiple trajectory shapes and random object categories. We use multiple images for each object, which we obtain by web crawling through <a href=\"https:\/\/www.bing.com\/images\/feed\" target=\"_blank\" rel=\"noreferrer noopener\">Bing Images<\/a>.&nbsp;<\/p>\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/www.sickgaming.net\/blog\/wp-content\/uploads\/2022\/08\/just-say-the-magic-word-using-language-to-program-robots-2.jpg\" alt=\"chart\" class=\"wp-image-867672\" width=\"633\" height=\"319\"><\/figure>\n<h4><strong>What can we do with this model?<\/strong>&nbsp;<\/h4>\n<p>We conducted several experiments in simulated and real-life environments to test the effectiveness of LaTTe. We also tested different form factors (manipulators, drones, and a hexapod robot) in a multitude of scenarios to show the capability of LaTTe to adapt to various robot platforms.&nbsp;<\/p>\n<p>Examples with manipulators:&nbsp;<\/p>\n<figure class=\"wp-block-gallery-3 wp-block-gallery has-nested-images columns-default\">\n<figure class=\"wp-block-image\"><img decoding=\"async\" loading=\"lazy\" width=\"600\" height=\"338\" data-id=\"868086\" src=\"https:\/\/www.sickgaming.net\/blog\/wp-content\/uploads\/2022\/08\/just-say-the-magic-word-using-language-to-program-robots-2.gif\" alt=\"manipulation_1\" class=\"wp-image-868086\"><\/figure>\n<figure class=\"wp-block-image is-style-default\"><img decoding=\"async\" loading=\"lazy\" width=\"600\" height=\"240\" data-id=\"868089\" src=\"https:\/\/www.sickgaming.net\/blog\/wp-content\/uploads\/2022\/08\/just-say-the-magic-word-using-language-to-program-robots-3.gif\" alt=\"manipulation_2\" class=\"wp-image-868089\"><\/figure>\n<\/figure>\n<p>Examples with aerial vehicles:&nbsp;<\/p>\n<figure class=\"wp-block-gallery-4 wp-block-gallery has-nested-images columns-default is-cropped\">\n<figure class=\"wp-block-image\"><img decoding=\"async\" loading=\"lazy\" width=\"600\" height=\"338\" data-id=\"868092\" src=\"https:\/\/www.sickgaming.net\/blog\/wp-content\/uploads\/2022\/08\/just-say-the-magic-word-using-language-to-program-robots-4.gif\" alt=\"drone\" class=\"wp-image-868092\"><\/figure>\n<figure class=\"wp-block-image\"><img decoding=\"async\" loading=\"lazy\" width=\"600\" height=\"338\" data-id=\"868095\" src=\"https:\/\/www.sickgaming.net\/blog\/wp-content\/uploads\/2022\/08\/just-say-the-magic-word-using-language-to-program-robots-5.gif\" alt=\"drone\" class=\"wp-image-868095\"><\/figure>\n<\/figure>\n<p>Examples with a hexapod robot:&nbsp;<\/p>\n<figure class=\"wp-block-gallery-5 wp-block-gallery has-nested-images columns-default is-cropped\">\n<figure class=\"wp-block-image\"><img decoding=\"async\" loading=\"lazy\" width=\"600\" height=\"338\" data-id=\"868098\" src=\"https:\/\/www.sickgaming.net\/blog\/wp-content\/uploads\/2022\/08\/just-say-the-magic-word-using-language-to-program-robots-6.gif\" alt=\"hexa\" class=\"wp-image-868098\"><\/figure>\n<figure class=\"wp-block-image\"><img decoding=\"async\" loading=\"lazy\" width=\"600\" height=\"338\" data-id=\"868101\" src=\"https:\/\/www.sickgaming.net\/blog\/wp-content\/uploads\/2022\/08\/just-say-the-magic-word-using-language-to-program-robots-7.gif\" alt=\"hexa\" class=\"wp-image-868101\"><\/figure>\n<\/figure>\n<h4><strong>Bringing robotics to a wider audience<\/strong>&nbsp;<\/h4>\n<p>We are excited to release these technologies with the aim of bringing robotics to the reach of a wider audience. Given the burgeoning applications of robots in several domains, it is imperative to design human-robot interfaces that are intuitive and easy to use. Our goal when designing such interfaces is to afford flexibility and precision of action, while ensuring that little to no technical training is required for novel users. Our Language Trajectory Transformer (LaTTe) framework takes a big step forward towards this direction.&nbsp;<\/p>\n<p><em>This work is being undertaken by a multidisciplinary team at <\/em><a href=\"https:\/\/www.microsoft.com\/en-us\/ai\/autonomous-systems\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Microsoft Autonomous Systems Research<\/em><\/a><em> together with the Munich Institute of Robotics and Machine Intelligence (<\/em><a href=\"https:\/\/www.mirmi.tum.de\/mirmi\/home\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>MIRMI<\/em><\/a><em>) at TU Munich. The researchers included in this project are: <\/em><a href=\"https:\/\/scholar.google.com.br\/citations?user=8cEgwaEAAAAJ&amp;hl=en&amp;oi=ao\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Arthur Bucker<\/em><\/a><em>, <\/em><a href=\"https:\/\/www.mirmi.tum.de\/mirmi\/team\/figueredo-luis\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Luis Figueredo<\/em><\/a><em>, <\/em><a href=\"https:\/\/www.professoren.tum.de\/en\/haddadin-sami\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Sami Haddadin<\/em><\/a><em>, <\/em><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/akapoor\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Ashish Kapoor<\/em><\/a><em>, <\/em><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/shuama\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Shuang Ma<\/em><\/a><em>, <\/em><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/savempra\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Sai Vemprala<\/em><\/a><em> and <\/em><a href=\"http:\/\/rogeriobonatti.com\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Rogerio Bonatti<\/em><\/a><em>.<\/em>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>LaTTe paper and video | Trajectory Transformer paper and video | Github code Language is the most intuitive way for us to express how we feel and what we want. However, despite recent advancements in artificial intelligence, it is still very hard to control a robot using natural language instructions. Free-form commands such as \u201cRobot, [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":127242,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[49],"tags":[117,50],"class_list":["post-127241","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-microsoft-news","tag-machine-learning","tag-recent-news"],"_links":{"self":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/127241","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/comments?post=127241"}],"version-history":[{"count":0,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/127241\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/media\/127242"}],"wp:attachment":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/media?parent=127241"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/categories?post=127241"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/tags?post=127241"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}