{"id":105084,"date":"2019-12-05T17:41:10","date_gmt":"2019-12-05T17:41:10","guid":{"rendered":"https:\/\/news.microsoft.com\/?p=435522"},"modified":"2019-12-05T17:41:10","modified_gmt":"2019-12-05T17:41:10","slug":"microsoft-research-open-data-project-evolving-our-standards-for-data-access-and-reproducible-research","status":"publish","type":"post","link":"https:\/\/sickgaming.net\/blog\/2019\/12\/05\/microsoft-research-open-data-project-evolving-our-standards-for-data-access-and-reproducible-research\/","title":{"rendered":"Microsoft Research Open Data Project: Evolving our standards for data access and reproducible research"},"content":{"rendered":"<p><a href=\"https:\/\/www.sickgaming.net\/blog\/wp-content\/uploads\/2019\/12\/microsoft-research-open-data-project-evolving-our-standards-for-data-access-and-reproducible-research.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-625509\" src=\"https:\/\/www.sickgaming.net\/blog\/wp-content\/uploads\/2019\/12\/microsoft-research-open-data-project-evolving-our-standards-for-data-access-and-reproducible-research.png\" alt=\"Datasets compilation for Open Data\" width=\"1400\" height=\"788\"><\/a><\/p>\n<p>Last summer we announced Microsoft Research Open Data\u2014an Azure-based repository-as-a-service for sharing datasets\u2014to encourage the reproducibility of research and make research data assets readily available in the cloud. Among other things, the project started a conversation between the community and Microsoft\u2019s legal team about dataset licensing. Inspired by these conversations, our legal team developed a set of brand new <a href=\"https:\/\/news.microsoft.com\/datainnovation\/#data-use-agreements\">data use agreements<\/a> and released them for public comment on Github earlier this year.<\/p>\n<p>Today we\u2019re excited to announce that Microsoft Research Open Data will be adopting these data use agreements for several datasets that we offer.<\/p>\n<h3>Diving a bit deeper on the new data use agreements<\/h3>\n<p>The <a href=\"https:\/\/github.com\/microsoft\/Open-Use-of-Data-Agreement\">Open Use of Data Agreement<\/a> (O-UDA) is intended for use by an individual or organization that is able to distribute data for unrestricted uses, and for which there is no privacy or confidentiality concern. It is not appropriate for datasets that include any data that might include materials subject to privacy laws (such as the GDPR or HIPAA) or other unlicensed third-party materials. The O-UDA meets the open definition: it does not impose any restriction with respect to the use or modi\ufb01cation of data other than ensuring that attribution and limitation of liability information is passed downstream. In the research context, this implies that users of the data need to cite the corresponding publication with which the data is associated. This aids in findability and reusability of data, an important tenet in the <a href=\"https:\/\/www.go-fair.org\/fair-principles\/\">FAIR guiding principles<\/a> for scientific data management and stewardship.<\/p>\n<p>We also recognize that in certain cases, datasets useful for AI and research analysis may not be able to be fully \u201copen\u201d under the O-UDA. For example, they may contain third-party copyrighted materials, such as text snippets or images, from publicly available sources. The law permits their use for research, so following the principle that research data should be \u201c<a href=\"https:\/\/ec.europa.eu\/research\/participants\/data\/ref\/h2020\/grants_manual\/hi\/oa_pilot\/h2020-hi-oa-pilot-guide_en.pdf\">as open as possible, as closed as necessary<\/a>,\u201d we developed the <a href=\"https:\/\/github.com\/microsoft\/Computational-Use-of-Data-Agreement\">Computational Use of Data Agreement<\/a> (C-UDA) to make data available for research while respecting other interests. We will prefer the O-UDA where possible, but we see the C-UDA as a useful tool for ensuring that researchers continue to have access to important and relevant datasets.<\/p>\n<h3>Datasets that reflect the goals of our project<\/h3>\n<p>The following examples reference datasets that have adopted the Open Use of Data Agreement (O-UDA).<\/p>\n<h4>Location data for geo-privacy research<\/h4>\n<p>Microsoft researcher <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jckrumm\/\">John Krumm<\/a> and collaborators collected <a href=\"https:\/\/msropendata.com\/datasets\/94d31431-0842-447c-b990-245761b7c5f2\">GPS data<\/a> from 21 people who carried a GPS receiver in the Seattle area. Users who provided their data agreed to it being shared as long as certain geographic regions were deleted. This work covers key research on privacy preservation of GPS data as evidenced in the corresponding paper, \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2016\/02\/ubicomp243-brush.pdf\">Exploring End User Preferences for Location Obfuscation, Location-Based Services, and the Value of Location<\/a>,\u201d which was accepted at the Twelfth ACM International Conference on Ubiquitous Computing (UbiComp 2010). The paper has been cited 147 times, including for research that builds upon this work to further the field of preservation of geo-privacy for location-based services providers.<\/p>\n<h4>Hand gestures data for computer vision<\/h4>\n<p>Another example dataset is that of <a href=\"https:\/\/msropendata.com\/datasets\/d7859d92-56c9-46f2-b217-6adafaa1500f\">labeled hand images and video clips<\/a> collected by researchers <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/eyalk\/\">Eyal Krupka<\/a>, Kfir Karmon, and others. The research addresses an important computer vision and machine learning problem that deals with developing a hand-gesture-based interface language. The data was recorded using depth cameras and has labels that cover joints and fingertips. The two datasets included are FingersData, which contains 3,500 labeled depth frames of various hand poses, and GestureClips, which contains 140 gesture clips (100 of these contain labeled hand gestures and 40 contain non-gesture activity). The research associated with this dataset is available in the paper \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2017\/05\/Towards-realistic-hands-gesture-interface.pdf\">Toward Realistic Hands Gesture Interface: Keeping it Simple for Developers and Machines<\/a>,\u201d which was published in Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems.<\/p>\n<h4>Question-Answer data for machine reading comprehension<\/h4>\n<p>Finally, the FigureQA dataset generated by researchers Samira Ebrahimi Kahou, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/adatkins\/\">Adam Atkinson<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/adtrisch\/\">Adam Trischler<\/a>, Yoshua Bengio and collaborators, introduces a visual reasoning task for research that is specific to graphical plots and figures. The <a href=\"https:\/\/msropendata.com\/datasets\/85596452-0fe3-4335-bc00-ae83ee8ffcfd\">dataset<\/a> has 180,000 figures with 1.3 million question-answer pairs in the training set. More details about the dataset are available in the paper \u201c<a href=\"https:\/\/arxiv.org\/abs\/1710.07300\">FigureQA: An Annotated Figure Dataset for Visual Reasoning<\/a>\u201d and corresponding <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/figureqa-annotated-figure-dataset-visual-reasoning\/\">Microsoft Research Blog post<\/a>. The dataset is pivotal to developing more powerful visual question answering and reasoning models, which potentially improve accuracy of AI systems that are involved in decision making based on charts and graphs.<\/p>\n<h3>The data agreements are a part of our larger goals<\/h3>\n<p>Microsoft Research Open Data project was conceived from the start to reflect Microsoft Research\u2019s commitment to fostering open science and research and to achieve this without compromising the ethics of collecting and sharing data. Our goal is to make it easier for researchers to maintain provenance of data while having the ability to reference and build upon it.<\/p>\n<p>The addition of the <a href=\"https:\/\/www.linkedin.com\/pulse\/enabling-data-use-through-power-community-erich-andersen\/\">new data agreements<\/a> to Microsoft Research Open Data\u2019s feature set is an exciting step in furthering our mission.<\/p>\n<p><strong>Acknowledgements:<\/strong> This work would not have been possible without the substantial team effort by \u2014 Dave Green, Justin Colannino, Gretchen Deo, Sarah Kim, Emily McReynolds, Mario Madden, Emily Schlesinger, Elaine Peterson, Leila Stevenson, Dave Baskin, and Sergio Loscialo.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Last summer we announced Microsoft Research Open Data\u2014an Azure-based repository-as-a-service for sharing datasets\u2014to encourage the reproducibility of research and make research data assets readily available in the cloud. Among other things, the project started a conversation between the community and Microsoft\u2019s legal team about dataset licensing. Inspired by these conversations, our legal team developed a [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":105085,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[49],"tags":[159,50],"class_list":["post-105084","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-microsoft-news","tag-microsoft-research","tag-recent-news"],"_links":{"self":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/105084","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/comments?post=105084"}],"version-history":[{"count":0,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/105084\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/media\/105085"}],"wp:attachment":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/media?parent=105084"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/categories?post=105084"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/tags?post=105084"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}