{"id":90982,"date":"2019-03-28T12:27:55","date_gmt":"2019-03-28T12:27:55","guid":{"rendered":"http:\/\/www.sickgaming.net\/blog\/2019\/03\/28\/can-better-task-stealing-make-linux-faster\/"},"modified":"2019-03-28T12:27:55","modified_gmt":"2019-03-28T12:27:55","slug":"can-better-task-stealing-make-linux-faster","status":"publish","type":"post","link":"https:\/\/sickgaming.net\/blog\/2019\/03\/28\/can-better-task-stealing-make-linux-faster\/","title":{"rendered":"Can Better Task Stealing Make Linux Faster?"},"content":{"rendered":"<p><em>Oracle Linux kernel developer Steve Sistare contributes this discussion on kernel scheduler improvements.<\/em><\/p>\n<h2 class=\"selectionShareable\">Load balancing via scalable task stealing<\/h2>\n<p class=\"selectionShareable\">The Linux task scheduler balances load across a system by pushing waking tasks to idle CPUs, and by pulling tasks from busy CPUs when a CPU becomes idle. Efficient scaling is a challenge on both the push and pull sides on large systems. For pulls, the scheduler searches all CPUs in successively larger domains until an overloaded CPU is found, and pulls a task from the busiest group. This is very expensive, costing 10&#8217;s to 100&#8217;s of microseconds on large systems, so search time is limited by the average idle time, and some domains are not searched. Balance is not always achieved, and idle CPUs go unused.<\/p>\n<p class=\"selectionShareable\">I have implemented an alternate mechanism that is invoked after the existing search in idle_balance() limits itself and finds nothing. I maintain a bitmap of overloaded CPUs, where a CPU sets its bit when its runnable CFS task count exceeds 1. The bitmap is sparse, with a limited number of significant bits per cacheline. This reduces cache contention when many threads concurrently set, clear, and visit elements. There is a bitmap per last-level cache. When a CPU becomes idle, it searches the bitmap to find the first overloaded CPU with a migratable task, and steals it. This simple stealing yields a higher CPU utilization than idle_balance() alone, because the search is cheap, costing 1 to 2 microseconds, so it may be called every time the CPU is about to go idle. Stealing does not offload the globally busiest queue, but it is much better than running nothing at all.<\/p>\n<h2 class=\"selectionShareable\">Results<\/h2>\n<p class=\"selectionShareable\">Stealing improves utilization with only a modest CPU overhead in scheduler code. In the following experiment, hackbench is run with varying numbers of groups (40 tasks per group), and the delta in \/proc\/schedstat is shown for each run, averaged per CPU, augmented with these non-standard stats:<\/p>\n<ul>\n<li>%find &#8211; percent of time spent in old and new functions that search for idle CPUs and tasks to steal and set the overloaded CPUs bitmap.<\/li>\n<li>steal &#8211; number of times a task is stolen from another CPU. Elapsed time improves by 8 to 36%, costing at most 0.4% more find time. <\/li>\n<\/ul>\n<p>\u200b\u200bCPU busy utilization is close to 100% for the new kernel, as shown by the green curve in the following graph, versus the orange curve for the baseline kernel:<\/p>\n<p class=\"selectionShareable\"><img decoding=\"async\" alt src=\"http:\/\/www.sickgaming.net\/blog\/wp-content\/uploads\/2019\/03\/can-better-task-stealing-make-linux-faster.png\"><\/p>\n<p class=\"selectionShareable\">Stealing improves Oracle database OLTP performance by up to 9% depending on load, and we have seen some nice improvements for mysql, pgsql, gcc, java, and networking. In general, stealing is most helpful for workloads with a high context switch rate.<\/p>\n<h2 class=\"selectionShareable\">The code<\/h2>\n<p class=\"selectionShareable\">As of this writing, this work is not yet upstream, but the latest patch series is at&nbsp;<a href=\"https:\/\/lkml.org\/lkml\/2018\/12\/6\/1253\">https:\/\/lkml.org\/lkml\/2018\/12\/6\/1253.&nbsp;<\/a>If your kernel is built with CONFIG_SCHED_DEBUG=y, you can verify that it contains the stealing optimization using<\/p>\n<pre>\n<code> # grep -q STEAL \/sys\/kernel\/debug\/sched_features &amp;&amp; echo Yes Yes\n<\/code><\/pre>\n<p class=\"selectionShareable\">If you try it, note that stealing is disabled for systems with more than 2 NUMA nodes, because hackbench regresses on such systems, as I explain in&nbsp;<a href=\"https:\/\/lkml.org\/lkml\/2018\/12\/6\/1250\">https:\/\/lkml.org\/lkml\/2018\/12\/6\/1250 .<\/a>However, I suspect this effect is specific to hackbench and that stealing will help other workloads on many-node systems. To try it, reboot with kernel parameter sched_steal_node_limit = 8 (or larger).<\/p>\n<h2 class=\"selectionShareable\">Future work<\/h2>\n<p class=\"selectionShareable\">After the basic stealing algorithm is pushed upstream, I am considering the following enhancements:<\/p>\n<ul>\n<li>If stealing within the last-level cache does not find a candidate, steal across LLC&#8217;s and NUMA nodes.<\/li>\n<li>Maintain a sparse bitmap to identify stealing candidates in the RT scheduling class. Currently pull_rt_task() searches all run queues.<\/li>\n<li>Remove the core and socket levels from idle_balance(), as stealing handles those levels. Remove idle_balance() entirely when stealing across LLC is supported.<\/li>\n<li>Maintain a bitmap to identify idle cores and idle CPUs, for push balancing.<\/li>\n<\/ul>\n<p><em>This article originally appeared at <a href=\"https:\/\/blogs.oracle.com\/linux\/can-better-task-stealing-make-linux-faster\">Oracle Developers Blog<\/a>.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Oracle Linux kernel developer Steve Sistare contributes this discussion on kernel scheduler improvements. Load balancing via scalable task stealing The Linux task scheduler balances load across a system by pushing waking tasks to idle CPUs, and by pulling tasks from busy CPUs when a CPU becomes idle. Efficient scaling is a challenge on both the [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":90983,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[40],"tags":[],"class_list":["post-90982","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-linux-freebsd-unix"],"_links":{"self":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/90982","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/comments?post=90982"}],"version-history":[{"count":0,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/90982\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/media\/90983"}],"wp:attachment":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/media?parent=90982"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/categories?post=90982"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/tags?post=90982"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}