Create an account


Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
gRPC performance improvements in .NET 5

#1
gRPC performance improvements in .NET 5

<div style="margin: 5px 5% 10px 5%;"><img src="https://www.sickgaming.net/blog/wp-content/uploads/2020/10/grpc-performance-improvements-in-net-5.png" width="58" height="58" title="" alt="" /></div><div><div class="row justify-content-center">
<div class="col-md-4">
<div><img loading="lazy" src="https://www.sickgaming.net/blog/wp-content/uploads/2020/10/grpc-performance-improvements-in-net-5.png" width="58" height="58" alt="Avatar" class="avatar avatar-58 wp-user-avatar wp-user-avatar-58 photo avatar-default"></p>
<p>James</p>
</div>
</div>
</div>
<div class="entry-meta entry-meta-layout">
<p>October 27th, 2020</p>
</p></div>
<p><!-- .entry-meta --> </p>
<p>gRPC is a modern open source remote procedure call framework. There are many exciting features in gRPC: real-time streaming, end-to-end code generation, and great cross-platform support to name a few. The most exciting to me, and consistently mentioned by developers who are interested in gRPC, is performance.</p>
<p>Last year Microsoft contributed a new implementation of gRPC for .NET to the <a href="https://www.cncf.io/">CNCF</a>. Built on top of Kestrel and HttpClient, gRPC for .NET makes gRPC a first-class member of the .NET ecosystem.</p>
<p>In our first <a href="https://docs.microsoft.com/aspnet/core/grpc/">gRPC for .NET</a> release, we focused on gRPC’s core features, compatibility, and stability. In .NET 5, we made gRPC really fast.</p>
<h2 id="grpc-and-net-5-are-fast">gRPC and .NET 5 are fast</h2>
<p>In a <a href="https://github.com/LesnyRumcajs/grpc_bench">community run benchmark</a> of different gRPC server implementations, .NET gets the highest requests per second after Rust, and is just ahead of C++ and Go.</p>
<p><a href="https://www.sickgaming.net/blog/wp-content/uploads/2020/10/grpc-performance-improvements-in-net-5-1.png"><img loading="lazy" src="https://www.sickgaming.net/blog/wp-content/uploads/2020/10/grpc-performance-improvements-in-net-5-1.png" alt="gRPC performance comparison" width="1028" height="616" class="alignnone size-full wp-image-24250"></a></p>
<p>This result builds on top of the work done in .NET 5. Our benchmarks show .NET 5 server performance is 60% faster than .NET Core 3.1. .NET 5 client performance is 230% faster than .NET Core 3.1.</p>
<p>Stephen Toub discusses <a href="https://github.com/dotnet/runtime">dotnet/runtime</a> changes in his <a href="https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-5/">Performance Improvements in .NET 5</a> blog post. Check it out to read about improvements in HttpClient and HTTP/2.</p>
<p>In the rest of this blog post I’ll talk about the improvements we made to make gRPC fast in ASP.NET Core.</p>
<h2 id="http-2-allocations-in-kestrel">HTTP/2 allocations in Kestrel</h2>
<p>gRPC uses HTTP/2 as its underlying protocol. A fast HTTP/2 implementation is the most important factor when it comes to performance. Our gRPC server builds on top of Kestrel, a HTTP server written in C# that is designed with performance in mind. Kestrel is a top contender in the <a href="https://www.techempower.com/benchmarks/">TechEmpower benchmarks</a>, and gRPC benefits from a lot of the performance improvements in Kestrel automatically. However, there are many HTTP/2 specific optimizations that were made in .NET 5.</p>
<p>Reducing allocations is a good place to start. Fewer allocations per HTTP/2 request means less time doing garbage collection (GC). And CPU time “wasted” in GC is CPU time not spent serving HTTP/2 requests.</p>
<p><a href="https://www.sickgaming.net/blog/wp-content/uploads/2020/10/grpc-performance-improvements-in-net-5-2.png"><img loading="lazy" src="https://www.sickgaming.net/blog/wp-content/uploads/2020/10/grpc-performance-improvements-in-net-5-2.png" alt=".NET Core 3.1 memory graph" width="1550" height="777" class="alignnone size-full wp-image-24251"></a></p>
<p>The performance profiler above is measuring allocations over 100,000 gRPC requests. The live object graph’s sawtooth shaped pattern indicates memory building up, then being garbage collected. About 3.9KB is being allocated per request. Lets try to get that number down!</p>
<p><a href="https://github.com/dotnet/aspnetcore/pull/18601">dotnet/aspnetcore#18601</a> adds pooling of streams in a HTTP/2 connection. This one change almost cuts allocations per request in half. It enables reuse of internal types like <code>Http2Stream</code>, and publicly accessible types like <code>HttpContext</code> and <code>HttpRequest</code>, across multiple requests.</p>
<p>Once streams are pooled a range of optimizations become available:</p>
<p>There are many smaller allocation savings. <a href="https://github.com/dotnet/aspnetcore/pull/19783">dotnet/aspnetcore#19783</a> removes allocations in Kestrel’s HTTP/2 flow control. A resettable <code>ManualResetValueTaskSourceCore&lt;T&gt;</code> type replaces allocating a new object each time flow control is triggered. <a href="https://github.com/dotnet/aspnetcore/pull/19273">dotnet/aspnetcore#19273</a> replaces an array allocation with <code>stackalloc</code> when validating the HTTP request path. <a href="https://github.com/dotnet/aspnetcore/pull/19277">dotnet/aspnetcore#19277</a> and <a href="https://github.com/dotnet/aspnetcore/pull/19325">dotnet/aspnetcore#19325</a> eliminate some unintended allocations related to logging. <a href="https://github.com/dotnet/aspnetcore/pull/22557">dotnet/aspnetcore#22557</a> avoids allocating a <code>Task&lt;T&gt;</code> if a task is already complete. And finally <a href="https://github.com/dotnet/aspnetcore/pull/19732">dotnet/aspnetcore#19732</a> saves a string allocation by special casing <code>content-length</code> of 0. Because every allocation matters.</p>
<p><a href="https://www.sickgaming.net/blog/wp-content/uploads/2020/10/grpc-performance-improvements-in-net-5-3.png"><img loading="lazy" src="https://www.sickgaming.net/blog/wp-content/uploads/2020/10/grpc-performance-improvements-in-net-5-3.png" alt=".NET 5 memory" width="1550" height="777" class="alignnone size-full wp-image-24248"></a></p>
<p>Per-request memory in .NET 5 is now just 330 B, a decrease of 92%. The sawtooth pattern has also disappeared. Reduced allocations means garbage collection didn’t run at all while the server processed 100,000 gRPC calls.</p>
<h2 id="reading-http-headers-in-kestrel">Reading HTTP headers in Kestrel</h2>
<p>A hotpath in HTTP/2 is reading and writing HTTP headers. A HTTP/2 connection supports concurrent requests over a TCP socket, a feature called multiplexing. Multiplexing allows HTTP/2 to make efficient use of connections, but only the headers for one request on a connection can be processed at a time. HTTP/2’s <a href="https://tools.ietf.org/html/rfc7541">HPack header compression</a> is stateful and depends on order. Processing HTTP/2 headers is a bottleneck so has to be as fast as possible.</p>
<p><a href="https://github.com/dotnet/aspnetcore/pull/23083">dotnet/aspnetcore#23083</a> optimizes the performance of <code>HPackDecoder</code>. The decoder is a state machine that reads incoming HTTP/2 <code>HEADER</code> frames. The approach here is good, the state machine allows Kestrel to decode frames as they arrive, but the decoder was checking state after parsing each byte. Another problem is literal values, the header names and values, were copied multiple times. Optimizations in this PR include:</p>
<ul>
<li>Tighten parsing loops. For example, if we’ve just parsed a header name then the value must come afterwards. There is no need to check the state machine to figure out the next state.</li>
<li>Skip literal parsing all together. Literals in HPack have a length prefix. If we know the next 100 bytes are a literal then there is no need to inspect each byte. Mark the literal’s location and resuming parsing at its end.</li>
<li>Avoid copying literal bytes. Previously literal bytes were always copied to an intermediary array before passed to Kestrel. Most of the time this isn’t necessary and instead we can just slice the original buffer and pass a <code>ReadOnlySpan&lt;byte&gt;</code> to Kestrel.</li>
</ul>
<p>Together these changes significantly decrease the time it takes to parse headers. Header size is almost no longer a factor. The decoder marks the start and end position of a value and then slices that range.</p>
<pre><code class="csharp">private HPackDecoder _decoder = CreateDecoder();
private byte[] _smallHeader = new byte[] { /* HPack bytes */ };
private byte[] _largeHeader = new byte[] { /* HPack bytes */ };
private IHttpHeadersHandler _noOpHandler = new NoOpHeadersHandler(); [Benchmark]
public void SmallDecode() =&gt; _decoder.Decode(_smallHeader, endHeaders: true, handler: _noOpHandler); [Benchmark]
public void LargeDecode() =&gt; _decoder.Decode(_largeHeader, endHeaders: true, handler: _noOpHandler);
</code></pre>
<table>
<thead>
<tr>
<th>Method</th>
<th>Runtime</th>
<th align="right">Mean</th>
<th align="right">Ratio</th>
<th align="right">Allocated</th>
</tr>
</thead>
<tbody>
<tr>
<td>SmallDecode</td>
<td>.NET Core 3.1</td>
<td align="right">111.20 ns</td>
<td align="right">1.00</td>
<td align="right">0 B</td>
</tr>
<tr>
<td>SmallDecode</td>
<td>.NET 5.0</td>
<td align="right">71.90 ns</td>
<td align="right">0.65</td>
<td align="right">0 B</td>
</tr>
<tr>
<td>LargeDecode</td>
<td>.NET Core 3.1</td>
<td align="right">49,083.00 ns</td>
<td align="right">1.00</td>
<td align="right">0 B</td>
</tr>
<tr>
<td>LargeDecode</td>
<td>.NET 5.0</td>
<td align="right">98.68 ns</td>
<td align="right">0.002</td>
<td align="right">0 B</td>
</tr>
</tbody>
</table>
<p>Once headers have been decoded, Kestrel needs to validate and process them. For example, special HTTP/2 headers like <code>:path</code> and <code>:method</code> need to be set onto <code>HttpRequest.Path</code> and <code>HttpRequest.Method</code>, and other headers need to be converted to strings and added to the <code>HttpRequest.Headers</code> collection.</p>
<p>Kestrel has the concept of known request headers. Known headers are a selection of commonly occuring request headers that have been optimized for fast setting and getting. <a href="https://github.com/dotnet/aspnetcore/pull/24730">dotnet/aspnetcore#24730</a> adds an even faster path for setting HPack static table headers to the known headers. The <a href="https://tools.ietf.org/html/rfc7541#appendix-A">HPack static table</a> gives 61 common header names and values a number ID that can be sent instead of the full name. A header with a static table ID can use the optimized path to bypass some validation and quickly be set in the collection based on its ID. <a href="https://github.com/dotnet/aspnetcore/pull/24945">dotnet/aspnetcore#24945</a> adds extra optimization for static table IDs with a name and value.</p>
<h2 id="adding-hpack-response-compression">Adding HPack response compression</h2>
<p>Prior to .NET 5, Kestrel supported reading HPack compressed headers in requests, but it didn’t compress response headers. The obvious advantage of response header compression is less network usage, but there are performance benefits as well. It’s faster to write a couple of bits for a compressed header than it is to encode and write the header’s full name and value as bytes.</p>
<p><a href="https://github.com/dotnet/aspnetcore/pull/19521">dotnet/aspnetcore#19521</a> adds initial HPack static compression. Static compression is pretty simple: if the header is in the <a href="https://tools.ietf.org/html/rfc7541#appendix-A">HPack static table</a> then write the ID to identify the header instead of the longer name.</p>
<p>Dynamic HPack header compression is more complicated, but also provides bigger gains. Response header names and values are tracked in a dynamic table and are each assigned an ID. As a response’s headers are written, the server checks to see if the header name and value are in the table. If there is a match then the ID is written. If there isn’t then the full header is written, and it is added to the table for the next response. There is a maximum size of the dynamic table, so adding a header to it may evict other headers with a first in, first out order.</p>
<p><a href="https://github.com/dotnet/aspnetcore/pull/20058">dotnet/aspnetcore#20058</a> adds dynamic HPack header compression. To quickly search for headers the dynamic table groups header entries using a basic hash table. To track order and evict the oldest headers, entries maintain a linked list. To avoid allocations, removed entries are pooled and reused.</p>
<p><a href="https://www.sickgaming.net/blog/wp-content/uploads/2020/10/grpc-performance-improvements-in-net-5-4.png"><img loading="lazy" src="https://www.sickgaming.net/blog/wp-content/uploads/2020/10/grpc-performance-improvements-in-net-5-4.png" alt="Wireshark HTTP/2 response" width="1370" height="290" class="alignnone size-full wp-image-24252"></a></p>
<p>Using Wireshark, we can see the impact of header compression on response size for this example gRPC call. .NET Core 3.x writes 77 B, while .NET 5 is only 12 B.</p>
<h2 id="protobuf-message-serialization">Protobuf message serialization</h2>
<p>gRPC for .NET uses the <a href="https://www.nuget.org/packages/google.protobuf">Google.Protobuf</a> package as the default serializer for messages. Protobuf is an efficient binary serialization format. Google.Protobuf is designed for performance, using code generation instead of reflection to serialize .NET objects. There are some modern .NET APIs and features that can be added to it to reduce allocations and improve efficiency.</p>
<p>The biggest improvement to Google.Protobuf is support for modern .NET IO types: <code>Span&lt;T&gt;</code>, <code>ReadOnlySequence&lt;T&gt;</code> and <code>IBufferWriter&lt;T&gt;</code>. These types allow gRPC messages to be serialized directly using buffers exposed by Kestrel. This saves Google.Protobuf allocating an intermediary array when serializing and deserializing Protobuf content.</p>
<p>Support for Protobuf buffer serialization was a multi-year effort between Microsoft and Google engineers. Changes were spread across multiple repositories.</p>
<p><a href="https://github.com/protocolbuffers/protobuf/pull/7351">protocolbuffers/protobuf#7351</a> and <a href="https://github.com/protocolbuffers/protobuf/pull/7576">protocolbuffers/protobuf#7576</a> add support for buffer serialization to Google.Protobuf. This is by far the biggest and most complicated change. Three attempts were made to add this feature before the right balance between performance, backwards compatibility and code reuse was found. Protobuf reading and writing uses many performance oriented features and APIs added to C# and .NET Core:</p>
<ul>
<li><code>Span&lt;T&gt;</code> and C# <code>ref struct</code> types enables fast and safe access to memory. <code>Span&lt;T&gt;</code> represents a contiguous region of arbitrary memory. Using span lets us serialize to managed .NET arrays, stack allocated arrays, or unmanaged memory, without using pointers. <code>Span&lt;T&gt;</code> and .NET protects us against buffer overflow.</li>
<li><code>stackalloc</code> is used to create stack-based arrays. <code>stackalloc</code> is a useful tool to avoid allocations when a small buffer is required.</li>
<li>Low-level methods such as <code>MemoryMarshal.GetReference()</code>, <code>Unsafe.ReadUnaligned()</code> and <code>Unsafe.WriteUnaligned()</code> convert directly between primitive types and bytes.</li>
<li><code>BinaryPrimitives</code> has helper methods for efficiently converting between .NET primitive types and bytes. For example, <code>BinaryPrimitives.ReadUInt64LittleEndian</code> reads little endian bytes and returns an unsigned 64 bit number. Methods provided by <code>BinaryPrimitive</code> are heavily optimized and use vectorization.</li>
</ul>
<p>A great thing about modern C# and .NET is it is possible to write fast, efficient, low-level libraries without sacrificing memory safety. When it comes to performance, .NET lets you have your cake and eat it too!</p>
<pre><code class="csharp">private TestMessage _testMessage = CreateMessage();
private ReadOnlySequence&lt;byte&gt; _testData = CreateData();
private IBufferWriter&lt;byte&gt; _bufferWriter = CreateWriter(); [Benchmark]
public IMessage ToByteArray() =&gt; _testMessage.ToByteArray(); [Benchmark]
public IMessage ToBufferWriter() =&gt; _testMessage.WriteTo(_bufferWriter); [Benchmark]
public IMessage FromByteArray() =&gt; TestMessage.Parser.ParseFrom(CreateBytes()); [Benchmark]
public IMessage FromSequence() =&gt; TestMessage.Parser.ParseFrom(_testData);
</code></pre>
<table>
<thead>
<tr>
<th>Method</th>
<th>Runtime</th>
<th align="right">Mean</th>
<th align="right">Ratio</th>
<th align="right">Allocated</th>
</tr>
</thead>
<tbody>
<tr>
<td>ToByteArray</td>
<td>.NET 5.0</td>
<td align="right">1,133.82 ns</td>
<td align="right">1.00</td>
<td align="right">184 B</td>
</tr>
<tr>
<td>ToBufferWriter</td>
<td>.NET 5.0</td>
<td align="right">589.05 ns</td>
<td align="right">0.51</td>
<td align="right">64 B</td>
</tr>
<tr>
<td>FromByteArray</td>
<td>.NET 5.0</td>
<td align="right">409.88 ns</td>
<td align="right">1.00</td>
<td align="right">1960 B</td>
</tr>
<tr>
<td>FromSequence</td>
<td>.NET 5.0</td>
<td align="right">381.03 ns</td>
<td align="right">0.92</td>
<td align="right">1776 B</td>
</tr>
</tbody>
</table>
<p>Adding support for buffer serialization to Google.Protobuf is just the first step. More work is required for gRPC for .NET to take advantage of the new capability:</p>
<ul>
<li><a href="https://github.com/grpc/grpc/pull/18865">grpc/grpc#18865</a> and <a href="https://github.com/grpc/grpc/pull/19792">grpc/grpc#19792</a> adds <code>ReadOnlySequence&lt;byte&gt;</code> and <code>IBufferWriter&lt;byte&gt;</code> APIs to the gRPC serialization abstraction layer in Grpc.Core.Api.</li>
<li><a href="https://github.com/grpc/grpc/pull/23485">grpc/grpc#23485</a> updates gRPC code generation to glue the changes in Google.Protobuf to Grpc.Core.Api.</li>
<li><a href="https://github.com/grpc/grpc-dotnet/pull/376">grpc/grpc-dotnet#376</a> and <a href="https://github.com/grpc/grpc-dotnet/pull/629">grpc/grpc-dotnet#629</a> updates gRPC for .NET to use the new serialization abstractions in Grpc.Core.Api. This code is the integration between Kestrel and gRPC. Because Kestrel’s IO is built on top of <a href="https://devblogs.microsoft.com/dotnet/system-io-pipelines-high-performance-io-in-net/">System.IO.Pipelines</a>, we can use its buffers during serialization.</li>
</ul>
<p>The end result is gRPC for .NET serializes Protobuf messages directly to Kestrel’s request and response buffers. Intermediary array allocations and byte copies have been eliminated from gRPC message serialization.</p>
<h2 id="wrapping-up">Wrapping Up</h2>
<p>Performance is a feature of .NET and gRPC, and as cloud apps scale it is more important than ever. I think all developers can agree it is fun to make fast apps, but performance has real world impact. Lower latency and higher throughput means fewer servers. It is an opportunity to save money, reduce power use and build greener apps.</p>
<p><a href="https://www.sickgaming.net/blog/wp-content/uploads/2020/10/grpc-performance-improvements-in-net-5-5.png"><img loading="lazy" src="https://www.sickgaming.net/blog/wp-content/uploads/2020/10/grpc-performance-improvements-in-net-5-5.png" alt=".NET Core 3.1 vs .NET 5 results" width="1372" height="451" class="alignnone size-full wp-image-24253"></a></p>
<p>As is obvious from this tour, a lot of changes have gone into gRPC, Protobuf and .NET aimed at improving performance. Our benchmarks show a 60% improvement in gRPC server RPS and a 230% improvement in gRPC client RPS.</p>
<p><a href="https://dotnet.microsoft.com/download/dotnet/5.0">.NET 5 RC2</a> is available now, and the official .NET 5 release is in November. To try out the performance improvements and to get started using gRPC with .NET, the best place to start is the <a href="https://docs.microsoft.com/aspnet/core/tutorials/grpc/grpc-start">Create a gRPC client and server in ASP.NET Core</a> tutorial.</p>
<p>We look forward to hearing about apps built with gRPC and .NET, and to your future contributions in the <a href="https://github.com/dotnet">dotnet</a> and <a href="https://github.com/grpc">grpc</a> repos!</p>
</div>


https://www.sickgaming.net/blog/2020/10/...-in-net-5/
Reply



Forum Jump:


Users browsing this thread:
2 Guest(s)

Forum software by © MyBB Theme © iAndrew 2016