{"id":16493,"date":"2018-04-03T15:07:00","date_gmt":"2018-04-03T15:07:00","guid":{"rendered":"http:\/\/www.gamasutra.com\/view\/news\/316041"},"modified":"2018-04-03T15:07:00","modified_gmt":"2018-04-03T15:07:00","slug":"blog-building-a-fully-managed-game-analytics-pipeline","status":"publish","type":"post","link":"https:\/\/sickgaming.net\/blog\/2018\/04\/03\/blog-building-a-fully-managed-game-analytics-pipeline\/","title":{"rendered":"Blog: Building a fully-managed game analytics pipeline"},"content":{"rendered":"<p><strong><em><small>The following blog post, unless otherwise noted, was written by a member of Gamasutra\u2019s community.<br \/>The thoughts and opinions expressed are those of the writer and not Gamasutra or its parent company.<\/small><\/em><\/strong><\/p>\n<hr \/>\n<p id=\"24f9\">Gathering usage\u00a0data such as player progress in games is invaluable for teams. Typically, entire teams have been dedicated to building and maintaining data pipelines for collecting and storing tracking data for games. However, with many new serverless tools available, the barriers to building an analytics pipeline for collecting game data have been significantly reduced. Managed tools such as Google\u2019s PubSub, DataFlow, and BigQuery have made it possible for a small team to set up analytics pipelines that can scale to a huge volume of events, while requiring minimal operational overhead. This post describes how to build a lightweight game analytics pipeline on the Google Cloud platform (GCP) that is fully-managed (serverless) and auto scales to meet demand.\u00a0<\/p>\n<p id=\"f856\">I was inspired by Google\u2019s\u00a0<a href=\"https:\/\/medium.com\/r\/?url=https%3A%2F%2Fcloud.google.com%2Fsolutions%2Fmobile%2Fmobile-gaming-analysis-telemetry\" rel=\"nofollow\" target=\"_blank\">reference architecture<\/a>\u00a0for mobile game analytics. The goal of this post is to show that it\u2019s possible for a small team to build and maintain a\u00a0<a href=\"https:\/\/medium.com\/r\/?url=https%3A%2F%2Fgithub.com%2Fbgweber%2FGameAnalytics\" rel=\"nofollow\" target=\"_blank\">data pipeline<\/a>\u00a0that scales to large event volumes, provides a data lake for data science tasks, provides a query environment for analytics teams, and has extensibility for additional components such as an experiment framework for applications.\u00a0<\/p>\n<p id=\"1fae\">The core piece of technology I\u2019m using to implement this data pipeline is Google\u2019s DataFlow, which is now integrated with the\u00a0<a href=\"https:\/\/medium.com\/r\/?url=https%3A%2F%2Fcloud.google.com%2Fblog%2Fbig-data%2F2016%2F08%2Fcloud-dataflow-apache-beam-and-you\" rel=\"nofollow\" target=\"_blank\">Apache Beam<\/a>\u00a0library. DataFlow tasks define a graph of operations to perform on a collection of events, which can be streaming data sources. This post presents a DataFlow task implemented in Java that streams tracking events from a PubSub topic to a data lake and to BigQuery. An introduction to DataFlow and it\u2019s concepts is available in\u00a0<a href=\"https:\/\/medium.com\/r\/?url=https%3A%2F%2Fcloud.google.com%2Fdataflow%2Fdocs%2Fconcepts\" rel=\"nofollow\" target=\"_blank\">Google\u2019s documentation<\/a>. While DataFlow tasks are portable, since they are now based on Apache Beam, this post focuses on how to use DataFlow in conjunction with additional managed services on GCP to build a simple, serverless, and scalable data pipeline for storing game events.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.sickgamedev.win\/wp-content\/uploads\/2018\/04\/blog-building-a-fully-managed-game-analytics-pipeline.png\" \/><\/p>\n<p><em>My lightweight implementation of the GCP Reference Architecture for Analytics.<\/em><\/p>\n<p id=\"6c1a\">The data pipeline that performs all of this functionality is relatively simple. The pipeline reads messages from PubSub and then transforms the events for persistence: the BigQuery portion of the pipeline converts messages to TableRow objects and streams directly to BigQuery, while the AVRO portion of the pipeline batches events into discrete windows and then saves the events to Google Storage. The graph of operations is shown in the figure below.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.sickgamedev.win\/wp-content\/uploads\/2018\/04\/blog-building-a-fully-managed-game-analytics-pipeline-1.png\" \/><\/p>\n<p><em>The streaming pipeline deployed to Google\u00a0Cloud<\/em><\/p>\n<h2 id=\"caf4\"><strong>Setting up the Environment<\/strong><\/h2>\n<p id=\"047c\">The first step in building a game data pipeline is setting up the dependencies necessary to compile and deploy the project. I used the following maven dependencies to set up environments for the tracking API that sends events to the pipeline, and the data pipeline that processes events.\u00a0<br \/>\u00a0<\/p>\n<div id=\"3306\"><em>Dependencies for the Tracking API<\/em> -&gt;<br \/>\u00a0 com.google.cloud<br \/>\u00a0 google-cloud-pubsub<br \/>\u00a0 0.32.0-beta<br \/>\u00a0 <\/p>\n<p>\u200b&lt;!&#8211; <em>Dependencies for the data pipeline<\/em> -&gt;<br \/>&lt;dependency&gt;<br \/>\u00a0 &lt;groupId&gt;com.google.cloud.dataflow&lt;\/groupId&gt;<br \/>\u00a0 &lt;artifactId&gt;google-cloud-dataflow-java-sdk-all&lt;\/artifactId&gt;<br \/>\u00a0 &lt;version&gt;2.2.0&lt;\/version&gt;<br \/>&lt;\/dependency&gt;<\/p>\n<\/div>\n<p id=\"1fc6\">I used Eclipse to author and compile the code for this tutorial, since it is open source. However, other IDEs such as\u00a0<a href=\"https:\/\/medium.com\/r\/?url=https%3A%2F%2Fwww.jetbrains.com%2Fidea%2F\" rel=\"nofollow\" target=\"_blank\">IntelliJ<\/a>\u00a0provide additional features for deploying and monitoring DataFlow tasks. Before you can deploy jobs to Google Cloud, you\u2019ll need to set up a service account for both PubSub and DataFlow. Setting up these credentials is outside the scope of this post, and more details are available in the\u00a0<a href=\"https:\/\/medium.com\/r\/?url=https%3A%2F%2Fcloud.google.com%2Fbigquery%2Fdocs%2Fauthentication%2Fservice-account-file\" rel=\"nofollow\" target=\"_blank\">Google documentation<\/a>.<\/p>\n<p id=\"a41d\">An additional prerequisite for running this data pipeline is setting up a PubSub topic on GCP. I defined a\u00a0<em>raw-events<\/em>\u00a0topic that is used for publishing and consuming messages for the data pipeline. Additional details on creating a PubSub topic are available\u00a0<a href=\"https:\/\/medium.com\/r\/?url=https%3A%2F%2Fcloud.google.com%2Fpubsub%2Fdocs%2Fquickstart-console\" rel=\"nofollow\" target=\"_blank\">here<\/a>.<\/p>\n<p id=\"e7ba\">To deploy this data pipeline, you\u2019ll need to set up a java environment with the maven dependencies listed above, set up a GCP\u00a0project and enable billing, enable billing on the storage and BigQuery services, and create a PubSub topic for sending and receiving messages. All of these managed services do cost money, but there is a free tier that can be used for prototyping a data pipeline.\u00a0<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.sickgamedev.win\/wp-content\/uploads\/2018\/04\/blog-building-a-fully-managed-game-analytics-pipeline-2.png\" \/><\/p>\n<p><em>Sending events from a\u00a0game server to a PubSub\u00a0topic<\/em><\/p>\n<h2 id=\"d5af\">Publishing Events<\/h2>\n<p id=\"7691\">In order to build a usable data pipeline, it\u2019s useful to build APIs that encapsulate the details of sending game events. The\u00a0<a href=\"https:\/\/medium.com\/r\/?url=https%3A%2F%2Fgithub.com%2Fbgweber%2FGameAnalytics%2Fblob%2Fmaster%2Fevents%2Ftracking%2FTrackingAPI.java\" rel=\"nofollow\" target=\"_blank\">Tracking API<\/a>\u00a0class provides this functionality, and can be used to send generated event data to the data pipeline. The code below shows the method signature for sending game events, and shows how to generate sample data.\u00a0<br \/>\u00a0<\/p>\n<div id=\"64b7\">\/** Event Signature for the Tracking API *\/<br \/>\/\/ sendEvent(String eventType, String eventVersion, HashMap attributes);<\/p>\n<p>\/\/ send a batch of events\u00a0\u00a0\u00a0<br \/>for (int i=0; i&lt;10000; i++) {<\/p>\n<p>\u00a0 \/\/ generate event names\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>\u00a0 String eventType = Math.random() &lt; 0.5 ?<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0 &#8220;Session&#8221; : (Math.random() &lt; 0.5 ? &#8220;Login&#8221; : &#8220;MatchStart&#8221;);<\/p>\n<p>\u00a0 \/\/ create attributes to send\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>\u00a0 HashMap&lt;String, String&gt; attributes = new HashMap&lt;String,String&gt;();<br \/>\u00a0 attributes.put(&#8220;userID&#8221;, &#8220;&#8221; + (int)(Math.random()*10000));<br \/>\u00a0 attributes.put(&#8220;deviceType&#8221;, Math.random() &lt; 0.5 ?<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0 &#8220;Android&#8221; : (Math.random() &lt; 0.5 ? &#8220;iOS&#8221; : &#8220;Web&#8221;));<\/p>\n<p>\u00a0 \/\/ send the event\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>\u00a0 tracking.sendEvent(eventType, &#8220;V1&#8221;, attributes);\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>}<\/p>\n<\/div>\n<p id=\"9343\">The tracking API establishes a connection to a PubSub topic, passes game events as a JSON format, and implements a callback for notification of delivery failures. The code used to send events is provided below, and is based on Google\u2019s PubSub example provided\u00a0<a href=\"https:\/\/medium.com\/r\/?url=https%3A%2F%2Fcloud.google.com%2Fpubsub%2Fdocs%2Fquickstart-client-libraries\" rel=\"nofollow\" target=\"_blank\">here<\/a>.<br \/>\u00a0<\/p>\n<div id=\"4180\">\/\/ Setup a PubSub connection<br \/>TopicName topicName = TopicName.of(projectID, topicID);<br \/>Publisher publisher = Publisher.newBuilder(topicName).build();<\/p>\n<p>\/\/ Specify an event to send<br \/>String event = {\\&#8221;eventType\\&#8221;:\\&#8221;session\\&#8221;,\\&#8221;eventVersion\\&#8221;:\\&#8221;1\\&#8221;}&#8221;;<\/p>\n<p>\/\/ Convert the event to bytes\u00a0\u00a0\u00a0<br \/>ByteString data = ByteString.copyFromUtf8(event.toString());<\/p>\n<p>\/\/schedule a message to be published\u00a0\u00a0\u00a0<br \/>PubsubMessage pubsubMessage =<br \/>\u00a0 PubsubMessage.newBuilder().setData(data).build();<\/p>\n<p>\u200b\/\/ publish the message, and add this class as a callback listener<br \/>ApiFuture&lt;String&gt; future = publisher.publish(pubsubMessage);\u00a0\u00a0\u00a0<br \/>ApiFutures.addCallback(future, this);<\/p>\n<\/div>\n<p id=\"466c\">The code above enables games\u00a0to send events to a PubSub topic. The next step is to process this events in a fully-managed environment that can scale as necessary to meet demand.<\/p>\n<h2 id=\"c19b\">Storing Events<\/h2>\n<p id=\"6782\">One of the key functions of a game data pipeline is to make instrumented events available to data science and analytics teams for analysis. The data sources used as endpoints should have low latency and be able to scale up to a massive volume of events. The data pipeline defined in this tutorial shows how to output events to both BigQuery and a data lake that can be used to support a large number of analytics business users.\u00a0<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.sickgamedev.win\/wp-content\/uploads\/2018\/04\/blog-building-a-fully-managed-game-analytics-pipeline-3.png\" \/><\/p>\n<p><em>Streaming event data from PubSub to\u00a0DataFlow<\/em><\/p>\n<p id=\"5a3c\">The first step in this data pipeline is reading events from a PubSub topic and passing ingested messages to the DataFlow process. DataFlow provides a PubSub connector that enables streaming of PubSub messages to other DataFlow components. The code below shows how to instantiate the data pipeline, specify streaming mode, and to consume messages from a specific PubSub topic. The output of this process is a collection of PubSub messages that can be stored for later analysis.<br \/>\u00a0<\/p>\n<div id=\"3d15\">\/\/ set up pipeline options\u00a0\u00a0\u00a0<br \/>Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);\u00a0\u00a0\u00a0<br \/>options.setStreaming(true);\u00a0\u00a0\u00a0<br \/>Pipeline pipeline = Pipeline.create(options);<\/p>\n<p>\u200b\/\/ read game events from PubSub\u00a0\u00a0\u00a0<br \/>PCollection&lt;PubsubMessage&gt; events = pipeline<br \/>\u00a0 .apply(PubsubIO.readMessages().fromTopic(topic));<\/p>\n<\/div>\n<p id=\"a079\">The first way we want to store game events is in a columnar format that can be used to build a data lake. While this post doesn\u2019t show how to utilize these files in downstream ETLs, having a data lake is a great way to maintain a copy of your data set in case you need to make changes to your database. The data lake provides a way to backload your data if necessary due to changes in schemas or data ingestion issues. The portion of the data pipeline allocated to this process is shown below.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.sickgamedev.win\/wp-content\/uploads\/2018\/04\/blog-building-a-fully-managed-game-analytics-pipeline-4.png\" \/><\/p>\n<p><em>Batching events to AVRO format and saving to Google\u00a0Storage<\/em><\/p>\n<p id=\"48b6\">For AVRO, we can\u2019t use a direct streaming approach. We need to group events into batches before we can save to flat files. The way this can be accomplished in DataFlow is by applying a windowing function that groups events into fixed batches. The code below applies transformations that convert the PubSub messages into String objects, group the messages into 5 minute intervals, and output the resulting batches to AVRO files on Google Storage.<br \/>\u00a0<\/p>\n<div id=\"f443\">\/\/ AVRO output portion of the pipeline\u00a0\u00a0\u00a0<br \/>events.apply(&#8220;To String&#8221;, ParDo.of(new DoFn() {<br \/>\u00a0 @ProcessElement\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>\u00a0 public void processElement(ProcessContext c) throws Exception {<br \/>\u00a0\u00a0\u00a0 String message = new String(c.element().getPayload());<br \/>\u00a0\u00a0\u00a0 c.output(message);\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>\u00a0 }\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>}))<\/p>\n<p>\/\/ Batch events into 5 minute windows\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>.apply(&#8220;Batch Events&#8221;, Window.&lt;String&gt;into(\u00a0\u00a0\u00a0<br \/>\u00a0\u00a0\u00a0 FixedWindows.of(Duration.standardMinutes(5)))\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>\u00a0 .triggering(AfterWatermark.pastEndOfWindow())\u00a0\u00a0\u00a0\u00a0<br \/>\u00a0 .discardingFiredPanes()\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>\u00a0 .withAllowedLateness(Duration.standardMinutes(5)))<\/p>\n<p>\u200b\/\/ Save the events in ARVO format\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>.apply(&#8220;To AVRO&#8221;, AvroIO.write(String.class)<br \/>\u00a0 .to(&#8220;gs:\/\/your_gs_bucket\/avro\/raw-events.avro&#8221;)<br \/>\u00a0 .withWindowedWrites()<br \/>\u00a0 .withNumShards(8)<br \/>\u00a0 .withSuffix(&#8220;.avro&#8221;));<\/p>\n<\/div>\n<p id=\"d51c\">To summarize, the above code batches game events into 5 minute windows and then exports the events to AVRO files on Google Storage.\u00a0<\/p>\n<p id=\"a9a8\">The result of this portion of the data pipeline is a collection of AVRO files on google storage that can be used to build a data lake. A new AVRO output is generated every 5 minutes, and downstream ETLs can parse the raw events into processed event-specific table schemas. The image below shows a sample output of AVRO files.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.sickgamedev.win\/wp-content\/uploads\/2018\/04\/blog-building-a-fully-managed-game-analytics-pipeline-5.png\" \/><\/p>\n<p><em>AVRO files saved to Google\u00a0Storage<\/em><\/p>\n<p id=\"04ae\">In addition to creating a data lake, we want the events to be immediately accessible in a query environment. DataFlow provides a BigQuery connector which serves this functionality, and data streamed to this endpoint is available for analysis after a short duration. This portion of the data pipeline is shown in the figure below.\u00a0<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.sickgamedev.win\/wp-content\/uploads\/2018\/04\/blog-building-a-fully-managed-game-analytics-pipeline-6.png\" \/><\/p>\n<p><em>Streaming events from DataFlow to\u00a0BigQuery<\/em><\/p>\n<p id=\"fb0e\">The data pipeline converts the PubSub messages into TableRow objects, which can be directly inserted into BigQuery. The code below consists of two apply methods: a data transformation and a IO writer. The transform step reads the message payloads from PubSub, parses the message as a JSON object, extracts the\u00a0<em>eventType<\/em>\u00a0and\u00a0<em>eventVersion<\/em>\u00a0attributes, and creates a TableRow object with these attributes in addition to a timestamp and the message payload. The second apply method tells the pipeline to write the records to BigQuery and to append the events to an existing table.\u00a0<br \/>\u00a0<\/p>\n<div id=\"c23f\">\/\/ parse the PubSub events and create rows to insert into BigQuery\u00a0\u00a0\u00a0<br \/>events.apply(&#8220;To Table Rows&#8221;, new<br \/>\u00a0 PTransform, PCollection&gt;() {<br \/>\u00a0\u00a0\u00a0 public PCollection expand(<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 PCollection input) {\u00a0\u00a0\u00a0<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 return input.apply(&#8220;To Predictions&#8221;, ParDo.of(new\u00a0<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 DoFn() {\u00a0\u00a0<\/p>\n<p>\u00a0\u00a0\u00a0 @ProcessElement\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>\u00a0\u00a0\u00a0 public void processElement(ProcessContext c) throws Exception {<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0 String message = new String(c.element().getPayload());<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0 \/\/ parse the json message for attributes<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0 JsonObject jsonObject =<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 new JsonParser().parse(message).getAsJsonObject();<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0 String eventType = jsonObject.get(&#8220;eventType&#8221;).getAsString();<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0 String eventVersion = jsonObject.<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 get(&#8220;eventVersion&#8221;).getAsString();\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0 String serverTime = dateFormat.format(new Date());<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0 \/\/ create and output the table row\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>\u00a0\u00a0\u00a0\u00a0 TableRow record = new TableRow();\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>\u00a0\u00a0\u00a0\u00a0 record.set(&#8220;eventType&#8221;, eventType);\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>\u00a0\u00a0\u00a0\u00a0 record.set(&#8220;eventVersion&#8221;, eventVersion);\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>\u00a0\u00a0\u00a0\u00a0 record.set(&#8220;serverTime&#8221;, serverTime);<br \/>\u00a0\u00a0\u00a0\u00a0 record.set(&#8220;message&#8221;, message);\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>\u00a0\u00a0\u00a0\u00a0 c.output(record);\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>\u00a0 }}));\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>}})<\/p>\n<p>\/\/stream the events to Big Query\u00a0\u00a0\u00a0<br \/>.apply(&#8220;To BigQuery&#8221;,BigQueryIO.writeTableRows()\u00a0\u00a0<br \/>\u00a0 .to(table)\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<br \/>\u00a0 .withSchema(schema)<br \/>\u00a0 .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)<br \/>\u00a0 .withWriteDisposition(WriteDisposition.WRITE_APPEND));<\/p>\n<\/div>\n<p>To summarize the above code, each game event\u00a0that is consumed from PubSub is converted into a TableRow object with a timestamp and then streamed to BigQuery for storage.\u00a0<\/p>\n<p>The result of this portion of the data pipeline is that game events will be streamed to BigQuery and will be available for analysis in the output table specified by the DataFlow task. In order to effectively use these events for queries, you\u2019ll need to build additional ETLs for creating processed event tables with schematized records, but you now have a data collection mechanism in place for storing events.\u00a0<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.sickgamedev.win\/wp-content\/uploads\/2018\/04\/blog-building-a-fully-managed-game-analytics-pipeline-7.png\" \/><\/p>\n<p><em>Game event records queried from the raw-events table in\u00a0BigQuery<\/em><\/p>\n<h2 id=\"2bad\"><strong>Deploying and Auto\u00a0Scaling<\/strong><\/h2>\n<p id=\"b4f5\">With DataFlow you can test the data pipeline locally or deploy to the cloud. If you run the code samples without specifying additional attributes, then the data pipeline will execute on your local machine. In order to deploy to the cloud and take advantage of the auto scaling capabilities of this data pipeline, you need to specify a new runner class as part of your runtime arguments. In order to run the data pipeline, I used the following runtime arguments:<\/p>\n<p>&#8211;runner=org.apache.beam.runners.dataflow.DataflowRunner<br \/>&#8211;jobName=game-analytics<br \/>&#8211;project=your_project_id<br \/>&#8211;tempLocation=gs:\/\/temp-bucket<\/p>\n<p id=\"5c83\">Once the job is deployed, you should see a message that the job has been submitted. You can then click on the\u00a0<a href=\"https:\/\/medium.com\/r\/?url=https%3A%2F%2Fconsole.cloud.google.com%2Fdataflow\" rel=\"nofollow\" target=\"_blank\">DataFlow console<\/a>\u00a0to see the task:\u00a0<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.sickgamedev.win\/wp-content\/uploads\/2018\/04\/blog-building-a-fully-managed-game-analytics-pipeline-8.png\" \/><\/p>\n<p><em>The steaming data pipeline running on Google\u00a0Cloud<\/em><\/p>\n<p id=\"8d7c\">The runtime configuration specified above will not default to an auto scaling configuration. In order to deploy a job that scales up based on demand, you\u2019ll need to specify additional attributes, such as:<br \/>\u00a0<\/p>\n<p>&#8211;autoscalingAlgorithm=THROUGHPUT_BASED<br \/>&#8211;maxNumWorkers=30<\/p>\n<p id=\"89ce\">Additional details on setting up a DataFlow task to scale to heavy workload conditions are available in\u00a0<a href=\"https:\/\/medium.com\/r\/?url=https%3A%2F%2Fcloud.google.com%2Fblog%2Fbig-data%2F2016%2F03%2Fcomparing-cloud-dataflow-autoscaling-to-spark-and-hadoop\" rel=\"nofollow\" target=\"_blank\">this Google article<\/a>\u00a0and\u00a0<a href=\"https:\/\/medium.com\/r\/?url=https%3A%2F%2Flabs.spotify.com%2F2016%2F03%2F10%2Fspotifys-event-delivery-the-road-to-the-cloud-part-iii%2F\" rel=\"nofollow\" target=\"_blank\">this post<\/a>\u00a0from Spotify. The image below shows how DataFlow can scale up to meet demand as necessary.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.sickgamedev.win\/wp-content\/uploads\/2018\/04\/blog-building-a-fully-managed-game-analytics-pipeline-9.png\" \/><\/p>\n<p><em>An example of Dataflow auto scaling<\/em><\/p>\n<h2 id=\"3cfc\"><strong>Conclusion\u00a0<\/strong><\/h2>\n<p id=\"d82c\">There is now a variety of tools available that make it possible to set up an analytics pipeline for a game or web application with minimal effort. Using managed resources enables small teams to take advantage of serverless and autoscaling infrastructure to scale up to massive event volumes with minimal infrastructure management. Rather than using a data vendor\u2019s off-the-shelf solution for collecting data, you can record all relevant data for your app.\u00a0<\/p>\n<p id=\"901f\">The goal of this post was to show how a data lake and query environment can be set up using the GCP stack. While the approach presented here isn\u2019t directly portable to other clouds, the Apache Beam library used to implement the core functionality of this data pipeline is portable and similar tools can be leveraged to build scalable data pipelines on other cloud providers.<\/p>\n<p id=\"2a37\">This architecture is a minimal implementation of an event collection system that is useful for analytics and data science teams. In order to meet the demands of most analytics teams, the raw events will need to be transformed into processed and cooked events in order to meet business needs. This discussion is outside the scope of this post, but the analytics foundation should now be in place for building out a highly effective data platform.\u00a0<\/p>\n<p id=\"7882\">The full source code for this sample pipeline is available on <a href=\"https:\/\/github.com\/bgweber\/GameAnalytics\">Github<\/a>.\u00a0<a href=\"https:\/\/medium.com\/r\/?url=https%3A%2F%2Fwww.linkedin.com%2Fin%2Fben-weber-3b87482%2F\" rel=\"noopener nofollow noopener noopener nofollow\" target=\"_blank\">Ben Weber<\/a>\u00a0is the lead data scientist at\u00a0<a href=\"https:\/\/medium.com\/r\/?url=https%3A%2F%2Fangel.co%2Fwindfall-data\" rel=\"noopener nofollow noopener noopener nofollow\" target=\"_blank\">Windfall Data<\/a>, where our mission is to build the most accurate and comprehensive model of net worth.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The following blog post, unless otherwise noted, was written by a member of Gamasutra\u2019s community.The thoughts and opinions expressed are those of the writer and not Gamasutra or its parent company. Gathering usage\u00a0data such as player progress in games is invaluable for teams. Typically, entire teams have been dedicated to building and maintaining data pipelines [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":16494,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-16493","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-news"],"_links":{"self":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/16493","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/comments?post=16493"}],"version-history":[{"count":0,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/16493\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/media\/16494"}],"wp:attachment":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/media?parent=16493"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/categories?post=16493"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/tags?post=16493"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}