Real Time Big Data

Hey, hey, hey! As we saw in the first part (enlace en inglés) of How To Create A Pokémon Go On AWS, now we have a little expertise on how generating the map for our users. Still, there is a lot of work to be done, but in this post, we are going to forget about the secular work and focus in another interesting point: how we can receive and process information in real time (nearly, at least) from a million devices all at the same time.

In particular, we want to be capable of:

Here you have a scheme (courtesy of Cloudcraft) with the architecture we will use to solve this scenario. And yes: AWS just published Kinesis Analytics and it considerably will simplify it all. But, hey, it’s still not available in GA.

Pokémon Go Architecture on AWS (II)

Data Ingestion

The service that in this case naturally help us is called Kinesis Streams. It’s a high throughput queue with an ensured order of delivery and default 24-hour data retention. In plain English, it’s some sort of box in which you can introduce as many slips of paper as you want and read them, knowing that you keep the order they were saved.

We will create a stream (a process flow almost in real time, implemented by a queue), so the players will send their position to it.

The unit to measure the throughput of a stream is denominated shard. You can write a megabyte of data over each shard every second of a thousand times per second (the limit you reach before).

The AWS SDK for Android and iOS includes a good support for Kinesis Streams. Among other things, it proves to be easy accumulating in the smartphone a series of readings to make the most of each writing you are going to do and send them to the servers in just one operation.

We decided that we’ll apply this pattern to the coordinates registry, so we will have to calculate the shards according to the data volume sent and not to the number of readings. A very important point is that Kinesis is scalable, so we are able to tailor the capacity (therefore, the cost) according to the charge patterns we find in our journey.

Near-real-time Processing

To consume (process) the incoming data we will deploy an Autoscaling Group with EC2 nodes distributed over two Availability Zones. We will run a Java app in it, responsible for reading the stream. Why Java? Because the support for this language in Kinesis Streams is excellent thanks to KCL library, and it’s well integrated with Apache Storm, something we will speak in the next paragraphs. Also, since Spring Boot exists Java is cool again.

Do you remember the last article in which we explained how DynamoDB worked? We would ensure speed access partitioning data for a key, essentially the same that happens in Kinesis; we can send to a given shard the data coming from a UTM square. When an app would read this data it will retrieve it gathered through a located map zone, so it will be simple to apply a sliding window to accumulate in (almost) real-time the number of users that you can find in each of those 1-kilometer side squares we’ve been talking about. Hold on, let me introduce an example. Everything is easier with examples.

Picture yourself we have five users in Plaza Catalunya and another five at Puerta del Sol in Madrid. I write five and five because I don’t want to offend anybody. We will create two shards in our stream that we will call A and B, but they are actually identified with numbers.

Let’s say the readings of Plaza Catalunya players end all in A, while Sol players end in B. The most interesting thing will be that the app that reads the first shard will only receive half of the total registries, but within them, we’ll find all the registries from Plaza Catalunya, so the app will have the data that it needs and the computational capacity to count the registries. The same will happen for the app (would be better to say “node”) to deal with shard B. Do we need more process capacity? In that case, we add more shards and nodes.

Once we have this first aggregation tier we can add up the ones that are found in larger geographical units. I know what you are thinking. You are thinking that Plaza Catalunya data can be in a node, while calle Balmes (very close geographically) could be partitioned to somewhere else, so the nodes that process the players position have to be able of easily exchanging this information between them to get the next aggregation tier level (let’s say 100 square kilometers).

Well, all this network work can be implemented surprisingly easy by leveraging clusterization software I talked about before: Twitter Storm (or Apache Storm, or whatever they call it this week). In addition, Kinesis Streams can be integrated easily in it by using Kinesis Storm Spout.

If you want to release the hipster you’ve got inside, test the heir in Twitter for Storm: Heron.


The data that we are adding can be stored wherever you want (it’s not big data) but Redis is always a safe bet in this cases. As we don’t want to worry a lot about its management, we will use this database leveraging Elasticache service, configured in high availability.

In order to access the geolocation of the players, we create a microservice. Its API will be as simple as GET /players/{z}/{x}/{y} and it only needs to retrieve the last known position of each user that is present in that tile, or a value which represents some of them if the zoom level doesn’t need more precision. You can find more information about this kind of URLs in OSM’s Wiki.

Cleverly, we will have accumulated data in Elasticsearch, easing this access pattern, the webservice implementation is very easy, so we decided to leverage Lambda service by publishing the function by API Gateway instead of EC2. In addition, we could brag at friend’s dinners we use it in production (we do it this way at CAPSiDE, friends smile courteously and they keep drinking).

Lambda is totally managed (including autoscale) and extremely cost-effective above it all. A lot. To the point that a million calls to the API will cost $0.40. Forty cents. Per million.

The static part of the microservice (HTML, js, css, images, etc) can be served directly from S3 static hosting through Cloudfront, which will act as a CDN to accelerate downloads.

This static part will be supported by Javascript library Leaflet with heatmaps plugin to draw the points. We will utilise our beloved Open Street Map cartography.

Well, our dashboard is now created. By the way, if you want to assemble something similar, you are invited to our free workshops. We announce them with sufficient time in our Twitter account, so you can match your schedule to come.


Now let’s solve the long-term storage issue from all that humongous data volume we are receiving.

I’m sure you predicted the destiny will be S3: it’s cost-effective and its durability is exceptionally high. In addition, Kinesis Firehose (a Kinesis Streams complement) allows you to write on S3 data flows in a very efficient way.

We’ll only need a Lambda to connect Kinesis Streams with Kinesis Firehose, and at the same time will send efficiently data to S3.

Indeed, if you want to do data warehousing utilising SQL, you’ll be delighted that Kinesis Firehose can also nurture Redshift, column database from Amazon. But let’s say it’s not necessary for the moment.

It’s not bad at all, isn’t it? We’ve utilised a growing number of technologies, but as long as we are focused on managed services, complexity is under control. Above it all, we gained momentum: in a bit of time, we can pass from having a great idea to see it up and running supporting a million users. We are living exciting times.

Ah! I shouldn’t forget to talk about my own business: in CAPSiDE we can help you with this kind of projects, from training to systems operation, even advanced architecture design. We are actually doing this with lots of companies you probably know.


Actually, it seems the rest of the game doesn’t need anything special: ELBs connected to EC2 in autoscaling groups, some S3 with Cloudfront, etc. So probably this series of posts end here. But you always can speak your mind at @capside and we’ll try to please you.

TAGS: aws, aws, AWS Kinesis, big data, DynamoDB, Pokemon Go

speech-bubble-13-icon Created with Sketch.

Leave a Reply

Your email address will not be published. Required fields are marked *