How to effectively handle web traffic in AWS
*This article was written by a former DevOps Architect at Lab08 – Atanas Dimitrov*
Previously, you have been able to catch our approach to product management at Lab08, digging into discovery teams, needs covering, and MVP implementation.
We shift the focus to shed some light on our DevOps stack. More precisely, how we enable scalable and high availability infrastructures for the products we build. In this article, you can find some tips and tricks that we use in our setups, as they leverage us to deliver the optimal experiences for our software users.
At Lab08, we cover the entire spectrum of product development. We can do it all from covering product management and architecture decisions through coding and setting up infrastructure to deployment and monitoring. From a technical point of view, such projects are primarily web-based platforms. That being said, our focus in this piece will revolve around handling the request/response pathway from end-consumer devices – such as a browser on a laptop or mobile app – to the application and reversely. When we work with clients, we consistently set them up in AWS’s public cloud system, as we feel it provides the soundest foundation and infrastructure alternative available.
In the illustration above, you can see a simplified design of the AWS infrastructure, which consists of three main components. 1) AWS’s distributed network, 2) AWS Network Load Balancer, and 3) Nginx or Openresty on an Autoscale group of EC2 instances.
AWS distributed network
At Lab08, we want to ensure that we provide our customers with sustainable infrastructures that have the right balance between solid performance and reasonable cost. This is why in our AWS setup, based on the specific needs, we are using either the AWS Global Accelerator or the CloudFront services to deliver high-speed and secure products, thus connecting the web platform and end-users in the best way possible. Cloudfront is used when accelerating SPA(single page application) frontend applications hosted in S3, accelerating APIs, applying Cloudfront functions at the edge, enabling web application firewalls at the edge, IP limits, content security headers, or when leveraging DDoS protection. Reversely, Global Accelerator is preferred when a static IP is required or when the traffic is not http/s.
Not mentioning the apparent usage applications of Cloudfront, we use it for a few specific actions:
Routing
The AWS Network Load Balancer (NLB) is applied over the Application Load Balancer (ALB) for reasons which will be clarified a bit later. This type of load balancer, being a layer 4, is missing a couple of features found in the ALB, for example, the Layer 7 routing, which is based on hostname/URI path. By adding Cloudfront in front of the NLB, we cover the routing at the edge by defining different Cloudfront behaviors targeting different origins. We also apply some complex routing using cloud functions and lambda@edge when needed.
Origin Protection
To avoid bypassing Cloudfront and hitting the Origin directly, we use the “Custom header” option on the Cloudfront Origin section. Cloudfront allows us to inject a custom set header on every request towards the Origin, allowing the Origin to apply policies based on this header. For example, if we set “my-custom-header=12345” and at the autoscale group of nginx/openresty nodes, behind the NLB, we apply a simple nginx map and block so that if a request doesn’t contain this same header with the previously set value, the traffic is blocked.
# is_bypassing_cloudfront is set to 0 only when the header my-custom-header is found and equals 12345
map $http_my_custom_header $is_bypassing_cloudfront {
default 1;
12345 0;
}
….
#Then into the server or location nginx section
if ($is_bypassing_cloudfront) {
return 444;
}
An important note is on the nginx/openresty’s config. The example defines the header name with “-” in Cloudflare’s option, and simultaneously, it must be defined with “_” in nginx- like so – “my_custom_header,” and not “my-custom-header.”
While 444 is not a standard response code, nginx instructs the proxy server to close the connection without any response, which is excellent when dealing with malicious traffic.
AWS Network Load Balancing
At Lab08, we have steered away from ALBs in favor of using AWS NLB (Network Load Balancer). The reason is that the performance benefits of NLBs outweigh the alternatives. When massive traffic spikes are on the horizon, there is no need for “pre-warm/scale” actions. We work on projects where traffic comes quite unpredictably, why the NLBs excel. Additionally, the static IPs of the NLBs are also quite helpful. However, there are still some elements that the NLB is missing.
Layer 7 Routing
cannot route based on the Host header or through the URI path at the NLB layer. However, we can easily overcome this challenge by routing this as a different stage – e.g., the Cloudfront layer.
There are no Layer 7 capabilities for the different services or microservices, so we simply use other listeners on the same NLB. Each listener represents a separate microservice that operates different ports on the NLB, can use various certificates, and has a different target group of autoscale nodes that serve multiple services. Lastly, routing at the Origin’s nginx/openresty stage, on the Amazon EC2 instances, behind the NLB.
There is no security group,
so setting security access at the NLB is impossible. We have a few options to overcome this. Either through A) Cloudfront layer’s AWS WAF – web application firewall – with IP sets or B) Origin’s nginx/openresty.
In option B), we do a simple bash script set as a cron job using the AWS’ CLI tool to discover the IPs of the EC2 instances, tagged with the correct key/value, which needs to have access to the local EC2’s service and adding those IPs into a nginx map. Every EC2 instance has the appropriate IAM(Identity and Access Management) policy to allow queries for those queries.
Assuming we want all instances with the tag “Role = microservice 1” to reach our service through a public subnet NLB. Using the below script, we can generate an openresty map, setting the right access:
#!/user/bin/env bash
# Here, we define the EC2 tag to gather the list of nodes we want access to. Let's call the key "Role" and its value "microservice1"
ROLES=microservice1
# pick the region from the instance-data
REGION=$(curl -s http://instance-data/latest/meta-data/placement/availability-zone | sed s/.$//)
# generate temporary nginx geo map with the discovered IPs of all
# running instances with tag Role = microservice1
aws ec2 describe-instances --region ${REGION} --filter Name=tag:Role,Values=${ROLE} Name=instance-state-name,Values=running --query 'Reservations[*].Instances[*].[PublicIpAddress]' --output text | sed 's/$/ 0;/; 1s/^/geo $non_autoscale_net {\ndefault 1;\n/; $s/$/\n}/' > /tmp/no_autoscale_net.conf
# check if the list has changed and if so reload the proxy
if cmp -s /tmp/non_autoscale_net.conf /etc/openresty/conf.d/non_autoscale_net.conf
then
exit 0
else
cp /tmp/non_autoscale_net.conf /etc/openresty/conf.d/non_autoscale_net.conf
/usr/bin/openresty -t && systemctl reload openresty.service
fi
This script generates a nginx/openresty map, where the variable non_autoscale_net is set to 1 if the source IP of the request is not from the autoscale group of instances, tagged with Role=microservice1, like for example:
geo $non_autoscale_net {
default 1;
13.53.212.185 0;
13.49.125.236 0;
}
This map is subsequently used in the nginx/openresty’s vhost server or location section to block all that’s not coming from the allowed IPs:
If ($non_autoscale_net) {
return 444;
}
This way, no matter what autoscale activity executes, we’ll immediately have the proper access list.
Nginx or Openresty
Perhaps it is already clear that nginx/openresty is a crucial part of our request lifecycle path. We chose Openresty when Lua code is involved in a particular use case. Other than the most popular usages, such as serving static content, rate limiting, reverse, or FastCGI proxy, we use it also for traceability.
Traceability
To maintain a complete view of the request’s workflow, we need traceable information in every request/response, which we can later correlate with other logs, such as those generated by the applications. As nginx/openresty is always in the path of every request, here’s what we do:
# set trace ID if it isn't already available in the request. Here we use the nginx's internal request_id variable, which has 32 chars, same as the default B3 TraceID, so set the managed_x_b3_traceid to be equal to nginx's internal request_id, if x_b3_traceid hasn't been placed already, and just copy the value of the x-b3-traceid header into managed_x_b3_traceid variable if the request has it already.
map $http_x_b3_traceid $managed_x_b3_traceid {
"" $request_id;
default $http_x_b3_traceid;
}
However, what if we need a 16 char tradeid over a 32 one? In such a case, we use the previous map to generate a fresh one with just the first 16 bits of it:
map $managed_x_b3_traceid $short_x_b3_traceid {
default "";
"~^(?<prefix>.{16}).*$" $prefix;
}
If the span ID is available, just use the ITM. If it is missing, set it with the same value as the traceid.
map $http_x_b3_spanid $managed_x_b3_spanid {
"" $short_x_b3_traceid;
default $http_x_b3_spanid;
}
In order to make use of the maps defined above, we add the following into the server or into the proxy’s location block:
proxy_set_header X-B3-Traceid $short_x_b3_traceid;
proxy_set_header X-B3-Spanid $managed_x_b3_spanid;
add_header X-B3-Traceid $stripped_x_b3_traceid always;
We use custom nginx/openresty logs that include those
$short_x_b3_traceid and $managed_x_b3_spanid
then have them shipped with filebeat to a centralized ELK stack (Logstash, Elasticsearch, and Kibana). This way, we can log correlate later in kibana those IDs with the IDs from our application’s logs.
When choosing the right block section to set proxy set/add_header definitions, keep in mind that if you add headers in the nginx’s server section and then add headers within a location block within that same server section, the add_header statements outside of the location will not be applied – only the ones within the location will.
Combining nginx maps is an elegant feature, and we use it frequently. What we also found very useful is the Lua extent. A great example of a use-case is in our non-production environments, serving an arbitrary branch of service. We work with it by distributing our applications as docker images. Suppose we have several branches of the same application operating on the same host. In that case, we simply divide those services by binding those docker containers on a port, which is a simple function of its branch name as a parameter. For example, let’s take an application called “API,” and on the EC2 instance, we have a docker container running branch “testing.” This container would be run on port 3766.
What’s the logic behind that port number? We calculate the sum of unicodes of each char within the branch’s name and add it to a base number (default to 3000, otherwise using argv[2] if set).
#!/usr/bin/python
"""
Usage:
Positional Arguments
Argument 1: branch name
Argument 2: starting port, if not set default 3000
"""
import sys
if len(sys.argv[1:]) > 2 or len(sys.argv[1:]) < 1:
print("Usage: Argument 1: branch name Argument 2: starting port, if not set default 3000")
sys.exit(1)
SUM=0
word=sys.argv[1]
starting_port=int(sys.argv[2]) if len(sys.argv) >=3 else 3000
for i in range(len(word)):
SUM += ord(word[i])
print(SUM+starting_port)
So when we start a new docker container, for which we use ansible roles, we first calculate the port and use it to present the docker’s service to that same port on the host. This way, we can have multiple branches of the same service/application working on different predictable ports on the same EC2 instance.
We must match each request to its corresponding docker container in the reverse proxy. For example, requests to https://testing.api.projectabc.com must be proxied to localhost:3766, which is where Lua comes in handy. In our nginx’s upstream, we have not a fixed port but rather a calculated one. In the example below, we can also use Lua’s balancer_by_lua_block to have a simple retry over an array of upstreams:
upstream api_proxy {
server 0.0.0.1 fail_timeout=3;
balancer_by_lua_block {
local upstream_servers = {
"127.0.0.1",
-- REPLACE,
-- REPLACE,
-- REPLACE,
-- REPLACE,
-- REPLACE,
}
local balancer = require "ngx.balancer"
local host = upstream_servers[math.random(#upstream_servers)]
local port = 3000
local string_lenght=string.len(ngx.var.branch)
for i=1,string_lenght do
port = port + string.byte(ngx.var.branch, i)
end
if not ngx.ctx.retry then
ngx.ctx.retry = true
ok, err = balancer.set_more_tries(5)
if not ok then
ngx.log(ngx.ERR, "set_more_tries failed: ", err)
end
end
local ok, err = balancer.set_current_peer(host, port)
if not ok then
ngx.log(ngx.ERR, "failed to set the current peer: ", err)
return ngx.exit(500)
end
}
keepalive 64;
}
In the server section we have:
server_name ~^(?<branch>.+).api.projectabc.com
Nginx will match the Host header and will create a variable “branch” =” testing,” which will then be used to calculate the port number for the upstream and because we use the same calculation in Lua that we used in the Python’s code, the request will hit its relative docker container:
local string_lenght=string.len(ngx.var.branch
for i=1,string_lenght do
port = port + string.byte(ngx.var.branch, i)
end
If you have made it this far, you now have a good idea of how our request lifecycle works. We have developed it to be efficient, quick, and use the least number of elements in the process.
We hope that you enjoyed getting a glimpse of DevOps here at Lab08. If you have any questions, comments, or something interesting from reading this piece, you can always reach out to us. We are always happy to discuss further with like-minded people.
References
1. https://kura.gg/2020/08/09/retrying-dynamically-configured-upstreams-with-openresty/
2. http://nginx.org/en/docs/http/ngx_http_map_module.html
3. https://openresty.org/en/
4. https://www.lua.org/
Atanas Dimitrov
Former Senior DevOps Architect at Lab08
Linkedin
Atanas was the DevOps Architect at Lab08 from 2018-2023. He used to handle the setup, maintenance, code deployment, security, troubleshooting, backup, monitoring, and incident reaction for all of our customers’ infrastructure.
His role was to ensure the stability and efficiency of all the platforms that we create together with our clients.
Be sure to follow us on social media to receive updates about other similar content!