Interact Engineering

How we moved our entire storage layer to S3

Tom Walters — Mon, 12 Dec 2022 08:30:00 GMT

Within our product ecosystem lives a large monolithic application that is responsible for the majority of functionality that our customers use every day. Like most technology companies that have undergone sustained, rapid growth over a number of years, this application houses a large number of development patterns, complex business logic, and a range of coupled vs. un-coupled facets.

We run our multi-tenanted application on AWS, and as we’ve scaled to over 1000 enterprise customers that rely on the product every day we’ve continually upgraded our infrastructure to a highly redundant and performant architecture. At the core of this is a large cluster of EC2 instances that run our primary application. Because it’s multi-tenanted we can easily scale up and down based on customer traffic (wouldn’t you know it, we see cyclical peaks and troughs as the UK, and various time-zones in the US wake up and fire up their intranets alongside their morning coffees!) This works very well, and the remaining elements of the product ecosystem scale in a similar fashion.

Storage

Photo by benjamin lehman / Unsplash

However, one aspect which was nagging the Engineering team for some time was our persistence layer. When the core application was first conceived (in a time before AWS), it was designed to run on a single-server deployment – most of our customers would run the application on their own local infrastructure. Customers would use IIS to host the application, and it would store all of the uploaded files alongside any tenant-specific files on the local disk – all was well. Over the years, the system grew, more modules were added, and complexity increased. The business kept growing and signed more customers.

Fast-forward to today and our product stack is hosted entirely on AWS across a large number of virtual machines and other infrastructure. We don’t enforce session affinity, which means that a single user likely interacts with hundreds of distinct web servers over the course of a day. At the application level this is completely fine, but at the persistence level we encounter some additional complications.

For a number of years, we’ve utilised shared NFS instances to act as a central store – accessible from all EC2 instances, backed by local caches on these boxes for our persistence, and again, this has worked extremely well. But as ever, as we’ve continued to grow these huge NFS instances have presented us with issues such as kernel-level incompatibility with AV software due to customised Linux kernels being used by AWS, and general issues of scale with such high traffic volumes, and it being a single point of failure.

Enter S3

The obvious choice for the next generation of persistence layer is S3. But unlike our move to NFS, this would present a number of additional complexities including:

Low-level File API usage throughout the application – tens of thousands of instances of File.Read and File.Write would have to be inspected and replaced
Reconciling internal APIs - uploads are served via an application controller which checks the permissions of the current user against the access rights – we wouldn’t be able to just serve the raw S3 URLs, assets, theme files etc.
Migration – how do we move the existing files to S3? We manage dozens of terabytes of data across the customer base
Hot Swap – how do we move to S3 without any service interruption?
Security - how to keep the data isolated for each customer

Low-Level File APIs

The most challenging and time-consuming part of the migration was always going to be finding and correctly updating all the references to the standard library File operations, without introducing bugs.

A quick inspection of the usage (File.Read/File.Write etc.) showed that it will be a very large task - there were 14k+ instances of File.Read for instance in the codebase.

After numerous whiteboarding sessions, and defining what problems we wanted to solve along the way, we decided that the best way to approach this was to create a new central library project dedicated to handling the persistence layer operations (access, writing, exists checks, permission checks, deletions, etc.)

Any file operations would then simply use the new central library, and the library would internally decide how to handle the files (eg. use local disk, use S3, use another technology, use several at once? etc.)

This design decision had numerous benefits:

Can easily unit test the entire persistence layer
Can easily introduce new storage backends in the future
Optimisations, and changes (eg. global encryption or compression) can be easily done in a single central library, and take effect across the entire application
Consolidation and standardisation of file access logic
Introduces consistency in how files are handled
Easy for new developers to pick up and interact with the storage layer

S3 Integration

The S3 integration brings about some interesting challenges - both from an infrastructure/cloud perspective, and a code implementation perspective.

Security configuration - what is secure?
Costs - S3 traffic generally happens over the internet, meaning you pay for egress and ingress traffic
Limitations - throughput limitations per prefix - this impacts the bucket structure
Secure interactions - eg. S3 Bucket Ownership validation on write
Scanning files - scanning files using an AV on a disk is easy, how do we go about this in a serverless service?

Finding the right approach for us took a lot of research, small-scale experiments, and a lot of retrospection and refinement.

Some of the questions we asked were (hopefully, some are useful to anyone currently undergoing a similar journey):

As a SaaS service, do we use a bucket per customer, or use a single bucket with a subfolder/prefix per customer to segregate and isolate data?
If we choose a bucket per customer, what naming convention do we use given the bucket names live in a global namespace (not scoped to an individual AWS account)? - this gives a nice level of isolation but there is a limit of 1k buckets per AWS account
If we choose a subfolder/prefix per customer, how do we provide a secure isolation layer for accessing a single customer's files (eg. can't modify the path to jump a few folders up and access another customer's data) - an S3 to VPC access point scoped to a specific folder, allows for this level of isolation
Do we want the architecture to allow BYOK for bucket encryption?
To avoid additional network costs, shall we use S3 Access Points to connect it directly to the VPC accessing the files? - as long as the network traffic doesn't leave the AWS network, it's free - so keeping the traffic between your software and S3 buckets inside of the data center, can make it a lot cheaper
How do we ensure our backup strategy meets the policies on frequency, and retention - versioning can be used to capture file changes as a form of versions - there is no need to take backups on a fixed interval. Provided the files don't change very frequently, this can reduce the backup costs significantly, and combined with Lifecycle Policies you can tailor the correct retention periods, and even customize the storage class for older file versions.
How do we restore from a "backup" when needed? - this can be done manually if targetting specific files, or can be done en-masse (some options include pipelines such as https://aws.amazon.com/blogs/storage/point-in-time-restore-for-amazon-s3-buckets/, or tools like https://github.com/angeloc/s3-pit-restore)
etc.

It was a really interesting exercise and took a lot of solution planning - it is one of these exercises where relatively small decisions at the beginning, restrict and lock you into specific options in the future, without the ability to easily change your mind.

Reconciling Internal APIs

Thankfully, Pareto law applies quite universally, and most of the requests triggering file operations were located in two controllers.

It made for a great starting point - by modifying the logic underpinning those two controllers, to use the new persistence layer logic, which in turn would cover the vast majority of file operations in the system. 80% of the benefit, for 20% of the work.

If we can get this to work, then we can focus on the remaining work at our leisure.

Naturally, in many mature systems like ours, there will be rich permission structures and existing paradigms for access logging, compression, asset caching, etc.

We wanted to integrate the new persistence storage library in a way that was respecting the existing logic, so that we don't cause issues in all the other critical areas of the system (we wanted a controlled implementation, that does not grow in scope uncontrollably), and in a way that does not have to be aware of the underlying storage technology (the caller does not need to be aware of consider if the files are stored on local disk, network share, S3, Azure Blob Storage, or anywhere else.)

Our controllers had various endpoints for handling different use cases.

Get Documents - permission oriented, don't want these cached on any intermediate networking devices en route to the user
Get Video Stream - you don't want to return the entire file before any playback starts
Get Image Assets - want to ensure appropriate caching, compression, etc. happens
(there were a lot more, and a lot more complex CRUD endpoints to handle chunked operations, etc.)

It would be very easy to just redirect all requests directly to S3 - we'd save ourselves some bandwidth on file transfers - but it would be insecure and inappropriate. Instead, we cleverly replaced the file access calls with equivalent calls to our new persistence layer library.

Whilst there, we also added a number of optimisations.

For example, when reading files and assets from the disk, and sending them back to the browser, you want to set the Content-Type correctly so that the browser and any receiving client can treat the file correctly.

Our previous implementation started by loading the beginning of the file into a stream (for reasons we won't get into here), and inspecting it to determine the file type. This was happening on read, which led to a lot of wasted processing - the type won't change unless the file changes.

So when implementing the new persistent storage layer, and in particular the S3 implementation, we abstracted the content type detection and brought it into the persistence layer library, and when writing a file to S3, we used the existing mechanism to determine the content type, but we stored the result in the Metadata tags which can be set on S3 objects. On any subsequent reads, we would simply need to look up the value in a Dictionary, instead of carrying out additional processing to figure it out again.

We managed to simplify, and standardise a lot of file operations in this exercise, and reduce the overall entropy of that part of the system.

We now had a solid, reusable library and set of patterns for moving across the rest of the system to use the new persistence layer library, and we had a system capable of interfacing with different underlying storage platforms.

Migration & Hot Swapping

When dealing with a live system, there is always a question of data migration. How do we go about migrating millions of files, and dozens of TBs of data from NFS to S3, in a seamless, safe, fault-tolerant, and reliable way?

We decided that doing a big all-or-nothing migration was too risky, and would likely not work and/or cause significant disruption.

We put the S3 storage behind a feature flag so that we can enable it on select sites and control the rate of the rollout. Regardless if the flag is on or off, all the file interactions would go through the new persistent storage layer library, and the library would decide internally whether to use NFS or S3.

This allows us to start off with everyone using NFS (no change), but the files are being handled by the new central library. This serves as a confidence-building exercise for the new library in production, beyond unit testing, and QA testing that have already taken place.

It also allows us to phase the S3 storage in at a comfortable pace, while we monitor the stability, error rates, and user experience, allowing us to fall back to the NFS implementation in case of issues. This ticks a huge box for fault tolerance.

This does raise a question of migrating data from NFS to S3, and what happens to data that's uploaded to S3 if we need to fall back to NFS in case of problems. You don't want data to go missing, so when you fall back to NFS, you want to ensure all the files are still there.

This led us to introduce "storage strategies" within the persistence layer. The default strategy was configured to "dual-write" content to NFS and to S3. This meant that for the time being NFS remains the source of truth, and S3 lazily receives files on write and on access, slowly populating the storage layer the more the system is used. On-read uploads were done by checking if the requested file exists in S3, and if not, it'd be fetched from NFS, uploaded to S3, and then returned to the user. This gave us a self-populating, human-free mechanism for populating S3, and rolling this feature out in a way that is:

Reliable - no human involvement and each operation is always handled the same way
Safe - data never leaves the system and there are no concerns about data handling
Fault-tolerant - if there is an issue with S3, NFS still contains all the data so we can easily fall back
Seamless - the migration was done in a lazy fashion - if the file didn't exist in S3, we'd grab it from NFS and asynchronously upload it so it's there next time

This strategy then allows for migration of any files not yet uploaded to S3 to happen seamlessly and asynchronously behind the scenes and once completed, the strategy in the code can be adjusted from dual-write to both file systems, to use S3 only, completing the migration and deprecation of the NFS service.

Lessons and Conclusion

This was by no means an easy or short project, but we used technology to make the transition from NFS to S3 at a global scale seamless, whilst introducing additional improvements into the system and developer ergonomics along the way.
The approach taken has made the cost of failure very cheap, and the effort to roll back very easy, and has enabled us to introduce further improvements and optimisations (eg. new compression technologies) with ease, as well as built out new modules and features, without much consideration for storage.
The rollout was very seamless and successful, and it shows that the preparation and planning that has gone into this project has led to great results.

Protecting against S3 attack vectors with Zero Trust Engineering

Daniel Wardin — Tue, 15 Nov 2022 09:30:00 GMT

Amazon S3 was a game-changer when it came to the market. Incredible durability and availability, very high performance, ease of use, cheap and completely serverless (scale as you need, and only pay for what you use.)

You create a bucket, upload your files, and then serve them from the bucket. You could run static websites, stream videos, or use it as a backend storage provider for your applications (most common use.) Extremely versatile.

It comes built-in with:

encryption
version control
replication
auditability
logging
security
policy management
access management
easy-to-use web-based portal

However, like any solution/technology, and it comes with its share of problems and vulnerabilities. It's not a carefree silver bullet.

A well-known practice is to lock down your buckets, and for your IAM credentials to (hopefully) align with least privilege principles. Great! But what if your files can be funnelled to another account without your knowledge, or if files can be positioned on your domain, regardless of how much you lock down your buckets?

We'll cover some of the least known, and more dangerous attack vectors for S3:

bucket parking, and
subdomain takeover

Bucket Parking

As many of us know, the S3 bucket names are unique at a global level, not at the account level. This means that it is technically a shared space, and collisions may happen.

Now let's think about the implications of that fact...

Bucket parking attacks take advantage of that fact, and collisions are sought out on purpose.

Imagine hosting a SaaS application, which supports uploading business-critical documents via your interface, to an S3 bucket, where each of your customers gets their own bucket, with a unique name (eg. a combination of their business name, and a pre/post-fix of some sort.)

Example

Your customer, "Dunder Mifflin" have an isolated bucket, auto-created by your software, and named as follows.

dunder-mifflin-docs

Anytime they upload a document, it gets safely put into the dunder-mifflin-docs bucket using your AWS Credentials.

The bucket is securely locked down from public access, has restrictive access policies in place, and your AWS Credentials are also heavily locked down to prevent their use for unauthorized activities, aside from a select few actions on the S3 service (such as PutObject.)

If any new customer, joins, your code auto-creates buckets for them, and names it accordingly to that naming format, and the same rigorous lock-down is applied to the bucket.

All good so far, right?

Not really - this is where things get dangerous due to the shared space nature of S3.

How do you know that this bucket in fact belongs to you when uploading files, and is not just another bucket in the globally shared namespace of AWS S3 that happens to have this name?

If it's a globally shared name space, any existing bucket with that name, could belong to someone else. In a normal scenario, this is fine, as you'd notice the collision, and get AccessDenied when trying to upload to this bucket (it's likely locked down, and you don't have access.)

Let's return to our example, and let's assume you just signed a new customer, "Vance Refrigeration".

Your code will attempt to create a bucket if one doesn't exist. However, if an attacker already created a bucket called:

vance-refrigeration-docs

and applied a very open access policy to the bucket, your code will be told the bucket exists and will be able to successfully upload documents to this bucket, even though it is not yours, and your own setup is locked down.

Intriguing, right?! Regardless of the restrictions, you applied on your own buckets and IAM credentials - as long as you can execute PutObject, you're vulnerable.

Here is a bucket setup, that would allow the attacker to accept files from your application without any restrictions - it just needs to get your application to try and put files into a bucket with that name.

You need Everyone Write access.

And public access to be disabled.

Your code will continue to function (the bucket name it was looking for exists, and accepts its requests) and you're none the wiser, but now your and your customer's data lives in someone else's account.

This is what the exploit looks like at a high level.

This of course relies on knowing the bucket naming format, which is generally not known, but can be extracted in many different ways, and should be generally considered a non-secret. If your security relies on bucket names being un-guessable or unknown, your mitigation is weak.

We have been successful in replicating this behaviour in our experiment labs, and regardless of the level of lockdown in the intended destination account, the files were funnelled out successfully.

Solution

AWS recognized the risk associated with this and provided an easy way to guard your code against it.

The main problem is that the code/client assumes that the bucket is in the intended destination account, but it is never made explicit - remember, we're operating in a shared namespace, not in an isolated space in your AWS Account.

In September 2020, the following announcement was made, about the support of bucket ownership verification headers, which can provide constraints on uploading files.

Amazon S3 bucket owner condition helps to validate correct bucket ownership

Amazon Web Services, Inc.

Verifying bucket ownership with bucket owner condition - Amazon Simple Storage Service

Verify bucket owner for Amazon S3 operations.

Amazon Simple Storage Service

Access method	Parameter for non-copy operations	Copy operation source parameter	Copy operation destination parameter
AWS CLI	`--expected-bucket-owner`	`--expected-source-bucket-owner`	`--expected-bucket-owner`
Amazon S3 REST APIs	`x-amz-expected-bucket-owner` header	`x-amz-source-expected-bucket-owner` header	`x-amz-expected-bucket-owner` header

A previously successful call, to PutObject directed at an open bucket owned by the attacker, would now fail, and return AccessDenied, because AWS will verify the bucket owner to the supplied Header value for us.

You're using Zero Trust principles, to verify if your requests should in fact proceed, based on constraints you impose on the request itself.

This secures your application from being exploited with this attack vector, by embracing Zero Trust principles and forcing the requests to validate the destination, instead of assuming that the bucket is always safe to write to as long as you have the permission to do so.

Disclaimer: At the point of writing this, a lot of open-source S3 Storage Adapters, and Providers for various systems, don't implement the Expected Bucket Owner functionality, and can therefore be vulnerable to the above exploits.

Subdomain Takeover

This is a similar type of attack, where the attacker creates buckets with very specific names, to perform attacks on otherwise trusted sites. This attack is largely opportunistic in nature and relies on misconfigurations to be present.

Imagine you have a site that users S3 static hosting, to load your assets. Now imagine, you have several of these. Each S3 bucket has a CNAME, allowing it to be served on your trusted domain.

Sometime in the future, you delete a bucket but forget to clean up the CNAME.

Since S3 bucket names are shared, upon deletion, the name gets released into the wild.

Now, an attacker could simply take that bucket name, create their own bucket in the same region, and put malicious files which are now part of your "trusted site/domain".

This can be utilized for:

XSS
Phishing
Bypassing domain security
Stealing sensitive user data, cookies, etc.

For example, the US Department of Defence suffered from this vulnerability.

U.S. Dept Of Defense disclosed on HackerOne: Subdomain takeover due...

**Summary:**An unclaimed Amazon S3 bucket on █████████ gives an attacker the possibility to gain full control over this subdomain. **Description:**`███████` pointed to an S3 bucket that did no...

HackerOne

Solution

First and foremost, make sure you have a good S3 bucket housekeeping process, and ensure that your buckets are accounted for.

Keep an inventory of any buckets which have CNAMEs pointing at them - if you're gathering this information retrospectively, look for buckets that have Static Website Hosting enabled in the first instance.

Audit your DNS entries, and ensure you don't have any DNS CNAME entries pointing at non-existent (or, not owned) S3 buckets.

This can be done in an automated fashion, by using the Route53 APIs to load all Hosted Zone records, and compare to the S3 buckets in your account, and flag any which don't align.

It comes down to having strong change management in place for your assets, and DNS-hosted zones (usually a forgotten or poorly controlled part of the infrastructure.)

Verification of any change is a must:

verification to ensure correct decommissioning
verification to ensure the correctness of changes
verification to ensure the correctness of an implementation
etc.

Code reviews and Pull Requests are similarly used to verify someone's work and changes.

Infrastructure processes and changes should undergo the same level of scrutiny (zero trust principle), as you'll find mistakes and opportunities for improvement, and eliminate a huge source of entropy, which may lead to security vulnerabilities.

Conclusion

Zero Trust Engineering principles can bring a lot of security benefits across the board, as demonstrated by the S3 examples above.

With the Bucker Parking example, AWS thankfully provides a simple way to verify the bucket owner to prevent accidental file funnelling to external accounts.

With Subdomain takeover, a stronger process and housekeeping are required to ensure things are kept in order. Zero Trust principles are a great start to this, simply adding a deliberate verification step following any changes, will go a long way.

How we adopted a cell-based architecture

Daniel Wardin — Wed, 05 Oct 2022 09:25:00 GMT

You're probably familiar with horizontally scaling stateless applications. Most likely running containerized microservices under the hood, powered by a database engine, and serving traffic through a load balancer. A fairly standard recipe for scale, right?

A Short Story

You're working for a successful SaaS business with a superb product offering - customers are queuing around the block to use your services. Your product is great, you have a scalable architecture, the sales team is full of closers, and you have a strong, committed engineering team to grow the offering - you're on to a winner!

Time goes by. With your trusty architecture by your side you're scaling smoothly and without any real issues - you're serving more customers than ever. The infrastructure tackles a growing customer base each day - it doesn't even break a sweat!

But then one day while minding your own business (working on the next killer feature) you see metrics on your dashboards starting to tickle amber territory. Then a buzz. A rapid incrementing of the number of emails in your inbox. PagerDuty has broken down your "Do Not Disturb" setting.

Confusion ensues. You hurriedly jump onto the infrastructure, load the dashboards, and you see your database engine in the red zone! It's reaching its capacity, and the headroom is shrinking very fast.

Terraform? Where we're going, we don't need Terraform.

ClickOps to the rescue.

You've upscaled the server several times already and you take a decision to spin up another server and move half of the data over to spread the load.

The dashboards promptly return to green, PagerDuty calms down - you've saved the day!

You wipe the sweat off your brow, lower your shoulders from your ears, and relax in your chair. You mull over the issue in your mind, but quickly brush this off as just one of those days - the issue was rather simple at heart, you think, more resources were required to deal with the load, that's all. All is well again - fire has been put out - you are a hero.

In the corner of your eye, you notice that the database servers begin to struggle again. Huh!? You repeat the exercise and shard the databases across more servers, and the problem goes away.

...how strange...this came a lot faster than you expected...

Harmony returns.

The next morning you open your laptop ready for another routine software release. You build new images, roll them into production, perform your verification checks, and consider the job done.

But just a few hours later as users begin to use the new features, you spot a random web server biting the dust, in a flash of red on the dashboard. Must be a coincidence! A few seconds later, you see another server also meet its fate - how bizarre! In a few short minutes all web servers are dead, while new servers valiantly charge forth into the vanguard to absorb fresh waves of users - only to perish in an even shorter time frame - it's a bloodbath!

The dashboards have been flooded with crimson - like the Overlook hotel's faulty lift - sites are inaccessible, and all customers have just suffered a huge outage.

All hands are on deck to understand what happened and you decide to revert the release, which mercifully resolves the issue. No one quite understands the root cause. The engineers put a hold on their current work to deep dive into the logs, access patterns, and see if they can spot anything. After a little digging, they strike gold!

There was an issue in the new release only triggered by a specific series of actions, which kills the process, and by extension the web server - the patch is easy, and the new version no longer brings production to its knees. The junior engineers breath a sigh of release. The seniors continue to look worried - how did such an issue cascade across all of production?

In this case the estate fell victim to the so-called "poison pill" scenario, wherein a fault triggered by user behavior brings down one instance, and as the user continues to use the system in the same way (or retrying their actions), it brings down other servers that are now trying to compensate for their out of fallen colleague. The contagion quickly infects each server, until it brings the whole estate down. Nobody is quite sure what made this issue so transmissible.

Before anyone can get to the bottom of it, a few weeks later your application locks up out of the blue! No one can access the site, web boxes are not particularly strained, database servers still have plenty of headroom... How bizarre! You start digging around and notice you can't connect to SQL Server at all, but the underlying server itself is responding and working well. You dig deeper and find that SQL connection limits have been exceeded when traffic reached its peak.

Little did you know, that you've introduced a new source of problems with the new database server topology, that simply didn't exist before. Lesson learned the hard way!

Photo by Todd Cravens / Unsplash

Your architecture has become a bit of a whale by now, with hundreds of instances being balanced very carefully as you serve hundreds of millions of requests. All of these servers, serving traffic in a round-robin fashion, randomly establishing connections to all the different SQL boxes, quickly reach the connection limit, preventing any new connections from being established. Requests awaiting new connections block the server request queues which quickly become saturated and bring down the service due to a self-inflicted denial of service.

Things are not looking good. You run some scripts to kill SQL connections, and you reduce the connection pool lifetime in the application along with the max pool size to restore service and bring the application back to life.

Time to rethink things, and accept things for what they are - the architecture has become the problem and is no longer suitable. As with grief, in production engineering, the first step is acceptance: Murphy's law always applies.

Your enormous business success has taken you on a wild journey that has left you with a behemoth of an infrastructure to manage where each day can bring around new challenges (which may be exhilarating for the problem-solvers inside of us), as each day of growth, brings you into the unknown, uncharted territory.

Things that have performed well in the past start to struggle, assumptions which held true before are no longer valid, lessons are learned the hard way, issues and outages become more frequent, and you come to the realization that you must make some fundamental changes.

You begin a new journey of research to discover what could be done to increase fault tolerance, isolation, and return the service back to life, and healthy operating norms.

Enter Cell-Based Architecture

We are a very successful, and rapidly growing organization, and this constantly pushes us to get better, to change things, and improve. As our infrastructure started to outgrow itself and become a challenge to manage, we started to look to new approaches, and learn from the best to move to a true, global scale application, and remove scaling limitations that may have existed with the prior approach. As a wise man once said, at scale, everything fails all the time - there is no getting around that and you must accept it, and build your application accordingly, to minimize the blast radius, increase fault tolerance, and focus on improved recovery and reliability. These became our guiding principles in selecting a new architecture.

Our research has taken us to Cell-Based Architecture, which has been well adopted by AWS for planet-scale, resilient services. The premise is simple, and rooted in Biological systems, to provide a natural way to grow, and become resilient to failures via isolation.

Here are some fantastic resources we have found in our research.

Physalia: Cell-based Architecture to Provide Higher Availability on Amazon EBS - presented by Werner Vogels - AWS CTO

Cell-Based Architecture at AWS

AWS - How to scale beyond limits with cell-based architectures

AWS - How to scale beyond limits with cell-based architectures

AWS Aurora also relies on cell-based architecture https://www.allthingsdistributed.com/2019/03/amazon-aurora-design-cloud-native-relational-database.html

AWS Whitepaper on Reducing Blast Radius with Cell-Based Architectures

https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Reducing_blast_radius_with_cell-based_architectures_ARC411-R1.pdf

A large ecosystem becomes a bit like a whale, it's very large, not very agile, hard to handle, and if something starts going wrong, you've got a huge problem.

Moby Dick

A cell-based architecture is more akin to a pod of well-behaved dolphins - life is easier, and when you do have issues, they're on a much smaller scale, and easier to manage. Things are generally much easier to manage.

Photo by Courtnie Tosana / Unsplash

Benefits

There are so many benefits to running a cell-based architecture (or a similar setup.)

If you do it right, all cells are always dealing with a deterministic amount of load, which can be planned for, stress-tested, and you can train for real-life activities, as everything becomes very deterministic and manageable.

Health thresholds and load thresholds can be well defined, which in turn will guide your judgment in creating new pods. You're no longer venturing into the unknown each time you add more customers to the environment.

You get a huge benefit from fault isolation - poison pill scenarios can't take down your whole infrastructure. In the worst case, only a single cell is affected, and if that cell is small enough, this should be a small fraction of your overall traffic. The noisy neighbor impact is also greatly diminished.

If the cells are small and manageable (if not, you probably want to subdivide further), you can spin up new ones quite quickly in case you need to scale.

Scaling the capacity (eg. web servers) within an individual cell also becomes much less time-sensitive. eg. In a "whale" type infrastructure, if the servers take 5 mins to become operational from starting scaling, to serving traffic, after a certain point, auto-scaling will not be able to keep up with the speed of incoming traffic, and its growth gradient. If this happens in an individual cell, it's usually a fraction of the growth and impact, but that 5 mins for a server to come online, is still the same - so the time to scale remains the same, but the demand and impact are greatly reduced, and well in control.

Releases can be rolled out, cell-by-cell, rather than all at once to reduce risk and isolate any faults. You may say, Dan, but we use Rolling Updates, so surely that's okay - it's certainly better, but you are still changing the central environment, so there is always a risk associated with it, doing rolling updates, with cell-by-cell updates, takes it to another level.

If you need to roll back an individual cell, it's much faster than the whole environment. Just make sure you have proper verification checks in place, and you know what healthy (and conversely, unhealthy), looks like.

It also seems to be a fairly limitless way to scale things (at least for now) - if it works for AWS at their scale, it'll likely be fine for us for a very long time.

I personally find it quite fascinating that some of the most revolutionary approaches such as these are rooted in biological systems. Nature is amazing!

How we did it?

Our approach to moving the whole infrastructure to cell-based was quite interesting and carried out over a few days.

It was purely an infrastructure undertaking and required no code changes. No need to move frameworks, languages, or move everything to microservices. We kept our joyful monolith as is, and approached it as a logistics challenge.

Once all the research, planning, and testing, were concluded, we were ready to take it to production.

Day 1

First, we created an inventory of all the CNAME records we manage for our customer sites and grouped them by the destination load balancer (we manage lots of them, as AWS only supports 25 SSL certs per load balancer.)

Once we knew that, we looked at which SQL Servers, each customer was hosted on.

We then decided that each SQL Server will become the heart of each cell, and will have its own server fleet, and everything else it needs to function (creating an isolated vertical slice through our architecture.)

This is what we were going for at a very high level - the cell you belong to was dictated by the SQL fleet you were hosted on, and CNAME your site was pointed at (both need to match up.)

Each cell would be small, and well within the healthy operating parameters.

To split up all of our sites, and load, across the cells, we needed to define some target metrics for each cell - healthy operating parameters. We already know what a healthy latency, error rates, etc. look like from the years of data we monitored in our production environment.

It came down to a few simple calculations to get us started:

A maximum number of customers hosted per cell - we don't want an excessive number of databases per cell.
Maximum total number of daily requests a cell should cater for - we knew how many requests were made per site, and how many each web server can cater for, and how quickly requests must ramp up to outrun the web instance warm-up time.

We came up with 2 well-defined numbers (which we continuously tune as we learn more about our architecture.) These formed our upper limits per cell, which we considered healthy.

We ran through some basic maths based on these 2 figures, and came up with the number of cells we need - and inflated it by 10-20% for additional headroom.

We started to set up the underlying infrastructure for these cells, alongside the existing architecture.

A few hours later, once everything was set up, we started to move the first sites (sandbox sites) to the cells, to build confidence and iron out potential issues (can't be too careful.)

The process was simple, mirror the database and files to the new cell fleet, and remap the CNAME (short TTL) to complete the move.

Everything went great.

We continued with several dozen more sites, and then wrapped up for the day.

Day 2

Day 2 was focused primarily on mirroring data, moving sites and associated SSL certs, and remapping CNAMEs. The team was divided into 3 squads, split by responsibility:

taking backups, and mirroring data to destination cells
ensuring certs were correctly installed and making CNAME changes
verifying sites to ensure all was moved correctly

By the end of day 2, we moved over 80% of our EU estate to a cell-based architecture, without any issues.

Day 3-5

These days were focused on finalizing the EU re-architecture, and applying the same methodology to US architecture.

The original infrastructure was left intact (although, drastically scaled-down in terms of capacity) until all the migrations were completed.

Upon successful completion of all EU and US moves, the original "whale" style infrastructure was deprecated and torn down.

New Challenges

Whilst the new architecture brings a LOT of great benefits, it does introduce new challenges that are worth considering.

We had to adjust our reporting pipelines as we now had lots of 'mini' deployments, instead of a few 'large' ones. Ensuring configuration consistency, data ingestion consistency, alarming, and adjusting the dashboards to reflect this took a bit of effort and planning - this was largely a one-off effort. We still discover better ways of doing things, but that's a part of the learning journey.

The release process needed to be adjusted to cater for deployments to individual cells, and doing it cell-by-cell, rather than all cells at once.

We came up with the health targets/target upper limits for cells, which were focused around 2 main figures - number of sites, and the total number of requests served per cell. Over time, the usage patterns change as customers grow and scale or their engagement improves as a result of our strategic help and services. This means that starting metrics for each cell, slowly drift and change over time.

This required some new dashboards and pipelines to be built, to keep track of these metrics, and prompt us when changes need to happen - rebalancing the cell, to ensure we fit within the upper limits.

Results

Overall, the journey was really worthwhile, and we found a lot of benefits we didn't anticipate from this move.

We noticed the overall latency metrics (average, p75, p95, and p99) were on average 35% lower on each cell. This was mainly attributed to less horizontal hopping and trying to serve so many databases per cell.
Transient error rates dropped by over 50% - whilst low prior to the move, errors related to networking or other transient issues were reduced by 50%.
Adhoc spikes in traffic due to large organized events are handled much better within each cell due to better overall headroom, and less noisy neighbor impact.
The frequency of issues related to latencies, error rates, or other similar issues, has been dramatically reduced.

There is still a lot to learn and optimize, but our experience with the cell-based architecture so far has been excellent!

Distributed Logging & Tracing within AWS

Daniel Wardin — Wed, 06 Jul 2022 07:51:00 GMT

Picture the scene:

It's 10:30pm on Christmas Eve. You're wrapped up warm in this year's hilarious Christmas jumper (it got plenty of laughs from people at the office), you've had just enough mince pies but don't want to move too much in case you realise you've already gained more timber than the Christmas tree in the corner (the twinkling light from which is reflecting off the TV in front of you as you watch Hans Gruber falling from the top of Nakatomi Plaza.) You've notched up at least 100 rounds of Mariah Carey's most enduring hit, and the Christmas break is stretching out in front of you like the endless school holidays you remember from so long ago. Bliss.

And then, from the deep recesses of the sofa you feel it. Ever so quietly at first - you think it could be your imagination. Was that some phantom vibration?

Bzzz. There it is again. This time you hear it too. You fumble for a moment, hand down the side of the cushion before grasping it. Your phone is springing to life. A lock-screen flooded with that damned cheery multi-coloured icon and a series of automated messages.

And now you regret signing up to being on call this Christmas. You brush aside crumbs on your chest from the pies of early-evening and clear your throat to try and enter the zone. You turn to your partner and utter...

I'm sorry dear, it's time for me to troubleshoot our web cluster.

Did that bring you out in a cold sweat? Perhaps it triggered a memory you've tried your hardest to forget.

Nobody wants to be thrust into troubleshooting mode when you're 7 mince pies deep, let alone when you're just about at peak relaxation (before the in-laws have arrived.) But as we know, things break. And they break at very unpredictable times. Having a well-defined process for managing incidents and an on-call rota that works for everyone is key to long-term success. And once you're online and ready to troubleshoot it's essential that you have the tools, data, and access you need to resolve issues as quickly and as painlessly as possible.

So, let's dive into one of the most important parts of operating always-on software at scale: logging and tracing of large-scale software platforms. Specifically, we'll focus on introducing comprehensive logging and tracing to a non-cloud-native application within AWS.

Why?

Thankfully most troubleshooting can happen in normal working hours. Technical Support and Development teams can use logs to better understand issues that end-users might be facing. As services grow and it becomes increasingly costly and time-consuming to find the root causes of problems, logging is a key resource to help enrich your team's understanding. It can also be immensely useful in understanding how to better optimize your service to cut down on the cloud bill.

What?

Tracing usually refers to end-to-end mapping of requests through your infrastructure (normally, across several services) so that you can peer into what goes on under the hood for each request and is generally achieved by using a correlation ID of some sort to tie all metrics and logs together. Patterns and aggregations are then computed from these traces to provide high-level insights, and metrics at each level of the infrastructure. This can help with understanding the behavior of the code or service under the given circumstances (e.g. identifying bugs or understanding oddities), troubleshooting bottlenecks and performance issues (you can see which steps accumulate the most time), and many other things.

Logging is the art of capturing useful information during the runtime of the application in order to help an observer understand what the application is doing - exposing the internal state and control flow for later analysis. This can be used to troubleshoot the application, understand the nature of certain behaviors, or help with the audibility of past events. Unfortunately the engineers writing the logs are rarely those that will rely on them when supporting the application in a support role. This can lead to logs that lack crucial and useful information leading to prolonged troubleshooting and resolution times.

Interact's Approach

Within our product stack we run a large multi-tenanted, monolithic .NET application which already has plenty of logging via Serilog. These logs are sent to ElasticSearch which are, in turn, consumed by Kibana. As a side note: I strongly recommend that you avoid AWS CloudWatch if you value your money - data ingestion costs at scale will outshine your hosting bill!

Like most engineering teams we've taken an iterative approach to implementing logging in this application - baking in our experience trouble-shooting complex scenarios at scale - to enrich our logging and tracing techniques. The monolith is surrounded by several services which generally receive their traffic by using Load Balancer route rules or other proxy services, to split traffic based on the path patterns.

If we were to map out a high level request path into the system (in our case, using an AWS environment) it would look something like below:

(Architecture below assumes a certain setup, including Application Load Balancers, which relay the traffic to several web servers in a target group - generally, these would be controlled by the Auto Scaling Group.)

A request originates from the client devices
The request reaches the to the Application Load Balancer (ALB)
The ALB checks its route table and evaluates the rules
If the rule results in a "Forward traffic to a Target Group", the target group determines which web server will receive this request
The web server application processes the request (eg. IIS, Nginx, Apache, Express etc.)
The application code performs the necessary operations on the request - this may involve reaching out to a number of underlying services (especially if the webserver layer is stateless in nature.)
The application generates a response and passes it back to the webserver application
The web server returns the response to the Application Load Balancer
The Application Load Balancer returns the response to the client device

Application Logging

Most of the logging in modern applications happens around step 6, and sometimes 5 and 7 if the relevant configuration is in place. Because of this, we typically rely on application or product engineers to write logging code and without a rigorous set of guidelines for writing good logs, we often either don't get any, or end up with low-value logging.

There are several problems here.

A lot of the time, a singular log message is not particularly useful, and may simply tell you "a problem exists", but not much beyond that. What is truly needed is more context and a lot more information around the request that caused it.

You want a few things to be present in your logging strategy:

Centralized and easy to access log querying tool (eg. we use ElasticSearch and Kibana - again, stay away from CloudWatch as a log store if you value your money!) - it shouldn't be a quest to simply gain access to the logs - the easier it is, the more people will refer to them.
Correlation ID in the logs - so that you can tie individual log messages together into a story. Think about 10k requests being logged out simultaneously, you want to be able to tie them together into "stories" and follow each request.
Proper exception handling and logging - no empty catch statements, correct logging levels being used, alarms in place to notify the relevant people of issues, etc.
Ambient Context being captured - logging the variables, arguments, and anything useful, at the point of failure, to aid in understanding the failure, and what was inside of the function when it failed. Printing the exception alone is rarely enough, to understand if the issue was a poor implementation, bad data, or a cascaded issue from further up the caller tree (more on this and how we do it, in a future blog post.)

All of these underpin Interact's approach to logging.

We constantly talk about the Christmas Eve scenario and ask our engineers to think from that vantage point. What would you need to know in order to fix this as quickly as possible? What context is important? What if you'd never worked on this part of the application and had no idea what the control flow was?

Tracing with AWS X-Ray

AWS X-Ray provides tracing for distributed applications and yields lots of insights into how your application behaves in development/production, and importantly, how it interacts with other services that it relies on. It is able to gather information and infer relationships between services based on traffic routes across the AWS infrastructure and provide incredibly useful insights and recommendations as well as help you fine-tune your application.

If we refer back to the 9 steps outlined above, X-Ray covers steps 2-9 and can be an extremely powerful tool for debugging issues across multiple services. It does not, however, replace the need for good application logging and should be used as a complementary service.

The integration with the codebase is very light touch and we've found that the value it brings really is worth it.

Discover application issues and get notifications with AWS X-Ray Insights | Amazon Web Services

Today, AWS X-Ray is pleased to announce the general availability of Insights, a feature that helps you proactively detect performance issues in your applications. AWS X-Ray helps developers and DevOps engineers analyze and debug production environments and distributed applications, such as those bui…

Amazon Web Services

X-Ray focuses on high-level aggregates, trends, and relationships, but also allows you to drill into specific requests.

It's well-suited to microservice architectures wherein requests are fulfilled by distributed services, and the architecture map clearly models the processes that happen within it (you can see the request path between multiple services, and at which point it fails - it makes the troubleshooting process more visual) - you can also use it with monolithic applications, where a single "service" performs most of the logic.

Whereas in a microservice-based architecture you can quickly navigate the map and find the issue in a specific service which, if small enough, should be easy to troubleshoot and debug, in a monolith the issue is usually going to lie within the application (since this is where most of the magic happens.) What you need here, is robust application logs, that can be correlated with these additional tools, like X-Ray.

Ops Dashboards

We run another instance of ElasticSearch and Kibana to monitor the health of the infrastructure and the overall estate. This contains metrics and information from all the instances, logs from select AWS services or on-instance agents, and a record of all requests that have made it through our AWS Application Load Balancers.

All of this data is used to create a plethora of dashboards, and visualisations, that clearly tells us how healthy the infrastructure is at any point in time as well and multiple weeks into the past which allows us to drawn comparisons, find patterns and analyse events after they have occurred (we don't lose that data post-event.)

The request data from the load balancers is heavily enriched at the point of ingestion. We can use things such the tenant domain names to further determine and attach information about the database server used in the transaction, database name, and specific site characteristics (this allows us to look at characteristics of healthy traffic while looking at shared infrastructure components to find patterns and root causes of issues faster.) We can cross-reference information in the ALB logs with other AWS information such as which specific web server served this request (we can see if any hosts have a higher than normal latency, or error rates), which AMI or software version the instance was running, which load balancer served the request, amongst many other things.

This makes for a very powerful tool to manage a huge deployment serving billions of web requests.

Consolidation using X-Amzn-Trace-Id

The final step is to tie it all together - to extract more value than a simple sum of the pieces.

As an example, what if I want to drill into a request that has resulted in a 500 response code? If I found it on the ops dashboard, I want to find the specific logs in the application logs that belong to this request only. If I was prompted by X-Ray about an increasing error rate, I may want to query the ops dashboard for other patterns that may be infrastructure-related, isolated to a specific customer or load balancer, or I may want to take some sample requests from X-Ray and see what was going on in the application at the point in time, and then circle back to X-Ray and the ops dashboard.

The real value comes from using all 3 tools together.

Being able to do this, gives you powerful forensic capabilities, reduces the first-responder resolution time, and makes for a more transparent experience at scale. You can probe into the issue as it happens or retrospectively post-fact (depending on how long you retain your data.)

The main identifier we share across all our logging, tracing, and ops systems, is the HTTP Header:

X-Amzn-Trace-Id

Logging: We capture this in the application and enrich our Serilog logs with this value - allowing us to query by this value in ElasticSearch and Kibana.

Ops Dashboard: We capture this in the ops dashboards by ensuring we index this value from the Application Load Balancers logs. This allows us to pluck individual requests from the ops dashboards, that contain close to a billion requests in the index at any point in time. We can then use it to refer to the application logs, to inspect exactly what happened.

AWS X-Ray: X-Ray Uses this Trace ID automatically to track requests travelling through the AWS infrastructure and services and is able to model the traces based on that. As you drill into the X-Ray metrics, and reports, you can take these Trace IDs and cross-reference them with the ops dashboard, and the application logs accordingly, providing you with a full picture of what has happened.

It is essential to have a robust observability, logging, and tracing strategy when operating at scale.

These methods have served Interact well, and we hope that you have found it useful and actionable, and hopefully can apply some of it to your own strategy, and benefit from our learnings and experience.

Now, back to mince pies and the only Christmas film that matters...