How we moved our entire storage layer to S3

Within our product ecosystem lives a large monolithic application that is responsible for the majority of functionality that our customers use every day. Like most technology companies that have undergone sustained, rapid growth over a number of years, this application houses a large number of development patterns, complex business logic, and a range of coupled vs. un-coupled facets.

We run our multi-tenanted application on AWS, and as we’ve scaled to over 1000 enterprise customers that rely on the product every day we’ve continually upgraded our infrastructure to a highly redundant and performant architecture. At the core of this is a large cluster of EC2 instances that run our primary application. Because it’s multi-tenanted we can easily scale up and down based on customer traffic (wouldn’t you know it, we see cyclical peaks and troughs as the UK, and various time-zones in the US wake up and fire up their intranets alongside their morning coffees!) This works very well, and the remaining elements of the product ecosystem scale in a similar fashion.

Storage

The inside of a hard drive is exposed. — Photo by benjamin lehman / Unsplash

However, one aspect which was nagging the Engineering team for some time was our persistence layer. When the core application was first conceived (in a time before AWS), it was designed to run on a single-server deployment – most of our customers would run the application on their own local infrastructure. Customers would use IIS to host the application, and it would store all of the uploaded files alongside any tenant-specific files on the local disk – all was well. Over the years, the system grew, more modules were added, and complexity increased. The business kept growing and signed more customers.

Fast-forward to today and our product stack is hosted entirely on AWS across a large number of virtual machines and other infrastructure. We don’t enforce session affinity, which means that a single user likely interacts with hundreds of distinct web servers over the course of a day. At the application level this is completely fine, but at the persistence level we encounter some additional complications.

For a number of years, we’ve utilised shared NFS instances to act as a central store – accessible from all EC2 instances, backed by local caches on these boxes for our persistence, and again, this has worked extremely well. But as ever, as we’ve continued to grow these huge NFS instances have presented us with issues such as kernel-level incompatibility with AV software due to customised Linux kernels being used by AWS, and general issues of scale with such high traffic volumes, and it being a single point of failure.

Enter S3

The obvious choice for the next generation of persistence layer is S3. But unlike our move to NFS, this would present a number of additional complexities including:

Low-level File API usage throughout the application – tens of thousands of instances of File.Read and File.Write would have to be inspected and replaced
Reconciling internal APIs - uploads are served via an application controller which checks the permissions of the current user against the access rights – we wouldn’t be able to just serve the raw S3 URLs, assets, theme files etc.
Migration – how do we move the existing files to S3? We manage dozens of terabytes of data across the customer base
Hot Swap – how do we move to S3 without any service interruption?
Security - how to keep the data isolated for each customer

Low-Level File APIs

The most challenging and time-consuming part of the migration was always going to be finding and correctly updating all the references to the standard library File operations, without introducing bugs.

A quick inspection of the usage (File.Read/File.Write etc.) showed that it will be a very large task - there were 14k+ instances of File.Read for instance in the codebase.

After numerous whiteboarding sessions, and defining what problems we wanted to solve along the way, we decided that the best way to approach this was to create a new central library project dedicated to handling the persistence layer operations (access, writing, exists checks, permission checks, deletions, etc.)

Any file operations would then simply use the new central library, and the library would internally decide how to handle the files (eg. use local disk, use S3, use another technology, use several at once? etc.)

This design decision had numerous benefits:

Can easily unit test the entire persistence layer
Can easily introduce new storage backends in the future
Optimisations, and changes (eg. global encryption or compression) can be easily done in a single central library, and take effect across the entire application
Consolidation and standardisation of file access logic
Introduces consistency in how files are handled
Easy for new developers to pick up and interact with the storage layer

S3 Integration

The S3 integration brings about some interesting challenges - both from an infrastructure/cloud perspective, and a code implementation perspective.

Security configuration - what is secure?
Costs - S3 traffic generally happens over the internet, meaning you pay for egress and ingress traffic
Limitations - throughput limitations per prefix - this impacts the bucket structure
Secure interactions - eg. S3 Bucket Ownership validation on write
Scanning files - scanning files using an AV on a disk is easy, how do we go about this in a serverless service?

Finding the right approach for us took a lot of research, small-scale experiments, and a lot of retrospection and refinement.

Some of the questions we asked were (hopefully, some are useful to anyone currently undergoing a similar journey):

As a SaaS service, do we use a bucket per customer, or use a single bucket with a subfolder/prefix per customer to segregate and isolate data?
If we choose a bucket per customer, what naming convention do we use given the bucket names live in a global namespace (not scoped to an individual AWS account)? - this gives a nice level of isolation but there is a limit of 1k buckets per AWS account
If we choose a subfolder/prefix per customer, how do we provide a secure isolation layer for accessing a single customer's files (eg. can't modify the path to jump a few folders up and access another customer's data) - an S3 to VPC access point scoped to a specific folder, allows for this level of isolation
Do we want the architecture to allow BYOK for bucket encryption?
To avoid additional network costs, shall we use S3 Access Points to connect it directly to the VPC accessing the files? - as long as the network traffic doesn't leave the AWS network, it's free - so keeping the traffic between your software and S3 buckets inside of the data center, can make it a lot cheaper
How do we ensure our backup strategy meets the policies on frequency, and retention - versioning can be used to capture file changes as a form of versions - there is no need to take backups on a fixed interval. Provided the files don't change very frequently, this can reduce the backup costs significantly, and combined with Lifecycle Policies you can tailor the correct retention periods, and even customize the storage class for older file versions.
How do we restore from a "backup" when needed? - this can be done manually if targetting specific files, or can be done en-masse (some options include pipelines such as https://aws.amazon.com/blogs/storage/point-in-time-restore-for-amazon-s3-buckets/, or tools like https://github.com/angeloc/s3-pit-restore)
etc.

It was a really interesting exercise and took a lot of solution planning - it is one of these exercises where relatively small decisions at the beginning, restrict and lock you into specific options in the future, without the ability to easily change your mind.

Reconciling Internal APIs

Thankfully, Pareto law applies quite universally, and most of the requests triggering file operations were located in two controllers.

It made for a great starting point - by modifying the logic underpinning those two controllers, to use the new persistence layer logic, which in turn would cover the vast majority of file operations in the system. 80% of the benefit, for 20% of the work.

If we can get this to work, then we can focus on the remaining work at our leisure.

Naturally, in many mature systems like ours, there will be rich permission structures and existing paradigms for access logging, compression, asset caching, etc.

We wanted to integrate the new persistence storage library in a way that was respecting the existing logic, so that we don't cause issues in all the other critical areas of the system (we wanted a controlled implementation, that does not grow in scope uncontrollably), and in a way that does not have to be aware of the underlying storage technology (the caller does not need to be aware of consider if the files are stored on local disk, network share, S3, Azure Blob Storage, or anywhere else.)

Our controllers had various endpoints for handling different use cases.

Get Documents - permission oriented, don't want these cached on any intermediate networking devices en route to the user
Get Video Stream - you don't want to return the entire file before any playback starts
Get Image Assets - want to ensure appropriate caching, compression, etc. happens
(there were a lot more, and a lot more complex CRUD endpoints to handle chunked operations, etc.)

It would be very easy to just redirect all requests directly to S3 - we'd save ourselves some bandwidth on file transfers - but it would be insecure and inappropriate. Instead, we cleverly replaced the file access calls with equivalent calls to our new persistence layer library.

Whilst there, we also added a number of optimisations.

For example, when reading files and assets from the disk, and sending them back to the browser, you want to set the Content-Type correctly so that the browser and any receiving client can treat the file correctly.

Our previous implementation started by loading the beginning of the file into a stream (for reasons we won't get into here), and inspecting it to determine the file type. This was happening on read, which led to a lot of wasted processing - the type won't change unless the file changes.

So when implementing the new persistent storage layer, and in particular the S3 implementation, we abstracted the content type detection and brought it into the persistence layer library, and when writing a file to S3, we used the existing mechanism to determine the content type, but we stored the result in the Metadata tags which can be set on S3 objects. On any subsequent reads, we would simply need to look up the value in a Dictionary, instead of carrying out additional processing to figure it out again.

We managed to simplify, and standardise a lot of file operations in this exercise, and reduce the overall entropy of that part of the system.

We now had a solid, reusable library and set of patterns for moving across the rest of the system to use the new persistence layer library, and we had a system capable of interfacing with different underlying storage platforms.

Migration & Hot Swapping

When dealing with a live system, there is always a question of data migration. How do we go about migrating millions of files, and dozens of TBs of data from NFS to S3, in a seamless, safe, fault-tolerant, and reliable way?

We decided that doing a big all-or-nothing migration was too risky, and would likely not work and/or cause significant disruption.

We put the S3 storage behind a feature flag so that we can enable it on select sites and control the rate of the rollout. Regardless if the flag is on or off, all the file interactions would go through the new persistent storage layer library, and the library would decide internally whether to use NFS or S3.

This allows us to start off with everyone using NFS (no change), but the files are being handled by the new central library. This serves as a confidence-building exercise for the new library in production, beyond unit testing, and QA testing that have already taken place.

It also allows us to phase the S3 storage in at a comfortable pace, while we monitor the stability, error rates, and user experience, allowing us to fall back to the NFS implementation in case of issues. This ticks a huge box for fault tolerance.

This does raise a question of migrating data from NFS to S3, and what happens to data that's uploaded to S3 if we need to fall back to NFS in case of problems. You don't want data to go missing, so when you fall back to NFS, you want to ensure all the files are still there.

This led us to introduce "storage strategies" within the persistence layer. The default strategy was configured to "dual-write" content to NFS and to S3. This meant that for the time being NFS remains the source of truth, and S3 lazily receives files on write and on access, slowly populating the storage layer the more the system is used. On-read uploads were done by checking if the requested file exists in S3, and if not, it'd be fetched from NFS, uploaded to S3, and then returned to the user. This gave us a self-populating, human-free mechanism for populating S3, and rolling this feature out in a way that is:

Reliable - no human involvement and each operation is always handled the same way
Safe - data never leaves the system and there are no concerns about data handling
Fault-tolerant - if there is an issue with S3, NFS still contains all the data so we can easily fall back
Seamless - the migration was done in a lazy fashion - if the file didn't exist in S3, we'd grab it from NFS and asynchronously upload it so it's there next time

This strategy then allows for migration of any files not yet uploaded to S3 to happen seamlessly and asynchronously behind the scenes and once completed, the strategy in the code can be adjusted from dual-write to both file systems, to use S3 only, completing the migration and deprecation of the NFS service.

Lessons and Conclusion

This was by no means an easy or short project, but we used technology to make the transition from NFS to S3 at a global scale seamless, whilst introducing additional improvements into the system and developer ergonomics along the way.
The approach taken has made the cost of failure very cheap, and the effort to roll back very easy, and has enabled us to introduce further improvements and optimisations (eg. new compression technologies) with ease, as well as built out new modules and features, without much consideration for storage.
The rollout was very seamless and successful, and it shows that the preparation and planning that has gone into this project has led to great results.