I’ve been working with large files for a few years now. These zip files are normally over a gigabyte and can easily go into the 50-100 gigabyte range. They contain data that can only be compressed to about 50%, so there’s no way to make them much smaller – they are what they are.
These files need to be uploaded from the field and processed in the cloud, and I’ve chosen AWS to do the storage via S3 and the processing via EC2. While we’ve been generally pleased with each, we have seen that AWS throttles data to and from S3 and also the creation of large instances in EC2.
This means that if I have a 10 gigabyte zip file I need uploaded to EC2, even if I’m using a network-optimized AWS image that supposedly can handle over a gigabit ethernet, my upload speeds are in the 30-70 megabit range. And yes, I’m uploading from a 1 gigabit connection with a cable – not WiFi, and yes, I’ve tested uploading to various resources, and yes, they all upload at about 800 megabits/sec, the reasonable throughput of a 1 gigabit line. The same goes for downloading from S3. My speeds can be in the range of 14-80 megabits/sec.
You’ll also find that when spinning up a large instance, such as a 64 cores and 192 gigabytes of RAM, AWS often reports back that they are out of capacity in a particular zone. Trying other zones will have the same results. AWS is simply under-provisioned for handling customer needs that are “out of the norm”.
I do understand that handling multi-gigabyte files is unusual, and so is spinning up an EC2 instance that costs about $100/day, but no where does AWS say “don’t tax our systems because we can’t handle it, and will throttle you”.
If you have these needs, you have now been warned about AWS.