How to Save 90% on Your S3 Bill
by January 31, 2014

Filed under: Performance Monitoring

AppNeta has used a lot of open source libraries and programs in building and running our architecture. One utility in general that’s provided us with an easy way to slice up and investigate our AWS spending is the awesome Ice. Instead of having to do manual tabulation based on the monthly billing email from Amazon, we can easily break down and graph our bill by hour, week or month. By tagging our resources, we can even group by environment, role or any other category we want, making it easy to see how much it costs to run our production environment versus our staging environment and other experiments.

It also allows us to easily spot aberrations in our usage patterns which cost AppNeta money. One of those was a somewhat high cost in our S3 bill, where we noticed that we were performing a lot of S3 List operations, as pointed out by our engineering team.

We use Python pretty heavily, and are big fans of boto. When retrieving and storing objects in S3, we usually execute code similar to:

 def _get_bucket(self, object_type):
      """ gets or creates bucket for object type """
      bucket_name = '%s-%s' % (self.bucket_prefix, object_type)
      try:
          bucket = self.s3.get_bucket(bucket_name)
      except S3ResponseError, e:
          if e.status == 404: # Bucket not found, first time using bucket.
              bucket = self.s3.create_bucket(bucket_name)

There’s nothing really out of the ordinary here, but we usually know the bucket name when using this operation, and we make sure to create the bucket first before placing code in production that needs a new bucket. Digging further into the boto code, we found:

def get_bucket(self, bucket_name, validate=True, headers=None):
    """
    Retrieves a bucket by name.

    If the bucket does not exist, an ``S3ResponseError`` will be raised. If
    you are unsure if the bucket exists or not, you can use the
    ``S3Connection.lookup`` method, which will either return a valid bucket
    or ``None``.

    :type bucket_name: string
    :param bucket_name: The name of the bucket

    :type headers: dict
    :param headers: Additional headers to pass along with the request to
        AWS.

    :type validate: boolean
    :param validate: If ``True``, it will try to fetch all keys within the
        given bucket. (Default: ``True``)
    """
    bucket = self.bucket_class(self, bucket_name)
    if validate:
        bucket.get_all_keys(headers, maxkeys=0)
    return bucket

The important parameter to notice here is the default of validate=True, which is also listed as the first example of getting a bucket in the boto docs. This causes the code to call getallkeys, which GETs the bucket (here) and returns a ListBucketResult.

Now, normally S3 is really, really cheap. But LISTs are over 12 times more expensive than GETs.

PUT, COPY, POST, or LIST Requests $0.005 per 1,000 requests
GET and all other Requests $0.004 per 10,000 requests

At scale, even $0.0005 per 1000 requests adds up.

Since we almost always know the bucket we’re going to be accessing or writing to, we have no need to perform an extra List, so we can modify our code to:

python bucket = self.s3.get_bucket(bucket_name, validate=False)

Our usage is now a lot less, which translates to a lower cost and a happier executive team 🙂

s3 get_bucket

For fun, our dev team decided to see how pervasive the use of the default validate parameters might be available on Github:

s3 get_bucket

Ouch.