How to Save 90% on Your S3 Bill

by Jesse Davis on January 31, 2014

s3-list-get-bucket-default

AppNeta has used a lot of open source libraries and programs in building and running our architecture. One utility in general that’s provided us with an easy way to slice up and investigate our AWS spending is the awesome Ice. Instead of having to do manual tabulation based on the monthly billing email from Amazon, we can easily break down and graph our bill by hour, week or month. By tagging our resources, we can even group by environment, role or any other category we want, making it easy to see how much it costs to run our production environment versus our staging environment and other experiments.

It also allows us to easily spot aberrations in our usage patterns which cost AppNeta money. One of those was a somewhat high cost in our S3 bill, where we noticed that we were performing a lot of S3 List operations, as pointed out by our engineering team.

We use Python pretty heavily, and are big fans of boto. When retrieving and storing objects in S3, we usually execute code similar to:

 def _get_bucket(self, object_type):
      """ gets or creates bucket for object type """
      bucket_name = '%s-%s' % (self.bucket_prefix, object_type)
      try:
          bucket = self.s3.get_bucket(bucket_name)
      except S3ResponseError, e:
          if e.status == 404: # Bucket not found, first time using bucket.
              bucket = self.s3.create_bucket(bucket_name)

There’s nothing really out of the ordinary here, but we usually know the bucket name when using this operation, and we make sure to create the bucket first before placing code in production that needs a new bucket. Digging further into the boto code, we found:

def get_bucket(self, bucket_name, validate=True, headers=None):
    """
    Retrieves a bucket by name.

    If the bucket does not exist, an ``S3ResponseError`` will be raised. If
    you are unsure if the bucket exists or not, you can use the
    ``S3Connection.lookup`` method, which will either return a valid bucket
    or ``None``.

    :type bucket_name: string
    :param bucket_name: The name of the bucket

    :type headers: dict
    :param headers: Additional headers to pass along with the request to
        AWS.

    :type validate: boolean
    :param validate: If ``True``, it will try to fetch all keys within the
        given bucket. (Default: ``True``)
    """
    bucket = self.bucket_class(self, bucket_name)
    if validate:
        bucket.get_all_keys(headers, maxkeys=0)
    return bucket

The important parameter to notice here is the default of validate=True, which is also listed as the first example of getting a bucket in the boto docs. This causes the code to call getallkeys, which GETs the bucket (here) and returns a ListBucketResult.

Now, normally S3 is really, really cheap. But LISTs are over 12 times more expensive than GETs.

PUT, COPY, POST, or LIST Requests $0.005 per 1,000 requests
GET and all other Requests $0.004 per 10,000 requests

At scale, even $0.0005 per 1000 requests adds up.

Since we almost always know the bucket we’re going to be accessing or writing to, we have no need to perform an extra List, so we can modify our code to:

python bucket = self.s3.get_bucket(bucket_name, validate=False)

Our usage is now a lot less, which translates to a lower cost and a happier executive team :)

s3 get_bucket

For fun, our dev team decided to see how pervasive the use of the default validate parameters might be available on Github:

s3 get_bucket

Ouch.

TwitterFacebookLinkedInRedditEmail

Slow Web Apps?

Web pages are complex. Download this free article to discover the four different ways you’re keeping your end users waiting.

Download the article
  • Justin

    I think you meant Netflix ICE , not Asgard

  • Vik

    Ice is the tool that does bill analysis – https://github.com/Netflix/ice

    asgard is for autoscaling, but a great article!

  • Jesse Davis

    You’re both absolutely right. We’re using or evaluating both here; I find myself swapping the two in conversation all the time!

  • nishant

    awesome article ,cheers

    I tried installing ice but was unsuccessful
    you have any tutorial regarding ice installation please share

    Thanks

  • Justin

    I was in the process of sending a lot of data to S3 when I read this.. Apparently my code was doing exactly the wrong thing: I was calling create_bucket once, then starting 40 threads to upload a few thousand files.. Unfortunately each thread was calling get_bucket before uploading each file. It never occured to me that get_bucket would try to create it… I pushed out a change adding verify=False…

    Upload took 14 seconds
    Upload took 15 seconds
    Upload took 15 seconds
    Upload took 14 seconds
    Upload took 15 seconds
    Upload took 16 seconds
    [restart app]
    Upload took 7 seconds
    Upload took 8 seconds
    Upload took 8 seconds
    Upload took 8 seconds
    Upload took 8 seconds
    Upload took 8 seconds
    Upload took 8 seconds

    I don’t care about the cost savings, but making uploads twice as fast? That’s awesome.

  • Shlomi Atar

    I’ve written this tiny python library for simple file uploads to S3 (boto is so complicated and full of features).

    It supports futures and has a nice interface for thread-pooled uploads.

    https://www.smore.com/labs/tinys3/

  • Abdelrahman

    The jets3t java lib get bucket does the same thing!

    • http://jamesmurty.com/ James Murty

      Not quite. JetS3t’s #getBucket method lists all the buckets in an account, not the keys inside the bucket.

      Even so, if you know a bucket exists you should avoid doing any unnecessary work by creating a StorageBucket/S3Bucket object directly.

      James — JetS3t author

  • http://toastdriven.com/ Daniel Lindsley

    First, thanks for highlighting this issue. The Boto core team tried to take this very seriously (https://github.com/boto/boto/issues/2078 & https://github.com/boto/boto/pull/2082) & we’ve fixed `get_bucket` to use the much cheaper “HEAD Bucket” alternative (released in Boto 2.25.0).

    Your solution is still very much valid. If you can guarantee the bucket is present server-side, the cheapest thing is to do no work at all. :D

    Thanks for the insight/writeup & hope Boto continues to treat you well!

    • Jesse Davis

      Thanks for the quick fix, Daniel! Hopefully this will make the bills cheaper for a lot of people :)

  • Jeff Barr

    Hi Jesse, greetings from the AWS team!

    We saw your post and have addressed this issue in Boto. Here’s the GitHub commit for version 2.25.0 of the release notes:

    https://github.com/boto/boto/commit/eff52226d8604630c8669dfd1f5c2bdf949ff90e

    Any code that parses the exception raised by get_bucket will need to be examined and possibly revised.

    Thanks for bringing this to our attention.

  • ryno75

    Good info. This has been addressed in boto 2.25.0… http://docs.pythonboto.org/en/latest/releasenotes/v2.25.0.html