Getting the sizes of Top level Directories in an AWS S3 Bucket with Boto3

I was recently asked to create a report showing the total files within the top level folders and all the subdirs under the folder in our S3 Buckets.

S3 bucket ‘files’ are objects that will return a key that contains the path where the object is stored within the bucket.
I came up with this function to take a bucket and iterate over the objects within the bucket. For each item, the key is examined and added to a running total kept in a dictionary.

Here’s what I ended up with.

def get_top_dir_size_summary(bucket_to_search):
    """
    This function takes in the name of an s3 bucket and returns a dictionary
    containing the top level dirs as keys and total filesize and value.
    :param bucket_to_search: a String containing the name of the bucket
    """
    # Setup the output dictionary for running totals
    dirsizedict = {}
    # Create 1 entry for '.' to represent the root folder instead of the default.
    dirsizedict['.'] = 0

    # ------------
    # Setup the AWS Res. and Clients
    s3 = boto3.resource('s3')
    s3client = boto3.client('s3')

    # This is a check to ensure a bad bucket name wasn't passed in.   I'm sure there is a better
    # way to check this.   If you have a better method, please comment on the article. 
    try:
        response = s3client.head_bucket(Bucket=bucket_to_search)
    except:
        print('Bucket ' + bucket_to_search + ' does not exist or is unavailable. - Exiting')
        quit()

    # since buckets could have more than 1000 items, have to use paginator to iterate 1000 at a time
    paginator = s3client.get_paginator('list_objects')
    pageresponse = paginator.paginate(Bucket=bucket_to_search)

    # iterate through each object in the bucket through the paginator.
    for pageobject in pageresponse:

        # Check to see of a buckets has contents, without this an empty bucket would throw an error. 
        if 'Contents' in pageobject.keys():

            # if there are contents, then iterate through each 'file'.
            for file in pageobject['Contents']:
                itemtocheck = s3.ObjectSummary(bucket_to_search, file['Key'])

                # Get Top level directory from the file by splitting the key. 
                keylist = file['Key'].split('/')

                # See if file is on root, if keylist has 1 item (root dir), there are no dirs on item
                if len(keylist) == 1:
                    dirsizedict['.'] += itemtocheck.size
                else:
                    # Not root, check if key already exists, create it needed, and add value otherwise
                    # Just add the value to the running total
                    if keylist[0] in dirsizedict:
                        dirsizedict[keylist[0]] += itemtocheck.size
                    else:
                        dirsizedict[keylist[0]] = itemtocheck.size

    return dirsizedict

That script is probably a little rough to an elite coder, so if you have any thoughts on improvement, let me hear them.

Tagged , , , . Bookmark the permalink.

6 Responses to Getting the sizes of Top level Directories in an AWS S3 Bucket with Boto3

  1. siva says:

    Hi,
    Your work in this is awesome, helping a lot to move forward.
    Can you please give a hint on how to extract “security group ID whose cidrIP is 0.0.0.0/0 in IpRanges in IpPermissions, from clouttrail log which is in JSON format using boto3 and python”. I tried all the ways but unable to move forward. Thanks in advance.

    • mike says:

      I think you are trying to find sec groups with an allow all using 0.0.0.0/0. Why not iterate over all groups in the account and check each rule in each group for a cidr of 0.0.0.0/0

  2. Mapes says:

    Hey thanks I know this is kinda old but, it helped me

  3. Lydon says:

    2020 and still great. Consider updating if you ever get the chance 🙂

  4. jagan reddy says:

    Script is not working

Leave a Reply

Your email address will not be published. Required fields are marked *

Solve : *
5 − 4 =