Hosting static website on S3 with CloudFront and pretty URLs

Oct 17, 2021 · 8 min read

This post describes how to deploy a static website in AWS just like the one you are reading right now. Despite being pretty easy in the basic scenario, serving files with pretty URLs is not straightforward. As a result, I decided to write this post.

For context, let’s specify what “pretty URL” and “ugly URL” are. Pretty URL:

https://msucharski.eu/posts/hosting-static-site-s3/

Ugly URL:

https://msucharski.eu/posts/hosting-static-site-s3/index.html

As you can probably guess, it all comes down to the URL rewriting rules or default index files for directories. Options for URL customizations in the CloudFront are pretty limited. Default root object can be only specified for root, so it won’t work for nested directories. The AWS on its blog suggests using the Lambda@Edge for the URL rewriting to implement default directory indexes.

I find the Lambda solution very flexible, but it is unnecessarily complex and costly for this use case. Fortunately, it is not as bad as it seems! When the S3 is used as static site storage (the most common scenario on AWS, I guess), we can leverage specific properties of this fantastic key-value store.

The “key-value” property is the point of this article, and you could stop reading right now because you know everything. ;-)

In the S3 console, we see a hierarchical directory structure. However, S3 is not a filesystem: it is an object-store. Keys to objects look like filesystem paths only by accident ¹, and it is perfectly valid for an object to have a key ending with /. Do you see where this is going? For every foo/bar/index.html object, we can create a copy with the key foo/bar/ (ending slash is essential!). Or (if you do not care about ugly paths), you can move the object instead of copying it.

I wrote a short Python script to do this automatically. It handles all edge cases I have encountered till now, but there may be more.² Take a look if you are interested:

#!/usr/bin/env python

import base64
import hashlib
import os
from argparse import ArgumentParser

import boto3

parser = ArgumentParser()
parser.add_argument('public_dir', type=str)
parser.add_argument('bucket', type=str)
parser.add_argument('prefix', type=str)


# Based on: https://stackoverflow.com/a/3431838/1561140
def md5(file_handle):
    file_hash = hashlib.md5()
    for chunk in iter(lambda: file_handle.read(4096), b""):
        file_hash.update(chunk)
    return base64.b64encode(file_hash.digest()).decode('utf-8')


def main(cmdline):
    args = parser.parse_args(cmdline)
    s3 = boto3.client('s3')

    for root, _, files in os.walk(args.public_dir):
        abs_root = os.path.join(args.prefix, root.replace(args.public_dir, ''))
        for file_name in files:
            with open(os.path.join(root, file_name), 'rb') as f:
                additional_params = {}
                keys = [os.path.join(abs_root, file_name)]
                if file_name in ['index.html', 'index.htm']:
                    # CloudFront does not support default root file for subdirectories so it has
                    # to be created manually.
                    keys.append(abs_root if abs_root.endswith('/') else (abs_root + '/'))
                    if len(abs_root.rstrip('/')) > 0:
                        keys.append(abs_root.rstrip('/'))

                _, ext = os.path.splitext(file_name)
                extension_to_content_type = {
                    '.html': 'text/html',
                    '.htm': 'text/html',
                    '.xml': 'text/xml',
                    '.css': 'text/css',
                    '.js': 'application/javascript',
                    '.png': 'image/png',
                    '.svg': 'image/svg+xml',
                }

                if ext in extension_to_content_type:
                    additional_params['ContentType'] = extension_to_content_type[ext]

                for key in keys:
                    f.seek(0)
                    md5_hash = md5(f)
                    f.seek(0)
                    print(f'Uploading {key}')
                    s3.put_object(
                        Bucket=args.bucket,
                        Key=key,
                        Body=f,
                        ContentMD5=md5_hash,
                        **additional_params,
                    )


if __name__ == '__main__':
    import sys

    main(sys.argv[1:])

Usage:

# ./upload_to_s3.py <path-to-site-files> <bucket-name> <prefix-in-the-bucket>
./upload_to_s3.py public/ my.domain.bucket ""

The part about pretty URLs ends here. If you already have experience with static site deployment on S3 with CloudFront and Terraform, you probably won’t learn anything new from the following section. Nonetheless, I encourage you to take a look. ;-)

Complete configuration using Terraform 🔗

The configuration is simple: we create an S3 bucket and a CloudFront distribution pointing to the bucket. I assume that a hosted zone (i.e., domain) already exists in Route 53.

Let’s start by introducing some variables:

variable "project_name" {
  type = string
}

variable "domain" {
  type = string
}

AWS requires unique names for some resource types. We could achieve this by letting the Terraform generate names, but then we would have cryptic identifiers. In my experience — even worse than those generated by CloudFormation, at least in the default configuration. For simplicity, we will prefix all resource names with project_name variable. It won’t protect us from every possible collision but will be good enough for this example.

The second variable — domain — is a fully qualified name of a target domain. In the case of this blog, it would contain msucharski.eu.

Also, provider and backend configuration is necessary:

terraform {
  required_version = ">= 0.13"
  required_providers {
    aws = {
      source = "hashicorp/aws"
      version = "~> 3.54"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

terraform {
  backend "s3" {
    bucket = "<REPLACE_WITH_STATE_BUCKET>"
    key    = "root"
    region = "us-east-1"
  }
}

Next, create the bucket:

resource "aws_s3_bucket" "static_website" {
  bucket = "${var.project_name}.website"
  acl    = "private"
}

And access policy, so the CloudFront has access to our bucket:

resource "aws_cloudfront_origin_access_identity" "s3_identity" {
  comment = "S3 upload access for ${var.project_name} website"
}

data "aws_iam_policy_document" "static_website_policy" {
  statement {
    actions   = ["s3:GetObject"]
    resources = ["${aws_s3_bucket.static_website.arn}/*"]

    principals {
      type        = "AWS"
      identifiers = [aws_cloudfront_origin_access_identity.s3_identity.iam_arn]
    }
  }
}

resource "aws_s3_bucket_policy" "cloudfront_policy_website" {
  bucket = aws_s3_bucket.static_website.id
  policy = data.aws_iam_policy_document.static_website_policy.json
}

In the above code, we give CloudFront read access to every object in the static_website bucket. As long as you don’t use this bucket for any other purpose, it should be secure. If you plan to store non-public objects on S3, you should make the policy more strict or (better) create a second S3 bucket.

We are almost ready to create the CloudFront distribution. There is one more thing: an SSL certificate. Thanks to the AWS Certificate Manager, this is very easy:

resource "aws_acm_certificate" "static_website" {
  domain_name       = var.domain
  validation_method = "DNS"

  lifecycle {
    create_before_destroy = true
  }
}

There is a catch, though: validation is partially manual.³ After creating the certificate resource, you will have to go to the Route 53 console and follow the instructions on domain validation. If you are using Route 53 as a DNS provider, then this will be a one-click process.⁴

And, at last, let’s create the CloudFront distribution:

resource "aws_cloudfront_distribution" "website" {
  enabled         = true
  is_ipv6_enabled = true
  comment         = "Website hosting for ${var.domain}"

  # Replace with simple "index.html" - syntax highlighter broke probably due to "index" string.
  default_root_object = trim(" index.html", " ")

  origin {
    domain_name = aws_s3_bucket.static_website.bucket_regional_domain_name
    origin_id   = "default"
    s3_origin_config {
      origin_access_identity = aws_cloudfront_origin_access_identity.s3_identity.cloudfront_access_identity_path
    }
  }

  default_cache_behavior {
    allowed_methods        = ["GET", "HEAD", "OPTIONS"]
    cached_methods         = ["GET", "HEAD", "OPTIONS"]
    target_origin_id       = "default"
    viewer_protocol_policy = "allow-all"

    cache_policy_id = aws_cloudfront_cache_policy.default.id
    compress        = true
  }

  restrictions {
    geo_restriction {
      restriction_type = "none"
    }
  }

  aliases = [var.domain]
  viewer_certificate {
    acm_certificate_arn = aws_acm_certificate.static_website.arn
    ssl_support_method  = "sni-only"
  }

  custom_error_response {
    error_code         = 404
    response_page_path = "/404.html"
    response_code      = 404
  }
  custom_error_response {
    error_code         = 403
    response_page_path = "/404.html"
    response_code      = 404
  }
}

resource "aws_cloudfront_cache_policy" "default" {
  name    = "policy-${replace("${var.project_name}-${var.domain}", ".", "-")}"
  comment = "Cache policy for ${var.domain}"

  default_ttl = 60 * 60      # 1 hour
  max_ttl     = 24 * 60 * 60 # 24 hours
  min_ttl     = 0

  parameters_in_cache_key_and_forwarded_to_origin {
    enable_accept_encoding_brotli = true
    enable_accept_encoding_gzip   = true

    cookies_config {
      cookie_behavior = "none"
    }
    headers_config {
      header_behavior = "none"
    }
    query_strings_config {
      query_string_behavior = "none"
    }
  }
}

That’s a big chunk of code. I won’t go into details because everything is in the CloudFront/Terraform documentation, and I believe that describing everything here would make this post vague. Also, the options are self-descriptive here.

Finally, we need to expose our distribution behind a human-readable domain. After all, it is supposed to be a website.

data "aws_route53_zone" "domain" {
  name         = "${var.domain}."
  private_zone = false
}

resource "aws_route53_record" "website" {
  zone_id = data.aws_route53_zone.domain.zone_id
  name    = var.domain
  type    = "A"

  alias {
    evaluate_target_health = false
    name                   = aws_cloudfront_distribution.website.domain_name
    zone_id                = aws_cloudfront_distribution.website.hosted_zone_id
  }
}

The above code reads hosted zone information from Route 53 and creates an appropriate record. The record is simple alias for our website hosted on the CloudFront.

Warning: I do not take any responsibility for resource costs on AWS. Remember that AWS, in most cases, is not free.

Now that we have configured everything necessary, it is time to deploy our new shiny website. Let’s run terraform apply and see the results. Keep in mind that creating CloudFront distribution takes around 15 minutes to create, so be patient.

# Initialize backend
$ terraform init

# Change values to appropriate names for your deployment
$ export PROJECT_NAME=s3-file-hosting-example
$ export DOMAIN=example.org

# Create resources
$ terraform apply -var "project_name=$PROJECT_NAME" -var "domain=$DOMAIN"

The first attempt to create CloudFront distribution might fail due to an invalid certificate. If that happens, please go to the ACM console and go through the domain validation process. (Validation can be automated³). After validation is successful (check “Validation status” in the ACM console), run terraform apply once again.

Now we can use the upload script, presented at the beginning of this post, to upload content to the website:

$ python ./upload_to_s3.py public/ "$PROJECT_NAME" ""

That’s it! You can visit your website and check if everything is working as expected. If you find yourself bored with the website, you can destroy it and all associated resources with a simple:

$ terraform destroy -var "project_name=$PROJECT_NAME" -var "domain=$DOMAIN"

I hope you enjoyed the article!

There are some technical restrictions on the paths, but the hierarchical structure is mostly for humans. ↩︎
If you find anything, please let me know! ↩︎
This is not exactly true. With Terraform you can use the acm_certificate_validation to automate this process. However, I will skip such an approach in this post. ↩︎ ↩︎
ACM can create the necessary CNAME record in the hosted zone for certificate validation as long as Route 53 is the DNS provider. ↩︎