Codelab 2: AWS + S3

Due Date

This codelab is due on Thursday, February 15th at 11:59:59PM. It may look long, but most of it is just setup paired with a casual walkthrough.

Goal

In this codelab, you'll get to play around with S3.

You'll get up and running on AWS with a Free Tier account and AWS Educate.
You'll test out S3 from the AWS CLI

Setting Up

Before starting this codelab, run git pull in the 389Lspring18 directory to update your local copy of the class repository.

Getting Started with AWS

The first step is going to be to get an AWS account. If you already have an Amazon.com account with your @umd.edu email, then you can go ahead and sign in with the same account. Sign in or Sign up here.

Free Tier Accounts

You'll be using a Free Tier account on AWS. This tier is meant to give developers like you access to AWS to test drive some of its features.

You can learn more about the specifics of what is available on this page.

AWS Educate via MLH

Now that you have signed up for an AWS account, register for AWS Educate here. As university students, you can get access through MLH's partnership with AWS Educate.

You'll need your Account Id during the sign-up process, which you can access from the "My Account" page on AWS.

After you complete this process, you will receive a promotional credit via email. You should redeem this in the Billing > Credits section via AWS's GUI.

If you're concerned about how much you're using, there's more information here.

AWS Educate grants students access to $100 in credit to pay for services not covered under the free tier.

AWS GUI

This class will use a combination of the CLI and the GUI (referred to as the AWS Management Console, or just the AWS console). You can do everything you can do in the GUI via the CLI, but the goal is to get you comfortable with each as we'll use both throughout the class.

The two keys components to understand here are how to access different services, and how to change between regions.

You can see the full list of services from the "Services" dropdown in the top-left corner:

To see which region you are currently in, look for the region tab in the top-right corner:

Keep in mind that we will be using us-east-1, which is in North Virginia.

As mentioned above, we will be using S3 in this codelab, so I would recommend that you go ahead and take a peek at S3 via the AWS Console. Come back throughout the codelab and look at how things change as you create buckets, upload files, etc.

AWS CLI

IAM User

To work with the CLI, you'll need to first create an IAM user via the AWS Console. Open the "IAM" service and navigate to "Users" (tab on left) > "Add user".

Pick a name for your user, and check "Programmatic Access". You won't need to sign in to the AWS console with your user.

You need to specify what permissions your new user has. To do this, we will create an IAM group for admin users. Give this group a name and select the "AdministratorAccess" policy.

Click through to the "Complete" page where AWS will provide you with the access key ID and secret access key. Keep the latter safe, perhaps saving them in a text file on your machine, because AWS will not show it to you again. These keys are used to sign programmatic requests that you make to AWS. If lost, you must restart and create a new IAM user.

Testing the CLI

If you haven't already done so, set up your local development environment. Follow the instructions here: Environment Setup. Upon entering the final command, you will be prompted to give the keys you just obtained. Set us-east-1 and json as the default region and output respectively.

$ aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: us-east-1
Default output format [None]: json

Now test your credentials with the following command.

$ aws sts get-caller-identity

Make sure the account ID matches and you're good to go. We'll use the CLI more below.

Root User

To clarify the difference between the user you have been using to log in to the AWS Console and the IAM user you just created:

The credentials you've been using to log in with are the root user credentials. Normally, you avoid using the root account and instead federate access out to IAM users so that you can audit their usage. However, in this class, we will continue to use the root account for simplicity.

Essentially, you have two users for your account: the root user, which you use to sign-in to the AWS console, and the IAM user (which has an Account ID and Secret Access Key), which you use to programmatically access AWS. We can't programmatically access the API with the root user, because it does not have associated access keys. In general, you tend to avoid using the root account on teams since it is hard to audit. Preferably all team members have an IAM user they can access, with a limited set of permissions. However, in this class, we will continue to use the root account for simplicity.

S3 Basics

There are two main ways that you will programmatically interact with AWS. The first is through the CLI and the second is through the boto3 SDK. The CLI is great for one-time changes to your AWS environment. However, if you want to create dynamic scripts, then you will want to use something else -- in this class, that will be Python with Amazon's Python SDK, boto3.

For codelabs, I'll be showing you how to interact with AWS via the CLI, since the CLI is easier to experiment with compared to Python code. I won't go into as much detail with the boto3 library, since I don't believe that explaining both adds much value. The APIs for both are fairly similar, so you should be able to easily pick up the relevant boto3 as long as you understand each codelab.

Feel free to ask on Piazza if you have any trouble or questions about boto3 APIs.

References

Do reference the aws s3api documentation while working through this codelab. You will need it for successful completion.

CLI Overview

This section relies on having the AWS cli installed, see Environment Setup

The aws cli exposes commands for almost every service in the form of:

$ aws [service] [subcommand]

The full list of services is available via tab-completion.

To get help on a command, append the word "help". For example:

$ aws help
...
$ aws s3api help
...

Buckets

Creating a Bucket

Lets start by working with S3 buckets.

The s3api create-bucket command allows you to create S3 buckets from the CLI. Go ahead and create a bucket to use in this codelab:

$ aws s3api create-bucket --bucket cmsc389l-<your directory id>

Remember that these names must be globally unique. What happens if you try to create a bucket that has already been claimed? Try it against the cmsc389l bucket that I created for class:

$ aws s3api create-bucket --bucket cmsc389l

Audit Trail

Throughout the rest of the codelab, you will be playing around with the bucket you just created. For me to grade this codelab, I ask that you enable CloudTrail auditing on this bucket. CloudTrail is a tool that allows you to perform audit logging on different services, which can be useful for debugging problems or for enabling compliance.

For this codelab, you'll submit the audit trail at the end of the assignment. I'm not going to look at the logs in depth, I'll only be skimming to verify that you took the time to go through this codelab. So feel free to explore beyond the walkthrough below.

We will enable the audit log via the AWS Console. Open the CloudTrail service and go to "Trails" > "Create Trail". Create a new trail on the bucket you just created (cmsc389l-<directory id>) and save it into a newly created bucket.

The new trail will take a few seconds to appear in the CloudTrail list, but once it does, you are good to go with the remainder of the codelab.

Important: You must enable the audit logging before continuing (above).

Listing Buckets

To check which buckets are owned by your account, use s3api list-buckets:

$ aws s3api list-buckets

You should see both buckets — the first bucket you created, and the one containing audit logs.

Writing Data

Now that we have an S3 bucket, we are ready to upload content into it.

Let's create a file to upload:

$ echo "Hello World\!" > hello_world.txt
$ ll
total 4.0K
drwxr-xr-x 1 vagrant vagrant 68 Sep 12 07:28 code
-rw-rw-r-- 1 vagrant vagrant 13 Sep 12 13:41 hello_world.txt

Upload it into your bucket with the s3api put-object command. As mentioned in class, notice how you need to specify the bucket and key, which will map to this object.

$ aws s3api put-object --bucket cmsc389l-colink --key hello_world.txt --body hello_world.txt

You can think of this as a hash table where together the bucket and key uniquely identify an object:

{
(bucket1, key1): object,
(bucket1, key2): object,
(bucket2, key1): object,
}

Listing Objects

To verify that the file was uploaded into the bucket, use the s3api list-objects-v2 command:

$ aws s3api list-objects-v2 --bucket cmsc389l-colink

Create and upload another file into this bucket with a different key. Now, you can use the --prefix flag of the list-objects-v2 command to filter which objects to list.

Directories

This becomes particularly useful with directories. Let's create a simple directory structure locally:

$ mkdir dir1 && mkdir dir2
$ echo "Directory 1" > dir1/file.txt
$ echo "Directory 2" > dir2/file.txt
$ tree
.
├── dir1
│ └── file.txt
└── dir2
└── file.txt

2 directories, 2 files

Upload both of these files:

$ aws s3api put-object --bucket cmsc389l-colink --key "dir1/file.txt" --body dir1/file.txt
$ aws s3api put-object --bucket cmsc389l-colink --key "dir2/file.txt" --body dir2/file.txt

Run the list-objects-v2 command again, and you will see all of these files. However, if you wanted to see only the files in one of these directories, you could limit the list operation to those keys with a given prefix:

$ aws s3api list-objects-v2 --bucket cmsc389l-colink --prefix dir1

This effectively recreates a recursive ls operation (ls -R [<prefix>]).

Reading Objects

Let's inspect the contents of that "Hello World" file that we uploaded.

To read the contents of an object, use the s3api get-object command. This will stream the contents of the file from S3 into a local file.

$ aws s3api get-object --bucket cmsc389l-colink --key hello_world.txt out.txt
$ cat out.txt
Hello World!

🙌

Deleting Objects

Finally, if you want to delete an object, there is the s3api delete-object command.

$ aws s3api delete-object --bucket cmsc389l-colink --key hello_world.txt

We can verify that the file was removed correctly by attempting to read this data back:

$ aws s3api get-object --bucket cmsc389l-colink --key hello_world.txt out.txt

As you see, you'll get a NoSuchKey exception.

Wrapping Up

Great, you should now be familiar with the basic operations on S3. You can programmatically create buckets, upload objects, read back data from objects, list keys in a bucket, and delete these objects by key.

Storage Types

In class, we talked about four storage classes: Default, Reduced Redundancy Storage (RRS), Infrequently Accessed (IA), and Glacier. We're not going to cover Glacier here, since it's API is fairly complicated. However, let's experiment with moving objects between the other three storage types.

We can toggle between the first three storage classes of an S3 object using the --storage-class flag for the s3api put-object command. Note that moving to Glacier is a separate process. Why? Objects in the first three classes are all accessible in real-time, while objects in Glacier are not.

For this process, let's re-upload a file into our bucket:

$ aws s3api put-object --bucket cmsc389l-colink --key hello_world.txt --body hello_world.txt

Configuring RRS or IA

Before we change the storage class, we can inspect the contents of the file we just uploaded to determine what storage class it is held in.

$ aws s3api list-objects --bucket cmsc389l-colink --prefix hello_world.txt

There are two ways to modify the storage class. The first involves overwriting the object already in S3 with a new object which has a different storage class set:

$ aws s3api put-object --bucket cmsc389l-colink --key hello_world.txt --storage-class REDUCED_REDUNDANCY

Perform the s3api list-objects command again to double-check that it modified the storage class.

The second is that we can use the s3api copy-object command to avoid the extra bandwidth charges that arise from uploading the object all over again. In this case, it will only change the storage class on the object.

$ aws s3api copy-object --bucket cmsc389l-colink --copy-source cmsc389l-colink/hello_world.txt --key hello_world.txt --storage-class STANDARD_IA

If you perform one last s3api list-objects command, you'll see that it changed the storage class again.

Remember that each of these modes still provide real-time access to these objects, so we can read them and get them back quickly:

$ time aws s3api get-object --bucket cmsc389l-colink --key hello_world.txt out.txt
...
aws s3api get-object --bucket cmsc389l-colink --key hello_world.txt out.txt 0.29s user 0.05s system 41% cpu 0.817 total

Instead, if you were to move these files into Glacier, then you would need to trigger a transfer job which would move files back into an S3 bucket. This process will usually take hours.

S3 CLI

The tool that you've used so far, aws s3api, is one of two AWS CLI tools for interacting with S3. The other, aws s3, is a higher-level wrapper on top of s3api that provides typical directory traversal operations.

For example, you can ls a directory on S3:

$ aws s3 ls s3://cmsc389l-colink
    PRE dir1/
    PRE dir2/
2017-09-13 11 (tel:2017091311):44:17 13 hello_world.txt
2017-09-13 12 (tel:2017091312):24:14 25165824 output.dat

It provides a number of operations:

cp ls mb mv presign rb rm sync website

I recommend that you play around with this tool a bit more to see how it compares with s3api. As you will see, it is much more limited than s3api, but it is quite useful for when you need to move files between your local computer and an S3 bucket.

Assignment

Your assignment for this codelab is to re-create part of the s3 sync command using boto3 in Python. You'll be able to use this script to quickly copy files and directories into S3.

Note: You must use boto3, not the s3 sync command. You will not get full credit if you just write a wrapper on top of the sync command.

As an example of how this script works, say you had a directory like so:

a.txt
folder1/
    b.txt
    c.txt
folder2/
    folder3/
        d.txt

Then here are a few examples of the resulting state in S3:

Example 1

$ python upload.py a.txt --bucket cmsc389l-colink

a.txt

Example 2

$ python upload.py a.txt --bucket cmsc389l-colink --destination z.txt

z.txt

Example 3

$ python upload.py . --bucket cmsc389l-colink

a.txt
folder1/b.txt
folder1/c.txt
folder2/folder3/d.txt

Example 4

$ python upload.py folder1 --bucket cmsc389l-colink --destination uploads

uploads/b.txt
uploads/c.txt

The upload.py file in the code/ directory includes some boilerplate to get you started.

Like the previous codelab, you only need to pass the public tests to get full credit.

The tests can be run like so:

$ python test.py

Note that if you kill the testing program, it won't clean up after itself. If you end up with extra testing buckets, you can quickly remove them with the following command:

$ aws s3 ls | grep "test-bucket" | cut -f3 -d ' ' | xargs -I {} aws s3 rb 's3://{}'

Additionally, we have provided you with a Pipfile that contains all the packages we believe you will need to complete the assignment. Pipfiles work just like package.json files in the world of Node and Javascript or Gemfiles in the world of Ruby and Rails. When you run pipenv install, the Pipfile will be parsed and all the dependencies inside of it will be downloaded and installed locally for your project. You are even able to set the specific version of a dependency that you are importing. Feel free to add to the Pipfile and packages you feel you need to add to implement your solution.

Host a Static Website

As mentioned in class, you can use S3 to host static content, like static websites. Let's do that for the STICs website!

At this point the assignment, you have a tool that you could use to upload static content for a website, however the tool does not yet set the content type of each file when they are uploaded (they default to a byte stream type). If you tried to open an HTML file in your browser that was hosted on S3 with this content type, then your browser would just download the file instead of serving it. This is because the browser doesn't use the file ending to determine the type of file, as you might expect, it instead uses the content type header in the response.

So, what we want to do is let S3 know what type of content it's getting during upload. You will do this by updating your tool to detect and set the content type header. S3 will then be able to pass that information to a browser, enabling it to display our content as desired. Take a look at the documentation again and note that path, bucket, acl, and s3_dest are not the only parameters defined for the upload function.

How do we know what to pass as arguments? Why, by using a library of course! It's just 2-3 lines with a package like mimetypes to detect and set the content type.

Note: A previous version of this codelab recommended python-magic, however it fails to identify a handful of content types such as CSS. Use an alternative, like the mimetypes package.

You will want to clone the source code for the STICS website from it's GitHub repository here.

$ git clone https://github.com/UMD-CS-STICs/UMD-CS-STICs.github.io

Then, create a new bucket specifically for this site, such as cmsc389l-<directory id>-website. Then, enable website hosting from that bucket:

$ aws s3 website s3://cmsc389l-colink-website --index-document index.html

upload.py

Assuming that your tool will properly set the Content-Type header, then you can copy the source code into the root directory of this new bucket using your tool:

$ python upload.py UMD-CS-STICs.github.io --bucket cmsc389l-colink-website --acl public-read

View the site!

To view the site, browse to the S3 URL of index.html and you will see the full STICs site!

https://s3.amazonaws.com/cmsc389l-colink-website/index.html

Submit a screenshot of the front page of the STICs site hosted on your S3. Make sure to include the URL in the screenshot (don't just submit a screenshot of sticsumd.com)!).

Submission

You will submit a zipped directory containing your script, upload.py, plus a Pipfile since you used magic. You will also include the screenshot of the STICs homepage and your audit log.

Audit logs are stored in as JSON files in a .gz zip archive in the bucket that you created. Within this bucket, these logs are organized in a directory hierarchy by date, region, and a few other factors (see this documentation page). You'll need to download all of your CloudTrail logs from this bucket for the us-east-1 region. You can use the aws s3 command to sync this directory to your local filesystem. You'll need to configure

$ # Command Format:
$ aws s3 sync s3://<cloudtrail bucket>/AWSLogs/<account number>/CloudTrail/us-east-1/<year>/<month> <local directory>
$ # Example:
$ aws s3 sync s3://cmsc389l-colink-codelab-02-trail/AWSLogs/800593953112/CloudTrail/us-east-1/2017/09 logs

In your submission, include the logs directory that you just downloaded.

Now that you have these log files locally, you can inspect their contents. To access a log, you'll need to unzip it with gunzip <log file name>.json.gz. Then, you can inspect the audit log with jq (a JSON pretty printer). Run cat <log file name>.json | jq . | less.

Submit this assignment to codelab2 on the submit server. Upload a zipped directory containing the following files:

<directory id>.zip
    upload.py
    Pipfile
    logs/...
    stics.png

Codelab 2