Using Luigi on aws

Introduction

This is just a wee play to see how to set luigi up on aws. This was driven by own misunderstanding of how luigi works and the need to get it set up on AWS. Most, if not all, of what's covered below will seem ridiculuously obvious to people.

AWS Setup

Launch an aws instance, droppping this code into the user_data field in advanced settings. You need to add your own environment name and reference the luigi setup script for git to pull down and run, which is basically adding the code below to the base of the script.

git clone https://github.com/ejokeeffe/boiler_deploy.git
cd boiler_deploy/scrips/luigi
sudo chmod a+rwx setup_luigi.sh
bash setup_luigi.sh

For launching the instance, we need to open up port 8082 for the scheduler and viewing the tasks through the web interface. Launch the instance and ssh in. Check the logs to see if everything is installed correctly

cat /var/log/cloud-init-output.log

The setup_luigi.sh file creates the environment variables required for luigi to run. However, you still need to edit them so that the point to the scheduler.

[core]
scheduler_host=123.456.789.456
scheduler-port=8082
default-scheduler-url=http://123.456.789.456:8082

For more info on setting these up, check out the documentation.

Run Example on a single instance

We're going to use the example, from the documentation. If you also add setup_examples.sh to the user data for the launch script, this example will be setup on your instance.

So, we've avoided it for long enough, so now lets ssh into the instance and run luigi.

# launch screen
screen

# activate the environment
source activate newenviron

# launch the daemon
luigid

This runs the luigi daemon (you can add --background switch to run as daemon, otherwise it takes up a screen window). You can view this scheduler then on http://your_url:8082.

Next step is to run some tasks.

luigi --module top_artists AggregateArtists --date-interval 2012-09

You should see the results in ~/data/, although following it through the web interface is not possible as it executes so quickly. But it should still show the tasks or if they have been removed then you can view them in the history http://your_url:8082/history.

Run example across multiple instances

Here we repeat the example above, but this time we use more than one instance. Specifically, we're trying to find out where tasks save the output data. Follow the launch procedure outlined above but this time point the scheduler to the instance above not to the new instance.

Now, when we run luigi --module top_artists AggregateArtists --date-interval 2012-09 on each of the instances (lets call them scheduler and worker), what happens? Well, they both run all the tasks. I guess thats because locally, they need to be able to see the file outputs.

So, this isn't surprising as Luigi was made for Hadoop (I think!), so it should be used for local storage. So, what we want to do is run the workers so that they're all pointing to the same location. We can do this using hadoop or using S3, which we'll cover next.

Run example using S3

As you've already setup the instances and run the scripts, everything should be setup so that it should work in s3, except you will need to configure the environment variable to store the location of s3.

echo export\ LUIGIS3_EXAMPLES="//Bucket/folder/" >> .bashrc

Now close the shell and reconnect to enable this. You obviously need to replace Bucket/folder with your location. Also, you'll need to configure aws locally on each instance so it can connect up.

Luigi uses boto to access s3, so you will have to edit your ~/.boto. A default has been copied across in the setup process. You will need to edit the access key and secret access key.

boiler_deploy contains an extended version of top_artists.py that includes the functions for linking to s3, called top_artists_extended.py. We will use this to test the s3 functionality. Note the addition of the sleep_seconds parameter which allows us to slow execution so we can spread amongst a number of workers.

luigi --module top_artists_extended AggregateArtistsS3 --date-interval 2012-09 --sleep-seconds 10 --workers 3

Run this function on both instances and then open the Luigi Task Status Web App. Navigate to the workers tab and you will see 3 workers set up on each ip. The work is shared between all 6 workers across both instances. Finally, you can go to your bucket on s3 and see the files created.

Run example using Postgres on RDS

This time we're experimenting on connecting to an RDS postgres db. First set up a db and then create the following environment variables on the instance using the following replacing the <..> with those credentials used to create the database.

cp .bashrc .bashrc.bkup.luigidbexamples
echo export\ LUIGI_DBHOST=<hostname> >> .bashrc
echo export\ LUIGI_DBDATABASE=<db> >> .bashrc
echo export\ LUIGI_DBUSER=<user> >> .bashrc
echo export\ LUIGI_DBPASSWORD=<pass> >> .bashrc

As before, I've taken the example code from top_artists.py and extended to link through to the environment variables. I've added the Top10Artists so it writes to s3 and then ArtistS3ToDatabase which writes these results to the database.

luigi --module top_artists_extended ArtistS3ToDatabase --date-interval 2012-09 --sleep-seconds 10 --workers 3

This creates a table in the database called artist_streams as well as another table called table_updates that records the files that were written to this.

Final Thoughts

It's a little bit of a faff to set up the process on each instances. However, you could launch the main scheduler instance first with user data scripts to get it started. Once this is up and running copy the ip/url into a copied luigi.cfg file in the user data for the other instances. The whole process can be kicked off using user data and not have to ssh in at all.

ejokeeffe/luigi_potter.md