Skip to content

Instantly share code, notes, and snippets.

@ifuller1
Created February 10, 2016 17:49
Show Gist options
  • Save ifuller1/2bd7dee71c646af02ad6 to your computer and use it in GitHub Desktop.
Save ifuller1/2bd7dee71c646af02ad6 to your computer and use it in GitHub Desktop.
At Trail we've been using [Pentaho's open source data integration (Kettle) app]
(http://community.pentaho.com/projects/data-integration/) to design ETL jobs that are used to provide our partner
integrations. As we integrate with more partners we've found some of the more complex integrations harder to achieve.
Using a GUI to design jobs provides a useful abstraction when they're made up of basic joins and transformations
but, as some of our jobs have become more complex, working with and understanding the tooling has produced diminishing returns.
### AWS Lambda
[AWS Lambda](https://aws.amazon.com/lambda/) is a stateless computation-as-a-service (CaaS?) platform, or as Amazon put it:
> AWS Lambda is a compute service where you can upload your code to AWS Lambda and the service can run the code on your behalf using
> AWS infrastructure. After you upload your code and create what we call a Lambda function, AWS Lambda takes
> care of provisioning and managing the servers that you use to run the code.
Using AWS Lambda we're able to write our more complex ETL jobs in Node - allowing us to leverage modern debugging tools and practices like TDD.
### Lambda as an ETL runner
We actually reviewed AWS Lambda as an ETL job running platform in the middle of 2015 but the lack of scheduling support stopped us moving ahead. Just a few months later, a lifetime in AWS product development, [Amazon introduced CloudWatch Events](https://aws.amazon.com/blogs/aws/new-cloudwatch-events-track-and-respond-to-changes-to-your-aws-resources/) - time based triggers that can be routed to several AWS services including Lambda functions.
Using CloudWatch events we're able to configure ETL jobs and have them run on a regular interval - meaning our entire ETL platform could be replaced
([and cheaply](https://aws.amazon.com/lambda/pricing/)) with AWS Lambda fuctions.
### Using AWS Lambda, the basics
Below I'll cover the basics of what we used to setup a Lambda function. If you want to use any of this for yourself
you can use [our 'hello-lambda' repo](https://github.com/trailsuite/hello-lambda) as a starting point.
#### Building
We used [webpack](https://webpack.github.io/) to bundle multiple Lambda functions into individual artifacts that can be deployed to Lambda. You
could probably achieve the same with [Babel](https://babeljs.io/) and [Gulp](http://gulpjs.com/) but we've been using webpack recently and like the functionality it provides.
To bundle our Lambda functions we used multiple entry points by parsing `./src/lambdas` and using [webpack's entry
point configuration](https://github.com/webpack/webpack/tree/master/examples/multiple-entry-points).
```javascript
...
function createEntryPoints()
{
let entryPointsArray = fs.readdirSync(path.join(__dirname, LAMBDAS_PATH))
.map(filename =>
{
return {
[filename.replace(".js", "")]: path.join(__dirname, LAMBDAS_PATH, filename)
};
});
return entryPointsArray.reduce((returnObject, entryItem) =>
{
return Object.assign(returnObject, entryItem);
}, {});
}
module.exports = {
entry: createEntryPoints(),
...
```
With this and an entry in the npm `package.json` `scripts` block we can run `npm run-script watch` to start building
`./src/lambdas` for deployment.
#### Running
The Lambda runtime expects to call a Lambda function with an event object and context. Lambda depends on these
parameters to provide event details and a callback that is called on completion.
To provide parity between your local environment and the Lambda runtime you can create a simple `run.js` as follows:
```javascript
let scheduleEventObject = {
"account": "123456789012",
...
};
let FunctionA = require("./dist/functionA.js");
let FunctionB = require("./dist/functionB.js");
let lambdaRunningContext = {
succeed: ...
};
FunctionA.functionA(scheduleEventObject, lambdaRunningContext);
```
#### Using environment variables
Using Lambda as an ETL runner required us to include some secure keys, such as SFTP credentials. Unfortunately Lambda
doesn't provide a way to manage environment keys so we looked at Amazon KMS (key management service) to encrypt and then decrypt the variables at runtime. We've omitted this process from the sample project but the following code along with a `no-parse.js` provides us with a workable solution.
```javascript
function loadEnvironmentVariables(callback)
{
let kms = new AWS.KMS({region: 'eu-west-1'});
let encryptedSecret = fs.readFileSync("./encrypted.env");
let decryptedSecret;
kms.decrypt({CiphertextBlob: encryptedSecret}, (err, decryptedData) =>
{
if (err)
{
console.log(err, err.stack);
}
else
{
decryptedSecret = decryptedData['Plaintext'].toString();
fs.writeFileSync('/tmp/.env', decryptedSecret);
require('dotenv').config({path: '/tmp/.env'});
callback();
}
});
}
```
#### Preparing AWS
Before you can deploy your Lambda function you need to configure it on AWS. If, like us, your default region is Ireland you would go to [https://eu-west-1.console.aws.amazon.com/lambda/home?region=eu-west-1#/create](https://eu-west-1.console.aws.amazon.com/lambda/home?region=eu-west-1#/create) and create a Lambda function named, for example, `functionA`. The handler name is generated from the imported module name, followed by the exported function. In our example we have `functionA.js` as the imported module and `functionA` as the function which means the **Handler** should be set to `functionA.functionA`.
You'll also need a role with the right execution policy. We're using the following:
```
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:*:*:*"
}
]
}
```
You'll need to manually provide the packaged zip file when first configuring the Lambda, but once it's configured you can use the automated deployment below.
*For further detail on creating Lambda functions, take a look at the AWS docs [here](http://docs.aws.amazon.com/lambda/latest/dg/get-started-create-function.html).*
#### Deploying to AWS
We use both [CircleCI](https://circleci.com/) and [Codeship](https://codeship.com/) at Trail. For simplicity we're using Codeship to build, test and
deploy our Lambda functions.
For the deployment to run you'll need to set the following environment variables on Codeship:
```
AWS_ACCESS_KEY_ID
AWS_DEFAULT_REGION
AWS_SECRET_ACCESS_KEY
```
We then use the following **Setup Commands**:
```
pip install awscli
nvm install 4.2.4
npm install
npm run-script build
```
along with a **Test Pipeline** consisting of:
```
npm test
```
to successfully run our tests.
To deploy the built artifacts we created the following script, which zips and deploys a given function to AWS. Using
this script you can setup a deployment pipeline on Codeship, providing the function you're trying to deploy, e.g. `./deploy.script functionA`.
```
#!/bin/bash
echo "Hello $1"
cd ./dist/
zip -r $1.zip ./$1.js
aws lambda update-function-code --function-name "$1" --zip-file fileb://$1.zip
aws lambda get-function --function-name "$1"
aws lambda invoke --function-name "$1" --payload "{}" output.log
if grep -q error output.log; then
echo "There were errors deploying and running $1 :("
cat output.log
exit 1
else
echo "$1 deployed!"
fi
```
### A note on performance
When we initially deployed our ETL jobs we noticed that network requests were failing, or even worse that the whole job was timing out within Lambda's five minute limit. After some research we realised that the memory configuration for our Lambda jobs (defaulted to 128MB) was directly tied to the CPU performance of the Lambda environment. Additionally, with our jobs running only every twenty minutes or so, the Lambda environment was starting from cold. The solution was to simply increase the memory allocation; doing so removed these issues by providing a more performant environment. A simple solution, and the price is still a fraction of our existing setup.
### Reference
Feel free to take a look at our reference project: [https://github.com/trailsuite/hello-lambda](https://github.com/trailsuite/hello-lambda). We've been impressed
with how easy Lambda was to work with and how it might improve our pipeline. If you're interested in working with technologies like this to solve difficult problems, then get in touch!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment