ifuller1 · February 10, 2016 17:49
diff --git a/gistfile1.txt b/gistfile1.txt
 At Trail we've been using [Pentaho's open source data integration (Kettle) app]
 (http://community.pentaho.com/projects/data-integration/) to design ETL jobs that are used to provide our partner
 integrations. As we integrate with more partners we've found some of the more complex integrations harder to achieve. 
 Using a GUI to design jobs provides a useful abstraction when they're made up of basic joins and transformations 
 but, as some of our jobs have become more complex, working with and understanding the tooling has produced diminishing returns.
 
 ### AWS Lambda
 [AWS Lambda](https://aws.amazon.com/lambda/) is a stateless computation-as-a-service (CaaS?) platform, or as Amazon put it: 
 > AWS Lambda is a compute service where you can upload your code to AWS Lambda and the service can run the code on your behalf using
 > AWS infrastructure. After you upload your code and create what we call a Lambda function, AWS Lambda takes 
 > care of provisioning and managing the servers that you use to run the code.

 Using AWS Lambda we're able to write our more complex ETL jobs in Node - allowing us to leverage modern debugging tools and practices like TDD.

 ### Lambda as an ETL runner
 We actually reviewed AWS Lambda as an ETL job running platform in the middle of 2015 but the lack of scheduling support stopped us moving ahead. Just a few months later, a lifetime in AWS product development, [Amazon introduced CloudWatch Events](https://aws.amazon.com/blogs/aws/new-cloudwatch-events-track-and-respond-to-changes-to-your-aws-resources/) - time based triggers that can be routed to several AWS services including Lambda functions. 

 Using CloudWatch events we're able to configure ETL jobs and have them run on a regular interval - meaning our entire ETL platform could be replaced 
 ([and cheaply](https://aws.amazon.com/lambda/pricing/)) with AWS Lambda fuctions.
 
 ### Using AWS Lambda, the basics
 Below I'll cover the basics of what we used to setup a Lambda function. If you want to use any of this for yourself
 you can use [our 'hello-lambda' repo](https://github.com/trailsuite/hello-lambda) as a starting point.

 #### Building
 We used [webpack](https://webpack.github.io/) to bundle multiple Lambda functions into individual artifacts that can be deployed to Lambda. You 
 could probably achieve the same with [Babel](https://babeljs.io/) and [Gulp](http://gulpjs.com/) but we've been using webpack recently and like the functionality it provides.
 
 To bundle our Lambda functions we used multiple entry points by parsing `./src/lambdas` and using [webpack's entry
 point configuration](https://github.com/webpack/webpack/tree/master/examples/multiple-entry-points).

 ```javascript
 ...
 function createEntryPoints()
 {
  let entryPointsArray = fs.readdirSync(path.join(__dirname, LAMBDAS_PATH))
      .map(filename =>
      {
        return {
          [filename.replace(".js", "")]: path.join(__dirname, LAMBDAS_PATH, filename)
        };
      });

  return entryPointsArray.reduce((returnObject, entryItem) =>
  {
    return Object.assign(returnObject, entryItem);
  }, {});
 }

 module.exports = {
  entry: createEntryPoints(),
 ...
 ```

 With this and an entry in the npm `package.json` `scripts` block we can run `npm run-script watch` to start building 
 `./src/lambdas` for deployment.

 #### Running
 The Lambda runtime expects to call a Lambda function with an event object and context. Lambda depends on these 
 parameters to provide event details and a callback that is called on completion. 
 To provide parity between your local environment and the Lambda runtime you can create a simple `run.js` as follows:

 ```javascript
 let scheduleEventObject = {
  "account": "123456789012",
  ...
 };

 let FunctionA = require("./dist/functionA.js");
 let FunctionB = require("./dist/functionB.js");

 let lambdaRunningContext = {
  succeed: ...
 };

 FunctionA.functionA(scheduleEventObject, lambdaRunningContext);
 ```

 #### Using environment variables
 Using Lambda as an ETL runner required us to include some secure keys, such as SFTP credentials. Unfortunately Lambda
 doesn't provide a way to manage environment keys so we looked at Amazon KMS (key management service) to encrypt and then decrypt the variables at runtime. We've omitted this process from the sample project but the following code along with a `no-parse.js` provides us with a workable solution.
  
 ```javascript
 function loadEnvironmentVariables(callback)
 {
  let kms = new AWS.KMS({region: 'eu-west-1'});
  let encryptedSecret = fs.readFileSync("./encrypted.env");
  let decryptedSecret;

  kms.decrypt({CiphertextBlob: encryptedSecret}, (err, decryptedData) =>
  {
    if (err)
    {
      console.log(err, err.stack);
    }
    else
    {
      decryptedSecret = decryptedData['Plaintext'].toString();
      fs.writeFileSync('/tmp/.env', decryptedSecret);
      require('dotenv').config({path: '/tmp/.env'});
      
      callback();
    }
  });
 }
 ```

 #### Preparing AWS
 Before you can deploy your Lambda function you need to configure it on AWS. If, like us, your default region is Ireland you would go to [https://eu-west-1.console.aws.amazon.com/lambda/home?region=eu-west-1#/create](https://eu-west-1.console.aws.amazon.com/lambda/home?region=eu-west-1#/create) and create a Lambda function named, for example, `functionA`. The handler name is generated from the imported module name, followed by the exported function. In our example we have `functionA.js` as the imported module and `functionA` as the function which means the **Handler** should be set to `functionA.functionA`. 

 You'll also need a role with the right execution policy. We're using the following: 

 ```
 {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:*"
        }
    ]
 }
 ```

 You'll need to manually provide the packaged zip file when first configuring the Lambda, but once it's configured you can use the automated deployment below. 

 *For further detail on creating Lambda functions, take a look at the AWS docs [here](http://docs.aws.amazon.com/lambda/latest/dg/get-started-create-function.html).*

 #### Deploying to AWS
 We use both [CircleCI](https://circleci.com/) and [Codeship](https://codeship.com/) at Trail. For simplicity we're using Codeship to build, test and
 deploy our Lambda functions. 

 For the deployment to run you'll need to set the following environment variables on Codeship:

 ```
 AWS_ACCESS_KEY_ID
 AWS_DEFAULT_REGION
 AWS_SECRET_ACCESS_KEY
 ```

 We then use the following **Setup Commands**:

 ```
 pip install awscli
 nvm install 4.2.4
 npm install
 npm run-script build
 ```

 along with a **Test Pipeline** consisting of:

 ```
 npm test
 ```

 to successfully run our tests.

 To deploy the built artifacts we created the following script, which zips and deploys a given function to AWS. Using 
 this script you can setup a deployment pipeline on Codeship, providing the function you're trying to deploy, e.g. `./deploy.script functionA`.

 ```
 #!/bin/bash
 echo "Hello $1"

 cd ./dist/

 zip -r $1.zip ./$1.js 

 aws lambda update-function-code --function-name "$1" --zip-file fileb://$1.zip
 aws lambda get-function --function-name "$1"
 aws lambda invoke --function-name "$1" --payload "{}" output.log

 if grep -q error output.log; then
        echo "There were errors deploying and running $1 :("
        cat output.log
        exit 1
 else
        echo "$1 deployed!"
 fi
 ```

 ### A note on performance
 When we initially deployed our ETL jobs we noticed that network requests were failing, or even worse that the whole job was timing out within Lambda's five minute limit. After some research we realised that the memory configuration for our Lambda jobs (defaulted to 128MB) was directly tied to the CPU performance of the Lambda environment. Additionally, with our jobs running only every twenty minutes or so, the Lambda environment was starting from cold. The solution was to simply increase the memory allocation; doing so removed these issues by providing a more performant environment. A simple solution, and the price is still a fraction of our existing setup. 

 ### Reference
 Feel free to take a look at our reference project: [https://github.com/trailsuite/hello-lambda](https://github.com/trailsuite/hello-lambda). We've been impressed
 with how easy Lambda was to work with and how it might improve our pipeline. If you're interested in working with technologies like this to solve difficult problems, then get in touch!
	At Trail we've been using [Pentaho's open source data integration (Kettle) app]
	(http://community.pentaho.com/projects/data-integration/) to design ETL jobs that are used to provide our partner
	integrations. As we integrate with more partners we've found some of the more complex integrations harder to achieve.
	Using a GUI to design jobs provides a useful abstraction when they're made up of basic joins and transformations
	but, as some of our jobs have become more complex, working with and understanding the tooling has produced diminishing returns.

	### AWS Lambda
	[AWS Lambda](https://aws.amazon.com/lambda/) is a stateless computation-as-a-service (CaaS?) platform, or as Amazon put it:
	> AWS Lambda is a compute service where you can upload your code to AWS Lambda and the service can run the code on your behalf using
	> AWS infrastructure. After you upload your code and create what we call a Lambda function, AWS Lambda takes
	> care of provisioning and managing the servers that you use to run the code.

	Using AWS Lambda we're able to write our more complex ETL jobs in Node - allowing us to leverage modern debugging tools and practices like TDD.

	### Lambda as an ETL runner
	We actually reviewed AWS Lambda as an ETL job running platform in the middle of 2015 but the lack of scheduling support stopped us moving ahead. Just a few months later, a lifetime in AWS product development, [Amazon introduced CloudWatch Events](https://aws.amazon.com/blogs/aws/new-cloudwatch-events-track-and-respond-to-changes-to-your-aws-resources/) - time based triggers that can be routed to several AWS services including Lambda functions.

	Using CloudWatch events we're able to configure ETL jobs and have them run on a regular interval - meaning our entire ETL platform could be replaced
	([and cheaply](https://aws.amazon.com/lambda/pricing/)) with AWS Lambda fuctions.

	### Using AWS Lambda, the basics
	Below I'll cover the basics of what we used to setup a Lambda function. If you want to use any of this for yourself
	you can use [our 'hello-lambda' repo](https://github.com/trailsuite/hello-lambda) as a starting point.

	#### Building
	We used [webpack](https://webpack.github.io/) to bundle multiple Lambda functions into individual artifacts that can be deployed to Lambda. You
	could probably achieve the same with [Babel](https://babeljs.io/) and [Gulp](http://gulpjs.com/) but we've been using webpack recently and like the functionality it provides.

	To bundle our Lambda functions we used multiple entry points by parsing `./src/lambdas` and using [webpack's entry
	point configuration](https://github.com/webpack/webpack/tree/master/examples/multiple-entry-points).

	```javascript
	...
	function createEntryPoints()
	{
	let entryPointsArray = fs.readdirSync(path.join(__dirname, LAMBDAS_PATH))
	.map(filename =>
	{
	return {
	[filename.replace(".js", "")]: path.join(__dirname, LAMBDAS_PATH, filename)
	};
	});

	return entryPointsArray.reduce((returnObject, entryItem) =>
	{
	return Object.assign(returnObject, entryItem);
	}, {});
	}

	module.exports = {
	entry: createEntryPoints(),
	...
	```

	With this and an entry in the npm `package.json` `scripts` block we can run `npm run-script watch` to start building
	`./src/lambdas` for deployment.

	#### Running
	The Lambda runtime expects to call a Lambda function with an event object and context. Lambda depends on these
	parameters to provide event details and a callback that is called on completion.
	To provide parity between your local environment and the Lambda runtime you can create a simple `run.js` as follows:

	```javascript
	let scheduleEventObject = {
	"account": "123456789012",
	...
	};

	let FunctionA = require("./dist/functionA.js");
	let FunctionB = require("./dist/functionB.js");

	let lambdaRunningContext = {
	succeed: ...
	};

	FunctionA.functionA(scheduleEventObject, lambdaRunningContext);
	```

	#### Using environment variables
	Using Lambda as an ETL runner required us to include some secure keys, such as SFTP credentials. Unfortunately Lambda
	doesn't provide a way to manage environment keys so we looked at Amazon KMS (key management service) to encrypt and then decrypt the variables at runtime. We've omitted this process from the sample project but the following code along with a `no-parse.js` provides us with a workable solution.

	```javascript
	function loadEnvironmentVariables(callback)
	{
	let kms = new AWS.KMS({region: 'eu-west-1'});
	let encryptedSecret = fs.readFileSync("./encrypted.env");
	let decryptedSecret;

	kms.decrypt({CiphertextBlob: encryptedSecret}, (err, decryptedData) =>
	{
	if (err)
	{
	console.log(err, err.stack);
	}
	else
	{
	decryptedSecret = decryptedData['Plaintext'].toString();
	fs.writeFileSync('/tmp/.env', decryptedSecret);
	require('dotenv').config({path: '/tmp/.env'});

	callback();
	}
	});
	}
	```

	#### Preparing AWS
	Before you can deploy your Lambda function you need to configure it on AWS. If, like us, your default region is Ireland you would go to [https://eu-west-1.console.aws.amazon.com/lambda/home?region=eu-west-1#/create](https://eu-west-1.console.aws.amazon.com/lambda/home?region=eu-west-1#/create) and create a Lambda function named, for example, `functionA`. The handler name is generated from the imported module name, followed by the exported function. In our example we have `functionA.js` as the imported module and `functionA` as the function which means the Handler should be set to `functionA.functionA`.

	You'll also need a role with the right execution policy. We're using the following:

	```
	{
	"Version": "2012-10-17",
	"Statement": [
	{
	"Effect": "Allow",
	"Action": [
	"logs:CreateLogGroup",
	"logs:CreateLogStream",
	"logs:PutLogEvents"
	],
	"Resource": "arn:aws:logs:::*"
	}
	]
	}
	```

	You'll need to manually provide the packaged zip file when first configuring the Lambda, but once it's configured you can use the automated deployment below.

	For further detail on creating Lambda functions, take a look at the AWS docs [here](http://docs.aws.amazon.com/lambda/latest/dg/get-started-create-function.html).

	#### Deploying to AWS
	We use both [CircleCI](https://circleci.com/) and [Codeship](https://codeship.com/) at Trail. For simplicity we're using Codeship to build, test and
	deploy our Lambda functions.

	For the deployment to run you'll need to set the following environment variables on Codeship:

	```
	AWS_ACCESS_KEY_ID
	AWS_DEFAULT_REGION
	AWS_SECRET_ACCESS_KEY
	```

	We then use the following Setup Commands:

	```
	pip install awscli
	nvm install 4.2.4
	npm install
	npm run-script build
	```

	along with a Test Pipeline consisting of:

	```
	npm test
	```

	to successfully run our tests.

	To deploy the built artifacts we created the following script, which zips and deploys a given function to AWS. Using
	this script you can setup a deployment pipeline on Codeship, providing the function you're trying to deploy, e.g. `./deploy.script functionA`.

	```
	#!/bin/bash
	echo "Hello $1"

	cd ./dist/

	zip -r $1.zip ./$1.js

	aws lambda update-function-code --function-name "$1" --zip-file fileb://$1.zip
	aws lambda get-function --function-name "$1"
	aws lambda invoke --function-name "$1" --payload "{}" output.log

	if grep -q error output.log; then
	echo "There were errors deploying and running $1 :("
	cat output.log
	exit 1
	else
	echo "$1 deployed!"
	fi
	```

	### A note on performance
	When we initially deployed our ETL jobs we noticed that network requests were failing, or even worse that the whole job was timing out within Lambda's five minute limit. After some research we realised that the memory configuration for our Lambda jobs (defaulted to 128MB) was directly tied to the CPU performance of the Lambda environment. Additionally, with our jobs running only every twenty minutes or so, the Lambda environment was starting from cold. The solution was to simply increase the memory allocation; doing so removed these issues by providing a more performant environment. A simple solution, and the price is still a fraction of our existing setup.

	### Reference
	Feel free to take a look at our reference project: [https://github.com/trailsuite/hello-lambda](https://github.com/trailsuite/hello-lambda). We've been impressed
	with how easy Lambda was to work with and how it might improve our pipeline. If you're interested in working with technologies like this to solve difficult problems, then get in touch!