Simplify Log Management with ELK on AWS

Elasticsearch is a robust software designed for efficient information retrieval across diverse datasets. When combined with Logstash and Kibana, it forms the commonly known “ELK stack”, frequently employed for collecting, temporarily storing, analyzing, and visualizing log data. Typically, additional software components like Filebeat, which transmits logs from the server to Logstash, and Elastalert, responsible for generating alerts based on analysis of data within Elasticsearch, are also necessary.

The ELK Stack: A Powerful Yet Challenging Solution

My experience with ELK for log management has been a mixed bag. While it’s undeniably powerful and feature-rich, its setup and maintenance can be quite complex.

Elasticsearch’s versatility allows its use in various scenarios, even as a search engine. However, this generality means additional configuration is required to tailor it for the specific demands of log management.

Setting up the ELK cluster was no easy feat, demanding considerable parameter tuning to achieve functionality. Configuring the five components (Filebeat, Logstash, Elasticsearch, Kibana, and Elastalert) was tedious, involving extensive documentation reviews and debugging communication issues between them. Even after successfully launching the cluster, ongoing maintenance tasks like patching, upgrading OS packages, monitoring resource usage, and fine-tuning were essential.

A Logstash update that broke my ELK stack due to a seemingly trivial keyword change (pluralization) pushed me to explore alternative solutions better suited to my needs.

My objective was to store logs from Apache, PHP, and Node.js applications, then analyze these logs for patterns indicating software bugs. The solution I opted for involved:

Installing CloudWatch Agent on the target.
Configuring CloudWatch Agent to transfer logs to CloudWatch Logs.
Triggering Lambda functions to process these logs.
Configuring the Lambda function to send notifications to a Slack channel upon pattern detection.
Implementing filters on CloudWatch Log Groups to minimize Lambda function invocations and control costs.

This serverless solution offers numerous advantages over a server cluster:

Routine maintenance tasks are offloaded to the cloud provider, ensuring underlying servers are consistently patched, upgraded, and maintained without user intervention.
Cluster monitoring and scaling are also managed by the cloud provider, with serverless setups like the one described scaling automatically.
Configuration is streamlined, and the risk of breaking changes in configuration formats from the cloud provider is significantly reduced.
Infrastructure-as-code implementation is simplified with CloudFormation templates.

Configuring Slack Alerts

Let’s delve into the specifics of implementing this setup using CloudFormation templates, incorporating Slack webhooks for engineer notifications. The initial step involves configuring Slack.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
AWSTemplateFormatVersion: 2010-09-09

Description: Setup log processing

Parameters:
  SlackWebhookHost:
  	Type: String
  	Description: Host name for Slack web hooks
  	Default: hooks.slack.com

  SlackWebhookPath:
  	Type: String
  	Description: Path part of the Slack webhook URL
  	Default: /services/YOUR/SLACK/WEBHOOK

Refer to this WebHooks for Slack guide for detailed information on setting up your Slack workspace.

After creating your Slack app and configuring an incoming webhook, the webhook URL is used as a parameter in your CloudFormation stack.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Resources:
  ApacheAccessLogGroup:
  	Type: AWS::Logs::LogGroup
  	Properties:
  	RetentionInDays: 100  # Or whatever is good for you

  ApacheErrorLogGroup:
  	Type: AWS::Logs::LogGroup
  	Properties:
  	RetentionInDays: 100  # Or whatever is good for you

Here, we create two log groups: one for Apache access logs and another for Apache error logs.

While I haven’t configured any lifecycle management for log data in this example, in a practical scenario, you would likely define a retention period and S3 lifecycle policies for archiving to Glacier.

Lambda Function for Processing Access Logs

Let’s now focus on the Lambda function responsible for processing Apache access logs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
BasicLambdaExecutionRole:
	Type: AWS::IAM::Role
	Properties:
  AssumeRolePolicyDocument:
  Version: 2012-10-17
  Statement:
  - Effect: Allow
  Principal:
  Service: lambda.amazonaws.com
  Action: sts:AssumeRole
  ManagedPolicyArns:
  - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

This section defines an IAM role, attached to the Lambda functions, granting them necessary permissions. The AWSLambdaBasicExecutionRole is essentially an AWS-provided IAM policy that allows the Lambda function to manage its log group and streams within CloudWatch Logs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
ProcessApacheAccessLogFunction:
	Type: AWS::Lambda::Function
	Properties:
  Handler: index.handler
  Role: !GetAtt BasicLambdaExecutionRole.Arn
  Runtime: python3.7
  Timeout: 10
  Environment:
  Variables:
  SLACK_WEBHOOK_HOST: !Ref SlackWebHookHost
  SLACK_WEBHOOK_PATH: !Ref SlackWebHookPath
  Code:
  ZipFile: |
  import base64
  import gzip
  import json
  import os
  from http.client import HTTPSConnection

  def handler(event, context):
  tmp = event['awslogs']['data']
  # `awslogs.data` is base64-encoded gzip'ed JSON
  tmp = base64.b64decode(tmp)
  tmp = gzip.decompress(tmp)
  tmp = json.loads(tmp)
  events = tmp['logEvents']
  for event in events:
  raw_log = event['message']
  log = json.loads(raw_log)
  if log['status'][0] == "5":
    # This is a 5XX status code
    print(f"Received an Apache access log with a 5XX status code: {raw_log}")
    slack_host = os.getenv('SLACK_WEBHOOK_HOST')
    slack_path = os.getenv('SLACK_WEBHOOK_PATH')
    print(f"Sending Slack post to: host={slack_host}, path={slack_path}, url={url}, content={raw_log}")
    cnx = HTTPSConnection(slack_host, timeout=5)
    cnx.request("POST", slack_path, json.dumps({'text': raw_log}))
    # It's important to read the response; if the cnx is closed too quickly, Slack might not post the msg
    resp = cnx.getresponse()
    resp_content = resp.read()
    resp_code = resp.status
    assert resp_code == 200  

This defines the Lambda function for processing Apache access logs. Note that I’m using a custom JSON-formatted access log format for easier downstream processing:

1
LogFormat "{\"vhost\": \"%v:%p\", \"client\": \"%a\", \"user\": \"%u\", \"timestamp\": \"%{%Y-%m-%dT%H:%M:%S}t\", \"request\": \"%r\", \"status\": \"%>s\", \"size\": \"%O\", \"referer\": \"%{Referer}i\", \"useragent\": \"%{User-Agent}i\"}" json

The Lambda function, written in Python 3, receives a log line from CloudWatch and performs pattern matching. This example identifies HTTP requests resulting in 5XX status codes and sends corresponding messages to a Slack channel.

You have the flexibility to implement any pattern detection logic due to the use of Python. This offers significantly more power than regex patterns in Logstash or Elastalert configurations.

Revision Control

For smaller utility Lambda functions, embedding code within CloudFormation templates can be convenient. However, larger projects with multiple Lambda functions and layers would benefit from using SAM.

1
2
3
4
5
6
7
ApacheAccessLogFunctionPermission:
	Type: AWS::Lambda::Permission
	Properties:
  FunctionName: !Ref ProcessApacheAccessLogFunction
  Action: lambda:InvokeFunction
  Principal: logs.amazonaws.com
  SourceArn: !Sub arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:*

This grants CloudWatch Logs permission to invoke the Lambda function. It’s worth noting that using the SourceAccount property along with SourceArn might lead to conflicts. Generally, omitting SourceAccount is recommended when the invoking service resides in the same AWS account. The SourceArn already restricts access from other accounts.

1
2
3
4
5
6
7
ApacheAccessLogSubscriptionFilter:
	Type: AWS::Logs::SubscriptionFilter
	DependsOn: ApacheAccessLogFunctionPermission
	Properties:
  LogGroupName: !Ref ApacheAccessLogGroup
  DestinationArn: !GetAtt ProcessApacheAccessLogFunction.Arn
  FilterPattern: "{$.status = 5*}"

This defines the subscription filter, acting as the bridge between CloudWatch Logs and the Lambda function. Logs sent to ApacheAccessLogGroup matching the filter pattern are forwarded to the Lambda function. The filter, designed for JSON input, triggers only when the status field starts with “5.”

This selective invocation based on 500 HTTP status codes (indicative of errors) helps optimize cost efficiency.

Detailed information about CloudWatch filter patterns, which are quite robust though not as comprehensive as Grok, can be found in Amazon CloudWatch documentation.

The DependsOn field ensures the Lambda function is available before CloudWatch Logs attempts invocation. While likely unnecessary in real-world scenarios where Apache would have a brief delay before receiving requests, it adds an extra layer of reliability.

Lambda Function for Processing Error Logs

Let’s examine the Lambda function responsible for handling Apache error logs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
ProcessApacheErrorLogFunction:
	Type: AWS::Lambda::Function
	Properties:
  Handler: index.handler
  Role: !GetAtt BasicLambdaExecutionRole.Arn
  Runtime: python3.7
  Timeout: 10
  Environment:
  Variables:
  SLACK_WEBHOOK_HOST: !Ref SlackWebHookHost
  SLACK_WEBHOOK_PATH: !Ref SlackWebHookPath
  Code:
  ZipFile: |
  import base64
  import gzip
  import json
  import os
  from http.client import HTTPSConnection

  def handler(event, context):
  tmp = event['awslogs']['data']
  # `awslogs.data` is base64-encoded gzip'ed JSON
  tmp = base64.b64decode(tmp)
  tmp = gzip.decompress(tmp)
  tmp = json.loads(tmp)
  events = tmp['logEvents']
  for event in events:
  raw_log = event['message']
  log = json.loads(raw_log)
  if log['level'] in ["error", "crit", "alert", "emerg"]:
    # This is a serious error message
    msg = log['msg']
    if msg.startswith("PHP Notice") or msg.startswith("PHP Warning"):
    print(f"Ignoring PHP notices and warnings: {raw_log}")
    else:
    print(f"Received a serious Apache error log: {raw_log}")
    slack_host = os.getenv('SLACK_WEBHOOK_HOST')
    slack_path = os.getenv('SLACK_WEBHOOK_PATH')
    print(f"Sending Slack post to: host={slack_host}, path={slack_path}, url={url}, content={raw_log}")
    cnx = HTTPSConnection(slack_host, timeout=5)
    cnx.request("POST", slack_path, json.dumps({'text': raw_log}))
    # It's important to read the response; if the cnx is closed too quickly, Slack might not post the msg
    resp = cnx.getresponse()
    resp_content = resp.read()
    resp_code = resp.status
    assert resp_code == 200  

This Lambda function processes Apache error logs and sends Slack notifications only for critical errors. PHP notices and warnings are not considered severe enough to warrant alerts.

Similar to the previous function, this one assumes JSON-formatted error logs, which can be configured using the following format string:

1
ErrorLogFormat "{\"vhost\": \"%v\", \"timestamp\": \"%{cu}t\", \"module\": \"%-m\", \"level\": \"%l\", \"pid\": \"%-P\", \"tid\": \"%-T\", \"oserror\": \"%-E\", \"client\": \"%-a\", \"msg\": \"%M\"}"

1
2
3
4
5
6
7
8
ApacheErrorLogFunctionPermission:
	Type: AWS::Lambda::Permission
	Properties:
  FunctionName: !Ref ProcessApacheErrorLogFunction
  Action: lambda:InvokeFunction
  Principal: logs.amazonaws.com
  SourceArn: !Sub arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:*
  SourceAccount: !Ref AWS::AccountId

This grants CloudWatch Logs the necessary permissions to call the Lambda function.

1
2
3
4
5
6
7
ApacheErrorLogSubscriptionFilter:
	Type: AWS::Logs::SubscriptionFilter
	DependsOn: ApacheErrorLogFunctionPermission
	Properties:
  LogGroupName: !Ref ApacheErrorLogGroup
  DestinationArn: !GetAtt ProcessApacheErrorLogFunction.Arn
  FilterPattern: '{$.msg != "PHP Warning*" && $.msg != "PHP Notice*"}'

This final step links CloudWatch Logs to the Lambda function via a subscription filter for the Apache error log group. The filter pattern ensures that PHP warnings and notices are ignored, preventing unnecessary function invocations.

Final Thoughts, Pricing, and Availability

In terms of cost-effectiveness, this solution significantly outperforms an ELK cluster. CloudWatch Logs storage costs are comparable to S3, and Lambda’s free tier offers one million invocations per month, sufficient for moderately trafficked websites, especially with effective CloudWatch Logs filtering.

Lambda functions support up to 1,000 concurrent calls, a current hard limit in AWS. Given the 30-40ms execution time for the functions described, this should handle substantial traffic. Workloads exceeding this limit might require a more sophisticated Kinesis-based solution, which I may explore in a future article.