Elasticsearch is a robust software designed for efficient information retrieval across diverse datasets. When combined with Logstash and Kibana, it forms the commonly known “ELK stack”, frequently employed for collecting, temporarily storing, analyzing, and visualizing log data. Typically, additional software components like Filebeat, which transmits logs from the server to Logstash, and Elastalert, responsible for generating alerts based on analysis of data within Elasticsearch, are also necessary.
The ELK Stack: A Powerful Yet Challenging Solution
My experience with ELK for log management has been a mixed bag. While it’s undeniably powerful and feature-rich, its setup and maintenance can be quite complex.
Elasticsearch’s versatility allows its use in various scenarios, even as a search engine. However, this generality means additional configuration is required to tailor it for the specific demands of log management.
Setting up the ELK cluster was no easy feat, demanding considerable parameter tuning to achieve functionality. Configuring the five components (Filebeat, Logstash, Elasticsearch, Kibana, and Elastalert) was tedious, involving extensive documentation reviews and debugging communication issues between them. Even after successfully launching the cluster, ongoing maintenance tasks like patching, upgrading OS packages, monitoring resource usage, and fine-tuning were essential.
A Logstash update that broke my ELK stack due to a seemingly trivial keyword change (pluralization) pushed me to explore alternative solutions better suited to my needs.
My objective was to store logs from Apache, PHP, and Node.js applications, then analyze these logs for patterns indicating software bugs. The solution I opted for involved:
- Installing CloudWatch Agent on the target.
- Configuring CloudWatch Agent to transfer logs to CloudWatch Logs.
- Triggering Lambda functions to process these logs.
- Configuring the Lambda function to send notifications to a Slack channel upon pattern detection.
- Implementing filters on CloudWatch Log Groups to minimize Lambda function invocations and control costs.
This serverless solution offers numerous advantages over a server cluster:
- Routine maintenance tasks are offloaded to the cloud provider, ensuring underlying servers are consistently patched, upgraded, and maintained without user intervention.
- Cluster monitoring and scaling are also managed by the cloud provider, with serverless setups like the one described scaling automatically.
- Configuration is streamlined, and the risk of breaking changes in configuration formats from the cloud provider is significantly reduced.
- Infrastructure-as-code implementation is simplified with CloudFormation templates.
Configuring Slack Alerts
Let’s delve into the specifics of implementing this setup using CloudFormation templates, incorporating Slack webhooks for engineer notifications. The initial step involves configuring Slack.
| |
Refer to this WebHooks for Slack guide for detailed information on setting up your Slack workspace.
After creating your Slack app and configuring an incoming webhook, the webhook URL is used as a parameter in your CloudFormation stack.
| |
Here, we create two log groups: one for Apache access logs and another for Apache error logs.
While I haven’t configured any lifecycle management for log data in this example, in a practical scenario, you would likely define a retention period and S3 lifecycle policies for archiving to Glacier.
Lambda Function for Processing Access Logs
Let’s now focus on the Lambda function responsible for processing Apache access logs.
| |
This section defines an IAM role, attached to the Lambda functions, granting them necessary permissions. The AWSLambdaBasicExecutionRole is essentially an AWS-provided IAM policy that allows the Lambda function to manage its log group and streams within CloudWatch Logs.
| |
This defines the Lambda function for processing Apache access logs. Note that I’m using a custom JSON-formatted access log format for easier downstream processing:
| |
The Lambda function, written in Python 3, receives a log line from CloudWatch and performs pattern matching. This example identifies HTTP requests resulting in 5XX status codes and sends corresponding messages to a Slack channel.
You have the flexibility to implement any pattern detection logic due to the use of Python. This offers significantly more power than regex patterns in Logstash or Elastalert configurations.
Revision Control
For smaller utility Lambda functions, embedding code within CloudFormation templates can be convenient. However, larger projects with multiple Lambda functions and layers would benefit from using SAM.
| |
This grants CloudWatch Logs permission to invoke the Lambda function. It’s worth noting that using the SourceAccount property along with SourceArn might lead to conflicts. Generally, omitting SourceAccount is recommended when the invoking service resides in the same AWS account. The SourceArn already restricts access from other accounts.
| |
This defines the subscription filter, acting as the bridge between CloudWatch Logs and the Lambda function. Logs sent to ApacheAccessLogGroup matching the filter pattern are forwarded to the Lambda function. The filter, designed for JSON input, triggers only when the status field starts with “5.”
This selective invocation based on 500 HTTP status codes (indicative of errors) helps optimize cost efficiency.
Detailed information about CloudWatch filter patterns, which are quite robust though not as comprehensive as Grok, can be found in Amazon CloudWatch documentation.
The DependsOn field ensures the Lambda function is available before CloudWatch Logs attempts invocation. While likely unnecessary in real-world scenarios where Apache would have a brief delay before receiving requests, it adds an extra layer of reliability.
Lambda Function for Processing Error Logs
Let’s examine the Lambda function responsible for handling Apache error logs.
| |
This Lambda function processes Apache error logs and sends Slack notifications only for critical errors. PHP notices and warnings are not considered severe enough to warrant alerts.
Similar to the previous function, this one assumes JSON-formatted error logs, which can be configured using the following format string:
| |
| |
This grants CloudWatch Logs the necessary permissions to call the Lambda function.
| |
This final step links CloudWatch Logs to the Lambda function via a subscription filter for the Apache error log group. The filter pattern ensures that PHP warnings and notices are ignored, preventing unnecessary function invocations.
Final Thoughts, Pricing, and Availability
In terms of cost-effectiveness, this solution significantly outperforms an ELK cluster. CloudWatch Logs storage costs are comparable to S3, and Lambda’s free tier offers one million invocations per month, sufficient for moderately trafficked websites, especially with effective CloudWatch Logs filtering.
Lambda functions support up to 1,000 concurrent calls, a current hard limit in AWS. Given the 30-40ms execution time for the functions described, this should handle substantial traffic. Workloads exceeding this limit might require a more sophisticated Kinesis-based solution, which I may explore in a future article.