In today’s world, technology is enabling businesses to do better as compared to what they were doing a few years ago. There are multiple systems which are supporting business processes including transactions, analysis, and reporting. For each business process a combination of software, hardware, networks, etc. are used to make the entire system work as per the user requirement. But sometimes it seems that the tech team faces an extremely difficult question from the business user, which is: “Why is the system not working?” With so many different systems working in a tandem system, the admin may take time to identify the root cause of the error. To minimize this, there is a need for a log management system. Ultimately, the logs tell us the story of why the system was behaving abnormally in the first place, at a particular given time. Let’s now focus on the challenges and possible solutions associated with Logs and its management.
In the real world, a solution is a combination of various components. Each component generates its own logs which are stored in the distributed environment. For example, a network device may store the SNMP logs in its local storage or apache applications logs stored on the application server. Further, the application server may be in a cluster they might be storing the logs locally or may be on a central location of that cluster.
In the era of the 1990’s with client-server applications, these logs were dumped into single or multiple files co-located as per the requirement. Talking about the next era, that is of the 2000’s where web applications used separate application and database server logs. Still manageable, isn’t it?
But with the current innovations in technology we are talking about micro services, on-demand container-based deployments, multiple VMs, service bus, near the real-time streaming engine and what else? Each of them running as a separate component in isolation which is nonetheless important in the entire ecosystem.
Not only these logs are distributed in multiple locations but they are also in a different format based on the device and OEM. And of course, they are not always easily readable by any person. To read these logs separately and correlating the multiple events is a daunting task.
So, here is the problem statement and there must be a solution to it!
Now, this is the stepping stone or the starting point for log management. Ingestion of these log files into the central repository can be configured in various ways such as:
Writing a code in the application to write log files directly on the central path
By scheduling the log shippers to push these files into the central path
By poling these devices from the application written in the central site for pulling these logs files
The objective here is to dump these files on the same storage for the further processing.
Log ingestion is an on-going process with two steps:
One time bulk ingestion
Once the logs are collected from a source into the central repository they are processed further to be converted into a more structured form. Collection/ingestion of data is difficult, parsing the data – even more so.
Log parsing involves the conversion of semi-structured data or text-based data into more structured data like JSON or custom-formatted text-based output which can further be used for analysis. Based on the parsing the values are extracted from the text-based data. Extraction may be limited to the attributes which are important for the analysis. Log transformation is converting the value of a particular field into a more meaningful format or value. A typical example of log transformation is a date field. Log enrichment gives more insights into your logs. Bot IP is one of the most common examples of log enrichment in the security domain.
Log Normalization process involves resolving of divergent representations of the same types of data into a common format in a database. This may be achieved by synchronizing data and time of events into a common date format say for example to a Coordinated Universal Time.
The benefit of normalization is that uniform information is available for the dataset which is very important for further analysis.
Indexing the normalized data for optimization of searching the data, applying the filters and data presentation is the basic requirement of any log management system. We are talking about an ecosystem with a huge data pumped on daily basis for near-real-time analysis. Until and unless we have a robust indexing mechanism it is next to impossible to perform such queries and it may break down the entire ecosystem.
Indexing the data has a lot of advantages although it might have some overhead in terms of storage and time taken to insert, update or delete. Some of the benefits are:
A word of caution: An improper construct of these indexing may be more harmful than evening not creating an index.
Logs should be stored as per the requirement. These storage mediums could be classified into different tiers which can be DAS (local hard drives) or external storage such as SAN, NAS or so on. The log retention policy to safeguard and archive logs should be based on the organization policy adhering to legal requirements.
Logs are very critical information of any organization especially working in the internet space. These should be protected from any attack and hence it is important to secure these logs. Various tools and techniques are available to secure these logs including but not limited to cryptography, masking, etc.
Log visualization and reporting can yield a lot of useful information. There is a prerequisite before the logs can be visualized. One such step could be the aggregation of logs in multiple levels of details such as Region wise, Device type wise, etc.
Log visualization is an art showing the right information to the right user is most important which involves selecting the right visualization (Chart: Bar, Pie, area graph etc.).
While selecting the tool for visualization one must assess the requirement clearly. Some of these tools have a user-friendly capability of DIY (DO IT YOURSELF).
Alert is the notification of an event based on a trigger or a condition/rule created in the system. Before generating these alerts it is important to Baseline the process of defining what is normal or what is abnormal in the ecosystem. For example abnormal behavior of a user who is generally not very active but suddenly became very activate in using the system. This behavioral abnormality can be configured in the system and alert can be generated for the relevant team.
Most of the product supports multiple forms of communication for sending these alerts be it be SMS or email. It is also very important to ensure the alerts are delivered only once for a single causative event. The alert should be managed till the complete closure.
Till this particular point, the discussion was about the reactive part. Once the logs are collected and analyzed for a period of time the patterns can be formed. These patterns can be used for proactive alerts using analytics modeling. The data can be used to train and validate the model initially.
Over the period of time, the models can be improved manually or using machine learning. Once there is enough data in the system even more patterns can be created using the artificial intelligence. The final goal or objective will be the self-evolving solution to sustain demand of the ecosystem. One such example is the predictive maintenance of a server cluster or a self-sustaining database.
There are multiple Log management tools available today which supports the end to end Log Management lifecycle.
In this space, there are a lot of open source tools available like ELK, Loggly, etc. Also, there is a proprietary solution like Splunk and ArcSight.
All of these tools provide comprehensive features but I would like to highlight ELK.
ELK stands for Elastic Search, Logstash and Kibana from Elastic.
Elastic search is the storage and indexing solution while Logstash is for the ETL purpose. Kibana is a complete Visualisation tool with a capability to create a dashboard, reports, visualization, and alerts. It also comes with Security and ML as part of a plugin called X-pack (although there is a separate license for this). There is some pre-built log shipper available as part of ELK stack called beats. These are pre-configured to parse some of the major log files like windows event log, apache logs, etc.
I had an excellent experience in using ELK stack and also like to recommend this.
Log management is very critical for every IT related project be it a DevOps project or a Product solution. It is important for the success of any delivery and operation & maintenance to include log management from the day one.
If you have some more precious time to spare, we would like to know which particular thing helped you the most in this article. Whatever be it- a step, a tip or a paragraph, kindly mention it in the below-section.