How to plan and implement observability of a Java application with Elasticsearch — a step-by-step guide Credit: Getty Images The concept of observability has been around for decades, but it’s a relative newcomer to the world of IT infrastructure. So what is observability in this context? It’s the state of having all of the information about the internals of a system so when an issue occurs you can pinpoint the problem and take the right action to resolve it. Notice that I said state. Observability is not a tool or a set of tools — it’s a property of the system that we are managing. In this article, I will walk through how to plan and implement an observable deployment including API testing and the collection of logs, metrics, and application performance monitoring (APM) data. I’ll also direct you to a number of free, self-paced training courses that help you develop the skills needed for achieving observable systems with the Elastic Stack. Three steps to observability These are the three steps toward observability presented in this article: Plan for success Collect requirements Identify data sources and integrations Deploy Elasticsearch and Kibana Collect data from systems and your services Logs Metrics Application performance management API synthetic testing Plan for success I have been doing fault and performance management for the past twenty years. In my experience, to reliably reach a state of observability, you have to do your homework before getting started. Here’s a condensed list of a few steps I take to set up my deployments for success: Goals: Talk to everyone and write the goals down Talk to your stakeholders and identify the goals: “We will know if the user is having a good or bad experience using our service;” “The solution will improve root cause analysis by providing distributed traces;” “When you page me in the middle of the night you will give me the info I need to find the problem;” etc. Data: Make a list of what data you need and who has it Make a list of the necessary information (data and metadata) needed to support the goals. Think beyond IT information — include whatever data you need to understand what is happening. For example, if Ops is checking the Weather Channel during their workflow, then consider adding weather data to your list of required information. Snoop around the best problem solver’s desk and find out what they’re looking at during an outage (and how they like their coffee). If your organization does postmortems, take a look at the data that the people bring into the room; if it’s valuable to determine the root cause at a finger-pointing session, then it’s so much more valuable in Ops before an outage. Fix: Think about the solution and information that can speed it up If Ops needs a hostname, a runbook, some asset info, and a process name to fix the problem, then have that data available in your observability solution and send it over when you page them. Add the required bits of information to the list you started in the previous step. A good starting point At this point, you have a list of data that you need so that when an issue occurs you can pinpoint the problem and take the right action to resolve it. That list might look something like this: Service data User experience data for my service Response time of the application per transaction and the components that make up the application (e.g., the front end and the database) Proper API functionality via synthetic testing Performance data for my infrastructure Operating system metrics Database metrics Logs from servers and apps Inbound integrations History of past incidents Runbooks Asset info Weather or other “non-IT” data Outbound integrations Incident management integration for alerting Elastic Observability The Elastic Stack — Elasticsearch, Kibana, Beats, and Logstash; formerly known as the ELK Stack — is a set of powerful open source tools for searching, analyzing, and visualizing data in real time. The Elastic Stack is widely used to centralize logs from operational systems. Over time, Elastic has added products for metrics, APM, and uptime monitoring — this is the Elastic Observability solution. The value of Elastic Observability is that it brings together all the types of data you need to help you make the right operational decisions and achieve a state of observability. Let’s jump into a scenario to demonstrate how to put Elastic Observability into action. Scenario I have a simple application to manage. It consists of a Spring Boot application running on a Linux VM in Google Cloud Platform. The application exposes two API endpoints and has a MariaDB back end. You can find the application in the Spring Guides. I have created an Elasticsearch Service deployment in Elastic Cloud and I will follow the agent install tutorials right in Kibana, the Elasticsearch analysis and management UI. The open source agents that will be used are: Filebeat for logs Metricbeat for metrics Heartbeat for API testing and response time monitoring Elastic APM Java Agent for distributed tracing of the application Note: This guide is written for a specific application based on Spring Boot and MySQL. If you have something else that you want to collect logs, metrics, and APM traces from, then you should be able to modify these instructions to do what you want. When you open up Kibana you will be greeted with a long list of out-of-the-box observability integrations. Implementation In this article I will go over the steps to get the basics done, and then in future articles I’ll dive into best practices and some of the integrations. Let’s walk through a simple deployment. Hosted Elasticsearch Service To follow along in this guide, create a deployment in Elasticsearch Service on Elastic Cloud (a trial account is free). Once you sign up, watch and follow the steps in the Deploy Elasticsearch in 3 minutes or less video. A few minutes later you will have a cluster that you can use to follow along with the rest of this article. Download the password that is presented to you; you will use that to log in to Kibana and to configure the Beats. The screenshots are from version 7.8 of the Elastic Stack — your UI may look slightly different based on your version. Elastic Kibana Kibana is the visualization and management tool of the Elastic Stack. Kibana will guide us through installing and configuring the Beats and Elastic APM Java Agent. Launch Kibana from the deployment details and log in with the elastic username and password. If you forget the password, reset it and then open Kibana: Elastic Find your way home The instructions for everything that you need to install can be found right in your Kibana instance. Often over the next few pages I will direct you to Kibana Home; you can get there by clicking on the Elastic icon in the top left of any Kibana page. Elastic Navigation If you want to dock the navigation menu while you learn your way around Kibana, click on the top-left three-line icon and then the lock at the bottom left: Elastic Add integrations This is the list of what will be collected: Logs from the infrastructure and MariaDB Metrics from the infrastructure and MariaDB API test results and response time measurements Distributed tracing of the application including the database Kibana guides you through adding logs, metrics, and APM. This video shows how to add MySQL metrics, and once you know how to do that you can follow the same process to add log and APM data. Logs from the infrastructure and MariaDB Both MariaDB and MySQL provide logs. I am interested in the error log and the slow log. By default the slow log is not produced. To configure these logs, have a look in the MariaDB docs. For my deployment the configuration file is /etc/mysql/mariadb.conf.d/50-server.cnf. Here are the relevant parts: # This group is only read by MariaDB servers, not by MySQL. # If you use the same .cnf file for MySQL and MariaDB, # you can put MariaDB-only options here [mariadb] slow_query_log # # * Logging and Replication # # Both location gets rotated by the cronjob. # Be aware that this log type is a performance killer. # As of 5.1 you can enable the log at runtime! #general_log_file = /var/log/mysql/mysql.log #general_log = 1 # # Error log - should be very few entries. # log_error = /var/log/mysql/error.log # # Enable the slow query log to see queries with especially long duration slow_query_log_file = /var/log/mysql/mariadb-slow.log long_query_time = 0.5 log_slow_rate_limit = 1 log_slow_verbosity = query_plan #log-queries-not-using-indexes To enable the slow query log, uncomment the lines in the slow query section and adjust the long query time as desired (the default is 10 seconds). A quick test of the configuration is to force a slow query with a SELECT SLEEP(): $ sudo -- sh -c 'echo "select sleep(2);" | mysql' sleep(2) 0 This results in a record being added to the slow log: # Time: 200427 15:19:59 # User@Host: root[root] @ localhost [] # Thread_id: 13 Schema: QC_hit: No # Query_time: 2.000173 Lock_time: 0.000000 Rows_sent: 1 Rows_examined: 0 # Rows_affected: 0 SET timestamp=1588000799; select sleep(2); Install Filebeat Follow the directions in Kibana Home > Add log data > MySQL logs. When you are instructed to enable and configure the mysql module, refer to these details for additional information: The filebeat modules enable command takes a list of modules, so save some steps and add system and auditd to the list: sudo filebeat modules enable mysql system auditd When you are instructed to Modify the settings in the d/mysql.yml file, note that the slow log I added is not in the default location, so edit the file modules.d/mysql.yml and specify the location of the slow log as an entry in the var.paths array: - module: mysql # Error logs error: enabled: true # Set custom paths for the log files. If left empty, # Filebeat will choose the paths depending on your OS. #var.paths: # Slow logs slowlog: enabled: true # Set custom paths for the log files. If left empty, # Filebeat will choose the paths depending on your OS. var.paths: - /var/log/mysql/mariadb-slow.log Run the setup command and start Filebeat as directed in Kibana > Add log data > MySQL logs. At the bottom of that page is a link to the MySQL dashboard. You should also look at the [Filebeat System] Syslog dashboard ECS and [Filebeat System] Sudo commands ECS dashboards. You can search for these in the dashboard list: Elastic Metrics from the infrastructure and MariaDB The operating system and MariaDB both expose metrics. There is no configuration for the OS to expose metrics. MariaDB makes metrics available at port 3306 by default, and the connection is password protected when you add a password to MariaDB. Install Metricbeat Follow the directions in Kibana Home > Add metric data > MySQL metrics. When you are instructed to enable and configure the mysql module, refer to these details for additional information: The metricbeat modules enable command takes a list of modules, so save some steps and add system to the list: sudo metricbeat modules enable mysql system When you are instructed to Modify the settings in the d/mysql.yml file, refer to these details: The Metricbeat module for MySQL needs to be configured with the proper host or IPADDR, port, username, and password. Here is my /etc/metricbeat/modules.d/mysql.yml: - module: mysql #metricsets: # - status # - galera_status period: 10s # Host DSN should be defined as "user:pass@tcp(127.0.0.1:3306)/" hosts: ["springuser:ThePassword@tcp(roscigno-obs:3306)/"] # I copied the username and password used in the Spring Book guide # my hostname is roscigno-obs Run the setup command and start Metricbeat as directed in Kibana > Add metric data > MySQL metrics. At the bottom of that page is a link to the MySQL dashboard: Elastic You should also look at the Metricbeat system dashboards. You can search for these in the dashboard list: Elastic API test results and response time measurements In order to measure proper functionality of the API endpoints we need to POST some URL encoded data, read the response, and verify it. This is often done manually by using curl or the Postman API Client. By automating the testing with Heartbeat, the response time and test results are available alongside the logs, APM, and other metrics for the service. Heartbeat monitors the availability of services by testing API endpoints for proper responses, checking websites for content and response codes, verifying ICMP pings, etc. Install Heartbeat Follow the instructions in Kibana Home > Add metric data > Uptime monitors. When you are instructed to edit the heartbeat.monitors setting in the heartbeat.yml file, replace the existing monitor with this API test: # Configure monitors inline heartbeat.monitors: - type: http name: SpringToDoApp schedule: '@every 5s' urls: ["http://roscigno-obs:8080/demo/add"] check.request: method: POST headers: 'Content-Type': 'application/x-www-form-urlencoded' body: "name=first&email=someemail%40someemailprovider.com" check.response: status: 200 body: - Saved - saved response.include_body: 'always' Run the setup command and start Heartbeat as directed in Kibana > Add metric data > Uptime monitors. At the bottom of that page is a link to the Uptime App. Elastic And Elastic Distributed tracing of the application including the database Elastic APM instruments your applications to ship performance metrics to Elasticsearch for visualization in Kibana with the APM app. By adding the APM jar file to the command used to launch the application I get distributed tracing so I can see where my app is spending time (whether it is in the Java code or in the calls to MariaDB). The process is provided in Kibana Home > Add APM > Java and consists of downloading the jar file and using the Java instrumentation API to start the agent. Open the Home > Add data > APM page, choose Java, and download the jar file: Elastic Load the APM Kibana objects Elastic I prefer to use environment variables for the javaagent properties, so I take the details provided and set the environment variables: $ cat environment export ELASTIC_APM_SERVER_URL=https://1530f7c8afdf402eb281750f0b127bc4.apm.us-central1.gcp.cloud.es.io:443 export ELASTIC_APM_SECRET_TOKEN=WjyW67R0eSWDhILWDD export ELASTIC_APM_SERVICE_NAME=winter-mysql export ELASTIC_APM_APP_PACKAGES=com.example I am launching the app via ./mvnw spring-boot:run and sourcing the environment variables in the Maven Wrapper: exec "$JAVACMD" -javaagent:./elastic-apm-agent.jar -Delastic.apm.service_name=${ELASTIC_APM_SERVICE_NAME:-demo-app} -Delastic.apm.server_url=${ELASTIC_APM_SERVER_URL} -Delastic.apm.secret_token=${ELASTIC_APM_SECRET_TOKEN} -Delastic.apm.application_packages=org.example $MAVEN_OPTS -classpath "$MAVEN_PROJECTBASEDIR/.mvn/wrapper/maven-wrapper.jar" "-Dmaven.home=${M2_HOME}" "-Dmaven.multiModuleProjectDirectory=${MAVEN_PROJECTBASEDIR}" ${WRAPPER_LAUNCHER} "$@" As soon as the application is started, the API tests set up earlier with Heartbeat will result in traces in Elasticsearch. Launch APM and navigate to your service: Elastic Out of the box and without any code changes you will see detailed performance information on response time for incoming requests, database queries, calls to caches, external HTTP requests, JVM metrics, errors, and more. This makes it easy to pinpoint and fix performance problems quickly. Looking into the AddNewUser transactions (which is where Heartbeat is performing synthetic API testing) I can see the breakdown of a transaction: Elastic And digging into the “INSERT INTO user” span I can see the SQL used, transaction ID, etc.: Elastic There is so much more to do with APM, in this article we are just looking at the out of the box traces with zero code changes. To add custom spans and more, see the documentation. Elastic Common Schema The Elastic Common Schema is a powerful way to correlate information from multiple sources. It specifies a common set of fields for logs and metrics that is used by all of the Elastic Beats, and it is extendable by you. By adopting the Elastic Common Schema in the Spring Boot app, the logs are both clear and consistent with the other data sources we are ingesting. This helps correlate a record from the application log with a database metric and an APM trace. You can read about logging best practices for Java apps and watch a video showing a Spring Boot application getting a logging overhaul. Put it all together In this article I covered several ways of collecting data from and about your application and the infrastructure. You can do the same and see logs, metrics, synthetic testing of your APIs, APM traces, and more. Earlier I said that you should not limit yourself in the type and amount of data that you bring in; decide what will help you prevent downtime and expand your list of sources. In future articles I will go into topics like alerting, machine learning, and bringing in custom data sources. If you want to discuss what I have shown, head to discuss.elastic.co and start a conversation. Pop @DanRoscigno in the post and I will try to help. More things to consider I skipped over “When you page me in the middle of the night you will give me the info I need to find the problem” from the goals, and a few other things that should be part of every project. In a future article I will write about building service level dashboards, best practices for ingest with Beats, alerting, and real user monitoring. In the meantime, see the links. If you’re ready to try it out yourself, the easiest way to get started is to take advantage of the free 14-day trial of the Elasticsearch Service on Elastic Cloud — the official hosted Elasticsearch offering from Elastic, which includes Kibana. If you prefer, you can also download Elasticsearch and Kibana to run on your laptop or deploy in a data center. Free hands-on training to build your skills is available at elastic.co/training/free. There is a group of observability related courses listed, and I would also suggest the Kibana course on that page as it walks you through creating your own visualizations. Dan Roscigno is a customer success manager focused on bringing decades of operations experience to customers in webinars, videos, tutorials, and documentation. He has been an EMT, sysadmin, developer, tech lead in operations, consultant, learning content developer, and a product marketing manager. He loves to make ops less stressful. Contact him on Twitter or discuss.elastic.co: @DanRoscigno. — New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com. Software DevelopmentJava