Logging System & Observability Practices
title: "Lesson 17 | Logging System & Observability Practices" summary: "Unveiling the Black Box—Dialogue Trace Replay, Skill Tracking, and Performance Analysis. Enhance control by integrating professional observability platforms." sortOrder: 170 status: "published"
Okay, as a technical education expert, I will now write the 17th lesson of the Hermes Agent tutorial for you.
Hermes Agent Tutorial | Lesson 17
Logging System & Observability Practices
Subtitle: Unveiling the Black Box—Dialogue Trace Replay, Skill Tracking, and Performance Analysis. Enhance control by integrating professional observability platforms.
Learning Objectives
In this lesson, you will delve into the internal workings of the Hermes Agent, learning how to transition from a "black box" user to a "white box" developer who can gain insight into its internal state. After completing this chapter, you will be able to:
- Master the Hermes Agent's built-in logging system: Understand its log structure, levels, and configuration, and use logs for basic troubleshooting.
- Implement dialogue trace replay and skill tracking: Accurately reproduce the Agent's decision path for each interaction and diagnose root causes through detailed log analysis.
- Perform basic performance analysis: Extract key performance indicators (Metrics) from logs, such as response latency and token consumption, to quantify the Agent's performance.
- Integrate with professional observability platforms: Learn how to integrate the Hermes Agent with the OpenTelemetry (OTel) ecosystem to implement distributed tracing, elevating observability to an industrial-grade level.
Core Concepts Explained
In complex AI Agent applications, when a user reports "it didn't do what I said" or "it's responding slowly," the biggest challenge we face is a lack of visibility. The Agent's decision chain (receive information -> retrieve memory -> think -> select skill -> call LLM -> execute skill -> generate response) is hidden behind code execution. Observability is the key to solving this problem. It's more than just "Monitoring"; monitoring tells us if a system is working, while observability allows us to ask arbitrary questions and find the answers for why from the data the system produces.
Observability is typically built on three pillars:
Logging
- What it is: Records discrete, timestamped events. Each log line is a snapshot of a fact, such as "User entered a message," "Skill
get_weatherwas called," or "LLM API returned an error." - How it's implemented in Hermes: The Hermes Agent uses the Loguru library to provide powerful and easily configurable logging. By default, it records the operational status of core components, received messages, executed skills, and other key information. Through configuration, we can output Structured Logging, typically in JSON format, which allows logs to be easily parsed, queried, and analyzed by machines.
- What it is: Records discrete, timestamped events. Each log line is a snapshot of a fact, such as "User entered a message," "Skill
Tracing
- What it is: Records the complete path of a single request (in the Agent's case, a full dialogue interaction) as it flows through different services or components in the system. A Trace is composed of multiple Spans, with each Span representing a unit of work (e.g., a function call, an API request).
- Why it's important: For an Agent, a single dialogue interaction might involve multiple internal steps: message parsing, memory retrieval, LLM inference, tool (skill) calls, etc. Distributed tracing links these steps together to form a complete call chain. This allows us to clearly see:
- Dialogue trace replay: What does the request's full lifecycle look like?
- Skill tracking: Which Skill was called? What parameters were passed? How long did it take to execute?
- Performance bottleneck identification: In the entire interaction, was the LLM call slow, or was an external API call within a Skill slow?
Metrics
- What it is: Aggregated, quantifiable data over a period of time. For example: messages processed per minute, average response latency, LLM API call success rate, total token consumption, etc.
- Why it's important: Metrics provide a high-level view of the system's health. By observing trends in metrics, we can set up alerts (e.g., when latency exceeds 2 seconds), perform capacity planning, or evaluate the performance impact of model/skill changes.
This lesson will start with basic logging practices and gradually guide you toward industrial-grade practices using OpenTelemetry for advanced tracing.
💻 Hands-on Demo
Part 1: Mastering the Hermes Agent's Built-in Logging System
The Hermes Agent's logging configuration is located in the config.yml file in the project's root directory. Let's start there.
1.1 Adjusting Log Levels and Formats
The default configuration might look like this:
# config.yml
log:
level: "INFO"
path: "hermes.log"
rotation: "10 MB"
retention: "7 days"
level: The verbosity of the logs. Common levels areDEBUG,INFO,WARNING,ERROR,CRITICAL.INFO: The default level, records general operational information.DEBUG: The most detailed level, used for development and debugging. It will print a vast amount of information, including the full LLM prompt and the model's thought process.
Hands-on: Enabling DEBUG Mode
To get a deep look into the Agent's "thought process," we'll change the log level to DEBUG.
# config.yml
log:
level: "DEBUG"
# ... other settings remain the same
Now, start the Hermes Agent and interact with it:
# Make sure you are in the root directory of the Hermes Agent project
hermes run
In another terminal or through a message gateway (like Telegram), send the Agent a command that requires a Skill, for example: "What's the weather like in Beijing today?"
Then, view the log file hermes.log:
tail -f hermes.log
You will see much more detailed output than at the INFO level, containing crucial debugging information:
...
DEBUG | hermes.core.agent:process_message:123 - Received new message from user: "What's the weather like in Beijing today?"
DEBUG | hermes.core.brain:think:45 - Generating thought process for the user query...
DEBUG | hermes.core.brain:_construct_prompt:67 - Constructed prompt for LLM:
"""
... (The full prompt sent to the LLM will be displayed here) ...
"""
DEBUG | hermes.providers.openai:chat_completion:89 - Calling OpenAI API with model gpt-4-turbo...
INFO | hermes.core.agent:execute_skill:234 - Agent decided to use skill: get_weather
DEBUG | hermes.core.agent:execute_skill:235 - Skill arguments: {'city': 'Beijing'}
INFO | hermes.skills.weather:get_weather:56 - Executing get_weather skill for city: Beijing
... (Internal logs from skill execution) ...
DEBUG | hermes.core.agent:execute_skill:250 - Skill `get_weather` returned: "Today in Beijing is sunny, with a temperature of 25 degrees..."
DEBUG | hermes.core.brain:think:90 - LLM call took 1.234 seconds.
INFO | hermes.core.agent:process_message:150 - Sending final response to user.
...
With DEBUG logs, we have clearly accomplished:
- Dialogue trace replay: Every step from receiving the message to the final response is recorded.
- Skill tracking: We saw that
get_weatherwas called with the arguments{'city': 'Beijing'}. - Performance analysis: We saw the LLM call duration was
1.234 seconds.
1.2 Enabling Structured Logging (JSON)
For automated analysis, plain text logs are difficult to process. Structured logging is a better choice.
Hands-on: Switching to JSON Format
Modify config.yml:
# config.yml
log:
level: "DEBUG"
path: "hermes.log"
rotation: "10 MB"
retention: "7 days"
json: true # Add this line
Restart the Hermes Agent and interact with it again. Now, if you check hermes.log, the content will look like this:
{"text": "...", "record": {"elapsed": {"repr": "...", "seconds": ...}, "exception": null, "extra": {}, "file": {"name": "agent.py", "path": "..."}, "function": "process_message", "level": {"icon": "...", "name": "DEBUG", "no": 10}, "line": 123, "message": "Received new message from user: \"What's the weather like in Beijing today?\"", "module": "agent", "name": "hermes.core.agent", "process": {"id": ..., "name": "..."}, "thread": {"id": ..., "name": "..."}, "time": {"repr": "...", "timestamp": ...}}}
Although readability is worse for humans, we can now easily query and analyze the logs with tools like jq.
Hands-on: Querying All Skill Calls with jq
# Filter for logs recording skill execution and extract the skill name and arguments
cat hermes.log | jq 'select(.record.message | test("Agent decided to use skill:")) | .record.extra'
This will produce output similar to the following, which is perfect for scripting:
{
"skill_name": "get_weather",
"skill_args": {
"city": "Beijing"
}
}
Part 2: Integrating OpenTelemetry for Advanced Tracing
While log analysis is powerful, manually correlating logs to build a complete request chain in complex, high-concurrency scenarios is very difficult. This is where OpenTelemetry (OTel) comes in. We will use OTel to convert the Hermes Agent's internal calls into visual, distributed trace data.
We will integrate with an open-source OTel backend, SigNoz (you could also use Jaeger, Zipkin, etc.). The advantage of SigNoz is that it combines logs, traces, and metrics into a single platform.
2.1 Deploying SigNoz
We will use Docker Compose to quickly deploy a local SigNoz instance.
Create a docker-compose.yml file:
# docker-compose.yml for SigNoz
version: '3'
services:
signoz-otel-collector:
image: signoz/otel-collector:0.93.0
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317" # OTLP gRPC receiver
- "4318:4318" # OTLP HTTP receiver
depends_on:
- signoz-clickhouse
signoz-query-service:
image: signoz/query-service:0.41.0
ports:
- "8080:8080"
command: [ "-config=/etc/query-service/config.yaml" ]
volumes:
- ./query-service-config.yaml:/etc/query-service/config.yaml
depends_on:
- signoz-clickhouse
signoz-clickhouse:
image: clickhouse/clickhouse-server:23.10.2-alpine
ports:
- "8123:8123"
- "9000:9000"
volumes:
- signoz-clickhouse-data:/var/lib/clickhouse
healthcheck:
test: ["CMD", "clickhouse-client", "-q", "SELECT 1"]
interval: 30s
timeout: 10s
retries: 5
signoz-frontend:
image: signoz/frontend:0.41.0
ports:
- "3301:3000"
depends_on:
- signoz-query-service
volumes:
signoz-clickhouse-data:
You will also need otel-collector-config.yaml and query-service-config.yaml. For simplicity, you can get these configuration files from the official SigNoz documentation.
After downloading the configuration files, start SigNoz:
docker-compose up -d
Wait a few minutes, then visit http://localhost:3301, and you should see the SigNoz UI.
2.2 Installing OTel Dependencies for the Hermes Agent
In your Hermes Agent's Python environment, install the necessary OTel libraries.
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp
2.3 Instrumenting the Hermes Agent with Tracing Code
Now, we need to modify the Hermes Agent's code to create and manage Spans at key points and send the data to SigNoz.
Create a new Python file, for example, hermes/utils/telemetry.py:
# hermes/utils/telemetry.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
def setup_tracer(service_name="hermes-agent"):
"""Initialize the OpenTelemetry Tracer"""
resource = Resource(attributes={
"service.name": service_name
})
# Configure the OTLP Exporter to send data to the SigNoz Collector
# The default address is localhost:4317, matching our docker-compose setup
otlp_exporter = OTLPSpanExporter(endpoint="localhost:4317", insecure=True)
# Set up the Tracer Provider and Span Processor
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
# Set the global Tracer Provider
trace.set_tracer_provider(provider)
# Return a tracer instance
return trace.get_tracer(__name__)
# Initialize once at the module level
tracer = setup_tracer()
Now, we can "instrument" the Agent's core logic with tracing code. An ideal place for this is the main function that handles user messages, Agent.process_message.
Modify hermes/core/agent.py:
# hermes/core/agent.py
# ... other imports ...
from hermes.utils.telemetry import tracer # Import the tracer we created
class Agent:
# ... other methods ...
async def process_message(self, message: str, user_id: str):
# Create a top-level Span representing the entire dialogue interaction
with tracer.start_as_current_span("process_user_message") as parent_span:
# Add useful attributes to the Span for easier querying
parent_span.set_attribute("user.id", user_id)
parent_span.set_attribute("user.message", message)
try:
# ... existing process_message logic ...
# 1. Memory retrieval
with tracer.start_as_current_span("memory_retrieval") as memory_span:
relevant_memory = self.memory.retrieve(user_id, message)
memory_span.set_attribute("memory.found_count", len(relevant_memory))
# 2. Thinking and skill selection
with tracer.start_as_current_span("brain_think") as brain_span:
thought_process, skill_name, skill_args = await self.brain.think(message, user_id)
brain_span.set_attribute("llm.thought", thought_process)
brain_span.set_attribute("skill.selected", skill_name or "None")
# 3. Skill execution
if skill_name:
with tracer.start_as_current_span("skill_execution") as skill_span:
skill_span.set_attribute("skill.name", skill_name)
skill_span.set_attribute("skill.args", str(skill_args))
skill_result = await self.execute_skill(skill_name, **skill_args)
skill_span.set_attribute("skill.result_length", len(str(skill_result)))
# If the skill execution fails, you can record the exception
# skill_span.record_exception(e)
# skill_span.set_status(trace.Status(trace.StatusCode.ERROR, "Skill execution failed"))
# ... subsequent logic for generating the final response ...
final_response = "..." # Assuming this is the final response
parent_span.set_attribute("agent.response", final_response)
return final_response
except Exception as e:
# If an exception occurs during the process, record it in the top-level Span
parent_span.record_exception(e)
parent_span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
raise
Finally, in the Hermes Agent's main entry point (main.py or __main__.py), ensure the OTel initialization code is called.
# Call this when the application starts
from hermes.utils.telemetry import setup_tracer
setup_tracer()
# ... subsequent startup code ...
2.4 Analyzing the Trace Data
Restart the Hermes Agent and interact with it again. Then open the SigNoz UI (http://localhost:3301).
- Click on the "Traces" tab on the left sidebar.
- You should see a service named
hermes-agent. - In the list, you will see Traces named
process_user_message. Click on one of them.
You will see a Flame Graph or Gantt Chart that perfectly visualizes the entire dialogue processing flow:
process_user_message -------------------------------------------------- [2500ms]
|
|- memory_retrieval --- [50ms]
|
|- brain_think --------------------------------------- [1500ms]
|
|- skill_execution (get_weather) -------------------- [800ms]
|
|- ... (response_generation) ... --- [150ms]
With this view, you can:
- Replay the trace at a glance: Clearly see the order of execution and the parent-child relationships.
- Pinpoint performance bottlenecks: In the chart above,
brain_think(the LLM call) andskill_executionare the most time-consuming parts. Ifskill_executiontakes an unusually long time, you can drill down further to see which external API call was slow. - Dive deep into the details: Click on any Span (like
skill_execution), and in the details panel on the right, you can see all the attributes we set, such asskill.nameandskill.args, which is crucial for debugging.
At this point, you have elevated the observability of your Hermes Agent from the "text log" era to the "distributed tracing" era, significantly improving your ability to understand and control this intelligent agent.
Commands Used
# View real-time logs
tail -f hermes.log
# Query structured logs using jq
cat hermes.log | jq '...'
# Start SigNoz (or another OTel backend)
docker-compose up -d
# Install OpenTelemetry Python libraries
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp
# Start the Hermes Agent
hermes run
Key Takeaways
- The Three Pillars of Observability: Logging, Tracing, and Metrics are the cornerstones for understanding and debugging complex systems.
- Hermes's Built-in Logging: By adjusting
log.level(especially toDEBUG) andlog.jsoninconfig.yml, you can obtain rich diagnostic information and prepare for automated analysis. - Logging is Fundamental:
DEBUGlevel logs allow you to manually perform dialogue trace replay and skill tracking, making it the primary tool for troubleshooting. - Tracing is the Next Step: When systems become complex or you need to pinpoint performance bottlenecks, integrating OpenTelemetry for distributed tracing is the best practice.
- Spans and Attributes: The core of OTel is creating Spans that represent units of work and attaching rich contextual information (Attributes) to them, which makes trace data highly valuable for analysis.
- From Black Box to White Box: Effectively using observability tools can make the Agent's internal decision-making process transparent, transforming you from a passive user into an active developer who can deeply control and optimize the Agent.