Skip to main content

Using Flame Graphs to Visualize Method Execution Times in Distributed Systems

· 4 min read
Artem Vavilov
BitDive core team

At BitDive.io, we leverage flame graphs not for traditional CPU profiling, but for visualizing the time (in milliseconds) that each method in a span consumes. This approach is especially suited for distributed systems with microservices running across different servers and processors. In this post, we will explain why flame graphs are effective for this use case, how they can help uncover bottlenecks in distributed architectures, and why this method could be more suitable for your system’s needs.

What Are Flame Graphs?

Flame graphs were originally developed to visualize CPU time across different functions, allowing engineers to identify performance bottlenecks by showing which functions or methods consume the most CPU cycles. However, at BitDive.io, we apply a modified version of flame graphs: instead of showing CPU time, our graphs reflect the time in milliseconds spent by each method within a span across distributed microservices.

In distributed systems, requests can pass through many services, often on different machines or even different processor architectures. Traditional CPU flame graphs fall short in such environments because the focus is on a single system’s resource usage. In contrast, our flame graphs reveal the execution time for each function across the distributed system, providing a more accurate picture of how requests are handled end-to-end.

Why Flame Graphs Are Ideal for Distributed Systems

  1. Holistic View of Latency: Distributed systems, particularly those based on microservices, often have requests passing through various services, each running on different hardware. Each service might experience delays due to external factors like network latency, database access times, or remote procedure calls (RPCs). By focusing on method execution time rather than CPU consumption, we get a holistic view of where time is being spent across multiple services.

This is essential because in distributed systems, bottlenecks are often caused by delays in specific services, not just CPU-bound processes. Visualizing how time is spent across different services in a single trace allows us to pinpoint slowdowns or inefficiencies at the service level, even if those services are running on vastly different machines  .

  1. Span-Based Flame Graphs: In a distributed system, tracing involves breaking down a request into spans that cover individual method executions across services. Each span represents a discrete operation within the larger distributed request, and visualizing these spans as flame graphs provides insight into where time is accumulating within the system. Unlike traditional flame graphs, where CPU cycles are the focus, our flame graphs show time spent in each span, enabling better performance monitoring in systems where delays are more network or I/O related than CPU-related .

  2. Scalability and Microservice Architecture: Microservices run on independent servers, which means each service might have its own specific challenges (e.g., some services might be bound by I/O, others by memory, etc.). A flame graph that visualizes CPU usage alone cannot adequately capture performance issues in this heterogeneous environment. By using method execution times, you capture a more meaningful metric for distributed systems, allowing you to optimize latency and response times across your entire architecture, not just within a single machine.

  3. Dependency Mapping: Another key advantage of this approach is that flame graphs can visually show the hierarchy of method calls across services, allowing you to easily detect which service or method is responsible for holding up the entire process. This is crucial in microservices, where understanding service dependencies is key to diagnosing performance bottlenecks .

Is This Always Better Than CPU Flame Graphs?

For distributed systems, yes. Focusing on CPU usage may provide insights for individual nodes or services, but it doesn’t reflect the overall user experience when latency is caused by network calls or inefficient inter-service communication. However, if your system is monolithic, where all services are tightly coupled on a single server, CPU flame graphs may still provide more immediate value.

Why Method-Level Flame Graphs Work for Microservices

Microservices communicate via API calls, each with its own potential for latency based on various factors (network speed, database response time, etc.). By focusing on the time each method takes to execute, developers can identify which services are slowing down the request lifecycle, regardless of where they’re hosted or what hardware they run on. This is critical for improving overall system performance in environments where services are geographically distributed or hosted on different infrastructures .

For a general overview of CPU utilization, visit CPU Flamegraphs.
Explore more on Application Performance Optimization.