Why Perform Performance Tuning?

  1. An online product without performance testing is like a ticking time bomb. You don’t know when it will break down, or what its limits are.
  2. Some performance problems accumulate slowly over time and eventually explode. Many more are caused by fluctuations in traffic.
  3. Performance tuning is a continuous process that helps you identify and address performance issues before they become critical, ensuring a smooth user experience and optimal resource utilization.
  4. Performance tuning can help you save costs by optimizing resource usage and avoiding the need for expensive hardware upgrades.
  5. It can also improve the scalability of your system, allowing it to handle increased traffic and user demand without degradation in performance.

When to start performance tuning?

The common advice is to start performance tuning as early as possible in the development process. However, this doesn’t necessarily mean that you need to start tuning from day one. It’s important to focus on building a solid foundation for your application first, and then gradually optimize performance as needed.

In fact, in the early stages of project development, we don’t need to focus too much on performance optimization. Doing so can be exhausting, not only failing to improve system performance but also impacting development progress and potentially creating new problems. We only need to ensure effective coding at the code level, such as reducing disk I/O operations, minimizing lock contention, and using efficient algorithms. For more complex business logic, we can fully utilize design patterns to optimize business logic code.

After the system coding is completed, we can perform performance testing. At this point the product manager will usually provide expected online data. We can conduct load testing on a provided reference platform, using performance analysis and statistical tools to collect various performance metrics and see if they are within the expected range.

After the project is successfully launched, we also need to monitor system performance based on the actual online situation, using log monitoring and performance statistics logs. Once problems are found, we must analyze the logs and fix them promptly.

What factors should be considered when defining performance tuning standards?

When we talk about performance tuning across different stages of system development, we often mention performance metrics. But what exactly are these metrics, and what factors influence them?

Before diving into metrics, it’s important to understand the key system resources that can become performance bottlenecks.

Key System Bottlenecks

CPU

Some applications are computation-intensive and may occupy CPU resources for long periods. This prevents other tasks from getting CPU time, leading to slow responses. Common causes include:

  • Infinite loops caused by recursion
  • Regex backtracking issues
  • Frequent Full GC in JVM
  • Excessive context switching in multithreading

Memory

In Java applications, memory is managed by the JVM, mainly using heap space to store objects. While memory access is fast, it is limited and expensive compared to disk. Problems arise when:

  • Memory is exhausted
  • Objects cannot be garbage collected
  • Memory leaks or OutOfMemory errors occur

Disk I/O

Disks provide large storage but are significantly slower than memory. Although SSDs improve performance, they still lag behind RAM. Heavy read/write operations can easily become a bottleneck.

Network

Network performance plays a critical role, especially in distributed systems. Limited bandwidth or high concurrency can quickly make the network a bottleneck.

Exceptions

Throwing and handling exceptions in Java is expensive because it involves building stack traces. Frequent exceptions under high concurrency can significantly degrade performance.

Database

Most systems rely on databases, and database operations often involve disk I/O. High-frequency read/write operations can lead to:

  • Increased latency
  • I/O bottlenecks
  • Overall system slowdown

Lock Contention

In concurrent programming, locks are used to ensure data consistency. However, they introduce overhead such as context switching. Even though modern JVMs optimize locks (biased locks, lightweight locks, etc.), improper usage can still harm performance.

Key Performance Metrics

Based on the above factors, we can define several important metrics to evaluate system performance.

1. Response Time

Response time is one of the most critical performance indicators. The shorter the response time, the better the system performance.

  • Database Response Time: Often the most time-consuming part
  • Server Response Time: Includes request handling and business logic execution
  • Network Response Time: Time spent on data transmission
  • Client Response Time: Usually negligible unless heavy client-side logic exists

2. Throughput

Throughput measures how many requests a system can handle per unit time, often expressed as TPS (Transactions Per Second). Higher throughput indicates better performance.

Disk Throughput

  • IOPS (Input/Output Per Second): Number of read/write operations per second (important for random access workloads like OLTP systems)
  • Data Throughput: Amount of data transferred per second (important for sequential workloads like video streaming)

Network Throughput

The maximum data transfer rate without packet loss. It depends on:

  • Bandwidth
  • CPU processing power
  • Network hardware
  • System architecture

3. Resource Utilization

This includes:

  • CPU usage
  • Memory usage
  • Disk I/O
  • Network I/O

Think of these like a wooden barrel: the shortest plank determines the capacity. Any poorly utilized resource can become a bottleneck.

4. Load Capacity

This measures how well a system performs under increasing load.

A key observation:

  • As concurrency increases, response time also increases
  • A well-designed system shows a gradual increase
  • When the system starts failing (e.g., errors spike), it has reached its limit

5. Error Rate

The error rate is the percentage of requests that result in errors. A high error rate indicates that the system is struggling to handle the load, which can lead to poor user experience and potential system instability.