Understanding Java Heap Memory: Generations, Garbage Collection, and Troubleshooting

Contents hide

 

Introduction to Java Heap and GC Principles

The Java Heap is divided into the Young Generation and the Old Generation, known in the diagram as the New Generation and Tenured Generation, for storing object instances.

Java Heap

The new generation is divided into three areas: one Eden and two Survivor spaces – From Survivor (abbreviated as S0) and To Survivor (abbreviated as S1). The default ratio among them is 8:1:1. Additionally, the default ratio between the new generation and the old generation is 1:2. Typically, when a new object is created, the newly generated object instance is placed into Eden. During garbage collection, the objects in Eden that are not collected are moved into one of the Survivor spaces, let’s say Survivor0.

Then empty Eden, so now Eden is empty, and Survivor0 contains object instances still in use, while Survivor0 is empty. The next time collection occurs, the objects still in use in Eden will be placed into Survivor0, and the objects in Survivor0 that need to be collected will be marked and cleared. Finally, the entire Survivor0 is copied to Survivor1, and then Survivor0 is cleared.

After numerous cycles of garbage collection, if certain objects are consistently in use and cannot be collected, it’s assumed that these object instances are likely to remain in use indefinitely. Therefore, they are moved to the old generation. This approach ensures that during the mark-sweep-copy process in the young generation, these objects won’t be involved, saving a considerable amount of time. This is because garbage collection is frequently executed in the young generation, while the old generation undergoes collection only when specific conditions are met.

Java Heap: Young Generation

  • All newly created objects are initially placed in the young generation. The goal of the young generation is to quickly collect objects with short life cycles as efficiently as possible.
  • The new generation memory is divided into one Eden space and two Survivor spaces (Survivor0, Survivor1) in an 8:1:1 ratio. Typically, there’s one Eden space and two Survivor spaces. Most objects are initially created in the Eden space. During garbage collection, the live objects from the Eden space are copied to the Survivor0 space, allowing the Eden space to be cleared. When the Survivor0 space also becomes full, the live objects from both the Eden and Survivor0 spaces are transferred to the other Survivor1 space, after which the Eden and Survivor0 spaces are cleared. At this point, Survivor0 is empty, and then Survivor0 and Survivor1 spaces are swapped, ensuring that Survivor1 remains empty. This process repeats iteratively.
  • When survivor1 area is insufficient to store the live objects from eden and survivor0, these live objects are directly stored in the old generation. If the old generation also fills up, a Full GC is triggered, which means both the young and old generations undergo garbage collection.
  • The Garbage Collection (GC) occurring in the young generation is also called Minor GC. Minor GC happens quite frequently (and it doesn’t necessarily trigger only when the Eden space is full).

Old Generation in Java Heap

  • During a YGC (Young Garbage Collection), when the Survivor space is insufficient to accommodate the surviving objects, those objects will directly move to the Old Generation.
  • After several Young Garbage Collections (YGC), if the age of a surviving object reaches the set threshold (default is 15), it will be promoted to the old generation.
  • Dynamic age determination rule: In the To Survivor space, if objects of the same age collectively occupy more than half of the space, then objects older than this age will be moved directly to the old generation without needing to reach the default generational age.
  • Large Object: Controlled by the -XX:PretenureSizeThreshold startup parameter. If the object size exceeds this value, it will bypass the Young Generation and be directly allocated in the Old Generation.

Java Heap: Permanent Generation

Used for storing static files such as Java classes and methods. The Permanent Generation does not significantly impact garbage collection, but some applications may dynamically generate or invoke some classes, such as Hibernate. In such cases, it is necessary to set a relatively large Permanent Generation space to store these newly added classes during execution.

When is YGC triggered?

In most cases, objects are directly allocated in the Eden space of the Young Generation. If there isn’t enough space in the Eden area, it triggers a YGC (Minor GC). The YGC process only handles the Young Generation. Since the majority of objects can be reclaimed in a short period, after YGC, only a small number of objects survive and are moved to the S0 area (utilizing the copy algorithm).

When the next Young Garbage Collection (YGC) is triggered, the live objects in the Eden space and S0 space are moved to the S1 space, and the Eden space and S0 space are then cleared. When YGC is triggered again, it processes the areas of the Eden space and S1 space (i.e., S0 and S1 swap roles). With each YGC cycle, the age of the surviving objects increases by one.

When is the FGC triggered?Hello! It looks like your message is empty. If you need

  • When objects promoted to the old generation exceed the remaining space of the old generation, an FGC (Major GC) is triggered. The FGC process includes both the new generation and the old generation.
  • When the memory usage in the old generation reaches a certain threshold (which can be adjusted via parameters), it directly triggers a Full Garbage Collection (FGC).
  • Space allocation guarantee: Before the Young Generation Collection (YGC), it’s crucial to check whether the largest available contiguous space in the old generation is greater than the total space of all objects in the young generation. If it’s less, this indicates that YGC is unsafe. In such cases, the system will check whether the parameter `HandlePromotionFailure` is set to allow promotion failure. If it does not allow, then a Full GC will be triggered immediately. If it does allow, the system will further check if the largest available contiguous space in the old generation is greater than the average size of objects promoted to the old generation in past instances. If this is also less, it will trigger a Full GC.
  • Metaspace will expand when there is insufficient space, and when it expands to the value specified by the -XX:MetaspaceSize parameter, it will also trigger a Full Garbage Collection (FGC).
  • When `System.gc()` or `Runtime.gc()` is explicitly called, it triggers a Full Garbage Collection (FGC).

Under what circumstances can Garbage Collection (GC) impact the program?

Regardless of whether it’s YGC or FGC, they will inevitably cause a certain level of program stutter (also known as the Stop-The-World problem: when the GC thread starts working, other work threads are suspended). Even when using more advanced garbage collection algorithms like ParNew, CMS, or G1, these only reduce the stutter duration and do not completely eliminate it.

Under what circumstances will GC impact the program? From most to least severe, I believe the following four situations are included:

  • Frequent FGC: Typically, FGC is relatively slow, ranging from a few hundred milliseconds to several seconds. Under normal circumstances, FGC should execute once every few hours or even days, which is generally acceptable in terms of system impact. However, if FGC occurs frequently (such as every few minutes), it indicates a problem. This can cause worker threads to be frequently stopped, making the system appear constantly laggy and degrading the overall performance of the program.
  • YGC takes too long: Generally, the total time consumed by YGC in tens or hundreds of milliseconds is relatively normal. Although it might cause the system to stall for a few milliseconds or tens of milliseconds, this situation is almost imperceptible to the user and has a negligible impact on the program. However, if YGC takes up to 1 second or even several seconds (almost as long as FGC), then the stalling time will increase. Coupled with the frequent occurrence of YGC itself, this can lead to a significant number of service timeout issues.
  • Extended FGC Duration: When FGC takes longer, stutter duration also increases, particularly affecting high concurrency services. This can lead to numerous timeout issues during the FGC, thereby reducing availability. This issue requires attention as well.
  • YGC Occurs Too Frequently: Even if YGC does not cause service timeouts, having YGC occur too frequently can reduce the overall performance of the service. This is something that should be monitored, especially for high-concurrency services.

Java, despite having a Garbage Collector (GC), can still experience memory leak issues.

1. The use of static collection classes such as HashMap and Vector is most likely to cause memory leaks.

2. Various connections, such as database connections, network connections, and IO connections, are not explicitly closed or shown to call `close`, leading to memory leaks because they are not reclaimed by the garbage collector (GC).

3. The use of listeners can also lead to memory leaks if they are not properly removed when objects are released.

Differences Between Memory Leak and Memory Overflow

1. Memory Leak (memory leak)

Applying for memory without freeing it, for instance, if there is a total of 1024M of memory, and you’ve allocated 521M of memory that is never reclaimed, then only 521M of usable memory remains, as if a portion has leaked. To put it colloquially, a memory leak is akin to “occupying the toilet without using it.”

2. Memory Overflow (out of memory)

If you have more content or specific HTML structure you’d like me to work with, feel free to share!

When requesting memory, there is not enough memory available for use;

To put it simply, imagine a restroom with three stalls, and two people are just standing there without leaving (memory leaks). This leaves only one stall available, causing the restroom to be under significant stress. Suddenly, two more people come in, and there aren’t enough stalls (memory) available. The memory leak then escalates into a memory overflow.

Object X references object Y, and X’s lifecycle is longer than Y’s lifecycle; thus, when Y’s lifecycle ends, X still holds a reference to Y. At this point, the garbage collector will not reclaim object Y. If object X also references the shorter-lived objects A, B, and C, and object A further references objects a, b, and c, this could lead to a situation where a significant number of obsolete objects remain uncollected. Consequently, these objects consume memory resources, potentially leading to a memory leak and eventually causing memory overflow.

It is evident that the relationship between memory leaks and memory overflow is such that an increase in memory leaks will eventually lead to memory overflow.

Java Heap

Notice: Anonymous inner classes hold a reference to the outer class, which can potentially lead to memory leaks, whereas static inner classes do not.https://mp.weixin.qq.com/s/ZX-BvkQ4B7ql62Mi8v_rLwIt seems like you’ve submitted an empty message or a character that I can’t translate. Please provide the text you’d like me to assist with, and I’ll

Check JVM Configuration, Set Java Heap Size

Use the following command to view the JVM startup parameters:

ps aux | grep "applicationName=adsearch"

The heap memory is seen to be 4G, with the young generation at 2G and the old generation also at 2G. The young generation uses the ParNew collector, while the old generation employs the concurrent mark-sweep (CMS) collector. A full garbage collection (FGC) occurs when the memory occupancy rate of the old generation reaches 80%.

-Xms4g -Xmx4g -Xmn2g -Xss1024K 

-XX:ParallelGCThreads=5 

-XX:+UseConcMarkSweepGC 

-XX:+UseParNewGC 

-XX:+UseCMSCompactAtFullCollection 

-XX:CMSInitiatingOccupancyFraction=80

For setting the entire Java heap size, configure Xmx and Xms to be 3-4 times the size of the surviving objects in the old generation after a Full GC, that is, 3-4 times the memory usage of the old generation post-Full GC.

Set PermSize and MaxPermSize to be 1.2-1.5 times the size of the live objects in the old generation.

The setting for the Young Generation Xmn should be 1 to 1.5 times the size of the surviving objects in the Old Generation.

The old generation’s memory size is set to be 2-3 times the size of the surviving objects in the old generation.


Xms = Xmx = (3-4) (size of the old generation after full GC)

Xmn = (1-1.5) (size of the old generation after full GC)

Add GC Logs in JVM Parameters

-XX:+PrintGC  -XX:+PrintGCDetails  
or  
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps

The GC log will record the memory size of each generation after each FullGC, and observe the space size after the old generation GC. You can observe the memory situation after FullGC for a period of time (such as 2 days), and estimate the size of surviving objects in the old generation after FullGC based on the space size data of the old generation after multiple FullGCs (you can take the average value based on the memory size after multiple FullGCs)

[GC (Allocation Failure) [PSYoungGen: 228290K->3505K(244224K)] 264814K->43652K(506368K), 0.0368352 secs] [Times: user=0.14 sys=0.00, real=0.04 secs]

Based on the gc log, we can roughly infer whether youngGC and fullGC are too frequent or take too long, so as to prescribe the right medicine. We will analyze the G1 garbage collector below, and we recommend that you use G1-XX:+UseG1GC.

YoungGC is too frequent
Frequent youngGC usually means that there are many short-term small objects. First consider whether the Eden area/new generation is set too small, and see if the problem can be solved by adjusting the -Xmn, -XX:SurvivorRatio and other parameter settings. If the parameters are normal, but the young GC frequency is still too high, you need to use Jmap and MAT to further investigate the dump file.

YoungGC takes too long
The problem of too long time depends on where the time is spent in the GC log. Taking the G1 log as an example, you can pay attention to the Root Scanning, Object Copy, Ref Proc and other stages. If Ref Proc takes a long time, you need to pay attention to the referenced objects. If Root Scanning takes a long time, you need to pay attention to the number of threads and cross-generational references. Object Copy needs to pay attention to the object life cycle. Moreover, the time analysis needs to be compared horizontally, that is, compared with the time of other projects or normal time periods. For example, if the Root Scanning in the figure increases significantly compared to the normal time period, it means that too many threads are started.

Trigger full GCMore often in G1, it’s about mixed GC, but you can troubleshoot mixed GC using the same approach as young GC. If a full GC is triggered, there generally is an issue. G1 will fall back to using the Serial collector to perform garbage collection, which can result in pause times reaching several seconds, essentially bringing the system to its knees.

The causes of full GC may include the following, along with some thoughts on parameter adjustments:

  • Concurrent Phase Failure: During the concurrent marking phase, if the old generation is filled up before the MixGC, G1 will then abandon the marking cycle. In such situations, it might be necessary to increase the heap size or adjust the number of concurrent marking threads.-XX:ConcGCThreads。
  • Promotion failure: During GC (Garbage Collection), there wasn’t enough memory available for surviving/promoted objects, which triggered a Full GC. At this point, you can address this issue by
-XX:G1ReservePercentTo increase the reserved memory percentage, reduce-XX:InitiatingHeapOccupancyPercentInitiate the marking in advance,-XX:ConcGCThreadsIt is also possible to increase the number of labeled threads.
  • Large Object Allocation Failure: When a large object cannot find a suitable region space for allocation, a full GC occurs. In such cases, you can increase memory or enlarge
-XX:G1HeapRegionSize。
  • Don’t randomly invoke System.gc(): Just avoid doing it unnecessarily.

Additionally, we can configure it in the startup parameters.-XX:HeapDumpPath=/xxx/dump.hprofPlease dump the files related to full GC and perform dumps before and after GC using jinfo.

jinfo -flag +HeapDumpBeforeFullGC pid 
jinfo -flag +HeapDumpAfterFullGC pid

In this way, you get 2 dump files. After comparison, focus mainly on the problematic objects that have been garbage collected to pinpoint the issue.

2. Steps for Troubleshooting Java Heap FGC Issues

1. Clearly understand from a programming perspective, what are the causes of Full Garbage Collection (FGC)?

  • Large Object: The system loaded too much data into memory at once (for example, SQL queries without pagination), causing the large object to enter the old generation.
  • Memory leak: A large number of objects are frequently created but cannot be reclaimed (for example, IO objects not calling the close method to release resources after use), initially triggering frequent garbage collection (FGC), and eventually leading to an OutOfMemoryError (OOM).
  • Programs frequently generate long-lived objects, and when these objects survive beyond the generational cutoff age, they transition to the old generation, ultimately triggering a Full Garbage Collection (FGC), as exemplified in this article.
  • The bug in the program caused the dynamic generation of numerous new classes, leading to continuous Metaspace occupation, which initially triggered Full Garbage Collection (FGC) and eventually resulted in an OutOfMemoryError (OOM).
  • The code explicitly calls the `gc` method, including in your own code and even within the framework’s code.
  • JVM Parameter Configuration Issues: Including total memory size, the size of the young and old generations, the size of the Eden space and S space, metaspacesize, garbage collection algorithms, and more.

2. What tools can be used to clearly troubleshoot issues?

  • Large objects: The system loads too much data into memory at one time (for example, SQL queries are not paginated), causing large objects to enter the old generation.
  • Memory leaks: A large number of objects are frequently created, but cannot be recycled (for example, the close method is not called after the IO object is used to release resources), which first triggers FGC and finally leads to OOM.
  • The program frequently generates some objects with long life cycles. When the survival age of these objects exceeds the generation age, they will enter the old generation and finally trigger FGC. (That is, the case in this article)
  • Program BUGs cause many new classes to be dynamically generated, causing Metaspace to be constantly occupied, first triggering FGC and finally leading to OOM.
  • The gc method is explicitly called in the code, including your own code and even the code in the framework.
  • JVM parameter setting issues: including total memory size, the size of the new generation and the old generation, the size of the Eden area and the S area, the size of the metaspace, the garbage collection algorithm, etc.

3. Commands for Checking Java Heap GC Status

a. Check the Maximum Java Heap Usage for a Specific Process

PID refers to the Process ID, 20 denotes the top twenty, instances indicate the number of instances, and bytes represent the memory usage size (1M=1024KB, 1KB=1024Bytes).

 jmap -histo pid | head -n 20

b. Monitor Java Heap Memory and Check Full GC Frequency

Monitor the JVM, print every 5 seconds, loop 100 times.

jstat -gc pid 5000 100
 
jstat -gcutil pid 5000 100

  • S0C: The size of the first survivor space
  • S1C: Size of the Second Survivor Space
  • S0U: Usage size of the first survivor space
  • S1U: Usage Size of the Second Survivor Space
  • EC: Size of the Eden Area
  • EU: Usage Size of the Eden Zone
  • OC: Old Generation Size
  • OU: Old Generation Usage Size
  • MC: Method Area Size
  • Method Area: Usage Size
  • CCSC: Reducing Compression Class Space Size
  • CCSU: Space Usage Size for Compression Classes
  • YGC: Young Generation Garbage Collection Count
  • YGCT: Time Consumption of Young Generation Garbage Collection
  • FGC: Number of Full Garbage Collections in the Old Generation
  • FGCT: Time Consumption of Old Generation Garbage Collection
  • GCT: Total Garbage Collection Time

Check the process runtime, frequency = duration / FGC

It looks like you're working with process information on a Unix-like system. The command you used, `ps -eo pid,tty,user,comm,lstart,etime | grep 24019`, is filtering the process list to find details about a specific process with the PID 24019.

Here's a breakdown of the output:

- **PID (Process ID):** 24019
- **TTY (Terminal Type):** ?
- **User:** admin
- **Command:** java
- **Start Time:** Thu Dec 13 11:17:14 2018
- **Elapsed Time:** 01:29:43

This means that the process with ID 24019 is a Java application started by the user 'admin' on December 13, 2018, at 11:17:14 AM, and it has been running for 1 hour, 29 minutes, and 43 seconds.

If you have any specific questions or need further assistance, feel free to ask!

4. Troubleshooting Guide for Java Heap

  • Check the monitoring to understand the time when the problem occurred and the current FGC frequency (you can compare it with the normal situation to see if the frequency is normal)
  • Understand whether there were any programs launched or basic component upgrades before this time point.
  • Understand the parameter settings of the JVM, including: the size settings of each area of ​​the heap space, which garbage collectors are used for the new generation and the old generation, and then analyze whether the JVM parameter settings are reasonable.
  • Then use the elimination method for the possible reasons listed in step 1, among which the full metaspace, memory leaks, and explicit calls to the gc method in the code are relatively easy to troubleshoot.
  • For FGC caused by large objects or objects with long life cycles, you can use the jmap -histo command and combine it with the dump heap memory file for further analysis. You need to locate the suspicious object first.
  • Locate the specific code through the suspicious object for further analysis. At this time, you need to combine the GC principle and the JVM parameter settings to figure out whether the suspicious object meets the conditions for entering the old generation before you can draw a conclusion.

5. Locating and Analyzing Memory Overflow

Memory overflow is often encountered in actual production environments. For example, continuously writing data into a collection, encountering an infinite loop, or reading an extremely large file can all potentially cause memory overflow.

In the event of a memory overflow, the first step is to pinpoint the phase where the memory overflow occurred and conduct an analysis to determine whether it is a normal or abnormal situation. If it is a legitimate requirement, consider increasing the memory allocation. If it is an abnormal requirement, then code modifications are necessary to fix the bug.

First, we must learn how to identify the problem before we proceed with the analysis. To locate issues, we need to utilize tools like jmap and MAT for precise analysis.

1. Simulate Memory Overflow

Write code to add one million strings to a List collection, with each string composed of 1000 UUIDs. If the program executes correctly, print “ok” at the end.

package com.zn;

import java.util.ArrayList;
import java.util.List;
import java.util.UUID;

public class TestJvmOutOfMemory {
    public static void main(String[] args) {
        List list = new ArrayList<>();
        for (int i = 0; i < 10000000; i++) {
            String str = "";
            for (int j = 0; j < 1000; j++) {
                str += UUID.randomUUID().toString();
            }
            list.add(str);
        }
        System.out.println("ok");
    }
}

2. Configure VM Options Parameters-Xms8m -Xmx8m -XX:+HeapDumpOnOutOfMemoryError

3. Run the tests

4. When a memory overflow occurs, a dump file will be created as java_pid65828.hprof.

5. Import into the MAT tool for analysis.

You can see that 87.99% of the memory is occupied by Object[] arrays, which is quite suspicious.

Analysis: This suspicion is correct, as it already occupies more than 90% of the memory, making a memory overflow highly likely.

6. View Details

A large number of UUID strings can be seen stored in the collection.

Section 3: Online Incident Troubleshooting

Troubleshooting Ideas for Slow Linux System, 100% CPU Usage, and Excessive Full GC Events

There are two main possible reasons for this situation:

  • A certain portion of the code is reading a large amount of data, leading to system memory exhaustion, which in turn results in excessive Full GC occurrences, causing the system to slow down;
  • The code contains CPU-intensive operations, leading to high CPU usage and slow system performance.
  • Relatively speaking, these are the two most frequently occurring online issues, and they can directly render a system unusable. Additionally, there are other situations that may cause a particular function to run slowly, but not to the extent of making the system unusable:

  • There's a blocking operation at a certain point in the code, which results in the function call being overall time-consuming, but it occurs somewhat randomly.
  • A certain thread for some reason enters the WAITING state, rendering the entire functionality unavailable. However, it cannot be reproduced.
  • Due to improper use of locks, multiple threads have entered a deadlock state, thus causing the overall system to perform noticeably slower.
  • For these three situations, examining the CPU and system memory status won't reveal specific issues, as they are relatively blocking operations. CPU and system memory usage aren't high, but the performance is still slow.

    CPU

    In general, we usually start by troubleshooting CPU-related issues. CPU anomalies are often relatively easy to pinpoint. The causes can include business logic problems (such as time-consuming calculations in the code), frequent garbage collection, and excessive context switching. The most common issue is usually caused by business logic (or framework logic). You can use `jstack` to analyze the corresponding stack trace.

    Using jstack to Analyze CPU Issues

    `top -H -p pid` command is used to view the running status of each thread within a specific process.

    You can view the running status of each thread under the process.

    Convert Thread ID

    In the results displayed by the `jstack` command, thread IDs are converted into hexadecimal format. You can view the conversion results using the following command, or use a scientific calculator for conversion:

    # printf "%x\n" pid
    printf "%x\n" 17880          
    45d8

    Here's a translation of the provided text while maintaining the original HTML structure:

    html jstack identification of CPU usage threads

    30 represents viewing 30 lines of logs.

    `jstack ProcessID | grep ThreadID -A 30`

    View Stack Informationjstack pid |grep 'nid' -C5 –color

    You can see that we have already found the stack information with nid 0x42. Next, we just need to analyze it carefully. Typically, we pay more attention to the WAITING and TIMED_WAITING states, while BLOCKED is self-explanatory. We can use the commandcat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -cTo get an overall grasp of the jstack status, if there are a lot of WAITING states, then there's likely an issue.

    Using `jstack` to Examine GC Activity

    Before determining if garbage collection (GC) is occurring too frequently, usejstat -gc pid 1000The command is used to observe changes in the GC generations, where 1000 represents the sampling interval (ms). S0C/S1C, S0U/S1U, EC/EU, OC/OU, and MC/MU respectively represent the capacity and usage of the two Survivor spaces, Eden space, Old generation, and Metaspace. YGC/YGT, FGC/FGCT, and GCT indicate the time spent and counts of Young GC, Full GC, and the total GC time. If you notice that GC is occurring frequently, proceed with a more detailed analysis of GC-related aspects.

    vmstatView context switching

    To address frequently occurring context issues, we can usevmstatThe command for viewing

    cs (context switch) represents the number of context switches. If we want to monitor a specific PID, we can usepidstat -w pid Commands, `cswch` and `nvcswch` refer to voluntary and involuntary switching.

    memory

    Memory issues can be somewhat more complex to troubleshoot compared to CPU issues, with more varied scenarios. The primary concerns include OOM (Out of Memory), GC (Garbage Collection) issues, and off-heap memory. Generally speaking, we begin by utilizingfreeFirst, let's examine the various aspects of memory.

    Heap memory

    Memory issues are mostly related to heap memory problems. Externally, they mainly manifest as OOM (Out of Memory) and StackOverflow.

    OOM

    Insufficient memory in the JVM can generally be categorized into the following types:

    Exception in thread "main" java.lang.OutOfMemoryError: unable to create new native threadThis means there isn't enough memory space to allocate a Java stack for a thread, which generally indicates an issue with the thread pool code, such as forgetting to shut it down. Therefore, you should first look for problems at the code level using tools like jstack or jmap. If everything is normal, you can address JVM issues by specifying...XssTo reduce the size of an individual thread stack. Additionally, at the system level, you can modify/etc/security/limits.conf`nofile` and `nproc` to increase OS thread limits

    Exception in thread "main" java.lang.OutOfMemoryError: Java heap spaceThis means that the heap memory usage has already reached the maximum value set by -Xmx, which is probably the most common OOM (Out of Memory) error. The solution should initially involve examining the code, suspecting the presence of a memory leak, and using tools like jstack and jmap to pinpoint the issue. If everything appears normal, only then should you consider adjustment.XmxValue to increase memory.

    Caused by: java.lang.OutOfMemoryError: Meta spaceThe meaning is that the memory usage in the metadata area has already reachedXX:MaxMetaspaceSizeThe maximum value setting follows the same troubleshooting approach as mentioned above. Parameters can be adjusted byXX:MaxPermSizeLet's make adjustments (Let's not discuss the permanent generation prior to version 1.8 here).

    Stack Overflow

    Stack overflow, this is something many people encounter frequently.Exception in thread "main" java.lang.StackOverflowErrorThe memory required to display the thread stack exceeds the Xss value, which should also be investigated. In terms of parameters, go through.XssTo make adjustments, but altering too much might trigger an OOM (Out of Memory) error.

    Using JMAP to Identify Code Memory Leaks

    In the context of troubleshooting code issues like OOM (Out of Memory) and StackOverflow, we typically utilize JMAPjmap -dump:format=b,file=filename pidLet's export the dump file.

    On the other hand, we can specify in the startup parameters.-XX:+HeapDumpOnOutOfMemoryErrorTo save a dump file during an OOM (Out of Memory) event.

    Refer to "Using JMAP and Analyzing Memory Leaks":https://blog.csdn.net/weixin_38004638/article/details/106135505

    GC Issues and Threads

    Heap memory leaks are always accompanied by GC anomalies. However, GC issues are not only related to memory problems; they can also lead to CPU load, network issues, and other complications. It's just that they are more closely related to memory, so we will summarize GC-related problems separately here.

    An excessive number of threads that aren't promptly garbage-collected can also lead to an OutOfMemoryError (OOM), which is mostly what was mentioned before.unable to create new native threadIn addition to deeply analyzing the dump file with `jstack`, we usually first take a look at the overall threads throughpstreee -p pid |wc -l。

    or by directly viewing/proc/pid/taskThe number corresponds to the number of threads.

    Out-of-heap memory

    If you encounter an out-of-heap memory overflow, it's truly unfortunate. Initially, the symptom of an out-of-heap memory overflow is a rapid increase in the physical resident memory, and errors may not be consistent depending on the usage scenario. Out-of-heap memory overflows are often related to the use of NIO. Generally, we first use `pmap` to check the memory usage of the process.pmap -x pid | sort -rn -k3 | head -30This passage means checking the top 30 largest memory segments in descending order by a corresponding PID. You can run the command again after a period of time to observe memory growth, or compare with a normal machine to identify suspicious memory segments.

    Typically, for situations where off-heap memory gradually increases until it crashes, you can first establish a baseline.jcmd pid VM.native_memory baseline。

    Then, after letting it sit for a while, go back and check the memory growth situation throughjcmd pid VM.native_memory detail.diff(summary.diff)Provide a summary or detailed level diff.

    You can see that jcmd analyzes memory in great detail, including heap, threads, and garbage collection (so other memory anomalies mentioned above can actually be analyzed using NMT). Here, we focus specifically on the growth of Internal memory in off-heap memory. If there's a significant increase, then there is definitely a problem.

    Disk

    Disk issues, similar to CPU problems, are considered fairly fundamental. First, in terms of disk space, we directly usedf -hlTo check the file system status

    More often, disk issues are actually performance-related issues. We can utilize `iostat`.iostat -d -k -xLet's conduct an analysis.

    The Last Column%utilYou can see the write intensity on each disk, whilerrqpm/sandwrqm/sTypically, indicating the read and write speeds can help pinpoint which specific disk is experiencing an issue.

    Additionally, we also need to know which process is performing the read and write operations. Generally speaking, developers have an idea themselves, or you can use the `iotop` command to locate the source of file read and write activities.

    However, what we have here is the tid, and we need to convert it into the pid. We can achieve this by using readlink to locate the pid.readlink -f /proc/*/task/tid/../..。

    Once you find the PID, you can examine the specific read and write activities of this process.cat /proc/pid/io

    We can also use the `lsof` command to determine the specific file read and write activities.lsof -p pid

    Network

    Issues at the network layer are generally quite complex, involving multiple scenarios and making troubleshooting difficult, turning into a nightmare for most developers. It is arguably the most complicated. Here, examples will be provided along with explanations from the TCP layer, the application layer, and the use of tools.

    Timeout

    Timeout errors primarily occur at the application layer, so it's crucial to focus on understanding the concept. Timeout errors can generally be divided into connection timeouts and read/write timeouts. In some client frameworks that use connection pools, there may also be timeouts for acquiring connections and for idle connection cleanup.

    • Read/Write Timeout. The readTimeout/writeTimeout, sometimes referred to in some frameworks as so_timeout or socketTimeout, all refer to data read/write timeout. Note that most of the timeout mentioned here refers to logical timeout. The timeout in SOA also refers to read timeout. Read/write timeout is usually set only on the client side.
    • Connection timeout usually refers to the maximum time for establishing a connection between a client and a server. On the server side, the term `connectionTimeout` can vary significantly. In Jetty, it indicates the idle connection cleanup time, whereas in Tomcat, it represents the maximum duration a connection is maintained.
    • Other aspects include connection acquisition timeout (connectionAcquireTimeout) and idle connection cleanup timeout (idleConnectionTimeout). These are often used in client or server frameworks that utilize connection pools or queues.

    When configuring various timeout settings, it is important to ensure that the client's timeout is shorter than the server's timeout to ensure the connection ends properly.

    In actual development, what concerns us most should be the read and write timeouts of the interface.

    Setting a reasonable interface timeout is a challenge. If the interface timeout is set too long, it may excessively occupy the server's TCP connections. Conversely, if the interface is set too short, timeouts may occur very frequently.

    The server-side interface response time is clearly reduced, but the client still experiences constant timeouts, which is another issue entirely. This problem is actually quite straightforward. The link from the client to the server includes network transmission, queuing, and service processing, each of which could be a factor causing the delay.

    TCP Queue Overflow

    TCP queue overflow is a relatively low-level error, which can lead to more surface-level errors such as timeouts and RSTs. Because the error is more obscure, we should discuss it separately.

    As illustrated in the diagram above, there are two queues: the SYNs queue (half-open queue) and the accept queue (fully established queue). During the three-way handshake, when the server receives a SYN from the client, it places the message in the SYNs queue and responds with a SYN+ACK to the client. When the server receives the client's ACK, if the accept queue is not full at this point, it moves the temporarily stored information from the SYNs queue to the accept queue. Otherwise, it follows the instructions specified by `tcp_abort_on_overflow`.

    `tcp_abort_on_overflow 0` means that if the accept queue is full at the third step of the three-way handshake, the server will discard the ACK sent by the client. `tcp_abort_on_overflow 1`, on the other hand, indicates that if the full connection queue is filled at the third step, the server will send a RST packet to the client, indicating that this handshake process and connection are discarded, which could result in numerous entries in the logs.connection reset / connection reset by peer。

    In actual development, how can we quickly pinpoint TCP queue overflow?

    The command `netstat -s | egrep "listen|LISTEN"` is used to filter and display network statistics related to listening connections. Here's a breakdown of the command:

    - `netstat -s`: Displays network statistics. - `egrep "listen|LISTEN"`: Filters the output to show only lines containing "listen" or "LISTEN".

    If you have any specific questions or need further assistance, feel free to ask!

    As shown in the image above, "overflowed" indicates the number of times the fully connected queue has overflowed, while "sockets dropped" represents the number of times the half-open connection queue has overflowed.

    The `ss` command, execute `ss -lnt`.

    As you can see above, Send-Q indicates that the maximum full connection queue on the listen port in the third column is 5, and the first column Recv-Q is how much full connection queue is currently used.
    Next, let's see how to set the full connection and semi-connection queue sizes: The size of the full connection queue depends on min(backlog, somaxconn). Backlog is passed in when the socket is created, and somaxconn is an OS-level system parameter. The size of the semi-connection queue depends on max(64, /proc/sys/net/ipv4/tcp_max_syn_backlog). In daily development, we often use servlet containers as servers, so sometimes we also need to pay attention to the size of the container's connection queue. In tomcat, backlog is called acceptCount, and in jetty it is acceptQueueSize. RST exception The RST packet indicates a connection reset, which is used to close some useless connections. It usually indicates an abnormal closure, which is different from four waves. In actual development, we often see connection reset / connection reset by peer errors, which are caused by RST packets. Port does not exist If a SYN request to establish a connection is sent to a non-existent port, the server will directly return a RST message to terminate the connection if it finds that it does not have this port. Actively replace FIN to terminate the connection Generally speaking, normal connection closure needs to be implemented through FIN messages, but we can also use RST messages instead of FIN to indicate direct connection termination. In actual development, the SO_LINGER value can be set to control it. This is often intentional to skip TIMED_WAIT and provide interaction efficiency. Use it with caution if it is not idle. An exception occurs on one side of the client or server, and the other side sends RST to inform the other side to close the connection The TCP queue overflow and sending RST packets mentioned above actually belongs to this type. This is often due to some reasons that one party can no longer process the request connection normally (such as the program crashes, the queue is full), thereby informing the other party to close the connection. The received TCP message is not in the known TCP connection For example, one machine lost the TCP message due to the bad network, and the other party closed the connection. Then after a long time, the previously missing TCP message was received, but because the corresponding TCP connection no longer exists, a RST packet will be directly sent to open a new connection. One party has not received the confirmation message from the other party for a long time, and sends a RST message after a certain period of time or retransmission times This is mostly related to the network environment. A poor network environment may cause more RST messages. It was mentioned before that too many RST messages will cause program errors. Reading operations on a closed connection will report connection reset, and writing operations on a closed connection will report connection reset by peer. Usually we may also see broken pipe errors, which are errors at the pipeline level, indicating that reading and writing on a closed pipeline often continue to read and write data after receiving RST and reporting connection reset errors. This is also introduced in the glibc source code comments. How do we determine the existence of RST packets when troubleshooting? Of course, use the tcpdump command to capture packets and use wireshark for simple analysis. tcpdump -i en0 tcp -w xxx.cap, en0 represents the monitored network card.

    Next, let's open the captured packet using Wireshark. As shown in the image, the red ones indicate RST packets.

    `TIME_WAIT` and `CLOSE_WAIT`

    TIME_WAIT and CLOSE_WAIT are concepts that most of you are probably familiar with. When you're online, you can directly use the commandnetstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'To view the number of TIME_WAIT and CLOSE_WAIT states

    Using the `ss` command is faster.ss -ant | awk '{++S[$1]} END {for(a in S) print a, S[a]}'

    TIME_WAIT

    The existence of `time_wait` serves two main purposes: firstly, to prevent lost packets from being reused by subsequent connections, and secondly, to ensure the connection closes normally within the 2MSL (Maximum Segment Lifetime) time frame. Its existence actually significantly reduces the occurrence of RST (Reset) packets.

    Excessive `time_wait` is more likely to occur in scenarios with frequent short connections. This situation can be optimized on the server-side by adjusting certain kernel parameters:

    # Enables socket reuse in TIME-WAIT state. Allows TIME-WAIT sockets to be reused for new TCP connections; default is 0, which means disabled.
    net.ipv4.tcp_tw_reuse = 1
    # Enables the fast recycling of TIME-WAIT sockets in TCP connections; default is 0, which means disabled.
    net.ipv4.tcp_tw_recycle = 1

    Of course, we must not forget the pitfall in NAT environments where packets are rejected due to timestamp discrepancies. Another method is to reduce it.tcp_max_tw_buckets, any `time_wait` exceeding this value will be eliminated, although this can also lead to reportstime wait bucket table overflowIt seems like your input is incomplete. Can you please provide more context or the full text so I can assist you properly?

    CLOSE_WAIT

    The occurrence of `CLOSE_WAIT` is often due to issues in the application code, where it fails to initiate a FIN packet after receiving an ACK. The likelihood of encountering `CLOSE_WAIT` is even higher than `TIME_WAIT`, and its consequences can be more severe. This situation typically arises when a connection is not properly closed due to some blockage, gradually exhausting all available threads.

    To address such issues, it is best to use `jstack` to analyze thread stacks to troubleshoot the problem. For details, refer to the aforementioned sections. Here, I will provide just one example.

    After the application was deployed, the developers mentioned that the number of CLOSE_WAIT states kept increasing until the application crashed. Upon inspecting with jstack, the most suspicious stack trace revealed that most threads were stuck atcountdownlatch.awaitThe method, after consulting with the development team, revealed that multithreading was used, but exceptions were not caught. After making modifications, it was discovered that the exception was merely the simplest that frequently occurs after upgrading the SDK.class not found.