Java 堆内存溢出梗概分析(Java Out of Memory Heap Analysis)

编程语言 William 287浏览 0评论

Any software developer who has worked with Java-based enterprise-class backend applications would have run into this infamous or awkward error from a customer or QA engineer: java.lang.OutOfMemoryError: Java heap space.

To understand this, we have to go back to the computer science fundamentals of the complexity of algorithms, specifically “space” complexity. If we recall, every application has a worst-case performance. Specifically, in the memory dimension, when this is unpredictable or is spiky, more than the recommended memory is being allocated to the application. That leads to an over-usage of the heap memory allocated and hence an “out of memory” condition.

The worst part of this specific condition is that the application cannot recover and will crash. Any attempts to restart the application — even with more max memory (-Xmx option) — is not a long-term solution. Without understanding what caused the heap usage inflation or spike, memory usage stability (hence application stability) is not guaranteed. So, what is the more methodical approach to understanding the programming problem related to a memory problem? This is answered by understanding the memory heap of applications and the distribution when the out of memory happens.

With this prelude, we will focus on the following:

  • Getting a heap dump from a Java process when it goes out of memory.

  • Understanding the type of memory issue the application is suffering from.

  • Analyzing out of memory issues with a heap analyzer, specifically with this great open source project: Eclipse MAT.

任何使用过基于 Java 的企业级后端应用的软件开发者都会遇到过这种低劣、奇怪的报错,这些报错来自于用户或是测试工程师: java.lang.OutOfMemoryError:Java heap space。

为了弄清楚问题,我们必须返回到算法复杂性的计算机科学基础,尤其是“空间”复杂性。如果我们回忆,每一个应用都有一个最坏情况特征。具体来说,在存储维度方面,超过推荐的存储将会被分配到应用程序上,这是不可预测但尖锐的问题。这导致了堆内存的过度使用,因此出现了”内存不够”的情况。

这种特定情况最糟糕的部分是应用程序不能修复,并且将崩溃。任何重启应用的尝试 – 甚至使用最大内存(-Xmx option)- 都不是长久之计。如果不明白什么导致了堆使用的膨胀或突出,内存使用稳定性(即应用稳定性)就不能保障。于是,什么才是更有效的理解关于内存的编程问题的途径?当内存溢出时,明白应用程序的内存堆和分布情况才能回答这个问题。

在这一前提下,我们将聚焦以下方面:

  • 当内存溢出时,获取到 Java 进程中的堆转储。

  • 明白应用程序正在遭遇的内存问题的类型。

  • 使用一个堆分析器,可以使用 Eclipse MAT 这个优秀的开源项目来分析内存溢出的问题。

Setting Up the Application Ready for Heap Analysis

Any non-deterministic or sporadic problems like an out of memory error would be a challenge to do any post-mortem on. So, the best way to handle OOMs is to let the JVM dump a heap file of the state of the memory of the JVM when it went out of memory.

Sun HotSpot JVM has a way to instruct the JVM to dump its heap state when the JVM runs out of memory into a file. This standard format is .hprof. So, to enable this feature, add XX:+HeapDumpOnOutOfMemoryError to the JVM startup options. Adding this option is essential to production systems since out of memory could take a long time to happen. This flag adds little or no performance overhead to the application.

If the heap dump .hprof file has to be written to a specific file system location, then add the directory path to XX:HeapDumpPath. Just make sure the application has write permissions for the particular directory path given here.

配置应用,为堆分析做准备

任何像内存溢出这种非确定性的、时有时无的问题对于事后的分析都是一个挑战。所以,最好的处理内存溢出的方法是让 JVM 虚拟机转储一份 JVM 虚拟机内存状态的堆文件。

Sun HotSpot JVM 有一种方法可以引导 JVM 转储内存溢出时的堆状态到一个文件中。其标准格式为 .hprof 。所以,为了实现这种操作,向 JVM 启动项中添加 XX:+HeapDumpOnOutOfMemoryError 。因为内存溢出可能经过很长一段时间才会发生,向生产系统增加这一选项也是必须的。

如果堆转储 .hprof 文件必须被写在一个特定的文件系统位置,那么就添加目录途径到 XX:HeapDumpPath 。只需确保该应用对于指定目录途径始终拥有写入权限。

Cause Analysis

101: Know the Nature of the Out of Memory Error

The most preliminary thing to understand when trying to assess and understand an out of memory error the memory growth characteristics. Make your conclusions about the following possibilities:

  • Spikes in usage: This type of OOM could be drastic based on the type of load. An application can be performing well under allocated memory for the JVM for 20 users. But if there was a spike for the 100th user, it might have hit a memory spike which leads to the out of memory error. There are two possibilities to tackle this cause.

  • Leaks: This is where the memory usage increases over time which is a problem due to a programming issue.

A healthy graph with healthy GC collection.

A leak chart that increases over time after a healthy GC collection pattern.

A memory graph that led to spiky memory usage, leading to OOM.

After we understand the nature of the memory issue that caused usage to surge, the following methodology might be used to avoid hitting the OOM error based on what inference comes out of the heap analysis.

原因分析

101:了解内存溢出错误的本质

当尝试去评估和了解一个内存溢出错误时,最先做的事情应该是观察内存增长特征。根据情况做出可能性的评估:

  • 尖峰状:这种类型的内存溢出在某种类型的加载上会是比较激烈的。当 JVM 分配内存给 20 个用户时,应用程序可以正常运行。但是,如果到第 100 个用户时可能会遭遇到内存峰值,从而导致内存溢出。有两种可能的办法去解决这个问题。

  • 泄露:由于某些编程问题,内存使用随着时间的推移逐渐增加。

                               拥有良性垃圾回收机制的健康图表

     

                            健康一段时间后,随时间推移而泄露的图表

          

                       引起内存使用凸起、导致内存溢出的内存图表

在我们了解导致使用率激增的内存问题的本质之后,基于从对分析中得到的推断,下面的这些方法或许可以用来避免遭遇内存溢出的错误。

Fixing a Memory Issue

  1. Fix the OOM-causing code: Since an object was added incrementally without clearing its reference (from the object reference of the running application) over a period of time by the application, the programming error has to be fixed. For instance, this could be a hash table that was inserted with business objects incrementally without deleting them after the business logic and transaction were completed.

  2. Increase the maximum memory as a fix: After understanding the runtime memory characteristics and the heap, the maximum heap memory allocated might have to be increased to avoid OOM errors again, since the suggested maximum memory was not enough for application stability. So, the application might have to be updated to run with a Java -Xmx flag with a higher value based on the assessment made from the heap analysis.

解决内存问题

  1. 修复引起内存溢出的代码:由于应用在某段时间内增量添加了一个对象而没有清除其引用(来自正在运行的应用程序的对象引用),导致不得不修复程序错误。例如,这一错误可能是插入了一个哈希表, 其中的业务对象会逐渐增加,然而业务逻辑和事务在完成之后并没有删除这些对象。

  2. 增加内存最大值作为一种修复方法。在了解了运行内存特征和堆之后,可能必须增加分配的最大堆内存来避免再次发生内存溢出,因为推荐的最大内存值不能够满足应用程序的稳定性。所以,应用程序可能不得不基于堆分析器的评估,将 Java -Xmx 的 flag 信息更新成一个更高值后再来运行。

Heap Analysis

We will be exploring in detail below how to analyze a heap dump using a heap analysis tool. In our case, we will be using the open source tool MAT by the Eclipse Foundation.

Heap Analysis Using MAT

Now time for the deep dive. We will go through a sequence of steps that will help explore the different features and views of MAT to get to an example of an OOM heap dump and think through the analysis.

  1. Open the heap (.hprof) generated when the OOM error happened. Make sure to copy the dump file to a dedicated folder since MAT creates lots of index files: File -> open

  2. This opens the dump with options for Leak Suspect Reports and Component Reports. Choose to run the Leak Suspect report.Leak Suspect MAT

  3. When the leak suspect chart opens, the pie in the overview pane shows the distribution of retained memory on a per-object basis. It shows the biggest objects in memory (objects that have high retained memory — memory accumulated by it and the objects that it references).

  4. The pie chart above shows 3 problem suspects by aggregating objects which hold the highest aggregated memory references (including shallow and retained).

堆分析

下面我们将详细分析如何使用一个堆分析工具来分析堆转储。在示例中,将使用到 Eclipse 基金会的开源工具 MAT 。

使用 MAT 进行堆分析

是时候进行深入探讨了。我们将通过一系列的步骤,帮助探索在 MAT 中的不同表现和视图,以获取一个堆内存溢出的示例并思考分析。

1. 打开内存溢出错误发生时产生的 .hprof 堆文件。确保复制转储文件到一个专门的文件夹下,因为 MAT 会创建许多索引文件:文件 -> 打开

2. 打开转储文件,有内存泄漏嫌疑报告和组件报告的选项。选择运行泄漏嫌疑报告。

Leak Suspect MAT

3. 泄漏嫌疑表打开后,在预览窗口的饼状图会展示在每个对象基础上保留内存的分布情况。它显示了内存中的最大对象(拥有最高保留内存的对象 —— 累积的内存和引用的对象)。

4. 上面的饼图通过聚合拥有最高内存引用(本身内存和总内存)的对象来展示 3 个问题嫌疑人。

Let us look at one at a time and assess whether it could be the root cause of the OOM error:

Suspect 1

454,570 instances of “java.lang.ref.Finalizer”, loaded by “<system class loader>” occupy 790,205,576 (47.96%) bytes.

The above tells us that there were 454,570 JVM finalizer instances occupying almost 50% of the allocated application memory.

Oops! What does this lead us to understand, based on the basic assumption that the reader knows what Java Finalizers do?

Read here for a primer: http://stackoverflow.com/questions/2860121/why-do-finalizers-have-a-severe-performance-penalty

Essentially, there are custom finalizers written by the developer to release certain resources held by an instance. These instances that are collected by the finalizers are collected outside the scope of the JVM GC collection algorithms using a separate queue. Essentially, this is a longer path to cleaning up by the GC. So now we are at a point where we are trying to understand what is getting finalized by these finalizers?

Potentially, Suspect2 which is sun.security.ssl.SSLSocketImpl which is occupying 20% of the memory. Can we confirm if these are the instances held to be cleared by the finalizers?

让我们逐一分情况查看,评估它是否是内存溢出错误的根本原因。

可疑点 1

由 “<system class loader>” 加载的 454,570 个 “java.lang.ref.Finalizer” 实例占用了 790,205,576(47.96%)个字节。 

这就是告诉我们有 454,570 个 JVM finalizer(终结器)实例占据了分配的应用内存的近 50 %。

假设读者知道 Java Finalizer 是做什么的,上面的信息会让我们明白什么呢?

入门阅读:http://stackoverflow.com/questions/2860121/why-do-finalizers-have-a-severe-performance-penalty

本质上,开发者编写了一些定制化的终结器去释放一个实例的资源。这些由终结器收集的实例不在 JVM 使用单独队列的垃圾回收算法的范围之内。实际上,这种途径比起垃圾回收机制的清理路径更长。所以现在我们应该努力搞清楚这些终结器到底终结了什么?

也或许是可疑点 2 ,占据了 20% 的 sun.security.ssl.SSLSocketImpl 。我们能确认是否这些就是要被终结器终结的实例吗?

Suspect 2

Now, let us open the Dominator view, which is under the tool button on the top of MAT. We see all the instances by class name listed, parsed by MAT and available on the heap dump.

Next, in the Dominator view, we will try to understand the relationship between java.lang.Finalizer and sun.security.ssl.SSLSocketImpl. We right-click on the sun.security.ssl.SSLSocketImpl row and open a Path to GC Roots -> exclude soft/weak references.

Now, MAT will start calculating the memory graph to show the paths to the GC root where this instance is referenced. This will show up with another page, showing the references as below:

As the above reference chain shows, the instance SSLSocketImpl is held by a reference from java.lang.ref.Finalizer, which is about 88k of the retained heap by itself at that level. And we could also notice that the finalizer chain is a linked list data structure with next pointers.

可疑点 2

现在,让我们打开在 MAT 顶部的工具按钮下面的 Dominator 视图。我们会看到所有的列出的类实例,经由 MAT 解析展示出有效的堆存储。

下一步,在 Dominator 视图,我们尝试理解 java.lang.Finalizer 和 sun.security.ssl.SSLSocketImpl 之间的关系。我们右键点击 sun.security.ssl.SSLSocketImpl 这一列,打开 GC Roots  -> exclude soft/weak references。

现在,MAT 将会开始绘制内存的图表来显示 GC root 的路径以及它所对应的实例引用。这会被显示在另外一个页面上,显示的引用如下:

如上面引用链显示,实例 SSLSocketImpl 来自于 java.lang.ref.Finalizer,整个 SSLSocketImpl 实例大约占用了 88k。我们还注意到 finalizer 链是一个针链表数据结构它指向下一个实例。

INFERENCE: At this point, we have a clear hint that the Java finalizer is trying to collect SSLSocketImpl objects. For an explanation of why so many of them are not collected, we start reviewing code.

Inspect code

Code inspection is needed at this point to see if sockets/the I/O stream are closed with finally clauses. In this case, it revealed that all streams related to I/O were, in fact, correctly closed. At this point, we doubt the JVM is the culprit. And, in fact, that was the case: There was a bug in Open JDK 6.0.XX’s GC collection code.

I hope this article gives you a model to analyze heap dumps and infer root causes in Java applications. Happy heap analysis!

推论:在这一点上,我们有一个明确的感觉,Java finalizer 试图在收集 SSLSocketImpl 对象。为了解释为什么还有很多信息没有被收集到,我开始检查代码。

检查代码

代码检查需要查看是不是由 socket 套接字被关闭导致的。在这种情况下,它显示与 I/O 相关的所有流,需要被正确地关闭。在一点上,我们怀疑 JVM 是始作俑者。实际上,在 Open JDK 6.0.XX 的 GC(垃圾收集器)上的代码中有一个 BUG。

我希望这篇文章给你一个模式来分析 Java 应用中的错误是由堆存储还是内部问题导致的。希望你使用堆分析愉快!

Extra Reading

Shallow vs. Retained Heap

Shallow heap is the memory consumed by one object. An object needs 32 or 64 bits (depending on the OS architecture) per reference, 4 bytes per Integer, 8 bytes per Long, etc. Depending on the heap dump format, the size may be adjusted (e.g. aligned to 8, etc.) to better model the real consumption of the VM.

A retained set of X is the set of objects that would be removed by GC when X is garbage collected.

A retained heap of X is the sum of shallow sizes of all objects in the retained set of X, i.e. memory kept alive by X.

Generally speaking, the shallow heap of an object is its size in the heap. The retained size of the same object is the amount of heap memory that will be freed when the object is garbage collected.

The retained set for a leading set of objects, such as all objects of a particular class or all objects of all classes loaded by a particular class loader or simply a bunch of arbitrary objects, is the set of objects that is released if all objects of that leading set become inaccessible. The retained set includes these objects as well as all other objects only accessible through these objects. The retained size is the total heap size of all objects contained in the retained set.

扩展阅读

Shallow heap (浅堆) vs. Retained Heap (保留堆)

浅堆是一个对象消耗的内存。根据情况,一个对象需要 32 位或 64 位(取决于其操作系统架构),对于整型为 4 字节,对于 Long 型为 8 字节等等。依据堆转储格式,其内存大小(比如,向 8 对齐)或许适应于更好地塑造虚拟机的真实消耗。

X 的保留集合是当 X 被垃圾回收时,那些将要被移除的对象集合。

X 的保留堆是在 X 的保留集合中所有对象的浅堆之和,也就是 X 存留的内存。

总体讲,一个对象的浅堆就是其在堆中的大小。同一个对象的保留大小就是当对象被垃圾回收时堆内存的总量。

一些对象的主要集合,比如某一特定类的所有对象、或是由某一特定类加载器加载的所有类的所有对象、或仅仅是一些任意的对象,它们的保留集是如果那些主要集的所有对象变得不可接近时所释放的对象集。保留集包括这些对象和仅通过这些对象才能获取的其它对象。保留集的大小是包含在保留集中的所有对象的堆的大小。


via:oschina

转载请注明:AspxHtml学习分享网 » Java 堆内存溢出梗概分析(Java Out of Memory Heap Analysis)

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址