项目背景:RuoyiVuePlus(v5.3.1) + Snail-Job(v1.4.0) + Spring-boot-admin(v3.4.5)
情况描述:通过 Docker 部署,正常运行一段时间后突然出现定时任务没有正常执行的情况,查看日志发现有内存溢出报错,结合snail job日志、Google Gemeni 问询可能是 gRPC 的问题。
SpringBootAdmin 是正常启动的,但是内存不够,连不上了。
2026-02-04 22:03:48 [registrationTask1] ERROR o.s.s.s.TaskUtils$LoggingErrorHandler - Unexpected error occurred in scheduled task
java.lang.OutOfMemoryError: Java heap space
at de.codecentric.boot.admin.client.registration.Application.builder(Application.java:44)
at de.codecentric.boot.admin.client.registration.Application.create(Application.java:57)
at de.codecentric.boot.admin.client.registration.DefaultApplicationFactory.createApplication(DefaultApplicationFactory.java:79)
at de.codecentric.boot.admin.client.registration.DefaultApplicationRegistrator.register(DefaultApplicationRegistrator.java:56)
at de.codecentric.boot.admin.client.registration.RegistrationApplicationListener$$Lambda$1675/0x00007ff2509d0238.run(Unknown Source)
at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
snailjob 相关日志:
2026-02-05 14:06:18 [server-node-balance] ERROR c.a.s.s.c.handler.ServerNodeBalance - server node is empty
2026-02-05 14:06:40 [server-node-balance] ERROR c.a.s.s.c.handler.ServerNodeBalance - server node is empty
2026-02-05 14:07:02 [server-node-balance] ERROR c.a.s.s.c.handler.ServerNodeBalance - server node is empty
2026-02-05 14:07:22 [server-node-balance] ERROR c.a.s.s.c.handler.ServerNodeBalance - server node is empty
2026-02-05 14:07:42 [server-node-balance] ERROR c.a.s.s.c.handler.ServerNodeBalance - server node is empty
2026-02-05 14:08:02 [server-node-balance] ERROR c.a.s.s.c.handler.ServerNodeBalance - server node is empty
2026-02-05 14:08:22 [server-node-balance] ERROR c.a.s.s.c.handler.ServerNodeBalance - server node is empty
2026-02-05 14:08:43 [server-node-balance] ERROR c.a.s.s.c.handler.ServerNodeBalance - server node is empty
2026-02-05 14:09:31 [JOB_ACTOR_SYSTEM-pekko.actor.job-task-executor-call-client-dispatcher-16] ERROR i.g.i.ManagedChannelOrphanWrapper - *~*~*~ Previous channel ManagedChannelImpl{logId=7, target=10.0.1.79:27040} was garbage collected without being shut down! ~*~*~*
Make sure to call shutdown()/shutdownNow()
java.lang.RuntimeException: ManagedChannel allocation site
at io.grpc.internal.ManagedChannelOrphanWrapper$ManagedChannelReference.<init>(ManagedChannelOrphanWrapper.java:102)
at io.grpc.internal.ManagedChannelOrphanWrapper.<init>(ManagedChannelOrphanWrapper.java:60)
at io.grpc.internal.ManagedChannelOrphanWrapper.<init>(ManagedChannelOrphanWrapper.java:51)
at io.grpc.internal.ManagedChannelImplBuilder.build(ManagedChannelImplBuilder.java:709)
at io.grpc.ForwardingChannelBuilder2.build(ForwardingChannelBuilder2.java:272)
at com.aizuda.snailjob.server.common.rpc.client.GrpcChannel.connect(GrpcChannel.java:140)
at com.aizuda.snailjob.server.common.rpc.client.GrpcChannel.send(GrpcChannel.java:74)
at com.aizuda.snailjob.server.common.rpc.client.GrpcClientInvokeHandler.lambda$requestRemote$3(GrpcClientInvokeHandler.java:167)
at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78)
at com.github.rholder.retry.Retryer.call(Retryer.java:160)
at com.aizuda.snailjob.server.common.rpc.client.GrpcClientInvokeHandler.requestRemote(GrpcClientInvokeHandler.java:159)
at com.aizuda.snailjob.server.common.rpc.client.GrpcClientInvokeHandler.invoke(GrpcClientInvokeHandler.java:109)
at com.aizuda.snailjob.server.common.rpc.client.GrpcClientInvokeHandler.invoke(GrpcClientInvokeHandler.java:60)
at jdk.proxy2/jdk.proxy2.$Proxy160.dispatch(Unknown Source)
at com.aizuda.snailjob.server.job.task.support.executor.job.RequestClientActor.doExecute(RequestClientActor.java:89)
at com.aizuda.snailjob.server.job.task.support.executor.job.RequestClientActor.lambda$createReceive$0(RequestClientActor.java:53)
at org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33)
at org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:214)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:213)
at org.apache.pekko.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:29)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:269)
at org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547)
at org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545)
at org.apache.pekko.actor.AbstractActor.aroundReceive(AbstractActor.scala:229)
at org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590)
at org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557)
at org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:273)
at org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:234)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
2026-02-05 14:09:31 [JOB_ACTOR_SYSTEM-pekko.actor.job-task-executor-call-client-dispatcher-16] ERROR i.g.i.ManagedChannelOrphanWrapper - *~*~*~ Previous channel ManagedChannelImpl{logId=8, target=10.0.1.79:27040} was garbage collected without being shut down! ~*~*~*
Make sure to call shutdown()/shutdownNow()
java.lang.RuntimeException: ManagedChannel allocation site
at io.grpc.internal.ManagedChannelOrphanWrapper$ManagedChannelReference.<init>(ManagedChannelOrphanWrapper.java:102)
at io.grpc.internal.ManagedChannelOrphanWrapper.<init>(ManagedChannelOrphanWrapper.java:60)
at io.grpc.internal.ManagedChannelOrphanWrapper.<init>(ManagedChannelOrphanWrapper.java:51)
at io.grpc.internal.ManagedChannelImplBuilder.build(ManagedChannelImplBuilder.java:709)
at io.grpc.ForwardingChannelBuilder2.build(ForwardingChannelBuilder2.java:272)
at com.aizuda.snailjob.server.common.rpc.client.GrpcChannel.connect(GrpcChannel.java:140)
at com.aizuda.snailjob.server.common.rpc.client.GrpcChannel.send(GrpcChannel.java:74)
at com.aizuda.snailjob.server.common.rpc.client.GrpcClientInvokeHandler.lambda$requestRemote$3(GrpcClientInvokeHandler.java:167)
at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78)
at com.github.rholder.retry.Retryer.call(Retryer.java:160)
at com.aizuda.snailjob.server.common.rpc.client.GrpcClientInvokeHandler.requestRemote(GrpcClientInvokeHandler.java:159)
at com.aizuda.snailjob.server.common.rpc.client.GrpcClientInvokeHandler.invoke(GrpcClientInvokeHandler.java:109)
at com.aizuda.snailjob.server.common.rpc.client.GrpcClientInvokeHandler.invoke(GrpcClientInvokeHandler.java:60)
at jdk.proxy2/jdk.proxy2.$Proxy160.dispatch(Unknown Source)
at com.aizuda.snailjob.server.job.task.support.executor.job.RequestClientActor.doExecute(RequestClientActor.java:89)
at com.aizuda.snailjob.server.job.task.support.executor.job.RequestClientActor.lambda$createReceive$0(RequestClientActor.java:53)
at org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33)
at org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:214)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:213)
at org.apache.pekko.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:29)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:269)
at org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547)
at org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545)
at org.apache.pekko.actor.AbstractActor.aroundReceive(AbstractActor.scala:229)
at org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590)
at org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557)
at org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:273)
at org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:234)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Gemini 分析:
从你提供的最新日志来看,情况发生了反转:导致内存溢出(OOM)的“元凶”可能并不是 Spring Boot Admin,而是 SnailJob 的 gRPC 通信机制。
日志中出现的 ManagedChannelOrphanWrapper 报错是典型的 gRPC 连接泄漏。
核心原因分析
1. gRPC 连接(ManagedChannel)未关闭
日志中明确提示:Previous channel ... was garbage collected without being shut down!。
gRPC 的 ManagedChannel 是一个重量级对象(包含线程池和 HTTP/2 连接)。在 SnailJob 的 RPC 调用过程中,代码创建了大量的 Channel,但在这些对象被 GC 回收之前,没有调用 shutdown()。
2. 重试机制(Retryer)加剧了泄漏
注意堆栈中的 com.github.rholder.retry.Retryer.call。
由于 SBA 或网络环境不稳定,SnailJob 尝试通过 gRPC 向远程节点发送数据。每次重试或失败时,可能都在重新创建 GrpcChannel,但旧的 Channel 却堆积在内存中。
3. 线程与内存的恶性循环
server node is empty:说明 SnailJob 找不到可用的服务节点,导致任务调度频繁失败。
ManagedChannel 堆积:每一个未关闭的 ManagedChannel 都会占用相当一部分内存。由于连接一直处于“待连接”或“异常”状态,它们无法被有效释放。
OOM 触发:随着堆内存被这些“孤儿”连接占满,JVM 频繁进行 Full GC 却无法回收这些重大的 Native 资源或对象,最终导致 Java Heap Space 耗尽。
💡 为什么你会认为是 Spring Boot Admin 的问题?
因为 SBA 的注册失败和 SnailJob 的 RPC 失败通常是伴生的:
底层网络抖动:导致 SBA 连不上,同时也导致 SnailJob 无法通过 gRPC 通信。
资源竞争:SnailJob 的 gRPC 泄漏耗尽了内存,导致 SBA 的 registrationTask 在申请内存创建 HTTP 请求时抛出 OOM 异常。SBA 只是那个“倒霉”的报错者,而 SnailJob 才是真正的“内存杀手”。
解决方案建议
1. 针对 SnailJob 的修复(优先级:高)
检查 GrpcChannel.java:日志显示泄漏发生在 com.aizuda.snailjob.server.common.rpc.client.GrpcChannel.connect(GrpcChannel.java:140)。检查该版本是否存在已知 Bug,或者连接池配置是否失效。
手动管理生命周期:确保在 RPC 调用结束或异常时,显式调用了 Channel 的关闭方法。
分析 GrpcChannel 的源码后,可以非常确定地说:这段代码存在明显的 ManagedChannel 泄漏点。
结合日志中 ManagedChannelOrphanWrapper 的报错(提示 Channel 在被 GC 前未关闭),以下是导致内存溢出的具体原因分析:
1. 核心 Bug 分析
A. 只生不灭:缺乏关闭逻辑
在 connect 方法中:
Java
public static ManagedChannel connect(String hostId, String ip, Integer port) {
// ...
ManagedChannel channel = ManagedChannelBuilder.forAddress(ip, port)
.build();
GrpcChannel.setChannel(hostId, ip, channel); // 放入 MAP
return channel;
}
以及 removeChannel 方法中:
Java
public static void removeChannel(ManagedChannel channel) {
CHANNEL_MAP.forEach((key, value) -> {
if (value.equals(channel)) {
CHANNEL_MAP.remove(key); // 仅仅是从缓存 MAP 中移除引用
}
});
}
问题所在: 当 channel 被认为失效(isShutdown 或需要更换)并从 CHANNEL_MAP 中移除时,代码从未调用 channel.shutdown()。
gRPC 的 ManagedChannel 包含底层的 Netty 线程池和 TCP 连接池。仅仅从 Map 里删除它,它依然在后台运行。等到 JVM 因为没有引用而触发 GC 时,gRPC 的监测机制发现连接还开着,于是抛出了你看到的 ManagedChannelOrphanWrapper 警告。
B. 无限重试导致的“创建风暴”
结合你的日志 server node is empty:
SnailJob 发现节点为空,进入重试。
重试过程中调用 send,send 发现没有可用 channel。
调用 connect 创建新 channel。
下一次循环,如果之前的 channel 状态不对,调用 removeChannel(注意:此时没关闭旧连接),接着又调用 connect 创建新 channel。
如此循环,内存中堆积了成百上千个“僵尸”连接,最终 Heap Space 耗尽,连带导致 SBA (Spring Boot Admin) 的定时任务也报 OOM。
2. 解决方案:如何修改代码?
你需要修改 GrpcChannel.java,确保在移除 Channel 时强制关闭它。
修改一:完善 removeChannel
Java
public static void removeChannel(ManagedChannel channel) {
if (Objects.isNull(channel)) return;
CHANNEL_MAP.entrySet().removeIf(entry -> {
if (entry.getValue().equals(channel)) {
// 核心修改:必须显式关闭!
shutdownChannel(entry.getValue());
return true;
}
return false;
});
}
// 提取一个关闭方法
private static void shutdownChannel(ManagedChannel channel) {
try {
if (channel != null && !channel.isShutdown()) {
channel.shutdownNow(); // 立即关闭
// 可选:等待一段时间确保彻底关闭
// channel.awaitTermination(2, TimeUnit.SECONDS);
}
} catch (Exception e) {
log.error("Shutdown grpc channel error", e);
}
}
修改二:修正 send 方法中的逻辑
在 send 方法里,当前的逻辑是 removeChannel(channel),但此时传入的 channel 可能已经是 null。
Java
public static ListenableFuture<GrpcResult> send(...) {
ManagedChannel channel = CHANNEL_MAP.get(Pair.of(hostId, hostIp));
if (Objects.isNull(channel) || channel.isShutdown() || channel.isTerminated()) {
// 修改点:如果旧 channel 存在但不可用,先关掉它
if (Objects.nonNull(channel)) {
shutdownChannel(channel);
}
removeChannel(channel); // 从 map 移除
channel = connect(hostId, hostIp, port);
// ...
}
}
3. 为什么 SBA 也会报错?
当上述 gRPC 的 ManagedChannel 疯狂创建而不关闭时,会消耗大量的:
堆内存 (Heap):用于存储对象。
直接内存 (Direct Memory):Netty 用于网络传输。
线程资源:每个 Channel 默认可能带几个线程。
SBA 的 registrationTask1 是每 10 秒运行一次的普通任务。当 SnailJob 把系统资源榨干后,SBA 的任务在执行 Application.create 时无法申请到新的内存空间,于是抛出了 OOM。
总结建议
修复源码:按照上述建议修改 GrpcChannel.java 的关闭逻辑。
SBA 降频:将 SBA 的 period 设置为 10m (600000ms),让它退出资源竞争。
根源治理:查一下为什么 server node is empty,这是触发“创建连接风暴”的开关。