title: 遇见连接超时 author: Gamehu tags: - SQL categories: - 工作 date: 2024-09-17 19:51:00 ---
离职系列 第九篇
离职系列,想想这几年在公司的成长,在这做个记录。这篇是遇到的一个比较典型的线上问题。
### 问题现象 写了一个每天执行两次的定时任务,该任务会分批对线上所有几百个租户生成《平安通告》,上线1个多月后突然手机收到告警,某几个用户生成失败。 ``` 2024-10-17 17:00:10,125 [safetyNoticeGenerator8] ERROR [com.alibaba.druid.pool.DruidDataSource] DruidDataSource.java:1988 - {conn-110021} discard org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend. at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:395) at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:498) at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:415) at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:190) at org.postgresql.jdbc.PgPreparedStatement.execute(PgPreparedStatement.java:177) at com.alibaba.druid.pool.DruidPooledPreparedStatement.execute(DruidPooledPreparedStatement.java:483) at org.apache.shardingsphere.driver.jdbc.core.statement.ShardingSpherePreparedStatement$2.executeSQL(ShardingSpherePreparedStatement.java:439) at org.apache.shardingsphere.driver.jdbc.core.statement.ShardingSpherePreparedStatement$2.executeSQL(ShardingSpherePreparedStatement.java:435) at org.apache.shardingsphere.infra.executor.sql.execute.engine.driver.jdbc.JDBCExecutorCallback.execute(JDBCExecutorCallback.java:95) at org.apache.shardingsphere.infra.executor.sql.execute.engine.driver.jdbc.JDBCExecutorCallback.execute(JDBCExecutorCallback.java:75) at org.apache.shardingsphere.infra.executor.kernel.ExecutorEngine.syncExecute(ExecutorEngine.java:135) at org.apache.shardingsphere.infra.executor.kernel.ExecutorEngine.serialExecute(ExecutorEngine.java:121) at org.apache.shardingsphere.infra.executor.kernel.ExecutorEngine.execute(ExecutorEngine.java:115) at org.apache.shardingsphere.infra.executor.sql.execute.engine.driver.jdbc.JDBCExecutor.execute(JDBCExecutor.java:65) at org.apache.shardingsphere.infra.executor.sql.execute.engine.driver.jdbc.JDBCExecutor.execute(JDBCExecutor.java:49) at org.apache.shardingsphere.driver.executor.DriverJDBCExecutor.doExecute(DriverJDBCExecutor.java:156) at org.apache.shardingsphere.driver.executor.DriverJDBCExecutor.execute(DriverJDBCExecutor.java:145) at org.apache.shardingsphere.driver.jdbc.core.statement.ShardingSpherePreparedStatement.execute(ShardingSpherePreparedStatement.java:402) at com.zaxxer.hikari.pool.ProxyPreparedStatement.execute(ProxyPreparedStatement.java:44) at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.execute(HikariProxyPreparedStatement.java) at org.apache.ibatis.executor.statement.PreparedStatementHandler.query(PreparedStatementHandler.java:65) at org.apache.ibatis.executor.statement.RoutingStatementHandler.query(RoutingStatementHandler.java:80) at jdk.internal.reflect.GeneratedMethodAccessor49.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:61) at jdk.proxy2/jdk.proxy2.$Proxy207.query(Unknown Source) at org.apache.ibatis.executor.SimpleExecutor.doQuery(SimpleExecutor.java:65) at org.apache.ibatis.executor.BaseExecutor.queryFromDatabase(BaseExecutor.java:333) at org.apache.ibatis.executor.BaseExecutor.query(BaseExecutor.java:158) at org.apache.ibatis.executor.CachingExecutor.query(CachingExecutor.java:110) at com.baomidou.mybatisplus.extension.plugins.MybatisPlusInterceptor.intercept(MybatisPlusInterceptor.java:81) at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:59) at jdk.proxy2/jdk.proxy2.$Proxy206.query(Unknown Source) at jdk.internal.reflect.GeneratedMethodAccessor44.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at org.apache.ibatis.plugin.Invocation.proceed(Invocation.java:49) at com.github.yulichang.interceptor.MPJInterceptor.intercept(MPJInterceptor.java:76) at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:59) at jdk.proxy2/jdk.proxy2.$Proxy206.query(Unknown Source) at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:154) at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:147) at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:142) at jdk.internal.reflect.GeneratedMethodAccessor109.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:425) at jdk.proxy2/jdk.proxy2.$Proxy189.selectList(Unknown Source) at org.mybatis.spring.SqlSessionTemplate.selectList(SqlSessionTemplate.java:224) at com.baomidou.mybatisplus.core.override.MybatisMapperMethod.executeForMany(MybatisMapperMethod.java:166) at com.baomidou.mybatisplus.core.override.MybatisMapperMethod.execute(MybatisMapperMethod.java:77) at com.baomidou.mybatisplus.core.override.MybatisMapperProxy$PlainMethodInvoker.invoke(MybatisMapperProxy.java:152) at com.baomidou.mybatisplus.core.override.MybatisMapperProxy.invoke(MybatisMapperProxy.java:89) at jdk.proxy2/jdk.proxy2.$Proxy197.findDistinctCiIdsByIdentityId(Unknown Source) at com.xxx.xxxcloud.manage.service.safety.notice.impl.xxServiceImpl.addAlertData(xxServiceImpl.java:612) at com.xxx.xxxcloud.manage.service.safety.notice.impl.xxServiceImpl.lambda$setJsonField$5(xxServiceImpl.java:368) at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) at com.xxx.xxxcloud.manage.service.safety.notice.impl.xxServiceImpl.setJsonField(xxServiceImpl.java:346) at com.xxx.xxxcloud.manage.service.safety.notice.impl.xxServiceImpl.batchCreate(xxServiceImpl.java:201) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:343) at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:196) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:751) at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:117) at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:391) at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:119) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:184) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:751) at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:703) at com.xxx.xxxcloud.manage.service.safety.notice.impl.xxServiceImpl$$SpringCGLIB$$0.batchCreate() at com.xxx.xxxcloud.manage.service.safety.notice.impl.SafetyNoticeScheduleServiceImpl.processBatch(SafetyNoticeScheduleServiceImpl.java:194) at com.xxx.xxxcloud.manage.service.safety.notice.impl.SafetyNoticeScheduleServiceImpl.lambda$processBatchesInThreadPool$0(SafetyNoticeScheduleServiceImpl.java:112) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) Caused by: java.net.SocketTimeoutException: Read timed out at java.base/sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:288) at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:314) at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355) at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808) at java.base/java.net.Socket$SocketInputStream.read(Socket.java:966) at java.base/sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:484) at java.base/sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:478) at java.base/sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:70) at java.base/sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1465) at java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1069) at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:161) at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:128) at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:113) at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:73) at org.postgresql.core.PGStream.receiveChar(PGStream.java:465) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2155) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:368) ... 82 common frames omitted org.springframework.jdbc.UncategorizedSQLException: ### Error querying database. Cause: java.sql.SQLException: Connection is closed ### The error may exist in class path resource [mapper/TbxxHealthCheckResultMapper.xml] ### The error may involve com.xxx.xxxcloud.manage.mapper.xx.TbxxHealthCheckResultMapper.selectExecuteLatest ### The error occurred while executing a query ### Cause: java.sql.SQLException: Connection is closed ; uncategorized SQLException; SQL state [null]; error code [0]; Connection is closed at org.mybatis.spring.MyBatisExceptionTranslator.translateExceptionIfPossible(MyBatisExceptionTranslator.java:93) at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:439) at java.base/java.lang.Thread.run(Thread.java:840) Caused by: java.sql.SQLException: Connection is closed at com.zaxxer.hikari.pool.ProxyConnection$ClosedConnection.lambda$getClosedConnection$0(ProxyConnection.java:502) at jdk.proxy3/jdk.proxy3.$Proxy173.prepareStatement(Unknown Source) at com.zaxxer.hikari.pool.ProxyConnection.prepareStatement(ProxyConnection.java:327) at com.zaxxer.hikari.pool.HikariProxyConnection.prepareStatement(HikariProxyConnection.java) ``` 从上面有几个关键信息梳理下异常调用链: {% asset_img error.jpg %} 得出几个重要信息: 1. 异常发生的业务代码是在批量处理"xx报告"时,查询告警处 2. ShardingSphere下的连接池hikari抛出了异常Connection is closed 3. Druid连接池也抛出了异常{conn-xxx} discard 4. 底层SocketTimeoutException,表明是客户端等待数据库服务器的响应超时(初步判断为慢sql) 所以我开始从以下几个方面排查: 1. 查业务代码的变更,为什么之前好好的跑了一个多月,突然出问题。 2. 检查数据库连接参数设置 3. 评估查询的数据量是否过大 4. 看下HikariCP和Druid是否有啥联系以及为啥会有两个连接池? 5. 查看数据库负载情况 ### 处理过程 #### 第一步:摸索 怀疑一切,切忌先入为主。 1. 想办法重现问题 1. 提前发通知,下午7点以后会操作线上系统。通常6点半以后几乎就没人用系统了。 2. 因为设计该功能时,留了补偿手动,可手动重新触发报告生成。 3. 反复重试了几次,问题未复现。 2. 查看业务代码变更。 1. 报错处业务代码owner是我,没做任何更改,所以变成重点关注addAlertData方法,也就是与告警相关(重点在数据量) 2. 业务没变更但是加入了**ShardingSphere**(异常中也有这块的信息,先存疑) 3. 检查系统连接池参数 1. druid,几乎都没有做定制,都是使用的默认值。 1. 使用默认值其实存在风险,应该根据业务调整一些参数,因为买的阿里的pg所以咨询了他们拿到了一份他们暴露的参数调优参考,后续可针对性修改) 2. 新增的ShardingSphere的HikariCP也是使用的默认值。(异常中也有这块的信息,先存疑) 4. 观察显示的数据库负载情况(阿里云的监控看板、pg的pg_stat_activity等视图、数据库日志等) 1. 从视图发现确实存在执行时间较长的几条sqlsql,虽然有些慢但是不足以触发异常。慢的原因初步判断为数据量过大(报错的租户行数都在50w左右) 2. 记录下这个sql,拿去控制台执行,EXPLAIN ANALYZE该sql,发现Seq Scan除了时间较长以外还存在索引问题,几乎每次查询都要用到的“告警状态”字段之前没加索引。(可能是原因之一) 3. 查看数据库日志如下,这表明数据库和客户端的连接中断确实是问题的根源,可能是因为网络问题、数据库负载过高、或者连接超时等因素导致的。 1. {%asset_img sql.png%} #### 第二步:验证 验证怀疑的所有点,通常控制变量法进行验证。 因为暂时没有复现,不着急先按兵不动。 1. 前提是要有补偿方案,不能阻断线上使用,特别是这种核心业务。因为当时留了后门可以手动触发某个租户所以没问题。 2. 虽然是线上故障同时也算严重bug,但是也不要着急,胡乱改一通,可能按下葫芦又起瓢,尽可能的找到根因,哪怕不能一次性修复。 等待了两天发现同样的问题又发生了。 1. 但是这次发现了上一次遗漏的一个信息,报错的租户从日志看时间,异常都是在10s以后抛出的。(初步判断是SocketTimeout的超时时间,可能是10s) 2. 拿到同样报错的sql去执行发现虽然跟之前没啥差别,问题还是那些问题,也没超过10s。(怀疑是因为报错时候执行了sql,触发了pg的缓存,所以再去查缓存生效,导致执行时间变短) 验证第一个问题: 因为当前看存在有两个连接池,查HikariCP与Druid的源码: ``` // HikariCP只有connectionTimeout为30s static { CONNECTION_TIMEOUT = TimeUnit.SECONDS.toMillis(30L); VALIDATION_TIMEOUT = TimeUnit.SECONDS.toMillis(5L); SOFT_TIMEOUT_FLOOR = Long.getLong("com.zaxxer.hikari.timeoutMs.floor", 250L); IDLE_TIMEOUT = TimeUnit.MINUTES.toMillis(10L); MAX_LIFETIME = TimeUnit.MINUTES.toMillis(30L); unitTest = false; } // DruidDataSource初始化,socketTimeout是10s public void init() throws SQLException { if (!this.inited) { DruidDriver.getInstance(); ReentrantLock lock = this.lock; try { lock.lockInterruptibly(); } catch (InterruptedException e) { throw new SQLException("interrupt", e); } boolean init = false; if (this.connectTimeout == 0) { this.connectTimeout = 10000; } if (this.socketTimeout == 0) { this.socketTimeout = 10000; } ``` 先不管HikariCP,至少能确定Druid的10s超时确实存在,因为阿里云上我开起了pg的慢sql监控,根据时间刚好查到了确实存在一条告警sql查询执行时间为10s+。 1. 所以初步判断SocketTimeoutException是因为sql执行时间超过了连接池的默认超时。 2. 去查了下该租户下告警的数据总数已经超过了50W,在活5000左右。 #### 第三步:修复方案 方案如下: 1. 加索引,对常用的字段“告警状态”添加索引。 2. 对在活告警查询的地方分批 3. 对在活主告警限制查询的条数而不是查所有。 #### 第四步:方案执行 1. 对所有配置分表的表,检查索引,添加必要的索引。 1. ``` -- ======================================== -- 描述: 创建告警表的“status”字段索引 -- 文件名: 001_create_alert_status_index.sql -- 作者: hht -- 创建日期: 2024-10-31 -- ======================================== DO $$ DECLARE i INTEGER; BEGIN -- 遍历 tb_xx_alert_0 ~ tb_xx_alert_15 FOR i IN 0..15 LOOP -- 动态生成 ALTER TABLE 语句,添加字段 EXECUTE FORMAT(' ALTER TABLE public.tb_xx_alert_%s CREATE INDEX tb_xx_alert_%s_status_index ON tb_xx_alert_%s (status); ', i); END LOOP; END $$; ``` 2. 告警分批且限制条数 1. {% codeblock %} private List findSubscribedAlerts(AlertSubscribedParam param) { try { List> parts = PartsListUtil.getParts(param.getSubscribedCiIds(), BATCH_SIZE); List all = new ArrayList<>(); for (List part : parts) { // 分批查询 param.setSubscribedCiIds(part); // 主告警不超过100条 Page page = new Page<>(1, 100 - all.size()); IPage result = tbXxAlertMapper.findSubscribedAlerts(page, param); if (result.getTotal() == 0) { continue; } // 聚合告警的子告警数据补充 getAlertAggChildren(param.getIdentityId(), result.getRecords()); all.addAll(result.getRecords()); // 如果已经达到100,退出循环 if (all.size() == 100) { break; } } // 最后统一按createTime降序排序 all.sort((a1, a2) -> Long.compare(a2.getCreateTime(), a1.getCreateTime())); return all; } catch (Exception e) { String errorMessage = String.format("查询关注告警失败: identityId=%s, deal=%s", param.getIdentityId(), param.isDeal()); log.error(errorMessage, param.getIdentityId(), e); throw new SafetyNoticeException(errorMessage, e); } } {% endcodeblock %} #### 第五步:监控效果 1. 上线后,连续一周到点蹲守 1. SELECT pg_stat_reset(); -- 重置所有统计信息 2. 数据库负载看板以及pg的pg_stat_activity。 1. 之前的sql执行时间没有再超过2s 2. 看板上慢sql也没有在发现 3. 数据库日志也没有再出现异常 3. pg_stat_user_indexes+EXPLAIN ANALYZE 对应sql,查看索引使用情况。 2. 业务功能正常。 看上去目前超时的问题暂时解决,但是要想更彻底的解,还需要后续对遗留项逐个解决。 ### 遗留项/改进项 1. 连接池的参数调优,虽然默认的看上去没啥问题,但是迟早肯定会出问题,记一个DFX。 2. 多了一个HikariCP连接池,而且通过jconsole看了下,两个连接池都会初始化,这块是否有必要有两个连接池,这儿存在隐患,对ShardingSphere需要深入了解下,记一个DFX。 3. 历史数据的清理,跟PO提出,需要加一个需求不仅仅是告警,可能还有其他数据。 4. 对于核心且经常更新的表是否需要定时REINDEX