接上面的分析,来下下订阅端的报错日志情况。首先在报错日志里面可以看到如下的信息:
BEA-000628 Created "1" resources for pool "DataSource",
out of which "1" are available and "0" are unavailable.
但是该信息没有在Log日志中标记为Error,而只是Info提示。根据该信息搜索网上相关的文章,还是和连接池中的DataSource连接池满有关系。而连接池满最大的原因就是本身我们在设置连接池的时候设置了最大的连接池数量,当并发量大的时候如果超过了连接池的最大数量,那么一定会触发连接池满而无法从连接池中获取到有效连接。而实际上对于JMS订阅本身不存在这种大并发,长耗时的事务处理。因此更大的可能还是我们的代码程序没有有效的关闭连接并放回到连接池中。
在上面信息还会发现如下Log日志信息:
BEA-001128 Connection for pool "DataSource" has been closed. 提示连接池中的数据库连接已经关闭。从这几个关键信息来看可以发现的还是从DataSource数据库连接池中无法获取到有效的连接,导致事务一直处于等待状态,到了超时时间后分布式事务发送超时并进行事务回滚。
对于Weblogic连接池问题,可以参考下面这篇文章详细描述:
https://www.cnblogs.com/kongzhongqijing/p/4431801.html
对于连接关闭,务必注意是需要在Try-Catch-Finally的Finally中进行,示例代码为:
Connection conn = null;
ResultSet rs = null;
preparedStatement pss = null;
try {
conn = dataSource.getConnection(USERID,pASSWORD);
pss = conn.prepareStatement("select * from table1 ?");
pss.setString(1,sessionKey);
rs = pss.executeQuery();
pss.close();
}
catch (Throwable t) {}
finally {
try
{
if (conn != null) conn.close();
}
catch (Exception e){}
}
而在上述出现了BEA-000628的错误日志后,根据同样的BEA-InstanceID号进行搜索,发现都是在140秒后会报错JMS订阅分布式事务在进行Commit事务提交的时候出现超时的异常。
BEA-010026 Exception occurred during commit of transaction
Timeout during commit processing
对于该异常更加详细的Log日志信息如下:
at weblogic.rjvm.ResponseImpl.unmarshalReturn(ResponseImpl.
at weblogic.rmi.internal.BasicRemoteRef.invoke(BasicRemoteRef.
at weblogic.transaction.internal.CoordinatorImpl_12213_WLStub.commit(Unknown Source)
at weblogic.transaction.internal.TransactionImpl$1.run(TransactionImpl.
at weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.
at weblogic.security.service.SecurityManager.runAs(SecurityManager.
at weblogic.transaction.internal.SecureAction.runAction(SecureAction.
at weblogic.transaction.internal.TransactionImpl.commit(TransactionImpl.
at weblogic.transaction.internal.ServerTransactionImpl.internalCommit(ServerTransactionImpl.
at weblogic.transaction.internal.ServerTransactionImpl.commit(ServerTransactionImpl.
at weblogic.ejb.container.internal.MDListener.execute(MDListener.
at weblogic.ejb.container.internal.MDListener.run(MDListener.
at weblogic.work.SelfTuningWorkManagerImpl$WorkAdapterImpl.run(SelfTuningWorkManagerImpl.
at weblogic.work.ExecuteThread.execute(ExecuteThread.
at weblogic.work.ExecuteThread.run(ExecuteThread.
Caused by: javax.transaction.SystemException: Timeout during commit processing
at weblogic.utils.StackTraceDisabled.unknownMethod()
.>
对于 BEA-010026 异常,我们在前面技术问题分析中谈到过,一篇关键的文章如下,和我们现在遇到的场景很类似,具体文章地址在:
这篇文章明确指出是了是由于跨越信任引起的JMS订阅事务超时,并给出了三种解决方法。
在Oracle Support网站我们可以参考如下文章:
https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=529785660949421&parent=SrDetailText&sourceId=3-17517297911&id=1634221.1&_afrWindowMode=0&_adf.ctrl-state=32e46up1r_128
这篇文章对XA分布式事务的Timeout进行了更加详细的阐述,具体如下:
So the Coordinator will behave like following :
1. First, Coordinator will tell all Participants to commit their work and wait all participant response, however, the Coordinator will not wait the response infinitely, by default, it just wait 2 * min( max( 10, JTA timeout ),60) seconds
2.After this period of time, if the Coordinator still do not get all the participant's response, the Coordinator will timeout and give the client an information
3.the Coordinator start a timer and periodical retry the commit work asynchronously behind-the-scenes until the commit work successfully ultimately or after default 24 hours, if the retry commit still can not be successfully, then the transaction will be abandon.
On a conclusion, when we see the exception "javax.transaction.SystemException: Timeout during commit processing" , we do not need be in panic, as that just mean the commit process slow as some participants are slow to do the commit work, anyway, the transaction will be processed well by the server and the transaction will be either committed or abandoned, we still do not get in any data corruption.
The only thing we should think is that why the participants are so slow to do the commit work and response to the Coordinator?
That means we need more thread dump and enable some JTA debug flag to reveal that.
在对 BEA-010026 异常的跟踪过程中,我们仍然在日志里面发现了PeerGoneException和EOFException的异常,这个异常也是一直伴随事务超时和回滚出现,具体异常信息如下:
javax.transaction.SystemException:
Lost connection to server while commit was in progress,
ignoring because initiating server is not coordinating server.
Remote Exception received=weblogic.rjvm.PeerGoneException: ; nested exception is:
java.io.EOFException
at weblogic.transaction.internal.TransactionImpl.commit(TransactionImpl.
at weblogic.transaction.internal.ServerTransactionImpl.internalCommit(ServerTransactionImpl.
at weblogic.transaction.internal.ServerTransactionImpl.commit(ServerTransactionImpl.
at weblogic.ejb.container.internal.MDListener.execute(MDListener.
at weblogic.ejb.container.internal.MDListener.run(MDListener.
at weblogic.work.SelfTuningWorkManagerImpl$WorkAdapterImpl.run(SelfTuningWorkManagerImpl.
at weblogic.work.ExecuteThread.execute(ExecuteThread.
at weblogic.work.ExecuteThread.run(ExecuteThread.
而对于这个异常,我们在前面技术问题分析的时候就谈到过,跟Remote Server机器访问和DNS域名解析关系最大,我们当时做过调整HTTP传输启用隧道传输模式但是后面发现跟这个关系不大。同时检查发现我们所有的Server里面并没有配置DNS域名。
另外在Oracle Support网站,我们找到一篇类似的文章如下:
JMS TopicConnection Causes java.io.EOFException After Login (Doc ID 1056832.1)
WebLogic server is configured with UserAuthentication using Weblogic security providers (custom authentication provider MBean and a principal class). When messages, with authenticated users, are sent to Foreign JMS server, the connection fails to establish and the below exception is seen on the server (client) which is sends the message:
This is a security related issue. This occurs when transferring Weblogic custom Principal object to the remote server. This can be avoided by sending the JMS message without passing the identity as below:
感觉整体还是和跨越安全有关系。