测试环境出现问题
今天,收到项目组通知说发生了死锁,让我查一下死锁出现的原因。
首先,登录数据库,查看trace日志所在路径
-
SYS@cams>show parameter dump;
-
NAME TYPE VALUE
-
------------------------------------ ----------- ------------------------------
-
background_core_dump string partial
-
background_dump_dest string /u01/app/oracle/diag/rdbms/cams/cams/trace
-
core_dump_dest string /u01/app/oracle/diag/rdbms/cams/cams/cdump
-
max_dump_file_size string unlimited
-
shadow_core_dump string PARTIAL
-
user_dump_dest string /u01/app/oracle/diag/rdbms/cams/cams/trace
查看alert日志,搜索00060
-
[oracle@db trace]$ cd /u01/app/oracle/diag/rdbms/cams/cams/trace/
-
[oracle@db trace]$ vi alert_cams.log
找到报错trace日志如下:
-
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/cams/cams/trace/cams_ora_12850.trc.
-
Tue Sep 12 13:28:31 2017
-
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/cams/cams/trace/cams_ora_12854.trc.
-
Tue Sep 12 13:28:36 2017
-
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/cams/cams/trace/cams_ora_12822.trc.
-
Tue Sep 12 13:31:12 2017
打开其中一个trace文件,部分内容如下:
-
Trace file /u01/app/oracle/diag/rdbms/cams/cams/trace/cams_ora_12850.trc
-
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
-
With the Partitioning, OLAP, Data Mining and Real Application Testing options
-
ORACLE_HOME = /u01/app/oracle/product/11.2.0/db_1
-
System name: Linux
-
Node name: db
-
Release: 2.6.32-431.el6.x86_64
-
Version: #1 SMP Sun Nov 10 22:19:54 EST 2013
-
Machine: x86_64
-
VM name: VMWare Version: 6
-
Instance name: cams
-
Redo thread mounted by this instance: 1
-
Oracle process number: 293
-
Unix process pid: 12850, image: oracle@db
-
*** 2017-09-12 13:28:28.401
-
*** SESSION ID:(1139.47765) 2017-09-12 13:28:28.401
-
*** CLIENT ID:() 2017-09-12 13:28:28.401
-
*** SERVICE NAME:(SYS$USERS) 2017-09-12 13:28:28.401
-
*** MODULE NAME:(JDBC Thin Client) 2017-09-12 13:28:28.401
-
*** ACTION NAME:() 2017-09-12 13:28:28.401
-
-
-
*** 2017-09-12 13:28:28.401
-
DEADLOCK DETECTED ( ORA-00060 )
-
-
[Transaction Deadlock]
-
这里指出死锁不是Oracle自身的错误,是用户的设计造成的问题
-
The following deadlock is not an ORACLE error. It is a
-
deadlock due to user error in the design of an application
-
or from issuing incorrect ad-hoc SQL. The following
-
information may aid in determining the deadlock:
-
-
这个死锁图表明两个session各持有一份资源,等待对方持有的资源
-
Deadlock graph:
-
---------Blocker(s)-------- ---------Waiter(s)---------
-
Resource Name process session holds waits process session holds waits
-
TX-00160018-00000adc 293 1139 X 384 65 X
-
TX-0004000f-000064d6 384 65 X 293 1139 X
-
-
session 1139: DID 0001-0125-000074A5 session 65: DID 0001-0180-000037EC
-
session 65: DID 0001-0180-000037EC session 1139: DID 0001-0125-000074A5
-
-
Rows waited on:
-
Session 1139: obj - rowid = 00021C7B - AAAhx7AAHAAARMrAAB
-
(dictionary objn - 138363, file - 7, block - 70443, slot - 1)
-
Session 65: obj - rowid = 00021C7B - AAAhx7AAHAAARMtAAD
-
(dictionary objn - 138363, file - 7, block - 70445, slot - 3)
-
-
----- Information for the OTHER waiting sessions -----
-
Session 65:
-
sid: 65 ser: 27869 audsid: 3026187 user: 111/CAMS_CORE
-
flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
-
flags2: (0x40009) -/-/INC
-
pid: 384 O/S info: user: oracle, term: UNKNOWN, ospid: 12854
-
image: oracle@db
-
client details:
-
O/S info: user: cams, term: unknown, ospid: 1234
-
machine: yy program: JDBC Thin Client
-
application name: JDBC Thin Client, hash value=2546894660
-
current SQL:
-
导致死锁的SQL语句1
-
select
-
-
SEQ_NO, ACO_AC,AC_DTE,CCY,BAL,FRZ_AMT,
-
CRE_TLR, CRE_DTE,
-
CRE_BR, CRE_BK, UPD_TLR, UPD_DTE, UPD_BR, UPD_BK,TSL_AMT,VPS_BAL
-
-
FROM CD_ACBALANCE
-
WHERE ACO_AC = :1
-
FOR UPDATE
-
-
----- End of information for the OTHER waiting sessions -----
-
-
Information for THIS session:
-
-
----- Current SQL Statement for this session (sql_id=cfy88pmyts0fn) -----
-
导致死锁的SQL语句2
-
select
-
-
SEQ_NO, ACO_AC,AC_DTE,CCY,BAL,FRZ_AMT,
-
CRE_TLR, CRE_DTE,
-
CRE_BR, CRE_BK, UPD_TLR, UPD_DTE, UPD_BR, UPD_BK,TSL_AMT,VPS_BAL
-
-
FROM CD_ACBALANCE
-
WHERE ACO_AC = :1
-
FOR UPDATE
-
===================================================
这里可以明确的是select for update语句造成了死锁。但是考虑到环境为测试环境,没开启归档,不能通过logminer挖掘故障期间的SQL事物,而且开发人员已经通过kill session的方式处理了死锁,所以需要在别的环境中重现该问题,并进行分析。
重现死锁问题
首先连上数据库,启用HR用户
-
[oracle@prod ~]$ sqlplus / as sysdba
-
SQL*Plus: Release 11.2.0.4.0 Production on Wed Sep 13 03:14:17 2017
-
Copyright (c) 1982, 2013, Oracle. All rights reserved.
-
Connected to an idle instance.
-
SYS@PROD>startup;
-
ORACLE instance started.
-
Total System Global Area 1202556928 bytes
-
Fixed Size 2252704 bytes
-
Variable Size 771752032 bytes
-
Database Buffers 419430400 bytes
-
Redo Buffers 9121792 bytes
-
Database mounted.
-
Database opened.
-
SYS@PROD>conn hr/hr
-
ERROR:
-
ORA-28000: the account is locked
-
Warning: You are no longer connected to ORACLE.
-
@>conn / as sysdba
-
Connected.
-
SYS@PROD>alter user hr identified by hr account unlock;
-
User altered.
-
SYS@PROD>conn hr/hr
-
Connected.
选中COUNTRIES表作为测试对象
-
HR@PROD>desc countries;
-
Name Null? Type
-
----------------------------------------- -------- ----------------------------
-
COUNTRY_ID NOT NULL CHAR(2)
-
COUNTRY_NAME VARCHAR2(40)
-
REGION_ID NUMBER
-
HR@PROD>select count(*) from countries;
-
COUNT(*)
-
----------
-
25
-
HR@PROD>select count(distinct(COUNTRY_ID)) from countries;
-
COUNT(DISTINCT(COUNTRY_ID))
-
---------------------------
-
25
经过验证,COUNTRIES表中的COUNTRY_ID字段是唯一的。
用COUNTRIES表新建一个test表,在test表中做测试
-
HR@PROD>create table test as select * from countries;
-
Table created.
-
HR@PROD>desc test
-
Name Null? Type
-
----------------------------------------- -------- ----------------------------
-
COUNTRY_ID NOT NULL CHAR(2)
-
COUNTRY_NAME VARCHAR2(40)
-
REGION_ID NUMBER
-
HR@PROD>select country_id,country_name,region_id from countries;
-
CO COUNTRY_NAME REGION_ID
-
-- ---------------------------------------- ----------
-
AR Argentina 2
-
AU Australia 3
-
BE Belgium 1
-
BR Brazil 2
-
CA Canada 2
-
CH Switzerland 1
-
CN China 3
-
DE Germany 1
-
DK Denmark 1
-
EG Egypt 4
-
FR France 1
-
CO COUNTRY_NAME REGION_ID
-
-- ---------------------------------------- ----------
-
IL Israel 4
-
IN India 3
-
IT Italy 1
-
JP Japan 3
-
KW Kuwait 4
-
ML Malaysia 3
-
MX Mexico 2
-
NG Nigeria 4
-
NL Netherlands 1
-
SG Singapore 3
-
UK United Kingdom 1
-
CO COUNTRY_NAME REGION_ID
-
-- ---------------------------------------- ----------
-
US United States of America 2
-
ZM Zambia 4
-
ZW Zimbabwe 4
-
25 rows selected.
情形1:
两个select for update语句想要同时锁定一条语句。
Session1:
-
HR@PROD>select country_id,country_name,region_id from test where country_id='AU' for update;
-
CO COUNTRY_NAME REGION_ID
-
-- ---------------------------------------- ----------
-
AU Australia 3
Sessions2:
-
HR@PROD>select country_id,country_name,region_id from test where country_id='AU' for update;
这时,session2会一直处于锁等待状态,而不会出现死锁。当把第一个事物commit或者rollback之后,第二个事物会继续执行。
Session1:
-
HR@PROD>rollback;
-
Rollback complete.
Session2:
-
HR@PROD>select country_id,country_name,region_id from test where country_id='AU' for update;
-
CO COUNTRY_NAME REGION_ID
-
-- ---------------------------------------- ----------
-
AU Australia
显然,两个select for update语句想要同时锁定一条语句,并不会出现死锁,而会出现锁等待的现象。
情形2:
第一个select for update事物锁定A row后,想要再锁定B row;第二个select for update事物锁定B row后,想要锁定A row。
Session1:
-
HR@PROD>select country_id,country_name,region_id from test where country_id='AU' for update;
-
CO COUNTRY_NAME REGION_ID
-
-- ---------------------------------------- ----------
-
AU Australia 3
Session 2:
-
HR@PROD>select country_id,country_name,region_id from test where country_id='UK' for update;
-
CO COUNTRY_NAME REGION_ID
-
-- ---------------------------------------- ----------
-
UK United Kingdom 1
如果Session1想要继续持有Session2正在持有的row:
Session1:
-
HR@PROD>select country_id,country_name,region_id from test where country_id='UK' for update;
这时,会出现锁等待的现场,和上一场景类似。如果Session2也想持有Session1正在持有的row:
Session2:
-
HR@PROD>select country_id,country_name,region_id from test where country_id='AU' for update;
这时,Session2还处于等待状态,但是Session1出现死锁:
Session1:
-
HR@PROD>select country_id,country_name,region_id from test where country_id='UK' for update;
-
select country_id,country_name,region_id from test where country_id='UK' for update
-
*
-
ERROR at line 1:
-
ORA-00060: deadlock detected while waiting for resource
这时,死锁的现象已经重现,而且,场景2从头到尾只使用了一个select for update语句,只是换了参数而已。
定位死锁语句
通过检查数据库表,能够检查出是哪一条语句被死锁,产生死锁的机器是哪一台。
1)用dba用户执行以下语句
-
select c.owner, c.object_name, c.object_type, b.sid,
-
b.serial#, b.lockwait, b.status, b.osuser, b.machine, b.process, b.program
-
from v$locked_object a ,
-
v$session b,
-
dba_objects c
-
where b.sid = a.session_id
-
and a.object_id = c.object_id;
如果有输出的结果,则说明有死锁,且能看到死锁的机器是哪一台。字段说明:
OWNER:死锁语句所用的数据库用户,这里是HR用户。
OBJECT_NAME:产生死锁的对象,这里是TEST表。
OBJECT_TYPE:产生死锁的对象类型,这里是TABLE。
SID:SESSION标识,常用于连接其它列。
SERIAL#:SID有可能会重复,当两个session的SID重复时,SERIAL#用来区别session
LOCKWAIT:可以通过这个字段查询出当前正在等待的锁的相关信息,如果有内容表示被死锁或者有锁等待事件。
STATUS:用来判断session状态。ACTIVE:正执行SQL语句。INACTIVE:等待操作。KILLED:被标注为杀死。
OSUSER:客户端操作系统用户名。
MACHINE:客户端操作系统的机器名。
PROCESS:客户端进程的ID。
PROGRAM:客户端执行的应用程序。
2)用dba用户执行以下语句,可以查看到被死锁的语句。
-
select sql_text from v$sql where (address,hash_value) in
-
(select sql_address,sql_hash_value from v$session where sid in
-
(select session_id from v$locked_object));
或者
-
select sql_text from v$sql where (address,hash_value) in
-
(select sql_address,sql_hash_value from v$session where lockwait is not null);
这里查出来出现死锁的语句和之前的测试结果一致。
3)分析trace日志文件
死锁发生时,可以再trace日志文件中找到如下语句:
-
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/prod/PROD/trace/PROD_ora_4852.trc.
打开trace文件,可以找到死锁的语句:
-
----- Information for the OTHER waiting sessions -----
-
Session 31:
-
sid: 31 ser: 25 audsid: 110280 user: 84/HR
-
flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
-
flags2: (0x40009) -/-/INC
-
pid: 23 O/S info: user: oracle, term: UNKNOWN, ospid: 4853
-
image: oracle@prod (TNS V1-V3)
-
client details:
-
O/S info: user: oracle, term: pts/1, ospid: 3149
-
machine: prod program: sqlplus@prod (TNS V1-V3)
-
application name: SQL*Plus, hash value=3669949024
-
current SQL:
-
select country_id,country_name,region_id from test where country_id='AU' for update
-
-
----- End of information for the OTHER waiting sessions -----
-
-
Information for THIS session:
-
-
----- Current SQL Statement for this session (sql_id=6n5kjs2twrwwq) -----
-
select country_id,country_name,region_id from test where country_id='UK' for update
-
===================================================
解决死锁问题
方法一(杀进程)
1)查找死锁的进程:
-
select sid, serial#, username, command, lockwait, osuser from v$session where lockwait is not null;
2)kill掉这个死锁的进程:
-
alter system kill session 'sid, serial#';
这里执行语句为:alter system kill session '29,69';
原先造成死锁的进程被killed。
3)如果还不能解决,使用杀系统进程的方式处理:
这里为了测试,再次重现了死锁,并使用杀进程方式进行处理。
-
select p.spid from v$session s,v$process p where s.sid=XXX and s.paddr=p.addr;
其中,XXX使用第一步查出来的SID参数替代
这里查出来的进程号为:4257
-
[oracle@prod ~]$ ps -ef | grep 4257
-
oracle 4257 3149 0 07:08 ? 00:00:00 oraclePROD (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
-
oracle 4321 4000 0 07:21 pts/4 00:00:00 grep 4257
-
[oracle@prod ~]$ kill -9 4257
-
[oracle@prod ~]$ ps -ef | grep 4257
-
oracle 4327 4000 0 07:21 pts/4 00:00:00 grep 4257
杀完进程之后,造成死锁的进程被杀死
查不到死锁进程
方法二(重启库)
-
SYS@PROD>shutdown immediate;
-
Database closed.
-
Database dismounted.
-
ORACLE instance shut down.
-
SYS@PROD>startup;
-
ORACLE instance started.
-
Total System Global Area 1202556928 bytes
-
Fixed Size 2252704 bytes
-
Variable Size 771752032 bytes
-
Database Buffers 419430400 bytes
-
Redo Buffers 9121792 bytes
-
Database mounted.
-
Database opened.
查看死锁进程:
方法三(commit || rollback)
Session1:
-
HR@PROD>select country_id,country_name,region_id from test where country_id='AU' for update;
-
CO COUNTRY_NAME REGION_ID
-
-- ---------------------------------------- ----------
-
AU Australia 3
Session2:
-
HR@PROD>select country_id,country_name,region_id from test where country_id='UK' for update;
-
CO COUNTRY_NAME REGION_ID
-
-- ---------------------------------------- ----------
-
UK United Kingdom 1
Session1:
-
HR@PROD>select country_id,country_name,region_id from test where country_id='UK' for update;
-
Waiting……
Session2:
-
HR@PROD>select country_id,country_name,region_id from test where country_id='AU' for update;
-
Waiting……
Session1:
-
HR@PROD>select country_id,country_name,region_id from test where country_id='UK' for update;
-
select country_id,country_name,region_id from test where country_id='UK' for update
-
*
-
ERROR at line 1:
-
ORA-00060: deadlock detected while waiting for resource
这里Session1出现死锁,只要执行commit或者rollback就可以解除死锁,只不过事务中第一个SQL执行成功,第二个SQL执行失败。
Session1:
-
HR@PROD>commit;
-
Commit complete.
Session2:
-
HR@PROD>select country_id,country_name,region_id from test where country_id='AU' for update;
-
CO COUNTRY_NAME REGION_ID
-
-- ---------------------------------------- ----------
-
AU Australia 3
此时,死锁状态解除:
问题总结
最后,通过与开发人员交流,得出的结论是出现该问题的原因不是程序设计的问题,而是在开发环境中,有人通过debug模式连到服务器上进行代码调试,有人通过客户端的形式访问服务器上的应用,当两者同时调试某一功能时(通过select for update的语句进行一张表中的数据访问),在特殊的情况下,出现了死锁的情况。
当然,在开发环境中比较容易出现这类情况,如果需要避免,就需要使用该开发环境的开发人员与测试人员做好沟通了。对于容易出现锁表的功能,可以要求测试人员在测试环境中测试,除特殊情况,尽量减少在开发环境中测试的次数,以免锁表影响开发测试进度。