Hi.
I found a bug that causes my application (using the PJSUA API) to deadlock
when the following three events occur simultaneously:
I can reproduce this, and have multiple times, but it requires a heavy load
(and luck) to create the proper event sequencing. I have included some
"gdb" stack back traces below for the three threads. The bug is located in
the processing of the CANCEL message, where the transaction group lock is
attempted to be obtained after the user agent lock. That's the wrong
locking order. Specifically, the CANCEL message is processed by
mod_ua_on_rx_request() where the user agent lock is obtained and
find_dlg_set_for_msg() is called to get the dialog set. The transaction
group lock is attempted to be acquired by pjsip_tsx_layer_find_tsx() which
is called by find_dlg_set_for_msg() to find the INVITE transaction.
This is a possible sequence of events that cause the deadlock (see the
stack back traces below):
Thread 1 calls pjsua_call_hangup() to hang up a call.
Thread 1 obtains the transaction lock pj_grp_lock_acquire(tsx->grp_l
ock).
Thread 2 starts processing a new call (INVITE msg).
Thread 2 obtains the PJSUA lock PJSUA_LOCK().
Thread 3 starts processing a CANCEL msg for the call being "hung up" by
Thread 1.
Thread 3 obtains the UA lock pj_mutex_lock(mod_ua.mutex).
Thread 3 attempts to obtain the transaction lock, but is blocked because
Thread 1 is holding it.
Thread 1 attempts to obtain the PJSUA lock, but is blocked because
Thread 2 is holding it.
Thread 2 attempts to obtain the UA lock, but is blocked because Thread 3
is holding it.
I modified the code to NOT to acquire the transaction group lock and
deadlock no longer occurs (or at least not after 12 hours of heavy load
testing). Based on the comments in the code, the group transaction lock is
obtained in pjsip_tsx_layer_find_tsx() in order to "prevent the transaction
from getting deleted before we have chance to lock it". Clearly the lock
removal stops deadlock, but the transaction deletion concern mentioned in
the code needs to be addressed.
A few other observations:
The function mod_ua_on_rx_request() is checking for *dialog *deadlock, and
has implemented a retry loop for mediation. After I took out the *transaction
*group lock acquisition, I noticed that *dialog *deadlock was being
detected and handled correctly (through the retry loop) for the calls that
would otherwise caused the three thread deadlock described above.
I did not experience deadlock with my application until I modified it to
handle SIP traffic from two UDP transports (from two separate network
interfaces). I don't know how that's related, except that probably made it
possible for the new INVITE (Thread 2) and the CANCEL (Thread 3) messages
to be processed simultaneously. I confirmed that in this case, both
messages came in on separate transports.
I have more detailed 'gdb' output if you are interested.
Thanks,
Dave Richards
Thread 1: My application "hanging up" a call with code "606" (Not
Acceptable)
(this thread is holding the tsx->grp_lock, and blocked on the
PJSUA lock /
#0 __lll_lock_wait () from /lib64/libpthread.so.0
#1 _L_lock_839 () from /lib64/libpthread.so.0
#2 pthread_mutex_lock () from /lib64/libpthread.so.0
#3 pj_mutex_lock (mutex=0x4878238) at ../src/pj/os_core_unix.c:1265
#4 PJSUA_LOCK () at ../include/pjsua-lib/pjsua_internal.h:575
/ blocked on PJSUA_LOCK() /
#5 pjsua_call_on_state_changed () at ../src/pjsua-lib/pjsua_call.c:3807
#6 inv_set_state () at ../src/pjsip-ua/sip_inv.c:317
#7 inv_on_state_incoming () at ../src/pjsip-ua/sip_inv.c:4310
#8 mod_inv_on_tsx_state () at ../src/pjsip-ua/sip_inv.c:717
#9 pjsip_dlg_on_tsx_state () at ../src/pjsip/sip_dialog.c:2064
#10 mod_ua_on_tsx_state () at ../src/pjsip/sip_ua_layer.c:178
#11 tsx_set_state () at ../src/pjsip/sip_transaction.c:1235
#12 tsx_on_state_proceeding_uas () at ../src/pjsip/sip_transaction.c:2819
#13 pjsip_tsx_send_msg () at ../src/pjsip/sip_transaction.c:1751
/ obtained pj_grp_lock_acquire(tsx->grp_lock) */
#14 pjsip_dlg_send_response () at ../src/pjsip/sip_dialog.c:1529
#15 pjsip_inv_send_msg () at ../src/pjsip-ua/sip_inv.c:3227
#16 pjsua_call_hangup (code=606) at ../src/pjsua-lib/pjsua_call.c:2407
#17 timer_callback () at TERMmain.c:313
#18 pj_timer_heap_poll () at ../src/pj/timer.c:643
#19 TIMERpoll ()
#20 main_func ()
#21 pj_run_app ()
#22 main ()
Thread 2: Worker thread handling new call (incoming INVITE)
(this thread is holding the PJSUA lock, blocked on the
mod_ua.mutex lock)
#0 __lll_lock_wait () from /lib64/libpthread.so.0
#1 _L_lock_839 () from /lib64/libpthread.so.0
#2 pthread_mutex_lock () from /lib64/libpthread.so.0
#3 pj_mutex_lock () at ../src/pj/os_core_unix.c:1265
#4 pjsip_ua_register_dlg () at ../src/pjsip/sip_ua_layer.c:298
/* blocked on pj_mutex_lock(mod_ua.mutex) /
#5 create_uas_dialog () at ../src/pjsip/sip_dialog.c:533
#6 pjsip_dlg_create_uas_and_inc_lock () at ../src/pjsip/sip_dialog.c:590
#7 pjsua_call_on_incoming () at ../src/pjsua-lib/pjsua_call.c:1386
/ obtained PJSUA_LOCK() */
#8 mod_pjsua_on_rx_request () at ../src/pjsua-lib/pjsua_core.c:577
#9 pjsip_endpt_process_rx_data () at ../src/pjsip/sip_endpoint.c:887
#10 endpt_on_rx_msg () at ../src/pjsip/sip_endpoint.c:1037
#11 pjsip_tpmgr_receive_packet () at ../src/pjsip/sip_transport.c:1939
#12 udp_on_read_complete () at ../src/pjsip/sip_transport_udp.c:170
#13 ioqueue_dispatch_read_event () at ../src/pj/ioqueue_common_abs.c:605
#14 pj_ioqueue_poll () at ../src/pj/ioqueue_epoll.c:796
#15 pjsip_endpt_handle_events2 () at ../src/pjsip/sip_endpoint.c:742
#16 pjsua_handle_events () at ../src/pjsua-lib/pjsua_core.c:1981
#17 worker_thread () at ../src/pjsua-lib/pjsua_core.c:699
#18 thread_main () at ../src/pj/os_core_unix.c:541
#19 start_thread () from /lib64/libpthread.so.0
#20 clone () from /lib64/libc.so.6
Thread 3: OFFENDING Worker thread handling CANCEL message for call being
"hung up" in Thread 1.
(this thread is holding the mod_ua.mutex lock, blocked on the
tsx->grp_lock lock)
#0 __lll_lock_wait () from /lib64/libpthread.so.0
#1 _L_lock_839 () from /lib64/libpthread.so.0
#2 pthread_mutex_lock () from /lib64/libpthread.so.0
#3 pj_mutex_lock () at ../src/pj/os_core_unix.c:1265
#4 pj_lock_acquire () at ../src/pj/lock.c:180
#5 grp_lock_acquire () at ../src/pj/lock.c:290
#6 pj_grp_lock_acquire () at ../src/pj/lock.c:478
#7 pjsip_tsx_layer_find_tsx () at ../src/pjsip/sip_transaction.c:672 /*
blocked on pj_grp_lock_acquire(tsx->grp_lock) /
#8 find_dlg_set_for_msg () at ../src/pjsip/sip_ua_layer.c:554
#9 mod_ua_on_rx_request () at ../src/pjsip/sip_ua_layer.c:615 /
obtains pj_mutex_lock(mod_ua.mutex) */
#10 pjsip_endpt_process_rx_data () at ../src/pjsip/sip_endpoint.c:887
#11 endpt_on_rx_msg () at ../src/pjsip/sip_endpoint.c:1037
#12 pjsip_tpmgr_receive_packet () at ../src/pjsip/sip_transport.c:1939
#13 udp_on_read_complete () at ../src/pjsip/sip_transport_udp.c:170
#14 ioqueue_dispatch_read_event () at ../src/pj/ioqueue_common_abs.c:605
#15 pj_ioqueue_poll () at ../src/pj/ioqueue_epoll.c:796
#16 pjsip_endpt_handle_events2 () at ../src/pjsip/sip_endpoint.c:742
#17 pjsua_handle_events () at ../src/pjsua-lib/pjsua_core.c:1981
#18 worker_thread () at ../src/pjsua-lib/pjsua_core.c:699
#19 thread_main () at ../src/pj/os_core_unix.c:541
#20 start_thread () from /lib64/libpthread.so.0
#21 clone () from /lib64/libc.so.6
On donderdag 16 maart 2017 14:18:25 CET David Richards wrote:
I found a bug that causes my application (using the PJSUA API) to deadlock
Alex Hermann