Interesting deadlock bug found causing three threads to deadlock on the PJSUA, UA, and transaction locks.

DR
David Richards
Thu, Mar 16, 2017 7:18 PM

Hi.

I found a bug that causes my application (using the PJSUA API) to deadlock
when the following three events occur simultaneously:

  1. A call (received earlier by my application) is being "hung up" by
    calling pjsua_call_hangup(),
  2. The same call is also receiving a CANCEL message, and
  3. An INVITE is being received for a new separate call (as odd as that
    sounds).

I can reproduce this, and have multiple times, but it requires a heavy load
(and luck) to create the proper event sequencing. I have included some
"gdb" stack back traces below for the three threads. The bug is located in
the processing of the CANCEL message, where the transaction group lock is
attempted to be obtained after the user agent lock. That's the wrong
locking order. Specifically, the CANCEL message is processed by
mod_ua_on_rx_request() where the user agent lock is obtained and
find_dlg_set_for_msg() is called to get the dialog set. The transaction
group lock is attempted to be acquired by pjsip_tsx_layer_find_tsx()  which
is called by find_dlg_set_for_msg() to find the INVITE transaction.

This is a possible sequence of events that cause the deadlock (see the
stack back traces below):

  1. Thread 1 calls pjsua_call_hangup() to hang up a call.

  2. Thread 1 obtains the transaction lock pj_grp_lock_acquire(tsx->grp_l
    ock).

  3. Thread 2 starts processing a new call (INVITE msg).

  4. Thread 2 obtains the PJSUA lock PJSUA_LOCK().

  5. Thread 3 starts processing a CANCEL msg for the call being "hung up" by
    Thread 1.

  6. Thread 3 obtains the UA lock pj_mutex_lock(mod_ua.mutex).

  7. Thread 3 attempts to obtain the transaction lock, but is blocked because
    Thread 1 is holding it.

  8. Thread 1 attempts to obtain the PJSUA lock, but is blocked because
    Thread 2 is holding it.

  9. Thread 2 attempts to obtain the UA lock, but is blocked because Thread 3
    is holding it.

I modified the code to NOT to acquire the transaction group lock and
deadlock no longer occurs (or at least not after 12 hours of heavy load
testing). Based on the comments in the code, the group transaction lock is
obtained in pjsip_tsx_layer_find_tsx() in order to "prevent the transaction
from getting deleted before we have chance to lock it". Clearly the lock
removal stops deadlock, but the transaction deletion concern mentioned in
the code needs to be addressed.

A few other observations:

The function  mod_ua_on_rx_request() is checking for *dialog *deadlock, and
has implemented a retry loop for mediation. After I took out the *transaction
*group lock acquisition, I noticed that *dialog *deadlock was being
detected and handled correctly (through the retry loop) for the calls that
would otherwise caused the three thread deadlock described above.

I did not experience deadlock with my application until I modified it to
handle SIP traffic from two UDP transports (from two separate network
interfaces).  I don't know how that's related, except that probably made it
possible for the new INVITE (Thread 2) and the CANCEL (Thread 3) messages
to be processed simultaneously. I confirmed that in this case, both
messages came in on separate transports.

I have more detailed 'gdb' output if you are interested.

Thanks,
Dave Richards

Thread 1: My application "hanging up" a call with code "606" (Not
Acceptable)
(this thread is holding the tsx->grp_lock, and blocked on the
PJSUA lock /
#0  __lll_lock_wait () from /lib64/libpthread.so.0
#1  _L_lock_839 () from /lib64/libpthread.so.0
#2  pthread_mutex_lock () from /lib64/libpthread.so.0
#3  pj_mutex_lock (mutex=0x4878238) at ../src/pj/os_core_unix.c:1265
#4  PJSUA_LOCK () at ../include/pjsua-lib/pjsua_internal.h:575
/
blocked on PJSUA_LOCK() /
#5  pjsua_call_on_state_changed () at ../src/pjsua-lib/pjsua_call.c:3807
#6  inv_set_state () at ../src/pjsip-ua/sip_inv.c:317
#7  inv_on_state_incoming () at ../src/pjsip-ua/sip_inv.c:4310
#8  mod_inv_on_tsx_state () at ../src/pjsip-ua/sip_inv.c:717
#9  pjsip_dlg_on_tsx_state () at ../src/pjsip/sip_dialog.c:2064
#10 mod_ua_on_tsx_state () at ../src/pjsip/sip_ua_layer.c:178
#11 tsx_set_state () at ../src/pjsip/sip_transaction.c:1235
#12 tsx_on_state_proceeding_uas () at ../src/pjsip/sip_transaction.c:2819
#13 pjsip_tsx_send_msg () at ../src/pjsip/sip_transaction.c:1751
/
obtained pj_grp_lock_acquire(tsx->grp_lock) */
#14 pjsip_dlg_send_response () at ../src/pjsip/sip_dialog.c:1529
#15 pjsip_inv_send_msg () at ../src/pjsip-ua/sip_inv.c:3227
#16 pjsua_call_hangup (code=606) at ../src/pjsua-lib/pjsua_call.c:2407
#17 timer_callback () at TERMmain.c:313
#18 pj_timer_heap_poll () at ../src/pj/timer.c:643
#19 TIMERpoll ()
#20 main_func ()
#21 pj_run_app ()
#22 main ()

Thread 2: Worker thread handling new call (incoming INVITE)
(this thread is holding the PJSUA lock, blocked on the
mod_ua.mutex lock)
#0  __lll_lock_wait () from /lib64/libpthread.so.0
#1  _L_lock_839 () from /lib64/libpthread.so.0
#2  pthread_mutex_lock () from /lib64/libpthread.so.0
#3  pj_mutex_lock () at ../src/pj/os_core_unix.c:1265
#4  pjsip_ua_register_dlg () at ../src/pjsip/sip_ua_layer.c:298
/* blocked on pj_mutex_lock(mod_ua.mutex) /
#5  create_uas_dialog () at ../src/pjsip/sip_dialog.c:533
#6  pjsip_dlg_create_uas_and_inc_lock () at ../src/pjsip/sip_dialog.c:590
#7  pjsua_call_on_incoming () at ../src/pjsua-lib/pjsua_call.c:1386
/
obtained PJSUA_LOCK() */
#8  mod_pjsua_on_rx_request () at ../src/pjsua-lib/pjsua_core.c:577
#9  pjsip_endpt_process_rx_data () at ../src/pjsip/sip_endpoint.c:887
#10 endpt_on_rx_msg () at ../src/pjsip/sip_endpoint.c:1037
#11 pjsip_tpmgr_receive_packet () at ../src/pjsip/sip_transport.c:1939
#12 udp_on_read_complete () at ../src/pjsip/sip_transport_udp.c:170
#13 ioqueue_dispatch_read_event () at ../src/pj/ioqueue_common_abs.c:605
#14 pj_ioqueue_poll () at ../src/pj/ioqueue_epoll.c:796
#15 pjsip_endpt_handle_events2 () at ../src/pjsip/sip_endpoint.c:742
#16 pjsua_handle_events () at ../src/pjsua-lib/pjsua_core.c:1981
#17 worker_thread () at ../src/pjsua-lib/pjsua_core.c:699
#18 thread_main () at ../src/pj/os_core_unix.c:541
#19 start_thread () from /lib64/libpthread.so.0
#20 clone () from /lib64/libc.so.6

Thread 3: OFFENDING Worker thread handling CANCEL message for call being
"hung up" in Thread 1.
(this thread is holding the mod_ua.mutex lock, blocked on the
tsx->grp_lock lock)
#0  __lll_lock_wait () from /lib64/libpthread.so.0
#1  _L_lock_839 () from /lib64/libpthread.so.0
#2  pthread_mutex_lock () from /lib64/libpthread.so.0
#3  pj_mutex_lock () at ../src/pj/os_core_unix.c:1265
#4  pj_lock_acquire () at ../src/pj/lock.c:180
#5  grp_lock_acquire () at ../src/pj/lock.c:290
#6  pj_grp_lock_acquire () at ../src/pj/lock.c:478
#7  pjsip_tsx_layer_find_tsx () at ../src/pjsip/sip_transaction.c:672    /*
blocked on pj_grp_lock_acquire(tsx->grp_lock) /
#8  find_dlg_set_for_msg () at ../src/pjsip/sip_ua_layer.c:554
#9  mod_ua_on_rx_request () at ../src/pjsip/sip_ua_layer.c:615          /

obtains pj_mutex_lock(mod_ua.mutex) */
#10 pjsip_endpt_process_rx_data () at ../src/pjsip/sip_endpoint.c:887
#11 endpt_on_rx_msg () at ../src/pjsip/sip_endpoint.c:1037
#12 pjsip_tpmgr_receive_packet () at ../src/pjsip/sip_transport.c:1939
#13 udp_on_read_complete () at ../src/pjsip/sip_transport_udp.c:170
#14 ioqueue_dispatch_read_event () at ../src/pj/ioqueue_common_abs.c:605
#15 pj_ioqueue_poll () at ../src/pj/ioqueue_epoll.c:796
#16 pjsip_endpt_handle_events2 () at ../src/pjsip/sip_endpoint.c:742
#17 pjsua_handle_events () at ../src/pjsua-lib/pjsua_core.c:1981
#18 worker_thread () at ../src/pjsua-lib/pjsua_core.c:699
#19 thread_main () at ../src/pj/os_core_unix.c:541
#20 start_thread () from /lib64/libpthread.so.0
#21 clone () from /lib64/libc.so.6

Hi. I found a bug that causes my application (using the PJSUA API) to deadlock when the following three events occur simultaneously: 1) A call (received earlier by my application) is being "hung up" by calling pjsua_call_hangup(), 2) The same call is also receiving a CANCEL message, and 3) An INVITE is being received for a new separate call (as odd as that sounds). I can reproduce this, and have multiple times, but it requires a heavy load (and luck) to create the proper event sequencing. I have included some "gdb" stack back traces below for the three threads. The bug is located in the processing of the CANCEL message, where the transaction group lock is attempted to be obtained after the user agent lock. That's the wrong locking order. Specifically, the CANCEL message is processed by mod_ua_on_rx_request() where the user agent lock is obtained and find_dlg_set_for_msg() is called to get the dialog set. The transaction group lock is attempted to be acquired by pjsip_tsx_layer_find_tsx() which is called by find_dlg_set_for_msg() to find the INVITE transaction. This is a possible sequence of events that cause the deadlock (see the stack back traces below): 1) Thread 1 calls pjsua_call_hangup() to hang up a call. 2) Thread 1 obtains the transaction lock pj_grp_lock_acquire(tsx->grp_l ock). 3) Thread 2 starts processing a new call (INVITE msg). 4) Thread 2 obtains the PJSUA lock PJSUA_LOCK(). 5) Thread 3 starts processing a CANCEL msg for the call being "hung up" by Thread 1. 6) Thread 3 obtains the UA lock pj_mutex_lock(mod_ua.mutex). 7) Thread 3 attempts to obtain the transaction lock, but is blocked because Thread 1 is holding it. 8) Thread 1 attempts to obtain the PJSUA lock, but is blocked because Thread 2 is holding it. 9) Thread 2 attempts to obtain the UA lock, but is blocked because Thread 3 is holding it. I modified the code to NOT to acquire the transaction group lock and deadlock no longer occurs (or at least not after 12 hours of heavy load testing). Based on the comments in the code, the group transaction lock is obtained in pjsip_tsx_layer_find_tsx() in order to "prevent the transaction from getting deleted before we have chance to lock it". Clearly the lock removal stops deadlock, but the transaction deletion concern mentioned in the code needs to be addressed. A few other observations: The function mod_ua_on_rx_request() is checking for *dialog *deadlock, and has implemented a retry loop for mediation. After I took out the *transaction *group lock acquisition, I noticed that *dialog *deadlock was being detected and handled correctly (through the retry loop) for the calls that would otherwise caused the three thread deadlock described above. I did not experience deadlock with my application until I modified it to handle SIP traffic from two UDP transports (from two separate network interfaces). I don't know how that's related, except that probably made it possible for the new INVITE (Thread 2) and the CANCEL (Thread 3) messages to be processed simultaneously. I confirmed that in this case, both messages came in on separate transports. I have more detailed 'gdb' output if you are interested. Thanks, Dave Richards Thread 1: My application "hanging up" a call with code "606" (Not Acceptable) (this thread is holding the tsx->grp_lock, and blocked on the PJSUA lock */ #0 __lll_lock_wait () from /lib64/libpthread.so.0 #1 _L_lock_839 () from /lib64/libpthread.so.0 #2 pthread_mutex_lock () from /lib64/libpthread.so.0 #3 pj_mutex_lock (mutex=0x4878238) at ../src/pj/os_core_unix.c:1265 #4 PJSUA_LOCK () at ../include/pjsua-lib/pjsua_internal.h:575 /* blocked on PJSUA_LOCK() */ #5 pjsua_call_on_state_changed () at ../src/pjsua-lib/pjsua_call.c:3807 #6 inv_set_state () at ../src/pjsip-ua/sip_inv.c:317 #7 inv_on_state_incoming () at ../src/pjsip-ua/sip_inv.c:4310 #8 mod_inv_on_tsx_state () at ../src/pjsip-ua/sip_inv.c:717 #9 pjsip_dlg_on_tsx_state () at ../src/pjsip/sip_dialog.c:2064 #10 mod_ua_on_tsx_state () at ../src/pjsip/sip_ua_layer.c:178 #11 tsx_set_state () at ../src/pjsip/sip_transaction.c:1235 #12 tsx_on_state_proceeding_uas () at ../src/pjsip/sip_transaction.c:2819 #13 pjsip_tsx_send_msg () at ../src/pjsip/sip_transaction.c:1751 /* obtained pj_grp_lock_acquire(tsx->grp_lock) */ #14 pjsip_dlg_send_response () at ../src/pjsip/sip_dialog.c:1529 #15 pjsip_inv_send_msg () at ../src/pjsip-ua/sip_inv.c:3227 #16 pjsua_call_hangup (code=606) at ../src/pjsua-lib/pjsua_call.c:2407 #17 timer_callback () at TERMmain.c:313 #18 pj_timer_heap_poll () at ../src/pj/timer.c:643 #19 TIMERpoll () #20 main_func () #21 pj_run_app () #22 main () Thread 2: Worker thread handling new call (incoming INVITE) (this thread is holding the PJSUA lock, blocked on the mod_ua.mutex lock) #0 __lll_lock_wait () from /lib64/libpthread.so.0 #1 _L_lock_839 () from /lib64/libpthread.so.0 #2 pthread_mutex_lock () from /lib64/libpthread.so.0 #3 pj_mutex_lock () at ../src/pj/os_core_unix.c:1265 #4 pjsip_ua_register_dlg () at ../src/pjsip/sip_ua_layer.c:298 /* blocked on pj_mutex_lock(mod_ua.mutex) */ #5 create_uas_dialog () at ../src/pjsip/sip_dialog.c:533 #6 pjsip_dlg_create_uas_and_inc_lock () at ../src/pjsip/sip_dialog.c:590 #7 pjsua_call_on_incoming () at ../src/pjsua-lib/pjsua_call.c:1386 /* obtained PJSUA_LOCK() */ #8 mod_pjsua_on_rx_request () at ../src/pjsua-lib/pjsua_core.c:577 #9 pjsip_endpt_process_rx_data () at ../src/pjsip/sip_endpoint.c:887 #10 endpt_on_rx_msg () at ../src/pjsip/sip_endpoint.c:1037 #11 pjsip_tpmgr_receive_packet () at ../src/pjsip/sip_transport.c:1939 #12 udp_on_read_complete () at ../src/pjsip/sip_transport_udp.c:170 #13 ioqueue_dispatch_read_event () at ../src/pj/ioqueue_common_abs.c:605 #14 pj_ioqueue_poll () at ../src/pj/ioqueue_epoll.c:796 #15 pjsip_endpt_handle_events2 () at ../src/pjsip/sip_endpoint.c:742 #16 pjsua_handle_events () at ../src/pjsua-lib/pjsua_core.c:1981 #17 worker_thread () at ../src/pjsua-lib/pjsua_core.c:699 #18 thread_main () at ../src/pj/os_core_unix.c:541 #19 start_thread () from /lib64/libpthread.so.0 #20 clone () from /lib64/libc.so.6 Thread 3: OFFENDING Worker thread handling CANCEL message for call being "hung up" in Thread 1. (this thread is holding the mod_ua.mutex lock, blocked on the tsx->grp_lock lock) #0 __lll_lock_wait () from /lib64/libpthread.so.0 #1 _L_lock_839 () from /lib64/libpthread.so.0 #2 pthread_mutex_lock () from /lib64/libpthread.so.0 #3 pj_mutex_lock () at ../src/pj/os_core_unix.c:1265 #4 pj_lock_acquire () at ../src/pj/lock.c:180 #5 grp_lock_acquire () at ../src/pj/lock.c:290 #6 pj_grp_lock_acquire () at ../src/pj/lock.c:478 #7 pjsip_tsx_layer_find_tsx () at ../src/pjsip/sip_transaction.c:672 /* blocked on pj_grp_lock_acquire(tsx->grp_lock) */ #8 find_dlg_set_for_msg () at ../src/pjsip/sip_ua_layer.c:554 #9 mod_ua_on_rx_request () at ../src/pjsip/sip_ua_layer.c:615 /* obtains pj_mutex_lock(mod_ua.mutex) */ #10 pjsip_endpt_process_rx_data () at ../src/pjsip/sip_endpoint.c:887 #11 endpt_on_rx_msg () at ../src/pjsip/sip_endpoint.c:1037 #12 pjsip_tpmgr_receive_packet () at ../src/pjsip/sip_transport.c:1939 #13 udp_on_read_complete () at ../src/pjsip/sip_transport_udp.c:170 #14 ioqueue_dispatch_read_event () at ../src/pj/ioqueue_common_abs.c:605 #15 pj_ioqueue_poll () at ../src/pj/ioqueue_epoll.c:796 #16 pjsip_endpt_handle_events2 () at ../src/pjsip/sip_endpoint.c:742 #17 pjsua_handle_events () at ../src/pjsua-lib/pjsua_core.c:1981 #18 worker_thread () at ../src/pjsua-lib/pjsua_core.c:699 #19 thread_main () at ../src/pj/os_core_unix.c:541 #20 start_thread () from /lib64/libpthread.so.0 #21 clone () from /lib64/libc.so.6
AH
Alex Hermann
Fri, Mar 17, 2017 8:48 AM

On donderdag 16 maart 2017 14:18:25 CET David Richards wrote:

I found a bug that causes my application (using the PJSUA API) to deadlock

This looks like the same deadlock I reported on 7-3-2017. Unfortunately, I got
no response to it.

Alex Hermann

On donderdag 16 maart 2017 14:18:25 CET David Richards wrote: > I found a bug that causes my application (using the PJSUA API) to deadlock This looks like the same deadlock I reported on 7-3-2017. Unfortunately, I got no response to it. -- Alex Hermann