pj::Call::onCallState() and similar callbacks should be documented as holding the PJSUA_LOCK

MO
Martin Oberhuber
Wed, May 22, 2019 8:19 AM

Dear PJSIP List / Maintainers,

As I've just been through debugging a deadlock in our application using
Pjsip / pjsua2, I would like to suggest a documentation improvement, that
should help other users run into the same trap as we did.

In our case, we've implemented some code in the pj::Call::onCallState()
callback, that used a Mutex to modify our application state. Another
thread, which took the same application state mutex, called out into PJSIP
with the mutex held. It was an innocent call::getInfo() with our own Mutex
held, but PJSIP deadlocked on this pattern in some, very rare, cases.

To make develoeprs aware, I would suggest improving the
pj::Call::onCallState() documentation as follows:

"
virtual void pj::Call::onCallState ( OnCallStateParam
https://www.pjsip.org/pjsip/docs/html/structpj_1_1OnCallStateParam.htm &
prm ) inlinevirtual

Notify application when call state has changed. Application may then query
the call info to get the detail call states by calling getInfo()
https://www.pjsip.org/pjsip/docs/html/classpj_1_1Call.htm#a5057b298a0f5f5a74a8d37a144ca538e
function.

Note that the callback is made with the PJSUA_LOCK held. So the code should
not take any resources that other Threads hold while calling into PJSIP.
Parameters
prm Callback parameter.

References PJ_UNUSED_ARG
https://www.pjsip.org/pjlib/docs/html/group__pj__config.htm#ga7523f8576efd271c22b188eab05d7870
.
"

What do you think?
Martin

Dear PJSIP List / Maintainers, As I've just been through debugging a deadlock in our application using Pjsip / pjsua2, I would like to suggest a documentation improvement, that should help other users run into the same trap as we did. In our case, we've implemented some code in the pj::Call::onCallState() callback, that used a Mutex to modify our application state. Another thread, which took the same application state mutex, called out into PJSIP with the mutex held. It was an innocent call::getInfo() with our own Mutex held, but PJSIP deadlocked on this pattern in some, very rare, cases. To make develoeprs aware, I would suggest improving the pj::Call::onCallState() documentation as follows: " virtual void pj::Call::onCallState ( OnCallStateParam <https://www.pjsip.org/pjsip/docs/html/structpj_1_1OnCallStateParam.htm> & *prm* ) inlinevirtual Notify application when call state has changed. Application may then query the call info to get the detail call states by calling getInfo() <https://www.pjsip.org/pjsip/docs/html/classpj_1_1Call.htm#a5057b298a0f5f5a74a8d37a144ca538e> function. Note that the callback is made with the PJSUA_LOCK held. So the code should not take any resources that other Threads hold while calling into PJSIP. Parameters prm Callback parameter. References PJ_UNUSED_ARG <https://www.pjsip.org/pjlib/docs/html/group__pj__config.htm#ga7523f8576efd271c22b188eab05d7870> . " What do you think? Martin
MO
Martin Oberhuber
Wed, May 22, 2019 8:32 AM

PS here is the backtraces of our deadlock, from pjsip 2.7.1 on Linux/armhf:

Thread 18 "pjsip"

#0  0x76acadf4 in __lll_lock_wait (futex=futex@entry=0x6263c77c,
private=0) at lowlevellock.c:46    # trying to acquire our
application mutex
#1  0x76ac3c64 in __GI___pthread_mutex_lock    #
trying to acquire our application mutex#2  0x00228320 in ?? () # our
application code callback#3  0x000ffd7c in ?? () # our application
code callback; PJSUA_LOCK is apparently held#4  0x76fd0d34 in
pj::Call::processStateChange (this=this@entry=0x6263c6c0, prm=...) at
../src/pjsua2/call.cpp:723#5  0x76face6c in
pj::Endpoint::on_call_state (call_id=<optimized out>, e=0x711fe50c) at
../src/pjsua2/endpoint.cpp:1005#6  0x76eff200 in
pjsua_call_on_state_changed (inv=0x722a612c, e=0x711fe50c) at
../src/pjsua-lib/pjsua_call.c:3812#7  0x76ec192c in inv_set_state
(inv=0x722a612c, state=<optimized out>, e=<optimized out>) at
../src/pjsip-ua/sip_inv.c:317#8  0x76ebd67c in mod_inv_on_tsx_state
(tsx=0x626fcb54, e=0x711fe50c) at ../src/pjsip-ua/sip_inv.c:717#9
0x76e73aa0 in pjsip_dlg_on_tsx_state (dlg=0x6240bd9c, tsx=0x626fcb54,
e=0x711fe50c) at ../src/pjsip/sip_dialog.c:2066#10 0x76e6bce0 in
tsx_set_state (tsx=tsx@entry=0x626fcb54,
state=state@entry=PJSIP_TSX_STATE_PROCEEDING,
event_src_type=event_src_type@entry=PJSIP_EVENT_TX_MSG,
event_src=event_src@entry=0x624095f4, flag=flag@entry=0)    at
../src/pjsip/sip_transaction.c:1267#11 0x76e6e228 in
tsx_on_state_proceeding_uas (tsx=0x626fcb54, event=<optimized out>) at
../src/pjsip/sip_transaction.c:2710#12 0x76e70188 in
pjsip_tsx_send_msg (tsx=tsx@entry=0x626fcb54,
tdata=tdata@entry=0x624095f4) at
../src/pjsip/sip_transaction.c:1789#13 0x76e732dc in
pjsip_dlg_send_response (dlg=0x6240bd9c, tsx=0x626fcb54,
tdata=tdata@entry=0x624095f4) at ../src/pjsip/sip_dialog.c:1531#14
0x76ec16f8 in pjsip_inv_send_msg (inv=0x722a612c, tdata=0x624095f4) at
../src/pjsip-ua/sip_inv.c:3231#15 0x76efb7f8 in pjsua_call_answer2
(call_id=2, opt=0x0, code=180, reason=0x0, msg_data=0x0) at
../src/pjsua-lib/pjsua_call.c:2317#16 0x76efb94c in
process_pending_call_answer (call=call@entry=0x76f3d3b8
<pjsua_var+30440>) at ../src/pjsua-lib/pjsua_call.c:978#17 0x76efd280
in pjsua_call_on_incoming (rdata=rdata@entry=0x626efd4c) at
../src/pjsua-lib/pjsua_call.c:1666#18 0x76f00698 in
mod_pjsua_on_rx_request (rdata=<optimized out>) at
../src/pjsua-lib/pjsua_core.c:582#19 0x76e569fc in
pjsip_endpt_process_rx_data (endpt=0x983094, rdata=0x626efd4c,
p=<optimized out>, p_handled=0x711fe8f4)    at
../src/pjsip/sip_endpoint.c:893#20 0x76e56bd4 in endpt_on_rx_msg
(endpt=0x983094, status=<optimized out>, rdata=0x626efd4c) at
../src/pjsip/sip_endpoint.c:1043#21 0x76e5dfa8 in
pjsip_tpmgr_receive_packet (mgr=0x98e310,
rdata=rdata@entry=0x626efd4c) at ../src/pjsip/sip_transport.c:1965#22
0x76e60d68 in udp_on_read_complete (key=0x98b1f0, op_key=<optimized out>, bytes_read=1058) at ../src/pjsip/sip_transport_udp.c:170#23
0x76cba7b4 in ioqueue_dispatch_read_event (ioqueue=<optimized out>,
h=0x98b1f0) at ../src/pj/ioqueue_common_abs.c:605#24 0x76cbc1c8 in
pj_ioqueue_poll (ioqueue=0x98af40, timeout=timeout@entry=0x711fed88)
at ../src/pj/ioqueue_epoll.c:812#25 0x76e566fc in
pjsip_endpt_handle_events2 (endpt=0x983094,
max_timeout=max_timeout@entry=0x711fedc0,
p_count=p_count@entry=0x711fedbc)    at
../src/pjsip/sip_endpoint.c:744#26 0x76f01704 in pjsua_handle_events
(msec_timeout=<optimized out>) at
../src/pjsua-lib/pjsua_core.c:1988#27 0x76f01744 in worker_thread
(arg=<optimized out>) at ../src/pjsua-lib/pjsua_core.c:704#28
0x76cbdbec in thread_main (param=0x9a93f4) at
../src/pj/os_core_unix.c:541#29 0x76ac107c in start_thread
(arg=0x711ff410) at pthread_create.c:339

Thread 22 "application"

#0  0x76acadf4 in __lll_lock_wait (futex=0x982c8c, private=0) at
lowlevellock.c:46#1  0x76ac3cc8 in __GI___pthread_mutex_lock
(mutex=mutex@entry=0x982c8c) at pthread_mutex_lock.c:116#2  0x76cbd740
in pj_mutex_lock (mutex=0x982c8c) at ../src/pj/os_core_unix.c:1265#3
0x76efa7e0 in PJSUA_LOCK () at
../include/pjsua-lib/pjsua_internal.h:580
#4  pjsua_call_get_info
(call_id=2, info=info@entry=0x733fd2dc) at
../src/pjsua-lib/pjsua_call.c:1805#5  0x76fd14ac in pj::Call::getInfo
(this=0x6263c6c0) at ../src/pjsua2/call.cpp:408#6  0x001c5cbc in ?? ()

our application calling into PJSIP#7  0x000f2588 in ?? () *# holding

our application mutex*

Thanks,
Martin

On Wed, May 22, 2019 at 10:19 AM Martin Oberhuber <mober.at+
eclipse@gmail.com> wrote:

Dear PJSIP List / Maintainers,

As I've just been through debugging a deadlock in our application using
Pjsip / pjsua2, I would like to suggest a documentation improvement, that
should help other users run into the same trap as we did.

In our case, we've implemented some code in the pj::Call::onCallState()
callback, that used a Mutex to modify our application state. Another
thread, which took the same application state mutex, called out into PJSIP
with the mutex held. It was an innocent call::getInfo() with our own Mutex
held, but PJSIP deadlocked on this pattern in some, very rare, cases.

To make develoeprs aware, I would suggest improving the
pj::Call::onCallState() documentation as follows:

"
virtual void pj::Call::onCallState ( OnCallStateParam
https://www.pjsip.org/pjsip/docs/html/structpj_1_1OnCallStateParam.htm
prm ) inlinevirtual

Notify application when call state has changed. Application may then query
the call info to get the detail call states by calling getInfo()
https://www.pjsip.org/pjsip/docs/html/classpj_1_1Call.htm#a5057b298a0f5f5a74a8d37a144ca538e
function.

Note that the callback is made with the PJSUA_LOCK held. So the code
should not take any resources that other Threads hold while calling into
PJSIP.
Parameters
prm Callback parameter.

References PJ_UNUSED_ARG
https://www.pjsip.org/pjlib/docs/html/group__pj__config.htm#ga7523f8576efd271c22b188eab05d7870
.
"

What do you think?
Martin

PS here is the backtraces of our deadlock, from pjsip 2.7.1 on Linux/armhf: Thread 18 "pjsip" #0 0x76acadf4 in __lll_lock_wait (futex=futex@entry=0x6263c77c, private=0) at lowlevellock.c:46 *# trying to acquire our application mutex*#1 0x76ac3c64 in __GI___pthread_mutex_lock # trying to acquire our application mutex#2 0x00228320 in ?? () # our application code callback#3 0x000ffd7c in ?? () # our application code callback; *PJSUA_LOCK is apparently held*#4 0x76fd0d34 in pj::Call::processStateChange (this=this@entry=0x6263c6c0, prm=...) at ../src/pjsua2/call.cpp:723#5 0x76face6c in pj::Endpoint::on_call_state (call_id=<optimized out>, e=0x711fe50c) at ../src/pjsua2/endpoint.cpp:1005#6 0x76eff200 in pjsua_call_on_state_changed (inv=0x722a612c, e=0x711fe50c) at ../src/pjsua-lib/pjsua_call.c:3812#7 0x76ec192c in inv_set_state (inv=0x722a612c, state=<optimized out>, e=<optimized out>) at ../src/pjsip-ua/sip_inv.c:317#8 0x76ebd67c in mod_inv_on_tsx_state (tsx=0x626fcb54, e=0x711fe50c) at ../src/pjsip-ua/sip_inv.c:717#9 0x76e73aa0 in pjsip_dlg_on_tsx_state (dlg=0x6240bd9c, tsx=0x626fcb54, e=0x711fe50c) at ../src/pjsip/sip_dialog.c:2066#10 0x76e6bce0 in tsx_set_state (tsx=tsx@entry=0x626fcb54, state=state@entry=PJSIP_TSX_STATE_PROCEEDING, event_src_type=event_src_type@entry=PJSIP_EVENT_TX_MSG, event_src=event_src@entry=0x624095f4, flag=flag@entry=0) at ../src/pjsip/sip_transaction.c:1267#11 0x76e6e228 in tsx_on_state_proceeding_uas (tsx=0x626fcb54, event=<optimized out>) at ../src/pjsip/sip_transaction.c:2710#12 0x76e70188 in pjsip_tsx_send_msg (tsx=tsx@entry=0x626fcb54, tdata=tdata@entry=0x624095f4) at ../src/pjsip/sip_transaction.c:1789#13 0x76e732dc in pjsip_dlg_send_response (dlg=0x6240bd9c, tsx=0x626fcb54, tdata=tdata@entry=0x624095f4) at ../src/pjsip/sip_dialog.c:1531#14 0x76ec16f8 in pjsip_inv_send_msg (inv=0x722a612c, tdata=0x624095f4) at ../src/pjsip-ua/sip_inv.c:3231#15 0x76efb7f8 in pjsua_call_answer2 (call_id=2, opt=0x0, code=180, reason=0x0, msg_data=0x0) at ../src/pjsua-lib/pjsua_call.c:2317#16 0x76efb94c in process_pending_call_answer (call=call@entry=0x76f3d3b8 <pjsua_var+30440>) at ../src/pjsua-lib/pjsua_call.c:978#17 0x76efd280 in pjsua_call_on_incoming (rdata=rdata@entry=0x626efd4c) at ../src/pjsua-lib/pjsua_call.c:1666#18 0x76f00698 in mod_pjsua_on_rx_request (rdata=<optimized out>) at ../src/pjsua-lib/pjsua_core.c:582#19 0x76e569fc in pjsip_endpt_process_rx_data (endpt=0x983094, rdata=0x626efd4c, p=<optimized out>, p_handled=0x711fe8f4) at ../src/pjsip/sip_endpoint.c:893#20 0x76e56bd4 in endpt_on_rx_msg (endpt=0x983094, status=<optimized out>, rdata=0x626efd4c) at ../src/pjsip/sip_endpoint.c:1043#21 0x76e5dfa8 in pjsip_tpmgr_receive_packet (mgr=0x98e310, rdata=rdata@entry=0x626efd4c) at ../src/pjsip/sip_transport.c:1965#22 0x76e60d68 in udp_on_read_complete (key=0x98b1f0, op_key=<optimized out>, bytes_read=1058) at ../src/pjsip/sip_transport_udp.c:170#23 0x76cba7b4 in ioqueue_dispatch_read_event (ioqueue=<optimized out>, h=0x98b1f0) at ../src/pj/ioqueue_common_abs.c:605#24 0x76cbc1c8 in pj_ioqueue_poll (ioqueue=0x98af40, timeout=timeout@entry=0x711fed88) at ../src/pj/ioqueue_epoll.c:812#25 0x76e566fc in pjsip_endpt_handle_events2 (endpt=0x983094, max_timeout=max_timeout@entry=0x711fedc0, p_count=p_count@entry=0x711fedbc) at ../src/pjsip/sip_endpoint.c:744#26 0x76f01704 in pjsua_handle_events (msec_timeout=<optimized out>) at ../src/pjsua-lib/pjsua_core.c:1988#27 0x76f01744 in worker_thread (arg=<optimized out>) at ../src/pjsua-lib/pjsua_core.c:704#28 0x76cbdbec in thread_main (param=0x9a93f4) at ../src/pj/os_core_unix.c:541#29 0x76ac107c in start_thread (arg=0x711ff410) at pthread_create.c:339 Thread 22 "application" #0 0x76acadf4 in __lll_lock_wait (futex=0x982c8c, private=0) at lowlevellock.c:46#1 0x76ac3cc8 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x982c8c) at pthread_mutex_lock.c:116#2 0x76cbd740 in pj_mutex_lock (mutex=0x982c8c) at ../src/pj/os_core_unix.c:1265#3 0x76efa7e0 in *PJSUA_LOCK () at ../include/pjsua-lib/pjsua_internal.h:580*#4 pjsua_call_get_info (call_id=2, info=info@entry=0x733fd2dc) at ../src/pjsua-lib/pjsua_call.c:1805#5 0x76fd14ac in pj::Call::getInfo (this=0x6263c6c0) at ../src/pjsua2/call.cpp:408#6 0x001c5cbc in ?? () # our application calling into PJSIP#7 0x000f2588 in ?? () *# holding our application mutex* Thanks, Martin On Wed, May 22, 2019 at 10:19 AM Martin Oberhuber <mober.at+ eclipse@gmail.com> wrote: > Dear PJSIP List / Maintainers, > > As I've just been through debugging a deadlock in our application using > Pjsip / pjsua2, I would like to suggest a documentation improvement, that > should help other users run into the same trap as we did. > > In our case, we've implemented some code in the pj::Call::onCallState() > callback, that used a Mutex to modify our application state. Another > thread, which took the same application state mutex, called out into PJSIP > with the mutex held. It was an innocent call::getInfo() with our own Mutex > held, but PJSIP deadlocked on this pattern in some, very rare, cases. > > To make develoeprs aware, I would suggest improving the > pj::Call::onCallState() documentation as follows: > > " > virtual void pj::Call::onCallState ( OnCallStateParam > <https://www.pjsip.org/pjsip/docs/html/structpj_1_1OnCallStateParam.htm> > & *prm* ) inlinevirtual > > Notify application when call state has changed. Application may then query > the call info to get the detail call states by calling getInfo() > <https://www.pjsip.org/pjsip/docs/html/classpj_1_1Call.htm#a5057b298a0f5f5a74a8d37a144ca538e> > function. > > Note that the callback is made with the PJSUA_LOCK held. So the code > should not take any resources that other Threads hold while calling into > PJSIP. > Parameters > prm Callback parameter. > > References PJ_UNUSED_ARG > <https://www.pjsip.org/pjlib/docs/html/group__pj__config.htm#ga7523f8576efd271c22b188eab05d7870> > . > " > > What do you think? > Martin >