Crash in pj_timer_heap_poll

HS
Hugo Sobral
Fri, Jun 28, 2019 6:29 PM

Hi,

Sometimes I get a crash in "pj_timer_heap_poll"
I notice that there was some corrections about that in milestone 2.9

https://trac.pjsip.org/repos/ticket/2172
https://trac.pjsip.org/repos/ticket/2176

But these corrections does not solve my issue.

After some time debugging the issue, it looks like that it happens when the
code:


if (node->cb)
(*node->cb)(ht, node);

in function  'pj_timer_heap_poll' is invoked with the function
'on_cache_timeout' at the same time that 'pj_stun_session' is being
destroyed (this destroy is made by application's code because the call
terminates).

I cannot understand exactly how this happens,
but I think that there is a time gap in function 'pj_timer_heap_poll' after
the 'unlock_timer_heap(ht)' that 'pj_stun_session' can be destroyed and
crash inside 'on_cache_timeout'

Is this a known issue in the specified functions?

Best regards,
Hugo Sobral

Hi, Sometimes I get a crash in "pj_timer_heap_poll" I notice that there was some corrections about that in milestone 2.9 https://trac.pjsip.org/repos/ticket/2172 https://trac.pjsip.org/repos/ticket/2176 But these corrections does not solve my issue. After some time debugging the issue, it looks like that it happens when the code: ------------------------------- if (node->cb) (*node->cb)(ht, node); ------------------------------- in function 'pj_timer_heap_poll' is invoked with the function 'on_cache_timeout' at the same time that 'pj_stun_session' is being destroyed (this destroy is made by application's code because the call terminates). I cannot understand exactly how this happens, but I think that there is a time gap in function 'pj_timer_heap_poll' after the 'unlock_timer_heap(ht)' that 'pj_stun_session' can be destroyed and crash inside 'on_cache_timeout' Is this a known issue in the specified functions? Best regards, Hugo Sobral
NI
Nanang Izzuddin
Wed, Jul 3, 2019 10:03 AM

Hi Hugo,

The STUN tdata timer in STUN session (i.e: tdata->res_timer) is scheduled
using 'pj_stun_session' group lock, so normally it should be safe from race
condition between 'pj_stun_session' destroy and 'on_cache_timeout', but
perhaps there is a corner case, could you also send the call stack traces
when the crash occurs?

BR,
nanang

On Sat, Jun 29, 2019 at 1:30 AM Hugo Sobral hugo.sobral@gmail.com wrote:

Hi,

Sometimes I get a crash in "pj_timer_heap_poll"
I notice that there was some corrections about that in milestone 2.9

https://trac.pjsip.org/repos/ticket/2172
https://trac.pjsip.org/repos/ticket/2176

But these corrections does not solve my issue.

After some time debugging the issue, it looks like that it happens when
the code:


if (node->cb)
(*node->cb)(ht, node);

in function  'pj_timer_heap_poll' is invoked with the function
'on_cache_timeout' at the same time that 'pj_stun_session' is being
destroyed (this destroy is made by application's code because the call
terminates).

I cannot understand exactly how this happens,
but I think that there is a time gap in function 'pj_timer_heap_poll'
after the 'unlock_timer_heap(ht)' that 'pj_stun_session' can be destroyed
and crash inside 'on_cache_timeout'

Is this a known issue in the specified functions?

Best regards,
Hugo Sobral


Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@lists.pjsip.org
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org

Hi Hugo, The STUN tdata timer in STUN session (i.e: tdata->res_timer) is scheduled using 'pj_stun_session' group lock, so normally it should be safe from race condition between 'pj_stun_session' destroy and 'on_cache_timeout', but perhaps there is a corner case, could you also send the call stack traces when the crash occurs? BR, nanang On Sat, Jun 29, 2019 at 1:30 AM Hugo Sobral <hugo.sobral@gmail.com> wrote: > Hi, > > Sometimes I get a crash in "pj_timer_heap_poll" > I notice that there was some corrections about that in milestone 2.9 > > https://trac.pjsip.org/repos/ticket/2172 > https://trac.pjsip.org/repos/ticket/2176 > > But these corrections does not solve my issue. > > After some time debugging the issue, it looks like that it happens when > the code: > > ------------------------------- > if (node->cb) > (*node->cb)(ht, node); > ------------------------------- > in function 'pj_timer_heap_poll' is invoked with the function > 'on_cache_timeout' at the same time that 'pj_stun_session' is being > destroyed (this destroy is made by application's code because the call > terminates). > > I cannot understand exactly how this happens, > but I think that there is a time gap in function 'pj_timer_heap_poll' > after the 'unlock_timer_heap(ht)' that 'pj_stun_session' can be destroyed > and crash inside 'on_cache_timeout' > > > Is this a known issue in the specified functions? > > Best regards, > Hugo Sobral > > > > > > > _______________________________________________ > Visit our blog: http://blog.pjsip.org > > pjsip mailing list > pjsip@lists.pjsip.org > http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org >
HS
Hugo Sobral
Fri, Jul 5, 2019 12:09 PM

Hi Nanang,

It was in this line:

PJ_LOG(5,(SNAME(tdata->sess), "Response cache deleted"));

Exception thrown: read access violation.
tdata->sess was 0xFFFFFFFFFFFFFFF7. occurred
pj_wrapper.dll!on_cache_timeout(pj_timer_heap_t * timer_heap,
pj_timer_entry * entry) Line 226
at pjnath\src\pjnath\stun_session.c(226)
pj_wrapper.dll!pj_timer_heap_poll(pj_timer_heap_t * ht, pj_time_val *
next_delay) Line 651
at pjlib\src\pj\timer.c(651)
[Managed to Native Transition]

From debugger, current values:

tdata  0x000001c5f264f828 {prev=0xdddddddddddddddd {prev=??? next=???
pool=??? ...} next=0xdddddddddddddddd {...} ...}  pj_stun_tx_data *
prev  0xdddddddddddddddd {prev=??? next=??? pool=??? ...}
pj_stun_tx_data *
next  0xdddddddddddddddd {prev=??? next=??? pool=??? ...}
pj_stun_tx_data *
pool  0xdddddddddddddddd {prev=??? next=???
obj_name=0xdddddddddddddded <Error reading characters of string.> ...}
pj_pool_t *
sess  0xdddddddddddddddd {cfg=??? pool=??? grp_lock=??? ...}
pj_stun_session *
msg  0xdddddddddddddddd {hdr={type=??? length=??? magic=??? ...}
attr_count=??? attr=0xddddddddddddddf5 {...} }  pj_stun_msg *
token  0xdddddddddddddddd  void *
client_tsx  0xdddddddddddddddd {obj_name=0xdddddddddddddddd <Error reading characters of string.> cb={on_complete=...} ...}
pj_stun_client_tsx *
retransmit  -572662307  int
msg_magic  3722304989  unsigned int
msg_key  0x000001c5f264f868 "ÝÝÝÝÝÝÝÝÝÝÝÝ...  unsigned char[12]
auth_info  {realm={ptr=0xdddddddddddddddd <Error reading characters of string.> slen=-2459565876494606883 } username=...}
pj_stun_req_cred_info
pkt  0xdddddddddddddddd  void *
max_len  3722304989  unsigned int
pkt_size  15987178197214944733  unsigned __int64
addr_len  3722304989  unsigned int
dst_addr  0xdddddddddddddddd  const void *
res_timer  {user_data=0xdddddddddddddddd id=-572662307
cb=0xdddddddddddddddd ...}  pj_timer_entry

tdata->sess  0xdddddddddddddddd {cfg=??? pool=??? grp_lock=??? ...}
pj_stun_session *
cfg  <Unable to read memory>
pool  <Unable to read memory>
grp_lock  <Unable to read memory>
cb  {on_send_msg=??? on_rx_request=??? on_request_complete=??? ...}
pj_stun_session_cb
user_data  <Unable to read memory>
is_destroying  <Unable to read memory>
use_fingerprint  <Unable to read memory>
rx_pool  <Unable to read memory>
dump_buf  0xddddddddddddde2d <Error reading characters of string.>
char[1000]
log_flag  <Unable to read memory>
auth_type  <Unable to read memory>
cred  {type=??? data={static_cred={realm={ptr=??? slen=??? }
username={ptr=??? slen=??? } data_type=??? ...} ...} }  pj_stun_auth_cred
auth_retry  <Unable to read memory>
next_nonce  {ptr=??? slen=??? }  pj_str_t
server_realm  {ptr=??? slen=??? }  pj_str_t
srv_name  {ptr=??? slen=??? }  pj_str_t
pending_request_list  {prev=??? next=??? pool=??? ...}
pj_stun_tx_data
cached_response_list  {prev=??? next=??? pool=??? ...}
pj_stun_tx_data

Best Regards,
Hugo Sobral

On Wed, 3 Jul 2019 at 11:04, Nanang Izzuddin nanang@pjsip.org wrote:

Hi Hugo,

The STUN tdata timer in STUN session (i.e: tdata->res_timer) is scheduled
using 'pj_stun_session' group lock, so normally it should be safe from race
condition between 'pj_stun_session' destroy and 'on_cache_timeout', but
perhaps there is a corner case, could you also send the call stack traces
when the crash occurs?

BR,
nanang

On Sat, Jun 29, 2019 at 1:30 AM Hugo Sobral hugo.sobral@gmail.com wrote:

Hi,

Sometimes I get a crash in "pj_timer_heap_poll"
I notice that there was some corrections about that in milestone 2.9

https://trac.pjsip.org/repos/ticket/2172
https://trac.pjsip.org/repos/ticket/2176

But these corrections does not solve my issue.

After some time debugging the issue, it looks like that it happens when
the code:


if (node->cb)
(*node->cb)(ht, node);

in function  'pj_timer_heap_poll' is invoked with the function
'on_cache_timeout' at the same time that 'pj_stun_session' is being
destroyed (this destroy is made by application's code because the call
terminates).

I cannot understand exactly how this happens,
but I think that there is a time gap in function 'pj_timer_heap_poll'
after the 'unlock_timer_heap(ht)' that 'pj_stun_session' can be destroyed
and crash inside 'on_cache_timeout'

Is this a known issue in the specified functions?

Best regards,
Hugo Sobral


Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@lists.pjsip.org
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org

Hi Nanang, It was in this line: PJ_LOG(5,(SNAME(tdata->sess), "Response cache deleted")); Exception thrown: read access violation. tdata->sess was 0xFFFFFFFFFFFFFFF7. occurred pj_wrapper.dll!on_cache_timeout(pj_timer_heap_t * timer_heap, pj_timer_entry * entry) Line 226 at pjnath\src\pjnath\stun_session.c(226) pj_wrapper.dll!pj_timer_heap_poll(pj_timer_heap_t * ht, pj_time_val * next_delay) Line 651 at pjlib\src\pj\timer.c(651) [Managed to Native Transition] >From debugger, current values: tdata 0x000001c5f264f828 {prev=0xdddddddddddddddd {prev=??? next=??? pool=??? ...} next=0xdddddddddddddddd {...} ...} pj_stun_tx_data * prev 0xdddddddddddddddd {prev=??? next=??? pool=??? ...} pj_stun_tx_data * next 0xdddddddddddddddd {prev=??? next=??? pool=??? ...} pj_stun_tx_data * pool 0xdddddddddddddddd {prev=??? next=??? obj_name=0xdddddddddddddded <Error reading characters of string.> ...} pj_pool_t * sess 0xdddddddddddddddd {cfg=??? pool=??? grp_lock=??? ...} pj_stun_session * msg 0xdddddddddddddddd {hdr={type=??? length=??? magic=??? ...} attr_count=??? attr=0xddddddddddddddf5 {...} } pj_stun_msg * token 0xdddddddddddddddd void * client_tsx 0xdddddddddddddddd {obj_name=0xdddddddddddddddd <Error reading characters of string.> cb={on_complete=...} ...} pj_stun_client_tsx * retransmit -572662307 int msg_magic 3722304989 unsigned int msg_key 0x000001c5f264f868 "ÝÝÝÝÝÝÝÝÝÝÝÝ... unsigned char[12] auth_info {realm={ptr=0xdddddddddddddddd <Error reading characters of string.> slen=-2459565876494606883 } username=...} pj_stun_req_cred_info pkt 0xdddddddddddddddd void * max_len 3722304989 unsigned int pkt_size 15987178197214944733 unsigned __int64 addr_len 3722304989 unsigned int dst_addr 0xdddddddddddddddd const void * res_timer {user_data=0xdddddddddddddddd id=-572662307 cb=0xdddddddddddddddd ...} pj_timer_entry tdata->sess 0xdddddddddddddddd {cfg=??? pool=??? grp_lock=??? ...} pj_stun_session * cfg <Unable to read memory> pool <Unable to read memory> grp_lock <Unable to read memory> cb {on_send_msg=??? on_rx_request=??? on_request_complete=??? ...} pj_stun_session_cb user_data <Unable to read memory> is_destroying <Unable to read memory> use_fingerprint <Unable to read memory> rx_pool <Unable to read memory> dump_buf 0xddddddddddddde2d <Error reading characters of string.> char[1000] log_flag <Unable to read memory> auth_type <Unable to read memory> cred {type=??? data={static_cred={realm={ptr=??? slen=??? } username={ptr=??? slen=??? } data_type=??? ...} ...} } pj_stun_auth_cred auth_retry <Unable to read memory> next_nonce {ptr=??? slen=??? } pj_str_t server_realm {ptr=??? slen=??? } pj_str_t srv_name {ptr=??? slen=??? } pj_str_t pending_request_list {prev=??? next=??? pool=??? ...} pj_stun_tx_data cached_response_list {prev=??? next=??? pool=??? ...} pj_stun_tx_data Best Regards, Hugo Sobral On Wed, 3 Jul 2019 at 11:04, Nanang Izzuddin <nanang@pjsip.org> wrote: > Hi Hugo, > > The STUN tdata timer in STUN session (i.e: tdata->res_timer) is scheduled > using 'pj_stun_session' group lock, so normally it should be safe from race > condition between 'pj_stun_session' destroy and 'on_cache_timeout', but > perhaps there is a corner case, could you also send the call stack traces > when the crash occurs? > > BR, > nanang > > > On Sat, Jun 29, 2019 at 1:30 AM Hugo Sobral <hugo.sobral@gmail.com> wrote: > >> Hi, >> >> Sometimes I get a crash in "pj_timer_heap_poll" >> I notice that there was some corrections about that in milestone 2.9 >> >> https://trac.pjsip.org/repos/ticket/2172 >> https://trac.pjsip.org/repos/ticket/2176 >> >> But these corrections does not solve my issue. >> >> After some time debugging the issue, it looks like that it happens when >> the code: >> >> ------------------------------- >> if (node->cb) >> (*node->cb)(ht, node); >> ------------------------------- >> in function 'pj_timer_heap_poll' is invoked with the function >> 'on_cache_timeout' at the same time that 'pj_stun_session' is being >> destroyed (this destroy is made by application's code because the call >> terminates). >> >> I cannot understand exactly how this happens, >> but I think that there is a time gap in function 'pj_timer_heap_poll' >> after the 'unlock_timer_heap(ht)' that 'pj_stun_session' can be destroyed >> and crash inside 'on_cache_timeout' >> >> >> Is this a known issue in the specified functions? >> >> Best regards, >> Hugo Sobral >> >> >> >> >> >> >> _______________________________________________ >> Visit our blog: http://blog.pjsip.org >> >> pjsip mailing list >> pjsip@lists.pjsip.org >> http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org >> > _______________________________________________ > Visit our blog: http://blog.pjsip.org > > pjsip mailing list > pjsip@lists.pjsip.org > http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org >
NI
Nanang Izzuddin
Tue, Jul 9, 2019 11:15 AM

Hi Hugo,

Instead of 'pj_stun_session', it is perhaps the 'pj_stun_tx_data' that had
been destroyed when 'on_cache_timeout' was invoked, and this scenario seems
to be possible indeed. So just created a patch (attached), unfortunately I
could not easily reproduce the issue here, so if the issue is reproducible
there, please help us testing the patch, thank you in advance.

BR,
nanang

On Fri, Jul 5, 2019 at 7:10 PM Hugo Sobral hugo.sobral@gmail.com wrote:

Hi Nanang,

It was in this line:

 PJ_LOG(5,(SNAME(tdata->sess), "Response cache deleted"));

Exception thrown: read access violation.
tdata->sess was 0xFFFFFFFFFFFFFFF7. occurred
pj_wrapper.dll!on_cache_timeout(pj_timer_heap_t * timer_heap,
pj_timer_entry * entry) Line 226
at pjnath\src\pjnath\stun_session.c(226)
pj_wrapper.dll!pj_timer_heap_poll(pj_timer_heap_t * ht, pj_time_val *
next_delay) Line 651
at pjlib\src\pj\timer.c(651)
[Managed to Native Transition]

From debugger, current values:

tdata   0x000001c5f264f828 {prev=0xdddddddddddddddd {prev=??? next=???

pool=??? ...} next=0xdddddddddddddddd {...} ...}  pj_stun_tx_data *
prev  0xdddddddddddddddd {prev=??? next=??? pool=??? ...}
pj_stun_tx_data *
next  0xdddddddddddddddd {prev=??? next=??? pool=??? ...}
pj_stun_tx_data *
pool  0xdddddddddddddddd {prev=??? next=???
obj_name=0xdddddddddddddded <Error reading characters of string.> ...}
pj_pool_t *
sess  0xdddddddddddddddd {cfg=??? pool=??? grp_lock=??? ...}
pj_stun_session *
msg  0xdddddddddddddddd {hdr={type=??? length=??? magic=??? ...}
attr_count=??? attr=0xddddddddddddddf5 {...} }  pj_stun_msg *
token  0xdddddddddddddddd  void *
client_tsx  0xdddddddddddddddd {obj_name=0xdddddddddddddddd <Error reading characters of string.> cb={on_complete=...} ...}
pj_stun_client_tsx *
retransmit  -572662307  int
msg_magic  3722304989  unsigned int
msg_key  0x000001c5f264f868 "ÝÝÝÝÝÝÝÝÝÝÝÝ...  unsigned char[12]
auth_info  {realm={ptr=0xdddddddddddddddd <Error reading characters of string.> slen=-2459565876494606883 } username=...}
pj_stun_req_cred_info
pkt  0xdddddddddddddddd  void *
max_len  3722304989  unsigned int
pkt_size  15987178197214944733  unsigned __int64
addr_len  3722304989  unsigned int
dst_addr  0xdddddddddddddddd  const void *
res_timer  {user_data=0xdddddddddddddddd id=-572662307
cb=0xdddddddddddddddd ...}  pj_timer_entry

tdata->sess   0xdddddddddddddddd {cfg=??? pool=??? grp_lock=??? ...}

pj_stun_session *
cfg  <Unable to read memory>
pool  <Unable to read memory>
grp_lock  <Unable to read memory>
cb  {on_send_msg=??? on_rx_request=??? on_request_complete=??? ...}
pj_stun_session_cb
user_data  <Unable to read memory>
is_destroying  <Unable to read memory>
use_fingerprint  <Unable to read memory>
rx_pool  <Unable to read memory>
dump_buf  0xddddddddddddde2d <Error reading characters of string.>
char[1000]
log_flag  <Unable to read memory>
auth_type  <Unable to read memory>
cred  {type=??? data={static_cred={realm={ptr=??? slen=??? }
username={ptr=??? slen=??? } data_type=??? ...} ...} }  pj_stun_auth_cred
auth_retry  <Unable to read memory>
next_nonce  {ptr=??? slen=??? }  pj_str_t
server_realm  {ptr=??? slen=??? }  pj_str_t
srv_name  {ptr=??? slen=??? }  pj_str_t
pending_request_list  {prev=??? next=??? pool=??? ...}
pj_stun_tx_data
cached_response_list  {prev=??? next=??? pool=??? ...}
pj_stun_tx_data

Best Regards,
Hugo Sobral

On Wed, 3 Jul 2019 at 11:04, Nanang Izzuddin nanang@pjsip.org wrote:

Hi Hugo,

The STUN tdata timer in STUN session (i.e: tdata->res_timer) is scheduled
using 'pj_stun_session' group lock, so normally it should be safe from race
condition between 'pj_stun_session' destroy and 'on_cache_timeout', but
perhaps there is a corner case, could you also send the call stack traces
when the crash occurs?

BR,
nanang

On Sat, Jun 29, 2019 at 1:30 AM Hugo Sobral hugo.sobral@gmail.com
wrote:

Hi,

Sometimes I get a crash in "pj_timer_heap_poll"
I notice that there was some corrections about that in milestone 2.9

https://trac.pjsip.org/repos/ticket/2172
https://trac.pjsip.org/repos/ticket/2176

But these corrections does not solve my issue.

After some time debugging the issue, it looks like that it happens when
the code:


if (node->cb)
(*node->cb)(ht, node);

in function  'pj_timer_heap_poll' is invoked with the function
'on_cache_timeout' at the same time that 'pj_stun_session' is being
destroyed (this destroy is made by application's code because the call
terminates).

I cannot understand exactly how this happens,
but I think that there is a time gap in function 'pj_timer_heap_poll'
after the 'unlock_timer_heap(ht)' that 'pj_stun_session' can be destroyed
and crash inside 'on_cache_timeout'

Is this a known issue in the specified functions?

Best regards,
Hugo Sobral


Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@lists.pjsip.org
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org

Hi Hugo, Instead of 'pj_stun_session', it is perhaps the 'pj_stun_tx_data' that had been destroyed when 'on_cache_timeout' was invoked, and this scenario seems to be possible indeed. So just created a patch (attached), unfortunately I could not easily reproduce the issue here, so if the issue is reproducible there, please help us testing the patch, thank you in advance. BR, nanang On Fri, Jul 5, 2019 at 7:10 PM Hugo Sobral <hugo.sobral@gmail.com> wrote: > Hi Nanang, > > It was in this line: > > PJ_LOG(5,(SNAME(tdata->sess), "Response cache deleted")); > > > > Exception thrown: read access violation. > tdata->sess was 0xFFFFFFFFFFFFFFF7. occurred > pj_wrapper.dll!on_cache_timeout(pj_timer_heap_t * timer_heap, > pj_timer_entry * entry) Line 226 > at pjnath\src\pjnath\stun_session.c(226) > pj_wrapper.dll!pj_timer_heap_poll(pj_timer_heap_t * ht, pj_time_val * > next_delay) Line 651 > at pjlib\src\pj\timer.c(651) > [Managed to Native Transition] > > > From debugger, current values: > > tdata 0x000001c5f264f828 {prev=0xdddddddddddddddd {prev=??? next=??? > pool=??? ...} next=0xdddddddddddddddd {...} ...} pj_stun_tx_data * > prev 0xdddddddddddddddd {prev=??? next=??? pool=??? ...} > pj_stun_tx_data * > next 0xdddddddddddddddd {prev=??? next=??? pool=??? ...} > pj_stun_tx_data * > pool 0xdddddddddddddddd {prev=??? next=??? > obj_name=0xdddddddddddddded <Error reading characters of string.> ...} > pj_pool_t * > sess 0xdddddddddddddddd {cfg=??? pool=??? grp_lock=??? ...} > pj_stun_session * > msg 0xdddddddddddddddd {hdr={type=??? length=??? magic=??? ...} > attr_count=??? attr=0xddddddddddddddf5 {...} } pj_stun_msg * > token 0xdddddddddddddddd void * > client_tsx 0xdddddddddddddddd {obj_name=0xdddddddddddddddd <Error > reading characters of string.> cb={on_complete=...} ...} > pj_stun_client_tsx * > retransmit -572662307 int > msg_magic 3722304989 unsigned int > msg_key 0x000001c5f264f868 "ÝÝÝÝÝÝÝÝÝÝÝÝ... unsigned char[12] > auth_info {realm={ptr=0xdddddddddddddddd <Error reading characters > of string.> slen=-2459565876494606883 } username=...} > pj_stun_req_cred_info > pkt 0xdddddddddddddddd void * > max_len 3722304989 unsigned int > pkt_size 15987178197214944733 unsigned __int64 > addr_len 3722304989 unsigned int > dst_addr 0xdddddddddddddddd const void * > res_timer {user_data=0xdddddddddddddddd id=-572662307 > cb=0xdddddddddddddddd ...} pj_timer_entry > > > tdata->sess 0xdddddddddddddddd {cfg=??? pool=??? grp_lock=??? ...} > pj_stun_session * > cfg <Unable to read memory> > pool <Unable to read memory> > grp_lock <Unable to read memory> > cb {on_send_msg=??? on_rx_request=??? on_request_complete=??? ...} > pj_stun_session_cb > user_data <Unable to read memory> > is_destroying <Unable to read memory> > use_fingerprint <Unable to read memory> > rx_pool <Unable to read memory> > dump_buf 0xddddddddddddde2d <Error reading characters of string.> > char[1000] > log_flag <Unable to read memory> > auth_type <Unable to read memory> > cred {type=??? data={static_cred={realm={ptr=??? slen=??? } > username={ptr=??? slen=??? } data_type=??? ...} ...} } pj_stun_auth_cred > auth_retry <Unable to read memory> > next_nonce {ptr=??? slen=??? } pj_str_t > server_realm {ptr=??? slen=??? } pj_str_t > srv_name {ptr=??? slen=??? } pj_str_t > pending_request_list {prev=??? next=??? pool=??? ...} > pj_stun_tx_data > cached_response_list {prev=??? next=??? pool=??? ...} > pj_stun_tx_data > > Best Regards, > Hugo Sobral > > On Wed, 3 Jul 2019 at 11:04, Nanang Izzuddin <nanang@pjsip.org> wrote: > >> Hi Hugo, >> >> The STUN tdata timer in STUN session (i.e: tdata->res_timer) is scheduled >> using 'pj_stun_session' group lock, so normally it should be safe from race >> condition between 'pj_stun_session' destroy and 'on_cache_timeout', but >> perhaps there is a corner case, could you also send the call stack traces >> when the crash occurs? >> >> BR, >> nanang >> >> >> On Sat, Jun 29, 2019 at 1:30 AM Hugo Sobral <hugo.sobral@gmail.com> >> wrote: >> >>> Hi, >>> >>> Sometimes I get a crash in "pj_timer_heap_poll" >>> I notice that there was some corrections about that in milestone 2.9 >>> >>> https://trac.pjsip.org/repos/ticket/2172 >>> https://trac.pjsip.org/repos/ticket/2176 >>> >>> But these corrections does not solve my issue. >>> >>> After some time debugging the issue, it looks like that it happens when >>> the code: >>> >>> ------------------------------- >>> if (node->cb) >>> (*node->cb)(ht, node); >>> ------------------------------- >>> in function 'pj_timer_heap_poll' is invoked with the function >>> 'on_cache_timeout' at the same time that 'pj_stun_session' is being >>> destroyed (this destroy is made by application's code because the call >>> terminates). >>> >>> I cannot understand exactly how this happens, >>> but I think that there is a time gap in function 'pj_timer_heap_poll' >>> after the 'unlock_timer_heap(ht)' that 'pj_stun_session' can be destroyed >>> and crash inside 'on_cache_timeout' >>> >>> >>> Is this a known issue in the specified functions? >>> >>> Best regards, >>> Hugo Sobral >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Visit our blog: http://blog.pjsip.org >>> >>> pjsip mailing list >>> pjsip@lists.pjsip.org >>> http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org >>> >> _______________________________________________ >> Visit our blog: http://blog.pjsip.org >> >> pjsip mailing list >> pjsip@lists.pjsip.org >> http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org >> > _______________________________________________ > Visit our blog: http://blog.pjsip.org > > pjsip mailing list > pjsip@lists.pjsip.org > http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org >
HS
Hugo Sobral
Wed, Jul 10, 2019 5:34 PM

Hi Nanang,

I have the same result but in a different location

Exception thrown: read access violation.
atomic_var was 0xFFFFFFFFFFFFFFFF. occurred
pj_wrapper.dll!pj_atomic_dec_and_get(pj_atomic_t * atomic_var) Line 800
at pjlib\src\pj\os_core_win32.c(800)
pj_wrapper.dll!grp_lock_dec_ref(pj_grp_lock_t * glock) Line 554
at pjlib\src\pj\lock.c(554)
pj_wrapper.dll!pj_grp_lock_dec_ref(pj_grp_lock_t * glock) Line 647
at pjlib\src\pj\lock.c(647)
pj_wrapper.dll!pj_timer_heap_poll(pj_timer_heap_t * ht, pj_time_val *
next_delay) Line 654
at pjlib\src\pj\timer.c(654)

Some vars from debugger:

&atomic_var 0x0000002f293fe670 {0xdddddddddddddddd {value=??? }}
pj_atomic_t * *
0xdddddddddddddddd {value=??? } pj_atomic_t *
value <Unable to read memory>

&atomic_var->value 0xdddddddddddddddd {???} long *
<Unable to read memory> long

On Tue, 9 Jul 2019 at 12:16, Nanang Izzuddin nanang@pjsip.org wrote:

Hi Hugo,

Instead of 'pj_stun_session', it is perhaps the 'pj_stun_tx_data' that had
been destroyed when 'on_cache_timeout' was invoked, and this scenario seems
to be possible indeed. So just created a patch (attached), unfortunately I
could not easily reproduce the issue here, so if the issue is reproducible
there, please help us testing the patch, thank you in advance.

BR,
nanang

On Fri, Jul 5, 2019 at 7:10 PM Hugo Sobral hugo.sobral@gmail.com wrote:

Hi Nanang,

It was in this line:

 PJ_LOG(5,(SNAME(tdata->sess), "Response cache deleted"));

Exception thrown: read access violation.
tdata->sess was 0xFFFFFFFFFFFFFFF7. occurred
pj_wrapper.dll!on_cache_timeout(pj_timer_heap_t * timer_heap,
pj_timer_entry * entry) Line 226
at pjnath\src\pjnath\stun_session.c(226)
pj_wrapper.dll!pj_timer_heap_poll(pj_timer_heap_t * ht, pj_time_val *
next_delay) Line 651
at pjlib\src\pj\timer.c(651)
[Managed to Native Transition]

From debugger, current values:

tdata   0x000001c5f264f828 {prev=0xdddddddddddddddd {prev=??? next=???

pool=??? ...} next=0xdddddddddddddddd {...} ...}  pj_stun_tx_data *
prev  0xdddddddddddddddd {prev=??? next=??? pool=??? ...}
pj_stun_tx_data *
next  0xdddddddddddddddd {prev=??? next=??? pool=??? ...}
pj_stun_tx_data *
pool  0xdddddddddddddddd {prev=??? next=???
obj_name=0xdddddddddddddded <Error reading characters of string.> ...}
pj_pool_t *
sess  0xdddddddddddddddd {cfg=??? pool=??? grp_lock=??? ...}
pj_stun_session *
msg  0xdddddddddddddddd {hdr={type=??? length=??? magic=??? ...}
attr_count=??? attr=0xddddddddddddddf5 {...} }  pj_stun_msg *
token  0xdddddddddddddddd  void *
client_tsx  0xdddddddddddddddd {obj_name=0xdddddddddddddddd <Error reading characters of string.> cb={on_complete=...} ...}
pj_stun_client_tsx *
retransmit  -572662307  int
msg_magic  3722304989  unsigned int
msg_key  0x000001c5f264f868 "ÝÝÝÝÝÝÝÝÝÝÝÝ...  unsigned char[12]
auth_info  {realm={ptr=0xdddddddddddddddd <Error reading characters of string.> slen=-2459565876494606883 } username=...}
pj_stun_req_cred_info
pkt  0xdddddddddddddddd  void *
max_len  3722304989  unsigned int
pkt_size  15987178197214944733  unsigned __int64
addr_len  3722304989  unsigned int
dst_addr  0xdddddddddddddddd  const void *
res_timer  {user_data=0xdddddddddddddddd id=-572662307
cb=0xdddddddddddddddd ...}  pj_timer_entry

tdata->sess   0xdddddddddddddddd {cfg=??? pool=??? grp_lock=??? ...}

pj_stun_session *
cfg  <Unable to read memory>
pool  <Unable to read memory>
grp_lock  <Unable to read memory>
cb  {on_send_msg=??? on_rx_request=??? on_request_complete=???
...}  pj_stun_session_cb
user_data  <Unable to read memory>
is_destroying  <Unable to read memory>
use_fingerprint  <Unable to read memory>
rx_pool  <Unable to read memory>
dump_buf  0xddddddddddddde2d <Error reading characters of string.>
char[1000]
log_flag  <Unable to read memory>
auth_type  <Unable to read memory>
cred  {type=??? data={static_cred={realm={ptr=??? slen=??? }
username={ptr=??? slen=??? } data_type=??? ...} ...} }  pj_stun_auth_cred
auth_retry  <Unable to read memory>
next_nonce  {ptr=??? slen=??? }  pj_str_t
server_realm  {ptr=??? slen=??? }  pj_str_t
srv_name  {ptr=??? slen=??? }  pj_str_t
pending_request_list  {prev=??? next=??? pool=??? ...}
pj_stun_tx_data
cached_response_list  {prev=??? next=??? pool=??? ...}
pj_stun_tx_data

Best Regards,
Hugo Sobral

On Wed, 3 Jul 2019 at 11:04, Nanang Izzuddin nanang@pjsip.org wrote:

Hi Hugo,

The STUN tdata timer in STUN session (i.e: tdata->res_timer) is
scheduled using 'pj_stun_session' group lock, so normally it should be safe
from race condition between 'pj_stun_session' destroy and
'on_cache_timeout', but perhaps there is a corner case, could you also send
the call stack traces when the crash occurs?

BR,
nanang

On Sat, Jun 29, 2019 at 1:30 AM Hugo Sobral hugo.sobral@gmail.com
wrote:

Hi,

Sometimes I get a crash in "pj_timer_heap_poll"
I notice that there was some corrections about that in milestone 2.9

https://trac.pjsip.org/repos/ticket/2172
https://trac.pjsip.org/repos/ticket/2176

But these corrections does not solve my issue.

After some time debugging the issue, it looks like that it happens when
the code:


if (node->cb)
(*node->cb)(ht, node);

in function  'pj_timer_heap_poll' is invoked with the function
'on_cache_timeout' at the same time that 'pj_stun_session' is being
destroyed (this destroy is made by application's code because the call
terminates).

I cannot understand exactly how this happens,
but I think that there is a time gap in function 'pj_timer_heap_poll'
after the 'unlock_timer_heap(ht)' that 'pj_stun_session' can be destroyed
and crash inside 'on_cache_timeout'

Is this a known issue in the specified functions?

Best regards,
Hugo Sobral


Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@lists.pjsip.org
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org

Hi Nanang, I have the same result but in a different location Exception thrown: read access violation. atomic_var was 0xFFFFFFFFFFFFFFFF. occurred pj_wrapper.dll!pj_atomic_dec_and_get(pj_atomic_t * atomic_var) Line 800 at pjlib\src\pj\os_core_win32.c(800) pj_wrapper.dll!grp_lock_dec_ref(pj_grp_lock_t * glock) Line 554 at pjlib\src\pj\lock.c(554) pj_wrapper.dll!pj_grp_lock_dec_ref(pj_grp_lock_t * glock) Line 647 at pjlib\src\pj\lock.c(647) pj_wrapper.dll!pj_timer_heap_poll(pj_timer_heap_t * ht, pj_time_val * next_delay) Line 654 at pjlib\src\pj\timer.c(654) Some vars from debugger: &atomic_var 0x0000002f293fe670 {0xdddddddddddddddd {value=??? }} pj_atomic_t * * 0xdddddddddddddddd {value=??? } pj_atomic_t * value <Unable to read memory> &atomic_var->value 0xdddddddddddddddd {???} long * <Unable to read memory> long On Tue, 9 Jul 2019 at 12:16, Nanang Izzuddin <nanang@pjsip.org> wrote: > Hi Hugo, > > Instead of 'pj_stun_session', it is perhaps the 'pj_stun_tx_data' that had > been destroyed when 'on_cache_timeout' was invoked, and this scenario seems > to be possible indeed. So just created a patch (attached), unfortunately I > could not easily reproduce the issue here, so if the issue is reproducible > there, please help us testing the patch, thank you in advance. > > BR, > nanang > > > On Fri, Jul 5, 2019 at 7:10 PM Hugo Sobral <hugo.sobral@gmail.com> wrote: > >> Hi Nanang, >> >> It was in this line: >> >> PJ_LOG(5,(SNAME(tdata->sess), "Response cache deleted")); >> >> >> >> Exception thrown: read access violation. >> tdata->sess was 0xFFFFFFFFFFFFFFF7. occurred >> pj_wrapper.dll!on_cache_timeout(pj_timer_heap_t * timer_heap, >> pj_timer_entry * entry) Line 226 >> at pjnath\src\pjnath\stun_session.c(226) >> pj_wrapper.dll!pj_timer_heap_poll(pj_timer_heap_t * ht, pj_time_val * >> next_delay) Line 651 >> at pjlib\src\pj\timer.c(651) >> [Managed to Native Transition] >> >> >> From debugger, current values: >> >> tdata 0x000001c5f264f828 {prev=0xdddddddddddddddd {prev=??? next=??? >> pool=??? ...} next=0xdddddddddddddddd {...} ...} pj_stun_tx_data * >> prev 0xdddddddddddddddd {prev=??? next=??? pool=??? ...} >> pj_stun_tx_data * >> next 0xdddddddddddddddd {prev=??? next=??? pool=??? ...} >> pj_stun_tx_data * >> pool 0xdddddddddddddddd {prev=??? next=??? >> obj_name=0xdddddddddddddded <Error reading characters of string.> ...} >> pj_pool_t * >> sess 0xdddddddddddddddd {cfg=??? pool=??? grp_lock=??? ...} >> pj_stun_session * >> msg 0xdddddddddddddddd {hdr={type=??? length=??? magic=??? ...} >> attr_count=??? attr=0xddddddddddddddf5 {...} } pj_stun_msg * >> token 0xdddddddddddddddd void * >> client_tsx 0xdddddddddddddddd {obj_name=0xdddddddddddddddd <Error >> reading characters of string.> cb={on_complete=...} ...} >> pj_stun_client_tsx * >> retransmit -572662307 int >> msg_magic 3722304989 unsigned int >> msg_key 0x000001c5f264f868 "ÝÝÝÝÝÝÝÝÝÝÝÝ... unsigned char[12] >> auth_info {realm={ptr=0xdddddddddddddddd <Error reading >> characters of string.> slen=-2459565876494606883 } username=...} >> pj_stun_req_cred_info >> pkt 0xdddddddddddddddd void * >> max_len 3722304989 unsigned int >> pkt_size 15987178197214944733 unsigned __int64 >> addr_len 3722304989 unsigned int >> dst_addr 0xdddddddddddddddd const void * >> res_timer {user_data=0xdddddddddddddddd id=-572662307 >> cb=0xdddddddddddddddd ...} pj_timer_entry >> >> >> tdata->sess 0xdddddddddddddddd {cfg=??? pool=??? grp_lock=??? ...} >> pj_stun_session * >> cfg <Unable to read memory> >> pool <Unable to read memory> >> grp_lock <Unable to read memory> >> cb {on_send_msg=??? on_rx_request=??? on_request_complete=??? >> ...} pj_stun_session_cb >> user_data <Unable to read memory> >> is_destroying <Unable to read memory> >> use_fingerprint <Unable to read memory> >> rx_pool <Unable to read memory> >> dump_buf 0xddddddddddddde2d <Error reading characters of string.> >> char[1000] >> log_flag <Unable to read memory> >> auth_type <Unable to read memory> >> cred {type=??? data={static_cred={realm={ptr=??? slen=??? } >> username={ptr=??? slen=??? } data_type=??? ...} ...} } pj_stun_auth_cred >> auth_retry <Unable to read memory> >> next_nonce {ptr=??? slen=??? } pj_str_t >> server_realm {ptr=??? slen=??? } pj_str_t >> srv_name {ptr=??? slen=??? } pj_str_t >> pending_request_list {prev=??? next=??? pool=??? ...} >> pj_stun_tx_data >> cached_response_list {prev=??? next=??? pool=??? ...} >> pj_stun_tx_data >> >> Best Regards, >> Hugo Sobral >> >> On Wed, 3 Jul 2019 at 11:04, Nanang Izzuddin <nanang@pjsip.org> wrote: >> >>> Hi Hugo, >>> >>> The STUN tdata timer in STUN session (i.e: tdata->res_timer) is >>> scheduled using 'pj_stun_session' group lock, so normally it should be safe >>> from race condition between 'pj_stun_session' destroy and >>> 'on_cache_timeout', but perhaps there is a corner case, could you also send >>> the call stack traces when the crash occurs? >>> >>> BR, >>> nanang >>> >>> >>> On Sat, Jun 29, 2019 at 1:30 AM Hugo Sobral <hugo.sobral@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> Sometimes I get a crash in "pj_timer_heap_poll" >>>> I notice that there was some corrections about that in milestone 2.9 >>>> >>>> https://trac.pjsip.org/repos/ticket/2172 >>>> https://trac.pjsip.org/repos/ticket/2176 >>>> >>>> But these corrections does not solve my issue. >>>> >>>> After some time debugging the issue, it looks like that it happens when >>>> the code: >>>> >>>> ------------------------------- >>>> if (node->cb) >>>> (*node->cb)(ht, node); >>>> ------------------------------- >>>> in function 'pj_timer_heap_poll' is invoked with the function >>>> 'on_cache_timeout' at the same time that 'pj_stun_session' is being >>>> destroyed (this destroy is made by application's code because the call >>>> terminates). >>>> >>>> I cannot understand exactly how this happens, >>>> but I think that there is a time gap in function 'pj_timer_heap_poll' >>>> after the 'unlock_timer_heap(ht)' that 'pj_stun_session' can be destroyed >>>> and crash inside 'on_cache_timeout' >>>> >>>> >>>> Is this a known issue in the specified functions? >>>> >>>> Best regards, >>>> Hugo Sobral >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Visit our blog: http://blog.pjsip.org >>>> >>>> pjsip mailing list >>>> pjsip@lists.pjsip.org >>>> http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org >>>> >>> _______________________________________________ >>> Visit our blog: http://blog.pjsip.org >>> >>> pjsip mailing list >>> pjsip@lists.pjsip.org >>> http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org >>> >> _______________________________________________ >> Visit our blog: http://blog.pjsip.org >> >> pjsip mailing list >> pjsip@lists.pjsip.org >> http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org >> > _______________________________________________ > Visit our blog: http://blog.pjsip.org > > pjsip mailing list > pjsip@lists.pjsip.org > http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org >
NI
Nanang Izzuddin
Mon, Aug 26, 2019 8:23 AM

Hi Hugo,

Just got a chance to continue investigation on this issue. Turned out that
the problem is rather complex, for example, cached tx_data may be destroyed
by on_cache_timeout() or by STUN session destroy or even by both (race
condition), so synchronization between STUN session and cached tx_data is
needed. Attached is an updated patch.

BR,
nanang

On Thu, Jul 11, 2019 at 12:35 AM Hugo Sobral hugo.sobral@gmail.com wrote:

Hi Nanang,

I have the same result but in a different location

Exception thrown: read access violation.
atomic_var was 0xFFFFFFFFFFFFFFFF. occurred
pj_wrapper.dll!pj_atomic_dec_and_get(pj_atomic_t * atomic_var) Line 800
at pjlib\src\pj\os_core_win32.c(800)
pj_wrapper.dll!grp_lock_dec_ref(pj_grp_lock_t * glock) Line 554
at pjlib\src\pj\lock.c(554)
pj_wrapper.dll!pj_grp_lock_dec_ref(pj_grp_lock_t * glock) Line 647
at pjlib\src\pj\lock.c(647)
pj_wrapper.dll!pj_timer_heap_poll(pj_timer_heap_t * ht, pj_time_val *
next_delay) Line 654
at pjlib\src\pj\timer.c(654)

Some vars from debugger:

&atomic_var 0x0000002f293fe670 {0xdddddddddddddddd {value=??? }}
pj_atomic_t * *
0xdddddddddddddddd {value=??? } pj_atomic_t *
value <Unable to read memory>

&atomic_var->value 0xdddddddddddddddd {???} long *
<Unable to read memory> long

On Tue, 9 Jul 2019 at 12:16, Nanang Izzuddin nanang@pjsip.org wrote:

Hi Hugo,

Instead of 'pj_stun_session', it is perhaps the 'pj_stun_tx_data' that
had been destroyed when 'on_cache_timeout' was invoked, and this scenario
seems to be possible indeed. So just created a patch (attached),
unfortunately I could not easily reproduce the issue here, so if the issue
is reproducible there, please help us testing the patch, thank you in
advance.

BR,
nanang

On Fri, Jul 5, 2019 at 7:10 PM Hugo Sobral hugo.sobral@gmail.com wrote:

Hi Nanang,

It was in this line:

 PJ_LOG(5,(SNAME(tdata->sess), "Response cache deleted"));

Exception thrown: read access violation.
tdata->sess was 0xFFFFFFFFFFFFFFF7. occurred
pj_wrapper.dll!on_cache_timeout(pj_timer_heap_t * timer_heap,
pj_timer_entry * entry) Line 226
at pjnath\src\pjnath\stun_session.c(226)
pj_wrapper.dll!pj_timer_heap_poll(pj_timer_heap_t * ht, pj_time_val *
next_delay) Line 651
at pjlib\src\pj\timer.c(651)
[Managed to Native Transition]

From debugger, current values:

tdata   0x000001c5f264f828 {prev=0xdddddddddddddddd {prev=???

next=??? pool=??? ...} next=0xdddddddddddddddd {...} ...}  pj_stun_tx_data
*
prev  0xdddddddddddddddd {prev=??? next=??? pool=??? ...}
pj_stun_tx_data *
next  0xdddddddddddddddd {prev=??? next=??? pool=??? ...}
pj_stun_tx_data *
pool  0xdddddddddddddddd {prev=??? next=???
obj_name=0xdddddddddddddded <Error reading characters of string.> ...}
pj_pool_t *
sess  0xdddddddddddddddd {cfg=??? pool=??? grp_lock=??? ...}
pj_stun_session *
msg  0xdddddddddddddddd {hdr={type=??? length=??? magic=??? ...}
attr_count=??? attr=0xddddddddddddddf5 {...} }  pj_stun_msg *
token  0xdddddddddddddddd  void *
client_tsx  0xdddddddddddddddd {obj_name=0xdddddddddddddddd
<Error reading characters of string.> cb={on_complete=...} ...}
pj_stun_client_tsx *
retransmit  -572662307  int
msg_magic  3722304989  unsigned int
msg_key  0x000001c5f264f868 "ÝÝÝÝÝÝÝÝÝÝÝÝ...  unsigned char[12]
auth_info  {realm={ptr=0xdddddddddddddddd <Error reading characters of string.> slen=-2459565876494606883 } username=...}
pj_stun_req_cred_info
pkt  0xdddddddddddddddd  void *
max_len  3722304989  unsigned int
pkt_size  15987178197214944733  unsigned __int64
addr_len  3722304989  unsigned int
dst_addr  0xdddddddddddddddd  const void *
res_timer  {user_data=0xdddddddddddddddd id=-572662307
cb=0xdddddddddddddddd ...}  pj_timer_entry

tdata->sess   0xdddddddddddddddd {cfg=??? pool=??? grp_lock=??? ...}

pj_stun_session *
cfg  <Unable to read memory>
pool  <Unable to read memory>
grp_lock  <Unable to read memory>
cb  {on_send_msg=??? on_rx_request=??? on_request_complete=???
...}  pj_stun_session_cb
user_data  <Unable to read memory>
is_destroying  <Unable to read memory>
use_fingerprint  <Unable to read memory>
rx_pool  <Unable to read memory>
dump_buf  0xddddddddddddde2d <Error reading characters of string.>  char[1000]
log_flag  <Unable to read memory>
auth_type  <Unable to read memory>
cred  {type=??? data={static_cred={realm={ptr=??? slen=??? }
username={ptr=??? slen=??? } data_type=??? ...} ...} }  pj_stun_auth_cred
auth_retry  <Unable to read memory>
next_nonce  {ptr=??? slen=??? }  pj_str_t
server_realm  {ptr=??? slen=??? }  pj_str_t
srv_name  {ptr=??? slen=??? }  pj_str_t
pending_request_list  {prev=??? next=??? pool=??? ...}
pj_stun_tx_data
cached_response_list  {prev=??? next=??? pool=??? ...}
pj_stun_tx_data

Best Regards,
Hugo Sobral

On Wed, 3 Jul 2019 at 11:04, Nanang Izzuddin nanang@pjsip.org wrote:

Hi Hugo,

The STUN tdata timer in STUN session (i.e: tdata->res_timer) is
scheduled using 'pj_stun_session' group lock, so normally it should be safe
from race condition between 'pj_stun_session' destroy and
'on_cache_timeout', but perhaps there is a corner case, could you also send
the call stack traces when the crash occurs?

BR,
nanang

On Sat, Jun 29, 2019 at 1:30 AM Hugo Sobral hugo.sobral@gmail.com
wrote:

Hi,

Sometimes I get a crash in "pj_timer_heap_poll"
I notice that there was some corrections about that in milestone 2.9

https://trac.pjsip.org/repos/ticket/2172
https://trac.pjsip.org/repos/ticket/2176

But these corrections does not solve my issue.

After some time debugging the issue, it looks like that it happens
when the code:


if (node->cb)
(*node->cb)(ht, node);

in function  'pj_timer_heap_poll' is invoked with the function
'on_cache_timeout' at the same time that 'pj_stun_session' is being
destroyed (this destroy is made by application's code because the call
terminates).

I cannot understand exactly how this happens,
but I think that there is a time gap in function 'pj_timer_heap_poll'
after the 'unlock_timer_heap(ht)' that 'pj_stun_session' can be destroyed
and crash inside 'on_cache_timeout'

Is this a known issue in the specified functions?

Best regards,
Hugo Sobral


Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@lists.pjsip.org
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org

Hi Hugo, Just got a chance to continue investigation on this issue. Turned out that the problem is rather complex, for example, cached tx_data may be destroyed by on_cache_timeout() or by STUN session destroy or even by both (race condition), so synchronization between STUN session and cached tx_data is needed. Attached is an updated patch. BR, nanang On Thu, Jul 11, 2019 at 12:35 AM Hugo Sobral <hugo.sobral@gmail.com> wrote: > Hi Nanang, > > I have the same result but in a different location > > Exception thrown: read access violation. > atomic_var was 0xFFFFFFFFFFFFFFFF. occurred > pj_wrapper.dll!pj_atomic_dec_and_get(pj_atomic_t * atomic_var) Line 800 > at pjlib\src\pj\os_core_win32.c(800) > pj_wrapper.dll!grp_lock_dec_ref(pj_grp_lock_t * glock) Line 554 > at pjlib\src\pj\lock.c(554) > pj_wrapper.dll!pj_grp_lock_dec_ref(pj_grp_lock_t * glock) Line 647 > at pjlib\src\pj\lock.c(647) > pj_wrapper.dll!pj_timer_heap_poll(pj_timer_heap_t * ht, pj_time_val * > next_delay) Line 654 > at pjlib\src\pj\timer.c(654) > > > Some vars from debugger: > > &atomic_var 0x0000002f293fe670 {0xdddddddddddddddd {value=??? }} > pj_atomic_t * * > 0xdddddddddddddddd {value=??? } pj_atomic_t * > value <Unable to read memory> > > &atomic_var->value 0xdddddddddddddddd {???} long * > <Unable to read memory> long > > > > > > > On Tue, 9 Jul 2019 at 12:16, Nanang Izzuddin <nanang@pjsip.org> wrote: > >> Hi Hugo, >> >> Instead of 'pj_stun_session', it is perhaps the 'pj_stun_tx_data' that >> had been destroyed when 'on_cache_timeout' was invoked, and this scenario >> seems to be possible indeed. So just created a patch (attached), >> unfortunately I could not easily reproduce the issue here, so if the issue >> is reproducible there, please help us testing the patch, thank you in >> advance. >> >> BR, >> nanang >> >> >> On Fri, Jul 5, 2019 at 7:10 PM Hugo Sobral <hugo.sobral@gmail.com> wrote: >> >>> Hi Nanang, >>> >>> It was in this line: >>> >>> PJ_LOG(5,(SNAME(tdata->sess), "Response cache deleted")); >>> >>> >>> >>> Exception thrown: read access violation. >>> tdata->sess was 0xFFFFFFFFFFFFFFF7. occurred >>> pj_wrapper.dll!on_cache_timeout(pj_timer_heap_t * timer_heap, >>> pj_timer_entry * entry) Line 226 >>> at pjnath\src\pjnath\stun_session.c(226) >>> pj_wrapper.dll!pj_timer_heap_poll(pj_timer_heap_t * ht, pj_time_val * >>> next_delay) Line 651 >>> at pjlib\src\pj\timer.c(651) >>> [Managed to Native Transition] >>> >>> >>> From debugger, current values: >>> >>> tdata 0x000001c5f264f828 {prev=0xdddddddddddddddd {prev=??? >>> next=??? pool=??? ...} next=0xdddddddddddddddd {...} ...} pj_stun_tx_data >>> * >>> prev 0xdddddddddddddddd {prev=??? next=??? pool=??? ...} >>> pj_stun_tx_data * >>> next 0xdddddddddddddddd {prev=??? next=??? pool=??? ...} >>> pj_stun_tx_data * >>> pool 0xdddddddddddddddd {prev=??? next=??? >>> obj_name=0xdddddddddddddded <Error reading characters of string.> ...} >>> pj_pool_t * >>> sess 0xdddddddddddddddd {cfg=??? pool=??? grp_lock=??? ...} >>> pj_stun_session * >>> msg 0xdddddddddddddddd {hdr={type=??? length=??? magic=??? ...} >>> attr_count=??? attr=0xddddddddddddddf5 {...} } pj_stun_msg * >>> token 0xdddddddddddddddd void * >>> client_tsx 0xdddddddddddddddd {obj_name=0xdddddddddddddddd >>> <Error reading characters of string.> cb={on_complete=...} ...} >>> pj_stun_client_tsx * >>> retransmit -572662307 int >>> msg_magic 3722304989 unsigned int >>> msg_key 0x000001c5f264f868 "ÝÝÝÝÝÝÝÝÝÝÝÝ... unsigned char[12] >>> auth_info {realm={ptr=0xdddddddddddddddd <Error reading >>> characters of string.> slen=-2459565876494606883 } username=...} >>> pj_stun_req_cred_info >>> pkt 0xdddddddddddddddd void * >>> max_len 3722304989 unsigned int >>> pkt_size 15987178197214944733 unsigned __int64 >>> addr_len 3722304989 unsigned int >>> dst_addr 0xdddddddddddddddd const void * >>> res_timer {user_data=0xdddddddddddddddd id=-572662307 >>> cb=0xdddddddddddddddd ...} pj_timer_entry >>> >>> >>> tdata->sess 0xdddddddddddddddd {cfg=??? pool=??? grp_lock=??? ...} >>> pj_stun_session * >>> cfg <Unable to read memory> >>> pool <Unable to read memory> >>> grp_lock <Unable to read memory> >>> cb {on_send_msg=??? on_rx_request=??? on_request_complete=??? >>> ...} pj_stun_session_cb >>> user_data <Unable to read memory> >>> is_destroying <Unable to read memory> >>> use_fingerprint <Unable to read memory> >>> rx_pool <Unable to read memory> >>> dump_buf 0xddddddddddddde2d <Error reading characters of >>> string.> char[1000] >>> log_flag <Unable to read memory> >>> auth_type <Unable to read memory> >>> cred {type=??? data={static_cred={realm={ptr=??? slen=??? } >>> username={ptr=??? slen=??? } data_type=??? ...} ...} } pj_stun_auth_cred >>> auth_retry <Unable to read memory> >>> next_nonce {ptr=??? slen=??? } pj_str_t >>> server_realm {ptr=??? slen=??? } pj_str_t >>> srv_name {ptr=??? slen=??? } pj_str_t >>> pending_request_list {prev=??? next=??? pool=??? ...} >>> pj_stun_tx_data >>> cached_response_list {prev=??? next=??? pool=??? ...} >>> pj_stun_tx_data >>> >>> Best Regards, >>> Hugo Sobral >>> >>> On Wed, 3 Jul 2019 at 11:04, Nanang Izzuddin <nanang@pjsip.org> wrote: >>> >>>> Hi Hugo, >>>> >>>> The STUN tdata timer in STUN session (i.e: tdata->res_timer) is >>>> scheduled using 'pj_stun_session' group lock, so normally it should be safe >>>> from race condition between 'pj_stun_session' destroy and >>>> 'on_cache_timeout', but perhaps there is a corner case, could you also send >>>> the call stack traces when the crash occurs? >>>> >>>> BR, >>>> nanang >>>> >>>> >>>> On Sat, Jun 29, 2019 at 1:30 AM Hugo Sobral <hugo.sobral@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> Sometimes I get a crash in "pj_timer_heap_poll" >>>>> I notice that there was some corrections about that in milestone 2.9 >>>>> >>>>> https://trac.pjsip.org/repos/ticket/2172 >>>>> https://trac.pjsip.org/repos/ticket/2176 >>>>> >>>>> But these corrections does not solve my issue. >>>>> >>>>> After some time debugging the issue, it looks like that it happens >>>>> when the code: >>>>> >>>>> ------------------------------- >>>>> if (node->cb) >>>>> (*node->cb)(ht, node); >>>>> ------------------------------- >>>>> in function 'pj_timer_heap_poll' is invoked with the function >>>>> 'on_cache_timeout' at the same time that 'pj_stun_session' is being >>>>> destroyed (this destroy is made by application's code because the call >>>>> terminates). >>>>> >>>>> I cannot understand exactly how this happens, >>>>> but I think that there is a time gap in function 'pj_timer_heap_poll' >>>>> after the 'unlock_timer_heap(ht)' that 'pj_stun_session' can be destroyed >>>>> and crash inside 'on_cache_timeout' >>>>> >>>>> >>>>> Is this a known issue in the specified functions? >>>>> >>>>> Best regards, >>>>> Hugo Sobral >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Visit our blog: http://blog.pjsip.org >>>>> >>>>> pjsip mailing list >>>>> pjsip@lists.pjsip.org >>>>> http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org >>>>> >>>> _______________________________________________ >>>> Visit our blog: http://blog.pjsip.org >>>> >>>> pjsip mailing list >>>> pjsip@lists.pjsip.org >>>> http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org >>>> >>> _______________________________________________ >>> Visit our blog: http://blog.pjsip.org >>> >>> pjsip mailing list >>> pjsip@lists.pjsip.org >>> http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org >>> >> _______________________________________________ >> Visit our blog: http://blog.pjsip.org >> >> pjsip mailing list >> pjsip@lists.pjsip.org >> http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org >> > _______________________________________________ > Visit our blog: http://blog.pjsip.org > > pjsip mailing list > pjsip@lists.pjsip.org > http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org >