Segfault in timer.c

RB
Ross Beer
Mon, Apr 8, 2019 4:05 PM

Hi,

We are seeing multiple segfaults while copying nodes:

Thread 1 (Thread 0x7fa977fff700 (LWP 36699)):
#0  0x00007fac956676fe in copy_node (ht=0x16d0410, slot=430, moved_node=0x7fa9c4c04c50) at ../src/pj/timer.c:137
#1  0x00007fac956679d9 in reheap_down (ht=0x16d0410, moved_node=0x7fab60031970, slot=430, child=862) at ../src/pj/timer.c:185
#2  0x00007fac95667d1d in remove_node (ht=0x16d0410, slot=0) at ../src/pj/timer.c:252
parent = 0
moved_node = 0x7fab60031970
removed_node = 0x7fa9b5978ae0
#3  0x00007fac95668634 in pj_timer_heap_poll (ht=0x16d0410, next_delay=0x7fa977ffecd0) at ../src/pj/timer.c:634
node = 0x100000002
grp_lock = 0x31dc8c88
now = {sec = 8157205, msec = 523}
count = 0
#4  0x00007fac955b6b06 in pjsip_endpt_handle_events2 (endpt=0x16d0128, max_timeout=0x7fa977ffed40, p_count=0x0) at ../src/pjsip/sip_endpoint.c:715
timeout = {sec = 0, msec = 0}
count = 0
net_event_count = 0
c = 0
#5  0x00007fac955b6c4b in pjsip_endpt_handle_events (endpt=0x16d0128, max_timeout=0x7fa977ffed40) at ../src/pjsip/sip_endpoint.c:776
#6  0x00007faac8024bcf in monitor_thread_exec (endpt=0x0) at res_pjsip.c:4512
delay = {sec = 0, msec = 10}
#7  0x00007fac9564fcb0 in thread_main (param=0x1815b28) at ../src/pj/os_core_unix.c:541
rec = 0x1815b28
result = 0x0
rc = 0
#8  0x00007fac9342ddd5 in start_thread () at /usr/lib64/libpthread.so.0
#9  0x00007fac927cfead in clone () at /usr/lib64/libc.so.6

Can anyone assist in resolving the issue?

Regards,

Ross

Hi, We are seeing multiple segfaults while copying nodes: Thread 1 (Thread 0x7fa977fff700 (LWP 36699)): #0 0x00007fac956676fe in copy_node (ht=0x16d0410, slot=430, moved_node=0x7fa9c4c04c50) at ../src/pj/timer.c:137 #1 0x00007fac956679d9 in reheap_down (ht=0x16d0410, moved_node=0x7fab60031970, slot=430, child=862) at ../src/pj/timer.c:185 #2 0x00007fac95667d1d in remove_node (ht=0x16d0410, slot=0) at ../src/pj/timer.c:252 parent = 0 moved_node = 0x7fab60031970 removed_node = 0x7fa9b5978ae0 #3 0x00007fac95668634 in pj_timer_heap_poll (ht=0x16d0410, next_delay=0x7fa977ffecd0) at ../src/pj/timer.c:634 node = 0x100000002 grp_lock = 0x31dc8c88 now = {sec = 8157205, msec = 523} count = 0 #4 0x00007fac955b6b06 in pjsip_endpt_handle_events2 (endpt=0x16d0128, max_timeout=0x7fa977ffed40, p_count=0x0) at ../src/pjsip/sip_endpoint.c:715 timeout = {sec = 0, msec = 0} count = 0 net_event_count = 0 c = 0 #5 0x00007fac955b6c4b in pjsip_endpt_handle_events (endpt=0x16d0128, max_timeout=0x7fa977ffed40) at ../src/pjsip/sip_endpoint.c:776 #6 0x00007faac8024bcf in monitor_thread_exec (endpt=0x0) at res_pjsip.c:4512 delay = {sec = 0, msec = 10} #7 0x00007fac9564fcb0 in thread_main (param=0x1815b28) at ../src/pj/os_core_unix.c:541 rec = 0x1815b28 result = 0x0 rc = 0 #8 0x00007fac9342ddd5 in start_thread () at /usr/lib64/libpthread.so.0 #9 0x00007fac927cfead in clone () at /usr/lib64/libc.so.6 Can anyone assist in resolving the issue? Regards, Ross
RB
Ross Beer
Tue, Apr 9, 2019 1:03 PM

Hello,

The previous segfault was with issue #2176 (https://trac.pjsip.org/repos/changeset/5934) patch applied, this patch appears to increase the frequency of segfaults in the 'pj_timer_heap_poll' process instead of mitigating them.

Can anyone offer assistance in getting this resolved?

Regards,

Ross


From: pjsip pjsip-bounces@lists.pjsip.org on behalf of Ross Beer ross.beer@outlook.com
Sent: 08 April 2019 17:05
To: pjsip@lists.pjsip.org
Subject: [pjsip] Segfault in timer.c

Hi,

We are seeing multiple segfaults while copying nodes:

Thread 1 (Thread 0x7fa977fff700 (LWP 36699)):
#0  0x00007fac956676fe in copy_node (ht=0x16d0410, slot=430, moved_node=0x7fa9c4c04c50) at ../src/pj/timer.c:137
#1  0x00007fac956679d9 in reheap_down (ht=0x16d0410, moved_node=0x7fab60031970, slot=430, child=862) at ../src/pj/timer.c:185
#2  0x00007fac95667d1d in remove_node (ht=0x16d0410, slot=0) at ../src/pj/timer.c:252
parent = 0
moved_node = 0x7fab60031970
removed_node = 0x7fa9b5978ae0
#3  0x00007fac95668634 in pj_timer_heap_poll (ht=0x16d0410, next_delay=0x7fa977ffecd0) at ../src/pj/timer.c:634
node = 0x100000002
grp_lock = 0x31dc8c88
now = {sec = 8157205, msec = 523}
count = 0
#4  0x00007fac955b6b06 in pjsip_endpt_handle_events2 (endpt=0x16d0128, max_timeout=0x7fa977ffed40, p_count=0x0) at ../src/pjsip/sip_endpoint.c:715
timeout = {sec = 0, msec = 0}
count = 0
net_event_count = 0
c = 0
#5  0x00007fac955b6c4b in pjsip_endpt_handle_events (endpt=0x16d0128, max_timeout=0x7fa977ffed40) at ../src/pjsip/sip_endpoint.c:776
#6  0x00007faac8024bcf in monitor_thread_exec (endpt=0x0) at res_pjsip.c:4512
delay = {sec = 0, msec = 10}
#7  0x00007fac9564fcb0 in thread_main (param=0x1815b28) at ../src/pj/os_core_unix.c:541
rec = 0x1815b28
result = 0x0
rc = 0
#8  0x00007fac9342ddd5 in start_thread () at /usr/lib64/libpthread.so.0
#9  0x00007fac927cfead in clone () at /usr/lib64/libc.so.6

Can anyone assist in resolving the issue?

Regards,

Ross

Hello, The previous segfault was with issue #2176 (https://trac.pjsip.org/repos/changeset/5934) patch applied, this patch appears to increase the frequency of segfaults in the 'pj_timer_heap_poll' process instead of mitigating them. Can anyone offer assistance in getting this resolved? Regards, Ross ________________________________ From: pjsip <pjsip-bounces@lists.pjsip.org> on behalf of Ross Beer <ross.beer@outlook.com> Sent: 08 April 2019 17:05 To: pjsip@lists.pjsip.org Subject: [pjsip] Segfault in timer.c Hi, We are seeing multiple segfaults while copying nodes: Thread 1 (Thread 0x7fa977fff700 (LWP 36699)): #0 0x00007fac956676fe in copy_node (ht=0x16d0410, slot=430, moved_node=0x7fa9c4c04c50) at ../src/pj/timer.c:137 #1 0x00007fac956679d9 in reheap_down (ht=0x16d0410, moved_node=0x7fab60031970, slot=430, child=862) at ../src/pj/timer.c:185 #2 0x00007fac95667d1d in remove_node (ht=0x16d0410, slot=0) at ../src/pj/timer.c:252 parent = 0 moved_node = 0x7fab60031970 removed_node = 0x7fa9b5978ae0 #3 0x00007fac95668634 in pj_timer_heap_poll (ht=0x16d0410, next_delay=0x7fa977ffecd0) at ../src/pj/timer.c:634 node = 0x100000002 grp_lock = 0x31dc8c88 now = {sec = 8157205, msec = 523} count = 0 #4 0x00007fac955b6b06 in pjsip_endpt_handle_events2 (endpt=0x16d0128, max_timeout=0x7fa977ffed40, p_count=0x0) at ../src/pjsip/sip_endpoint.c:715 timeout = {sec = 0, msec = 0} count = 0 net_event_count = 0 c = 0 #5 0x00007fac955b6c4b in pjsip_endpt_handle_events (endpt=0x16d0128, max_timeout=0x7fa977ffed40) at ../src/pjsip/sip_endpoint.c:776 #6 0x00007faac8024bcf in monitor_thread_exec (endpt=0x0) at res_pjsip.c:4512 delay = {sec = 0, msec = 10} #7 0x00007fac9564fcb0 in thread_main (param=0x1815b28) at ../src/pj/os_core_unix.c:541 rec = 0x1815b28 result = 0x0 rc = 0 #8 0x00007fac9342ddd5 in start_thread () at /usr/lib64/libpthread.so.0 #9 0x00007fac927cfead in clone () at /usr/lib64/libc.so.6 Can anyone assist in resolving the issue? Regards, Ross
M
Ming
Thu, May 30, 2019 6:20 AM

Hi Ross,

  • Is there any steps or specific scenario to reproduce the issue?
  • Which PJSIP version are you using? You mentioned ticket #2176, which
    additional patches do you apply to the base version you're using?
  • Our latest fix with regard to the timer is ticket #2191
    (https://trac.pjsip.org/repos/ticket/2191). Please let us know if the
    problem still persists after this patch. Another ticket which may be
    relevant is ticket #2172 (https://trac.pjsip.org/repos/ticket/2172).
  • For the stack trace which you provided, the crash seems to be caused
    when accessing the node 0x7fa9c4c04c50. So, you  will need to check
    the PJSIP log, with PJ_TIMER_DEBUG on, and see to which component the
    timer entry belongs, and when&how the pool which allocates the entry
    gets destroyed, and why isn't the timer cancelled first before the
    deallocation.

Regards,
Ming

On Tue, Apr 9, 2019 at 9:04 PM Ross Beer ross.beer@outlook.com wrote:

Hello,

The previous segfault was with issue #2176 (https://trac.pjsip.org/repos/changeset/5934) patch applied, this patch appears to increase the frequency of segfaults in the 'pj_timer_heap_poll' process instead of mitigating them.

Can anyone offer assistance in getting this resolved?

Regards,

Ross


From: pjsip pjsip-bounces@lists.pjsip.org on behalf of Ross Beer ross.beer@outlook.com
Sent: 08 April 2019 17:05
To: pjsip@lists.pjsip.org
Subject: [pjsip] Segfault in timer.c

Hi,

We are seeing multiple segfaults while copying nodes:

Thread 1 (Thread 0x7fa977fff700 (LWP 36699)):
#0  0x00007fac956676fe in copy_node (ht=0x16d0410, slot=430, moved_node=0x7fa9c4c04c50) at ../src/pj/timer.c:137
#1  0x00007fac956679d9 in reheap_down (ht=0x16d0410, moved_node=0x7fab60031970, slot=430, child=862) at ../src/pj/timer.c:185
#2  0x00007fac95667d1d in remove_node (ht=0x16d0410, slot=0) at ../src/pj/timer.c:252
parent = 0
moved_node = 0x7fab60031970
removed_node = 0x7fa9b5978ae0
#3  0x00007fac95668634 in pj_timer_heap_poll (ht=0x16d0410, next_delay=0x7fa977ffecd0) at ../src/pj/timer.c:634
node = 0x100000002
grp_lock = 0x31dc8c88
now = {sec = 8157205, msec = 523}
count = 0
#4  0x00007fac955b6b06 in pjsip_endpt_handle_events2 (endpt=0x16d0128, max_timeout=0x7fa977ffed40, p_count=0x0) at ../src/pjsip/sip_endpoint.c:715
timeout = {sec = 0, msec = 0}
count = 0
net_event_count = 0
c = 0
#5  0x00007fac955b6c4b in pjsip_endpt_handle_events (endpt=0x16d0128, max_timeout=0x7fa977ffed40) at ../src/pjsip/sip_endpoint.c:776
#6  0x00007faac8024bcf in monitor_thread_exec (endpt=0x0) at res_pjsip.c:4512
delay = {sec = 0, msec = 10}
#7  0x00007fac9564fcb0 in thread_main (param=0x1815b28) at ../src/pj/os_core_unix.c:541
rec = 0x1815b28
result = 0x0
rc = 0
#8  0x00007fac9342ddd5 in start_thread () at /usr/lib64/libpthread.so.0
#9  0x00007fac927cfead in clone () at /usr/lib64/libc.so.6

Can anyone assist in resolving the issue?

Regards,

Ross


Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@lists.pjsip.org
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org

Hi Ross, - Is there any steps or specific scenario to reproduce the issue? - Which PJSIP version are you using? You mentioned ticket #2176, which additional patches do you apply to the base version you're using? - Our latest fix with regard to the timer is ticket #2191 (https://trac.pjsip.org/repos/ticket/2191). Please let us know if the problem still persists after this patch. Another ticket which may be relevant is ticket #2172 (https://trac.pjsip.org/repos/ticket/2172). - For the stack trace which you provided, the crash seems to be caused when accessing the node 0x7fa9c4c04c50. So, you will need to check the PJSIP log, with PJ_TIMER_DEBUG on, and see to which component the timer entry belongs, and when&how the pool which allocates the entry gets destroyed, and why isn't the timer cancelled first before the deallocation. Regards, Ming On Tue, Apr 9, 2019 at 9:04 PM Ross Beer <ross.beer@outlook.com> wrote: > > Hello, > > The previous segfault was with issue #2176 (https://trac.pjsip.org/repos/changeset/5934) patch applied, this patch appears to increase the frequency of segfaults in the 'pj_timer_heap_poll' process instead of mitigating them. > > Can anyone offer assistance in getting this resolved? > > Regards, > > Ross > > ________________________________ > From: pjsip <pjsip-bounces@lists.pjsip.org> on behalf of Ross Beer <ross.beer@outlook.com> > Sent: 08 April 2019 17:05 > To: pjsip@lists.pjsip.org > Subject: [pjsip] Segfault in timer.c > > Hi, > > We are seeing multiple segfaults while copying nodes: > > Thread 1 (Thread 0x7fa977fff700 (LWP 36699)): > #0 0x00007fac956676fe in copy_node (ht=0x16d0410, slot=430, moved_node=0x7fa9c4c04c50) at ../src/pj/timer.c:137 > #1 0x00007fac956679d9 in reheap_down (ht=0x16d0410, moved_node=0x7fab60031970, slot=430, child=862) at ../src/pj/timer.c:185 > #2 0x00007fac95667d1d in remove_node (ht=0x16d0410, slot=0) at ../src/pj/timer.c:252 > parent = 0 > moved_node = 0x7fab60031970 > removed_node = 0x7fa9b5978ae0 > #3 0x00007fac95668634 in pj_timer_heap_poll (ht=0x16d0410, next_delay=0x7fa977ffecd0) at ../src/pj/timer.c:634 > node = 0x100000002 > grp_lock = 0x31dc8c88 > now = {sec = 8157205, msec = 523} > count = 0 > #4 0x00007fac955b6b06 in pjsip_endpt_handle_events2 (endpt=0x16d0128, max_timeout=0x7fa977ffed40, p_count=0x0) at ../src/pjsip/sip_endpoint.c:715 > timeout = {sec = 0, msec = 0} > count = 0 > net_event_count = 0 > c = 0 > #5 0x00007fac955b6c4b in pjsip_endpt_handle_events (endpt=0x16d0128, max_timeout=0x7fa977ffed40) at ../src/pjsip/sip_endpoint.c:776 > #6 0x00007faac8024bcf in monitor_thread_exec (endpt=0x0) at res_pjsip.c:4512 > delay = {sec = 0, msec = 10} > #7 0x00007fac9564fcb0 in thread_main (param=0x1815b28) at ../src/pj/os_core_unix.c:541 > rec = 0x1815b28 > result = 0x0 > rc = 0 > #8 0x00007fac9342ddd5 in start_thread () at /usr/lib64/libpthread.so.0 > #9 0x00007fac927cfead in clone () at /usr/lib64/libc.so.6 > > > Can anyone assist in resolving the issue? > > Regards, > > Ross > _______________________________________________ > Visit our blog: http://blog.pjsip.org > > pjsip mailing list > pjsip@lists.pjsip.org > http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org
RB
Ross Beer
Thu, May 30, 2019 1:15 PM

Hi Ming,

I am using Asterisk bundled, which is PJSIP 2.8 with patches:

"Fixed #2191:

  • Stricter double timer entry scheduling prevention.
  • Integrate group lock in SIP transport, e.g: for add/dec ref,    for timer scheduling."

I am have just had a further crash in a different location, which looks like its crashed due to a timer entry being cancelled and an error is happening while moving nodes. Is more locking required when performing operations on the timer heap?

Thread 1 (Thread 0x7f783ebfa700 (LWP 83829)):
#0  0x00007f7ae3dc0249 in copy_node (ht=0x184cbf0, slot=2305, moved_node=0x7f797732f0c0) at ../src/pj/timer.c:137
#1  0x00007f7ae3dc0677 in reheap_up (ht=0x184cbf0, moved_node=0x7f780e0f2990, slot=2305, parent=1152) at ../src/pj/timer.c:208
#2  0x00007f7ae3dc0882 in remove_node (ht=0x184cbf0, slot=2305) at ../src/pj/timer.c:254
parent = 1152
moved_node = 0x7f780e0f2990
removed_node = 0x7f797f4a9e08
#3  0x00007f7ae3dc0b64 in cancel (ht=0x184cbf0, entry=0x7f797f4a9e08, flags=1) at ../src/pj/timer.c:353
timer_node_slot = 2305
#4  0x00007f7ae3dc1061 in cancel_timer (ht=0x184cbf0, entry=0x7f797f4a9e08, flags=0, id_val=0) at ../src/pj/timer.c:591
count = 32634
#5  0x00007f7ae3dc10ea in pj_timer_heap_cancel (ht=0x184cbf0, entry=0x7f797f4a9e08) at ../src/pj/timer.c:611
#6  0x00007f7ae3d0f0c8 in pjsip_endpt_cancel_timer (endpt=0x184c908, entry=0x7f797f4a9e08) at ../src/pjsip/sip_endpoint.c:848
#7  0x00007f7ae3cf62f6 in set_timer (sub=0x7f797f4a9cc8, timer_id=2, seconds=3600) at ../src/pjsip-simple/evsub.c:508
#8  0x00007f7ae3cf91cc in on_tsx_state_uas (sub=0x7f797f4a9cc8, tsx=0x7f780e0f27d8, event=0x7f783ebf98c0) at ../src/pjsip-simple/evsub.c:2172
expires = 0x7f7a10f9e600
tdata = 0x7f799cbcc068
st_code = 200
st_text = 0x0
body = 0x0
old_state = PJSIP_EVSUB_STATE_ACTIVE
old_state_str = {ptr = 0x7f7ae3dc6f0c "ACTIVE", slen = 6}
reason = {ptr = 0x0, slen = 0}
rdata = 0x7f7a10f95a68
msg = 0x7f7a10f9dac0
res_hdr = {prev = 0x7f783ebf9700, next = 0x7f783ebf9700, type = 284812696, name = {ptr = 0x7f797e162ca8 "", slen = 284809960}, sname = {ptr = 0x7f797f4a9cc8 "evsub0x7f797f4a9cc8", slen = 140154425546736}, vptr = 0x7f7ae3cf7f2f <on_new_transaction+724>}
status = 0
event_hdr = 0x7f7a10f9e548

Kind regards,

Ross


From: pjsip pjsip-bounces@lists.pjsip.org on behalf of Ming ming@teluu.com
Sent: 30 May 2019 07:20
To: pjsip list
Subject: Re: [pjsip] Segfault in timer.c

Hi Ross,

  • Is there any steps or specific scenario to reproduce the issue?
  • Which PJSIP version are you using? You mentioned ticket #2176, which
    additional patches do you apply to the base version you're using?
  • Our latest fix with regard to the timer is ticket #2191
    (https://trac.pjsip.org/repos/ticket/2191). Please let us know if the
    problem still persists after this patch. Another ticket which may be
    relevant is ticket #2172 (https://trac.pjsip.org/repos/ticket/2172).
  • For the stack trace which you provided, the crash seems to be caused
    when accessing the node 0x7fa9c4c04c50. So, you  will need to check
    the PJSIP log, with PJ_TIMER_DEBUG on, and see to which component the
    timer entry belongs, and when&how the pool which allocates the entry
    gets destroyed, and why isn't the timer cancelled first before the
    deallocation.

Regards,
Ming

On Tue, Apr 9, 2019 at 9:04 PM Ross Beer ross.beer@outlook.com wrote:

Hello,

The previous segfault was with issue #2176 (https://trac.pjsip.org/repos/changeset/5934) patch applied, this patch appears to increase the frequency of segfaults in the 'pj_timer_heap_poll' process instead of mitigating them.

Can anyone offer assistance in getting this resolved?

Regards,

Ross


From: pjsip pjsip-bounces@lists.pjsip.org on behalf of Ross Beer ross.beer@outlook.com
Sent: 08 April 2019 17:05
To: pjsip@lists.pjsip.org
Subject: [pjsip] Segfault in timer.c

Hi,

We are seeing multiple segfaults while copying nodes:

Thread 1 (Thread 0x7fa977fff700 (LWP 36699)):
#0  0x00007fac956676fe in copy_node (ht=0x16d0410, slot=430, moved_node=0x7fa9c4c04c50) at ../src/pj/timer.c:137
#1  0x00007fac956679d9 in reheap_down (ht=0x16d0410, moved_node=0x7fab60031970, slot=430, child=862) at ../src/pj/timer.c:185
#2  0x00007fac95667d1d in remove_node (ht=0x16d0410, slot=0) at ../src/pj/timer.c:252
parent = 0
moved_node = 0x7fab60031970
removed_node = 0x7fa9b5978ae0
#3  0x00007fac95668634 in pj_timer_heap_poll (ht=0x16d0410, next_delay=0x7fa977ffecd0) at ../src/pj/timer.c:634
node = 0x100000002
grp_lock = 0x31dc8c88
now = {sec = 8157205, msec = 523}
count = 0
#4  0x00007fac955b6b06 in pjsip_endpt_handle_events2 (endpt=0x16d0128, max_timeout=0x7fa977ffed40, p_count=0x0) at ../src/pjsip/sip_endpoint.c:715
timeout = {sec = 0, msec = 0}
count = 0
net_event_count = 0
c = 0
#5  0x00007fac955b6c4b in pjsip_endpt_handle_events (endpt=0x16d0128, max_timeout=0x7fa977ffed40) at ../src/pjsip/sip_endpoint.c:776
#6  0x00007faac8024bcf in monitor_thread_exec (endpt=0x0) at res_pjsip.c:4512
delay = {sec = 0, msec = 10}
#7  0x00007fac9564fcb0 in thread_main (param=0x1815b28) at ../src/pj/os_core_unix.c:541
rec = 0x1815b28
result = 0x0
rc = 0
#8  0x00007fac9342ddd5 in start_thread () at /usr/lib64/libpthread.so.0
#9  0x00007fac927cfead in clone () at /usr/lib64/libc.so.6

Can anyone assist in resolving the issue?

Regards,

Ross


Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@lists.pjsip.org
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org

Hi Ming, I am using Asterisk bundled, which is PJSIP 2.8 with patches: "Fixed #2191: - Stricter double timer entry scheduling prevention. - Integrate group lock in SIP transport, e.g: for add/dec ref, for timer scheduling." I am have just had a further crash in a different location, which looks like its crashed due to a timer entry being cancelled and an error is happening while moving nodes. Is more locking required when performing operations on the timer heap? Thread 1 (Thread 0x7f783ebfa700 (LWP 83829)): #0 0x00007f7ae3dc0249 in copy_node (ht=0x184cbf0, slot=2305, moved_node=0x7f797732f0c0) at ../src/pj/timer.c:137 #1 0x00007f7ae3dc0677 in reheap_up (ht=0x184cbf0, moved_node=0x7f780e0f2990, slot=2305, parent=1152) at ../src/pj/timer.c:208 #2 0x00007f7ae3dc0882 in remove_node (ht=0x184cbf0, slot=2305) at ../src/pj/timer.c:254 parent = 1152 moved_node = 0x7f780e0f2990 removed_node = 0x7f797f4a9e08 #3 0x00007f7ae3dc0b64 in cancel (ht=0x184cbf0, entry=0x7f797f4a9e08, flags=1) at ../src/pj/timer.c:353 timer_node_slot = 2305 #4 0x00007f7ae3dc1061 in cancel_timer (ht=0x184cbf0, entry=0x7f797f4a9e08, flags=0, id_val=0) at ../src/pj/timer.c:591 count = 32634 #5 0x00007f7ae3dc10ea in pj_timer_heap_cancel (ht=0x184cbf0, entry=0x7f797f4a9e08) at ../src/pj/timer.c:611 #6 0x00007f7ae3d0f0c8 in pjsip_endpt_cancel_timer (endpt=0x184c908, entry=0x7f797f4a9e08) at ../src/pjsip/sip_endpoint.c:848 #7 0x00007f7ae3cf62f6 in set_timer (sub=0x7f797f4a9cc8, timer_id=2, seconds=3600) at ../src/pjsip-simple/evsub.c:508 #8 0x00007f7ae3cf91cc in on_tsx_state_uas (sub=0x7f797f4a9cc8, tsx=0x7f780e0f27d8, event=0x7f783ebf98c0) at ../src/pjsip-simple/evsub.c:2172 expires = 0x7f7a10f9e600 tdata = 0x7f799cbcc068 st_code = 200 st_text = 0x0 body = 0x0 old_state = PJSIP_EVSUB_STATE_ACTIVE old_state_str = {ptr = 0x7f7ae3dc6f0c "ACTIVE", slen = 6} reason = {ptr = 0x0, slen = 0} rdata = 0x7f7a10f95a68 msg = 0x7f7a10f9dac0 res_hdr = {prev = 0x7f783ebf9700, next = 0x7f783ebf9700, type = 284812696, name = {ptr = 0x7f797e162ca8 "", slen = 284809960}, sname = {ptr = 0x7f797f4a9cc8 "evsub0x7f797f4a9cc8", slen = 140154425546736}, vptr = 0x7f7ae3cf7f2f <on_new_transaction+724>} status = 0 event_hdr = 0x7f7a10f9e548 Kind regards, Ross ________________________________ From: pjsip <pjsip-bounces@lists.pjsip.org> on behalf of Ming <ming@teluu.com> Sent: 30 May 2019 07:20 To: pjsip list Subject: Re: [pjsip] Segfault in timer.c Hi Ross, - Is there any steps or specific scenario to reproduce the issue? - Which PJSIP version are you using? You mentioned ticket #2176, which additional patches do you apply to the base version you're using? - Our latest fix with regard to the timer is ticket #2191 (https://trac.pjsip.org/repos/ticket/2191). Please let us know if the problem still persists after this patch. Another ticket which may be relevant is ticket #2172 (https://trac.pjsip.org/repos/ticket/2172). - For the stack trace which you provided, the crash seems to be caused when accessing the node 0x7fa9c4c04c50. So, you will need to check the PJSIP log, with PJ_TIMER_DEBUG on, and see to which component the timer entry belongs, and when&how the pool which allocates the entry gets destroyed, and why isn't the timer cancelled first before the deallocation. Regards, Ming On Tue, Apr 9, 2019 at 9:04 PM Ross Beer <ross.beer@outlook.com> wrote: > > Hello, > > The previous segfault was with issue #2176 (https://trac.pjsip.org/repos/changeset/5934) patch applied, this patch appears to increase the frequency of segfaults in the 'pj_timer_heap_poll' process instead of mitigating them. > > Can anyone offer assistance in getting this resolved? > > Regards, > > Ross > > ________________________________ > From: pjsip <pjsip-bounces@lists.pjsip.org> on behalf of Ross Beer <ross.beer@outlook.com> > Sent: 08 April 2019 17:05 > To: pjsip@lists.pjsip.org > Subject: [pjsip] Segfault in timer.c > > Hi, > > We are seeing multiple segfaults while copying nodes: > > Thread 1 (Thread 0x7fa977fff700 (LWP 36699)): > #0 0x00007fac956676fe in copy_node (ht=0x16d0410, slot=430, moved_node=0x7fa9c4c04c50) at ../src/pj/timer.c:137 > #1 0x00007fac956679d9 in reheap_down (ht=0x16d0410, moved_node=0x7fab60031970, slot=430, child=862) at ../src/pj/timer.c:185 > #2 0x00007fac95667d1d in remove_node (ht=0x16d0410, slot=0) at ../src/pj/timer.c:252 > parent = 0 > moved_node = 0x7fab60031970 > removed_node = 0x7fa9b5978ae0 > #3 0x00007fac95668634 in pj_timer_heap_poll (ht=0x16d0410, next_delay=0x7fa977ffecd0) at ../src/pj/timer.c:634 > node = 0x100000002 > grp_lock = 0x31dc8c88 > now = {sec = 8157205, msec = 523} > count = 0 > #4 0x00007fac955b6b06 in pjsip_endpt_handle_events2 (endpt=0x16d0128, max_timeout=0x7fa977ffed40, p_count=0x0) at ../src/pjsip/sip_endpoint.c:715 > timeout = {sec = 0, msec = 0} > count = 0 > net_event_count = 0 > c = 0 > #5 0x00007fac955b6c4b in pjsip_endpt_handle_events (endpt=0x16d0128, max_timeout=0x7fa977ffed40) at ../src/pjsip/sip_endpoint.c:776 > #6 0x00007faac8024bcf in monitor_thread_exec (endpt=0x0) at res_pjsip.c:4512 > delay = {sec = 0, msec = 10} > #7 0x00007fac9564fcb0 in thread_main (param=0x1815b28) at ../src/pj/os_core_unix.c:541 > rec = 0x1815b28 > result = 0x0 > rc = 0 > #8 0x00007fac9342ddd5 in start_thread () at /usr/lib64/libpthread.so.0 > #9 0x00007fac927cfead in clone () at /usr/lib64/libc.so.6 > > > Can anyone assist in resolving the issue? > > Regards, > > Ross > _______________________________________________ > Visit our blog: http://blog.pjsip.org > > pjsip mailing list > pjsip@lists.pjsip.org > http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org _______________________________________________ Visit our blog: http://blog.pjsip.org pjsip mailing list pjsip@lists.pjsip.org http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org
M
Ming
Fri, May 31, 2019 5:13 AM

Hi Ross,

I'm afraid the backtrace alone is not sufficient to pinpoint the cause. We
know that the current timer implementation is not easy to troubleshoot, and
we're thinking of changing it, but for now in order to investigate the
issue further, we will need to get additional information which I mentioned
in the previous email, such as the steps to reproduce, and/or the complete
PJSIP logs level 6 compiled with the timer debugging on.

And I don't think adding locks can help here, because it looks like the
issue is caused by a dangling timer entry, i.e. a timer entry whose
owner/scheduler has been deleted.

Best regards,
Ming

On Thu, May 30, 2019 at 9:16 PM Ross Beer ross.beer@outlook.com wrote:

Hi Ming,

I am using Asterisk bundled, which is PJSIP 2.8 with patches:

"Fixed #2191:

  • Stricter double timer entry scheduling prevention.
  • Integrate group lock in SIP transport, e.g: for add/dec ref,    for
    timer scheduling."

I am have just had a further crash in a different location, which looks
like its crashed due to a timer entry being cancelled and an error is
happening while moving nodes. Is more locking required when performing
operations on the timer heap?

Thread 1 (Thread 0x7f783ebfa700 (LWP 83829)):
#0  0x00007f7ae3dc0249 in copy_node (ht=0x184cbf0, slot=2305,
moved_node=0x7f797732f0c0) at ../src/pj/timer.c:137
#1  0x00007f7ae3dc0677 in reheap_up (ht=0x184cbf0,
moved_node=0x7f780e0f2990, slot=2305, parent=1152) at ../src/pj/timer.c:208
#2  0x00007f7ae3dc0882 in remove_node (ht=0x184cbf0, slot=2305) at
../src/pj/timer.c:254
parent = 1152
moved_node = 0x7f780e0f2990
removed_node = 0x7f797f4a9e08
#3  0x00007f7ae3dc0b64 in cancel (ht=0x184cbf0, entry=0x7f797f4a9e08,
flags=1) at ../src/pj/timer.c:353
timer_node_slot = 2305
#4  0x00007f7ae3dc1061 in cancel_timer (ht=0x184cbf0,
entry=0x7f797f4a9e08, flags=0, id_val=0) at ../src/pj/timer.c:591
count = 32634
#5  0x00007f7ae3dc10ea in pj_timer_heap_cancel (ht=0x184cbf0,
entry=0x7f797f4a9e08) at ../src/pj/timer.c:611
#6  0x00007f7ae3d0f0c8 in pjsip_endpt_cancel_timer (endpt=0x184c908,
entry=0x7f797f4a9e08) at ../src/pjsip/sip_endpoint.c:848
#7  0x00007f7ae3cf62f6 in set_timer (sub=0x7f797f4a9cc8, timer_id=2,
seconds=3600) at ../src/pjsip-simple/evsub.c:508
#8  0x00007f7ae3cf91cc in on_tsx_state_uas (sub=0x7f797f4a9cc8,
tsx=0x7f780e0f27d8, event=0x7f783ebf98c0) at
../src/pjsip-simple/evsub.c:2172
expires = 0x7f7a10f9e600
tdata = 0x7f799cbcc068
st_code = 200
st_text = 0x0
body = 0x0
old_state = PJSIP_EVSUB_STATE_ACTIVE
old_state_str = {ptr = 0x7f7ae3dc6f0c "ACTIVE", slen = 6}
reason = {ptr = 0x0, slen = 0}
rdata = 0x7f7a10f95a68
msg = 0x7f7a10f9dac0
res_hdr = {prev = 0x7f783ebf9700, next = 0x7f783ebf9700, type =
284812696, name = {ptr = 0x7f797e162ca8 "", slen = 284809960}, sname = {ptr
= 0x7f797f4a9cc8 "evsub0x7f797f4a9cc8", slen = 140154425546736}, vptr =
0x7f7ae3cf7f2f <on_new_transaction+724>}
status = 0
event_hdr = 0x7f7a10f9e548

Kind regards,

Ross


From: pjsip pjsip-bounces@lists.pjsip.org on behalf of Ming <
ming@teluu.com>
Sent: 30 May 2019 07:20
To: pjsip list
Subject: Re: [pjsip] Segfault in timer.c

Hi Ross,

  • Is there any steps or specific scenario to reproduce the issue?
  • Which PJSIP version are you using? You mentioned ticket #2176, which
    additional patches do you apply to the base version you're using?
  • Our latest fix with regard to the timer is ticket #2191
    (https://trac.pjsip.org/repos/ticket/2191). Please let us know if the
    problem still persists after this patch. Another ticket which may be
    relevant is ticket #2172 (https://trac.pjsip.org/repos/ticket/2172).
  • For the stack trace which you provided, the crash seems to be caused
    when accessing the node 0x7fa9c4c04c50. So, you  will need to check
    the PJSIP log, with PJ_TIMER_DEBUG on, and see to which component the
    timer entry belongs, and when&how the pool which allocates the entry
    gets destroyed, and why isn't the timer cancelled first before the
    deallocation.

Regards,
Ming

On Tue, Apr 9, 2019 at 9:04 PM Ross Beer ross.beer@outlook.com wrote:

Hello,

The previous segfault was with issue #2176 (

https://trac.pjsip.org/repos/changeset/5934) patch applied, this patch
appears to increase the frequency of segfaults in the 'pj_timer_heap_poll'
process instead of mitigating them.

Can anyone offer assistance in getting this resolved?

Regards,

Ross


From: pjsip pjsip-bounces@lists.pjsip.org on behalf of Ross Beer <

Sent: 08 April 2019 17:05
To: pjsip@lists.pjsip.org
Subject: [pjsip] Segfault in timer.c

Hi,

We are seeing multiple segfaults while copying nodes:

Thread 1 (Thread 0x7fa977fff700 (LWP 36699)):
#0  0x00007fac956676fe in copy_node (ht=0x16d0410, slot=430,

moved_node=0x7fa9c4c04c50) at ../src/pj/timer.c:137

#1  0x00007fac956679d9 in reheap_down (ht=0x16d0410,

moved_node=0x7fab60031970, slot=430, child=862) at ../src/pj/timer.c:185

#2  0x00007fac95667d1d in remove_node (ht=0x16d0410, slot=0) at

../src/pj/timer.c:252

     parent = 0
     moved_node = 0x7fab60031970
     removed_node = 0x7fa9b5978ae0

#3  0x00007fac95668634 in pj_timer_heap_poll (ht=0x16d0410,

next_delay=0x7fa977ffecd0) at ../src/pj/timer.c:634

     node = 0x100000002
     grp_lock = 0x31dc8c88
     now = {sec = 8157205, msec = 523}
     count = 0

#4  0x00007fac955b6b06 in pjsip_endpt_handle_events2 (endpt=0x16d0128,

max_timeout=0x7fa977ffed40, p_count=0x0) at ../src/pjsip/sip_endpoint.c:715

     timeout = {sec = 0, msec = 0}
     count = 0
     net_event_count = 0
     c = 0

#5  0x00007fac955b6c4b in pjsip_endpt_handle_events (endpt=0x16d0128,

max_timeout=0x7fa977ffed40) at ../src/pjsip/sip_endpoint.c:776

#6  0x00007faac8024bcf in monitor_thread_exec (endpt=0x0) at

res_pjsip.c:4512

     delay = {sec = 0, msec = 10}

#7  0x00007fac9564fcb0 in thread_main (param=0x1815b28) at

../src/pj/os_core_unix.c:541

     rec = 0x1815b28
     result = 0x0
     rc = 0

#8  0x00007fac9342ddd5 in start_thread () at /usr/lib64/libpthread.so.0
#9  0x00007fac927cfead in clone () at /usr/lib64/libc.so.6

Can anyone assist in resolving the issue?

Regards,

Ross


Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@lists.pjsip.org
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org

Hi Ross, I'm afraid the backtrace alone is not sufficient to pinpoint the cause. We know that the current timer implementation is not easy to troubleshoot, and we're thinking of changing it, but for now in order to investigate the issue further, we will need to get additional information which I mentioned in the previous email, such as the steps to reproduce, and/or the complete PJSIP logs level 6 compiled with the timer debugging on. And I don't think adding locks can help here, because it looks like the issue is caused by a dangling timer entry, i.e. a timer entry whose owner/scheduler has been deleted. Best regards, Ming On Thu, May 30, 2019 at 9:16 PM Ross Beer <ross.beer@outlook.com> wrote: > Hi Ming, > > I am using Asterisk bundled, which is PJSIP 2.8 with patches: > > "Fixed #2191: > - Stricter double timer entry scheduling prevention. > - Integrate group lock in SIP transport, e.g: for add/dec ref, for > timer scheduling." > > I am have just had a further crash in a different location, which looks > like its crashed due to a timer entry being cancelled and an error is > happening while moving nodes. Is more locking required when performing > operations on the timer heap? > > Thread 1 (Thread 0x7f783ebfa700 (LWP 83829)): > #0 0x00007f7ae3dc0249 in copy_node (ht=0x184cbf0, slot=2305, > moved_node=0x7f797732f0c0) at ../src/pj/timer.c:137 > #1 0x00007f7ae3dc0677 in reheap_up (ht=0x184cbf0, > moved_node=0x7f780e0f2990, slot=2305, parent=1152) at ../src/pj/timer.c:208 > #2 0x00007f7ae3dc0882 in remove_node (ht=0x184cbf0, slot=2305) at > ../src/pj/timer.c:254 > parent = 1152 > moved_node = 0x7f780e0f2990 > removed_node = 0x7f797f4a9e08 > #3 0x00007f7ae3dc0b64 in cancel (ht=0x184cbf0, entry=0x7f797f4a9e08, > flags=1) at ../src/pj/timer.c:353 > timer_node_slot = 2305 > #4 0x00007f7ae3dc1061 in cancel_timer (ht=0x184cbf0, > entry=0x7f797f4a9e08, flags=0, id_val=0) at ../src/pj/timer.c:591 > count = 32634 > #5 0x00007f7ae3dc10ea in pj_timer_heap_cancel (ht=0x184cbf0, > entry=0x7f797f4a9e08) at ../src/pj/timer.c:611 > #6 0x00007f7ae3d0f0c8 in pjsip_endpt_cancel_timer (endpt=0x184c908, > entry=0x7f797f4a9e08) at ../src/pjsip/sip_endpoint.c:848 > #7 0x00007f7ae3cf62f6 in set_timer (sub=0x7f797f4a9cc8, timer_id=2, > seconds=3600) at ../src/pjsip-simple/evsub.c:508 > #8 0x00007f7ae3cf91cc in on_tsx_state_uas (sub=0x7f797f4a9cc8, > tsx=0x7f780e0f27d8, event=0x7f783ebf98c0) at > ../src/pjsip-simple/evsub.c:2172 > expires = 0x7f7a10f9e600 > tdata = 0x7f799cbcc068 > st_code = 200 > st_text = 0x0 > body = 0x0 > old_state = PJSIP_EVSUB_STATE_ACTIVE > old_state_str = {ptr = 0x7f7ae3dc6f0c "ACTIVE", slen = 6} > reason = {ptr = 0x0, slen = 0} > rdata = 0x7f7a10f95a68 > msg = 0x7f7a10f9dac0 > res_hdr = {prev = 0x7f783ebf9700, next = 0x7f783ebf9700, type = > 284812696, name = {ptr = 0x7f797e162ca8 "", slen = 284809960}, sname = {ptr > = 0x7f797f4a9cc8 "evsub0x7f797f4a9cc8", slen = 140154425546736}, vptr = > 0x7f7ae3cf7f2f <on_new_transaction+724>} > status = 0 > event_hdr = 0x7f7a10f9e548 > > Kind regards, > > Ross > > ------------------------------ > *From:* pjsip <pjsip-bounces@lists.pjsip.org> on behalf of Ming < > ming@teluu.com> > *Sent:* 30 May 2019 07:20 > *To:* pjsip list > *Subject:* Re: [pjsip] Segfault in timer.c > > Hi Ross, > > - Is there any steps or specific scenario to reproduce the issue? > - Which PJSIP version are you using? You mentioned ticket #2176, which > additional patches do you apply to the base version you're using? > - Our latest fix with regard to the timer is ticket #2191 > (https://trac.pjsip.org/repos/ticket/2191). Please let us know if the > problem still persists after this patch. Another ticket which may be > relevant is ticket #2172 (https://trac.pjsip.org/repos/ticket/2172). > - For the stack trace which you provided, the crash seems to be caused > when accessing the node 0x7fa9c4c04c50. So, you will need to check > the PJSIP log, with PJ_TIMER_DEBUG on, and see to which component the > timer entry belongs, and when&how the pool which allocates the entry > gets destroyed, and why isn't the timer cancelled first before the > deallocation. > > Regards, > Ming > > > On Tue, Apr 9, 2019 at 9:04 PM Ross Beer <ross.beer@outlook.com> wrote: > > > > Hello, > > > > The previous segfault was with issue #2176 ( > https://trac.pjsip.org/repos/changeset/5934) patch applied, this patch > appears to increase the frequency of segfaults in the 'pj_timer_heap_poll' > process instead of mitigating them. > > > > Can anyone offer assistance in getting this resolved? > > > > Regards, > > > > Ross > > > > ________________________________ > > From: pjsip <pjsip-bounces@lists.pjsip.org> on behalf of Ross Beer < > ross.beer@outlook.com> > > Sent: 08 April 2019 17:05 > > To: pjsip@lists.pjsip.org > > Subject: [pjsip] Segfault in timer.c > > > > Hi, > > > > We are seeing multiple segfaults while copying nodes: > > > > Thread 1 (Thread 0x7fa977fff700 (LWP 36699)): > > #0 0x00007fac956676fe in copy_node (ht=0x16d0410, slot=430, > moved_node=0x7fa9c4c04c50) at ../src/pj/timer.c:137 > > #1 0x00007fac956679d9 in reheap_down (ht=0x16d0410, > moved_node=0x7fab60031970, slot=430, child=862) at ../src/pj/timer.c:185 > > #2 0x00007fac95667d1d in remove_node (ht=0x16d0410, slot=0) at > ../src/pj/timer.c:252 > > parent = 0 > > moved_node = 0x7fab60031970 > > removed_node = 0x7fa9b5978ae0 > > #3 0x00007fac95668634 in pj_timer_heap_poll (ht=0x16d0410, > next_delay=0x7fa977ffecd0) at ../src/pj/timer.c:634 > > node = 0x100000002 > > grp_lock = 0x31dc8c88 > > now = {sec = 8157205, msec = 523} > > count = 0 > > #4 0x00007fac955b6b06 in pjsip_endpt_handle_events2 (endpt=0x16d0128, > max_timeout=0x7fa977ffed40, p_count=0x0) at ../src/pjsip/sip_endpoint.c:715 > > timeout = {sec = 0, msec = 0} > > count = 0 > > net_event_count = 0 > > c = 0 > > #5 0x00007fac955b6c4b in pjsip_endpt_handle_events (endpt=0x16d0128, > max_timeout=0x7fa977ffed40) at ../src/pjsip/sip_endpoint.c:776 > > #6 0x00007faac8024bcf in monitor_thread_exec (endpt=0x0) at > res_pjsip.c:4512 > > delay = {sec = 0, msec = 10} > > #7 0x00007fac9564fcb0 in thread_main (param=0x1815b28) at > ../src/pj/os_core_unix.c:541 > > rec = 0x1815b28 > > result = 0x0 > > rc = 0 > > #8 0x00007fac9342ddd5 in start_thread () at /usr/lib64/libpthread.so.0 > > #9 0x00007fac927cfead in clone () at /usr/lib64/libc.so.6 > > > > > > Can anyone assist in resolving the issue? > > > > Regards, > > > > Ross > > _______________________________________________ > > Visit our blog: http://blog.pjsip.org > > > > pjsip mailing list > > pjsip@lists.pjsip.org > > http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org > > _______________________________________________ > Visit our blog: http://blog.pjsip.org > > pjsip mailing list > pjsip@lists.pjsip.org > http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org > _______________________________________________ > Visit our blog: http://blog.pjsip.org > > pjsip mailing list > pjsip@lists.pjsip.org > http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org >