SEGV in pjsip due to race condition (+ patch)

MO
Martin Oberhuber
Wed, Jan 3, 2018 2:58 PM

Dear PJSIP project,

I've come across a SEGV in pjproject-2.6 on Linux under the following
circumstances:

  • An active account re-registers
  • The re-registration fails (in my case, DNS resolution timeout after 10
    seconds)
  • At exactly the same time, the keep-alive timer fires.

The issue was consistently reproducible for me under a very high network
load, with the re-registration interval set to 5 seconds, DNS timeout
default (10 seconds) and keepalive interval default (15 seconds). Since
5+10 seconds == 15 seconds, the 2 events coincide and lead to the following
backtrace:

Thread #1 (Suspended : Signal : SIGSEGV:Segmentation fault)
keep_alive_timer_cb() at pjsua_acc.c:1,981 0x76bcb424
pj_timer_heap_poll() at timer.c:643 0x76e43244
pjsip_endpt_handle_events2() at sip_endpoint.c:713 0x76cc0dd8
worker_thread() at pjsua_core.c:695 0x76bde404

Thread #2
__pthread_mutex_unlock_usercnt() at pthread_mutex_unlock.c:66 0x4940a4d8
__GI___pthread_mutex_unlock() at pthread_mutex_unlock.c:315 0x4940a588
pj_mutex_unlock() at os_core_unix.c:1,323 0x76e4241c
PJSUA_UNLOCK() at pjsua_internal.h:584 0x76be4244
pjsua_acc_set_registration() at pjsua_acc.c:2,682 0x76be4244
pj::Account::getInfo() at account.cpp:737 0x76e86438

Analyzing the backtrace, I found 2 problems which are fixed in attached
patch:

  1. timer.c: pj_timer_heap_poll places the timer onto the freelist and
    releases the global lock before calling the callback -- thus the callback
    may operate on a timer already freed! Proposed fix: keep timer_entry out of
    the freelist until the callback is done.
  2. pjsua_acc.c: Even with the 1st issue fixed, the account registration
    could still be canceled "exactly when the callback fires", because the lock
    is released before the callback ... thus putting NULL into the ka_transport
    thus causing the SEGV. Proposed fix: protect against NULL in ka_transport.

While the patch is against pjroject-2.6 , I believe that the issue is still
in latest trunk as well.
Please let me know what you think about the attached patch, and consider it
for inclusion in pjproject.

Thanks,
Martin

Martin Oberhuber https://at.linkedin.com/in/martin-oberhuber-25b43a8
| Software Architect, Project Lead & Consultant | Austria

Dear PJSIP project, I've come across a SEGV in pjproject-2.6 on Linux under the following circumstances: - An active account re-registers - The re-registration fails (in my case, DNS resolution timeout after 10 seconds) - At exactly the same time, the keep-alive timer fires. The issue was consistently reproducible for me under a very high network load, with the re-registration interval set to 5 seconds, DNS timeout default (10 seconds) and keepalive interval default (15 seconds). Since 5+10 seconds == 15 seconds, the 2 events coincide and lead to the following backtrace: Thread #1 (Suspended : Signal : SIGSEGV:Segmentation fault) keep_alive_timer_cb() at pjsua_acc.c:1,981 0x76bcb424 pj_timer_heap_poll() at timer.c:643 0x76e43244 pjsip_endpt_handle_events2() at sip_endpoint.c:713 0x76cc0dd8 worker_thread() at pjsua_core.c:695 0x76bde404 Thread #2 __pthread_mutex_unlock_usercnt() at pthread_mutex_unlock.c:66 0x4940a4d8 __GI___pthread_mutex_unlock() at pthread_mutex_unlock.c:315 0x4940a588 pj_mutex_unlock() at os_core_unix.c:1,323 0x76e4241c PJSUA_UNLOCK() at pjsua_internal.h:584 0x76be4244 pjsua_acc_set_registration() at pjsua_acc.c:2,682 0x76be4244 pj::Account::getInfo() at account.cpp:737 0x76e86438 Analyzing the backtrace, I found 2 problems which are fixed in attached patch: 1. timer.c: pj_timer_heap_poll places the timer onto the freelist and releases the global lock before calling the callback -- thus the callback may operate on a timer already freed! Proposed fix: keep timer_entry out of the freelist until the callback is done. 2. pjsua_acc.c: Even with the 1st issue fixed, the account registration could still be canceled "exactly when the callback fires", because the lock is released before the callback ... thus putting NULL into the ka_transport thus causing the SEGV. Proposed fix: protect against NULL in ka_transport. While the patch is against pjroject-2.6 , I believe that the issue is still in latest trunk as well. Please let me know what you think about the attached patch, and consider it for inclusion in pjproject. Thanks, Martin -- Martin Oberhuber <https://at.linkedin.com/in/martin-oberhuber-25b43a8> | Software Architect, Project Lead & Consultant | Austria
OB
Oren Barash
Sun, Jan 7, 2018 1:07 PM

Hello All,

I am also experiencing SEGV on Android regarding the same issue.
After phone loses signal, when getting it back the phone might connect to and 2g network, the jump to 3g or 4g.
When it “jumps” between networks, the android ConnectivityManager publishes network disconnection and then connection.

So when we try to remove the account and re create it, in some rear situation PJSIP just crushes with SEGV.

What the PJ owners think about this patch ?

Thanks,
Oren

On 3 Jan 2018, at 16:58, Martin Oberhuber mober.at@gmail.com wrote:

Dear PJSIP project,

I've come across a SEGV in pjproject-2.6 on Linux under the following circumstances:
An active account re-registers
The re-registration fails (in my case, DNS resolution timeout after 10 seconds)
At exactly the same time, the keep-alive timer fires.
The issue was consistently reproducible for me under a very high network load, with the re-registration interval set to 5 seconds, DNS timeout default (10 seconds) and keepalive interval default (15 seconds). Since 5+10 seconds == 15 seconds, the 2 events coincide and lead to the following backtrace:

Thread #1 (Suspended : Signal : SIGSEGV:Segmentation fault)
keep_alive_timer_cb() at pjsua_acc.c:1,981 0x76bcb424
pj_timer_heap_poll() at timer.c:643 0x76e43244
pjsip_endpt_handle_events2() at sip_endpoint.c:713 0x76cc0dd8
worker_thread() at pjsua_core.c:695 0x76bde404
Thread #2
__pthread_mutex_unlock_usercnt() at pthread_mutex_unlock.c:66 0x4940a4d8
__GI___pthread_mutex_unlock() at pthread_mutex_unlock.c:315 0x4940a588
pj_mutex_unlock() at os_core_unix.c:1,323 0x76e4241c
PJSUA_UNLOCK() at pjsua_internal.h:584 0x76be4244
pjsua_acc_set_registration() at pjsua_acc.c:2,682 0x76be4244
pj::Account::getInfo() at account.cpp:737 0x76e86438

Analyzing the backtrace, I found 2 problems which are fixed in attached patch:
timer.c: pj_timer_heap_poll places the timer onto the freelist and releases the global lock before calling the callback -- thus the callback may operate on a timer already freed! Proposed fix: keep timer_entry out of the freelist until the callback is done.
pjsua_acc.c: Even with the 1st issue fixed, the account registration could still be canceled "exactly when the callback fires", because the lock is released before the callback ... thus putting NULL into the ka_transport thus causing the SEGV. Proposed fix: protect against NULL in ka_transport.
While the patch is against pjroject-2.6 , I believe that the issue is still in latest trunk as well.
Please let me know what you think about the attached patch, and consider it for inclusion in pjproject.

Thanks,
Martin

Martin Oberhuber https://at.linkedin.com/in/martin-oberhuber-25b43a8 | Software Architect, Project Lead & Consultant | Austria
<pjproject-2.6-timer.patch>_______________________________________________
Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@lists.pjsip.org
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org

Hello All, I am also experiencing SEGV on Android regarding the same issue. After phone loses signal, when getting it back the phone might connect to and 2g network, the jump to 3g or 4g. When it “jumps” between networks, the android ConnectivityManager publishes network disconnection and then connection. So when we try to remove the account and re create it, in some rear situation PJSIP just crushes with SEGV. What the PJ owners think about this patch ? Thanks, Oren > On 3 Jan 2018, at 16:58, Martin Oberhuber <mober.at@gmail.com> wrote: > > Dear PJSIP project, > > I've come across a SEGV in pjproject-2.6 on Linux under the following circumstances: > An active account re-registers > The re-registration fails (in my case, DNS resolution timeout after 10 seconds) > At exactly the same time, the keep-alive timer fires. > The issue was consistently reproducible for me under a very high network load, with the re-registration interval set to 5 seconds, DNS timeout default (10 seconds) and keepalive interval default (15 seconds). Since 5+10 seconds == 15 seconds, the 2 events coincide and lead to the following backtrace: > > Thread #1 (Suspended : Signal : SIGSEGV:Segmentation fault) > keep_alive_timer_cb() at pjsua_acc.c:1,981 0x76bcb424 > pj_timer_heap_poll() at timer.c:643 0x76e43244 > pjsip_endpt_handle_events2() at sip_endpoint.c:713 0x76cc0dd8 > worker_thread() at pjsua_core.c:695 0x76bde404 > Thread #2 > __pthread_mutex_unlock_usercnt() at pthread_mutex_unlock.c:66 0x4940a4d8 > __GI___pthread_mutex_unlock() at pthread_mutex_unlock.c:315 0x4940a588 > pj_mutex_unlock() at os_core_unix.c:1,323 0x76e4241c > PJSUA_UNLOCK() at pjsua_internal.h:584 0x76be4244 > pjsua_acc_set_registration() at pjsua_acc.c:2,682 0x76be4244 > pj::Account::getInfo() at account.cpp:737 0x76e86438 > > Analyzing the backtrace, I found 2 problems which are fixed in attached patch: > timer.c: pj_timer_heap_poll places the timer onto the freelist and releases the global lock before calling the callback -- thus the callback may operate on a timer already freed! Proposed fix: keep timer_entry out of the freelist until the callback is done. > pjsua_acc.c: Even with the 1st issue fixed, the account registration could still be canceled "exactly when the callback fires", because the lock is released before the callback ... thus putting NULL into the ka_transport thus causing the SEGV. Proposed fix: protect against NULL in ka_transport. > While the patch is against pjroject-2.6 , I believe that the issue is still in latest trunk as well. > Please let me know what you think about the attached patch, and consider it for inclusion in pjproject. > > Thanks, > Martin > -- > Martin Oberhuber <https://at.linkedin.com/in/martin-oberhuber-25b43a8> | Software Architect, Project Lead & Consultant | Austria > <pjproject-2.6-timer.patch>_______________________________________________ > Visit our blog: http://blog.pjsip.org > > pjsip mailing list > pjsip@lists.pjsip.org > http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org
M
Ming
Mon, Jan 8, 2018 3:22 AM

Hi Martin,

I've committed the patch in ticket #2079
(https://trac.pjsip.org/repos/ticket/2079).

Thanks for such detailed report and analysis as well as the patch.
And thanks to Oren for bumping this post up.

Best regards,
Ming

On Sun, Jan 7, 2018 at 9:07 PM, Oren Barash oren.elbitsystems@gmail.com wrote:

Hello All,

I am also experiencing SEGV on Android regarding the same issue.
After phone loses signal, when getting it back the phone might connect to
and 2g network, the jump to 3g or 4g.
When it “jumps” between networks, the android ConnectivityManager publishes
network disconnection and then connection.

So when we try to remove the account and re create it, in some rear
situation PJSIP just crushes with SEGV.

What the PJ owners think about this patch ?

Thanks,
Oren

On 3 Jan 2018, at 16:58, Martin Oberhuber mober.at@gmail.com wrote:

Dear PJSIP project,

I've come across a SEGV in pjproject-2.6 on Linux under the following
circumstances:

An active account re-registers
The re-registration fails (in my case, DNS resolution timeout after 10
seconds)
At exactly the same time, the keep-alive timer fires.

The issue was consistently reproducible for me under a very high network
load, with the re-registration interval set to 5 seconds, DNS timeout
default (10 seconds) and keepalive interval default (15 seconds). Since 5+10
seconds == 15 seconds, the 2 events coincide and lead to the following
backtrace:

Thread #1 (Suspended : Signal : SIGSEGV:Segmentation fault)
keep_alive_timer_cb() at pjsua_acc.c:1,981 0x76bcb424
pj_timer_heap_poll() at timer.c:643 0x76e43244
pjsip_endpt_handle_events2() at sip_endpoint.c:713 0x76cc0dd8
worker_thread() at pjsua_core.c:695 0x76bde404

Thread #2
__pthread_mutex_unlock_usercnt() at pthread_mutex_unlock.c:66 0x4940a4d8

__GI___pthread_mutex_unlock() at pthread_mutex_unlock.c:315 0x4940a588
pj_mutex_unlock() at os_core_unix.c:1,323 0x76e4241c
PJSUA_UNLOCK() at pjsua_internal.h:584 0x76be4244
pjsua_acc_set_registration() at pjsua_acc.c:2,682 0x76be4244
pj::Account::getInfo() at account.cpp:737 0x76e86438

Analyzing the backtrace, I found 2 problems which are fixed in attached
patch:

timer.c: pj_timer_heap_poll places the timer onto the freelist and releases
the global lock before calling the callback -- thus the callback may operate
on a timer already freed! Proposed fix: keep timer_entry out of the freelist
until the callback is done.
pjsua_acc.c: Even with the 1st issue fixed, the account registration could
still be canceled "exactly when the callback fires", because the lock is
released before the callback ... thus putting NULL into the ka_transport
thus causing the SEGV. Proposed fix: protect against NULL in ka_transport.

While the patch is against pjroject-2.6 , I believe that the issue is still
in latest trunk as well.
Please let me know what you think about the attached patch, and consider it
for inclusion in pjproject.

Thanks,
Martin

Martin Oberhuber | Software Architect, Project Lead & Consultant | Austria

<pjproject-2.6-timer.patch>_______________________________________________
Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@lists.pjsip.org
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org


Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@lists.pjsip.org
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org

Hi Martin, I've committed the patch in ticket #2079 (https://trac.pjsip.org/repos/ticket/2079). Thanks for such detailed report and analysis as well as the patch. And thanks to Oren for bumping this post up. Best regards, Ming On Sun, Jan 7, 2018 at 9:07 PM, Oren Barash <oren.elbitsystems@gmail.com> wrote: > Hello All, > > I am also experiencing SEGV on Android regarding the same issue. > After phone loses signal, when getting it back the phone might connect to > and 2g network, the jump to 3g or 4g. > When it “jumps” between networks, the android ConnectivityManager publishes > network disconnection and then connection. > > So when we try to remove the account and re create it, in some rear > situation PJSIP just crushes with SEGV. > > What the PJ owners think about this patch ? > > Thanks, > Oren > > > > > On 3 Jan 2018, at 16:58, Martin Oberhuber <mober.at@gmail.com> wrote: > > Dear PJSIP project, > > I've come across a SEGV in pjproject-2.6 on Linux under the following > circumstances: > > An active account re-registers > The re-registration fails (in my case, DNS resolution timeout after 10 > seconds) > At exactly the same time, the keep-alive timer fires. > > The issue was consistently reproducible for me under a very high network > load, with the re-registration interval set to 5 seconds, DNS timeout > default (10 seconds) and keepalive interval default (15 seconds). Since 5+10 > seconds == 15 seconds, the 2 events coincide and lead to the following > backtrace: > > Thread #1 (Suspended : Signal : SIGSEGV:Segmentation fault) > keep_alive_timer_cb() at pjsua_acc.c:1,981 0x76bcb424 > pj_timer_heap_poll() at timer.c:643 0x76e43244 > pjsip_endpt_handle_events2() at sip_endpoint.c:713 0x76cc0dd8 > worker_thread() at pjsua_core.c:695 0x76bde404 > > Thread #2 > __pthread_mutex_unlock_usercnt() at pthread_mutex_unlock.c:66 0x4940a4d8 > > __GI___pthread_mutex_unlock() at pthread_mutex_unlock.c:315 0x4940a588 > pj_mutex_unlock() at os_core_unix.c:1,323 0x76e4241c > PJSUA_UNLOCK() at pjsua_internal.h:584 0x76be4244 > pjsua_acc_set_registration() at pjsua_acc.c:2,682 0x76be4244 > pj::Account::getInfo() at account.cpp:737 0x76e86438 > > Analyzing the backtrace, I found 2 problems which are fixed in attached > patch: > > timer.c: pj_timer_heap_poll places the timer onto the freelist and releases > the global lock before calling the callback -- thus the callback may operate > on a timer already freed! Proposed fix: keep timer_entry out of the freelist > until the callback is done. > pjsua_acc.c: Even with the 1st issue fixed, the account registration could > still be canceled "exactly when the callback fires", because the lock is > released before the callback ... thus putting NULL into the ka_transport > thus causing the SEGV. Proposed fix: protect against NULL in ka_transport. > > While the patch is against pjroject-2.6 , I believe that the issue is still > in latest trunk as well. > Please let me know what you think about the attached patch, and consider it > for inclusion in pjproject. > > Thanks, > Martin > -- > > Martin Oberhuber | Software Architect, Project Lead & Consultant | Austria > > <pjproject-2.6-timer.patch>_______________________________________________ > Visit our blog: http://blog.pjsip.org > > pjsip mailing list > pjsip@lists.pjsip.org > http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org > > > > _______________________________________________ > Visit our blog: http://blog.pjsip.org > > pjsip mailing list > pjsip@lists.pjsip.org > http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org >