Possible ICE check bug

JR
Jonni Rainisto
Mon, Jan 16, 2017 8:40 AM

Possible ICE check bug

Seems that ICE checking has an issue one of our VPN environment where
mediarelay handles the call proxying in natted env.
Seems ICE check is failing because of timing issues in the code.
Tested also in latest trunk and added some logging on called functions.

In successful case log shows:

09:09:01.021        icetp00  .Check 7: [2] 10.4.1.36:64134-->
10.115.131.146:50391: state changed from In Progress to Succeeded
09:09:01.021        icetp00  .on_check_complete
09:09:01.021        icetp00  .Check 7 is successful
09:09:01.022        icetp00  .on_stun_send_msg
09:09:01.024        icetp00  pj_ice_sess_on_rx_pkt
09:09:01.024        icetp00  .on_stun_rx_request
09:09:01.024        icetp00  ..on_stun_send_msg
09:09:01.024        icetp00  .handle_incoming_check
09:09:01.024        icetp00  .update_comp_check
09:09:01.024        icetp00  .Valid check 0: [2] 10.115.131.146:50389-->
10.115.131.146:50391 is nominated
09:09:01.024        icetp00  .Triggered check for check 7 not performed
because it's completed
09:09:01.024        icetp00  ..on_check_complete
09:09:01.024        icetp00  ..Check 7 is successful  and nominated
09:09:01.025        icetp00  ..Cancelling check 2: [2] 10.4.1.36:64134-->
10.115.5.10:38561 (In Progress)
09:09:01.025        icetp00  ..Check 2: [2] 10.4.1.36:64134-->
10.115.5.10:38561: state changed from In Progress to Failed
09:09:01.025        icetp00  ..Cancelling check 3: [2] 10.4.1.36:64134-->
100.114.251.142:38561 (In Progress)
09:09:01.025        icetp00  ..Check 3: [2] 10.4.1.36:64134-->
100.114.251.142:38561: state changed from In Progress to Failed
09:09:01.025        icetp00  ..Cancelling check 6: [2] 10.4.1.36:64134-->
10.115.131.146:50395 (In Progress)
09:09:01.025        icetp00  ..Check 6: [2] 10.4.1.36:64134-->
10.115.131.146:50395: state changed from In Progress to Failed
09:09:01.026        icetp00  pj_ice_sess_on_rx_pkt
09:09:01.026        icetp00  .on_stun_rx_request
09:09:01.026        icetp00  ..on_stun_send_msg
09:09:01.027        icetp00  .handle_incoming_check
09:09:01.027        icetp00  .Triggered check for check 4 not performed
because it's in progress. Retransmitting
09:09:01.027        icetp00  ...on_stun_send_msg
09:09:01.046        icetp00  pj_ice_sess_on_rx_pkt
09:09:01.046        icetp00  .on_stun_rx_request
09:09:01.046        icetp00  ..on_stun_send_msg
09:09:01.046        icetp00  .handle_incoming_check
09:09:01.046        icetp00  .Triggered check for check 4 not performed
because it's in progress. Retransmitting
09:09:01.046        icetp00  ...on_stun_send_msg
09:09:01.101        icetp00  .on_stun_send_msg
09:09:01.105        icetp00  pj_ice_sess_on_rx_pkt
09:09:01.105        icetp00  .on_stun_request_complete
09:09:01.105        icetp00  .Check 4: [1] 10.4.1.36:64130-->
10.115.131.146:50394 (nominated): connectivity check SUCCESS

So if comp_id=1(check 7) gets "is successful  and nominated" before next
check is triggered, it triggers comp_id=2(Check 4 retransmits) and voice
calls works nicely.
But if Triggered (handle_incoming_check) check happens before "is
successful  and nominated" has happened, then it tries the old Check 7 and
will never trigger on Check 4 which is in Progress state:

12:54:37.239        icetp00  ..on_stun_send_msg
12:54:37.239        icetp00  .handle_incoming_check
12:54:37.239        icetp00  .update_comp_check
12:54:37.239        icetp00  .Valid check 0: [2] 10.115.131.146:50453-->
10.115.131.146:50455 is nominated
12:54:37.239        icetp00  .Triggered check for check 7 not performed
because it's completed
12:54:37.239        icetp00  ..on_check_complete
12:54:37.239        icetp00  ..Check 7 is successful  and nominated
12:54:37.239        icetp00  ..Cancelling check 2: [2] 10.4.1.36:62952-->
10.115.5.10:46907 (In Progress)
12:54:37.239        icetp00  ..Check 2: [2] 10.4.1.36:62952-->
10.115.5.10:46907: state changed from In Progress to Failed
12:54:37.239        icetp00  ..Cancelling check 3: [2] 10.4.1.36:62952-->
100.114.251.142:46907 (In Progress)
12:54:37.240        icetp00  ..Check 3: [2] 10.4.1.36:62952-->
100.114.251.142:46907: state changed from In Progress to Failed
12:54:37.240        icetp00  ..Cancelling check 6: [2] 10.4.1.36:62952-->
10.115.131.146:50459 (In Progress)
12:54:37.240        icetp00  ..Check 6: [2] 10.4.1.36:62952-->
10.115.131.146:50459: state changed from In Progress to Failed
12:54:37.240        icetp00  pj_ice_sess_on_rx_pkt
12:54:37.241        icetp00  pj_ice_sess_on_rx_pkt
12:54:37.241        icetp00  .on_stun_rx_request
12:54:37.241        icetp00  ..on_stun_send_msg
12:54:37.241        icetp00  .handle_incoming_check
12:54:37.241        icetp00  .update_comp_check
12:54:37.242        icetp00  .Valid check 0: [2] 10.115.131.146:50453-->
10.115.131.146:50455 is nominated
12:54:37.242        icetp00  .Triggered check for check 7 not performed
because it's completed
12:54:37.242        icetp00  ..on_check_complete
12:54:37.242        icetp00  ..Check 7 is successful  and nominated
12:54:37.244        icetp00  .on_stun_send_msg

Any ideas where to continue debugging and/or how to fix it? Seems that 20ms
tiggering seems to stops happening when complete check on 1st comp_id takes
too much time.

Br, Jonni

Possible ICE check bug Seems that ICE checking has an issue one of our VPN environment where mediarelay handles the call proxying in natted env. Seems ICE check is failing because of timing issues in the code. Tested also in latest trunk and added some logging on called functions. In successful case log shows: 09:09:01.021 icetp00 .Check 7: [2] 10.4.1.36:64134--> 10.115.131.146:50391: state changed from In Progress to Succeeded 09:09:01.021 icetp00 .on_check_complete 09:09:01.021 icetp00 .Check 7 is successful 09:09:01.022 icetp00 .on_stun_send_msg 09:09:01.024 icetp00 pj_ice_sess_on_rx_pkt 09:09:01.024 icetp00 .on_stun_rx_request 09:09:01.024 icetp00 ..on_stun_send_msg 09:09:01.024 icetp00 .handle_incoming_check 09:09:01.024 icetp00 .update_comp_check 09:09:01.024 icetp00 .Valid check 0: [2] 10.115.131.146:50389--> 10.115.131.146:50391 is nominated 09:09:01.024 icetp00 .Triggered check for check 7 not performed because it's completed 09:09:01.024 icetp00 ..on_check_complete 09:09:01.024 icetp00 ..Check 7 is successful and nominated 09:09:01.025 icetp00 ..Cancelling check 2: [2] 10.4.1.36:64134--> 10.115.5.10:38561 (In Progress) 09:09:01.025 icetp00 ..Check 2: [2] 10.4.1.36:64134--> 10.115.5.10:38561: state changed from In Progress to Failed 09:09:01.025 icetp00 ..Cancelling check 3: [2] 10.4.1.36:64134--> 100.114.251.142:38561 (In Progress) 09:09:01.025 icetp00 ..Check 3: [2] 10.4.1.36:64134--> 100.114.251.142:38561: state changed from In Progress to Failed 09:09:01.025 icetp00 ..Cancelling check 6: [2] 10.4.1.36:64134--> 10.115.131.146:50395 (In Progress) 09:09:01.025 icetp00 ..Check 6: [2] 10.4.1.36:64134--> 10.115.131.146:50395: state changed from In Progress to Failed 09:09:01.026 icetp00 pj_ice_sess_on_rx_pkt 09:09:01.026 icetp00 .on_stun_rx_request 09:09:01.026 icetp00 ..on_stun_send_msg 09:09:01.027 icetp00 .handle_incoming_check 09:09:01.027 icetp00 .Triggered check for check 4 not performed because it's in progress. Retransmitting 09:09:01.027 icetp00 ...on_stun_send_msg 09:09:01.046 icetp00 pj_ice_sess_on_rx_pkt 09:09:01.046 icetp00 .on_stun_rx_request 09:09:01.046 icetp00 ..on_stun_send_msg 09:09:01.046 icetp00 .handle_incoming_check 09:09:01.046 icetp00 .Triggered check for check 4 not performed because it's in progress. Retransmitting 09:09:01.046 icetp00 ...on_stun_send_msg 09:09:01.101 icetp00 .on_stun_send_msg 09:09:01.105 icetp00 pj_ice_sess_on_rx_pkt 09:09:01.105 icetp00 .on_stun_request_complete 09:09:01.105 icetp00 .Check 4: [1] 10.4.1.36:64130--> 10.115.131.146:50394 (nominated): connectivity check SUCCESS So if comp_id=1(check 7) gets "is successful and nominated" before next check is triggered, it triggers comp_id=2(Check 4 retransmits) and voice calls works nicely. But if Triggered (handle_incoming_check) check happens before "is successful and nominated" has happened, then it tries the old Check 7 and will never trigger on Check 4 which is in Progress state: 12:54:37.239 icetp00 ..on_stun_send_msg 12:54:37.239 icetp00 .handle_incoming_check 12:54:37.239 icetp00 .update_comp_check 12:54:37.239 icetp00 .Valid check 0: [2] 10.115.131.146:50453--> 10.115.131.146:50455 is nominated 12:54:37.239 icetp00 .Triggered check for check 7 not performed because it's completed 12:54:37.239 icetp00 ..on_check_complete 12:54:37.239 icetp00 ..Check 7 is successful and nominated 12:54:37.239 icetp00 ..Cancelling check 2: [2] 10.4.1.36:62952--> 10.115.5.10:46907 (In Progress) 12:54:37.239 icetp00 ..Check 2: [2] 10.4.1.36:62952--> 10.115.5.10:46907: state changed from In Progress to Failed 12:54:37.239 icetp00 ..Cancelling check 3: [2] 10.4.1.36:62952--> 100.114.251.142:46907 (In Progress) 12:54:37.240 icetp00 ..Check 3: [2] 10.4.1.36:62952--> 100.114.251.142:46907: state changed from In Progress to Failed 12:54:37.240 icetp00 ..Cancelling check 6: [2] 10.4.1.36:62952--> 10.115.131.146:50459 (In Progress) 12:54:37.240 icetp00 ..Check 6: [2] 10.4.1.36:62952--> 10.115.131.146:50459: state changed from In Progress to Failed 12:54:37.240 icetp00 pj_ice_sess_on_rx_pkt 12:54:37.241 icetp00 pj_ice_sess_on_rx_pkt 12:54:37.241 icetp00 .on_stun_rx_request 12:54:37.241 icetp00 ..on_stun_send_msg 12:54:37.241 icetp00 .handle_incoming_check 12:54:37.241 icetp00 .update_comp_check 12:54:37.242 icetp00 .Valid check 0: [2] 10.115.131.146:50453--> 10.115.131.146:50455 is nominated 12:54:37.242 icetp00 .Triggered check for check 7 not performed because it's completed 12:54:37.242 icetp00 ..on_check_complete 12:54:37.242 icetp00 ..Check 7 is successful and nominated 12:54:37.244 icetp00 .on_stun_send_msg Any ideas where to continue debugging and/or how to fix it? Seems that 20ms tiggering seems to stops happening when complete check on 1st comp_id takes too much time. Br, Jonni