Empathy List Archives

NS

Nowlan, Sean

Thu, Oct 20, 2011 5:36 PM

Hi all,

I'm experiencing underruns when running 500kbps BPSK using GNUradio's benchmark_tx.py, which seems like too low a bandwidth to make an E100 choke. My thoughts on how to deal with this issue:

 Rebuild GNUradio with ARM NEON extensions (I'm running on a version without these).

 Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8. (What more is involved besides changing the io_type in the UHD sink object instantiation?)

Any other thoughts or comments would be greatly appreciated. Sorry if this was more appropriate to post in discuss-gnuradio; this is one of these issues that could go either way.

Thanks,
Sean

Hi all, I'm experiencing underruns when running 500kbps BPSK using GNUradio's benchmark_tx.py, which seems like too low a bandwidth to make an E100 choke. My thoughts on how to deal with this issue: 1) Rebuild GNUradio with ARM NEON extensions (I'm running on a version without these). 2) Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8. (What more is involved besides changing the io_type in the UHD sink object instantiation?) Any other thoughts or comments would be greatly appreciated. Sorry if this was more appropriate to post in discuss-gnuradio; this is one of these issues that could go either way. Thanks, Sean

JB

Josh Blum

Thu, Oct 20, 2011 5:47 PM

On 10/20/2011 10:36 AM, Nowlan, Sean wrote:

Hi all,

I'm experiencing underruns when running 500kbps BPSK using
GNUradio's benchmark_tx.py, which seems like too low a bandwidth to
make an E100 choke. My thoughts on how to deal with this issue:

 Rebuild GNUradio with ARM NEON extensions (I'm running on a

version without these).

 Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8.

(What more is involved besides changing the io_type in the UHD sink
object instantiation?)

Any other thoughts or comments would be greatly appreciated. Sorry if
this was more appropriate to post in discuss-gnuradio; this is one of
these issues that could go either way.

Well in general, the benchmark stuff is just an example to demonstrate a
complete rx/tx chain + mac layer. But actually its pretty poor in terms
of being an example and in terms of performance (even on x86).

I know gnuradio doesnt have a real mac later support because we still
need to invent message passing, but even so, the de-framer/correlator
for this app is written entirely in python (not even numpy).

The thing about the IO type is that the benchmark mod/demod chains work
on complex float32, so you have to make a complex int16 version of the
blocks in that chain to use complex int16 as an IO type.

Things you may consider:
Implementing a better packet framer/defamer. Using less floating point,
think neon optimized fir filters and such. This is actually what the
volk component will be used for. So we write a FIR filter
implementation, and then call into a volk dot product kernel. When
somebody finds out that the filter is the bottle neck, you add a neon or
assembly implementation for that kernel.

-Josh

On 10/20/2011 10:36 AM, Nowlan, Sean wrote: > Hi all, > > I'm experiencing underruns when running 500kbps BPSK using > GNUradio's benchmark_tx.py, which seems like too low a bandwidth to > make an E100 choke. My thoughts on how to deal with this issue: > > > 1) Rebuild GNUradio with ARM NEON extensions (I'm running on a > version without these). > > 2) Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8. > (What more is involved besides changing the io_type in the UHD sink > object instantiation?) > > Any other thoughts or comments would be greatly appreciated. Sorry if > this was more appropriate to post in discuss-gnuradio; this is one of > these issues that could go either way. > Well in general, the benchmark stuff is just an example to demonstrate a complete rx/tx chain + mac layer. But actually its pretty poor in terms of being an example and in terms of performance (even on x86). I know gnuradio doesnt have a real mac later support because we still need to invent message passing, but even so, the de-framer/correlator for this app is written entirely in python (not even numpy). The thing about the IO type is that the benchmark mod/demod chains work on complex float32, so you have to make a complex int16 version of the blocks in that chain to use complex int16 as an IO type. Things you may consider: Implementing a better packet framer/defamer. Using less floating point, think neon optimized fir filters and such. This is actually what the volk component will be used for. So we write a FIR filter implementation, and then call into a volk dot product kernel. When somebody finds out that the filter is the bottle neck, you add a neon or assembly implementation for that kernel. -Josh

NS

Nowlan, Sean

Mon, Oct 24, 2011 5:34 PM

Thanks. So what kind of performance gains would a C++ implementation buy me? (I know that question is loaded - it would depend on how it's implemented, of course, and it probably differs depending on the particular application).

Just to make sure, if I instantiate a GNUradio UHD Sink with any of the supported IO types, I just have to make sure I feed it samples of the correct range, i.e., [-1.0,+1.0] for float and [-2^16, +2^16-1] for COMPLEX_INT16?

Do you suspect that the bottleneck is the ARM processor? Will moving the python tx_chain to COMPLEX_INT16 help significantly? I don't care so much about the framer and not at all about the receiver. I just need to TX at a constant bitrate of 500 kbps.

Thanks,
Sean

-----Original Message-----
From: usrp-users-bounces@lists.ettus.com [mailto:usrp-users-bounces@lists.ettus.com] On Behalf Of Josh Blum
Sent: Thursday, October 20, 2011 1:48 PM
To: usrp-users@lists.ettus.com
Subject: Re: [USRP-users] Bandwidth issues on E100

On 10/20/2011 10:36 AM, Nowlan, Sean wrote:

Hi all,

I'm experiencing underruns when running 500kbps BPSK using GNUradio's
benchmark_tx.py, which seems like too low a bandwidth to make an E100
choke. My thoughts on how to deal with this issue:

 Rebuild GNUradio with ARM NEON extensions (I'm running on a

version without these).

 Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8.

(What more is involved besides changing the io_type in the UHD sink
object instantiation?)

Any other thoughts or comments would be greatly appreciated. Sorry if
this was more appropriate to post in discuss-gnuradio; this is one of
these issues that could go either way.

Well in general, the benchmark stuff is just an example to demonstrate a complete rx/tx chain + mac layer. But actually its pretty poor in terms of being an example and in terms of performance (even on x86).

I know gnuradio doesnt have a real mac later support because we still need to invent message passing, but even so, the de-framer/correlator for this app is written entirely in python (not even numpy).

The thing about the IO type is that the benchmark mod/demod chains work on complex float32, so you have to make a complex int16 version of the blocks in that chain to use complex int16 as an IO type.

Things you may consider:
Implementing a better packet framer/defamer. Using less floating point, think neon optimized fir filters and such. This is actually what the volk component will be used for. So we write a FIR filter implementation, and then call into a volk dot product kernel. When somebody finds out that the filter is the bottle neck, you add a neon or assembly implementation for that kernel.

-Josh

USRP-users mailing list
USRP-users@lists.ettus.com
http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com

Thanks. So what kind of performance gains would a C++ implementation buy me? (I know that question is loaded - it would depend on how it's implemented, of course, and it probably differs depending on the particular application). Just to make sure, if I instantiate a GNUradio UHD Sink with any of the supported IO types, I just have to make sure I feed it samples of the correct range, i.e., [-1.0,+1.0] for float and [-2^16, +2^16-1] for COMPLEX_INT16? Do you suspect that the bottleneck is the ARM processor? Will moving the python tx_chain to COMPLEX_INT16 help significantly? I don't care so much about the framer and not at all about the receiver. I just need to TX at a constant bitrate of 500 kbps. Thanks, Sean -----Original Message----- From: usrp-users-bounces@lists.ettus.com [mailto:usrp-users-bounces@lists.ettus.com] On Behalf Of Josh Blum Sent: Thursday, October 20, 2011 1:48 PM To: usrp-users@lists.ettus.com Subject: Re: [USRP-users] Bandwidth issues on E100 On 10/20/2011 10:36 AM, Nowlan, Sean wrote: > Hi all, > > I'm experiencing underruns when running 500kbps BPSK using GNUradio's > benchmark_tx.py, which seems like too low a bandwidth to make an E100 > choke. My thoughts on how to deal with this issue: > > > 1) Rebuild GNUradio with ARM NEON extensions (I'm running on a > version without these). > > 2) Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8. > (What more is involved besides changing the io_type in the UHD sink > object instantiation?) > > Any other thoughts or comments would be greatly appreciated. Sorry if > this was more appropriate to post in discuss-gnuradio; this is one of > these issues that could go either way. > Well in general, the benchmark stuff is just an example to demonstrate a complete rx/tx chain + mac layer. But actually its pretty poor in terms of being an example and in terms of performance (even on x86). I know gnuradio doesnt have a real mac later support because we still need to invent message passing, but even so, the de-framer/correlator for this app is written entirely in python (not even numpy). The thing about the IO type is that the benchmark mod/demod chains work on complex float32, so you have to make a complex int16 version of the blocks in that chain to use complex int16 as an IO type. Things you may consider: Implementing a better packet framer/defamer. Using less floating point, think neon optimized fir filters and such. This is actually what the volk component will be used for. So we write a FIR filter implementation, and then call into a volk dot product kernel. When somebody finds out that the filter is the bottle neck, you add a neon or assembly implementation for that kernel. -Josh _______________________________________________ USRP-users mailing list USRP-users@lists.ettus.com http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com

PB

Philip Balister

Mon, Oct 24, 2011 6:33 PM

On 10/24/2011 01:34 PM, Nowlan, Sean wrote:

Thanks. So what kind of performance gains would a C++ implementation buy me? (I know that question is loaded - it would depend on how it's implemented, of course, and it probably differs depending on the particular application).

Yes, a lot depends on the implementation. I very strongly suspect that
there are huge performance improvements available in the benchmark_tx
program. Basically, you want your blocks to do "lots" of processing as
opposed to a flow graph with many blocks each doing a little processing.

Just to make sure, if I instantiate a GNUradio UHD Sink with any of the supported IO types, I just have to make sure I feed it samples of the correct range, i.e., [-1.0,+1.0] for float and [-2^16, +2^16-1] for COMPLEX_INT16?

Yep.

Do you suspect that the bottleneck is the ARM processor? Will moving the python tx_chain to COMPLEX_INT16 help significantly? I don't care so much about the framer and not at all about the receiver. I just need to TX at a constant bitrate of 500 kbps.

In this case the bottleneck it the ARM. Assuming you are running the 3.0
kernel and recent UHD.

Philip

Thanks,
Sean

-----Original Message-----
From: usrp-users-bounces@lists.ettus.com [mailto:usrp-users-bounces@lists.ettus.com] On Behalf Of Josh Blum
Sent: Thursday, October 20, 2011 1:48 PM
To: usrp-users@lists.ettus.com
Subject: Re: [USRP-users] Bandwidth issues on E100

On 10/20/2011 10:36 AM, Nowlan, Sean wrote:

Hi all,

I'm experiencing underruns when running 500kbps BPSK using GNUradio's
benchmark_tx.py, which seems like too low a bandwidth to make an E100
choke. My thoughts on how to deal with this issue:

 Rebuild GNUradio with ARM NEON extensions (I'm running on a

version without these).

 Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8.

(What more is involved besides changing the io_type in the UHD sink
object instantiation?)

Any other thoughts or comments would be greatly appreciated. Sorry if
this was more appropriate to post in discuss-gnuradio; this is one of
these issues that could go either way.

Well in general, the benchmark stuff is just an example to demonstrate a complete rx/tx chain + mac layer. But actually its pretty poor in terms of being an example and in terms of performance (even on x86).

I know gnuradio doesnt have a real mac later support because we still need to invent message passing, but even so, the de-framer/correlator for this app is written entirely in python (not even numpy).

The thing about the IO type is that the benchmark mod/demod chains work on complex float32, so you have to make a complex int16 version of the blocks in that chain to use complex int16 as an IO type.

Things you may consider:
Implementing a better packet framer/defamer. Using less floating point, think neon optimized fir filters and such. This is actually what the volk component will be used for. So we write a FIR filter implementation, and then call into a volk dot product kernel. When somebody finds out that the filter is the bottle neck, you add a neon or assembly implementation for that kernel.

-Josh

USRP-users mailing list
USRP-users@lists.ettus.com
http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com

On 10/24/2011 01:34 PM, Nowlan, Sean wrote: > Thanks. So what kind of performance gains would a C++ implementation buy me? (I know that question is loaded - it would depend on how it's implemented, of course, and it probably differs depending on the particular application). Yes, a lot depends on the implementation. I very strongly suspect that there are huge performance improvements available in the benchmark_tx program. Basically, you want your blocks to do "lots" of processing as opposed to a flow graph with many blocks each doing a little processing. > > Just to make sure, if I instantiate a GNUradio UHD Sink with any of the supported IO types, I just have to make sure I feed it samples of the correct range, i.e., [-1.0,+1.0] for float and [-2^16, +2^16-1] for COMPLEX_INT16? Yep. > > Do you suspect that the bottleneck is the ARM processor? Will moving the python tx_chain to COMPLEX_INT16 help significantly? I don't care so much about the framer and not at all about the receiver. I just need to TX at a constant bitrate of 500 kbps. In this case the bottleneck it the ARM. Assuming you are running the 3.0 kernel and recent UHD. Philip > > Thanks, > Sean > > -----Original Message----- > From: usrp-users-bounces@lists.ettus.com [mailto:usrp-users-bounces@lists.ettus.com] On Behalf Of Josh Blum > Sent: Thursday, October 20, 2011 1:48 PM > To: usrp-users@lists.ettus.com > Subject: Re: [USRP-users] Bandwidth issues on E100 > > > > On 10/20/2011 10:36 AM, Nowlan, Sean wrote: >> Hi all, >> >> I'm experiencing underruns when running 500kbps BPSK using GNUradio's >> benchmark_tx.py, which seems like too low a bandwidth to make an E100 >> choke. My thoughts on how to deal with this issue: >> >> >> 1) Rebuild GNUradio with ARM NEON extensions (I'm running on a >> version without these). >> >> 2) Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8. >> (What more is involved besides changing the io_type in the UHD sink >> object instantiation?) >> >> Any other thoughts or comments would be greatly appreciated. Sorry if >> this was more appropriate to post in discuss-gnuradio; this is one of >> these issues that could go either way. >> > > Well in general, the benchmark stuff is just an example to demonstrate a complete rx/tx chain + mac layer. But actually its pretty poor in terms of being an example and in terms of performance (even on x86). > > I know gnuradio doesnt have a real mac later support because we still need to invent message passing, but even so, the de-framer/correlator for this app is written entirely in python (not even numpy). > > The thing about the IO type is that the benchmark mod/demod chains work on complex float32, so you have to make a complex int16 version of the blocks in that chain to use complex int16 as an IO type. > > Things you may consider: > Implementing a better packet framer/defamer. Using less floating point, think neon optimized fir filters and such. This is actually what the volk component will be used for. So we write a FIR filter implementation, and then call into a volk dot product kernel. When somebody finds out that the filter is the bottle neck, you add a neon or assembly implementation for that kernel. > > -Josh > > _______________________________________________ > USRP-users mailing list > USRP-users@lists.ettus.com > http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com > > _______________________________________________ > USRP-users mailing list > USRP-users@lists.ettus.com > http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com

NS

Nowlan, Sean

Mon, Oct 24, 2011 6:43 PM

Yes, I'm on the 3.0 kernel and recent UHD. Can anybody quantify the potential performance gain in using COMPLEX_INT16 over FLOAT32? I'll look into the specs on this OMAP device, but I'm hoping somebody has firsthand experience using ALU over FPU on it with gnuradio.

Sean

-----Original Message-----
From: Philip Balister [mailto:philip@opensdr.com]
Sent: Monday, October 24, 2011 2:34 PM
To: Nowlan, Sean
Cc: josh@ettus.com; usrp-users@lists.ettus.com
Subject: Re: [USRP-users] Bandwidth issues on E100

On 10/24/2011 01:34 PM, Nowlan, Sean wrote:

Thanks. So what kind of performance gains would a C++ implementation buy me? (I know that question is loaded - it would depend on how it's implemented, of course, and it probably differs depending on the particular application).

Yes, a lot depends on the implementation. I very strongly suspect that there are huge performance improvements available in the benchmark_tx program. Basically, you want your blocks to do "lots" of processing as opposed to a flow graph with many blocks each doing a little processing.

Just to make sure, if I instantiate a GNUradio UHD Sink with any of the supported IO types, I just have to make sure I feed it samples of the correct range, i.e., [-1.0,+1.0] for float and [-2^16, +2^16-1] for COMPLEX_INT16?

Yep.

Do you suspect that the bottleneck is the ARM processor? Will moving the python tx_chain to COMPLEX_INT16 help significantly? I don't care so much about the framer and not at all about the receiver. I just need to TX at a constant bitrate of 500 kbps.

In this case the bottleneck it the ARM. Assuming you are running the 3.0 kernel and recent UHD.

Philip

Thanks,
Sean

-----Original Message-----
From: usrp-users-bounces@lists.ettus.com
[mailto:usrp-users-bounces@lists.ettus.com] On Behalf Of Josh Blum
Sent: Thursday, October 20, 2011 1:48 PM
To: usrp-users@lists.ettus.com
Subject: Re: [USRP-users] Bandwidth issues on E100

On 10/20/2011 10:36 AM, Nowlan, Sean wrote:

Hi all,

I'm experiencing underruns when running 500kbps BPSK using
GNUradio's benchmark_tx.py, which seems like too low a bandwidth to
make an E100 choke. My thoughts on how to deal with this issue:

 Rebuild GNUradio with ARM NEON extensions (I'm running on a

version without these).

 Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8.

(What more is involved besides changing the io_type in the UHD sink
object instantiation?)

Any other thoughts or comments would be greatly appreciated. Sorry if
this was more appropriate to post in discuss-gnuradio; this is one of
these issues that could go either way.

Well in general, the benchmark stuff is just an example to demonstrate a complete rx/tx chain + mac layer. But actually its pretty poor in terms of being an example and in terms of performance (even on x86).

I know gnuradio doesnt have a real mac later support because we still need to invent message passing, but even so, the de-framer/correlator for this app is written entirely in python (not even numpy).

The thing about the IO type is that the benchmark mod/demod chains work on complex float32, so you have to make a complex int16 version of the blocks in that chain to use complex int16 as an IO type.

Things you may consider:
Implementing a better packet framer/defamer. Using less floating point, think neon optimized fir filters and such. This is actually what the volk component will be used for. So we write a FIR filter implementation, and then call into a volk dot product kernel. When somebody finds out that the filter is the bottle neck, you add a neon or assembly implementation for that kernel.

-Josh

USRP-users mailing list
USRP-users@lists.ettus.com
http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com

Yes, I'm on the 3.0 kernel and recent UHD. Can anybody quantify the potential performance gain in using COMPLEX_INT16 over FLOAT32? I'll look into the specs on this OMAP device, but I'm hoping somebody has firsthand experience using ALU over FPU on it with gnuradio. Sean -----Original Message----- From: Philip Balister [mailto:philip@opensdr.com] Sent: Monday, October 24, 2011 2:34 PM To: Nowlan, Sean Cc: josh@ettus.com; usrp-users@lists.ettus.com Subject: Re: [USRP-users] Bandwidth issues on E100 On 10/24/2011 01:34 PM, Nowlan, Sean wrote: > Thanks. So what kind of performance gains would a C++ implementation buy me? (I know that question is loaded - it would depend on how it's implemented, of course, and it probably differs depending on the particular application). Yes, a lot depends on the implementation. I very strongly suspect that there are huge performance improvements available in the benchmark_tx program. Basically, you want your blocks to do "lots" of processing as opposed to a flow graph with many blocks each doing a little processing. > > Just to make sure, if I instantiate a GNUradio UHD Sink with any of the supported IO types, I just have to make sure I feed it samples of the correct range, i.e., [-1.0,+1.0] for float and [-2^16, +2^16-1] for COMPLEX_INT16? Yep. > > Do you suspect that the bottleneck is the ARM processor? Will moving the python tx_chain to COMPLEX_INT16 help significantly? I don't care so much about the framer and not at all about the receiver. I just need to TX at a constant bitrate of 500 kbps. In this case the bottleneck it the ARM. Assuming you are running the 3.0 kernel and recent UHD. Philip > > Thanks, > Sean > > -----Original Message----- > From: usrp-users-bounces@lists.ettus.com > [mailto:usrp-users-bounces@lists.ettus.com] On Behalf Of Josh Blum > Sent: Thursday, October 20, 2011 1:48 PM > To: usrp-users@lists.ettus.com > Subject: Re: [USRP-users] Bandwidth issues on E100 > > > > On 10/20/2011 10:36 AM, Nowlan, Sean wrote: >> Hi all, >> >> I'm experiencing underruns when running 500kbps BPSK using >> GNUradio's benchmark_tx.py, which seems like too low a bandwidth to >> make an E100 choke. My thoughts on how to deal with this issue: >> >> >> 1) Rebuild GNUradio with ARM NEON extensions (I'm running on a >> version without these). >> >> 2) Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8. >> (What more is involved besides changing the io_type in the UHD sink >> object instantiation?) >> >> Any other thoughts or comments would be greatly appreciated. Sorry if >> this was more appropriate to post in discuss-gnuradio; this is one of >> these issues that could go either way. >> > > Well in general, the benchmark stuff is just an example to demonstrate a complete rx/tx chain + mac layer. But actually its pretty poor in terms of being an example and in terms of performance (even on x86). > > I know gnuradio doesnt have a real mac later support because we still need to invent message passing, but even so, the de-framer/correlator for this app is written entirely in python (not even numpy). > > The thing about the IO type is that the benchmark mod/demod chains work on complex float32, so you have to make a complex int16 version of the blocks in that chain to use complex int16 as an IO type. > > Things you may consider: > Implementing a better packet framer/defamer. Using less floating point, think neon optimized fir filters and such. This is actually what the volk component will be used for. So we write a FIR filter implementation, and then call into a volk dot product kernel. When somebody finds out that the filter is the bottle neck, you add a neon or assembly implementation for that kernel. > > -Josh > > _______________________________________________ > USRP-users mailing list > USRP-users@lists.ettus.com > http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com > > _______________________________________________ > USRP-users mailing list > USRP-users@lists.ettus.com > http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com

PB

Philip Balister

Tue, Oct 25, 2011 3:32 PM

On 10/24/2011 08:43 PM, Nowlan, Sean wrote:

Yes, I'm on the 3.0 kernel and recent UHD. Can anybody quantify the potential performance gain in using COMPLEX_INT16 over FLOAT32? I'll look into the specs on this OMAP device, but I'm hoping somebody has firsthand experience using ALU over FPU on it with gnuradio.

I would love to see someone try some int16 processing with the NEON
coprocessor. The tricky bit is the dynamic range limitations though.

Philip

Sean

-----Original Message-----
From: Philip Balister [mailto:philip@opensdr.com]
Sent: Monday, October 24, 2011 2:34 PM
To: Nowlan, Sean
Cc: josh@ettus.com; usrp-users@lists.ettus.com
Subject: Re: [USRP-users] Bandwidth issues on E100

On 10/24/2011 01:34 PM, Nowlan, Sean wrote:

Thanks. So what kind of performance gains would a C++ implementation buy me? (I know that question is loaded - it would depend on how it's implemented, of course, and it probably differs depending on the particular application).

Yes, a lot depends on the implementation. I very strongly suspect that there are huge performance improvements available in the benchmark_tx program. Basically, you want your blocks to do "lots" of processing as opposed to a flow graph with many blocks each doing a little processing.

Just to make sure, if I instantiate a GNUradio UHD Sink with any of the supported IO types, I just have to make sure I feed it samples of the correct range, i.e., [-1.0,+1.0] for float and [-2^16, +2^16-1] for COMPLEX_INT16?

Yep.

Do you suspect that the bottleneck is the ARM processor? Will moving the python tx_chain to COMPLEX_INT16 help significantly? I don't care so much about the framer and not at all about the receiver. I just need to TX at a constant bitrate of 500 kbps.

In this case the bottleneck it the ARM. Assuming you are running the 3.0 kernel and recent UHD.

Philip

Thanks,
Sean

-----Original Message-----
From: usrp-users-bounces@lists.ettus.com
[mailto:usrp-users-bounces@lists.ettus.com] On Behalf Of Josh Blum
Sent: Thursday, October 20, 2011 1:48 PM
To: usrp-users@lists.ettus.com
Subject: Re: [USRP-users] Bandwidth issues on E100

On 10/20/2011 10:36 AM, Nowlan, Sean wrote:

Hi all,

I'm experiencing underruns when running 500kbps BPSK using
GNUradio's benchmark_tx.py, which seems like too low a bandwidth to
make an E100 choke. My thoughts on how to deal with this issue:

 Rebuild GNUradio with ARM NEON extensions (I'm running on a

version without these).

 Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8.

(What more is involved besides changing the io_type in the UHD sink
object instantiation?)

Any other thoughts or comments would be greatly appreciated. Sorry if
this was more appropriate to post in discuss-gnuradio; this is one of
these issues that could go either way.

Well in general, the benchmark stuff is just an example to demonstrate a complete rx/tx chain + mac layer. But actually its pretty poor in terms of being an example and in terms of performance (even on x86).

I know gnuradio doesnt have a real mac later support because we still need to invent message passing, but even so, the de-framer/correlator for this app is written entirely in python (not even numpy).

The thing about the IO type is that the benchmark mod/demod chains work on complex float32, so you have to make a complex int16 version of the blocks in that chain to use complex int16 as an IO type.

Things you may consider:
Implementing a better packet framer/defamer. Using less floating point, think neon optimized fir filters and such. This is actually what the volk component will be used for. So we write a FIR filter implementation, and then call into a volk dot product kernel. When somebody finds out that the filter is the bottle neck, you add a neon or assembly implementation for that kernel.

-Josh

USRP-users mailing list
USRP-users@lists.ettus.com
http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com

On 10/24/2011 08:43 PM, Nowlan, Sean wrote: > Yes, I'm on the 3.0 kernel and recent UHD. Can anybody quantify the potential performance gain in using COMPLEX_INT16 over FLOAT32? I'll look into the specs on this OMAP device, but I'm hoping somebody has firsthand experience using ALU over FPU on it with gnuradio. > I would love to see someone try some int16 processing with the NEON coprocessor. The tricky bit is the dynamic range limitations though. Philip > Sean > > -----Original Message----- > From: Philip Balister [mailto:philip@opensdr.com] > Sent: Monday, October 24, 2011 2:34 PM > To: Nowlan, Sean > Cc: josh@ettus.com; usrp-users@lists.ettus.com > Subject: Re: [USRP-users] Bandwidth issues on E100 > > On 10/24/2011 01:34 PM, Nowlan, Sean wrote: >> Thanks. So what kind of performance gains would a C++ implementation buy me? (I know that question is loaded - it would depend on how it's implemented, of course, and it probably differs depending on the particular application). > > Yes, a lot depends on the implementation. I very strongly suspect that there are huge performance improvements available in the benchmark_tx program. Basically, you want your blocks to do "lots" of processing as opposed to a flow graph with many blocks each doing a little processing. > >> >> Just to make sure, if I instantiate a GNUradio UHD Sink with any of the supported IO types, I just have to make sure I feed it samples of the correct range, i.e., [-1.0,+1.0] for float and [-2^16, +2^16-1] for COMPLEX_INT16? > > Yep. > >> >> Do you suspect that the bottleneck is the ARM processor? Will moving the python tx_chain to COMPLEX_INT16 help significantly? I don't care so much about the framer and not at all about the receiver. I just need to TX at a constant bitrate of 500 kbps. > > In this case the bottleneck it the ARM. Assuming you are running the 3.0 kernel and recent UHD. > > Philip > >> >> Thanks, >> Sean >> >> -----Original Message----- >> From: usrp-users-bounces@lists.ettus.com >> [mailto:usrp-users-bounces@lists.ettus.com] On Behalf Of Josh Blum >> Sent: Thursday, October 20, 2011 1:48 PM >> To: usrp-users@lists.ettus.com >> Subject: Re: [USRP-users] Bandwidth issues on E100 >> >> >> >> On 10/20/2011 10:36 AM, Nowlan, Sean wrote: >>> Hi all, >>> >>> I'm experiencing underruns when running 500kbps BPSK using >>> GNUradio's benchmark_tx.py, which seems like too low a bandwidth to >>> make an E100 choke. My thoughts on how to deal with this issue: >>> >>> >>> 1) Rebuild GNUradio with ARM NEON extensions (I'm running on a >>> version without these). >>> >>> 2) Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8. >>> (What more is involved besides changing the io_type in the UHD sink >>> object instantiation?) >>> >>> Any other thoughts or comments would be greatly appreciated. Sorry if >>> this was more appropriate to post in discuss-gnuradio; this is one of >>> these issues that could go either way. >>> >> >> Well in general, the benchmark stuff is just an example to demonstrate a complete rx/tx chain + mac layer. But actually its pretty poor in terms of being an example and in terms of performance (even on x86). >> >> I know gnuradio doesnt have a real mac later support because we still need to invent message passing, but even so, the de-framer/correlator for this app is written entirely in python (not even numpy). >> >> The thing about the IO type is that the benchmark mod/demod chains work on complex float32, so you have to make a complex int16 version of the blocks in that chain to use complex int16 as an IO type. >> >> Things you may consider: >> Implementing a better packet framer/defamer. Using less floating point, think neon optimized fir filters and such. This is actually what the volk component will be used for. So we write a FIR filter implementation, and then call into a volk dot product kernel. When somebody finds out that the filter is the bottle neck, you add a neon or assembly implementation for that kernel. >> >> -Josh >> >> _______________________________________________ >> USRP-users mailing list >> USRP-users@lists.ettus.com >> http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com >> >> _______________________________________________ >> USRP-users mailing list >> USRP-users@lists.ettus.com >> http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com >

NS

Nowlan, Sean

Tue, Oct 25, 2011 4:14 PM

Are you saying that the float range doesn't map linearly to the INT16 range?

-----Original Message-----
From: Philip Balister [mailto:philip@opensdr.com]
Sent: Tuesday, October 25, 2011 11:32 AM
To: Nowlan, Sean
Cc: josh@ettus.com; usrp-users@lists.ettus.com
Subject: Re: [USRP-users] Bandwidth issues on E100

On 10/24/2011 08:43 PM, Nowlan, Sean wrote:

Yes, I'm on the 3.0 kernel and recent UHD. Can anybody quantify the potential performance gain in using COMPLEX_INT16 over FLOAT32? I'll look into the specs on this OMAP device, but I'm hoping somebody has firsthand experience using ALU over FPU on it with gnuradio.

I would love to see someone try some int16 processing with the NEON coprocessor. The tricky bit is the dynamic range limitations though.

Philip

Sean

-----Original Message-----
From: Philip Balister [mailto:philip@opensdr.com]
Sent: Monday, October 24, 2011 2:34 PM
To: Nowlan, Sean
Cc: josh@ettus.com; usrp-users@lists.ettus.com
Subject: Re: [USRP-users] Bandwidth issues on E100

On 10/24/2011 01:34 PM, Nowlan, Sean wrote:

Thanks. So what kind of performance gains would a C++ implementation buy me? (I know that question is loaded - it would depend on how it's implemented, of course, and it probably differs depending on the particular application).

Yes, a lot depends on the implementation. I very strongly suspect that there are huge performance improvements available in the benchmark_tx program. Basically, you want your blocks to do "lots" of processing as opposed to a flow graph with many blocks each doing a little processing.

Just to make sure, if I instantiate a GNUradio UHD Sink with any of the supported IO types, I just have to make sure I feed it samples of the correct range, i.e., [-1.0,+1.0] for float and [-2^16, +2^16-1] for COMPLEX_INT16?

Yep.

Do you suspect that the bottleneck is the ARM processor? Will moving the python tx_chain to COMPLEX_INT16 help significantly? I don't care so much about the framer and not at all about the receiver. I just need to TX at a constant bitrate of 500 kbps.

In this case the bottleneck it the ARM. Assuming you are running the 3.0 kernel and recent UHD.

Philip

Thanks,
Sean

-----Original Message-----
From: usrp-users-bounces@lists.ettus.com
[mailto:usrp-users-bounces@lists.ettus.com] On Behalf Of Josh Blum
Sent: Thursday, October 20, 2011 1:48 PM
To: usrp-users@lists.ettus.com
Subject: Re: [USRP-users] Bandwidth issues on E100

On 10/20/2011 10:36 AM, Nowlan, Sean wrote:

Hi all,

I'm experiencing underruns when running 500kbps BPSK using
GNUradio's benchmark_tx.py, which seems like too low a bandwidth to
make an E100 choke. My thoughts on how to deal with this issue:

 Rebuild GNUradio with ARM NEON extensions (I'm running on a

version without these).

 Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8.

(What more is involved besides changing the io_type in the UHD sink
object instantiation?)

Any other thoughts or comments would be greatly appreciated. Sorry
if this was more appropriate to post in discuss-gnuradio; this is
one of these issues that could go either way.

Well in general, the benchmark stuff is just an example to demonstrate a complete rx/tx chain + mac layer. But actually its pretty poor in terms of being an example and in terms of performance (even on x86).

I know gnuradio doesnt have a real mac later support because we still need to invent message passing, but even so, the de-framer/correlator for this app is written entirely in python (not even numpy).

The thing about the IO type is that the benchmark mod/demod chains work on complex float32, so you have to make a complex int16 version of the blocks in that chain to use complex int16 as an IO type.

Things you may consider:
Implementing a better packet framer/defamer. Using less floating point, think neon optimized fir filters and such. This is actually what the volk component will be used for. So we write a FIR filter implementation, and then call into a volk dot product kernel. When somebody finds out that the filter is the bottle neck, you add a neon or assembly implementation for that kernel.

-Josh

USRP-users mailing list
USRP-users@lists.ettus.com
http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com

Are you saying that the float range doesn't map linearly to the INT16 range? -----Original Message----- From: Philip Balister [mailto:philip@opensdr.com] Sent: Tuesday, October 25, 2011 11:32 AM To: Nowlan, Sean Cc: josh@ettus.com; usrp-users@lists.ettus.com Subject: Re: [USRP-users] Bandwidth issues on E100 On 10/24/2011 08:43 PM, Nowlan, Sean wrote: > Yes, I'm on the 3.0 kernel and recent UHD. Can anybody quantify the potential performance gain in using COMPLEX_INT16 over FLOAT32? I'll look into the specs on this OMAP device, but I'm hoping somebody has firsthand experience using ALU over FPU on it with gnuradio. > I would love to see someone try some int16 processing with the NEON coprocessor. The tricky bit is the dynamic range limitations though. Philip > Sean > > -----Original Message----- > From: Philip Balister [mailto:philip@opensdr.com] > Sent: Monday, October 24, 2011 2:34 PM > To: Nowlan, Sean > Cc: josh@ettus.com; usrp-users@lists.ettus.com > Subject: Re: [USRP-users] Bandwidth issues on E100 > > On 10/24/2011 01:34 PM, Nowlan, Sean wrote: >> Thanks. So what kind of performance gains would a C++ implementation buy me? (I know that question is loaded - it would depend on how it's implemented, of course, and it probably differs depending on the particular application). > > Yes, a lot depends on the implementation. I very strongly suspect that there are huge performance improvements available in the benchmark_tx program. Basically, you want your blocks to do "lots" of processing as opposed to a flow graph with many blocks each doing a little processing. > >> >> Just to make sure, if I instantiate a GNUradio UHD Sink with any of the supported IO types, I just have to make sure I feed it samples of the correct range, i.e., [-1.0,+1.0] for float and [-2^16, +2^16-1] for COMPLEX_INT16? > > Yep. > >> >> Do you suspect that the bottleneck is the ARM processor? Will moving the python tx_chain to COMPLEX_INT16 help significantly? I don't care so much about the framer and not at all about the receiver. I just need to TX at a constant bitrate of 500 kbps. > > In this case the bottleneck it the ARM. Assuming you are running the 3.0 kernel and recent UHD. > > Philip > >> >> Thanks, >> Sean >> >> -----Original Message----- >> From: usrp-users-bounces@lists.ettus.com >> [mailto:usrp-users-bounces@lists.ettus.com] On Behalf Of Josh Blum >> Sent: Thursday, October 20, 2011 1:48 PM >> To: usrp-users@lists.ettus.com >> Subject: Re: [USRP-users] Bandwidth issues on E100 >> >> >> >> On 10/20/2011 10:36 AM, Nowlan, Sean wrote: >>> Hi all, >>> >>> I'm experiencing underruns when running 500kbps BPSK using >>> GNUradio's benchmark_tx.py, which seems like too low a bandwidth to >>> make an E100 choke. My thoughts on how to deal with this issue: >>> >>> >>> 1) Rebuild GNUradio with ARM NEON extensions (I'm running on a >>> version without these). >>> >>> 2) Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8. >>> (What more is involved besides changing the io_type in the UHD sink >>> object instantiation?) >>> >>> Any other thoughts or comments would be greatly appreciated. Sorry >>> if this was more appropriate to post in discuss-gnuradio; this is >>> one of these issues that could go either way. >>> >> >> Well in general, the benchmark stuff is just an example to demonstrate a complete rx/tx chain + mac layer. But actually its pretty poor in terms of being an example and in terms of performance (even on x86). >> >> I know gnuradio doesnt have a real mac later support because we still need to invent message passing, but even so, the de-framer/correlator for this app is written entirely in python (not even numpy). >> >> The thing about the IO type is that the benchmark mod/demod chains work on complex float32, so you have to make a complex int16 version of the blocks in that chain to use complex int16 as an IO type. >> >> Things you may consider: >> Implementing a better packet framer/defamer. Using less floating point, think neon optimized fir filters and such. This is actually what the volk component will be used for. So we write a FIR filter implementation, and then call into a volk dot product kernel. When somebody finds out that the filter is the bottle neck, you add a neon or assembly implementation for that kernel. >> >> -Josh >> >> _______________________________________________ >> USRP-users mailing list >> USRP-users@lists.ettus.com >> http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com >> >> _______________________________________________ >> USRP-users mailing list >> USRP-users@lists.ettus.com >> http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com >

IB

Ian Buckley

Tue, Oct 25, 2011 5:00 PM

No, he's saying that FLOAT32 has 24bits of fraction (1 being sign) and 8 bits of exponent, thus it can represent normalized values in the approximate range 2^-126 -> 2^128 whilst maintaining 24bits of precision. INT16 can only represent -2^15 -> (2^15)-1. Thus integer DSP code needs to constantly and explicitly manage word growth to maintain precision whilst avoiding overflow, unlike floating point where normalization occurs transparently as part of the operation of the arithmetic hardware maintaining precision automatically.

-ian

On Oct 25, 2011, at 9:14 AM, Nowlan, Sean wrote:

Are you saying that the float range doesn't map linearly to the INT16 range?

-----Original Message-----
From: Philip Balister [mailto:philip@opensdr.com]
Sent: Tuesday, October 25, 2011 11:32 AM
To: Nowlan, Sean
Cc: josh@ettus.com; usrp-users@lists.ettus.com
Subject: Re: [USRP-users] Bandwidth issues on E100

On 10/24/2011 08:43 PM, Nowlan, Sean wrote:

Yes, I'm on the 3.0 kernel and recent UHD. Can anybody quantify the potential performance gain in using COMPLEX_INT16 over FLOAT32? I'll look into the specs on this OMAP device, but I'm hoping somebody has firsthand experience using ALU over FPU on it with gnuradio.

I would love to see someone try some int16 processing with the NEON coprocessor. The tricky bit is the dynamic range limitations though.

Philip

Sean

-----Original Message-----
From: Philip Balister [mailto:philip@opensdr.com]
Sent: Monday, October 24, 2011 2:34 PM
To: Nowlan, Sean
Cc: josh@ettus.com; usrp-users@lists.ettus.com
Subject: Re: [USRP-users] Bandwidth issues on E100

On 10/24/2011 01:34 PM, Nowlan, Sean wrote:

Thanks. So what kind of performance gains would a C++ implementation buy me? (I know that question is loaded - it would depend on how it's implemented, of course, and it probably differs depending on the particular application).

Yes, a lot depends on the implementation. I very strongly suspect that there are huge performance improvements available in the benchmark_tx program. Basically, you want your blocks to do "lots" of processing as opposed to a flow graph with many blocks each doing a little processing.

Just to make sure, if I instantiate a GNUradio UHD Sink with any of the supported IO types, I just have to make sure I feed it samples of the correct range, i.e., [-1.0,+1.0] for float and [-2^16, +2^16-1] for COMPLEX_INT16?

Yep.

Do you suspect that the bottleneck is the ARM processor? Will moving the python tx_chain to COMPLEX_INT16 help significantly? I don't care so much about the framer and not at all about the receiver. I just need to TX at a constant bitrate of 500 kbps.

In this case the bottleneck it the ARM. Assuming you are running the 3.0 kernel and recent UHD.

Philip

Thanks,
Sean

-----Original Message-----
From: usrp-users-bounces@lists.ettus.com
[mailto:usrp-users-bounces@lists.ettus.com] On Behalf Of Josh Blum
Sent: Thursday, October 20, 2011 1:48 PM
To: usrp-users@lists.ettus.com
Subject: Re: [USRP-users] Bandwidth issues on E100

On 10/20/2011 10:36 AM, Nowlan, Sean wrote:

Hi all,

I'm experiencing underruns when running 500kbps BPSK using
GNUradio's benchmark_tx.py, which seems like too low a bandwidth to
make an E100 choke. My thoughts on how to deal with this issue:

 Rebuild GNUradio with ARM NEON extensions (I'm running on a

version without these).

 Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8.

(What more is involved besides changing the io_type in the UHD sink
object instantiation?)

Any other thoughts or comments would be greatly appreciated. Sorry
if this was more appropriate to post in discuss-gnuradio; this is
one of these issues that could go either way.

Well in general, the benchmark stuff is just an example to demonstrate a complete rx/tx chain + mac layer. But actually its pretty poor in terms of being an example and in terms of performance (even on x86).

I know gnuradio doesnt have a real mac later support because we still need to invent message passing, but even so, the de-framer/correlator for this app is written entirely in python (not even numpy).

The thing about the IO type is that the benchmark mod/demod chains work on complex float32, so you have to make a complex int16 version of the blocks in that chain to use complex int16 as an IO type.

Things you may consider:
Implementing a better packet framer/defamer. Using less floating point, think neon optimized fir filters and such. This is actually what the volk component will be used for. So we write a FIR filter implementation, and then call into a volk dot product kernel. When somebody finds out that the filter is the bottle neck, you add a neon or assembly implementation for that kernel.

-Josh

USRP-users mailing list
USRP-users@lists.ettus.com
http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com

No, he's saying that FLOAT32 has 24bits of fraction (1 being sign) and 8 bits of exponent, thus it can represent normalized values in the approximate range 2^-126 -> 2^128 whilst maintaining 24bits of precision. INT16 can only represent -2^15 -> (2^15)-1. Thus integer DSP code needs to constantly and explicitly manage word growth to maintain precision whilst avoiding overflow, unlike floating point where normalization occurs transparently as part of the operation of the arithmetic hardware maintaining precision automatically. -ian On Oct 25, 2011, at 9:14 AM, Nowlan, Sean wrote: > Are you saying that the float range doesn't map linearly to the INT16 range? > > -----Original Message----- > From: Philip Balister [mailto:philip@opensdr.com] > Sent: Tuesday, October 25, 2011 11:32 AM > To: Nowlan, Sean > Cc: josh@ettus.com; usrp-users@lists.ettus.com > Subject: Re: [USRP-users] Bandwidth issues on E100 > > On 10/24/2011 08:43 PM, Nowlan, Sean wrote: >> Yes, I'm on the 3.0 kernel and recent UHD. Can anybody quantify the potential performance gain in using COMPLEX_INT16 over FLOAT32? I'll look into the specs on this OMAP device, but I'm hoping somebody has firsthand experience using ALU over FPU on it with gnuradio. >> > > I would love to see someone try some int16 processing with the NEON coprocessor. The tricky bit is the dynamic range limitations though. > > Philip > >> Sean >> >> -----Original Message----- >> From: Philip Balister [mailto:philip@opensdr.com] >> Sent: Monday, October 24, 2011 2:34 PM >> To: Nowlan, Sean >> Cc: josh@ettus.com; usrp-users@lists.ettus.com >> Subject: Re: [USRP-users] Bandwidth issues on E100 >> >> On 10/24/2011 01:34 PM, Nowlan, Sean wrote: >>> Thanks. So what kind of performance gains would a C++ implementation buy me? (I know that question is loaded - it would depend on how it's implemented, of course, and it probably differs depending on the particular application). >> >> Yes, a lot depends on the implementation. I very strongly suspect that there are huge performance improvements available in the benchmark_tx program. Basically, you want your blocks to do "lots" of processing as opposed to a flow graph with many blocks each doing a little processing. >> >>> >>> Just to make sure, if I instantiate a GNUradio UHD Sink with any of the supported IO types, I just have to make sure I feed it samples of the correct range, i.e., [-1.0,+1.0] for float and [-2^16, +2^16-1] for COMPLEX_INT16? >> >> Yep. >> >>> >>> Do you suspect that the bottleneck is the ARM processor? Will moving the python tx_chain to COMPLEX_INT16 help significantly? I don't care so much about the framer and not at all about the receiver. I just need to TX at a constant bitrate of 500 kbps. >> >> In this case the bottleneck it the ARM. Assuming you are running the 3.0 kernel and recent UHD. >> >> Philip >> >>> >>> Thanks, >>> Sean >>> >>> -----Original Message----- >>> From: usrp-users-bounces@lists.ettus.com >>> [mailto:usrp-users-bounces@lists.ettus.com] On Behalf Of Josh Blum >>> Sent: Thursday, October 20, 2011 1:48 PM >>> To: usrp-users@lists.ettus.com >>> Subject: Re: [USRP-users] Bandwidth issues on E100 >>> >>> >>> >>> On 10/20/2011 10:36 AM, Nowlan, Sean wrote: >>>> Hi all, >>>> >>>> I'm experiencing underruns when running 500kbps BPSK using >>>> GNUradio's benchmark_tx.py, which seems like too low a bandwidth to >>>> make an E100 choke. My thoughts on how to deal with this issue: >>>> >>>> >>>> 1) Rebuild GNUradio with ARM NEON extensions (I'm running on a >>>> version without these). >>>> >>>> 2) Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8. >>>> (What more is involved besides changing the io_type in the UHD sink >>>> object instantiation?) >>>> >>>> Any other thoughts or comments would be greatly appreciated. Sorry >>>> if this was more appropriate to post in discuss-gnuradio; this is >>>> one of these issues that could go either way. >>>> >>> >>> Well in general, the benchmark stuff is just an example to demonstrate a complete rx/tx chain + mac layer. But actually its pretty poor in terms of being an example and in terms of performance (even on x86). >>> >>> I know gnuradio doesnt have a real mac later support because we still need to invent message passing, but even so, the de-framer/correlator for this app is written entirely in python (not even numpy). >>> >>> The thing about the IO type is that the benchmark mod/demod chains work on complex float32, so you have to make a complex int16 version of the blocks in that chain to use complex int16 as an IO type. >>> >>> Things you may consider: >>> Implementing a better packet framer/defamer. Using less floating point, think neon optimized fir filters and such. This is actually what the volk component will be used for. So we write a FIR filter implementation, and then call into a volk dot product kernel. When somebody finds out that the filter is the bottle neck, you add a neon or assembly implementation for that kernel. >>> >>> -Josh >>> >>> _______________________________________________ >>> USRP-users mailing list >>> USRP-users@lists.ettus.com >>> http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com >>> >>> _______________________________________________ >>> USRP-users mailing list >>> USRP-users@lists.ettus.com >>> http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com >> > > > _______________________________________________ > USRP-users mailing list > USRP-users@lists.ettus.com > http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com

NS

Nowlan, Sean

Thu, Nov 3, 2011 9:53 PM

On this FAQ page, bandwidth estimates are listed for all devices except for E1xx. What are they?

http://www.ettus.com/faq

Thanks,
Sean

-----Original Message-----
From: Philip Balister [mailto:philip@opensdr.com]
Sent: Monday, October 24, 2011 2:34 PM
To: Nowlan, Sean
Cc: josh@ettus.com; usrp-users@lists.ettus.com
Subject: Re: [USRP-users] Bandwidth issues on E100

On 10/24/2011 01:34 PM, Nowlan, Sean wrote:

Thanks. So what kind of performance gains would a C++ implementation buy me? (I know that question is loaded - it would depend on how it's implemented, of course, and it probably differs depending on the particular application).

Yes, a lot depends on the implementation. I very strongly suspect that there are huge performance improvements available in the benchmark_tx program. Basically, you want your blocks to do "lots" of processing as opposed to a flow graph with many blocks each doing a little processing.

Just to make sure, if I instantiate a GNUradio UHD Sink with any of the supported IO types, I just have to make sure I feed it samples of the correct range, i.e., [-1.0,+1.0] for float and [-2^16, +2^16-1] for COMPLEX_INT16?

Yep.

Do you suspect that the bottleneck is the ARM processor? Will moving the python tx_chain to COMPLEX_INT16 help significantly? I don't care so much about the framer and not at all about the receiver. I just need to TX at a constant bitrate of 500 kbps.

In this case the bottleneck it the ARM. Assuming you are running the 3.0 kernel and recent UHD.

Philip

Thanks,
Sean

-----Original Message-----
From: usrp-users-bounces@lists.ettus.com
[mailto:usrp-users-bounces@lists.ettus.com] On Behalf Of Josh Blum
Sent: Thursday, October 20, 2011 1:48 PM
To: usrp-users@lists.ettus.com
Subject: Re: [USRP-users] Bandwidth issues on E100

On 10/20/2011 10:36 AM, Nowlan, Sean wrote:

Hi all,

I'm experiencing underruns when running 500kbps BPSK using
GNUradio's benchmark_tx.py, which seems like too low a bandwidth to
make an E100 choke. My thoughts on how to deal with this issue:

 Rebuild GNUradio with ARM NEON extensions (I'm running on a

version without these).

 Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8.

(What more is involved besides changing the io_type in the UHD sink
object instantiation?)

Any other thoughts or comments would be greatly appreciated. Sorry if
this was more appropriate to post in discuss-gnuradio; this is one of
these issues that could go either way.

Well in general, the benchmark stuff is just an example to demonstrate a complete rx/tx chain + mac layer. But actually its pretty poor in terms of being an example and in terms of performance (even on x86).

I know gnuradio doesnt have a real mac later support because we still need to invent message passing, but even so, the de-framer/correlator for this app is written entirely in python (not even numpy).

The thing about the IO type is that the benchmark mod/demod chains work on complex float32, so you have to make a complex int16 version of the blocks in that chain to use complex int16 as an IO type.

Things you may consider:
Implementing a better packet framer/defamer. Using less floating point, think neon optimized fir filters and such. This is actually what the volk component will be used for. So we write a FIR filter implementation, and then call into a volk dot product kernel. When somebody finds out that the filter is the bottle neck, you add a neon or assembly implementation for that kernel.

-Josh

USRP-users mailing list
USRP-users@lists.ettus.com
http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com

On this FAQ page, bandwidth estimates are listed for all devices except for E1xx. What are they? http://www.ettus.com/faq Thanks, Sean -----Original Message----- From: Philip Balister [mailto:philip@opensdr.com] Sent: Monday, October 24, 2011 2:34 PM To: Nowlan, Sean Cc: josh@ettus.com; usrp-users@lists.ettus.com Subject: Re: [USRP-users] Bandwidth issues on E100 On 10/24/2011 01:34 PM, Nowlan, Sean wrote: > Thanks. So what kind of performance gains would a C++ implementation buy me? (I know that question is loaded - it would depend on how it's implemented, of course, and it probably differs depending on the particular application). Yes, a lot depends on the implementation. I very strongly suspect that there are huge performance improvements available in the benchmark_tx program. Basically, you want your blocks to do "lots" of processing as opposed to a flow graph with many blocks each doing a little processing. > > Just to make sure, if I instantiate a GNUradio UHD Sink with any of the supported IO types, I just have to make sure I feed it samples of the correct range, i.e., [-1.0,+1.0] for float and [-2^16, +2^16-1] for COMPLEX_INT16? Yep. > > Do you suspect that the bottleneck is the ARM processor? Will moving the python tx_chain to COMPLEX_INT16 help significantly? I don't care so much about the framer and not at all about the receiver. I just need to TX at a constant bitrate of 500 kbps. In this case the bottleneck it the ARM. Assuming you are running the 3.0 kernel and recent UHD. Philip > > Thanks, > Sean > > -----Original Message----- > From: usrp-users-bounces@lists.ettus.com > [mailto:usrp-users-bounces@lists.ettus.com] On Behalf Of Josh Blum > Sent: Thursday, October 20, 2011 1:48 PM > To: usrp-users@lists.ettus.com > Subject: Re: [USRP-users] Bandwidth issues on E100 > > > > On 10/20/2011 10:36 AM, Nowlan, Sean wrote: >> Hi all, >> >> I'm experiencing underruns when running 500kbps BPSK using >> GNUradio's benchmark_tx.py, which seems like too low a bandwidth to >> make an E100 choke. My thoughts on how to deal with this issue: >> >> >> 1) Rebuild GNUradio with ARM NEON extensions (I'm running on a >> version without these). >> >> 2) Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8. >> (What more is involved besides changing the io_type in the UHD sink >> object instantiation?) >> >> Any other thoughts or comments would be greatly appreciated. Sorry if >> this was more appropriate to post in discuss-gnuradio; this is one of >> these issues that could go either way. >> > > Well in general, the benchmark stuff is just an example to demonstrate a complete rx/tx chain + mac layer. But actually its pretty poor in terms of being an example and in terms of performance (even on x86). > > I know gnuradio doesnt have a real mac later support because we still need to invent message passing, but even so, the de-framer/correlator for this app is written entirely in python (not even numpy). > > The thing about the IO type is that the benchmark mod/demod chains work on complex float32, so you have to make a complex int16 version of the blocks in that chain to use complex int16 as an IO type. > > Things you may consider: > Implementing a better packet framer/defamer. Using less floating point, think neon optimized fir filters and such. This is actually what the volk component will be used for. So we write a FIR filter implementation, and then call into a volk dot product kernel. When somebody finds out that the filter is the bottle neck, you add a neon or assembly implementation for that kernel. > > -Josh > > _______________________________________________ > USRP-users mailing list > USRP-users@lists.ettus.com > http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com > > _______________________________________________ > USRP-users mailing list > USRP-users@lists.ettus.com > http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com

BH

Ben Hilburn

Thu, Nov 3, 2011 9:59 PM

If you are going to process in the FPGA, you are limited by the clock, so
you are maxing out at 64 MSps. If you plan on doing it all on the GPP, you
should keep it at or below 4 MSps.

Cheers,
Ben

On Thu, Nov 3, 2011 at 2:53 PM, Nowlan, Sean Sean.Nowlan@gtri.gatech.eduwrote:

On this FAQ page, bandwidth estimates are listed for all devices except
for E1xx. What are they?

http://www.ettus.com/faq

Thanks,
Sean

-----Original Message-----
From: Philip Balister [mailto:philip@opensdr.com]
Sent: Monday, October 24, 2011 2:34 PM
To: Nowlan, Sean
Cc: josh@ettus.com; usrp-users@lists.ettus.com
Subject: Re: [USRP-users] Bandwidth issues on E100

On 10/24/2011 01:34 PM, Nowlan, Sean wrote:

Thanks. So what kind of performance gains would a C++ implementation buy

me? (I know that question is loaded - it would depend on how it's
implemented, of course, and it probably differs depending on the particular
application).

Yes, a lot depends on the implementation. I very strongly suspect that
there are huge performance improvements available in the benchmark_tx
program. Basically, you want your blocks to do "lots" of processing as
opposed to a flow graph with many blocks each doing a little processing.

Just to make sure, if I instantiate a GNUradio UHD Sink with any of the

supported IO types, I just have to make sure I feed it samples of the
correct range, i.e., [-1.0,+1.0] for float and [-2^16, +2^16-1] for
COMPLEX_INT16?

Yep.

Do you suspect that the bottleneck is the ARM processor? Will moving the

python tx_chain to COMPLEX_INT16 help significantly? I don't care so much
about the framer and not at all about the receiver. I just need to TX at a
constant bitrate of 500 kbps.

In this case the bottleneck it the ARM. Assuming you are running the 3.0
kernel and recent UHD.

Philip

Thanks,
Sean

-----Original Message-----
From: usrp-users-bounces@lists.ettus.com
[mailto:usrp-users-bounces@lists.ettus.com] On Behalf Of Josh Blum
Sent: Thursday, October 20, 2011 1:48 PM
To: usrp-users@lists.ettus.com
Subject: Re: [USRP-users] Bandwidth issues on E100

On 10/20/2011 10:36 AM, Nowlan, Sean wrote:

Hi all,

I'm experiencing underruns when running 500kbps BPSK using
GNUradio's benchmark_tx.py, which seems like too low a bandwidth to
make an E100 choke. My thoughts on how to deal with this issue:

 Rebuild GNUradio with ARM NEON extensions (I'm running on a

version without these).

 Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8.

(What more is involved besides changing the io_type in the UHD sink
object instantiation?)

Any other thoughts or comments would be greatly appreciated. Sorry if
this was more appropriate to post in discuss-gnuradio; this is one of
these issues that could go either way.

Well in general, the benchmark stuff is just an example to demonstrate a

complete rx/tx chain + mac layer. But actually its pretty poor in terms of
being an example and in terms of performance (even on x86).

I know gnuradio doesnt have a real mac later support because we still

need to invent message passing, but even so, the de-framer/correlator for
this app is written entirely in python (not even numpy).

The thing about the IO type is that the benchmark mod/demod chains work

on complex float32, so you have to make a complex int16 version of the
blocks in that chain to use complex int16 as an IO type.

Things you may consider:
Implementing a better packet framer/defamer. Using less floating point,

think neon optimized fir filters and such. This is actually what the volk
component will be used for. So we write a FIR filter implementation, and
then call into a volk dot product kernel. When somebody finds out that the
filter is the bottle neck, you add a neon or assembly implementation for
that kernel.

-Josh

USRP-users mailing list
USRP-users@lists.ettus.com
http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com

If you are going to process in the FPGA, you are limited by the clock, so you are maxing out at 64 MSps. If you plan on doing it all on the GPP, you should keep it at or below 4 MSps. Cheers, Ben On Thu, Nov 3, 2011 at 2:53 PM, Nowlan, Sean <Sean.Nowlan@gtri.gatech.edu>wrote: > On this FAQ page, bandwidth estimates are listed for all devices except > for E1xx. What are they? > > http://www.ettus.com/faq > > Thanks, > Sean > > -----Original Message----- > From: Philip Balister [mailto:philip@opensdr.com] > Sent: Monday, October 24, 2011 2:34 PM > To: Nowlan, Sean > Cc: josh@ettus.com; usrp-users@lists.ettus.com > Subject: Re: [USRP-users] Bandwidth issues on E100 > > On 10/24/2011 01:34 PM, Nowlan, Sean wrote: > > Thanks. So what kind of performance gains would a C++ implementation buy > me? (I know that question is loaded - it would depend on how it's > implemented, of course, and it probably differs depending on the particular > application). > > Yes, a lot depends on the implementation. I very strongly suspect that > there are huge performance improvements available in the benchmark_tx > program. Basically, you want your blocks to do "lots" of processing as > opposed to a flow graph with many blocks each doing a little processing. > > > > > Just to make sure, if I instantiate a GNUradio UHD Sink with any of the > supported IO types, I just have to make sure I feed it samples of the > correct range, i.e., [-1.0,+1.0] for float and [-2^16, +2^16-1] for > COMPLEX_INT16? > > Yep. > > > > > Do you suspect that the bottleneck is the ARM processor? Will moving the > python tx_chain to COMPLEX_INT16 help significantly? I don't care so much > about the framer and not at all about the receiver. I just need to TX at a > constant bitrate of 500 kbps. > > In this case the bottleneck it the ARM. Assuming you are running the 3.0 > kernel and recent UHD. > > Philip > > > > > Thanks, > > Sean > > > > -----Original Message----- > > From: usrp-users-bounces@lists.ettus.com > > [mailto:usrp-users-bounces@lists.ettus.com] On Behalf Of Josh Blum > > Sent: Thursday, October 20, 2011 1:48 PM > > To: usrp-users@lists.ettus.com > > Subject: Re: [USRP-users] Bandwidth issues on E100 > > > > > > > > On 10/20/2011 10:36 AM, Nowlan, Sean wrote: > >> Hi all, > >> > >> I'm experiencing underruns when running 500kbps BPSK using > >> GNUradio's benchmark_tx.py, which seems like too low a bandwidth to > >> make an E100 choke. My thoughts on how to deal with this issue: > >> > >> > >> 1) Rebuild GNUradio with ARM NEON extensions (I'm running on a > >> version without these). > >> > >> 2) Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8. > >> (What more is involved besides changing the io_type in the UHD sink > >> object instantiation?) > >> > >> Any other thoughts or comments would be greatly appreciated. Sorry if > >> this was more appropriate to post in discuss-gnuradio; this is one of > >> these issues that could go either way. > >> > > > > Well in general, the benchmark stuff is just an example to demonstrate a > complete rx/tx chain + mac layer. But actually its pretty poor in terms of > being an example and in terms of performance (even on x86). > > > > I know gnuradio doesnt have a real mac later support because we still > need to invent message passing, but even so, the de-framer/correlator for > this app is written entirely in python (not even numpy). > > > > The thing about the IO type is that the benchmark mod/demod chains work > on complex float32, so you have to make a complex int16 version of the > blocks in that chain to use complex int16 as an IO type. > > > > Things you may consider: > > Implementing a better packet framer/defamer. Using less floating point, > think neon optimized fir filters and such. This is actually what the volk > component will be used for. So we write a FIR filter implementation, and > then call into a volk dot product kernel. When somebody finds out that the > filter is the bottle neck, you add a neon or assembly implementation for > that kernel. > > > > -Josh > > > > _______________________________________________ > > USRP-users mailing list > > USRP-users@lists.ettus.com > > http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com > > > > _______________________________________________ > > USRP-users mailing list > > USRP-users@lists.ettus.com > > http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com > > > _______________________________________________ > USRP-users mailing list > USRP-users@lists.ettus.com > http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com >