usrp-users@lists.ettus.com

Discussion and technical support related to USRP, UHD, RFNoC

View all threads

Bandwidth issues on E100

NS
Nowlan, Sean
Thu, Nov 3, 2011 10:47 PM

Thanks for your help. Even modulating a 4 kHz sine wave with the "tx_waveforms" program at 2.37 MSps was causing underruns. This is a pretty simple program that doesn't do any software-side filtering or anything; it just pumps samples from a fixed wave table to the FPGA. What can I change or do to actually see that 4 MSps? I tried changing the process' scheduling priority using the linux 'chrt' command but found it was already using round-robin scheduling at priority 50 out of 99. Bumping it up higher than 50 didn't seem to get rid of underruns.

From: Ben Hilburn [mailto:ben.hilburn@ettus.com]
Sent: Thursday, November 03, 2011 5:59 PM
To: Nowlan, Sean
Cc: Philip Balister; usrp-users@lists.ettus.com
Subject: Re: [USRP-users] Bandwidth issues on E100

If you are going to process in the FPGA, you are limited by the clock, so you are maxing out at 64 MSps.  If you plan on doing it all on the GPP, you should keep it at or below 4 MSps.

Cheers,
Ben

On Thu, Nov 3, 2011 at 2:53 PM, Nowlan, Sean <Sean.Nowlan@gtri.gatech.edumailto:Sean.Nowlan@gtri.gatech.edu> wrote:
On this FAQ page, bandwidth estimates are listed for all devices except for E1xx. What are they?

http://www.ettus.com/faq

Thanks,
Sean

-----Original Message-----
From: Philip Balister [mailto:philip@opensdr.commailto:philip@opensdr.com]
Sent: Monday, October 24, 2011 2:34 PM
To: Nowlan, Sean
Cc: josh@ettus.commailto:josh@ettus.com; usrp-users@lists.ettus.commailto:usrp-users@lists.ettus.com
Subject: Re: [USRP-users] Bandwidth issues on E100

On 10/24/2011 01:34 PM, Nowlan, Sean wrote:

Thanks. So what kind of performance gains would a C++ implementation buy me? (I know that question is loaded - it would depend on how it's implemented, of course, and it probably differs depending on the particular application).

Yes, a lot depends on the implementation. I very strongly suspect that there are huge performance improvements available in the benchmark_tx program. Basically, you want your blocks to do "lots" of processing as opposed to a flow graph with many blocks each doing a little processing.

Just to make sure, if I instantiate a GNUradio UHD Sink with any of the supported IO types, I just have to make sure I feed it samples of the correct range, i.e., [-1.0,+1.0] for float and [-2^16, +2^16-1] for COMPLEX_INT16?

Yep.

Do you suspect that the bottleneck is the ARM processor? Will moving the python tx_chain to COMPLEX_INT16 help significantly? I don't care so much about the framer and not at all about the receiver. I just need to TX at a constant bitrate of 500 kbps.

In this case the bottleneck it the ARM. Assuming you are running the 3.0 kernel and recent UHD.

Philip

Thanks,
Sean

-----Original Message-----
From: usrp-users-bounces@lists.ettus.commailto:usrp-users-bounces@lists.ettus.com
[mailto:usrp-users-bounces@lists.ettus.commailto:usrp-users-bounces@lists.ettus.com] On Behalf Of Josh Blum
Sent: Thursday, October 20, 2011 1:48 PM
To: usrp-users@lists.ettus.commailto:usrp-users@lists.ettus.com
Subject: Re: [USRP-users] Bandwidth issues on E100

On 10/20/2011 10:36 AM, Nowlan, Sean wrote:

Hi all,

I'm experiencing underruns when running  500kbps BPSK using
GNUradio's benchmark_tx.py, which seems like too low a bandwidth to
make an E100 choke. My thoughts on how to deal with this issue:

  1.  Rebuild GNUradio with ARM NEON extensions (I'm running on a
    

version without these).

  1.  Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8.
    

(What more is involved besides changing the io_type in the UHD sink
object instantiation?)

Any other thoughts or comments would be greatly appreciated. Sorry if
this was more appropriate to post in discuss-gnuradio; this is one of
these issues that could go either way.

Well in general, the benchmark stuff is just an example to demonstrate a complete rx/tx chain + mac layer. But actually its pretty poor in terms of being an example and in terms of performance (even on x86).

I know gnuradio doesnt have a real mac later support because we still need to invent message passing, but even so, the de-framer/correlator for this app is written entirely in python (not even numpy).

The thing about the IO type is that the benchmark mod/demod chains work on complex float32, so you have to make a complex int16 version of the blocks in that chain to use complex int16 as an IO type.

Things you may consider:
Implementing a better packet framer/defamer. Using less floating point, think neon optimized fir filters and such. This is actually what the volk component will be used for. So we write a FIR filter implementation, and then call into a volk dot product kernel. When somebody finds out that the filter is the bottle neck, you add a neon or assembly implementation for that kernel.

-Josh


USRP-users mailing list
USRP-users@lists.ettus.commailto:USRP-users@lists.ettus.com
http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com


USRP-users mailing list
USRP-users@lists.ettus.commailto:USRP-users@lists.ettus.com
http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com

Thanks for your help. Even modulating a 4 kHz sine wave with the "tx_waveforms" program at 2.37 MSps was causing underruns. This is a pretty simple program that doesn't do any software-side filtering or anything; it just pumps samples from a fixed wave table to the FPGA. What can I change or do to actually see that 4 MSps? I tried changing the process' scheduling priority using the linux 'chrt' command but found it was already using round-robin scheduling at priority 50 out of 99. Bumping it up higher than 50 didn't seem to get rid of underruns. From: Ben Hilburn [mailto:ben.hilburn@ettus.com] Sent: Thursday, November 03, 2011 5:59 PM To: Nowlan, Sean Cc: Philip Balister; usrp-users@lists.ettus.com Subject: Re: [USRP-users] Bandwidth issues on E100 If you are going to process in the FPGA, you are limited by the clock, so you are maxing out at 64 MSps. If you plan on doing it all on the GPP, you should keep it at or below 4 MSps. Cheers, Ben On Thu, Nov 3, 2011 at 2:53 PM, Nowlan, Sean <Sean.Nowlan@gtri.gatech.edu<mailto:Sean.Nowlan@gtri.gatech.edu>> wrote: On this FAQ page, bandwidth estimates are listed for all devices except for E1xx. What are they? http://www.ettus.com/faq Thanks, Sean -----Original Message----- From: Philip Balister [mailto:philip@opensdr.com<mailto:philip@opensdr.com>] Sent: Monday, October 24, 2011 2:34 PM To: Nowlan, Sean Cc: josh@ettus.com<mailto:josh@ettus.com>; usrp-users@lists.ettus.com<mailto:usrp-users@lists.ettus.com> Subject: Re: [USRP-users] Bandwidth issues on E100 On 10/24/2011 01:34 PM, Nowlan, Sean wrote: > Thanks. So what kind of performance gains would a C++ implementation buy me? (I know that question is loaded - it would depend on how it's implemented, of course, and it probably differs depending on the particular application). Yes, a lot depends on the implementation. I very strongly suspect that there are huge performance improvements available in the benchmark_tx program. Basically, you want your blocks to do "lots" of processing as opposed to a flow graph with many blocks each doing a little processing. > > Just to make sure, if I instantiate a GNUradio UHD Sink with any of the supported IO types, I just have to make sure I feed it samples of the correct range, i.e., [-1.0,+1.0] for float and [-2^16, +2^16-1] for COMPLEX_INT16? Yep. > > Do you suspect that the bottleneck is the ARM processor? Will moving the python tx_chain to COMPLEX_INT16 help significantly? I don't care so much about the framer and not at all about the receiver. I just need to TX at a constant bitrate of 500 kbps. In this case the bottleneck it the ARM. Assuming you are running the 3.0 kernel and recent UHD. Philip > > Thanks, > Sean > > -----Original Message----- > From: usrp-users-bounces@lists.ettus.com<mailto:usrp-users-bounces@lists.ettus.com> > [mailto:usrp-users-bounces@lists.ettus.com<mailto:usrp-users-bounces@lists.ettus.com>] On Behalf Of Josh Blum > Sent: Thursday, October 20, 2011 1:48 PM > To: usrp-users@lists.ettus.com<mailto:usrp-users@lists.ettus.com> > Subject: Re: [USRP-users] Bandwidth issues on E100 > > > > On 10/20/2011 10:36 AM, Nowlan, Sean wrote: >> Hi all, >> >> I'm experiencing underruns when running 500kbps BPSK using >> GNUradio's benchmark_tx.py, which seems like too low a bandwidth to >> make an E100 choke. My thoughts on how to deal with this issue: >> >> >> 1) Rebuild GNUradio with ARM NEON extensions (I'm running on a >> version without these). >> >> 2) Switch from COMPLEX_FLOAT32 to COMPLEX_INT16 or COMPLEX_INT8. >> (What more is involved besides changing the io_type in the UHD sink >> object instantiation?) >> >> Any other thoughts or comments would be greatly appreciated. Sorry if >> this was more appropriate to post in discuss-gnuradio; this is one of >> these issues that could go either way. >> > > Well in general, the benchmark stuff is just an example to demonstrate a complete rx/tx chain + mac layer. But actually its pretty poor in terms of being an example and in terms of performance (even on x86). > > I know gnuradio doesnt have a real mac later support because we still need to invent message passing, but even so, the de-framer/correlator for this app is written entirely in python (not even numpy). > > The thing about the IO type is that the benchmark mod/demod chains work on complex float32, so you have to make a complex int16 version of the blocks in that chain to use complex int16 as an IO type. > > Things you may consider: > Implementing a better packet framer/defamer. Using less floating point, think neon optimized fir filters and such. This is actually what the volk component will be used for. So we write a FIR filter implementation, and then call into a volk dot product kernel. When somebody finds out that the filter is the bottle neck, you add a neon or assembly implementation for that kernel. > > -Josh > > _______________________________________________ > USRP-users mailing list > USRP-users@lists.ettus.com<mailto:USRP-users@lists.ettus.com> > http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com > > _______________________________________________ > USRP-users mailing list > USRP-users@lists.ettus.com<mailto:USRP-users@lists.ettus.com> > http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com _______________________________________________ USRP-users mailing list USRP-users@lists.ettus.com<mailto:USRP-users@lists.ettus.com> http://lists.ettus.com/mailman/listinfo/usrp-users_lists.ettus.com
JB
Josh Blum
Thu, Nov 3, 2011 11:01 PM

On 11/03/2011 03:47 PM, Nowlan, Sean wrote:

Thanks for your help. Even modulating a 4 kHz sine wave with the
"tx_waveforms" program at 2.37 MSps was causing underruns. This is a

tx_waveforms is impressively bad on even x86 machines. I believe that
the issue is the fmod call is iterative, if not that, its the double
floating point math on every sample. You should compare to gnuradio's
sig source which is implemented (not perfect but) much better and can
handle 4Msps.

-Josh

On 11/03/2011 03:47 PM, Nowlan, Sean wrote: > Thanks for your help. Even modulating a 4 kHz sine wave with the > "tx_waveforms" program at 2.37 MSps was causing underruns. This is a tx_waveforms is impressively bad on even x86 machines. I believe that the issue is the fmod call is iterative, if not that, its the double floating point math on every sample. You should compare to gnuradio's sig source which is implemented (not perfect but) much better and can handle 4Msps. -Josh
JB
Josh Blum
Thu, Nov 3, 2011 11:12 PM

On 11/03/2011 04:01 PM, Josh Blum wrote:

On 11/03/2011 03:47 PM, Nowlan, Sean wrote:

Thanks for your help. Even modulating a 4 kHz sine wave with the
"tx_waveforms" program at 2.37 MSps was causing underruns. This is a

tx_waveforms is impressively bad on even x86 machines. I believe that
the issue is the fmod call is iterative, if not that, its the double
floating point math on every sample. You should compare to gnuradio's
sig source which is implemented (not perfect but) much better and can
handle 4Msps.

-Josh

I should mention that Phillip Balister has contributed a number of neon
optimized fir filters to gnuradio-core. If you schedule neon calls
right, its like having an extra parallel processor on your ARM.

If you are interested in this. I highly recommend taking a look at
gnuradio's libvolk and adding a kernel for the various FIR filters you
are interested in. Basically a dot product in neon.

The plan was to make a new component in gnuradio called gr-filter. Each
block in gr-filter would be very simple, the work function just calls
into one these volk kernels.

Let me know if you are interested in something like that, I would like
to put together the boiler plate code for the gr-filter component.

-Josh

On 11/03/2011 04:01 PM, Josh Blum wrote: > > > On 11/03/2011 03:47 PM, Nowlan, Sean wrote: >> Thanks for your help. Even modulating a 4 kHz sine wave with the >> "tx_waveforms" program at 2.37 MSps was causing underruns. This is a > > tx_waveforms is impressively bad on even x86 machines. I believe that > the issue is the fmod call is iterative, if not that, its the double > floating point math on every sample. You should compare to gnuradio's > sig source which is implemented (not perfect but) much better and can > handle 4Msps. > > -Josh I should mention that Phillip Balister has contributed a number of neon optimized fir filters to gnuradio-core. If you schedule neon calls right, its like having an extra parallel processor on your ARM. If you are interested in this. I highly recommend taking a look at gnuradio's libvolk and adding a kernel for the various FIR filters you are interested in. Basically a dot product in neon. The plan was to make a new component in gnuradio called gr-filter. Each block in gr-filter would be very simple, the work function just calls into one these volk kernels. Let me know if you are interested in something like that, I would like to put together the boiler plate code for the gr-filter component. -Josh
JB
Josh Blum
Sat, Nov 5, 2011 6:34 PM

On 11/03/2011 03:47 PM, Nowlan, Sean wrote:

Thanks for your help. Even modulating a 4 kHz sine wave with the
"tx_waveforms" program at 2.37 MSps was causing underruns. This is a
pretty simple program that doesn't do any software-side filtering or

I made a simple 1 line diff to remove boost iround from table indexing
function. I can now run tx waveforms at 4Msps without underflow:
http://pastebin.com/1g8YcEJu

Notice that boost iround calls into iterative floating point routines in
libmath. Its is bad practice for performance reasons to call into
libmath once per sample. For example, it would not be OK for a FIR
filter to be implemented like this.

And here is a diff that allowed me to run tx waveforms at 8Msps on the
E100: http://pastebin.com/Zy8ywKks

Notice the use of integer arithmetic and use of the complex<short> as
the IO data type. At 8Msps, staying withing memory bandwidth can be
tougher. Using shorts over floats cuts this overhead in half.

I will probably merge the first diff, because there isnt a strong need
for the iround, it was more of an OCD to pick the best index between two
elements in the table.

-Josh

On 11/03/2011 03:47 PM, Nowlan, Sean wrote: > Thanks for your help. Even modulating a 4 kHz sine wave with the > "tx_waveforms" program at 2.37 MSps was causing underruns. This is a > pretty simple program that doesn't do any software-side filtering or I made a simple 1 line diff to remove boost iround from table indexing function. I can now run tx waveforms at 4Msps without underflow: http://pastebin.com/1g8YcEJu Notice that boost iround calls into iterative floating point routines in libmath. Its is bad practice for performance reasons to call into libmath once per sample. For example, it would not be OK for a FIR filter to be implemented like this. And here is a diff that allowed me to run tx waveforms at 8Msps on the E100: http://pastebin.com/Zy8ywKks Notice the use of integer arithmetic and use of the complex<short> as the IO data type. At 8Msps, staying withing memory bandwidth can be tougher. Using shorts over floats cuts this overhead in half. I will probably merge the first diff, because there isnt a strong need for the iround, it was more of an OCD to pick the best index between two elements in the table. -Josh