High performance interface bonding in Linux

My application:

I’m in the process of playing with DRBD with a couple spare servers in my garage. The idea being that while my business is far too poor to afford a SAN, a DRBD cluster should be able to exceed my needs for semi-redundant storage.

Part of DRBD involves a link between the two servers for synchronization (basically copying over bits that change). This is preferably a private link so that your primary connection doesn’t see bandwidth contention during heavy writes, so for testing purposes I tossed a couple dual port Intel NICs I had laying around into the servers so that I could have a dedicated 2 gbit synchronization link.

The setup:

I have two identical servers running CentOS 5.5 x86_64 with kernel version 2.6.18-194.32.1.el5. Both servers are running Intel Dual Pro/1000 PCI-X adapters for the replication link.

Bonding in Linux:

Specifically with CentOS/RHEL/Fedora bonding is really very easy. Most of the tutorials tell you to edit /etc/modprobe.conf to call the bonding driver (and neglect to tell you to call it again if you’re running multiple bonding interfaces) but this is entirely unnecessary.

Create /etc/sysconfig/network-scripts/ifcfg-bondn, where n is an integer. It should look something like this:

DEVICE=bond0
BONDING_OPTS="mode=4 miimon=100"
IPADDR=192.168.1.2
NETMASK=255.255.0.0
GATEWAY=192.168.1.1
BOOTPROTO=none
ONBOOT=yes
USERCTL=no

The BONDING_OPTS line is what what sets up the network bonding modules (instead of setting them up in /etc/modprobe.conf). You can read more about the options available here, what I have specified is LACP (IEEE 802.3ad) with a link monitor frequency of 100ms.

Now for each slave interface edit the ifcfg-ethx file to resemble the following:

DEVICE=eth0
HWADDR=xx:xx:xx:xx:xx:xx
MASTER=bond0
SLAVE=yes
ONBOOT=yes
USERCTL=no

Restart the network service and you should now be running on your new bonded connection, congratulations!

Performance:

Since my primary reason for using NIC bonding in the first place was performance I used iperf to verify that everything was working as desired.

[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.10 GBytes   944 Mbits/sec

These were not the results I had in mind. After doing some reading I determined that LACP may not be the best solution for my application. From my understanding the only bonding mode that offers any performance improvement for a single TCP stream is balance-rr (mode 0 for the bonding driver) which simply tosses packets out in a round-robin fashion.

So I altered my BONDING_OPTS as follows:

BONDING_OPTS="mode=0 miimon=100"

Surely this will provide a performance increase?

[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  900 MBytes   720 Mbits/sec

This can’t be right…

I finally fired up tcpdump to take a look at what was happening and saw a ton of packet retransmissions. Literally over half of the packets being transmitted were being discarded due to the limited TCP window size.

The solution:

For a link this fast the default MTU (1500) is far too small and results in the system literally discarding packets because it isn’t ready for them yet. Enabling Jumbo frames (a MTU of 9000) should be a good way to increase throughput and reduce packet retransmissions. To enable jumbo frames simply add this line to your /etc/sysconfig/network-scripts/ifcfg-interface file (this only needs to go in ifcfg-bondn if you’re using interface bonding):

MTU=9000

Results:

[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  2.28 GBytes  1.96 Gbits/sec

Now that’s what I’m talking about!

Other thoughts:

The bonding driver happens to have a nice interface in /proc that provides a lot of useful information about the trunk and it’s current status (located at /proc/net/bonding/bondn).

Ethernet Channel Bonding Driver: v3.4.0 (October 7, 2008)
Bonding Mode: load balancing (round-robin)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Slave Interface: eth2
MII Status: up
Link Failure Count: 0
Permanent HW addr: xx:xx:xx:xx:xx:xx
Slave Interface: eth3
MII Status: up
Link Failure Count: 0
Permanent HW addr: xx:xx:xx:xx:xx:xx

I personally plan on including bits of this (MII status, Link failure count, status) in my daily logs.

Tags: , ,

5 Responses to “High performance interface bonding in Linux”

  1. Trung Says:

    Hi, did you configure anything on the network switch to support balance-rr mode? From the network-bonding documentation, I notice that: “The balance-rr, balance-xor and broadcast modes generally require that the switch have the appropriate ports grouped together”. Thanks.

  2. Clint Says:

    Trung,

    Balance-rr does not require any specific changes at the switch level, it is simply using a round-robin algorithm to alternate packets between network adapters.

    However switch-level configuration was required in order to use jumbo frames (a MTU of 9000).

    Clint

  3. Jenny Says:

    What iperf parameter did you use?
    Have you though about testing this on a 4x 1Gbps NIC?

    Great work!

  4. Clint Says:

    I honestly don’t remember the iperf parameters I used, it’s been a while since I wrote the article. From the results I posted it looks like I just used the default interval time (10s), I doubt that I messed with any other parameters.

    I have not needed to use balance-rr on a quad port NIC though I can’t imagine the setup being any different than what I described here. Any time I’ve had a quad port NIC I’ve typically used LACP as the applications (firewall, fileserver, etc.) have involved more than one state at a time (and thus performed well over LACP).

    For this project specifically I actually moved to an Infiniband interface to improve performance, latency was affecting transfer speeds more than the raw bandwidth of the interface. I may post another article on that at some point.

  5. Scott Says:

    Hi,

    Did you ever encounter any issues copying files via scp to hosts both using the MTU of 9000? I’m seeing the files stall and then fail to transfer. Reducing the MTU back down to 1500 seems to resolve the issue.

Leave a Reply