--- zzzz-none-000/linux-3.10.107/Documentation/networking/bonding.txt 2017-06-27 09:49:32.000000000 +0000 +++ scorpion-7490-727/linux-3.10.107/Documentation/networking/bonding.txt 2021-02-04 17:41:59.000000000 +0000 @@ -51,6 +51,7 @@ 3.4 Configuring Bonding Manually via Sysfs 3.5 Configuration with Interfaces Support 3.6 Overriding Configuration for Special Cases +3.7 Configuring LACP for 802.3ad mode in a more secure way 4. Querying Bonding Configuration 4.1 Bonding Configuration @@ -104,8 +105,7 @@ ============================== Most popular distro kernels ship with the bonding driver -already available as a module and the ifenslave user level control -program installed and ready for use. If your distro does not, or you +already available as a module. If your distro does not, or you have need to compile bonding from source (e.g., configuring and installing a mainline kernel from kernel.org), you'll need to perform the following steps: @@ -124,46 +124,13 @@ driver as module since it is currently the only way to pass parameters to the driver or configure more than one bonding device. - Build and install the new kernel and modules, then continue -below to install ifenslave. + Build and install the new kernel and modules. -1.2 Install ifenslave Control Utility +1.2 Bonding Control Utility ------------------------------------- - The ifenslave user level control program is included in the -kernel source tree, in the file Documentation/networking/ifenslave.c. -It is generally recommended that you use the ifenslave that -corresponds to the kernel that you are using (either from the same -source tree or supplied with the distro), however, ifenslave -executables from older kernels should function (but features newer -than the ifenslave release are not supported). Running an ifenslave -that is newer than the kernel is not supported, and may or may not -work. - - To install ifenslave, do the following: - -# gcc -Wall -O -I/usr/src/linux/include ifenslave.c -o ifenslave -# cp ifenslave /sbin/ifenslave - - If your kernel source is not in "/usr/src/linux," then replace -"/usr/src/linux/include" in the above with the location of your kernel -source include directory. - - You may wish to back up any existing /sbin/ifenslave, or, for -testing or informal use, tag the ifenslave to the kernel version -(e.g., name the ifenslave executable /sbin/ifenslave-2.6.10). - -IMPORTANT NOTE: - - If you omit the "-I" or specify an incorrect directory, you -may end up with an ifenslave that is incompatible with the kernel -you're trying to build it for. Some distros (e.g., Red Hat from 7.1 -onwards) do not have /usr/include/linux symbolically linked to the -default kernel source include directory. - -SECOND IMPORTANT NOTE: - If you plan to configure bonding using sysfs or using the -/etc/network/interfaces file, you do not need to use ifenslave. + It is recommended to configure bonding via iproute2 (netlink) +or sysfs, the old ifenslave control utility is obsolete. 2. Bonding Driver Options ========================= @@ -212,6 +179,27 @@ active slave, or the empty string if there is no active slave or the current mode does not use an active slave. +ad_actor_sys_prio + + In an AD system, this specifies the system priority. The allowed range + is 1 - 65535. If the value is not specified, it takes 65535 as the + default value. + + This parameter has effect only in 802.3ad mode and is available through + SysFs interface. + +ad_actor_system + + In an AD system, this specifies the mac-address for the actor in + protocol packet exchanges (LACPDUs). The value cannot be NULL or + multicast. It is preferred to have the local-admin bit set for this + mac but driver does not enforce it. If the value is not given then + system defaults to using the masters' mac address as actors' system + address. + + This parameter has effect only in 802.3ad mode and is available through + SysFs interface. + ad_select Specifies the 802.3ad aggregation selection logic to use. The @@ -254,6 +242,21 @@ This option was added in bonding version 3.4.0. +ad_user_port_key + + In an AD system, the port-key has three parts as shown below - + + Bits Use + 00 Duplex + 01-05 Speed + 06-15 User-defined + + This defines the upper 10 bits of the port key. The values can be + from 0 - 1023. If not given, the system defaults to 0. + + This parameter has effect only in 802.3ad mode and is available through + SysFs interface. + all_slaves_active Specifies that duplicate frames (received on inactive ports) should be @@ -304,16 +307,15 @@ arp_validate Specifies whether or not ARP probes and replies should be - validated in the active-backup mode. This causes the ARP - monitor to examine the incoming ARP requests and replies, and - only consider a slave to be up if it is receiving the - appropriate ARP traffic. + validated in any mode that supports arp monitoring, or whether + non-ARP traffic should be filtered (disregarded) for link + monitoring purposes. Possible values are: none or 0 - No validation is performed. This is the default. + No validation or filtering is performed. active or 1 @@ -327,28 +329,90 @@ Validation is performed for all slaves. - For the active slave, the validation checks ARP replies to - confirm that they were generated by an arp_ip_target. Since - backup slaves do not typically receive these replies, the - validation performed for backup slaves is on the ARP request - sent out via the active slave. It is possible that some - switch or network configurations may result in situations - wherein the backup slaves do not receive the ARP requests; in - such a situation, validation of backup slaves must be - disabled. - - This option is useful in network configurations in which - multiple bonding hosts are concurrently issuing ARPs to one or - more targets beyond a common switch. Should the link between - the switch and target fail (but not the switch itself), the - probe traffic generated by the multiple bonding instances will - fool the standard ARP monitor into considering the links as - still up. Use of the arp_validate option can resolve this, as - the ARP monitor will only consider ARP requests and replies - associated with its own instance of bonding. + filter or 4 + + Filtering is applied to all slaves. No validation is + performed. + + filter_active or 5 + + Filtering is applied to all slaves, validation is performed + only for the active slave. + + filter_backup or 6 + + Filtering is applied to all slaves, validation is performed + only for backup slaves. + + Validation: + + Enabling validation causes the ARP monitor to examine the incoming + ARP requests and replies, and only consider a slave to be up if it + is receiving the appropriate ARP traffic. + + For an active slave, the validation checks ARP replies to confirm + that they were generated by an arp_ip_target. Since backup slaves + do not typically receive these replies, the validation performed + for backup slaves is on the broadcast ARP request sent out via the + active slave. It is possible that some switch or network + configurations may result in situations wherein the backup slaves + do not receive the ARP requests; in such a situation, validation + of backup slaves must be disabled. + + The validation of ARP requests on backup slaves is mainly helping + bonding to decide which slaves are more likely to work in case of + the active slave failure, it doesn't really guarantee that the + backup slave will work if it's selected as the next active slave. + + Validation is useful in network configurations in which multiple + bonding hosts are concurrently issuing ARPs to one or more targets + beyond a common switch. Should the link between the switch and + target fail (but not the switch itself), the probe traffic + generated by the multiple bonding instances will fool the standard + ARP monitor into considering the links as still up. Use of + validation can resolve this, as the ARP monitor will only consider + ARP requests and replies associated with its own instance of + bonding. + + Filtering: + + Enabling filtering causes the ARP monitor to only use incoming ARP + packets for link availability purposes. Arriving packets that are + not ARPs are delivered normally, but do not count when determining + if a slave is available. + + Filtering operates by only considering the reception of ARP + packets (any ARP packet, regardless of source or destination) when + determining if a slave has received traffic for link availability + purposes. + + Filtering is useful in network configurations in which significant + levels of third party broadcast traffic would fool the standard + ARP monitor into considering the links as still up. Use of + filtering can resolve this, as only ARP traffic is considered for + link availability purposes. This option was added in bonding version 3.1.0. +arp_all_targets + + Specifies the quantity of arp_ip_targets that must be reachable + in order for the ARP monitor to consider a slave as being up. + This option affects only active-backup mode for slaves with + arp_validation enabled. + + Possible values are: + + any or 0 + + consider the slave up only when any of the arp_ip_targets + is reachable + + all or 1 + + consider the slave up only when all of the arp_ip_targets + are reachable + downdelay Specifies the time, in milliseconds, to wait before disabling @@ -515,10 +579,10 @@ XOR policy: Transmit based on the selected transmit hash policy. The default policy is a simple [(source - MAC address XOR'd with destination MAC address) modulo - slave count]. Alternate transmit policies may be - selected via the xmit_hash_policy option, described - below. + MAC address XOR'd with destination MAC address XOR + packet type ID) modulo slave count]. Alternate transmit + policies may be selected via the xmit_hash_policy option, + described below. This mode provides load balancing and fault tolerance. @@ -558,13 +622,19 @@ balance-tlb or 5 Adaptive transmit load balancing: channel bonding that - does not require any special switch support. The - outgoing traffic is distributed according to the - current load (computed relative to the speed) on each - slave. Incoming traffic is received by the current - slave. If the receiving slave fails, another slave - takes over the MAC address of the failed receiving - slave. + does not require any special switch support. + + In tlb_dynamic_lb=1 mode; the outgoing traffic is + distributed according to the current load (computed + relative to the speed) on each slave. + + In tlb_dynamic_lb=0 mode; the load balancing based on + current load is disabled and the load is distributed + only using the hash distribution. + + Incoming traffic is received by the current slave. + If the receiving slave fails, another slave takes over + the MAC address of the failed receiving slave. Prerequisite: @@ -629,6 +699,106 @@ swapped with the new curr_active_slave that was chosen. + l2da or 7 + + L2 Destination Address (L2DA) based mode allows bonding to + send packets using different slaves according to L2 Destination + Address of the packets. + + In L2DA mode, the bonding maintains a default slave and + DA/slave map. + + Upon a packet transmission, the bonding examines the DA of the + packet and searches for a slave assigned to the DA within the + DA/slave map. If such mapping exists, the bonding uses this + slave for sending of the packet.Otherwise, it uses default + slave. + + For multicast packet, if BOND_L2DA_OPT_DUP_MC_TX is set, + L2DA mode duplicates it to all slaves upon transmission. + + For RX packets, if BOND_L2DA_OPT_DEDUP_RX is set, L2DA examines + L2 source address and search for slave interface assigned to + the transmitting STA. L2DA then implements following logic based + on the interface that the packet was received on: + 1. RX EAPOLs are unconditionally allowed + 2. if DA/slave map has an entry for the transmitting STA, allow + the RX packet if received from the assigned slave interface. + Drop the RX packet if received from other interface. + 3. else (no slave interface assigned to the transmitting STA), + if default slave configured (l2da_default_slave), allow the Rx + packet if it was received from it. + 4. else (slave interface not assigned and default slave not + configured), allow the RX packet unconditionally. + + In case BOND_L2DA_OPT_DEDUP_RX is not set, L2DA allows all Rx packets. + + L2DA mode implements packet forwarding for received packets + according to the following logic: + 1. for Rx multicast packets: clone and send to all slaves + (except the one it came from) and also deliver to local network + stack. + 2. for Rx unicast packets: search packet's destination address + in L2DA map. If found, send the packet to the corresponding + slave; otherwise - deliver to local network stack. + L2DA packet forwarding functionality is disabled by default and + can be enabled with BOND_L2DA_OPT_FORWARD_RX bond L2DA option. + + In case BOND_OPT_L2DA_MULTIMAC parameter is set, bond slaves can have + different MAC addresses. + + In case BOND_L2DA_OPT_REPLACE_MAC option set, source MAC on TX and dest + MAC on RX in ETH header will be replaced. In case of ARP packet, + replace in payload as well. + + In case active or default slave has been modified and + BOND_L2DA_OPT_AUTO_ARP_ANNOUNCE option is set, ARP announcement will + be sent. + + New sysfs entries added - l2da_default_slave, l2da_table and + l2da_opts. Here is their usage: + + l2da_default_slave entry - set an interface as default: + echo interface_name > l2da_default_slave + + l2da_default_slave entry - show the default interface: + cat l2da_default_slave + + l2da_table entry - add a new entry into the table: + echo "+xx:xx:xx:xx:xx:xx@interface_name" > l2da_table + + l2da_table entry - remove an entry from the table: + echo "-xx:xx:xx:xx:xx:xx" > l2da_table + + l2da_table entry - remove all entries from the table: + echo "-*" > l2da_table + + l2da_table entry - show the table: + cat l2da_table + + l2da_opts entry - set options: + echo options > l2da_opts + + l2da_opts entry - show enabled options: + cat l2da_opts + + l2da_multimac - show multimac status: + cat l2da_multimac + + New bonding specific GENL family ("bond") added. It can be used + to add custom GENL commands (rather than extend the standard + rtnl_link_ops's changelink and fill_info callbacks). + + Following list of GENL commands introduced for L2DA: + BOND_GENL_CMD_L2DA_SET_DEFAULT + BOND_GENL_CMD_L2DA_GET_DEFAULT + BOND_GENL_CMD_L2DA_ADD_MAP_ENTRY + BOND_GENL_CMD_L2DA_DEL_MAP_ENTRY + BOND_GENL_CMD_L2DA_GET_MAP_ENTRY + BOND_GENL_CMD_L2DA_RESET_MAP + BOND_GENL_CMD_L2DA_SET_OPTS + BOND_GENL_CMD_L2DA_GET_OPTS + num_grat_arp num_unsol_na @@ -648,6 +818,15 @@ are generated by the ipv4 and ipv6 code and the numbers of repetitions cannot be set independently. +packets_per_slave + + Specify the number of packets to transmit through a slave before + moving to the next one. When set to 0 then a slave is chosen at + random. + + The valid range is 0 - 65535; the default value is 1. This option + has effect only in balance-rr mode. + primary A string (eth0, eth2, etc) specifying which slave is the @@ -657,7 +836,8 @@ one slave is preferred over another, e.g., when one slave has higher throughput than another. - The primary option is only valid for active-backup mode. + The primary option is only valid for active-backup(1), + balance-tlb (5) and balance-alb (6) mode. primary_reselect @@ -699,6 +879,28 @@ This option was added for bonding version 3.6.0. +tlb_dynamic_lb + + Specifies if dynamic shuffling of flows is enabled in tlb + mode. The value has no effect on any other modes. + + The default behavior of tlb mode is to shuffle active flows across + slaves based on the load in that interval. This gives nice lb + characteristics but can cause packet reordering. If re-ordering is + a concern use this variable to disable flow shuffling and rely on + load balancing provided solely by the hash distribution. + xmit-hash-policy can be used to select the appropriate hashing for + the setup. + + The sysfs entry can be used to change the setting per bond device + and the initial value is derived from the module parameter. The + sysfs entry is allowed to be changed only if the bond device is + down. + + The default value is "1" that enables flow shuffling while value "0" + disables it. This option was added in bonding driver 3.7.1 + + updelay Specifies the time, in milliseconds, to wait before enabling a @@ -732,14 +934,15 @@ xmit_hash_policy Selects the transmit hash policy to use for slave selection in - balance-xor and 802.3ad modes. Possible values are: + balance-xor, 802.3ad, and tlb modes. Possible values are: layer2 - Uses XOR of hardware MAC addresses to generate the - hash. The formula is + Uses XOR of hardware MAC addresses and packet type ID + field to generate the hash. The formula is - (source MAC XOR destination MAC) modulo slave count + hash = source MAC XOR destination MAC XOR packet type ID + slave number = hash modulo slave count This algorithm will place all traffic to a particular network peer on the same slave. @@ -752,21 +955,16 @@ protocol information to generate the hash. Uses XOR of hardware MAC addresses and IP addresses to - generate the hash. The IPv4 formula is + generate the hash. The formula is - (((source IP XOR dest IP) AND 0xffff) XOR - ( source MAC XOR destination MAC )) - modulo slave count + hash = source MAC XOR destination MAC XOR packet type ID + hash = hash XOR source IP XOR destination IP + hash = hash XOR (hash RSHIFT 16) + hash = hash XOR (hash RSHIFT 8) + And then hash is reduced modulo slave count. - The IPv6 formula is - - hash = (source ip quad 2 XOR dest IP quad 2) XOR - (source ip quad 3 XOR dest IP quad 3) XOR - (source ip quad 4 XOR dest IP quad 4) - - (((hash >> 24) XOR (hash >> 16) XOR (hash >> 8) XOR hash) - XOR (source MAC XOR destination MAC)) - modulo slave count + If the protocol is IPv6 then the source and destination + addresses are first hashed using ipv6_addr_hash. This algorithm will place all traffic to a particular network peer on the same slave. For non-IP traffic, @@ -788,21 +986,16 @@ slaves, although a single connection will not span multiple slaves. - The formula for unfragmented IPv4 TCP and UDP packets is - - ((source port XOR dest port) XOR - ((source IP XOR dest IP) AND 0xffff) - modulo slave count + The formula for unfragmented TCP and UDP packets is - The formula for unfragmented IPv6 TCP and UDP packets is + hash = source port, destination port (as in the header) + hash = hash XOR source IP XOR destination IP + hash = hash XOR (hash RSHIFT 16) + hash = hash XOR (hash RSHIFT 8) + And then hash is reduced modulo slave count. - hash = (source port XOR dest port) XOR - ((source ip quad 2 XOR dest IP quad 2) XOR - (source ip quad 3 XOR dest IP quad 3) XOR - (source ip quad 4 XOR dest IP quad 4)) - - ((hash >> 24) XOR (hash >> 16) XOR (hash >> 8) XOR hash) - modulo slave count + If the protocol is IPv6 then the source and destination + addresses are first hashed using ipv6_addr_hash. For fragmented TCP or UDP packets and all other IPv4 and IPv6 protocol traffic, the source and destination port @@ -810,10 +1003,6 @@ formula is the same as for the layer2 transmit hash policy. - The IPv4 policy is intended to mimic the behavior of - certain switches, notably Cisco switches with PFC2 as - well as some Foundry and IBM products. - This algorithm is not fully 802.3ad compliant. A single TCP or UDP conversation containing both fragmented and unfragmented packets will see packets @@ -824,6 +1013,26 @@ conversations. Other implementations of 802.3ad may or may not tolerate this noncompliance. + encap2+3 + + This policy uses the same formula as layer2+3 but it + relies on skb_flow_dissect to obtain the header fields + which might result in the use of inner headers if an + encapsulation protocol is used. For example this will + improve the performance for tunnel users because the + packets will be distributed according to the encapsulated + flows. + + encap3+4 + + This policy uses the same formula as layer3+4 but it + relies on skb_flow_dissect to obtain the header fields + which might result in the use of inner headers if an + encapsulation protocol is used. For example this will + improve the performance for tunnel users because the + packets will be distributed according to the encapsulated + flows. + The default value is layer2. This option was added in bonding version 2.6.3. In earlier versions of bonding, this parameter does not exist, and the layer2 policy is the only policy. The @@ -847,11 +1056,19 @@ This option was added for bonding version 3.7.0. +lp_interval + + Specifies the number of seconds between instances where the bonding + driver sends learning packets to each slaves peer switch. + + The valid range is 1 - 0x7fffffff; the default value is 1. This Option + has effect only in balance-tlb and balance-alb modes. + 3. Configuring Bonding Devices ============================== You can configure bonding using either your distro's network -initialization scripts, or manually using either ifenslave or the +initialization scripts, or manually using either iproute2 or the sysfs interface. Distros generally use one of three packages for the network initialization scripts: initscripts, sysconfig or interfaces. Recent versions of these packages have support for bonding, while older @@ -1160,7 +1377,7 @@ those instances, see the "Configuring Multiple Bonds Manually" section, below. -3.3 Configuring Bonding Manually with Ifenslave +3.3 Configuring Bonding Manually with iproute2 ----------------------------------------------- This section applies to distros whose network initialization @@ -1171,7 +1388,7 @@ The general method for these systems is to place the bonding module parameters into a config file in /etc/modprobe.d/ (as appropriate for the installed distro), then add modprobe and/or -ifenslave commands to the system's global init script. The name of +`ip link` commands to the system's global init script. The name of the global init script differs; for sysconfig, it is /etc/init.d/boot.local and for initscripts it is /etc/rc.d/rc.local. @@ -1183,8 +1400,8 @@ modprobe bonding mode=balance-alb miimon=100 modprobe e100 ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up -ifenslave bond0 eth0 -ifenslave bond0 eth1 +ip link set eth0 master bond0 +ip link set eth1 master bond0 Replace the example bonding module parameters and bond0 network configuration (IP address, netmask, etc) with the appropriate @@ -1371,6 +1588,12 @@ To remove an ARP target: # echo -192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target +To configure the interval between learning packet transmits: +# echo 12 > /sys/class/net/bond0/bonding/lp_interval + NOTE: the lp_inteval is the number of seconds between instances where +the bonding driver sends learning packets to each slaves peer switch. The +default interval is 1 second. + Example Configuration --------------------- We begin with the same example that is shown in section 3.3, @@ -1536,6 +1759,53 @@ This feature first appeared in bonding driver version 3.7.0 and support for output slave selection was limited to round-robin and active-backup modes. +3.7 Configuring LACP for 802.3ad mode in a more secure way +---------------------------------------------------------- + +When using 802.3ad bonding mode, the Actor (host) and Partner (switch) +exchange LACPDUs. These LACPDUs cannot be sniffed, because they are +destined to link local mac addresses (which switches/bridges are not +supposed to forward). However, most of the values are easily predictable +or are simply the machine's MAC address (which is trivially known to all +other hosts in the same L2). This implies that other machines in the L2 +domain can spoof LACPDU packets from other hosts to the switch and potentially +cause mayhem by joining (from the point of view of the switch) another +machine's aggregate, thus receiving a portion of that hosts incoming +traffic and / or spoofing traffic from that machine themselves (potentially +even successfully terminating some portion of flows). Though this is not +a likely scenario, one could avoid this possibility by simply configuring +few bonding parameters: + + (a) ad_actor_system : You can set a random mac-address that can be used for + these LACPDU exchanges. The value can not be either NULL or Multicast. + Also it's preferable to set the local-admin bit. Following shell code + generates a random mac-address as described above. + + # sys_mac_addr=$(printf '%02x:%02x:%02x:%02x:%02x:%02x' \ + $(( (RANDOM & 0xFE) | 0x02 )) \ + $(( RANDOM & 0xFF )) \ + $(( RANDOM & 0xFF )) \ + $(( RANDOM & 0xFF )) \ + $(( RANDOM & 0xFF )) \ + $(( RANDOM & 0xFF ))) + # echo $sys_mac_addr > /sys/class/net/bond0/bonding/ad_actor_system + + (b) ad_actor_sys_prio : Randomize the system priority. The default value + is 65535, but system can take the value from 1 - 65535. Following shell + code generates random priority and sets it. + + # sys_prio=$(( 1 + RANDOM + RANDOM )) + # echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio + + (c) ad_user_port_key : Use the user portion of the port-key. The default + keeps this empty. These are the upper 10 bits of the port-key and value + ranges from 0 - 1023. Following shell code generates these 10 bits and + sets it. + + # usr_port_key=$(( RANDOM & 0x3FF )) + # echo $usr_port_key > /sys/class/net/bond0/bonding/ad_user_port_key + + 4 Querying Bonding Configuration ================================= @@ -2144,11 +2414,8 @@ It is possible to adjust TCP/IP's congestion limits by altering the net.ipv4.tcp_reordering sysctl parameter. The - usual default value is 3, and the maximum useful value is 127. - For a four interface balance-rr bond, expect that a single - TCP/IP stream will utilize no more than approximately 2.3 - interface's worth of throughput, even after adjusting - tcp_reordering. + usual default value is 3. But keep in mind TCP stack is able + to automatically increase this when it detects reorders. Note that the fraction of packets that will be delivered out of order is highly variable, and is unlikely to be zero. The level @@ -2216,13 +2483,13 @@ bandwidth. Additionally, the linux bonding 802.3ad implementation - distributes traffic by peer (using an XOR of MAC addresses), - so in a "gatewayed" configuration, all outgoing traffic will - generally use the same device. Incoming traffic may also end - up on a single device, but that is dependent upon the - balancing policy of the peer's 8023.ad implementation. In a - "local" configuration, traffic will be distributed across the - devices in the bond. + distributes traffic by peer (using an XOR of MAC addresses + and packet type ID), so in a "gatewayed" configuration, all + outgoing traffic will generally use the same device. Incoming + traffic may also end up on a single device, but that is + dependent upon the balancing policy of the peer's 8023.ad + implementation. In a "local" configuration, traffic will be + distributed across the devices in the bond. Finally, the 802.3ad mode mandates the use of the MII monitor, therefore, the ARP monitor is not available in this mode.