Copyright © 2002, 2003, 2004, 2005, 2006, 2007 Martin A. Brown
"Mar 2007"
| Revision History | ||
|---|---|---|
| Revision 0.4.5 | 2007-03-31 | MAB |
| corrected DocBook build environment; new mail address | ||
| Revision 0.4.4 | 2003-04-26 | MAB |
| added index, began packet filtering chapter | ||
| Revision 0.4.3 | 2003-04-14 | MAB |
| ongoing editing, ARP/NAT fixes, routing content | ||
| Revision 0.4.2 | 2003-03-16 | MAB |
| ongoing editing; unreleased version | ||
| Revision 0.4.1 | 2003-02-19 | MAB |
| major routing revision; better use of callouts | ||
| Revision 0.4.0 | 2003-02-11 | MAB |
| major NAT revs; add inline scripts; outline FIB | ||
| Revision 0.3.9 | 2003-02-05 | MAB |
| fleshed out bonding; added bridging chapter | ||
| Revision 0.3.8 | 2003-02-03 | MAB |
| move to linux-ip.net; use TLDP XSL stylesheets | ||
| Revision 0.3.7 | 2003-02-02 | MAB |
| major editing on ARP; minor editing on routing | ||
| Revision 0.3.6 | 2003-01-30 | MAB |
| switch to XSLT processing; minor revs; CVS | ||
| Revision 0.3.5 | 2003-01-08 | MAB |
| ARP flux complete; ARP filtering touched | ||
| Revision 0.3.4 | 2003-01-06 | MAB |
| ARP complete; bridging added; ip neigh complete | ||
| Revision 0.3.3 | 2003-01-05 | MAB |
| split into 3 parts; ARP chapter begun | ||
| Revision 0.3.2 | 2002-12-29 | MAB |
| links updated; minor editing | ||
| Revision 0.3.1 | 2002-11-26 | MAB |
| edited: intro, snat, nat; split advanced in two | ||
| Revision 0.3.0 | 2002-11-14 | MAB |
| chapters finally have good HTML names | ||
| Revision 0.2.9 | 2002-11-11 | MAB |
| routing chapter heavily edited | ||
| Revision 0.2.8 | 2002-11-07 | MAB |
| basic chapter heavily edited | ||
| Revision 0.2.7 | 2002-11-04 | MAB |
| routing chapter finished; links rearranged | ||
| Revision 0.2.6 | 2002-10-29 | MAB |
| routing chapter continued | ||
| Revision 0.2.5 | 2002-10-28 | MAB |
| routing chapter partly complete | ||
| Revision 0.2.4 | 2002-10-08 | MAB |
| advanced routing additions and overview | ||
| Revision 0.2.3 | 2002-09-30 | MAB |
| minor editing; worked on tools/netstat; advanced routing | ||
| Revision 0.2.2 | 2002-09-24 | MAB |
| formalized revisioning; finished basic networking; started netstat | ||
| Revision 0.2.1 | 2002-09-21 | MAB |
| added network map to incomplete rough draft | ||
| Revision 0.2 | 2002-09-20 | MAB |
| incomplete rough draft released on LARTC list | ||
| Revision 0.1 | 2002-08-04 | MAB |
| rough draft begun | ||
Abstract
This guide provides an overview of many of the tools available for IP network administration of the linux operating system, kernels in the 2.2 and 2.4 series. It covers Ethernet, ARP, IP routing, NAT, and other topics central to the management of IP networks.
Table of Contents
List of Tables
List of Examples
conf/$DEV/arp_filternet/$DEV/hidden/etc/iproute2/rt_tableslocal routing tableREJECT
target, cf. Example D.17, “Adding a prohibit route with route
add”prohibit route with route
addfrom in a routing command with
route addsrc in a routing command with
route addTable of Contents
This guide is as an overview of the IP networking capabilities of linux kernels 2.2 and 2.4. The target audience is any beginning to advanced network administrator who wants practical examples and explanation of rumoured features of linux. As the Internet is lousy with documentation on the nooks and crannies of linux networking support, I have tried to provide links to existing documentation on IP networking with linux.
The documentation you'll find here covers kernels 2.2 and 2.4, although a good number of the examples and concepts may also apply to older kernels. In the event that I cover a feature that is only present or supported under a particular kernel, I'll identify which kernel supports that feature.
I assume a few things about the reader. First, the reader has a basic understanding (at least) of IP addressing and networking. If this is not the case, or the reader has some trouble following my networking examples, I have provided a section of links to IP layer tutorials and general introductory documentation in the appendix. Second, I assume the reader is comfortable with command line tools and the Linux, Unix, or BSD environments. Finally, I assume the reader has working network cards and a Linux OS. For assistance with Ethernet cards, the there exists a good Ethernet HOWTO.
The examples I give are intended as tutorial examples only. The user should understand and accept the ramifications of using these examples on his/her own machines. I recommend that before running any example on a production machine, the user test in a controlled environment. I accept no responsibility for damage, misconfiguration or loss of any kind as a result of referring to this documentation. Proceed with caution at your own risk.
This guide has been written primarily as a companion reference to IP networking on Ethernets. Although I do allude to other link layer types occasionally in this book, the focus has been IP as used in Ethernet. Ethernet is one of the most common networking devices supported under linux, and is practically ubiquitous.
This text was written in DocBook with vim. All formatting has been applied by xsltproc based on DocBook and LDP XSL stylesheets. Typeface formatting and display conventions are similar to most printed and electronically distributed technical documentation. A brief summary of these conventions follows below.
The interactive shell prompt will look like
[root@hostname]#
for the root user and
[user@hostname]$
for non-root users, although most of the operations we will be discussing will require root privileges.
Any commands to be entered by the user will always appear like
{ echo "Hi, I am exiting with a non-zero exit code."; exit 1 }
Output by any program will look something like this:
Hi, I am exiting with a non-zero exit code.
Where possible, an additional convention I have used is the suppression of all hostname lookup. DNS and other naming based schemes often confuse the novice and expert alike, particularly when the name resolver is slow or unreachable. Since the focus of this guide is IP layer networking, DNS names will be used only where absolutely unambiguous.
Perhaps this should be called things that are wrong with this document,
or things which should be improved. See the
src/ROADMAP for notes on what is likely to be
forthcoming in subsequent releases.
The internal document linking, while good, but could be better. Especially lame is the lack of an index. External links should be used more commonly where appropriate instead of sending users to the links page.
If you are looking for LARTC topics, you may find some LAR topics here, but you should try the LARTC page itself if you have questions that are more TC than LAR. Consult Appendix I, Links to other Resources for further references to available documentation.
There are many tools available under linux which are also available under other unix-like operating systems, but there are additional tools and specific tools which are available only to users of linux. This guide represents an effort to identify some of these tools. The most concrete example of the difference between linux only tools and generally available unix-like tools is the difference between the traditional ifconfig and route commands, available under most variants of unix, and the iproute2 command suite, written specificially for linux.
Because this guide concerns itself with the features, strengths, and peculiarities of IP networking with linux, the iproute2 command suite assumes a prominent role. The iproute2 tools expose the strength, flexibility and potential of the linux networking stack.
Many of the tools introduced and concepts introduced are also detailed in other HOWTOs and guides available at The Linux Documentation Project in addition to many other places on the Internet and in printed books.
As with many human endeavours, this work is made possible by the efforts of others. For me, this effort represents almost four years of learning and network administration. The knowledge collected here is in large measure a repackaging of disparate resources and my own experiences over time. Without the greater linux community, I would not be able to provide this resource.
I would like to take this opportunity to make a plug for my employer, SecurePipe, Inc. which has provided me stable and challenging employment for these (almost) four years. SecurePipe is a managed security services provider specializing in managed firewall, VPN, and IDS services to small and medium sized companies. They offer me the opportunity to hone my networking skills and explore areas of linux networking unknown to me. Thanks also to SecurePipe, Inc. for hosting this cost-free on their servers.
Over the course of the project, many people have contributed suggestions,
modifications, corrections and additions. I'll acknowledge them briefly
here. For full acknowledgements, see
src/ACKNOWLEDGEMENTS in the DocBook source tree.
Russ Herrold, 2002-09-22
Yann Hirou, 2002-09-26
Julian Anastasov, 2002-10-29
Bert Hubert, 2002-11-14
Tony Kapela, 2002-11-30
George Georgalis, 2003-01-11
Alex Russell, 2003-02-02
giovanni, 2003-02-06
Gilles Douillet, 2003-02-28
Please feel free to point out any irregularities, factual errors,
typographical errors, or logical gaps in this
documentation. If you have rants or raves about this documentation,
please mail me directly at <martin@linux-ip.net>.
Now, let's begin! Let me welcome you to the pleasure and reliability of IP networking with linux.
Table of Contents
Table of Contents
Internet Protocol (IP) networking is now among the most common networking technologies in use today. The IP stack under linux is mature, robust and reliable. This chapter covers the basics of configuring a linux machine or multiple linux machines to join an IP network.
This chapter covers a quick overview of the locations of the networking control files on different distributions of linux. The remainder of the chapter is devoted to outlining the basics of IP networking with linux.
These basics are written in a more tutorial style than the remainder of the first part of the book. Reading and understanding IP addressing and routing information is a key skill to master when beginning with linux. Naturally, the next step is to alter the IP configuration of a machine. This chapter will introduce these two key skills in a tutorial style. Subsequent chapters will engage specific subtopics of linux networking in a more thorough and less tutorial manner.
Different linux distribution vendors put their networking configuration files in different places in the filesystem. Here is a brief summary of the locations of the IP networking configuration information under a few common linux distributions along with links to further documentation.
Location of networking configuration files
RedHat (and Mandrake)
Interface definitions
/etc/sysconfig/network-scripts/ifcfg-*
Hostname and default gateway definition
/etc/sysconfig/network
Definition of static routes
/etc/sysconfig/static-routes
SuSe (version >= 8.0)
Interface definitions
/etc/sysconfig/network/ifcfg-*
Static route definition
/etc/sysconfig/network/routes
Interface specific static route definition
/etc/sysconfig/network/ifroute-*
SuSe (version <= 8.0)
Interface and route definitions
/etc/rc.config
Debian
Interface and route definitions
/etc/network/interfaces
Gentoo
Interface and route definitions
/etc/conf.d/net
Slackware
Interface and route definitions
/etc/rc.d/rc.inet1
The format of the networking configuration files differs significantly from distribution to distribution, yet the tools used by these scripts are the same. This documentation will focus on these tools and how they instruct the kernel to alter interface and route information. Consult the distribution's documentation for questions of file format and order of operation.
For the remainder of this document, many examples refer to machines in a hypothetical network. Refer to the example network description for the network map and addressing scheme.
Assuming an already configured machine named tristan, let's
look at the IP addressing and routing
table. Next we'll examine how the machine
communicates with computers (hosts) on the locally reachable network. We'll
then send packets through our
default gateway to other networks. After learning what a default
route is, we'll look at a static
route.
One of the first things to learn about a machine attached to an IP
network is its IP address. We'll begin by looking at
a machine named tristan on the main desktop network (192.168.99.0/24).
The machine tristan
is alive on IP 192.168.99.35 and
has been properly configured by the system administrator.
By examining the
route
and ifconfig
output we can learn a good deal about the network to which
tristan is connected
[1].
Example 1.1. Sample ifconfig output
|
For the moment, ignore the loopback interface (lo) and concentrate on the Ethernet interface. Examine the output of the ifconfig command. We can learn a great deal about the IP network to which we are connected simply by reading the ifconfig output. For a thorough discussion of ifconfig, see Section C.1, “ifconfig”.
The IP address active on tristan is 192.168.99.35. This means that
any IP packets created by tristan will have a
source address of 192.168.99.35. Similarly any packet received by
tristan will have the destination address of 192.168.99.35.
When creating an outbound packet tristan will set the destination
address to the server's IP. This gives the remote host and the
networking devices in between these hosts enough information to
carry packets between the two devices.
Because tristan will
advertise that it accepts packets with a destination address of
192.168.99.35, any frames (packets) appearing on the Ethernet
bound for 192.168.99.35 will reach tristan. The process of
communicating the ownership of an IP address is called ARP. Read
Section 2.1.1, “Overview of Address Resolution Protocol” for a complete discussion of
this process.
This is fundamental to IP networking. It is fundamental that a host be able to generate and receive packets on an IP address assigned to it. This IP address is a unique identifier for the machine on the network to which it is connected.
Common traffic to and from machines today is unicast IP traffic.
Unicast traffic is essentially a conversation between two hosts.
Though there may be routers between them, the two hosts are carrying
on a private conversation. Examples of common unicast traffic
are protocols such as HTTP (web), SMTP (sending mail), POP3 (fetching
mail), IRC (chat), SSH (secure shell), and LDAP (directory access).
To participate in any of these kinds of traffic,
tristan will send and receive packets on 192.168.99.35.
In contrast to unicast traffic, there is another common IP networking technique called broadcasting. Broadcast traffic is a way of addressing all hosts in a given network range with a single destination IP address. To continue the analogy of the unicast conversation, a broadcast is more like shouting in a room. Occasionally, network administrators will refer to broadcast techniques and broadcasting as "chatty network traffic".
Broadcast techniques are used at the Ethernet layer and the IP layer, so the cautious person talks about Ethernet broadcasts or IP broadcast. Refer to Section 2.1.1, “Overview of Address Resolution Protocol”, for more information on a common use of broadcast Ethernet frames.
IP Broadcast techniques can be used to share information with all partners on a network or to discover characteristics of other members of a network. SMB (Server Message Block) as implemented by Microsoft products and the samba package makes extensive use of broadcasting techniques for discovery and information sharing. Dynamic Host Configuration Protocol (DHCP) also makes use of broadcasting techniques to manage IP addressing.
The IP broadcast address is, usually, correctly derived from the IP address and network mask although it can be easily be set explicitly to a different address. Because the broadcast address is used for autodiscovery (e.g, SMB under some protocols, an incorrect broadcast address can inhibit a machine's ability to participate in networked communication [2].
The netmask on the interface should match the netmask in the routing table for the locally connected network. Typically, the route and the IP interface definition are calculated from the same configuration data so they should match perfectly.
If you are at all confused about how to address a network or how to read either the traditional notation or the CIDR notation for network addressing, see one of the CIDR/netmask references in Section I.1.3, “General IP Networking Resources”.
We can see from the output above that the IP address 192.168.99.35
falls inside the address space 192.168.99.0/24. We also note that
the machine tristan
will route packets bound for 192.168.99.0/24 directly onto the
Ethernet attached to eth0. This line in the routing table
identifies a network available on the Ethernet attached to eth0
("Iface") by its network address ("Destination") and size ("Genmask").
|
Every host on the 192.168.99.0/24 network should share the network address and netmask specified above. No two hosts should share the same IP address.
Currently, there are two hosts connected to the example desktop network.
Both tristan and masq-gw are connected to 192.168.99.0/24. Thus,
192.168.99.254 (masq-gw) should be reachable from tristan.
Success of this test provides evidence that tristan is
configured properly. N.B., Assume that the network
administrator has properly configured masq-gw. Since the
default gateway in any
network is an important host, testing reachability of the default
gateway also has a value in determining the proper operation of the
local network.
The ping tool, designed to take advantage of Internet Control Message Protocol (ICMP), can be used to test reachability of IP addresses. For a command summary and examples of the use of ping, see Section G.1, “ping”.
Example 1.2. Testing reachability of a locally connected host with ping
|
In Section 1.2.1, “Sending Packets to the Local Network”, we verified that hosts connected to the same local network can reach each other and, importantly, the default gateway. Now, let's see what happens to packets which have a destination address outside the locally connected network.
Assuming that the network administrator allows ping packets
from the desktop network into the public network,
ping can be invoked with the
record route option to show the path the packet travels from
tristan to wan-gw and back.
Example 1.3. Testing reachability of non-local hosts
By testing reachability of the local network 192.168.99.0/24 and an IP address outside our local network, we have verified the basic elements of IP connectivity.
To summarize this section, we have:
identified the IP address, network address and netmask in use
on tristan using the tools ifconfig and
route
verified that tristan can reach its default gateway
tested that packets bound for destinations outside our local network reach the intended destination and return
Static routes instruct the kernel to route packets
for a known destination host or network to a router or
gateway different from the default gateway.
In the example network, the desktop machine tristan would need
a static route to reach hosts in the 192.168.98.0/24 network.
Note that the branch office network is reachable over an ISDN line.
The ISDN router's IP in tristan's network is 192.168.99.1. This
means that there are two gateways in the example desktop network,
one connected to a small branch office network, and the other
connected to the Internet.
Without a static route to the branch office network, tristan would
use masq-gw as the gateway, which is not the most efficient path for
packets bound for morgan. Let's examine why a static route would
be better here.
If tristan generates a packet bound for morgan and
sends the packet to the default gateway, masq-gw will forward the
packet to isdn-router as well as generate an ICMP redirect message
to tristan. This ICMP redirect message tells tristan to send
future packets with a destination address of 192.168.98.82 (morgan)
directly to isdn-router. For a fuller discussion of ICMP redirect,
see
Section 4.10.2, “ICMP Redirects and Routing”.
The absence of a static route has caused two extra packets to be
generated on the Ethernet for no benefit. Not only that, but
tristan will eventually expire the temporary route entry
[3]
for 192.168.98.82, which means that subsequent packets bound for
morgan will repeat this process
[4].
To solve this problem, add a static route to tristan's routing
table. Below is a modified routing table (see
Section 1.3, “Changing IP Addresses and Routes” to learn how to change the routing
table).
Example 1.4. Sample routing table with a static route
|
According to this routing table, any packets with a destination address in the 192.168.98.0/24 network will be routed to the gateway 192.168.99.1 instead of the default gateway. This will prevent unnecessary ICMP redirect messages.
These are the basic tools for inspecting the IP address and the routes on a linux machine. Understanding the output of these tools will help you understand how machines fit into simple networks, and will be a base on which you can build an understanding of more complex networks.
[1] For BSD and UNIX users, the idiom netstat -rn may be more familiar than the common route -n on a linux machine. Both of these commands provide the same basic information although the formatting is a bit different. For a fuller discussion of these, see either Section G.4, “netstat” or Section D.1, “route”. For access to all of the routing features of the linux kernel, use ip route instead.
[2] An incorrect broadcast address often highlights a mismatch of the configured IP address and netmask on an interface. If in doubt, be sure to use an IP calculator to set the correct netmask and broadcast addresses.
[3] If the machine is a linux machine, then the temporary route entry is stored in the routing cache. Consult Section 4.7, “Routing Cache” for more information on the routing cache.
[4] It is quite reasonable to ignore ICMP redirect messages from unknown hosts on the Internet, but ICMP redirect messages on a LAN indicate that a host has mismatched netmasks or missing static routes.
This section introduces changing the IP address on an interface, changing the default gateway, and adding and removing a static route. With the knowledge of ifconfig and route output it's a small step to learn how to change IP configuration with these same tools.
For a practical example, let's say that the branch office server,
morgan, needs to visit the main office for some hardware maintenance.
Since the services on the machine are not in use, it's a convenient
time to fetch some software updates, after configuring the machine to
join the LAN.
Once the machine is booted and connected to the Ethernet, it's ready for IP reconfiguration. In order to join an IP network, the following information is required. Refer to the network map and appendix to gather the required information below.
Example 1.5. ifconfig and route output before the change
|
The process of readdressing for the new network involves three steps.
It is clear in
Example 1.5, “ifconfig and route
output before the change”, that morgan is configured
for a different network than the main office desktop network.
First, the
active interface must be
brought down, then a
new address must be configured
on the interface and brought up, and finally
a new default route must be
added. If the networking configuration is correct and the
process is successful, the machine should be able to connect to local
and non-local destinations.
This is a fast way to stop networking on a single-homed machine such as a server or workstation. On multi-homed hosts, other interfaces on the machine would be unaffected by this command. This method of bringing down an interface has some serious side effects, which should be understood. Here is a summary of the side effects of bringing down an interface.
Side effects of bringing down an interface with ifconfig
all IP addresses on the specified interface are deactivated and removed
any connections established to or from IPs on the specified interface are broken [7]
all routes to any destinations through the specified interface are removed from the routing tables
the link layer device is deactivated
The next step, bringing up the interface, requires the new networking configuration information. It's a good habit to check the interface after configuration to verify settings.
Example 1.7. Bringing up an Ethernet interface with ifconfig
|
The second call to ifconfig allows verification of the IP addressing information. The currently configured IP address on eth0 is 192.168.99.14. Bringing up an interface also has a small set of side effects.
Side effects of bringing up an interface
the link layer device is activated
the requested IP address is assigned to the specified interface
all local, network, and broadcast routes implied by the IP configuration are added to the routing tables
Use ping to verify the reachability of other locally connected hosts or skip directly to setting the default gateway.
It should come as no surprise to a close reader
(hint),
that the default route was removed at the execution of
ifconfig eth0 down. The crucial final step is
configuring the default route.
Example 1.8. Adding a default route with route
|
The routing table on morgan should look exactly like the initial
routing table on tristan. Compare the routing tables in
Example 1.1, “Sample ifconfig output” and
Example 1.8, “Adding a default route with route”.
These changes to the routing table on morgan will stay in effect
until they are manually changed, the network is restarted, or the
machine reboots. With knowledge of the addressing scheme of a
network, and the use of
ifconfig and
route it's
simple to readdress a machine on just about any Ethernet you can
attach to. The benefits of familiarity with these commands extend to
non-Ethernet IP networks as well, because these commands operate on the
IP layer, independent of the link layer.
Now that morgan has joined the LAN at the main office and can
reach the Internet, a static route to the branch office would be
convenient for accessing resources on that network.
A static route is any route entered into a routing table which specifies at least a destination address and a gateway or device. Static routes are special instructions regarding the path a packet should take to reach a destination and are usually used to specify reachability of a destination through a router other than the default gateway.
As we saw above, in Section 1.2.3, “Static Routes to Networks”, a static route provides a specific route to a known destination. There are several pieces of information we need to know in order to be able to add a static route.
the address of the destination (192.168.98.0)
the netmask of the destination (255.255.255.0)
EITHER the IP address of the router through which the destination (192.168.99.1) is reachable
OR the name of the link layer device to which the destination is directly connected
Example 1.9. Adding a static route with route
|
Example 1.9, “Adding a static route with route” shows how to add a static route to the 192.168.98.0/24 network. In order to test the reachability of the remote network, ping any machine on the 192.168.98.0/24 network. Routers are usually a good choice, since they rarely have packet filters and are usually alive.
Because a more specific route is always chosen over a less specific route, it is even possible to support host routes. These are routes for destinations which are single IP addresses. This can be accomplished with a manually added static route as below.
Example 1.10. Removing a static network route and adding a static host route
|
This should serve as an illustration that there is no difference to the kernel in selecting a route between a host route and a network route with a host netmask. If this is a surprise or is at all confusing, review the use of netmasks in IP networking. Some collected links on general IP networking are available in Section I.1.3, “General IP Networking Resources”.
[5] The network address can be calculated from the IP address and netmask. Refer to Section H.1, “ipcalc and other IP addressing calculators”. Especially handy is the variable length subnet mask RFC, RFC 1878.
[6] Many networks are configured with the name resolution services on a publicly connected host. See Section 12.6, “DNS Troubleshooting”.
[7] It is possible for a linux box which meets the following three criteria to maintain connections and provide services without having the service IP configured on an interface. It must be functioning as a router, be configured to support non-local binding and be in the route path of the client machine. This is an uncommon need, frequently accomplished by the use of transparent proxying software.
This chapter has introduced the simplest uses of ifconfig and route to view and alter the IP configuration of a host. To reiterate the minimum requirements to create an IP network between two machines:
Requirements for Two Hosts on the Same Ethernet to Communicate Using IP
Each host must have a good connection to the Ethernet. Verify a good connection to the Ethernet with mii-tool, documented in Section B.5, “mii-tool”.
Each host must share IP network space. Practically, this means that each host should have the same network address, netmask, and broadcast address [8].
Each host must have a unique IP address.
Neither host must block the other's IP packets. (Host based packet filtering may hinder connections!)
This concludes the tour of basic host networking and IP layer configuration as well as some basic tools available to the linux user. For further documentation on these tools, other tips, tricks, and more advanced content, keep reading!
[8] Technically, the two hosts simply need to have routes to each other, but we are discussing the simplest case here, so we'll leave this for a discussion of shared media.
Table of Contents
The most common link layer network in use today is Ethernet. Although there are several common speeds of Ethernet devices, they function identically with regard to higher layer protocols. As this documentation focusses on higher layer protocols (IP), some fine distinctions about different types of Ethernet will be overlooked in favor of depicting the uniform manner in which IP networks overlay Ethernets.
Address Resolution Protocol provides the necessary mapping between link
layer
addresses and IP addresses for machines connected to Ethernets. Linux
offers control of ARP requests and replies via several
not-well-known /proc interfaces;
net/ipv4/conf/$DEV/proxy_arp,
net/ipv4/conf/$DEV/medium_id, and
net/ipv4/conf/$DEV/hidden. For even
finer control of ARP requests than is available in stock kernels,
there are kernel and iproute2 patches.
This chapter will introduce the ARP conversation, discuss the ARP cache, a volatile mapping of the reachable IPs and MAC addresses on a segment, examine the ARP flux problem, and explore several ARP filtering and suppression techniques. A section on VLAN technology and channel bonding will round out the chapter on Ethernet.
Address Resolution Protocol (ARP) hovers in the shadows of most networks. Because of its simplicity, by comparison to higher layer protocols, ARP rarely intrudes upon the network administrator's routine. All modern IP-capable operating systems provide support for ARP. The uncommon alternative to ARP is static link-layer-to-IP mappings.
ARP defines the exchanges between network interfaces connected to an Ethernet media segment in order to map an IP address to a link layer address on demand. Link layer addresses are hardware addresses (although they are not immutable) on Ethernet cards and IP addresses are logical addresses assigned to machines attached to the Ethernet. Subsequently in this chapter, link layer addresses may be known by many different names: Ethernet addresses, Media Access Control (MAC) addresses, and even hardware addresses. Disputably, the correct term from the kernel's perspective is "link layer address" because this address can be changed (on many Ethernet cards) via command line tools. Nevertheless, these terms are not realistically distinct and can be used interchangeably.
Address Resolution Protocol (ARP) exists solely to glue together the IP and Ethernet networking layers. Since networking hardware such as switches, hubs, and bridges operate on Ethernet frames, they are unaware of the higher layer data carried by these frames [9]. Similarly, IP layer devices, operating on IP packets need to be able to transmit their IP data on Ethernets. ARP defines the conversation by which IP capable hosts can exchange mappings of their Ethernet and IP addressing.
ARP is used to locate the Ethernet address associated with a desired IP
address. When a machine has a packet bound for another IP on a locally
connected Ethernet network, it will send a broadcast Ethernet frame
containing an ARP request onto the Ethernet. All machines with the same
Ethernet broadcast address will receive this packet
[10].
If a machine receives the ARP request and it hosts the IP requested,
it will respond with the link layer address on which it will receive
packets for that IP address.
N.B., the
arp_filter
sysctl will alter this behaviour
somewhat.
Once the requestor receives the response packet, it associates the MAC address and the IP address. This information is stored in the arp cache. The arp cache can be manipulated with the ip neighbor and arp commands. To learn how and when to manipulate the arp cache, see Section B.1, “arp”.
In Example 1.2, “Testing reachability of a locally connected host with
ping”, we used ping to
test reachability of masq-gw. Using a packet sniffer to capture
the sequence of packets on the Ethernet as a result of tristan's
attempt to ping, provides an example of ARP in flagrante
delicto. Consult the
example network map for a
visual representation of the network layout in which this traffic
occurs.
This is an archetypal conversation between two computers exchanging relevant hardware addressing in order that they can pass IP packets, and is comprised of two Ethernet frames.
Example 2.1. ARP conversation captured with tcpdump [11]
|
This broadcast Ethernet frame, identifiable by the
destination Ethernet address with all bits set
(ff:ff:ff:ff:ff:ff) contains an ARP request from |
|
The ARP reply from
The machine which initiated the ARP request ( |
| The final two packets in Example 2.1, “ARP conversation captured with tcpdump ” display the link layer header and the encapsulated ICMP packets exchanged between these two hosts. Examining the ARP cache on each of these hosts would reveal entries on each host for the other host's link layer address. |
This example is the commonest example of ARP traffic on an Ethernet. In summary, an ARP request is transmitted in a broadcast Ethernet frame. The ARP reply is a unicast response, containing the desired information, sent to the requestor's link layer address.
An even rarer usage of ARP is gratuitous ARP, where a machine announces its ownership of an IP address on a media segment. The arping utility can generate these gratuitous ARP frames. Linux kernels will respect gratuitous ARP frames [12].
Example 2.2. Gratuitous ARP reply frames
|
The frames generated in Example 2.2, “Gratuitous ARP reply frames” are ARP replies to a question never asked. This sort of ARP is common in failover solutions and also for nefarious sorts of purposes, such as ettercap.
Unsolicited ARP request frames, on the other hand, are broadcast ARP requests initiated by a host owning an IP address.
Example 2.3. Unsolicited ARP request frames
|
These two uses of arping can help diagnose Ethernet and ARP problems--particularly hosts replying for addresses which do not belong to them.
To avoid IP address collisions on dynamic networks (where hosts are turning on and off, connecting and disconnecting and otherwise changing IP addresses) duplicate address detection becomes important. Fortunately, arping provides this functionality as well. A startup script could include the arping utility in duplicate address detection mode to select between IP addresses or methods of acquiring an IP address.
Example 2.4. Duplicate Address Detection with ARP
|
Address Resolution Protocol, which provides a method to connect physical network addresses with logical network addresses is a key element to the deployment of IP on Ethernet networks.
In simplest terms, an ARP cache is a stored mapping of IP addresses with link layer addresses. An ARP cache obviates the need for an ARP request/reply conversation for each IP packet exchanged. Naturally, this efficiency comes with a price. Each host maintains its own ARP cache, which can become outdated when a host is replaced, or an IP address moves from one host to another. The ARP cache is also known as the neighbor table.
To display the ARP cache, the venerable and cross-platform arp admirably dispatches its duty. As with many of the iproute2 tools, more information is available via ip neighbor than with arp. Example 2.5, “ARP cache listings with arp and ip neighbor” below illustrates the differences in the output between the output of these two different tools.
Example 2.5. ARP cache listings with arp and ip neighbor
|
A major difference between the information reported by ip neighbor and arp is the state of the proxy ARP table. The only way to list permanently advertised entries in the neighbor table (proxy ARP entries) is with the arp.
Entries in the ARP cache are periodically and automatically
verified unless continually used. Along with
net/ipv4/neigh/$DEV/gc_stale_time,
there are a number of other parameters in
net/ipv4/neigh/$DEV which control the
expiration of entries in the ARP cache.
When a host is down or disconnected from the Ethernet, there is a
period of time during which other hosts may have an ARP cache entry
for the disconnected host. Any other machine may display a neighbor
table with the link layer address of the recently disconnected host.
Because there is a recently known-good link layer address on which
the IP was reachable, the entry will abide. At
gc_stale_time the state of the entry will change,
reflecting the need to verify the reachability of the link layer
address. When the disconnected host fails to respond ARP requests,
the neighbor table entry will be marked as
incomplete
Here are a the possible states for entries in the neighbor table.
Table 2.1. Active ARP cache entry states
| ARP cache entry state | meaning | action if used |
|---|---|---|
| permanent | never expires; never verified | reset use counter |
| noarp | normal expiration; never verified | reset use counter |
| reachable | normal expiration | reset use counter |
| stale | still usable; needs verification | reset use counter; change state to delay |
| delay | schedule ARP request; needs verification | reset use counter |
| probe | sending ARP request | reset use counter |
| incomplete | first ARP request sent | send ARP request |
| failed | no response received | send ARP request |
To resume, a host (192.168.99.7) in tristan's ARP cache on the
example network has just
been disconnected. There are a series of events which
will occur as tristan's ARP cache entry for 192.168.99.7 expires and
gets scheduled for verification. Imagine that the following commands
are run to capture each of these states immediately before state
change.
Example 2.6. ARP cache timeout
The remaining neighbor table flags are visible when initial ARP
requests are made. If no ARP cache entry exists for a requested
destination IP, the kernel will generate
mcast_solicit ARP requests until receiving an
answer.
During this discovery period, the ARP cache
entry will be listed in an incomplete state. If
the lookup does not succeed after the specified number of ARP
requests, the ARP cache entry will be listed in a
failed state. If the lookup does succeed, the
kernel enters the response into the ARP cache and resets the
confirmation and update timers.
After receipt of a corresponding ARP reply, the kernel enters the response into the ARP cache and resets the confirmation and update timers.
For machines not using a static mapping for link layer and IP addresses, ARP provides on demand mappings. The remainder of this section will cover the methods available under linux to control the address resolution protocol.
Complete ARP suppression is not difficult at all. ARP suppression can be accomplished under linux on a per-interface basis by setting the noarp flag on any Ethernet interface. Disabling ARP will require static neighbor table mappings for all hosts wishing to exchange packets across the Ethernet.
To suppress ARP on an interface simply use ip link set dev $DEV arp off as in Example B.7, “Using ip link set to change device flags” or ifconfig $DEV -arp as in Example C.5, “Setting interface flags with ifconfig”. Complete ARP suppression will prevent the host from sending any ARP requests or responding with any ARP replies.
When a linux box is connected to a network segment with multiple network cards, a potential problem with the link layer address to IP address mapping can occur. The machine may respond to ARP requests from both Ethernet interfaces. On the machine creating the ARP request, these multiple answers can cause confusion, or worse yet, non-deterministic population of the ARP cache. Known as ARP flux [13], this can lead to the possibly puzzling effect that an IP migrates non-deterministically through multiple link layer addresses. It's important to understand that ARP flux typically only affects hosts which have multiple physical connections to the same medium or broadcast domain.
This is a simple illustration of the problem in a network where a server has two Ethernet adapters connected to the same media segment. They need not have IP addresses in the same IP network for the ARP reply to be generated by each interface. Note the first two replies received in response to the ARP broadcast request. These replies arrive from conflicting link layer addresses in response to this request. Also notice the greater time required for the sending and receiving hosts to process the broadcast ARP request frames than the unicast frames which follow (probes two and three).
Example 2.7. ARP flux
|
There are four solutions to this problem. The common solution for
kernel 2.4 harnesses the
arp_filter
sysctl, while the common solution for kernel 2.2 takes
advantage of the
hidden
sysctl. These two solutions alter the behaviour of ARP on a
per interface basis and only if the functionality has been enabled.
Alternate solutions which provide much greater control of ARP (possibly documented here at a later date) include Julian Anastasov's ip arp tool and his noarp route flag. While these tools were conceived in the course of the Linux Virtual Server project, they have practical application outside this realm.
One method for preventing ARP flux involves the use of
net/ipv4/conf/$DEV/arp_filter. In
short, the use of arp_filter causes the recipient
(in the
case below,
real-server) to perform a route lookup to
determine the interface through which to send the
reply, instead of the default behaviour
(shown above), replying
from all Ethernet interfaces which receive the request.
The arp_filter solution can have unintended
effects if the only route to the destination
is through one of the network cards. In
Example 2.8, “Correction of ARP flux with
conf/$DEV/arp_filter”, real-client will
demonstrate this. This instructive example should highlight
the shortcomings of the arp_filter solution in
very complex networks where finer-grained control is required.
In general, the arp_filter solution
sufficiently solves the ARP flux problem. First, hosts do not
generate ARP requests for networks to which they do not have a
direct route (see
Section 4.2, “Routing to Locally Connected Networks”) and second, when such a route
exists, the host normally
chooses a source
address in the same network as the destination. So, the
arp_filter solution is a good general solution,
but does not adequately address the occasional need for more control
over ARP requests and replies.
Example 2.8. Correction of ARP flux with
conf/$DEV/arp_filter
|
Set the sysctl variables to enable the
arp_filter functionality. After this,
you might expect that ARP replies for 10.10.20.67 would only
advertise the link layer address on eth0 (00:80:c8:e8:1e:fc).
|
|
Here is the expected behaviour. Only one reply comes in for
the IP 10.10.20.67 after the arp_filter
sysctl has been enabled. The reply originates from the
interface on real-server which actually hosts the IP
address. Note that the source address on the ARP queries is
10.10.20.33, and that the ARP query causes real-server to
perform a route lookup on 10.10.20.33 to choose an interface
from which to send the reply.
|
|
Here, real-client requests the link layer address of the
host 192.168.100.1, but the source IP on the request packet
(chosen according to the
rules for source
address selection) is 10.10.20.33. When
real-server looks up a route to this destination, it
chooses its eth0, and replies with the link layer address of
its eth0. Conventional networking needs should not run
afoul of this oddity of the arp_filter
ARP flux prevention technique.
|
| Remove the entry in the neighbor table before testing again. |
| By adding an IP address in the same network as the intended destination (which would be rather common where multiple IP networks share the same medium or broadcast domain), the kernel can now select a different source address for the ARP request packets. |
|
Note the source address of the ARP queries is now
192.168.100.2. When real-server performs a route lookup
for the 192.168.100.0/24 destination, the chosen path is
through eth1. The ARP reply packets now have the correct
link layer address.
|
In general, the arp_filter solution should
suffice, but this knowledge can be key in determining whether or not
an alternate solution, such as an
ARP filtering solution
are necessary.
hiddenThe ARP flux problem can also be combatted with a kernel patch by Julian Anastasov, which was incorporated into the 2.2.14+ kernel series, but never into the 2.4+ kernel series. Therefore, the functionality may not be available in all kernels.
The sysctl net/ipv4/conf/$DEV/hidden toggles
the generation of ARP replies for requested IPs. It marks an
interface and all of its IP addresses invisible to other
interfaces for the purpose of ARP
requests. When an ARP request arrives on any interface, the kernel
tests to see if the IP address is locally hosted anywhere on the
machine. If the IP is found on any interface, the kernel will
generate a reply.
Since this is not always desirable, the hidden
sysctl can be employed. This prevents the kernel from finding the
IP address when testing to see what IP addresses are locally hosted.
The kernel can always find IPs hosted on the interface on which the
packet arrived, but it cannot find addresses which are
hidden.
As shown in
Example 2.9, “Correction of ARP flux with
net/$DEV/hidden”, not only can ARP flux be
corrected, but sensitive information about the IP addresses
available on a linux box can be safeguarded
[14].
This makes the hidden sysctl useful for
preventing unwanted IP disclosure via ARP on multi-homed hosts,
in addition to preventing ARP flux on hosts connected to the
same network medium.
Example 2.9. Correction of ARP flux with
net/$DEV/hidden
|
These are two examples of methods to prevent ARP flux. Other alternatives for correcting this problem are documented in Section 2.3, “ARP filtering”, where much more sophisticated tools are available for manipulation and control over the ARP functions of linux.
[9] Some networking equipment vendors have built devices which are sold as high performance switches and are capable of performing operations on higher layer contents of Ethernet frames. Typically, however, a switching device is not capable of operating on IP packets.
[10] The kernel uses the Ethernet broadcast address configured on the link layer device. This is rarely anything but ff:ff:ff:ff:ff:ff. In the extraordinary event that this is not the Ethernet broadcast address in your network, see Section B.3.7, “Changing hardware or Ethernet broadcast address with ip link set”.
[11] tcpdump is one of a number of utilities for watching packets visible to an interface. For further introduction to tcpdump, see Section G.5, “tcpdump”.
[12] I have repeatedly tested using arping in gratuitous ARP mode, and have found that linux kernels appear to respect gratuitous ARP. This is a surprise. Does anybody have ideas about this? Must research!
[13] I have seen it called names other than ARP flux--anybody out there heard of this called anything besides ARP flux?
[14] Consider a masquerading firewall which answers ARP requests on a public segment for IPs hosted on an internal interface. This amounts to inadvertent exposure of internal addressing, and can be used by an attacker as part of a data-gathering or reconaissance operation on a network.
Occasionally, an IP network must be split into separate segments. Proxy ARP can be used for increased control over packets exchanged between two hosts or to limit exposure between two hosts in a single IP network. The technique of proxy ARP is commonly used to interpose a device with higher layer functionality between two other hosts. From a practical standpoint, there is little difference between the functions of a packet-filtering bridge and a firewall performing proxy ARP. The manner by which the interposed device receives the packets, however, is tremendously different.
The device performing proxy ARP (masq-gw) responds for all ARP queries
on behalf of IPs reachable on interfaces other than the interface on
which the query arrives.
FIXME; manual proxy ARP (see also
Section 9.3, “Breaking a network in two with proxy ARP”), kernel proxy ARP, and the newly
supported sysctl net/ipv4/conf/$DEV/medium_id.
For a brief description of the use of medium_id, see Julian's remarks.
FIXME; Kernel proxy ARP with the sysctl
net/ipv4/conf/$DEV/proxy_arp.
Note....until this section is written, this post by Don Cohen is rather instructive.
This section should be part of the "ghetto" which will include documentation on ip arp. There's nothing more to add here at the moment (low priority).
|
The ip arp tool. Patches and code for the noarp route flag.
FIXME; add a few paragraphs on ip arp and the noarp flag.
Virtual LANs are a way to take a single switch and subdivide it into logical media segments. A single switch port in a VLAN-capable switch can carry packets from multiple virtual LANs and linux can understand the format of these Ethernet frames. For more on this, see the linux 802.1q VLAN implementation site.
Kernels in the late 2.4 series have support for VLAN incorporated into the stock release. The vconfig tool, however needs to be compiled against the kernel source in order to provide userland configurability of the kernel support for VLANs.
There are a few items of note which may prevent quick adoption of VLAN support under linux. Ben McKeegan wrote a good summary of the MTU/MRU issues involved with VLANs and 10/100 Ethernet. Gigabit Ethernet drivers are not hamstrung with this problem. Consider using gigabit Ethernet cards from the outset to avoid these potential problems.
Example 2.11. Bringing up a VLAN interface
|
Each interface defined using the vconfig utility takes its name from the base device to which it has been bound, and appends the VLAN tag ID, as shown in Example 2.11, “Bringing up a VLAN interface”.
This documentation is sparse. Visit the main site and the VLAN mailing list archives.
Networking vendors have long offered a functionality for aggregating bandwidth across multiple physical links to a switch. This allows a machine (frequently a server) to treat multiple physical connections to switch units as a single logical link. The standard moniker for this technology is IEEE 802.3ad, although it is known by the common names of trunking, port trunking and link aggregation. The conventional use of bonding under linux is an implementation of this link aggregation.
A separate use of the same driver allows the kernel to present a single logical interface for two physical links to two separate switches. Only one link is used at any given time. By using media independent interface signal failure to detect when a switch or link becomes unusable, the kernel can, transparently to userspace and application layer services, fail to the backup physical connection. Though not common, the failure of switches, network interfaces, and cables can cause outages. As a component of high availability planning, these bonding techniques can help reduce the number of single points of failure.
For more information on bonding, see the
Documentation/networking/bonding.txt from the linux
source code tree.
Bonding for link aggregation must be supported by both endpoints. Two linux machines connected via crossover cables can take advantage of link aggregation. A single machine connected with two physical cables to a switch which supports port trunking can use link aggregation to the switch. Any conventional switch will become ineffably confused by a hardware address appearing on multiple ports simultaneously.
Example 2.12. Link aggregation bonding
|
FIXME; Need an experiment here....maybe a tcpdump to show how the management frames appear on the wire.
This Beowulf software page describes in a bit more detail the rationale and a practical application of linux channel bonding (for link aggregation).
Bonding support under linux is part of a high availability solution. For an entry point into the complexity of high availability in conjunction with linux, see the linux-ha.org site. To guard against layer two (switch) and layer one (cable) failure, a machine can be configured with multiple physical connections to separate switch devices while presenting a single logical interface to userspace.
The name of the interface can be specified by the user. It is
commonly bond0 or something similar. As a
logical interface, it can be used in routing tables and by
tcpdump.
The bond interface, when created, has no link layer address. In the example below, an address is manually added to the interface. See Example 2.12, “Link aggregation bonding” for an example of the bonding driver reporting setting the link layer address when the first device is enslaved to the bond (doesn't that sound cruel!).
Example 2.13. High availability bonding
|
Immediately noticeable, there is a new flag in the ip link
show output. The MASTER and
SLAVE flags clearly report the nature of the
relationship between the interfaces. Also, the Ethernet interfaces
indicate the master interface via the keywords master
bond0.
Note also, that all three of the interfaces share the same link layer
address, 00:80:c8:e7:ab:5c.
FIXME; What doe DEBUG,AUTOMEDIA,PORTSEL,NOTRAILERS mean?
Table of Contents
Bridging, once the realm of hardware devices, can also be performed by a linux machine. Along with bridging comes the capability of filtering and transforming frames (or even higher layer protocols) via hooks at the Ethernet layer with the ebtables and iptables commands.
Linux can function as a bridge, the equivalent of an extremely power-thirsty switch. For now, the best place to go is the main linux bridging site.
Often ebtables and bridging are used together.
In order to take advantage of ebtables the machine needs to be running as a bridge. (Accurate, nicht wahr?)
If you believe in really scary stuff, you can run the bridging code with netfilter, so you can manipulate IP packets transparently on your bridge. For more on this, see the documentation of bridging and firewalling. The firewall and bridge architecture is part of the development branch of the kernel 2.5 series.
Table of Contents
Routing is fundamental to the design of the Internet Protocol. IP routing has been cleverly designed to minimize the complexity for leaf nodes and networks. Linux can be used as a leaf node, such as a workstation, where setting the IP address, netmask and default gateway suffices for all routing needs. Alternatively, the same routing subsystem can be used in the core of a network connecting multiple public and private networks.
This chapter will begin with the basics of IP routing with linux, routing to locally connected destinations, routing to destinations through the default gateway, and using linux as a router. Subsequent topics will include the kernel's route selection algorithm, the routing cache, routing tables, the routing policy database, and issues with ICMP and routing.
The precinct of this documentation is primarily static routing. Though dynamic routing is important to large networks, Internet service providers, and backbone providers, this documentation is targetted for smaller networks, particularly networks which use static routing. Nonetheless, the concepts governing the manipulation of a packet in the kernel, and how routing decisions are made by the kernel are applicable to dynamic routing environments.
The linux routing subsystem has been designed with large scale networks in mind, without forgetting the need for easy configurability for leaf nodes, such as workstations and servers.
The design of IP routing allows for very simple route definitions for small networks, while not hindering the flexibility of routing in complex environments. A key concept in IP routing is the ability to define what addresses are locally reachable as opposed to not directly known destinations. Every IP capable host knows about at least three classes of destination: itself, locally connected computers and everywhere else.
Most fully-featured IP-aware networked operating systems (all unix-like operating systems with IP stacks, modern Macintoshes, and modern Windows) include support for the loopback device and IP. This is an IP and range configured on the host machine itself which allows the machine to talk to itself. Linux systems can communicate over IP on any locally configured IP address, whether on the loopback device or not. This is the first class of destinations: locally hosted addresses.
The second class of IP addresses are addresses in the locally connected network segment. Each machine with a connection to an IP network can reach a subset of the entire IP address space on its directly connected network interface.
All other hosts or destination IPs fall into a third range. Any IP which is not on the machine itself or locally reachable (i.e. connected to the same media segment) is only reachable through an IP routing device. This routing device must have an IP address in a locally reachable IP address range.
All IP networking is a permutation of these three fundamental concepts of reachability. This list summarizes the three possible classifications for reachability of destination IP addresses from any single source machine.
The IP address is reachable on the machine itself. Under linux this is considered scope host and is used for IPs bound to any network device including loopback devices, and the network range for the loopback device. Addresses of this nature are called local IPs or locally hosted IPs.
The IP address is reachable on the directly connected link layer medium. Addresses of this type are called locally reachable or (preferred) directly reachable IPs.
The IP address is ultimately reachable through a router which is reachable on a directly connected link layer medium. This class of IP addresses is only reachable through a gateway.
As a practical description of the above, this partial diagram of the
example network shows two
machines connected to 192.168.99.0/24. On tristan the IP addresses
127.0.0.1 (loopback--not pictured) and 192.168.99.35 are considered
locally hosted IP addresses. The directly reachable IP addresses fall
inside the 192.168.99.0/24 network. Any other destination addresses are
only reachable through a gateway, probably masq-gw.
Before examining the routing system in more detail, there are some terms to identify and define. These terms are general IP networking terms and should be familiar to users who have used IP on other operating systems and networking equipment.
A single number between decimal 0 and 255, hexadecimal 0x00 and 0xff. An octet is a single byte in size.
Examples: 140, 254, 255, 1, 0, 7.
A locally unique four octet logical identifier which a machine can use to communicate using the Internet Protocol. This address is determined by combining the network address and the administratively assigned host address. Simply put, the IP address is a unique number identifying a host on a network.
Examples: 192.168.99.35, 140.71.38.7, 205.254.210.186.
The rightmost bits (frequently octets) in an IP address which are not a part of the network address. The part of an IP address which identifies the computer on a network independent of the network.
Examples: 192.168.1.27/24, 10.10.17.24/8, 172.20.158.75/16.
A four octet address and network mask identifying the usable range of IP addresses. Conventional and CIDR notations combine the four bare octets with the netmask or prefix length to define this address. Briefly, a network address is the first address in a range, and is reserved to identify the entire network. [15]
Examples: 192.168.187.0/24, 205.254.211.192/26, 4.20.17.128/255.255.255.248, 10.0.0.0/255.0.0.0, 12.35.17.112/28.
A four-octet set of bits which, when AND'd with a particular IP address produces the network address. Combined with a network address or IP address, the netmask identifies the range of IP addresses which are directly reachable.
Examples: 255.255.255.0, 255.255.0.0, 255.255.192.0, 255.255.255.224, 255.0.0.0.
An alternate representation of network mask, this is a single integer between 0 and 32, identifying the number of significant bits in an IP address or network address. This is the "slash-number" component of a CIDR address.
Examples: 4.20.17.0/24, 66.14.17.116/30, 10.158.42.72/29, 10.48.7.198/9, 192.168.154.64/26.
A four octet address derived from an OR operation between the host address portion of a network address and the full broadcast special 255.255.255.255. The broadcast is the highest allowable address in a given network, and is reserved for broadcast traffic.
Examples: 192.168.205.255/24, 172.18.255.255/16, 12.7.149.63/26.
These definitions are common to IP networking in general, and are understood by all in the IP networking community. For less terse introductory material on matters of IP network addressing in general, see Section I.1.3, “General IP Networking Resources”.
As is apparent from the interdependencies amongst the above definitions, each term defines a separate part of the concept of the relationships between an IP address and its network. A good IP calculator can assist in mastering these IP fundamentals.
Example 4.2. Using ipcalc to display IP information
|
A tool similar to the one shown in Example 4.2, “Using ipcalc to display IP information” can assist in visualizing the relationships among IP addressing concepts.
Subequently, this chapter will introduce some concrete examples of routing in a real network. The example network illustrates this network and all of the addresses involved.
[15] At least one reader (CAO) has pointed out to me that there is ambiguity in the meaning and common usage of the term network address. While occasionally used to refer to a single IP address at the top of a range of addresses, the primary meaning requires the implicit network mask.
Historically, this term has always meant the IP address at the top of a range AND the netmask identifying the set of available addresses. Without this latter piece of information, the network address is simply an IP address.
Technically, the use of this term to mean a single IP at the top of the range is incorrect, although not uncommon.
Any IP network is defined by two sets of numbers: network address and netmask. By convention, there are two ways to represent these two numbers. Netmask notation is the convention and tradition in IP networking although the more succinct CIDR notation is gaining popularity.
In the
example network, isolde has
IP address 192.168.100.17.
In CIDR notation, isolde's address is 192.168.100.17/24, and in
traditional netmask notation, 192.168.100.17/255.255.255.0.
Any of the
IP calculators, confirms that the
first usable IP address is 192.168.100.1 and the last usable IP address
is 192.168.100.254.
Importantly, the IP network address, 192.168.100.0/24, is reachable
through the directly connected Ethernet interface (refer to
classification 2).
Therefore, isolde should be able to reach any IP address in
this range directly on the locally connected Ethernet segment.
Below is the routing table for isolde, first shown with the
conventional route -n output
[16]
and then with the
ip route show
[17]
command. Each of these tools conveys
the same routing table and operates on the same kernel routing table.
For more on the routing table displayed in
Example 4.3, “Identifying the locally connected networks with
route”, consult
Section 4.8.3, “The Main Routing Table”.
Example 4.3. Identifying the locally connected networks with route
|
In the above example, the locally reachable destination is
192.168.100.0/255.255.255.0 which can also be written 192.168.100.0/24
as in ip route show. In classful networking
terms, the network to which isolde is directly connected is called a
class C sized network.
When a process on isolde needs to send a packet to another
machine on the locally connected network, packets will be sent from
192.168.100.17 (isolde's IP). The kernel will consult
the routing table to determine the route and the source address to use
when sending this packet.
Assuming the destination is 192.168.100.32, the kernel will find that
192.168.100.32 falls inside the IP address range 192.168.100.0/24 and
will select this route for the outbound packet. For further details on
source address selection, see
Section 4.6, “Source Address Selection”. The source address on the
outbound packet conveys vital information to the host receiving the
packet. In order for the packet to be able to return, isolde has to
use an IP address that is locally available, 192.168.100.32 has to have
a route to isolde and neither host must block the packet.
The packet will be sent to the locally connected network segment
directly, because isolde interprets from the routing table
that 192.168.100.32 is directly reachable through the physical network
connection on eth0.
Occasionally, a machine will be directly connected to two different IP networks on the same device. The routing table will show that both networks are reachable through the same physical device. For more on this topic, see Section 9.2, “Multiple IP Networks on one Ethernet Segment”. Similarly, multi-homed hosts will have routes for all locally connected networks through the locally-connected network interface. For more on this sort of configuration, see Section 9.6, “Multihomed Hosts”.
This covers the classification of IP destinations which are available on a locally connected network. This highlights the importance of an accurate netmask and network address. The next section will cover IP ranges which are neither locally hosted nor fall in the range of the locally reachable networks. These destinations must be reached through a router.
[16] The route -n output can also be produced with netstat -rn and is commonly used by admininstrators who rely on platform independent behaviour across heterogeneous Unix and Unix-like systems. This traditional routing table output uses conventional netmask notation to denote network size.
By comparison to the total number of publicly accessible hosts on the Internet there is an almost insignificant number of hosts inside any locally reachable network. This means that the majority of potential destinations are only available via a router.
Any machine which will accept and forward packets between two networks is a router. Every router is at least dual-homed; one interface connects to one network, and a second interface connects to another network. This interface is frequently an independent NIC, although it might be a virtual interface, such as a VLAN interface. Machines connected to either network learn by a routing protocol or are statically configured to pass traffic for the other network to the router.
For tristan, there are two different paths out of 192.168.99.0/24.
One path has another leaf network, 192.168.98.0/24, and the other path
has many networks, including the Internet. The routing table on
tristan should then contain two different routes out of the network.
One destination 192.168.98.0/24 will be reachable through 192.168.99.1.
So, if tristan has a packet with a destination IP address in the range
of the branch office network, it will choose to send the packet directly
to isdn-router.
The default route is another way to say the route for destination 0/0. This is the most general possible route. It is the catch-all route. If no more specific route exists in a routing table, a default route will be used. Many servers and workstations are connected to leaf networks with only one router, hence Example 4.3, “Identifying the locally connected networks with route” shows a very common sort of routing table. There's a route for localhost, for the locally connected IP network, and a default route.
For Internet-connected hosts, the default route is customarily set to the IP of the locally reachable router which has a path to the Internet. Each router in turn has a default gateway pointing to another Internet-connected router until the packet is handed off to an Internet Service Provider's network.
Operating as a router allows a linux machine to accept packets on one interface and transmit them on another. This is the nature of a router. The process of accepting and transmitting IP packets is known as forwarding. IP forwarding is a requirement for many of the networking techniques identified here. Stateless NAT and firewalling, transparent proxying and masquerading all require the support of IP forwarding in order to function correctly.
The sysctl net/ipv4/ip_forward toggles the IP
forwarding functionality on a linux box. Note that setting this sysctl
alters other routing-related sysctl entries, so it is wise to set this
first, and then alter other entries.
Frequently, an administrator will forget this simple and crucial detail
when configuring a new machine to operate as a router only to be
frustrated at the simple error.
The sysctl net/ipv4/conf/$DEV/forward defaults to
the value of net/ipv4/ip_forward, but can be
independently modified. In order to allow forwarding of packets between
two interfaces while prohibiting such behaviour on a third interface,
this sysctl can be employed.
Crucial to the proper ability of hosts to exchange IP packets is the correct selection of a route to the destination. The rules for the selection of route path are traditionally made on a hop-by-hop basis [18] based solely upon the destination address of the packet. Linux behaves as a conventional routing device in this way, but can also provide a more flexible capability. Routes can be chosen and prioritized based on other packet characteristics.
The route selection algorithm under linux has been generalized to enable the powerful latter scenario without complicating the overwhelmingly common case of the former scenario.
The above sections on routing to a local network and the default gateway expose the importance of destination address for route selection. In this simplified model, the kernel need only know the destination address of the packet, which it compares against the routing tables to determine the route by which to send the packet.
The kernel searches for a matching entry for the destination first in the routing cache and then the main routing table. In the case that the machine has recently transmitted a packet to the destination address, the routing cache will contain an entry for the destination. The kernel will select the same route, and transmit the packet accordingly.
If the linux machine has not recently transmitted a packet to this destination address, it will look up the destination in its routing table using a technique known longest prefix match [19]. In practical terms, the concept of longest prefix match means that the most specific route to the destination will be chosen.
The use of the longest prefix match allows routes for large networks to be overridden by more specific host or network routes, as required in Example 1.10, “Removing a static network route and adding a static host route”, for example. Conversely, it is this same property of longest prefix match which allows routes to individual destinations to be aggregated into larger network addresses. Instead of entering individual routes for each host, large numbers of contiguous network addresses can be aggregated. This is the realized promise of CIDR networking. See Section I.1.3, “General IP Networking Resources” for further details.
In the common case, route selection is based completely on the destination address. Conventional (as opposed to policy-based) IP networking relies on only the destination address to select a route for a packet.
Because the majority of linux systems have no need of policy based routing features, they use the conventional routing technique of longest prefix match. While this meets the needs of a large subset of linux networking needs, there are unrealized policy routing features in a machine operating in this fashion.
With the prevalence of low cost bandwidth, easily configured VPN tunnels, and increasing reliance on networks, the technique of selecting a route based solely on the destination IP address range no longer suffices for all situations. The discussion of the common case of route selection under linux neglects one of the most powerful features in the linux IP stack. Since kernel 2.2, linux has supported policy based routing through the use of multiple routing tables and the routing policy database (RPDB). Together, they allow a network administrator to configure a machine select different routing tables and routes based on a number of criteria.
Selectors available for use in policy-based routing are attributes of a packet passing through the linux routing code. The source address of a packet, the ToS flags, an fwmark (a mark carried through the kernel in the data structure representing the packet), and the interface name on which the packet was received are attributes which can be used as selectors. By selecting a routing table based on packet attributes, an administrator can have granular control over the network path of any packet.
With this knowledge of the RPDB and multiple routing tables, let's revisit in detail the method by which the kernel selects the proper route for a packet. Understanding the series of steps the kernel takes for route selection should demystify advanced routing. In fact, advanced routing could more accurately be called policy-based networking.
When determining the route by which to send a packet, the kernel always consults the routing cache first. The routing cache is a hash table used for quick access to recently used routes. If the kernel finds an entry in the routing cache, the corresponding entry will be used. If there is no entry in the routing cache, the kernel begins the process of route selection. For details on the method of matching a route in the routing cache, see Section 4.7, “Routing Cache”.
The kernel begins iterating by priority through the routing policy database. For each matching entry in the RPDB, the kernel will try to find a matching route to the destination IP address in the specified routing table using the aforementioned longest prefix match selection algorithm. When a matching destination is found, the kernel will select the matching route, and forward the packet. If no matching entry is found in the specified routing table, the kernel will pass to the next rule in the RPDB, until it finds a match or falls through the end of the RPDB and all consulted routing tables.
Here is a snippet of python-esque pseudocode to illustrate the kernel's route selection process again. Each of the lookups below occurs in kernel hash tables which are accessible to the user through the use of various iproute2 tools.
Example 4.4. Routing Selection Algorithm in Pseudo-code
if packet.routeCacheLookupKey in routeCache :
route = routeCache[ packet.routeCacheLookupKey ]
else
for rule in rpdb :
if packet.rpdbLookupKey in rule :
routeTable = rule[ lookupTable ]
if packet.routeLookupKey in routeTable :
route = route_table[ packet.routeLookup_key ]
|
This pseudocode provides some explanation of the decisions required to find a route. The final piece of information required to understand the decision making process is the lookup process for each of the three hash table lookups. In Table 4.1, “Keys used for hash table lookups during route selection”, each key is listed in order of importance. Optional keys are listed in italics and represent keys that will be matched if they are present.
Table 4.1. Keys used for hash table lookups during route selection
| route cache | RPDB | route table |
|---|---|---|
| destination | source | destination |
| source | destination | ToS |
| ToS | ToS | scope |
| fwmark | fwmark | oif |
| iif | iif |
The route cache (also the forwarding information base) can be displayed using ip route show cache. The routing policy database (RPDB) can be manipulated with the ip rule utility. Individual route tables can be manipulated and displayed with the ip route command line tool.
Example 4.5. Listing the Routing Policy Database (RPDB)
|
Observation of the output of ip rule show in Example 4.5, “Listing the Routing Policy Database (RPDB)” on a box whose RPDB has not been changed should reveal a high priority rule, rule 0. This rule, created at RPDB initialization, instructs the kernel to try to find a match for the destination in the local routing table. If there is no match for the packet in the local routing table, then, per rule 32766, the kernel will perform a route lookup in the main routing table. Normally, the main routing table will contain a default route if not a more specific route. Failing a route lookup in the main routing table the final rule (32767) instructs the kernel to perform a route lookup in table 253.
A common mistake when working with multiple routing tables involves
forgetting about the statelessness of IP routing. This manifests when
the user configuring the policy routing machine accounts for outbound
packets (via fwmark, or ip rule
selectors), but forgets to account for the return packets.
For more ideas on how to use policy routing, how to work with multiple routing tables, and how to troubleshoot, see Section 10.3, “Using the Routing Policy Database and Multiple Routing Tables”.
Yeah. That's it. So there.
[18] This document could stand to allude to MPLS implementations under linux, for those who want to look at traffic engineering and packet tagging on backbones. This is certainly not in the scope of this chapter, and should be in a separate chapter, which covers developing technologies.
The selection of the correct source address is key to correct communication between hosts with multiple IP addresses. If a host chooses an address from a private network to communicate with a public Internet host, it is likely that the return half of the communication will never arrive.
The initial source address for an outbound packet is chosen in according
to the following series of rules. The application can request a
particular IP
[20],
the kernel will use the src hint from the chosen
route path
[21],
or, lacking this hint, the kernel will choose the first address
configured on the interface which falls in the same network as the
destination address or the nexthop router.
The following list recapitulates the manner by which the kernel determines what the source address of an outbound packet.
The application is already using the socket, in which case, the
source address has been chosen. Also, the application can
specifically request a particular address (not necessarily a
locally hosted IP; see
Section 9.7, “Binding to Non-local Addresses”) using the
bind call.
The kernel performs a
route lookup and finds an
outbound route for the destination. If the route contains the
src parameter, the kernel selects this IP
address for the outbound packet.
Also refer to this excerpt from the iproute2 command reference.
[20]
Many networking applications accept a command line option to prefer
a particular source address. The call to select a particular
IP is known as bind(), so the command
line option frequently
contains the word bind, e.g.,
--bind-address.
Examples of command line tools allowing specification of the source
address are nc -s $BINDADDR $DEST $PORT or
socat -
TCP4:$REMOTEHOST:$REMOTEPORT,bind=$BINDADDR.
[21]
In this case, the route has already been selected (see
Section 4.5, “Route Selection”) and the chosen route entry
includes a hint for preferred source address on outbound packets
specifically for this purpose. For examples on configuring the
routing tables to include this parameter, see
Example D.19, “Using src in a routing command with
route add”.
The routing cache is also known as the forwarding information base (FIB). This term may be familiar to users of other routing systems.
The routing cache stores recently used routing entries in a fast and convenient hash lookup table, and is consulted before the routing tables. If the kernel finds a matching entry during route cache lookup, it will forward the packet immediately and stop traversing the routing tables.
Because the routing cache is maintained by the kernel separately from the routing tables, manipulating the routing tables may not have an immediate effect on the kernel's choice of path for a given packet. To avoid a non-deterministic lag between the time that a new route is entered into the kernel routing tables and the time that a new lookup in those route tables is performed, use ip route flush cache. Once the route cache has been emptied, new route lookups (if not by a packet, then manually with ip route get) will result in a new lookup to the kernel routing tables.
The following is a listing of the hash lookup keys in the routing cache and a description of each key. Compare this list with the elements identified in Table 4.1, “Keys used for hash table lookups during route selection”.
The destination IP address of the packet. This is the destination address on the packet at the time of the route lookup. The address is a host address. All 32 bits are significant during this lookup.
The source IP address of the packet. This is the source address on the packet at the time of the route lookup. The address is a host address. All 32 bits are significant during this lookup.
The ToS marking on the packet. If there is no ToS marking on the packet (tos == 0), this lookup key is unused. If there is a ToS marking, the kernel will search for a match with this ToS value. If no matching (dst, src, tos) is found, the kernel will continue the search for a route by traversing the RPDB.
The mark on a packet added administratively by the packet filtering engine (ipchains or iptables). This mark is not part of the physical IP packet, and only exists as part of the data structure held in memory on the routing device to represent the IP packet. If there is no fwmark on the packet, this lookup key is unused. When present, the kernel will search for a matching (dst, src, tos?, fwmark) entry. If no matching entry is found, the kernel will continue the search for a route by traversing the RPDB.
The name of the interface on which the packet arrived.
The following attributes may be stored for each entry in the routing cache.
FIXME. A) I don't know what it is. B) I don't know how to describe it.
FIXME. Gotta find some references to this, too.
Collectively the hash keys uniquely identify routes in the forwarding information base (routing cache) and each entry provides attributes of the route.
Linux kernel 2.2 and 2.4 support multiple routing tables [22]. Beyond the two commonly used routing tables (the local and main routing tables), the kernel supports up to 252 additional routing tables.
The multiple routing table system provides a flexible infrastructure on top of which to implement policy routing. By allowing multiple traditional routing tables (keyed primarily to destination address) to be combined with the routing policy database (RPDB) (keyed primarily to source address), the kernel supports a well-known and well-understood interface while simultaneously expanding and extending its routing capabilities. Each routing table still operates in the traditional and expected fashion. Linux simply allows you to choose from a number of routing tables, and to traverse routing tables in a user-definable sequence until a matching route is found.
Any given routing table can contain an arbitrary number of entries, each of which is keyed on the following characteristics (cf. Table 4.1, “Keys used for hash table lookups during route selection”)
destination address; a network or host address (primary key)
tos; Type of Service
output interface
For practical purposes, this means that (even) a single routing table can contain multiple routes to the same destination if the ToS differs on each route or if the route applies to a different interface [23].
Kernels supporting multiple routing tables refer to routing tables by
unique integer slots between 0 and 255
[24].
The two routing tables normally employed are
table 255, the
local routing table, and
table 254, the
main routing table. For
examples of using multiple routing tables, see
Chapter 9, Advanced IP Management, in particular,
Example 10.1, “Multiple Outbound Internet links, part I;
ip route”,
Example 10.3, “Multiple Outbound Internet links, part III;
ip rule” and
Example 10.4, “Multiple Internet links, inbound traffic; using
iproute2 only
”. Also be sure
to read
Section 10.3, “Using the Routing Policy Database and Multiple Routing
Tables” and
Section 4.9, “Routing Policy Database (RPDB)”.
The ip route and ip rule commands
have built in support for the special tables main and local.
Any other routing tables can be referred to by number or an
administratively maintained mapping file,
/etc/iproute2/rt_tables.
The format of this file is extraordinarily simple. Each line represents one mapping of an arbitrary string to an integer. Comments are allowed.
Example 4.6. Typical content of
/etc/iproute2/rt_tables
|
The routing table manipulated by the conventional
route command
is the main routing table. Additionally, the use of both
ip address and
ifconfig
will cause the kernel to alter the local routing table (and usually the
main routing table). For further documentation on how to manipulate
the other routing tables, see the command description of
ip route.
Each routing table can contain an arbitrary number of route entries. Aside from the local routing table, which is maintained by the kernel, and the main routing table which is partially maintained by the kernel, all routing tables are controlled by the administrator or routing software. All routes on a machine can be changed or removed [25].
Each of the following route types is available for use with the ip route command. Each route type causes a particular sort of behaviour, which is identified in the textual description. Compare the route types described below with the rule types available for use in the RPDB.
A unicast route is the most common route in routing tables. This is a typical route to a destination network address, which describes the path to the destination. Even complex routes, such as nexthop routes are considered unicast routes. If no route type is specified on the command line, the route is assumed to be a unicast route.
This route type is used for link layer devices (such as Ethernet cards) which support the notion of a broadcast address. This route type is used only in the local routing table [26] and is typically handled by the kernel.
The kernel will add entries into the local routing table when IP addresses are added to an interface. This means that the IPs are locally hosted IPs [27].
This route entry is added by the kernel in the local routing table, when the user attempts to configure stateless NAT. See Section 5.3, “Stateless NAT with iproute2” for a fuller discussion of network address translation in general. [28].
When a request for a routing decision returns a destination with an unreachable route type, an ICMP unreachable is generated and returned to the source address.
When a request for a routing decision returns a destination with a prohibit route type, the kernel generates an ICMP prohibited to return to the source address.
A packet matching a route with the route type blackhole is discarded. No ICMP is sent and no packet is forwarded.
The throw route type is a convenient route type which causes a route lookup in a routing table to fail, returning the routing selection process to the RPDB. This is useful when there are additional routing tables. Note that there is an implicit throw if no default route exists in a routing table, so the route created by the first command in the example is superfluous, although legal.
The power of these route types when combined with the routing policy database can hardly be understated. All of these route types can be used without the RPDB, although the throw route doesn't make much sense outside of a multiple routing table installation.
The local routing table is maintained by the kernel. Normally, the local routing table should not be manipulated, but it is available for viewing. In Example D.12, “Viewing the local routing table with ip route show table local”, you'll see two of the common uses of the local routing table. The first common use is the specification of broadcast address, necessary only for link layers which support broadcast addressing. The second common type of entry in a local routing table is a route to a locally hosted IP.
The route types found in the local routing table
are local, nat and
broadcast. These route types are not relevant in
other routing tables, and other route types cannot be used in the
local routing table.
If the the machine has several IP addresses on one Ethernet interface, there will be a route to each locally hosted IP in the local routing table. This is a normal side effect of bringing up an IP address on an interface under linux. Maintenance of the broadcast and local routes in the local routing table can only be done by the kernel.
Example 4.15. Kernel maintenance of the local routing table
|
Note in
Example 4.15, “Kernel maintenance of the local routing table”, that the kernel adds
not only the route for the locally connected network in the main
routing table, but also the three required special addresses in the
local routing table. Any IP addresses which are locally hosted on
the box will have local entries in the local table. The
network
address and
broadcast
address are both entered as broadcast type
addresses on the interface to which they have been bound.
Conceptually, there is significance to the distinction between a
network and broadcast address, but practically, they are treated
analogously, by other networking gear as well as the linux kernel.
There is one other type of route which commonly ends up in the local
routing table. When using iproute2 NAT, there will
be entries in the local routing table for each network address
translation. Refer to
Example D.21, “Creating a NAT route for a single IP with ip route add
nat” and
Example D.22, “Creating a NAT route for an entire network with ip
route add nat” for example output.
The main routing table is the routing table most people think of when
considering a linux routing table. When no table is specified to an
ip route command, the kernel assumes the main
routing table. The route command only manipulates
the main routing table.
Similarly to the local table, the main table is populated
automatically by the kernel when new interfaces are brought up
with IP addresses. Consult the main routing table before and after
ip address add 192.168.254.254/24 brd + dev eth1
in
Example 4.15, “Kernel maintenance of the local routing table” for a concrete example
of this kernel behaviour. Also, visit
this summary of
side effects of interface definition and activation with
ifconfig or ip address.
[22]
The kernel must be compiled with the option
CONFIG_IP_MULTIPLE_TABLES=y. This is common
in vendor and stock kernels, both 2.2 and 2.4.
[23] If somebody has used scope or oif as additional keys in a routing table, and has an example, I'd love to see it, for possible inclusion in this documentation.
[24] Can anybody describe to me what is in table 0? It looks almost like an aggregation of the routing entries in routing tables 254 and 255.
[25] Once again, I recommend caution when altering the local routing table. Removing local route types from the local routing table can break networking in strange and wonderful ways.
[26] OK, I'm not absolutely sure you can't use the broadcast route in other routing tables, but I believe you can't. Testing forthcoming...
[27] Ibid. I'm not sure that local route types can be used in any routing table other than the local routing table. Testing forthcoming...
[28] Ibid. nat route types might be ineffectual outside the local routing table. Testing forthcoming...
The routing policy database (RPDB) controls the order in which the kernel searches through the routing tables. Each rule has a priority, and rules are examined sequentially from rule 0 through rule 32767.
When a new packet arrives for routing (assuming the routing cache is empty), the kernel begins at the highest priority rule in the RPDB--rule 0. The kernel iterates over each rule in turn until the packet to be routed matches a rule. When this happens the kernel follows the instructions in that rule. Typically, this causes the kernel to perform a route lookup in a specified routing table. If a matching route is found in the routing table, the kernel uses that route. If no such route is found, the kernel returns to traverse the RPDB again, until every option has been exhausted.
The priority-based rule system provides a flexible way to define routes while taking advantage of the traditional routing table concept. For a complete picture of the entire route selection process including the RPDB, see the section on routing selection.
There are a number of different rule types available for use in the routing policy database. These rule types have a striking similarity to the route types available for route entries.
A unicast rule entry is the most common rule type. This rule type simple causes the kernel to refer to the specified routing table in the search for a route. If no rule type is specified on the command line, the rule is assumed to be a unicast rule.
The nat rule type is required for correct operation of stateless NAT. This rule is typically coupled with a corresponding nat route entry. The RPDB nat entry causes the kernel to rewrite the source address of an outbound packet. See Section 5.3, “Stateless NAT with iproute2” for a fuller discussion of network address translation in general.
Any route lookup matching a rule entry with an unreachable rule type will cause the kernel to generate an ICMP unreachable to the source address of the packet.
Any route lookup matching a rule entry with a prohibit rule type will cause the kernel to generate an ICMP prohibited to the source address of the packet.
While traversing the RPDB, any route lookup which matches a rule with the blackhole rule type will cause the packet to be dropped. No ICMP will be sent and no packet will be forwarded.
The routing policy database provides the core of functionality around which the policy routing and advanced routing features can be built.
ICMP is a very important part of the communication between hosts on IP networks. Used by routers and endpoints (clients and servers) ICMP communicates error conditions in networks and provides a means for endpoints to receive information about a network path or requested connection.
One of the commonest uses of ICMP by the administrator of a network is the use of ping to detect the state of a machine in the network. There are other types of ICMP which are used for other inter-computer communication. One other common type of ICMP is the ICMP returned by a router or host which is not accepting connections. Essentially, the host returns the ICMP as a polite method of saying “Go away.”.
One important use of ICMP, which is completely transparent to most users (and indeed many admins), is the use of ICMP to discover the Path Maximum Transmission Unit (PMTU). By discovering the Path MTU and transmitting packets with this the MTU, a host can minimize the delay of traffic due to fragmentation, and (theoretically) attain a more even rate of data transmission. Because each destination may have a different MTU due to different network paths, the MTU is a per route attribute stored in the routing cache.
Path MTU can be quite easily broken if any single hop along the way blocks all ICMP. Be sure to allow ICMP unreachable/fragmentation needed packets into and out of your network. This will prevent you from being one of the unclueful network admins who cause PMTU problems.
An ICMP redirect is a router's way of communicating
that there is a better path out of this network or into another one
than the one the host had chosen. In
the example network,
tristan has a route to the world through masq-gw and a route to
192.168.98.0/24 through isdn-router. If tristan sends a packet
for 192.168.98.0/24 to masq-gw, the optimal outcome is for
masq-gw to suggest with an ICMP redirect that tristan send such
packets via isdn-router instead.
By this method, hosts can learn what networks are reachable through which routers on the local network segment. ICMP redirect messages, however, are easy to forge, and were (at one time) used to subvert poorly configured machines. While this is infrequently a problem on the Internet today, it's still good practice to ignore ICMP redirect messages from public networks. Create static routes where necessary on private and public networks to prevent ICMP redirect messages from being generated on your network.
To examine an example of ICMP redirect in action, we simply
need to send a packet directly from tristan to
morgan. We assume that masq-gw has a route to 192.168.98.0/24
via 192.168.99.1 (isdn-router), that tristan has no
such route.
Example 4.21. ICMP Redirect on the Wire [29]
|
There's a great deal of information above, so let's examine the
important parts. We have the first three packets which passed by our
NIC as a result of this attempt to establish a session. First, we see
a packet from tristan bound for morgan with tristan's source MAC
and masq-gw's destination MAC. Because masq-gw is tristan's
default gateway, tristan will send all packets there.
The next packet is the ICMP redirect, informing tristan of a
better route. It includes several pieces of information.
Implicitly, the source IP indicates what router is suggesting the
alternate route, and the contents specify what the intended
destination was, and what the better route is. Note that masq-gw
suggests using 192.168.99.1 (isdn-router) as the gateway for this
destination.
The final packet is part of the intended session, but has the MAC
address of masq-gw on it. masq-gw has (courteously) informed us
that we should not use it as a route for the intended destination, but
has also (courteously) forwarded the packet as we had requested. In
this small network, it is acceptable to allow ICMP redirect messages,
although these should always be dropped at network borders, both
inbound and outbound.
So, in summary, ICMP redirect messages are not intrinsically dangerous or problematic, but they shouldn't exist in well-maintained networks. If you happen to see them growing in the shadows of your network, some careful observation should show you what hosts are affected and which routing tables could use some attention.
[29] Consult Table A.2, “Example Network; Host Addressing” for details on the IP and MAC addresses of the hosts referred to in this example.
Table of Contents
Network Address Translation (NAT) is a deceptively simple concept. NAT is the technique of rewriting addresses on a packet as it passes through a routing device. There are far reaching ramifications on network design and protocol compatibility wherever NAT is used.
This chapter will introduce two types of NAT available under linux. One, full NAT or stateless NAT, is available under kernel 2.2 and kernel 2.4 via the iproute2 userspace interface. Available only under kernel 2.4, destination NAT (DNAT) is an important derivative of full NAT. DNAT configuration from userspace is accomplished via the iptables utility. The experienced network administrator is probably puzzling about absent references to source NAT (SNAT) and masquerading. These prominent and prevalent uses of NAT are covered in Chapter 6, Masquerading and Source Network Address Translation, although many concepts involved in the special purpose SNAT and masquerading will be introduced in this chapter.
Network address translation is known by a number of names in the networking world: full NAT, one-to-one NAT and inbound NAT. As used in this chapter and throughout this documentation, NAT, when unqualified, will refer to full network address translation or one-to-one NAT. NAT techniques derived from full NAT, such as destination or source NAT, will be described as DNAT (destination NAT) and SNAT (source NAT).
Michael Hasenstein's seminal paper on network address translation is available courtesy of SuSe Linux AG here.
Network address translation (NAT) is a technique of transparently mapping an IP address or range to another IP address or range. Any routing device situated between two endpoints can perform this transformation of the packet. Network designers must however take one key element under consideration when laying out a network with NAT in mind. The router(s) performing NAT must have an opportunity to rewrite the packet upon entry to the network and upon exit from the network [30].
Because network address translation manipulates the addressing of a packet, the NAT transformation becomes a passive but critical part of the conversation between hosts exchanging packets. NAT is by necessity transparent to the application layer endpoints and operates on any type of IP packet. There are some application and even network layer protocols which will break as a result of this rewriting. Consult Section 5.2, “Application Layer Protocols with Embedded Network Information” for a discussion of these cases.
Here are a few common reasons to consider NAT along with potential NAT solution candidates shown in parentheses.
Publicly accessible services need to be provided on registered Internet IPs which change or might change. NAT allows the separation of internal IP addressing schemes from the public IP space, easing the burden of changing internal addressing or external IPs. (NAT, DNAT, PAT with DNAT PAT from userspace)
An application requires inbound and outbound connections. In this case SNAT/masquerading will not suffice. See also Section 6.3, “Where Masquerading and SNAT Break”. (NAT, SNAT and application-aware connection tracking)
The network numbering scheme is changing. Clever use of NAT allows reachability of services on both IP addresses or IP address ranges during the network numbering migration. (NAT, DNAT)
Two networks share the same IP addressing space and need to exchange packets. Using network address translation to publish NAT network spaces with different numbering schemes would allow each network to retain the addressing scheme while accessing the other network. (NAT, DNAT, SNAT)
These are the commonest reasons to consider and implement NAT. Other niche applications of NAT, notably as part of load balancing systems, exist although this chapter will concentrate on the use of NAT to hide, isolate or renumber networks. It will also cover inbound connections, leaving the discussion of many-to-one NAT, SNAT and masquerading for Chapter 6, Masquerading and Source Network Address Translation.
One motivator for deploying NAT in a network is the benefit of virtualizing the network. By isolating services provided in one network from changes in other networks, the effects of such changes can be minimized. The disadvantage of virtualizing the network in this way is the increased reliance on the NAT device.
Providing inbound services via NAT can be accomplished in several different ways. Two common techniques are to use iproute2 NAT and netfilter DNAT. Less common (and possibly less desirable) one can use port redirection tools. Depending on which tool is employed, different characteristics of a packet can trigger the address transformation.
The simplest form of NAT under linux is
iproute2 NAT.
This type of NAT requires two matching commands, one to cause the kernel
to rewrite the inbound packets (ip route add nat $NATIP
via $REAL)
and one to rewrite the outbound packets (ip rule add from
$REAL nat $NATIP). The router configured in this fashion will
retain no state for connections. It will simply transform any packets
passing through. By contrast, netfilter is capable of retaining state
on connections passing through the router and selecting packets more
granularly than is possible with only iproute2 tools.
Before the advent of the netfilter engine in the linux kernel, there were several tools available to administer NAT, DNAT and PAT. These tools were not included in many distributions and weren't adopted broadly in the community. Although you may find references to ipmasqadm, ipnatadm and ipportfw across the Internet in older documentation, these tools have been superseded in functionality and widespread deployment by the netfilter engine and its userspace partner, iptables.
The netfilter engine provides a more flexible language for selection of packets to be transformed than that provided by the iproute2 suite and kernel routing functionality. Additionally, any NAT services provided by the netfilter engine come with the labor-saving and resource-consuming connection tracking mechanism. DNAT translates the address on an inbound packet and creates an entry in the connection tracking state table. For even modest machines, the connection tracking resource consumption should not be problematic.
Netfilter DNAT allows the user to select packets based on characteristics such as destination port. This blurs the distinction between network address translation and port address translation. NAT always transforms the layer 3 contents of a packet. Port redirection operates at layer 4. From a practical perspective, there is little difference between a port redirection and a netfilter DNAT which has selected a single port. The manner in which the packet and contents are retransmitted, however, is tremendously different.
One other less common technique for furnishing inbound services is the use of port redirection. Although there are higher layer tools which can perform transparent application layer proxying (e.g. Squid), these are outside the scope of this documentation.
There are a number of IP addresses involved in any NAT transformations or connection states. The following list identifies these names and the convention used to describe each IP address. Beware that the prevalance of NAT to publish services on the Internet via public IP addresses has lead to the server/client lingo common in discussions of NAT.
The IP address to which packets are addressed. This is the address on the packet before the device performing NAT manipulates it. This is frequently also described as the public IP, although any given application of NAT knows no distinction between public and private address ranges.
The IP address after the NAT device has performed its transformation. Frequently, this is described as the private IP, although any given application of NAT knows no distinction between public and private address ranges.
The source address of the initial packet. The client IP in a NAT transformation does not change; this IP is the source IP address on any inbound packets both before and after the translation. It is also the destination address on the outbound packet.
The above terms will be used below and in general discussions of NAT.
[30] If using stateless NAT, the inbound and outbound translations can occur on more than one device, provided that all of the devices are performing the same translation.
Network address translation is beautifully invisible when it works, but has adverse effects on some protocols. Some network applications, e.g., FTP, SNMP, H323, LDAP, IRC, make use of embedded IP information in the application layer protocol or data stream. Since the 2.0.x kernel series (which is not covered here), linux has supported modules which inspect and manipulate packet contents on particular types of packets when used with NAT or masquerading.
FTP is the classic example. Within the FTP control channel (usually established to destination port tcp/21) the client and the server exchange IP address and port information. If the network address translation device doesn't manipulate this data, the FTP server will not be able to contact the client to provide the data.
Passive mode FTP provides the possibility for a network layer which requires only outbound TCP connections. This results in a more NAT friendly and firewall friendly protocol, because the connections are initiated from the client.
Not only are there network applications which break when NAT is involved but also network layer protocols. IPSec is a standards-based network-layer security protocol commonly used in VPNs and IPv6 networks. There are many different ways to use IPSec, but, when used in AH (Authentication Header) mode, NAT will break IPSec functionality.
This underscores the importance of determining if NAT is the best solution for the problem. There are kernel modules to help handle many (though not all) of the application layer protocol when using NAT, but some protocols, such as IPSec in AH mode simply cannot be used with NAT.
Stateless NAT, occasionally maligned as dumb NAT [31], is the simplest form of NAT. It involves rewriting addresses passing through a routing device: inbound packets will undergo destination address rewriting and outbound packets will undergo source address rewriting. The iproute2 suite of tools provides the two commands required to configure the kernel to perform stateless NAT. This section will cover only stateless NAT, which can only be accomplished under linux with the iproute2 tools, although it can be simulated with netfilter.
Creating an iproute2 NAT mapping has the side
effect of causing the kernel to answer ARP requests for the NAT IP.
For more detail on ARP filtering, suppression and conditional ARP, see
Chapter 2, Ethernet. This can be considered, alternatively, a
benefit or a misfeature of the kernel support for NAT.
The nat entry in the local routing table causes
the kernel to reply for ARP requests to the NAT IP.
Conversely,
netfilter DNAT makes no ARP entry or
provision for neighbor advertisement.
Whether or not it is using a packet filter, a linux machine can perform NAT using the iproute2 suite of tools. This chapter will document the use of iproute2 tools for NAT with a simple example and an explanation of the required commands, then an example of using NAT with the RPDB and using NAT with a packet filter.
NAT with iproute2 can be used in conjunction with the routing policy database (cf. RPDB) to support conditional NAT, e.g. only perform NAT if the source IP falls within a certain range. See Section 5.3.3, “Conditional Stateless NAT”.
Assume that example company in example network wants to provide SMTP service on a public IP (205.254.211.0/24) but plans to move to a different IP addressing space in the near future. Network address translation can assist example company prepare for the move. The administrator will select an IP on the internal network (192.168.100.0/24) and configure the router to accept and translate packets for the publicly reachable IP into the private IP.
Example 5.1. Stateless NAT Packet Capture [32]
|
The first packet comes in on eth1, masq-gw's
outside interface. The packet is addressed to the NAT IP,
205.254.211.17 on tcp/25. This is the IP/port pair on which
which our service runs. This is a snapshot
of the packet before it has been handled by the NAT code.
|
|
The next line is the "same" packet leaving eth0, masq-gw's
inside interface, bound for the internal network.
The NAT code has substituted the real IP of the server,
192.168.100.17. This rewriting is handled by the
nat entry in the
local routing table (ip
route). See also
Example 5.2, “Basic commands to create a stateless NAT”.
|
|
The SMTP server then sends a return packet which arrives on
eth0. This is the packet before the NAT code on masq-gw
has rewritten the outbound packet. This rewriting is handled
by the RPDB entry (ip rule). See also
Example 5.2, “Basic commands to create a stateless NAT”.
|
| Finally, the return packet is transmitted on eth1 after having been rewritten. The source IP address on the packet is now the public IP on which the service is published. |
There are only a few commands which are required to enable stateless
NAT on a linux routing device. The commands below will configure
the host masq-gw (see
Section A.1, “Example Network Map and General Notes” and
Section A.2, “Example Network Addressing Charts”) as shown above in
Example 5.1, “
Stateless NAT Packet Capture
”.
Example 5.2. Basic commands to create a stateless NAT
| This command tells the kernel to perform network address translation on any packet bound for 205.254.211.17. The parameter via tells the NAT code to rewrite the packet bound for 205.254.211.17 with the new destination address 192.168.100.17. Note, that this only handles inbound packets; that is, packets whose destination address contains 205.254.211.17. |
| This command enters the corresponding rule for the outbound traffic into the RPDB (kernel 2.2 and up). This rule will cause the kernel rewrite any packet from 192.168.100.17 with the specified source address (205.254.211.17). Any packet originating from 192.168.100.17 which passes through this router will trigger this rule. In short, this command rewrites the source address of outbound packets so that they appear to originate from the NAT IP. |
|
The kernel maintains a routing cache to handle routing
decisions more quickly
(Section 4.7, “Routing Cache”). After making changes
to the routing tables on a system, it is good practice to
empty the routing cache with ip route flush
cache. Once the cache is empty, the
kernel is guaranteed to consult the routing tables again
instead of the routing cache.
|
|
These two commands allow the user to inspect the
routing policy database and the local
routing table to determine if the NAT routes and rules were
added correctly.
|
NAT introduces a complexity to the network in which it is used because a service is reachable on a public and a private IP. Usually, this is a reasonable tradeoff or else stateless NAT would fail in the selection process. In the case that the linux routing device is connected to a public network and more than one private network, there is more work to do.
Though the service is available to the public network on a public (NAT) IP, internal users may need to connect to the private or internal IP.
This is accomplished by use of the routing policy database (RPDB), which allows conditional routing based on packet characteristics. For a more complete explanation of the RPDB, see Section 4.9, “Routing Policy Database (RPDB)”. The routing policy database can be manipulated with the ip rule command. In order to successfully configure NAT, familiarity with the ip rule command is required.
Example 5.3. Conditional Stateless NAT (not performing NAT for a specified destination network)
|
Note that we now have an entry of higher priority in the RPDB
for any packets returning from 192.168.100.17 bound for
192.168.99.0/24. The rule tells the kernel to find the route
for 192.168.99.0/24 (from 192.168.100.17) in the main
routing table. This exception to the NAT mapping of our public
IP to our internal server will allow the hosts in our second
internal network to reach the host named isolde on
its private IP address.
If tristan were to initiate a connection to isolde now, the
packet would return from IP 192.168.100.17 instead of being
rewritten from 205.254.211.17.
Now we have had success creating a NAT mapping with the iproute2 tools and we have successfully made an exception for another internal network which is connected to our linux router. Now, supposing we learn that we will be losing our IP space next week, we are prepared to change our NAT rules without readdressing our server network.
Naturally, you may not wish to create these rules manually every time you want to use NAT on every device. A standard SysV initialization script and configuration file can ease the burden of managing a number of NAT IPs on your system.
Because NAT rewrites the packet as it passes through the IP stack, packet filtering can become complex. With attentiveness to the addressing of the packet at each stage in its journey through the packet filtering code, you can ease the burden of writing a packet filter.
All of the below requirements can be deduced from an understanding of NAT and the path a packet takes through the kernel. Consult also the ipchains packet path as illustrated in the ipchains HOWTO to understand the packet path when using ipchains. Keep in mind when viewing the ASCII diagram that stateless NAT will always occur in the routing stage. Also consult the kernel packet traveling diagram for a good picture of a 2.4 kernel packet path.
Table 5.1, “Filtering an iproute2 NAT packet with ipchains” identifies the IP addresses on a packet traversing each of the input, forward and output chains in an ipchains installation.
Table 5.1. Filtering an iproute2 NAT packet with ipchains
| Inbound to the NAT IP | ||
|---|---|---|
| Chain | Source IP | Destination IP |
| input | 64.70.12.210 | 205.254.211.17 |
| Routing Stage | ||
| forward | 64.70.12.210 | 192.168.100.17 |
| output | 64.70.12.210 | 192.168.100.17 |
| Outbound from the real IP | ||
|---|---|---|
| Chain | Source IP | Destination IP |
| input | 192.168.100.17 | 64.70.12.210 |
| Routing Stage | ||
| forward | 205.254.211.17 | 64.70.12.210 |
| output | 205.254.211.17 | 64.70.12.210 |
A firewall implementing a tight policy (deny all, selectively allow) will require a large number of individual rules to allow the NAT packets to traverse the firewall packet filter. Assuming the configuration detailed in Example 5.1, “ Stateless NAT Packet Capture ”, the following set of chains is required and will restrict access to only port 25 [33].
Example 5.4. Using an ipchains packet filter with stateless NAT
|
Please note that the formatting of the commands is simply for display purposes, and to allow for easier reading of a complex set of commands. The above set of rules is 31 individual chains. This is most certainly a complex set of rules. For further details on how to use ipchains please see the ipchains HOWTO. The salient detail you should notice from the above set of rules is the difference between the IPs used in the input and forward chains. Since packets are rewritten by the stateless NAT code in the routing stage, the transformation of the packet will by complete before the forward chain is traversed.
The first two lines cover all inbound TCP packets, the first line as a
special case of the second, indicating (-l) that we
want to log the packet. After successfully traversing the input chain,
the packet is routed, at which point the destination address of the
packet has changed. Now, we need to forward the packet from the public
source address to the private (or real) internal IP address. Finally,
we need to allow the packet out on the internal interface.
The next set of rules handles all of the TCP return packets. On the input rule, we are careful to match only non-SYN packets from our internal server bound for the world. Once again, the packet is rewritten during the routing stage. Now in the forward chain, the packet's source IP is the public IP of the service. Finally, we need to let the packet out on our external interface.
The next series of lines are required ICMP rules to prevent network traffic from breaking terribly. These types of ICMP, particularly destination unreachable (ICMP 3) and source quench (ICMP 4) help to ensure that TCP sessions run with optimized characteristics.
These rules are the minimum set of ipchains rules needed to support a NAT'd TCP service. This concludes our discussion of publishing a service to the world with iproute2 based NAT and protecting the service with ipchains. As you can see, the complexity of supporting NAT with iproute2 can be substantial, which is why we'll examine the benefits of inbound NAT (DNAT) with netfilter in the next section.
[33] I assume here that the user has a restrictive default policy on the firewalling device. I suggest a policy of DENY on each of the built in ipchains chains.
Destination NAT with netfilter is commonly used to publish a service from an internal RFC 1918 network to a publicly accessible IP. To enable DNAT, at least one iptables command is required. The connection tracking mechanism of netfilter will ensure that subsequent packets exchanged in either direction (which can be identified as part of the existing DNAT connection) are also transformed.
In a devilishly subtle difference, netfilter DNAT does not cause the kernel to answer ARP requests for the NAT IP, where iproute2 NAT automatically begins answering ARP requests for the NAT IP.
Example 5.5. Using DNAT for all protocols (and ports) on one IP
|
In this example, all packets arriving on the router with a destination of 10.10.20.99 will depart from the router with a destination of 10.10.14.2.
Example 5.6. Using DNAT for a single port
|
Full network address translation, as performed with iproute2 can be simulated with both netfilter SNAT and DNAT, with the potential benefit (and attendent resource consumption) of connection tracking.
Example 5.7. Simulating full NAT with SNAT and DNAT
|
Port address translation (hereafter PAT) provides a similar functionality to NAT, but is a more specific tool. PAT forwards requests for a particular IP and port pair to another IP port pair. This feature is commonly used on publicly connected hosts to make an internal service available to a larger network.
PAT will break in strange and wonderful ways if there is an alternate route between the two hosts connected by the port address translation.
PAT has one important benefit over NAT (with the iproute2 tools). Let's assume that you have only five public IP addresses for which you have paid dearly. Additionally, let's assume that you want to run services on standard ports. You had hoped to connect four SMTP servers, two SSH servers and five HTTP servers. If you had wanted to accomplish this with NAT, you'd need more IP space.
Table of Contents
Commonly known under a variety of names, SNAT, masquerading or Many-To-One NAT can be part of a solution to protect
Masquerading for connections or traffic initiated from inside a network. Consider reading Chapter 5, Network Address Translation (NAT) for details on handling inbound traffic or connections.
Masquerading has been supported under the linux kernel since before kernel 2.0. The technique of masquerading
Though SNAT and masquerading perform the same fundamental function, mapping one address space into another one, the details differ slighly. Most noticeably, masquerading chooses the source IP address for the outbound packet from the IP bound to the interface through which the packet will exit.
Table of Contents
It is not an uncommon story today to hear how people were first exposed to linux. Many people found linux an excellent and reliable masquerading firewall in the mid-1990s and slowly became more and more accustomed to working with linux as a result of the low total cost of ownership.
The capabilities of packet filtering tools available under linux today dwarfs that of early linux (ipfwadm, anybody?) yet retains the reliability and expressive flexibility of the older tools.
For networks and machines directly connected to the Internet, packet filtering is no longer an option, but a need. This chapter will introduce the packet filtering tools available under kernels 2.2 and 2.4. Since there is much available documentation on packet filtering, host protection and masquerading with a packet filter, this chapter will refer liberally to external resources.
This chapter begins with an introduction to and the history of packet filtering with linux. After covering some of the weaknesses of packet filtering, it will cover the netfilter architecture, and then delve into using iptables. An introduction to the use of ipchains will follow along with introductions to host and network protection. The chapter will close with an overview of further resources.
Packet filtering refers to the technique of conditionally allowing or denying packets entering or exiting a network or host based on the characteristics of that packet. There are two fundamental types of packet filters. A static packet filter is a set of rules against which every packet is checked, and allowed or denied. A dynamic packet filter keeps track of the connections currently passing the firewall. This is usually described as a stateful or dynamic packet filtering engine. Netfilter provides the capability for linux (2.4+) to operate as a stateful packet filtering device.
For a brief digression, consider the term stateful packet inspection. This term has been used in two distinctly different meanings. At least one commercial security company differentiates between stateful packet filtering and stateful packet inspection [34]. Supposedly, a stateful packet inspection engine is able to examine the contents of a packet and make a limited guess as to the legitimacy of the application layer content. While I would call this an application layer proxy, I do not use the product. For the purposes of this documentation, the terms stateful packet inspection and stateful packet filtering are synonomous.
Packet filtering, the network layer portion of a firewall solution, is one part of a good security stance. As the embodiment and manifestation of an organizational security policy for network layer traffic, the packet filter restricts traffic flows between networks and hosts. There is tremendous value from a security perspective in enforcing these traffic flows, instead of allowing arbitrary traffic flow.
The use of packet filtering to enforce these traffic flows is not restricted to routers and firewalls alone. Standalone servers and workstations can use these same tools to protect themselves. There are a couple of common approaches to packet filtering. Generally, network security professionals subscribe to the notion that the filtering policy should deny or drop all traffic and selectively allow desired traffic. An alternate, more open, policy suggests allowing everything, selectively blocking undesirable traffic.
The languages used in most packet filtering tools for describing IP packets allow for a great deal of specifity when identifying traffic. This specifity enables an administrator a great deal of flexibility for protecting resources and limiting traffic flows.
Packet filtering under linux has a long history, punctuated by major alterations in the packet filtering systems included in the kernel. In the mid- and late-1990s, ipfwadm exposed the three packet filtering chains of kernel 2.0 to the user: in, forward, and out. Individual entries added to these chains would be traversed in order in each ruleset. The first matching rule in each chain would be used, and every packet passing through a router would traverse these three chains.
With the advent of linux 2.2, users could create their own chains and chain structures. The kernel architecture was different from that of the earlier kernel, but from the user's perspective, the manner in which the rules were written was only slightly different. Rule chains, traversed rather like subroutines and manipulated with ipchains, could be arbitrarily complex and nested. The built-in packet filtering chains had names: input, output and forward. The first matching rule in any chain called from one of the built-in chains would be used. Every packet passing through a router would traverse (at least) the three built-in rule chains. There is backward compatible support for ipfwadm syntax via a wrapper shell script which converts the command to an ipchains syntax.
In kernel 2.4, the netfilter architecture which provides functionality other than packet filtering, allows users to create the arbitrary chains and chain structures similar to those supported by linux 2.2. The built in chains are INPUT, FORWARD, and OUTPUT. A major difference in the use of chains was introduced in linux 2.4; packets passing through a router will traverse the FORWARD chain only. User-defined iptables chains resemble branches rather than subroutines. Under linux 2.4, ipchains compatibility is maintained with a kernel module. For ipfwadm compatibility, the kernel module and the aforementioned wrapper shell script function adequately.
The packet filtering support under linux has grown increasingly complex and mature with successive kernels and development efforts on the user space tools. The netfilter architecture of linux 2.4 represented a tremendous step forward in the packet filtering capabilities of linux with support for stateful packet filtering.
Although the functionality offered by linux kernels for protecting network resources with packet filtering allows tremendously specific network layer access control and auditing capability, it alone cannot successfully and completely protect network resources. There are weaknesses in and limits to the usefulness of packet filters.
In cases where a packet filter restricts access to a resource based on the source IP address attempting to access that resource, the packet filter cannot verify whether the packets originate from the real device or from a host or router spoofing this source address. A transparent proxy illustrates this problem perfectly. A transparent proxy frequently runs on a masquerading or NAT host which is connected to the Internet. This machine intercepts outbound connections for a particular protocol (e.g, HTTP), and simulates the real server to the client. The client may have a packet filter limiting outbound connections to a single IP and port pair, but the transparent proxy will still operate on the outbound connection.
This is an innocuous example, indeed. A potentially more threatening example is an ssh server which accepts connections only from an IP range. Any router between the two endpoints which can spoof IP packets will be able to pass the packet filter, whether it is a stateful or a static packet filter. This should underscore the importance of solid application layer security in addition to the need for judiciously employed packet filtering.
A packet filter makes no effort to validate the contents of a data stream, so data passed over a packet filter may be bogus, invalid or otherwise incorrect. The packet filter only verifies that the network layer datagrams are correctly addressed and well-formed [35]. Many security devices, such as firewalls, include support for proxies, which are application aware. These are security mechanisms which can validate data streams. Proxies are often integrated with packet filters for a tight network layer and application layer firewall.
Tunnels are one of the most common ways to subvert a packet filter. They come in wide varieties: ssh tunnels which allow users to transport TCP sessions into or out of a network; GRE tunnels, which allow arbitrary packets to be encapsulated in an IP packet; UDP tunnels; VPN tunnels; TAP/TUN tunnels; and application layer transport tunnels, such as RPC over HTTP/HTTPS. Some of these tunnels are very difficult to prevent with packet filtering, while others are trivial to block.
Perhaps it is apparent, why **FIXME** adversarial relationship between packet filters and content....limitation of packet filter....hence proxies...blah blah blah.
Use of ICMP, when to block ICMP; tunneling through lax packet filters with ICMP (trinoo, ICMPchat).
Another area of network security which is not addressed by packet filtering is encryption. Encryption can be used at a number of different layers in a networked environment. Compare IPSec, encrypted packets, with Secure Sockets Layer (SSL), which encrypts a single application layer session. IPSec operates at layer 3, while SSL operates above layer 4. Packet filtering does not directly address the issue of encryption in any way. Both are tools used in an ongoing effort to maintain and secure a network.
There are a few good starting place for those needing guidelines on securing machines. First, the Security Quickstart HOWTO is a good place to begin. There is also the Security HOWTO. These and several other good general security resources are also available via linuxsecurity.com's documentation area.
Much of the previous discussion applies to packet filtering in general, and linux suffers from the same limitations of packet filtering. It is folly to assume that a good packet filter makes a network immune from security issues.
The weaknesses of static (or stateless) packet filters and stateful packet filters are different in a few ways. Stateless packet filters frequently block SYN scans of networks, but ....
Stateless packet filters. (cf. iptables connection tracking), cf. state vs. stateless discussion.
confounded application layer protocols like FTP, H323
Because of the nature of connection tracking and state awareness, stateful packet filters are vulnerable to resource exhaustion and deliberate attempts to trip rate-limiting features.
DoS on connection tracking packet filters DoS on rate limiters ?
[35] In truth, there is some examination of data inside the network layer datagram. Almost all packet filtering engines allow the user to distinguish between the different IP protocol types, such as GRE, TCP, UDP, ICMP, and even attributes of these datagrams and segments. The important thing to realize is that a packet filter makes no effort to examine the data stream.
minimum ICMP required to meet the networking needs; xref PMTU discussion
source quench
parameter problem
inbound destination unreachable
outbound destination unreachable fragmentation needed
optional: echo request and echo reply
optional: outbound destination unreachable
optional: time exceeded
packet filtering engine in kernel 2.2 (skip history, adequately documented elsewhere)
packet filtering engine as part of netfilter in kernel 2.4, backwards compatible support for ipchains
differences between the packet traversal in ipchains and iptables. link to Stef Coene's KPTD (kernel 2.4). Anybody know of a link to a KPTD for kernel 2.2?
the three builtin chains, input, output, forward
policy per chain, see targets
jumping from chain to chain, -j $TARGET; wher TARGET=chain
the big picture; how chains are traversed
targets (other than chains) ACCEPT, DENY, REJECT....
selecting on interface
Host protection in the past was typically performed with application
layer checks on the originating IP or hostname. This was (and still is)
frequently accomplished with libwrap, which verifies whether or not to
allow a connection based on the contents of the system wide
configuration files /etc/hosts.allow and
/etc/hosts.deny.
Host protection is one part of protecting a host, by preventing inbound packets from reaching higher layers. This is no substitute for tight application layer security. Strong network and host-level packet filters mitigate a host's exposure when it is connected to a network.
Example 7.1. Blocking a destination and using the REJECT
target, cf. Example D.17, “Adding a prohibit route with route
add”
|
The use of linux packet filtering features is mature and well-documented in many places throughout the Internet. One of the most thorough introductions to the use of iptables has been collected by Oskar Andreasson at his Iptables tutorial. For further reference material on the use of iptables consult this resource.
For those continuing to use ipchains the ipchains HOWTO courtesy of TLDP provides an introduction to the world of ipchains.
For kernel 2.4, understanding the sequence of packet mangling, filtering and network address translation is key. The kernel packet traveling diagram provides a visual representation of the path a packet takes through the kernel. Here you will see the netfilter hooks, traffic control, and routing stages. A similar picture of kernel 2.4's packet path is available in a single page PDF entitled Linux Kernel 2.4 Packet handling.
See also Section I.1.8, “ipchains Resources” and Section I.1.7, “Netfilter Resources” in the appendices for a more complete set of references and links.
Table of Contents
The content in this part is intended as a practical, hands-on guide to users wanting real, tested solutions.
The remainder of this documentation is written in a less formal style, and is heavy on examples. It should be viewed as practical explication of the above chapters.
Table of Contents
Table of Contents
In many of the previous chapters, we have covered the many of the key elements required to understand basic networking with linux. In this chapter, we will introduce a few new concepts, but will endeavor to put some of the ideas together to solve practical networking problems.
ARP flux. /proc/sys/net/ipv4/conf/all/hidden
Nothing here for now. Refer to
Section 2.1.4, “The ARP Flux Problem”.
Media share; IP overlay; compare VLANS; consider bridging; consider migrating from one IP space to another (vrrpd, anybody?).
Proxy ARP is a technique for splitting an IP network into two separate segments. Hosts on one segment can only reach hosts in the other segment through the router performing proxy ARP. If a router sits between two parts of an IP network and is not running bridging software, then routes to hosts in each segment and proxy ARP are required on the router to allow each half of the network to communicate with the other half.
Occasionally, this technique is incorrectly called proxy ARP bridging. An Ethernet bridge operates on frames and a router operates on packets. The proxy ARP router should have routes to all hosts on both segments. Once the router can reach all locally connected destinations via the correct interfaces, you can begin to configure the proxy ARP functionality.
Although proxy ARP complicates a network, a great advantage of proxy ARP technique is the greater control over IP connections between hosts.
There are two primary proxy ARP techniques. With the 2.4 kernel, it is
possible to use the sysctl
net/ipv4/conf/all/proxy_arp to perform proxy ARP.
Alternatively, manual population of the ARP table reaches the same end.
The key part of the correct functioning of proxy ARP in a network is that the host breaking a network into two parts has correct routes for all destinations in both halves of the network. If the host which has interfaces in both networks does not have an accurate routing table, IP packets will get dropped on the routing device.
One common method of breaking a network in two involves making a very small stub subnet at one end or the other of the IP range. This small subnet (maybe as small as a /30 network, with two usable IPs) makes an excellent sequestered location for a host which requires more protection or even, a generally untrusted host which shouldn't have complete access to the Ethernet to which the other machines connect.
For a practical example of this, see the relationship between the
service-router, masq-gw and isolde in the
network map. isolde and
service-router share the same IP network, 192.168.100.0/24. If either
has a packet for the other, it will generate an ARP request which should
be answered by masq-gw. Naturally, masq-gw has its routes
configured in such a way that both hosts are reachable from it. Thus,
the packet will successfully pass through masq-gw.
Let's examine what the sequence of events is by which the packet will
reach service-router from isolde. In this example, isolde will
send an echo request packet to service-router. Please also refer to
Section B.1, “arp” for examples and command lines to create
a proxy ARP configuration.
the admin on isolde creates an echo request packet
for 192.168.100.1 with
ping
isolde sends an ARP request for the owner of 192.168.100.1
masq-gw replies that isolde should send packets for
192.168.100.1 to its Ethernet address, 00:80:c8:f8:5c:71
masq-gw receives the packet, unwraps it and selects eth3 as
the output interface
masq-gw sends an ARP request for the owner of 192.168.100.1
service-router replies that masq-gw should send packets for
192.168.100.1 to its Ethernet address, 00:c0:7b:7d:00:c8
service-router receives the packet unwraps it and hands it up
the IP stack, which generates an echo reply bound for the source
address, 192.168.100.17 (isolde's IP)
service-router sends an ARP request for the owner of 192.168.100.17
masq-gw replies that service-router should send packets for
192.168.100.17 to its Ethernet address, 00:80:c8:f8:5c:74
masq-gw receives the packet, unwraps it and selects eth0 as
the output interface
masq-gw sends an ARP request for the owner of 192.168.100.17
isolde replies that masq-gw should send packets for
192.168.100.17 to its Ethernet address, 00:80:c8:e8:4b:8e
isolde receives the reply, unwraps it and hands it up the IP stack
to the awaiting
ping command
Where possible, a simplified network is easier to maintain, but occasionally, this sort of trickery is necessary. This is an excellent way to insert a firewall into the middle of a network. The firewall, naturally, has to have its routes set properly, and proxy ARP entries will be required for routers.
Now, here's a short script and configuration file which can be run as a SysVInit style script. This script provides a great deal of control over the ARP table directly so may be preferable in some cases to an alternate solution outlined below. This proxy-arp script reads the following configuration file. Each is commented heavily so it should be clear how to use them.
This chapter discussed how to break a network in twain with proxy ARP techniques. For another explanation of the same concepts, read the Proxy ARP Subnet mini-HOWTO. Available in most (all?) 2.4 kernels is built-in capability for Proxy ARP. This is documented in deeper detail above. Consider familiarizing yourself with the methods of suppressing and controling ARP through Julian Anastasov's work.
Don't forget to add something here about multiple IPs bound to loopback; and refer to Julian's work. FIXME
Assume a machine has multiple connections to the same Ethernet segment, and has individual IPs bound to each interface. A peculiar feature of linux is its willingness to respond to ARP requests for any IP bound to any interface. This can lead to ARP flux, a situation where a given IP is sometimes accessed on one MAC address and sometimes another.
/proc/sys/net/ipv4/conf/all/hidden; consider arp
suppression issues.
Consider ARP suppression issues. Leakage of sensitive (IP addressing) information from other interfaces.
FIXME!! Don't forget to note that iproute2 NAT and binding to
non-local IPs do not play well together. I disagree with
this.
Binding to a non-local socket, which was possible under
kernel 2.2 with when the kernel was compiled with
CONFIG_IP_TRANSPROXY, is available under kernel 2.4 via the
/proc IP sysctl interface. If you wish to be
able to bind to non-local sockets:
|
Table of Contents
One of the most difficult aspects of working with the advanced routing features of linux is gaining an understanding the sequence of events as a packet traverses the kernel space. It is, in fact, the key knowledge needed to grasp the potential of advanced routing scenarios and to troubleshoot successfully when things don't go as planned.
If you are reading this for the first time, stop now and go visit and study the kernel packet traveling diagram and the kernel packet handling diagram now. These represent two different efforts to describe the order in which different networking subsystems inside the linux kernel have an opportunity to inspect, manipulate and redirect a packet. Understanding this sequence of events is key to harnessing the power of linux networking.
Now, let's examine some of the different commands you can use to manipulate packets at each of these stages. The list below describes the sequence of events for a packet bound for a non-local destination.
Packet Traversal; Non-Local Destination
All of the PREROUTING netfilter hooks are called here. This means that we get our first opportunity to inspect and drop a packet, we can perform DNAT on the packet to make sure that the destination IP is rewritten before we make a routing decision (at which time the destination address becomes very important). We can also set ToS or an fwmark on the packet at this time. If we want to use an IMQ device for ingress control, we can put our hooks here.
If we are using ipchains, the input chain is traversed.
Any traffic control on the real device on which the packet arrived is now performed.
The input routing stage is traversed by any packet entering the local machine. Here we concern ourselves only with packets which are routed through this machine to another destination Additionally, iproute2 NAT occurs here [36].
The packet enters the FORWARD netfilter hooks. Here, the packet can be mangled with ToS or fwmark. After the mangle chain is passed, the filter chain will be traversed. For kernel 2.4-based routing devices this will be the location for packet filtering rules. If we are using ipchains, the forward chain would be traversed here instead of the netfilter FORWARD hooks.
The output chain in an ipchains installation would be traversed here.
The POSTROUTING netfilter hooks are traversed. These include packet mangling, NAT and IMQ for egress.
Finally, the packet is transmitted via the outbound device per traffic control configuration on that outbound device.
The above describes the sequence of events for packets passing through the linux routing device. Let's look at a similar descriptions of the paths that packets bound for local destinations take through the kernel.
Packet Traversal; Local Destination
All of the PREROUTING netfilter hooks are called here. This means that we get our first opportunity to inspect and drop a packet, we can perform DNAT on the packet to make sure that the destination IP is rewritten before we make a routing decision (at which time the destination address becomes very important). We can also set ToS or an fwmark on the packet at this time. If we want to use an IMQ device for ingress control, we can put our hooks here.
If we are using ipchains, the input chain is traversed.
Any traffic control on the real device on which the packet arrived is now performed.
The input routing stage is traversed by any packet entering the local machine. Here we concern ourselves with packets bound for local destinations only.
The INPUT netfilter hooks are traversed. Commonly this is filtering for inbound connections, but can include packet mangling.
The local destination process receives the connection. If there is no open socket, an error is generated.
Naturally, packets need to go out from the machine as well, so let's look at the path for outbound packets which were locally generated.
Packet Traversal; Locally Generated
The process with the open socket sends data.
The routing decision is made. This is frequently called output routing because it is only for packets leaving the system. This routing code is (sometimes?) responsible for selecting the source IP of the outbound packet.
The netfilter OUTPUT hooks are traversed. The basic filter, nat, and mangle hooks are available. This is where SNAT can take place.
The output chain in an ipchains installation would be traversed here.
The POSTROUTING netfilter hooks are traversed. These include packet mangling, NAT and IMQ for egress.
Finally, the packet is transmitted via the outbound device per traffic control configuration on that outbound device.
[36] Leonardo calls this "dumb NAT" because the NAT performed by iproute2 at the routing stage is stateless.
Understanding and practically applying the knowledge of how and when to harness the routing features of linux is a matter of experience. The below is a set of examples for how to use the RPDB and multiple routing tables to solve different types of problems. These are but a few simple examples which allude to the flexibility and power available with the complex policy routing system under linux.
Type of Service (ToS) is a flag in the header of an IP packet which is sometimes honored by upstream routers. Some routers on the Internet respect the ToS flag and others do not, however, the ToS flag can be used as part of the decision about where to route a given packet (for a refresher on the keys used for routing to a destination read Section 4.5, “Route Selection”). Because it can be used as part of the routing decision, ToS can be used to select a route separate from the route chosen for normal packets (packets not marked with any ToS).
FIXME!! Don't forget to point out that fwmark with ipchains/iptables is a decimal number, but that iproute2 uses hexadecimal number. Thanks to Jose Luis Domingo Lopez for his post to the LARTC list!
The questions summarized in this section should rightly be entered into the FAQ, since they are FAQs on the LARTC list.
There are many places where a linux based router/masquerading device can assist in managing multiple Internet connections. We'll outline here some of the more common setups involving multiple Internet connections and how to manage them with iptables, ipchains, and iproute2. One of the first distinctions you can make when planning how to use multiple Internet connections is what inbound services you expect to host and how you want to split traffic over the multiple links.
In the discussion and examples below, I'll address the issues involved with two separate uplinks to two different providers. I assume the following:
You are not using BGP, and you do not have your own AS. If you are using BGP and have your own AS, you have a different set of problems than the problems described here [37].
You have two netblocks from two different ISPs.
You are funneling your internal network through this routing device, which is performing masquerading/NAT to the Internet.
Additionally, I'll restrict my comments to statically assigned public IP address ranges unless I mention (in particular) dynamically allocated addresses.
In the following sections we'll look at the use of multiple Internet connections first in terms of outbound traffic only, then in terms of inbound traffic only. After that, we'll look at using multiple Internet connections for handling both inbound and outbound services.
There are two main uses for multiple Internet links connected to the same internal network. One common use is to select an outbound link based on the type of outbound service. The other is to split traffic arbitrarily across multiple ISPs for reasons like failover and to accommodate greater aggregate bandwidth than would be available on a single uplink.
If your need is the latter, please consult the documentation on the LARTC site, as it does a good job of summarizing the issues involved and describes how to accomplish this. This type of use of multiple Internet connections means that (from the perspective of the linux routing device), there is a multipath default route. The LARTC documentation remarks that Julian Anastasov's patches "make things nicer to work with." The patches to which the LARTC documents are referring are Julian's dead gateway detection patches (at least) which can help the linux routing device provide Internet service to the internal network when one of the links is down. See here for Julian's route work.
In the remainder of this section, we'll discuss how to classify traffic for different ISPs, how to handle the packet filtering for this sort of classification scheme, and how to create routing tables appropriate for the task at hand. If anything at all seems unclear in this section, you may find a quick re-reading of the advanced routing overview quite fruitful.
The simplest way to split Internet access into two separate groups
is by source IP of the outbound packet. This can be done most
simply with ip rule and a second routing table.
We'll assume that masq-gw in the example network gets a second,
low cost network connection through a DSL vendor.
The DSL IP on masq-gw will be 67.17.28.12 with a gateway of
67.17.28.14. We'll assume that this is for outbound connectivity
only, and that the IP is active on eth4 of the masq-gw machine.
Before beginning let's outline the process we are going to follow.
Copy the main routing table to another routing table and set the alternate default route [38].
Use iptables/ipchains to mark traffic with fwmark.
Add a rule to the routing policy database.
Test!
Here's a short snippet of shell which you may find handy for copying one routing table to another; see the full script for a more generalized example.
Example 10.1. Multiple Outbound Internet links, part I; ip route
|
Now, exactly what have we just done? We have created two routing
tables on masq-gw each of which has a different default gateway.
We have successfully accomplished the first part of our
preparations.
Now, let's mark the traffic we would like to route in using conditional logic. We'll use iptables to select traffic bound for destination ports 80 and 443 originating in the main office desktop network.
Example 10.2. Multiple Outbound Internet links, part II; iptables
|
With these iptables lines we have instructed netfilter to mark packets matching these criteria with the fwmark and we have prepared the NAT rules so that our outbound packets will originate from the correct IPs.
Once again, it is important to realize that the fwmark added to a packet is only valid and discernible while the packet is still on the host running the packet filter. The fwmark is stored in a data structure the kernel uses to track the packet. Because the fwmark is not a part of the packet itself, the fwmark is lost as soon as the packet has left the local machine. For more detail on the use of fwmark, see Section 10.3.2, “Using fwmark for Policy Routing”.
iproute2 supports the use of fwmark as a selector for rule lookups, so we can use fwmarks in the routing policy database to cause packets to be conditionally routed based on that fwmark. This can lead to great complexity if a machine has multiple routing tables, packet filters, and other fancy networking tools, such as NAT or proxies. Caveat emptor.
A convention I find sensible is to use the same number for a routing table and fwmark where possible. This simplifies the maintenance of the systems which are using iproute2 and fwmark, especially if the table identifier and fwmark are set in a configuration file with the same variable name. Since we are testing this on the command line, we'll just make sure that we can add the rules first.
Example 10.3. Multiple Outbound Internet links, part III; ip rule
|
The last piece is in place. Now, users in the 192.168.99.0/24 subnet who are browsing the Internet should be using the DSL line instead of the T1 line for connectivity.
In order to verify that traffic is indeed getting marked and routed appropriately, you should use tcpdump to profile the outbound traffic on each link at the same time as you generate outbound traffic on both links.
The above is a cookbook example of categorizing traffic, and sending the traffic out across different providers. To my knowledge, the commonest reason to use this sort of solution is to separate traffic by importance and use a reliable (and perhaps more costly) link for the more important traffic while reserving the less costly Internet connection for other connections. In the above illustrative case, we have simply selected the web traffic for the less reliable (DSL) provider.
Once again, if you would like to split load over multiple links regardless of classification of traffic, then you really want a multipath default route, which is described and documented very well in the LARTC HOWTO.
There are many different ways to handle hosting servers to multiple ISPs, and most of them are out of the scope of this document. If you are in need of this sort of advanced networking, you probably already know where to research. If not, I'd suggest starting your research in load balancing, global load balancing, failover, and layer 4-7 switching. These are networking tools which can facilitate the management of a highly available service.
Publishing the same service on two different ISPs is can be formidable challenge. While this is possible using some of the advanced networking features under linux, one should understand the greater issues involved with publishing a service on two public IPs, especially if the idea is to provide service to the general Internet even if one of the ISPs go down. For a thorough examination of the topics involved with load balancing of all kinds, see Chandra Kopparapu's book Load Balancing Servers, Firewalls and Caches.
If you are aware of the many difficult issues involved in handling inbound connections to a network, and still want to publish a service on two different ISPs (perhaps before you have a more robust load balancing/upper layer switching technology in place), you'll find the recipe below.
Before we examine the recipe, let's look at a complex scenario to see what the crucial points are. If you do not have the kernel packet traveling diagram memorized, you may wish to refer to it in the following discussion. One other item to remember is that routing decisions are stateless [39].
We'll assume that the client IP is a fixed IP (64.70.12.210) and
we'll discuss how this client IP would reach each of the services
published on masq-gw's two public networks. The IPs used for
the services will be 67.17.28.10 and 205.254.211.17.
Now, whether you are using NAT with iproute2 or
with iptables, you'll run across the problem here outlined. Here
is the flow of the packet through masq-gw to the server and back
to the client.
Inbound NAT to the same server via two public IPs in two different networks
inbound packet from 64.70.12.210 to 67.17.28.10 arrives on eth4
packet is accepted, rewritten, and routed; from 64.70.12.210 to 192.168.100.17; if iptables DNAT, packet is rewritten in PREROUTING chain of nat table, then routed; if iproute2, packet is routed and rewritten simultaneously
rewritten packet is transmitted out eth0
isolde receives packet, accepts, responds
inbound packet from 192.168.100.17 to 64.70.12.210
routing decision is made; default route (via 205.254.211.254) is selected; if iproute2 is used, packet is also rewritten from 67.17.28.10 to 64.70.12.210
if iptables DNAT is used, connection tracking will take care of rewriting this packet from 67.17.28.10 to 64.70.12.210
packet is transmitted out eth1
This is the problem! The packet may have the correct source address, but it is leaving via the wrong interface. Many ISPs filter traffic entering their network and will block traffic from your network with source IPs outside your allocated range. To an ISP this looks like spoofed traffic.
The solution is marvelously elegant and simple. Select one IP on
the internal server which will be reachable via one provider and one
IP which will be reachable via the other provider. By using two IP
addresses on the internal machine, we can use ip
rule on masq-gw to select a routing table with a
different default route based upon the source IP of the response
packets to clients. Below, we'll assume the same routing tables as
in the previous section (cf.
Section 10.4.1, “Outbound traffic Using Multiple Connections to the
Internet”).
Here we have a server isolde which needs to be accessible via two
different public IP addresses. We'll add an IP address to isolde
so that it is reachable on 192.168.100.10 as well as 192.168.100.17.
Then, the following rules on masq-gw will ensure that packets are
rewritten and routed in order to avoid the problem pointed out
above.
Example 10.4. Multiple Internet links, inbound traffic; using iproute2 only [40]
|
[37] Anybody who has any experience with linux as a firewall behind a BGP device? Linux as a firewall/router running BGP? Thoughts? Things I should include here? Yeah, I know about Zebra, but I haven't ever used it.
[38] Sometimes it may not be quite proper to simply copy the main routing table to another routing table. You may want a subset of hosts on the internal network to access the alternate link. Anybody have any sage advice here for the newbie in multiple routing tables?
[39] The following discussion is actually a restatement of Wes Hodges' posting on his solution to this problem.
[40] This example makes no reference to packet filtering. If you are reading this, I assume you are competent at determining the packet filtering issues. If you have doubts about what rules to add, see Section 5.4, “Stateless NAT and Packet Filtering”.
Table of Contents
Here are some scripts which may come in handy for manipulating different features of the linux networking stack. If you'd like, you can get a tarball of these scripts to take home with you.
The proxy ARP script was written before the kernel supported proxy ARP natively. If you simply want proxy ARP to work, then you need only enable it in your 2.4 kernel. If you require more control than afforded by the kernel proxy ARP functionality and you wish to recompile iproute2 and your kernel, you can use the iproute2 extension, ip arp. Otherwise, you might try this script.
Example 11.1. Proxy ARP SysV initialization script
#! /bin/sh -
#
# proxy-arp Set proxy-arp settings in arp cache
#
# chkconfig: 2345 15 85
# description: using the arp command line utility, populate the arp
# cache with IP addresses for hosts on different media
# which share IP space.
#
# Copyright (c)2002 SecurePipe, Inc. - http://www.securepipe.com/
#
# This program is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation; either version 2 of the License, or (at your
# option) any later version.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
# or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
# for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
#
# -- written initially during 1998
# 2002-08-14; Martin A. Brown <mabrown@securepipe.com>
# - cleaned up and commented extensively
# - joined the process parsimony bandwagon, and eliminated
# many unnecessary calls to ifconfig and awk
#
gripe () { echo "$@" >&2; }
abort () { gripe "Fatal: $@"; exit 1; }
CONFIG=${CONFIG:-/etc/proxy-arp.conf}
[ -r "$CONFIG" ] || abort $CONFIG is not readable
case "$1" in
start)
# -- create proxy arp settings according to
# table in the config file
#
grep -Ev '^#|^$' $CONFIG | {
while read INTERFACE IPADDR ; do
[ -z "$INTERFACE" -o -z "$IPADDR" ] && continue
arp -s $IPADDR -i $INTERFACE -D $INTERFACE pub
done
}
;;
stop)
# -- clear the cache for any entries in the
# configuration file
#
grep -Ev '^#|^$' /etc/proxy-arp.conf | {
while read INTERFACE IPADDR ; do
[ -z "$INTERFACE" -o -z "$IPADDR" ] && continue
arp -d $IPADDR -i $INTERFACE
done
}
;;
status)
arp -an | grep -i perm
;;
restart)
$0 stop
$0 start
;;
*)
echo "Usage: proxy-arp {start|stop|restart}"
exit 1
esac
exit 0
#
# - end of proxy-arp
|
Example 11.2. Proxy ARP configuration file
#
# Proxy ARP configuration file
#
# -- This is the proxy-arp configuration file. A sysV init script
# (proxy-arp) reads this configuration file and creates the
# required arp table entries.
#
# Copyright (c)2002 SecurePipe, Inc. - http://www.securepipe.com/
#
# This program is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation; either version 2 of the License, or (at your
# option) any later version.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
# or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
# for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
#
#
# -- file was created during 1998
# 2002-08-15; Martin A. Brown <mabrown@securepipe.com>
# - format unchanged
# - added comments
#
# -- field descriptions:
# field 1 this field contains the ethernet interface on which
# to advertise reachability of an IP.
# field 2 this field contains the IP address for which to advertise
#
# -- notes
#
# - white space, lines beginning with a comment and blank lines are ignored
#
# -- examples
#
# - each example is commented with an English description of the
# resulting configuration
# - followed by a pseudo shellcode description of how to understand
# what will happen
#
# -- example #0; advertise for 10.10.15.175 on eth1
#
# eth1 10.10.15.175
#
# for any arp request on eth1; do
# if requested address is 10.10.15.175; then
# answer arp request with our ethernet address from eth1 (so
# that the reqeustor sends IP packets to us)
# fi
# done
#
# -- example #1; advertise for 172.30.48.10 on eth0
#
# eth0 172.30.48.10
#
# for any arp request on eth0; do
# if requested address is 172.30.48.10; then
# answer arp request with our ethernet address from eth1 (so
# that the reqeustor sends IP packets to us)
# fi
# done
#
# -- add your own configuration here
# -- end /etc/proxy-arp.conf
#
|
The script will remove all NAT route entries and then all RPDB entries, other than the three default entries and anything saying "iif lo". It will then populate the RPDB and create NAT route entries according to the configuration file. Use this script with caution if you have customized your RPDB.
Example 11.3. Static NAT SysV initialization script
#! /bin/sh -
#
# nat; start and stop network address translations using iproute2 tools
#
# chkconfig: 345 45 55
# description: iproute2 tools allow for sophisticated routing, network
# address translation, and policy based routing. This script
# generalizes static NAT mappings and exceptions.
#
# Copyright (c)2002 SecurePipe, Inc. - http://www.securepipe.com/
#
# This program is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation; either version 2 of the License, or (at your
# option) any later version.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
# or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
# for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
#
#
# -- written initially, 2002-03-02; -MAB
# 2002-08-14; Martin A. Brown <mabrown@securepipe.com>
# - cleaned up and commented the code a bit
# - altered the script to provide support for NAT from user-specified
# networks instead of assuming that anything from 0/0 should be
# translated
# 2002-08-30; Martin A. Brown <mabrown@securepipe.com>
# - add configuration setting to flush all NAT rules and routes before
# installing new rules and routes
# - add a ./nat flush option
# 2003-01-31; Matthew Callaway <matt@securepipe.com>
# - add validation routines
# 2003-02-05; Martin A. Brown <mabrown@securepipe.com>
# - oversight identified by Shawn Balestracci; not all NAT rules
# were flushed--we were looking only for map-to, not the exclude
# rules as well
gripe () { echo "$@" >&2; }
abort () { gripe "Fatal: $@"; exit 1; }
CONFIG=${CONFIG:-/etc/sysconfig/static-nat}
[ -r "$CONFIG" ] || abort $CONFIG is not readable
function isIP () {
# -- this function validates a variable as a valid IP address or CIDR network
#
VAR=$1
echo ${VAR} | grep -Eq \
"[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}(|[[:digit:]]{1,2})"
[ $? -eq 0 ] && return 0
return 1
}
function isINT () {
# -- this function validates a variable as a valid integer
#
VAR=$1
echo ${VAR} | grep -Eq \
"[[:digit:]]{1,}"
[ $? -eq 0 ] && return 0
return 1
}
function validate () {
grep -Ev '^#|^$' $CONFIG | while read NET NAT REAL NPRIO RPRIO EXCLUDE ; do
# Fields 5 and 6 are optional
if [ -z "$NET" -o -z "$NAT" -o -z "$REAL" -o -z "$NPRIO" ]; then
echo Syntax error: Missing field: $NET $NAT $REAL $NPRIO $RPRIO $EXCLUDE
exit 1
fi
if [ -n "$RPRIO" -a -z "$EXCLUDE" ]; then
echo Syntax error: $NET $NAT $REAL $NPRIO $RPRIO $EXCLUDE
echo Field 6 must be used with field 5
exit 1
fi
for ITEM in $NET $NAT $REAL $EXCLUDE ; do
isIP $ITEM
if [ $? -ne 0 ]; then
echo "In line:"
echo $NET $NAT $REAL $NPRIO $RPRIO $EXCLUDE
echo $ITEM is not a valid IP or CIDR block
exit 1
fi
done
for ITEM in $NPRIO $RPRIO; do
isINT $ITEM
if [ $? -ne 0 ]; then
echo "In line:"
echo $NET $NAT $REAL $NPRIO $RPRIO $EXCLUDE
echo $ITEM is not an integer
exit 1
fi
done
done
}
function flush () {
# -- this function should remove all NAT rules and routes
#
# -- remove all of the rules, except the three builtins and any IPSec
# rule; -MAB;
#
ip rule show | grep -Ev '^(0|32766|32767):|iif lo' \
| while read PRIO NATRULE; do
ip rule del prio ${PRIO%%:*} $( echo $NATRULE | sed 's|all|0/0|' )
done
# -- remove all of the rules
#
ip route show table local | grep ^nat | while read NATROUTE; do
ip route del $NATROUTE
done
ip route flush cache;
}
function nat () {
grep -Ev '^#|^$' $CONFIG | while read NET NAT REAL NPRIO RPRIO EXCLUDE ; do
# <-- set up the route for the NAT IP to turn it into the real IP
#
ip route add from $NET nat $NAT via ${REAL%%/*}
[ "$?" -eq "0" ] || \
gripe cmd failed: ip route add nat $NAT via ${REAL%%/*}
# <-- establish the minimum routing policy database;
# this is required so that the outbound packet gets
# rewritten to be from the IP which sent us the packet
#
ip rule add to $NET nat ${NAT%%/*} from $REAL prio $NPRIO
[ "$?" -eq "0" ] || \
gripe cmd failed: ip rule add nat ${NAT%%/*} from $REAL prio $NPRIO
# <-- determine if the user has supplied networks or address to be
# excluded from the $NETwork address above
#
[ ! -z "$RPRIO" ] && [ ! -z "$EXCLUDE" ] && {
for NETWORK in $EXCLUDE ; do
ip rule add from $REAL to $NETWORK prio $RPRIO
[ "$?" -eq "0" ] || \
gripe cmd failed: ip rule add from $REAL to $NETWORK prio $RPRIO
done;
}
done;
# <-- We don't want to forget to flush the cache, or the user will
# sit around wondering for the next few minutes why the NAT rules
# aren't working. After flushing the cache, the NAT rules will
# work right away.
#
ip route flush cache;
}
# see how we were called
case "$1" in
start) validate && nat
;;
stop) flush
;;
restart) $0 stop; $0 start
;;
status) ip route show table local | grep ^nat
ip rule show | grep map-to
;;
*) echo "usage: nat {start|stop|restart|status}"
;;
esac
#
# - end of nat
|
Example 11.4. Static NAT configuration file
#
# NAT configuration file
#
# -- This file is used to configure NAT routes and rules
# via the iproute2 package. A sysV init script (nat)
# uses this file to set up the routes/rules.
#
#
# Copyright (c)2002 SecurePipe, Inc. - http://www.securepipe.com/
#
# This program is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation; either version 2 of the License, or (at your
# option) any later version.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
# or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
# for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
#
#
# -- file created by Matt Callaway <matt@securepipe.com>
# 2002-03-01; Martin A. Brown <mabrown@securepipe.com>
# - first major revision; added comments
# 2002-08-14; Martin A. Brown <mabrown@securepipe.com>
# - cleaned up the file; added copious commenting and examples
# - provided support for NAT only from specified networks (backwards
# incompatibility added here; benefit is huge flexibility gain)
# 2003-02-10; Martin A. Brown <mabrown@securepipe.com>
# - example #6 added. Thanks for identification and description of
# this scenario, and the example in the format of the other
# examples go to Shawn Balestracci <shawnb@securepipe.com>
#
# -- field descriptions:
# field 1 this field contains a network address. Any packets from
# this network will be translated according to fields two and
# three, with the exception of any networks specified in fields
# 6 and higher
# field 2 contains the NAT IP, the IP that only exists as a publicly
# reachable IP for an internal host
# field 3 contains the real IP of the machine, usually an internal IP
# field 4 contains the priority for the NAT rule itself in the RPDB
# field 5 contains the priority for the routing rule in the RPDB. In
# order for the internal networks to reach the real IP of the
# server/host, this priority must be higher than the priority
# for the NAT rule. **lower numbers == higher priority**
# field 6+ contains a whitespace separated list of networks which
# should be able to reach the real IP (field 2) directly.
# The entries into the rule policy database (RPDB) for these
# networks will prevent packets from real-IP to dest-network
# from being rewritten with the NAT IP as the source IP.
# Networks specified here should be subnets of the network
# specified in field 1.
#
# -- notes
#
# - white space, lines beginning with a comment and blank lines are ignored
# - field 5 should always be a lower number (higher priority) than field 4
# - fields 5 and 6+ are optional
# - fields 5 and 6+ must be used together, if used at all
#
# -- examples
#
# - each example is commented with an English description of the network
# address translation which will occur
# - followed by a pseudo shellcode description of how to understand
# exactly what the NAT will look like
#
# -- example #1; NAT a single IP from anywhere
#
# 0/0 10.10.0.14 172.31.254.1 1000
#
# for packets from any address (0/0);
# if destination_address is 10.10.0.14 ; then
# rewrite destination address from 10.10.0.14 to 172.31.254.1
# fi
# done
#
# -- example #2; NAT an entire network (from anywhere)
#
# 0/0 10.13.0.0/16 172.17.0.0/16 1000
#
# for packets from any address (0/0); do
# if destination_address is in 10.13.0.0/16 ; then
# rewrite destination address from 10.13.x.x to 172.17.x.x
# fi
# done
#
# -- example #3; NAT an entire network, but only from a specified nework
#
# 10.10.0.0/16 10.15.0.0/24 192.168.0.0/24 1000
#
# if packet is from 10.10.0.0/16 ; then
# if destination_address is in 10.15.0.0/24 ; then
# rewrite destination address from 10.15.0.x to 192.168.0.x
# fi
# fi
#
# -- example #4; NAT an entire network, but only from a specified nework;
# make an exception for certain IP ranges
#
# 10.10.0.0/16 10.15.2.0/24 192.168.2.0/24 1000 990 10.10.38.0/24
#
# if packet is from 10.10.0.0/16 and not from 10.10.38.0/24 ; then
# if destination_address is in 10.15.2.0/24 ; then
# rewrite destination address from 10.15.2.x to 192.168.2.x
# fi
# fi
#
# -- example #5; NAT a single IP from anywhere; don't NAT if from specified
# IP ranges
#
# 0/0 10.74.1.8 192.168.73.15 1000 990 192.168.71.0/24 192.168.70.0/24
#
# for packets from any address except 192.168.71.0/24 and 192.168.70.0/24; do
# if destination_address is 10.74.1.8 ; then
# rewrite destination address from 10.74.1.8 to 192.168.73.15
# fi
# done
#
# -- example #6; NAT to the same IP differently based on the source
# network IP ranges
#
# 0/0 10.74.1.8 192.168.73.15 1000
# 192.168.71.0/24 192.168.71.15 192.168.73.15 400
# 192.168.70.0/24 192.168.71.15 192.168.73.15 400
#
# N.B., the RPDB must traverse lines two and three first, hence the higher
# priority. If the source network is not 192.168.{71,70}.0/24 then
# the we'll meet the next entry, 1000.
# N.B., the third entry in this example will cause an RTNETLINK: file
# exists error, because there is already an entry in the local
# routing table for 192.168.71.15 --NAT--> 192.168.73.15. Known bug.
#
# for packets from 192.168.71.0/24 or 192.168.70.0/24; do
# if destination_address is 192.168.71.15 ; then
# rewrite destination address from 192.168.71.15 to 192.168.73.15
# fi
# done
#
# for packets from any address except 192.168.71.0/24 and 192.168.70.0/24; do
# if destination_address is 10.74.1.8 ; then
# rewrite destination address from 10.74.1.8 to 192.168.73.15
# fi
# done
#
# -- add your own configuration here
# -- end /etc/sysconfig/static-nat
#
|
Table of Contents
Invariably, troubles and misconfigurations creep into networks. New devices get connected and added to a network. Old devices are removed, and something seemingly unrelated breaks. Troubleshooting is really a test in discerning patterns.
My favored method for solving problems is to start with the simplest elements, verifying correct operation and proceeding to the next layer or element until I have isolated the problem element. If you are lucky, you'll know from a symptom where the problem is likely to be, but more often, you'll have to start at the bottom of the networking hierarchy, and verify each other layer.
The first thing to consider whenever somebody reports a strange networking problem is any recent change. What has changed recently in the network? Have any new machines been added? Is the user using a service which was recently decommissioned? Did a machine (firewall, mail server, DNS resolver) recently reboot? Did all of the services restart?
The content in this part is intended to function as supporting reference material for the above chapters. Following you will find a reference for many common linux command line utilities as well as the example network map and network description. A set of links to external resources, and a troubleshooting guide round out the content in this part of the document.
Table of Contents
The below network map is a fictional network. This network should provide examples of several of the common functions of a linux box in networking situations. The hostnames used in the documentation are taken from this network map. Where practical, I have tried to simulate real-world situations throughout the documentation, to ease the practical application of the concepts.

Because this guide focusses on linux networking, I have omitted discussion
of the ISDN routers and unless relevant, the layer 2 devices (hubs and
switches). The remaining hosts on the example network can be
broken into three main categories: single-homed hosts (servers and
workstations), masquerading (cf. NAT) routers, and public routers.
For those viewing the above netmap from a security perspective,
wan-gw and masq-gw would both run
packet filters (at least), which turns the network into a traditional
screened-subnet firewall.
The LAN shown above is a common leaf-network scenario for business offices. Frequently, there are one or two machines on a public network segment, a masquerading firewall, and one or more networks behind the masquerading firewall. Please do not consider this example network the only way to interconnect devices. The above is one method of designing a network--there are many practical issues to weigh in network design. I am deliberately skirting the issue of network design here and proposing an example network similar to or a superset of a commonly found network design.
It is rare for a business which is not an ISP to own a class C sized network today, but I have nonetheless chosen a class C sized public network as our fictitious company's network.
In addition to the network map above, you may find the following network address and host address information handy as you read through the various examples and documentation based on this fictional network.
Table A.1. Example Network; Network Addressing
| network address | function |
|---|---|
| 205.254.211.0/24 | public ISP-allocated network |
| 192.168.100.0/24 | internal server network |
| 192.168.99.0/24 | main office desktop network |
| 192.168.98.0/24 | branch office desktop network |
Host addressing information is summarized in this table. follows.
Table A.2. Example Network; Host Addressing
| hostname | interface | IP address | MAC address |
|---|---|---|---|
isolde | eth0 | 192.168.100.17/24 | 00:80:c8:e8:4b:8e |
tristan | eth0 | 192.168.99.35/24 | 00:80:c8:f8:4a:51 |
morgan | eth0 | 192.168.98.82/24 | 00:80:c8:f8:4a:53 |
masq-gw | eth0 | 192.168.100.254/24 | 00:80:c8:f8:5c:71 |
masq-gw | eth1 | 205.254.211.179/24 | 00:80:c8:f8:5c:72 |
masq-gw | eth2 | 192.168.99.254/24 | 00:80:c8:f8:5c:73 |
masq-gw | eth3 | 192.168.100.2/30 | 00:80:c8:f8:5c:74 |
wan-gw | eth0 | 205.254.211.254/24 | [ unknown ] |
wan-gw | wan0 | 205.254.209.73/30 | [ n/a ] |
isdn-router | (Ethernet) | 192.168.99.1/24 | 00:c0:7b:45:6a:39 |
branch-router | (Ethernet) | 192.168.98.254/24 | 00:c0:7b:37:af:91 |
service-router | (Ethernet) | 192.168.100.1/24 | 00:c0:7b:7d:00:c8 |
I have referred liberally to this example network throughout this documentation. Any example commands in the documentation assume the network configuration as shown on this network map.
Additionally, hosts which are not part of this (fictional) network but
appear in the documentation will appear under the names real-server
and real-client. This convention exists simply to disambiguate
real-world examples from the machines in the fictional network.