Netfilter
Netfilter
The Linux net filter is a framework in the kernel that allows modules to observe and modify packets
as they pass through the protocol stack. Kernel services or modules can register custom
hooks/filters by both protocol family (PF_INET) and by the point in packet processing
(NF_IP_LOCAL_IN) at which the filter is to be invoked.
The facilty is currently available for IPv4, IPv6 and DECnet but could be extended to other protocol
families. Each protocol family can provide several processing points in the stack where a packet of
that protocol can be passed to a filter. These points are referred to as hook points or hook types.
Hence, when registering a custom hook, the protocol family and the protocol specific hook type
must be specified.
A statically allocated array of lists defined in net/core/netfilter.c holds all the hooks registered for
each protocol and hook types. NF_MAX_HOOKS, the maximum types of hooks a protocol can
support has been defined as 8 in include/linux/netfilter.h.
1
Definining a netfilter hook
Each custom hook is defined using the following nf_hook_ops structure. This structure is passed to
the nf_register_hook function.
44 struct nf_hook_ops
45 {
46 struct list_head list;
47
48 /* User fills in from here down. */
49 nf_hookfn *hook;
50 int pf;
51 int hooknum;
52 /* Hooks are ordered in ascending priority. */
53 int priority;
54 };
list: links all hooks of a common pm and hooknum into the nf_hooks array
pf: protocol family (PF_INET) of the filter.
hooknum: the protocol specific hook type (NF_IP_FORWARD) identifier.
priority: order of the hook in the list.
hook: A pointer to the hook function. It's prototype is as follows:
52 enum nf_ip_hook_priorities {
53 NF_IP_PRI_FIRST = INT_MIN,
54 NF_IP_PRI_CONNTRACK = -200,
55 NF_IP_PRI_MANGLE = -150,
56 NF_IP_PRI_NAT_DST = -100,
57 NF_IP_PRI_FILTER = 0,
58 NF_IP_PRI_NAT_SRC = 100,
59 NF_IP_PRI_LAST = INT_MAX,
60 };
2
The nf_register_hook() function defined in net/core/netfilter.c adds the nf_hook_ops structure that
defines a custom hook to the appropriate list based on the protocol family and filter type.
Since the list is ordered by ascending priority values, invocation order is lowest numerical value
first.
3
IP Packet Transmission Through the Netfilter Layer
This macro translates to a call to the nf_hook_slow() function if the netfilter debug option is defined
or if there are hooks/filters set for the specific protocol family and hook type. Otherwise it simply
passes the sk_buff directly to the ok function.
117 /* This is gross, but inline doesn't cut it for avoiding the
118 function call in fast path: gcc doesn't inline (needs
value tracking?). --RR */
119 #ifdef CONFIG_NETFILTER_DEBUG
120 #define NF_HOOK nf_hook_slow
121 #else
122 #define NF_HOOK(pf, hook, skb, indev, outdev, okfn) \
123 (list_empty(&nf_hooks[(pf)][(hook)]) \
124 ? (okfn)(skb) \
125 : nf_hook_slow((pf), (hook), (skb),
(indev), (outdev), (okfn)))
126 #endif
When the net filter facility is enabled and the look list is non-empty, this macro invokes the
nf_hook_slow() function. The nf_hook_slow() function is defined in net/core/netfilter.c, it's task is
to invoke each hook in the specified list, and based on the verdict from the hooks, it either passes
the packet to the okfn or drops the packet.
450 int nf_hook_slow(int pf, unsigned int hook,
struct sk_buff *skb,
451 struct net_device *indev,
452 struct net_device *outdev,
453 int (*okfn)(struct sk_buff *))
454 {
455 struct list_head *elem;
456 unsigned int verdict;
457 int ret = 0;
4
For a non-linear sk_buff each fragment's size, offset and page address are stored in the
skb_frag_struct array. If the skb is non-linear (i.e. skb->data_len!=0), skb_linearize() is called to
reorganized all the data into one linear buffer.
After ensuring the sk_buff is linear, nf_hook_slow() continues. The ip_summed field in the sk_buff
was initialized to 0 (CHECKSUM_NONE) during creation. The objective of this code block is
unclear. It should be remembered though that this nf_hook_slow() is called for both input and
output processing.
Here the function nf_iterate() is called to execute all the hooks defined for this protocol family and
hook type.
483 elem = &nf_hooks[pf][hook];
484 verdict = nf_iterate(&nf_hooks[pf][hook],
&skb, hook, indev, outdev, &elem, okfn);
5
On return to nf_hook_slow(), actions are based on the verdict. A verdict of NF_QUEUE for an IP
packet this results in a series of function calls leading to the ipq_enqueue() function defined in
net/ipv4/netfilter/ip_queue.c. It is not understood what conditions might trigger this situation.
If NF_ACCEPT is the verdict from all hooks, the output_maybe_reroute() function which was
passed into nf_hook_slow() as the okfn() is invoked with the sk_buff as the parameter. If the packet
is to be dropped kfree_skb() is called.
6
Iterating through the hook chain
The value returned by the hook function determines the action taken by the switch statement. An
immediate return, possibly aborting the send, is made if the value returned is NF_QUEUE,
NF_STOLEN, or NF_DROP. For values of NF_REPEAT or NF_ACCEPT the for loop continues.
350 switch (elem->hook(hook, skb, indev, outdev,okfn))
{
351 case NF_QUEUE:
352 return NF_QUEUE;
353
354 case NF_STOLEN:
355 return NF_STOLEN;
356
357 case NF_DROP:
358 return NF_DROP;
359
360 case NF_REPEAT:
361 *i = (*i)->prev;
362 break;
363
364 #ifdef CONFIG_NETFILTER_DEBUG
365 case NF_ACCEPT:
366 break;
367
368 default:
369 NFDEBUG("Evil return from %p(%u).\n",
370 elem->hook, hook);
371 #endif
372 }
373 }
7
If all the hook functions return NF_ACCEPT, then NF_ACCEPT is returned to nf_hook_slow.
The pointer skb->dst refers to the route cache element associated with this packet's source and
destination. In ip_route_output_slow(), rt->u.dst->output was set to ip_output() which is defined
in net/ipv4/ip_output.c.
255 int ip_output(struct sk_buff *skb)
256 {
257 #ifdef CONFIG_IP_ROUTE_NAT
258 struct rtable *rt = (struct rtable*)skb->dst;
259 #endif
260
261 IP_INC_STATS(IpOutRequests);
262
263 #ifdef CONFIG_IP_ROUTE_NAT
264 if (rt->rt_flags&RTCF_NAT)
265 ip_do_nat(skb);
266 #endif
267
268 return ip_finish_output(skb);
269 }
8
The ip_finish_output() function
The ip_finish_output() function sets skb->dev to the device associated with the route's associated
output device structure and the protocol type to ETH_P_IP. This indicates that the value 0x8000
must represent an IP packet even if the output device is not an ethernet device.
183 __inline__ int ip_finish_output(struct sk_buff *skb)
184 {
185 struct net_device *dev = skb->dst->dev;
186
187 skb->dev = dev;
188 skb->protocol = __constant_htons(ETH_P_IP);
Next, the NF_HOOK macro is again invoked. This macro expands to nf_hook_slow() and invokes
all the net filters defined for PF_INET at the NF_IP_POST_ROUTING level. If the verdict from
all filters is NF_ACCEPT, the okfn(), ip_finish_output2() is called as before.
189
190 return NF_HOOK(PF_INET, NF_IP_POST_ROUTING,
skb, NULL, dev, ip_finish_output2);
192 }
9
There are two mechanisms by which calls to the link layer may be made. If the dst_entry has an
hh_cache pointer then the hh_cache entry must contain both the hardware header itself and a pointer
to an output function at the device / link layer. The output function is always set to
dev_queue_xmit().
If there is no hh pointer but there is a neighbor pointer, then the neighbor structure must have an
output function pointer. The output function of the neighbour structure is set to
neigh_resolve_output() if the network device needs a hardware header. Otherwise (for a loopback,
point to point, or virtual device) it set to invoke dev_queue_xmit() by the arp_constructor() function
that is called when each neigbour structure is created. Note the ugly hardcode of the hardware
header length at 16 bytes.
168 if (hh) {
169 read_lock_bh(&hh->hh_lock);
170 memcpy(skb->data - 16, hh->hh_data, 16);
171 read_unlock_bh(&hh->hh_lock);
172 skb_push(skb, hh->hh_len);
173 return hh->hh_output(skb);
174 } else if (dst->neighbour)
175 return dst->neighbour->output(skb);
176
If there is no hardware header structure and no neighbor structure available, then there is no way to
send the packet and it must be dropped. The net_ratelimit() function is used to limit the number of
printk's generated to not more than 1 every 5 seconds to avoid flooding the syslog in case
something is badly amiss in the network setup.
177 if (net_ratelimit())
178 printk(KERN_DEBUG "ip_finish_output2:
No header cache and no neighbour!\n");
179 kfree_skb(skb);
180 return -EINVAL;
181 }
10