0% found this document useful (0 votes)
20 views

Lecture - 08 - Tcpip Stack in The Linux Kernel

Uploaded by

Hien Le
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Lecture - 08 - Tcpip Stack in The Linux Kernel

Uploaded by

Hien Le
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Divye Kapoor

Pracheer Agarwal
Swagat Konchada
 It is the software layer in the kernel that provides a
uniform filesystem interface to userspace programs

 It provides an abstraction within the kernel that allows


for transparent working with a variety of filesystems.

 Thus it allows many different filesystem


implementations to coexist freely

 Each socket is implemented as a “file” mounted on


the sockfs filesystem.
 file->private points to the socket information.
 Inodes provide a method to access the actual
data blocks allocated to a file. For sockets, they
provide buffer space which can be used to hold
socket specific data.
 struct inode

 Every file is represented in the kernel as an


object of the file structure. It requires an inode
provided to it.
 struct file
Struct operations {
int (*read)(int, char *, int);
void (*destroy_inode)(inode *);
void (*dirty_inode) (struct inode *);
int (*write_inode) (struct inode *, int);
void (*drop_inode) (struct inode *);
void (*delete_inode) (struct inode *);
};
Sizeof(operations) = sizeof(function ptr)*6
Divye Kapoor
User Space
Socket, bind, listen, connect, send, recv, write, read etc.

Socket Functions (Kernel)


sys_socket, sys_bind, sys_listen, sys_connect etc. in socket.c

TCP/IP Layer Functions


inet_create, tcp_v4_connect, tcp_sendmsg, tcp_recvmsg

Ethernet Device Layer


dev_hard_start_xmit
Sys_socket()

Sock_create() Sock_map_fd()

Allocate a socket object


(internally an inode Sock_alloc_fd()
Associated with a file object) Allocate a file descriptor

Locate the family requested and


call the create function for that
family Sock_attach_fd()

Inet_create() Fd_install()
Lower layer initialization
Sys_connect()

Sockfd_lookup_light()
Returns the socket object
associated with the given fd

Move_addr_to_kernel()
For userspace sockaddr *

Sock->ops->connect()
Lower layer call

Tcp_v4_connect()
Socket layer functions
are elided.
Defined in <include/linux/skbuff.h>

 used by every network layer (except the physical layer)


 fields of the structure change as it is passed from one layer to another
 i.e., fields are layer dependent.
struct sk_buff {
... ... ...
#ifdef CONFIG_NET_SCHED
_ _u32 tc_index;
#ifdef CONFIG_NET_CLS_ACT
_ _u32 tc_verd;
_ _u32 tc_classid;
#endif
#endif
}
sk_buff is peppered with C preprocessor #ifdef directives.
CONFIG_NET_SCHED symbol should be defined at compile time for the
structure to have the element tc_index.
enabled with some version of make config by an administrator.
 The kernel maintains all sk_buff structures in a doubly linked list.

struct sk_buff_head {/* only the head of the list */


/* These two members must be first. */
struct sk_buff * next;
struct sk_buff * prev;

_ _u32 qlen;
spinlock_t lock;/* atomicity in accessing a sk_buff list. */
};
 Layout
 General
 Feature-specific
 Management functions
 struct sock * sk
sock data structure of the socket that owns this buffer
 unsigned int len
includes both the data in the main buffer (i.e., the one pointed to by head)
and the data in the fragments
 unsigned int data_len
unlike len, data_len accounts only for the size of the data in the fragments.
 unsigned int truesize
skb->truesize = size + sizeof(struct sk_buff);
 atomic_t users
reference count, or the number of entities using this sk_buff buffer
atomic_inc and atomic_dec
 struct sock * sk
sock data structure of the socket that owns this buffer
 unsigned int len
includes both the data in the main buffer (i.e., the one pointed to by
head) and the data in the fragments
 unsigned int data_len
unlike len, data_len accounts only for the size of the data in the fragments.
 unsigned int truesize
skb->truesize = size + sizeof(struct sk_buff);
 atomic_t users
reference count, or the number of entities using this sk_buff buffer
atomic_inc and atomic_dec
 struct sock * sk
sock data structure of the socket that owns this buffer
 unsigned int len
includes both the data in the main buffer (i.e., the one pointed to by head)
and the data in the fragments
 unsigned int data_len
unlike len, data_len accounts only for the size of the data in the
fragments.
 unsigned int truesize
skb->truesize = size + sizeof(struct sk_buff);
 atomic_t users
reference count, or the number of entities using this sk_buff buffer
atomic_inc and atomic_dec
 struct sock * sk
sock data structure of the socket that owns this buffer
 unsigned int len
includes both the data in the main buffer (i.e., the one pointed to by head)
and the data in the fragments
 unsigned int data_len
unlike len, data_len accounts only for the size of the data in the fragments.
 unsigned int truesize
skb->truesize = size + sizeof(struct sk_buff);
 atomic_t users
reference count, or the number of entities using this sk_buff buffer
atomic_inc and atomic_dec
 struct sock * sk
sock data structure of the socket that owns this buffer
 unsigned int len
includes both the data in the main buffer (i.e., the one pointed to by head)
and the data in the fragments
 unsigned int data_len
unlike len, data_len accounts only for the size of the data in the fragments.
 unsigned int truesize
skb->truesize = size + sizeof(struct sk_buff);
 atomic_t users
reference count, or the number of entities using this sk_buff buffer
atomic_inc() and atomic_dec()
• unsigned char *head
• sk_buff_data_t end
• unsigned char *data
• sk_buff_data_t tail
struct net_device *dev
 represents the receiving interface or the to be transmitted device(or
interface) corresponding to the packet.
 usually represents the virtual device’s(representation of all devices
grouped) net_device structure.

 Pointers to protocol headers.


 sk_buff_data_t transport_header;
 sk_buff_data_t network_header;
 sk_buff_data_t mac_header;
updation of data is done using the *_header pointers
 char cb[40]
 This is a "control buffer," or storage for private information, maintained
by each layer for internal use.

struct tcp_skb_cb {
... ... ... _ _u32 seq; /* Starting sequence number */
_ _u32 end_seq; /* SEQ + FIN + SYN + datalen*/
_ _u32 when; /* used to compute rtt's */
_ _u8 flags; /* TCP header flags. */
... ... ...
};
Defined in <include/linux/skbuff.h> & <net/core/skbuff.c>

skb_put(struct sk_buff *, usingned int len)


skb_push(struct sk_buff *skb, unsigned int len)
skb_pull(struct sk_buff *skb, unsigned int len)
skb_reserve(struct sk_buff *skb, int len)

Each of the above four memory management functions return the data ptr.
defined in <net/core/skbuff.c>

struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,


int fclone, int node)

size = SKB_DATA_ALIGN(size);
data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);

struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
unsigned int length, gfp_t gfp_mask)
The buffer allocation function meant for use by device drivers
Executed in interrupt mode

Freeing memory: kfree_skb and dev_kfree_skb

Release buffer back to the buffer-pool.


Buffer released only when skb_users counter is 1. If not, the counter is
decremented.
Socket layer functions
are elided.
 Defined in <include/linux/netdevice.h>
 stores all information specifically regarding a network device
 one such structure for each device, both real ones (such as Ethernet
NICs) and virtual ones
 Network devices can be classified into types such as Ethernet cards and
Token Ring cards
 Each type may come in several models.
 Model specific parameters are initialized by device driver software.
 Parameters common for different models are initiated by kernel.
struct net_device{
char name[IFNAMSIZ];
int ifindex;

/* device name hash chain, ex: eth0 */


struct hlist_node name_hlist;

unsigned long mem_end;/* shared mem end */


unsigned long mem_start; /* shared mem start */
unsigned long base_addr; /* device I/O address */
unsigned int irq; /* device IRQ number */
unsigned char if_port; /* Selectable AUI, TP,..*/
unsigned char dma; /* DMA channel */


struct net_device{
char name[IFNAMSIZ];
int ifindex;

/* device name hash chain, ex: eth0 */


struct hlist_node name_hlist;

unsigned long mem_end; /* shared mem end */


unsigned long mem_start; /* shared mem start */
unsigned long base_addr; /* device I/O address */
unsigned int irq; /* device IRQ number */
unsigned char if_port; /* Selectable AUI, TP,..*/
unsigned char dma; /* DMA channel */


struct net_device{
char name[IFNAMSIZ];
int ifindex;

/* device name hash chain, ex: eth0 */


struct hlist_node name_hlist;

unsigned long mem_end;/* shared mem end */


unsigned long mem_start; /* shared mem start */
unsigned long base_addr; /* device I/O address */
unsigned int irq; /* device IRQ number */
unsigned char if_port; /* Selectable AUI, TP,..*/
unsigned char dma; /* DMA channel
*/
struct net_device{
char name[IFNAMSIZ];
/* device name hash chain, ex: eth0 */
struct hlist_node name_hlist;

unsigned long mem_end;/* shared mem end */


unsigned long mem_start; /* shared mem start */
unsigned long base_addr; /* device I/O address */
unsigned int irq; /* device IRQ number */
unsigned char if_port; /* Selectable AUI, TP,..*/
unsigned char dma; /* DMA channel */
unsigned short flags; /* interface flags (a la BSD) */

struct net_device{
char name[IFNAMSIZ];
/* device name hash chain, ex: eth0 */
struct hlist_node name_hlist;

unsigned long mem_end;/* shared mem end */


unsigned long mem_start; /* shared mem start */
unsigned long base_addr; /* device I/O address */
unsigned int irq; /* device IRQ number */
unsigned char if_port; /* Selectable AUI, TP,..*/
unsigned char dma; /* DMA channel */
unsigned short flags; /* interface flags (a la BSD)*/

struct net_device{
char name[IFNAMSIZ];
/* device name hash chain, ex: eth0 */
struct hlist_node name_hlist;

unsigned long mem_end;/* shared mem end */


unsigned long mem_start; /* shared mem start */
unsigned long base_addr; /* device I/O address */
unsigned int irq; /* device IRQ number */
unsigned char if_port; /* Selectable AUI, TP,..*/
unsigned char dma; /* DMA channel */
unsigned short flags; /* interface flags (a la BSD)*/
/* ex : IFF_UP || IFF_RUNNING || IFF_MULTICAST */
struct net_device{

unsigned mtu; /* interface MTU value */


unsigned short type; /* interface hardware type */
unsigned short hard_header_len; /* hardware hdr length */

unsigned char dev_addr[MAX_ADDR_LEN];


unsigned char addr_len; /* hardware address length */

unsigned char broadcast[MAX_ADDR_LEN];


unsigned int promiscuity;


struct net_device{

unsigned mtu; /* interface MTU value */


unsigned short type; /* interface hardware type*/
unsigned short hard_header_len; /* hardware hdr length */

unsigned char dev_addr[MAX_ADDR_LEN];


unsigned char addr_len; /* hardware address length */

unsigned char broadcast[MAX_ADDR_LEN];


unsigned int promiscuity;


struct net_device{

unsigned mtu; /* interface MTU value */


unsigned short type; /* interface hardware type */
unsigned short hard_header_len;/* hardware hdr length */
unsigned char dev_addr[MAX_ADDR_LEN];
unsigned char addr_len; /* hardware address length */

unsigned char broadcast[MAX_ADDR_LEN];


unsigned int promiscuity;


struct net_device{

unsigned mtu; /* interface MTU value */


unsigned short type; /* interface hardware type */
unsigned short hard_header_len; /* hardware hdr length */

unsigned char dev_addr[MAX_ADDR_LEN];


unsigned char addr_len; /* hardware address length*/
unsigned char broadcast[MAX_ADDR_LEN];
unsigned int promiscuity;


struct net_device{

unsigned mtu; /* interface MTU value */


unsigned short type; /* interface hardware type */
unsigned short hard_header_len; /* hardware hdr length */

unsigned char dev_addr[MAX_ADDR_LEN];


unsigned char addr_len; /* hardware address length */

unsigned char broadcast[MAX_ADDR_LEN];


unsigned int promiscuity;


struct net_device{

unsigned mtu; /* interface MTU value */


unsigned short type; /* interface hardware type */
unsigned short hard_header_len; /* hardware hdr length */

unsigned char dev_addr[MAX_ADDR_LEN];


unsigned char addr_len; /* hardware address length */

unsigned char broadcast[MAX_ADDR_LEN];


unsigned int promiscuity;


struct net_device{


struct net_device *next;
struct hlist_node name_hlist;
struct hlist_node index_hlist;
We don’t process the packet in the interrupt subroutine.
Netif_rx() – raise the net Rx softIRQ.
Net_rx_action() is called - start processing the packet
 Processing of packet starts with the protocol switching section
Netif_receive_skb() is called to process the packet and find out the next protocol layer.
Protocol family of the packet is extracted from the link layer header.
ip_rcv() is an entry point for IP packets processing.
Checks if the packet we have is destined for some other host (using PACKET_OTHERHOST)
Check the checksum of the packet by calling ip_fast_csum()
Call ip_route_input() , this routine checks kernel routing table rt_hash_table.
If packet needs to be forwarded input routine is ip_forward()
Otherwise ip_local_deliver()
ip_send() is called to check if the packet needs to be fragmented
If yes , fragment the packet by calling ip_fragment()
Packet output path – ip_finish_output()
ip_local_deliver() – packets need to delivered locally
ip_defrag()
Protocol identifier field skb->np.iph->protocol (in IP header).
For TCP, we find the receive handler as tcp_v4_rcv() (entry point for the TCP layer)
_tcp_v4_lookup() – find the socket to which the packet belongs
Establised sockets are maintained in the hash table tcp_ehash.
Established socket not found – New connection request for any listening socket
Search for listening socket – tcp_v4_lookup_listener()
tcp_rcv_established()
Application read the data from the receive queue if it issues recv()
Kernel routine to read data from TCP socket is tcp_recvmsg()

You might also like