--- zzzz-none-000/linux-3.10.107/Documentation/networking/packet_mmap.txt 2017-06-27 09:49:32.000000000 +0000 +++ scorpion-7490-727/linux-3.10.107/Documentation/networking/packet_mmap.txt 2021-02-04 17:41:59.000000000 +0000 @@ -98,6 +98,11 @@ The destruction of the socket and all associated resources is done by a simple call to close(fd). +Similarly as without PACKET_MMAP, it is possible to use one socket +for capture and transmission. This can be done by mapping the +allocated RX and TX buffer ring with a single mmap() call. +See "Mapping and use of the circular buffer (ring)". + Next I will describe PACKET_MMAP settings and its constraints, also the mapping of the circular buffer in the user process and the use of this buffer. @@ -414,6 +419,19 @@ the frames. This is because a frame cannot be spawn across two blocks. +To use one socket for capture and transmission, the mapping of both the +RX and TX buffer ring has to be done with one call to mmap: + + ... + setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo)); + setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar)); + ... + rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); + tx_ring = rx_ring + size; + +RX must be the first as the kernel maps the TX ring memory right +after the RX one. + At the beginning of each frame there is an status field (see struct tpacket_hdr). If this field is 0 means that the frame is ready to be used for the kernel, If not, there is a frame the user can read @@ -422,9 +440,10 @@ +++ Capture process: from include/linux/if_packet.h - #define TP_STATUS_COPY 2 - #define TP_STATUS_LOSING 4 - #define TP_STATUS_CSUMNOTREADY 8 + #define TP_STATUS_COPY (1 << 1) + #define TP_STATUS_LOSING (1 << 2) + #define TP_STATUS_CSUMNOTREADY (1 << 3) + #define TP_STATUS_CSUM_VALID (1 << 7) TP_STATUS_COPY : This flag indicates that the frame (and associated meta information) has been truncated because it's @@ -435,7 +454,7 @@ enabled previously with setsockopt() and the PACKET_COPY_THRESH option. - The number of frames than can be buffered to + The number of frames that can be buffered to be read with recvfrom is limited like a normal socket. See the SO_RCVBUF option in the socket (7) man page. @@ -448,6 +467,12 @@ reading the packet we should not try to check the checksum. +TP_STATUS_CSUM_VALID : This flag indicates that at least the transport + header checksum of the packet has been already + validated on the kernel side. If the flag is not set + then we are free to check the checksum by ourselves + provided that TP_STATUS_CSUMNOTREADY is also not set. + for convenience there are also the following defines: #define TP_STATUS_KERNEL 0 @@ -517,8 +542,6 @@ TPACKET_V1: - Default if not otherwise specified by setsockopt(2) - RX_RING, TX_RING available - - VLAN metadata information available for packets - (TP_STATUS_VLAN_VALID) TPACKET_V1 --> TPACKET_V2: - Made 64 bit clean due to unsigned long usage in TPACKET_V1 @@ -526,6 +549,13 @@ userspace and the like - Timestamp resolution in nanoseconds instead of microseconds - RX_RING, TX_RING available + - VLAN metadata information available for packets + (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID), + in the tpacket2_hdr structure: + - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates + that the tp_vlan_tci field has valid VLAN TCI value + - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field + indicates that the tp_vlan_tpid field has valid VLAN TPID value - How to switch to TPACKET_V2: 1. Replace struct tpacket_hdr by struct tpacket2_hdr 2. Query header len and save @@ -553,6 +583,15 @@ In the AF_PACKET fanout mode, packet reception can be load balanced among processes. This also works in combination with mmap(2) on packet sockets. +Currently implemented fanout policies are: + + - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash + - PACKET_FANOUT_LB: schedule to socket by round-robin + - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on + - PACKET_FANOUT_RND: schedule to socket by random selection + - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another + - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping + Minimal example code by David S. Miller (try things like "./test eth0 hash", "./test eth0 lb", etc.): @@ -714,6 +753,12 @@ Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.): +/* Written from scratch, but kernel-to-user space API usage + * dissected from lolpcap: + * Copyright 2011, Chetan Loke + * License: GPL, version 2.0 + */ + #include #include #include @@ -732,27 +777,6 @@ #include #include -#define BLOCK_SIZE (1 << 22) -#define FRAME_SIZE 2048 - -#define NUM_BLOCKS 64 -#define NUM_FRAMES ((BLOCK_SIZE * NUM_BLOCKS) / FRAME_SIZE) - -#define BLOCK_RETIRE_TOV_IN_MS 64 -#define BLOCK_PRIV_AREA_SZ 13 - -#define ALIGN_8(x) (((x) + 8 - 1) & ~(8 - 1)) - -#define BLOCK_STATUS(x) ((x)->h1.block_status) -#define BLOCK_NUM_PKTS(x) ((x)->h1.num_pkts) -#define BLOCK_O2FP(x) ((x)->h1.offset_to_first_pkt) -#define BLOCK_LEN(x) ((x)->h1.blk_len) -#define BLOCK_SNUM(x) ((x)->h1.seq_num) -#define BLOCK_O2PRIV(x) ((x)->offset_to_priv) -#define BLOCK_PRIV(x) ((void *) ((uint8_t *) (x) + BLOCK_O2PRIV(x))) -#define BLOCK_HDR_LEN (ALIGN_8(sizeof(struct block_desc))) -#define BLOCK_PLUS_PRIV(sz_pri) (BLOCK_HDR_LEN + ALIGN_8((sz_pri))) - #ifndef likely # define likely(x) __builtin_expect(!!(x), 1) #endif @@ -775,7 +799,7 @@ static unsigned long packets_total = 0, bytes_total = 0; static sig_atomic_t sigint = 0; -void sighandler(int num) +static void sighandler(int num) { sigint = 1; } @@ -784,6 +808,8 @@ { int err, i, fd, v = TPACKET_V3; struct sockaddr_ll ll; + unsigned int blocksiz = 1 << 22, framesiz = 1 << 11; + unsigned int blocknum = 64; fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); if (fd < 0) { @@ -798,13 +824,12 @@ } memset(&ring->req, 0, sizeof(ring->req)); - ring->req.tp_block_size = BLOCK_SIZE; - ring->req.tp_frame_size = FRAME_SIZE; - ring->req.tp_block_nr = NUM_BLOCKS; - ring->req.tp_frame_nr = NUM_FRAMES; - ring->req.tp_retire_blk_tov = BLOCK_RETIRE_TOV_IN_MS; - ring->req.tp_sizeof_priv = BLOCK_PRIV_AREA_SZ; - ring->req.tp_feature_req_word |= TP_FT_REQ_FILL_RXHASH; + ring->req.tp_block_size = blocksiz; + ring->req.tp_frame_size = framesiz; + ring->req.tp_block_nr = blocknum; + ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz; + ring->req.tp_retire_blk_tov = 60; + ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH; err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req, sizeof(ring->req)); @@ -814,8 +839,7 @@ } ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr, - PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, - fd, 0); + PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0); if (ring->map == MAP_FAILED) { perror("mmap"); exit(1); @@ -845,58 +869,6 @@ return fd; } -#ifdef __checked -static uint64_t prev_block_seq_num = 0; - -void assert_block_seq_num(struct block_desc *pbd) -{ - if (unlikely(prev_block_seq_num + 1 != BLOCK_SNUM(pbd))) { - printf("prev_block_seq_num:%"PRIu64", expected seq:%"PRIu64" != " - "actual seq:%"PRIu64"\n", prev_block_seq_num, - prev_block_seq_num + 1, (uint64_t) BLOCK_SNUM(pbd)); - exit(1); - } - - prev_block_seq_num = BLOCK_SNUM(pbd); -} - -static void assert_block_len(struct block_desc *pbd, uint32_t bytes, int block_num) -{ - if (BLOCK_NUM_PKTS(pbd)) { - if (unlikely(bytes != BLOCK_LEN(pbd))) { - printf("block:%u with %upackets, expected len:%u != actual len:%u\n", - block_num, BLOCK_NUM_PKTS(pbd), bytes, BLOCK_LEN(pbd)); - exit(1); - } - } else { - if (unlikely(BLOCK_LEN(pbd) != BLOCK_PLUS_PRIV(BLOCK_PRIV_AREA_SZ))) { - printf("block:%u, expected len:%lu != actual len:%u\n", - block_num, BLOCK_HDR_LEN, BLOCK_LEN(pbd)); - exit(1); - } - } -} - -static void assert_block_header(struct block_desc *pbd, const int block_num) -{ - uint32_t block_status = BLOCK_STATUS(pbd); - - if (unlikely((block_status & TP_STATUS_USER) == 0)) { - printf("block:%u, not in TP_STATUS_USER\n", block_num); - exit(1); - } - - assert_block_seq_num(pbd); -} -#else -static inline void assert_block_header(struct block_desc *pbd, const int block_num) -{ -} -static void assert_block_len(struct block_desc *pbd, uint32_t bytes, int block_num) -{ -} -#endif - static void display(struct tpacket3_hdr *ppd) { struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac); @@ -926,37 +898,27 @@ static void walk_block(struct block_desc *pbd, const int block_num) { - int num_pkts = BLOCK_NUM_PKTS(pbd), i; + int num_pkts = pbd->h1.num_pkts, i; unsigned long bytes = 0; - unsigned long bytes_with_padding = BLOCK_PLUS_PRIV(BLOCK_PRIV_AREA_SZ); struct tpacket3_hdr *ppd; - assert_block_header(pbd, block_num); - - ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd + BLOCK_O2FP(pbd)); + ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd + + pbd->h1.offset_to_first_pkt); for (i = 0; i < num_pkts; ++i) { bytes += ppd->tp_snaplen; - if (ppd->tp_next_offset) - bytes_with_padding += ppd->tp_next_offset; - else - bytes_with_padding += ALIGN_8(ppd->tp_snaplen + ppd->tp_mac); - display(ppd); - ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd + ppd->tp_next_offset); - __sync_synchronize(); + ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd + + ppd->tp_next_offset); } - assert_block_len(pbd, bytes_with_padding, block_num); - packets_total += num_pkts; bytes_total += bytes; } -void flush_block(struct block_desc *pbd) +static void flush_block(struct block_desc *pbd) { - BLOCK_STATUS(pbd) = TP_STATUS_KERNEL; - __sync_synchronize(); + pbd->h1.block_status = TP_STATUS_KERNEL; } static void teardown_socket(struct ring *ring, int fd) @@ -972,7 +934,7 @@ socklen_t len; struct ring ring; struct pollfd pfd; - unsigned int block_num = 0; + unsigned int block_num = 0, blocks = 64; struct block_desc *pbd; struct tpacket_stats_v3 stats; @@ -994,15 +956,15 @@ while (likely(!sigint)) { pbd = (struct block_desc *) ring.rd[block_num].iov_base; -retry_block: - if ((BLOCK_STATUS(pbd) & TP_STATUS_USER) == 0) { + + if ((pbd->h1.block_status & TP_STATUS_USER) == 0) { poll(&pfd, 1, -1); - goto retry_block; + continue; } walk_block(pbd, block_num); flush_block(pbd); - block_num = (block_num + 1) % NUM_BLOCKS; + block_num = (block_num + 1) % blocks; } len = sizeof(stats); @@ -1022,6 +984,27 @@ } ------------------------------------------------------------------------------- ++ PACKET_QDISC_BYPASS +------------------------------------------------------------------------------- + +If there is a requirement to load the network with many packets in a similar +fashion as pktgen does, you might set the following option after socket +creation: + + int one = 1; + setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one)); + +This has the side-effect, that packets sent through PF_PACKET will bypass the +kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning, +packet are not buffered, tc disciplines are ignored, increased loss can occur +and such packets are also not visible to other PF_PACKET sockets anymore. So, +you have been warned; generally, this can be useful for stress testing various +components of a system. + +On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled +on PF_PACKET sockets. + +------------------------------------------------------------------------------- + PACKET_TIMESTAMP ------------------------------------------------------------------------------- @@ -1032,14 +1015,9 @@ of hardware timestamps with SIOCSHWTSTAMP (see related information from Documentation/networking/timestamping.txt). -PACKET_TIMESTAMP accepts the same integer bit field as -SO_TIMESTAMPING. However, only the SOF_TIMESTAMPING_SYS_HARDWARE -and SOF_TIMESTAMPING_RAW_HARDWARE values are recognized by -PACKET_TIMESTAMP. SOF_TIMESTAMPING_SYS_HARDWARE takes precedence over -SOF_TIMESTAMPING_RAW_HARDWARE if both bits are set. +PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING: - int req = 0; - req |= SOF_TIMESTAMPING_SYS_HARDWARE; + int req = SOF_TIMESTAMPING_RAW_HARDWARE; setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req)) For the mmap(2)ed ring buffers, such timestamps are stored in the @@ -1047,14 +1025,13 @@ what kind of timestamp has been reported, the tp_status field is binary |'ed with the following possible bits ... - TP_STATUS_TS_SYS_HARDWARE TP_STATUS_TS_RAW_HARDWARE TP_STATUS_TS_SOFTWARE ... that are equivalent to its SOF_TIMESTAMPING_* counterparts. For the -RX_RING, if none of those 3 are set (i.e. PACKET_TIMESTAMP is not set), -then this means that a software fallback was invoked *within* PF_PACKET's -processing code (less precise). +RX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a +software fallback was invoked *within* PF_PACKET's processing code (less +precise). Getting timestamps for the TX_RING works as follows: i) fill the ring frames, ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant