eBPF In-Depth Technical Guide: From Fundamentals to Production

2026-02-22 15:06

eBPF In-Depth Technical Guide: From Fundamentals to Production

Document Version: 3.0 (Enhanced Edition)
Last Updated: February 22, 2026
Target Audience: Systems Engineers, Kernel Developers, Performance Optimization Experts

Part I: eBPF Core Architecture Deep Dive

1. eBPF Virtual Machine Implementation

1.1 Register Architecture and Instruction Set

The eBPF virtual machine adopts a RISC-style 64-bit register architecture, consisting of:

R0-R9: 10 general-purpose registers (64-bit)
R10: Read-only stack pointer register
PC (Program Counter): Points to the currently executing instruction

Register Conventions:

Register	Purpose	Calling Convention
R0	Return value register	Function return value, helper function results
R1-R5	Argument registers	Function call parameters (up to 5)
R6-R9	Callee-saved	Preserved across function calls
R10	Stack pointer	Read-only, points to 512-byte stack top

1.2 Instruction Format and Encoding

eBPF instructions use a fixed 64-bit encoding format:

struct bpf_insn {
    __u8  code;     // Opcode (8 bits)
    __u8  dst_reg:4; // Destination register (4 bits)
    __u8  src_reg:4; // Source register (4 bits)
    __s16 off;      // Offset (16 bits)
    __s32 imm;      // Immediate value (32 bits)
};

Opcode Categories:

ALU Operations: 0x04 (ADD), 0x14 (SUB), 0x24 (MUL), 0x34 (DIV)
Memory Operations: 0x61 (LDXW), 0x62 (STW), 0x63 (STXW)
Jump Operations: 0x05 (JA), 0x15 (JEQ), 0x25 (JGT)
Function Calls: 0x85 (CALL), 0x95 (EXIT)

2. Probe Engine Mechanism Deep Dive

2.1 Kprobe Implementation: Instruction Replacement and Breakpoint Mechanism

Kprobe implements function interception through dynamic instruction replacement. The core mechanism includes:

Step 1: Breakpoint Instruction Insertion

// Original function instruction sequence
do_sys_open:
    push   %rbp
    mov    %rsp,%rbp
    ...

// After Kprobe activation
do_sys_open:
    int3          // 0xCC breakpoint instruction (x86_64)
    mov    %rsp,%rbp
    ...

Step 2: Breakpoint Handling Flow

CPU triggers INT3 exception, enters kernel exception handler
Save register state to pt_regs structure
Lookup Kprobe handler, execute registered eBPF program
Single-step original instruction (using TF flag)
Resume normal execution, continue function flow

Program Counter (PC) Jump Details:

// PC jump timeline
1. PC = do_sys_open              // Normal execution
2. Hit INT3, PC jumps to exception handler
3. PC = kprobe_int3_handler      // Kernel exception entry
4. PC = pre_handler_kprobe       // Kprobe pre-handler
5. PC = eBPF_program_entry       // Execute eBPF program
6. PC = post_handler_kprobe      // Kprobe post-handler
7. Set TF flag, PC = do_sys_open + 1  // Single-step original instruction
8. Trigger DEBUG exception, clear TF
9. PC = do_sys_open + 2          // Resume normal flow

2.2 Kretprobe Return Value Interception

Kretprobe works by hijacking the function return address:

// Function call stack changes
Normal call:
    [caller's return address]  ← RSP
    [saved RBP]
    [local variables]

After Kretprobe activation:
    [trampoline address]       ← RSP (replaced)
    [original return address]  (saved to kretprobe_instance)
    [saved RBP]
    [local variables]

Return Address Hijacking Flow:

At function entry, save original return address
Replace stack return address with kretprobe_trampoline
When function executes RET, jumps to trampoline
Trampoline executes eBPF program, reads return value (R0 register)
Restore original return address, jump back to caller

2.3 Uprobe User-Space Probing

Uprobe uses page table permission modification and copy-on-write (COW) mechanism:

// Uprobe activation flow
1. Locate target instruction address (e.g., malloc function)
2. Create copy of original page (COW)
3. Insert INT3 instruction in the copy
4. Modify page table to map virtual address to copy page
5. Set page table permission to read-only, trigger page fault
6. Page fault handler executes eBPF program
7. Restore original page mapping

3. Hardware Resource Access Mechanisms

3.1 PMU (Performance Monitoring Unit) Access

eBPF accesses hardware performance counters through the perf_event subsystem:

// PMU register access flow
1. User-space configures perf_event_attr
2. Kernel allocates PMU hardware counter
3. Configure MSR (Model-Specific Register)
   - IA32_PERFEVTSEL0-3: Event selection registers
   - IA32_PMC0-3: Performance counter registers
4. eBPF program reads counter via bpf_perf_event_read()
5. Hardware interrupt triggers sampling (overflow)
6. eBPF program executes in interrupt context

MSR Register Configuration Example (x86_64):

// Configure CPU cycles counter
IA32_PERFEVTSEL0 = 0x0043003C
  [7:0]   = 0x3C  // Event Select: CPU_CLK_UNHALTED.THREAD_P
  [15:8]  = 0x00  // Unit Mask
  [16]    = 0     // USR: Don't count user mode
  [17]    = 1     // OS: Count kernel mode
  [18]    = 0     // Edge Detect
  [19]    = 0     // Pin Control
  [20]    = 0     // APIC Interrupt Enable
  [22]    = 1     // Enable Counter
  [23]    = 0     // Invert Counter Mask

// Read counter value
cycles = RDMSR(IA32_PMC0)

3.2 Memory Mapping and DMA Access

eBPF programs can access kernel memory through bpf_probe_read_kernel():

// Memory access permission checking
SEC("kprobe/tcp_sendmsg")
int kprobe_tcp_sendmsg(struct pt_regs *ctx) {
    struct sock *sk = (struct sock *)PT_REGS_PARM1(ctx);
    
    // 1. Validate pointer (verifier static check)
    if (!sk) return 0;
    
    // 2. Use safe read function (with bounds checking)
    u16 family;
    bpf_probe_read_kernel(&family, sizeof(family), &sk->sk_family);
    
    // 3. Access user-space memory (requires different function)
    char buffer[256];
    bpf_probe_read_user(buffer, sizeof(buffer), user_ptr);
    
    return 0;
}

Memory Access Safety Mechanisms:

Function	Purpose	Safety Checks
bpf_probe_read_kernel()	Read kernel memory	Page table check, address range validation
bpf_probe_read_user()	Read user memory	copy_from_user(), page fault handling
bpf_probe_write_user()	Write user memory	Write permission check, COW handling

Part II: eBPF Maps Advanced Implementation

4. Map Internal Data Structures

4.1 Hash Map Implementation: Per-CPU Hash Table

// BPF_MAP_TYPE_HASH kernel implementation
struct bpf_htab {
    struct bpf_map map;
    struct bucket *buckets;  // Hash bucket array
    void *elems;             // Element storage area
    union {
        struct pcpu_freelist freelist;  // Per-CPU free list
        struct bpf_lru lru;             // LRU eviction policy
    };
    atomic_t count;          // Current element count
    u32 n_buckets;           // Bucket count (power of 2)
    u32 elem_size;           // Element size
    struct bpf_spin_lock *lock;  // Spinlock (optional)
};

// Hash lookup flow
1. hash = jhash(key, key_size, seed)
2. bucket_id = hash & (n_buckets - 1)
3. bucket = &buckets[bucket_id]
4. Traverse bucket list, compare keys
5. Return value pointer

4.2 Ring Buffer Lock-Free Implementation

BPF_MAP_TYPE_RINGBUF uses a lock-free ring buffer:

// Ring Buffer memory layout
struct bpf_ringbuf {
    u64 consumer_pos __aligned(PAGE_SIZE);  // Consumer position (user-space)
    u64 producer_pos __aligned(PAGE_SIZE);  // Producer position (kernel-space)
    char data[] __aligned(PAGE_SIZE);       // Data area
};

// Lock-free write algorithm
1. old_pos = atomic_read(&rb->producer_pos)
2. new_pos = old_pos + len
3. if (new_pos - consumer_pos > rb->size) return -ENOSPC
4. if (!atomic_cmpxchg(&rb->producer_pos, old_pos, new_pos))
       goto retry
5. memcpy(&rb->data[old_pos & mask], data, len)
6. Set commit flag

5. JIT Compiler Implementation

5.1 x86_64 JIT Compilation Flow

// eBPF instruction to x86_64 machine code mapping
eBPF: r0 = r1 + r2
  ↓ JIT compilation
x86_64: add %rsi, %rdi  // RDI=r0, RSI=r1, RDX=r2

// JIT compilation steps
1. First pass: Calculate jump offsets
2. Allocate JIT code buffer (executable pages)
3. Second pass: Generate machine code
4. Fix jump target addresses
5. Set page permissions to RX (read+execute)
6. Flush I-Cache

JIT Optimization Examples:

eBPF Instruction	Unoptimized x86_64	Optimized x86_64
r0 = 0	mov $0, %rdi	xor %rdi, %rdi
r0 += 1	add $1, %rdi	inc %rdi
if r0 == 0 goto +5	cmp $0, %rdi; je offset	test %rdi, %rdi; jz offset

Part III: Production Environment Practices

6. Performance Optimization Techniques

6.1 Reduce Map Lookup Overhead

// Before optimization: Multiple lookups
SEC("kprobe/tcp_sendmsg")
int kprobe_tcp_sendmsg(struct pt_regs *ctx) {
    u32 pid = bpf_get_current_pid_tgid();
    u64 *bytes = bpf_map_lookup_elem(&stats, &pid);
    if (bytes) (*bytes)++;
    u64 *packets = bpf_map_lookup_elem(&stats2, &pid);
    if (packets) (*packets)++;
    return 0;
}

// After optimization: Merged data structure
struct tcp_stats {
    u64 bytes;
    u64 packets;
};

SEC("kprobe/tcp_sendmsg")
int kprobe_tcp_sendmsg_optimized(struct pt_regs *ctx) {
    u32 pid = bpf_get_current_pid_tgid();
    struct tcp_stats *stats = bpf_map_lookup_elem(&combined_stats, &pid);
    if (stats) {
        stats->bytes++;
        stats->packets++;
    }
    return 0;
}

6.2 Use Per-CPU Maps to Avoid Lock Contention

// Per-CPU Map definition
struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
    __uint(max_entries, 1);
    __type(key, u32);
    __type(value, u64);
} per_cpu_counter SEC(".maps");

// Lock-free update
SEC("kprobe/schedule")
int kprobe_schedule(struct pt_regs *ctx) {
    u32 key = 0;
    u64 *count = bpf_map_lookup_elem(&per_cpu_counter, &key);
    if (count) (*count)++;  // No atomic operations needed
    return 0;
}

7. Debugging and Troubleshooting

7.1 Debugging with bpftool

# List all eBPF programs
bpftool prog show

# View program's JIT assembly code
bpftool prog dump jited id 123

# View program's eBPF bytecode
bpftool prog dump xlated id 123

# View Map contents
bpftool map dump id 456

# View program statistics
bpftool prog show id 123 --json | jq '.run_time_ns'

7.2 Verifier Error Analysis

Common Verifier Errors and Solutions:

Error Message	Cause	Solution
invalid read from stack	Reading uninitialized stack variable	Write before read, or use = {0} initialization
unbounded loop	Indeterminate loop count	Use #pragma unroll or limit loop iterations
R1 pointer arithmetic	Illegal pointer operation	Use bpf_probe_read() instead of direct dereference
exceeds max program size	Program instruction count exceeded	Split into multiple programs, use tail calls

8. Practical Case Study: TCP Connection Tracking

8.1 Complete Implementation

// tcp_tracer.bpf.c
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>

struct conn_info {
    u32 saddr;
    u32 daddr;
    u16 sport;
    u16 dport;
    u64 bytes_sent;
    u64 bytes_recv;
};

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 10240);
    __type(key, u64);  // sock pointer
    __type(value, struct conn_info);
} connections SEC(".maps");

SEC("kprobe/tcp_sendmsg")
int BPF_KPROBE(tcp_sendmsg, struct sock *sk, struct msghdr *msg, size_t size)
{
    u64 sock_ptr = (u64)sk;
    struct conn_info *info = bpf_map_lookup_elem(&connections, &sock_ptr);
    
    if (!info) {
        struct conn_info new_info = {0};
        
        // Read connection information
        BPF_CORE_READ_INTO(&new_info.saddr, sk, __sk_common.skc_rcv_saddr);
        BPF_CORE_READ_INTO(&new_info.daddr, sk, __sk_common.skc_daddr);
        BPF_CORE_READ_INTO(&new_info.sport, sk, __sk_common.skc_num);
        BPF_CORE_READ_INTO(&new_info.dport, sk, __sk_common.skc_dport);
        
        new_info.bytes_sent = size;
        bpf_map_update_elem(&connections, &sock_ptr, &new_info, BPF_NOEXIST);
    } else {
        __sync_fetch_and_add(&info->bytes_sent, size);
    }
    
    return 0;
}

SEC("kprobe/tcp_cleanup_rbuf")
int BPF_KPROBE(tcp_cleanup_rbuf, struct sock *sk, int copied)
{
    if (copied <= 0) return 0;
    
    u64 sock_ptr = (u64)sk;
    struct conn_info *info = bpf_map_lookup_elem(&connections, &sock_ptr);
    
    if (info) {
        __sync_fetch_and_add(&info->bytes_recv, copied);
    }
    
    return 0;
}

char LICENSE[] SEC("license") = "GPL";

8.2 User-Space Program

// tcp_tracer.c
#include <bpf/libbpf.h>
#include <bpf/bpf.h>
#include <arpa/inet.h>

int main() {
    struct bpf_object *obj;
    struct bpf_map *map;
    int map_fd;
    
    // Load eBPF program
    obj = bpf_object__open_file("tcp_tracer.bpf.o", NULL);
    bpf_object__load(obj);
    
    // Attach to kernel
    bpf_object__attach(obj);
    
    // Get map
    map = bpf_object__find_map_by_name(obj, "connections");
    map_fd = bpf_map__fd(map);
    
    // Periodically read connection information
    while (1) {
        u64 key, next_key;
        struct conn_info info;
        
        key = 0;
        while (bpf_map_get_next_key(map_fd, &key, &next_key) == 0) {
            bpf_map_lookup_elem(map_fd, &next_key, &info);
            
            char saddr[INET_ADDRSTRLEN], daddr[INET_ADDRSTRLEN];
            inet_ntop(AF_INET, &info.saddr, saddr, sizeof(saddr));
            inet_ntop(AF_INET, &info.daddr, daddr, sizeof(daddr));
            
            printf("%s:%d -> %s:%d  TX: %llu  RX: %llu\n",
                   saddr, ntohs(info.sport),
                   daddr, ntohs(info.dport),
                   info.bytes_sent, info.bytes_recv);
            
            key = next_key;
        }
        
        sleep(1);
    }
    
    return 0;
}

Summary and Best Practices

Key Takeaways

Understand Low-Level Mechanisms: Master Probe engine, PC jumps, hardware access implementation details
Performance Optimization: Use Per-CPU Maps, reduce Map lookups, leverage JIT properly
Safe Programming: Follow Verifier rules, use safe memory access functions
Production Deployment: Thorough testing, monitor performance impact, prepare rollback plans

Reference Resources

This document is continuously updated. Feedback and suggestions are welcome.

eBPF In-Depth Technical Guide: From Fundamentals to Production

eBPF In-Depth Technical Guide: From Fundamentals to Production

Part I: eBPF Core Architecture Deep Dive

1. eBPF Virtual Machine Implementation

1.1 Register Architecture and Instruction Set

1.2 Instruction Format and Encoding

2. Probe Engine Mechanism Deep Dive

2.1 Kprobe Implementation: Instruction Replacement and Breakpoint Mechanism

2.2 Kretprobe Return Value Interception

2.3 Uprobe User-Space Probing

3. Hardware Resource Access Mechanisms

3.1 PMU (Performance Monitoring Unit) Access

3.2 Memory Mapping and DMA Access

Part II: eBPF Maps Advanced Implementation

4. Map Internal Data Structures

4.1 Hash Map Implementation: Per-CPU Hash Table

4.2 Ring Buffer Lock-Free Implementation

5. JIT Compiler Implementation

5.1 x86_64 JIT Compilation Flow

Part III: Production Environment Practices

6. Performance Optimization Techniques

6.1 Reduce Map Lookup Overhead

6.2 Use Per-CPU Maps to Avoid Lock Contention

7. Debugging and Troubleshooting

7.1 Debugging with bpftool

7.2 Verifier Error Analysis

8. Practical Case Study: TCP Connection Tracking

8.1 Complete Implementation

8.2 User-Space Program

Summary and Best Practices

Key Takeaways

Reference Resources

Leave a Comment

Top Posts

Hot Posts

Recent Posts

Tag Cloud