eBPF In-Depth Technical Guide: From Fundamentals to Production

eBPF In-Depth Technical Guide: From Fundamentals to Production

Document Version: 3.0 (Enhanced Edition)
Last Updated: February 22, 2026
Target Audience: Systems Engineers, Kernel Developers, Performance Optimization Experts


Part I: eBPF Core Architecture Deep Dive

1. eBPF Virtual Machine Implementation

1.1 Register Architecture and Instruction Set

The eBPF virtual machine adopts a RISC-style 64-bit register architecture, consisting of:

  • R0-R9: 10 general-purpose registers (64-bit)
  • R10: Read-only stack pointer register
  • PC (Program Counter): Points to the currently executing instruction

Register Conventions:

Register Purpose Calling Convention
R0 Return value register Function return value, helper function results
R1-R5 Argument registers Function call parameters (up to 5)
R6-R9 Callee-saved Preserved across function calls
R10 Stack pointer Read-only, points to 512-byte stack top

1.2 Instruction Format and Encoding

eBPF instructions use a fixed 64-bit encoding format:

struct bpf_insn {
    __u8  code;     // Opcode (8 bits)
    __u8  dst_reg:4; // Destination register (4 bits)
    __u8  src_reg:4; // Source register (4 bits)
    __s16 off;      // Offset (16 bits)
    __s32 imm;      // Immediate value (32 bits)
};

Opcode Categories:

  • ALU Operations: 0x04 (ADD), 0x14 (SUB), 0x24 (MUL), 0x34 (DIV)
  • Memory Operations: 0x61 (LDXW), 0x62 (STW), 0x63 (STXW)
  • Jump Operations: 0x05 (JA), 0x15 (JEQ), 0x25 (JGT)
  • Function Calls: 0x85 (CALL), 0x95 (EXIT)

2. Probe Engine Mechanism Deep Dive

2.1 Kprobe Implementation: Instruction Replacement and Breakpoint Mechanism

Kprobe implements function interception through dynamic instruction replacement. The core mechanism includes:

Step 1: Breakpoint Instruction Insertion

// Original function instruction sequence
do_sys_open:
    push   %rbp
    mov    %rsp,%rbp
    ...

// After Kprobe activation
do_sys_open:
    int3          // 0xCC breakpoint instruction (x86_64)
    mov    %rsp,%rbp
    ...

Step 2: Breakpoint Handling Flow

  1. CPU triggers INT3 exception, enters kernel exception handler
  2. Save register state to pt_regs structure
  3. Lookup Kprobe handler, execute registered eBPF program
  4. Single-step original instruction (using TF flag)
  5. Resume normal execution, continue function flow

Program Counter (PC) Jump Details:

// PC jump timeline
1. PC = do_sys_open              // Normal execution
2. Hit INT3, PC jumps to exception handler
3. PC = kprobe_int3_handler      // Kernel exception entry
4. PC = pre_handler_kprobe       // Kprobe pre-handler
5. PC = eBPF_program_entry       // Execute eBPF program
6. PC = post_handler_kprobe      // Kprobe post-handler
7. Set TF flag, PC = do_sys_open + 1  // Single-step original instruction
8. Trigger DEBUG exception, clear TF
9. PC = do_sys_open + 2          // Resume normal flow

2.2 Kretprobe Return Value Interception

Kretprobe works by hijacking the function return address:

// Function call stack changes
Normal call:
    [caller's return address]  ← RSP
    [saved RBP]
    [local variables]

After Kretprobe activation:
    [trampoline address]       ← RSP (replaced)
    [original return address]  (saved to kretprobe_instance)
    [saved RBP]
    [local variables]

Return Address Hijacking Flow:

  1. At function entry, save original return address
  2. Replace stack return address with kretprobe_trampoline
  3. When function executes RET, jumps to trampoline
  4. Trampoline executes eBPF program, reads return value (R0 register)
  5. Restore original return address, jump back to caller

2.3 Uprobe User-Space Probing

Uprobe uses page table permission modification and copy-on-write (COW) mechanism:

// Uprobe activation flow
1. Locate target instruction address (e.g., malloc function)
2. Create copy of original page (COW)
3. Insert INT3 instruction in the copy
4. Modify page table to map virtual address to copy page
5. Set page table permission to read-only, trigger page fault
6. Page fault handler executes eBPF program
7. Restore original page mapping

3. Hardware Resource Access Mechanisms

3.1 PMU (Performance Monitoring Unit) Access

eBPF accesses hardware performance counters through the perf_event subsystem:

// PMU register access flow
1. User-space configures perf_event_attr
2. Kernel allocates PMU hardware counter
3. Configure MSR (Model-Specific Register)
   - IA32_PERFEVTSEL0-3: Event selection registers
   - IA32_PMC0-3: Performance counter registers
4. eBPF program reads counter via bpf_perf_event_read()
5. Hardware interrupt triggers sampling (overflow)
6. eBPF program executes in interrupt context

MSR Register Configuration Example (x86_64):

// Configure CPU cycles counter
IA32_PERFEVTSEL0 = 0x0043003C
  [7:0]   = 0x3C  // Event Select: CPU_CLK_UNHALTED.THREAD_P
  [15:8]  = 0x00  // Unit Mask
  [16]    = 0     // USR: Don't count user mode
  [17]    = 1     // OS: Count kernel mode
  [18]    = 0     // Edge Detect
  [19]    = 0     // Pin Control
  [20]    = 0     // APIC Interrupt Enable
  [22]    = 1     // Enable Counter
  [23]    = 0     // Invert Counter Mask

// Read counter value
cycles = RDMSR(IA32_PMC0)

3.2 Memory Mapping and DMA Access

eBPF programs can access kernel memory through bpf_probe_read_kernel():

// Memory access permission checking
SEC("kprobe/tcp_sendmsg")
int kprobe_tcp_sendmsg(struct pt_regs *ctx) {
    struct sock *sk = (struct sock *)PT_REGS_PARM1(ctx);
    
    // 1. Validate pointer (verifier static check)
    if (!sk) return 0;
    
    // 2. Use safe read function (with bounds checking)
    u16 family;
    bpf_probe_read_kernel(&family, sizeof(family), &sk->sk_family);
    
    // 3. Access user-space memory (requires different function)
    char buffer[256];
    bpf_probe_read_user(buffer, sizeof(buffer), user_ptr);
    
    return 0;
}

Memory Access Safety Mechanisms:

Function Purpose Safety Checks
bpf_probe_read_kernel() Read kernel memory Page table check, address range validation
bpf_probe_read_user() Read user memory copy_from_user(), page fault handling
bpf_probe_write_user() Write user memory Write permission check, COW handling

Part II: eBPF Maps Advanced Implementation

4. Map Internal Data Structures

4.1 Hash Map Implementation: Per-CPU Hash Table

// BPF_MAP_TYPE_HASH kernel implementation
struct bpf_htab {
    struct bpf_map map;
    struct bucket *buckets;  // Hash bucket array
    void *elems;             // Element storage area
    union {
        struct pcpu_freelist freelist;  // Per-CPU free list
        struct bpf_lru lru;             // LRU eviction policy
    };
    atomic_t count;          // Current element count
    u32 n_buckets;           // Bucket count (power of 2)
    u32 elem_size;           // Element size
    struct bpf_spin_lock *lock;  // Spinlock (optional)
};

// Hash lookup flow
1. hash = jhash(key, key_size, seed)
2. bucket_id = hash & (n_buckets - 1)
3. bucket = &buckets[bucket_id]
4. Traverse bucket list, compare keys
5. Return value pointer

4.2 Ring Buffer Lock-Free Implementation

BPF_MAP_TYPE_RINGBUF uses a lock-free ring buffer:

// Ring Buffer memory layout
struct bpf_ringbuf {
    u64 consumer_pos __aligned(PAGE_SIZE);  // Consumer position (user-space)
    u64 producer_pos __aligned(PAGE_SIZE);  // Producer position (kernel-space)
    char data[] __aligned(PAGE_SIZE);       // Data area
};

// Lock-free write algorithm
1. old_pos = atomic_read(&rb->producer_pos)
2. new_pos = old_pos + len
3. if (new_pos - consumer_pos > rb->size) return -ENOSPC
4. if (!atomic_cmpxchg(&rb->producer_pos, old_pos, new_pos))
       goto retry
5. memcpy(&rb->data[old_pos & mask], data, len)
6. Set commit flag

5. JIT Compiler Implementation

5.1 x86_64 JIT Compilation Flow

// eBPF instruction to x86_64 machine code mapping
eBPF: r0 = r1 + r2
  ↓ JIT compilation
x86_64: add %rsi, %rdi  // RDI=r0, RSI=r1, RDX=r2

// JIT compilation steps
1. First pass: Calculate jump offsets
2. Allocate JIT code buffer (executable pages)
3. Second pass: Generate machine code
4. Fix jump target addresses
5. Set page permissions to RX (read+execute)
6. Flush I-Cache

JIT Optimization Examples:

eBPF Instruction Unoptimized x86_64 Optimized x86_64
r0 = 0 mov $0, %rdi xor %rdi, %rdi
r0 += 1 add $1, %rdi inc %rdi
if r0 == 0 goto +5 cmp $0, %rdi; je offset test %rdi, %rdi; jz offset

Part III: Production Environment Practices

6. Performance Optimization Techniques

6.1 Reduce Map Lookup Overhead

// Before optimization: Multiple lookups
SEC("kprobe/tcp_sendmsg")
int kprobe_tcp_sendmsg(struct pt_regs *ctx) {
    u32 pid = bpf_get_current_pid_tgid();
    u64 *bytes = bpf_map_lookup_elem(&stats, &pid);
    if (bytes) (*bytes)++;
    u64 *packets = bpf_map_lookup_elem(&stats2, &pid);
    if (packets) (*packets)++;
    return 0;
}

// After optimization: Merged data structure
struct tcp_stats {
    u64 bytes;
    u64 packets;
};

SEC("kprobe/tcp_sendmsg")
int kprobe_tcp_sendmsg_optimized(struct pt_regs *ctx) {
    u32 pid = bpf_get_current_pid_tgid();
    struct tcp_stats *stats = bpf_map_lookup_elem(&combined_stats, &pid);
    if (stats) {
        stats->bytes++;
        stats->packets++;
    }
    return 0;
}

6.2 Use Per-CPU Maps to Avoid Lock Contention

// Per-CPU Map definition
struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
    __uint(max_entries, 1);
    __type(key, u32);
    __type(value, u64);
} per_cpu_counter SEC(".maps");

// Lock-free update
SEC("kprobe/schedule")
int kprobe_schedule(struct pt_regs *ctx) {
    u32 key = 0;
    u64 *count = bpf_map_lookup_elem(&per_cpu_counter, &key);
    if (count) (*count)++;  // No atomic operations needed
    return 0;
}

7. Debugging and Troubleshooting

7.1 Debugging with bpftool

# List all eBPF programs
bpftool prog show

# View program's JIT assembly code
bpftool prog dump jited id 123

# View program's eBPF bytecode
bpftool prog dump xlated id 123

# View Map contents
bpftool map dump id 456

# View program statistics
bpftool prog show id 123 --json | jq '.run_time_ns'

7.2 Verifier Error Analysis

Common Verifier Errors and Solutions:

Error Message Cause Solution
invalid read from stack Reading uninitialized stack variable Write before read, or use = {0} initialization
unbounded loop Indeterminate loop count Use #pragma unroll or limit loop iterations
R1 pointer arithmetic Illegal pointer operation Use bpf_probe_read() instead of direct dereference
exceeds max program size Program instruction count exceeded Split into multiple programs, use tail calls

8. Practical Case Study: TCP Connection Tracking

8.1 Complete Implementation

// tcp_tracer.bpf.c
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>

struct conn_info {
    u32 saddr;
    u32 daddr;
    u16 sport;
    u16 dport;
    u64 bytes_sent;
    u64 bytes_recv;
};

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 10240);
    __type(key, u64);  // sock pointer
    __type(value, struct conn_info);
} connections SEC(".maps");

SEC("kprobe/tcp_sendmsg")
int BPF_KPROBE(tcp_sendmsg, struct sock *sk, struct msghdr *msg, size_t size)
{
    u64 sock_ptr = (u64)sk;
    struct conn_info *info = bpf_map_lookup_elem(&connections, &sock_ptr);
    
    if (!info) {
        struct conn_info new_info = {0};
        
        // Read connection information
        BPF_CORE_READ_INTO(&new_info.saddr, sk, __sk_common.skc_rcv_saddr);
        BPF_CORE_READ_INTO(&new_info.daddr, sk, __sk_common.skc_daddr);
        BPF_CORE_READ_INTO(&new_info.sport, sk, __sk_common.skc_num);
        BPF_CORE_READ_INTO(&new_info.dport, sk, __sk_common.skc_dport);
        
        new_info.bytes_sent = size;
        bpf_map_update_elem(&connections, &sock_ptr, &new_info, BPF_NOEXIST);
    } else {
        __sync_fetch_and_add(&info->bytes_sent, size);
    }
    
    return 0;
}

SEC("kprobe/tcp_cleanup_rbuf")
int BPF_KPROBE(tcp_cleanup_rbuf, struct sock *sk, int copied)
{
    if (copied <= 0) return 0;
    
    u64 sock_ptr = (u64)sk;
    struct conn_info *info = bpf_map_lookup_elem(&connections, &sock_ptr);
    
    if (info) {
        __sync_fetch_and_add(&info->bytes_recv, copied);
    }
    
    return 0;
}

char LICENSE[] SEC("license") = "GPL";

8.2 User-Space Program

// tcp_tracer.c
#include <bpf/libbpf.h>
#include <bpf/bpf.h>
#include <arpa/inet.h>

int main() {
    struct bpf_object *obj;
    struct bpf_map *map;
    int map_fd;
    
    // Load eBPF program
    obj = bpf_object__open_file("tcp_tracer.bpf.o", NULL);
    bpf_object__load(obj);
    
    // Attach to kernel
    bpf_object__attach(obj);
    
    // Get map
    map = bpf_object__find_map_by_name(obj, "connections");
    map_fd = bpf_map__fd(map);
    
    // Periodically read connection information
    while (1) {
        u64 key, next_key;
        struct conn_info info;
        
        key = 0;
        while (bpf_map_get_next_key(map_fd, &key, &next_key) == 0) {
            bpf_map_lookup_elem(map_fd, &next_key, &info);
            
            char saddr[INET_ADDRSTRLEN], daddr[INET_ADDRSTRLEN];
            inet_ntop(AF_INET, &info.saddr, saddr, sizeof(saddr));
            inet_ntop(AF_INET, &info.daddr, daddr, sizeof(daddr));
            
            printf("%s:%d -> %s:%d  TX: %llu  RX: %llu\n",
                   saddr, ntohs(info.sport),
                   daddr, ntohs(info.dport),
                   info.bytes_sent, info.bytes_recv);
            
            key = next_key;
        }
        
        sleep(1);
    }
    
    return 0;
}

Summary and Best Practices

Key Takeaways

  • Understand Low-Level Mechanisms: Master Probe engine, PC jumps, hardware access implementation details
  • Performance Optimization: Use Per-CPU Maps, reduce Map lookups, leverage JIT properly
  • Safe Programming: Follow Verifier rules, use safe memory access functions
  • Production Deployment: Thorough testing, monitor performance impact, prepare rollback plans

Reference Resources

This document is continuously updated. Feedback and suggestions are welcome.

← Previous Post

Leave a Comment