All Posts

Break, Remap, Debug: Using the FPB Unit on ARM Cortex-M

1. Introduction

When you debug code that lives in Flash, "software" breakpoints aren't always enough. The Flash Patch and Breakpoint (FPB) unit is a tiny hardware block in the ARM debug fabric that lets you do two powerful things without touching your binary: set true hardware breakpoints on instruction fetches, and—optionally—replace the fetched instruction with one you choose. In practice, that means you can halt exactly at a function in Flash, or even redirect the first instruction to a small veneer to tweak behavior for tests and hotfixes.

2. FPB Features

What is FPB?

The Flash Patch & Breakpoint (FPB) is a CoreSight debug component in ARMv7-M/ARMv7E-M (e.g., STM32H7) that watches instruction/literal fetches and can either raise a debug event or substitute the fetched word, enabling non-intrusive breakpoints and lightweight patches in code that runs from Flash.

2.1 Hardware Breakpoints

The FPB's comparators watch instruction (or literal) fetches in the CODE region. When a comparator matches an address, the core raises a debug event—either halting in place (halt-mode) or entering the DebugMonitor exception—so you can stop in Flash without changing your binary.

To work, you must set both enables: the global FP_CTRL.ENABLE and the per-comparator FP_COMP[n].ENABLE. Matches are at halfword granularity, so the address's bit1 selects the lower/upper halfword you're targeting. This is the simplest, most reliable way to break on code in Flash, and it's what most debuggers use under the hood.

FPB Hardware Breakpoint

Figure 1: FPB Hardware Breakpoint

2.2 Flash Patch (Remap) — Lightweight Instruction Replacement

Flash Patch (remap) lets FPB swap one fetched instruction word with a word you pre-load in RAM. You point FP_REMAP at a 32-byte table (8×32-bit) in SRAM—it must be 32-byte aligned and lives in SRAM because hardware forces FP_REMAP[31:29]=0b001. ARM comparator n at the target address; when the CPU fetches there, FPB substitutes table[n] for the original 32-bit fetch.

In Thumb, each 32-bit fetch contains two 16-bit halves: the lower halfword at address A and the upper at A+2. The REPLACE field chooses what you override: LOWER (just bits 15:0 at A), UPPER (bits 31:16 at A+2), or BOTH (all 32 bits—use this for 32-bit Thumb-2 ops like B.W). Typical uses are swapping a single instruction, tweaking a literal, or replacing the first instruction with a branch veneer in Flash (needed because a single B.W only reaches ±16 MB, so long jumps to RAM require a veneer).

FPB Remap Flow

Figure 2: FPB Remap Flow

Credit: The Definitive Guide to ARM Cortex-M3 and Cortex-M4

3. How to Use FPB

3.1 Hardware Breakpoint

To halt when test() is fetched from Flash without touching the app, first enable CoreSight access (DEMCR.TRCENA=1) and turn on the FPB (FP_CTRL.ENABLE=1). Then program one comparator (e.g., FP_COMP[0]) with the first-instruction address of test(): clear the Thumb bit (bit0) but preserve bit1 so the comparator matches the correct halfword; set the comparator's ENABLE bit. Issue DSB; ISB barriers, and call test(). On the next fetch, the FPB match raises a debug event—your debugger halts immediately (halt mode) or the DebugMonitor exception fires—giving you a true hardware breakpoint in Flash with zero code changes.

typedef struct {
    volatile uint32_t FP_CTRL, FP_REMAP, FP_COMP[8];
} FPB_Type;

#define FPB ((FPB_Type*)0xE0002000UL)
#define FPB_CTRL_ENABLE   (1u<<0)
#define FPB_COMP_ENABLE   (1u<<0)

void test(void) {
    /* debugger halts here */
    __NOP();
}

static inline void fpb_enable(void) {
    /* allow FPB/DWT/ITM */
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    FPB->FP_CTRL = FPB_CTRL_ENABLE; 
    __DSB(); 
    __ISB();
}

void demo_breakpoint(void) {
    fpb_enable();
    uint32_t a = ((uint32_t)(uintptr_t)test) & ~1u;
    FPB->FP_COMP[0] = (a & 0x1FFFFFFCu) | (a & 0x2u) | FPB_COMP_ENABLE;
    __DSB();
    __ISB();

    test();
}

int main(void) {
    HAL_Init();
    SystemClock_Config();
    demo_breakpoint();
    while (1) {}
}

3.2 Flash Patch (Remap)

To change behavior at runtime without touching Flash, use FPB's remap path: instead of executing the word fetched from Flash, the core can substitute a word you preload in a tiny table in SRAM. Point FP_REMAP at a 32-byte (8×32-bit) table that's 32-byte aligned, then arm a comparator on the first instruction address of the function you want to intercept (clear the T-bit, but preserve bit1 so the correct halfword is selected).

For a 32-bit replacement (e.g., injecting a B.W), set REPLACE_BOTH; after a DSB; ISB, the next fetch at that address will read your replacement word from the table, not from Flash. A practical pattern is to inject a 32-bit branch to a tiny veneer in Flash (keeps within the ±16 MB range of B.W), do your tweak there, then BX LR back.

/* 3.2 Flash patch (remap): replace first fetch of test() with a 32-bit B.W to a veneer */

#define FPB_COMP_REPLACE_LOWER  (1u<<30)
#define FPB_COMP_REPLACE_UPPER  (2u<<30)
#define FPB_COMP_REPLACE_BOTH   (3u<<30)

__attribute__((aligned(32))) static uint32_t fpb_table[8];

__attribute__((noinline)) static void veneer_dec(void) {
    /* example: adjust state, then return */
    __NOP();
    __asm volatile("bx lr");
}

/* Encode Thumb-2 unconditional B.W (T4) */
static inline uint32_t encode_bw(uint32_t pc, uint32_t target_t) {
    uint32_t tgt = target_t | 1u;                  /* ensure Thumb bit */
    int32_t  imm = (int32_t)tgt - (int32_t)pc;     /* bytes */
    imm >>= 1;                                     /* halfword scale */
    uint32_t S=(imm>>20)&1u, imm10=(imm>>11)&0x3FFu, imm11=imm&0x7FFu;
    uint16_t hi=(uint16_t)(0xF000 | (S<<10) | 0x0800 | imm10);
    uint16_t lo=(uint16_t)(0xF800 | (S<<10) | imm11);
    return ((uint32_t)hi<<16) | lo;
}

void demo_patch(void) {
    /* 1) Enable CoreSight + FPB */
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    FPB->FP_CTRL = FPB_CTRL_ENABLE; 
    __DSB(); 
    __ISB();

    /* 2) Program FP_REMAP to our 32-byte table in SRAM */
    uint32_t base = ((uint32_t)(uintptr_t)fpb_table) & ~0x1Fu;  /* align */
    FPB->FP_REMAP = base;

    /* 3) Build the replacement word: a 32-bit branch to a small veneer in FLASH */
    uint32_t entry = ((uint32_t)(uintptr_t)test) & ~1u;         /* clear T-bit */
    uint32_t pc_for_b = entry + 4u;                             /* PC during 32-bit fetch */
    uint32_t bw = encode_bw(pc_for_b, (uint32_t)(uintptr_t)veneer_dec);

    /* 4) Write the same replacement into all 8 slots (keeps it simple) */
    for (uint32_t s=0; s<8; ++s)
        ((volatile uint32_t*)(base + s*4u))[0] = bw;
    __DSB(); 
    __ISB();

    /* 5) Arm comparator 0 on the exact entry halfword, REPLACE_BOTH for 32-bit */
    FPB->FP_COMP[0] = (entry & 0x1FFFFFFCu) | (entry & 0x2u)  /* keep bit1 (halfword) */
                    | FPB_COMP_REPLACE_BOTH | FPB_COMP_ENABLE;
    __DSB();
    __ISB();

    /* Call: first fetch of test() is replaced by our B.W → veneer runs, then returns */
    test();
}

4. Conclusion

FPB shines when you need true hardware breakpoints in Flash without touching the binary and when you want fast, reversible hot patches—swap a single instruction, tweak a literal, or redirect a function prologue to a small veneer. It's especially useful for production debugging and in-field diagnostics where reflashing is risky.