Intel's XScale Microarchitecture processors provide a Performance Monitoring Unit (PMU) that can be utilized to provide information that can be useful for fine tuning of code. This text file describes the API that's been developed for use by Linux kernel programmers. When I have some extra time on my hand, I will extend the code to provide support for user mode performance monitoring (which is probably much more useful). Note that to get the most usage out of the PMU, I highly reccomend getting the XScale reference manual from Intel and looking at chapter 12. To use the PMU, you must #include in your source file. Since there's only one PMU, only one user can currently use the PMU at a given time. To claim the PMU for usage, call pmu_claim() which returns an identifier. When you are done using the PMU, call pmu_release() with the identifier that you were given by pmu_claim. In addition, the PMU can only be used on XScale based systems that provide an external timer. Systems that the PMU is currently supported on are: - Cyclone IQ80310 Before delving into how to use the PMU code, let's do a quick overview of the PMU itself. The PMU consists of three registers that can be used for performance measurements. The first is the CCNT register with provides the number of clock cycles elapsed since the PMU was started. The next two register, PMN0 and PMN1, are eace user programmable to provide 1 of 20 different performance statistics. By combining different statistics, you can derive complex performance metrics. To start the PMU, just call pmu_start(pm0, pmn1). pmn0 and pmn1 tell the PMU what statistics to capture and can each be one of: EVT_ICACHE_MISS Instruction fetches requiring access to external memory EVT_ICACHE_NO_DELIVER Instruction cache could not deliver an instruction. Either an ICACHE miss or an instruction TLB miss. EVT_ICACHE_DATA_STALL Stall in execution due to a data dependency. This counter is incremented each cycle in which the condition is present. EVT_ITLB_MISS Instruction TLB miss EVT_DTLB_MISS Data TLB miss EVT_BRANCH A branch instruction was executed and it may or may not have changed program flow EVT_BRANCH_MISS A branch (B or BL instructions only) was mispredicted EVT_INSTRUCTION An instruction was executed EVT_DCACHE_FULL_STALL Stall because data cache buffers are full. Incremented on every cycle in which condition is present. EVT_DCACHE_FULL_STALL_CONTIG Stall because data cache buffers are full. Incremented on every cycle in which condition is contigous. EVT_DCACHE_ACCESS Data cache access (data fetch) EVT_DCACHE_MISS Data cache miss EVT_DCACHE_WRITE_BACK Data cache write back. This counter is incremented for every 1/2 line (four words) that are written back. EVT_PC_CHANGED Software changed the PC. This is incremented only when the software changes the PC and there is no mode change. For example, a MOV instruction that targets the PC would increment the counter. An SWI would not as it triggers a mode change. EVT_BCU_REQUEST The Bus Control Unit(BCU) received a request from the core EVT_BCU_FULL The BCU request queue if full. A high value for this event means that the BCU is often waiting for to complete on the external bus. EVT_BCU_DRAIN The BCU queues were drained due to either a Drain Write Buffer command or an I/O transaction for a page that was marked as uncacheable and unbufferable. EVT_BCU_ECC_NO_ELOG The BCU detected an ECC error on the memory bus but noe ELOG register was available to to log the errors. EVT_BCU_1_BIT_ERR The BCU detected a 1-bit error while reading from the bus. EVT_RMW An RMW cycle occurred due to narrow write on ECC protected memory. To get the results back, call pmu_stop(&results) where results is defined as a struct pmu_results: struct pmu_results { u32 ccnt; /* Clock Counter Register */ u32 ccnt_of; / u32 pmn0; /* Performance Counter Register 0 */ u32 pmn0_of; u32 pmn1; /* Performance Counter Register 1 */ u32 pmn1_of; }; Pretty simple huh? Following are some examples of how to get some commonly wanted numbers out of the PMU data. Note that since you will be dividing things, this isn't super useful from the kernel and you need to printk the data out to syslog. See [1] for more examples. Instruction Cache Efficiency pmu_start(EVT_INSTRUCTION, EVT_ICACHE_MISS); ... pmu_stop(&results); icache_miss_rage = results.pmn1 / results.pmn0; cycles_per_instruction = results.ccnt / results.pmn0; Data Cache Efficiency pmu_start(EVT_DCACHE_ACCESS, EVT_DCACHE_MISS); ... pmu_stop(&results); dcache_miss_rage = results.pmn1 / results.pmn0; Instruction Fetch Latency pmu_start(EVT_ICACHE_NO_DELIVER, EVT_ICACHE_MISS); ... pmu_stop(&results); average_stall_waiting_for_instruction_fetch = results.pmn0 / results.pmn1; percent_stall_cycles_due_to_instruction_fetch = results.pmn0 / results.ccnt; ToDo: - Add support for usermode PMU usage. This might require hooking into the scheduler so that we pause the PMU when the task that requested statistics is scheduled out. -- This code is still under development, so please feel free to send patches, questions, comments, etc to me. Deepak Saxena