Linux 服务器功耗与性能管理(三):cpuidle 子系统的实现(2024)
整理一些 Linux 服务器性能相关的 CPU 硬件基础及内核子系统知识。
- Linux 服务器功耗与性能管理(一):CPU 硬件基础(2024)
- Linux 服务器功耗与性能管理(二):几个内核子系统的设计(2024)
- Linux 服务器功耗与性能管理(三):cpuidle 子系统的实现(2024)
- Linux 服务器功耗与性能管理(四):监控、配置、调优(2024)
- Linux 服务器功耗与性能管理(五):问题讨论(2024)
水平及维护精力所限,文中不免存在错误或过时之处,请酌情参考。 传播知识,尊重劳动,年满十八周岁,转载请注明出处。
- 1 结构体
- 2 cpuidle governors 注册
- 3 cpuidle drivers 注册
- 4 idle task:进入 c-state 的过程
- 4.1
idle task
:do_idle()
- 4.2
do_idle() -> cpuidle_idle_call() -> call_cpuidle(driver)
- 4.3
call_cpuidle(driver) -> cpuidle_enter(drv, dev, next_state)
- 4.4
cpuidle_enter(drv, dev, next_state) -> cpuidle_enter_state(dev, drv, index)
- 4.5
cpuidle_enter_state(dev, drv, index) -> target_state->enter()
- 4.6 idle state
enter()
回调方法:以hltpoll
c1 state 为例
- 4.1
- 5 快速确认调用路径:跟踪内核调用栈
- 参考资料
前两篇是理论,这一篇看一下内核代码:idle task 及 cpuidle 子系统的实现。
内核代码中涉及到“空闲状态”用的都是 “idle state” 术语,它基本对应于上一篇我们所讲的 c-state, 本文可能会交替使用这两个术语。
1 结构体
1.1 struct cpuidle_state
表示 CPU 空闲状态的结构体,即 Linux 中的 c-state 表示,
// include/linux/cpuidle.h
struct cpuidle_state {
...
s64 exit_latency_ns;
s64 target_residency_ns;
unsigned int exit_latency; /* in US */
unsigned int target_residency; /* in US */
unsigned int flags;
int power_usage; /* in mW */
int (*enter) (struct cpuidle_device *dev, struct cpuidle_driver *drv, int index);
};
为了理解方便,移动了几个字段的顺序。 下面看几个重要字段和方法。
1.1.1 exit_latency/target_residency
有两套,单位分别是 us
和 ns
;
exit_latency
:返回到 fully functional state 所需的时间;target_residency
:处理器进入这个空闲状态之后,所应该停留的最短时间。
这两个参数说明:进入和离开每个状态也是有开销的,如果停留时间小于某个阈值就不划算,那种情况下就没必要进入这个状态了。
1.1.2 power_usage
:这个状态的功耗
CPU 在这个状态下的功耗。
1.1.3 flags
定义一些比特位特性,
/* Idle State Flags */
#define CPUIDLE_FLAG_NONE (0x00)
#define CPUIDLE_FLAG_POLLING BIT(0) /* polling state */
#define CPUIDLE_FLAG_COUPLED BIT(1) /* state applies to multiple cpus */
#define CPUIDLE_FLAG_TIMER_STOP BIT(2) /* timer is stopped on this state */
#define CPUIDLE_FLAG_UNUSABLE BIT(3) /* avoid using this state */
#define CPUIDLE_FLAG_OFF BIT(4) /* disable this state by default */
#define CPUIDLE_FLAG_TLB_FLUSHED BIT(5) /* idle-state flushes TLBs */
#define CPUIDLE_FLAG_RCU_IDLE BIT(6) /* idle-state takes care of RCU */
1.1.5 enter()
方法
enter(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index)
由各 idle driver 实现,后面会看到。
执行该方法会进入这个状态,需要传 CPU 设备、idle driver、state index 三个参数。
1.2 struct cpuidle_governor
// include/linux/cpuidle.h
struct cpuidle_governor {
char name[CPUIDLE_NAME_LEN];
struct list_head governor_list;
unsigned int rating; // the governor's idea of how useful it is. By default, the kernel will use
// the governor with the highest rating value, but the system administrator can override that choice
int (*select) (struct cpuidle_driver *drv, struct cpuidle_device *dev, bool *stop_tick);
void (*reflect) (struct cpuidle_device *dev, int index);
};
1.2.1 select()
最重要的方法,governor 根据自己的判断,包括
- 定时器事件
- 预测的 sleep 时长、idle 时长等
PM QoS
latency requirements
等等,选出它认为最合适的一个 idle 状态。
1.2.2 reflect()
CPU 退出这个 idle 状态时执行,governor 根据里面的 timing 信息 反思(reflect)决策的好坏。
1.3 struct cpuidle_driver
struct cpuidle_driver {
...
const char *name;
struct module *owner;
struct cpuidle_state states[CPUIDLE_STATE_MAX]; /* must be ordered in decreasing power consumption */
int state_count;
int safe_state_index;
struct cpumask *cpumask; /* the driver handles the cpus in cpumask */
const char *governor;/* preferred governor to switch at register time */
};
1.3.1 states[]
:该驱动支持的 idle states 列表
根据功耗降序排列。
1.3.2 cpuidle_register_driver()
:注册 idle driver
通过下面的方法注册 driver:
int cpuidle_register_driver(struct cpuidle_driver *drv);
查看有哪些地方会注册:
$ grep -R "cpuidle_register_driver" *
arch/x86/kernel/apm_32.c: if (!cpuidle_register_driver(&apm_idle_driver))
drivers/acpi/processor_idle.c: retval = cpuidle_register_driver(&acpi_idle_driver);
drivers/cpuidle/cpuidle.c: ret = cpuidle_register_driver(drv);
drivers/cpuidle/driver.c:EXPORT_SYMBOL_GPL(cpuidle_register_driver);
drivers/cpuidle/cpuidle-haltpoll.c: ret = cpuidle_register_driver(drv);
drivers/cpuidle/cpuidle-cps.c: err = cpuidle_register_driver(&cps_driver);
drivers/idle/intel_idle.c: retval = cpuidle_register_driver(&intel_idle_driver);
...
1.4 struct cpuidle_device
每个 CPU 对应:
struct cpuidle_device {
unsigned int registered:1;
unsigned int enabled:1;
unsigned int poll_time_limit:1;
unsigned int cpu;
ktime_t next_hrtimer;
int last_state_idx;
u64 last_residency_ns;
u64 poll_limit_ns;
u64 forced_idle_latency_limit_ns;
struct cpuidle_state_usage states_usage[CPUIDLE_STATE_MAX];
struct cpuidle_state_kobj *kobjs[CPUIDLE_STATE_MAX];
struct cpuidle_driver_kobj *kobj_driver;
struct cpuidle_device_kobj *kobj_dev;
struct list_head device_list;
cpumask_t coupled_cpus;
struct cpuidle_coupled *coupled;
};
2 cpuidle governors 注册
这里就看一个最常用的:menu
governor。
2.1 menu
governor 注册
2.1.1 注册
// drivers/cpuidle/governors/menu.c
static struct cpuidle_governor menu_governor = {
.name = "menu",
.rating = 20,
.select = menu_select,
.reflect = menu_reflect,
};
static int __init init_menu(void) {
return cpuidle_register_governor(&menu_governor);
}
postcore_initcall(init_menu);
接下来看看它的 select/reflect
方法实现。
2.1.2 select()
方法
// menu_select - selects the next idle state to enter
// @drv: cpuidle driver containing state data
// @dev: the CPU
// @stop_tick: indication on whether or not to stop the tick
static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, bool *stop_tick) {
struct menu_device *data = this_cpu_ptr(&menu_devices);
s64 latency_req = cpuidle_governor_latency_req(dev->cpu);
/* determine the expected residency time, round up */
delta = tick_nohz_get_sleep_length(&delta_tick);
data->next_timer_ns = delta;
/* Use the lowest expected idle interval to pick the idle state. */
predicted_ns = ...;
// Find the idle state with the lowest power while satisfying our constraints.
for (i = 0; i < drv->state_count; i++) {
struct cpuidle_state *s = &drv->states[i];
if (s->target_residency_ns > predicted_ns) {
// Use a physical idle state, not busy polling, unless a timer is going to trigger soon enough.
if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) && s->exit_latency_ns <= latency_req && s->target_residency_ns <= data->next_timer_ns) {
predicted_ns = s->target_residency_ns;
idx = i;
break;
}
if (predicted_ns < TICK_NSEC)
break;
if (!tick_nohz_tick_stopped()) {
// If the state selected so far is shallow, waking up early won't hurt, so retain the
// tick in that case and let the governor run again in the next iteration of the loop.
predicted_ns = drv->states[idx].target_residency_ns;
break;
}
// If the state selected so far is shallow and this state's target residency matches the time till the
// closest timer event, select this one to avoid getting stuck in the shallow one for too long.
if (drv->states[idx].target_residency_ns < TICK_NSEC && s->target_residency_ns <= delta_tick)
idx = i;
return idx;
}
if (s->exit_latency_ns > latency_req)
break;
idx = i;
}
// Don't stop the tick if the selected state is a polling one or if the
// expected idle duration is shorter than the tick period length.
if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) || predicted_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) {
*stop_tick = false;
if (idx > 0 && drv->states[idx].target_residency_ns > delta_tick) {
// The tick is not going to be stopped and the target residency of the state to be returned is not within
// the time until the next timer event including the tick, so try to correct that.
for (i = idx - 1; i >= 0; i--) {
idx = i;
if (drv->states[i].target_residency_ns <= delta_tick)
break;
}
}
}
return idx;
}
2.1.3 reflect()
方法
略。
3 cpuidle drivers 注册
3.1 haltpoll
driver:haltpoll
governor 的 driver
// drivers/cpuidle/cpuidle-haltpoll.c
static struct cpuidle_driver haltpoll_driver = {
.name = "haltpoll",
.governor = "haltpoll",
.states = {
{ /* entry 0 is for polling */ },
{
.enter = default_enter_idle,
.exit_latency = 1,
.target_residency = 1,
.power_usage = -1,
.name = "haltpoll idle",
.desc = "default architecture idle",
},
},
.safe_state_index = 0,
.state_count = 2,
};
states
是一个数组,存放了按功耗降序排列的、这个 driver 支持的 c-states,
可以看到,
- 第一个状态是给
polling
保留的,对应c0
状态,它的功耗也确实是最大的; - 第二个状态才是
haltpoll idle
状态,对应c1
状态; - 没有功耗更低的 c2/c3/… 等状态,
3.1.1 注册
static int __init haltpoll_init(void) {
struct cpuidle_driver *drv = &haltpoll_driver;
cpuidle_poll_state_init(drv);
cpuidle_register_driver(drv); // register driver
haltpoll_cpuidle_devices = alloc_percpu(struct cpuidle_device);
ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "cpuidle/haltpoll:online", haltpoll_cpu_online, haltpoll_cpu_offline);
haltpoll_hp_state = ret;
}
3.1.2 enter()
:调用 hlt
指令让 CPU 进入休眠状态
static int default_enter_idle(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) {
if (current_clr_polling_and_test()) {
local_irq_enable();
return index;
}
default_idle();
return index;
}
接下来的 x86 架构下的调用栈:
default_enter_idle
|-default_idle
|-__cpuidle default_idle(void) // arch/x86/kernel/process.c
|-raw_safe_halt() // include/linux/irqflags.h
|-raw_safe_halt() // arch/x86/include/asm/irqflags.h
|-native_safe_halt();
|-asm volatile("sti; hlt": : :"memory");
可以看到最后就是通过内联汇编执行一条指令 sti; htl
,使处理器进入休眠模式,
直到下一个外部中断到来。
In the x86 computer architecture, HLT (halt) is an assembly language instruction which
halts the central processing unit (CPU) until the next external interrupt is fired
.
3.1.3 事实上禁用了 cpuidle 子系统
cpuidle 的价值就是在多个 idle state 之间选一个最合适的。
对于 idle=haltpoll
,因为只有一个低功耗状态 c1,没什么可选的,所以
cpuidle
子系统是不起作用的。
3.2 acpi_idle
driver
上一篇看到,AMD
CPU 在很多情况下用的是这个 driver。
ACPI (Advanced Configuration and Power Interface) 是一个厂商无关的高级配置和功耗管理规范, 将底层硬件以及功能上报给内核,与底层硬件的通信方式是 firmware (UEFI 或 BIOS)。
3.2.1 注册
这个 driver 的注册比较特殊,不像其他的 driver 那样静态初始化各种字段,而是根据一些条件,
在后面动态初始化,比如用哪个 enter()
方法。
// drivers/acpi/processor_idle.c
// governor/enter() 等等,都需要在后面动态初始化
struct cpuidle_driver acpi_idle_driver = {
.name = "acpi_idle",
.owner = THIS_MODULE,
};
// prepares and configures cpuidle global state data i.e. idle routines
static int acpi_processor_setup_cpuidle_states(struct acpi_processor *pr) {
struct cpuidle_driver *drv = &acpi_idle_driver;
if (pr->flags.has_lpi)
return acpi_processor_setup_lpi_states(pr);
return acpi_processor_setup_cstates(pr);
}
LPI (Low Power Idle) 模式
static int acpi_processor_setup_lpi_states(struct acpi_processor *pr) {
struct acpi_lpi_state *lpi;
struct cpuidle_state *state;
struct cpuidle_driver *drv = &acpi_idle_driver;
for (i = 0; i < pr->power.count && i < CPUIDLE_STATE_MAX; i++) {
lpi = &pr->power.lpi_states[i];
state = &drv->states[i];
snprintf(state->name, CPUIDLE_NAME_LEN, "LPI-%d", i);
strlcpy(state->desc, lpi->desc, CPUIDLE_DESC_LEN);
state->exit_latency = lpi->wake_latency;
state->target_residency = lpi->min_residency;
if (lpi->arch_flags)
state->flags |= CPUIDLE_FLAG_TIMER_STOP;
state->enter = acpi_idle_lpi_enter;
drv->safe_state_index = i;
}
drv->state_count = i;
return 0;
}
注册的 enter()
方法是 acpi_idle_lpi_enter()
。
普通模式
static int acpi_processor_setup_cstates(struct acpi_processor *pr) {
struct acpi_processor_cx *cx;
struct cpuidle_state *state;
struct cpuidle_driver *drv = &acpi_idle_driver;
if (max_cstate == 0)
max_cstate = 1;
if (IS_ENABLED(CONFIG_ARCH_HAS_CPU_RELAX)) {
cpuidle_poll_state_init(drv);
count = 1;
} else {
count = 0;
}
for (i = 1; i < ACPI_PROCESSOR_MAX_POWER && i <= max_cstate; i++) {
cx = &pr->power.states[i];
if (!cx->valid)
continue;
state = &drv->states[count];
snprintf(state->name, CPUIDLE_NAME_LEN, "C%d", i);
strlcpy(state->desc, cx->desc, CPUIDLE_DESC_LEN);
state->exit_latency = cx->latency;
state->target_residency = cx->latency * latency_factor;
state->enter = acpi_idle_enter;
state->flags = 0;
if (cx->type == ACPI_STATE_C1 || cx->type == ACPI_STATE_C2) {
drv->safe_state_index = count;
}
count++;
if (count == CPUIDLE_STATE_MAX)
break;
}
drv->state_count = count;
if (!count)
return -EINVAL;
return 0;
}
注册的 enter()
方法是 acpi_idle_enter()
。
3.2.2 enter()
方法
普通模式
static int acpi_idle_enter(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) {
struct acpi_processor_cx *cx = per_cpu(acpi_cstate[index], dev->cpu);
struct acpi_processor *pr;
pr = __this_cpu_read(processors);
if (unlikely(!pr))
return -EINVAL;
if (cx->type != ACPI_STATE_C1) {
if (cx->type == ACPI_STATE_C3 && pr->flags.bm_check)
return acpi_idle_enter_bm(drv, pr, cx, index);
/* C2 to C1 demotion. */
if (acpi_idle_fallback_to_c1(pr) && num_online_cpus() > 1) {
index = ACPI_IDLE_STATE_START;
cx = per_cpu(acpi_cstate[index], dev->cpu);
}
}
if (cx->type == ACPI_STATE_C3)
ACPI_FLUSH_CPU_CACHE();
acpi_idle_do_entry(cx);
return index;
}
// acpi_idle_do_entry - enter idle state using the appropriate method
// @cx: cstate data
//
// Caller disables interrupt before call and enables interrupt after return.
static void __cpuidle acpi_idle_do_entry(struct acpi_processor_cx *cx) {
if (cx->entry_method == ACPI_CSTATE_FFH) {
/* Call into architectural FFH based C-state */
acpi_processor_ffh_cstate_enter(cx);
} else if (cx->entry_method == ACPI_CSTATE_HALT) {
acpi_safe_halt();
} else {
/* IO port based C-state */
inb(cx->address);
wait_for_freeze();
}
}
// Callers should disable interrupts before the call and enable interrupts after return.
static void __cpuidle acpi_safe_halt(void) {
if (!tif_need_resched()) {
safe_halt();
local_irq_disable();
}
}
#define safe_halt() \
do { \
trace_hardirqs_on(); \
raw_safe_halt(); \
} while (0)
LPI (Low Power Idle) 模式
/**
* acpi_idle_lpi_enter - enters an ACPI any LPI state
* @dev: the target CPU
* @drv: cpuidle driver containing cpuidle state info
* @index: index of target state
*
* Return: 0 for success or negative value for error
*/
static int acpi_idle_lpi_enter(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) {
struct acpi_processor *pr;
struct acpi_lpi_state *lpi;
pr = __this_cpu_read(processors);
lpi = &pr->power.lpi_states[index];
if (lpi->entry_method == ACPI_CSTATE_FFH)
return acpi_processor_ffh_lpi_enter(lpi);
return -EINVAL;
}
3.3 intel_idle
driver
Intel 的 c-states 多到吐了,注册列表见 drivers/idle/intel_idle.c。
Intel CPU node 并不一定就用这个 driver,也可能用 acpi_idle
,取决于用户自己的配置。
比如同一服务器厂商不同批次的机器,配置可能就不一样。
下面挑一个 idle state 来看看。
3.3.1 举例 nehalem idle states
// States are indexed by the cstate number, which is also the index into the MWAIT hint array.
// Thus C0 is a dummy.
static struct cpuidle_state nehalem_cstates[] __initdata = {
{
.name = "C1",
.desc = "MWAIT 0x00",
.flags = MWAIT2flg(0x00),
.exit_latency = 3,
.target_residency = 6,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, },
{
.name = "C1E",
.desc = "MWAIT 0x01",
.flags = MWAIT2flg(0x01) | CPUIDLE_FLAG_ALWAYS_ENABLE,
.exit_latency = 10,
.target_residency = 20,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, },
{
.name = "C3",
.desc = "MWAIT 0x10",
.flags = MWAIT2flg(0x10) | CPUIDLE_FLAG_TLB_FLUSHED,
.exit_latency = 20,
.target_residency = 80,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, },
{
.name = "C6",
.desc = "MWAIT 0x20",
.flags = MWAIT2flg(0x20) | CPUIDLE_FLAG_TLB_FLUSHED,
.exit_latency = 200,
.target_residency = 800,
.enter = &intel_idle,
.enter_s2idle = intel_idle_s2idle, },
{
.enter = NULL }
};
这个 c-state 列表注册了 4 个状态,相同点:
enter
函数相同,都是由intel_idle
来处理状态进入;对我们来说是好事,只需要看一个函数就行了。
不同点:
name
:对应 intel 定义的 cstate,从C1
到C6
不等;flags
:比如前两个是浅睡眠;后两个是深度睡眠状态,再唤醒时会冲掉 TLB 缓存;-
延迟不同
exit_latency
:3~200us
,差了 70 倍;target_residency
:6~800us
,差了 100 多倍。
这些状态会显示在 cpupower 里面:
# On an intel-cpu node
$ cpupower monitor
| Nehalem || SandyBridge || Mperf || Idle_Stats
PKG|CORE| CPU| C3 | C6 | PC3 | PC6 || C7 | PC2 | PC7 || C0 | Cx | Freq || POLL | C1
0| 0| 0| 0.00| 0.00| 0.00| 0.00|| 0.00| 0.00| 0.00|| 1.83| 98.17| 2493|| 0.00| 99.81
0| 0| 24| 0.00| 0.00| 0.00| 0.00|| 0.00| 0.00| 0.00|| 1.72| 98.28| 2494|| 0.00| 99.82
...
3.3.2 enter()
方法:intel_idle()
/**
* intel_idle - Ask the processor to enter the given idle state.
* @dev: cpuidle device of the target CPU.
* @drv: cpuidle driver (assumed to point to intel_idle_driver).
* @index: Target idle state index.
*
* Use the MWAIT instruction to notify the processor that the CPU represented by
* @dev is idle and it can try to enter the idle state corresponding to @index.
*
* If the local APIC timer is not known to be reliable in the target idle state,
* enable one-shot tick broadcasting for the target CPU before executing MWAIT.
*
* Optionally call leave_mm() for the target CPU upfront to avoid wakeups due to flushing user TLBs.
*/
static __cpuidle int intel_idle(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) {
struct cpuidle_state *state = &drv->states[index];
unsigned long eax = flg2MWAIT(state->flags);
unsigned long ecx = 1; /* break on interrupt flag */
mwait_idle_with_hints(eax, ecx);
return index;
}
/*
* MWAIT takes an 8-bit "hint" in EAX "suggesting"
* the C-state (top nibble) and sub-state (bottom nibble)
* 0x00 means "MWAIT(C1)", 0x10 means "MWAIT(C2)" etc.
*
* We store the hint at the top of our "flags" for each state.
*/
#define flg2MWAIT(flags) (((flags) >> 24) & 0xFF)
/*
* This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
* which can obviate IPI to trigger checking of need_resched.
* We execute MONITOR against need_resched and enter optimized wait state
* through MWAIT. Whenever someone changes need_resched, we would be woken
* up from MWAIT (without an IPI).
*
* New with Core Duo processors, MWAIT can take some hints based on CPU capability.
*/
static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx) {
if (static_cpu_has_bug(X86_BUG_MONITOR) || !current_set_polling_and_test()) {
if (static_cpu_has_bug(X86_BUG_CLFLUSH_MONITOR)) {
mb();
clflush((void *)¤t_thread_info()->flags);
mb();
}
__monitor((void *)¤t_thread_info()->flags, 0, 0);
if (!need_resched())
__mwait(eax, ecx);
}
current_clr_polling();
}
3.3.3 intel_idle 和 acpi_idle 的先后关系
intel_idle
不依赖 firmware/BIOS 就能有足够的信息来控制 c-states。
这个 driver 基本上会忽略 BIOS 设置和内核启动参数。如果你想自己控制 c-states,
就用 intel_idle.max_cstate=0
来禁用这个 driver。
禁用 intel_idle
driver 之后,内核就会用 acpi_idle
来控制 C-states。
系统固件(BIOS)会通过 ACPI table 向内核提供一个可用的 c-states 列表。
用户可以通过 BIOS 设置来修改这个 c-states table。
Disabling C-states in this way will typically result in Linux using the C1 state for idle processors, which is fairly fast. If BIOS doesn’t allow C-states to be disabled, C-states can also be limited to C1 with the kernel parameter “idle=halt” (kernel parameter “idle=halt” should automatically disable cpuidle, including intel_idle, in newer kernels).
Controlling Processor C-State Usage in Linux, A Dell technical white paper describing the use of C-states with Linux operating systems, 2013
4 idle task:进入 c-state 的过程
接下来看一下 Linux 切 c-state 的代码流程。
较新版本的内核,idle task 对应的是 do_idle()
函数,我们从这里开始。
4.1 idle task
: do_idle()
// https://github.com/torvalds/linux/blob/v5.15/kernel/sched/idle.c#L261
// Generic idle loop implementation. Called with polling cleared.
static void do_idle(void) {
int cpu = smp_processor_id();
nohz_run_idle_balance(cpu); // Check if we need to update blocked load
// If the arch has a polling bit, we maintain an invariant:
//
// Our polling bit is clear if we're not scheduled (i.e. if rq->curr != rq->idle). This means that,
// if rq->idle has the polling bit set, then setting need_resched is guaranteed to cause the CPU to reschedule.
__current_set_polling();
+- tick_nohz_idle_enter();
|
| while (!need_resched()) {
| local_irq_disable();
| +- arch_cpu_idle_enter();
| |
| | // In poll mode we reenable interrupts and spin. Also if we detected in the wakeup from idle path that the tick
| | // broadcast device expired for us, we don't want to go deep idle as we know that the IPI is going to arrive right away.
| | if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
| | tick_nohz_idle_restart_tick();
| | cpu_idle_poll();
| | } else {
| | cpuidle_idle_call(); // --> CALLING INTO cpuidle subsystem, governor+driver
| | }
| +- arch_cpu_idle_exit();
| }
|
| // Since we fell out of the loop above, we know TIF_NEED_RESCHED must be set, propagate it into PREEMPT_NEED_RESCHED.
| // This is required because for polling idle loops we will not have had an IPI to fold the state for us.
| preempt_set_need_resched();
+- tick_nohz_idle_exit();
__current_clr_polling();
schedule_idle();
}
里面调用到 cpuidle_idle_call()
,进入 cpuidle 子系统。
4.2 do_idle() -> cpuidle_idle_call() -> call_cpuidle(driver)
/**
* cpuidle_idle_call - the main idle function
*
* On architectures that support TIF_POLLING_NRFLAG, is called with polling
* set, and it returns with polling set. If it ever stops polling, it must clear the polling bit.
*/
static void cpuidle_idle_call(void) {
struct cpuidle_device *dev = cpuidle_get_device();
struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
if (idle_should_enter_s2idle() || dev->forced_idle_latency_limit_ns) {
...
} else {
// Ask the cpuidle framework to choose a convenient idle state.
bool stop_tick = true;
next_state = cpuidle_select(drv, dev, &stop_tick);
if (stop_tick || tick_nohz_tick_stopped())
tick_nohz_idle_stop_tick();
else
tick_nohz_idle_retain_tick();
entered_state = call_cpuidle(drv, dev, next_state);
cpuidle_reflect(dev, entered_state); // Give the governor an opportunity to reflect on the outcome
}
exit_idle:
__current_set_polling();
}
idle loop 每次执行时,主要做两件事情。
4.2.1 cpuidle_select(drv, dev, &stop_tick)
:选择 c-state
调用 governor
,找到最适合当前条件的 idle state。
这个过程在上一节介绍 governor enter() 方法是时候已经大致看过了, 接下来看 state 选好之后,如何切换到这个状态。
4.2.2 call_cpuidle(drv, dev, next_state)
:要求处理器进入 c-state
调用 driver
,要求 processor hardware 进入选择的 idle state。
接下来看进入这个 idle state 的调用链路。
4.3 call_cpuidle(driver) -> cpuidle_enter(drv, dev, next_state)
static int call_cpuidle(struct cpuidle_driver *drv, struct cpuidle_device *dev, int next_state) {
// This function will block until an interrupt occurs and will take care of re-enabling the local interrupts
return cpuidle_enter(drv, dev, next_state);
}
4.4 cpuidle_enter(drv, dev, next_state) -> cpuidle_enter_state(dev, drv, index)
// drivers/cpuidle/cpuidle.c
/**
* cpuidle_enter - enter into the specified idle state
*
* @drv: the cpuidle driver tied with the cpu
* @dev: the cpuidle device
* @index: the index in the idle state table
*
* Returns the index in the idle state, < 0 in case of error.
* The error code depends on the backend driver
*/
int cpuidle_enter(struct cpuidle_driver *drv, struct cpuidle_device *dev, int index) {
/*
* Store the next hrtimer, which becomes either next tick or the next
* timer event, whatever expires first. Additionally, to make this data
* useful for consumers outside cpuidle, we rely on that the governor's
* ->select() callback have decided, whether to stop the tick or not.
*/
WRITE_ONCE(dev->next_hrtimer, tick_nohz_get_next_hrtimer());
if (cpuidle_state_is_coupled(drv, index))
ret = cpuidle_enter_state_coupled(dev, drv, index);
else
ret = cpuidle_enter_state(dev, drv, index);
WRITE_ONCE(dev->next_hrtimer, 0);
return ret;
}
4.5 cpuidle_enter_state(dev, drv, index) -> target_state->enter()
// drivers/cpuidle/cpuidle.c
/**
* cpuidle_enter_state - enter the state and update stats
* @dev: cpuidle device for this cpu
* @drv: cpuidle driver for this cpu
* @index: index into the states table in @drv of the state to enter
*/
int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) {
struct cpuidle_state *target_state = &drv->states[index];
bool broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP);
/*
* Tell the time framework to switch to a broadcast timer because our
* local timer will be shut down. If a local timer is used from another
* CPU as a broadcast timer, this call may fail if it is not available.
*/
if (broadcast && tick_broadcast_enter()) {
index = find_deepest_state(drv, dev, target_state->exit_latency_ns, CPUIDLE_FLAG_TIMER_STOP, false);
if (index < 0) {
default_idle_call();
return -EBUSY;
}
target_state = &drv->states[index];
broadcast = false;
}
if (target_state->flags & CPUIDLE_FLAG_TLB_FLUSHED)
leave_mm(dev->cpu);
sched_idle_set_state(target_state); /* Take note of the planned idle state. */
time_start = ns_to_ktime(local_clock());
int entered_state = target_state->enter(dev, drv, index);
sched_clock_idle_wakeup_event();
time_end = ns_to_ktime(local_clock());
sched_idle_set_state(NULL); /* The cpu is no longer idle or about to enter idle. */
if (broadcast) {
if (WARN_ON_ONCE(!irqs_disabled()))
local_irq_disable();
tick_broadcast_exit();
}
if (!cpuidle_state_is_coupled(drv, index))
local_irq_enable();
if (entered_state >= 0) {
s64 diff, delay = drv->states[entered_state].exit_latency_ns;
// Update cpuidle counters
diff = ktime_sub(time_end, time_start);
dev->last_residency_ns = diff;
dev->states_usage[entered_state].time_ns += diff;
dev->states_usage[entered_state].usage++;
if (diff < drv->states[entered_state].target_residency_ns) {
for (i = entered_state - 1; i >= 0; i--) {
if (dev->states_usage[i].disable)
continue;
dev->states_usage[entered_state].above++; /* Shallower states are enabled, so update. */
break;
}
} else if (diff > delay) {
for (i = entered_state + 1; i < drv->state_count; i++) {
if (dev->states_usage[i].disable)
continue;
// Update if a deeper state would have been a better match for the observed idle duration.
if (diff - delay >= drv->states[i].target_residency_ns)
dev->states_usage[entered_state].below++;
break;
}
}
} else {
dev->last_residency_ns = 0;
dev->states_usage[index].rejected++;
}
return entered_state;
}
这一步会调用 c-state 的 enter()
方法,
也就是我们上一节 driver 注册部分看到的,比如,
htl_idle
acpi_idle
intel_idle
4.6 idle state enter()
回调方法:以 hltpoll
c1 state 为例
// https://github.com/torvalds/linux/blob/v5.15/arch/x86/include/asm/irqflags.h#L54
static inline __cpuidle void native_halt(void) {
mds_idle_clear_cpu_buffers();
asm volatile("hlt": : :"memory");
}
这条指令会让 CPU 进入 halted 状态,直到下一个中断事件到来。
5 快速确认调用路径:跟踪内核调用栈
代码中有很多分支,有时候想确定在特定机器(归根结底是特定配置)上走的是哪个逻辑。 这里简单介绍两种 trace 工具,可以比较快的确定。
5.1 bpftrace
模糊搜索可 trace 的内核函数,
$ bpftrace -l '*cpuidle*' # 或 bpftrace -l | grep cpuidle
...
跟着某个内核函数,看有没有调用到这里,并打印调用到这个函数时前面的调用栈:
$ bpftrace -e 'kprobe:cpuidle_enter_state {printf("%s\n",kstack);}'
cpuidle_enter_state+1
cpuidle_enter+41
cpuidle_idle_call+300
do_idle+123
cpu_startup_entry+25
secondary_startup_64_no_verify+194
$ bpftrace -e 'kprobe:intel_idle {printf("%s\n",kstack);}'
intel_idle+1
cpuidle_enter_state+137
cpuidle_enter+41
cpuidle_idle_call+300
do_idle+123
cpu_startup_entry+25
secondary_startup_64_no_verify+194
更多使用方式可参考 Linux tracing/profiling 基础:符号表、调用栈、perf/bpftrace 示例等(2022)。
5.2 trace-cmd
老牌 trace 工具,功能跟 bpftrace 类似,使用方式跟 perf 有点类似,
$ trace-cmd record -l 'cpuidle_enter_state' -p function_graph
$ trace-cmd report
...
<idle>-0 [007] 699714.113701: funcgraph_entry: | cpuidle_enter_state() {
<idle>-0 [002] 699714.113703: funcgraph_entry: | cpuidle_enter_state() {