深入解讀Linux進程調度Schedule【轉】

2019 年 10 月 10 日
筆記

轉自：https://blog.csdn.net/Vince_/article/details/88982802

調度系統是現代作業系統非常核心的基礎子系統之一，尤其在多任務並行作業系統（Multitasking OS）上，系統可能運行於單核或者多核CPU上，進程可能處於運行狀態或者在記憶體中可運行等待狀態。如何實現多任務同時使用資源並且提供給用戶及時的響應實現實時交互以及提供高流量並發等對現代作業系統的設計實現帶來了巨大挑戰，而Linux調度子系統的設計同樣需要實現這些看似矛盾的要求，適應不同的使用場景。

我們看到Linux是一個複雜的現在作業系統，各個子系統之間相互合作才能完成高效的任務。本文從圍繞調度子系統，介紹了調度子系統核心的概念，並且將其與Linux各個相關組件的關係進行探討，尤其是與調度子系統息息相關的中斷（softirq和irq）子系統以及定時器Timer，深入而全面地展示了調度相關的各個概念以及相互聯繫。

由於筆者最近在調試PowerPC相關的晶片，因此相關的介紹會以此為例提取相關的內核源程式碼進行解讀展示。涉及的程式碼為Linux-4.4穩定發布版本，讀者可以查看源碼進行對照。

1. 相關概念要理解調度子系統，首先需要總體介紹調度的流程，對系統有一個高屋建瓴的認識之後，再在整體流程中對各個節點分別深入分析，從而掌握豐富而飽滿的細節。

在系統啟動早期，會註冊硬體中斷，時鐘中斷是硬體中斷中非常重要的一種，調度過程中需要不斷地刷新進程的狀態以及設置調度標誌已決定是否搶佔進程的執行進行調度。時鐘中斷就是周期性地完成此項工作。這裡又引出另外一個現代OS的調度設計思想即搶佔（preempt），而與其對應的概念則為非搶佔或者合作（cooperate），後面會給出兩者的詳細區別。時鐘中斷屬於硬體中斷，Linux系統不支援中斷嵌套，所以在中斷髮生時又會禁止本地中斷（local_irq_disable），而為了儘快相應其他可能的硬體事件，必須要儘快完成處理並開啟中斷，因此引出了中斷下半部，也就是softirq的概念。同樣在調度過程中有很多定時器（Timer／Hrtimer）會被啟動來完成相應的工作。在調度發生時，針對進程的資源需求類型又有不同的調度策略，因此出現了不同的調度類，以實現不同的調度演算法完成不同場景下的需求。因此本文從中斷和軟中斷，定時器和高精度定時器，搶佔和非搶佔，實時和普通進程調度，鎖合併發等角度進行深入分析，並將相關的概念聯繫起來，以完成對Linux內核調度子系統的解讀。

1.1 Preemptive Preemptive Multitasking系統上，調度器決定運行中的進程何時中止運行換出而新的進程開始執行，該過程稱為搶佔Preemption，而搶佔前的進程運行時間一般為提前設定的時間片(Timeslice)，時間片的設定與進程優先順序有關，根據實際的調度類方法決定，調度類後面會具體介紹。在定時器中斷處理過程中對進程的運行時間vruntime進行刷新，如果已經超過了進程可運行的時間片，則設置當前進程current的thread_info flag的調度標誌TIF_NEED_RESCHED，在下一個調度入口會調用need_resched函數判斷該標誌，如果被設置則會進入調度過程，換出當前進程並選擇新進程開始執行。關於調度入口，下面章節會進行詳細介紹。

1.2 Cooperative 非搶佔的Cooperative Multitasking系統最大的特點就是進程只有在主動決定放棄CPU的時候才開始調度其他進程執行，稱為yielding，調度器無法控制全局的進程運行狀態和時間，這其中最大的缺點就是掛起的進程可能會導致整個系統停止運行，無法調度。進程在因為需要等待特定的訊號活著事件發生時會放棄CPU而進入睡眠，通過主動調用schedule進入調度。

1.3 Nice 系統普通進程一般會設定一個數值Nice來決定其優先順序，在用戶空間可以通過nice系統調用設置進程的nice值。Nice取值範圍在-20 ~ +19，進程時間片大小一般根據nice值進行調整，nice值越高則進程時間片一般會分配越小，通過ps -el可以查看。nice可以理解為對別的進程nice一些。

1.4 Real-time priority 進程實時優先順序，與nice為兩個不同維度的考量，取值範圍為0 ~ 99，值越大則其優先順序越高，一般實時進程real-time process的優先順序高於普通進程normal process。ps -eo state,uid,pid,ppid,rtprio,time,comm可以查看具體資訊，其中-代表進程非實時，數值代表實時優先順序。

2. 調度器的類型根據任務的資源需求類型可以將其分為IO-bounced和Processor-bounced進程，其中IO-bounced可以較為廣義的理解，比如網路設備以及鍵盤滑鼠等，實時性要求較高，但是CPU佔用可能並不密集。Processor-bounced進程對CPU的使用較為密集，比如加密解密過程，影像處理等。針對任務類型區分調度，可以實現較好的體驗，提高實時性的交互，同時可以預留大量的CPU資源給計算密集型的進程。所以在調度設計中採用了複雜的演算法保證及時響應以及大吞吐量。

有五種調度類：

fair_sched_class，現在較高版本的Linux上也就是CFS(Completely Fair Scheduler)，Linux上面主要的調度方式，由CONFIG_FAIR_GROUP_SCHED宏控制 rt_sched_class，由CONFIG_RT_GROUP_SCHED宏控制，實時調度類型。 dl_sched_class，deadline調度類，實時調度中較高級別的調度類型，一般之後在系統緊急情況下會調用； stop_sched_class，最高優先順序的調度類型，屬於實時調度類型的一種，在系統終止時會在其上創建進程進入調度。 idle_sched_class，優先順序最低，在系統空閑時才會進入該調度類型調度，一般系統中只有一個idle，那就是初始化進程init_task，在初始化完成後它將自己設置為idle進程，並不做更多工作。 3. 調度子系統的初始化 start_kernel函數調用sched_init進入調度的初始化。首先分配alloc_size大小的記憶體，初始化root_task_group，root_task_group為系統默認的task group，系統啟動階段每個進程都屬於該task group需要注意root_task_group中的成員是針對perCPU的。初始化完成之後將init_task標記為idle進程。具體看下面函數中的注釋。

void __init sched_init(void) { int i, j; unsigned long alloc_size = 0, ptr; /* calculate the size to be allocated for root_task_group items. * some items in the struct task_group are per-cpu fields, so use * no_cpu_ids here. */ #ifdef CONFIG_FAIR_GROUP_SCHED alloc_size += 2 * nr_cpu_ids * sizeof(void **); #endif #ifdef CONFIG_RT_GROUP_SCHED alloc_size += 2 * nr_cpu_ids * sizeof(void **); #endif if (alloc_size) { /* allocate mem here. */ ptr = (unsigned long)kzalloc(alloc_size, GFP_NOWAIT); #ifdef CONFIG_FAIR_GROUP_SCHED root_task_group.se = (struct sched_entity **)ptr; ptr += nr_cpu_ids * sizeof(void **); root_task_group.cfs_rq = (struct cfs_rq **)ptr; ptr += nr_cpu_ids * sizeof(void **); #endif /* CONFIG_FAIR_GROUP_SCHED */ #ifdef CONFIG_RT_GROUP_SCHED root_task_group.rt_se = (struct sched_rt_entity **)ptr; ptr += nr_cpu_ids * sizeof(void **); root_task_group.rt_rq = (struct rt_rq **)ptr; ptr += nr_cpu_ids * sizeof(void **); #endif /* CONFIG_RT_GROUP_SCHED */ } #ifdef CONFIG_CPUMASK_OFFSTACK /* Use dynamic allocation for cpumask_var_t, instead of putting them on the stack. * This is a bit more expensive, but avoids stack overflow. * Allocate load_balance_mask for every cpu below. */ for_each_possible_cpu(i) { per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node( cpumask_size(), GFP_KERNEL, cpu_to_node(i)); } #endif /* CONFIG_CPUMASK_OFFSTACK */ /* init the real-time task group cpu time percentage. * the hrtimer of def_rt_bandwidth is initialized here. */ init_rt_bandwidth(&def_rt_bandwidth, global_rt_period(), global_rt_runtime()); /* init the deadline task group cpu time percentage. */ init_dl_bandwidth(&def_dl_bandwidth, global_rt_period(), global_rt_runtime()); #ifdef CONFIG_SMP /* 初始化默認調度域，調度域包含一個或者多個CPU，負載均衡是在調度域之內執行，相互之間進行隔離 */ init_defrootdomain(); #endif #ifdef CONFIG_RT_GROUP_SCHED init_rt_bandwidth(&root_task_group.rt_bandwidth, global_rt_period(), global_rt_runtime()); #endif /* CONFIG_RT_GROUP_SCHED */ #ifdef CONFIG_CGROUP_SCHED /* 將分配並初始化好的邋root_task_group加入到錿ask_groups全局鏈表 */ list_add(&root_task_group.list, &task_groups); INIT_LIST_HEAD(&root_task_group.children); INIT_LIST_HEAD(&root_task_group.siblings); /* 初始化自動分組 */ autogroup_init(&init_task); #endif /* CONFIG_CGROUP_SCHED */ /* 遍歷每個cpu的運行隊列，對其進行初始化 */ for_each_possible_cpu(i) { struct rq *rq; rq = cpu_rq(i); raw_spin_lock_init(&rq->lock); /* CPU運行隊列的所有調度實體(sched_entity)的數目 */ rq->nr_running = 0; /* CPU負載 */ rq->calc_load_active = 0; /* 負載更新時間 */ rq->calc_load_update = jiffies + LOAD_FREQ; /* 分別初始化運行隊列的cfs rt和dl隊列 */ init_cfs_rq(&rq->cfs); init_rt_rq(&rq->rt); init_dl_rq(&rq->dl); #ifdef CONFIG_FAIR_GROUP_SCHED /* root的CPU總的配額 */ root_task_group.shares = ROOT_TASK_GROUP_LOAD; INIT_LIST_HEAD(&rq->leaf_cfs_rq_list); /* * How much cpu bandwidth does root_task_group get? * * In case of task-groups formed thr' the cgroup filesystem, it * gets 100% of the cpu resources in the system. This overall * system cpu resource is divided among the tasks of * root_task_group and its child task-groups in a fair manner, * based on each entity's (task or task-group's) weight * (se->load.weight). * * In other words, if root_task_group has 10 tasks of weight * 1024) and two child groups A0 and A1 (of weight 1024 each), * then A0's share of the cpu resource is: * * A0's bandwidth = 1024 / (10*1024 + 1024 + 1024) = 8.33% * * We achieve this by letting root_task_group's tasks sit * directly in rq->cfs (i.e root_task_group->se[] = NULL). */ /* 初始化cfs_bandwidth，普通進程佔有的CPU資源，初始化調度類相應的高精度定時器 */ init_cfs_bandwidth(&root_task_group.cfs_bandwidth); /* 當前CPU運行隊列的cfs_rq的task_group指定為tg, 即root_task_group */ /* 指定cfs_rq的rq為當前CPU運行隊列rq */ /* root_task_group在當前CPU上的cfs_rq */ /* 目前schedule_entity se是空 */ init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL); #endif /* CONFIG_FAIR_GROUP_SCHED */ rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime; #ifdef CONFIG_RT_GROUP_SCHED /* 類似前面init_tg_cfs_entry的初始化, 完成相互賦值 */ init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL); #endif /* 初始化該隊列所保存的每個CPU的負載情況 */ for (j = 0; j < CPU_LOAD_IDX_MAX; j++) rq->cpu_load[j] = 0; /* 該隊列最後更新CPU負載的時間 */ rq->last_load_update_tick = jiffies; #ifdef CONFIG_SMP /* 初始化負載均衡相關的參數 */ rq->sd = NULL; rq->rd = NULL; rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE; rq->balance_callback = NULL; rq->active_balance = 0; rq->next_balance = jiffies; rq->push_cpu = 0; rq->cpu = i; rq->online = 0; rq->idle_stamp = 0; rq->avg_idle = 2*sysctl_sched_migration_cost; rq->max_idle_balance_cost = sysctl_sched_migration_cost; INIT_LIST_HEAD(&rq->cfs_tasks); /* CPU運行隊列加入到默認調度域中 */ rq_attach_root(rq, &def_root_domain); #ifdef CONFIG_NO_HZ_COMMON /* 動態時鐘使用標誌位，初始時間未使用 */ rq->nohz_flags = 0; #endif #ifdef CONFIG_NO_HZ_FULL /* 動態時鐘使用的標誌位，用於保存上次調度tick發生時間 */ rq->last_sched_tick = 0; #endif #endif /* 運行隊列高精度定時器的初始化，還未正式生效 */ init_rq_hrtick(rq); atomic_set(&rq->nr_iowait, 0); } /* 設置初始化進程的load權重 */ set_load_weight(&init_task); #ifdef CONFIG_PREEMPT_NOTIFIERS /* init_task的搶佔通知鏈初始化 */ INIT_HLIST_HEAD(&init_task.preempt_notifiers); #endif /* * The boot idle thread does lazy MMU switching as well: */ atomic_inc(&init_mm.mm_count); enter_lazy_tlb(&init_mm, current); /* * During early bootup we pretend to be a normal task: */ /* 設定初始化進程採用fair調度類 */ current->sched_class = &fair_sched_class; /* * Make us the idle thread. Technically, schedule() should not be * called from this thread, however somewhere below it might be, * but because we are the idle thread, we just pick up running again * when this runqueue becomes "idle". */ /* 將當前進程變更為idle進程，將其各項資訊重新初始化，調度類設置兩位idle調度器 */ init_idle(current, smp_processor_id()); calc_load_update = jiffies + LOAD_FREQ; #ifdef CONFIG_SMP zalloc_cpumask_var(&sched_domains_tmpmask, GFP_NOWAIT); /* May be allocated at isolcpus cmdline parse time */ if (cpu_isolated_map == NULL) zalloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT); idle_thread_set_boot_cpu(); set_cpu_rq_start_time(); #endif /* 初始化fair調度類，其實實際上是註冊SCHED_SOFTIRQ類型的軟中斷處理函數run_rebalance_domains，執行負載平衡過程 */ /* 這裡的問題是SCHED_SOFTIRQ軟中斷是何時觸發?*/ init_sched_fair_class(); /* 標記調度器開始運行，但是此時系統只有init_task一個進程，且為idle進程， * 定時器暫時還未啟動，不會調度到其它進程，所以繼續回到start_kernel執行初始化過程。 */ scheduler_running = 1; } 在sched_init初始化之後，繼續回到start_kernel執行，跟調度相關的內容是：

init_IRQ 該函數中會初始化IRQ的棧空間，包括系統中所有的軟體中斷和硬體中斷。時鐘中斷是調度的驅動因素，包括硬體中斷和軟中斷下半部，在這裡也進行了初始化。中斷相關的內容後面章節會有詳細的介紹，此處需要了解整個初始化流程，知道這個點做了什麼。

init_timers 此處會初始化timer，註冊TIMER_SOFTIRQ軟中斷回調函數run_timer_softirq，關於softirq的內容我會在最後進行介紹。既然在這裡註冊了softirq，那麼在哪裡開始激活或啟動該softirq呢？該softirq的作用是什麼？

在時鐘中斷的註冊章節我們會看到，tick_handle_periodic為時鐘中斷的事件回調函數，在time_init中被賦值到時鐘中斷的回調函數鉤子處，發生時鐘中斷是會被調用做中斷處理。該函數最終調用tick_periodic，繼續調用update_process_times，進而再調用run_local_timers函數來打開TIMER_SOFTIRQ，同時run_local_timers也調用介面hrtimer_run_queues運行高精度定時器。這是中斷處理的典型方式，即硬體中斷處理關鍵部分，啟動softirq後打開硬體中斷響應，更多的事務在軟中斷下半部中處理。關於該軟中斷的具體作用後面會詳細介紹，這裡需要了解的是它會激活所有過期的定時器。

time_init 執行時鐘相關的初始化，後面會看到，我們在系統初始化初期的彙編階段會註冊硬體中斷向量表，但是中斷設備和事件處理函數並未初始化，這裡調用init_decrementer_clockevent初始化時鐘中斷設備，並初始化時間回調tick_handle_periodic；同時調用tick_setup_hrtimer_broadcast註冊高精度定時器設備及其回調，在中斷髮生時實際會被執行。此時硬體中斷被激活。

sched_clock_postinit和sched_clock_init 開啟調度時間相關的定時器定期更新資訊。

4. 調度的處理過程 4.1 schedule()介面首先需要關閉搶佔，防止調度重入，然後調用__schedule，進行current相關的處理，比如有待決訊號，則繼續標記狀態為TASK_RUNNING，或者如果需要睡眠則調用deactivate_task將從運行隊列移除後加入對應的等待隊列，通過pick_next_task選擇下一個需要執行的進程，進行context_switch進入新進程運行。

4.2 pick_next_task 首先判斷當前進程調度類sched_class是否為fair_sched_calss，也就是CFS，如果是且當前cpu的調度隊列下所有調度實體數目與其下面所有CFS調度類的下屬群組內的調度實體數目總數相同，即無RT等其他調度類中有調度實體存在（rq->nr_running == rq->cfs.h_nr_running），則直接返回當前調度類fair_sched_class的pick_next_task選擇結果，否則需要遍歷所有調度類for_each_class(class)，返回class->pick_next_task的非空結果。

這裡需要關注的是for_each_class的遍歷過程，從sched_class_highest開始，也就是stop_sched_class。

#define sched_class_highest (&stop_sched_class) #define for_each_class(class) for (class = sched_class_highest; class; class = class->next) extern const struct sched_class stop_sched_class; extern const struct sched_class dl_sched_class; extern const struct sched_class rt_sched_class; extern const struct sched_class fair_sched_class; extern const struct sched_class idle_sched_class; 4.3 各個調度類的關聯按照優先順序依次羅列組成單鏈表：

stop_sched_class->next->dl_sched_class->next->rt_sched_class->next->fair_sched_class->next->idle_sched_class->next=MULL

4.4 調度類的註冊在編譯過程中通過early_initcall(cpu_stop_init)進行stop相關的註冊，cpu_stop_init對cpu_stop_threads進行了註冊，其create方法被調用時實際執行了cpu_stop_create->sched_set_stop_task，對stop的sched_class進行註冊，create的執行路徑如下：

cpu_stop_init-> smpboot_register_percpu_thread-> smpboot_register_percpu_thread_cpumask-> __smpboot_create_thread-> cpu_stop_threads.create(即cpu_stop_create) 現在回到pick_next_task，由於stop_sched_class作為最高級別調度類將所有系統中的調度類鏈接起來，遍歷過程查看所有sched_class，從最高優先順序開始，直到找到一個可以調度的進程返回，如果整個系統空閑，則之中會調度到系統初始化進程init_task，其最後被設置為idle進程在系統空閑時的調度執行，上面對sched_init的解釋裡面有詳細說明。

5. 調度的入口 Timer interrupt is responsible for decrementing the running process』s timeslice count.When the count reaches zero, need_resched is set and the kernel runs the scheduler as soon as possible 在時鐘中斷中更新進程執行時間資訊，如果時間片用完，則設置need_resched，在接下來的調度過程中換出正在執行的進程。

RTC(Real-Time Clock) 實時時鐘，非易失性設備存儲系統時間，在系統啟動時，通過COMS連接設備到系統，讀取對應的時間資訊提供給系統設置。

System Timer 系統定時器由電子時鐘以可編程頻率實現，驅動系統時鐘中斷定期發生，也有部分架構通過減法器decrementer實現，通過計數器設定初始值，以固定頻率減少直到為0，然後出發時鐘中斷。

The timer interrupt is broken into two pieces: an architecture-dependent and an architecture-independent routine. The architecture-dependent routine is registered as the interrupt handler for the system timer and, thus, runs when the timer interrupt hits. Its exact job depends on the given architecture, of course, but most handlers perform at least the following work: 1. Obtain the xtime_lock lock, which protects access to jiffies_64 and the wall time value, xtime. 2. Acknowledge or reset the system timer as required. 3. Periodically save the updated wall time to the real time clock. 4. Call the architecture-independent timer routine, tick_periodic().

The architecture-independent routine, tick_periodic(), performs much more work: 1. Increment the jiffies_64 count by one. (This is safe, even on 32-bit architectures, because the xtime_lock lock was previously obtained.) 2. Update resource usages, such as consumed system and user time, for the currently running process. 3. Run any dynamic timers that have expired (discussed in the following section). 4. Execute scheduler_tick(), as discussed in Chapter 4. 5. Update the wall time, which is stored in xtime. 6. Calculate the infamous load average.

6. 時鐘中斷（Timer Interrupt）時鐘中斷是系統中調度和搶佔的驅動因素，在時鐘中斷中會進行進程運行時間的更新等，並更新調度標誌，以決定是否進行調度。下面以Powerpc FSL Booke架構晶片ppce500為例來看具體程式碼，其他架構類似，設計思想相同。

6.1 時鐘中斷的註冊首先在系統最開始的啟動階段註冊中斷處理函數，這個過程發生在start_kernel執行之前的彙編初始化部分，在系統初始化完成後時鐘中斷髮生時執行中斷回調函數。

IBM的PowerPC架構的內核啟動入口head文件在arch/powerpc/kernel/下，其中e500架構的內核入口文件為head_fsl_booke.S，其中定義了中斷向量列表：

interrupt_base: /* Critical Input Interrupt */ CRITICAL_EXCEPTION(0x0100, CRITICAL, CriticalInput, unknown_exception) …… /* Decrementer Interrupt */ DECREMENTER_EXCEPTION …… 時鐘中斷的定義為DECREMENTER_EXCEPTION，實際展開過程在arch/powerpc/kernel/head_booke.h頭文件中：

#define EXC_XFER_TEMPLATE(hdlr, trap, msr, copyee, tfer, ret) li r10,trap; stw r10,_TRAP(r11); lis r10,msr@h; ori r10,r10,msr@l; copyee(r10, r9); bl tfer; .long hdlr; .long ret #define EXC_XFER_LITE(n, hdlr) EXC_XFER_TEMPLATE(hdlr, n+1, MSR_KERNEL, NOCOPY, transfer_to_handler, ret_from_except) #define DECREMENTER_EXCEPTION START_EXCEPTION(Decrementer) NORMAL_EXCEPTION_PROLOG(DECREMENTER); lis r0,TSR_DIS@h; /* Setup the DEC interrupt mask */ mtspr SPRN_TSR,r0; /* Clear the DEC interrupt */ addi r3,r1,STACK_FRAME_OVERHEAD; EXC_XFER_LITE(0x0900, timer_interrupt) 再來看timer_interrupt函數：

static void __timer_interrupt(void) { struct pt_regs *regs = get_irq_regs(); u64 *next_tb = this_cpu_ptr(&decrementers_next_tb); struct clock_event_device *evt = this_cpu_ptr(&decrementers); u64 now; trace_timer_interrupt_entry(regs); if (test_irq_work_pending()) { clear_irq_work_pending(); irq_work_run(); } now = get_tb_or_rtc(); if (now >= *next_tb) { *next_tb = ~(u64)0; if (evt->event_handler) evt->event_handler(evt); __this_cpu_inc(irq_stat.timer_irqs_event); } else { now = *next_tb – now; if (now <= DECREMENTER_MAX) set_dec((int)now); /* We may have raced with new irq work */ if (test_irq_work_pending()) set_dec(1); __this_cpu_inc(irq_stat.timer_irqs_others); } #ifdef CONFIG_PPC64 /* collect purr register values often, for accurate calculations */ if (firmware_has_feature(FW_FEATURE_SPLPAR)) { struct cpu_usage *cu = this_cpu_ptr(&cpu_usage_array); cu->current_tb = mfspr(SPRN_PURR); } #endif trace_timer_interrupt_exit(regs); } /* * timer_interrupt – gets called when the decrementer overflows, * with interrupts disabled. */ void timer_interrupt(struct pt_regs * regs) { struct pt_regs *old_regs; u64 *next_tb = this_cpu_ptr(&decrementers_next_tb); /* Ensure a positive value is written to the decrementer, or else * some CPUs will continue to take decrementer exceptions. */ set_dec(DECREMENTER_MAX); /* Some implementations of hotplug will get timer interrupts while * offline, just ignore these and we also need to set * decrementers_next_tb as MAX to make sure __check_irq_replay * don't replay timer interrupt when return, otherwise we'll trap * here infinitely 🙁 */ if (!cpu_online(smp_processor_id())) { *next_tb = ~(u64)0; return; } /* Conditionally hard-enable interrupts now that the DEC has been * bumped to its maximum value */ may_hard_irq_enable(); #if defined(CONFIG_PPC32) && defined(CONFIG_PPC_PMAC) if (atomic_read(&ppc_n_lost_interrupts) != 0) do_IRQ(regs); #endif old_regs = set_irq_regs(regs); irq_enter(); __timer_interrupt(); irq_exit(); set_irq_regs(old_regs); } 在__timer_interrupt函數中執行了evt->event_handler函數調用，此處event_handler是什麼，究竟是怎麼註冊的呢？

答案是tick_handle_periodic，該函數實際上為中斷事件真正的處理過程，前面的interrupt handler僅僅是為中斷做一些準備工作，如完成暫存器等相關資訊的保存等，做好了入口工作，二下面的event_handler則完成了中斷事件實際想做的事情，其函數定義如下：

/* * Event handler for periodic ticks */ void tick_handle_periodic(struct clock_event_device *dev) { int cpu = smp_processor_id(); ktime_t next = dev->next_event; tick_periodic(cpu); #if defined(CONFIG_HIGH_RES_TIMERS) || defined(CONFIG_NO_HZ_COMMON) /* * The cpu might have transitioned to HIGHRES or NOHZ mode via * update_process_times() -> run_local_timers() -> * hrtimer_run_queues(). */ if (dev->event_handler != tick_handle_periodic) return; #endif if (!clockevent_state_oneshot(dev)) return; for (;;) { /* * Setup the next period for devices, which do not have * periodic mode: */ next = ktime_add(next, tick_period); if (!clockevents_program_event(dev, next, false)) return; /* * Have to be careful here. If we're in oneshot mode, * before we call tick_periodic() in a loop, we need * to be sure we're using a real hardware clocksource. * Otherwise we could get trapped in an infinite * loop, as the tick_periodic() increments jiffies, * which then will increment time, possibly causing * the loop to trigger again and again. */ if (timekeeping_valid_for_hres()) tick_periodic(cpu); } } tick_handle_periodic的註冊和執行流程如下：

start_kernel->time_init->init_decrementer_clockevent->register_decrementer_clockevent->clockevents_register_device->tick_check_new_device->tick_setup_periodic->tick_set_periodic_handler->tick_handle_periodic->tick_periodic->update_process_times->scheduler_tick

後面一段為tick_handle_periodic的執行流程調用，可以看到在scheduler_tick中又調用了調度類的task_tick函數介面，如果當前採用CFS調度策略則執行fair_sched_class->task_tick，同樣的在rt_sched_class中實現為task_tick_rt，實現如下：

static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued) { struct sched_rt_entity *rt_se = &p->rt; update_curr_rt(rq); watchdog(rq, p); /* * RR tasks need a special form of timeslice management. * FIFO tasks have no timeslices. */ if (p->policy != SCHED_RR) return; if (–p->rt.time_slice) return; p->rt.time_slice = sched_rr_timeslice; /* * Requeue to the end of queue if we (and all of our ancestors) are not * the only element on the queue */ for_each_sched_rt_entity(rt_se) { if (rt_se->run_list.prev != rt_se->run_list.next) { requeue_task_rt(rq, p, 0); resched_curr(rq); return; } } } 可以看到，如果當前時間片還未用完，則直接返回，否則將進程實時時間片設置為sched_rr_timeslice，並且將調度實體的進程放置到調度隊列rq的末尾，調用resched_curr設置調度資訊後返回，這裡實際上是實時調度的RR(Round Robin)思想。

現在又有新的問題，設置了進程的調度標誌TIF_NEED_RESCHED之後，實際的調度何時發生呢？調度的入口分為四個：

中斷返回；系統調用返回用戶空間；進程主動放棄cpu執行調度；訊號處理完成後返回內核空間；時鐘中斷返回導致進程調度為第1種，此處以ppce500為例來看調度如何發生：各種異常返回的入口RET_FROM_EXC_LEVEL，調用user_exc_return而進入do_work 而do_work作為總的入口點進入執行過程：

do_work: /* r10 contains MSR_KERNEL here */ andi. r0,r9,_TIF_NEED_RESCHED beq do_user_signal 可以看到，如果未設置調度標誌，則會執行restore_user返回之前的調用棧

do_user_signal: /* r10 contains MSR_KERNEL here */ ori r10,r10,MSR_EE SYNC MTMSRD(r10) /* hard-enable interrupts */ /* save r13-r31 in the exception frame, if not already done */ lwz r3,_TRAP(r1) andi. r0,r3,1 beq 2f SAVE_NVGPRS(r1) rlwinm r3,r3,0,0,30 stw r3,_TRAP(r1) 2: addi r3,r1,STACK_FRAME_OVERHEAD mr r4,r9 bl do_notify_resume REST_NVGPRS(r1) b recheck

調用do_resched的地方為同樣定義在entry_32.S的recheck函數：

recheck: /* Note: And we don't tell it we are disabling them again * neither. Those disable/enable cycles used to peek at * TI_FLAGS aren't advertised. */ LOAD_MSR_KERNEL(r10,MSR_KERNEL) SYNC MTMSRD(r10) /* disable interrupts */ CURRENT_THREAD_INFO(r9, r1) lwz r9,TI_FLAGS(r9) andi. r0,r9,_TIF_NEED_RESCHED bne- do_resched andi. r0,r9,_TIF_USER_WORK_MASK beq restore_user 在entry_32.S中可以看到在函數do_resched中調用了schedule函數執行了調度：

do_resched: /* r10 contains MSR_KERNEL here */ /* Note: We don't need to inform lockdep that we are enabling * interrupts here. As far as it knows, they are already enabled */ ori r10,r10,MSR_EE SYNC MTMSRD(r10) /* hard-enable interrupts */ bl schedule 6.2 時鐘中斷的執行過程在前面的中斷向量定義中可以看到有一個處理過程為bl tfer;這裡的tfer為transfer_to_handler或者transfer_to_handler_full，在時鐘中斷中為transfer_to_handler，主要做了一些中斷處理函數調用之前的準備處理過程，然後跳轉到中斷執行過程hdlr，最後進入ret執行，ret對應函數ret_from_except或者ret_from_except_full，在時鐘中斷中對應為ret_from_except，進而調用resume_kernel後進入preempt_schedule_irq執行調度過程：

/* * this is the entry point to schedule() from kernel preemption * off of irq context. * Note, that this is called and return with irqs disabled. This will * protect us against recursive calling from irq. */ asmlinkage __visible void __sched preempt_schedule_irq(void) { enum ctx_state prev_state; /* Catch callers which need to be fixed */ BUG_ON(preempt_count() || !irqs_disabled()); prev_state = exception_enter(); do { preempt_disable(); local_irq_enable(); __schedule(true); local_irq_disable(); sched_preempt_enable_no_resched(); } while (need_resched()); exception_exit(prev_state); } 接下來看看函數preempt_disable和local_irq_disable

static __always_inline volatile int *preempt_count_ptr(void) { return &current_thread_info()->preempt_count; } 其實關閉搶佔只是將當前進程狀態資訊preempt_count增加相應的值1，在此調用之後又barrier()操作，防止編譯器優化和記憶體訪問順序問題，達到同步的目的。

/* * Wrap the arch provided IRQ routines to provide appropriate checks. */ #define raw_local_irq_disable() arch_local_irq_disable() #define raw_local_irq_enable() arch_local_irq_enable() #define raw_local_irq_save(flags) do { typecheck(unsigned long, flags); flags = arch_local_irq_save(); } while (0) #define raw_local_irq_restore(flags) do { typecheck(unsigned long, flags); arch_local_irq_restore(flags); } while (0) #define raw_local_save_flags(flags) do { typecheck(unsigned long, flags); flags = arch_local_save_flags(); } while (0) #define raw_irqs_disabled_flags(flags) ({ typecheck(unsigned long, flags); arch_irqs_disabled_flags(flags); }) #define raw_irqs_disabled() (arch_irqs_disabled()) #define raw_safe_halt() arch_safe_halt() #define local_irq_enable() do { raw_local_irq_enable(); } while (0) #define local_irq_disable() do { raw_local_irq_disable(); } while (0) #define local_irq_save(flags) do { raw_local_irq_save(flags); } while (0) #define local_irq_restore(flags) do { raw_local_irq_restore(flags); } while (0) #define safe_halt() do { raw_safe_halt(); } while (0) 跟架構相關的irq操作定義如下：

static inline void arch_local_irq_restore(unsigned long flags) { #if defined(CONFIG_BOOKE) asm volatile("wrtee %0" : : "r" (flags) : "memory"); #else mtmsr(flags); #endif } static inline unsigned long arch_local_irq_save(void) { unsigned long flags = arch_local_save_flags(); #ifdef CONFIG_BOOKE asm volatile("wrteei 0" : : : "memory"); #else SET_MSR_EE(flags & ~MSR_EE); #endif return flags; } static inline void arch_local_irq_disable(void) { #ifdef CONFIG_BOOKE asm volatile("wrteei 0" : : : "memory"); #else arch_local_irq_save(); #endif } static inline void arch_local_irq_enable(void) { #ifdef CONFIG_BOOKE asm volatile("wrteei 1" : : : "memory"); #else unsigned long msr = mfmsr(); SET_MSR_EE(msr | MSR_EE); #endif } static inline bool arch_irqs_disabled_flags(unsigned long flags) { return (flags & MSR_EE) == 0; } static inline bool arch_irqs_disabled(void) { return arch_irqs_disabled_flags(arch_local_save_flags()); } #define hard_irq_disable() arch_local_irq_disable() 6.3 IRQ介紹這裡來分析一下ppce500的irq相關內容：

e500為booke架構晶片，與經典的powerpc架構有所差別，對於外部中斷異常處理而言，主要是獲取中斷向量地址的方式差異。其中經典架構中是根據異常類型得到偏移 offset, 異常向量的物理地址為 :

MSR[IP]=0 時，Vector = offset ;

MSR[IP]=1 時，Vector = offset | 0xFFF00000;

其中 MSR[IP] 代表 Machine State Register 的 Interrupt Prefix 比特，該比特用來選擇中斷向量的地址前綴。

而booke架構晶片則是從異常類型對應的 IVOR(Interrupt Vector Offset Register) 得到偏移 ( 只取低 16 比特 , 最低 4 比特清零 )，加上 IVPR(Interrupt Prefix Register) 的高 16 比特，構成中斷向量的地址：

Vector = (IVORn & 0xFFF0) | (IVPR & 0xFFFF0000);

值得注意的是，跟經典 PowerPC 不同，Book E 的中斷向量是 Effective Address, 對應 Linux 內核的虛擬地址。Book E架構的MMU是一直開啟的，所以不會運行在實模式(real mode)，在初始化過程中是通過在TLB中手動創建地址轉換條目實現地址轉換，建立頁表之後會根據頁表資訊更新TLB。這裡可以列出來在內核源程式碼裡面的注釋資訊：

/* * Interrupt vector entry code * * The Book E MMUs are always on so we don't need to handle * interrupts in real mode as with previous PPC processors. In * this case we handle interrupts in the kernel virtual address * space. * * Interrupt vectors are dynamically placed relative to the * interrupt prefix as determined by the address of interrupt_base. * The interrupt vectors offsets are programmed using the labels * for each interrupt vector entry. * * Interrupt vectors must be aligned on a 16 byte boundary. * We align on a 32 byte cache line boundary for good measure. */ 下面是手冊上面關於Fixed-Interval Timer Interrupt的章節說明： Fixed-Interval Timer Interrupt, A fixed-interval timer interrupt occurs when no higher priority exception exists, a fixed-interval timer exception exists (TSR[FIS] = 1), and the interrupt is enabled (TCR[FIE] = 1 and (MSR[EE] = 1 or (MSR[GS] = 1 ))). See Section 9.5, 「Fixed-Interval Timer.」

The fixed-interval timer period is determined by TCR[FPEXT] || TCR[FP], which specifies one of 64 bit locations of the time base used to signal a fixed-interval timer exception on a transition from 0 to 1. TCR[FPEXT] || TCR[FP] = 0b0000_00 selects TBU[0]. TCR[FPEXT] || TCR[FP] = 0b1111_11 selects TBL[63].

NOTE: Software Considerations MSR[EE] also enables other asynchronous interrupts. TSR[FIS] is set when a fixed-interval timer exception exists. SRR0, SRR1, and MSR, are updated as shown in this table. Register Setting SRR0 Set to the effective address of the next instruction to be executed. SRR1 Set to the MSR contents at the time of the interrupt. MSR • CM is set to EPCR[ICM]

• RI, ME, DE, CE are unchanged

• All other defined MSR bits are cleared

TSR FIS is set when a fixed-interval timer exception exists, not as a result of the interrupt. See Section 4.7.2, 「Timer Status Register (TSR).」 Instruction execution resumes at address IVPR[0–47] || IVOR11[48–59] || 0b0000.

NOTE: Software Considerations To avoid redundant fixed-interval timer interrupts, before reenabling MSR[EE], the interrupt handler must clear TSR[FIS] by writing a word to TSR using mtspr with a 1 in any bit position to be cleared and 0 in all others. Data written to the TSR is not direct data, but a mask. Writing a 1 to this bit causes it to be cleared; writing a 0 has no effect.

https://www.ibm.com/developerworks/cn/linux/l-cn-powerpc-mpic/index.html

https://www.nxp.com/files-static/32bit/doc/ref_manual/EREF_RM.pdf

https://blog.51cto.com/13578681/2073499

https://www.cnblogs.com/tolimit/p/4303052.html

http://www.haifux.org/lectures/299/netLec7.pdf

https://ggaaooppeenngg.github.io/zh-CN/2017/05/07/cgroups-%E5%88%86%E6%9E%90%E4%B9%8B%E5%86%85%E5%AD%98%E5%92%8CCPU/

https://blog.csdn.net/pwl999/article/details/78817899

深入解讀Linux進程調度Schedule【轉】

VirMach 便宜 VPS

QNews

深入解讀Linux進程調度Schedule【轉】

分享此文：

Related Posts

超過Numpy的速度有多難？試試Numba的GPU加速

通過Dapr實現一個簡單的基於.net的微服務電商系統(十六)——dapr+sentinel中間件實現服務保護

API level targeting to 28，準備好了嗎？

請給出一個SpringMVC的表單提交的例子和session運用的例子

VirMach 便宜 VPS

QNews

熱門搜尋