2. Linux-3.14.12記憶體管理筆記【系統啟動階段的memblock演算法(2)】

  • 2019 年 10 月 8 日
  • 筆記

memory:表示可用可分配的記憶體; 結束完memblock演算法初始化前的準備工作,回到memblock演算法初始化及其演算法實現上面。memblock是一個很簡單的演算法。

memblock演算法的實現是,它將所有狀態都保存在一個全局變數__initdata_memblock中,演算法的初始化以及記憶體的申請釋放都是在將記憶體塊的狀態做變更。那麼從數據結構入手,

__initdata_memblock是一個memblock結構體。其結構體定義:

【file:/include/linux/memblock.h】  struct memblock {      bool bottom_up; /* is bottom up direction? */      phys_addr_t current_limit;      struct memblock_type memory;      struct memblock_type reserved;  };

結構體內各成員的意思:

  • bottom_up:用來表示分配器分配記憶體是自低地址(低地址指的是內核映像尾部,下同)向高地址還是自高地址向低地址來分配的;
  • current_limit:用來表示用來限制memblock_alloc()和memblock_alloc_base(…, MEMBLOCK_ALLOC_ACCESSIBLE)的記憶體申請;
  • memory:表示可用可分配的記憶體;
  • reserved:表示已經分配出去了的記憶體;

memory和reserved是很關鍵的一個數據結構,memblock演算法的記憶體初始化和申請釋放都是圍繞著它們轉。

往下看看memory和reserved的結構體memblock_type定義:

【file:/include/linux/memblock.h】  struct memblock_type {      unsigned long cnt; /* number of regions */      unsigned long max; /* size of the allocated array */      phys_addr_t total_size; /* size of all regions */      struct memblock_region *regions;  };

cnt和max分別表示當前狀態(memory/reserved)的記憶體塊可用數和可支援的最大數,total_size則表示當前狀態(memory/reserved)的空間大小(也就是可用的記憶體塊資訊大小總和),而regions則是用於保存記憶體塊資訊的結構(包括基址、大小和標記等):

【file:/include/linux/memblock.h】  struct memblock_region {      phys_addr_t base;      phys_addr_t size;      unsigned long flags;  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP      int nid;  #endif  };

memblock演算法的主要結構體也就這麼多了,總的關係如圖:

回去看看__initdata_memblock的定義:

【file:/mm/memblock.c】  static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;  static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;  struct memblock memblock __initdata_memblock = {      .memory.regions = memblock_memory_init_regions,      .memory.cnt = 1, /* empty dummy entry */      .memory.max = INIT_MEMBLOCK_REGIONS,        .reserved.regions = memblock_reserved_init_regions,      .reserved.cnt = 1, /* empty dummy entry */      .reserved.max = INIT_MEMBLOCK_REGIONS,        .bottom_up = false,      .current_limit = MEMBLOCK_ALLOC_ANYWHERE,  };

它初始化了部分成員,表示記憶體申請自高地址向低地址,且current_limit設為~0,即0xFFFFFFFF,同時通過全局變數定義為memblock的演算法管理中的memory和reserved準備了記憶體空間。

接下來分析一下memblock演算法初始化,其初始化函數為memblock_x86_fill(),初始化調用棧位置:

start_kernel()                          #/init/main.c    └->setup_arch()                        #/arch/x86/kernel/setup.c    └->memblock_x86_fill()                #/arch/x86/kernel/e820.c

函數實現:

【file:/arch/x86/kernel/e820.c】  void __init memblock_x86_fill(void)  {      int i;      u64 end;        /*       * EFI may have more than 128 entries       * We are safe to enable resizing, beause memblock_x86_fill()       * is rather later for x86       */      memblock_allow_resize();        for (i = 0; i < e820.nr_map; i++) {          struct e820entry *ei = &e820.map[i];            end = ei->addr + ei->size;          if (end != (resource_size_t)end)              continue;            if (ei->type != E820_RAM && ei->type != E820_RESERVED_KERN)              continue;            memblock_add(ei->addr, ei->size);      }        /* throw away partial pages */      memblock_trim_memory(PAGE_SIZE);        memblock_dump_all();  }

該函數的實現中,調用了memblock_allow_resize() 僅是用於置memblock_can_resize的值;裡面的for則是用於循環遍歷e820的記憶體布局資訊,將資訊做memblock_add的操作;最後循環退出後,將調用memblock_trim_memory()和memblock_dump_all()做後處理。這裡首先看一下memblock_add()的函數實現:

【file:/mm/memblock.c】  int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size)  {      return memblock_add_region(&memblock.memory, base, size,                     MAX_NUMNODES, 0);  }

memblock_add()主要封裝了memblock_add_region(),特別需要留意它操作對象是memblock.memory(可用可分配的記憶體),可以推測其意圖是將e820的記憶體資訊往這裡添加,接著往下看memblock_add_region()的實現:

【file:/mm/memblock.c】  /**   * memblock_add_region - add new memblock region   * @type: memblock type to add new region into   * @base: base address of the new region   * @size: size of the new region   * @nid: nid of the new region   * @flags: flags of the new region   *   * Add new memblock region [@base,@base+@size) into @type. The new region   * is allowed to overlap with existing ones - overlaps don't affect already   * existing regions. @type is guaranteed to be minimal (all neighbouring   * compatible regions are merged) after the addition.   *   * RETURNS:   * 0 on success, -errno on failure.   */  static int __init_memblock memblock_add_region(struct memblock_type *type,                  phys_addr_t base, phys_addr_t size,                  int nid, unsigned long flags)  {      bool insert = false;      phys_addr_t obase = base;      phys_addr_t end = base + memblock_cap_size(base, &size);      int i, nr_new;        if (!size)          return 0;        /* special case for empty array */      if (type->regions[0].size == 0) {          WARN_ON(type->cnt != 1 || type->total_size);          type->regions[0].base = base;          type->regions[0].size = size;          type->regions[0].flags = flags;          memblock_set_region_node(&type->regions[0], nid);          type->total_size = size;          return 0;      }  repeat:      /*       * The following is executed twice. Once with %false @insert and       * then with %true. The first counts the number of regions needed       * to accomodate the new area. The second actually inserts them.       */      base = obase;      nr_new = 0;        for (i = 0; i < type->cnt; i++) {          struct memblock_region *rgn = &type->regions[i];          phys_addr_t rbase = rgn->base;          phys_addr_t rend = rbase + rgn->size;            if (rbase >= end)              break;          if (rend <= base)              continue;          /*           * @rgn overlaps. If it separates the lower part of new           * area, insert that portion.           */          if (rbase > base) {              nr_new++;              if (insert)                  memblock_insert_region(type, i++, base,                                 rbase - base, nid,                                 flags);          }          /* area below @rend is dealt with, forget about it */          base = min(rend, end);      }        /* insert the remaining portion */      if (base < end) {          nr_new++;          if (insert)              memblock_insert_region(type, i, base, end - base,                             nid, flags);      }        /*       * If this was the first round, resize array and repeat for actual       * insertions; otherwise, merge and return.       */      if (!insert) {          while (type->cnt + nr_new > type->max)              if (memblock_double_array(type, obase, size) < 0)                  return -ENOMEM;          insert = true;          goto repeat;      } else {          memblock_merge_regions(type);          return 0;      }  }

分析一下memblock_add_region()函數的行為流程:

  1. 如果memblock演算法管理記憶體為空的時候,則將當前空間添加進去;
  2. 不為空的情況下,則先檢查是否存在記憶體重疊的情況,如果有的話,則剔除重疊部分,然後將其餘非重疊的部分添加進去;
  3. 如果出現region[]數組空間不夠的情況,則通過memblock_double_array()添加新的region[]空間;
  4. 最後通過memblock_merge_regions()把緊挨著的記憶體合併了。

現在很明了,可以看到其功能作用是把e820圖裡面的記憶體布局轉換到memblock管理演算法當中的memblock.memory進行管理,表示該記憶體可用。

接著回到memblock_x86_fill()退出for循環的兩個後處理函數memblock_trim_memory()和memblock_dump_all(),其中memblock_trim_memory()的實現:

【file:/mm/memblock.c】  void __init_memblock memblock_trim_memory(phys_addr_t align)  {      int i;      phys_addr_t start, end, orig_start, orig_end;      struct memblock_type *mem = &memblock.memory;        for (i = 0; i < mem->cnt; i++) {          orig_start = mem->regions[i].base;          orig_end = mem->regions[i].base + mem->regions[i].size;          start = round_up(orig_start, align);          end = round_down(orig_end, align);            if (start == orig_start && end == orig_end)              continue;            if (start < end) {              mem->regions[i].base = start;              mem->regions[i].size = end - start;          } else {              memblock_remove_region(mem, i);              i--;          }      }  }

該函數主要用於對memblock.memory做修整,剔除不對齊的部分。而最後memblock_dump_all則是將整理的資訊做dump輸出,這裡就不分析了。

至此memblock記憶體管理算是初始化完畢了。接下來看一下演算法的記憶體申請和釋放的,memblock演算法下的記憶體申請和釋放的介面分別為:

memblock_alloc()和memblock_free()。

memblock_alloc()的函數實現(入參為size大小和align用於位元組對齊):

【file:/mm/memblock.c】  phys_addr_t __init memblock_alloc(phys_addr_t size, phys_addr_t align)  {      return memblock_alloc_base(size, align, MEMBLOCK_ALLOC_ACCESSIBLE);  }

加上標示MEMBLOCK_ALLOC_ACCESSIBLE表示申請記憶體可訪問,封裝調用memblock_alloc_base():

【file:/mm/memblock.c】  phys_addr_t __init memblock_alloc_base(phys_addr_t size, phys_addr_t align, phys_addr_t max_addr)  {      phys_addr_t alloc;        alloc = __memblock_alloc_base(size, align, max_addr);        if (alloc == 0)          panic("ERROR: Failed to allocate 0x%llx bytes below 0x%llx.n",                (unsigned long long) size, (unsigned long long) max_addr);        return alloc;  }

繼續__memblock_alloc_base()(封裝了memblock_alloc_base_nid(),新增NUMA_NO_NODE入參表示無NUMA的節點,畢竟當前還沒初始化到那一步):

【file:/mm/memblock.c】  phys_addr_t __init __memblock_alloc_base(phys_addr_t size, phys_addr_t align, phys_addr_t max_addr)  {      return memblock_alloc_base_nid(size, align, max_addr, NUMA_NO_NODE);  }

繼續memblock_alloc_base_nid():

【file:/mm/memblock.c】  static phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,                      phys_addr_t align, phys_addr_t max_addr,                      int nid)  {      phys_addr_t found;        if (!align)          align = SMP_CACHE_BYTES;        found = memblock_find_in_range_node(size, align, 0, max_addr, nid);      if (found && !memblock_reserve(found, size))          return found;        return 0;  }

這裡主要留意兩個關鍵函數memblock_find_in_range_node()和memblock_reserve()。

先看一下memblock_find_in_range_node()的實現:

【file:/mm/memblock.c】  /**   * memblock_find_in_range_node - find free area in given range and node   * @size: size of free area to find   * @align: alignment of free area to find   * @start: start of candidate range   * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}   * @nid: nid of the free area to find, %NUMA_NO_NODE for any node   *   * Find @size free area aligned to @align in the specified range and node.   *   * When allocation direction is bottom-up, the @start should be greater   * than the end of the kernel image. Otherwise, it will be trimmed. The   * reason is that we want the bottom-up allocation just near the kernel   * image so it is highly likely that the allocated memory and the kernel   * will reside in the same node.   *   * If bottom-up allocation failed, will try to allocate memory top-down.   *   * RETURNS:   * Found address on success, 0 on failure.   */  phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,                      phys_addr_t align, phys_addr_t start,                      phys_addr_t end, int nid)  {      int ret;      phys_addr_t kernel_end;        /* pump up @end */      if (end == MEMBLOCK_ALLOC_ACCESSIBLE)          end = memblock.current_limit;        /* avoid allocating the first page */      start = max_t(phys_addr_t, start, PAGE_SIZE);      end = max(start, end);      kernel_end = __pa_symbol(_end);        /*       * try bottom-up allocation only when bottom-up mode       * is set and @end is above the kernel image.       */      if (memblock_bottom_up() && end > kernel_end) {          phys_addr_t bottom_up_start;            /* make sure we will allocate above the kernel */          bottom_up_start = max(start, kernel_end);            /* ok, try bottom-up allocation first */          ret = __memblock_find_range_bottom_up(bottom_up_start, end,                                size, align, nid);          if (ret)              return ret;            /*           * we always limit bottom-up allocation above the kernel,           * but top-down allocation doesn't have the limit, so           * retrying top-down allocation may succeed when bottom-up           * allocation failed.           *           * bottom-up allocation is expected to be fail very rarely,           * so we use WARN_ONCE() here to see the stack trace if           * fail happens.           */          WARN_ONCE(1, "memblock: bottom-up allocation failed, "                   "memory hotunplug may be affectedn");      }        return __memblock_find_range_top_down(start, end, size, align, nid);  }

粗略講解一下,判斷end的範圍,從前面調用關係跟下來,end其實就是MEMBLOCK_ALLOC_ACCESSIBLE,由此一來,將會設置為memblock.current_limit。緊接著對start做調整,為的是避免申請到第一個頁面。memblock_bottom_up()返回的是memblock.bottom_up,前面初始化的時候也知道這個值是false(這不是一定的,在numa初始化時會設置為true),所以最後應該調用的是__memblock_find_range_top_down()去查找記憶體。看一下__memblock_find_range_top_down()的實現:

【file:/mm/memblock.c】  /**   * __memblock_find_range_top_down - find free area utility, in top-down   * @start: start of candidate range   * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}   * @size: size of free area to find   * @align: alignment of free area to find   * @nid: nid of the free area to find, %NUMA_NO_NODE for any node   *   * Utility called from memblock_find_in_range_node(), find free area top-down.   *   * RETURNS:   * Found address on success, 0 on failure.   */  static phys_addr_t __init_memblock  __memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,                     phys_addr_t size, phys_addr_t align, int nid)  {      phys_addr_t this_start, this_end, cand;      u64 i;        for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {          this_start = clamp(this_start, start, end);          this_end = clamp(this_end, start, end);            if (this_end < size)              continue;            cand = round_down(this_end - size, align);          if (cand >= this_start)              return cand;      }        return 0;  }

memblock_find_range_top_down()通過使用for_each_free_mem_range_reverse宏封裝調用__next_free_mem_range_rev()函數,此函數逐一將memblock.memory裡面的記憶體塊資訊提取出來與memblock.reserved的各項資訊進行檢驗,確保返回的this_start和this_end不會與reserved的記憶體存在交叉重疊的情況。然後通過clamp取中間值,判斷大小是否滿足,滿足的情況下,將自末端向前(因為這是top-down申請方式)的size大小的空間的起始地址(前提該地址不會超出this_start)返回回去。至此滿足要求的記憶體塊算是找到了。

多說一些,其實__memblock_find_range_bottom_up()和__memblock_find_range_top_down()的查找記憶體實現是完全類似的,僅在down-top和top-down上面存在差異罷了。

既然滿足條件的記憶體塊找到了,那麼回到memblock_alloc_base_nid()調用的另一個關鍵函數memblock_reserve():

【file:/mm/memblock.c】  int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)  {      return memblock_reserve_region(base, size, MAX_NUMNODES, 0);  }

接著看一下memblock_reserve_region():

【file:/mm/memblock.c】  static int __init_memblock memblock_reserve_region(phys_addr_t base,                             phys_addr_t size,                             int nid,                             unsigned long flags)  {      struct memblock_type *_rgn = &memblock.reserved;        memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pFn",               (unsigned long long)base,               (unsigned long long)base + size - 1,               flags, (void *)_RET_IP_);        return memblock_add_region(_rgn, base, size, nid, flags);  }

可以看到memblock_reserve_region()是通過memblock_add_region()函數往memblock.reserved裡面添加記憶體塊資訊。

最後看看memblock演算法的memblock_free()實現:

【file:/mm/memblock.c】  int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size)  {      memblock_dbg(" memblock_free: [%#016llx-%#016llx] %pFn",               (unsigned long long)base,               (unsigned long long)base + size - 1,               (void *)_RET_IP_);        return __memblock_remove(&memblock.reserved, base, size);  }

該函數主要封裝了__memblock_remove()用於對memblock.reserved的操作。

接著看__memblock_remove():

【file:/mm/memblock.c】  static int __init_memblock __memblock_remove(struct memblock_type *type,                           phys_addr_t base, phys_addr_t size)  {      int start_rgn, end_rgn;      int i, ret;        ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn);      if (ret)          return ret;        for (i = end_rgn - 1; i >= start_rgn; i--)          memblock_remove_region(type, i);      return 0;  }

該函數主要調用兩個關鍵函數memblock_isolate_range()和memblock_remove_region()。先看一下memblock_isolate_range():

【file:/mm/memblock.c】  /**   * memblock_isolate_range - isolate given range into disjoint memblocks   * @type: memblock type to isolate range for   * @base: base of range to isolate   * @size: size of range to isolate   * @start_rgn: out parameter for the start of isolated region   * @end_rgn: out parameter for the end of isolated region   *   * Walk @type and ensure that regions don't cross the boundaries defined by   * [@base,@base+@size). Crossing regions are split at the boundaries,   * which may create at most two more regions. The index of the first   * region inside the range is returned in *@start_rgn and end in *@end_rgn.   *   * RETURNS:   * 0 on success, -errno on failure.   */  static int __init_memblock memblock_isolate_range(struct memblock_type *type,                      phys_addr_t base, phys_addr_t size,                      int *start_rgn, int *end_rgn)  {      phys_addr_t end = base + memblock_cap_size(base, &size);      int i;        *start_rgn = *end_rgn = 0;        if (!size)          return 0;        /* we'll create at most two more regions */      while (type->cnt + 2 > type->max)          if (memblock_double_array(type, base, size) < 0)              return -ENOMEM;        for (i = 0; i < type->cnt; i++) {          struct memblock_region *rgn = &type->regions[i];          phys_addr_t rbase = rgn->base;          phys_addr_t rend = rbase + rgn->size;            if (rbase >= end)              break;          if (rend <= base)              continue;            if (rbase < base) {              /*               * @rgn intersects from below. Split and continue               * to process the next region - the new top half.               */              rgn->base = base;              rgn->size -= base - rbase;              type->total_size -= base - rbase;              memblock_insert_region(type, i, rbase, base - rbase,                             memblock_get_region_node(rgn),                             rgn->flags);          } else if (rend > end) {              /*               * @rgn intersects from above. Split and redo the               * current region - the new bottom half.               */              rgn->base = end;              rgn->size -= end - rbase;              type->total_size -= end - rbase;              memblock_insert_region(type, i--, rbase, end - rbase,                             memblock_get_region_node(rgn),                             rgn->flags);          } else {              /* @rgn is fully contained, record it */              if (!*end_rgn)                  *start_rgn = i;              *end_rgn = i + 1;          }      }        return 0;  }

可以看到memblock_isolate_range()主要是找到覆蓋了指定的記憶體塊的記憶體項的下標索引給找到並以出參返回回去。接著看memblock_remove_region的實現:

【file:/mm/memblock.c】  static void __init_memblock memblock_remove_region(struct memblock_type *type, unsigned long r)  {      type->total_size -= type->regions[r].size;      memmove(&type->regions[r], &type->regions[r + 1],          (type->cnt - (r + 1)) * sizeof(type->regions[r]));      type->cnt--;        /* Special case for empty arrays */      if (type->cnt == 0) {          WARN_ON(type->total_size != 0);          type->cnt = 1;          type->regions[0].base = 0;          type->regions[0].size = 0;          type->regions[0].flags = 0;          memblock_set_region_node(&type->regions[0], MAX_NUMNODES);      }  }

其主要功能是將指定下標索引的記憶體項從memblock.reserved管理結構中移除。

兩者結合起來,更容易理解。在__memblock_remove()裡面,memblock_isolate_range()主要作用是基於被釋放的記憶體資訊將memblock.reserved劃分為兩段,將memblock.reserved覆蓋了被釋放的記憶體項自開始項到結束項的下標索引以start_rgn和end_rgn返回回去。memblock_isolate_range()返回後,接著memblock_remove_region()則藉助於start_rgn和end_rgn把這幾項從memblock.reserved管理結構中移除。至此記憶體釋放完畢。

簡單點做個小結:memblock管理演算法將可用可分配的記憶體在memblock.memory進行管理起來,已分配的記憶體在memblock.reserved進行管理,只要記憶體塊加入到memblock.reserved裡面就表示該記憶體已經被申請佔用了。所以有個關鍵點需要注意,記憶體申請的時候,僅是把被申請到的記憶體加入到memblock.reserved中,並不會在memblock.memory裡面有相關的刪除或改動的操作,這也就是為什麼申請和釋放的操作都集中在memblock.reserved的原因了。這個演算法效率並不高,但是這是合理的,畢竟在初始化階段沒有那麼多複雜的記憶體操作場景,甚至很多地方都是申請了記憶體做永久使用的