msm-4.14/mm/memory_hotplug.c
Dan Williams 6e8d8f0cd2
mm: shuffle initial free memory to improve memory-side-cache utilization
Patch series "mm: Randomize free memory", v10.

This patch (of 3):

Randomization of the page allocator improves the average utilization of
a direct-mapped memory-side-cache.  Memory side caching is a platform
capability that Linux has been previously exposed to in HPC
(high-performance computing) environments on specialty platforms.  In
that instance it was a smaller pool of high-bandwidth-memory relative to
higher-capacity / lower-bandwidth DRAM.  Now, this capability is going
to be found on general purpose server platforms where DRAM is a cache in
front of higher latency persistent memory [1].

Robert offered an explanation of the state of the art of Linux
interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel).  That's better than forcing
    users to deploy remedies like:
        "To eliminate this gradual degradation, we have added a Stream
         measurement to the Node Health Check that follows each job;
         nodes are rebooted whenever their measured memory bandwidth
         falls below 300 GB/s."

A replacement for zonesort was merged upstream in commit cc9aec03e58f
("x86/numa_emulation: Introduce uniform split capability").  With this
numa_emulation capability, memory can be split into cache sized
("near-memory" sized) numa nodes.  A bind operation to such a node, and
disabling workloads on other nodes, enables full cache performance.
However, once the workload exceeds the cache size then cache conflicts
are unavoidable.  While HPC environments might be able to tolerate
time-scheduling of cache sized workloads, for general purpose server
platforms, the oversubscribed cache case will be the common case.

The worst case scenario is that a server system owner benchmarks a
workload at boot with an un-contended cache only to see that performance
degrade over time, even below the average cache performance due to
excessive conflicts.  Randomization clips the peaks and fills in the
valleys of cache utilization to yield steady average performance.

Here are some performance impact details of the patches:

1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
   3X speedup in a contrived case that tries to force cache conflicts.
   The contrived cased used the numa_emulation capability to force an
   instance of the benchmark to be run in two of the near-memory sized
   numa nodes.  If both instances were placed on the same emulated they
   would fit and cause zero conflicts.  While on separate emulated nodes
   without randomization they underutilized the cache and conflicted
   unnecessarily due to the in-order allocation per node.

2/ A well known Java server application benchmark was run with a heap
   size that exceeded cache size by 3X.  The cache conflict rate was 8%
   for the first run and degraded to 21% after page allocator aging.  With
   randomization enabled the rate levelled out at 11%.

3/ A MongoDB workload did not observe measurable difference in
   cache-conflict rates, but the overall throughput dropped by 7% with
   randomization in one case.

4/ Mel Gorman ran his suite of performance workloads with randomization
   enabled on platforms without a memory-side-cache and saw a mix of some
   improvements and some losses [3].

While there is potentially significant improvement for applications that
depend on low latency access across a wide working-set, the performance
may be negligible to negative for other workloads.  For this reason the
shuffle capability defaults to off unless a direct-mapped
memory-side-cache is detected.  Even then, the page_alloc.shuffle=0
parameter can be specified to disable the randomization on those systems.

Outside of memory-side-cache utilization concerns there is potentially
security benefit from randomization.  Some data exfiltration and
return-oriented-programming attacks rely on the ability to infer the
location of sensitive data objects.  The kernel page allocator, especially
early in system boot, has predictable first-in-first out behavior for
physical pages.  Pages are freed in physical address order when first
onlined.

Quoting Kees:
    "While we already have a base-address randomization
     (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
     memory layouts would certainly be using the predictability of
     allocation ordering (i.e. for attacks where the base address isn't
     important: only the relative positions between allocated memory).
     This is common in lots of heap-style attacks. They try to gain
     control over ordering by spraying allocations, etc.

     I'd really like to see this because it gives us something similar
     to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
caches it leaves vast bulk of memory to be predictably in order allocated.
However, it should be noted, the concrete security benefits are hard to
quantify, and no known CVE is mitigated by this randomization.

Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
are initially populated with free memory at boot and at hotplug time.  Do
this based on either the presence of a page_alloc.shuffle=Y command line
parameter, or autodetection of a memory-side-cache (to be added in a
follow-on patch).

The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.  10,
4MB this trades off randomization granularity for time spent shuffling.
MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
while still showing memory-side cache behavior improvements, and the
expectation that the security implications of finer granularity
randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.  The
performance impact of the shuffling appears to be in the noise compared to
other memory initialization work.

This initial randomization can be undone over time so a follow-on patch is
introduced to inject entropy on page free decisions.  It is reasonable to
ask if the page free entropy is sufficient, but it is not enough due to
the in-order initial freeing of pages.  At the start of that process
putting page1 in front or behind page0 still keeps them close together,
page2 is still near page1 and has a high chance of being adjacent.  As
more pages are added ordering diversity improves, but there is still high
page locality for the low address pages and this leads to no significant
impact to the cache conflict rate.

[1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
[2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
[3]: https://lkml.org/lkml/2018/10/12/309

[dan.j.williams@intel.com: fix shuffle enable]
  Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
[cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
  Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Robert Elliott <elliott@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-08-25 14:41:19 +00:00

2004 lines
51 KiB
C

/*
* linux/mm/memory_hotplug.c
*
* Copyright (C)
*/
#include <linux/stddef.h>
#include <linux/mm.h>
#include <linux/sched/signal.h>
#include <linux/swap.h>
#include <linux/interrupt.h>
#include <linux/pagemap.h>
#include <linux/compiler.h>
#include <linux/export.h>
#include <linux/pagevec.h>
#include <linux/writeback.h>
#include <linux/slab.h>
#include <linux/sysctl.h>
#include <linux/cpu.h>
#include <linux/memory.h>
#include <linux/memremap.h>
#include <linux/memory_hotplug.h>
#include <linux/highmem.h>
#include <linux/vmalloc.h>
#include <linux/ioport.h>
#include <linux/delay.h>
#include <linux/migrate.h>
#include <linux/page-isolation.h>
#include <linux/pfn.h>
#include <linux/suspend.h>
#include <linux/mm_inline.h>
#include <linux/firmware-map.h>
#include <linux/stop_machine.h>
#include <linux/hugetlb.h>
#include <linux/memblock.h>
#include <linux/bootmem.h>
#include <linux/compaction.h>
#include <linux/device.h>
#include <linux/rmap.h>
#include <asm/tlbflush.h>
#include "internal.h"
#include "shuffle.h"
/*
* online_page_callback contains pointer to current page onlining function.
* Initially it is generic_online_page(). If it is required it could be
* changed by calling set_online_page_callback() for callback registration
* and restore_online_page_callback() for generic callback restore.
*/
static int generic_online_page(struct page *page);
static online_page_callback_t online_page_callback = generic_online_page;
static DEFINE_MUTEX(online_page_callback_lock);
DEFINE_STATIC_PERCPU_RWSEM(mem_hotplug_lock);
void get_online_mems(void)
{
percpu_down_read(&mem_hotplug_lock);
}
void put_online_mems(void)
{
percpu_up_read(&mem_hotplug_lock);
}
#ifndef CONFIG_MEMORY_HOTPLUG_MOVABLE_NODE
bool movable_node_enabled = false;
#else
bool movable_node_enabled = true;
#endif
#ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
bool memhp_auto_online;
#else
bool memhp_auto_online = true;
#endif
EXPORT_SYMBOL_GPL(memhp_auto_online);
static int __init setup_memhp_default_state(char *str)
{
if (!strcmp(str, "online"))
memhp_auto_online = true;
else if (!strcmp(str, "offline"))
memhp_auto_online = false;
return 1;
}
__setup("memhp_default_state=", setup_memhp_default_state);
void mem_hotplug_begin(void)
{
cpus_read_lock();
percpu_down_write(&mem_hotplug_lock);
}
void mem_hotplug_done(void)
{
percpu_up_write(&mem_hotplug_lock);
cpus_read_unlock();
}
/* add this memory to iomem resource */
static struct resource *register_memory_resource(u64 start, u64 size)
{
struct resource *res, *conflict;
res = kzalloc(sizeof(struct resource), GFP_KERNEL);
if (!res)
return ERR_PTR(-ENOMEM);
res->name = "System RAM";
res->start = start;
res->end = start + size - 1;
res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
conflict = request_resource_conflict(&iomem_resource, res);
if (conflict) {
if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
pr_debug("Device unaddressable memory block "
"memory hotplug at %#010llx !\n",
(unsigned long long)start);
}
pr_debug("System RAM resource %pR cannot be added\n", res);
kfree(res);
return ERR_PTR(-EEXIST);
}
return res;
}
static void release_memory_resource(struct resource *res)
{
if (!res)
return;
release_resource(res);
kfree(res);
return;
}
#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
void get_page_bootmem(unsigned long info, struct page *page,
unsigned long type)
{
page->freelist = (void *)type;
SetPagePrivate(page);
set_page_private(page, info);
page_ref_inc(page);
}
void put_page_bootmem(struct page *page)
{
unsigned long type;
type = (unsigned long) page->freelist;
BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);
if (page_ref_dec_return(page) == 1) {
page->freelist = NULL;
ClearPagePrivate(page);
set_page_private(page, 0);
INIT_LIST_HEAD(&page->lru);
free_reserved_page(page);
}
}
#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
#ifndef CONFIG_SPARSEMEM_VMEMMAP
static void register_page_bootmem_info_section(unsigned long start_pfn)
{
unsigned long *usemap, mapsize, section_nr, i;
struct mem_section *ms;
struct page *page, *memmap;
section_nr = pfn_to_section_nr(start_pfn);
ms = __nr_to_section(section_nr);
/* Get section's memmap address */
memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
/*
* Get page for the memmap's phys address
* XXX: need more consideration for sparse_vmemmap...
*/
page = virt_to_page(memmap);
mapsize = sizeof(struct page) * PAGES_PER_SECTION;
mapsize = PAGE_ALIGN(mapsize) >> PAGE_SHIFT;
/* remember memmap's page */
for (i = 0; i < mapsize; i++, page++)
get_page_bootmem(section_nr, page, SECTION_INFO);
usemap = __nr_to_section(section_nr)->pageblock_flags;
page = virt_to_page(usemap);
mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
for (i = 0; i < mapsize; i++, page++)
get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
}
#else /* CONFIG_SPARSEMEM_VMEMMAP */
static void register_page_bootmem_info_section(unsigned long start_pfn)
{
unsigned long *usemap, mapsize, section_nr, i;
struct mem_section *ms;
struct page *page, *memmap;
if (!pfn_valid(start_pfn))
return;
section_nr = pfn_to_section_nr(start_pfn);
ms = __nr_to_section(section_nr);
memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION);
usemap = __nr_to_section(section_nr)->pageblock_flags;
page = virt_to_page(usemap);
mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
for (i = 0; i < mapsize; i++, page++)
get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
}
#endif /* !CONFIG_SPARSEMEM_VMEMMAP */
void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
{
unsigned long i, pfn, end_pfn, nr_pages;
int node = pgdat->node_id;
struct page *page;
nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
page = virt_to_page(pgdat);
for (i = 0; i < nr_pages; i++, page++)
get_page_bootmem(node, page, NODE_INFO);
pfn = pgdat->node_start_pfn;
end_pfn = pgdat_end_pfn(pgdat);
/* register section info */
for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
/*
* Some platforms can assign the same pfn to multiple nodes - on
* node0 as well as nodeN. To avoid registering a pfn against
* multiple nodes we check that this pfn does not already
* reside in some other nodes.
*/
if (pfn_valid(pfn) && (early_pfn_to_nid(pfn) == node))
register_page_bootmem_info_section(pfn);
}
}
#endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
bool want_memblock)
{
int ret;
int i;
if (pfn_valid(phys_start_pfn))
return -EEXIST;
ret = sparse_add_one_section(NODE_DATA(nid), phys_start_pfn);
if (ret < 0)
return ret;
/*
* Make all the pages reserved so that nobody will stumble over half
* initialized state.
* FIXME: We also have to associate it with a node because pfn_to_node
* relies on having page with the proper node.
*/
for (i = 0; i < PAGES_PER_SECTION; i++) {
unsigned long pfn = phys_start_pfn + i;
struct page *page;
if (!pfn_valid(pfn))
continue;
page = pfn_to_page(pfn);
set_page_node(page, nid);
SetPageReserved(page);
}
if (!want_memblock)
return 0;
return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
}
/*
* Reasonably generic function for adding memory. It is
* expected that archs that support memory hotplug will
* call this function after deciding the zone to which to
* add the new pages.
*/
int __ref __add_pages(int nid, unsigned long phys_start_pfn,
unsigned long nr_pages, bool want_memblock)
{
unsigned long i;
int err = 0;
int start_sec, end_sec;
struct vmem_altmap *altmap;
/* during initialize mem_map, align hot-added range to section */
start_sec = pfn_to_section_nr(phys_start_pfn);
end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1);
altmap = to_vmem_altmap((unsigned long) pfn_to_page(phys_start_pfn));
if (altmap) {
/*
* Validate altmap is within bounds of the total request
*/
if (altmap->base_pfn != phys_start_pfn
|| vmem_altmap_offset(altmap) > nr_pages) {
pr_warn_once("memory add fail, invalid altmap\n");
err = -EINVAL;
goto out;
}
altmap->alloc = 0;
}
for (i = start_sec; i <= end_sec; i++) {
err = __add_section(nid, section_nr_to_pfn(i), want_memblock);
/*
* EEXIST is finally dealt with by ioresource collision
* check. see add_memory() => register_memory_resource()
* Warning will be printed if there is collision.
*/
if (err && (err != -EEXIST))
break;
err = 0;
cond_resched();
}
vmemmap_populate_print_last();
out:
return err;
}
EXPORT_SYMBOL_GPL(__add_pages);
#ifdef CONFIG_MEMORY_HOTREMOVE
/* find the smallest valid pfn in the range [start_pfn, end_pfn) */
static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
unsigned long start_pfn,
unsigned long end_pfn)
{
for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
if (unlikely(!pfn_to_online_page(start_pfn)))
continue;
if (unlikely(pfn_to_nid(start_pfn) != nid))
continue;
if (zone && zone != page_zone(pfn_to_page(start_pfn)))
continue;
return start_pfn;
}
return 0;
}
/* find the biggest valid pfn in the range [start_pfn, end_pfn). */
static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
unsigned long start_pfn,
unsigned long end_pfn)
{
unsigned long pfn;
/* pfn is the end pfn of a memory section. */
pfn = end_pfn - 1;
for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
if (unlikely(!pfn_to_online_page(pfn)))
continue;
if (unlikely(pfn_to_nid(pfn) != nid))
continue;
if (zone && zone != page_zone(pfn_to_page(pfn)))
continue;
return pfn;
}
return 0;
}
static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
unsigned long end_pfn)
{
unsigned long zone_start_pfn = zone->zone_start_pfn;
unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */
unsigned long zone_end_pfn = z;
unsigned long pfn;
int nid = zone_to_nid(zone);
zone_span_writelock(zone);
if (zone_start_pfn == start_pfn) {
/*
* If the section is smallest section in the zone, it need
* shrink zone->zone_start_pfn and zone->zone_spanned_pages.
* In this case, we find second smallest valid mem_section
* for shrinking zone.
*/
pfn = find_smallest_section_pfn(nid, zone, end_pfn,
zone_end_pfn);
if (pfn) {
zone->zone_start_pfn = pfn;
zone->spanned_pages = zone_end_pfn - pfn;
}
} else if (zone_end_pfn == end_pfn) {
/*
* If the section is biggest section in the zone, it need
* shrink zone->spanned_pages.
* In this case, we find second biggest valid mem_section for
* shrinking zone.
*/
pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn,
start_pfn);
if (pfn)
zone->spanned_pages = pfn - zone_start_pfn + 1;
}
/*
* The section is not biggest or smallest mem_section in the zone, it
* only creates a hole in the zone. So in this case, we need not
* change the zone. But perhaps, the zone has only hole data. Thus
* it check the zone has only hole or not.
*/
pfn = zone_start_pfn;
for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
if (unlikely(!pfn_to_online_page(pfn)))
continue;
if (page_zone(pfn_to_page(pfn)) != zone)
continue;
/* If the section is current section, it continues the loop */
if (start_pfn == pfn)
continue;
/* If we find valid section, we have nothing to do */
zone_span_writeunlock(zone);
return;
}
/* The zone has no valid section */
zone->zone_start_pfn = 0;
zone->spanned_pages = 0;
zone_span_writeunlock(zone);
}
static void update_pgdat_span(struct pglist_data *pgdat)
{
unsigned long node_start_pfn = 0, node_end_pfn = 0;
struct zone *zone;
for (zone = pgdat->node_zones;
zone < pgdat->node_zones + MAX_NR_ZONES; zone++) {
unsigned long zone_end_pfn = zone->zone_start_pfn +
zone->spanned_pages;
/* No need to lock the zones, they can't change. */
if (!zone->spanned_pages)
continue;
if (!node_end_pfn) {
node_start_pfn = zone->zone_start_pfn;
node_end_pfn = zone_end_pfn;
continue;
}
if (zone_end_pfn > node_end_pfn)
node_end_pfn = zone_end_pfn;
if (zone->zone_start_pfn < node_start_pfn)
node_start_pfn = zone->zone_start_pfn;
}
pgdat->node_start_pfn = node_start_pfn;
pgdat->node_spanned_pages = node_end_pfn - node_start_pfn;
}
static void __remove_zone(struct zone *zone, unsigned long start_pfn)
{
struct pglist_data *pgdat = zone->zone_pgdat;
int nr_pages = PAGES_PER_SECTION;
unsigned long flags;
#ifdef CONFIG_ZONE_DEVICE
/*
* Zone shrinking code cannot properly deal with ZONE_DEVICE. So
* we will not try to shrink the zones - which is okay as
* set_zone_contiguous() cannot deal with ZONE_DEVICE either way.
*/
if (zone_idx(zone) == ZONE_DEVICE)
return;
#endif
pgdat_resize_lock(zone->zone_pgdat, &flags);
shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
update_pgdat_span(pgdat);
pgdat_resize_unlock(zone->zone_pgdat, &flags);
}
static int __remove_section(struct zone *zone, struct mem_section *ms,
unsigned long map_offset)
{
unsigned long start_pfn;
int scn_nr;
int ret = -EINVAL;
if (!valid_section(ms))
return ret;
ret = unregister_memory_section(ms);
if (ret)
return ret;
scn_nr = __section_nr(ms);
start_pfn = section_nr_to_pfn((unsigned long)scn_nr);
__remove_zone(zone, start_pfn);
sparse_remove_one_section(zone, ms, map_offset);
return 0;
}
/**
* __remove_pages() - remove sections of pages from a zone
* @zone: zone from which pages need to be removed
* @phys_start_pfn: starting pageframe (must be aligned to start of a section)
* @nr_pages: number of pages to remove (must be multiple of section size)
*
* Generic helper function to remove section mappings and sysfs entries
* for the section of the memory we are removing. Caller needs to make
* sure that pages are marked reserved and zones are adjust properly by
* calling offline_pages().
*/
int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
unsigned long nr_pages)
{
unsigned long i;
unsigned long map_offset = 0;
int sections_to_remove, ret = 0;
/* In the ZONE_DEVICE case device driver owns the memory region */
if (is_dev_zone(zone)) {
struct page *page = pfn_to_page(phys_start_pfn);
struct vmem_altmap *altmap;
altmap = to_vmem_altmap((unsigned long) page);
if (altmap)
map_offset = vmem_altmap_offset(altmap);
} else {
resource_size_t start, size;
start = phys_start_pfn << PAGE_SHIFT;
size = nr_pages * PAGE_SIZE;
ret = release_mem_region_adjustable(&iomem_resource, start,
size);
if (ret) {
resource_size_t endres = start + size - 1;
pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
&start, &endres, ret);
}
}
clear_zone_contiguous(zone);
/*
* We can only remove entire sections
*/
BUG_ON(phys_start_pfn & ~PAGE_SECTION_MASK);
BUG_ON(nr_pages % PAGES_PER_SECTION);
sections_to_remove = nr_pages / PAGES_PER_SECTION;
for (i = 0; i < sections_to_remove; i++) {
unsigned long pfn = phys_start_pfn + i*PAGES_PER_SECTION;
ret = __remove_section(zone, __pfn_to_section(pfn), map_offset);
map_offset = 0;
if (ret)
break;
}
set_zone_contiguous(zone);
return ret;
}
#endif /* CONFIG_MEMORY_HOTREMOVE */
int set_online_page_callback(online_page_callback_t callback)
{
int rc = -EINVAL;
get_online_mems();
mutex_lock(&online_page_callback_lock);
if (online_page_callback == generic_online_page) {
online_page_callback = callback;
rc = 0;
}
mutex_unlock(&online_page_callback_lock);
put_online_mems();
return rc;
}
EXPORT_SYMBOL_GPL(set_online_page_callback);
int restore_online_page_callback(online_page_callback_t callback)
{
int rc = -EINVAL;
get_online_mems();
mutex_lock(&online_page_callback_lock);
if (online_page_callback == callback) {
online_page_callback = generic_online_page;
rc = 0;
}
mutex_unlock(&online_page_callback_lock);
put_online_mems();
return rc;
}
EXPORT_SYMBOL_GPL(restore_online_page_callback);
void __online_page_set_limits(struct page *page)
{
}
EXPORT_SYMBOL_GPL(__online_page_set_limits);
void __online_page_increment_counters(struct page *page)
{
adjust_managed_page_count(page, 1);
}
EXPORT_SYMBOL_GPL(__online_page_increment_counters);
void __online_page_free(struct page *page)
{
__free_reserved_page(page);
}
EXPORT_SYMBOL_GPL(__online_page_free);
static int generic_online_page(struct page *page)
{
__online_page_set_limits(page);
__online_page_increment_counters(page);
__online_page_free(page);
return 0;
}
static void __free_pages_hotplug(struct page *page, unsigned int order)
{
unsigned int nr_pages = 1 << order;
struct page *p = page;
unsigned int loop;
adjust_managed_page_count(page, nr_pages);
for (loop = 0; loop < nr_pages; loop++, p++) {
__online_page_set_limits(p);
ClearPageReserved(p);
set_page_count(p, 0);
}
set_page_refcounted(page);
__free_pages(page, order);
}
static void __free_pages_memory(unsigned long start,
unsigned long nr_pages, void *arg)
{
unsigned long order;
unsigned long onlined_pages = *(unsigned long *)arg;
struct page *page;
unsigned long i;
unsigned long phy_addr;
for (i = 0; i < nr_pages; i += (1UL << order)) {
order = min(MAX_ORDER - 1UL, __ffs(start + i));
page = pfn_to_page(start + i);
phy_addr = page_to_phys(page);
if (phy_addr >= bootloader_memory_limit)
break;
while ((1UL << order) > nr_pages - i ||
phy_addr + ((1UL << order) * PAGE_SIZE)
> bootloader_memory_limit)
order--;
__free_pages_hotplug(page, order);
onlined_pages += (1UL << order);
}
*(unsigned long *)arg = onlined_pages;
}
static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
void *arg)
{
if (PageReserved(pfn_to_page(start_pfn)))
__free_pages_memory(start_pfn, nr_pages, arg);
online_mem_sections(start_pfn, start_pfn + nr_pages);
return 0;
}
/* check which state of node_states will be changed when online memory */
static void node_states_check_changes_online(unsigned long nr_pages,
struct zone *zone, struct memory_notify *arg)
{
int nid = zone_to_nid(zone);
enum zone_type zone_last = ZONE_NORMAL;
/*
* If we have HIGHMEM or movable node, node_states[N_NORMAL_MEMORY]
* contains nodes which have zones of 0...ZONE_NORMAL,
* set zone_last to ZONE_NORMAL.
*
* If we don't have HIGHMEM nor movable node,
* node_states[N_NORMAL_MEMORY] contains nodes which have zones of
* 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
*/
if (N_MEMORY == N_NORMAL_MEMORY)
zone_last = ZONE_MOVABLE;
/*
* if the memory to be online is in a zone of 0...zone_last, and
* the zones of 0...zone_last don't have memory before online, we will
* need to set the node to node_states[N_NORMAL_MEMORY] after
* the memory is online.
*/
if (zone_idx(zone) <= zone_last && !node_state(nid, N_NORMAL_MEMORY))
arg->status_change_nid_normal = nid;
else
arg->status_change_nid_normal = -1;
#ifdef CONFIG_HIGHMEM
/*
* If we have movable node, node_states[N_HIGH_MEMORY]
* contains nodes which have zones of 0...ZONE_HIGHMEM,
* set zone_last to ZONE_HIGHMEM.
*
* If we don't have movable node, node_states[N_NORMAL_MEMORY]
* contains nodes which have zones of 0...ZONE_MOVABLE,
* set zone_last to ZONE_MOVABLE.
*/
zone_last = ZONE_HIGHMEM;
if (N_MEMORY == N_HIGH_MEMORY)
zone_last = ZONE_MOVABLE;
if (zone_idx(zone) <= zone_last && !node_state(nid, N_HIGH_MEMORY))
arg->status_change_nid_high = nid;
else
arg->status_change_nid_high = -1;
#else
arg->status_change_nid_high = arg->status_change_nid_normal;
#endif
/*
* if the node don't have memory befor online, we will need to
* set the node to node_states[N_MEMORY] after the memory
* is online.
*/
if (!node_state(nid, N_MEMORY))
arg->status_change_nid = nid;
else
arg->status_change_nid = -1;
}
static void node_states_set_node(int node, struct memory_notify *arg)
{
if (arg->status_change_nid_normal >= 0)
node_set_state(node, N_NORMAL_MEMORY);
if (arg->status_change_nid_high >= 0)
node_set_state(node, N_HIGH_MEMORY);
node_set_state(node, N_MEMORY);
}
static void __meminit resize_zone_range(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages)
{
unsigned long old_end_pfn = zone_end_pfn(zone);
if (zone_is_empty(zone) || start_pfn < zone->zone_start_pfn)
zone->zone_start_pfn = start_pfn;
zone->spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - zone->zone_start_pfn;
}
static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned long start_pfn,
unsigned long nr_pages)
{
unsigned long old_end_pfn = pgdat_end_pfn(pgdat);
if (!pgdat->node_spanned_pages || start_pfn < pgdat->node_start_pfn)
pgdat->node_start_pfn = start_pfn;
pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn;
}
void __ref move_pfn_range_to_zone(struct zone *zone,
unsigned long start_pfn, unsigned long nr_pages)
{
struct pglist_data *pgdat = zone->zone_pgdat;
int nid = pgdat->node_id;
unsigned long flags;
if (zone_is_empty(zone))
init_currently_empty_zone(zone, start_pfn, nr_pages);
clear_zone_contiguous(zone);
/* TODO Huh pgdat is irqsave while zone is not. It used to be like that before */
pgdat_resize_lock(pgdat, &flags);
zone_span_writelock(zone);
resize_zone_range(zone, start_pfn, nr_pages);
zone_span_writeunlock(zone);
resize_pgdat_range(pgdat, start_pfn, nr_pages);
pgdat_resize_unlock(pgdat, &flags);
/*
* TODO now we have a visible range of pages which are not associated
* with their zone properly. Not nice but set_pfnblock_flags_mask
* expects the zone spans the pfn range. All the pages in the range
* are reserved so nobody should be touching them so we should be safe
*/
memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, MEMMAP_HOTPLUG);
set_zone_contiguous(zone);
}
/*
* Returns a default kernel memory zone for the given pfn range.
* If no kernel zone covers this pfn range it will automatically go
* to the ZONE_NORMAL.
*/
static struct zone *default_kernel_zone_for_pfn(int nid, unsigned long start_pfn,
unsigned long nr_pages)
{
struct pglist_data *pgdat = NODE_DATA(nid);
int zid;
for (zid = 0; zid <= ZONE_NORMAL; zid++) {
struct zone *zone = &pgdat->node_zones[zid];
if (zone_intersects(zone, start_pfn, nr_pages))
return zone;
}
return &pgdat->node_zones[ZONE_NORMAL];
}
static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn,
unsigned long nr_pages)
{
struct zone *kernel_zone = default_kernel_zone_for_pfn(nid, start_pfn,
nr_pages);
struct zone *movable_zone = &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
bool in_kernel = zone_intersects(kernel_zone, start_pfn, nr_pages);
bool in_movable = zone_intersects(movable_zone, start_pfn, nr_pages);
/*
* We inherit the existing zone in a simple case where zones do not
* overlap in the given range
*/
if (in_kernel ^ in_movable)
return (in_kernel) ? kernel_zone : movable_zone;
/*
* If the range doesn't belong to any zone or two zones overlap in the
* given range then we use movable zone only if movable_node is
* enabled because we always online to a kernel zone by default.
*/
return movable_node_enabled ? movable_zone : kernel_zone;
}
struct zone *zone_for_pfn_range(int online_type, int nid,
unsigned long start_pfn, unsigned long nr_pages)
{
if (online_type == MMOP_ONLINE_KERNEL)
return default_kernel_zone_for_pfn(nid, start_pfn, nr_pages);
if (online_type == MMOP_ONLINE_MOVABLE)
return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
return default_zone_for_pfn(nid, start_pfn, nr_pages);
}
/*
* Associates the given pfn range with the given node and the zone appropriate
* for the given online type.
*/
static struct zone * __meminit move_pfn_range(int online_type, int nid,
unsigned long start_pfn, unsigned long nr_pages)
{
struct zone *zone;
zone = zone_for_pfn_range(online_type, nid, start_pfn, nr_pages);
move_pfn_range_to_zone(zone, start_pfn, nr_pages);
return zone;
}
/* Must be protected by mem_hotplug_begin() or a device_lock */
int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type)
{
unsigned long flags;
unsigned long onlined_pages = 0;
struct zone *zone;
int need_zonelists_rebuild = 0;
int nid;
int ret;
struct memory_notify arg;
nid = pfn_to_nid(pfn);
/* associate pfn range with the zone */
zone = move_pfn_range(online_type, nid, pfn, nr_pages);
arg.start_pfn = pfn;
arg.nr_pages = nr_pages;
node_states_check_changes_online(nr_pages, zone, &arg);
ret = memory_notify(MEM_GOING_ONLINE, &arg);
ret = notifier_to_errno(ret);
if (ret)
goto failed_addition;
/*
* If this zone is not populated, then it is not in zonelist.
* This means the page allocator ignores this zone.
* So, zonelist must be updated after online.
*/
if (!populated_zone(zone)) {
need_zonelists_rebuild = 1;
setup_zone_pageset(zone);
}
ret = walk_system_ram_range(pfn, nr_pages, &onlined_pages,
online_pages_range);
if (ret) {
if (need_zonelists_rebuild)
zone_pcp_reset(zone);
goto failed_addition;
}
zone->present_pages += onlined_pages;
pgdat_resize_lock(zone->zone_pgdat, &flags);
zone->zone_pgdat->node_present_pages += onlined_pages;
pgdat_resize_unlock(zone->zone_pgdat, &flags);
shuffle_zone(zone);
if (onlined_pages) {
node_states_set_node(nid, &arg);
if (need_zonelists_rebuild)
build_all_zonelists(NULL);
else
zone_pcp_update(zone);
}
init_per_zone_wmark_min();
if (onlined_pages) {
kswapd_run(nid);
kcompactd_run(nid);
}
vm_total_pages = nr_free_pagecache_pages();
writeback_set_ratelimit();
if (onlined_pages)
memory_notify(MEM_ONLINE, &arg);
return 0;
failed_addition:
pr_debug("online_pages [mem %#010llx-%#010llx] failed\n",
(unsigned long long) pfn << PAGE_SHIFT,
(((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1);
memory_notify(MEM_CANCEL_ONLINE, &arg);
return ret;
}
#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
static void reset_node_present_pages(pg_data_t *pgdat)
{
struct zone *z;
for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
z->present_pages = 0;
pgdat->node_present_pages = 0;
}
/* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
{
struct pglist_data *pgdat;
unsigned long zones_size[MAX_NR_ZONES] = {0};
unsigned long zholes_size[MAX_NR_ZONES] = {0};
unsigned long start_pfn = PFN_DOWN(start);
pgdat = NODE_DATA(nid);
if (!pgdat) {
pgdat = arch_alloc_nodedata(nid);
if (!pgdat)
return NULL;
arch_refresh_nodedata(nid, pgdat);
} else {
/*
* Reset the nr_zones, order and classzone_idx before reuse.
* Note that kswapd will init kswapd_classzone_idx properly
* when it starts in the near future.
*/
pgdat->nr_zones = 0;
pgdat->kswapd_order = 0;
pgdat->kswapd_classzone_idx = 0;
}
/* we can use NODE_DATA(nid) from here */
/* init node's zones as empty zones, we don't have any present pages.*/
free_area_init_node(nid, zones_size, start_pfn, zholes_size);
pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
/*
* The node we allocated has no zone fallback lists. For avoiding
* to access not-initialized zonelist, build here.
*/
build_all_zonelists(pgdat);
/*
* zone->managed_pages is set to an approximate value in
* free_area_init_core(), which will cause
* /sys/device/system/node/nodeX/meminfo has wrong data.
* So reset it to 0 before any memory is onlined.
*/
reset_node_managed_pages(pgdat);
/*
* When memory is hot-added, all the memory is in offline state. So
* clear all zones' present_pages because they will be updated in
* online_pages() and offline_pages().
*/
reset_node_present_pages(pgdat);
return pgdat;
}
static void rollback_node_hotadd(int nid, pg_data_t *pgdat)
{
arch_refresh_nodedata(nid, NULL);
free_percpu(pgdat->per_cpu_nodestats);
arch_free_nodedata(pgdat);
return;
}
/**
* try_online_node - online a node if offlined
*
* called by cpu_up() to online a node without onlined memory.
*/
int try_online_node(int nid)
{
pg_data_t *pgdat;
int ret;
if (node_online(nid))
return 0;
mem_hotplug_begin();
pgdat = hotadd_new_pgdat(nid, 0);
if (!pgdat) {
pr_err("Cannot online node %d due to NULL pgdat\n", nid);
ret = -ENOMEM;
goto out;
}
node_set_online(nid);
ret = register_one_node(nid);
BUG_ON(ret);
out:
mem_hotplug_done();
return ret;
}
static int online_memory_one_block(struct memory_block *mem, void *arg)
{
bool *onlined_block = (bool *)arg;
int ret;
if (*onlined_block || !is_memblock_offlined(mem))
return 0;
ret = device_online(&mem->dev);
if (!ret)
*onlined_block = true;
return 0;
}
bool try_online_one_block(int nid)
{
struct zone *zone = &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
bool onlined_block = false;
int ret = lock_device_hotplug_sysfs();
if (ret)
return false;
walk_memory_range(zone->zone_start_pfn, zone_end_pfn(zone),
&onlined_block, online_memory_one_block);
unlock_device_hotplug();
return onlined_block;
}
static int check_hotplug_memory_range(u64 start, u64 size)
{
u64 start_pfn = PFN_DOWN(start);
u64 nr_pages = size >> PAGE_SHIFT;
/* Memory range must be aligned with section */
if ((start_pfn & ~PAGE_SECTION_MASK) ||
(nr_pages % PAGES_PER_SECTION) || (!nr_pages)) {
pr_err("Section-unaligned hotplug range: start 0x%llx, size 0x%llx\n",
(unsigned long long)start,
(unsigned long long)size);
return -EINVAL;
}
return 0;
}
static int online_memory_block(struct memory_block *mem, void *arg)
{
return device_online(&mem->dev);
}
/*
* NOTE: The caller must call lock_device_hotplug() to serialize hotplug
* and online/offline operations (triggered e.g. by sysfs).
*
* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG
*/
int __ref add_memory_resource(int nid, struct resource *res, bool online)
{
u64 start, size;
pg_data_t *pgdat = NULL;
bool new_pgdat;
bool new_node;
int ret;
start = res->start;
size = resource_size(res);
ret = check_hotplug_memory_range(start, size);
if (ret)
return ret;
{ /* Stupid hack to suppress address-never-null warning */
void *p = NODE_DATA(nid);
new_pgdat = !p;
}
mem_hotplug_begin();
/*
* Add new range to memblock so that when hotadd_new_pgdat() is called
* to allocate new pgdat, get_pfn_range_for_nid() will be able to find
* this new range and calculate total pages correctly. The range will
* be removed at hot-remove time.
*/
memblock_add_node(start, size, nid);
new_node = !node_online(nid);
if (new_node) {
pgdat = hotadd_new_pgdat(nid, start);
ret = -ENOMEM;
if (!pgdat)
goto error;
}
/* call arch's memory hotadd */
ret = arch_add_memory(nid, start, size, true);
if (ret < 0)
goto error;
/* we online node here. we can't roll back from here. */
node_set_online(nid);
if (new_node) {
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
ret = __register_one_node(nid);
if (ret)
goto register_fail;
/*
* link memory sections under this node. This is already
* done when creatig memory section in register_new_memory
* but that depends to have the node registered so offline
* nodes have to go through register_node.
* TODO clean up this mess.
*/
ret = link_mem_sections(nid, start_pfn, nr_pages);
register_fail:
/*
* If sysfs file of new node can't create, cpu on the node
* can't be hot-added. There is no rollback way now.
* So, check by BUG_ON() to catch it reluctantly..
*/
BUG_ON(ret);
}
/* create new memmap entry */
firmware_map_add_hotplug(start, start + size, "System RAM");
/* online pages if requested */
if (online)
walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1),
NULL, online_memory_block);
goto out;
error:
/* rollback pgdat allocation and others */
if (new_pgdat && pgdat)
rollback_node_hotadd(nid, pgdat);
memblock_remove(start, size);
out:
mem_hotplug_done();
return ret;
}
/* requires device_hotplug_lock, see add_memory_resource() */
int __ref __add_memory(int nid, u64 start, u64 size)
{
struct resource *res;
int ret;
res = register_memory_resource(start, size);
if (IS_ERR(res))
return PTR_ERR(res);
ret = add_memory_resource(nid, res, memhp_auto_online);
if (ret < 0)
release_memory_resource(res);
return ret;
}
int add_memory(int nid, u64 start, u64 size)
{
int rc;
lock_device_hotplug();
rc = __add_memory(nid, start, size);
unlock_device_hotplug();
return rc;
}
EXPORT_SYMBOL_GPL(add_memory);
#ifdef CONFIG_MEMORY_HOTREMOVE
/*
* A free page on the buddy free lists (not the per-cpu lists) has PageBuddy
* set and the size of the free page is given by page_order(). Using this,
* the function determines if the pageblock contains only free pages.
* Due to buddy contraints, a free page at least the size of a pageblock will
* be located at the start of the pageblock
*/
static inline int pageblock_free(struct page *page)
{
return PageBuddy(page) && page_order(page) >= pageblock_order;
}
/* Return the start of the next active pageblock after a given page */
static struct page *next_active_pageblock(struct page *page)
{
/* Ensure the starting page is pageblock-aligned */
BUG_ON(page_to_pfn(page) & (pageblock_nr_pages - 1));
/* If the entire pageblock is free, move to the end of free page */
if (pageblock_free(page)) {
int order;
/* be careful. we don't have locks, page_order can be changed.*/
order = page_order(page);
if ((order < MAX_ORDER) && (order >= pageblock_order))
return page + (1 << order);
}
return page + pageblock_nr_pages;
}
/* Checks if this range of memory is likely to be hot-removable. */
bool is_mem_section_removable(unsigned long start_pfn, unsigned long nr_pages)
{
struct page *page = pfn_to_page(start_pfn);
unsigned long end_pfn = min(start_pfn + nr_pages, zone_end_pfn(page_zone(page)));
struct page *end_page = pfn_to_page(end_pfn);
/* Check the starting page of each pageblock within the range */
for (; page < end_page; page = next_active_pageblock(page)) {
if (!is_pageblock_removable_nolock(page))
return false;
cond_resched();
}
/* All pageblocks in the memory block are likely to be hot-removable */
return true;
}
/*
* Confirm all pages in a range [start, end) belong to the same zone.
* When true, return its valid [start, end).
*/
int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn,
unsigned long *valid_start, unsigned long *valid_end)
{
unsigned long pfn, sec_end_pfn;
unsigned long start, end;
struct zone *zone = NULL;
struct page *page;
int i;
for (pfn = start_pfn, sec_end_pfn = SECTION_ALIGN_UP(start_pfn + 1);
pfn < end_pfn;
pfn = sec_end_pfn, sec_end_pfn += PAGES_PER_SECTION) {
/* Make sure the memory section is present first */
if (!present_section_nr(pfn_to_section_nr(pfn)))
continue;
for (; pfn < sec_end_pfn && pfn < end_pfn;
pfn += MAX_ORDER_NR_PAGES) {
i = 0;
/* This is just a CONFIG_HOLES_IN_ZONE check.*/
while ((i < MAX_ORDER_NR_PAGES) &&
!pfn_valid_within(pfn + i))
i++;
if (i == MAX_ORDER_NR_PAGES || pfn + i >= end_pfn)
continue;
/* Check if we got outside of the zone */
if (zone && !zone_spans_pfn(zone, pfn + i))
return 0;
page = pfn_to_page(pfn + i);
if (zone && page_zone(page) != zone)
return 0;
if (!zone)
start = pfn + i;
zone = page_zone(page);
end = pfn + MAX_ORDER_NR_PAGES;
}
}
if (zone) {
*valid_start = start;
*valid_end = min(end, end_pfn);
return 1;
} else {
return 0;
}
}
/*
* Scan pfn range [start,end) to find movable/migratable pages (LRU pages,
* non-lru movable pages and hugepages). We scan pfn because it's much
* easier than scanning over linked list. This function returns the pfn
* of the first found movable page if it's found, otherwise 0.
*/
static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
{
unsigned long pfn;
struct page *page;
for (pfn = start; pfn < end; pfn++) {
if (pfn_valid(pfn)) {
page = pfn_to_page(pfn);
if (PageLRU(page))
return pfn;
if (__PageMovable(page))
return pfn;
if (PageHuge(page)) {
if (page_huge_active(page))
return pfn;
else
pfn = round_up(pfn + 1,
1 << compound_order(page)) - 1;
}
}
}
return 0;
}
static struct page *new_node_page(struct page *page, unsigned long private,
int **result)
{
int nid = page_to_nid(page);
nodemask_t nmask = node_states[N_MEMORY];
/*
* try to allocate from a different node but reuse this node if there
* are no other online nodes to be used (e.g. we are offlining a part
* of the only existing node)
*/
node_clear(nid, nmask);
if (nodes_empty(nmask))
node_set(nid, nmask);
return new_page_nodemask(page, nid, &nmask);
}
#define NR_OFFLINE_AT_ONCE_PAGES (256)
static int
do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
{
unsigned long pfn;
struct page *page;
int move_pages = NR_OFFLINE_AT_ONCE_PAGES;
int not_managed = 0;
int ret = 0;
LIST_HEAD(source);
for (pfn = start_pfn; pfn < end_pfn && move_pages > 0; pfn++) {
if (!pfn_valid(pfn))
continue;
page = pfn_to_page(pfn);
if (PageHuge(page)) {
struct page *head = compound_head(page);
pfn = page_to_pfn(head) + (1<<compound_order(head)) - 1;
if (compound_order(head) > PFN_SECTION_SHIFT) {
ret = -EBUSY;
break;
}
if (isolate_huge_page(page, &source))
move_pages -= 1 << compound_order(head);
continue;
} else if (thp_migration_supported() && PageTransHuge(page))
pfn = page_to_pfn(compound_head(page))
+ hpage_nr_pages(page) - 1;
/*
* HWPoison pages have elevated reference counts so the migration would
* fail on them. It also doesn't make any sense to migrate them in the
* first place. Still try to unmap such a page in case it is still mapped
* (e.g. current hwpoison implementation doesn't unmap KSM pages but keep
* the unmap as the catch all safety net).
*/
if (PageHWPoison(page)) {
if (WARN_ON(PageLRU(page)))
isolate_lru_page(page);
if (page_mapped(page))
try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS,
NULL);
continue;
}
if (!get_page_unless_zero(page))
continue;
/*
* We can skip free pages. And we can deal with pages on
* LRU and non-lru movable pages.
*/
if (PageLRU(page))
ret = isolate_lru_page(page);
else
ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE);
if (!ret) { /* Success */
put_page(page);
list_add_tail(&page->lru, &source);
move_pages--;
if (!__PageMovable(page))
inc_node_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
} else {
#ifdef CONFIG_DEBUG_VM
pr_alert("failed to isolate pfn %lx\n", pfn);
dump_page(page, "isolation failed");
#endif
put_page(page);
/* Because we don't have big zone->lock. we should
check this again here. */
if (page_count(page)) {
not_managed++;
ret = -EBUSY;
break;
}
}
}
if (!list_empty(&source)) {
if (not_managed) {
putback_movable_pages(&source);
goto out;
}
/* Allocate a new page from the nearest neighbor node */
ret = migrate_pages(&source, new_node_page, NULL, 0,
MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
if (ret)
putback_movable_pages(&source);
}
out:
return ret;
}
/*
* remove from free_area[] and mark all as Reserved.
*/
static int
offline_isolated_pages_cb(unsigned long start, unsigned long nr_pages,
void *data)
{
__offline_isolated_pages(start, start + nr_pages);
return 0;
}
static void
offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
{
walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL,
offline_isolated_pages_cb);
}
/*
* Check all pages in range, recoreded as memory resource, are isolated.
*/
static int
check_pages_isolated_cb(unsigned long start_pfn, unsigned long nr_pages,
void *data)
{
int ret;
long offlined = *(long *)data;
ret = test_pages_isolated(start_pfn, start_pfn + nr_pages, true);
offlined = nr_pages;
if (!ret)
*(long *)data += offlined;
return ret;
}
static long
check_pages_isolated(unsigned long start_pfn, unsigned long end_pfn)
{
long offlined = 0;
int ret;
ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, &offlined,
check_pages_isolated_cb);
if (ret < 0)
offlined = (long)ret;
return offlined;
}
static int __init cmdline_parse_movable_node(char *p)
{
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
movable_node_enabled = true;
#else
pr_warn("movable_node parameter depends on CONFIG_HAVE_MEMBLOCK_NODE_MAP to work properly\n");
#endif
return 0;
}
early_param("movable_node", cmdline_parse_movable_node);
/* check which state of node_states will be changed when offline memory */
static void node_states_check_changes_offline(unsigned long nr_pages,
struct zone *zone, struct memory_notify *arg)
{
struct pglist_data *pgdat = zone->zone_pgdat;
unsigned long present_pages = 0;
enum zone_type zt, zone_last = ZONE_NORMAL;
/*
* If we have HIGHMEM or movable node, node_states[N_NORMAL_MEMORY]
* contains nodes which have zones of 0...ZONE_NORMAL,
* set zone_last to ZONE_NORMAL.
*
* If we don't have HIGHMEM nor movable node,
* node_states[N_NORMAL_MEMORY] contains nodes which have zones of
* 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
*/
if (N_MEMORY == N_NORMAL_MEMORY)
zone_last = ZONE_MOVABLE;
/*
* check whether node_states[N_NORMAL_MEMORY] will be changed.
* If the memory to be offline is in a zone of 0...zone_last,
* and it is the last present memory, 0...zone_last will
* become empty after offline , thus we can determind we will
* need to clear the node from node_states[N_NORMAL_MEMORY].
*/
for (zt = 0; zt <= zone_last; zt++)
present_pages += pgdat->node_zones[zt].present_pages;
if (zone_idx(zone) <= zone_last && nr_pages >= present_pages)
arg->status_change_nid_normal = zone_to_nid(zone);
else
arg->status_change_nid_normal = -1;
#ifdef CONFIG_HIGHMEM
/*
* If we have movable node, node_states[N_HIGH_MEMORY]
* contains nodes which have zones of 0...ZONE_HIGHMEM,
* set zone_last to ZONE_HIGHMEM.
*
* If we don't have movable node, node_states[N_NORMAL_MEMORY]
* contains nodes which have zones of 0...ZONE_MOVABLE,
* set zone_last to ZONE_MOVABLE.
*/
zone_last = ZONE_HIGHMEM;
if (N_MEMORY == N_HIGH_MEMORY)
zone_last = ZONE_MOVABLE;
for (; zt <= zone_last; zt++)
present_pages += pgdat->node_zones[zt].present_pages;
if (zone_idx(zone) <= zone_last && nr_pages >= present_pages)
arg->status_change_nid_high = zone_to_nid(zone);
else
arg->status_change_nid_high = -1;
#else
arg->status_change_nid_high = arg->status_change_nid_normal;
#endif
/*
* node_states[N_HIGH_MEMORY] contains nodes which have 0...ZONE_MOVABLE
*/
zone_last = ZONE_MOVABLE;
/*
* check whether node_states[N_HIGH_MEMORY] will be changed
* If we try to offline the last present @nr_pages from the node,
* we can determind we will need to clear the node from
* node_states[N_HIGH_MEMORY].
*/
for (; zt <= zone_last; zt++)
present_pages += pgdat->node_zones[zt].present_pages;
if (nr_pages >= present_pages)
arg->status_change_nid = zone_to_nid(zone);
else
arg->status_change_nid = -1;
}
static void node_states_clear_node(int node, struct memory_notify *arg)
{
if (arg->status_change_nid_normal >= 0)
node_clear_state(node, N_NORMAL_MEMORY);
if ((N_MEMORY != N_NORMAL_MEMORY) &&
(arg->status_change_nid_high >= 0))
node_clear_state(node, N_HIGH_MEMORY);
if ((N_MEMORY != N_HIGH_MEMORY) &&
(arg->status_change_nid >= 0))
node_clear_state(node, N_MEMORY);
}
static int __ref __offline_pages(unsigned long start_pfn,
unsigned long end_pfn, unsigned long timeout)
{
unsigned long pfn, nr_pages, expire;
long offlined_pages;
int ret, drain, retry_max, node;
unsigned long flags;
unsigned long valid_start, valid_end;
struct zone *zone;
struct memory_notify arg;
/* at least, alignment against pageblock is necessary */
if (!IS_ALIGNED(start_pfn, pageblock_nr_pages))
return -EINVAL;
if (!IS_ALIGNED(end_pfn, pageblock_nr_pages))
return -EINVAL;
/* This makes hotplug much easier...and readable.
we assume this for now. .*/
if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start, &valid_end))
return -EINVAL;
zone = page_zone(pfn_to_page(valid_start));
node = zone_to_nid(zone);
nr_pages = end_pfn - start_pfn;
/* set above range as isolated */
ret = start_isolate_page_range(start_pfn, end_pfn,
MIGRATE_MOVABLE, true);
if (ret)
return ret;
arg.start_pfn = start_pfn;
arg.nr_pages = nr_pages;
node_states_check_changes_offline(nr_pages, zone, &arg);
ret = memory_notify(MEM_GOING_OFFLINE, &arg);
ret = notifier_to_errno(ret);
if (ret)
goto failed_removal;
pfn = start_pfn;
expire = jiffies + timeout;
drain = 0;
retry_max = 5;
repeat:
/* start memory hot removal */
ret = -EAGAIN;
if (time_after(jiffies, expire))
goto failed_removal;
ret = -EINTR;
if (signal_pending(current))
goto failed_removal;
ret = 0;
if (drain) {
lru_add_drain_all_cpuslocked();
cond_resched();
drain_all_pages(zone);
}
pfn = scan_movable_pages(start_pfn, end_pfn);
if (pfn) { /* We have movable pages */
ret = do_migrate_range(pfn, end_pfn);
if (!ret) {
drain = 1;
goto repeat;
} else {
if (ret < 0)
if (--retry_max == 0)
goto failed_removal;
yield();
drain = 1;
goto repeat;
}
}
/* drain all zone's lru pagevec, this is asynchronous... */
lru_add_drain_all_cpuslocked();
yield();
/* drain pcp pages, this is synchronous. */
drain_all_pages(zone);
/*
* dissolve free hugepages in the memory block before doing offlining
* actually in order to make hugetlbfs's object counting consistent.
*/
ret = dissolve_free_huge_pages(start_pfn, end_pfn);
if (ret)
goto failed_removal;
/* check again */
offlined_pages = check_pages_isolated(start_pfn, end_pfn);
if (offlined_pages < 0) {
ret = -EBUSY;
goto failed_removal;
}
pr_info("Offlined Pages %ld\n", offlined_pages);
/* Ok, all of our target is isolated.
We cannot do rollback at this point. */
offline_isolated_pages(start_pfn, end_pfn);
/* reset pagetype flags and makes migrate type to be MOVABLE */
undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
/* removal success */
adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
zone->present_pages -= offlined_pages;
pgdat_resize_lock(zone->zone_pgdat, &flags);
zone->zone_pgdat->node_present_pages -= offlined_pages;
pgdat_resize_unlock(zone->zone_pgdat, &flags);
init_per_zone_wmark_min();
if (!populated_zone(zone)) {
zone_pcp_reset(zone);
build_all_zonelists(NULL);
} else
zone_pcp_update(zone);
node_states_clear_node(node, &arg);
if (arg.status_change_nid >= 0) {
kswapd_stop(node);
kcompactd_stop(node);
}
vm_total_pages = nr_free_pagecache_pages();
writeback_set_ratelimit();
memory_notify(MEM_OFFLINE, &arg);
return 0;
failed_removal:
pr_debug("memory offlining [mem %#010llx-%#010llx] failed\n",
(unsigned long long) start_pfn << PAGE_SHIFT,
((unsigned long long) end_pfn << PAGE_SHIFT) - 1);
memory_notify(MEM_CANCEL_OFFLINE, &arg);
/* pushback to free area */
undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
return ret;
}
/* Must be protected by mem_hotplug_begin() or a device_lock */
int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
{
return __offline_pages(start_pfn, start_pfn + nr_pages,
MIGRATE_TIMEOUT_SEC * HZ);
}
#endif /* CONFIG_MEMORY_HOTREMOVE */
/**
* walk_memory_range - walks through all mem sections in [start_pfn, end_pfn)
* @start_pfn: start pfn of the memory range
* @end_pfn: end pfn of the memory range
* @arg: argument passed to func
* @func: callback for each memory section walked
*
* This function walks through all present mem sections in range
* [start_pfn, end_pfn) and call func on each mem section.
*
* Returns the return value of func.
*/
int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
void *arg, int (*func)(struct memory_block *, void *))
{
struct memory_block *mem = NULL;
struct mem_section *section;
unsigned long pfn, section_nr;
int ret;
for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
section_nr = pfn_to_section_nr(pfn);
if (!present_section_nr(section_nr))
continue;
section = __nr_to_section(section_nr);
/* same memblock? */
if (mem)
if ((section_nr >= mem->start_section_nr) &&
(section_nr <= mem->end_section_nr))
continue;
mem = find_memory_block_hinted(section, mem);
if (!mem)
continue;
ret = func(mem, arg);
if (ret) {
kobject_put(&mem->dev.kobj);
return ret;
}
}
if (mem)
kobject_put(&mem->dev.kobj);
return 0;
}
#ifdef CONFIG_MEMORY_HOTREMOVE
static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
{
int ret = !is_memblock_offlined(mem);
if (unlikely(ret)) {
phys_addr_t beginpa, endpa;
beginpa = PFN_PHYS(section_nr_to_pfn(mem->start_section_nr));
endpa = PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1))-1;
pr_warn("removing memory fails, because memory [%pa-%pa] is onlined\n",
&beginpa, &endpa);
}
return ret;
}
static int check_cpu_on_node(pg_data_t *pgdat)
{
int cpu;
for_each_present_cpu(cpu) {
if (cpu_to_node(cpu) == pgdat->node_id)
/*
* the cpu on this node isn't removed, and we can't
* offline this node.
*/
return -EBUSY;
}
return 0;
}
static void unmap_cpu_on_node(pg_data_t *pgdat)
{
#ifdef CONFIG_ACPI_NUMA
int cpu;
for_each_possible_cpu(cpu)
if (cpu_to_node(cpu) == pgdat->node_id)
numa_clear_node(cpu);
#endif
}
static int check_and_unmap_cpu_on_node(pg_data_t *pgdat)
{
int ret;
ret = check_cpu_on_node(pgdat);
if (ret)
return ret;
/*
* the node will be offlined when we come here, so we can clear
* the cpu_to_node() now.
*/
unmap_cpu_on_node(pgdat);
return 0;
}
/**
* try_offline_node
*
* Offline a node if all memory sections and cpus of the node are removed.
*
* NOTE: The caller must call lock_device_hotplug() to serialize hotplug
* and online/offline operations before this call.
*/
void try_offline_node(int nid)
{
pg_data_t *pgdat = NODE_DATA(nid);
unsigned long start_pfn = pgdat->node_start_pfn;
unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
unsigned long pfn;
for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
unsigned long section_nr = pfn_to_section_nr(pfn);
if (!present_section_nr(section_nr))
continue;
if (pfn_to_nid(pfn) != nid)
continue;
/*
* some memory sections of this node are not removed, and we
* can't offline node now.
*/
return;
}
if (check_and_unmap_cpu_on_node(pgdat))
return;
/*
* all memory/cpu of this node are removed, we can offline this
* node now.
*/
node_set_offline(nid);
unregister_one_node(nid);
}
EXPORT_SYMBOL(try_offline_node);
/**
* remove_memory
*
* NOTE: The caller must call lock_device_hotplug() to serialize hotplug
* and online/offline operations before this call, as required by
* try_offline_node().
*/
void __ref remove_memory(int nid, u64 start, u64 size)
{
int ret;
BUG_ON(check_hotplug_memory_range(start, size));
mem_hotplug_begin();
/*
* All memory blocks must be offlined before removing memory. Check
* whether all memory blocks in question are offline and trigger a BUG()
* if this is not the case.
*/
ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
check_memblock_offlined_cb);
if (ret)
BUG();
/* remove memmap entry */
firmware_map_remove(start, start + size, "System RAM");
memblock_free(start, size);
memblock_remove(start, size);
arch_remove_memory(start, size);
try_offline_node(nid);
mem_hotplug_done();
}
EXPORT_SYMBOL_GPL(remove_memory);
#endif /* CONFIG_MEMORY_HOTREMOVE */