Commit Graph

870228 Commits

Author SHA1 Message Date
Zi Yan
40f9962e23 BACKPORT: mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate
In isolate_migratepages_block, if we have too many isolated pages and
nr_migratepages is not zero, we should try to migrate what we have
without wasting time on isolating.

In theory it's possible that multiple parallel compactions will cause
too_many_isolated() to become true even if each has isolated less than
COMPACT_CLUSTER_MAX, and loop forever in the while loop.  Bailing
immediately prevents that.

[vbabka@suse.cz: changelog addition]

Fixes: 1da2f328fa64 (“mm,thp,compaction,cma: allow THP migration for CMA allocations”)
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@vger.kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Yang Shi <shy828301@gmail.com>
Link: https://lkml.kernel.org/r/20201030183809.3616803-2-zi.yan@sent.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I748deb8121f802388337cd19d06af0f569f59362
2022-11-12 11:22:25 +00:00
Zi Yan
50ede4f71b BACKPORT: mm/compaction: count pages and stop correctly during page isolation
In isolate_migratepages_block, when cc->alloc_contig is true, we are
able to isolate compound pages.  But nr_migratepages and nr_isolated did
not count compound pages correctly, causing us to isolate more pages
than we thought.

So count compound pages as the number of base pages they contain.
Otherwise, we might be trapped in too_many_isolated while loop, since
the actual isolated pages can go up to COMPACT_CLUSTER_MAX*512=16384,
where COMPACT_CLUSTER_MAX is 32, since we stop isolation after
cc->nr_migratepages reaches to COMPACT_CLUSTER_MAX.

In addition, after we fix the issue above, cc->nr_migratepages could
never be equal to COMPACT_CLUSTER_MAX if compound pages are isolated,
thus page isolation could not stop as we intended.  Change the isolation
stop condition to '>='.

The issue can be triggered as follows:

In a system with 16GB memory and an 8GB CMA region reserved by
hugetlb_cma, if we first allocate 10GB THPs and mlock them (so some THPs
are allocated in the CMA region and mlocked), reserving 6 1GB hugetlb
pages via /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages will
get stuck (looping in too_many_isolated function) until we kill either
task.  With the patch applied, oom will kill the application with 10GB
THPs and let hugetlb page reservation finish.

[ziy@nvidia.com: v3]
[UtsavBalar1231: mm,compaction: Fix compilation on 4.19]

Link: https://lkml.kernel.org/r/20201030183809.3616803-1-zi.yan@sent.com
Fixes: 1da2f328fa64 ("cmm,thp,compaction,cma: allow THP migration for CMA allocations")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201029200435.3386066-1-zi.yan@sent.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I1b7acd38e68fcf2f17bce7db23bcc0ef5aa09d05
2022-11-12 11:22:24 +00:00
Mateusz Nosek
515fb161e6 BACKPORT: mm/compaction.c: micro-optimization remove unnecessary branch
The same code can work both for 'zone->compact_considered > defer_limit'
and 'zone->compact_considered >= defer_limit'.  In the latter there is one
branch less which is more effective considering performance.

Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Link: https://lkml.kernel.org/r/20200913190448.28649-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I8958caaf1223d36d16aa52c341ab07ee323b5f40
2022-11-12 11:22:24 +00:00
Randy Dunlap
4bb8143326 BACKPORT: mm/compaction.c: delete duplicated word
Drop the repeated word "a".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200801173822.14973-2-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I1f190adce016c6c5cb1866338d41e30a36cf184a
2022-11-12 11:22:24 +00:00
Alex Shi
fe8430fbed BACKPORT: mm/compaction: correct the comments of compact_defer_shift
There is no compact_defer_limit. It should be compact_defer_shift in
use. and add compact_order_failed explanation.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Link: http://lkml.kernel.org/r/3bd60e1b-a74e-050d-ade4-6e8f54e00b92@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Iff6c5094c6966d0323c40165819e5732f10beb62
2022-11-12 11:22:24 +00:00
Nitin Gupta
dd33c02e78 BACKPORT: mm: use unsigned types for fragmentation score
Proactive compaction uses per-node/zone "fragmentation score" which is
always in range [0, 100], so use unsigned type of these scores as well as
for related constants.

Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: http://lkml.kernel.org/r/20200618010319.13159-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Icbbba74fd60245c549d000d8ffad8d6b5ecd20fd
2022-11-12 11:22:24 +00:00
Nitin Gupta
4686653495 BACKPORT: mm: fix compile error due to COMPACTION_HPAGE_ORDER
Fix compile error when COMPACTION_HPAGE_ORDER is assigned to
HUGETLB_PAGE_ORDER.  The correct way to check if this constant is defined
is to check for CONFIG_HUGETLBFS.

Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623064544.25766-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Ibe116429a3ec96e58cac23e30e130bbe433aeac0
2022-11-12 11:22:24 +00:00
Nitin Gupta
3d84a10b4a BACKPORT: mm: proactive compaction
For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented.  Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency.  Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system.  Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.

For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.

The tunable takes a value in range [0, 100], with a default of 20.

Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl.  Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate.  The internal interpretation of this opaque
value allows for future fine-tuning.

Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
The score for a node is defined as weighted mean of per-zone external
fragmentation.  A zone's present_pages determines its weight.

To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same.  If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value.  By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.

This patch is largely based on ideas from Michal Hocko [2].  See also the
LWN article [3].

Performance data
================

System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch

echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap.  The workload is mainly anonymous
userspace pages, which are easy to move around.  I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.

1. Kernel hugepage allocation latencies

With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:

(all latency values are in microseconds)

- With vanilla 5.6.0-rc3

  percentile latency
  –––––––––– –––––––
	   5    7894
	  10    9496
	  25   12561
	  30   15295
	  40   18244
	  50   21229
	  60   27556
	  75   30147
	  80   31047
	  90   32859
	  95   33799

Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)

- With 5.6.0-rc3 + this patch, with proactiveness=20

sysctl -w vm.compaction_proactiveness=20

  percentile latency
  –––––––––– –––––––
	   5       2
	  10       2
	  25       3
	  30       3
	  40       3
	  50       4
	  60       4
	  75       4
	  80       4
	  90       5
	  95     429

Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)

2. JAVA heap allocation

In this test, we first fragment memory using the same method as for (1).

Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages.  We also set THP to madvise to
allow hugepage backing of this heap.

/usr/bin/time
 java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch

The above command allocates 700G of Java heap using hugepages.

- With vanilla 5.6.0-rc3

17.39user 1666.48system 27:37.89elapsed

- With 5.6.0-rc3 + this patch, with proactiveness=20

8.35user 194.58system 3:19.62elapsed

Elapsed time remains around 3:15, as proactiveness is further increased.

Note that proactive compaction happens throughout the runtime of these
workloads.  The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.

In the above Java workload, proactiveness is set to 20.  The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction.  As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80).  Repeat.

bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark.  kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.

Backoff behavior
================

Above workloads produce a memory state which is easy to compact.  However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off.  To test this aspect:

- Created a kernel driver that allocates almost all memory as hugepages
  followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
  with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
  (=> ~30 seconds between retries).

[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/

Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nitin Gupta <ngupta@nitingupta.dev>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I69571a7af05f33174eeb24e9d4b4d0dc640983c4
2022-11-12 11:22:23 +00:00
Wonhyuk Yang
70f7432702 BACKPORT: mm/compaction: fix misbehaviors of fast_find_migrateblock()
[ Upstream commit 15d28d0d11609c7a4f217b3d85e26456d9beb134 ]

In the fast_find_migrateblock(), it iterates ocer the freelist to find the
proper pageblock.  But there are some misbehaviors.

First, if the page we found is equal to cc->migrate_pfn, it is considered
that we didn't find a suitable pageblock.  Secondly, if the loop was
terminated because order is less than PAGE_ALLOC_COSTLY_ORDER, it could be
considered that we found a suitable one.  Thirdly, if the skip bit is set
on the page block and we goto continue, it doesn't check nr_scanned.
Fourthly, if the page block's skip bit is set, it checks that page block
is the last of list, which is unnecessary.

Link: https://lkml.kernel.org/r/20210128130411.6125-1-vvghjk1234@gmail.com
Fixes: 70b44595eafe9 ("mm, compaction: use free lists to quickly locate a migration source")
Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I430cb1768260351895d65faccd6e4f51fb0a6019
2022-11-12 11:22:23 +00:00
Rokudo Yan
407f232842 BACKPORT: mm, compaction: move high_pfn to the for loop scope
commit 74e21484e40bb8ce0f9828bbfe1c9fc9b04249c6 upstream.

In fast_isolate_freepages, high_pfn will be used if a prefered one (ie
PFN >= low_fn) not found.

But the high_pfn is not reset before searching an free area, so when it
was used as freepage, it may from another free area searched before.  As
a result move_freelist_head(freelist, freepage) will have unexpected
behavior (eg corrupt the MOVABLE freelist)

  Unable to handle kernel paging request at virtual address dead000000000200
  Mem abort info:
    ESR = 0x96000044
    Exception class = DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
  Data abort info:
    ISV = 0, ISS = 0x00000044
    CM = 0, WnR = 1
  [dead000000000200] address between user and kernel address ranges

  -000|list_cut_before(inline)
  -000|move_freelist_head(inline)
  -000|fast_isolate_freepages(inline)
  -000|isolate_freepages(inline)
  -000|compaction_alloc(?, ?)
  -001|unmap_and_move(inline)
  -001|migrate_pages([NSD:0xFFFFFF80088CBBD0] from = 0xFFFFFF80088CBD88, [NSD:0xFFFFFF80088CBBC8] get_new_p
  -002|__read_once_size(inline)
  -002|static_key_count(inline)
  -002|static_key_false(inline)
  -002|trace_mm_compaction_migratepages(inline)
  -002|compact_zone(?, [NSD:0xFFFFFF80088CBCB0] capc = 0x0)
  -003|kcompactd_do_work(inline)
  -003|kcompactd([X19] p = 0xFFFFFF93227FBC40)
  -004|kthread([X20] _create = 0xFFFFFFE1AFB26380)
  -005|ret_from_fork(asm)

The issue was reported on an smart phone product with 6GB ram and 3GB
zram as swap device.

This patch fixes the issue by reset high_pfn before searching each free
area, which ensure freepage and freelist match when call
move_freelist_head in fast_isolate_freepages().

Link: http://lkml.kernel.org/r/20190118175136.31341-12-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20210112094720.1238444-1-wu-yan@tcl.com
Fixes: 5a811889de10f1eb ("mm, compaction: use free lists to quickly locate a migration target")
Signed-off-by: Rokudo Yan <wu-yan@tcl.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Id970e5a5eb496babe170a9740652c49ac7726b0a
2022-11-12 11:22:23 +00:00
Vlastimil Babka
7e934ecada BACKPORT: mm, compaction: make capture control handling safe wrt interrupts
commit b9e20f0da1f5c9c68689450a8cb436c9486434c8 upstream.

Hugh reports:

 "While stressing compaction, one run oopsed on NULL capc->cc in
  __free_one_page()'s task_capc(zone): compact_zone_order() had been
  interrupted, and a page was being freed in the return from interrupt.

  Though you would not expect it from the source, both gccs I was using
  (4.8.1 and 7.5.0) had chosen to compile compact_zone_order() with the
  ".cc = &cc" implemented by mov %rbx,-0xb0(%rbp) immediately before
  callq compact_zone - long after the "current->capture_control =
  &capc". An interrupt in between those finds capc->cc NULL (zeroed by
  an earlier rep stos).

  This could presumably be fixed by a barrier() before setting
  current->capture_control in compact_zone_order(); but would also need
  more care on return from compact_zone(), in order not to risk leaking
  a page captured by interrupt just before capture_control is reset.

  Maybe that is the preferable fix, but I felt safer for task_capc() to
  exclude the rather surprising possibility of capture at interrupt
  time"

I have checked that gcc10 also behaves the same.

The advantage of fix in compact_zone_order() is that we don't add
another test in the page freeing hot path, and that it might prevent
future problems if we stop exposing pointers to uninitialized structures
in current task.

So this patch implements the suggestion for compact_zone_order() with
barrier() (and WRITE_ONCE() to prevent store tearing) for setting
current->capture_control, and prevents page leaking with
WRITE_ONCE/READ_ONCE in the proper order.

Link: http://lkml.kernel.org/r/20200616082649.27173-1-vbabka@suse.cz
Fixes: 5e1f0f098b46 ("mm, compaction: capture a page under direct compaction")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Hugh Dickins <hughd@google.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Li Wang <liwang@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>	[5.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I9f2c405e8673ed4e7148b2399c2c6e140b238960
2022-11-12 11:22:23 +00:00
Vlastimil Babka
4ebd766e37 BACKPORT: mm, compaction: fully assume capture is not NULL in compact_zone_order()
commit 6467552ca64c4ddd2b83ed73192107d7145f533b upstream.

Dan reports:

The patch 5e1f0f098b46: "mm, compaction: capture a page under direct
compaction" from Mar 5, 2019, leads to the following Smatch complaint:

    mm/compaction.c:2321 compact_zone_order()
     error: we previously assumed 'capture' could be null (see line 2313)

mm/compaction.c
  2288  static enum compact_result compact_zone_order(struct zone *zone, int order,
  2289                  gfp_t gfp_mask, enum compact_priority prio,
  2290                  unsigned int alloc_flags, int classzone_idx,
  2291                  struct page **capture)
                                      ^^^^^^^

  2313		if (capture)
                    ^^^^^^^
Check for NULL

  2314			current->capture_control = &capc;
  2315
  2316		ret = compact_zone(&cc, &capc);
  2317
  2318		VM_BUG_ON(!list_empty(&cc.freepages));
  2319		VM_BUG_ON(!list_empty(&cc.migratepages));
  2320
  2321		*capture = capc.page;
                ^^^^^^^^
Unchecked dereference.

  2322		current->capture_control = NULL;
  2323

In practice this is not an issue, as the only caller path passes non-NULL
capture:

__alloc_pages_direct_compact()
  struct page *page = NULL;
  try_to_compact_pages(capture = &page);
    compact_zone_order(capture = capture);

So let's remove the unnecessary check, which should also make Smatch happy.

Fixes: 5e1f0f098b46 ("mm, compaction: capture a page under direct compaction")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: http://lkml.kernel.org/r/18b0df3c-0589-d96c-23fa-040798fee187@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I799c005429a5fb50bf7cd29e53bd6495ff405279
2022-11-12 11:22:23 +00:00
Vlastimil Babka
2cf518a50c BACKPORT: mm, compaction: fix wrong pfn handling in __reset_isolation_pfn()
Florian and Dave reported [1] a NULL pointer dereference in
__reset_isolation_pfn().  While the exact cause is unclear, staring at
the code revealed two bugs, which might be related.

One bug is that if zone starts in the middle of pageblock, block_page
might correspond to different pfn than block_pfn, and then the
pfn_valid_within() checks will check different pfn's than those accessed
via struct page.  This might result in acessing an unitialized page in
CONFIG_HOLES_IN_ZONE configs.

The other bug is that end_page refers to the first page of next
pageblock and not last page of current pageblock.  The online and valid
check is then wrong and with sections, the while (page < end_page) loop
might wander off actual struct page arrays.

[1] https://lore.kernel.org/linux-xfs/87o8z1fvqu.fsf@mid.deneb.enyo.de/

Link: http://lkml.kernel.org/r/20191008152915.24704-1-vbabka@suse.cz
Fixes: 6b0868c820ff ("mm/compaction.c: correct zone boundary handling when resetting pageblock skip hints")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Florian Weimer <fw@deneb.enyo.de>
Reported-by: Dave Chinner <david@fromorbit.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I22f75ef159c47fb52e93a971130c99c03f7d636b
2022-11-12 11:22:22 +00:00
Pengfei Li
61847fc3c2 BACKPORT: mm/compaction.c: remove unnecessary zone parameter in isolate_migratepages()
Like commit 40cacbcb3240 ("mm, compaction: remove unnecessary zone
parameter in some instances"), remove unnecessary zone parameter.

No functional change.

Link: http://lkml.kernel.org/r/20190806151616.21107-1-lpf.vector@gmail.com
Signed-off-by: Pengfei Li <lpf.vector@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Qian Cai <cai@lca.pw>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Ib45836180760685a4db1cfebaacc935b93aef782
2022-11-12 11:22:22 +00:00
Yafang Shao
c9a27b3ad1 BACKPORT: mm/compaction.c: clear total_{migrate,free}_scanned before scanning a new zone
total_{migrate,free}_scanned will be added to COMPACTMIGRATE_SCANNED and
COMPACTFREE_SCANNED in compact_zone().  We should clear them before
scanning a new zone.  In the proc triggered compaction, we forgot clearing
them.

[laoar.shao@gmail.com: introduce a helper compact_zone_counters_init()]
  Link: http://lkml.kernel.org/r/1563869295-25748-1-git-send-email-laoar.shao@gmail.com
[akpm@linux-foundation.org: expand compact_zone_counters_init() into its single callsite, per mhocko]
[vbabka@suse.cz: squash compact_zone() list_head init as well]
  Link: http://lkml.kernel.org/r/1fb6f7da-f776-9e42-22f8-bbb79b030b98@suse.cz
[akpm@linux-foundation.org: kcompactd_do_work(): avoid unnecessary initialization of cc.zone]
Link: http://lkml.kernel.org/r/1563789275-9639-1-git-send-email-laoar.shao@gmail.com
Fixes: 7f354a548d ("mm, compaction: add vmstats for kcompactd work")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Yafang Shao <shaoyafang@didiglobal.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I4432c84a692ef0fba7fab2247fd9e647ede7b6e5
2022-11-12 11:22:22 +00:00
Suzuki K Poulose
afb646ce70 BACKPORT: mm, compaction: make sure we isolate a valid PFN
When we have holes in a normal memory zone, we could endup having
cached_migrate_pfns which may not necessarily be valid, under heavy memory
pressure with swapping enabled ( via __reset_isolation_suitable(),
triggered by kswapd).

Later if we fail to find a page via fast_isolate_freepages(), we may end
up using the migrate_pfn we started the search with, as valid page.  This
could lead to accessing NULL pointer derefernces like below, due to an
invalid mem_section pointer.

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 [47/1825]
 Mem abort info:
   ESR = 0x96000004
   Exception class = DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
 Data abort info:
   ISV = 0, ISS = 0x00000004
   CM = 0, WnR = 0
 user pgtable: 4k pages, 48-bit VAs, pgdp = 0000000082f94ae9
 [0000000000000008] pgd=0000000000000000
 Internal error: Oops: 96000004 [#1] SMP
 ...
 CPU: 10 PID: 6080 Comm: qemu-system-aar Not tainted 510-rc1+ #6
 Hardware name: AmpereComputing(R) OSPREY EV-883832-X3-0001/OSPREY, BIOS 4819 09/25/2018
 pstate: 60000005 (nZCv daif -PAN -UAO)
 pc : set_pfnblock_flags_mask+0x58/0xe8
 lr : compaction_alloc+0x300/0x950
 [...]
 Process qemu-system-aar (pid: 6080, stack limit = 0x0000000095070da5)
 Call trace:
  set_pfnblock_flags_mask+0x58/0xe8
  compaction_alloc+0x300/0x950
  migrate_pages+0x1a4/0xbb0
  compact_zone+0x750/0xde8
  compact_zone_order+0xd8/0x118
  try_to_compact_pages+0xb4/0x290
  __alloc_pages_direct_compact+0x84/0x1e0
  __alloc_pages_nodemask+0x5e0/0xe18
  alloc_pages_vma+0x1cc/0x210
  do_huge_pmd_anonymous_page+0x108/0x7c8
  __handle_mm_fault+0xdd4/0x1190
  handle_mm_fault+0x114/0x1c0
  __get_user_pages+0x198/0x3c0
  get_user_pages_unlocked+0xb4/0x1d8
  __gfn_to_pfn_memslot+0x12c/0x3b8
  gfn_to_pfn_prot+0x4c/0x60
  kvm_handle_guest_abort+0x4b0/0xcd8
  handle_exit+0x140/0x1b8
  kvm_arch_vcpu_ioctl_run+0x260/0x768
  kvm_vcpu_ioctl+0x490/0x898
  do_vfs_ioctl+0xc4/0x898
  ksys_ioctl+0x8c/0xa0
  __arm64_sys_ioctl+0x28/0x38
  el0_svc_common+0x74/0x118
  el0_svc_handler+0x38/0x78
  el0_svc+0x8/0xc
 Code: f8607840 f100001f 8b011401 9a801020 (f9400400)
 ---[ end trace af6a35219325a9b6 ]---

The issue was reported on an arm64 server with 128GB with holes in the
zone (e.g, [32GB@4GB, 96GB@544GB]), with a swap device enabled, while
running 100 KVM guest instances.

This patch fixes the issue by ensuring that the page belongs to a valid
PFN when we fallback to using the lower limit of the scan range upon
failure in fast_isolate_freepages().

Link: http://lkml.kernel.org/r/1558711908-15688-1-git-send-email-suzuki.poulose@arm.com
Fixes: 5a811889de10f1eb ("mm, compaction: use free lists to quickly locate a migration target")
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reported-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I270a6ad91b96cb8b6c8c65c77228a4389b4b687c
2022-11-12 11:22:22 +00:00
Mel Gorman
0f0575f5d4 BACKPORT: mm/compaction.c: correct zone boundary handling when isolating pages from a pageblock
syzbot reported the following error from a tree with a head commit of
baf76f0c58ae ("slip: make slhc_free() silently accept an error pointer")

  BUG: unable to handle kernel paging request at ffffea0003348000
  #PF error: [normal kernel read fault]
  PGD 12c3f9067 P4D 12c3f9067 PUD 12c3f8067 PMD 0
  Oops: 0000 [#1] PREEMPT SMP KASAN
  CPU: 1 PID: 28916 Comm: syz-executor.2 Not tainted 5.1.0-rc6+ #89
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:constant_test_bit arch/x86/include/asm/bitops.h:314 [inline]
  RIP: 0010:PageCompound include/linux/page-flags.h:186 [inline]
  RIP: 0010:isolate_freepages_block+0x1c0/0xd40 mm/compaction.c:579
  Code: 01 d8 ff 4d 85 ed 0f 84 ef 07 00 00 e8 29 00 d8 ff 4c 89 e0 83 85 38 ff
  ff ff 01 48 c1 e8 03 42 80 3c 38 00 0f 85 31 0a 00 00 <4d> 8b 2c 24 31 ff 49
  c1 ed 10 41 83 e5 01 44 89 ee e8 3a 01 d8 ff
  RSP: 0018:ffff88802b31eab8 EFLAGS: 00010246
  RAX: 1ffffd4000669000 RBX: 00000000000cd200 RCX: ffffc9000a235000
  RDX: 000000000001ca5e RSI: ffffffff81988cc7 RDI: 0000000000000001
  RBP: ffff88802b31ebd8 R08: ffff88805af700c0 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000000 R12: ffffea0003348000
  R13: 0000000000000000 R14: ffff88802b31f030 R15: dffffc0000000000
  FS:  00007f61648dc700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffffea0003348000 CR3: 0000000037c64000 CR4: 00000000001426e0
  Call Trace:
   fast_isolate_around mm/compaction.c:1243 [inline]
   fast_isolate_freepages mm/compaction.c:1418 [inline]
   isolate_freepages mm/compaction.c:1438 [inline]
   compaction_alloc+0x1aee/0x22e0 mm/compaction.c:1550

There is no reproducer and it is difficult to hit -- 1 crash every few
days.  The issue is very similar to the fix in commit 6b0868c820ff
("mm/compaction.c: correct zone boundary handling when resetting pageblock
skip hints").  When isolating free pages around a target pageblock, the
boundary handling is off by one and can stray into the next pageblock.
Triggering the syzbot error requires that the end of pageblock is section
or zone aligned, and that the next section is unpopulated.

A more subtle consequence of the bug is that pageblocks were being
improperly used as migration targets which potentially hurts fragmentation
avoidance in the long-term one page at a time.

A debugging patch revealed that it's definitely possible to stray outside
of a pageblock which is not intended.  While syzbot cannot be used to
verify this patch, it was confirmed that the debugging warning no longer
triggers with this patch applied.  It has also been confirmed that the THP
allocation stress tests are not degraded by this patch.

Link: http://lkml.kernel.org/r/20190510182124.GI18914@techsingularity.net
Fixes: e332f741a8dd ("mm, compaction: be selective about what pageblocks to clear skip hints")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: syzbot+d84c80f9fe26a0f7a734@syzkaller.appspotmail.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> # v5.1+
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Ic632c6013f0751094ca4abaa002360df78ce312b
2022-11-12 11:22:22 +00:00
Qian Cai
c784cc0fe9 BACKPORT: mm/compaction.c: fix an undefined behaviour
In a low-memory situation, cc->fast_search_fail can keep increasing as it
is unable to find an available page to isolate in
fast_isolate_freepages().  As the result, it could trigger an error below,
so just compare with the maximum bits can be shifted first.

UBSAN: Undefined behaviour in mm/compaction.c:1160:30
shift exponent 64 is too large for 64-bit type 'unsigned long'
CPU: 131 PID: 1308 Comm: kcompactd1 Kdump: loaded Tainted: G
W    L    5.0.0+ #17
Call trace:
 dump_backtrace+0x0/0x450
 show_stack+0x20/0x2c
 dump_stack+0xc8/0x14c
 __ubsan_handle_shift_out_of_bounds+0x7e8/0x8c4
 compaction_alloc+0x2344/0x2484
 unmap_and_move+0xdc/0x1dbc
 migrate_pages+0x274/0x1310
 compact_zone+0x26ec/0x43bc
 kcompactd+0x15b8/0x1a24
 kthread+0x374/0x390
 ret_from_fork+0x10/0x18

[akpm@linux-foundation.org: code cleanup]
Link: http://lkml.kernel.org/r/20190320203338.53367-1-cai@lca.pw
Fixes: 70b44595eafe ("mm, compaction: use free lists to quickly locate a migration source")
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I9efcb7cfcca586f3225365415560945ce8a37ffd
2022-11-12 11:22:21 +00:00
Qian Cai
16de6558a7 BACKPORT: mm/compaction.c: abort search if isolation fails
Running LTP oom01 in a tight loop or memory stress testing put the system
in a low-memory situation could triggers random memory corruption like
page flag corruption below due to in fast_isolate_freepages(), if
isolation fails, next_search_order() does not abort the search immediately
could lead to improper accesses.

UBSAN: Undefined behaviour in ./include/linux/mm.h:1195:50
index 7 is out of range for type 'zone [5]'
Call Trace:
 dump_stack+0x62/0x9a
 ubsan_epilogue+0xd/0x7f
 __ubsan_handle_out_of_bounds+0x14d/0x192
 __isolate_free_page+0x52c/0x600
 compaction_alloc+0x886/0x25f0
 unmap_and_move+0x37/0x1e70
 migrate_pages+0x2ca/0xb20
 compact_zone+0x19cb/0x3620
 kcompactd_do_work+0x2df/0x680
 kcompactd+0x1d8/0x6c0
 kthread+0x32c/0x3f0
 ret_from_fork+0x35/0x40
------------[ cut here ]------------
kernel BUG at mm/page_alloc.c:3124!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
RIP: 0010:__isolate_free_page+0x464/0x600
RSP: 0000:ffff888b9e1af848 EFLAGS: 00010007
RAX: 0000000030000000 RBX: ffff888c39fcf0f8 RCX: 0000000000000000
RDX: 1ffff111873f9e25 RSI: 0000000000000004 RDI: ffffed1173c35ef6
RBP: ffff888b9e1af898 R08: fffffbfff4fc2461 R09: fffffbfff4fc2460
R10: fffffbfff4fc2460 R11: ffffffffa7e12303 R12: 0000000000000008
R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000007
FS:  0000000000000000(0000) GS:ffff888ba8e80000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc7abc00000 CR3: 0000000752416004 CR4: 00000000001606a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 compaction_alloc+0x886/0x25f0
 unmap_and_move+0x37/0x1e70
 migrate_pages+0x2ca/0xb20
 compact_zone+0x19cb/0x3620
 kcompactd_do_work+0x2df/0x680
 kcompactd+0x1d8/0x6c0
 kthread+0x32c/0x3f0
 ret_from_fork+0x35/0x40

Link: http://lkml.kernel.org/r/20190320192648.52499-1-cai@lca.pw
Fixes: dbe2d4e4f12e ("mm, compaction: round-robin the order while searching the free lists for a target")
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I7bbee10fffce4cdf3289fd461ed949285ea195a1
2022-11-12 11:22:21 +00:00
Mel Gorman
ba7b813abb BACKPORT: mm/compaction.c: correct zone boundary handling when resetting pageblock skip hints
Mikhail Gavrilo reported the following bug being triggered in a Fedora
kernel based on 5.1-rc1 but it is relevant to a vanilla kernel.

 kernel: page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
 kernel: ------------[ cut here ]------------
 kernel: kernel BUG at include/linux/mm.h:1021!
 kernel: invalid opcode: 0000 [#1] SMP NOPTI
 kernel: CPU: 6 PID: 116 Comm: kswapd0 Tainted: G         C        5.1.0-0.rc1.git1.3.fc31.x86_64 #1
 kernel: Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 1201 12/07/2018
 kernel: RIP: 0010:__reset_isolation_pfn+0x244/0x2b0
 kernel: Code: fe 06 e8 0f 8e fc ff 44 0f b6 4c 24 04 48 85 c0 0f 85 dc fe ff ff e9 68 fe ff ff 48 c7 c6 58 b7 2e 8c 4c 89 ff e8 0c 75 00 00 <0f> 0b 48 c7 c6 58 b7 2e 8c e8 fe 74 00 00 0f 0b 48 89 fa 41 b8 01
 kernel: RSP: 0018:ffff9e2d03f0fde8 EFLAGS: 00010246
 kernel: RAX: 0000000000000034 RBX: 000000000081f380 RCX: ffff8cffbddd6c20
 kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8cffbddd6c20
 kernel: RBP: 0000000000000001 R08: 0000009898b94613 R09: 0000000000000000
 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000100000
 kernel: R13: 0000000000100000 R14: 0000000000000001 R15: ffffca7de07ce000
 kernel: FS:  0000000000000000(0000) GS:ffff8cffbdc00000(0000) knlGS:0000000000000000
 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 kernel: CR2: 00007fc1670e9000 CR3: 00000007f5276000 CR4: 00000000003406e0
 kernel: Call Trace:
 kernel:  __reset_isolation_suitable+0x62/0x120
 kernel:  reset_isolation_suitable+0x3b/0x40
 kernel:  kswapd+0x147/0x540
 kernel:  ? finish_wait+0x90/0x90
 kernel:  kthread+0x108/0x140
 kernel:  ? balance_pgdat+0x560/0x560
 kernel:  ? kthread_park+0x90/0x90
 kernel:  ret_from_fork+0x27/0x50

He bisected it down to e332f741a8dd ("mm, compaction: be selective about
what pageblocks to clear skip hints").  The problem is that the patch in
question was sloppy with respect to the handling of zone boundaries.  In
some instances, it was possible for PFNs outside of a zone to be examined
and if those were not properly initialised or poisoned then it would
trigger the VM_BUG_ON.  This patch corrects the zone boundary issues when
resetting pageblock skip hints and Mikhail reported that the bug did not
trigger after 30 hours of testing.

Link: http://lkml.kernel.org/r/20190327085424.GL3189@techsingularity.net
Fixes: e332f741a8dd ("mm, compaction: be selective about what pageblocks to clear skip hints")
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I6cc42eeb1ec6743fa4504065ba35d9c6f35968af
2022-11-12 11:22:21 +00:00
Mel Gorman
d255a22284 BACKPORT: mm, compaction: capture a page under direct compaction
Compaction is inherently race-prone as a suitable page freed during
compaction can be allocated by any parallel task.  This patch uses a
capture_control structure to isolate a page immediately when it is freed
by a direct compactor in the slow path of the page allocator.  The
intent is to avoid redundant scanning.

                                     5.0.0-rc1              5.0.0-rc1
                               selective-v3r17          capture-v3r19
Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
Amean     fault-both-3      2582.11 (   0.00%)     2563.68 (   0.71%)
Amean     fault-both-5      4500.26 (   0.00%)     4233.52 (   5.93%)
Amean     fault-both-7      5819.53 (   0.00%)     6333.65 (  -8.83%)
Amean     fault-both-12     9321.18 (   0.00%)     9759.38 (  -4.70%)
Amean     fault-both-18     9782.76 (   0.00%)    10338.76 (  -5.68%)
Amean     fault-both-24    15272.81 (   0.00%)    13379.55 *  12.40%*
Amean     fault-both-30    15121.34 (   0.00%)    16158.25 (  -6.86%)
Amean     fault-both-32    18466.67 (   0.00%)    18971.21 (  -2.73%)

Latency is only moderately affected but the devil is in the details.  A
closer examination indicates that base page fault latency is reduced but
latency of huge pages is increased as it takes creater care to succeed.
Part of the "problem" is that allocation success rates are close to 100%
even when under pressure and compaction gets harder

                                5.0.0-rc1              5.0.0-rc1
                          selective-v3r17          capture-v3r19
Percentage huge-3        96.70 (   0.00%)       98.23 (   1.58%)
Percentage huge-5        96.99 (   0.00%)       95.30 (  -1.75%)
Percentage huge-7        94.19 (   0.00%)       97.24 (   3.24%)
Percentage huge-12       94.95 (   0.00%)       97.35 (   2.53%)
Percentage huge-18       96.74 (   0.00%)       97.30 (   0.58%)
Percentage huge-24       97.07 (   0.00%)       97.55 (   0.50%)
Percentage huge-30       95.69 (   0.00%)       98.50 (   2.95%)
Percentage huge-32       96.70 (   0.00%)       99.27 (   2.65%)

And scan rates are reduced as expected by 6% for the migration scanner
and 29% for the free scanner indicating that there is less redundant
work.

Compaction migrate scanned    20815362    19573286
Compaction free scanned       16352612    11510663

[mgorman@techsingularity.net: remove redundant check]
  Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Ia00b58bc8791d428ee1b33d8bf83f83d04f1e9ea
2022-11-12 11:22:21 +00:00
Mel Gorman
ecc71ef993 BACKPORT: mm, compaction: be selective about what pageblocks to clear skip hints
Pageblock hints are cleared when compaction restarts or kswapd makes
enough progress that it can sleep but it's over-eager in that the bit is
cleared for migration sources with no LRU pages and migration targets
with no free pages.  As pageblock skip hint flushes are relatively rare
and out-of-band with respect to kswapd, this patch makes a few more
expensive checks to see if it's appropriate to even clear the bit.
Every pageblock that is not cleared will avoid 512 pages being scanned
unnecessarily on x86-64.

The impact is variable with different workloads showing small
differences in latency, success rates and scan rates.  This is expected
as clearing the hints is not that common but doing a small amount of
work out-of-band to avoid a large amount of work in-band later is
generally a good thing.

Link: http://lkml.kernel.org/r/20190118175136.31341-22-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
[cai@lca.pw: no stuck in __reset_isolation_pfn()]
  Link: http://lkml.kernel.org/r/20190206034732.75687-1-cai@lca.pw
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I2dab8b77d34a4fb8ce0c54e51bf45ce7646ec914
2022-11-12 11:22:20 +00:00
Mel Gorman
3461a6e26b BACKPORT: mm, compaction: sample pageblocks for free pages
Once fast searching finishes, there is a possibility that the linear
scanner is scanning full blocks found by the fast scanner earlier.  This
patch uses an adaptive stride to sample pageblocks for free pages.  The
more consecutive full pageblocks encountered, the larger the stride
until a pageblock with free pages is found.  The scanners might meet
slightly sooner but it is an acceptable risk given that the search of
the free lists may still encounter the pages and adjust the cached PFN
of the free scanner accordingly.

                                     5.0.0-rc1              5.0.0-rc1
                              roundrobin-v3r17       samplefree-v3r17
Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
Amean     fault-both-3      2752.37 (   0.00%)     2729.95 (   0.81%)
Amean     fault-both-5      4341.69 (   0.00%)     4397.80 (  -1.29%)
Amean     fault-both-7      6308.75 (   0.00%)     6097.61 (   3.35%)
Amean     fault-both-12    10241.81 (   0.00%)     9407.15 (   8.15%)
Amean     fault-both-18    13736.09 (   0.00%)    10857.63 *  20.96%*
Amean     fault-both-24    16853.95 (   0.00%)    13323.24 *  20.95%*
Amean     fault-both-30    15862.61 (   0.00%)    17345.44 (  -9.35%)
Amean     fault-both-32    18450.85 (   0.00%)    16892.00 (   8.45%)

The latency is mildly improved offseting some overhead from earlier
patches that are prerequisites for the rest of the series.  However, a
major impact is on the free scan rate with an 82% reduction.

                                5.0.0-rc1      5.0.0-rc1
                         roundrobin-v3r17 samplefree-v3r17
Compaction migrate scanned    21607271            20116887
Compaction free scanned       95336406            16668703

It's also the first time in the series where the number of pages scanned
by the migration scanner is greater than the free scanner due to the
increased search efficiency.

Link: http://lkml.kernel.org/r/20190118175136.31341-21-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I60f54e0f066bf609ca6713fd3038c5dd1ac3ebb8
2022-11-12 11:22:20 +00:00
Mel Gorman
eba2727080 BACKPORT: mm, compaction: round-robin the order while searching the free lists for a target
As compaction proceeds and creates high-order blocks, the free list
search gets less efficient as the larger blocks are used as compaction
targets.  Eventually, the larger blocks will be behind the migration
scanner for partially migrated pageblocks and the search fails.  This
patch round-robins what orders are searched so that larger blocks can be
ignored and find smaller blocks that can be used as migration targets.

The overall impact was small on 1-socket but it avoids corner cases
where the migration/free scanners meet prematurely or situations where
many of the pageblocks encountered by the free scanner are almost full
instead of being properly packed.  Previous testing had indicated that
without this patch there were occasional large spikes in the free
scanner without this patch.

[dan.carpenter@oracle.com: fix static checker warning]
Link: http://lkml.kernel.org/r/20190118175136.31341-20-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Ib2e2da22ab748ba706b62d93aed3f6e80cdb3b6e
2022-11-12 11:22:20 +00:00
Mel Gorman
1f0901aba3 BACKPORT: mm, compaction: reduce premature advancement of the migration target scanner
The fast isolation of free pages allows the cached PFN of the free
scanner to advance faster than necessary depending on the contents of
the free list.  The key is that fast_isolate_freepages() can update
zone->compact_cached_free_pfn via isolate_freepages_block().  When the
fast search fails, the linear scan can start from a point that has
skipped valid migration targets, particularly pageblocks with just
low-order free pages.  This can cause the migration source/target
scanners to meet prematurely causing a reset.

This patch starts by avoiding an update of the pageblock skip
information and cached PFN from isolate_freepages_block() and puts the
responsibility of updating that information in the callers.  The fast
scanner will update the cached PFN if and only if it finds a block that
is higher than the existing cached PFN and sets the skip if the
pageblock is full or nearly full.  The linear scanner will update
skipped information and the cached PFN only when a block is completely
scanned.  The total impact is that the free scanner advances more slowly
as it is primarily driven by the linear scanner instead of the fast
search.

                                     5.0.0-rc1              5.0.0-rc1
                               noresched-v3r17         slowfree-v3r17
Amean     fault-both-3      2965.68 (   0.00%)     3036.75 (  -2.40%)
Amean     fault-both-5      3995.90 (   0.00%)     4522.24 * -13.17%*
Amean     fault-both-7      5842.12 (   0.00%)     6365.35 (  -8.96%)
Amean     fault-both-12     9550.87 (   0.00%)    10340.93 (  -8.27%)
Amean     fault-both-18    13304.72 (   0.00%)    14732.46 ( -10.73%)
Amean     fault-both-24    14618.59 (   0.00%)    16288.96 ( -11.43%)
Amean     fault-both-30    16650.96 (   0.00%)    16346.21 (   1.83%)
Amean     fault-both-32    17145.15 (   0.00%)    19317.49 ( -12.67%)

The impact to latency is higher than the last version but it appears to
be due to a slight increase in the free scan rates which is a potential
side-effect of the patch.  However, this is necessary for later patches
that are more careful about how pageblocks are treated as earlier
iterations of those patches hit corner cases where the restarts were
punishing and very visible.

Link: http://lkml.kernel.org/r/20190118175136.31341-19-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Ia45514deafc6bd13472947b8cc7557b010186747
2022-11-12 11:22:20 +00:00
Mel Gorman
f30a0bf029 BACKPORT: mm, compaction: do not consider a need to reschedule as contention
Scanning on large machines can take a considerable length of time and
eventually need to be rescheduled.  This is treated as an abort event
but that's not appropriate as the attempt is likely to be retried after
making numerous checks and taking another cycle through the page
allocator.  This patch will check the need to reschedule if necessary
but continue the scanning.

The main benefit is reduced scanning when compaction is taking a long
time or the machine is over-saturated.  It also avoids an unnecessary
exit of compaction that ends up being retried by the page allocator in
the outer loop.

                                     5.0.0-rc1              5.0.0-rc1
                              synccached-v3r16        noresched-v3r17
Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
Amean     fault-both-3      2958.27 (   0.00%)     2965.68 (  -0.25%)
Amean     fault-both-5      4091.90 (   0.00%)     3995.90 (   2.35%)
Amean     fault-both-7      5803.05 (   0.00%)     5842.12 (  -0.67%)
Amean     fault-both-12     9481.06 (   0.00%)     9550.87 (  -0.74%)
Amean     fault-both-18    14141.51 (   0.00%)    13304.72 (   5.92%)
Amean     fault-both-24    16438.00 (   0.00%)    14618.59 (  11.07%)
Amean     fault-both-30    17531.72 (   0.00%)    16650.96 (   5.02%)
Amean     fault-both-32    17101.96 (   0.00%)    17145.15 (  -0.25%)

Link: http://lkml.kernel.org/r/20190118175136.31341-18-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Id7aff270d3b0a0fd8be98019f3fdabcc81a61f3b
2022-11-12 11:22:20 +00:00
Mel Gorman
b95f8bcbb0 BACKPORT: mm, compaction: rework compact_should_abort as compact_check_resched
With incremental changes, compact_should_abort no longer makes any
documented sense.  Rename to compact_check_resched and update the
associated comments.  There is no benefit other than reducing redundant
code and making the intent slightly clearer.  It could potentially be
merged with earlier patches but it just makes the review slightly
harder.

Link: http://lkml.kernel.org/r/20190118175136.31341-17-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I03cadabc5b27c28be9353f3cc7fdc019d06553a9
2022-11-12 11:22:19 +00:00
Mel Gorman
011a3aca6e BACKPORT: mm, compaction: keep cached migration PFNs synced for unusable pageblocks
Migrate has separate cached PFNs for ASYNC and SYNC* migration on the
basis that some migrations will fail in ASYNC mode.  However, if the
cached PFNs match at the start of scanning and pageblocks are skipped
due to having no isolation candidates, then the sync state does not
matter.  This patch keeps matching cached PFNs in sync until a pageblock
with isolation candidates is found.

The actual benefit is marginal given that the sync scanner following the
async scanner will often skip a number of pageblocks but it's useless
work.  Any benefit depends heavily on whether the scanners restarted
recently.

Link: http://lkml.kernel.org/r/20190118175136.31341-16-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I1d7ed8c961622dab9cd1e55b77e1e1ec6fa3bb5a
2022-11-12 11:22:19 +00:00
Mel Gorman
0cfc1369fc BACKPORT: mm, compaction: check early for huge pages encountered by the migration scanner
When scanning for sources or targets, PageCompound is checked for huge
pages as they can be skipped quickly but it happens relatively late
after a lot of setup and checking.  This patch short-cuts the check to
make it earlier.  It might still change when the lock is acquired but
this has less overhead overall.  The free scanner advances but the
migration scanner does not.  Typically the free scanner encounters more
movable blocks that change state over the lifetime of the system and
also tends to scan more aggressively as it's actively filling its
portion of the physical address space with data.  This could change in
the future but for the moment, this worked better in practice and
incurred fewer scan restarts.

The impact on latency and allocation success rates is marginal but the
free scan rates are reduced by 15% and system CPU usage is reduced by
3.3%.  The 2-socket results are not materially different.

Link: http://lkml.kernel.org/r/20190118175136.31341-15-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Ica3b213fa634a7c8dc0c43a4546e4adbfe9b7c0b
2022-11-12 11:22:19 +00:00
Mel Gorman
7382d2caca BACKPORT: mm, compaction: finish pageblock scanning on contention
Async migration aborts on spinlock contention but contention can be high
when there are multiple compaction attempts and kswapd is active.  The
consequence is that the migration scanners move forward uselessly while
still contending on locks for longer while leaving suitable migration
sources behind.

This patch will acquire the lock but track when contention occurs.  When
it does, the current pageblock will finish as compaction may succeed for
that block and then abort.  This will have a variable impact on latency
as in some cases useless scanning is avoided (reduces latency) but a
lock will be contended (increase latency) or a single contended
pageblock is scanned that would otherwise have been skipped (increase
latency).

                                     5.0.0-rc1              5.0.0-rc1
                                norescan-v3r16    finishcontend-v3r16
Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
Amean     fault-both-3      3002.07 (   0.00%)     3153.17 (  -5.03%)
Amean     fault-both-5      4684.47 (   0.00%)     4280.52 (   8.62%)
Amean     fault-both-7      6815.54 (   0.00%)     5811.50 *  14.73%*
Amean     fault-both-12    10864.02 (   0.00%)     9276.85 (  14.61%)
Amean     fault-both-18    12247.52 (   0.00%)    11032.67 (   9.92%)
Amean     fault-both-24    15683.99 (   0.00%)    14285.70 (   8.92%)
Amean     fault-both-30    18620.02 (   0.00%)    16293.76 *  12.49%*
Amean     fault-both-32    19250.28 (   0.00%)    16721.02 *  13.14%*

                                5.0.0-rc1              5.0.0-rc1
                           norescan-v3r16    finishcontend-v3r16
Percentage huge-1         0.00 (   0.00%)        0.00 (   0.00%)
Percentage huge-3        95.00 (   0.00%)       96.82 (   1.92%)
Percentage huge-5        94.22 (   0.00%)       95.40 (   1.26%)
Percentage huge-7        92.35 (   0.00%)       95.92 (   3.86%)
Percentage huge-12       91.90 (   0.00%)       96.73 (   5.25%)
Percentage huge-18       89.58 (   0.00%)       96.77 (   8.03%)
Percentage huge-24       90.03 (   0.00%)       96.05 (   6.69%)
Percentage huge-30       89.14 (   0.00%)       96.81 (   8.60%)
Percentage huge-32       90.58 (   0.00%)       97.41 (   7.54%)

There is a variable impact that is mostly good on latency while allocation
success rates are slightly higher.  System CPU usage is reduced by about
10% but scan rate impact is mixed

Compaction migrate scanned    27997659.00    20148867
Compaction free scanned      120782791.00   118324914

Migration scan rates are reduced 28% which is expected as a pageblock is
used by the async scanner instead of skipped.  The impact on the free
scanner is known to be variable.  Overall the primary justification for
this patch is that completing scanning of a pageblock is very important
for later patches.

[yuehaibing@huawei.com: fix unused variable warning]
Link: http://lkml.kernel.org/r/20190118175136.31341-14-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: YueHaibing <yuehaibing@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I7c5c1aedbea9bfbc77beb52b2b517066bda4c0d6
2022-11-12 11:22:19 +00:00
Mel Gorman
b321c4ac6f BACKPORT: mm, compaction: avoid rescanning the same pageblock multiple times
Pageblocks are marked for skip when no pages are isolated after a scan.
However, it's possible to hit corner cases where the migration scanner
gets stuck near the boundary between the source and target scanner.  Due
to pages being migrated in blocks of COMPACT_CLUSTER_MAX, pages that are
migrated can be reallocated before the pageblock is complete.  The
pageblock is not necessarily skipped so it can be rescanned multiple
times.  Similarly, a pageblock with some dirty/writeback pages may fail
to migrate and be rescanned until writeback completes which is wasteful.

This patch tracks if a pageblock is being rescanned.  If so, then the
entire pageblock will be migrated as one operation.  This narrows the
race window during which pages can be reallocated during migration.
Secondly, if there are pages that cannot be isolated then the pageblock
will still be fully scanned and marked for skipping.  On the second
rescan, the pageblock skip is set and the migration scanner makes
progress.

                                     5.0.0-rc1              5.0.0-rc1
                                findfree-v3r16         norescan-v3r16
Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
Amean     fault-both-3      3200.68 (   0.00%)     3002.07 (   6.21%)
Amean     fault-both-5      4847.75 (   0.00%)     4684.47 (   3.37%)
Amean     fault-both-7      6658.92 (   0.00%)     6815.54 (  -2.35%)
Amean     fault-both-12    11077.62 (   0.00%)    10864.02 (   1.93%)
Amean     fault-both-18    12403.97 (   0.00%)    12247.52 (   1.26%)
Amean     fault-both-24    15607.10 (   0.00%)    15683.99 (  -0.49%)
Amean     fault-both-30    18752.27 (   0.00%)    18620.02 (   0.71%)
Amean     fault-both-32    21207.54 (   0.00%)    19250.28 *   9.23%*

                                5.0.0-rc1              5.0.0-rc1
                           findfree-v3r16         norescan-v3r16
Percentage huge-3        96.86 (   0.00%)       95.00 (  -1.91%)
Percentage huge-5        93.72 (   0.00%)       94.22 (   0.53%)
Percentage huge-7        94.31 (   0.00%)       92.35 (  -2.08%)
Percentage huge-12       92.66 (   0.00%)       91.90 (  -0.82%)
Percentage huge-18       91.51 (   0.00%)       89.58 (  -2.11%)
Percentage huge-24       90.50 (   0.00%)       90.03 (  -0.52%)
Percentage huge-30       91.57 (   0.00%)       89.14 (  -2.65%)
Percentage huge-32       91.00 (   0.00%)       90.58 (  -0.46%)

Negligible difference but this was likely a case when the specific
corner case was not hit.  A previous run of the same patch based on an
earlier iteration of the series showed large differences where migration
rates could be halved when the corner case was hit.

The specific corner case where migration scan rates go through the roof
was due to a dirty/writeback pageblock located at the boundary of the
migration/free scanner did not happen in this case.  When it does
happen, the scan rates multipled by massive margins.

Link: http://lkml.kernel.org/r/20190118175136.31341-13-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I622c77c32b975d50c0901c66e7e62f21b6858990
2022-11-12 11:22:19 +00:00
Mel Gorman
cbbeed2202 BACKPORT: mm, compaction: use free lists to quickly locate a migration target
Similar to the migration scanner, this patch uses the free lists to
quickly locate a migration target.  The search is different in that
lower orders will be searched for a suitable high PFN if necessary but
the search is still bound.  This is justified on the grounds that the
free scanner typically scans linearly much more than the migration
scanner.

If a free page is found, it is isolated and compaction continues if
enough pages were isolated.  For SYNC* scanning, the full pageblock is
scanned for any remaining free pages so that is can be marked for
skipping in the near future.

1-socket thpfioscale
                                     5.0.0-rc1              5.0.0-rc1
                                 isolmig-v3r15         findfree-v3r16
Amean     fault-both-3      3024.41 (   0.00%)     3200.68 (  -5.83%)
Amean     fault-both-5      4749.30 (   0.00%)     4847.75 (  -2.07%)
Amean     fault-both-7      6454.95 (   0.00%)     6658.92 (  -3.16%)
Amean     fault-both-12    10324.83 (   0.00%)    11077.62 (  -7.29%)
Amean     fault-both-18    12896.82 (   0.00%)    12403.97 (   3.82%)
Amean     fault-both-24    13470.60 (   0.00%)    15607.10 * -15.86%*
Amean     fault-both-30    17143.99 (   0.00%)    18752.27 (  -9.38%)
Amean     fault-both-32    17743.91 (   0.00%)    21207.54 * -19.52%*

The impact on latency is variable but the search is optimistic and
sensitive to the exact system state.  Success rates are similar but the
major impact is to the rate of scanning

                                5.0.0-rc1      5.0.0-rc1
                            isolmig-v3r15 findfree-v3r16
Compaction migrate scanned    25646769          29507205
Compaction free scanned      201558184         100359571

The free scan rates are reduced by 50%.  The 2-socket reductions for the
free scanner are more dramatic which is a likely reflection that the
machine has more memory.

[dan.carpenter@oracle.com: fix static checker warning]
[vbabka@suse.cz: correct number of pages scanned for lower orders]
Link: http://lkml.kernel.org/r/20190118175136.31341-12-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I1d4ce3e48d5130e891e1780ad21b89feba73c263
2022-11-12 11:22:18 +00:00
Mel Gorman
728bb051fe BACKPORT: mm, compaction: keep migration source private to a single compaction instance
Due to either a fast search of the free list or a linear scan, it is
possible for multiple compaction instances to pick the same pageblock
for migration.  This is lucky for one scanner and increased scanning for
all the others.  It also allows a race between requests on which first
allocates the resulting free block.

This patch tests and updates the pageblock skip for the migration
scanner carefully.  When isolating a block, it will check and skip if
the block is already in use.  Once the zone lock is acquired, it will be
rechecked so that only one scanner can set the pageblock skip for
exclusive use.  Any scanner contending will continue with a linear scan.
The skip bit is still set if no pages can be isolated in a range.  While
this may result in redundant scanning, it avoids unnecessarily acquiring
the zone lock when there are no suitable migration sources.

1-socket thpscale
Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
Amean     fault-both-3      3390.40 (   0.00%)     3024.41 (  10.80%)
Amean     fault-both-5      5082.28 (   0.00%)     4749.30 (   6.55%)
Amean     fault-both-7      7012.51 (   0.00%)     6454.95 (   7.95%)
Amean     fault-both-12    11346.63 (   0.00%)    10324.83 (   9.01%)
Amean     fault-both-18    15324.19 (   0.00%)    12896.82 *  15.84%*
Amean     fault-both-24    16088.50 (   0.00%)    13470.60 *  16.27%*
Amean     fault-both-30    18723.42 (   0.00%)    17143.99 (   8.44%)
Amean     fault-both-32    18612.01 (   0.00%)    17743.91 (   4.66%)

                                5.0.0-rc1              5.0.0-rc1
                            findmig-v3r15          isolmig-v3r15
Percentage huge-3        89.83 (   0.00%)       92.96 (   3.48%)
Percentage huge-5        91.96 (   0.00%)       93.26 (   1.41%)
Percentage huge-7        92.85 (   0.00%)       93.63 (   0.84%)
Percentage huge-12       92.74 (   0.00%)       92.80 (   0.07%)
Percentage huge-18       91.71 (   0.00%)       91.62 (  -0.10%)
Percentage huge-24       92.13 (   0.00%)       91.50 (  -0.69%)
Percentage huge-30       93.79 (   0.00%)       92.73 (  -1.13%)
Percentage huge-32       91.27 (   0.00%)       91.94 (   0.74%)

This shows a reasonable reduction in latency as multiple compaction
scanners do not operate on the same blocks with a similar allocation
success rate.

Compaction migrate scanned    41093126    25646769

Migration scan rates are reduced by 38%.

Link: http://lkml.kernel.org/r/20190118175136.31341-11-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I7dd42d993e2f76113bb92e967298019669bfbaca
2022-11-12 11:22:18 +00:00
Mel Gorman
dee9498637 BACKPORT: mm, compaction: use free lists to quickly locate a migration source
The migration scanner is a linear scan of a zone with a potentiall large
search space.  Furthermore, many pageblocks are unusable such as those
filled with reserved pages or partially filled with pages that cannot
migrate.  These still get scanned in the common case of allocating a THP
and the cost accumulates.

The patch uses a partial search of the free lists to locate a migration
source candidate that is marked as MOVABLE when allocating a THP.  It
prefers picking a block with a larger number of free pages already on
the basis that there are fewer pages to migrate to free the entire
block.  The lowest PFN found during searches is tracked as the basis of
the start for the linear search after the first search of the free list
fails.  After the search, the free list is shuffled so that the next
search will not encounter the same page.  If the search fails then the
subsequent searches will be shorter and the linear scanner is used.

If this search fails, or if the request is for a small or
unmovable/reclaimable allocation then the linear scanner is still used.
It is somewhat pointless to use the list search in those cases.  Small
free pages must be used for the search and there is no guarantee that
movable pages are located within that block that are contiguous.

                                     5.0.0-rc1              5.0.0-rc1
                                 noboost-v3r10          findmig-v3r15
Amean     fault-both-3      3771.41 (   0.00%)     3390.40 (  10.10%)
Amean     fault-both-5      5409.05 (   0.00%)     5082.28 (   6.04%)
Amean     fault-both-7      7040.74 (   0.00%)     7012.51 (   0.40%)
Amean     fault-both-12    11887.35 (   0.00%)    11346.63 (   4.55%)
Amean     fault-both-18    16718.19 (   0.00%)    15324.19 (   8.34%)
Amean     fault-both-24    21157.19 (   0.00%)    16088.50 *  23.96%*
Amean     fault-both-30    21175.92 (   0.00%)    18723.42 *  11.58%*
Amean     fault-both-32    21339.03 (   0.00%)    18612.01 *  12.78%*

                                5.0.0-rc1              5.0.0-rc1
                            noboost-v3r10          findmig-v3r15
Percentage huge-3        86.50 (   0.00%)       89.83 (   3.85%)
Percentage huge-5        92.52 (   0.00%)       91.96 (  -0.61%)
Percentage huge-7        92.44 (   0.00%)       92.85 (   0.44%)
Percentage huge-12       92.98 (   0.00%)       92.74 (  -0.25%)
Percentage huge-18       91.70 (   0.00%)       91.71 (   0.02%)
Percentage huge-24       91.59 (   0.00%)       92.13 (   0.60%)
Percentage huge-30       90.14 (   0.00%)       93.79 (   4.04%)
Percentage huge-32       90.03 (   0.00%)       91.27 (   1.37%)

This shows an improvement in allocation latencies with similar
allocation success rates.  While not presented, there was a 31%
reduction in migration scanning and a 8% reduction on system CPU usage.
A 2-socket machine showed similar benefits.

[mgorman@techsingularity.net: several fixes]
  Link: http://lkml.kernel.org/r/20190204120111.GL9565@techsingularity.net
[vbabka@suse.cz: migrate block that was found-fast, some optimisations]
Link: http://lkml.kernel.org/r/20190118175136.31341-10-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <Vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I9e9fef1d7e4182abb4eafb66034a7eb338343147
2022-11-12 11:22:18 +00:00
Mel Gorman
c76da502b1 BACKPORT: mm, compaction: ignore the fragmentation avoidance boost for isolation and compaction
When pageblocks get fragmented, watermarks are artifically boosted to
reclaim pages to avoid further fragmentation events.  However,
compaction is often either fragmentation-neutral or moving movable pages
away from unmovable/reclaimable pages.  As the true watermarks are
preserved, allow compaction to ignore the boost factor.

The expected impact is very slight as the main benefit is that
compaction is slightly more likely to succeed when the system has been
fragmented very recently.  On both 1-socket and 2-socket machines for
THP-intensive allocation during fragmentation the success rate was
increased by less than 1% which is marginal.  However, detailed tracing
indicated that failure of migration due to a premature ENOMEM triggered
by watermark checks were eliminated.

Link: http://lkml.kernel.org/r/20190118175136.31341-9-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I3e73b53719800e4454bec29c39ef3afd4b3e327a
2022-11-12 11:22:18 +00:00
Mel Gorman
8aaf44c127 BACKPORT: mm, compaction: always finish scanning of a full pageblock
When compaction is finishing, it uses a flag to ensure the pageblock is
complete but it makes sense to always complete migration of a pageblock.
Minimally, skip information is based on a pageblock and partially
scanned pageblocks may incur more scanning in the future.  The pageblock
skip handling also becomes more strict later in the series and the hint
is more useful if a complete pageblock was always scanned.

The potentially impacts latency as more scanning is done but it's not a
consistent win or loss as the scanning is not always a high percentage
of the pageblock and sometimes it is offset by future reductions in
scanning.  Hence, the results are not presented this time due to a
misleading mix of gains/losses without any clear pattern.  However, full
scanning of the pageblock is important for later patches.

Link: http://lkml.kernel.org/r/20190118175136.31341-8-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I376206c9c8ef39ae4805fbe60842fbbc9365a653
2022-11-12 11:22:18 +00:00
Mel Gorman
811c6def0c BACKPORT: mm, migrate: immediately fail migration of a page with no migration handler
Pages with no migration handler use a fallback handler which sometimes
works and sometimes persistently retries.  A historical example was
blockdev pages but there are others such as odd refcounting when
page->private is used.  These are retried multiple times which is
wasteful during compaction so this patch will fail migration faster
unless the caller specifies MIGRATE_SYNC.

This is not expected to help THP allocation success rates but it did
reduce latencies very slightly in some cases.

1-socket thpfioscale
                                        4.20.0                 4.20.0
                              noreserved-v2r15         failfast-v2r15
Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
Amean     fault-both-3      3839.67 (   0.00%)     3833.72 (   0.15%)
Amean     fault-both-5      5177.47 (   0.00%)     4967.15 (   4.06%)
Amean     fault-both-7      7245.03 (   0.00%)     7139.19 (   1.46%)
Amean     fault-both-12    11534.89 (   0.00%)    11326.30 (   1.81%)
Amean     fault-both-18    16241.10 (   0.00%)    16270.70 (  -0.18%)
Amean     fault-both-24    19075.91 (   0.00%)    19839.65 (  -4.00%)
Amean     fault-both-30    22712.11 (   0.00%)    21707.05 (   4.43%)
Amean     fault-both-32    21692.92 (   0.00%)    21968.16 (  -1.27%)

The 2-socket results are not materially different.  Scan rates are
similar as expected.

Link: http://lkml.kernel.org/r/20190118175136.31341-7-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Iae251660ab1eba19d0bbd451798b683dedaa3da3
2022-11-12 11:22:17 +00:00
Mel Gorman
747bc31ad0 BACKPORT: mm, compaction: rename map_pages to split_map_pages
It's non-obvious that high-order free pages are split into order-0 pages
from the function name.  Fix it.

Link: http://lkml.kernel.org/r/20190118175136.31341-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I4011d479a5cb4614c4e77f2bc967d1d599f3ca5b
2022-11-12 11:22:17 +00:00
Mel Gorman
e0f7c5c47e BACKPORT: mm, compaction: remove unnecessary zone parameter in some instances
A zone parameter is passed into a number of top-level compaction
functions despite the fact that it's already in compact_control.  This
is harmless but it did need an audit to check if zone actually ever
changes meaningfully.  This patches removes the parameter in a number of
top-level functions.  The change could be much deeper but this was
enough to briefly clarify the flow.

No functional change.

Link: http://lkml.kernel.org/r/20190118175136.31341-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I5246c4733adc5c2f5bf48195b76ad6fbf6d7d2a5
2022-11-12 11:22:17 +00:00
UtsavBalar1231
38479a533f Revert "mm/compaction.c: clear total_{migrate,free}_scanned before scanning a new zone"
This reverts commit 4d8bdf7f3a.

Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I70f066cbc4944cffd0983e26ebad08abee37a768
2022-11-12 11:22:17 +00:00
Mel Gorman
8f31a213b3 BACKPORT: mm, compaction: remove last_migrated_pfn from compact_control
The last_migrated_pfn field is a bit dubious as to whether it really
helps but either way, the information from it can be inferred without
increasing the size of compact_control so remove the field.

Link: http://lkml.kernel.org/r/20190118175136.31341-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I419d00c5b4a722ea2a8c4d7b0d920502254da951
2022-11-12 11:22:17 +00:00
Mel Gorman
e375d355a4 BACKPORT: mm: move zone watermark accesses behind an accessor
This is a preparation patch only, no functional change.

Link: http://lkml.kernel.org/r/20181123114528.28802-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change-Id: Iccef5f02348ab59d75cbbe2b48f40017391b88b1
Git-commit: a921444382b49cc7fdeca3fba3e278bc09484a27
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
[vinmenon@codeaurora.org: trivial merge conflict fixes]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:22:16 +00:00
Anshuman Khandual
4bd17506dd BACKPORT: arm64/mm: enable HugeTLB migration for contiguous bit HugeTLB pages
Let arm64 subscribe to the previously added framework in which
architecture can inform whether a given huge page size is supported for
migration.  This just overrides the default function
arch_hugetlb_migration_supported() and enables migration for all
possible HugeTLB page sizes on arm64.

With this, HugeTLB migration support on arm64 now covers all possible
HugeTLB options.

          CONT PTE    PMD    CONT PMD    PUD
          --------    ---    --------    ---
  4K:        64K      2M        32M      1G
  16K:        2M     32M         1G
  64K:        2M    512M        16G

Link: http://lkml.kernel.org/r/1545121450-1663-6-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I0a066eee32e51fe39161a182205bc55b53a55e23
2022-11-12 11:22:16 +00:00
Anshuman Khandual
14f30bee97 BACKPORT: arm64/mm: enable HugeTLB migration
Let arm64 subscribe to generic HugeTLB page migration framework.  Right
now this only works on the following PMD and PUD level HugeTLB page
sizes with various kernel base page size combinations.

         CONT PTE    PMD    CONT PMD    PUD
         --------    ---    --------    ---
  4K:         NA     2M         NA      1G
  16K:        NA    32M         NA
  64K:        NA   512M         NA

Link: http://lkml.kernel.org/r/1545121450-1663-5-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Id6f11998f24adb6c6423f3e31579df4b002b3d2c
2022-11-12 11:22:16 +00:00
Anshuman Khandual
415df046da BACKPORT: mm/hugetlb: enable arch specific huge page size support for migration
Architectures like arm64 have HugeTLB page sizes which are different
than generic sizes at PMD, PUD, PGD level and implemented via contiguous
bits.  At present these special size HugeTLB pages cannot be identified
through macros like (PMD|PUD|PGDIR)_SHIFT and hence chosen not be
migrated.

Enabling migration support for these special HugeTLB page sizes along
with the generic ones (PMD|PUD|PGD) would require identifying all of
them on a given platform.  A platform specific hook can precisely
enumerate all huge page sizes supported for migration.  Instead of
comparing against standard huge page orders let
hugetlb_migration_support() function call a platform hook
arch_hugetlb_migration_support().  Default definition for the platform
hook maintains existing semantics which checks standard huge page order.
But an architecture can choose to override the default and provide
support for a comprehensive set of huge page sizes.

Link: http://lkml.kernel.org/r/1545121450-1663-4-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I0ea47939853787fb6497b9d732b8f2f27b946ca1
2022-11-12 11:22:16 +00:00
Anshuman Khandual
a4a566f7ae BACKPORT: mm/hugetlb: enable PUD level huge page migration
Architectures like arm64 have PUD level HugeTLB pages for certain configs
(1GB huge page is PUD based on ARM64_4K_PAGES base page size) that can
be enabled for migration.  It can be achieved through checking for
PUD_SHIFT order based HugeTLB pages during migration.

Link: http://lkml.kernel.org/r/1545121450-1663-3-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I4b5e1d74d1a2b082a3a9dfa0c2bd7aa09e973dd2
2022-11-12 11:22:16 +00:00
Anshuman Khandual
57729829d1 BACKPORT: mm/hugetlb: distinguish between migratability and movability
Patch series "arm64/mm: Enable HugeTLB migration", v4.

This patch series enables HugeTLB migration support for all supported
huge page sizes at all levels including contiguous bit implementation.
Following HugeTLB migration support matrix has been enabled with this
patch series.  All permutations have been tested except for the 16GB.

           CONT PTE    PMD    CONT PMD    PUD
           --------    ---    --------    ---
  4K:         64K     2M         32M     1G
  16K:         2M    32M          1G
  64K:         2M   512M         16G

First the series adds migration support for PUD based huge pages.  It
then adds a platform specific hook to query an architecture if a given
huge page size is supported for migration while also providing a default
fallback option preserving the existing semantics which just checks for
(PMD|PUD|PGDIR)_SHIFT macros.  The last two patches enables HugeTLB
migration on arm64 and subscribe to this new platform specific hook by
defining an override.

The second patch differentiates between movability and migratability
aspects of huge pages and implements hugepage_movable_supported() which
can then be used during allocation to decide whether to place the huge
page in movable zone or not.

This patch (of 5):

During huge page allocation it's migratability is checked to determine
if it should be placed under movable zones with GFP_HIGHUSER_MOVABLE.
But the movability aspect of the huge page could depend on other factors
than just migratability.  Movability in itself is a distinct property
which should not be tied with migratability alone.

This differentiates these two and implements an enhanced movability check
which also considers huge page size to determine if it is feasible to be
placed under a movable zone.  At present it just checks for gigantic pages
but going forward it can incorporate other enhanced checks.

Link: http://lkml.kernel.org/r/1545121450-1663-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I4f892af1b05c6812969f8c9fc99a41fbaed696d3
2022-11-12 11:22:15 +00:00
Matthew Wilcox
ef61c5c2e4 BACKPORT: mm: remove sysctl_extfrag_handler()
sysctl_extfrag_handler() neglects to propagate the return value from
proc_dointvec_minmax() to its caller.  It's a wrapper that doesn't need
to exist, so just use proc_dointvec_minmax() directly.

Link: http://lkml.kernel.org/r/20190104032557.3056-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reported-by: Aditya Pakki <pakki001@umn.edu>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I7b0017a905054d90a12bba644f48a96a93588968
2022-11-12 11:22:15 +00:00
Alex Shi
15ff646fd5 BACKPORT: mm/lru: introduce TestClearPageLRU()
Currently lru_lock still guards both lru list and page's lru bit, that's
ok.  but if we want to use specific lruvec lock on the page, we need to
pin down the page's lruvec/memcg during locking.  Just taking lruvec lock
first may be undermined by the page's memcg charge/migration.  To fix this
problem, we will clear the lru bit out of locking and use it as pin down
action to block the page isolation in memcg changing.

So now a standard steps of page isolation is following:
	1, get_page(); 	       #pin the page avoid to be free
	2, TestClearPageLRU(); #block other isolation like memcg change
	3, spin_lock on lru_lock; #serialize lru list access
	4, delete page from lru list;

This patch start with the first part: TestClearPageLRU, which combines
PageLRU check and ClearPageLRU into a macro func TestClearPageLRU.  This
function will be used as page isolation precondition to prevent other
isolations some where else.  Then there are may !PageLRU page on lru list,
need to remove BUG() checking accordingly.

There 2 rules for lru bit now:
1, the lru bit still indicate if a page on lru list, just in some
   temporary moment(isolating), the page may have no lru bit when
   it's on lru list.  but the page still must be on lru list when the
   lru bit set.
2, have to remove lru bit before delete it from lru list.

As Andrew Morton mentioned this change would dirty cacheline for a page
which isn't on the LRU.  But the loss would be acceptable in Rong Chen
<rong.a.chen@intel.com> report:
https://lore.kernel.org/lkml/20200304090301.GB5972@shao2-debian/

Link: https://lkml.kernel.org/r/1604566549-62481-15-git-send-email-alex.shi@linux.alibaba.com
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I49a9f0c981e49efe5819f5556528b9db9181c18c
2022-11-12 11:22:15 +00:00
Miklos Szeredi
9440600419 BACKPORT: fuse: fix warning in tree_insert() and clean up writepage insertion
fuse_writepages_fill() calls tree_insert() with ap->num_pages = 0 which
triggers the following warning:

 WARNING: CPU: 1 PID: 17211 at fs/fuse/file.c:1728 tree_insert+0xab/0xc0 [fuse]
 RIP: 0010:tree_insert+0xab/0xc0 [fuse]
 Call Trace:
  fuse_writepages_fill+0x5da/0x6a0 [fuse]
  write_cache_pages+0x171/0x470
  fuse_writepages+0x8a/0x100 [fuse]
  do_writepages+0x43/0xe0

Fix up the warning and clean up the code around rb-tree insertion:

 - Rename tree_insert() to fuse_insert_writeback() and make it return the
   conflicting entry in case of failure

 - Re-add tree_insert() as a wrapper around fuse_insert_writeback()

 - Rename fuse_writepage_in_flight() to fuse_writepage_add() and reverse
   the meaning of the return value to mean

    + "true" in case the writepage entry was successfully added

    + "false" in case it was in-fligt queued on an existing writepage
       entry's auxiliary list or the existing writepage entry's temporary
       page updated

   Switch from fuse_find_writeback() + tree_insert() to
   fuse_insert_writeback()

 - Move setting orig_pages to before inserting/updating the entry; this may
   result in the orig_pages value being discarded later in case of an
   in-flight request

 - In case of a new writepage entry use fuse_writepage_add()
   unconditionally, only set data->wpa if the entry was added.

Fixes: 6b2fb79963fb ("fuse: optimize writepages search")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Original-path-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I2ae5f55cc95ce93a5a6892baca52c9cfd4a85503
2022-11-12 11:22:15 +00:00