Commit Graph

14427 Commits

Author SHA1 Message Date
Sultan Alsawaf
b75887b8a8 mm: Don't hog the CPU and zone lock in rmqueue_bulk()
There is noticeable scheduling latency and heavy zone lock contention
stemming from rmqueue_bulk's single hold of the zone lock while doing
its work, as seen with the preemptoff tracer. There's no actual need for
rmqueue_bulk() to hold the zone lock the entire time; it only does so
for supposed efficiency. As such, we can relax the zone lock and even
reschedule when IRQs are enabled in order to keep the scheduling delays
and zone lock contention at bay. Forward progress is still guaranteed,
as the zone lock can only be relaxed after page removal.

With this change, rmqueue_bulk() no longer appears as a serious offender
in the preemptoff tracer, and system latency is noticeably improved.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:20 +00:00
Sultan Alsawaf
83a8040ba8 mm: kmemleak: Don't require global debug options or debugfs
This allows kmemleak to function even when debugfs is globally disabled,
allowing kmemleak to give accurate results for CONFIG_DEBUG_FS=n.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:00 +00:00
UtsavBalar1231
94031b31da mm: damon: Enable DAMON-based reclaim by default
Change-Id: I23da7fed2549c5d8ba9b62874056e74a201186b3
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:23:13 +00:00
Suren Baghdasaryan
c116699b25 UPSTREAM: mm: count time in drain_all_pages during direct reclaim as memory pressure
When page allocation in direct reclaim path fails, the system will make
one attempt to shrink per-cpu page lists and free pages from high alloc
reserves.  Draining per-cpu pages into buddy allocator can be a very slow
operation because it's done using workqueues and the task in direct
reclaim waits for all of them to finish before proceeding.  Currently this
time is not accounted as psi memory stall.

While testing mobile devices under extreme memory pressure, when
allocations are failing during direct reclaim, we notices that psi events
which would be expected in such conditions were not triggered.  After
profiling these cases it was determined that the reason for missing psi
events was that a big chunk of time spent in direct reclaim is not
accounted as memory stall, therefore psi would not reach the levels at
which an event is generated.  Further investigation revealed that the bulk
of that unaccounted time was spent inside drain_all_pages call.

A typical captured case when drain_all_pages path gets activated:

__alloc_pages_slowpath  took 44.644.613ns
    __perform_reclaim   took    751.668ns (1.7%)
    drain_all_pages     took 43.887.167ns (98.3%)

PSI in this case records the time spent in __perform_reclaim but ignores
drain_all_pages, IOW it misses 98.3% of the time spent in
__alloc_pages_slowpath.

Annotate __alloc_pages_direct_reclaim in its entirety so that delays from
handling page allocation failure in the direct reclaim path are accounted
as memory stall.

Link: https://lkml.kernel.org/r/20220223194812.1299646-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: Tim Murray <timmurray@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: SJD Ayy <3xp1r4t3d@gmail.com>
Change-Id: I1e0683dc8687fcce141ab5d9af3eb3ddfe297973
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:23:12 +00:00
Minchan Kim
6f414bfe88 BACKPORT: mm: don't be stuck to rmap lock on reclaim path
The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended
under memory pressure if processes keep working on their vmas(e.g., fork,
mmap, munmap).  It makes reclaim path stuck.  In our real workload traces,
we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it
makes other processes entering direct reclaim, which were also stuck on
the lock.

This patch makes lru aging path try_lock mode like shink_page_list so the
reclaim context will keep working with next lru pages without being stuck.
if it found the rmap lock contended, it rotates the page back to head of
lru in both active/inactive lrus to make them consistent behavior, which
is basic starting point rather than adding more heristic.

Since this patch introduces a new "contended" field as out-param along
with try_lock in-param in rmap_walk_control, it's not immutable any longer
if the try_lock is set so remove const keywords on rmap related functions.
Since rmap walking is already expensive operation, I doubt the const
would help sizable benefit( And we didn't have it until 5.17).

In a heavy app workload in Android, trace shows following statistics.  It
almost removes rmap lock contention from reclaim path.

Martin Liu reported:

Before:

   max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
         1632            0            1631   151.542173        31672    209  page_lock_anon_vma_read
          601            0             601   145.544681        28817    198  rmap_walk_file

After:

   max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
          NaN          NaN              NaN          NaN          NaN    0.0             NaN
            0            0                0     0.127645            1     12  rmap_walk_file

[minchan@kernel.org: add comment, per Matthew]
  Link: https://lkml.kernel.org/r/YnNqeB5tUf6LZ57b@google.com
Link: https://lkml.kernel.org/r/20220510215423.164547-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Martin Liu <liumartin@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Conflicts:
	folio->page

(cherry picked from commit 6d4675e601357834dadd2ba1d803f6484596015c)
Bug: 239681156
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I0c63e0291120c8a1b5f2d83b8a7b210cb56c27a2
2022-11-12 11:23:12 +00:00
Charan Teja Kalla
77cd7bea25 UPSTREAM: mm: madvise: return correct bytes advised with process_madvise
Patch series "mm: madvise: return correct bytes processed with
process_madvise", v2.  With the process_madvise(), always choose to return
non zero processed bytes over an error.  This can help the user to know on
which VMA, passed in the 'struct iovec' vector list, is failed to advise
thus can take the decission of retrying/skipping on that VMA.

This patch (of 2):

The process_madvise() system call returns error even after processing some
VMA's passed in the 'struct iovec' vector list which leaves the user
confused to know where to restart the advise next.  It is also against
this syscall man page[1] documentation where it mentions that "return
value may be less than the total number of requested bytes, if an error
occurred after some iovec elements were already processed.".

Consider a user passed 10 VMA's in the 'struct iovec' vector list of which
9 are processed but one.  Then it just returns the error caused on that
failed VMA despite the first 9 VMA's processed, leaving the user confused
about on which VMA it is failed.  Returning the number of bytes processed
here can help the user to know which VMA it is failed on and thus can
retry/skip the advise on that VMA.

[1]https://man7.org/linux/man-pages/man2/process_madvise.2.html.

Link: https://lkml.kernel.org/r/cover.1647008754.git.quic_charante@quicinc.com
Link: https://lkml.kernel.org/r/125b61a0edcee5c2db8658aed9d06a43a19ccafc.1647008754.git.quic_charante@quicinc.com
Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Id743a28f41e28f4867bb5265be6ae7ef58bbc394
2022-11-12 11:23:12 +00:00
Yu Zhao
1e3683dd22 mglru: fixes
Signed-off-by: Yu Zhao <yuzhao@google.com>
Change-Id: I09999793fd6481bcf2acefb29c57d5d40c0b1427
2022-11-12 11:23:12 +00:00
UtsavBalar1231
3e27214d50 mm: vmscan: Adapt to use new mm_walk_ops API
commit: pagewalk: separate function pointers from iterator data
allows us to add operations inside mm_walk_ops structure so refactor the code
to use it instead of mm_walk structure.

Change-Id: I56f47aadc961bf96617b6ef5af3b2b256133c890
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:23:11 +00:00
Suren Baghdasaryan
84ebfc164d UPSTREAM: mm/madvise: replace ptrace attach requirement for process_madvise
process_madvise currently requires ptrace attach capability.
PTRACE_MODE_ATTACH gives one process complete control over another
process.  It effectively removes the security boundary between the two
processes (in one direction).  Granting ptrace attach capability even to a
system process is considered dangerous since it creates an attack surface.
This severely limits the usage of this API.

The operations process_madvise can perform do not affect the correctness
of the operation of the target process; they only affect where the data is
physically located (and therefore, how fast it can be accessed).  What we
want is the ability for one process to influence another process in order
to optimize performance across the entire system while leaving the
security boundary intact.

Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ and
CAP_SYS_NICE.  PTRACE_MODE_READ to prevent leaking ASLR metadata and
CAP_SYS_NICE for influencing process performance.

Link: https://lkml.kernel.org/r/20210303185807.2160264-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <stable@vger.kernel.org>	[5.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Id57cd1b73d7e0e2c020bde09137d354243abeee9
2022-11-12 11:23:11 +00:00
Minchan Kim
e0d1393539 UPSTREAM: mm/madvise: remove racy mm ownership check
Jann spotted the security hole due to race of mm ownership check.

If the task is sharing the mm_struct but goes through execve() before
mm_access(), it could skip process_madvise_behavior_valid check.  That
makes *any advice hint* to reach into the remote process.

This patch removes the mm ownership check.  With it, it will lose the
ability that local process could give *any* advice hint with vector
interface for some reason (e.g., performance).  Since there is no
concrete example in upstream yet, it would be better to remove the
abiliity at this moment and need to review when such new advice comes
up.

Fixes: ecb8ac8b1f14 ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Ia41d642752fcbc689d2dd6a461aeebd5c5bdc97f
2022-11-12 11:23:11 +00:00
Eric Dumazet
26903478fa UPSTREAM: mm/madvise: fix memory leak from process_madvise
The early return in process_madvise() will produce a memory leak.

Fix it.

Fixes: ecb8ac8b1f14 ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20201116155132.GA3805951@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Iaac716b6096f0bf9e4837658dbbb4ff75294b86d
2022-11-12 11:23:10 +00:00
Minchan Kim
1c21fc048c BACKPORT: mm/madvise: introduce process_madvise() syscall: an external memory hinting API
There is usecase that System Management Software(SMS) want to give a
memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
case of Android, it is the ActivityManagerService.

The information required to make the reclaim decision is not known to the
app.  Instead, it is known to the centralized userspace
daemon(ActivityManagerService), and that daemon must be able to initiate
reclaim on its own without any app involvement.

To solve the issue, this patch introduces a new syscall
process_madvise(2).  It uses pidfd of an external process to give the
hint.  It also supports vector address range because Android app has
thousands of vmas due to zygote so it's totally waste of CPU and power if
we should call the syscall one by one for each vma.(With testing 2000-vma
syscall vs 1-vector syscall, it showed 15% performance improvement.  I
think it would be bigger in real practice because the testing ran very
cache friendly environment).

Another potential use case for the vector range is to amortize the cost
ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
benefit users like TCP receive zerocopy and malloc implementations.  In
future, we could find more usecases for other advises so let's make it
happens as API since we introduce a new syscall at this moment.  With
that, existing madvise(2) user could replace it with process_madvise(2)
with their own pid if they want to have batch address ranges support
feature.

ince it could affect other process's address range, only privileged
process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
UID) gives it the right to ptrace the process could use it successfully.
The flag argument is reserved for future use if we need to extend the API.

I think supporting all hints madvise has/will supported/support to
process_madvise is rather risky.  Because we are not sure all hints make
sense from external process and implementation for the hint may rely on
the caller being in the current context so it could be error-prone.  Thus,
I just limited hints as MADV_[COLD|PAGEOUT] in this patch.

If someone want to add other hints, we could hear the usecase and review
it for each hint.  It's safer for maintenance rather than introducing a
buggy syscall but hard to fix it later.

So finally, the API is as follows,

      ssize_t process_madvise(int pidfd, const struct iovec *iovec,
                unsigned long vlen, int advice, unsigned int flags);

    DESCRIPTION
      The process_madvise() system call is used to give advice or directions
      to the kernel about the address ranges from external process as well as
      local process. It provides the advice to address ranges of process
      described by iovec and vlen. The goal of such advice is to improve
      system or application performance.

      The pidfd selects the process referred to by the PID file descriptor
      specified in pidfd. (See pidofd_open(2) for further information)

      The pointer iovec points to an array of iovec structures, defined in
      <sys/uio.h> as:

        struct iovec {
            void *iov_base;         /* starting address */
            size_t iov_len;         /* number of bytes to be advised */
        };

      The iovec describes address ranges beginning at address(iov_base)
      and with size length of bytes(iov_len).

      The vlen represents the number of elements in iovec.

      The advice is indicated in the advice argument, which is one of the
      following at this moment if the target process specified by pidfd is
      external.

        MADV_COLD
        MADV_PAGEOUT

      Permission to provide a hint to external process is governed by a
      ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).

      The process_madvise supports every advice madvise(2) has if target
      process is in same thread group with calling process so user could
      use process_madvise(2) to extend existing madvise(2) to support
      vector address ranges.

    RETURN VALUE
      On success, process_madvise() returns the number of bytes advised.
      This return value may be less than the total number of requested
      bytes, if an error occurred. The caller should check return value
      to determine whether a partial advice occurred.

FAQ:

Q.1 - Why does any external entity have better knowledge?

Quote from Sandeep

"For Android, every application (including the special SystemServer)
are forked from Zygote.  The reason of course is to share as many
libraries and classes between the two as possible to benefit from the
preloading during boot.

After applications start, (almost) all of the APIs end up calling into
this SystemServer process over IPC (binder) and back to the
application.

In a fully running system, the SystemServer monitors every single
process periodically to calculate their PSS / RSS and also decides
which process is "important" to the user for interactivity.

So, because of how these processes start _and_ the fact that the
SystemServer is looping to monitor each process, it does tend to *know*
which address range of the application is not used / useful.

Besides, we can never rely on applications to clean things up
themselves.  We've had the "hey app1, the system is low on memory,
please trim your memory usage down" notifications for a long time[1].
They rely on applications honoring the broadcasts and very few do.

So, if we want to avoid the inevitable killing of the application and
restarting it, some way to be able to tell the OS about unimportant
memory in these applications will be useful.

- ssp

Q.2 - How to guarantee the race(i.e., object validation) between when
giving a hint from an external process and get the hint from the target
process?

process_madvise operates on the target process's address space as it
exists at the instant that process_madvise is called.  If the space
target process can run between the time the process_madvise process
inspects the target process address space and the time that
process_madvise is actually called, process_madvise may operate on
memory regions that the calling process does not expect.  It's the
responsibility of the process calling process_madvise to close this
race condition.  For example, the calling process can suspend the
target process with ptrace, SIGSTOP, or the freezer cgroup so that it
doesn't have an opportunity to change its own address space before
process_madvise is called.  Another option is to operate on memory
regions that the caller knows a priori will be unchanged in the target
process.  Yet another option is to accept the race for certain
process_madvise calls after reasoning that mistargeting will do no
harm.  The suggested API itself does not provide synchronization.  It
also apply other APIs like move_pages, process_vm_write.

The race isn't really a problem though.  Why is it so wrong to require
that callers do their own synchronization in some manner?  Nobody
objects to write(2) merely because it's possible for two processes to
open the same file and clobber each other's writes --- instead, we tell
people to use flock or something.  Think about mmap.  It never
guarantees newly allocated address space is still valid when the user
tries to access it because other threads could unmap the memory right
before.  That's where we need synchronization by using other API or
design from userside.  It shouldn't be part of API itself.  If someone
needs more fine-grained synchronization rather than process level,
there were two ideas suggested - cookie[2] and anon-fd[3].  Both are
applicable via using last reserved argument of the API but I don't
think it's necessary right now since we have already ways to prevent
the race so don't want to add additional complexity with more
fine-grained optimization model.

To make the API extend, it reserved an unsigned long as last argument
so we could support it in future if someone really needs it.

Q.3 - Why doesn't ptrace work?

Injecting an madvise in the target process using ptrace would not work
for us because such injected madvise would have to be executed by the
target process, which means that process would have to be runnable and
that creates the risk of the abovementioned race and hinting a wrong
VMA.  Furthermore, we want to act the hint in caller's context, not the
callee's, because the callee is usually limited in cpuset/cgroups or
even freezed state so they can't act by themselves quick enough, which
causes more thrashing/kill.  It doesn't work if the target process are
ptraced(e.g., strace, debugger, minidump) because a process can have at
most one ptracer.

[1] https://developer.android.com/topic/performance/memory"

[2] process_getinfo for getting the cookie which is updated whenever
    vma of process address layout are changed - Daniel Colascione -
    https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224

[3] anonymous fd which is used for the object(i.e., address range)
    validation - Michal Hocko -
    https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/

[minchan@kernel.org: fix process_madvise build break for arm64]
  Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
[minchan@kernel.org: fix build error for mips of process_madvise]
  Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
[akpm@linux-foundation.org: fix patch ordering issue]
[akpm@linux-foundation.org: fix arm64 whoops]
[minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
[akpm@linux-foundation.org: fix i386 build]
[sfr@canb.auug.org.au: fix syscall numbering]
  Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
[sfr@canb.auug.org.au: madvise.c needs compat.h]
  Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
[minchan@kernel.org: fix mips build]
  Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
[yuehaibing@huawei.com: remove duplicate header which is included twice]
  Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
[minchan@kernel.org: do not use helper functions for process_madvise]
  Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
[akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
[sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
  Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au

Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Dias <joaodias@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: <linux-man@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I8aaf9794564361e2f9c42fb139b38bfe7b63dc65
2022-11-12 11:23:10 +00:00
Linus Torvalds
39149288d9 UPSTREAM: mm: check that mm is still valid in madvise()
IORING_OP_MADVISE can end up basically doing mprotect() on the VM of
another process, which means that it can race with our crazy core dump
handling which accesses the VM state without holding the mmap_sem
(because it incorrectly thinks that it is the final user).

This is clearly a core dumping problem, but we've never fixed it the
right way, and instead have the notion of "check that the mm is still
ok" using mmget_still_valid() after getting the mmap_sem for writing in
any situation where we're not the original VM thread.

See commit 04f5866e41fb ("coredump: fix race condition between
mmget_not_zero()/get_task_mm() and core dumping") for more background on
this whole mmget_still_valid() thing.  You might want to have a barf bag
handy when you do.

We're discussing just fixing this properly in the only remaining core
dumping routines.  But even if we do that, let's make do_madvise() do
the right thing, and then when we fix core dumping, we can remove all
these mmget_still_valid() checks.

Change-Id: I503f628ca80bd759e101d03a96f3600294a0726a
Reported-and-tested-by: Jann Horn <jannh@google.com>
Fixes: c1ca757bd6f4 ("io_uring: add IORING_OP_MADVISE")
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: d208cbe90b0f39abbd019a383514306997ba47e3
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:23:10 +00:00
Michal Hocko
fb870bc27a UPSTREAM: mm: do not allow MADV_PAGEOUT for CoW pages
commit 12e967fd8e4e6c3d275b4c69c890adc838891300 upstream.

Jann has brought up a very interesting point [1].  While shared pages
are excluded from MADV_PAGEOUT normally, CoW pages can be easily
reclaimed that way.  This can lead to all sorts of hard to debug
problems.  E.g.  performance problems outlined by Daniel [2].

There are runtime environments where there is a substantial memory
shared among security domains via CoW memory and a easy to reclaim way
of that memory, which MADV_{COLD,PAGEOUT} offers, can lead to either
performance degradation in for the parent process which might be more
privileged or even open side channel attacks.

The feasibility of the latter is not really clear to me TBH but there is
no real reason for exposure at this stage.  It seems there is no real
use case to depend on reclaiming CoW memory via madvise at this stage so
it is much easier to simply disallow it and this is what this patch
does.  Put it simply MADV_{PAGEOUT,COLD} can operate only on the
exclusively owned memory which is a straightforward semantic.

[1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKuw@mail.gmail.com
[2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zRQ@mail.gmail.com

Fixes: 9c276cc65a58 ("mm: introduce MADV_COLD")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200312082248.GS23944@dhcp22.suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I188fc0c8e2d25aa4c9eb451bc2989d735254bf60
2022-11-12 11:23:10 +00:00
zhong jiang
3958c5d71e UPSTREAM: mm: fix trying to reclaim unevictable lru page when calling madvise_pageout
Recently, I hit the following issue when running upstream.

  kernel BUG at mm/vmscan.c:1521!
  invalid opcode: 0000 [#1] SMP KASAN PTI
  CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
  RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
  Call Trace:
   reclaim_pages+0x499/0x800 mm/vmscan.c:2188
   madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
   walk_pmd_range mm/pagewalk.c:53 [inline]
   walk_pud_range mm/pagewalk.c:112 [inline]
   walk_p4d_range mm/pagewalk.c:139 [inline]
   walk_pgd_range mm/pagewalk.c:166 [inline]
   __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
   walk_page_range+0x179/0x310 mm/pagewalk.c:349
   madvise_pageout_page_range mm/madvise.c:506 [inline]
   madvise_pageout+0x1f0/0x330 mm/madvise.c:542
   madvise_vma mm/madvise.c:931 [inline]
   __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
   do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

madvise_pageout() accesses the specified range of the vma and isolates
them, then runs shrink_page_list() to reclaim its memory.  But it also
isolates the unevictable pages to reclaim.  Hence, we can catch the
cases in shrink_page_list().

The root cause is that we scan the page tables instead of specific LRU
list.  and so we need to filter out the unevictable lru pages from our
end.

Link: http://lkml.kernel.org/r/1572616245-18946-1-git-send-email-zhongjiang@huawei.com
Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I84ecf1ed9f7d9de14df1798922183fb637d75adc
2022-11-12 11:23:10 +00:00
Minchan Kim
3c9057067d UPSTREAM: mm: factor out common parts between MADV_COLD and MADV_PAGEOUT
There are many common parts between MADV_COLD and MADV_PAGEOUT.
This patch factor them out to save code duplication.

Link: http://lkml.kernel.org/r/20190726023435.214162-6-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: kbuild test robot <lkp@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I625bee817bffdb970838f885247da02dad1aebca
2022-11-12 11:23:09 +00:00
Minchan Kim
25b4dabc2b BACKPORT: mm: introduce MADV_PAGEOUT
When a process expects no accesses to a certain memory range for a long
time, it could hint kernel that the pages can be reclaimed instantly but
data should be preserved for future use.  This could reduce workingset
eviction so it ends up increasing performance.

This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
MADV_PAGEOUT can be used by a process to mark a memory range as not
expected to be used for a long time so that kernel reclaims *any LRU*
pages instantly.  The hint can help kernel in deciding which pages to
evict proactively.

A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
intentionally because it's automatically bounded by PMD size.  If PMD
size(e.g., 256) makes some trouble, we could fix it later by limit it to
SWAP_CLUSTER_MAX[1].

- man-page material

MADV_PAGEOUT (since Linux x.x)

Do not expect access in the near future so pages in the specified
regions could be reclaimed instantly regardless of memory pressure.
Thus, access in the range after successful operation could cause
major page fault but never lose the up-to-date contents unlike
MADV_DONTNEED. Pages belonging to a shared mapping are only processed
if a write access is allowed for the calling process.

MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
VM_PFNMAP pages.

[1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/

[minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
  Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I65f078fbd0e1ddbd645ada85e19ad02f3bd5f94b
2022-11-12 11:23:09 +00:00
Minchan Kim
63009fceb9 BACKPORT: mm: introduce MADV_COLD
Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.

- Background

The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot
start.  While we continually try to improve the performance of cold
starts, hot starts will always be significantly less power hungry as well
as faster so we are trying to make hot start more likely than cold start.

To increase hot start, Android userspace manages the order that apps
should be killed in a process called ActivityManagerService.
ActivityManagerService tracks every Android app or service that the user
could be interacting with at any time and translates that into a ranked
list for lmkd(low memory killer daemon).  They are likely to be killed by
lmkd if the system has to reclaim memory.  In that sense they are similar
to entries in any other cache.  Those apps are kept alive for
opportunistic performance improvements but those performance improvements
will vary based on the memory requirements of individual workloads.

- Problem

Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap.  Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a
cached process to be killed(we measured performance swapping out vs.
zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
times faster even though we use zram which is much faster than real
storage) so kill from lmkd will often satisfy the high zone watermark,
resulting in very few pages actually being moved to swap.

- Approach

The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd by
reclaiming apps as soon as they entered the cached state.  Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.

To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly.  These new
options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
ways to gain some free memory space.  MADV_PAGEOUT is similar to
MADV_DONTNEED in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed immediately; MADV_COLD is similar
to MADV_FREE in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed when memory pressure rises.

This patch (of 5):

When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use.  This could reduce
workingset eviction so it ends up increasing performance.

This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future.  The hint can help kernel in deciding which
pages to evict early during memory pressure.

It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves

	active file page -> inactive file LRU
	active anon page -> inacdtive anon LRU

Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because the
content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault.  Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages.  It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied.  Even, it could give a bonus to make
them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost.  Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU.  Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device.  Let's start simpler way without adding
complexity at this moment.  However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD on
anonymous memory because we rarely age anonymous LRU lists.

* man-page material

MADV_COLD (since Linux x.x)

Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies.  In
contrast to MADV_FREE, the contents of the region are preserved regardless
of subsequent writes to pages.

MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.

[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I088a174c6fe7776229a0e557bc66c205a9b4ce5b
2022-11-12 11:23:09 +00:00
Mike Rapoport
921374f9c8 UPSTREAM: mm/madvise: reduce code duplication in error handling paths
madvise_behavior() converts -ENOMEM to -EAGAIN in several places using
identical code.

Move that code to a common error handling path.

No functional changes.

Link: http://lkml.kernel.org/r/1564640896-1210-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Ia8646d7a9077f9f35427c4c9af4b87c8b8010648
2022-11-12 11:23:09 +00:00
Christoph Hellwig
e75292735d BACKPORT: pagewalk: separate function pointers from iterator data
The mm_walk structure currently mixed data and code.  Split out the
operations vectors into a new mm_walk_ops structure, and while we are
changing the API also declare the mm_walk structure inside the
walk_page_range and walk_page_vma functions.

Based on patch from Linus Torvalds.

Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I0fce77a7f34d9979ff5798855516fdc403b21d27

Conflicts:
	arch/openrisc/kernel/dma.c
	mm/hmm.c
	mm/madvise.c
	mm/migrate.c
	mm/pagewalk.c
2022-11-12 11:23:09 +00:00
Christoph Hellwig
1c83f8bcfc BACKPORT: mm: split out a new pagewalk.h header from mm.h
Add a new header for the two handful of users of the walk_page_range /
walk_page_vma interface instead of polluting all users of mm.h with it.

Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I52777c8a3ec57547071137af37d932fcadc551c2
2022-11-12 11:23:08 +00:00
Peter Zijlstra
e86ff4eef6 BACKPORT: mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush
commit 0ed1325967ab5f7a4549a2641c6ebe115f76e228 upstream.

Architectures for which we have hardware walkers of Linux page table
should flush TLB on mmu gather batch allocation failures and batch flush.
Some architectures like POWER supports multiple translation modes (hash
and radix) and in the case of POWER only radix translation mode needs the
above TLBI.  This is because for hash translation mode kernel wants to
avoid this extra flush since there are no hardware walkers of linux page
table.  With radix translation, the hardware also walks linux page table
and with that, kernel needs to make sure to TLB invalidate page walk cache
before page table pages are freed.

More details in commit d86564a2f0 ("mm/tlb, x86/mm: Support invalidating
TLB caches for RCU_TABLE_FREE")

The changes to sparc are to make sure we keep the old behavior since we
are now removing HAVE_RCU_TABLE_NO_INVALIDATE.  The default value for
tlb_needs_table_invalidate is to always force an invalidate and sparc can
avoid the table invalidate.  Hence we define tlb_needs_table_invalidate to
false for sparc architecture.

Link: http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.kumar@linux.ibm.com
Fixes: a46cc7a90f ("powerpc/mm/radix: Improve TLB/PWC flushes")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Cc: <stable@vger.kernel.org>	[4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Ic3ae446cacefcf4cdeefa549260a8317da76cdc6
2022-11-12 11:23:07 +00:00
Peter Zijlstra
f1bf332d86 BACKPORT: asm-generic/tlb: Remove tlb_table_flush()
There are no external users of this API (nor should there be); remove it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I3f195dd2de4eae907f258ab896f3e2d327c6bc3f
2022-11-12 11:23:07 +00:00
Peter Zijlstra
10e8b913b8 BACKPORT: asm-generic/tlb: Remove tlb_flush_mmu_free()
As the comment notes; it is a potentially dangerous operation. Just
use tlb_flush_mmu(), that will skip the (double) TLB invalidate if
it really isn't needed anyway.

No change in behavior intended.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I163973fc009d4f74c073f8f0727be4459e3e06c7
2022-11-12 11:23:06 +00:00
Peter Zijlstra
3ec443aa7e BACKPORT: asm-generic/tlb: Remove CONFIG_HAVE_GENERIC_MMU_GATHER
Since all architectures are now using it, it is redundant.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I94ad445b58b3e93594cb63e9c297a247284ba5b7
2022-11-12 11:23:06 +00:00
Peter Zijlstra
e4eff8566c BACKPORT: asm-generic/tlb: Remove arch_tlb*_mmu()
Now that all architectures are converted to the generic code, remove
the arch hooks.

No change in behavior intended.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I30655d00eb84558368e4e3ae6d3442a1e9c53cea
2022-11-12 11:23:06 +00:00
Martin Schwidefsky
66f65458ee BACKPORT: asm-generic/tlb: Introduce CONFIG_HAVE_MMU_GATHER_NO_GATHER=y
Add the Kconfig option HAVE_MMU_GATHER_NO_GATHER to the generic
mmu_gather code. If the option is set the mmu_gather will not
track individual pages for delayed page free anymore. A platform
that enables the option needs to provide its own implementation
of the __tlb_remove_page_size() function to free pages.

No change in behavior intended.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: aneesh.kumar@linux.vnet.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: linux@armlinux.org.uk
Cc: npiggin@gmail.com
Link: http://lkml.kernel.org/r/20180918125151.31744-2-schwidefsky@de.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I8b967fff9864d99e814387200abc290974c44d7a
2022-11-12 11:23:06 +00:00
Peter Zijlstra
0c534c1fe0 UPSTREAM: asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE
Make issuing a TLB invalidate for page-table pages the normal case.

The reason is twofold:

 - too many invalidates is safer than too few,
 - most architectures use the linux page-tables natively
   and would thus require this.

Make it an opt-out, instead of an opt-in.

No change in behavior intended.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I375d2c679328ac4027e930e21ff2ec05f18a7b84
2022-11-12 11:23:04 +00:00
Peter Zijlstra
920b99018b BACKPORT: x86/mm: Page size aware flush_tlb_mm_range()
Use the new tlb_get_unmap_shift() to determine the stride of the
INVLPG loop.

Cc: Nick Piggin <npiggin@gmail.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Ide59c57cbe4bcdbd2116c187bb89f24120d36d24
2022-11-12 11:23:03 +00:00
Peter Zijlstra
faeb134753 BACKPORT: asm-generic/tlb, arch: Provide CONFIG_HAVE_MMU_GATHER_PAGE_SIZE
Move the mmu_gather::page_size things into the generic code instead of
PowerPC specific bits.

No change in behavior intended.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I48f1ee515438bbf89e80a5e3c489560ce139eb87
2022-11-12 11:23:03 +00:00
Peter Zijlstra
f814c5e8b6 BACKPORT: mm/memory: Move mmu_gather and TLB invalidation code into its own file
In preparation for maintaining the mmu_gather code as its own entity,
move the implementation out of memory.c and into its own file.

[UtsavBalar1231]: Adapt to sm8250 4.19

Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I4fe8e61514e4304a45b91dad85520a08c86b7b96
2022-11-12 11:23:02 +00:00
Will Deacon
982501ad0a UPSTREAM: asm-generic/tlb: Track which levels of the page tables have been cleared
It is common for architectures with hugepage support to require only a
single TLB invalidation operation per hugepage during unmap(), rather than
iterating through the mapping at a PAGE_SIZE increment. Currently,
however, the level in the page table where the unmap() operation occurs
is not stored in the mmu_gather structure, therefore forcing
architectures to issue additional TLB invalidation operations or to give
up and over-invalidate by e.g. invalidating the entire TLB.

Ideally, we could add an interval rbtree to the mmu_gather structure,
which would allow us to associate the correct mapping granule with the
various sub-mappings within the range being invalidated. However, this
is costly in terms of book-keeping and memory management, so instead we
approximate by keeping track of the page table levels that are cleared
and provide a means to query the smallest granule required for invalidation.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: I60de38ecb7e95942cbc03e00861113427eb4dd3f
2022-11-12 11:23:02 +00:00
Hailong Tu
6b31703551 FROMLIST: mm/damon/reclaim: Fix the timer always stays active
The timer stays active even if the reclaim mechanism is never enabled.
It is unnecessary overhead can be completely avoided by using module_param_cb() for enabled flag.

Signed-off-by: Hailong Tu <tuhailong@gmail.com>

Link: https://lore.kernel.org/all/20220421125910.1052459-1-tuhailong@gmail.com/

Bug: 228223814
Signed-off-by: Hailong Tu <tuhailong@oppo.com>
Change-Id: I77591e41ce424ce16a4a5c70e7a86cdae996a354
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:23:02 +00:00
SeongJae Park
6f4af7419b UPSTREAM: mm/damon: hide kernel pointer from tracepoint event
DAMON's virtual address spaces monitoring primitive uses 'struct pid *'
of the target process as its monitoring target id.  The kernel address
is exposed as-is to the user space via the DAMON tracepoint,
'damon_aggregated'.

Though primarily only privileged users are allowed to access that, it
would be better to avoid unnecessarily exposing kernel pointers so.
Because the trace result is only required to be able to distinguish each
target, we aren't need to use the pointer as-is.

This makes the tracepoint to use the index of the target in the
context's targets list as its id in the tracepoint, to hide the kernel
space address.

Link: https://lkml.kernel.org/r/20211229131016.23641-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 76fd0285b447991267e838842c0be7395eb454bb)

Bug: 228223814
Signed-off-by: Hailong Tu <tuhailong@oppo.com>
Change-Id: Iee4a6f56cf9bf5f61fb95f438dfc3316c10198e5
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:23:01 +00:00
SeongJae Park
f91236cb9c UPSTREAM: mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log
The failure log message for 'damon_va_three_regions()' prints the target
id, which is a 'struct pid' pointer in the case.  To avoid exposing the
kernel pointer via the log, this makes the log to use the index of the
target in the context's targets list instead.

Link: https://lkml.kernel.org/r/20211229131016.23641-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 962fe7a6b1b2f9deb1b31b3344afa3b11afdf7ab)

Bug: 228223814
Signed-off-by: Hailong Tu <tuhailong@oppo.com>
Change-Id: I6b9b09a9b7024630f05ec41db030006d63f47ce1
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:23:01 +00:00
SeongJae Park
6dc1d52c1d UPSTREAM: mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging
Failure of 'damon_va_three_regions()' is logged using 'pr_err()'.  But,
the function can fail in legal situations.  To avoid making users be
surprised and to keep the kernel clean, this makes the log to be printed
using 'pr_debug()'.

Link: https://lkml.kernel.org/r/20211229131016.23641-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 251403f19aab6a122f4dcfb14149814e85564202)

Bug: 228223814
Signed-off-by: Hailong Tu <tuhailong@oppo.com>
Change-Id: I0a2c08d856d0067af6af7a008e89f63fa5fac381
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:23:01 +00:00
SeongJae Park
a5ffdc2f5c UPSTREAM: mm/damon/dbgfs: remove an unnecessary variable
Patch series "mm/damon: Hide unnecessary information disclosures".

DAMON is exposing some unnecessary information including kernel pointer
in kernel log and tracepoint.  This patchset hides such information.
The first patch is only for a trivial cleanup, though.

This patch (of 4):

This commit removes a unnecessarily used variable in
dbgfs_target_ids_write().

Link: https://lkml.kernel.org/r/20211229131016.23641-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211229131016.23641-2-sj@kernel.org
Fixes: 4bc05954d007 ("mm/damon: implement a debugfs-based user space interface")
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 70b8480812d0a3930049a44820a1fa149b090c10)

Bug: 228223814
Signed-off-by: Hailong Tu <tuhailong@oppo.com>
Change-Id: I774c900de92780ae9ebf09f01c0b0536eb43f822
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:23:00 +00:00
Guoqing Jiang
ada84132a2 UPSTREAM: mm/damon: move the implementation of damon_insert_region to damon.h
Usually, inline function is declared static since it should sit between
storage and type.  And implement it in a header file if used by multiple
files.

And this change also fixes compile issue when backport damon to 5.10.

  mm/damon/vaddr.c: In function `damon_va_evenly_split_region':
  ./include/linux/damon.h:425:13: error: inlining failed in call to `always_inline' `damon_insert_region': function body not available
  425 | inline void damon_insert_region(struct damon_region *r,
      | ^~~~~~~~~~~~~~~~~~~
  mm/damon/vaddr.c:86:3: note: called from here
  86 | damon_insert_region(n, r, next, t);
     | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Link: https://lkml.kernel.org/r/20211223085703.6142-1-guoqing.jiang@linux.dev
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 2cd4b8e10cc31eadb5b10b1d73b3f28156f3776c)

Bug: 228223814
Signed-off-by: Hailong Tu <tuhailong@oppo.com>
Change-Id: Iaa05318092b8e98bfbfc72a8c5df2cb6de97d224
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:23:00 +00:00
Baolin Wang
21644bb0d9 UPSTREAM: mm/damon: add access checking for hugetlb pages
The process's VMAs can be mapped by hugetlb page, but now the DAMON did
not implement the access checking for hugetlb pte, so we can not get the
actual access count like below if a process VMAs were mapped by hugetlb.

  damon_aggregated: target_id=18446614368406014464 nr_regions=12 4194304-5476352: 0 545
  damon_aggregated: target_id=18446614368406014464 nr_regions=12 140662370467840-140662372970496: 0 545
  damon_aggregated: target_id=18446614368406014464 nr_regions=12 140662372970496-140662375460864: 0 545
  damon_aggregated: target_id=18446614368406014464 nr_regions=12 140662375460864-140662377951232: 0 545
  damon_aggregated: target_id=18446614368406014464 nr_regions=12 140662377951232-140662380449792: 0 545
  damon_aggregated: target_id=18446614368406014464 nr_regions=12 140662380449792-140662382944256: 0 545
  ......

Thus this patch adds hugetlb access checking support, with this patch we
can see below VMA mapped by hugetlb access count.

  damon_aggregated: target_id=18446613056935405824 nr_regions=12 140296486649856-140296489914368: 1 3
  damon_aggregated: target_id=18446613056935405824 nr_regions=12 140296489914368-140296492978176: 1 3
  damon_aggregated: target_id=18446613056935405824 nr_regions=12 140296492978176-140296495439872: 1 3
  damon_aggregated: target_id=18446613056935405824 nr_regions=12 140296495439872-140296498311168: 1 3
  damon_aggregated: target_id=18446613056935405824 nr_regions=12 140296498311168-140296501198848: 1 3
  damon_aggregated: target_id=18446613056935405824 nr_regions=12 140296501198848-140296504320000: 1 3
  damon_aggregated: target_id=18446613056935405824 nr_regions=12 140296504320000-140296507568128: 1 2
  ......

[baolin.wang@linux.alibaba.com: fix unused var warning]
  Link: https://lkml.kernel.org/r/1aaf9c11-0d8e-b92d-5c92-46e50a6e8d4e@linux.alibaba.com
[baolin.wang@linux.alibaba.com: v3]
  Link: https://lkml.kernel.org/r/486927ecaaaecf2e3a7fbe0378ec6e1c58b50747.1640852276.git.baolin.wang@linux.alibaba.com

Link: https://lkml.kernel.org/r/6afcbd1fda5f9c7c24f320d26a98188c727ceec3.1639623751.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 49f4203aae06ba9d67b500c90339b262b0a52637)

Bug: 228223814
Signed-off-by: Hailong Tu <tuhailong@oppo.com>
Change-Id: Id7c86f5c0344efef150b3b27f662b696263947ed
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:23:00 +00:00
SeongJae Park
51db59ab5a UPSTREAM: mm/damon/dbgfs: support all DAMOS stats
Currently, DAMON debugfs interface is not supporting DAMON-based
Operation Schemes (DAMOS) stats for schemes successfully applied regions
and time/space quota limit exceeds.  This adds the support.

Link: https://lkml.kernel.org/r/20211210150016.35349-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 3a619fdb8de8a3ecd4200e7d183d2c8ceb32289e)

Bug: 228223814
Signed-off-by: Hailong Tu <tuhailong@oppo.com>
Change-Id: Ide3cfaff1b63e053ece5a8746a463fb87ef6c607
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:23:00 +00:00
SeongJae Park
c7eaa00794 UPSTREAM: mm/damon/reclaim: provide reclamation statistics
This implements new DAMON_RECLAIM parameters for statistics reporting.
Those can be used for understanding how DAMON_RECLAIM is working, and
for tuning the other parameters.

Link: https://lkml.kernel.org/r/20211210150016.35349-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 60e52e7c46a127bca5ddd48b89002564f3862063)

Bug: 228223814
Signed-off-by: Hailong Tu <tuhailong@oppo.com>
Change-Id: I721bdacb3b2d8b41154296f5dbf8ef48c5dd0744
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:23:00 +00:00
SeongJae Park
fb62fd4aa3 UPSTREAM: mm/damon/schemes: account how many times quota limit has exceeded
If the time/space quotas of a given DAMON-based operation scheme is too
small, the scheme could show unexpectedly slow progress.  However, there
is no good way to notice the case in runtime.  This commit extends the
DAMOS stat to provide how many times the quota limits exceeded so that
the users can easily notice the case and tune the scheme.

Link: https://lkml.kernel.org/r/20211210150016.35349-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 6268eac34ca30af7f6313504d556ec7fcd295621)

Bug: 228223814
Signed-off-by: Hailong Tu <tuhailong@oppo.com>
Change-Id: I36184dc51917810c81ac8d576b144e33a4454dc1
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:22:59 +00:00
SeongJae Park
59a81312e5 UPSTREAM: mm/damon/schemes: account scheme actions that successfully applied
Patch series "mm/damon/schemes: Extend stats for better online analysis and tuning".

To help online access pattern analysis and tuning of DAMON-based
Operation Schemes (DAMOS), DAMOS provides simple statistics for each
scheme.  Introduction of DAMOS time/space quota further made the tuning
easier by making the risk management easier.  However, that also made
understanding of the working schemes a little bit more difficult.

For an example, progress of a given scheme can now be throttled by not
only the aggressiveness of the target access pattern, but also the
time/space quotas.  So, when a scheme is showing unexpectedly slow
progress, it's difficult to know by what the progress of the scheme is
throttled, with currently provided statistics.

This patchset extends the statistics to contain some metrics that can be
helpful for such online schemes analysis and tuning (patches 1-2),
exports those to users (patches 3 and 5), and add documents (patches 4
and 6).

This patch (of 6):

DAMON-based operation schemes (DAMOS) stats provide only the number and
the amount of regions that the action of the scheme has tried to be
applied.  Because the action could be failed for some reasons, the
currently provided information is sometimes not useful or convenient
enough for schemes profiling and tuning.  To improve this situation,
this commit extends the DAMOS stats to provide the number and the amount
of regions that the action has successfully applied.

Link: https://lkml.kernel.org/r/20211210150016.35349-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211210150016.35349-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 0e92c2ee9f459542c5384d9cfab24873c3dd6398)

Bug: 228223814
Signed-off-by: Hailong Tu <tuhailong@oppo.com>
Change-Id: Iddfe9257cb99091404202576a1addc5cc340cb8b
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:22:59 +00:00
SeongJae Park
76479acf46 UPSTREAM: mm/damon: convert macro functions to static inline functions
Patch series "mm/damon: Misc cleanups".

This patchset contains miscellaneous cleanups for DAMON's macro
functions and documentation.

This patch (of 6):

This commit converts macro functions in DAMON to static inline functions,
for better type checking, code documentation, etc[1].

[1] https://lore.kernel.org/linux-mm/20211202151213.6ec830863342220da4141bc5@linux-foundation.org/

Link: https://lkml.kernel.org/r/20211209131806.19317-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211209131806.19317-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 88f86dcfa454784f7de550966c60fc78a3e95d6d)

Bug: 228223814
Signed-off-by: Hailong Tu <tuhailong@oppo.com>
Change-Id: Id1e4e08a844b49cf3572cb3ac02158f5c1909ea2
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:22:59 +00:00
Xin Hao
8afdd7f734 UPSTREAM: mm/damon: move damon_rand() definition into damon.h
damon_rand() is called in three files:damon/core.c, damon/ paddr.c,
damon/vaddr.c, i think there is no need to redefine this twice, So move
it to damon.h will be a good choice.

Link: https://lkml.kernel.org/r/20211202075859.51341-1-xhao@linux.alibaba.com
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 9b2a38d6ef25c1748e3964b0ff30a89e4ed26583)

Bug: 228223814
Signed-off-by: Hailong Tu <tuhailong@oppo.com>
Change-Id: Ib7be3062385fac4b422faa86705968aa39095a72
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:22:59 +00:00
Xin Hao
655c93a820 UPSTREAM: mm/damon/schemes: add the validity judgment of thresholds
In dbgfs "schemes" interface, i do some test like this:
    # cd /sys/kernel/debug/damon
    # echo "2 1 2 1 10 1 3 10 1 1 1 1 1 1 1 1 2 3" > schemes
    # cat schemes
    # 2 1 2 1 10 1 3 10 1 1 1 1 1 1 1 1 2 3 0 0

There have some unreasonable places, i set the valules of these variables
"<min_sz, max_sz> <min_nr_a, max_nr_a>, <min_age, max_age>, <wmarks.high,
wmarks.mid, wmarks.low>" as "<2, 1>, <2, 1>, <10, 1>, <1, 2, 3>.

So there add a validity judgment for these thresholds value.

Link: https://lkml.kernel.org/r/d78360e52158d786fcbf20bc62c96785742e76d3.1637239568.git.xhao@linux.alibaba.com
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit c89ae63eb0662b6c9f82dbfad3ef010239b8c1b1)

Bug: 228223814
Signed-off-by: Hailong Tu <tuhailong@oppo.com>
Change-Id: I2b23124b95ebeac935f28f9632dccaeabace4533
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:22:59 +00:00
Yihao Han
f2f5855f52 UPSTREAM: mm/damon/vaddr: remove swap_ranges() and replace it with swap()
Remove 'swap_ranges()' and replace it with the macro 'swap()' defined in
'include/linux/minmax.h' to simplify code and improve efficiency

Link: https://lkml.kernel.org/r/20211111115355.2808-1-hanyihao@vivo.com
Signed-off-by: Yihao Han <hanyihao@vivo.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 8bd0b9da03c9154e279b1a502636103887b9fbed)

Bug: 228223814
Signed-off-by: Hailong Tu <tuhailong@oppo.com>
Change-Id: Icf565b52a7642fb830ae004764ca496986aed82a
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:22:58 +00:00
Xin Hao
607816e355 UPSTREAM: mm/damon: remove some unneeded function definitions in damon.h
In damon.h some func definitions about VA & PA can only be used in its
own file, so there no need to define in the header file, and the header
file will look cleaner.

If other files later need these functions, the prototypes can be added
to damon.h at that time.

[sj@kernel.org: remove unnecessary function prototype position changes]
 Link: https://lkml.kernel.org/r/20211118114827.20052-1-sj@kernel.org

Link: https://lkml.kernel.org/r/45fd5b3ef6cce8e28dbc1c92f9dc845ccfc949d7.1636989871.git.xhao@linux.alibaba.com
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit cdeed009f3bceee41f73f0137db785fd29a05cb8)

Bug: 228223814
Signed-off-by: Hailong Tu <tuhailong@oppo.com>
Change-Id: I2203b9c8e4625493797e1f3e506431799c4404c9
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:22:58 +00:00
Xin Hao
8b9ac54ec9 UPSTREAM: mm/damon/core: use abs() instead of diff_of()
In kernel, we can use abs(a - b) to get the absolute value, So there is no
need to redefine a new one.

Link: https://lkml.kernel.org/r/b24e7b82d9efa90daf150d62dea171e19390ad0b.1636989871.git.xhao@linux.alibaba.com
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit d720bbbd70e968f8a0257393b575c3a29b56f990)

Bug: 228223814
Signed-off-by: Hailong Tu <tuhailong@oppo.com>
Change-Id: Iab418bdc79cea24a175c2c5e17ab7ee393df6c47
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:22:58 +00:00
Xin Hao
601a5d3e94 UPSTREAM: mm/damon: unified access_check function naming rules
Patch series "mm/damon: Do some small changes", v4.

This patch (of 4):

In damon/paddr.c file, two functions names start with underscore,
	static void __damon_pa_prepare_access_check(struct damon_ctx *ctx,
			struct damon_region *r)
	static void __damon_pa_prepare_access_check(struct damon_ctx *ctx,
			struct damon_region *r)
In damon/vaddr.c file, there are also two functions with the same function,
	static void damon_va_prepare_access_check(struct damon_ctx *ctx,
			struct mm_struct *mm, struct damon_region *r)
	static void damon_va_check_access(struct damon_ctx *ctx,
			struct mm_struct *mm, struct damon_region *r)

It makes sense to keep consistent, and it is not easy to be confused with
the function that call them.

Link: https://lkml.kernel.org/r/cover.1636989871.git.xhao@linux.alibaba.com
Link: https://lkml.kernel.org/r/529054aed932a42b9c09fc9977ad4574b9e7b0bd.1636989871.git.xhao@linux.alibaba.com
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit b627b774911660852ce7f3f3817955ddad2bd130)

Bug: 228223814
Signed-off-by: Hailong Tu <tuhailong@oppo.com>
Change-Id: I180e7f068ed4745c5fbc0e2a81320b7fda03c52e
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2022-11-12 11:22:58 +00:00