android_kernel_xiaomi_sm7250/kernel/bounds.c
Yu Zhao 3c91f2ca61 FROMLIST: mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be
applied to the multi-gen LRU, as a new convention; the terms
"activation" and "deactivation" will be applied to the active/inactive
LRU, as usual.

The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
promotes hot pages to the youngest generation when it finds them
accessed through page tables; the demotion of cold pages happens
consequently when it increments max_seq. The aging has the complexity
O(nr_hot_pages), since it is only interested in hot pages. Promotion
in the aging path does not require any LRU list operations, only the
updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
the result of the increment of max_seq, requires LRU list operations,
e.g., lru_deactivate_fn().

The eviction consumes old generations. Given an lruvec, it increments
min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types
are available from the same generation.

Each generation is divided into multiple tiers. Tiers represent
different ranges of numbers of accesses through file descriptors. A
page accessed N times through file descriptors is in tier
order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in page->flags. In contrast to moving across generations, which
requires the LRU lock, moving across tiers only involves operations on
page->flags. The feedback loop also monitors refaults over all tiers
and decides when to protect pages in which tiers (N>1), using the
first tier (N=0,1) as a baseline. The first tier contains single-use
unmapped clean pages, which are most likely the best choices. The
eviction moves a page to the next generation, i.e., min_seq+1, if the
feedback loop decides so. This approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
   inferring whether pages accessed multiple times through file
   descriptors are statistically hot and thus worth protecting in the
   eviction path.
2. It takes pages accessed through page tables into account and avoids
   overprotecting pages accessed multiple times through file
   descriptors. (Pages accessed through page tables are in the first
   tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
   twice through file descriptors, when under heavy buffered I/O
   workloads.

Server benchmark results:
  Single workload:
    fio (buffered I/O): +[38, 40]%
                         IOPS         BW
      5.18-ed4643521e6a: 2547k        9989MiB/s
      patch1-6:          3540k        13.5GiB/s

  Single workload:
    memcached (anon): +[103, 107]%
                         Ops/sec      KB/sec
      5.18-ed4643521e6a: 469048.66    18243.91
      patch1-6:          964656.80    37520.88

  Configurations:
    CPU: two Xeon 6154
    Mem: total 256G

    Node 1 was only used as a ram disk to reduce the variance in the
    results.

    patch drivers/block/brd.c <<EOF
    99,100c99,100
    < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
    < 	page = alloc_page(gfp_flags);
    ---
    > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
    > 	page = alloc_pages_node(1, gfp_flags, 0);
    EOF

    cat >>/etc/systemd/system.conf <<EOF
    CPUAffinity=numa
    NUMAPolicy=bind
    NUMAMask=0
    EOF

    cat >>/etc/memcached.conf <<EOF
    -m 184320
    -s /var/run/memcached/memcached.sock
    -a 0766
    -t 36
    -B binary
    EOF

    cat fio.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkfs.ext4 /dev/ram0
    mount -t ext4 /dev/ram0 /mnt

    mkdir /sys/fs/cgroup/user.slice/test
    echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
    echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
    fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=5m --group_reporting

    cat memcached.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

Client benchmark results:
  kswapd profiles:
    5.18-ed4643521e6a
      39.56%  page_vma_mapped_walk
      19.32%  lzo1x_1_do_compress (real work)
       7.18%  do_raw_spin_lock
       4.23%  _raw_spin_unlock_irq
       2.26%  vma_interval_tree_subtree_search
       2.12%  vma_interval_tree_iter_next
       2.11%  folio_referenced_one
       1.90%  anon_vma_interval_tree_iter_first
       1.47%  ptep_clear_flush
       0.97%  __anon_vma_interval_tree_subtree_search

    patch1-6
      36.13%  lzo1x_1_do_compress (real work)
      19.16%  page_vma_mapped_walk
       6.55%  _raw_spin_unlock_irq
       4.02%  do_raw_spin_lock
       2.32%  anon_vma_interval_tree_iter_first
       2.11%  ptep_clear_flush
       1.76%  __zram_bvec_write
       1.64%  folio_referenced_one
       1.40%  memmove
       1.35%  obj_malloc

  Configurations:
    CPU: single Snapdragon 7c
    Mem: total 4G

    Chrome OS MemoryPressure [1]

[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/

Link: https://lore.kernel.org/r/20220309021230.721028-7-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 228114874
Change-Id: I3fe4850006d7984cd9f4fd46134b826609dc2f86
2022-11-12 11:21:16 +00:00

36 lines
939 B
C

// SPDX-License-Identifier: GPL-2.0
/*
* Generate definitions needed by the preprocessor.
* This code generates raw asm output which is post-processed
* to extract and format the required data.
*/
#define __GENERATING_BOUNDS_H
/* Include headers that define the enum constants of interest */
#include <linux/page-flags.h>
#include <linux/mmzone.h>
#include <linux/kbuild.h>
#include <linux/log2.h>
#include <linux/spinlock_types.h>
int main(void)
{
/* The enum constants to put into include/generated/bounds.h */
DEFINE(NR_PAGEFLAGS, __NR_PAGEFLAGS);
DEFINE(MAX_NR_ZONES, __MAX_NR_ZONES);
#ifdef CONFIG_SMP
DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
#endif
DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
#ifdef CONFIG_LRU_GEN
DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1));
DEFINE(LRU_REFS_WIDTH, MAX_NR_TIERS - 2);
#else
DEFINE(LRU_GEN_WIDTH, 0);
DEFINE(LRU_REFS_WIDTH, 0);
#endif
/* End of constants */
return 0;
}