Commit Graph

870728 Commits

Author SHA1 Message Date
Yue Hu
3fa8ac111c erofs: directly use wrapper erofs_page_is_managed() when shrinking
We already have the wrapper function to identify managed page.

Link: https://lore.kernel.org/r/20210810065450.1320-1-zbestahu@gmail.com
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Change-Id: Id68356b5fc0cd6f4da2b53c2e56f9075253d624d
2022-11-12 11:24:50 +00:00
Gao Xiang
027bb6d69b erofs: fix 1 lcluster-sized pcluster for big pcluster
If the 1st NONHEAD lcluster of a pcluster isn't CBLKCNT lcluster type
rather than a HEAD or PLAIN type instead, which means its pclustersize
_must_ be 1 lcluster (since its uncompressed size < 2 lclusters),
as illustrated below:

       HEAD     HEAD / PLAIN    lcluster type
   ____________ ____________
  |_:__________|_________:__|   file data (uncompressed)
   .                .
  .____________.
  |____________|                pcluster data (compressed)

Such on-disk case was explained before [1] but missed to be handled
properly in the runtime implementation.

It can be observed if manually generating 1 lcluster-sized pcluster
with 2 lclusters (thus CBLKCNT doesn't exist.) Let's fix it now.

[1] https://lore.kernel.org/r/20210407043927.10623-1-xiang@kernel.org

Link: https://lore.kernel.org/r/20210510064715.29123-1-xiang@kernel.org
Fixes: cec6e93beadf ("erofs: support parsing big pcluster compress indexes")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <xiang@kernel.org>
Change-Id: I8a11d6d0c883a7222767e5c48f5da41c22698784
2022-11-12 11:24:50 +00:00
Gao Xiang
c94dab0246 erofs: enable big pcluster feature
Enable COMPR_CFGS and BIG_PCLUSTER since the implementations are
all settled properly.

Link: https://lore.kernel.org/r/20210407043927.10623-11-xiang@kernel.org
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: I3e12261cf03e62bad4d7287d5827f3309a45d2a6
2022-11-12 11:24:50 +00:00
Gao Xiang
6247dd5bfb erofs: support decompress big pcluster for lz4 backend
Prior to big pcluster, there was only one compressed page so it'd
easy to map this. However, when big pcluster is enabled, more work
needs to be done to handle multiple compressed pages. In detail,

 - (maptype 0) if there is only one compressed page + no need
   to copy inplace I/O, just map it directly what we did before;

 - (maptype 1) if there are more compressed pages + no need to
   copy inplace I/O, vmap such compressed pages instead;

 - (maptype 2) if inplace I/O needs to be copied, use per-CPU
   buffers for decompression then.

Another thing is how to detect inplace decompression is feasable or
not (it's still quite easy for non big pclusters), apart from the
inplace margin calculation, inplace I/O page reusing order is also
needed to be considered for each compressed page. Currently, if the
compressed page is the xth page, it shouldn't be reused as [0 ...
nrpages_out - nrpages_in + x], otherwise a full copy will be triggered.

Although there are some extra optimization ideas for this, I'd like
to make big pcluster work correctly first and obviously it can be
further optimized later since it has nothing with the on-disk format
at all.

Link: https://lore.kernel.org/r/20210407043927.10623-10-xiang@kernel.org
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: Icf073fd978b9a7f659651540e478a67d14450a09
2022-11-12 11:24:50 +00:00
Gao Xiang
7caadd02df erofs: support parsing big pcluster compact indexes
Different from non-compact indexes, several lclusters are packed
as the compact form at once and an unique base blkaddr is stored for
each pack, so each lcluster index would take less space on avarage
(e.g. 2 bytes for COMPACT_2B.) btw, that is also why BIG_PCLUSTER
switch should be consistent for compact head0/1.

Prior to big pcluster, the size of all pclusters was 1 lcluster.
Therefore, when a new HEAD lcluster was scanned, blkaddr would be
bumped by 1 lcluster. However, that way doesn't work anymore for
big pcluster since we actually don't know the compressed size of
pclusters in advance (before reading CBLKCNT lcluster).

So, instead, let blkaddr of each pack be the first pcluster blkaddr
with a valid CBLKCNT, in detail,

 1) if CBLKCNT starts at the pack, this first valid pcluster is
    itself, e.g.
  _____________________________________________________________
 |_CBLKCNT0_|_NONHEAD_| .. |_HEAD_|_CBLKCNT1_| ... |_HEAD_| ...
 ^ = blkaddr base          ^ += CBLKCNT0           ^ += CBLKCNT1

 2) if CBLKCNT doesn't start at the pack, the first valid pcluster
    is the next pcluster, e.g.
  _________________________________________________________
 | NONHEAD_| .. |_HEAD_|_CBLKCNT0_| ... |_HEAD_|_HEAD_| ...
                ^ = blkaddr base        ^ += CBLKCNT0
                                               ^ += 1

When a CBLKCNT is found, blkaddr will be increased by CBLKCNT
lclusters, or a new HEAD is found immediately, bump blkaddr by 1
instead (see the picture above.)

Also noted if CBLKCNT is the end of the pack, instead of storing
delta1 (distance of the next HEAD lcluster) as normal NONHEADs,
it still uses the compressed block count (delta0) since delta1
can be calculated indirectly but the block count can't.

Adjust decoding logic to fit big pcluster compact indexes as well.

Link: https://lore.kernel.org/r/20210407043927.10623-9-xiang@kernel.org
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: Ia9dc0c8bf03b630edc4dfcedf61ef0632e1fcb05
2022-11-12 11:24:49 +00:00
Gao Xiang
05658e6d14 erofs: support parsing big pcluster compress indexes
When INCOMPAT_BIG_PCLUSTER sb feature is enabled, legacy compress indexes
will also have the same on-disk header compact indexes to keep per-file
configurations instead of leaving it zeroed.

If ADVISE_BIG_PCLUSTER is set for a file, CBLKCNT will be loaded for each
pcluster in this file by parsing 1st non-head lcluster.

Link: https://lore.kernel.org/r/20210407043927.10623-8-xiang@kernel.org
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: I316bad6965e2cfa97e6fb1fb548dda8bc363b3b1
2022-11-12 11:24:49 +00:00
Gao Xiang
3a2d5cdb17 erofs: adjust per-CPU buffers according to max_pclusterblks
Adjust per-CPU buffers on demand since big pcluster definition is
available. Also, bail out unsupported pcluster size according to
Z_EROFS_PCLUSTER_MAX_SIZE.

Link: https://lore.kernel.org/r/20210407043927.10623-7-xiang@kernel.org
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: I5679ca306856206222bba50d78985390252e6b1f
2022-11-12 11:24:49 +00:00
Gao Xiang
d0d8fc4059 erofs: add big physical cluster definition
Big pcluster indicates the size of compressed data for each physical
pcluster is no longer fixed as block size, but could be more than 1
block (more accurately, 1 logical pcluster)

When big pcluster feature is enabled for head0/1, delta0 of the 1st
non-head lcluster index will keep block count of this pcluster in
lcluster size instead of 1. Or, the compressed size of pcluster
should be 1 lcluster if pcluster has no non-head lcluster index.

Also note that BIG_PCLUSTER feature reuses COMPR_CFGS feature since
it depends on COMPR_CFGS and will be released together.

Link: https://lore.kernel.org/r/20210407043927.10623-6-xiang@kernel.org
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: I00c6ff3b9197f0d4c3b19698d4f1fb1899773734
2022-11-12 11:24:49 +00:00
Gao Xiang
fd0b2e82e9 erofs: fix up inplace I/O pointer for big pcluster
When picking up inplace I/O pages, it should be traversed in reverse
order in aligned with the traversal order of file-backed online pages.
Also, index should be updated together when preloading compressed pages.

Previously, only page-sized pclustersize was supported so no problem
at all. Also rename `compressedpages' to `icpage_ptr' to reflect its
functionality.

Link: https://lore.kernel.org/r/20210407043927.10623-5-xiang@kernel.org
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: Iefdad4fe34757134e4ae06dd4dcc16ada0a7880c
2022-11-12 11:24:49 +00:00
Gao Xiang
18f3dd8eb4 erofs: introduce physical cluster slab pools
Since multiple pcluster sizes could be used at once, the number of
compressed pages will become a variable factor. It's necessary to
introduce slab pools rather than a single slab cache now.

This limits the pclustersize to 1M (Z_EROFS_PCLUSTER_MAX_SIZE), and
get rid of the obsolete EROFS_FS_CLUSTER_PAGE_LIMIT, which has no
use now.

Link: https://lore.kernel.org/r/20210407043927.10623-4-xiang@kernel.org
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: I69bdcc92c379a66f71db38f0111d41a34526a586
2022-11-12 11:24:48 +00:00
Gao Xiang
089adbeb82 erofs: introduce multipage per-CPU buffers
To deal the with the cases which inplace decompression is infeasible
for some inplace I/O. Per-CPU buffers was introduced to get rid of page
allocation latency and thrash for low-latency decompression algorithms
such as lz4.

For the big pcluster feature, introduce multipage per-CPU buffers to
keep such inplace I/O pclusters temporarily as well but note that
per-CPU pages are just consecutive virtually.

When a new big pcluster fs is mounted, its max pclustersize will be
read and per-CPU buffers can be growed if needed. Shrinking adjustable
per-CPU buffers is more complex (because we don't know if such size
is still be used), so currently just release them all when unloading.

Link: https://lore.kernel.org/r/20210409190630.19569-1-xiang@kernel.org
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: I7153344779a49710efc2ce57b0e2d8b9b991009b
2022-11-12 11:24:48 +00:00
Gao Xiang
d59de04be7 erofs: reserve physical_clusterbits[]
Formal big pcluster design is actually more powerful / flexable than
the previous thought whose pclustersize was fixed as power-of-2 blocks,
which was obviously inefficient and space-wasting. Instead, pclustersize
can now be set independently for each pcluster, so various pcluster
sizes can also be used together in one file if mkfs wants (for example,
according to data type and/or compression ratio).

Let's get rid of previous physical_clusterbits[] setting (also notice
that corresponding on-disk fields are still 0 for now). Therefore,
head1/2 can be used for at most 2 different algorithms in one file and
again pclustersize is now independent of these.

Link: https://lore.kernel.org/r/20210407043927.10623-2-xiang@kernel.org
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: Ie48ec3c8088d7aab59683ecce6049cc8c13f4bf2
2022-11-12 11:24:48 +00:00
Ruiqi Gong
272228d253 erofs: Clean up spelling mistakes found in fs/erofs
zmap.c: s/correspoinding/corresponding
zdata.c: s/endding/ending

Link: https://lore.kernel.org/r/20210331093920.31923-1-gongruiqi1@huawei.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Ruiqi Gong <gongruiqi1@huawei.com>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: Ia6f036e2eb2aad4054fb7fa4d434623fa97ac0fd
2022-11-12 11:24:48 +00:00
Gao Xiang
52b1a55f3d erofs: add on-disk compression configurations
Add a bitmap for available compression algorithms and a variable-sized
on-disk table for compression options in preparation for upcoming big
pcluster and LZMA algorithm, which follows the end of super block.

To parse the compression options, the bitmap is scanned one by one.
For each available algorithm, there is data followed by 2-byte `length'
correspondingly (it's enough for most cases, or entire fs blocks should
be used.)

With such available algorithm bitmap, kernel itself can also refuse to
mount such filesystem if any unsupported compression algorithm exists.

Note that COMPR_CFGS feature will be enabled with BIG_PCLUSTER.

Link: https://lore.kernel.org/r/20210329100012.12980-1-hsiangkao@aol.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: Ic8420a2f98774fb316075c54bd81103f62756e19
2022-11-12 11:24:48 +00:00
Gao Xiang
d45aa2f8fe erofs: introduce on-disk lz4 fs configurations
Introduce z_erofs_lz4_cfgs to store all lz4 configurations.
Currently it's only max_distance, but will be used for new
features later.

Link: https://lore.kernel.org/r/20210329012308.28743-4-hsiangkao@aol.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: Ie7306c702d33d62529630c3c6825d892a306a6fe
2022-11-12 11:24:48 +00:00
Huang Jianan
50412aa0b1 erofs: support adjust lz4 history window size
lz4 uses LZ4_DISTANCE_MAX to record history preservation. When
using rolling decompression, a block with a higher compression
ratio will cause a larger memory allocation (up to 64k). It may
cause a large resource burden in extreme cases on devices with
small memory and a large number of concurrent IOs. So appropriately
reducing this value can improve performance.

Decreasing this value will reduce the compression ratio (except
when input_size <LZ4_DISTANCE_MAX). But considering that erofs
currently only supports 4k output, reducing this value will not
significantly reduce the compression benefits.

The maximum value of LZ4_DISTANCE_MAX defined by lz4 is 64k, and
we can only reduce this value. For the old kernel, it just can't
reduce the memory allocation during rolling decompression without
affecting the decompression result.

Link: https://lore.kernel.org/r/20210329012308.28743-3-hsiangkao@aol.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Signed-off-by: Guo Weichao <guoweichao@oppo.com>
[ Gao Xiang: introduce struct erofs_sb_lz4_info for configurations. ]
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: I19ebaefe144f3144e291225626ff7067ca646af6
2022-11-12 11:24:47 +00:00
Gao Xiang
a8f99e98e5 erofs: introduce erofs_sb_has_xxx() helpers
Introduce erofs_sb_has_xxx() to make long checks short, especially
for later big pcluster & LZMA features.

Link: https://lore.kernel.org/r/20210329012308.28743-2-hsiangkao@aol.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: Ibf0e8724fbbc787b53e512212c794ca99f31390b
2022-11-12 11:24:47 +00:00
Yue Hu
6f04dbe3a3 erofs: don't use erofs_map_blocks() any more
Currently, erofs_map_blocks() will be called only from
erofs_{bmap, read_raw_page} which are all for uncompressed files.
So, the compression branch in erofs_map_blocks() is pointless. Let's
remove it and use erofs_map_blocks_flatmode() directly. Also update
related comments.

Link: https://lore.kernel.org/r/20210325071008.573-1-zbestahu@gmail.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: I1804aa803d08d1df05e03811788ee8a2ace91f76
2022-11-12 11:24:47 +00:00
Gao Xiang
1a47f87d98 erofs: complete a missing case for inplace I/O
Add a missing case which could cause unnecessary page allocation but
not directly use inplace I/O instead, which increases runtime extra
memory footprint.

The detail is, considering an online file-backed page, the right half
of the page is chosen to be cached (e.g. the end page of a readahead
request) and some of its data doesn't exist in managed cache, so the
pcluster will be definitely kept in the submission chain. (IOWs, it
cannot be decompressed without I/O, e.g., due to the bypass queue).

Currently, DELAYEDALLOC/TRYALLOC cases can be downgraded as NOINPLACE,
and stop online pages from inplace I/O. After this patch, unneeded page
allocations won't be observed in pickup_page_for_submission() then.

Link: https://lore.kernel.org/r/20210321183227.5182-1-hsiangkao@aol.com
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: I4ec6ee8ee275e142846f657dabe14934c0019ce2
2022-11-12 11:24:47 +00:00
Huang Jianan
1ffe1efa18 erofs: use workqueue decompression for atomic contexts only
z_erofs_decompressqueue_endio may not be executed in the atomic
context, for example, when dm-verity is turned on. In this scenario,
data can be decompressed directly to get rid of additional kworker
scheduling overhead.

Link: https://lore.kernel.org/r/20210317035448.13921-2-huangjianan@oppo.com
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Signed-off-by: Guo Weichao <guoweichao@oppo.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: I3dbb47bf28b420fe272de86a437d4fec09d69943
2022-11-12 11:24:47 +00:00
Huang Jianan
1913a8c2a4 erofs: avoid memory allocation failure during rolling decompression
Currently, err would be treated as io error. Therefore, it'd be
better to ensure memory allocation during rolling decompression
to avoid such io error.

In the long term, we might consider adding another !Uptodate case
for such case.

Link: https://lore.kernel.org/r/20210316031515.90954-1-huangjianan@oppo.com
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Signed-off-by: Guo Weichao <guoweichao@oppo.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: I609eec373f765e897fa3f66f4b5a8b6549b88356
2022-11-12 11:24:46 +00:00
Gao Xiang
76042998ea erofs: force inplace I/O under low memory scenario
Try to forcely switch to inplace I/O under low memory scenario in
order to avoid direct memory reclaim due to cached page allocation.

Link: https://lore.kernel.org/r/20201209123717.12430-1-hsiangkao@aol.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: I8ea2d3b59c68125271f66853cf5dc6ca39e7aaa9
2022-11-12 11:24:46 +00:00
Gao Xiang
6ddba74d77 erofs: simplify try_to_claim_pcluster()
simplify try_to_claim_pcluster() by directly using cmpxchg() here
(the retry loop caused more overhead.) Also, move the chain loop
detection in and rename it to z_erofs_try_to_claim_pcluster().

Link: https://lore.kernel.org/r/20201208095834.3133565-3-hsiangkao@redhat.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: I8d091ff44123b099ef199eaa4200a00b8854623f
2022-11-12 11:24:46 +00:00
Gao Xiang
d6e4efa5bd erofs: insert to managed cache after adding to pcl
Previously, it could be some concern to call add_to_page_cache_lru()
with page->mapping == Z_EROFS_MAPPING_STAGING (!= NULL).

In contrast, page->private is used instead now, so partially revert
commit 5ddcee1f3a1c ("erofs: get rid of __stagingpage_alloc helper")
with some adaption for simplicity.

Link: https://lore.kernel.org/r/20201208095834.3133565-2-hsiangkao@redhat.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: If250d62b47083649e96d0937eb1990b6c84d768f
2022-11-12 11:24:46 +00:00
Gao Xiang
5e6356e2a9 erofs: get rid of magical Z_EROFS_MAPPING_STAGING
Previously, we played around with magical page->mapping for short-lived
temporary pages since we need to identify different types of pages in
the same pcluster but both invalidated and short-lived temporary pages
can have page->mapping == NULL. It was considered as safe because that
temporary pages are all non-LRU / non-movable pages.

This patch tends to use specific page->private to identify short-lived
pages instead so it won't rely on page->mapping anymore. Details are
described in "compress.h" as well.

Link: https://lore.kernel.org/r/20201208095834.3133565-1-hsiangkao@redhat.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: I2c8650e80cb6016ed828d04f89f8bd3512ca3fb2
2022-11-12 11:24:46 +00:00
Vladimir Zapolskiy
bb7da106ff erofs: remove a void EROFS_VERSION macro set in Makefile
Since commit 4f761fa253b4 ("erofs: rename errln/infoln/debugln to
erofs_{err, info, dbg}") the defined macro EROFS_VERSION has no affect,
therefore removing it from the Makefile is a non-functional change.

Link: https://lore.kernel.org/r/20201030122839.25431-1-vladimir@tuxera.com
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Vladimir Zapolskiy <vladimir@tuxera.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Change-Id: Id63ad279985db2a156d62be814bf381c9bea8342
2022-11-12 11:24:45 +00:00
Gao Xiang
be617d2193 erofs: fix unsafe pagevec reuse of hooked pclusters
There are pclusters in runtime marked with Z_EROFS_PCLUSTER_TAIL
before actual I/O submission. Thus, the decompression chain can be
extended if the following pcluster chain hooks such tail pcluster.

As the related comment mentioned, if some page is made of a hooked
pcluster and another followed pcluster, it can be reused for in-place
I/O (since I/O should be submitted anyway):
 _______________________________________________________________
|  tail (partial) page |          head (partial) page           |
|_____PRIMARY_HOOKED___|____________PRIMARY_FOLLOWED____________|

However, it's by no means safe to reuse as pagevec since if such
PRIMARY_HOOKED pclusters finally move into bypass chain without I/O
submission. It's somewhat hard to reproduce with LZ4 and I just found
it (general protection fault) by ro_fsstressing a LZMA image for long
time.

I'm going to actively clean up related code together with multi-page
folio adaption in the next few months. Let's address it directly for
easier backporting for now.

Call trace for reference:
  z_erofs_decompress_pcluster+0x10a/0x8a0 [erofs]
  z_erofs_decompress_queue.isra.36+0x3c/0x60 [erofs]
  z_erofs_runqueue+0x5f3/0x840 [erofs]
  z_erofs_readahead+0x1e8/0x320 [erofs]
  read_pages+0x91/0x270
  page_cache_ra_unbounded+0x18b/0x240
  filemap_get_pages+0x10a/0x5f0
  filemap_read+0xa9/0x330
  new_sync_read+0x11b/0x1a0
  vfs_read+0xf1/0x190

Link: https://lore.kernel.org/r/20211103182006.4040-1-xiang@kernel.org
Fixes: 3883a79abd ("staging: erofs: introduce VLE decompression support")
Cc: <stable@vger.kernel.org> # 4.19+
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: Ieecf1f9ac4d2b2e1d285b7c157d9f37feddb79f7
2022-11-12 11:24:45 +00:00
Yue Hu
3fc031f2a6 erofs: remove the occupied parameter from z_erofs_pagevec_enqueue()
No any behavior to variable occupied in z_erofs_attach_page() which
is only caller to z_erofs_pagevec_enqueue().

Link: https://lore.kernel.org/r/20210419102623.2015-1-zbestahu@gmail.com
Signed-off-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Gao Xiang <xiang@kernel.org>
Signed-off-by: Gao Xiang <xiang@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: I1e3990f8b8c6bdce29dcadcf561aca0ee8657453
2022-11-12 11:24:45 +00:00
Gao Xiang
ba4695a03d erofs: don't trigger WARN() when decompression fails
syzbot reported a WARNING [1] due to corrupted compressed data.

As Dmitry said, "If this is not a kernel bug, then the code should
not use WARN. WARN if for kernel bugs and is recognized as such by
all testing systems and humans."

[1] https://lore.kernel.org/r/000000000000b3586105cf0ff45e@google.com

Link: https://lore.kernel.org/r/20211025074311.130395-1-hsiangkao@linux.alibaba.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Reported-by: syzbot+d8aaffc3719597e8cfb4@syzkaller.appspotmail.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Change-Id: If67aa741b0e1563465d162cf56ddc0ba6a50b0d1
2022-11-12 11:24:45 +00:00
Gao Xiang
7c6e563c72 erofs: move from drivers/staging/ to fs/
Since 5.4, erofs has been moved into fs/.

Keep up with the 5.10 LTS kernel until the following commit:
dbaf435ddf97 ("erofs: add unsupported inode i_format check")

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Tested-by: Liu Bo <bo.liu@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Change-Id: If4827beb38d76d65844e31f9a594010871849918
2022-11-12 11:24:45 +00:00
Gao Xiang
0c8e6c2ec1 erofs: sync up with kernel 5.10
Backport 5.10 LTS erofs codebase to 4.19 with adaption such
as using old mount APIs, radix tree instead of XArray and
reverting some bio interface changes.

Keep up with the 5.10 LTS kernel until the following commit:
dbaf435ddf97 ("erofs: add unsupported inode i_format check")

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Tested-by: Liu Bo <bo.liu@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Change-Id: Iac3047e822480a14c446ac2671f79369251bd7b6
2022-11-12 11:24:44 +00:00
Park Ju Hyung
038ea3002f msm_geni_serial: skip flushing tx upon shutdown
This is causing runtime PM to malfunction and prevents suspend indefinitely.

This might be HAL related, but since flushing data by manually powering on
the serial is unnecessary upon shutdown, skip it instead.

Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: celtare21 <celtare21@gmail.com>
2022-11-12 11:24:44 +00:00
Park Ju Hyung
56f3dfad34 msm_geni_serial: reduce wakelock timeout from ISR to 100ms
Average userspace response time from ISR is less than 10ms.

Whooping 2 seconds is way too long. Reduce it to 100ms.

Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Change-Id: I78e1f434f6e5585497972ee82b2a29c5cfd87408
2022-11-12 11:24:44 +00:00
Arjan van de Ven
de7528f5bd fs: ext4: fsync: optimize double-fsync() a bunch
There are cases where EXT4 is a bit too conservative sending barriers down to
the disk; there are cases where the transaction in progress is not the one
that sent the barrier (in other words: the fsync is for a file for which the
IO happened more time ago and all data was already sent to the disk).

For that case, a more performing tradeoff can be made on SSD devices (which
have the ability to flush their dram caches in a hurry on a power fail event)
where the barrier gets sent to the disk, but we don't need to wait for the
barrier to complete. Any consecutive IO will block on the barrier correctly.

Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
2022-11-12 11:24:44 +00:00
nathanchance
6ff225f1cd block : makefile : disable align mismatch
*new clang introduced Walign mismatch

*fix compile block/blk-mq
2022-11-12 11:24:43 +00:00
Jens Axboe
f0042f8846 blk-mq: fix corruption with direct issue
If we attempt a direct issue to a SCSI device, and it returns BUSY, then
we queue the request up normally. However, the SCSI layer may have
already setup SG tables etc for this particular command. If we later
merge with this request, then the old tables are no longer valid. Once
we issue the IO, we only read/write the original part of the request,
not the new state of it.

This causes data corruption, and is most often noticed with the file
system complaining about the just read data being invalid:

[  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)

because most of it is garbage...

This doesn't happen from the normal issue path, as we will simply defer
the request to the hardware queue dispatch list if we fail. Once it's on
the dispatch list, we never merge with it.

Fix this from the direct issue path by flagging the request as
REQ_NOMERGE so we don't change the size of it before issue.

See also:
  https://bugzilla.kernel.org/show_bug.cgi?id=201685

Tested-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 6ce3dd6eec ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
2022-11-12 11:24:43 +00:00
Arjan van de Ven
29729569a8 do accept() in LIFO order for cache efficiency
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
2022-11-12 11:24:43 +00:00
Arjan van de Ven
c48ce61c87 kernel: time: reduce ntp wakeups
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
2022-11-12 11:24:43 +00:00
Arjan van de Ven
1353e025a4 ipv4/tcp: allow the memory tuning for tcp to go a little bigger than default
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
2022-11-12 11:24:43 +00:00
Arjan van de Ven
e27e70b196 Initialize ata before graphics
ATA init is the long pole in the boot process, and its asynchronous.
move the graphics init after it so that ata and graphics initialize
in parallel

Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
2022-11-12 11:24:42 +00:00
Arjan van de Ven
5f4f5aceae Increase the ext4 default commit age
Both the VM and EXT4 have a "commit to disk after X seconds" time.
Currently the EXT4 time is shorter than our VM time, which is a bit
suboptional,
it's better for performance to let the VM do the writeouts in bulk
rather than something deep in the journalling layer.

(DISTRO TWEAK -- NOT FOR UPSTREAM)

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
2022-11-12 11:24:42 +00:00
spakkkk
eaa4e1547f arm64: config: enable exfat 2022-11-12 11:24:42 +00:00
Ritesh Harjani
70cc7eeb24 ext4: optimize file overwrites
In case if the file already has underlying blocks/extents allocated
then we don't need to start a journal txn and can directly return
the underlying mapping. Currently ext4_iomap_begin() is used by
both DAX & DIO path. We can check if the write request is an
overwrite & then directly return the mapping information.

This could give a significant perf boost for multi-threaded writes
specially random overwrites.
On PPC64 VM with simulated pmem(DAX) device, ~10x perf improvement
could be seen in random writes (overwrite). Also bcoz this optimizes
away the spinlock contention during jbd2 slab cache allocation
(jbd2_journal_handle). On x86 VM, ~2x perf improvement was observed.

Reported-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/88e795d8a4d5cd22165c7ebe857ba91d68d8813e.1600401668.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-11-12 11:24:42 +00:00
Park Ju Hyung
4d8e9a4708 quota_tree: Avoid dynamic memory allocations
Most allocations done here are rather small and can fit on the stack,
eliminating the need to allocate them dynamically. Reserve a 1024B
stack buffer for this purpose to avoid the overhead of dynamic
memory allocation.

1024B covers most use cases, and higher values were observed to cause
stack corruptions.

Co-authored-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
2022-11-12 11:24:42 +00:00
Park Ju Hyung
33ce619f2f techpack: display: add some bp hints to hot paths
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
[@0ctobot: Adapted for 4.19]
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
2022-11-12 11:24:41 +00:00
Park Ju Hyung
aeef1bc25e drm/msm: use kmem_cache pool for struct vblank_work
These get allocated and freed millions of times on this kernel tree.

Use a dedicated kmem_cache pool and avoid costly dynamic memory allocations.

Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
2022-11-12 11:24:41 +00:00
Park Ju Hyung
d2baa1092e kthread: use buffer from the stack space
struct kthread_create_info is small enough to fit perfectly under
the stack space.

Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
2022-11-12 11:24:41 +00:00
Park Ju Hyung
518acbdb88 exec: use bprm from the stack space
struct linux_binprm isn't big and is safe to use from the stack space

Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
[@0ctobot: Adapted for 4.19]
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
2022-11-12 11:24:41 +00:00
Park Ju Hyung
60b66cad88 sched: do not allocate window cpu arrays separately
These are allocated extremely frequently.

Allocate them with CONFIG_NR_CPUS upon struct ravg's allocation.

This will break walt debug tracings.

Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
2022-11-12 11:24:41 +00:00
Park Ju Hyung
19d0fb968c power_supply: don't allocate attrname
healthd queries this extremely frequently and attrname is allocated
and de-allocated repeatedly.

Use the stack space instead.

Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
2022-11-12 11:24:40 +00:00