There is noticeable scheduling latency and heavy zone lock contention
stemming from rmqueue_bulk's single hold of the zone lock while doing
its work, as seen with the preemptoff tracer. There's no actual need for
rmqueue_bulk() to hold the zone lock the entire time; it only does so
for supposed efficiency. As such, we can relax the zone lock and even
reschedule when IRQs are enabled in order to keep the scheduling delays
and zone lock contention at bay. Forward progress is still guaranteed,
as the zone lock can only be relaxed after page removal.
With this change, rmqueue_bulk() no longer appears as a serious offender
in the preemptoff tracer, and system latency is noticeably improved.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The page allocator processes free pages in groups of pageblocks, where
the size of a pageblock is typically quite large (1024 pages without
hugetlbpage support). Pageblocks are processed atomically with the zone
lock held, which can cause severe scheduling delays on both the CPU
going through the pageblock and any other CPUs waiting to acquire the
zone lock. A frequent offender is move_freepages_block(), which is used
by rmqueue() for page allocation.
As it turns out, there's no requirement for pageblocks to be so large,
so the pageblock order can simply be reduced to ease the scheduling
delays and zone lock contention. PAGE_ALLOC_COSTLY_ORDER is used as a
reasonable setting to ensure non-costly page allocation requests can
still be serviced without always needing to free up more than one
pageblock's worth of pages at a time.
This has a noticeable effect on overall system latency when memory
pressure is elevated. The various mm functions which operate on
pageblocks no longer appear in the preemptoff tracer, where previously
they would spend up to 100 ms on a mobile arm64 CPU processing a
pageblock with preemption disabled and the zone lock held.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
There's no requirement that perf_event_read_local() be used from a
context where CPU migration isn't possible, yet smp_processor_id() is
used with the assumption that the caller guarantees CPU migration can't
occur. Since IRQs are disabled here anyway, the smp_processor_id() can
simply be moved to the IRQ-disabled section to guarantee its safety.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The partial cache maintenance helpers check the number of segments in
each mapping before checking if the mapping is actually in use, which
sometimes results in spurious errors being returned to vidc. The errors
then cause vidc to malfunction, even though nothing's wrong.
The reason for checking the segment count first was to elide map_rwsem;
however, it turns out that map_rwsem isn't needed anyway, so we can have
our cake and eat it too.
Fix the spurious segment count errors by reordering the checks, and
remove map_rwsem entirely so we don't have to worry about eliding it for
performance reasons.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
There isn't a need for cpus_affine to be atomic, and reading/writing to
it outside of the global pm_qos lock is racy anyway. As such, we can
simply turn it into a primitive integer type.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The plist is already sorted and traversed in ascending order of PM QoS
value, so we can simply look at the lowest PM QoS values which affect
the given request's CPUs until we've looked at all of them, at which
point the traversal can be stopped early. This also lets us get rid of
the pesky qos_val array.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Andrzej Perczak discovered that his CPUs would almost never enter an
idle state deeper than C0, and pinpointed the cause of the issue to be
commit "qos: Speed up pm_qos_set_value_for_cpus()". As it turns out, the
optimizations introduced in that commit contain two issues that are
responsible for this behavior: pm_qos_remove_request() fails to refresh
the affected per-CPU targets, and IRQ migrations fail to refresh their
old affinity's targets.
Removing a request fails to refresh the per-CPU targets because
`new_req->node.prio` isn't updated to the PM QoS class' default value
upon removal, and so it contains its old value from when it was active.
This causes the `changed` loop in pm_qos_set_value_for_cpus() to check
against a stale PM QoS request value and erroneously determine that the
request in question doesn't alter the current per-CPU targets.
As for IRQ migrations, only the new CPU affinity mask gets updated,
which causes the CPUs present in the old affinity mask but not the new
one to retain their targets, specifically when a migration occurs while
the associated PM QoS request is active.
To fix these issues while retaining optimal speed, update PM QoS
requests' CPU affinity inside pm_qos_set_value_for_cpus() so that the
old affinity can be known, and skip the `changed` loop when the request
in question is being removed.
Reported-by: Andrzej Perczak <kartapolska@gmail.com>
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The bias timer is only started when WFI is used, so we only need to
try and cancel it after leaving WFI.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Scheduling in atomic context is indicative of a serious problem that,
although may not be immediately lethal, can lead to strange issues and
eventually a panic. We should therefore panic the first time it's
detected.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
It turns out that the ION_IOC_HEAP_QUERY command is actually used in
some camera-related components in Android 11, such as libultradepth_api,
and in libdmabufheap in Android 12. The omission of this command causes
these components to break when their ioctl attempt returns -ENOTTY.
Restore the ION_IOC_HEAP_QUERY command to fix the incompatibility.
Unfortunately, libdmabufheap uses heap names in order to look up heap
IDs so that the calling userspace code can maintain a constant heap name
and cope with inconsistent heap IDs. For example, if some user code
wants to allocate from the system heap, it only has to specify "system"
as the desired heap name, and it doesn't need to keep track of the
system heap ID.
This is unfortunate because now we must copy heap name strings to
userspace. In order to speed this up, a pre-allocated array, which is
statically allocated to accommodate the maximum number of heaps, is
populated with heap data as heaps are created. When a heap query command
requests heap data, all we have to do is copy the big array of pre-made
data, and we're done.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
We can omit the _IOC_SIZE() check and also inline copy_from_user() by
duplicating copy_from_user() for each ioctl command and giving it a
constant size. Since there aren't many ioctls here, this doesn't turn
the code into spaghetti.
We can further optimize the prefetch ioctls as well by omitting one word
of data from the copy_from_user(), since the first member of `struct
ion_prefetch_data` (the `len` field) is unused. As proof of this, rename
`len` to `unused` in the uapi header, which also ensures that the
compiler will notify us if this ever changes in the future. This is
necessary because the prefetch data is used outside of ion.c, where we
cannot easily audit its usage.
There's no reduction done for the allocation ioctl because we could only
reduce the copy_from_user() payload by a half word, which will result in
a payload size that isn't a multiple of a word. The copy_from_user()
implementation on arm64 will go slower as a result, so just leave it
untouched.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Heaps are never removed, and there is only one ion_device_add_heap()
user: msm_ion_probe(). This single user calls ion_device_add_heap()
sequentially, not concurrently. Furthermore, all heap additions are done
once kernel init is complete, and heaps are only accessed by userspace,
so no locking is needed at all here.
The write lock in ion_walk_heaps() doesn't make sense either since the
heap-walking functions neither mutate a heap's placement in the plist,
nor change a heap in a way that requires pausing all buffer allocations.
The functions used in the heap walking routine handle synchronization
themselves, so there's no need for the mutex-style locking here. This
write lock appears to be a historical artifact from the following 2013
commit (present in msm-3.4 trees) where a justification for the write
lock was never given: 7c1b8aa23ef ("gpu: ion: Add support for heap
walking").
Since the heap plist rwsem appears to be thoroughly useless, we can
safely remove it to reduce complexity and improve performance.
Also, change the name of ion_device_add_heap() to ion_add_heap() so the
compiler can notify us if ion_device_add_heap() is used elsewhere in the
future.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The following warning occurs because we don't update the runqueue's
clock when taking rq->lock in sched_migrate_to_cpumask_end():
rq->clock_update_flags < RQCF_ACT_SKIP
WARNING: CPU: 0 PID: 991 at update_curr+0x1c8/0x2bc
[...]
Call trace:
update_curr+0x1c8/0x2bc
dequeue_task_fair+0x7c/0x1238
do_set_cpus_allowed+0x64/0x28c
sched_migrate_to_cpumask_end+0xa8/0x1b4
m_stop+0x40/0x78
seq_read+0x39c/0x4ac
__vfs_read+0x44/0x12c
vfs_read+0xf0/0x1d8
SyS_read+0x6c/0xcc
el0_svc_naked+0x34/0x38
Fix it by adding an update_rq_clock() call when taking rq->lock.
Signed-off-by: celtare21 <celtare21@gmail.com>
These are in the critical path for rendering frames to the display, so
mark them as performance-critical and affine them to the big CPU
cluster. They aren't placed onto the prime cluster because the
single-CPU prime cluster will be used to run the DRM IRQ and kthreads.
DRM is more latency-critical than KGSL and we need to have DRM and KGSL
running on separate CPUs for the best performance, so KGSL gets the big
cluster.
Note that since there are other IRQs requested via kgsl_request_irq(),
we must specify that the IRQ to be made perf-critical is kgsl_3d0_irq.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
This reverts commit 417bded5a942a2a23ad65b3fe5fd3fff2d0dbf5b.
This is wrong. This causes 3 IRQs to be affined to the big CPU cluster,
not just the primary kgsl_3d0_irq one. As a result, the perf crit API
thinks that the 2 extra IRQs are critical and will balance them despite
them being rarely used (kgsl_hfi_irq and kgsl_gmu_irq).
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Scheduler code is very hot and every little optimization counts. Instead
of constantly checking sched_numa_balancing when NUMA is disabled,
compile it out.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
PID map reads for processes with thousands of mappings can be done
extensively by certain Android apps, burning through CPU time on
higher-performance CPUs even though reading PID maps is never a
performance-critical task. We can relieve the load on the important CPUs
by moving PID map reads to little CPUs via sched_migrate_to_cpumask_*().
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: dreamisbaka <jolinux.g@gmail.com>
There are some chunks of code in the kernel running in process context
where it may be helpful to run the code on a specific set of CPUs, such
as when reading some CPU-intensive procfs files. This is especially
useful when the code in question must run within the context of the
current process (so kthreads cannot be used).
Add an API to make this possible, which consists of the following:
sched_migrate_to_cpumask_start():
@old_mask: pointer to output the current task's old cpumask
@dest: pointer to a cpumask the current task should be moved to
sched_migrate_to_cpumask_end():
@old_mask: pointer to the old cpumask generated earlier
@dest: pointer to the dest cpumask provided earlier
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: dreamisbaka <jolinux.g@gmail.com>
Android and various applications in Android need to read PID map data in
order to work. Some processes can contain over 10,000 mappings, which
results in lots of time wasted on simply generating strings. This wasted
time adds up, especially in the case of Unity-based games, which utilize
the Boehm garbage collector. A game's main process typically has well
over 10,000 mappings due to the loaded textures, and the Boehm GC reads
PID maps several times a second. This results in over 100,000 map
entries being printed out per second, so micro-optimization here is
important. Before this commit, show_vma_header_prefix() would typically
take around 1000 ns to run on a Snapdragon 855; now it only takes about
50 ns to run, which is a 20x improvement.
The primary micro-optimizations here assume that there are no more than
40 bits in the virtual address space, hence the CONFIG_ARM64_VA_BITS
check. Arm64 uses a virtual address size of 39 bits, so this perfectly
covers it.
This also removes padding used to beautify PID map output to further
speed up reads and reduce the amount of bytes printed, and optimizes the
dentry path retrieval for file-backed mappings. Note, however, that the
trailing space at the end of the line for non-file-backed mappings
cannot be omitted, as it breaks some PID map parsers.
This still retains insignificant leading zeros from printed hex values
to maintain the current output format.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: LibXZR <xzr467706992@163.com>
There's no point in enabling QoS clocks when are none of them for certain
clients.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Page pool additions and removals are very hot during GPU workloads, so
they should be optimized accordingly. We can use a lock-less list for
storing the free pages in order to speed things up. The lock-less list
allows for one llist_del_first() user and unlimited llist_add() users to
run concurrently, so only a spin lock around the llist_del_first() is
needed; everything else is lock-free. The per-pool page count is now an
atomic to make it lock-free as well.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: LibXZR <xzr467706992@163.com>
Clearing dim layers indiscriminately for each blend stage on each commit
wastes a lot of CPU time since the clearing process is heavy on register
accesses. We can optimize this by only clearing dim layers when they're
actually set, and only clearing them on a per-stage basis at that. This
reduces display commit latency considerably.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The most frequent user of fenced GMU writes, adreno_ringbuffer_submit(),
performs a fenced GMU write under a spin lock, and since fenced GMU
writes use udelay(), a lot of CPU cycles are burned here. Not only is
the spin lock held for longer than necessary (because the write doesn't
need to be inside the spin lock), but also a lot of CPU time is wasted
in udelay() for tens of microseconds when usleep_range() can be used
instead.
Move the locked fenced GMU writes to outside their spin locks and make
adreno_gmu_fenced_write() use usleep_range() when not in atomic/IRQ
context, to save power and improve performance. Fenced GMU writes are
found to take an average of 28 microseconds on the Snapdragon 855, so a
usleep range of 10 to 30 microseconds is optimal.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The time profiling here is only used to provide additional debug info
for a context dump as well as a tracepoint. It adds non-trivial overhead
to ringbuffer submission since it accesses GPU registers, so remove it
along with the tracepoint since we're not debugging adreno.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Qualcomm's PM QoS solution suffers from a number of issues: applying
PM QoS to all CPUs, convoluted spaghetti code that wastes CPU cycles,
and keeping PM QoS applied for 10 ms after all requests finish
processing.
This implements a simple IRQ-affined PM QoS mechanism for each UFS
adapter which uses atomics to elide locking, and enqueues a worker to
apply PM QoS to the target CPU as soon as a command request is issued.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Qualcomm's QoS implementation wastes a significant power from
CPU cycles.
Scrap the QoS bits and save a bit power without hurting any
functionality.
Change-Id: I1de3563d9c99ba863f10a90a900d290bdd8e6b79
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Carlos Ayrton Lopez Arroyo <15030201@itcelaya.edu.mx>
This implementation is completely over the top and wastes lots of CPU
cycles. It's too convoluted to fix, so just scrap it to make way for a
simpler solution. This purges every PM QoS reference in the UFS drivers.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Using a timeout for a PM QoS request can lead to disastrous results on
power consumption. It's always possible to find a fixed scope in which a
PM QoS request should be applied, so timeouts aren't ever strictly
needed; they're usually just a lazy way of using PM QoS. Remove the API
so that it cannot be abused any longer.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
KGSL already has PM QoS covering what matters. The L2PC PM QoS code is
not only unneeded, but also unused, so remove it. It's poorly designed
anyway since it uses a timeout with PM QoS, which is drastically bad for
power consumption.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Reading and clearing any errors from the VBIF error registers takes a
significant amount of time during kickoff, and is only used to produce
debug logs when errors are detected. Since we're not debugging hardware
issues in MDSS, remove the VBIF error clearing entirely to reduce
display rendering latency.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Explicit write memory barriers are unneeded here since releasing a lock
already implies a full memory barrier.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The IRQ status reads are decoupled from the IRQ dispatcher, even though
the dispatcher is the only one using the IRQ statuses. This results in a
lot of redundant work being done as the IRQ status reader also reads the
IRQ-enable register and clears the IRQ mask, both of which are already
handled by the IRQ dispatcher. We can cut out the redundant work done in
the hardware IRQ handler by consolidating the IRQ status reads into the
IRQ dispatcher.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
These heavy checks for seeing if autorefresh is enabled are unneeded
when the autorefresh config is disabled. These checks are performed on
every display commit and show up as using a significant amount of CPU
time in perf top. Skip them when it's unnecessary in order to improve
display rendering performance.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Every atomic frame commit allocates memory dynamically to check the
states of the CRTCs, when those allocations can just be stored on the
stack instead. Eliminate these dynamic memory allocations in the frame
commit path to improve performance. They don't need need to be zeroed
out either.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
The default INITCONTEXTLEN-sized buffers can fit on the stack. Do so to
save a call to kmalloc() in a hot path.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Most, if not all, map keys and values are rather small and can fit on
the stack, eliminating the need to allocate them dynamically. Reserve
some small stack buffers for them to avoid dynamic memory allocation.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Most command buffers here are rather small (fewer than 256 words); it's
a waste of time to dynamically allocate memory for such a small buffer
when it could easily fit on the stack.
Conditionally using an on-stack command buffer when the size is small
enough eliminates the need for using a dynamically-allocated buffer most
of the time, reducing GPU command submission latency.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Ashmem uses a single big mutex lock for all synchronization, and even
uses it when no synchronization issues are present. The contention from
using a single lock results in all-around poor performance.
Rewrite to use fine-grained locks and atomic constructions to eliminate
the big mutex lock, thereby improving performance greatly. In places
where locks are needed for a one-time operation, we speculatively
check if locking is needed while avoiding data races. The optional name
fields are removed as well.
Note that because asma->unpinned_list never has anything added to it,
we can remove any code using it to clean up the driver a lot and
reduce synchronization requirements. This also means that
ashmem_lru_list never gets anything added to it either, so all code
using it is dead code as well, which we can remove.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Remote register I/O amounts to a measurably significant portion of CPU
time due to how frequently this function is used. Cache the value of
each register on-demand and use this value in future invocations to
mitigate the expensive I/O.
Co-authored-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
[@0ctobot: Adapted for msm-4.19]
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
The plane states allocation and free show up on perf top as taking up a
non-trivial amount of time on every commit. Since the allocation is
small, just place it on the stack to eliminate the dynamic allocation
overhead completely.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
More often than not, get_vsync_info() is used to only get the write line
count, while the other values it returns are left unused. This is not
optimal since it is done on every display commit. We can eliminate the
superfluous register reads by adding a parameter specifying if only the
write line count is requested.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
The cleanup portion of non-blocking commits can be offloaded to little
CPUs to reduce latency in the display commit path, since it takes up a
non-trivial amount of CPU time. This reduces display commit latency.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
[lazerl0rd: Adjust for Linux 4.19, with different commit cleanup.]
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
The default frequency on Qualcomm CPUs is the lowest frequency supported
by the CPU. This hurts latency when waking from suspend, as each CPU
coming online runs at its lowest frequency until the governor can take
over later. To speed up waking from suspend, hijack the CPUHP_AP_ONLINE
hook and use it to set the highest available frequency on each CPU as
they come online. This is done behind the governor's back but it's fine
because the governor isn't running at this point in time for a CPU
that's coming online.
This speeds up waking from suspend significantly.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
[lazerl0rd: Adjusted to apply to qcom-cpufreq-hw instead of clk-cpu-osm.]
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
Since we know an interrupt will be arriving soon when a frame is
committed, we can anticipate it and prevent the CPU servicing that
interrupt from entering deep idle states. This reduces display rendering
latency.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
A lot of unnecessary work is done in pm_qos_set_value_for_cpus(),
especially when the request being updated isn't affined to all CPUs.
We can reduce the work done here significantly by only inspecting the
CPUs which are affected by the updated request, and bailing out if the
updated request doesn't change anything.
We can make some other micro-optimizations as well knowing that this
code is only for the PM_QOS_CPU_DMA_LATENCY class.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
RCU callbacks are not time-critical and constitute kernel housekeeping.
Offload the no-callback kthreads onto little CPUs to clear load off of
the more important, higher-performance CPUs.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Binder code is very hot, so checking frequently to see if a debug
message should be printed is a waste of cycles. We're not debugging
binder, so just stub out the debug prints to compile them out entirely.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
All of these dynamically-allocated structs can be simply placed on the
stack, eliminating the overhead of dynamic memory allocation entirely.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Most write buffers are rather small and can fit on the stack,
eliminating the need to allocate them dynamically. Reserve a 4 KiB
stack buffer for this purpose to avoid the overhead of dynamic
memory allocation.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>