Commit Graph

870580 Commits

Author SHA1 Message Date
Sultan Alsawaf
b75887b8a8 mm: Don't hog the CPU and zone lock in rmqueue_bulk()
There is noticeable scheduling latency and heavy zone lock contention
stemming from rmqueue_bulk's single hold of the zone lock while doing
its work, as seen with the preemptoff tracer. There's no actual need for
rmqueue_bulk() to hold the zone lock the entire time; it only does so
for supposed efficiency. As such, we can relax the zone lock and even
reschedule when IRQs are enabled in order to keep the scheduling delays
and zone lock contention at bay. Forward progress is still guaranteed,
as the zone lock can only be relaxed after page removal.

With this change, rmqueue_bulk() no longer appears as a serious offender
in the preemptoff tracer, and system latency is noticeably improved.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:20 +00:00
Sultan Alsawaf
370fefb53f mm: Lower the non-hugetlbpage pageblock size to reduce scheduling delays
The page allocator processes free pages in groups of pageblocks, where
the size of a pageblock is typically quite large (1024 pages without
hugetlbpage support). Pageblocks are processed atomically with the zone
lock held, which can cause severe scheduling delays on both the CPU
going through the pageblock and any other CPUs waiting to acquire the
zone lock. A frequent offender is move_freepages_block(), which is used
by rmqueue() for page allocation.

As it turns out, there's no requirement for pageblocks to be so large,
so the pageblock order can simply be reduced to ease the scheduling
delays and zone lock contention. PAGE_ALLOC_COSTLY_ORDER is used as a
reasonable setting to ensure non-costly page allocation requests can
still be serviced without always needing to free up more than one
pageblock's worth of pages at a time.

This has a noticeable effect on overall system latency when memory
pressure is elevated. The various mm functions which operate on
pageblocks no longer appear in the preemptoff tracer, where previously
they would spend up to 100 ms on a mobile arm64 CPU processing a
pageblock with preemption disabled and the zone lock held.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:20 +00:00
Sultan Alsawaf
3e0eb86439 perf/core: Fix risky smp_processor_id() usage in perf_event_read_local()
There's no requirement that perf_event_read_local() be used from a
context where CPU migration isn't possible, yet smp_processor_id() is
used with the assumption that the caller guarantees CPU migration can't
occur. Since IRQs are disabled here anyway, the smp_processor_id() can
simply be moved to the IRQ-disabled section to guarantee its safety.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:20 +00:00
Sultan Alsawaf
e06cdf4ce5 ion: Fix partial cache maintenance operations
The partial cache maintenance helpers check the number of segments in
each mapping before checking if the mapping is actually in use, which
sometimes results in spurious errors being returned to vidc. The errors
then cause vidc to malfunction, even though nothing's wrong.

The reason for checking the segment count first was to elide map_rwsem;
however, it turns out that map_rwsem isn't needed anyway, so we can have
our cake and eat it too.

Fix the spurious segment count errors by reordering the checks, and
remove map_rwsem entirely so we don't have to worry about eliding it for
performance reasons.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:20 +00:00
Sultan Alsawaf
7efe3414b7 qos: Change cpus_affine to not be atomic
There isn't a need for cpus_affine to be atomic, and reading/writing to
it outside of the global pm_qos lock is racy anyway. As such, we can
simply turn it into a primitive integer type.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:19 +00:00
Sultan Alsawaf
803faf2d23 qos: Speed up plist traversal in pm_qos_set_value_for_cpus()
The plist is already sorted and traversed in ascending order of PM QoS
value, so we can simply look at the lowest PM QoS values which affect
the given request's CPUs until we've looked at all of them, at which
point the traversal can be stopped early. This also lets us get rid of
the pesky qos_val array.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:19 +00:00
Sultan Alsawaf
5d000fb40c qos: Fix PM QoS requests almost never shutting off
Andrzej Perczak discovered that his CPUs would almost never enter an
idle state deeper than C0, and pinpointed the cause of the issue to be
commit "qos: Speed up pm_qos_set_value_for_cpus()". As it turns out, the
optimizations introduced in that commit contain two issues that are
responsible for this behavior: pm_qos_remove_request() fails to refresh
the affected per-CPU targets, and IRQ migrations fail to refresh their
old affinity's targets.

Removing a request fails to refresh the per-CPU targets because
`new_req->node.prio` isn't updated to the PM QoS class' default value
upon removal, and so it contains its old value from when it was active.
This causes the `changed` loop in pm_qos_set_value_for_cpus() to check
against a stale PM QoS request value and erroneously determine that the
request in question doesn't alter the current per-CPU targets.

As for IRQ migrations, only the new CPU affinity mask gets updated,
which causes the CPUs present in the old affinity mask but not the new
one to retain their targets, specifically when a migration occurs while
the associated PM QoS request is active.

To fix these issues while retaining optimal speed, update PM QoS
requests' CPU affinity inside pm_qos_set_value_for_cpus() so that the
old affinity can be known, and skip the `changed` loop when the request
in question is being removed.

Reported-by: Andrzej Perczak <kartapolska@gmail.com>
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:19 +00:00
Sultan Alsawaf
40106c0f1f cpuidle: lpm-levels: Only cancel the bias timer when it's used
The bias timer is only started when WFI is used, so we only need to
try and cancel it after leaving WFI.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:19 +00:00
Sultan Alsawaf
cf3a9c131c sched/core: Always panic when scheduling in atomic context
Scheduling in atomic context is indicative of a serious problem that,
although may not be immediately lethal, can lead to strange issues and
eventually a panic. We should therefore panic the first time it's
detected.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:19 +00:00
Sultan Alsawaf
57cc3cbe1e ion: Restore ION_IOC_HEAP_QUERY ioctl command
It turns out that the ION_IOC_HEAP_QUERY command is actually used in
some camera-related components in Android 11, such as libultradepth_api,
and in libdmabufheap in Android 12. The omission of this command causes
these components to break when their ioctl attempt returns -ENOTTY.

Restore the ION_IOC_HEAP_QUERY command to fix the incompatibility.

Unfortunately, libdmabufheap uses heap names in order to look up heap
IDs so that the calling userspace code can maintain a constant heap name
and cope with inconsistent heap IDs. For example, if some user code
wants to allocate from the system heap, it only has to specify "system"
as the desired heap name, and it doesn't need to keep track of the
system heap ID.

This is unfortunate because now we must copy heap name strings to
userspace. In order to speed this up, a pre-allocated array, which is
statically allocated to accommodate the maximum number of heaps, is
populated with heap data as heaps are created. When a heap query command
requests heap data, all we have to do is copy the big array of pre-made
data, and we're done.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:18 +00:00
Sultan Alsawaf
5fb388c1c1 ion: Further optimize ioctl handler
We can omit the _IOC_SIZE() check and also inline copy_from_user() by
duplicating copy_from_user() for each ioctl command and giving it a
constant size. Since there aren't many ioctls here, this doesn't turn
the code into spaghetti.

We can further optimize the prefetch ioctls as well by omitting one word
of data from the copy_from_user(), since the first member of `struct
ion_prefetch_data` (the `len` field) is unused. As proof of this, rename
`len` to `unused` in the uapi header, which also ensures that the
compiler will notify us if this ever changes in the future. This is
necessary because the prefetch data is used outside of ion.c, where we
cannot easily audit its usage.

There's no reduction done for the allocation ioctl because we could only
reduce the copy_from_user() payload by a half word, which will result in
a payload size that isn't a multiple of a word. The copy_from_user()
implementation on arm64 will go slower as a result, so just leave it
untouched.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:18 +00:00
Sultan Alsawaf
5e17b2f335 ion: Remove unneeded rwsem for the heap priority list
Heaps are never removed, and there is only one ion_device_add_heap()
user: msm_ion_probe(). This single user calls ion_device_add_heap()
sequentially, not concurrently. Furthermore, all heap additions are done
once kernel init is complete, and heaps are only accessed by userspace,
so no locking is needed at all here.

The write lock in ion_walk_heaps() doesn't make sense either since the
heap-walking functions neither mutate a heap's placement in the plist,
nor change a heap in a way that requires pausing all buffer allocations.
The functions used in the heap walking routine handle synchronization
themselves, so there's no need for the mutex-style locking here. This
write lock appears to be a historical artifact from the following 2013
commit (present in msm-3.4 trees) where a justification for the write
lock was never given: 7c1b8aa23ef ("gpu: ion: Add support for heap
walking").

Since the heap plist rwsem appears to be thoroughly useless, we can
safely remove it to reduce complexity and improve performance.

Also, change the name of ion_device_add_heap() to ion_add_heap() so the
compiler can notify us if ion_device_add_heap() is used elsewhere in the
future.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:18 +00:00
celtare21
5ddd0c1940 sched/core: Fix rq clock warning in sched_migrate_to_cpumask_end()
The following warning occurs because we don't update the runqueue's
clock when taking rq->lock in sched_migrate_to_cpumask_end():

rq->clock_update_flags < RQCF_ACT_SKIP
WARNING: CPU: 0 PID: 991 at update_curr+0x1c8/0x2bc
[...]
Call trace:
update_curr+0x1c8/0x2bc
dequeue_task_fair+0x7c/0x1238
do_set_cpus_allowed+0x64/0x28c
sched_migrate_to_cpumask_end+0xa8/0x1b4
m_stop+0x40/0x78
seq_read+0x39c/0x4ac
__vfs_read+0x44/0x12c
vfs_read+0xf0/0x1d8
SyS_read+0x6c/0xcc
el0_svc_naked+0x34/0x38

Fix it by adding an update_rq_clock() call when taking rq->lock.

Signed-off-by: celtare21 <celtare21@gmail.com>
2022-11-12 11:24:18 +00:00
Sultan Alsawaf
adaa599abb msm: kgsl: Affine kgsl_3d0_irq and worker kthread to the big CPU cluster
These are in the critical path for rendering frames to the display, so
mark them as performance-critical and affine them to the big CPU
cluster. They aren't placed onto the prime cluster because the
single-CPU prime cluster will be used to run the DRM IRQ and kthreads.
DRM is more latency-critical than KGSL and we need to have DRM and KGSL
running on separate CPUs for the best performance, so KGSL gets the big
cluster.

Note that since there are other IRQs requested via kgsl_request_irq(),
we must specify that the IRQ to be made perf-critical is kgsl_3d0_irq.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:18 +00:00
Sultan Alsawaf
94f3e31c1b Revert "msm: kgsl: Affine IRQ and worker kthread to the big CPU cluster"
This reverts commit 417bded5a942a2a23ad65b3fe5fd3fff2d0dbf5b.

This is wrong. This causes 3 IRQs to be affined to the big CPU cluster,
not just the primary kgsl_3d0_irq one. As a result, the perf crit API
thinks that the 2 extra IRQs are critical and will balance them despite
them being rarely used (kgsl_hfi_irq and kgsl_gmu_irq).

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:17 +00:00
Sultan Alsawaf
686d26f283 sched/fair: Compile out NUMA code entirely when NUMA is disabled
Scheduler code is very hot and every little optimization counts. Instead
of constantly checking sched_numa_balancing when NUMA is disabled,
compile it out.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:17 +00:00
Sultan Alsawaf
e23d4ea590 mm: Perform PID map reads on the little CPU cluster
PID map reads for processes with thousands of mappings can be done
extensively by certain Android apps, burning through CPU time on
higher-performance CPUs even though reading PID maps is never a
performance-critical task. We can relieve the load on the important CPUs
by moving PID map reads to little CPUs via sched_migrate_to_cpumask_*().

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: dreamisbaka <jolinux.g@gmail.com>
2022-11-12 11:24:17 +00:00
Sultan Alsawaf
c2c3304ca2 sched: Add API to migrate the current process to a given cpumask
There are some chunks of code in the kernel running in process context
where it may be helpful to run the code on a specific set of CPUs, such
as when reading some CPU-intensive procfs files. This is especially
useful when the code in question must run within the context of the
current process (so kthreads cannot be used).

Add an API to make this possible, which consists of the following:
sched_migrate_to_cpumask_start():
 @old_mask: pointer to output the current task's old cpumask
 @dest: pointer to a cpumask the current task should be moved to

sched_migrate_to_cpumask_end():
 @old_mask: pointer to the old cpumask generated earlier
 @dest: pointer to the dest cpumask provided earlier

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: dreamisbaka <jolinux.g@gmail.com>
2022-11-12 11:24:17 +00:00
Sultan Alsawaf
d516115a24 mm: Micro-optimize PID map reads for arm64 while retaining output format
Android and various applications in Android need to read PID map data in
order to work. Some processes can contain over 10,000 mappings, which
results in lots of time wasted on simply generating strings. This wasted
time adds up, especially in the case of Unity-based games, which utilize
the Boehm garbage collector. A game's main process typically has well
over 10,000 mappings due to the loaded textures, and the Boehm GC reads
PID maps several times a second. This results in over 100,000 map
entries being printed out per second, so micro-optimization here is
important. Before this commit, show_vma_header_prefix() would typically
take around 1000 ns to run on a Snapdragon 855; now it only takes about
50 ns to run, which is a 20x improvement.

The primary micro-optimizations here assume that there are no more than
40 bits in the virtual address space, hence the CONFIG_ARM64_VA_BITS
check. Arm64 uses a virtual address size of 39 bits, so this perfectly
covers it.

This also removes padding used to beautify PID map output to further
speed up reads and reduce the amount of bytes printed, and optimizes the
dentry path retrieval for file-backed mappings. Note, however, that the
trailing space at the end of the line for non-file-backed mappings
cannot be omitted, as it breaks some PID map parsers.

This still retains insignificant leading zeros from printed hex values
to maintain the current output format.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: LibXZR <xzr467706992@163.com>
2022-11-12 11:24:17 +00:00
Sultan Alsawaf
2d1025e96a msm: msm_bus: Don't enable QoS clocks when none are present
There's no point in enabling QoS clocks when are none of them for certain
clients.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
2022-11-12 11:24:16 +00:00
Sultan Alsawaf
f3df72d95e msm: kgsl: Use lock-less list for page pools
Page pool additions and removals are very hot during GPU workloads, so
they should be optimized accordingly. We can use a lock-less list for
storing the free pages in order to speed things up. The lock-less list
allows for one llist_del_first() user and unlimited llist_add() users to
run concurrently, so only a spin lock around the llist_del_first() is
needed; everything else is lock-free. The per-pool page count is now an
atomic to make it lock-free as well.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: LibXZR <xzr467706992@163.com>
2022-11-12 11:24:16 +00:00
Sultan Alsawaf
da748e58a9 drm/msm/sde: Don't clear dim layers when there aren't any applied
Clearing dim layers indiscriminately for each blend stage on each commit
wastes a lot of CPU time since the clearing process is heavy on register
accesses. We can optimize this by only clearing dim layers when they're
actually set, and only clearing them on a per-stage basis at that. This
reduces display commit latency considerably.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:16 +00:00
Sultan Alsawaf
03af212bd8 msm: kgsl: Don't busy wait for fenced GMU writes when possible
The most frequent user of fenced GMU writes, adreno_ringbuffer_submit(),
performs a fenced GMU write under a spin lock, and since fenced GMU
writes use udelay(), a lot of CPU cycles are burned here. Not only is
the spin lock held for longer than necessary (because the write doesn't
need to be inside the spin lock), but also a lot of CPU time is wasted
in udelay() for tens of microseconds when usleep_range() can be used
instead.

Move the locked fenced GMU writes to outside their spin locks and make
adreno_gmu_fenced_write() use usleep_range() when not in atomic/IRQ
context, to save power and improve performance. Fenced GMU writes are
found to take an average of 28 microseconds on the Snapdragon 855, so a
usleep range of 10 to 30 microseconds is optimal.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:16 +00:00
Sultan Alsawaf
0871f5ceea msm: kgsl: Remove unneeded time profiling from ringbuffer submission
The time profiling here is only used to provide additional debug info
for a context dump as well as a tracepoint. It adds non-trivial overhead
to ringbuffer submission since it accesses GPU registers, so remove it
along with the tracepoint since we're not debugging adreno.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:15 +00:00
Sultan Alsawaf
a8052f1777 scsi: ufs: Add simple IRQ-affined PM QoS operations
Qualcomm's PM QoS solution suffers from a number of issues: applying
PM QoS to all CPUs, convoluted spaghetti code that wastes CPU cycles,
and keeping PM QoS applied for 10 ms after all requests finish
processing.

This implements a simple IRQ-affined PM QoS mechanism for each UFS
adapter which uses atomics to elide locking, and enqueues a worker to
apply PM QoS to the target CPU as soon as a command request is issued.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
2022-11-12 11:24:15 +00:00
Panchajanya1999
67322ffa74 drivers/char: adsprpc: Remove Qcom's PM_QoS implementation
Qualcomm's QoS implementation wastes a significant power from
CPU cycles.
Scrap the QoS bits and save a bit power without hurting any
functionality.

Change-Id: I1de3563d9c99ba863f10a90a900d290bdd8e6b79
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Carlos Ayrton Lopez Arroyo <15030201@itcelaya.edu.mx>
2022-11-12 11:24:15 +00:00
Sultan Alsawaf
d89f76b07e scsi: ufs: Scrap Qualcomm's PM QoS implementation
This implementation is completely over the top and wastes lots of CPU
cycles. It's too convoluted to fix, so just scrap it to make way for a
simpler solution. This purges every PM QoS reference in the UFS drivers.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
2022-11-12 11:24:15 +00:00
Sultan Alsawaf
c0b8dc4d0a qos: Remove pm_qos_update_request_timeout() API
Using a timeout for a PM QoS request can lead to disastrous results on
power consumption. It's always possible to find a fixed scope in which a
PM QoS request should be applied, so timeouts aren't ever strictly
needed; they're usually just a lazy way of using PM QoS. Remove the API
so that it cannot be abused any longer.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:15 +00:00
Sultan Alsawaf
28c9e970a8 msm: kgsl: Remove L2PC PM QoS feature
KGSL already has PM QoS covering what matters. The L2PC PM QoS code is
not only unneeded, but also unused, so remove it. It's poorly designed
anyway since it uses a timeout with PM QoS, which is drastically bad for
power consumption.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:14 +00:00
Sultan Alsawaf
85bc91328d drm/msm/sde: Don't read and clear VBIF errors upon commit
Reading and clearing any errors from the VBIF error registers takes a
significant amount of time during kickoff, and is only used to produce
debug logs when errors are detected. Since we're not debugging hardware
issues in MDSS, remove the VBIF error clearing entirely to reduce
display rendering latency.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:14 +00:00
Sultan Alsawaf
952dfb1b3f drm/msm/sde: Remove redundant write memory barriers from IRQ routines
Explicit write memory barriers are unneeded here since releasing a lock
already implies a full memory barrier.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:14 +00:00
Sultan Alsawaf
07276d430e drm/msm/sde: Consolidate IRQ status reads into IRQ dispatcher
The IRQ status reads are decoupled from the IRQ dispatcher, even though
the dispatcher is the only one using the IRQ statuses. This results in a
lot of redundant work being done as the IRQ status reader also reads the
IRQ-enable register and clears the IRQ mask, both of which are already
handled by the IRQ dispatcher. We can cut out the redundant work done in
the hardware IRQ handler by consolidating the IRQ status reads into the
IRQ dispatcher.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:14 +00:00
Sultan Alsawaf
eb696e16c7 drm/msm/sde: Skip heavy autorefresh checks when it's not enabled
These heavy checks for seeing if autorefresh is enabled are unneeded
when the autorefresh config is disabled. These checks are performed on
every display commit and show up as using a significant amount of CPU
time in perf top. Skip them when it's unnecessary in order to improve
display rendering performance.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:14 +00:00
Sultan Alsawaf
ee675ad54d drm/msm/sde: Don't allocate memory dynamically for CRTC atomic check
Every atomic frame commit allocates memory dynamically to check the
states of the CRTCs, when those allocations can just be stored on the
stack instead. Eliminate these dynamic memory allocations in the frame
commit path to improve performance. They don't need need to be zeroed
out either.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
2022-11-12 11:24:13 +00:00
Sultan Alsawaf
7375d8ee7a selinux: Avoid dynamic memory allocation for INITCONTEXTLEN buffers
The default INITCONTEXTLEN-sized buffers can fit on the stack. Do so to
save a call to kmalloc() in a hot path.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:13 +00:00
Sultan Alsawaf
f18743c5ae bpf: Avoid allocating small buffers for map keys and values
Most, if not all, map keys and values are rather small and can fit on
the stack, eliminating the need to allocate them dynamically. Reserve
some small stack buffers for them to avoid dynamic memory allocation.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:13 +00:00
Sultan Alsawaf
a97ce0db2c msm: kgsl: Avoid dynamically allocating small command buffers
Most command buffers here are rather small (fewer than 256 words); it's
a waste of time to dynamically allocate memory for such a small buffer
when it could easily fit on the stack.

Conditionally using an on-stack command buffer when the size is small
enough eliminates the need for using a dynamically-allocated buffer most
of the time, reducing GPU command submission latency.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:13 +00:00
LibXZR
ce7ebaf0c0 ashmem: Adapt building on msm-4.19 2022-11-12 11:24:12 +00:00
Sultan Alsawaf
58c1240c0b ashmem: Rewrite to improve clarity and performance
Ashmem uses a single big mutex lock for all synchronization, and even
uses it when no synchronization issues are present. The contention from
using a single lock results in all-around poor performance.

Rewrite to use fine-grained locks and atomic constructions to eliminate
the big mutex lock, thereby improving performance greatly. In places
where locks are needed for a one-time operation, we speculatively
check if locking is needed while avoiding data races. The optional name
fields are removed as well.

Note that because asma->unpinned_list never has anything added to it,
we can remove any code using it to clean up the driver a lot and
reduce synchronization requirements. This also means that
ashmem_lru_list never gets anything added to it either, so all code
using it is dead code as well, which we can remove.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:12 +00:00
Danny Lin
98adc65124 drm/msm/sde: Cache register values when performing clock control
Remote register I/O amounts to a measurably significant portion of CPU
time due to how frequently this function is used. Cache the value of
each register on-demand and use this value in future invocations to
mitigate the expensive I/O.

Co-authored-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
[@0ctobot: Adapted for msm-4.19]
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
2022-11-12 11:24:12 +00:00
Sultan Alsawaf
aafe02c4d8 drm/msm/sde: Don't allocate memory dynamically for plane states
The plane states allocation and free show up on perf top as taking up a
non-trivial amount of time on every commit. Since the allocation is
small, just place it on the stack to eliminate the dynamic allocation
overhead completely.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
2022-11-12 11:24:12 +00:00
Sultan Alsawaf
238880552e drm/msm/sde: Skip unneeded register reads when getting write line count
More often than not, get_vsync_info() is used to only get the write line
count, while the other values it returns are left unused. This is not
optimal since it is done on every display commit. We can eliminate the
superfluous register reads by adding a parameter specifying if only the
write line count is requested.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
2022-11-12 11:24:12 +00:00
Sultan Alsawaf
2fb480b5de drm/msm: Offload commit cleanup onto little CPUs
The cleanup portion of non-blocking commits can be offloaded to little
CPUs to reduce latency in the display commit path, since it takes up a
non-trivial amount of CPU time. This reduces display commit latency.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
[lazerl0rd: Adjust for Linux 4.19, with different commit cleanup.]
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
2022-11-12 11:24:11 +00:00
Signed-off-by: Sultan Alsawaf
5e45392196 clk: qcom: qcom-cpufreq-hw: Set each CPU clock to its max when waking up
The default frequency on Qualcomm CPUs is the lowest frequency supported
by the CPU. This hurts latency when waking from suspend, as each CPU
coming online runs at its lowest frequency until the governor can take
over later. To speed up waking from suspend, hijack the CPUHP_AP_ONLINE
hook and use it to set the highest available frequency on each CPU as
they come online. This is done behind the governor's back but it's fine
because the governor isn't running at this point in time for a CPU
that's coming online.

This speeds up waking from suspend significantly.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
[lazerl0rd: Adjusted to apply to qcom-cpufreq-hw instead of clk-cpu-osm.]
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
2022-11-12 11:24:11 +00:00
Sultan Alsawaf
78d0f9faf4 drm/msm: Speed up interrupt processing upon commit
Since we know an interrupt will be arriving soon when a frame is
committed, we can anticipate it and prevent the CPU servicing that
interrupt from entering deep idle states. This reduces display rendering
latency.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
2022-11-12 11:24:11 +00:00
Sultan Alsawaf
c2e09bd1cf qos: Speed up pm_qos_set_value_for_cpus()
A lot of unnecessary work is done in pm_qos_set_value_for_cpus(),
especially when the request being updated isn't affined to all CPUs.
We can reduce the work done here significantly by only inspecting the
CPUs which are affected by the updated request, and bailing out if the
updated request doesn't change anything.

We can make some other micro-optimizations as well knowing that this
code is only for the PM_QOS_CPU_DMA_LATENCY class.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:11 +00:00
Sultan Alsawaf
530c735e50 rcu: Run nocb kthreads on little CPUs
RCU callbacks are not time-critical and constitute kernel housekeeping.
Offload the no-callback kthreads onto little CPUs to clear load off of
the more important, higher-performance CPUs.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:11 +00:00
Sultan Alsawaf
143b27d222 binder: Stub out debug prints by default
Binder code is very hot, so checking frequently to see if a debug
message should be printed is a waste of cycles. We're not debugging
binder, so just stub out the debug prints to compile them out entirely.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:10 +00:00
Sultan Alsawaf
21ef075aef ALSA: control_compat: Don't dynamically allocate single-use structs
All of these dynamically-allocated structs can be simply placed on the
stack, eliminating the overhead of dynamic memory allocation entirely.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:10 +00:00
Sultan Alsawaf
7781d6b3bd kernfs: Avoid dynamic memory allocation for small write buffers
Most write buffers are rather small and can fit on the stack,
eliminating the need to allocate them dynamically. Reserve a 4 KiB
stack buffer for this purpose to avoid the overhead of dynamic
memory allocation.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:10 +00:00