Commit Graph

870686 Commits

Author SHA1 Message Date
Sultan Alsawaf
78d0f9faf4 drm/msm: Speed up interrupt processing upon commit
Since we know an interrupt will be arriving soon when a frame is
committed, we can anticipate it and prevent the CPU servicing that
interrupt from entering deep idle states. This reduces display rendering
latency.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
2022-11-12 11:24:11 +00:00
Sultan Alsawaf
c2e09bd1cf qos: Speed up pm_qos_set_value_for_cpus()
A lot of unnecessary work is done in pm_qos_set_value_for_cpus(),
especially when the request being updated isn't affined to all CPUs.
We can reduce the work done here significantly by only inspecting the
CPUs which are affected by the updated request, and bailing out if the
updated request doesn't change anything.

We can make some other micro-optimizations as well knowing that this
code is only for the PM_QOS_CPU_DMA_LATENCY class.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:11 +00:00
Sultan Alsawaf
530c735e50 rcu: Run nocb kthreads on little CPUs
RCU callbacks are not time-critical and constitute kernel housekeeping.
Offload the no-callback kthreads onto little CPUs to clear load off of
the more important, higher-performance CPUs.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:11 +00:00
Sultan Alsawaf
143b27d222 binder: Stub out debug prints by default
Binder code is very hot, so checking frequently to see if a debug
message should be printed is a waste of cycles. We're not debugging
binder, so just stub out the debug prints to compile them out entirely.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:10 +00:00
Sultan Alsawaf
21ef075aef ALSA: control_compat: Don't dynamically allocate single-use structs
All of these dynamically-allocated structs can be simply placed on the
stack, eliminating the overhead of dynamic memory allocation entirely.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:10 +00:00
Sultan Alsawaf
7781d6b3bd kernfs: Avoid dynamic memory allocation for small write buffers
Most write buffers are rather small and can fit on the stack,
eliminating the need to allocate them dynamically. Reserve a 4 KiB
stack buffer for this purpose to avoid the overhead of dynamic
memory allocation.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:10 +00:00
Sultan Alsawaf
b34db5c768 sched/fair: Don't remove important task migration logic from PELT
This task migration logic is guarded by a WALT #ifdef even though it has
nothing specific to do with WALT. The result is that, with PELT, boosted
tasks can be migrated to the little cluster, causing visible stutters.

Move the WALT #ifdef so PELT can benefit from this logic as well.

Thanks to Zachariah Kennedy <zkennedy87@gmail.com> and Danny Lin
<danny@kdrag0n.dev> for discovering this issue and creating this fix.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:10 +00:00
Sultan Alsawaf
5cfb51a3e7 workqueue: Affine unbound workqueues to little CPUs by default
Although unbound workqueues are eligible to run their workers on any
CPU, the workqueue subsystem prefers scheduling workers onto the CPU
which queues them. This results in unbound workers consuming valuable
CPU time on the big and prime CPU clusters.

We can alleviate the burden of kernel housekeeping on the more important
CPUs by moving the unbound workqueues to the little CPU cluster by
default. This may also reduce power consumption, which is a plus.

Fix Warning:
kernel/workqueue.c:5765:6: warning: unused variable 'hk_flags' [-Wunused-variable]
        int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
[spak]

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:10 +00:00
Sultan Alsawaf
ae14413b9d sched/rt: Change default SCHED_RR timeslice from 100 ms to 1 jiffy
For us, it's most helpful to have the round-robin timeslice as low as is
allowed by the scheduler to reduce latency. Since it's limited by the
scheduler tick rate, just set the default to 1 jiffy, which is the
lowest possible value.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:09 +00:00
Sultan Alsawaf
bb31f94813 sched/core: Use SCHED_RR in place of SCHED_FIFO for all users
Although SCHED_FIFO is a real-time scheduling policy, it can have bad
results on system latency, since each SCHED_FIFO task will run to
completion before yielding to another task. This can result in visible
micro-stalls when a SCHED_FIFO task hogs the CPU for too long. On a
system where latency is favored over throughput, using SCHED_RR is a
better choice than SCHED_FIFO.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:09 +00:00
Sultan Alsawaf
9a8d0bd815 kernel: Extend the perf-critical API to little CPUs
It's helpful to be able to affine low-priority kthreads to the little
CPU, such as for deferred memory cleanup. Extend the perf-critical API
to make this possible.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:09 +00:00
Sultan Alsawaf
62d5626ba3 drm: Reduce latency while processing atomic ioctls
Unfortunately, our display driver must occasionally sleep inside
start_atomic(), which causes the CPU running the ioctl to schedule and
potentially idle. Depending on how deeply the running CPU idles, there
can be quite a bit of latency added to processing the "atomic" ioctl as
a result, which hurts display rendering latency. We can alleviate this
by optimistically assuming that the task running the ioctl won't
migrate CPUs while it runs and restricting the current CPU to shallow
idle states.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:09 +00:00
Sultan Alsawaf
df7fe36f50 disp: msm: Remove debug print from sde_reg_write()
This unused debug print wastes CPU time when writing to registers,
resulting in perf top reporting a decent chunk of time spent inside
sde_reg_write(). Removing the debug print gets sde_reg_write() off perf
top's radar.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:09 +00:00
Sultan Alsawaf
cfd927d9f2 msm: kgsl: Relax CPU latency requirements to save power
Relaxing the CPU latency requirement by about 500 us won't significantly
hurt graphics performance. On the flip side, most SoCs have many idle
levels just below 1000 us in latency, with deeper idle levels having
latencies in excess of 2000 us. Changing the latency requirement to
1000 us allows most SoCs to use their deepest sub-1000-us idle state
while the GPU is active.

Additionally, since the lpm driver has been updated to allow power
levels with latencies equal to target latencies, change the wakeup
latency from 101 to 100 for clarity.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:08 +00:00
Sultan Alsawaf
b29379f4da msm: kgsl: Reduce latency while processing ioctls
The GPU's ioctls sleep a lot due to memory allocation, during which time
the CPU can enter a deep idle state and take a while to finish executing
the ioctl afterwards. We can alleviate this by optimistically assuming
that the task running the ioctl won't migrate CPUs while it runs and
restricting the current CPU to shallow idle states.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:08 +00:00
Sultan Alsawaf
bee671e32a kernel: Affine hwcomposer to big CPUs
HWC in Android is responsible for communicating userspace's graphics
requests with the kernel. Affining it to fast CPUs greatly reduces
jitter.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:08 +00:00
Sultan Alsawaf
281004c8c4 PM: sleep: Don't allow s2idle to be used
Unfortunately, s2idle is only somewhat functional. Although commit
70441d36af58 ("cpuidle: lpm_levels: add soft watchdog for s2idle") makes
s2idle usable, there are still CPU stalls caused by s2idle's buggy
behavior, and the aforementioned hack doesn't address them. Therefore,
let's stop userspace from enabling s2idle and instead enforce the
default deep sleep mode.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:08 +00:00
Sultan Alsawaf
bae6dcbbb8 dma-buf/sync_file: Speed up ioctl by omitting debug names
A lot of CPU time is wasted on allocating, populating, and copying
debug names back and forth with userspace when they're not actually
needed. We can't just remove the name buffers from the various sync data
structures though because we must preserve ABI compatibility with
userspace, but instead we can just pretend the name fields of the
user-shared structs aren't there. This massively reduces the sizes of
memory allocated for these data structures and the amount of data passed
between userspace, as well as eliminates a kzalloc() entirely from
sync_file_ioctl_fence_info(), thus improving graphics performance.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:08 +00:00
Danny Lin
8f5c1833fc binder: Fix log spam caused by interrupted waits
Being interrupted by a signal during binder ioctls is not unusual.
Commit 218225c05beaf02320f05a9b3b7041e27a0684f7 changed the interrupted
ioctl return code from ERESTARTSYS to EINTR, but the error logging check
was never updated:

[  352.090265] binder: 6534:6552 ioctl c0306201 7a0f9fea10 returned -4
[  352.090393] binder: 6534:7670 ioctl c0306201 7a6489da10 returned -4
[  352.090432] binder: 6534:9501 ioctl c0306201 7a61377a10 returned -4

Update the return code to fix the log spam.

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
2022-11-12 11:24:07 +00:00
Sultan Alsawaf
77cee07f2b arm64: Allow LD_DEAD_CODE_DATA_ELIMINATION to be selected
DCE is not problematic on arm64 and works out of the box. Let it be
used.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:07 +00:00
Sultan Alsawaf
d130711775 arm64: Keep alternative-instruction sections
Otherwise DCE happily discards them and we're left with a kernel that
doesn't boot.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:07 +00:00
Sultan Alsawaf
c939e768d6 arm64: Inline the spin lock function family
Combined with LTO, this yields a consistent 5% boost to procfs I/O
performance right off the bat (as measured with callbench). The spin
lock functions constitute some of the hottest code paths in the kernel;
inlining them to improve performance makes sense.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:07 +00:00
Sultan Alsawaf
20c9e246d5 kbuild: Increase GCC automatic inline instruction limit to 1000 for LTO
GCC 10 updated its interprocedural optimizer's logic to have it make
more conservative inlining decisions, resulting in worse syscall and
hackbench performance compared to GCC 9. Although the max-inline-insns-
auto parameter's value was not altered, increasing it from the -O3
default of 30 to 1000 instructions yields improved performance with LTO,
surpassing GCC 9.

Do this only for LTO though because for non-LTO builds, this causes GCC
to produce mountains of spurious -Wmaybe-used-uninitialized warnings.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:07 +00:00
Sultan Alsawaf
12d906d079 lto: Add Link Time Optimization support for GCC
This requires a modern version of GCC and various other patches in order
to work. LTO results in a smaller kernel binary with better performance.

Based off of work from Andi Kleen <ak@linux.intel.com>.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:06 +00:00
Andi Kleen
6eac219ad8 lto: Add __noreorder and mark initcalls __noreorder
gcc 5 has a new no_reorder attribute that prevents top level
reordering only for that symbol.

Kernels don't like any reordering of initcalls between files, as several
initcalls depend on each other. LTO previously needed to use
-fno-toplevel-reordering to prevent boot failures.

Add a __noreorder wrapper for the no_reorder attribute and use
it for initcalls.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2022-11-12 11:24:06 +00:00
Sultan Alsawaf
da7364b282 sys_ni: Fix cond_syscall() alias for LTO
When using LTO, the conditional syscall aliases aren't weak, and instead
override implemented syscalls rather than serve as a fallback for
missing syscalls. Fix the cond_syscall() alias using an attribute so
that it gets properly evaluated at link time.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:06 +00:00
Sultan Alsawaf
23055407dc scripts: gcc-ld: Fix -m* flag parsing
Any -m* flags need to be prefixed with "-Wl," when passed to the linker.
This is already done for flags which aren't special-cased, so we can
just remove the specific -m* flag handling to appease GCC.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:06 +00:00
Sultan Alsawaf
3ee4d5731d Revert "kbuild: thin archives final link close --whole-archives option"
This reverts commit 1328a1ae0e.

This breaks GCC LTO.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:06 +00:00
Sultan Alsawaf
56e3dceb7c arm64: Disable place-relative 32-bit relocations for GCC LTO
This is not supported by LTO.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:05 +00:00
Sultan Alsawaf
2b687a7733 arm64: Tune GCC for lito's CPU
This yields a modest, consistent improvement in hackbench performance.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:05 +00:00
Sultan Alsawaf
37f65732ef arm64: Only disable LTO for vDSO when Clang is used
GCC LTO works fine on the vDSO.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:05 +00:00
Sultan Alsawaf
67fd676f12 arm64: Disable -fwhole-program for vDSO
The vDSO library is obviously not self-contained, so it doesn't qualify
for -fwhole-program. Using -fwhole-program on the vDSO library breaks
it, so disable -fwhole-program to fix it.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:05 +00:00
Sultan Alsawaf
f4dfc6c0cb kbuild: Disable stack conservation for GCC
There's plenty of room on the stack for a few more inlined bytes here
and there. The measured stack usage at runtime is still safe without
this, and performance is surely improved at a microscopic level, so
remove it.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:04 +00:00
Sultan Alsawaf
008fd45714 msm: camera: Stub out the camera_debug_util API and compile it out
A measurably significant amount of CPU time is spent in these routines
while the camera is open. These are also responsible for a grotesque
amount of dmesg spam, so let's nuke them.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:04 +00:00
Sultan Alsawaf
92924c9df1 disp: msm: Use the PM_QOS_REQ_AFFINE_IRQ feature to control SDE PM QoS
Instead of registering a custom IRQ notifier, we should just use the PM
QoS framework's PM_QOS_REQ_AFFINE_IRQ feature.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:04 +00:00
Sultan Alsawaf
614ff8afe7 disp: msm: Remove unneeded PM QoS requests
These are blocking some CPUs in the LITTLE cluster from entering deep
idle because the driver assumes that display rendering work occurs on a
hardcoded set of CPUs, which is false. We already have the IRQ PM QoS
machinery, so this cruft is unneeded.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:04 +00:00
Sultan Alsawaf
a7748256b4 irqchip/gic-v3: Remove pr_devel message containing smp_processor_id()
This call to smp_processor_id() forces gic_raise_softirq() to require
being called while preemption is disabled, which isn't an actual
requirement. When called without preemption disabled, smp_processor_id()
is thus used incorrectly and generates a warning splat with the relevant
kernel debug options enabled.

Get rid of the useless pr_devel message outright to fix the incorrect
smp_processor_id() usage.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:04 +00:00
Sultan Alsawaf
b48e29aaf2 mbcache: Speed up cache entry creation
In order to prevent redundant entry creation by racing against itself,
mb_cache_entry_create scans through a large hash-list of all current
entries in order to see if another allocation for the requested new
entry has been made. Furthermore, it allocates memory for a new entry
before scanning through this hash-list, which results in that allocated
memory being discarded when the requested new entry is already present.
This happens more than half the time.

Speed up cache entry creation by keeping a small linked list of
requested new entries in progress, and scanning through that first
instead of the large hash-list. Additionally, don't bother allocating
memory for a new entry until it's known that the allocated memory will
be used.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:03 +00:00
Sultan Alsawaf
aae02f228c cpuidle: Mark CPUs idle as late as possible to avoid unneeded IPIs
It isn't guaranteed a CPU will idle upon calling lpm_cpuidle_enter(),
since it could abort early at the need_resched() check. In this case,
it's possible for an IPI to be sent to this "idle" CPU needlessly, thus
wasting power. For the same reason, it's also wasteful to keep a CPU
marked idle even after it's woken up.

Reduce the window that CPUs are marked idle to as small as it can be in
order to improve power consumption.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:03 +00:00
Sultan Alsawaf
5cd8543751 cpuidle: Optimize pm_qos notifier callback and IPI semantics
The pm_qos callback currently suffers from a number of pitfalls: it
sends IPIs to CPUs that may not be idle, waits for those IPIs to finish
propagating while preemption is disabled (resulting in a long busy wait
for the pm_qos_update_target() caller), and needlessly calls a no-op
function when the IPIs are processed.

Optimize the pm_qos notifier by only sending IPIs to CPUs that are
idle, and by using arch_send_wakeup_ipi_mask() instead of
smp_call_function_many(). Using IPI_WAKEUP instead of IPI_CALL_FUNC,
which is what smp_call_function_many() uses behind the scenes, has the
benefit of doing zero work upon receipt of the IPI; IPI_WAKEUP is
designed purely for sending an IPI without a payload, whereas
IPI_CALL_FUNC does unwanted extra work just to run the empty
smp_callback() function.

Determining which CPUs are idle is done efficiently with an atomic
bitmask instead of using the wake_up_if_idle() API, which checks the
CPU's runqueue in an RCU read-side critical section and under a spin
lock. Not very efficient in comparison to a simple, atomic bitwise
operation. A cpumask isn't needed for this because NR_CPUS is
guaranteed to fit within a word.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:03 +00:00
Sultan Alsawaf
db8f37bbef arm64: Allow IPI_WAKEUP to be used outside of the ACPI parking protocol
An empty IPI is useful for cpuidle to wake sleeping CPUs without causing
them to do unnecessary work upon receipt of the IPI. IPI_WAKEUP fills
this use-case nicely, so let it be used outside of the ACPI parking
protocol.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:03 +00:00
Sultan Alsawaf
020542d17c qos: Don't disable interrupts while holding pm_qos_lock
None of the pm_qos functions actually run in interrupt context; if some
driver calls pm_qos_update_target in interrupt context then it's already
broken. There's no need to disable interrupts while holding pm_qos_lock,
so don't do it.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:03 +00:00
Sultan Alsawaf
b6bbd4193e qos: Replace expensive cpumask usage with raw bitwise operations
cpumask_set_cpu() uses the set_bit() helper, which, in typical kernels
prior to 4.19, uses a spin lock to guarantee atomicity. This is
expensive and unneeded, especially since the qos functions are hot code
paths. The rest of the cpumask functions use the bitmap API, which is
also more expensive than just doing some simple operations on a word.

Since we're operating with a CPU count that can fit within a word,
replace the expensive cpumask operations with raw bitwise operations
wherever possible to make the pm_qos framework more efficient.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:02 +00:00
Sultan Alsawaf
73a40230d3 Revert "mutex: Add a delay into the SPIN_ON_OWNER wait loop."
This reverts commit 1e5a5b5e00.

This doesn't make sense for a few reasons. Firstly, upstream uses this
mutex code and it works fine on all arches; why should arm be any
different?

Secondly, once the mutex owner starts to spin on `wait_lock`,
preemption is disabled and the owner will be in an actively-running
state. The optimistic mutex spinning occurs when the lock owner is
actively running on a CPU, and while the optimistic spinning takes
place, no attempt to acquire `wait_lock` is made by the new waiter.
Therefore, it is guaranteed that new mutex waiters which optimistically
spin will not contend the `wait_lock` spin lock that the owner needs to
acquire in order to make forward progress.

Another potential source of `wait_lock` contention can come from tasks
that call mutex_trylock(), but this isn't actually problematic (and if
it were, it would affect the MUTEX_SPIN_ON_OWNER=n use-case too). This
won't introduce significant contention on `wait_lock` because the
trylock code exits before attempting to lock `wait_lock`, specifically
when the atomic mutex counter indicates that the mutex is already
locked. So in reality, the amount of `wait_lock` contention that can
come from mutex_trylock() amounts to only one task. And once it
finishes, `wait_lock` will no longer be contended and the previous
mutex owner can proceed with clean up.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:02 +00:00
Sultan Alsawaf
0a2139efa8 Revert "usb: gadget: mtp: Increase RX transfer length to 1M"
This reverts commit 0db49c2550a09458db188fb7312c66783c5af104.

This results in kmalloc() abuse to find a large number of contiguous
pages, which thrashes the page allocator and hurts overall performance.
I couldn't reproduce the improved MTP throughput that this commit
claimed either, so just revert it.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:02 +00:00
Sultan Alsawaf
5dc0c03441 Revert "usb: gadget: f_mtp: Increase default TX buffer size"
This reverts commit a9a60c58e0fa21c41ac284282949187b13bdd756.

This results in kmalloc() abuse to find a large number of contiguous
pages, which thrashes the page allocator and hurts overall performance.
I couldn't reproduce the improved MTP throughput that this commit
claimed either, so just revert it.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:02 +00:00
Sultan Alsawaf
72fe2c3281 ion: Mark workqueues freeing buffers asynchronously as CPU intensive
When exiting the camera, there's a period of intense lag caused by all
of the buffer-free workers consuming all CPUs at once for a few seconds.
This isn't very good, and freeing the buffers isn't super time critical,
so we can lower the burden of the workers by marking the per-heap
workqueues as CPU intensive, which offloads the burden of balancing the
workers onto the scheduler.

Also, mark these workqueues with WQ_MEM_RECLAIM so forward progress is
guaranteed via a rescuer thread, since these are used to free memory.
The unnecessary WQ_UNBOUND_MAX_ACTIVE is removed as well, since it's
only used for increasing the active worker count on large-CPU systems.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
2022-11-12 11:24:02 +00:00
Sultan Alsawaf
b933e6ca1e ion: Rewrite to improve clarity and performance
The ION driver suffers from massive code bloat caused by excessive
debug features, as well as poor lock usage as a result of that. Multiple
locks in ION exist to make the debug features thread-safe, which hurts
ION's actual performance when doing its job.

There are numerous code paths in ION that hold mutexes for no reason and
hold them for longer than necessary. This results in not only unwanted
lock contention, but also long delays when a mutex lock results in the
calling thread getting preempted for a while. All lock usage in ION
follows this pattern, which causes poor performance across the board.
Furthermore, a big mutex lock is used mostly everywhere, which causes
performance degradation due to unnecessary lock overhead.

Instead of having a big mutex lock, multiple fine-grained locks are now
used, improving performance.

Additionally, dup_sg_table is called very frequently, and lies within
the rendering path for the display. Speed it up by copying scatterlists
in page-sized chunks rather than iterating one at a time. Note that
sg_alloc_table zeroes out `table`, so there's no need to zero it out
using the memory allocator.

This also features a lock-less caching system for DMA attachments and
their respective sg_table copies, reducing overhead significantly for
code which frequently maps and unmaps DMA buffers and speeding up cache
maintenance since iteration through the list of buffer attachments is
now lock-free. This is safe since there is no interleaved DMA buffer
attaching or accessing for a single ION buffer.

Overall, just rewrite ION entirely to fix its deficiencies. This
optimizes ION for excellent performance and discards its debug cruft.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
2022-11-12 11:24:01 +00:00
Sultan Alsawaf
4f1ab4600a iommu: msm: Rewrite to improve clarity and performance
This scope of this driver's lock usage is extremely wide, leading to
excessively long lock hold times. Additionally, there is lots of
excessive linked-list traversal and unnecessary dynamic memory
allocation in a critical path, causing poor performance across the
board.

Fix all of this by greatly reducing the scope of the locks used and by
significantly reducing the amount of operations performed when
msm_dma_map_sg_attrs() is called. The entire driver's code is overhauled
for better cleanliness and performance.

Note that ION must be modified to pass a known structure via the private
dma_buf pointer, so that the IOMMU driver can prevent races when
operating on the same buffer concurrently. This is the only way to
eliminate said buffer races without hurting the IOMMU driver's
performance.

Some additional members are added to the device struct as well to make
these various performance improvements possible.

This also removes the manual cache maintenance since ION already handles
it.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:01 +00:00
Sultan Alsawaf
f342f528cd f2fs: Force strict fsync mode
To help mitigate data loss potential in the event of an unclean
shutdown. The small performance hit is worth the trade-off for improved
data integrity, especially since custom kernels are more susceptible to
unclean shutdowns (panics) during development.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2022-11-12 11:24:01 +00:00