Although unbound workqueues are eligible to run their workers on any
CPU, the workqueue subsystem prefers scheduling workers onto the CPU
which queues them. This results in unbound workers consuming valuable
CPU time on the big and prime CPU clusters.
We can alleviate the burden of kernel housekeeping on the more important
CPUs by moving the unbound workqueues to the little CPU cluster by
default. This may also reduce power consumption, which is a plus.
Fix Warning:
kernel/workqueue.c:5765:6: warning: unused variable 'hk_flags' [-Wunused-variable]
int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
[spak]
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
For us, it's most helpful to have the round-robin timeslice as low as is
allowed by the scheduler to reduce latency. Since it's limited by the
scheduler tick rate, just set the default to 1 jiffy, which is the
lowest possible value.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Although SCHED_FIFO is a real-time scheduling policy, it can have bad
results on system latency, since each SCHED_FIFO task will run to
completion before yielding to another task. This can result in visible
micro-stalls when a SCHED_FIFO task hogs the CPU for too long. On a
system where latency is favored over throughput, using SCHED_RR is a
better choice than SCHED_FIFO.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
It's helpful to be able to affine low-priority kthreads to the little
CPU, such as for deferred memory cleanup. Extend the perf-critical API
to make this possible.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Unfortunately, our display driver must occasionally sleep inside
start_atomic(), which causes the CPU running the ioctl to schedule and
potentially idle. Depending on how deeply the running CPU idles, there
can be quite a bit of latency added to processing the "atomic" ioctl as
a result, which hurts display rendering latency. We can alleviate this
by optimistically assuming that the task running the ioctl won't
migrate CPUs while it runs and restricting the current CPU to shallow
idle states.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
This unused debug print wastes CPU time when writing to registers,
resulting in perf top reporting a decent chunk of time spent inside
sde_reg_write(). Removing the debug print gets sde_reg_write() off perf
top's radar.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Relaxing the CPU latency requirement by about 500 us won't significantly
hurt graphics performance. On the flip side, most SoCs have many idle
levels just below 1000 us in latency, with deeper idle levels having
latencies in excess of 2000 us. Changing the latency requirement to
1000 us allows most SoCs to use their deepest sub-1000-us idle state
while the GPU is active.
Additionally, since the lpm driver has been updated to allow power
levels with latencies equal to target latencies, change the wakeup
latency from 101 to 100 for clarity.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The GPU's ioctls sleep a lot due to memory allocation, during which time
the CPU can enter a deep idle state and take a while to finish executing
the ioctl afterwards. We can alleviate this by optimistically assuming
that the task running the ioctl won't migrate CPUs while it runs and
restricting the current CPU to shallow idle states.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
HWC in Android is responsible for communicating userspace's graphics
requests with the kernel. Affining it to fast CPUs greatly reduces
jitter.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Unfortunately, s2idle is only somewhat functional. Although commit
70441d36af58 ("cpuidle: lpm_levels: add soft watchdog for s2idle") makes
s2idle usable, there are still CPU stalls caused by s2idle's buggy
behavior, and the aforementioned hack doesn't address them. Therefore,
let's stop userspace from enabling s2idle and instead enforce the
default deep sleep mode.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
A lot of CPU time is wasted on allocating, populating, and copying
debug names back and forth with userspace when they're not actually
needed. We can't just remove the name buffers from the various sync data
structures though because we must preserve ABI compatibility with
userspace, but instead we can just pretend the name fields of the
user-shared structs aren't there. This massively reduces the sizes of
memory allocated for these data structures and the amount of data passed
between userspace, as well as eliminates a kzalloc() entirely from
sync_file_ioctl_fence_info(), thus improving graphics performance.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Being interrupted by a signal during binder ioctls is not unusual.
Commit 218225c05beaf02320f05a9b3b7041e27a0684f7 changed the interrupted
ioctl return code from ERESTARTSYS to EINTR, but the error logging check
was never updated:
[ 352.090265] binder: 6534:6552 ioctl c0306201 7a0f9fea10 returned -4
[ 352.090393] binder: 6534:7670 ioctl c0306201 7a6489da10 returned -4
[ 352.090432] binder: 6534:9501 ioctl c0306201 7a61377a10 returned -4
Update the return code to fix the log spam.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Combined with LTO, this yields a consistent 5% boost to procfs I/O
performance right off the bat (as measured with callbench). The spin
lock functions constitute some of the hottest code paths in the kernel;
inlining them to improve performance makes sense.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
GCC 10 updated its interprocedural optimizer's logic to have it make
more conservative inlining decisions, resulting in worse syscall and
hackbench performance compared to GCC 9. Although the max-inline-insns-
auto parameter's value was not altered, increasing it from the -O3
default of 30 to 1000 instructions yields improved performance with LTO,
surpassing GCC 9.
Do this only for LTO though because for non-LTO builds, this causes GCC
to produce mountains of spurious -Wmaybe-used-uninitialized warnings.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
This requires a modern version of GCC and various other patches in order
to work. LTO results in a smaller kernel binary with better performance.
Based off of work from Andi Kleen <ak@linux.intel.com>.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
gcc 5 has a new no_reorder attribute that prevents top level
reordering only for that symbol.
Kernels don't like any reordering of initcalls between files, as several
initcalls depend on each other. LTO previously needed to use
-fno-toplevel-reordering to prevent boot failures.
Add a __noreorder wrapper for the no_reorder attribute and use
it for initcalls.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
When using LTO, the conditional syscall aliases aren't weak, and instead
override implemented syscalls rather than serve as a fallback for
missing syscalls. Fix the cond_syscall() alias using an attribute so
that it gets properly evaluated at link time.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Any -m* flags need to be prefixed with "-Wl," when passed to the linker.
This is already done for flags which aren't special-cased, so we can
just remove the specific -m* flag handling to appease GCC.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The vDSO library is obviously not self-contained, so it doesn't qualify
for -fwhole-program. Using -fwhole-program on the vDSO library breaks
it, so disable -fwhole-program to fix it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
There's plenty of room on the stack for a few more inlined bytes here
and there. The measured stack usage at runtime is still safe without
this, and performance is surely improved at a microscopic level, so
remove it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
A measurably significant amount of CPU time is spent in these routines
while the camera is open. These are also responsible for a grotesque
amount of dmesg spam, so let's nuke them.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Instead of registering a custom IRQ notifier, we should just use the PM
QoS framework's PM_QOS_REQ_AFFINE_IRQ feature.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
These are blocking some CPUs in the LITTLE cluster from entering deep
idle because the driver assumes that display rendering work occurs on a
hardcoded set of CPUs, which is false. We already have the IRQ PM QoS
machinery, so this cruft is unneeded.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
This call to smp_processor_id() forces gic_raise_softirq() to require
being called while preemption is disabled, which isn't an actual
requirement. When called without preemption disabled, smp_processor_id()
is thus used incorrectly and generates a warning splat with the relevant
kernel debug options enabled.
Get rid of the useless pr_devel message outright to fix the incorrect
smp_processor_id() usage.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
In order to prevent redundant entry creation by racing against itself,
mb_cache_entry_create scans through a large hash-list of all current
entries in order to see if another allocation for the requested new
entry has been made. Furthermore, it allocates memory for a new entry
before scanning through this hash-list, which results in that allocated
memory being discarded when the requested new entry is already present.
This happens more than half the time.
Speed up cache entry creation by keeping a small linked list of
requested new entries in progress, and scanning through that first
instead of the large hash-list. Additionally, don't bother allocating
memory for a new entry until it's known that the allocated memory will
be used.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
It isn't guaranteed a CPU will idle upon calling lpm_cpuidle_enter(),
since it could abort early at the need_resched() check. In this case,
it's possible for an IPI to be sent to this "idle" CPU needlessly, thus
wasting power. For the same reason, it's also wasteful to keep a CPU
marked idle even after it's woken up.
Reduce the window that CPUs are marked idle to as small as it can be in
order to improve power consumption.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The pm_qos callback currently suffers from a number of pitfalls: it
sends IPIs to CPUs that may not be idle, waits for those IPIs to finish
propagating while preemption is disabled (resulting in a long busy wait
for the pm_qos_update_target() caller), and needlessly calls a no-op
function when the IPIs are processed.
Optimize the pm_qos notifier by only sending IPIs to CPUs that are
idle, and by using arch_send_wakeup_ipi_mask() instead of
smp_call_function_many(). Using IPI_WAKEUP instead of IPI_CALL_FUNC,
which is what smp_call_function_many() uses behind the scenes, has the
benefit of doing zero work upon receipt of the IPI; IPI_WAKEUP is
designed purely for sending an IPI without a payload, whereas
IPI_CALL_FUNC does unwanted extra work just to run the empty
smp_callback() function.
Determining which CPUs are idle is done efficiently with an atomic
bitmask instead of using the wake_up_if_idle() API, which checks the
CPU's runqueue in an RCU read-side critical section and under a spin
lock. Not very efficient in comparison to a simple, atomic bitwise
operation. A cpumask isn't needed for this because NR_CPUS is
guaranteed to fit within a word.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
An empty IPI is useful for cpuidle to wake sleeping CPUs without causing
them to do unnecessary work upon receipt of the IPI. IPI_WAKEUP fills
this use-case nicely, so let it be used outside of the ACPI parking
protocol.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
None of the pm_qos functions actually run in interrupt context; if some
driver calls pm_qos_update_target in interrupt context then it's already
broken. There's no need to disable interrupts while holding pm_qos_lock,
so don't do it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
cpumask_set_cpu() uses the set_bit() helper, which, in typical kernels
prior to 4.19, uses a spin lock to guarantee atomicity. This is
expensive and unneeded, especially since the qos functions are hot code
paths. The rest of the cpumask functions use the bitmap API, which is
also more expensive than just doing some simple operations on a word.
Since we're operating with a CPU count that can fit within a word,
replace the expensive cpumask operations with raw bitwise operations
wherever possible to make the pm_qos framework more efficient.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
This reverts commit 1e5a5b5e00.
This doesn't make sense for a few reasons. Firstly, upstream uses this
mutex code and it works fine on all arches; why should arm be any
different?
Secondly, once the mutex owner starts to spin on `wait_lock`,
preemption is disabled and the owner will be in an actively-running
state. The optimistic mutex spinning occurs when the lock owner is
actively running on a CPU, and while the optimistic spinning takes
place, no attempt to acquire `wait_lock` is made by the new waiter.
Therefore, it is guaranteed that new mutex waiters which optimistically
spin will not contend the `wait_lock` spin lock that the owner needs to
acquire in order to make forward progress.
Another potential source of `wait_lock` contention can come from tasks
that call mutex_trylock(), but this isn't actually problematic (and if
it were, it would affect the MUTEX_SPIN_ON_OWNER=n use-case too). This
won't introduce significant contention on `wait_lock` because the
trylock code exits before attempting to lock `wait_lock`, specifically
when the atomic mutex counter indicates that the mutex is already
locked. So in reality, the amount of `wait_lock` contention that can
come from mutex_trylock() amounts to only one task. And once it
finishes, `wait_lock` will no longer be contended and the previous
mutex owner can proceed with clean up.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
This reverts commit 0db49c2550a09458db188fb7312c66783c5af104.
This results in kmalloc() abuse to find a large number of contiguous
pages, which thrashes the page allocator and hurts overall performance.
I couldn't reproduce the improved MTP throughput that this commit
claimed either, so just revert it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
This reverts commit a9a60c58e0fa21c41ac284282949187b13bdd756.
This results in kmalloc() abuse to find a large number of contiguous
pages, which thrashes the page allocator and hurts overall performance.
I couldn't reproduce the improved MTP throughput that this commit
claimed either, so just revert it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When exiting the camera, there's a period of intense lag caused by all
of the buffer-free workers consuming all CPUs at once for a few seconds.
This isn't very good, and freeing the buffers isn't super time critical,
so we can lower the burden of the workers by marking the per-heap
workqueues as CPU intensive, which offloads the burden of balancing the
workers onto the scheduler.
Also, mark these workqueues with WQ_MEM_RECLAIM so forward progress is
guaranteed via a rescuer thread, since these are used to free memory.
The unnecessary WQ_UNBOUND_MAX_ACTIVE is removed as well, since it's
only used for increasing the active worker count on large-CPU systems.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
The ION driver suffers from massive code bloat caused by excessive
debug features, as well as poor lock usage as a result of that. Multiple
locks in ION exist to make the debug features thread-safe, which hurts
ION's actual performance when doing its job.
There are numerous code paths in ION that hold mutexes for no reason and
hold them for longer than necessary. This results in not only unwanted
lock contention, but also long delays when a mutex lock results in the
calling thread getting preempted for a while. All lock usage in ION
follows this pattern, which causes poor performance across the board.
Furthermore, a big mutex lock is used mostly everywhere, which causes
performance degradation due to unnecessary lock overhead.
Instead of having a big mutex lock, multiple fine-grained locks are now
used, improving performance.
Additionally, dup_sg_table is called very frequently, and lies within
the rendering path for the display. Speed it up by copying scatterlists
in page-sized chunks rather than iterating one at a time. Note that
sg_alloc_table zeroes out `table`, so there's no need to zero it out
using the memory allocator.
This also features a lock-less caching system for DMA attachments and
their respective sg_table copies, reducing overhead significantly for
code which frequently maps and unmaps DMA buffers and speeding up cache
maintenance since iteration through the list of buffer attachments is
now lock-free. This is safe since there is no interleaved DMA buffer
attaching or accessing for a single ION buffer.
Overall, just rewrite ION entirely to fix its deficiencies. This
optimizes ION for excellent performance and discards its debug cruft.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
This scope of this driver's lock usage is extremely wide, leading to
excessively long lock hold times. Additionally, there is lots of
excessive linked-list traversal and unnecessary dynamic memory
allocation in a critical path, causing poor performance across the
board.
Fix all of this by greatly reducing the scope of the locks used and by
significantly reducing the amount of operations performed when
msm_dma_map_sg_attrs() is called. The entire driver's code is overhauled
for better cleanliness and performance.
Note that ION must be modified to pass a known structure via the private
dma_buf pointer, so that the IOMMU driver can prevent races when
operating on the same buffer concurrently. This is the only way to
eliminate said buffer races without hurting the IOMMU driver's
performance.
Some additional members are added to the device struct as well to make
these various performance improvements possible.
This also removes the manual cache maintenance since ION already handles
it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
To help mitigate data loss potential in the event of an unclean
shutdown. The small performance hit is worth the trade-off for improved
data integrity, especially since custom kernels are more susceptible to
unclean shutdowns (panics) during development.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
This is a waste of memory for something that we don't use. We're
building our own kernel, so this GKI nonsense is unneeded.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
This allows kmemleak to function even when debugfs is globally disabled,
allowing kmemleak to give accurate results for CONFIG_DEBUG_FS=n.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Trying to wait for fences that have already been signaled incurs a high
setup cost, since dynamic memory allocation must be used. Avoiding this
overhead when it isn't needed improves performance.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Exporting the IRQ of a SPI device's master controller can help device
drivers utilize the PM QoS API to force the SPI master IRQ to be
serviced with low latency.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Exporting the IRQ of an i2c client's adapter can help i2c client drivers
utilize the PM QoS API to force the i2c IRQ to be serviced with low
latency.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>