android_kernel_xiaomi_sm7250

Author	SHA1	Message	Date
Sultan Alsawaf	78d0f9faf4	drm/msm: Speed up interrupt processing upon commit Since we know an interrupt will be arriving soon when a frame is committed, we can anticipate it and prevent the CPU servicing that interrupt from entering deep idle states. This reduces display rendering latency. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>	2022-11-12 11:24:11 +00:00
Sultan Alsawaf	c2e09bd1cf	qos: Speed up pm_qos_set_value_for_cpus() A lot of unnecessary work is done in pm_qos_set_value_for_cpus(), especially when the request being updated isn't affined to all CPUs. We can reduce the work done here significantly by only inspecting the CPUs which are affected by the updated request, and bailing out if the updated request doesn't change anything. We can make some other micro-optimizations as well knowing that this code is only for the PM_QOS_CPU_DMA_LATENCY class. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:11 +00:00
Sultan Alsawaf	530c735e50	rcu: Run nocb kthreads on little CPUs RCU callbacks are not time-critical and constitute kernel housekeeping. Offload the no-callback kthreads onto little CPUs to clear load off of the more important, higher-performance CPUs. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:11 +00:00
Sultan Alsawaf	143b27d222	binder: Stub out debug prints by default Binder code is very hot, so checking frequently to see if a debug message should be printed is a waste of cycles. We're not debugging binder, so just stub out the debug prints to compile them out entirely. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:10 +00:00
Sultan Alsawaf	21ef075aef	ALSA: control_compat: Don't dynamically allocate single-use structs All of these dynamically-allocated structs can be simply placed on the stack, eliminating the overhead of dynamic memory allocation entirely. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:10 +00:00
Sultan Alsawaf	7781d6b3bd	kernfs: Avoid dynamic memory allocation for small write buffers Most write buffers are rather small and can fit on the stack, eliminating the need to allocate them dynamically. Reserve a 4 KiB stack buffer for this purpose to avoid the overhead of dynamic memory allocation. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:10 +00:00
Sultan Alsawaf	b34db5c768	sched/fair: Don't remove important task migration logic from PELT This task migration logic is guarded by a WALT #ifdef even though it has nothing specific to do with WALT. The result is that, with PELT, boosted tasks can be migrated to the little cluster, causing visible stutters. Move the WALT #ifdef so PELT can benefit from this logic as well. Thanks to Zachariah Kennedy <zkennedy87@gmail.com> and Danny Lin <danny@kdrag0n.dev> for discovering this issue and creating this fix. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:10 +00:00
Sultan Alsawaf	5cfb51a3e7	workqueue: Affine unbound workqueues to little CPUs by default Although unbound workqueues are eligible to run their workers on any CPU, the workqueue subsystem prefers scheduling workers onto the CPU which queues them. This results in unbound workers consuming valuable CPU time on the big and prime CPU clusters. We can alleviate the burden of kernel housekeeping on the more important CPUs by moving the unbound workqueues to the little CPU cluster by default. This may also reduce power consumption, which is a plus. Fix Warning: kernel/workqueue.c:5765:6: warning: unused variable 'hk_flags' [-Wunused-variable] int hk_flags = HK_FLAG_DOMAIN \| HK_FLAG_WQ; [spak] Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:10 +00:00
Sultan Alsawaf	ae14413b9d	sched/rt: Change default SCHED_RR timeslice from 100 ms to 1 jiffy For us, it's most helpful to have the round-robin timeslice as low as is allowed by the scheduler to reduce latency. Since it's limited by the scheduler tick rate, just set the default to 1 jiffy, which is the lowest possible value. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:09 +00:00
Sultan Alsawaf	bb31f94813	sched/core: Use SCHED_RR in place of SCHED_FIFO for all users Although SCHED_FIFO is a real-time scheduling policy, it can have bad results on system latency, since each SCHED_FIFO task will run to completion before yielding to another task. This can result in visible micro-stalls when a SCHED_FIFO task hogs the CPU for too long. On a system where latency is favored over throughput, using SCHED_RR is a better choice than SCHED_FIFO. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:09 +00:00
Sultan Alsawaf	9a8d0bd815	kernel: Extend the perf-critical API to little CPUs It's helpful to be able to affine low-priority kthreads to the little CPU, such as for deferred memory cleanup. Extend the perf-critical API to make this possible. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:09 +00:00
Sultan Alsawaf	62d5626ba3	drm: Reduce latency while processing atomic ioctls Unfortunately, our display driver must occasionally sleep inside start_atomic(), which causes the CPU running the ioctl to schedule and potentially idle. Depending on how deeply the running CPU idles, there can be quite a bit of latency added to processing the "atomic" ioctl as a result, which hurts display rendering latency. We can alleviate this by optimistically assuming that the task running the ioctl won't migrate CPUs while it runs and restricting the current CPU to shallow idle states. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:09 +00:00
Sultan Alsawaf	df7fe36f50	disp: msm: Remove debug print from sde_reg_write() This unused debug print wastes CPU time when writing to registers, resulting in perf top reporting a decent chunk of time spent inside sde_reg_write(). Removing the debug print gets sde_reg_write() off perf top's radar. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:09 +00:00
Sultan Alsawaf	cfd927d9f2	msm: kgsl: Relax CPU latency requirements to save power Relaxing the CPU latency requirement by about 500 us won't significantly hurt graphics performance. On the flip side, most SoCs have many idle levels just below 1000 us in latency, with deeper idle levels having latencies in excess of 2000 us. Changing the latency requirement to 1000 us allows most SoCs to use their deepest sub-1000-us idle state while the GPU is active. Additionally, since the lpm driver has been updated to allow power levels with latencies equal to target latencies, change the wakeup latency from 101 to 100 for clarity. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:08 +00:00
Sultan Alsawaf	b29379f4da	msm: kgsl: Reduce latency while processing ioctls The GPU's ioctls sleep a lot due to memory allocation, during which time the CPU can enter a deep idle state and take a while to finish executing the ioctl afterwards. We can alleviate this by optimistically assuming that the task running the ioctl won't migrate CPUs while it runs and restricting the current CPU to shallow idle states. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:08 +00:00
Sultan Alsawaf	bee671e32a	kernel: Affine hwcomposer to big CPUs HWC in Android is responsible for communicating userspace's graphics requests with the kernel. Affining it to fast CPUs greatly reduces jitter. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:08 +00:00
Sultan Alsawaf	281004c8c4	PM: sleep: Don't allow s2idle to be used Unfortunately, s2idle is only somewhat functional. Although commit 70441d36af58 ("cpuidle: lpm_levels: add soft watchdog for s2idle") makes s2idle usable, there are still CPU stalls caused by s2idle's buggy behavior, and the aforementioned hack doesn't address them. Therefore, let's stop userspace from enabling s2idle and instead enforce the default deep sleep mode. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:08 +00:00
Sultan Alsawaf	bae6dcbbb8	dma-buf/sync_file: Speed up ioctl by omitting debug names A lot of CPU time is wasted on allocating, populating, and copying debug names back and forth with userspace when they're not actually needed. We can't just remove the name buffers from the various sync data structures though because we must preserve ABI compatibility with userspace, but instead we can just pretend the name fields of the user-shared structs aren't there. This massively reduces the sizes of memory allocated for these data structures and the amount of data passed between userspace, as well as eliminates a kzalloc() entirely from sync_file_ioctl_fence_info(), thus improving graphics performance. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:08 +00:00
Danny Lin	8f5c1833fc	binder: Fix log spam caused by interrupted waits Being interrupted by a signal during binder ioctls is not unusual. Commit 218225c05beaf02320f05a9b3b7041e27a0684f7 changed the interrupted ioctl return code from ERESTARTSYS to EINTR, but the error logging check was never updated: [ 352.090265] binder: 6534:6552 ioctl c0306201 7a0f9fea10 returned -4 [ 352.090393] binder: 6534:7670 ioctl c0306201 7a6489da10 returned -4 [ 352.090432] binder: 6534:9501 ioctl c0306201 7a61377a10 returned -4 Update the return code to fix the log spam. Signed-off-by: Danny Lin <danny@kdrag0n.dev>	2022-11-12 11:24:07 +00:00
Sultan Alsawaf	77cee07f2b	arm64: Allow LD_DEAD_CODE_DATA_ELIMINATION to be selected DCE is not problematic on arm64 and works out of the box. Let it be used. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:07 +00:00
Sultan Alsawaf	d130711775	arm64: Keep alternative-instruction sections Otherwise DCE happily discards them and we're left with a kernel that doesn't boot. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:07 +00:00
Sultan Alsawaf	c939e768d6	arm64: Inline the spin lock function family Combined with LTO, this yields a consistent 5% boost to procfs I/O performance right off the bat (as measured with callbench). The spin lock functions constitute some of the hottest code paths in the kernel; inlining them to improve performance makes sense. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:07 +00:00
Sultan Alsawaf	20c9e246d5	kbuild: Increase GCC automatic inline instruction limit to 1000 for LTO GCC 10 updated its interprocedural optimizer's logic to have it make more conservative inlining decisions, resulting in worse syscall and hackbench performance compared to GCC 9. Although the max-inline-insns- auto parameter's value was not altered, increasing it from the -O3 default of 30 to 1000 instructions yields improved performance with LTO, surpassing GCC 9. Do this only for LTO though because for non-LTO builds, this causes GCC to produce mountains of spurious -Wmaybe-used-uninitialized warnings. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:07 +00:00
Sultan Alsawaf	12d906d079	lto: Add Link Time Optimization support for GCC This requires a modern version of GCC and various other patches in order to work. LTO results in a smaller kernel binary with better performance. Based off of work from Andi Kleen <ak@linux.intel.com>. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:06 +00:00
Andi Kleen	6eac219ad8	lto: Add __noreorder and mark initcalls __noreorder gcc 5 has a new no_reorder attribute that prevents top level reordering only for that symbol. Kernels don't like any reordering of initcalls between files, as several initcalls depend on each other. LTO previously needed to use -fno-toplevel-reordering to prevent boot failures. Add a __noreorder wrapper for the no_reorder attribute and use it for initcalls. Signed-off-by: Andi Kleen <ak@linux.intel.com>	2022-11-12 11:24:06 +00:00
Sultan Alsawaf	da7364b282	sys_ni: Fix cond_syscall() alias for LTO When using LTO, the conditional syscall aliases aren't weak, and instead override implemented syscalls rather than serve as a fallback for missing syscalls. Fix the cond_syscall() alias using an attribute so that it gets properly evaluated at link time. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:06 +00:00
Sultan Alsawaf	23055407dc	scripts: gcc-ld: Fix -m* flag parsing Any -m* flags need to be prefixed with "-Wl," when passed to the linker. This is already done for flags which aren't special-cased, so we can just remove the specific -m* flag handling to appease GCC. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:06 +00:00
Sultan Alsawaf	3ee4d5731d	Revert "kbuild: thin archives final link close --whole-archives option" This reverts commit `1328a1ae0e`. This breaks GCC LTO. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:06 +00:00
Sultan Alsawaf	56e3dceb7c	arm64: Disable place-relative 32-bit relocations for GCC LTO This is not supported by LTO. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:05 +00:00
Sultan Alsawaf	2b687a7733	arm64: Tune GCC for lito's CPU This yields a modest, consistent improvement in hackbench performance. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:05 +00:00
Sultan Alsawaf	37f65732ef	arm64: Only disable LTO for vDSO when Clang is used GCC LTO works fine on the vDSO. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:05 +00:00
Sultan Alsawaf	67fd676f12	arm64: Disable -fwhole-program for vDSO The vDSO library is obviously not self-contained, so it doesn't qualify for -fwhole-program. Using -fwhole-program on the vDSO library breaks it, so disable -fwhole-program to fix it. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:05 +00:00
Sultan Alsawaf	f4dfc6c0cb	kbuild: Disable stack conservation for GCC There's plenty of room on the stack for a few more inlined bytes here and there. The measured stack usage at runtime is still safe without this, and performance is surely improved at a microscopic level, so remove it. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:04 +00:00
Sultan Alsawaf	008fd45714	msm: camera: Stub out the camera_debug_util API and compile it out A measurably significant amount of CPU time is spent in these routines while the camera is open. These are also responsible for a grotesque amount of dmesg spam, so let's nuke them. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:04 +00:00
Sultan Alsawaf	92924c9df1	disp: msm: Use the PM_QOS_REQ_AFFINE_IRQ feature to control SDE PM QoS Instead of registering a custom IRQ notifier, we should just use the PM QoS framework's PM_QOS_REQ_AFFINE_IRQ feature. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:04 +00:00
Sultan Alsawaf	614ff8afe7	disp: msm: Remove unneeded PM QoS requests These are blocking some CPUs in the LITTLE cluster from entering deep idle because the driver assumes that display rendering work occurs on a hardcoded set of CPUs, which is false. We already have the IRQ PM QoS machinery, so this cruft is unneeded. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:04 +00:00
Sultan Alsawaf	a7748256b4	irqchip/gic-v3: Remove pr_devel message containing smp_processor_id() This call to smp_processor_id() forces gic_raise_softirq() to require being called while preemption is disabled, which isn't an actual requirement. When called without preemption disabled, smp_processor_id() is thus used incorrectly and generates a warning splat with the relevant kernel debug options enabled. Get rid of the useless pr_devel message outright to fix the incorrect smp_processor_id() usage. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:04 +00:00
Sultan Alsawaf	b48e29aaf2	mbcache: Speed up cache entry creation In order to prevent redundant entry creation by racing against itself, mb_cache_entry_create scans through a large hash-list of all current entries in order to see if another allocation for the requested new entry has been made. Furthermore, it allocates memory for a new entry before scanning through this hash-list, which results in that allocated memory being discarded when the requested new entry is already present. This happens more than half the time. Speed up cache entry creation by keeping a small linked list of requested new entries in progress, and scanning through that first instead of the large hash-list. Additionally, don't bother allocating memory for a new entry until it's known that the allocated memory will be used. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:03 +00:00
Sultan Alsawaf	aae02f228c	cpuidle: Mark CPUs idle as late as possible to avoid unneeded IPIs It isn't guaranteed a CPU will idle upon calling lpm_cpuidle_enter(), since it could abort early at the need_resched() check. In this case, it's possible for an IPI to be sent to this "idle" CPU needlessly, thus wasting power. For the same reason, it's also wasteful to keep a CPU marked idle even after it's woken up. Reduce the window that CPUs are marked idle to as small as it can be in order to improve power consumption. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:03 +00:00
Sultan Alsawaf	5cd8543751	cpuidle: Optimize pm_qos notifier callback and IPI semantics The pm_qos callback currently suffers from a number of pitfalls: it sends IPIs to CPUs that may not be idle, waits for those IPIs to finish propagating while preemption is disabled (resulting in a long busy wait for the pm_qos_update_target() caller), and needlessly calls a no-op function when the IPIs are processed. Optimize the pm_qos notifier by only sending IPIs to CPUs that are idle, and by using arch_send_wakeup_ipi_mask() instead of smp_call_function_many(). Using IPI_WAKEUP instead of IPI_CALL_FUNC, which is what smp_call_function_many() uses behind the scenes, has the benefit of doing zero work upon receipt of the IPI; IPI_WAKEUP is designed purely for sending an IPI without a payload, whereas IPI_CALL_FUNC does unwanted extra work just to run the empty smp_callback() function. Determining which CPUs are idle is done efficiently with an atomic bitmask instead of using the wake_up_if_idle() API, which checks the CPU's runqueue in an RCU read-side critical section and under a spin lock. Not very efficient in comparison to a simple, atomic bitwise operation. A cpumask isn't needed for this because NR_CPUS is guaranteed to fit within a word. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:03 +00:00
Sultan Alsawaf	db8f37bbef	arm64: Allow IPI_WAKEUP to be used outside of the ACPI parking protocol An empty IPI is useful for cpuidle to wake sleeping CPUs without causing them to do unnecessary work upon receipt of the IPI. IPI_WAKEUP fills this use-case nicely, so let it be used outside of the ACPI parking protocol. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:03 +00:00
Sultan Alsawaf	020542d17c	qos: Don't disable interrupts while holding pm_qos_lock None of the pm_qos functions actually run in interrupt context; if some driver calls pm_qos_update_target in interrupt context then it's already broken. There's no need to disable interrupts while holding pm_qos_lock, so don't do it. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:03 +00:00
Sultan Alsawaf	b6bbd4193e	qos: Replace expensive cpumask usage with raw bitwise operations cpumask_set_cpu() uses the set_bit() helper, which, in typical kernels prior to 4.19, uses a spin lock to guarantee atomicity. This is expensive and unneeded, especially since the qos functions are hot code paths. The rest of the cpumask functions use the bitmap API, which is also more expensive than just doing some simple operations on a word. Since we're operating with a CPU count that can fit within a word, replace the expensive cpumask operations with raw bitwise operations wherever possible to make the pm_qos framework more efficient. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:02 +00:00
Sultan Alsawaf	73a40230d3	Revert "mutex: Add a delay into the SPIN_ON_OWNER wait loop." This reverts commit `1e5a5b5e00`. This doesn't make sense for a few reasons. Firstly, upstream uses this mutex code and it works fine on all arches; why should arm be any different? Secondly, once the mutex owner starts to spin on `wait_lock`, preemption is disabled and the owner will be in an actively-running state. The optimistic mutex spinning occurs when the lock owner is actively running on a CPU, and while the optimistic spinning takes place, no attempt to acquire `wait_lock` is made by the new waiter. Therefore, it is guaranteed that new mutex waiters which optimistically spin will not contend the `wait_lock` spin lock that the owner needs to acquire in order to make forward progress. Another potential source of `wait_lock` contention can come from tasks that call mutex_trylock(), but this isn't actually problematic (and if it were, it would affect the MUTEX_SPIN_ON_OWNER=n use-case too). This won't introduce significant contention on `wait_lock` because the trylock code exits before attempting to lock `wait_lock`, specifically when the atomic mutex counter indicates that the mutex is already locked. So in reality, the amount of `wait_lock` contention that can come from mutex_trylock() amounts to only one task. And once it finishes, `wait_lock` will no longer be contended and the previous mutex owner can proceed with clean up. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:02 +00:00
Sultan Alsawaf	0a2139efa8	Revert "usb: gadget: mtp: Increase RX transfer length to 1M" This reverts commit 0db49c2550a09458db188fb7312c66783c5af104. This results in kmalloc() abuse to find a large number of contiguous pages, which thrashes the page allocator and hurts overall performance. I couldn't reproduce the improved MTP throughput that this commit claimed either, so just revert it. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:02 +00:00
Sultan Alsawaf	5dc0c03441	Revert "usb: gadget: f_mtp: Increase default TX buffer size" This reverts commit a9a60c58e0fa21c41ac284282949187b13bdd756. This results in kmalloc() abuse to find a large number of contiguous pages, which thrashes the page allocator and hurts overall performance. I couldn't reproduce the improved MTP throughput that this commit claimed either, so just revert it. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:02 +00:00
Sultan Alsawaf	72fe2c3281	ion: Mark workqueues freeing buffers asynchronously as CPU intensive When exiting the camera, there's a period of intense lag caused by all of the buffer-free workers consuming all CPUs at once for a few seconds. This isn't very good, and freeing the buffers isn't super time critical, so we can lower the burden of the workers by marking the per-heap workqueues as CPU intensive, which offloads the burden of balancing the workers onto the scheduler. Also, mark these workqueues with WQ_MEM_RECLAIM so forward progress is guaranteed via a rescuer thread, since these are used to free memory. The unnecessary WQ_UNBOUND_MAX_ACTIVE is removed as well, since it's only used for increasing the active worker count on large-CPU systems. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>	2022-11-12 11:24:02 +00:00
Sultan Alsawaf	b933e6ca1e	ion: Rewrite to improve clarity and performance The ION driver suffers from massive code bloat caused by excessive debug features, as well as poor lock usage as a result of that. Multiple locks in ION exist to make the debug features thread-safe, which hurts ION's actual performance when doing its job. There are numerous code paths in ION that hold mutexes for no reason and hold them for longer than necessary. This results in not only unwanted lock contention, but also long delays when a mutex lock results in the calling thread getting preempted for a while. All lock usage in ION follows this pattern, which causes poor performance across the board. Furthermore, a big mutex lock is used mostly everywhere, which causes performance degradation due to unnecessary lock overhead. Instead of having a big mutex lock, multiple fine-grained locks are now used, improving performance. Additionally, dup_sg_table is called very frequently, and lies within the rendering path for the display. Speed it up by copying scatterlists in page-sized chunks rather than iterating one at a time. Note that sg_alloc_table zeroes out `table`, so there's no need to zero it out using the memory allocator. This also features a lock-less caching system for DMA attachments and their respective sg_table copies, reducing overhead significantly for code which frequently maps and unmaps DMA buffers and speeding up cache maintenance since iteration through the list of buffer attachments is now lock-free. This is safe since there is no interleaved DMA buffer attaching or accessing for a single ION buffer. Overall, just rewrite ION entirely to fix its deficiencies. This optimizes ION for excellent performance and discards its debug cruft. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>	2022-11-12 11:24:01 +00:00
Sultan Alsawaf	4f1ab4600a	iommu: msm: Rewrite to improve clarity and performance This scope of this driver's lock usage is extremely wide, leading to excessively long lock hold times. Additionally, there is lots of excessive linked-list traversal and unnecessary dynamic memory allocation in a critical path, causing poor performance across the board. Fix all of this by greatly reducing the scope of the locks used and by significantly reducing the amount of operations performed when msm_dma_map_sg_attrs() is called. The entire driver's code is overhauled for better cleanliness and performance. Note that ION must be modified to pass a known structure via the private dma_buf pointer, so that the IOMMU driver can prevent races when operating on the same buffer concurrently. This is the only way to eliminate said buffer races without hurting the IOMMU driver's performance. Some additional members are added to the device struct as well to make these various performance improvements possible. This also removes the manual cache maintenance since ION already handles it. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:01 +00:00
Sultan Alsawaf	f342f528cd	f2fs: Force strict fsync mode To help mitigate data loss potential in the event of an unclean shutdown. The small performance hit is worth the trade-off for improved data integrity, especially since custom kernels are more susceptible to unclean shutdowns (panics) during development. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:01 +00:00

... 2 3 4 5 6 ...

870686 Commits