android_kernel_xiaomi_sm7250

Author	SHA1	Message	Date
Sultan Alsawaf	b75887b8a8	mm: Don't hog the CPU and zone lock in rmqueue_bulk() There is noticeable scheduling latency and heavy zone lock contention stemming from rmqueue_bulk's single hold of the zone lock while doing its work, as seen with the preemptoff tracer. There's no actual need for rmqueue_bulk() to hold the zone lock the entire time; it only does so for supposed efficiency. As such, we can relax the zone lock and even reschedule when IRQs are enabled in order to keep the scheduling delays and zone lock contention at bay. Forward progress is still guaranteed, as the zone lock can only be relaxed after page removal. With this change, rmqueue_bulk() no longer appears as a serious offender in the preemptoff tracer, and system latency is noticeably improved. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:20 +00:00
Sultan Alsawaf	370fefb53f	mm: Lower the non-hugetlbpage pageblock size to reduce scheduling delays The page allocator processes free pages in groups of pageblocks, where the size of a pageblock is typically quite large (1024 pages without hugetlbpage support). Pageblocks are processed atomically with the zone lock held, which can cause severe scheduling delays on both the CPU going through the pageblock and any other CPUs waiting to acquire the zone lock. A frequent offender is move_freepages_block(), which is used by rmqueue() for page allocation. As it turns out, there's no requirement for pageblocks to be so large, so the pageblock order can simply be reduced to ease the scheduling delays and zone lock contention. PAGE_ALLOC_COSTLY_ORDER is used as a reasonable setting to ensure non-costly page allocation requests can still be serviced without always needing to free up more than one pageblock's worth of pages at a time. This has a noticeable effect on overall system latency when memory pressure is elevated. The various mm functions which operate on pageblocks no longer appear in the preemptoff tracer, where previously they would spend up to 100 ms on a mobile arm64 CPU processing a pageblock with preemption disabled and the zone lock held. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:20 +00:00
Sultan Alsawaf	3e0eb86439	perf/core: Fix risky smp_processor_id() usage in perf_event_read_local() There's no requirement that perf_event_read_local() be used from a context where CPU migration isn't possible, yet smp_processor_id() is used with the assumption that the caller guarantees CPU migration can't occur. Since IRQs are disabled here anyway, the smp_processor_id() can simply be moved to the IRQ-disabled section to guarantee its safety. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:20 +00:00
Sultan Alsawaf	e06cdf4ce5	ion: Fix partial cache maintenance operations The partial cache maintenance helpers check the number of segments in each mapping before checking if the mapping is actually in use, which sometimes results in spurious errors being returned to vidc. The errors then cause vidc to malfunction, even though nothing's wrong. The reason for checking the segment count first was to elide map_rwsem; however, it turns out that map_rwsem isn't needed anyway, so we can have our cake and eat it too. Fix the spurious segment count errors by reordering the checks, and remove map_rwsem entirely so we don't have to worry about eliding it for performance reasons. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:20 +00:00
Sultan Alsawaf	7efe3414b7	qos: Change cpus_affine to not be atomic There isn't a need for cpus_affine to be atomic, and reading/writing to it outside of the global pm_qos lock is racy anyway. As such, we can simply turn it into a primitive integer type. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:19 +00:00
Sultan Alsawaf	803faf2d23	qos: Speed up plist traversal in pm_qos_set_value_for_cpus() The plist is already sorted and traversed in ascending order of PM QoS value, so we can simply look at the lowest PM QoS values which affect the given request's CPUs until we've looked at all of them, at which point the traversal can be stopped early. This also lets us get rid of the pesky qos_val array. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:19 +00:00
Sultan Alsawaf	5d000fb40c	qos: Fix PM QoS requests almost never shutting off Andrzej Perczak discovered that his CPUs would almost never enter an idle state deeper than C0, and pinpointed the cause of the issue to be commit "qos: Speed up pm_qos_set_value_for_cpus()". As it turns out, the optimizations introduced in that commit contain two issues that are responsible for this behavior: pm_qos_remove_request() fails to refresh the affected per-CPU targets, and IRQ migrations fail to refresh their old affinity's targets. Removing a request fails to refresh the per-CPU targets because `new_req->node.prio` isn't updated to the PM QoS class' default value upon removal, and so it contains its old value from when it was active. This causes the `changed` loop in pm_qos_set_value_for_cpus() to check against a stale PM QoS request value and erroneously determine that the request in question doesn't alter the current per-CPU targets. As for IRQ migrations, only the new CPU affinity mask gets updated, which causes the CPUs present in the old affinity mask but not the new one to retain their targets, specifically when a migration occurs while the associated PM QoS request is active. To fix these issues while retaining optimal speed, update PM QoS requests' CPU affinity inside pm_qos_set_value_for_cpus() so that the old affinity can be known, and skip the `changed` loop when the request in question is being removed. Reported-by: Andrzej Perczak <kartapolska@gmail.com> Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:19 +00:00
Sultan Alsawaf	40106c0f1f	cpuidle: lpm-levels: Only cancel the bias timer when it's used The bias timer is only started when WFI is used, so we only need to try and cancel it after leaving WFI. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:19 +00:00
Sultan Alsawaf	cf3a9c131c	sched/core: Always panic when scheduling in atomic context Scheduling in atomic context is indicative of a serious problem that, although may not be immediately lethal, can lead to strange issues and eventually a panic. We should therefore panic the first time it's detected. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:19 +00:00
Sultan Alsawaf	57cc3cbe1e	ion: Restore ION_IOC_HEAP_QUERY ioctl command It turns out that the ION_IOC_HEAP_QUERY command is actually used in some camera-related components in Android 11, such as libultradepth_api, and in libdmabufheap in Android 12. The omission of this command causes these components to break when their ioctl attempt returns -ENOTTY. Restore the ION_IOC_HEAP_QUERY command to fix the incompatibility. Unfortunately, libdmabufheap uses heap names in order to look up heap IDs so that the calling userspace code can maintain a constant heap name and cope with inconsistent heap IDs. For example, if some user code wants to allocate from the system heap, it only has to specify "system" as the desired heap name, and it doesn't need to keep track of the system heap ID. This is unfortunate because now we must copy heap name strings to userspace. In order to speed this up, a pre-allocated array, which is statically allocated to accommodate the maximum number of heaps, is populated with heap data as heaps are created. When a heap query command requests heap data, all we have to do is copy the big array of pre-made data, and we're done. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:18 +00:00
Sultan Alsawaf	5fb388c1c1	ion: Further optimize ioctl handler We can omit the _IOC_SIZE() check and also inline copy_from_user() by duplicating copy_from_user() for each ioctl command and giving it a constant size. Since there aren't many ioctls here, this doesn't turn the code into spaghetti. We can further optimize the prefetch ioctls as well by omitting one word of data from the copy_from_user(), since the first member of `struct ion_prefetch_data` (the `len` field) is unused. As proof of this, rename `len` to `unused` in the uapi header, which also ensures that the compiler will notify us if this ever changes in the future. This is necessary because the prefetch data is used outside of ion.c, where we cannot easily audit its usage. There's no reduction done for the allocation ioctl because we could only reduce the copy_from_user() payload by a half word, which will result in a payload size that isn't a multiple of a word. The copy_from_user() implementation on arm64 will go slower as a result, so just leave it untouched. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:18 +00:00
Sultan Alsawaf	5e17b2f335	ion: Remove unneeded rwsem for the heap priority list Heaps are never removed, and there is only one ion_device_add_heap() user: msm_ion_probe(). This single user calls ion_device_add_heap() sequentially, not concurrently. Furthermore, all heap additions are done once kernel init is complete, and heaps are only accessed by userspace, so no locking is needed at all here. The write lock in ion_walk_heaps() doesn't make sense either since the heap-walking functions neither mutate a heap's placement in the plist, nor change a heap in a way that requires pausing all buffer allocations. The functions used in the heap walking routine handle synchronization themselves, so there's no need for the mutex-style locking here. This write lock appears to be a historical artifact from the following 2013 commit (present in msm-3.4 trees) where a justification for the write lock was never given: 7c1b8aa23ef ("gpu: ion: Add support for heap walking"). Since the heap plist rwsem appears to be thoroughly useless, we can safely remove it to reduce complexity and improve performance. Also, change the name of ion_device_add_heap() to ion_add_heap() so the compiler can notify us if ion_device_add_heap() is used elsewhere in the future. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:18 +00:00
celtare21	5ddd0c1940	sched/core: Fix rq clock warning in sched_migrate_to_cpumask_end() The following warning occurs because we don't update the runqueue's clock when taking rq->lock in sched_migrate_to_cpumask_end(): rq->clock_update_flags < RQCF_ACT_SKIP WARNING: CPU: 0 PID: 991 at update_curr+0x1c8/0x2bc [...] Call trace: update_curr+0x1c8/0x2bc dequeue_task_fair+0x7c/0x1238 do_set_cpus_allowed+0x64/0x28c sched_migrate_to_cpumask_end+0xa8/0x1b4 m_stop+0x40/0x78 seq_read+0x39c/0x4ac __vfs_read+0x44/0x12c vfs_read+0xf0/0x1d8 SyS_read+0x6c/0xcc el0_svc_naked+0x34/0x38 Fix it by adding an update_rq_clock() call when taking rq->lock. Signed-off-by: celtare21 <celtare21@gmail.com>	2022-11-12 11:24:18 +00:00
Sultan Alsawaf	adaa599abb	msm: kgsl: Affine kgsl_3d0_irq and worker kthread to the big CPU cluster These are in the critical path for rendering frames to the display, so mark them as performance-critical and affine them to the big CPU cluster. They aren't placed onto the prime cluster because the single-CPU prime cluster will be used to run the DRM IRQ and kthreads. DRM is more latency-critical than KGSL and we need to have DRM and KGSL running on separate CPUs for the best performance, so KGSL gets the big cluster. Note that since there are other IRQs requested via kgsl_request_irq(), we must specify that the IRQ to be made perf-critical is kgsl_3d0_irq. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:18 +00:00
Sultan Alsawaf	94f3e31c1b	Revert "msm: kgsl: Affine IRQ and worker kthread to the big CPU cluster" This reverts commit 417bded5a942a2a23ad65b3fe5fd3fff2d0dbf5b. This is wrong. This causes 3 IRQs to be affined to the big CPU cluster, not just the primary kgsl_3d0_irq one. As a result, the perf crit API thinks that the 2 extra IRQs are critical and will balance them despite them being rarely used (kgsl_hfi_irq and kgsl_gmu_irq). Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:17 +00:00
Sultan Alsawaf	686d26f283	sched/fair: Compile out NUMA code entirely when NUMA is disabled Scheduler code is very hot and every little optimization counts. Instead of constantly checking sched_numa_balancing when NUMA is disabled, compile it out. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:17 +00:00
Sultan Alsawaf	e23d4ea590	mm: Perform PID map reads on the little CPU cluster PID map reads for processes with thousands of mappings can be done extensively by certain Android apps, burning through CPU time on higher-performance CPUs even though reading PID maps is never a performance-critical task. We can relieve the load on the important CPUs by moving PID map reads to little CPUs via sched_migrate_to_cpumask_*(). Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: dreamisbaka <jolinux.g@gmail.com>	2022-11-12 11:24:17 +00:00
Sultan Alsawaf	c2c3304ca2	sched: Add API to migrate the current process to a given cpumask There are some chunks of code in the kernel running in process context where it may be helpful to run the code on a specific set of CPUs, such as when reading some CPU-intensive procfs files. This is especially useful when the code in question must run within the context of the current process (so kthreads cannot be used). Add an API to make this possible, which consists of the following: sched_migrate_to_cpumask_start(): @old_mask: pointer to output the current task's old cpumask @dest: pointer to a cpumask the current task should be moved to sched_migrate_to_cpumask_end(): @old_mask: pointer to the old cpumask generated earlier @dest: pointer to the dest cpumask provided earlier Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: dreamisbaka <jolinux.g@gmail.com>	2022-11-12 11:24:17 +00:00
Sultan Alsawaf	d516115a24	mm: Micro-optimize PID map reads for arm64 while retaining output format Android and various applications in Android need to read PID map data in order to work. Some processes can contain over 10,000 mappings, which results in lots of time wasted on simply generating strings. This wasted time adds up, especially in the case of Unity-based games, which utilize the Boehm garbage collector. A game's main process typically has well over 10,000 mappings due to the loaded textures, and the Boehm GC reads PID maps several times a second. This results in over 100,000 map entries being printed out per second, so micro-optimization here is important. Before this commit, show_vma_header_prefix() would typically take around 1000 ns to run on a Snapdragon 855; now it only takes about 50 ns to run, which is a 20x improvement. The primary micro-optimizations here assume that there are no more than 40 bits in the virtual address space, hence the CONFIG_ARM64_VA_BITS check. Arm64 uses a virtual address size of 39 bits, so this perfectly covers it. This also removes padding used to beautify PID map output to further speed up reads and reduce the amount of bytes printed, and optimizes the dentry path retrieval for file-backed mappings. Note, however, that the trailing space at the end of the line for non-file-backed mappings cannot be omitted, as it breaks some PID map parsers. This still retains insignificant leading zeros from printed hex values to maintain the current output format. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: LibXZR <xzr467706992@163.com>	2022-11-12 11:24:17 +00:00
Sultan Alsawaf	2d1025e96a	msm: msm_bus: Don't enable QoS clocks when none are present There's no point in enabling QoS clocks when are none of them for certain clients. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>	2022-11-12 11:24:16 +00:00
Sultan Alsawaf	f3df72d95e	msm: kgsl: Use lock-less list for page pools Page pool additions and removals are very hot during GPU workloads, so they should be optimized accordingly. We can use a lock-less list for storing the free pages in order to speed things up. The lock-less list allows for one llist_del_first() user and unlimited llist_add() users to run concurrently, so only a spin lock around the llist_del_first() is needed; everything else is lock-free. The per-pool page count is now an atomic to make it lock-free as well. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: LibXZR <xzr467706992@163.com>	2022-11-12 11:24:16 +00:00
Sultan Alsawaf	da748e58a9	drm/msm/sde: Don't clear dim layers when there aren't any applied Clearing dim layers indiscriminately for each blend stage on each commit wastes a lot of CPU time since the clearing process is heavy on register accesses. We can optimize this by only clearing dim layers when they're actually set, and only clearing them on a per-stage basis at that. This reduces display commit latency considerably. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:16 +00:00
Sultan Alsawaf	03af212bd8	msm: kgsl: Don't busy wait for fenced GMU writes when possible The most frequent user of fenced GMU writes, adreno_ringbuffer_submit(), performs a fenced GMU write under a spin lock, and since fenced GMU writes use udelay(), a lot of CPU cycles are burned here. Not only is the spin lock held for longer than necessary (because the write doesn't need to be inside the spin lock), but also a lot of CPU time is wasted in udelay() for tens of microseconds when usleep_range() can be used instead. Move the locked fenced GMU writes to outside their spin locks and make adreno_gmu_fenced_write() use usleep_range() when not in atomic/IRQ context, to save power and improve performance. Fenced GMU writes are found to take an average of 28 microseconds on the Snapdragon 855, so a usleep range of 10 to 30 microseconds is optimal. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:16 +00:00
Sultan Alsawaf	0871f5ceea	msm: kgsl: Remove unneeded time profiling from ringbuffer submission The time profiling here is only used to provide additional debug info for a context dump as well as a tracepoint. It adds non-trivial overhead to ringbuffer submission since it accesses GPU registers, so remove it along with the tracepoint since we're not debugging adreno. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:15 +00:00
Sultan Alsawaf	a8052f1777	scsi: ufs: Add simple IRQ-affined PM QoS operations Qualcomm's PM QoS solution suffers from a number of issues: applying PM QoS to all CPUs, convoluted spaghetti code that wastes CPU cycles, and keeping PM QoS applied for 10 ms after all requests finish processing. This implements a simple IRQ-affined PM QoS mechanism for each UFS adapter which uses atomics to elide locking, and enqueues a worker to apply PM QoS to the target CPU as soon as a command request is issued. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: alk3pInjection <webmaster@raspii.tech>	2022-11-12 11:24:15 +00:00
Panchajanya1999	67322ffa74	drivers/char: adsprpc: Remove Qcom's PM_QoS implementation Qualcomm's QoS implementation wastes a significant power from CPU cycles. Scrap the QoS bits and save a bit power without hurting any functionality. Change-Id: I1de3563d9c99ba863f10a90a900d290bdd8e6b79 Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live> Signed-off-by: Carlos Ayrton Lopez Arroyo <15030201@itcelaya.edu.mx>	2022-11-12 11:24:15 +00:00
Sultan Alsawaf	d89f76b07e	scsi: ufs: Scrap Qualcomm's PM QoS implementation This implementation is completely over the top and wastes lots of CPU cycles. It's too convoluted to fix, so just scrap it to make way for a simpler solution. This purges every PM QoS reference in the UFS drivers. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: alk3pInjection <webmaster@raspii.tech>	2022-11-12 11:24:15 +00:00
Sultan Alsawaf	c0b8dc4d0a	qos: Remove pm_qos_update_request_timeout() API Using a timeout for a PM QoS request can lead to disastrous results on power consumption. It's always possible to find a fixed scope in which a PM QoS request should be applied, so timeouts aren't ever strictly needed; they're usually just a lazy way of using PM QoS. Remove the API so that it cannot be abused any longer. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:15 +00:00
Sultan Alsawaf	28c9e970a8	msm: kgsl: Remove L2PC PM QoS feature KGSL already has PM QoS covering what matters. The L2PC PM QoS code is not only unneeded, but also unused, so remove it. It's poorly designed anyway since it uses a timeout with PM QoS, which is drastically bad for power consumption. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:14 +00:00
Sultan Alsawaf	85bc91328d	drm/msm/sde: Don't read and clear VBIF errors upon commit Reading and clearing any errors from the VBIF error registers takes a significant amount of time during kickoff, and is only used to produce debug logs when errors are detected. Since we're not debugging hardware issues in MDSS, remove the VBIF error clearing entirely to reduce display rendering latency. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:14 +00:00
Sultan Alsawaf	952dfb1b3f	drm/msm/sde: Remove redundant write memory barriers from IRQ routines Explicit write memory barriers are unneeded here since releasing a lock already implies a full memory barrier. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:14 +00:00
Sultan Alsawaf	07276d430e	drm/msm/sde: Consolidate IRQ status reads into IRQ dispatcher The IRQ status reads are decoupled from the IRQ dispatcher, even though the dispatcher is the only one using the IRQ statuses. This results in a lot of redundant work being done as the IRQ status reader also reads the IRQ-enable register and clears the IRQ mask, both of which are already handled by the IRQ dispatcher. We can cut out the redundant work done in the hardware IRQ handler by consolidating the IRQ status reads into the IRQ dispatcher. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:14 +00:00
Sultan Alsawaf	eb696e16c7	drm/msm/sde: Skip heavy autorefresh checks when it's not enabled These heavy checks for seeing if autorefresh is enabled are unneeded when the autorefresh config is disabled. These checks are performed on every display commit and show up as using a significant amount of CPU time in perf top. Skip them when it's unnecessary in order to improve display rendering performance. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:14 +00:00
Sultan Alsawaf	ee675ad54d	drm/msm/sde: Don't allocate memory dynamically for CRTC atomic check Every atomic frame commit allocates memory dynamically to check the states of the CRTCs, when those allocations can just be stored on the stack instead. Eliminate these dynamic memory allocations in the frame commit path to improve performance. They don't need need to be zeroed out either. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>	2022-11-12 11:24:13 +00:00
Sultan Alsawaf	7375d8ee7a	selinux: Avoid dynamic memory allocation for INITCONTEXTLEN buffers The default INITCONTEXTLEN-sized buffers can fit on the stack. Do so to save a call to kmalloc() in a hot path. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:13 +00:00
Sultan Alsawaf	f18743c5ae	bpf: Avoid allocating small buffers for map keys and values Most, if not all, map keys and values are rather small and can fit on the stack, eliminating the need to allocate them dynamically. Reserve some small stack buffers for them to avoid dynamic memory allocation. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:13 +00:00
Sultan Alsawaf	a97ce0db2c	msm: kgsl: Avoid dynamically allocating small command buffers Most command buffers here are rather small (fewer than 256 words); it's a waste of time to dynamically allocate memory for such a small buffer when it could easily fit on the stack. Conditionally using an on-stack command buffer when the size is small enough eliminates the need for using a dynamically-allocated buffer most of the time, reducing GPU command submission latency. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:13 +00:00
LibXZR	ce7ebaf0c0	ashmem: Adapt building on msm-4.19	2022-11-12 11:24:12 +00:00
Sultan Alsawaf	58c1240c0b	ashmem: Rewrite to improve clarity and performance Ashmem uses a single big mutex lock for all synchronization, and even uses it when no synchronization issues are present. The contention from using a single lock results in all-around poor performance. Rewrite to use fine-grained locks and atomic constructions to eliminate the big mutex lock, thereby improving performance greatly. In places where locks are needed for a one-time operation, we speculatively check if locking is needed while avoiding data races. The optional name fields are removed as well. Note that because asma->unpinned_list never has anything added to it, we can remove any code using it to clean up the driver a lot and reduce synchronization requirements. This also means that ashmem_lru_list never gets anything added to it either, so all code using it is dead code as well, which we can remove. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:12 +00:00
Danny Lin	98adc65124	drm/msm/sde: Cache register values when performing clock control Remote register I/O amounts to a measurably significant portion of CPU time due to how frequently this function is used. Cache the value of each register on-demand and use this value in future invocations to mitigate the expensive I/O. Co-authored-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Danny Lin <danny@kdrag0n.dev> [@0ctobot: Adapted for msm-4.19] Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>	2022-11-12 11:24:12 +00:00
Sultan Alsawaf	aafe02c4d8	drm/msm/sde: Don't allocate memory dynamically for plane states The plane states allocation and free show up on perf top as taking up a non-trivial amount of time on every commit. Since the allocation is small, just place it on the stack to eliminate the dynamic allocation overhead completely. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>	2022-11-12 11:24:12 +00:00
Sultan Alsawaf	238880552e	drm/msm/sde: Skip unneeded register reads when getting write line count More often than not, get_vsync_info() is used to only get the write line count, while the other values it returns are left unused. This is not optimal since it is done on every display commit. We can eliminate the superfluous register reads by adding a parameter specifying if only the write line count is requested. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>	2022-11-12 11:24:12 +00:00
Sultan Alsawaf	2fb480b5de	drm/msm: Offload commit cleanup onto little CPUs The cleanup portion of non-blocking commits can be offloaded to little CPUs to reduce latency in the display commit path, since it takes up a non-trivial amount of CPU time. This reduces display commit latency. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> [lazerl0rd: Adjust for Linux 4.19, with different commit cleanup.] Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>	2022-11-12 11:24:11 +00:00
Signed-off-by: Sultan Alsawaf	5e45392196	clk: qcom: qcom-cpufreq-hw: Set each CPU clock to its max when waking up The default frequency on Qualcomm CPUs is the lowest frequency supported by the CPU. This hurts latency when waking from suspend, as each CPU coming online runs at its lowest frequency until the governor can take over later. To speed up waking from suspend, hijack the CPUHP_AP_ONLINE hook and use it to set the highest available frequency on each CPU as they come online. This is done behind the governor's back but it's fine because the governor isn't running at this point in time for a CPU that's coming online. This speeds up waking from suspend significantly. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> [lazerl0rd: Adjusted to apply to qcom-cpufreq-hw instead of clk-cpu-osm.] Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>	2022-11-12 11:24:11 +00:00
Sultan Alsawaf	78d0f9faf4	drm/msm: Speed up interrupt processing upon commit Since we know an interrupt will be arriving soon when a frame is committed, we can anticipate it and prevent the CPU servicing that interrupt from entering deep idle states. This reduces display rendering latency. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>	2022-11-12 11:24:11 +00:00
Sultan Alsawaf	c2e09bd1cf	qos: Speed up pm_qos_set_value_for_cpus() A lot of unnecessary work is done in pm_qos_set_value_for_cpus(), especially when the request being updated isn't affined to all CPUs. We can reduce the work done here significantly by only inspecting the CPUs which are affected by the updated request, and bailing out if the updated request doesn't change anything. We can make some other micro-optimizations as well knowing that this code is only for the PM_QOS_CPU_DMA_LATENCY class. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:11 +00:00
Sultan Alsawaf	530c735e50	rcu: Run nocb kthreads on little CPUs RCU callbacks are not time-critical and constitute kernel housekeeping. Offload the no-callback kthreads onto little CPUs to clear load off of the more important, higher-performance CPUs. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:11 +00:00
Sultan Alsawaf	143b27d222	binder: Stub out debug prints by default Binder code is very hot, so checking frequently to see if a debug message should be printed is a waste of cycles. We're not debugging binder, so just stub out the debug prints to compile them out entirely. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:10 +00:00
Sultan Alsawaf	21ef075aef	ALSA: control_compat: Don't dynamically allocate single-use structs All of these dynamically-allocated structs can be simply placed on the stack, eliminating the overhead of dynamic memory allocation entirely. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:10 +00:00
Sultan Alsawaf	7781d6b3bd	kernfs: Avoid dynamic memory allocation for small write buffers Most write buffers are rather small and can fit on the stack, eliminating the need to allocate them dynamically. Reserve a 4 KiB stack buffer for this purpose to avoid the overhead of dynamic memory allocation. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2022-11-12 11:24:10 +00:00

1 2 3 4 5 ...

870580 Commits