summaryrefslogtreecommitdiffstats
path: root/arch/loongarch/lib
Commit message (Collapse)AuthorAgeFilesLines
* LoongArch: Select ARCH_SUPPORTS_INT128 if CC_HAS_INT128Xi Ruoyao2024-05-142-0/+58
| | | | | | | | | | | | | | | | | | | | This allows compiling a full 128-bit product of two 64-bit integers as a mul/mulh pair, instead of a nasty long sequence of 20+ instructions. However, after selecting ARCH_SUPPORTS_INT128, when optimizing for size the compiler generates calls to __ashlti3, __ashrti3, and __lshrti3 for shifting __int128 values, causing a link failure: loongarch64-unknown-linux-gnu-ld: kernel/sched/fair.o: in function `mul_u64_u32_shr': <PATH>/include/linux/math64.h:161:(.text+0x5e4): undefined reference to `__lshrti3' So provide the implementation of these functions if ARCH_SUPPORTS_INT128. Closes: https://lore.kernel.org/loongarch/CAAhV-H5EZ=7OF7CSiYyZ8_+wWuenpo=K2WT8-6mAT4CvzUC_4g@mail.gmail.com/ Signed-off-by: Xi Ruoyao <xry111@xry111.site> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Add ORC stack unwinder supportTiezhu Yang2024-03-114-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel CONFIG_UNWINDER_ORC option enables the ORC unwinder, which is similar in concept to a DWARF unwinder. The difference is that the format of the ORC data is much simpler than DWARF, which in turn allows the ORC unwinder to be much simpler and faster. The ORC data consists of unwind tables which are generated by objtool. After analyzing all the code paths of a .o file, it determines information about the stack state at each instruction address in the file and outputs that information to the .orc_unwind and .orc_unwind_ip sections. The per-object ORC sections are combined at link time and are sorted and post-processed at boot time. The unwinder uses the resulting data to correlate instruction addresses with their stack states at run time. Most of the logic are similar with x86, in order to get ra info before ra is saved into stack, add ra_reg and ra_offset into orc_entry. At the same time, modify some arch-specific code to silence the objtool warnings. Co-developed-by: Jinyang He <hejinyang@loongson.cn> Signed-off-by: Jinyang He <hejinyang@loongson.cn> Co-developed-by: Youling Tang <tangyouling@loongson.cn> Signed-off-by: Youling Tang <tangyouling@loongson.cn> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Add KASAN (Kernel Address Sanitizer) supportQing Zhang2023-09-063-9/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1/8 of kernel addresses reserved for shadow memory. But for LoongArch, There are a lot of holes between different segments and valid address space (256T available) is insufficient to map all these segments to kasan shadow memory with the common formula provided by kasan core, saying (addr >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET So LoongArch has a arch-specific mapping formula, different segments are mapped individually, and only limited space lengths of these specific segments are mapped to shadow. At early boot stage the whole shadow region populated with just one physical page (kasan_early_shadow_page). Later, this page is reused as readonly zero shadow for some memory that kasan currently don't track. After mapping the physical memory, pages for shadow memory are allocated and mapped. Functions like memset()/memcpy()/memmove() do a lot of memory accesses. If bad pointer passed to one of these function it is important to be caught. Compiler's instrumentation cannot do this since these functions are written in assembly. KASan replaces memory functions with manually instrumented variants. Original functions declared as weak symbols so strong definitions in mm/kasan/kasan.c could replace them. Original functions have aliases with '__' prefix in names, so we could call non-instrumented variant if needed. Signed-off-by: Qing Zhang <zhangqing@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Add SIMD-optimized XOR routinesWANG Xuerui2023-09-065-0/+315
| | | | | | | | | | | | | | | | | | | | | | Add LSX and LASX implementations of xor operations, operating on 64 bytes (one L1 cache line) at a time, for a balance between memory utilization and instruction mix. Huacai confirmed that all future LoongArch implementations by Loongson (that we care) will likely also feature 64-byte cache lines, and experiments show no throughput improvement with further unrolling. Performance numbers measured during system boot on a 3A5000 @ 2.5GHz: > 8regs : 12702 MB/sec > 8regs_prefetch : 10920 MB/sec > 32regs : 12686 MB/sec > 32regs_prefetch : 10918 MB/sec > lsx : 17589 MB/sec > lasx : 26116 MB/sec Acked-by: Song Liu <song@kernel.org> Signed-off-by: WANG Xuerui <git@xen0n.name> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Adjust {copy, clear}_user exception handler behaviorWeihao Li2023-09-062-121/+127
| | | | | | | | | | The {copy, clear}_user function should returns number of bytes that could not be {copied, cleared}. So, try to {copy, clear} byte by byte when ld.{d,w,h} and st.{d,w,h} trapped into an exception. Reviewed-by: WANG Rui <wangrui@loongson.cn> Signed-off-by: Weihao Li <liweihao@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Replace #include <asm/export.h> with #include <linux/export.h>Masahiro Yamada2023-08-255-5/+5
| | | | | | | | | | | | | Commit ddb5cdbafaaad ("kbuild: generate KSYMTAB entries by modpost") deprecated <asm/export.h>, which is now a wrapper of <linux/export.h>. Replace #include <asm/export.h> with #include <linux/export.h>. After all the <asm/export.h> lines are converted, <asm/export.h> and <asm-generic/export.h> will be removed. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Remove unneeded #include <asm/export.h>Masahiro Yamada2023-08-251-1/+0
| | | | | | | | There is no EXPORT_SYMBOL() line there, hence #include <asm/export.h> is unneeded. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Fix return value underflow in exception pathWANG Rui2023-07-282-2/+4
| | | | | | | | | | | | This patch fixes an underflow issue in the return value within the exception path, specifically at .Llt8 when the remaining length is less than 8 bytes. Cc: stable@vger.kernel.org Fixes: 8941e93ca590 ("LoongArch: Optimize memory ops (memset/memcpy/memmove)") Reported-by: Weihao Li <liweihao@loongson.cn> Signed-off-by: WANG Rui <wangrui@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Make the CPUCFG&CSR ops simple aliases of compiler built-insWANG Xuerui2023-06-291-3/+3
| | | | | | | | | | | | | | | In addition to less visual clutter, this also makes Clang happy regarding the const-ness of arguments. In the original approach, all Clang gets to see is the incoming arguments whose const-ness cannot be proven without first being inlined; so Clang errors out here while GCC is fine. While at it, tweak several printk format strings because the return type of csr_read64 becomes effectively unsigned long, instead of unsigned long long. Signed-off-by: WANG Xuerui <git@xen0n.name> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Add support for function error injectionTiezhu Yang2023-05-012-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Inspired by the commit 42d038c4fb00f ("arm64: Add support for function error injection") and the commit ee55ff803b383 ("riscv: Add support for function error injection"), this patch supports function error injection for LoongArch. Mainly implement two functions: (1) regs_set_return_value() which is used to overwrite the return value, (2) override_function_with_return() which is used to override the probed function returning and jump to its caller. Here is a simple test under CONFIG_FUNCTION_ERROR_INJECTION and CONFIG_FAIL_FUNCTION: # echo sys_clone > /sys/kernel/debug/fail_function/inject # echo 100 > /sys/kernel/debug/fail_function/probability # dmesg bash: fork: Invalid argument # dmesg ... FAULT_INJECTION: forcing a failure. name fail_function, interval 1, probability 100, space 0, times 1 ... Call Trace: [<90000000002238f4>] show_stack+0x5c/0x180 [<90000000012e384c>] dump_stack_lvl+0x60/0x88 [<9000000000b1879c>] should_fail_ex+0x1b0/0x1f4 [<900000000032ead4>] fei_kprobe_handler+0x28/0x6c [<9000000000230970>] kprobe_breakpoint_handler+0xf0/0x118 [<90000000012e3e60>] do_bp+0x2c4/0x358 [<9000000002241924>] exception_handlers+0x1924/0x10000 [<900000000023b7d0>] sys_clone+0x0/0x4 [<90000000012e4744>] do_syscall+0x7c/0x94 [<9000000000221e44>] handle_syscall+0xc4/0x160 Tested-by: Hengqi Chen <hengqi.chen@gmail.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Add checksum optimization for 64-bit systemBibo Mao2023-05-012-1/+142
| | | | | | | | | | | | | LoongArch platform is 64-bit system, which supports 8-bytes memory accessing, but generic checksum functions use 4-byte memory access. So add 8-bytes memory access optimization for checksum functions on LoongArch. And the code comes from arm64 system. When network hw checksum is disabled, iperf performance improves about 10% with this patch. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Optimize memory ops (memset/memcpy/memmove)WANG Rui2023-05-015-167/+603
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To optimize memset()/memcpy()/memmove() and so on, we use a jump table to dispatch cases for short data lengths; and for long data lengths, we split the destination into head part (first 8 bytes), tail part (last 8 bytes) and middle part. The head part and tail part may be at unaligned addresses, while the middle part is always aligned (the middle part is allowed to overlap the head/tail part). In this way, the first and last 8 bytes may be unaligned accesses, but we can make sure the data in the middle is processed at an aligned destination address. We have tested micro-bench[1] on a Loongson-3C5000 16-core machine (2.2GHz): 1. memset | length | src offset | dst offset | speed before | speed after | % | |--------|------------|------------|--------------|-------------|---------| | 8 | 0 | 0 | 696.191 | 1518.785 | 118.16% | | 8 | 0 | 1 | 696.325 | 1518.937 | 118.14% | | 50 | 0 | 0 | 969.976 | 8053.902 | 730.32% | | 50 | 0 | 1 | 970.034 | 8058.475 | 730.74% | | 300 | 0 | 0 | 5876.612 | 16544.703 | 181.53% | | 300 | 0 | 1 | 5030.849 | 16549.011 | 228.95% | | 1200 | 0 | 0 | 11797.077 | 16752.137 | 42.00% | | 1200 | 0 | 1 | 5687.141 | 16645.233 | 192.68% | | 4000 | 0 | 0 | 15723.27 | 16761.557 | 6.60% | | 4000 | 0 | 1 | 5906.114 | 16732.316 | 183.30% | | 8000 | 0 | 0 | 16751.403 | 16770.002 | 0.11% | | 8000 | 0 | 1 | 5995.449 | 16754.07 | 179.45% | 2. memcpy | length | src offset | dst offset | speed before | speed after | % | |--------|------------|------------|--------------|-------------|---------| | 8 | 0 | 0 | 696.2 | 1670.605 | 139.96% | | 8 | 0 | 1 | 696.325 | 1671.138 | 139.99% | | 50 | 0 | 0 | 969.974 | 8724.999 | 799.51% | | 50 | 0 | 1 | 970.032 | 8730.138 | 799.98% | | 300 | 0 | 0 | 5564.662 | 16272.652 | 192.43% | | 300 | 0 | 1 | 4670.436 | 14972.842 | 220.59% | | 1200 | 0 | 0 | 10740.23 | 16751.728 | 55.97% | | 1200 | 0 | 1 | 5027.741 | 14874.564 | 195.85% | | 4000 | 0 | 0 | 15122.367 | 16737.642 | 10.68% | | 4000 | 0 | 1 | 5536.918 | 14890.397 | 168.93% | | 8000 | 0 | 0 | 16505.453 | 16553.543 | 0.29% | | 8000 | 0 | 1 | 5821.619 | 14841.804 | 154.94% | 3. memmove | length | src offset | dst offset | speed before | speed after | % | |--------|------------|------------|--------------|-------------|---------| | 8 | 0 | 0 | 982.693 | 1670.568 | 70.00% | | 8 | 0 | 1 | 983.023 | 1671.174 | 70.00% | | 50 | 0 | 0 | 1230.87 | 8727.625 | 609.06% | | 50 | 0 | 1 | 1232.515 | 8730.138 | 608.32% | | 300 | 0 | 0 | 6490.375 | 16296.993 | 151.09% | | 300 | 0 | 1 | 4282.687 | 14972.842 | 249.61% | | 1200 | 0 | 0 | 11742.755 | 16752.546 | 42.66% | | 1200 | 0 | 1 | 5039.338 | 14872.951 | 195.14% | | 4000 | 0 | 0 | 15467.786 | 16737.09 | 8.21% | | 4000 | 0 | 1 | 5009.905 | 14890.542 | 197.22% | | 8000 | 0 | 0 | 16489.664 | 16553.273 | 0.39% | | 8000 | 0 | 1 | 5823.786 | 14858.646 | 155.14% | * speed: MB/s * length: byte [1] https://github.com/heiher/mem-bench Signed-off-by: WANG Rui <wangrui@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Mark some assembler symbols as non-kprobe-ableTiezhu Yang2023-02-253-0/+10
| | | | | | | | | | | | | | Some assembler symbols are not kprobe safe, such as handle_syscall (used as syscall exception handler), *memset*/*memcpy*/*memmove* (may cause recursive exceptions), they can not be instrumented, just blacklist them for kprobing. Here is a related problem and discussion: Link: https://lore.kernel.org/lkml/20230114143859.7ccc45c1c5d9ce302113ab0a@kernel.org/ Tested-by: Jeff Xie <xiehuan09@gmail.com> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Use alternative to optimize librariesHuacai Chen2022-12-146-11/+460
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the alternative to optimize common libraries according whether CPU has UAL (hardware unaligned access support) feature, including memset(), memcopy(), memmove(), copy_user() and clear_user(). We have tested UnixBench on a Loongson-3A5000 quad-core machine (1.6GHz): 1, One copy, before patch: System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 9566582.0 819.8 Double-Precision Whetstone 55.0 2805.3 510.1 Execl Throughput 43.0 2120.0 493.0 File Copy 1024 bufsize 2000 maxblocks 3960.0 209833.0 529.9 File Copy 256 bufsize 500 maxblocks 1655.0 89400.0 540.2 File Copy 4096 bufsize 8000 maxblocks 5800.0 320036.0 551.8 Pipe Throughput 12440.0 340624.0 273.8 Pipe-based Context Switching 4000.0 109939.1 274.8 Process Creation 126.0 4728.7 375.3 Shell Scripts (1 concurrent) 42.4 2223.1 524.3 Shell Scripts (8 concurrent) 6.0 883.1 1471.9 System Call Overhead 15000.0 518639.1 345.8 ======== System Benchmarks Index Score 500.2 2, One copy, after patch: System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 9567674.7 819.9 Double-Precision Whetstone 55.0 2805.5 510.1 Execl Throughput 43.0 2392.7 556.4 File Copy 1024 bufsize 2000 maxblocks 3960.0 417804.0 1055.1 File Copy 256 bufsize 500 maxblocks 1655.0 112909.5 682.2 File Copy 4096 bufsize 8000 maxblocks 5800.0 1255207.4 2164.2 Pipe Throughput 12440.0 555712.0 446.7 Pipe-based Context Switching 4000.0 99964.5 249.9 Process Creation 126.0 5192.5 412.1 Shell Scripts (1 concurrent) 42.4 2302.4 543.0 Shell Scripts (8 concurrent) 6.0 919.6 1532.6 System Call Overhead 15000.0 511159.3 340.8 ======== System Benchmarks Index Score 640.1 3, Four copies, before patch: System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 38268610.5 3279.2 Double-Precision Whetstone 55.0 11222.2 2040.4 Execl Throughput 43.0 7892.0 1835.3 File Copy 1024 bufsize 2000 maxblocks 3960.0 235149.6 593.8 File Copy 256 bufsize 500 maxblocks 1655.0 74959.6 452.9 File Copy 4096 bufsize 8000 maxblocks 5800.0 545048.5 939.7 Pipe Throughput 12440.0 1337359.0 1075.0 Pipe-based Context Switching 4000.0 473663.9 1184.2 Process Creation 126.0 17491.2 1388.2 Shell Scripts (1 concurrent) 42.4 6865.7 1619.3 Shell Scripts (8 concurrent) 6.0 1015.9 1693.1 System Call Overhead 15000.0 1899535.2 1266.4 ======== System Benchmarks Index Score 1278.3 4, Four copies, after patch: System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 38272815.5 3279.6 Double-Precision Whetstone 55.0 11222.8 2040.5 Execl Throughput 43.0 8839.2 2055.6 File Copy 1024 bufsize 2000 maxblocks 3960.0 313912.9 792.7 File Copy 256 bufsize 500 maxblocks 1655.0 80976.1 489.3 File Copy 4096 bufsize 8000 maxblocks 5800.0 1176594.3 2028.6 Pipe Throughput 12440.0 2100941.9 1688.9 Pipe-based Context Switching 4000.0 476696.4 1191.7 Process Creation 126.0 18394.7 1459.9 Shell Scripts (1 concurrent) 42.4 7172.2 1691.6 Shell Scripts (8 concurrent) 6.0 1058.3 1763.9 System Call Overhead 15000.0 1874714.7 1249.8 ======== System Benchmarks Index Score 1488.8 Signed-off-by: Jun Yi <yijun@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Add unaligned access supportHuacai Chen2022-12-142-1/+85
| | | | | | | | | | Loongson-2 series (Loongson-2K500, Loongson-2K1000) don't support unaligned access in hardware, while Loongson-3 series (Loongson-3A5000, Loongson-3C5000) are configurable whether support unaligned access in hardware. This patch add unaligned access emulation for those LoongArch processors without hardware support. Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Remove the .fixup section usageYouling Tang2022-12-142-19/+11
| | | | | | | | Use the `.L_xxx` label to improve fixup code and then remove the .fixup section usage. Signed-off-by: Youling Tang <tangyouling@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Consolidate __ex_table constructionYouling Tang2022-12-142-6/+4
| | | | | | | | | | Consolidate all the __ex_table constuction code with a _ASM_EXTABLE or _asm_extable helper. There should be no functional change as a result of this patch. Signed-off-by: Youling Tang <tangyouling@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Improve dump_tlb() output messagesHuacai Chen2022-09-031-13/+13
| | | | | | | 1, Use nr/nx to replace ri/xi; 2, Add 0x prefix for hexadecimal data. Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Remove useless header compiler.hJun Yi2022-07-291-1/+0
| | | | | | | | The content of LoongArch's compiler.h is trivial, with some unused anywhere, so inline the definitions and remove the header. Signed-off-by: Jun Yi <yijun@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Simplify "BGT foo, zero" with BGTZWANG Xuerui2022-07-292-2/+2
| | | | | | | | | Support for the syntactic sugar is present in upstream binutils port from the beginning. Use it for shorter lines and better consistency. Generated code should be identical. Signed-off-by: WANG Xuerui <git@xen0n.name> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Add some library functionsHuacai Chen2022-06-034-0/+244
| | | | | | | | | | Add some library functions for LoongArch, including: delay, memset, memcpy, memmove, copy_user, strncpy_user, strnlen_user and tlb dump functions. Reviewed-by: WANG Xuerui <git@xen0n.name> Reviewed-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* LoongArch: Add build infrastructureHuacai Chen2022-06-031-0/+6
Add Kbuild, Makefile, Kconfig and link script for LoongArch build infrastructure. Reviewed-by: Guo Ren <guoren@kernel.org> Reviewed-by: WANG Xuerui <git@xen0n.name> Reviewed-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>