[RFC,00/22] riscv: s64ilp32: Running 32-bit Linux kernel on 64-bit supervisor mode

Message ID	20230518131013.3366406-1-guoren@kernel.org
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: guoren@kernel.org To: arnd@arndb.de, guoren@kernel.org, palmer@rivosinc.com, tglx@linutronix.de, peterz@infradead.org, luto@kernel.org, conor.dooley@microchip.com, heiko@sntech.de, jszhang@kernel.org, chenhuacai@kernel.org, apatel@ventanamicro.com, atishp@atishpatra.org, mark.rutland@arm.com, bjorn@kernel.org, paul.walmsley@sifive.com, catalin.marinas@arm.com, will@kernel.org, rppt@kernel.org, anup@brainfault.org, shihua@iscas.ac.cn, jiawei@iscas.ac.cn, liweiwei@iscas.ac.cn, luxufan@iscas.ac.cn, chunyu@iscas.ac.cn, tsu.yubo@gmail.com, wefu@redhat.com, wangjunqiang@iscas.ac.cn, kito.cheng@sifive.com, andy.chiu@sifive.com, vincent.chen@sifive.com, greentime.hu@sifive.com, corbet@lwn.net, wuwei2016@iscas.ac.cn, jrtc27@jrtc27.com Cc: linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, linux-riscv@lists.infradead.org, Guo Ren <guoren@linux.alibaba.com> Subject: [RFC PATCH 00/22] riscv: s64ilp32: Running 32-bit Linux kernel on 64-bit supervisor mode Date: Thu, 18 May 2023 09:09:51 -0400 Message-Id: <20230518131013.3366406-1-guoren@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	riscv: s64ilp32: Running 32-bit Linux kernel on 64-bit supervisor mode \| [RFC,00/22] riscv: s64ilp32: Running 32-bit Linux kernel on 64-bit supervisor mode [RFC,01/22] riscv: vdso: Unify vdso32 & compat_vdso into vdso/Makefile [RFC,02/22] riscv: vdso: Remove compat_vdso/ [RFC,03/22] riscv: vdso: Add time-related vDSO common flow for vdso32 [RFC,04/22] clocksource: riscv: s64ilp32: Use __riscv_xlen instead of CONFIG_32BIT [RFC,05/22] riscv: s64ilp32: Introduce xlen_t [RFC,06/22] irqchip: riscv: s64ilp32: Use __riscv_xlen instead of CONFIG_32BIT [RFC,07/22] riscv: s64ilp32: Add sbi support [RFC,08/22] riscv: s64ilp32: Add asid support [RFC,09/22] riscv: s64ilp32: Introduce PTR_L and PTR_S [RFC,10/22] riscv: s64ilp32: Enable user space runtime environment [RFC,11/22] riscv: s64ilp32: Add ebpf jit support [RFC,12/22] riscv: s64ilp32: Add ELF32 support [RFC,13/22] riscv: s64ilp32: Add ARCH RV64 ILP32 compiling framework [RFC,14/22] riscv: s64ilp32: Add MMU_SV39 mode support for 32BIT [RFC,15/22] riscv: s64ilp32: Enable native atomic64 [RFC,16/22] riscv: s64ilp32: Add TImode (128 int) support [RFC,17/22] riscv: s64ilp32: Implement cmpxchg_double [RFC,18/22] riscv: s64ilp32: Disable KVM [RFC,19/22] riscv: Cleanup rv32_defconfig [RFC,20/22] riscv: s64ilp32: Add rv64ilp32_defconfig [RFC,21/22] riscv: s64ilp32: Correct the rv64ilp32 stackframe layout [RFC,22/22] riscv: s64ilp32: Temporary workaround solution to gcc problem

Message ID

20230518131013.3366406-1-guoren@kernel.org

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: guoren@kernel.org
To: arnd@arndb.de, guoren@kernel.org, palmer@rivosinc.com,
        tglx@linutronix.de, peterz@infradead.org, luto@kernel.org,
        conor.dooley@microchip.com, heiko@sntech.de, jszhang@kernel.org,
        chenhuacai@kernel.org, apatel@ventanamicro.com,
        atishp@atishpatra.org, mark.rutland@arm.com, bjorn@kernel.org,
        paul.walmsley@sifive.com, catalin.marinas@arm.com, will@kernel.org,
        rppt@kernel.org, anup@brainfault.org, shihua@iscas.ac.cn,
        jiawei@iscas.ac.cn, liweiwei@iscas.ac.cn, luxufan@iscas.ac.cn,
        chunyu@iscas.ac.cn, tsu.yubo@gmail.com, wefu@redhat.com,
        wangjunqiang@iscas.ac.cn, kito.cheng@sifive.com,
        andy.chiu@sifive.com, vincent.chen@sifive.com,
        greentime.hu@sifive.com, corbet@lwn.net, wuwei2016@iscas.ac.cn,
        jrtc27@jrtc27.com
Cc: linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org,
        linux-riscv@lists.infradead.org, Guo Ren <guoren@linux.alibaba.com>
Subject: [RFC PATCH 00/22] riscv: s64ilp32: Running 32-bit Linux kernel on
 64-bit supervisor mode
Date: Thu, 18 May 2023 09:09:51 -0400
Message-Id: <20230518131013.3366406-1-guoren@kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

riscv: s64ilp32: Running 32-bit Linux kernel on 64-bit supervisor mode |

Message

Guo Ren May 18, 2023, 1:09 p.m. UTC

  From: Guo Ren <guoren@linux.alibaba.com>

This patch series adds s64ilp32 support to riscv. The term s64ilp32
means smode-xlen=64 and -mabi=ilp32 (ints, longs, and pointers are all
32-bit), i.e., running 32-bit Linux kernel on pure 64-bit supervisor
mode. There have been many 64ilp32 abis existing, such as mips-n32 [1],
arm-aarch64ilp32 [2], and x86-x32 [3], but they are all about userspace.
Thus, this should be the first time running a 32-bit Linux kernel with
the 64ilp32 ABI at supervisor mode (If not, correct me).

Why 32-bit Linux?
=================
The motivation for using a 32-bit Linux kernel is to reduce memory
footprint and meet the small capacity of DDR & cache requirement
(e.g., 64/128MB SIP SoC).

Here are the 32-bit v.s. 64-bit Linux kernel data type comparison
summary:
			32-bit		64-bit
sizeof(page):		32bytes		64bytes
sizeof(list_head):	8bytes		16bytes
sizeof(hlist_head):	8bytes		16bytes
sizeof(vm_area):	68bytes		136bytes
...

The size of ilp32's long & pointer is just half of lp64's (rv64 default
abi - longs and pointers are all 64-bit). This significant difference
in data type causes different memory & cache footprint costs. Here is
the comparison measurement between s32ilp32, s64ilp32, and s64lp64 in
the same 128MB qemu system environment:

Rootfs:
u32ilp32 - Using the same 32-bit userspace rootfs.ext2 (UXL=32) binary
	   from buildroot 2023.02-rc3, qemu_riscv32_virt_defconfig

Linux:
s32ilp32 - Linux version 6.3.0-rc1 (124MB)
           rv32_defconfig: $(Q)$(MAKE) -f $(srctree)/Makefile
           defconfig 32-bit.config

s64lp64  - Linux version 6.3.0-rc1 (126MB)
           defconfig: $(Q)$(MAKE) -f $(srctree)/Makefile defconfig

s64ilp32 - Linux version 6.3.0-rc1 (126MB)
           rv64ilp32_defconfig: $(Q)$(MAKE) -f $(srctree)/Makefile
	   defconfig 64ilp32.config

Opensbi:
m64lp64  - (2MB) OpenSBI v1.2-80-g4b28afc98bbe
m32ilp32 - (4MB) OpenSBI v1.2-80-g4b28afc98bbe

  +----------------------------------------+--------
  |              u32ilp32                  |
  |                UXL=32                  | Rootfs
  +----------------------------------------+--------
  | +----------+ +---------+ | +---------+ |
  | | s64ilp32 | | s64lp64 | | | s32ilp32| |
  | |  SXL=64  | |  SXL=64 | | |  SXL=32 | | Linux
  | +----------+ +---------+ | +---------+ |
  +----------------------------------------+--------
  | +----------------------+ | +---------+ |
  | |        m64lp64       | | | m32ilp32| |
  | |         MXL=64       | | |  MXL=32 | | Opensbi
  | +----------------------+ | +---------+ |
  +----------------------------------------+--------
  | +----------------------+ | +---------+ |
  | |        qemu-rv64     | | |qemu-rv32| | HW
  | +----------------------+ | +---------+ |
  +----------------------------------------+--------

Mem-usage:
(s32ilp32) # free
       total   used   free  shared  buff/cache   available
Mem:  100040   8380  88244      44        3416       88080

(s64lp64)  # free
       total   used   free  shared  buff/cache   available
Mem:   91568  11848  75796      44        3924       75952

(s64ilp32) # free
       total   used   free  shared  buff/cache   available
Mem:  101952   8528  90004      44        3420       89816
                     ^^^^^

It's a rough measurement based on the current default config without any
modification, and 32-bit (s32ilp32, s64ilp32) saved more than 16% memory
to 64-bit (s64lp64). But s32ilp32 & s64ilp32 have a similar memory
footprint (about 0.33% difference), meaning s64ilp32 has a big chance to
replace s32ilp32 on the 64-bit machine.

Why s64ilp32?
=============
The current RISC-V has the profiles of RVA20S64, RVA22S64, and RVA23S64
(ongoing) [4], but no RVA**S32 profile exists or any ongoing plan. That
means when a vendor wants to produce a 32-bit s-mode RISC-V Application
Processor, they have no shape to follow. Therefore, many cheap riscv
chips have come out but follow the RVA2xS64 profiles, such as Allwinner
D1/D1s/F133 [5], SOPHGO CV1800B [6], Canaan Kendryte k230 [7], and
Bouffalo Lab BL808 which are typically cortex a7/a35/a53 product
scenarios. The D1 & CV1800B & BL808 didn't support UXL=32 (32-bit U-mode),
so they need a new u64ilp32 userspace ABI which has no software ecosystem
for the current. Thus, the first landing of s64ilp32 would be on Canaan
Kendryte k230, which has c908 with rv64gcv and compat user mode
(sstatus.uxl=32/64), which could support the existing rv32 userspace
software ecosystem.

Another reason for inventing s64ilp32 is performance benefits and
simplify 64-bit CPU hardware design (v.s. s32ilp32).

Why s64ilp32 has better performance?
====================================
Generally speaking, we should build a 32-bit hardware s-mode to run
32-bit Linux on a 64-bit processor (such as Linux-arm32 on cortex-a53).
Or only use old 32ilp32-abi on a 64-bit machine (such as mips
SYS_SUPPORTS_32BIT_KERNEL). These can't reuse performance-related
features and instructions of the 64-bit hardware, such as 64-bit ALU,
AMO, and LD/SD, which would cause significant performance gaps on many
Linux features:

 - memcpy/memset/strcmp (s64ilp32 has half of the instructions count
   and double the bandwidth of load/store instructions than s32ilp32.)

 - ebpf JIT is a 64-bit virtual ISA, which is not suitable
   for mapping to s32ilp32.

 - Atomic64 (s64ilp32 has the exact native instructions mapping as
   s64lp64, but s32ilp32 only uses generic_atomic64, a tradeoff &
   limited software solution.)

 - 64-bit native arithmetic instructions for "long long" type

 - Support cmxchg_double for slub (The 2nd 32-bit Linux
   supports the feature, the 1st is i386.)

 - ...

Compared with the user space ecosystem, the 32-bit Linux kernel is more
eager to need 64ilp32 to improve performance because the Linux kernel
can't utilize float-point/vector features of the ISA.

Let's look at performance from another perspective (s64ilp32 v.s.
s64lp64). Just as the first chapter said, the pointer size of ilp32 is
half of the lp64, and it reduces the size of the critical data structs
(e.g., page, list, ...). That means the cache of using ilp32 could
contain double data that lp64 with the same cache capacity, which is a
natural advantage of 32-bit.

Why s64ilp32 simplifies CPU design?
===================================
Yes, there are a lot of runing 32-bit Linux on 64-bit hardware examples
in history, such as arm cortex a35/a53/a55, which implements the 32-bit
EL1/EL2/EL3 hardware mode to support 32-bit Linux. We could follow Arm's
style, but riscv could choose another better way. Compared to UXL=32,
the MXL=SXL=32 has many CSR-related hardware functionalities, which
causes a lot of effort to mix them into 64-bit hardware. The s64ilp32
works on MXL=SXL=64 mode, so the CPU vendors needn't implement 32-bit
machine and supervisor modes.

How does s64ilp32 work?
=======================
The s64ilp32 is the same as the s64lp64 compat mode from a hardware
view, i.e., MXL=SXL=64 + UXL=32. Because the s64ilp32 uses CONFIG_32BIT
of Linux, it only supports u32ilp32 abi user space, the current standard
rv32 software ecosystem, and it can't work with u64lp64 abi (I don't
want that complex and useless stuff). But it may work with u64ilp32 in the
future; now, the s64ilp32 depends on the UXL=32 feature of the hardware.

The 64ilp32 gcc still uses sign-extend lw & auipc to generate address
variables because inserting zero-extend instructions to mask the highest
32-bit would cause significant code size and performance problems. Thus,
we invented an OS approach to solve the problem:
 - When satp=bare and start physical address < 2GB, there is no sign-extend
   address problem.
 - When satp=bare and start physical address > 2GB, we need zjpm liked
   hardware extensions to mask high 32bit.
   (Fortunately, all existed SoCs' (D1/D1s/F133, CV1800B, k230, BL808)
    start physical address < 2GB.)
 - When satp=sv39, we invent double mapping to make the sign-extended
   virtual address the same as the zero-extended virtual address.

   +--------+      +---------+      +--------+
   |        |   +--| 511:PUD1|      |        |
   |        |   |  +---------+      |        |
   |        |   |  | 510:PUD0|--+   |        |
   |        |   |  +---------+  |   |        |
   |        |   |  |         |  |   |        |
   |        |   |  |         |  |   |        |
   |        |   |  |         |  |   |        |
   |        |   |  | INVALID |  |   |        |
   |        |   |  |         |  |   |        |
   |  ....  |   |  |         |  |   |  ....  |
   |        |   |  |         |  |   |        |
   |        |   |  +---------+  |   |        |
   |        |   +--|  3:PUD1 |  |   |        |
   |        |   |  +---------+  |   |        |
   |        |   |  |  2:PUD0 |--+   |        |
   |        |   |  +---------+  |   |        |
   |        |   |  |1:USR_PUD|  |   |        |
   |        |   |  +---------+  |   |        |
   |        |   |  |0:USR_PUD|  |   |        |
   +--------+<--+  +---------+  +-->+--------+
      PUD1         ^   PGD             PUD0 
      1GB          |   4GB             1GB
                   |
                   +----------+      
                   | Sv39 PGDP|
                   +----------+      
                       SATP

The size of xlen was always equal to the pointer/long size before
s64ilp32 emerged. So we need to introduce a new type of data - xlen_t,
which could deal with CSR-related and callee-save/restore operations.

Some kernel features use 32BIT/64BIT to determine the exact ISA, such as
ebpf JIT would map to rv32 ISA when CONFIG_32BIT=y. But s64ilp32 needs
the ebpf JIT map to rv64 ISA when CONFIG_32BIT=y and we need to use
another config to distinguish the difference.

More detials, please review the path series.

How to run s64ilp32?
====================

GNU toolchain
-------------
git clone https://github.com/Liaoshihua/riscv-gnu-toolchain.git
cd riscv-gnu-toolchain
./configure --prefix="$PWD/opt-rv64-ilp32/" --with-arch=rv64imac --with-abi=ilp32
make linux
export PATH=$PATH:$PWD/opt-rv64-ilp32/bin/

Opensbi
-------
git clone https://github.com/riscv-software-src/opensbi.git
CROSS_COMPILE=riscv64-unknown-linux-gnu- make PLATFORM=generic

Linux kernel
------------
git clone https://github.com/guoren83/linux.git -b s64ilp32
cd linux
make ARCH=riscv CROSS_COMPILE=riscv64-unknown-linux-gnu- rv64ilp32_defconfig
make ARCH=riscv CROSS_COMPILE=riscv64-unknown-linux-gnu- all

Rootfs
------
git clone git://git.busybox.net/buildroot
cd buildroot
make qemu_riscv32_virt_defconfig
make

Qemu
----
git clone https://github.com/plctlab/plct-qemu.git -b plct-s64ilp32-dev
cd plct-qemu
mkdir build
cd build
../qemu/configure --target-list="riscv64-softmmu riscv32-softmmu"
make

Run
---
./qemu-system-riscv64 -cpu rv64 -M virt -m 128m -nographic -bios fw_dynamic.bin -kernel Image -drive file=rootfs.ext2,format=raw,id=hd0 -device virtio-blk-device,drive=hd0 -append "rootwait root=/dev/vda ro console=ttyS0 earlycon=sbi" -netdev user,id=net0 -device virtio-net-device,netdev=net0

OpenSBI v1.2-119-gdc1c7db05e07
   ____                    _____ ____ _____
  / __ \                  / ____|  _ \_   _|
 | |  | |_ __   ___ _ __ | (___ | |_) || |
 | |  | | '_ \ / _ \ '_ \ \___ \|  _ < | |
 | |__| | |_) |  __/ | | |____) | |_) || |_
  \____/| .__/ \___|_| |_|_____/|___/_____|
        | |
        |_|

Platform Name             : riscv-virtio,qemu
Platform Features         : medeleg
Platform HART Count       : 1
Platform IPI Device       : aclint-mswi
Platform Timer Device     : aclint-mtimer @ 10000000Hz
Platform Console Device   : uart8250
Platform HSM Device       : ---
Platform PMU Device       : ---
Platform Reboot Device    : sifive_test
Platform Shutdown Device  : sifive_test
Platform Suspend Device   : ---
Platform CPPC Device      : ---
Firmware Base             : 0x60000000
Firmware Size             : 360 KB
Firmware RW Offset        : 0x40000
Runtime SBI Version       : 1.0

Domain0 Name              : root
Domain0 Boot HART         : 0
Domain0 HARTs             : 0*
Domain0 Region00          : 0x0000000002000000-0x000000000200ffff M: (I,R,W) S/U: ()
Domain0 Region01          : 0x0000000060040000-0x000000006005ffff M: (R,W) S/U: ()
Domain0 Region02          : 0x0000000060000000-0x000000006003ffff M: (R,X) S/U: ()
Domain0 Region03          : 0x0000000000000000-0xffffffffffffffff M: (R,W,X) S/U: (R,W,X)
Domain0 Next Address      : 0x0000000060200000
Domain0 Next Arg1         : 0x0000000067e00000
Domain0 Next Mode         : S-mode
Domain0 SysReset          : yes
Domain0 SysSuspend        : yes

Boot HART ID              : 0
Boot HART Domain          : root
Boot HART Priv Version    : v1.12
Boot HART Base ISA        : rv64imafdch
Boot HART ISA Extensions  : time,sstc
Boot HART PMP Count       : 16
Boot HART PMP Granularity : 4
Boot HART PMP Address Bits: 54
Boot HART MHPM Count      : 16
Boot HART MIDELEG         : 0x0000000000001666
Boot HART MEDELEG         : 0x0000000000f0b509
[    0.000000] Linux version 6.3.0-rc1-00086-gc8d2fedb997a (guoren@fedora) (riscv64-unknown-linux-gnu-gcc (g5e578a16201f) 13.0.1 20230206 (experimental), GNU ld (GNU Binutils) 2.40.50.20230205) #1 SMP Sun May 14 10:46:42 EDT 2023
[    0.000000] random: crng init done
[    0.000000] OF: fdt: Ignoring memory range 0x60000000 - 0x60200000
[    0.000000] Machine model: riscv-virtio,qemu
[    0.000000] efi: UEFI not found.
[    0.000000] OF: reserved mem: 0x60000000..0x6003ffff (256 KiB) map non-reusable mmode_resv1@60000000
[    0.000000] OF: reserved mem: 0x60040000..0x6005ffff (128 KiB) map non-reusable mmode_resv0@60040000
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000060200000-0x0000000067ffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000060200000-0x0000000067ffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000060200000-0x0000000067ffffff]
[    0.000000] On node 0, zone Normal: 512 pages in unavailable ranges
[    0.000000] SBI specification v1.0 detected
[    0.000000] SBI implementation ID=0x1 Version=0x10002
[    0.000000] SBI TIME extension detected
[    0.000000] SBI IPI extension detected
[    0.000000] SBI RFENCE extension detected
[    0.000000] SBI SRST extension detected
[    0.000000] SBI HSM extension detected
[    0.000000] riscv: base ISA extensions acdfhim
[    0.000000] riscv: ELF capabilities acdfim
[    0.000000] percpu: Embedded 13 pages/cpu s24352 r8192 d20704 u53248
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 31941
[    0.000000] Kernel command line: rootwait root=/dev/vda ro console=ttyS0 earlycon=sbi norandmaps
[    0.000000] Dentry cache hash table entries: 16384 (order: 4, 65536 bytes, linear)
[    0.000000] Inode-cache hash table entries: 8192 (order: 3, 32768 bytes, linear)
[    0.000000] mem auto-init: stack:all(zero), heap alloc:off, heap free:off
[    0.000000] Virtual kernel memory layout:
[    0.000000]       fixmap : 0x9ce00000 - 0x9d000000   (2048 kB)
[    0.000000]       pci io : 0x9d000000 - 0x9e000000   (  16 MB)
[    0.000000]      vmemmap : 0x9e000000 - 0xa0000000   (  32 MB)
[    0.000000]      vmalloc : 0xa0000000 - 0xc0000000   ( 512 MB)
[    0.000000]       lowmem : 0xc0000000 - 0xc7e00000   ( 126 MB)
[    0.000000] Memory: 97748K/129024K available (8699K kernel code, 8867K rwdata, 4096K rodata, 4204K init, 361K bss, 31276K reserved, 0K cma-reserved)
...
Starting network: udhcpc: started, v1.36.0
udhcpc: broadcasting discover
udhcpc: broadcasting select for 10.0.2.15, server 10.0.2.2
udhcpc: lease of 10.0.2.15 obtained from 10.0.2.2, lease time 86400
deleting routers
adding dns 10.0.2.3
OK

Welcome to Buildroot
buildroot login: root
# cat /proc/cpuinfo
processor       : 0
hart            : 0
isa             : rv64imafdch_zihintpause_zbb_sstc
mmu             : sv39
mvendorid       : 0x0
marchid         : 0x70232
mimpid          : 0x70232

# uname -a
Linux buildroot 6.3.0-rc1-00086-gc8d2fedb997a #1 SMP Sun May 14 10:46:42 EDT 2023 riscv32 GNU/Linux
# ls /lib/
ld-linux-riscv32-ilp32d.so.1  libgcc_s.so.1
libanl.so.1                   libm.so.6
libatomic.so                  libnss_dns.so.2
libatomic.so.1                libnss_files.so.2
libatomic.so.1.2.0            libpthread.so.0
libc.so.6                     libresolv.so.2
libcrypt.so.1                 librt.so.1
libdl.so.2                    libutil.so.1
libgcc_s.so                   modules

# cat /proc/99/maps
0000000055554000-0000000055634000 r-xp 00000000 00000000fe:00 17  /bin/busybox
0000000055634000-0000000055636000 r--p 00000000df000 00000000fe:00 17  /bin/busybox
0000000055636000-0000000055637000 rw-p 00000000e1000 00000000fe:00 17  /bin/busybox
0000000055637000-0000000055659000 rw-p 00000000 00:00 0  [heap]
0000000077e8d000-0000000077fbe000 r-xp 00000000 00000000fe:00 137  /lib/libc.so.6
0000000077fbe000-0000000077fbf000 ---p 00000000131000 00000000fe:00 137  /lib/libc.so.6
0000000077fbf000-0000000077fc1000 r--p 00000000131000 00000000fe:00 137  /lib/libc.so.6
0000000077fc1000-0000000077fc2000 rw-p 00000000133000 00000000fe:00 137  /lib/libc.so.6
0000000077fc2000-0000000077fcc000 rw-p 00000000 00:00 0
0000000077fcc000-0000000077fd4000 r-xp 00000000 00000000fe:00 146  /lib/libresolv.so.2
0000000077fd4000-0000000077fd5000 ---p 000000008000 00000000fe:00 146  /lib/libresolv.so.2
0000000077fd5000-0000000077fd6000 r--p 000000008000 00000000fe:00 146  /lib/libresolv.so.2
0000000077fd6000-0000000077fd7000 rw-p 000000009000 00000000fe:00 146  /lib/libresolv.so.2
0000000077fd7000-0000000077fd9000 rw-p 00000000 00:00 0
0000000077fd9000-0000000077fdb000 r--p 00000000 00:00 0  [vvar]
0000000077fdb000-0000000077fdd000 r-xp 00000000 00:00 0  [vdso]
0000000077fdd000-0000000077ffc000 r-xp 00000000 00000000fe:00 132  /lib/ld-linux-riscv32-ilp32d.so.1
0000000077ffd000-0000000077ffe000 r--p 000000001f000 00000000fe:00 132  /lib/ld-linux-riscv32-ilp32d.so.1
0000000077ffe000-0000000077fff000 rw-p 0000000020000 00000000fe:00 132  /lib/ld-linux-riscv32-ilp32d.so.1
000000007ffde000-000000007ffff000 rw-p 00000000 00:00 0  [stack]

Other resources
===============

OpenEuler riscv32 rootfs
------------------------
The OpenEuler riscv32 rootfs you can download from here:
https://repo.tarsier-infra.com/openEuler-RISC-V/obs/archive/rv32/openeuler-image-qemu-riscv32-20221111070036.rootfs.ext4
(Made by Junqiang Wang)

Debain riscv32 rootfs
---------------------
The Debian riscv32 rootfs you can download from here:
https://github.com/yuzibo/riscv32
(Made by Bo YU and Han Gao)

Fedora riscv32 rootfs
---------------------
https://fedoraproject.org/wiki/Architectures/RISC-V/RV32
(Made by Wei Fu)

LLVM 64ilp32
------------
git clone https://github.com/luxufan/llvm-project.git -b rv64-ilp32
cd llvm-project
mkdir build && cd build
cmake ../llvm -G Ninja -DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD=“X86;RISCV" -DLLVM_ENABLE_PROJECTS="clang;lld"
ninja all

(LLVM development status is that CC=clang can compile the kernel with
 LLVM=1 but has not yet booted successfully.)

Patch organization
==================
This series depends on 64ilp32 toolchain patches that are not upstream
yet.

PATCH [0-1] unify vdso32 & compat_vdso
PATCH [2] adds time-related vDSO common flow for vdso32
PATCH [3] adds s64ilp32 support of clocksource driver
PATCH [5] adds s64ilp32 support of irqchip driver
PATCH [4,6-12] add basic data types and compiling framework
PATCH [13] adds MMU_SV39 support
PATCH [14] adds native atomic64
PATCH [15] adds TImode
PATCH [16] adds cmpxchg_double
PATCH [17-19] cleanup kconfig & add defconfig
PATCH [20-21] fix temporary compiler problems

Open issues
===========

Callee saved the register width
-------------------------------
For 64-bit ISA (including 64lp64, 64ilp32), callee can't determine the
correct width used in the register, so they saved the maximum width of
the ISA register, i.e., xlen size. We also found this rule in x86-x32,
mips-n32, and aarch64ilp32, which comes from 64lp64. See PATCH [20]

Here are two downsides of this:
 - It would cause a difference with 32ilp32's stack frame, and s64ilp32
   reuses 32ilp32 software stack. Thus, many additional compatible
   problems would happen during the porting of 64ilp32 software.
 - It also increases the budget of the stack usage.
   <setup_vm>:
     auipc   a3,0xff3fb
     add     a3,a3,1234 # c0000000
     li      a5,-1
     lui     a4,0xc0000
     addw    sp,sp,-96
     srl     a5,a5,0x20
     subw    a4,a4,a3
     auipc   a2,0x111a
     add     a2,a2,1212 # c1d1f000
     sd      s0,80(sp)----+
     sd      s1,72(sp)    |
     sd      s2,64(sp)    |
     sd      s7,24(sp)    |
     sd      s8,16(sp)    |
     sd      s9,8(sp)     |-> All <= 32b widths, but occupy 64b
     sd      ra,88(sp)    |   stack space.
     sd      s3,56(sp)    |   Affect memory footprint & cache
     sd      s4,48(sp)    |   performance.
     sd      s5,40(sp)    |
     sd      s6,32(sp)    |
     sd      s10,0(sp)----+
     sll     a1,a4,0x20
     subw    a2,a2,a3
     and     a4,a4,a5

So here is a proposal to riscv 64ilp32 ABI:
 - Let the compiler prevent callee saving ">32b variables" in
   callee-registers. (Q: We need to measure, how the influence of
   64b variables cross function call?)

EF_RISCV_X32
------------
We add an e_flag (EF_RISCV_X32) to distinguish the 32-bit ELF, which
occupies BIT[6] of the e_flags layout.

ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              REL (Relocatable file)
  Machine:                           RISC-V
  Version:                           0x1
  Entry point address:               0x0
  Start of program headers:          0 (bytes into file)
  Start of section headers:          24620 (bytes into file)
  Flags:                             0x21, RVC, X32, soft-float ABI
                                                ^^^
64-bit Optimization problem
---------------------------
There is an existing problem in 64ilp32 gcc that combines two pointers
in one register. Liao is solving that problem. Before he finishes the
job, we could prevent it with a simple noinline attribute, fortunately.
    struct path {
            struct vfsmount *mnt;
            struct dentry *dentry;
    } __randomize_layout;

    struct nameidata {
            struct path     path;
            ...
            struct path     root;
    ...
    } __randomize_layout;

            struct nameidata *nd
            ...
            nd->path = nd->root;
    6c88                    ld      a0,24(s1)
                                    ^^ // a0 contains two pointers
    e088                    sd      a0,0(s1)
            mntget(path->mnt);
            // Need "lw a0,0(s1)" or "a0 << 32; a0 >> 32"
    2a6150ef                jal     c01ce946 <mntget> // bug!

Acknowledge
===========
The s64ilp32 needs many other projects' cooperation. Thx, all guys
involved:
 - GNU:			LiaoShihua <shihua@iscas.ac.cn>,
			Jiawe Chen<jiawei@iscas.ac.cn>
 - Qemu:		Weiwei Li <liweiwei@iscas.ac.cn>
 - LLVM:		luxufan <luxufan@iscas.ac.cn>,
			Chunyu Liao<chunyu@iscas.ac.cn>
 - OpenEuler rv32:	Junqiang Wang <wangjunqiang@iscas.ac.cn> 
 - Debian rv32:		Bo YU <tsu.yubo@gmail.com>
			Han Gao <gaohan@iscas.ac.cn>
 - Fedora rv32:		Wei Fu <wefu@redhat.com> 

References
==========
[1] https://techpubs.jurassic.nl/manuals/0630/developer/Mpro_n32_ABI/sgi_html/index.html
[2] https://wiki.debian.org/Arm64ilp32Port
[3] https://lwn.net/Articles/456731/
[4] https://github.com/riscv/riscv-profiles/releases
[5] https://www.cnx-software.com/2021/10/25/allwinner-d1s-f133-risc-v-processor-64mb-ddr2/
[6] https://milkv.io/duo/
[7] https://twitter.com/tphuang/status/1631308330256801793
[8] https://www.cnx-software.com/2022/12/02/pine64-ox64-sbc-bl808-risc-v-multi-protocol-wisoc-64mb-ram/

Guo Ren (22):
  riscv: vdso: Unify vdso32 & compat_vdso into vdso/Makefile
  riscv: vdso: Remove compat_vdso/
  riscv: vdso: Add time-related vDSO common flow for vdso32
  clocksource: riscv: s64ilp32: Use __riscv_xlen instead of CONFIG_32BIT
  riscv: s64ilp32: Introduce xlen_t
  irqchip: riscv: s64ilp32: Use __riscv_xlen instead of CONFIG_32BIT
  riscv: s64ilp32: Add sbi support
  riscv: s64ilp32: Add asid support
  riscv: s64ilp32: Introduce PTR_L and PTR_S
  riscv: s64ilp32: Enable user space runtime environment
  riscv: s64ilp32: Add ebpf jit support
  riscv: s64ilp32: Add ELF32 support
  riscv: s64ilp32: Add ARCH RV64 ILP32 compiling framework
  riscv: s64ilp32: Add MMU_SV39 mode support for 32BIT
  riscv: s64ilp32: Enable native atomic64
  riscv: s64ilp32: Add TImode (128 int) support
  riscv: s64ilp32: Implement cmpxchg_double
  riscv: s64ilp32: Disable KVM
  riscv: Cleanup rv32_defconfig
  riscv: s64ilp32: Add rv64ilp32_defconfig
  riscv: s64ilp32: Correct the rv64ilp32 stackframe layout
  riscv: s64ilp32: Temporary workaround solution to gcc problem

 arch/riscv/Kconfig                            |  36 +++-
 arch/riscv/Makefile                           |  24 ++-
 arch/riscv/configs/32-bit.config              |   2 -
 arch/riscv/configs/64ilp32.config             |   2 +
 arch/riscv/include/asm/asm.h                  |   5 +
 arch/riscv/include/asm/atomic.h               |   6 +
 arch/riscv/include/asm/cmpxchg.h              |  53 ++++++
 arch/riscv/include/asm/cpu_ops_sbi.h          |   4 +-
 arch/riscv/include/asm/csr.h                  |  58 +++---
 arch/riscv/include/asm/extable.h              |   2 +-
 arch/riscv/include/asm/page.h                 |  24 ++-
 arch/riscv/include/asm/pgtable-64.h           |  42 ++---
 arch/riscv/include/asm/pgtable.h              |  26 ++-
 arch/riscv/include/asm/processor.h            |   8 +-
 arch/riscv/include/asm/ptrace.h               |  96 +++++-----
 arch/riscv/include/asm/sbi.h                  |  24 +--
 arch/riscv/include/asm/stacktrace.h           |   6 +
 arch/riscv/include/asm/timex.h                |  10 +-
 arch/riscv/include/asm/vdso.h                 |  34 +++-
 arch/riscv/include/asm/vdso/gettimeofday.h    |  84 +++++++++
 arch/riscv/include/uapi/asm/elf.h             |   2 +-
 arch/riscv/include/uapi/asm/unistd.h          |   1 +
 arch/riscv/kernel/Makefile                    |   3 +-
 arch/riscv/kernel/compat_signal.c             |   2 +-
 arch/riscv/kernel/compat_vdso/.gitignore      |   2 -
 arch/riscv/kernel/compat_vdso/compat_vdso.S   |   8 -
 .../kernel/compat_vdso/compat_vdso.lds.S      |   3 -
 arch/riscv/kernel/compat_vdso/flush_icache.S  |   3 -
 arch/riscv/kernel/compat_vdso/getcpu.S        |   3 -
 arch/riscv/kernel/compat_vdso/note.S          |   3 -
 arch/riscv/kernel/compat_vdso/rt_sigreturn.S  |   3 -
 arch/riscv/kernel/cpu.c                       |   4 +-
 arch/riscv/kernel/cpu_ops_sbi.c               |   4 +-
 arch/riscv/kernel/cpufeature.c                |   4 +-
 arch/riscv/kernel/entry.S                     |  24 +--
 arch/riscv/kernel/head.S                      |   8 +-
 arch/riscv/kernel/process.c                   |   8 +-
 arch/riscv/kernel/sbi.c                       |  24 +--
 arch/riscv/kernel/signal.c                    |   6 +-
 arch/riscv/kernel/traps.c                     |   4 +-
 arch/riscv/kernel/vdso.c                      |   4 +-
 arch/riscv/kernel/vdso/Makefile               | 176 ++++++++++++------
 ..._vdso_offsets.sh => gen_vdso32_offsets.sh} |   2 +-
 .../gen_vdso64_offsets.sh}                    |   2 +-
 arch/riscv/kernel/vdso/vgettimeofday.c        |  39 +++-
 arch/riscv/kernel/vdso32.S                    |   8 +
 arch/riscv/kernel/{vdso/vdso.S => vdso64.S}   |   8 +-
 arch/riscv/kvm/Kconfig                        |   1 +
 arch/riscv/lib/Makefile                       |   1 +
 arch/riscv/lib/memset.S                       |   4 +-
 arch/riscv/mm/context.c                       |  16 +-
 arch/riscv/mm/fault.c                         |  13 +-
 arch/riscv/mm/init.c                          |  29 ++-
 arch/riscv/net/Makefile                       |   6 +-
 arch/riscv/net/bpf_jit_comp64.c               |  10 +-
 drivers/clocksource/timer-riscv.c             |   2 +-
 drivers/irqchip/irq-riscv-intc.c              |   4 +-
 fs/namei.c                                    |   2 +-
 58 files changed, 675 insertions(+), 317 deletions(-)
 create mode 100644 arch/riscv/configs/64ilp32.config
 delete mode 100644 arch/riscv/kernel/compat_vdso/.gitignore
 delete mode 100644 arch/riscv/kernel/compat_vdso/compat_vdso.S
 delete mode 100644 arch/riscv/kernel/compat_vdso/compat_vdso.lds.S
 delete mode 100644 arch/riscv/kernel/compat_vdso/flush_icache.S
 delete mode 100644 arch/riscv/kernel/compat_vdso/getcpu.S
 delete mode 100644 arch/riscv/kernel/compat_vdso/note.S
 delete mode 100644 arch/riscv/kernel/compat_vdso/rt_sigreturn.S
 rename arch/riscv/kernel/vdso/{gen_vdso_offsets.sh => gen_vdso32_offsets.sh} (78%)
 rename arch/riscv/kernel/{compat_vdso/gen_compat_vdso_offsets.sh => vdso/gen_vdso64_offsets.sh} (77%)
 create mode 100644 arch/riscv/kernel/vdso32.S
 rename arch/riscv/kernel/{vdso/vdso.S => vdso64.S} (73%)

Comments

Palmer Dabbelt May 18, 2023, 3:38 p.m. UTC | #1

On Thu, 18 May 2023 06:09:51 PDT (-0700), guoren@kernel.org wrote:
> From: Guo Ren <guoren@linux.alibaba.com>
>
> This patch series adds s64ilp32 support to riscv. The term s64ilp32
> means smode-xlen=64 and -mabi=ilp32 (ints, longs, and pointers are all
> 32-bit), i.e., running 32-bit Linux kernel on pure 64-bit supervisor
> mode. There have been many 64ilp32 abis existing, such as mips-n32 [1],
> arm-aarch64ilp32 [2], and x86-x32 [3], but they are all about userspace.
> Thus, this should be the first time running a 32-bit Linux kernel with
> the 64ilp32 ABI at supervisor mode (If not, correct me).

Does anyone actually want this?  At a bare minimum we'd need to add it 
to the psABI, which would presumably also be required on the compiler 
side of things.

It's not even clear anyone wants rv64/ilp32 in userspace, the kernel 
seems like it'd be even less widely used.

> Why 32-bit Linux?
> =================
> The motivation for using a 32-bit Linux kernel is to reduce memory
> footprint and meet the small capacity of DDR & cache requirement
> (e.g., 64/128MB SIP SoC).
>
> Here are the 32-bit v.s. 64-bit Linux kernel data type comparison
> summary:
> 			32-bit		64-bit
> sizeof(page):		32bytes		64bytes
> sizeof(list_head):	8bytes		16bytes
> sizeof(hlist_head):	8bytes		16bytes
> sizeof(vm_area):	68bytes		136bytes
> ...
>
> The size of ilp32's long & pointer is just half of lp64's (rv64 default
> abi - longs and pointers are all 64-bit). This significant difference
> in data type causes different memory & cache footprint costs. Here is
> the comparison measurement between s32ilp32, s64ilp32, and s64lp64 in
> the same 128MB qemu system environment:
>
> Rootfs:
> u32ilp32 - Using the same 32-bit userspace rootfs.ext2 (UXL=32) binary
> 	   from buildroot 2023.02-rc3, qemu_riscv32_virt_defconfig
>
> Linux:
> s32ilp32 - Linux version 6.3.0-rc1 (124MB)
>            rv32_defconfig: $(Q)$(MAKE) -f $(srctree)/Makefile
>            defconfig 32-bit.config
>
> s64lp64  - Linux version 6.3.0-rc1 (126MB)
>            defconfig: $(Q)$(MAKE) -f $(srctree)/Makefile defconfig
>
> s64ilp32 - Linux version 6.3.0-rc1 (126MB)
>            rv64ilp32_defconfig: $(Q)$(MAKE) -f $(srctree)/Makefile
> 	   defconfig 64ilp32.config
>
> Opensbi:
> m64lp64  - (2MB) OpenSBI v1.2-80-g4b28afc98bbe
> m32ilp32 - (4MB) OpenSBI v1.2-80-g4b28afc98bbe
>
>   +----------------------------------------+--------
>   |              u32ilp32                  |
>   |                UXL=32                  | Rootfs
>   +----------------------------------------+--------
>   | +----------+ +---------+ | +---------+ |
>   | | s64ilp32 | | s64lp64 | | | s32ilp32| |
>   | |  SXL=64  | |  SXL=64 | | |  SXL=32 | | Linux
>   | +----------+ +---------+ | +---------+ |
>   +----------------------------------------+--------
>   | +----------------------+ | +---------+ |
>   | |        m64lp64       | | | m32ilp32| |
>   | |         MXL=64       | | |  MXL=32 | | Opensbi
>   | +----------------------+ | +---------+ |
>   +----------------------------------------+--------
>   | +----------------------+ | +---------+ |
>   | |        qemu-rv64     | | |qemu-rv32| | HW
>   | +----------------------+ | +---------+ |
>   +----------------------------------------+--------
>
> Mem-usage:
> (s32ilp32) # free
>        total   used   free  shared  buff/cache   available
> Mem:  100040   8380  88244      44        3416       88080
>
> (s64lp64)  # free
>        total   used   free  shared  buff/cache   available
> Mem:   91568  11848  75796      44        3924       75952
>
> (s64ilp32) # free
>        total   used   free  shared  buff/cache   available
> Mem:  101952   8528  90004      44        3420       89816
>                      ^^^^^
>
> It's a rough measurement based on the current default config without any
> modification, and 32-bit (s32ilp32, s64ilp32) saved more than 16% memory
> to 64-bit (s64lp64). But s32ilp32 & s64ilp32 have a similar memory
> footprint (about 0.33% difference), meaning s64ilp32 has a big chance to
> replace s32ilp32 on the 64-bit machine.
>
> Why s64ilp32?
> =============
> The current RISC-V has the profiles of RVA20S64, RVA22S64, and RVA23S64
> (ongoing) [4], but no RVA**S32 profile exists or any ongoing plan. That
> means when a vendor wants to produce a 32-bit s-mode RISC-V Application
> Processor, they have no shape to follow. Therefore, many cheap riscv
> chips have come out but follow the RVA2xS64 profiles, such as Allwinner
> D1/D1s/F133 [5], SOPHGO CV1800B [6], Canaan Kendryte k230 [7], and
> Bouffalo Lab BL808 which are typically cortex a7/a35/a53 product
> scenarios. The D1 & CV1800B & BL808 didn't support UXL=32 (32-bit U-mode),
> so they need a new u64ilp32 userspace ABI which has no software ecosystem
> for the current. Thus, the first landing of s64ilp32 would be on Canaan
> Kendryte k230, which has c908 with rv64gcv and compat user mode
> (sstatus.uxl=32/64), which could support the existing rv32 userspace
> software ecosystem.
>
> Another reason for inventing s64ilp32 is performance benefits and
> simplify 64-bit CPU hardware design (v.s. s32ilp32).
>
> Why s64ilp32 has better performance?
> ====================================
> Generally speaking, we should build a 32-bit hardware s-mode to run
> 32-bit Linux on a 64-bit processor (such as Linux-arm32 on cortex-a53).
> Or only use old 32ilp32-abi on a 64-bit machine (such as mips
> SYS_SUPPORTS_32BIT_KERNEL). These can't reuse performance-related
> features and instructions of the 64-bit hardware, such as 64-bit ALU,
> AMO, and LD/SD, which would cause significant performance gaps on many
> Linux features:
>
>  - memcpy/memset/strcmp (s64ilp32 has half of the instructions count
>    and double the bandwidth of load/store instructions than s32ilp32.)
>
>  - ebpf JIT is a 64-bit virtual ISA, which is not suitable
>    for mapping to s32ilp32.
>
>  - Atomic64 (s64ilp32 has the exact native instructions mapping as
>    s64lp64, but s32ilp32 only uses generic_atomic64, a tradeoff &
>    limited software solution.)
>
>  - 64-bit native arithmetic instructions for "long long" type
>
>  - Support cmxchg_double for slub (The 2nd 32-bit Linux
>    supports the feature, the 1st is i386.)
>
>  - ...
>
> Compared with the user space ecosystem, the 32-bit Linux kernel is more
> eager to need 64ilp32 to improve performance because the Linux kernel
> can't utilize float-point/vector features of the ISA.
>
> Let's look at performance from another perspective (s64ilp32 v.s.
> s64lp64). Just as the first chapter said, the pointer size of ilp32 is
> half of the lp64, and it reduces the size of the critical data structs
> (e.g., page, list, ...). That means the cache of using ilp32 could
> contain double data that lp64 with the same cache capacity, which is a
> natural advantage of 32-bit.
>
> Why s64ilp32 simplifies CPU design?
> ===================================
> Yes, there are a lot of runing 32-bit Linux on 64-bit hardware examples
> in history, such as arm cortex a35/a53/a55, which implements the 32-bit
> EL1/EL2/EL3 hardware mode to support 32-bit Linux. We could follow Arm's
> style, but riscv could choose another better way. Compared to UXL=32,
> the MXL=SXL=32 has many CSR-related hardware functionalities, which
> causes a lot of effort to mix them into 64-bit hardware. The s64ilp32
> works on MXL=SXL=64 mode, so the CPU vendors needn't implement 32-bit
> machine and supervisor modes.
>
> How does s64ilp32 work?
> =======================
> The s64ilp32 is the same as the s64lp64 compat mode from a hardware
> view, i.e., MXL=SXL=64 + UXL=32. Because the s64ilp32 uses CONFIG_32BIT
> of Linux, it only supports u32ilp32 abi user space, the current standard
> rv32 software ecosystem, and it can't work with u64lp64 abi (I don't
> want that complex and useless stuff). But it may work with u64ilp32 in the
> future; now, the s64ilp32 depends on the UXL=32 feature of the hardware.
>
> The 64ilp32 gcc still uses sign-extend lw & auipc to generate address
> variables because inserting zero-extend instructions to mask the highest
> 32-bit would cause significant code size and performance problems. Thus,
> we invented an OS approach to solve the problem:
>  - When satp=bare and start physical address < 2GB, there is no sign-extend
>    address problem.
>  - When satp=bare and start physical address > 2GB, we need zjpm liked
>    hardware extensions to mask high 32bit.
>    (Fortunately, all existed SoCs' (D1/D1s/F133, CV1800B, k230, BL808)
>     start physical address < 2GB.)
>  - When satp=sv39, we invent double mapping to make the sign-extended
>    virtual address the same as the zero-extended virtual address.
>
>    +--------+      +---------+      +--------+
>    |        |   +--| 511:PUD1|      |        |
>    |        |   |  +---------+      |        |
>    |        |   |  | 510:PUD0|--+   |        |
>    |        |   |  +---------+  |   |        |
>    |        |   |  |         |  |   |        |
>    |        |   |  |         |  |   |        |
>    |        |   |  |         |  |   |        |
>    |        |   |  | INVALID |  |   |        |
>    |        |   |  |         |  |   |        |
>    |  ....  |   |  |         |  |   |  ....  |
>    |        |   |  |         |  |   |        |
>    |        |   |  +---------+  |   |        |
>    |        |   +--|  3:PUD1 |  |   |        |
>    |        |   |  +---------+  |   |        |
>    |        |   |  |  2:PUD0 |--+   |        |
>    |        |   |  +---------+  |   |        |
>    |        |   |  |1:USR_PUD|  |   |        |
>    |        |   |  +---------+  |   |        |
>    |        |   |  |0:USR_PUD|  |   |        |
>    +--------+<--+  +---------+  +-->+--------+
>       PUD1         ^   PGD             PUD0
>       1GB          |   4GB             1GB
>                    |
>                    +----------+
>                    | Sv39 PGDP|
>                    +----------+
>                        SATP
>
> The size of xlen was always equal to the pointer/long size before
> s64ilp32 emerged. So we need to introduce a new type of data - xlen_t,
> which could deal with CSR-related and callee-save/restore operations.
>
> Some kernel features use 32BIT/64BIT to determine the exact ISA, such as
> ebpf JIT would map to rv32 ISA when CONFIG_32BIT=y. But s64ilp32 needs
> the ebpf JIT map to rv64 ISA when CONFIG_32BIT=y and we need to use
> another config to distinguish the difference.
>
> More detials, please review the path series.
>
> How to run s64ilp32?
> ====================
>
> GNU toolchain
> -------------
> git clone https://github.com/Liaoshihua/riscv-gnu-toolchain.git
> cd riscv-gnu-toolchain
> ./configure --prefix="$PWD/opt-rv64-ilp32/" --with-arch=rv64imac --with-abi=ilp32
> make linux
> export PATH=$PATH:$PWD/opt-rv64-ilp32/bin/
>
> Opensbi
> -------
> git clone https://github.com/riscv-software-src/opensbi.git
> CROSS_COMPILE=riscv64-unknown-linux-gnu- make PLATFORM=generic
>
> Linux kernel
> ------------
> git clone https://github.com/guoren83/linux.git -b s64ilp32
> cd linux
> make ARCH=riscv CROSS_COMPILE=riscv64-unknown-linux-gnu- rv64ilp32_defconfig
> make ARCH=riscv CROSS_COMPILE=riscv64-unknown-linux-gnu- all
>
> Rootfs
> ------
> git clone git://git.busybox.net/buildroot
> cd buildroot
> make qemu_riscv32_virt_defconfig
> make
>
> Qemu
> ----
> git clone https://github.com/plctlab/plct-qemu.git -b plct-s64ilp32-dev
> cd plct-qemu
> mkdir build
> cd build
> ../qemu/configure --target-list="riscv64-softmmu riscv32-softmmu"
> make
>
> Run
> ---
> ./qemu-system-riscv64 -cpu rv64 -M virt -m 128m -nographic -bios fw_dynamic.bin -kernel Image -drive file=rootfs.ext2,format=raw,id=hd0 -device virtio-blk-device,drive=hd0 -append "rootwait root=/dev/vda ro console=ttyS0 earlycon=sbi" -netdev user,id=net0 -device virtio-net-device,netdev=net0
>
> OpenSBI v1.2-119-gdc1c7db05e07
>    ____                    _____ ____ _____
>   / __ \                  / ____|  _ \_   _|
>  | |  | |_ __   ___ _ __ | (___ | |_) || |
>  | |  | | '_ \ / _ \ '_ \ \___ \|  _ < | |
>  | |__| | |_) |  __/ | | |____) | |_) || |_
>   \____/| .__/ \___|_| |_|_____/|___/_____|
>         | |
>         |_|
>
> Platform Name             : riscv-virtio,qemu
> Platform Features         : medeleg
> Platform HART Count       : 1
> Platform IPI Device       : aclint-mswi
> Platform Timer Device     : aclint-mtimer @ 10000000Hz
> Platform Console Device   : uart8250
> Platform HSM Device       : ---
> Platform PMU Device       : ---
> Platform Reboot Device    : sifive_test
> Platform Shutdown Device  : sifive_test
> Platform Suspend Device   : ---
> Platform CPPC Device      : ---
> Firmware Base             : 0x60000000
> Firmware Size             : 360 KB
> Firmware RW Offset        : 0x40000
> Runtime SBI Version       : 1.0
>
> Domain0 Name              : root
> Domain0 Boot HART         : 0
> Domain0 HARTs             : 0*
> Domain0 Region00          : 0x0000000002000000-0x000000000200ffff M: (I,R,W) S/U: ()
> Domain0 Region01          : 0x0000000060040000-0x000000006005ffff M: (R,W) S/U: ()
> Domain0 Region02          : 0x0000000060000000-0x000000006003ffff M: (R,X) S/U: ()
> Domain0 Region03          : 0x0000000000000000-0xffffffffffffffff M: (R,W,X) S/U: (R,W,X)
> Domain0 Next Address      : 0x0000000060200000
> Domain0 Next Arg1         : 0x0000000067e00000
> Domain0 Next Mode         : S-mode
> Domain0 SysReset          : yes
> Domain0 SysSuspend        : yes
>
> Boot HART ID              : 0
> Boot HART Domain          : root
> Boot HART Priv Version    : v1.12
> Boot HART Base ISA        : rv64imafdch
> Boot HART ISA Extensions  : time,sstc
> Boot HART PMP Count       : 16
> Boot HART PMP Granularity : 4
> Boot HART PMP Address Bits: 54
> Boot HART MHPM Count      : 16
> Boot HART MIDELEG         : 0x0000000000001666
> Boot HART MEDELEG         : 0x0000000000f0b509
> [    0.000000] Linux version 6.3.0-rc1-00086-gc8d2fedb997a (guoren@fedora) (riscv64-unknown-linux-gnu-gcc (g5e578a16201f) 13.0.1 20230206 (experimental), GNU ld (GNU Binutils) 2.40.50.20230205) #1 SMP Sun May 14 10:46:42 EDT 2023
> [    0.000000] random: crng init done
> [    0.000000] OF: fdt: Ignoring memory range 0x60000000 - 0x60200000
> [    0.000000] Machine model: riscv-virtio,qemu
> [    0.000000] efi: UEFI not found.
> [    0.000000] OF: reserved mem: 0x60000000..0x6003ffff (256 KiB) map non-reusable mmode_resv1@60000000
> [    0.000000] OF: reserved mem: 0x60040000..0x6005ffff (128 KiB) map non-reusable mmode_resv0@60040000
> [    0.000000] Zone ranges:
> [    0.000000]   Normal   [mem 0x0000000060200000-0x0000000067ffffff]
> [    0.000000] Movable zone start for each node
> [    0.000000] Early memory node ranges
> [    0.000000]   node   0: [mem 0x0000000060200000-0x0000000067ffffff]
> [    0.000000] Initmem setup node 0 [mem 0x0000000060200000-0x0000000067ffffff]
> [    0.000000] On node 0, zone Normal: 512 pages in unavailable ranges
> [    0.000000] SBI specification v1.0 detected
> [    0.000000] SBI implementation ID=0x1 Version=0x10002
> [    0.000000] SBI TIME extension detected
> [    0.000000] SBI IPI extension detected
> [    0.000000] SBI RFENCE extension detected
> [    0.000000] SBI SRST extension detected
> [    0.000000] SBI HSM extension detected
> [    0.000000] riscv: base ISA extensions acdfhim
> [    0.000000] riscv: ELF capabilities acdfim
> [    0.000000] percpu: Embedded 13 pages/cpu s24352 r8192 d20704 u53248
> [    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 31941
> [    0.000000] Kernel command line: rootwait root=/dev/vda ro console=ttyS0 earlycon=sbi norandmaps
> [    0.000000] Dentry cache hash table entries: 16384 (order: 4, 65536 bytes, linear)
> [    0.000000] Inode-cache hash table entries: 8192 (order: 3, 32768 bytes, linear)
> [    0.000000] mem auto-init: stack:all(zero), heap alloc:off, heap free:off
> [    0.000000] Virtual kernel memory layout:
> [    0.000000]       fixmap : 0x9ce00000 - 0x9d000000   (2048 kB)
> [    0.000000]       pci io : 0x9d000000 - 0x9e000000   (  16 MB)
> [    0.000000]      vmemmap : 0x9e000000 - 0xa0000000   (  32 MB)
> [    0.000000]      vmalloc : 0xa0000000 - 0xc0000000   ( 512 MB)
> [    0.000000]       lowmem : 0xc0000000 - 0xc7e00000   ( 126 MB)
> [    0.000000] Memory: 97748K/129024K available (8699K kernel code, 8867K rwdata, 4096K rodata, 4204K init, 361K bss, 31276K reserved, 0K cma-reserved)
> ...
> Starting network: udhcpc: started, v1.36.0
> udhcpc: broadcasting discover
> udhcpc: broadcasting select for 10.0.2.15, server 10.0.2.2
> udhcpc: lease of 10.0.2.15 obtained from 10.0.2.2, lease time 86400
> deleting routers
> adding dns 10.0.2.3
> OK
>
> Welcome to Buildroot
> buildroot login: root
> # cat /proc/cpuinfo
> processor       : 0
> hart            : 0
> isa             : rv64imafdch_zihintpause_zbb_sstc
> mmu             : sv39
> mvendorid       : 0x0
> marchid         : 0x70232
> mimpid          : 0x70232
>
> # uname -a
> Linux buildroot 6.3.0-rc1-00086-gc8d2fedb997a #1 SMP Sun May 14 10:46:42 EDT 2023 riscv32 GNU/Linux
> # ls /lib/
> ld-linux-riscv32-ilp32d.so.1  libgcc_s.so.1
> libanl.so.1                   libm.so.6
> libatomic.so                  libnss_dns.so.2
> libatomic.so.1                libnss_files.so.2
> libatomic.so.1.2.0            libpthread.so.0
> libc.so.6                     libresolv.so.2
> libcrypt.so.1                 librt.so.1
> libdl.so.2                    libutil.so.1
> libgcc_s.so                   modules
>
> # cat /proc/99/maps
> 0000000055554000-0000000055634000 r-xp 00000000 00000000fe:00 17  /bin/busybox
> 0000000055634000-0000000055636000 r--p 00000000df000 00000000fe:00 17  /bin/busybox
> 0000000055636000-0000000055637000 rw-p 00000000e1000 00000000fe:00 17  /bin/busybox
> 0000000055637000-0000000055659000 rw-p 00000000 00:00 0  [heap]
> 0000000077e8d000-0000000077fbe000 r-xp 00000000 00000000fe:00 137  /lib/libc.so.6
> 0000000077fbe000-0000000077fbf000 ---p 00000000131000 00000000fe:00 137  /lib/libc.so.6
> 0000000077fbf000-0000000077fc1000 r--p 00000000131000 00000000fe:00 137  /lib/libc.so.6
> 0000000077fc1000-0000000077fc2000 rw-p 00000000133000 00000000fe:00 137  /lib/libc.so.6
> 0000000077fc2000-0000000077fcc000 rw-p 00000000 00:00 0
> 0000000077fcc000-0000000077fd4000 r-xp 00000000 00000000fe:00 146  /lib/libresolv.so.2
> 0000000077fd4000-0000000077fd5000 ---p 000000008000 00000000fe:00 146  /lib/libresolv.so.2
> 0000000077fd5000-0000000077fd6000 r--p 000000008000 00000000fe:00 146  /lib/libresolv.so.2
> 0000000077fd6000-0000000077fd7000 rw-p 000000009000 00000000fe:00 146  /lib/libresolv.so.2
> 0000000077fd7000-0000000077fd9000 rw-p 00000000 00:00 0
> 0000000077fd9000-0000000077fdb000 r--p 00000000 00:00 0  [vvar]
> 0000000077fdb000-0000000077fdd000 r-xp 00000000 00:00 0  [vdso]
> 0000000077fdd000-0000000077ffc000 r-xp 00000000 00000000fe:00 132  /lib/ld-linux-riscv32-ilp32d.so.1
> 0000000077ffd000-0000000077ffe000 r--p 000000001f000 00000000fe:00 132  /lib/ld-linux-riscv32-ilp32d.so.1
> 0000000077ffe000-0000000077fff000 rw-p 0000000020000 00000000fe:00 132  /lib/ld-linux-riscv32-ilp32d.so.1
> 000000007ffde000-000000007ffff000 rw-p 00000000 00:00 0  [stack]
>
> Other resources
> ===============
>
> OpenEuler riscv32 rootfs
> ------------------------
> The OpenEuler riscv32 rootfs you can download from here:
> https://repo.tarsier-infra.com/openEuler-RISC-V/obs/archive/rv32/openeuler-image-qemu-riscv32-20221111070036.rootfs.ext4
> (Made by Junqiang Wang)
>
> Debain riscv32 rootfs
> ---------------------
> The Debian riscv32 rootfs you can download from here:
> https://github.com/yuzibo/riscv32
> (Made by Bo YU and Han Gao)
>
> Fedora riscv32 rootfs
> ---------------------
> https://fedoraproject.org/wiki/Architectures/RISC-V/RV32
> (Made by Wei Fu)
>
> LLVM 64ilp32
> ------------
> git clone https://github.com/luxufan/llvm-project.git -b rv64-ilp32
> cd llvm-project
> mkdir build && cd build
> cmake ../llvm -G Ninja -DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD=“X86;RISCV" -DLLVM_ENABLE_PROJECTS="clang;lld"
> ninja all
>
> (LLVM development status is that CC=clang can compile the kernel with
>  LLVM=1 but has not yet booted successfully.)
>
> Patch organization
> ==================
> This series depends on 64ilp32 toolchain patches that are not upstream
> yet.
>
> PATCH [0-1] unify vdso32 & compat_vdso
> PATCH [2] adds time-related vDSO common flow for vdso32
> PATCH [3] adds s64ilp32 support of clocksource driver
> PATCH [5] adds s64ilp32 support of irqchip driver
> PATCH [4,6-12] add basic data types and compiling framework
> PATCH [13] adds MMU_SV39 support
> PATCH [14] adds native atomic64
> PATCH [15] adds TImode
> PATCH [16] adds cmpxchg_double
> PATCH [17-19] cleanup kconfig & add defconfig
> PATCH [20-21] fix temporary compiler problems
>
> Open issues
> ===========
>
> Callee saved the register width
> -------------------------------
> For 64-bit ISA (including 64lp64, 64ilp32), callee can't determine the
> correct width used in the register, so they saved the maximum width of
> the ISA register, i.e., xlen size. We also found this rule in x86-x32,
> mips-n32, and aarch64ilp32, which comes from 64lp64. See PATCH [20]
>
> Here are two downsides of this:
>  - It would cause a difference with 32ilp32's stack frame, and s64ilp32
>    reuses 32ilp32 software stack. Thus, many additional compatible
>    problems would happen during the porting of 64ilp32 software.
>  - It also increases the budget of the stack usage.
>    <setup_vm>:
>      auipc   a3,0xff3fb
>      add     a3,a3,1234 # c0000000
>      li      a5,-1
>      lui     a4,0xc0000
>      addw    sp,sp,-96
>      srl     a5,a5,0x20
>      subw    a4,a4,a3
>      auipc   a2,0x111a
>      add     a2,a2,1212 # c1d1f000
>      sd      s0,80(sp)----+
>      sd      s1,72(sp)    |
>      sd      s2,64(sp)    |
>      sd      s7,24(sp)    |
>      sd      s8,16(sp)    |
>      sd      s9,8(sp)     |-> All <= 32b widths, but occupy 64b
>      sd      ra,88(sp)    |   stack space.
>      sd      s3,56(sp)    |   Affect memory footprint & cache
>      sd      s4,48(sp)    |   performance.
>      sd      s5,40(sp)    |
>      sd      s6,32(sp)    |
>      sd      s10,0(sp)----+
>      sll     a1,a4,0x20
>      subw    a2,a2,a3
>      and     a4,a4,a5
>
> So here is a proposal to riscv 64ilp32 ABI:
>  - Let the compiler prevent callee saving ">32b variables" in
>    callee-registers. (Q: We need to measure, how the influence of
>    64b variables cross function call?)
>
> EF_RISCV_X32
> ------------
> We add an e_flag (EF_RISCV_X32) to distinguish the 32-bit ELF, which
> occupies BIT[6] of the e_flags layout.
>
> ELF Header:
>   Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
>   Class:                             ELF32
>   Data:                              2's complement, little endian
>   Version:                           1 (current)
>   OS/ABI:                            UNIX - System V
>   ABI Version:                       0
>   Type:                              REL (Relocatable file)
>   Machine:                           RISC-V
>   Version:                           0x1
>   Entry point address:               0x0
>   Start of program headers:          0 (bytes into file)
>   Start of section headers:          24620 (bytes into file)
>   Flags:                             0x21, RVC, X32, soft-float ABI
>                                                 ^^^
> 64-bit Optimization problem
> ---------------------------
> There is an existing problem in 64ilp32 gcc that combines two pointers
> in one register. Liao is solving that problem. Before he finishes the
> job, we could prevent it with a simple noinline attribute, fortunately.
>     struct path {
>             struct vfsmount *mnt;
>             struct dentry *dentry;
>     } __randomize_layout;
>
>     struct nameidata {
>             struct path     path;
>             ...
>             struct path     root;
>     ...
>     } __randomize_layout;
>
>             struct nameidata *nd
>             ...
>             nd->path = nd->root;
>     6c88                    ld      a0,24(s1)
>                                     ^^ // a0 contains two pointers
>     e088                    sd      a0,0(s1)
>             mntget(path->mnt);
>             // Need "lw a0,0(s1)" or "a0 << 32; a0 >> 32"
>     2a6150ef                jal     c01ce946 <mntget> // bug!
>
> Acknowledge
> ===========
> The s64ilp32 needs many other projects' cooperation. Thx, all guys
> involved:
>  - GNU:			LiaoShihua <shihua@iscas.ac.cn>,
> 			Jiawe Chen<jiawei@iscas.ac.cn>
>  - Qemu:		Weiwei Li <liweiwei@iscas.ac.cn>
>  - LLVM:		luxufan <luxufan@iscas.ac.cn>,
> 			Chunyu Liao<chunyu@iscas.ac.cn>
>  - OpenEuler rv32:	Junqiang Wang <wangjunqiang@iscas.ac.cn>
>  - Debian rv32:		Bo YU <tsu.yubo@gmail.com>
> 			Han Gao <gaohan@iscas.ac.cn>
>  - Fedora rv32:		Wei Fu <wefu@redhat.com>
>
> References
> ==========
> [1] https://techpubs.jurassic.nl/manuals/0630/developer/Mpro_n32_ABI/sgi_html/index.html
> [2] https://wiki.debian.org/Arm64ilp32Port
> [3] https://lwn.net/Articles/456731/
> [4] https://github.com/riscv/riscv-profiles/releases
> [5] https://www.cnx-software.com/2021/10/25/allwinner-d1s-f133-risc-v-processor-64mb-ddr2/
> [6] https://milkv.io/duo/
> [7] https://twitter.com/tphuang/status/1631308330256801793
> [8] https://www.cnx-software.com/2022/12/02/pine64-ox64-sbc-bl808-risc-v-multi-protocol-wisoc-64mb-ram/
>
> Guo Ren (22):
>   riscv: vdso: Unify vdso32 & compat_vdso into vdso/Makefile
>   riscv: vdso: Remove compat_vdso/
>   riscv: vdso: Add time-related vDSO common flow for vdso32
>   clocksource: riscv: s64ilp32: Use __riscv_xlen instead of CONFIG_32BIT
>   riscv: s64ilp32: Introduce xlen_t
>   irqchip: riscv: s64ilp32: Use __riscv_xlen instead of CONFIG_32BIT
>   riscv: s64ilp32: Add sbi support
>   riscv: s64ilp32: Add asid support
>   riscv: s64ilp32: Introduce PTR_L and PTR_S
>   riscv: s64ilp32: Enable user space runtime environment
>   riscv: s64ilp32: Add ebpf jit support
>   riscv: s64ilp32: Add ELF32 support
>   riscv: s64ilp32: Add ARCH RV64 ILP32 compiling framework
>   riscv: s64ilp32: Add MMU_SV39 mode support for 32BIT
>   riscv: s64ilp32: Enable native atomic64
>   riscv: s64ilp32: Add TImode (128 int) support
>   riscv: s64ilp32: Implement cmpxchg_double
>   riscv: s64ilp32: Disable KVM
>   riscv: Cleanup rv32_defconfig
>   riscv: s64ilp32: Add rv64ilp32_defconfig
>   riscv: s64ilp32: Correct the rv64ilp32 stackframe layout
>   riscv: s64ilp32: Temporary workaround solution to gcc problem
>
>  arch/riscv/Kconfig                            |  36 +++-
>  arch/riscv/Makefile                           |  24 ++-
>  arch/riscv/configs/32-bit.config              |   2 -
>  arch/riscv/configs/64ilp32.config             |   2 +
>  arch/riscv/include/asm/asm.h                  |   5 +
>  arch/riscv/include/asm/atomic.h               |   6 +
>  arch/riscv/include/asm/cmpxchg.h              |  53 ++++++
>  arch/riscv/include/asm/cpu_ops_sbi.h          |   4 +-
>  arch/riscv/include/asm/csr.h                  |  58 +++---
>  arch/riscv/include/asm/extable.h              |   2 +-
>  arch/riscv/include/asm/page.h                 |  24 ++-
>  arch/riscv/include/asm/pgtable-64.h           |  42 ++---
>  arch/riscv/include/asm/pgtable.h              |  26 ++-
>  arch/riscv/include/asm/processor.h            |   8 +-
>  arch/riscv/include/asm/ptrace.h               |  96 +++++-----
>  arch/riscv/include/asm/sbi.h                  |  24 +--
>  arch/riscv/include/asm/stacktrace.h           |   6 +
>  arch/riscv/include/asm/timex.h                |  10 +-
>  arch/riscv/include/asm/vdso.h                 |  34 +++-
>  arch/riscv/include/asm/vdso/gettimeofday.h    |  84 +++++++++
>  arch/riscv/include/uapi/asm/elf.h             |   2 +-
>  arch/riscv/include/uapi/asm/unistd.h          |   1 +
>  arch/riscv/kernel/Makefile                    |   3 +-
>  arch/riscv/kernel/compat_signal.c             |   2 +-
>  arch/riscv/kernel/compat_vdso/.gitignore      |   2 -
>  arch/riscv/kernel/compat_vdso/compat_vdso.S   |   8 -
>  .../kernel/compat_vdso/compat_vdso.lds.S      |   3 -
>  arch/riscv/kernel/compat_vdso/flush_icache.S  |   3 -
>  arch/riscv/kernel/compat_vdso/getcpu.S        |   3 -
>  arch/riscv/kernel/compat_vdso/note.S          |   3 -
>  arch/riscv/kernel/compat_vdso/rt_sigreturn.S  |   3 -
>  arch/riscv/kernel/cpu.c                       |   4 +-
>  arch/riscv/kernel/cpu_ops_sbi.c               |   4 +-
>  arch/riscv/kernel/cpufeature.c                |   4 +-
>  arch/riscv/kernel/entry.S                     |  24 +--
>  arch/riscv/kernel/head.S                      |   8 +-
>  arch/riscv/kernel/process.c                   |   8 +-
>  arch/riscv/kernel/sbi.c                       |  24 +--
>  arch/riscv/kernel/signal.c                    |   6 +-
>  arch/riscv/kernel/traps.c                     |   4 +-
>  arch/riscv/kernel/vdso.c                      |   4 +-
>  arch/riscv/kernel/vdso/Makefile               | 176 ++++++++++++------
>  ..._vdso_offsets.sh => gen_vdso32_offsets.sh} |   2 +-
>  .../gen_vdso64_offsets.sh}                    |   2 +-
>  arch/riscv/kernel/vdso/vgettimeofday.c        |  39 +++-
>  arch/riscv/kernel/vdso32.S                    |   8 +
>  arch/riscv/kernel/{vdso/vdso.S => vdso64.S}   |   8 +-
>  arch/riscv/kvm/Kconfig                        |   1 +
>  arch/riscv/lib/Makefile                       |   1 +
>  arch/riscv/lib/memset.S                       |   4 +-
>  arch/riscv/mm/context.c                       |  16 +-
>  arch/riscv/mm/fault.c                         |  13 +-
>  arch/riscv/mm/init.c                          |  29 ++-
>  arch/riscv/net/Makefile                       |   6 +-
>  arch/riscv/net/bpf_jit_comp64.c               |  10 +-
>  drivers/clocksource/timer-riscv.c             |   2 +-
>  drivers/irqchip/irq-riscv-intc.c              |   4 +-
>  fs/namei.c                                    |   2 +-
>  58 files changed, 675 insertions(+), 317 deletions(-)
>  create mode 100644 arch/riscv/configs/64ilp32.config
>  delete mode 100644 arch/riscv/kernel/compat_vdso/.gitignore
>  delete mode 100644 arch/riscv/kernel/compat_vdso/compat_vdso.S
>  delete mode 100644 arch/riscv/kernel/compat_vdso/compat_vdso.lds.S
>  delete mode 100644 arch/riscv/kernel/compat_vdso/flush_icache.S
>  delete mode 100644 arch/riscv/kernel/compat_vdso/getcpu.S
>  delete mode 100644 arch/riscv/kernel/compat_vdso/note.S
>  delete mode 100644 arch/riscv/kernel/compat_vdso/rt_sigreturn.S
>  rename arch/riscv/kernel/vdso/{gen_vdso_offsets.sh => gen_vdso32_offsets.sh} (78%)
>  rename arch/riscv/kernel/{compat_vdso/gen_compat_vdso_offsets.sh => vdso/gen_vdso64_offsets.sh} (77%)
>  create mode 100644 arch/riscv/kernel/vdso32.S
>  rename arch/riscv/kernel/{vdso/vdso.S => vdso64.S} (73%)

Arnd Bergmann May 18, 2023, 6:29 p.m. UTC | #2

On Thu, May 18, 2023, at 17:38, Palmer Dabbelt wrote:
> On Thu, 18 May 2023 06:09:51 PDT (-0700), guoren@kernel.org wrote:
>> From: Guo Ren <guoren@linux.alibaba.com>
>>
>> This patch series adds s64ilp32 support to riscv. The term s64ilp32
>> means smode-xlen=64 and -mabi=ilp32 (ints, longs, and pointers are all
>> 32-bit), i.e., running 32-bit Linux kernel on pure 64-bit supervisor
>> mode. There have been many 64ilp32 abis existing, such as mips-n32 [1],
>> arm-aarch64ilp32 [2], and x86-x32 [3], but they are all about userspace.
>> Thus, this should be the first time running a 32-bit Linux kernel with
>> the 64ilp32 ABI at supervisor mode (If not, correct me).
>
> Does anyone actually want this?  At a bare minimum we'd need to add it 
> to the psABI, which would presumably also be required on the compiler 
> side of things.
>
> It's not even clear anyone wants rv64/ilp32 in userspace, the kernel 
> seems like it'd be even less widely used.

We have had long discussions about supporting ilp32 userspace on
arm64, and I think almost everyone is glad we never merged it into
the mainline kernel, so we don't have to worry about supporting it
in the future. The cost of supporting an extra user space ABI
is huge, and I'm sure you don't want to go there. The other two
cited examples (mips-n32 and x86-x32) are pretty much unused now
as well, but still have a maintenance burden until they can finally
get removed.

If for some crazy reason you'd still want the 64ilp32 ABI in user
space, running the kernel this way is probably still a bad idea,
but that one is less clear. There is clearly a small memory
penalty of running a 64-bit kernel for larger data structures
(page, inode, task_struct, ...) and vmlinux, and there is no
huge additional maintenance cost on top of the ABI itself
that you'd need either way, but using a 64-bit address space
in the kernel has some important advantages even when running
32-bit userland: processes can use the entire 4GB virtual
space, while the kernel can address more than 768MB of lowmem,
and KASLR has more bits to work with for randomization. On
RISCV, some additional features (VMAP_STACK, KASAN, KFENCE,
...) depend on 64-bit kernels even though they don't
strictly need that.

     Arnd

Paul Walmsley May 19, 2023, 12:14 a.m. UTC | #3

On Thu, 18 May 2023, Palmer Dabbelt wrote:

> On Thu, 18 May 2023 06:09:51 PDT (-0700), guoren@kernel.org wrote:
>
> > This patch series adds s64ilp32 support to riscv. The term s64ilp32
> > means smode-xlen=64 and -mabi=ilp32 (ints, longs, and pointers are all
> > 32-bit), i.e., running 32-bit Linux kernel on pure 64-bit supervisor
> > mode. There have been many 64ilp32 abis existing, such as mips-n32 [1],
> > arm-aarch64ilp32 [2], and x86-x32 [3], but they are all about userspace.
> > Thus, this should be the first time running a 32-bit Linux kernel with
> > the 64ilp32 ABI at supervisor mode (If not, correct me).
> 
> Does anyone actually want this?  At a bare minimum we'd need to add it to the
> psABI, which would presumably also be required on the compiler side of things.
> 
> It's not even clear anyone wants rv64/ilp32 in userspace, the kernel seems
> like it'd be even less widely used.

We've certainly talked to folks who are interested in RV64 ILP32 userspace 
with an LP64 kernel.  The motivation is the usual one: to reduce data size 
and therefore (ideally) BOM cost.  I think this work, if it goes forward, 
would need to go hand in hand with the RVIA psABI group.

The RV64 ILP32 kernel and ILP32 userspace approach implemented by this 
patch is intriguing, but I guess for me, the question is whether it's 
worth the extra hassle vs. a pure RV32 kernel & userspace.  

- Paul

Paul Walmsley May 19, 2023, 12:38 a.m. UTC | #4

On Thu, 18 May 2023, Arnd Bergmann wrote:

> We have had long discussions about supporting ilp32 userspace on
> arm64, and I think almost everyone is glad we never merged it into
> the mainline kernel, so we don't have to worry about supporting it
> in the future. The cost of supporting an extra user space ABI
> is huge, and I'm sure you don't want to go there. The other two
> cited examples (mips-n32 and x86-x32) are pretty much unused now
> as well, but still have a maintenance burden until they can finally
> get removed.

There probably hasn't been much pressure to support Aarch64 ILP32 since 
ARM still has hardware support for Aarch32.  Will be interesting to see if 
that's still the case after ARM drops Aarch32 support for future designs.

- Paul

Arnd Bergmann May 19, 2023, 8:55 a.m. UTC | #5

On Fri, May 19, 2023, at 02:38, Paul Walmsley wrote:
> On Thu, 18 May 2023, Arnd Bergmann wrote:
>
>> We have had long discussions about supporting ilp32 userspace on
>> arm64, and I think almost everyone is glad we never merged it into
>> the mainline kernel, so we don't have to worry about supporting it
>> in the future. The cost of supporting an extra user space ABI
>> is huge, and I'm sure you don't want to go there. The other two
>> cited examples (mips-n32 and x86-x32) are pretty much unused now
>> as well, but still have a maintenance burden until they can finally
>> get removed.
>
> There probably hasn't been much pressure to support Aarch64 ILP32 since 
> ARM still has hardware support for Aarch32.  Will be interesting to see if 
> that's still the case after ARM drops Aarch32 support for future designs.

I think there was a some pressure for 64ilp32 from Arm when aarch64 support
was originally added, as they always planned to drop aarch32 support
eventually, but I don't see that coming back now. I think the situation
is quite different as well:

On aarch64, there is a significant cost in supporting aarch32 userspace
because of the complexity of that particular instruction set, but at
the same time there is also a huge amount of software that is compiled
for or written to support aarch32 software, and nobody wants to
replace that.  There are also a lot of existing arm32 chips with
guaranteed availability well into the 2030s, new 32-bit-only chips
based on Cortex-A7 (originally released in 2011) coming out constantly,
and even the latest low-end core (Cortex-A510 r1). It's probably
going to be several years before that core even shows up in low-memory
systems, and then decades before this stops being available in SoCs,
even in the unlikely case that no future low-end cores support
aarch32-el0 mode (it's already been announced that there are no
plans for future high-end cores with aarch32 mode, but those won't
be used in low-memory configurations anyway).

For RISC-V, I have not seen much interest in Linux userspace for
the existing rv32 mode, so you could argue that there is not much
to lose in abandoning it. On the other hand, the cost of adding
rv32 support to an rv64 core should be very small as all the
instructions are already present in some other encoding, and
developers have already spent a significant amount of work on
bringing up rv32 userspace that would all have to be done again
for a new ABI, and you'd end up splitting the already tiny
developer base for 32-bit riscv in two for the existing rv32 side
and a new rv64ilp32 side. 

I suppose the answer in both cases is the same though: if a
SoC maker wants to sell a product to users with low memory,
they should pick a CPU core that implements standard 32-bit
user space support rather than making a mess of it and
expecting software to work around it.

      Arnd

Guo Ren May 19, 2023, 3:31 p.m. UTC | #6

On Fri, May 19, 2023 at 2:29 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Thu, May 18, 2023, at 17:38, Palmer Dabbelt wrote:
> > On Thu, 18 May 2023 06:09:51 PDT (-0700), guoren@kernel.org wrote:
> >> From: Guo Ren <guoren@linux.alibaba.com>
> >>
> >> This patch series adds s64ilp32 support to riscv. The term s64ilp32
> >> means smode-xlen=64 and -mabi=ilp32 (ints, longs, and pointers are all
> >> 32-bit), i.e., running 32-bit Linux kernel on pure 64-bit supervisor
> >> mode. There have been many 64ilp32 abis existing, such as mips-n32 [1],
> >> arm-aarch64ilp32 [2], and x86-x32 [3], but they are all about userspace.
> >> Thus, this should be the first time running a 32-bit Linux kernel with
> >> the 64ilp32 ABI at supervisor mode (If not, correct me).
> >
> > Does anyone actually want this?  At a bare minimum we'd need to add it
> > to the psABI, which would presumably also be required on the compiler
> > side of things.
> >
> > It's not even clear anyone wants rv64/ilp32 in userspace, the kernel
> > seems like it'd be even less widely used.
>
> We have had long discussions about supporting ilp32 userspace on
> arm64, and I think almost everyone is glad we never merged it into
> the mainline kernel, so we don't have to worry about supporting it
> in the future. The cost of supporting an extra user space ABI
> is huge, and I'm sure you don't want to go there. The other two
> cited examples (mips-n32 and x86-x32) are pretty much unused now
> as well, but still have a maintenance burden until they can finally
> get removed.
>
> If for some crazy reason you'd still want the 64ilp32 ABI in user
> space, running the kernel this way is probably still a bad idea,
> but that one is less clear. There is clearly a small memory
> penalty of running a 64-bit kernel for larger data structures
> (page, inode, task_struct, ...) and vmlinux, and there is no
I don't think it's a small memory penalty, our measurement is about
16% with defconfig, see "Why 32-bit Linux?" section.
This patch series doesn't add 64ilp32 userspace abi, but it seems you
also don't like to run 32-bit Linux kernel on 64-bit hardware, right?

The motivation of s64ilp32 (running 32-bit Linux kernel on 64-bit s-mode):
 - The target hardware (Canaan Kendryte k230) only supports MXL=64,
SXL=64, UXL=64/32.
 - The 64-bit Linux + compat 32-bit app can't satisfy the 64/128MB scenarios.

> huge additional maintenance cost on top of the ABI itself
> that you'd need either way, but using a 64-bit address space
> in the kernel has some important advantages even when running
> 32-bit userland: processes can use the entire 4GB virtual
> space, while the kernel can address more than 768MB of lowmem,
> and KASLR has more bits to work with for randomization. On
> RISCV, some additional features (VMAP_STACK, KASAN, KFENCE,
> ...) depend on 64-bit kernels even though they don't
> strictly need that.

I agree that the 64-bit linux kernel has more functionalities, but:
 - What do you think about linux on a 64/128MB SoC? Could it be
affordable to VMAP_STACK, KASAN, KFENCE?
 - I think 32-bit Linux & RTOS have monopolized this market (64/128MB
scenarios), right?

>
>      Arnd

Arnd Bergmann May 19, 2023, 4:53 p.m. UTC | #7

On Fri, May 19, 2023, at 17:31, Guo Ren wrote:
> On Fri, May 19, 2023 at 2:29 AM Arnd Bergmann <arnd@arndb.de> wrote:
>> On Thu, May 18, 2023, at 17:38, Palmer Dabbelt wrote:
>> > On Thu, 18 May 2023 06:09:51 PDT (-0700), guoren@kernel.org wrote:
>>
>> If for some crazy reason you'd still want the 64ilp32 ABI in user
>> space, running the kernel this way is probably still a bad idea,
>> but that one is less clear. There is clearly a small memory
>> penalty of running a 64-bit kernel for larger data structures
>> (page, inode, task_struct, ...) and vmlinux, and there is no
> I don't think it's a small memory penalty, our measurement is about
> 16% with defconfig, see "Why 32-bit Linux?" section.
>
> This patch series doesn't add 64ilp32 userspace abi, but it seems you
> also don't like to run 32-bit Linux kernel on 64-bit hardware, right?

Ok, I'm sorry for missing the important bit here. So if this can
still use the normal 32-bit user space, the cost of this patch set
is not huge, and it's something that can be beneficial in a few
cases, though I suspect most users are still better off running
64-bit kernels.

> The motivation of s64ilp32 (running 32-bit Linux kernel on 64-bit s-mode):
>  - The target hardware (Canaan Kendryte k230) only supports MXL=64,
> SXL=64, UXL=64/32.
>  - The 64-bit Linux + compat 32-bit app can't satisfy the 64/128MB scenarios.
>
>> huge additional maintenance cost on top of the ABI itself
>> that you'd need either way, but using a 64-bit address space
>> in the kernel has some important advantages even when running
>> 32-bit userland: processes can use the entire 4GB virtual
>> space, while the kernel can address more than 768MB of lowmem,
>> and KASLR has more bits to work with for randomization. On
>> RISCV, some additional features (VMAP_STACK, KASAN, KFENCE,
>> ...) depend on 64-bit kernels even though they don't
>> strictly need that.
>
> I agree that the 64-bit linux kernel has more functionalities, but:
>  - What do you think about linux on a 64/128MB SoC? Could it be
> affordable to VMAP_STACK, KASAN, KFENCE?

I would definitely recommend VMAP_STACK, but that can be implemented
and is used on other 32-bit architectures (ppc32, arm32) without a
huge cost. The larger virtual user address space can help even on
machines with 128MB, though most applications probably don't care at
that point.

>  - I think 32-bit Linux & RTOS have monopolized this market (64/128MB
> scenarios), right?

The minimum amount of RAM that makes a system usable for Linux is
constantly going up, so I think with 64MB, most new projects are
already better off running some RTOS kernel instead of Linux.
The ones that are still usable today probably won't last a lot
of distro upgrades before the bloat catches up with them, but I
can see how your patch set can give them a few extra years of
updates.

For the 256MB+ systems, I would expect the sensitive kernel
allocations to be small enough that the series makes little
difference. The 128MB systems are the most interesting ones
here, and I'm curious to see where you spot most of the
memory usage differences, I'll also reply to your initial
mail for that.

       Arnd

Palmer Dabbelt May 19, 2023, 5:18 p.m. UTC | #8

On Fri, 19 May 2023 09:53:35 PDT (-0700), Arnd Bergmann wrote:
> On Fri, May 19, 2023, at 17:31, Guo Ren wrote:
>> On Fri, May 19, 2023 at 2:29 AM Arnd Bergmann <arnd@arndb.de> wrote:
>>> On Thu, May 18, 2023, at 17:38, Palmer Dabbelt wrote:
>>> > On Thu, 18 May 2023 06:09:51 PDT (-0700), guoren@kernel.org wrote:
>>>
>>> If for some crazy reason you'd still want the 64ilp32 ABI in user
>>> space, running the kernel this way is probably still a bad idea,
>>> but that one is less clear. There is clearly a small memory
>>> penalty of running a 64-bit kernel for larger data structures
>>> (page, inode, task_struct, ...) and vmlinux, and there is no
>> I don't think it's a small memory penalty, our measurement is about
>> 16% with defconfig, see "Why 32-bit Linux?" section.
>>
>> This patch series doesn't add 64ilp32 userspace abi, but it seems you
>> also don't like to run 32-bit Linux kernel on 64-bit hardware, right?
>
> Ok, I'm sorry for missing the important bit here. So if this can
> still use the normal 32-bit user space, the cost of this patch set
> is not huge, and it's something that can be beneficial in a few
> cases, though I suspect most users are still better off running
> 64-bit kernels.

Running a normal 32-bit userspace would require HW support for the 
32-bit mode switch for userspace, though (rv32 isn't a subset of rv64, 
so there's nothing we can do to make those binaries function correctly 
with uABI).  The userspace-only mode switch is a bit simpler than the 
user+supervisor switch, but it seems like vendors who really want the 
memory savings would just implement both mode switches.

>> The motivation of s64ilp32 (running 32-bit Linux kernel on 64-bit s-mode):
>>  - The target hardware (Canaan Kendryte k230) only supports MXL=64,
>> SXL=64, UXL=64/32.
>>  - The 64-bit Linux + compat 32-bit app can't satisfy the 64/128MB scenarios.
>>
>>> huge additional maintenance cost on top of the ABI itself
>>> that you'd need either way, but using a 64-bit address space
>>> in the kernel has some important advantages even when running
>>> 32-bit userland: processes can use the entire 4GB virtual
>>> space, while the kernel can address more than 768MB of lowmem,
>>> and KASLR has more bits to work with for randomization. On
>>> RISCV, some additional features (VMAP_STACK, KASAN, KFENCE,
>>> ...) depend on 64-bit kernels even though they don't
>>> strictly need that.
>>
>> I agree that the 64-bit linux kernel has more functionalities, but:
>>  - What do you think about linux on a 64/128MB SoC? Could it be
>> affordable to VMAP_STACK, KASAN, KFENCE?
>
> I would definitely recommend VMAP_STACK, but that can be implemented
> and is used on other 32-bit architectures (ppc32, arm32) without a
> huge cost. The larger virtual user address space can help even on
> machines with 128MB, though most applications probably don't care at
> that point.

At least having them as an option seems reasonable.  Historically we 
haven't gated new base systems on having every feature the others do, 
though (!MMU, rv32, etc).

>>  - I think 32-bit Linux & RTOS have monopolized this market (64/128MB
>> scenarios), right?
>
> The minimum amount of RAM that makes a system usable for Linux is
> constantly going up, so I think with 64MB, most new projects are
> already better off running some RTOS kernel instead of Linux.
> The ones that are still usable today probably won't last a lot
> of distro upgrades before the bloat catches up with them, but I
> can see how your patch set can give them a few extra years of
> updates.

We also have 32-bit kernel support.  Systems that have tens of MB of RAM 
tend to end up with some memory technology that doesn't scale to 
gigabytes these days, and since that's fixed when the chip is built it 
seems like those folks would be better off just having HW support for 
32-bit kernels (and maybe not even bothering with HW support for 64-bit 
kernels).

> For the 256MB+ systems, I would expect the sensitive kernel
> allocations to be small enough that the series makes little
> difference. The 128MB systems are the most interesting ones
> here, and I'm curious to see where you spot most of the
> memory usage differences, I'll also reply to your initial
> mail for that.

Thanks.  I agree we need to see some real systems that benefit from 
this, as it's a pretty big support cost.  Just defconfig sizes doesn't 
mean a whole lot, as users on these very constrained systems aren't 
likely to run defconfig anyway.

If someone's going to use it then I'm fine taking the code, it just 
seems like a very thin set of possible use cases.  We've already got 
almost no users in RISC-V land, I've got a feeling this is esoteric 
enough to actually have zero.

>
>        Arnd

Arnd Bergmann May 19, 2023, 8:20 p.m. UTC | #9

On Thu, May 18, 2023, at 15:09, guoren@kernel.org wrote:
> From: Guo Ren <guoren@linux.alibaba.com>
> Why 32-bit Linux?
> =================
> The motivation for using a 32-bit Linux kernel is to reduce memory
> footprint and meet the small capacity of DDR & cache requirement
> (e.g., 64/128MB SIP SoC).
>
> Here are the 32-bit v.s. 64-bit Linux kernel data type comparison
> summary:
> 			32-bit		64-bit
> sizeof(page):		32bytes		64bytes
> sizeof(list_head):	8bytes		16bytes
> sizeof(hlist_head):	8bytes		16bytes
> sizeof(vm_area):	68bytes		136bytes
> ...

> Mem-usage:
> (s32ilp32) # free
>        total   used   free  shared  buff/cache   available
> Mem:  100040   8380  88244      44        3416       88080
>
> (s64lp64)  # free
>        total   used   free  shared  buff/cache   available
> Mem:   91568  11848  75796      44        3924       75952
>
> (s64ilp32) # free
>        total   used   free  shared  buff/cache   available
> Mem:  101952   8528  90004      44        3420       89816
>                      ^^^^^
>
> It's a rough measurement based on the current default config without any
> modification, and 32-bit (s32ilp32, s64ilp32) saved more than 16% memory
> to 64-bit (s64lp64). But s32ilp32 & s64ilp32 have a similar memory
> footprint (about 0.33% difference), meaning s64ilp32 has a big chance to
> replace s32ilp32 on the 64-bit machine.

I've tried to run the same numbers for the debate about running
32-bit vs 64-bit arm kernels in the past, but focused mostly on
slightly larger systems, but I looked mainly at the 512MB case,
as that is the most cost-efficient DDR3 memory configuration
and fairly common.

What I'd like to understand better in your example is where
the 14MB of memory went. I assume this is for 128MB of total
RAM, so we know that 1MB went into additional 'struct page'
objects (32 bytes * 32768 pages). It would be good to know
where the dynamic allocations went and if they are  reclaimable
(e.g. inodes) or non-reclaimable (e.g. kmalloc-128).

For the vmlinux size, is this already a minimal config
that one would run on a board with 128MB of RAM, or a
defconfig that includes a lot of stuff that is only relevant
for other platforms but also grows on 64-bit?

What do you see in /proc/slabinfo, /proc/meminfo/, and
'size vmlinux' for the s64ilp32 and s64lp64 kernels here?

       Arnd

Guo Ren May 20, 2023, 1:43 a.m. UTC | #10

On Sat, May 20, 2023 at 12:54 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Fri, May 19, 2023, at 17:31, Guo Ren wrote:
> > On Fri, May 19, 2023 at 2:29 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >> On Thu, May 18, 2023, at 17:38, Palmer Dabbelt wrote:
> >> > On Thu, 18 May 2023 06:09:51 PDT (-0700), guoren@kernel.org wrote:
> >>
> >> If for some crazy reason you'd still want the 64ilp32 ABI in user
> >> space, running the kernel this way is probably still a bad idea,
> >> but that one is less clear. There is clearly a small memory
> >> penalty of running a 64-bit kernel for larger data structures
> >> (page, inode, task_struct, ...) and vmlinux, and there is no
> > I don't think it's a small memory penalty, our measurement is about
> > 16% with defconfig, see "Why 32-bit Linux?" section.
> >
> > This patch series doesn't add 64ilp32 userspace abi, but it seems you
> > also don't like to run 32-bit Linux kernel on 64-bit hardware, right?
>
> Ok, I'm sorry for missing the important bit here. So if this can
> still use the normal 32-bit user space, the cost of this patch set
> is not huge, and it's something that can be beneficial in a few
> cases, though I suspect most users are still better off running
> 64-bit kernels.
>
> > The motivation of s64ilp32 (running 32-bit Linux kernel on 64-bit s-mode):
> >  - The target hardware (Canaan Kendryte k230) only supports MXL=64,
> > SXL=64, UXL=64/32.
> >  - The 64-bit Linux + compat 32-bit app can't satisfy the 64/128MB scenarios.
> >
> >> huge additional maintenance cost on top of the ABI itself
> >> that you'd need either way, but using a 64-bit address space
> >> in the kernel has some important advantages even when running
> >> 32-bit userland: processes can use the entire 4GB virtual
> >> space, while the kernel can address more than 768MB of lowmem,
> >> and KASLR has more bits to work with for randomization. On
> >> RISCV, some additional features (VMAP_STACK, KASAN, KFENCE,
> >> ...) depend on 64-bit kernels even though they don't
> >> strictly need that.
> >
> > I agree that the 64-bit linux kernel has more functionalities, but:
> >  - What do you think about linux on a 64/128MB SoC? Could it be
> > affordable to VMAP_STACK, KASAN, KFENCE?
>
> I would definitely recommend VMAP_STACK, but that can be implemented
> and is used on other 32-bit architectures (ppc32, arm32) without a
> huge cost. The larger virtual user address space can help even on
> machines with 128MB, though most applications probably don't care at
> that point.
Good point, I would support VMAP_STACK in ARCH_RV64ILP32.


>
> >  - I think 32-bit Linux & RTOS have monopolized this market (64/128MB
> > scenarios), right?
>
> The minimum amount of RAM that makes a system usable for Linux is
> constantly going up, so I think with 64MB, most new projects are
> already better off running some RTOS kernel instead of Linux.
> The ones that are still usable today probably won't last a lot
> of distro upgrades before the bloat catches up with them, but I
> can see how your patch set can give them a few extra years of
> updates.
Linux development costs much cheaper than RTOS, so the vendors would
first develop a Linux version. If it succeeds in the market, the
vendors will create a cost-down solution. So their first choice is to
cut down the memory footprint of the first Linux version instead of
moving to RTOS.

With the price of 128MB-DDR3 & 64MB-DDR2 being more and more similar,
32bit-Linux has more opportunities to instead of RTOS.

>
> For the 256MB+ systems, I would expect the sensitive kernel
> allocations to be small enough that the series makes little
> difference. The 128MB systems are the most interesting ones
> here, and I'm curious to see where you spot most of the
> memory usage differences, I'll also reply to your initial
> mail for that.
Thx, I aslo recommand you read about "Why s64ilp32 has better
performance?" section :)
How do you think running arm32-Linux on coretex-A35/A53/A55?

>
>        Arnd

Guo Ren May 20, 2023, 2:53 a.m. UTC | #11

On Sat, May 20, 2023 at 4:20 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Thu, May 18, 2023, at 15:09, guoren@kernel.org wrote:
> > From: Guo Ren <guoren@linux.alibaba.com>
> > Why 32-bit Linux?
> > =================
> > The motivation for using a 32-bit Linux kernel is to reduce memory
> > footprint and meet the small capacity of DDR & cache requirement
> > (e.g., 64/128MB SIP SoC).
> >
> > Here are the 32-bit v.s. 64-bit Linux kernel data type comparison
> > summary:
> >                       32-bit          64-bit
> > sizeof(page):         32bytes         64bytes
> > sizeof(list_head):    8bytes          16bytes
> > sizeof(hlist_head):   8bytes          16bytes
> > sizeof(vm_area):      68bytes         136bytes
> > ...
>
> > Mem-usage:
> > (s32ilp32) # free
> >        total   used   free  shared  buff/cache   available
> > Mem:  100040   8380  88244      44        3416       88080
> >
> > (s64lp64)  # free
> >        total   used   free  shared  buff/cache   available
> > Mem:   91568  11848  75796      44        3924       75952
> >
> > (s64ilp32) # free
> >        total   used   free  shared  buff/cache   available
> > Mem:  101952   8528  90004      44        3420       89816
> >                      ^^^^^
> >
> > It's a rough measurement based on the current default config without any
> > modification, and 32-bit (s32ilp32, s64ilp32) saved more than 16% memory
> > to 64-bit (s64lp64). But s32ilp32 & s64ilp32 have a similar memory
> > footprint (about 0.33% difference), meaning s64ilp32 has a big chance to
> > replace s32ilp32 on the 64-bit machine.
>
> I've tried to run the same numbers for the debate about running
> 32-bit vs 64-bit arm kernels in the past, but focused mostly on
> slightly larger systems, but I looked mainly at the 512MB case,
> as that is the most cost-efficient DDR3 memory configuration
> and fairly common.
512MB is extravagant, in my opinion. In the IPC market, 32/64MB is for
480P/720P/1080p, 128/256MB is for 1080p/2k, and 512/1024MB is for 4K.
> 512MB chips is less than 5% of the total (I guess). Even in 512MB
chips, the additional memory is for the frame buffer, not the Linux
system.
I agree for the > 512MB scenarios would make it less sensitive on a
32/64-bit Linux kernel.

>
> What I'd like to understand better in your example is where
> the 14MB of memory went. I assume this is for 128MB of total
> RAM, so we know that 1MB went into additional 'struct page'
> objects (32 bytes * 32768 pages). It would be good to know
> where the dynamic allocations went and if they are  reclaimable
> (e.g. inodes) or non-reclaimable (e.g. kmalloc-128).
>
> For the vmlinux size, is this already a minimal config
> that one would run on a board with 128MB of RAM, or a
> defconfig that includes a lot of stuff that is only relevant
> for other platforms but also grows on 64-bit?
It's not minimal config, it's defconfig. So I say it's a roungh measurement :)

I admit I wanted a little bit to exaggerate it, but that's the
starting point for cutting down memory usage for most people, right?
During the past year, we have been convincing our customers to use the
s64lp64 + u32ilp32, but they can't tolerate even 1% memory additional
cost in 64MB/128MB scenarios and then chose cortex-a7/a35, which could
run 32-bit Linux. I think it's too early to talk about throwing 32-bit
Linux into the garbage, not only for the reason of memory footprint
but also for the ingrained opinion of the people. Changing their mind
needs a long time.

>
> What do you see in /proc/slabinfo, /proc/meminfo/, and
> 'size vmlinux' for the s64ilp32 and s64lp64 kernels here?
Both s64ilp32 & s64lp64 use the same u32ilp32_rootfs.ext2 binary and
the same opensbi binary.
All are opensbi(2MB) + Linux(126MB) memory layout.

Here is the result:

s64ilp32:
[    0.000000] Virtual kernel memory layout:
[    0.000000]       fixmap : 0x9ce00000 - 0x9d000000   (2048 kB)
[    0.000000]       pci io : 0x9d000000 - 0x9e000000   (  16 MB)
[    0.000000]      vmemmap : 0x9e000000 - 0xa0000000   (  32 MB)
[    0.000000]      vmalloc : 0xa0000000 - 0xc0000000   ( 512 MB)
[    0.000000]       lowmem : 0xc0000000 - 0xc7e00000   ( 126 MB)
[    0.000000] Memory: 97748K/129024K available (8699K kernel code,
8867K rwdata, 4096K rodata, 4204K init, 361K bss, 31276K reserved, 0K
cma-reserved)
...
# free
              total        used        free      shared  buff/cache   available
Mem:         101952        8516       90016          44        3420       89828
Swap:             0           0           0
# cat /proc/meminfo
MemTotal:         101952 kB
MemFree:           90016 kB
MemAvailable:      89836 kB
Buffers:             292 kB
Cached:             2484 kB
SwapCached:            0 kB
Active:             2556 kB
Inactive:            656 kB
Active(anon):         40 kB
Inactive(anon):      440 kB
Active(file):       2516 kB
Inactive(file):      216 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                32 kB
Writeback:             0 kB
AnonPages:           480 kB
Mapped:             1804 kB
Shmem:                44 kB
KReclaimable:        644 kB
Slab:               4536 kB
SReclaimable:        644 kB
SUnreclaim:         3892 kB
KernelStack:         344 kB
PageTables:          112 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:       50976 kB
Committed_AS:       2040 kB
VmallocTotal:     524288 kB
VmallocUsed:         112 kB
VmallocChunk:          0 kB
Percpu:               64 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB

# cat /proc/slabinfo

                                                             [68/1691]
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab>
<pagesperslab> : tunables <limit> <batchcount> <sharedfactor> :
slabdata <active_slabs> <num_slabs> <sharedavail>
ext4_groupinfo_1k     28     28    144   28    1 : tunables    0    0
  0 : slabdata      1      1      0
p9_req_t               0      0    104   39    1 : tunables    0    0
  0 : slabdata      0      0      0
UDPv6                  0      0   1088   15    4 : tunables    0    0
  0 : slabdata      0      0      0
tw_sock_TCPv6          0      0    200   20    1 : tunables    0    0
  0 : slabdata      0      0      0
request_sock_TCPv6      0      0    240   17    1 : tunables    0    0
   0 : slabdata      0      0      0
TCPv6                  0      0   2048    8    4 : tunables    0    0
  0 : slabdata      0      0      0
bio-72                32     32    128   32    1 : tunables    0    0
  0 : slabdata      1      1      0
bfq_io_cq              0      0   1000    8    2 : tunables    0    0
  0 : slabdata      0      0      0
bio-184               21     21    192   21    1 : tunables    0    0
  0 : slabdata      1      1      0
mqueue_inode_cache     10     10    768   10    2 : tunables    0    0
   0 : slabdata      1      1      0
v9fs_inode_cache       0      0    576   14    2 : tunables    0    0
  0 : slabdata      0      0      0
nfs4_xattr_cache_cache      0      0   1848   17    8 : tunables    0
  0    0 : slabdata      0      0      0
nfs_direct_cache       0      0    152   26    1 : tunables    0    0
  0 : slabdata      0      0      0
nfs_read_data         36     36    640   12    2 : tunables    0    0
  0 : slabdata      3      3      0
nfs_inode_cache        0      0    832   19    4 : tunables    0    0
  0 : slabdata      0      0      0
isofs_inode_cache      0      0    528   15    2 : tunables    0    0
  0 : slabdata      0      0      0
fat_inode_cache        0      0    632   25    4 : tunables    0    0
  0 : slabdata      0      0      0
fat_cache              0      0     24  170    1 : tunables    0    0
  0 : slabdata      0      0      0
jbd2_journal_handle      0      0     48   85    1 : tunables    0
0    0 : slabdata      0      0      0
jbd2_journal_head      0      0     80   51    1 : tunables    0    0
  0 : slabdata      0      0      0
ext4_fc_dentry_update      0      0     88   46    1 : tunables    0
 0    0 : slabdata      0      0      0
ext4_inode_cache      88     88    984    8    2 : tunables    0    0
  0 : slabdata     11     11      0
ext4_allocation_context     36     36    112   36    1 : tunables    0
   0    0 : slabdata      1      1      0
ext4_io_end_vec        0      0     24  170    1 : tunables    0    0
  0 : slabdata      0      0      0
pending_reservation      0      0     16  256    1 : tunables    0
0    0 : slabdata      0      0      0
extent_status        256    256     32  128    1 : tunables    0    0
  0 : slabdata      2      2      0
mbcache              102    102     40  102    1 : tunables    0    0
  0 : slabdata      1      1      0
dio                    0      0    384   10    1 : tunables    0    0
  0 : slabdata      0      0      0
audit_tree_mark        0      0     64   64    1 : tunables    0    0
  0 : slabdata      0      0      0
rpc_inode_cache        0      0    576   14    2 : tunables    0    0
  0 : slabdata      0      0      0
ip4-frags              0      0    152   26    1 : tunables    0    0
  0 : slabdata      0      0      0
RAW                    9      9    896    9    2 : tunables    0    0
  0 : slabdata      1      1      0
UDP                    8      8    960    8    2 : tunables    0    0
  0 : slabdata      1      1      0
tw_sock_TCP            0      0    200   20    1 : tunables    0    0
  0 : slabdata      0      0      0
request_sock_TCP       0      0    240   17    1 : tunables    0    0
  0 : slabdata      0      0      0
TCP                    0      0   1920    8    4 : tunables    0    0
  0 : slabdata      0      0      0
hugetlbfs_inode_cache      8      8    504    8    1 : tunables    0
 0    0 : slabdata      1      1      0
bio-164               42     42    192   21    1 : tunables    0    0
  0 : slabdata      2      2      0
ep_head                0      0      8  512    1 : tunables    0    0
  0 : slabdata      0      0      0
dax_cache             14     14    576   14    2 : tunables    0    0
  0 : slabdata      1      1      0
sgpool-128            16     16   2048    8    4 : tunables    0    0
  0 : slabdata      2      2      0
sgpool-64              8      8   1024    8    2 : tunables    0    0
  0 : slabdata      1      1      0
request_queue         13     13    616   13    2 : tunables    0    0
  0 : slabdata      1      1      0
blkdev_ioc             0      0     80   51    1 : tunables    0    0
  0 : slabdata      0      0      0
bio-120               64     64    128   32    1 : tunables    0    0
  0 : slabdata      2      2      0
biovec-max            40     40   3072   10    8 : tunables    0    0
  0 : slabdata      4      4      0
biovec-128             0      0   1536   10    4 : tunables    0    0
  0 : slabdata      0      0      0
                                                             [19/1691]
biovec-64             10     10    768   10    2 : tunables    0    0
  0 : slabdata      1      1      0
dmaengine-unmap-2    128    128     32  128    1 : tunables    0    0
  0 : slabdata      1      1      0
sock_inode_cache      22     22    704   11    2 : tunables    0    0
  0 : slabdata      2      2      0
skbuff_small_head     14     14    576   14    2 : tunables    0    0
  0 : slabdata      1      1      0
skbuff_fclone_cache      0      0    448    9    1 : tunables    0
0    0 : slabdata      0      0      0
file_lock_cache       28     28    144   28    1 : tunables    0    0
  0 : slabdata      1      1      0
buffer_head          357    357     80   51    1 : tunables    0    0
  0 : slabdata      7      7      0
proc_dir_entry       256    256    128   32    1 : tunables    0    0
  0 : slabdata      8      8      0
pde_opener             0      0     24  170    1 : tunables    0    0
  0 : slabdata      0      0      0
proc_inode_cache      60     60    536   15    2 : tunables    0    0
  0 : slabdata      4      4      0
seq_file              42     42     96   42    1 : tunables    0    0
  0 : slabdata      1      1      0
sigqueue              85     85     48   85    1 : tunables    0    0
  0 : slabdata      1      1      0
bdev_cache            14     14   1152   14    4 : tunables    0    0
  0 : slabdata      1      1      0
shmem_inode_cache    637    637    600   13    2 : tunables    0    0
  0 : slabdata     49     49      0
kernfs_node_cache  13938  13938     88   46    1 : tunables    0    0
  0 : slabdata    303    303      0
inode_cache          360    360    496    8    1 : tunables    0    0
  0 : slabdata     45     45      0
dentry              1196   1196    152   26    1 : tunables    0    0
  0 : slabdata     46     46      0
names_cache            8      8   4096    8    8 : tunables    0    0
  0 : slabdata      1      1      0
net_namespace          0      0   2944   11    8 : tunables    0    0
  0 : slabdata      0      0      0
iint_cache             0      0     96   42    1 : tunables    0    0
  0 : slabdata      0      0      0
key_jar              105    105    192   21    1 : tunables    0    0
  0 : slabdata      5      5      0
uts_namespace          0      0    416   19    2 : tunables    0    0
  0 : slabdata      0      0      0
nsproxy              102    102     40  102    1 : tunables    0    0
  0 : slabdata      1      1      0
vm_area_struct       255    255     80   51    1 : tunables    0    0
  0 : slabdata      5      5      0
signal_cache          55     55    704   11    2 : tunables    0    0
  0 : slabdata      5      5      0
sighand_cache         60     60   1088   15    4 : tunables    0    0
  0 : slabdata      4      4      0
anon_vma_chain       384    384     32  128    1 : tunables    0    0
  0 : slabdata      3      3      0
anon_vma             168    168     72   56    1 : tunables    0    0
  0 : slabdata      3      3      0
perf_event             0      0    816   10    2 : tunables    0    0
  0 : slabdata      0      0      0
maple_node            32     32    256   16    1 : tunables    0    0
  0 : slabdata      2      2      0
radix_tree_node      338    338    304   13    1 : tunables    0    0
  0 : slabdata     26     26      0
task_group             8      8    512    8    1 : tunables    0    0
  0 : slabdata      1      1      0
mm_struct             20     20    768   10    2 : tunables    0    0
  0 : slabdata      2      2      0
vmap_area            102    102     40  102    1 : tunables    0    0
  0 : slabdata      1      1      0
page->ptl            256    256     16  256    1 : tunables    0    0
  0 : slabdata      1      1      0
kmalloc-cg-8k          0      0   8192    4    8 : tunables    0    0
  0 : slabdata      0      0      0
kmalloc-cg-4k          8      8   4096    8    8 : tunables    0    0
  0 : slabdata      1      1      0
kmalloc-cg-2k         72     72   2048    8    4 : tunables    0    0
  0 : slabdata      9      9      0
kmalloc-cg-1k         32     32   1024    8    2 : tunables    0    0
  0 : slabdata      4      4      0
kmalloc-cg-512        32     32    512    8    1 : tunables    0    0
  0 : slabdata      4      4      0
kmalloc-cg-256        96     96    256   16    1 : tunables    0    0
  0 : slabdata      6      6      0
kmalloc-cg-192        63     63    192   21    1 : tunables    0    0
  0 : slabdata      3      3      0
kmalloc-cg-128       160    160    128   32    1 : tunables    0    0
  0 : slabdata      5      5      0
kmalloc-cg-64        128    128     64   64    1 : tunables    0    0
  0 : slabdata      2      2      0
kmalloc-rcl-8k         0      0   8192    4    8 : tunables    0    0
  0 : slabdata      0      0      0
kmalloc-rcl-4k         0      0   4096    8    8 : tunables    0    0
  0 : slabdata      0      0      0
kmalloc-rcl-2k         0      0   2048    8    4 : tunables    0    0
  0 : slabdata      0      0      0
kmalloc-rcl-1k         0      0   1024    8    2 : tunables    0    0
  0 : slabdata      0      0      0
kmalloc-rcl-512        0      0    512    8    1 : tunables    0    0
  0 : slabdata      0      0      0
kmalloc-rcl-256        0      0    256   16    1 : tunables    0    0
  0 : slabdata      0      0      0
kmalloc-rcl-192        0      0    192   21    1 : tunables    0    0
  0 : slabdata      0      0      0
kmalloc-rcl-128        0      0    128   32    1 : tunables    0    0
  0 : slabdata      0      0      0
kmalloc-rcl-64         0      0     64   64    1 : tunables    0    0
  0 : slabdata      0      0      0
kmalloc-8k            12     12   8192    4    8 : tunables    0    0
  0 : slabdata      3      3      0
kmalloc-4k            16     16   4096    8    8 : tunables    0    0
  0 : slabdata      2      2      0
kmalloc-2k            40     40   2048    8    4 : tunables    0    0
  0 : slabdata      5      5      0
kmalloc-1k            88     88   1024    8    2 : tunables    0    0
  0 : slabdata     11     11      0
kmalloc-512          856    856    512    8    1 : tunables    0    0
  0 : slabdata    107    107      0
kmalloc-256           64     64    256   16    1 : tunables    0    0
  0 : slabdata      4      4      0
kmalloc-192          126    126    192   21    1 : tunables    0    0
  0 : slabdata      6      6      0
kmalloc-128         1056   1056    128   32    1 : tunables    0    0
  0 : slabdata     33     33      0
kmalloc-64          5302   5312     64   64    1 : tunables    0    0
  0 : slabdata     83     83      0
kmem_cache_node      128    128     64   64    1 : tunables    0    0
  0 : slabdata      2      2      0
kmem_cache           128    128    128   32    1 : tunables    0    0
  0 : slabdata      4      4      0

s64lp64:
[    0.000000] Virtual kernel memory layout:
[    0.000000]       fixmap : 0xff1bfffffee00000 - 0xff1bffffff000000
 (2048 kB)
[    0.000000]       pci io : 0xff1bffffff000000 - 0xff1c000000000000
 (  16 MB)
[    0.000000]      vmemmap : 0xff1c000000000000 - 0xff20000000000000
 (1024 TB)
[    0.000000]      vmalloc : 0xff20000000000000 - 0xff60000000000000
 (16384 TB)
[    0.000000]      modules : 0xffffffff01579000 - 0xffffffff80000000
 (2026 MB)
[    0.000000]       lowmem : 0xff60000000000000 - 0xff60000008000000
 ( 128 MB)
[    0.000000]       kernel : 0xffffffff80000000 - 0xffffffffffffffff
 (2047 MB)
[    0.000000] Memory: 89380K/131072K available (8638K kernel code,
4979K rwdata, 4096K rodata, 2191K init, 477K bss, 41692K reserved, 0K
cma-reserved)
...
# free
              total        used        free      shared  buff/cache   available
Mem:          91568       11472       76264          48        3832       76376
Swap:             0           0           0
# cat /proc/meminfo
MemTotal:          91568 kB
MemFree:           76220 kB
MemAvailable:      76352 kB
Buffers:             292 kB
Cached:             2488 kB
SwapCached:            0 kB
Active:             2560 kB
Inactive:            656 kB
Active(anon):         44 kB
Inactive(anon):      440 kB
Active(file):       2516 kB
Inactive(file):      216 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                16 kB
Writeback:             0 kB
AnonPages:           480 kB
Mapped:             1804 kB
Shmem:                48 kB
KReclaimable:       1092 kB
Slab:               6900 kB
SReclaimable:       1092 kB
SUnreclaim:         5808 kB
KernelStack:         688 kB
PageTables:          120 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:       45784 kB
Committed_AS:       2044 kB
VmallocTotal:   17592186044416 kB
VmallocUsed:         904 kB
VmallocChunk:          0 kB
Percpu:               88 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
# cat /proc/slabinfo
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab>
<pagesperslab> : tunables <limit> <batchcount> <sharedfactor> :
slabdata <active_slabs> <num_slabs> <sharedavail>
ext4_groupinfo_1k     19     19    208   19    1 : tunables    0    0
  0 : slabdata      1      1      0
p9_req_t               0      0    176   23    1 : tunables    0    0
  0 : slabdata      0      0      0
ip6-frags              0      0    208   19    1 : tunables    0    0
  0 : slabdata      0      0      0
UDPv6                  0      0   1472   11    4 : tunables    0    0
  0 : slabdata      0      0      0
tw_sock_TCPv6          0      0    264   15    1 : tunables    0    0
  0 : slabdata      0      0      0
request_sock_TCPv6      0      0    312   13    1 : tunables    0    0
   0 : slabdata      0      0      0
TCPv6                  0      0   2560   12    8 : tunables    0    0
  0 : slabdata      0      0      0
bio-96                32     32    128   32    1 : tunables    0    0
  0 : slabdata      1      1      0
bfq_io_cq              0      0   1352   12    4 : tunables    0    0
  0 : slabdata      0      0      0
bfq_queue              0      0    576   14    2 : tunables    0    0
  0 : slabdata      0      0      0
mqueue_inode_cache     14     14   1152   14    4 : tunables    0    0
   0 : slabdata      1      1      0
v9fs_inode_cache       0      0    888    9    2 : tunables    0    0
  0 : slabdata      0      0      0
nfs4_xattr_cache_cache      0      0   3168   10    8 : tunables    0
  0    0 : slabdata      0      0      0
nfs_direct_cache       0      0    264   15    1 : tunables    0    0
  0 : slabdata      0      0      0
nfs_commit_data       11     11    704   11    2 : tunables    0    0
  0 : slabdata      1      1      0
nfs_read_data         36     36    896    9    2 : tunables    0    0
  0 : slabdata      4      4      0
nfs_inode_cache        0      0   1272   25    8 : tunables    0    0
  0 : slabdata      0      0      0
isofs_inode_cache      0      0    824   19    4 : tunables    0    0
  0 : slabdata      0      0      0
fat_inode_cache        0      0    976    8    2 : tunables    0    0
  0 : slabdata      0      0      0
fat_cache              0      0     40  102    1 : tunables    0    0
  0 : slabdata      0      0      0
jbd2_journal_head      0      0    144   28    1 : tunables    0    0
  0 : slabdata      0      0      0
jbd2_revoke_table_s      0      0     16  256    1 : tunables    0
0    0 : slabdata      0      0      0
ext4_fc_dentry_update      0      0     96   42    1 : tunables    0
 0    0 : slabdata      0      0      0
ext4_inode_cache     105    105   1496   21    8 : tunables    0    0
  0 : slabdata      5      5      0
ext4_allocation_context     30     30    136   30    1 : tunables    0
   0    0 : slabdata      1      1      0
ext4_prealloc_space     34     34    120   34    1 : tunables    0
0    0 : slabdata      1      1      0
ext4_system_zone     102    102     40  102    1 : tunables    0    0
  0 : slabdata      1      1      0
ext4_io_end_vec        0      0     32  128    1 : tunables    0    0
  0 : slabdata      0      0      0
bio_post_read_ctx    170    170     48   85    1 : tunables    0    0
  0 : slabdata      2      2      0
pending_reservation      0      0     32  128    1 : tunables    0
0    0 : slabdata      0      0      0
extent_status        102    102     40  102    1 : tunables    0    0
  0 : slabdata      1      1      0
mbcache                0      0     56   73    1 : tunables    0    0
  0 : slabdata      0      0      0
dnotify_struct         0      0     32  128    1 : tunables    0    0
  0 : slabdata      0      0      0
pid_namespace          0      0    160   25    1 : tunables    0    0
  0 : slabdata      0      0      0
posix_timers_cache      0      0    272   15    1 : tunables    0    0
   0 : slabdata      0      0      0
rpc_inode_cache        0      0    832   19    4 : tunables    0    0
  0 : slabdata      0      0      0
UNIX                  12     12   1344   12    4 : tunables    0    0
  0 : slabdata      1      1      0
ip4-frags              0      0    224   18    1 : tunables    0    0
  0 : slabdata      0      0      0
xfrm_dst_cache         0      0    320   12    1 : tunables    0    0
  0 : slabdata      0      0      0
ip_fib_trie           85     85     48   85    1 : tunables    0    0
  0 : slabdata      1      1      0
ip_fib_alias          73     73     56   73    1 : tunables    0    0
  0 : slabdata      1      1      0
UDP                   12     12   1280   12    4 : tunables    0    0
  0 : slabdata      1      1      0
                                                             [35/1689]
tw_sock_TCP            0      0    264   15    1 : tunables    0    0
  0 : slabdata      0      0      0
request_sock_TCP       0      0    312   13    1 : tunables    0    0
  0 : slabdata      0      0      0
TCP                    0      0   2432   13    8 : tunables    0    0
  0 : slabdata      0      0      0
hugetlbfs_inode_cache     10     10    784   10    2 : tunables    0
 0    0 : slabdata      1      1      0
bio-224               48     48    256   16    1 : tunables    0    0
  0 : slabdata      3      3      0
ep_head                0      0     16  256    1 : tunables    0    0
  0 : slabdata      0      0      0
inotify_inode_mark      0      0     96   42    1 : tunables    0    0
   0 : slabdata      0      0      0
dax_cache              8      8    960    8    2 : tunables    0    0
  0 : slabdata      1      1      0
sgpool-128            10     10   3072   10    8 : tunables    0    0
  0 : slabdata      1      1      0
sgpool-64             10     10   1536   10    4 : tunables    0    0
  0 : slabdata      1      1      0
sgpool-16             10     10    384   10    1 : tunables    0    0
  0 : slabdata      1      1      0
request_queue         15     15   1040   15    4 : tunables    0    0
  0 : slabdata      1      1      0
bio-160               42     42    192   21    1 : tunables    0    0
  0 : slabdata      2      2      0
biovec-128             8      8   2048    8    4 : tunables    0    0
  0 : slabdata      1      1      0
biovec-64              8      8   1024    8    2 : tunables    0    0
  0 : slabdata      1      1      0
user_namespace         0      0    632   25    4 : tunables    0    0
  0 : slabdata      0      0      0
uid_cache             84     84    192   21    1 : tunables    0    0
  0 : slabdata      4      4      0
dmaengine-unmap-2     64     64     64   64    1 : tunables    0    0
  0 : slabdata      1      1      0
sock_inode_cache      24     24   1024    8    2 : tunables    0    0
  0 : slabdata      3      3      0
skbuff_small_head     12     12    640   12    2 : tunables    0    0
  0 : slabdata      1      1      0
skbuff_fclone_cache      0      0    512    8    1 : tunables    0
0    0 : slabdata      0      0      0
file_lock_cache       17     17    232   17    1 : tunables    0    0
  0 : slabdata      1      1      0
fsnotify_mark_connector      0      0     56   73    1 : tunables    0
   0    0 : slabdata      0      0      0
pde_opener             0      0     40  102    1 : tunables    0    0
  0 : slabdata      0      0      0
proc_inode_cache      57     57    848   19    4 : tunables    0    0
  0 : slabdata      3      3      0
seq_file              26     26    152   26    1 : tunables    0    0
  0 : slabdata      1      1      0
sigqueue              51     51     80   51    1 : tunables    0    0
  0 : slabdata      1      1      0
bdev_cache            18     18   1792    9    4 : tunables    0    0
  0 : slabdata      2      2      0
shmem_inode_cache    646    646    936   17    4 : tunables    0    0
  0 : slabdata     38     38      0
kernfs_iattrs_cache      0      0     96   42    1 : tunables    0
0    0 : slabdata      0      0      0
kernfs_node_cache  14304  14304    128   32    1 : tunables    0    0
  0 : slabdata    447    447      0
filp                  84     84    320   12    1 : tunables    0    0
  0 : slabdata      7      7      0
inode_cache          360    360    776   10    2 : tunables    0    0
  0 : slabdata     36     36      0
dentry              1188   1188    216   18    1 : tunables    0    0
  0 : slabdata     66     66      0
names_cache           48     48   4096    8    8 : tunables    0    0
  0 : slabdata      6      6      0
net_namespace          0      0   3840    8    8 : tunables    0    0
  0 : slabdata      0      0      0
iint_cache             0      0    152   26    1 : tunables    0    0
  0 : slabdata      0      0      0
uts_namespace          0      0    432    9    1 : tunables    0    0
  0 : slabdata      0      0      0
nsproxy               56     56     72   56    1 : tunables    0    0
  0 : slabdata      1      1      0
vm_area_struct       240    240    136   30    1 : tunables    0    0
  0 : slabdata      8      8      0
files_cache           22     22    704   11    2 : tunables    0    0
  0 : slabdata      2      2      0
signal_cache          56     56   1152   14    4 : tunables    0    0
  0 : slabdata      4      4      0
sighand_cache         57     57   1664   19    8 : tunables    0    0
  0 : slabdata      3      3      0
task_struct           55     55   2880   11    8 : tunables    0    0
  0 : slabdata      5      5      0
anon_vma             120    120    136   30    1 : tunables    0    0
  0 : slabdata      4      4      0
perf_event             0      0   1152   14    4 : tunables    0    0
  0 : slabdata      0      0      0
maple_node           304    304    256   16    1 : tunables    0    0
  0 : slabdata     19     19      0
radix_tree_node      350    350    584   14    2 : tunables    0    0
  0 : slabdata     25     25      0
task_group            10     10    768   10    2 : tunables    0    0
  0 : slabdata      1      1      0
mm_struct             22     22   1408   11    4 : tunables    0    0
  0 : slabdata      2      2      0
vmap_area            168    168     72   56    1 : tunables    0    0
  0 : slabdata      3      3      0
page->ptl            170    170     24  170    1 : tunables    0    0
  0 : slabdata      1      1      0
kmalloc-cg-8k          0      0   8192    4    8 : tunables    0    0
  0 : slabdata      0      0      0
kmalloc-cg-4k         24     24   4096    8    8 : tunables    0    0
  0 : slabdata      3      3      0
kmalloc-cg-2k         32     32   2048    8    4 : tunables    0    0
  0 : slabdata      4      4      0
kmalloc-cg-1k         24     24   1024    8    2 : tunables    0    0
  0 : slabdata      3      3      0
kmalloc-cg-512        32     32    512    8    1 : tunables    0    0
  0 : slabdata      4      4      0
kmalloc-cg-256        16     16    256   16    1 : tunables    0    0
  0 : slabdata      1      1      0
kmalloc-cg-192       147    147    192   21    1 : tunables    0    0
  0 : slabdata      7      7      0
kmalloc-cg-128        64     64    128   32    1 : tunables    0    0
  0 : slabdata      2      2      0
kmalloc-cg-64        320    320     64   64    1 : tunables    0    0
  0 : slabdata      5      5      0
kmalloc-rcl-8k         0      0   8192    4    8 : tunables    0    0
  0 : slabdata      0      0      0
kmalloc-rcl-4k         0      0   4096    8    8 : tunables    0    0
  0 : slabdata      0      0      0
kmalloc-rcl-2k         0      0   2048    8    4 : tunables    0    0
  0 : slabdata      0      0      0
kmalloc-rcl-1k         0      0   1024    8    2 : tunables    0    0
  0 : slabdata      0      0      0
kmalloc-rcl-512        0      0    512    8    1 : tunables    0    0
  0 : slabdata      0      0      0
kmalloc-rcl-256        0      0    256   16    1 : tunables    0    0
  0 : slabdata      0      0      0
kmalloc-rcl-192        0      0    192   21    1 : tunables    0    0
  0 : slabdata      0      0      0
kmalloc-rcl-128      320    320    128   32    1 : tunables    0    0
  0 : slabdata     10     10      0
kmalloc-rcl-64        64     64     64   64    1 : tunables    0    0
  0 : slabdata      1      1      0
kmalloc-8k            12     12   8192    4    8 : tunables    0    0
  0 : slabdata      3      3      0
kmalloc-4k            16     16   4096    8    8 : tunables    0    0
  0 : slabdata      2      2      0
kmalloc-2k            64     64   2048    8    4 : tunables    0    0
  0 : slabdata      8      8      0
kmalloc-1k           840    840   1024    8    2 : tunables    0    0
  0 : slabdata    105    105      0
kmalloc-512          144    144    512    8    1 : tunables    0    0
  0 : slabdata     18     18      0
kmalloc-256          816    816    256   16    1 : tunables    0    0
  0 : slabdata     51     51      0
kmalloc-192          252    252    192   21    1 : tunables    0    0
  0 : slabdata     12     12      0
kmalloc-128          480    480    128   32    1 : tunables    0    0
  0 : slabdata     15     15      0
kmalloc-64          4912   4928     64   64    1 : tunables    0    0
  0 : slabdata     77     77      0
kmem_cache_node      128    128    128   32    1 : tunables    0    0
  0 : slabdata      4      4      0
kmem_cache           126    126    192   21    1 : tunables    0    0
  0 : slabdata      6      6      0

>
>        Arnd

Arnd Bergmann May 20, 2023, 10:13 a.m. UTC | #12

On Sat, May 20, 2023, at 04:53, Guo Ren wrote:
> On Sat, May 20, 2023 at 4:20 AM Arnd Bergmann <arnd@arndb.de> wrote:
>> On Thu, May 18, 2023, at 15:09, guoren@kernel.org wrote:
>>
>> I've tried to run the same numbers for the debate about running
>> 32-bit vs 64-bit arm kernels in the past, but focused mostly on
>> slightly larger systems, but I looked mainly at the 512MB case,
>> as that is the most cost-efficient DDR3 memory configuration
>> and fairly common.
> 512MB is extravagant, in my opinion. In the IPC market, 32/64MB is for
> 480P/720P/1080p, 128/256MB is for 1080p/2k, and 512/1024MB is for 4K.
>> 512MB chips is less than 5% of the total (I guess). Even in 512MB
> chips, the additional memory is for the frame buffer, not the Linux
> system.

This depends a lot on the target application of course. For
a phone or NAS box, 512MB is probably the lower limit.

What I observe in arch/arm/ devicetree submissions, in board-db.org,
and when looking at industrial Arm board vendor websites  is that
512MB is the most common configuration, and I think 1GB is still
more common than 256MB even for 32-bit machines. There is of course
a difference between number of individual products, and number of
machines shipped in a given configuration, and I guess you have
a good point that the cheapest ones are also the ones that ship
in the highest volume.

>> What I'd like to understand better in your example is where
>> the 14MB of memory went. I assume this is for 128MB of total
>> RAM, so we know that 1MB went into additional 'struct page'
>> objects (32 bytes * 32768 pages). It would be good to know
>> where the dynamic allocations went and if they are  reclaimable
>> (e.g. inodes) or non-reclaimable (e.g. kmalloc-128).
>>
>> For the vmlinux size, is this already a minimal config
>> that one would run on a board with 128MB of RAM, or a
>> defconfig that includes a lot of stuff that is only relevant
>> for other platforms but also grows on 64-bit?
> It's not minimal config, it's defconfig. So I say it's a roungh
> measurement :)
>
> I admit I wanted a little bit to exaggerate it, but that's the
> starting point for cutting down memory usage for most people, right?
> During the past year, we have been convincing our customers to use the
> s64lp64 + u32ilp32, but they can't tolerate even 1% memory additional
> cost in 64MB/128MB scenarios and then chose cortex-a7/a35, which could
> run 32-bit Linux. I think it's too early to talk about throwing 32-bit
> Linux into the garbage, not only for the reason of memory footprint
> but also for the ingrained opinion of the people. Changing their mind
> needs a long time.
>
>>
>> What do you see in /proc/slabinfo, /proc/meminfo/, and
>> 'size vmlinux' for the s64ilp32 and s64lp64 kernels here?
> Both s64ilp32 & s64lp64 use the same u32ilp32_rootfs.ext2 binary and
> the same opensbi binary.
> All are opensbi(2MB) + Linux(126MB) memory layout.
>
> Here is the result:
>
> s64ilp32:
> [    0.000000] Virtual kernel memory layout:
> [    0.000000]       fixmap : 0x9ce00000 - 0x9d000000   (2048 kB)
> [    0.000000]       pci io : 0x9d000000 - 0x9e000000   (  16 MB)
> [    0.000000]      vmemmap : 0x9e000000 - 0xa0000000   (  32 MB)
> [    0.000000]      vmalloc : 0xa0000000 - 0xc0000000   ( 512 MB)
> [    0.000000]       lowmem : 0xc0000000 - 0xc7e00000   ( 126 MB)
> [    0.000000] Memory: 97748K/129024K available (8699K kernel code,
> 8867K rwdata, 4096K rodata, 4204K init, 361K bss, 31276K reserved, 0K
> cma-reserved)

Ok, so it saves only a little bit on .text/.init/.bss/.rodata, but
there is a 4MB difference in rwdata, and a total of 10.4MB difference
in "reserved" size, which I think includes all of the above plus
the mem_map[] array.

89380K/131072K available (8638K kernel code, 4979K rwdata, 4096K rodata, 2191K init, 477K bss, 41692K reserved, 0K cma-reserved)

Oddly, I don't see anywhere close to 8KB in a riscv64 defconfig
build (linux-next, gcc-13), so I don't know where that comes
from:

$ size -A build/tmp/vmlinux | sort -k2 -nr | head
Total                   13518684
.text                    8896058   18446744071562076160
.rodata                  2219008   18446744071576748032
.data                     933760   18446744071583039488
.bss                      476080   18446744071584092160
.init.text                264718   18446744071572553728
__ksymtab_strings         183986   18446744071579214312
__ksymtab_gpl             122928   18446744071579091384
__ksymtab                 109080   18446744071578982304
__bug_table                98352   18446744071583973248



> KReclaimable:        644 kB
> Slab:               4536 kB
> SReclaimable:        644 kB
> SUnreclaim:         3892 kB
> KernelStack:         344 kB

These look like the only notable differences in meminfo:

  KReclaimable:       1092 kB                                        
  Slab:               6900 kB                                        
  SReclaimable:       1092 kB                                        
  SUnreclaim:         5808 kB                                        
  KernelStack:         688 kB                                        

The largest chunk here is 2MB in non-reclaimable slab allocations,
or a 50% growth of those.

The kernel stacks are doubled as expected, but that's only 344KB,
similarly for reclaimable slabs.

> # cat /proc/slabinfo
>
>                                                              [68/1691]
> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab>
> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> :
> slabdata <active_slabs> <num_slabs> <sharedavail>
> ext4_groupinfo_1k     28     28    144   28    1 : tunables    0    0
>   0 : slabdata      1      1      0
> p9_req_t               0      0    104   39    1 : tunables    0    0

Did you perhaps miss a few lines while pasting these? It seems
odd that some caches only show up in the ilp32 case (proc_dir_entry,
bd2_journa_handle, buffer_head, biovec_max, anon_vma_chain, ...) and
some others are only in the lp64 case (UNIX, ext4_prealloc_space,
files_cache, filp, ip_fib_alias, task_struct, uid_cache, ...).

Looking at the ones that are in both and have the largest size
increase, I see

# lp64
1788 kernfs_node_cache 14304 128
 590 shmem_inode_cache 646 936
 272 inode_cache 360 776
 153 ext4_inode_cache 105 1496
 250 dentry 1188 216
 192 names_cache 48 4096
 199 radix_tree_node 350 584
 307 kmalloc-64 4912 64
  60 kmalloc-128 480 128
  47 kmalloc-192 252 192
 204 kmalloc-256 816 256
  72 kmalloc-512 144 512
 840 kmalloc-1k 840 1024

# ilp32
1197 kernfs_node_cache 13938 88
 373 shmem_inode_cache 637 600
 174 inode_cache 360 496
  84 ext4_inode_cache 88 984
 177 dentry 1196 152
  32 names_cache 8 4096
 100 radix_tree_node 338 304
 331 kmalloc-64 5302 64
 132 kmalloc-128 1056 128
  23 kmalloc-192 126 192
  16 kmalloc-256 64 256
 428 kmalloc-512 856 512
  88 kmalloc-1k 88 1024

So sysfs (kernfs_node_cache) has the largest chunk of the
2MB non-reclaimable slab, grown 50% from 1.2MB to 1.8MB.
In some cases, this could be avoided entirely by turning
off sysfs, but most users can't do that.
shmem_inode_cache is probably mostly devtmpfs, the
other inode caches ones are smaller and likely reclaimable.

It's interesting how the largest slab cache ends up
being the kmalloc-1k cache (840 1K objects) on lp64,
but the kmalloc-512 cache (856 512B objects) on ilp32.
My guess is that the majority of this is from a single
callsite that has an allocation groing just beyond 512B.
This alone seems significant enough to need further
investigation, I would hope we can completely avoid
these by adding a custom slab cache. I don't see this
effect on an arm64 boot though, for me the 512B allocations
are much higher the 1K ones.

Maybe you can identify the culprit using the boot-time traces
as listed in https://elinux.org/Kernel_dynamic_memory_analysis#Dynamic
That might help everyone running a 64-bit kernel on
low-memory configurations, though it would of course slightly
weaken your argument for an ilp32 kernel ;-)

     Arnd

Guo Ren May 20, 2023, 3:57 p.m. UTC | #13

On Sat, May 20, 2023 at 6:13 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Sat, May 20, 2023, at 04:53, Guo Ren wrote:
> > On Sat, May 20, 2023 at 4:20 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >> On Thu, May 18, 2023, at 15:09, guoren@kernel.org wrote:
> >>
> >> I've tried to run the same numbers for the debate about running
> >> 32-bit vs 64-bit arm kernels in the past, but focused mostly on
> >> slightly larger systems, but I looked mainly at the 512MB case,
> >> as that is the most cost-efficient DDR3 memory configuration
> >> and fairly common.
> > 512MB is extravagant, in my opinion. In the IPC market, 32/64MB is for
> > 480P/720P/1080p, 128/256MB is for 1080p/2k, and 512/1024MB is for 4K.
> >> 512MB chips is less than 5% of the total (I guess). Even in 512MB
> > chips, the additional memory is for the frame buffer, not the Linux
> > system.
>
> This depends a lot on the target application of course. For
> a phone or NAS box, 512MB is probably the lower limit.
>
> What I observe in arch/arm/ devicetree submissions, in board-db.org,
> and when looking at industrial Arm board vendor websites  is that
> 512MB is the most common configuration, and I think 1GB is still
> more common than 256MB even for 32-bit machines. There is of course
> a difference between number of individual products, and number of
> machines shipped in a given configuration, and I guess you have
> a good point that the cheapest ones are also the ones that ship
> in the highest volume.
>
> >> What I'd like to understand better in your example is where
> >> the 14MB of memory went. I assume this is for 128MB of total
> >> RAM, so we know that 1MB went into additional 'struct page'
> >> objects (32 bytes * 32768 pages). It would be good to know
> >> where the dynamic allocations went and if they are  reclaimable
> >> (e.g. inodes) or non-reclaimable (e.g. kmalloc-128).
> >>
> >> For the vmlinux size, is this already a minimal config
> >> that one would run on a board with 128MB of RAM, or a
> >> defconfig that includes a lot of stuff that is only relevant
> >> for other platforms but also grows on 64-bit?
> > It's not minimal config, it's defconfig. So I say it's a roungh
> > measurement :)
> >
> > I admit I wanted a little bit to exaggerate it, but that's the
> > starting point for cutting down memory usage for most people, right?
> > During the past year, we have been convincing our customers to use the
> > s64lp64 + u32ilp32, but they can't tolerate even 1% memory additional
> > cost in 64MB/128MB scenarios and then chose cortex-a7/a35, which could
> > run 32-bit Linux. I think it's too early to talk about throwing 32-bit
> > Linux into the garbage, not only for the reason of memory footprint
> > but also for the ingrained opinion of the people. Changing their mind
> > needs a long time.
> >
> >>
> >> What do you see in /proc/slabinfo, /proc/meminfo/, and
> >> 'size vmlinux' for the s64ilp32 and s64lp64 kernels here?
> > Both s64ilp32 & s64lp64 use the same u32ilp32_rootfs.ext2 binary and
> > the same opensbi binary.
> > All are opensbi(2MB) + Linux(126MB) memory layout.
> >
> > Here is the result:
> >
> > s64ilp32:
> > [    0.000000] Virtual kernel memory layout:
> > [    0.000000]       fixmap : 0x9ce00000 - 0x9d000000   (2048 kB)
> > [    0.000000]       pci io : 0x9d000000 - 0x9e000000   (  16 MB)
> > [    0.000000]      vmemmap : 0x9e000000 - 0xa0000000   (  32 MB)
> > [    0.000000]      vmalloc : 0xa0000000 - 0xc0000000   ( 512 MB)
> > [    0.000000]       lowmem : 0xc0000000 - 0xc7e00000   ( 126 MB)
> > [    0.000000] Memory: 97748K/129024K available (8699K kernel code,
> > 8867K rwdata, 4096K rodata, 4204K init, 361K bss, 31276K reserved, 0K
> > cma-reserved)
>
> Ok, so it saves only a little bit on .text/.init/.bss/.rodata, but
> there is a 4MB difference in rwdata, and a total of 10.4MB difference
> in "reserved" size, which I think includes all of the above plus
> the mem_map[] array.
>
> 89380K/131072K available (8638K kernel code, 4979K rwdata, 4096K rodata, 2191K init, 477K bss, 41692K reserved, 0K cma-reserved)
>
> Oddly, I don't see anywhere close to 8KB in a riscv64 defconfig
> build (linux-next, gcc-13), so I don't know where that comes
> from:
>
> $ size -A build/tmp/vmlinux | sort -k2 -nr | head
> Total                   13518684
> .text                    8896058   18446744071562076160
> .rodata                  2219008   18446744071576748032
> .data                     933760   18446744071583039488
> .bss                      476080   18446744071584092160
> .init.text                264718   18446744071572553728
> __ksymtab_strings         183986   18446744071579214312
> __ksymtab_gpl             122928   18446744071579091384
> __ksymtab                 109080   18446744071578982304
> __bug_table                98352   18446744071583973248
>
>
>
> > KReclaimable:        644 kB
> > Slab:               4536 kB
> > SReclaimable:        644 kB
> > SUnreclaim:         3892 kB
> > KernelStack:         344 kB
>
> These look like the only notable differences in meminfo:
>
>   KReclaimable:       1092 kB
>   Slab:               6900 kB
>   SReclaimable:       1092 kB
>   SUnreclaim:         5808 kB
>   KernelStack:         688 kB
>
> The largest chunk here is 2MB in non-reclaimable slab allocations,
> or a 50% growth of those.
>
> The kernel stacks are doubled as expected, but that's only 344KB,
> similarly for reclaimable slabs.
>
> > # cat /proc/slabinfo
> >
> >                                                              [68/1691]
> > slabinfo - version: 2.1
> > # name            <active_objs> <num_objs> <objsize> <objperslab>
> > <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> :
> > slabdata <active_slabs> <num_slabs> <sharedavail>
> > ext4_groupinfo_1k     28     28    144   28    1 : tunables    0    0
> >   0 : slabdata      1      1      0
> > p9_req_t               0      0    104   39    1 : tunables    0    0
>
> Did you perhaps miss a few lines while pasting these? It seems
> odd that some caches only show up in the ilp32 case (proc_dir_entry,
> bd2_journa_handle, buffer_head, biovec_max, anon_vma_chain, ...) and
> some others are only in the lp64 case (UNIX, ext4_prealloc_space,
> files_cache, filp, ip_fib_alias, task_struct, uid_cache, ...).
>
> Looking at the ones that are in both and have the largest size
> increase, I see
>
> # lp64
> 1788 kernfs_node_cache 14304 128
>  590 shmem_inode_cache 646 936
>  272 inode_cache 360 776
>  153 ext4_inode_cache 105 1496
>  250 dentry 1188 216
>  192 names_cache 48 4096
>  199 radix_tree_node 350 584
>  307 kmalloc-64 4912 64
>   60 kmalloc-128 480 128
>   47 kmalloc-192 252 192
>  204 kmalloc-256 816 256
>   72 kmalloc-512 144 512
>  840 kmalloc-1k 840 1024
>
> # ilp32
> 1197 kernfs_node_cache 13938 88
>  373 shmem_inode_cache 637 600
>  174 inode_cache 360 496
>   84 ext4_inode_cache 88 984
>  177 dentry 1196 152
>   32 names_cache 8 4096
>  100 radix_tree_node 338 304
>  331 kmalloc-64 5302 64
>  132 kmalloc-128 1056 128
>   23 kmalloc-192 126 192
>   16 kmalloc-256 64 256
>  428 kmalloc-512 856 512
>   88 kmalloc-1k 88 1024
>
> So sysfs (kernfs_node_cache) has the largest chunk of the
> 2MB non-reclaimable slab, grown 50% from 1.2MB to 1.8MB.
> In some cases, this could be avoided entirely by turning
> off sysfs, but most users can't do that.
> shmem_inode_cache is probably mostly devtmpfs, the
> other inode caches ones are smaller and likely reclaimable.
>
> It's interesting how the largest slab cache ends up
> being the kmalloc-1k cache (840 1K objects) on lp64,
> but the kmalloc-512 cache (856 512B objects) on ilp32.
> My guess is that the majority of this is from a single
> callsite that has an allocation groing just beyond 512B.
> This alone seems significant enough to need further
> investigation, I would hope we can completely avoid
> these by adding a custom slab cache. I don't see this
> effect on an arm64 boot though, for me the 512B allocations
> are much higher the 1K ones.
>
> Maybe you can identify the culprit using the boot-time traces
> as listed in https://elinux.org/Kernel_dynamic_memory_analysis#Dynamic
> That might help everyone running a 64-bit kernel on
> low-memory configurations, though it would of course slightly
> weaken your argument for an ilp32 kernel ;-)
Thx for the detailed reply, I would try your approches mentioned
lately. But these about traditional CONFIG_32BIT v.s. CONFIG_64BIT
comparation.

Besides the detailed analysis data, we also would meet the people's
concept problem. Such as struct page, struct list_head, and some
variables containing pointers, ilp32's would be significantly smaller
than lp64. That means ilp32 is smaller than lp64 in people's minds.
This concept would prevent vendors from accepting lp64 as a cost-down
solution. They even won't try, which I've met these years. I was an
lp64 kernel supporter last year, but I met a lot of arguments on
s64lp64 + u32ilp32. Some guys are using arm32 Linux; they want to stay
on 32-bit Linux to ensure their complex C code can work. So our
argument about "ilp32 v.s. lp64" won't have a result.

Let's change another view, cache utilization. These 64/128MB SoCs also
have limited cache capacities (L1-32KB+L2-128KB/only L1-64KB). Such as
List walk and stack saving/restoring are very common in Linux. What do
you think about "32-bit v.s. 64-bit" cache utilization?

>
>      Arnd

Guo Ren May 21, 2023, 12:37 p.m. UTC | #14

On Fri, May 19, 2023 at 8:14 AM Paul Walmsley <paul.walmsley@sifive.com> wrote:
>
> On Thu, 18 May 2023, Palmer Dabbelt wrote:
>
> > On Thu, 18 May 2023 06:09:51 PDT (-0700), guoren@kernel.org wrote:
> >
> > > This patch series adds s64ilp32 support to riscv. The term s64ilp32
> > > means smode-xlen=64 and -mabi=ilp32 (ints, longs, and pointers are all
> > > 32-bit), i.e., running 32-bit Linux kernel on pure 64-bit supervisor
> > > mode. There have been many 64ilp32 abis existing, such as mips-n32 [1],
> > > arm-aarch64ilp32 [2], and x86-x32 [3], but they are all about userspace.
> > > Thus, this should be the first time running a 32-bit Linux kernel with
> > > the 64ilp32 ABI at supervisor mode (If not, correct me).
> >
> > Does anyone actually want this?  At a bare minimum we'd need to add it to the
> > psABI, which would presumably also be required on the compiler side of things.
> >
> > It's not even clear anyone wants rv64/ilp32 in userspace, the kernel seems
> > like it'd be even less widely used.
>
> We've certainly talked to folks who are interested in RV64 ILP32 userspace
> with an LP64 kernel.  The motivation is the usual one: to reduce data size
> and therefore (ideally) BOM cost.  I think this work, if it goes forward,
> would need to go hand in hand with the RVIA psABI group.
>
> The RV64 ILP32 kernel and ILP32 userspace approach implemented by this
> patch is intriguing, but I guess for me, the question is whether it's
> worth the extra hassle vs. a pure RV32 kernel & userspace.
Running pure RV32 kernel on 64-bit hardware is not an intelligent
choice (such as cortex-a35/a53/a55), because they wasted 64-bit hw
capabilities, and the hardware designer would waste additional
resources & time on 32-bit machine & supervisor modes (In arm it is
called EL3/EL2/EL1 modes). Think about too many PMP CSRs, PMU CSRs,
and mode switch ... it's definitely wrong to follow the
cortex-a35/a53/a55 way to deal with riscv32 on a 64-bit hardware. The
chapter "Why s64ilp32 has better performance?" give out the
improvement v.s. pure 32-bit, I repeat it here:

 - memcpy/memset/strcmp (s64ilp32 has half of the number of
instructions and double the bandwidth per load/store instruction than
s32ilp32.)

- ebpf JIT is a 64-bit virtual ISA, which couldn't be sufficient
mapping by s32ilp32, but s64ilp32 could (just like s64lp64).

 - Atomic64 (s64ilp32 has the exact native instructions mapping as
s64lp64, but s32ilp32 only uses generic_atomic64, a tradeoff & limited
software solution.)

 - 64-bit native arithmetic instructions for "long long" type

 - riscv s64ilp32 could support cmxchg_double for slub (The 2nd 32-bit
Linux supports the feature, the 1st is i386.)

>
>
> - Paul