[1/2] Documentation: riscv: Add early boot document

Message ID 20230619094705.51337-1-alexghiti@rivosinc.com
State New
Headers
Series [1/2] Documentation: riscv: Add early boot document |

Commit Message

Alexandre Ghiti June 19, 2023, 9:47 a.m. UTC
  This document describes the constraints and requirements of the early
boot process in a RISC-V kernel.

Szigned-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
---
 Documentation/riscv/boot-image-header.rst |   3 -
 Documentation/riscv/boot.rst              | 181 ++++++++++++++++++++++
 Documentation/riscv/index.rst             |   1 +
 3 files changed, 182 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/riscv/boot.rst
  

Comments

Alexandre Ghiti June 19, 2023, 9:49 a.m. UTC | #1
@Sunil V L Something about ACPI is more than welcome :)

And thanks to @Björn Töpel and @Atish Kumar Patra for helping in
writing this document!

On Mon, Jun 19, 2023 at 11:47 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>
> This document describes the constraints and requirements of the early
> boot process in a RISC-V kernel.
>
> Szigned-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> ---
>  Documentation/riscv/boot-image-header.rst |   3 -
>  Documentation/riscv/boot.rst              | 181 ++++++++++++++++++++++
>  Documentation/riscv/index.rst             |   1 +
>  3 files changed, 182 insertions(+), 3 deletions(-)
>  create mode 100644 Documentation/riscv/boot.rst
>
> diff --git a/Documentation/riscv/boot-image-header.rst b/Documentation/riscv/boot-image-header.rst
> index d7752533865f..a4a45310c4c4 100644
> --- a/Documentation/riscv/boot-image-header.rst
> +++ b/Documentation/riscv/boot-image-header.rst
> @@ -7,9 +7,6 @@ Boot image header in RISC-V Linux
>
>  This document only describes the boot image header details for RISC-V Linux.
>
> -TODO:
> -  Write a complete booting guide.
> -
>  The following 64-byte header is present in decompressed Linux kernel image::
>
>         u32 code0;                /* Executable code */
> diff --git a/Documentation/riscv/boot.rst b/Documentation/riscv/boot.rst
> new file mode 100644
> index 000000000000..b02230818b79
> --- /dev/null
> +++ b/Documentation/riscv/boot.rst
> @@ -0,0 +1,181 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============================================
> +Early boot requirements/constraints on RISC-V
> +=============================================
> +
> +:Author: Alexandre Ghiti <alexghiti@rivosinc.com>
> +:Date: 23 May 2023
> +
> +This document describes what the RISC-V kernel expects from the previous stages
> +and the firmware, but also the constraints that any developer must have in mind
> +when touching the early boot process, e.g. before the final virtual mapping is
> +setup.
> +
> +Pre-kernel boot (Expectations from firmware)
> +============================================
> +
> +Registers state
> +---------------
> +
> +The RISC-V kernel expects:
> +
> +  * `$a0` to contain the hartid of the current core.
> +  * `$a1` to contain the address of the device tree in memory.
> +
> +CSR state
> +---------
> +
> +The RISC-V kernel expects:
> +
> +  * `$satp = 0`: the MMU must be disabled.
> +
> +Reserved memory for resident firmware
> +-------------------------------------
> +
> +The RISC-V kernel expects the firmware to mark any resident memory with the
> +`no-map` flag, thus the kernel won't map those regions in the direct mapping
> +(avoiding issues with hibernation, speculative accesses and probably other
> +subsystems).
> +
> +Kernel location
> +---------------
> +
> +The RISC-V kernel expects to be placed at a PMD boundary (2MB for rv64 and 4MB
> +for rv32). Note though that the EFI stub will physically relocate the kernel if
> +that's not the case.
> +
> +Device-tree
> +-----------
> +
> +The RISC-V kernel always expects a device tree, it is:
> +
> +- either passed directly to the kernel from the previous stage using the `$a1`
> +  register,
> +- or when booting with UEFI, the device tree will be retrieved by the EFI stub
> +  using the EFI configuration table or it will be created.
> +
> +Bootflow
> +--------
> +
> +There exist 2 methods to enter the kernel:
> +
> +- `RISCV_BOOT_SPINWAIT`: the firmware releases all harts in the kernel, one hart
> +  wins a lottery and executes the early boot code while the other harts are
> +  parked waiting for the initialization to finish. This method is now
> +  **deprecated**.
> +- Ordered booting: the firmware releases only one hart that will execute the
> +  initialization phase and then will start all other harts using the SBI HSM
> +  extension.
> +
> +UEFI
> +----
> +
> +UEFI memory map
> +~~~~~~~~~~~~~~~
> +
> +When booting with UEFI, the RISC-V kernel will use only the EFI memory map to
> +populate the system memory.
> +
> +The UEFI firmware must parse the subnodes of the `/reserved-memory` device tree
> +node and abide by the device tree specification to convert the attributes of
> +those subnodes (`no-map` and `reusable`) into their correct EFI equivalent
> +(refer to section "3.5.4 /reserved-memory and UEFI" of the device tree
> +specification).
> +
> +RISCV_EFI_BOOT_PROTOCOL
> +~~~~~~~~~~~~~~~~~~~~~~~
> +
> +When booting with UEFI, the EFI stub requires the boot hartid in order to pass
> +it to the RISC-V kernel in `$a1`. The EFI stub retrieves the boot hartid using
> +one of the following methods:
> +
> +- `RISCV_EFI_BOOT_PROTOCOL` (**preferred**).
> +- `boot-hartid` device tree subnode (**deprecated**).
> +
> +Any new firmware must implement `RISCV_EFI_BOOT_PROTOCOL` as the device tree
> +based approach is deprecated now.
> +
> +During kernel boot: (Kernel internals)
> +======================================
> +
> +EFI stub and device tree
> +------------------------
> +
> +When booting with UEFI, the device tree is supplemented by the EFI stub with the
> +following parameters (largely shared with arm64 in Documentation/arm/uefi.rst):
> +
> +==========================  ======   ===========================================
> +Name                        Size     Description
> +==========================  ======   ===========================================
> +linux,uefi-system-table     64-bit   Physical address of the UEFI System Table.
> +
> +linux,uefi-mmap-start       64-bit   Physical address of the UEFI memory map,
> +                                     populated by the UEFI GetMemoryMap() call.
> +
> +linux,uefi-mmap-size        32-bit   Size in bytes of the UEFI memory map
> +                                     pointed to in previous entry.
> +
> +linux,uefi-mmap-desc-size   32-bit   Size in bytes of each entry in the UEFI
> +                                     memory map.
> +
> +linux,uefi-mmap-desc-ver    32-bit   Version of the mmap descriptor format.
> +
> +kaslr-seed                  64-bit   Entropy used to randomize the kernel image
> +                                     base address location.
> +
> +bootargs                             Kernel command line
> +==========================  ======   ===========================================
> +
> +Virtual mapping setup
> +---------------------
> +
> +The installation of the virtual mapping is done in 2 steps in the RISC-V kernel:
> +
> +1. :c:func:`setup_vm` installs a temporary kernel mapping in
> +   :c:var:`early_pg_dir` which allows to discover the system memory: only the
> +   kernel text/data are mapped at this point. When establishing this mapping,
> +   no allocation can be done (since the system memory is not known yet), so
> +   :c:var:`early_pg_dir` page table is statically allocated (using only one
> +   table for each level).
> +
> +2. :c:func:`setup_vm_final` creates the final kernel mapping in
> +   :c:var:`swapper_pg_dir` and takes advantage of the discovered system memory
> +   to create the linear mapping. When establishing this mapping, the kernel
> +   can allocate memory but cannot access it directly (since the direct mapping
> +   is not present yet), so it uses temporary mappings in the fixmap region to
> +   be able to access the newly allocated page table levels.
> +
> +For :c:func:`virt_to_phys` and :c:func:`phys_to_virt` to be able to correctly
> +convert direct mapping addresses to physical addresses, it needs to know the
> +start of the DRAM: this happens after 1, right before 2 installs the direct
> +mapping (see :c:func:`setup_bootmem` function in arch/riscv/mm/init.c). So
> +any usage of those macros before the final virtual mapping is installed must be
> +carefully examined.
> +
> +Device-tree mapping via fixmap
> +------------------------------
> +
> +The RISC-V kernel uses the fixmap region to map the device tree because the
> +device tree virtual mapping must remain the same between :c:func:`setup_vm` and
> +:c:func:`setup_vm_final` calls since :c:var:`reserved_mem` array is initialized
> +with virtual addresses established by :c:func:`setup_vm` and used with the
> +mapping established by :c:func:`setup_vm_final`.
> +
> +Pre-MMU execution
> +-----------------
> +
> +Any code that executes before even the first virtual mapping is established
> +must be very carefully compiled as:
> +
> +- `-fno-pie`: This is needed for relocatable kernels which use `-fPIE`, since
> +  otherwise, any access to a global symbol would go through the GOT which is
> +  only relocated virtually.
> +- `-mcmodel=medany`: Any access to a global symbol must be PC-relative to avoid
> +  any relocations to happen before the MMU is setup.
> +- Also note that *all* instrumentation must also be disabled (that includes
> +  KASAN, ftrace and others).
> +
> +As using a symbol from a different compilation unit requires this unit to be
> +compiled with those flags, we advise, as much as possible, not to use external
> +symbols.
> diff --git a/Documentation/riscv/index.rst b/Documentation/riscv/index.rst
> index 175a91db0200..1f66062def6d 100644
> --- a/Documentation/riscv/index.rst
> +++ b/Documentation/riscv/index.rst
> @@ -5,6 +5,7 @@ RISC-V architecture
>  .. toctree::
>      :maxdepth: 1
>
> +    boot
>      boot-image-header
>      vm-layout
>      hwprobe
> --
> 2.39.2
>
  
Björn Töpel June 19, 2023, 11 a.m. UTC | #2
Alexandre Ghiti <alexghiti@rivosinc.com> writes:

> This document describes the constraints and requirements of the early
> boot process in a RISC-V kernel.
>
> Szigned-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>

Additional 'z'; "Signed-off-by:"

Nice writeup!

Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
  
Conor Dooley June 19, 2023, 12:25 p.m. UTC | #3
Hey Alex,

Thanks for working on this :) I've got a mix of suggestions and
questions below. Hopefully it is not too disjoint, since I didn't write
them in order.

On Mon, Jun 19, 2023 at 11:47:04AM +0200, Alexandre Ghiti wrote:
> This document describes the constraints and requirements of the early
> boot process in a RISC-V kernel.
> 
> Szigned-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> ---
>  Documentation/riscv/boot-image-header.rst |   3 -
>  Documentation/riscv/boot.rst              | 181 ++++++++++++++++++++++
>  Documentation/riscv/index.rst             |   1 +
>  3 files changed, 182 insertions(+), 3 deletions(-)
>  create mode 100644 Documentation/riscv/boot.rst
> 
> diff --git a/Documentation/riscv/boot-image-header.rst b/Documentation/riscv/boot-image-header.rst
> index d7752533865f..a4a45310c4c4 100644
> --- a/Documentation/riscv/boot-image-header.rst
> +++ b/Documentation/riscv/boot-image-header.rst
> @@ -7,9 +7,6 @@ Boot image header in RISC-V Linux
>  
>  This document only describes the boot image header details for RISC-V Linux.
>  
> -TODO:
> -  Write a complete booting guide.
> -
>  The following 64-byte header is present in decompressed Linux kernel image::
>  
>  	u32 code0;		  /* Executable code */
> diff --git a/Documentation/riscv/boot.rst b/Documentation/riscv/boot.rst
> new file mode 100644
> index 000000000000..b02230818b79
> --- /dev/null
> +++ b/Documentation/riscv/boot.rst
> @@ -0,0 +1,181 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============================================
> +Early boot requirements/constraints on RISC-V
> +=============================================

Please use "title case", here and elsewhere in the doc.
I'd also be inclined to drop the "Early" from here, as it permits more
natural section headings. Perhaps "RISC-V Kernel Boot Requirements and
Constraints"?

> +
> +:Author: Alexandre Ghiti <alexghiti@rivosinc.com>
> +:Date: 23 May 2023
> +
> +This document describes what the RISC-V kernel expects from the previous stages

"the previous stages" is a bit vague IMO. You mean bootloader stages I
assume, but I think it should be explicit. Perhaps:
"...what a RISC-V kernel expects from bootloaders and firmware, and the
constraints..."

> +and the firmware, but also the constraints that any developer must have in mind
> +when touching the early boot process, e.g. before the final virtual mapping is
> +setup.

s/setup./set up./

Do you mean to have "For example" here? Or is "before the final virtual
mapping is set up" the definition or "early boot"? If the latter, I
would reword this as something like:
"...when modifying the early boot process. For the purposes of this
document, the 'early boot process' refers to any code that runs before
the final virtual mapping is set up."

> +Pre-kernel boot (Expectations from firmware)

Firmware or bootloaders? TBH, I would just drop the section in () and
do something like:
	Pre-kernel Requirements and Constraints
	=======================================
	
	The RISC-V kernel expects the following of bootloaders and platform
	firmware:

> +
> +Registers state

s/Registers state/Register State/

> +---------------
> +
> +The RISC-V kernel expects:
> +
> +  * `$a0` to contain the hartid of the current core.
> +  * `$a1` to contain the address of the device tree in memory.
> +
> +CSR state
> +---------
> +
> +The RISC-V kernel expects:
> +
> +  * `$satp = 0`: the MMU must be disabled.

"the MMU, if present, must be disabled." ;)

> +
> +Reserved memory for resident firmware
> +-------------------------------------
> +
> +The RISC-V kernel expects the firmware to mark any resident memory with the

Should this be
"...resident memory, or memory it has protected with PMPs, with..."
?

> +`no-map` flag, thus the kernel won't map those regions in the direct mapping

"no-map" is a DT specific term, should this section be moved down under
DT, as a sub-section of that?

> +(avoiding issues with hibernation, speculative accesses and probably other
> +subsystems).

I'm not sure that this () section is beneficial. To be honest, recent
issues aside, this section here seems like a statement of the obvious...

> +
> +Kernel location
> +---------------
> +
> +The RISC-V kernel expects to be placed at a PMD boundary (2MB for rv64 and 4MB

Would that be better worded as "(2 MB aligned for rv64 and 4 MB aligned
for rv32)"? It might be overly explicit, but I figure there's no harm...

> +for rv32). Note though that the EFI stub will physically relocate the kernel if

s/though//

> +that's not the case.
> +
> +Device-tree

s/Device-tree/Devicetree/ and...

> +-----------
> +
> +The RISC-V kernel always expects a device tree, it is:

...s/device tree/devicetree/ to match elsewhere in the kernel docs.
Same applies to the other instances of "device tree" in this patch,
please.

> +
> +- either passed directly to the kernel from the previous stage using the `$a1`
> +  register,
> +- or when booting with UEFI, the device tree will be retrieved by the EFI stub
> +  using the EFI configuration table or it will be created.

Can I suggest changing this around a little, pulling the "either" &
dropping some boilerplate so that it reads (to me!) a little more
naturally:
	The RISC-V kernel always expects a devicetree, it is either:
	
	- passed directly to the kernel from the previous stage using the `$a1`
	  register,
	- retrieved by the EFI stub when booting with UEFI, using the EFI
	  configuration table or it will be created by ____.

Also, please elaborate on what it will be created by.

> +
> +Bootflow

"Boot Flow", no?
I am not sure that this is the "correct" heading for the content it
describes, but I have nothing better to offer :/

> +--------
> +
> +There exist 2 methods to enter the kernel:
> +
> +- `RISCV_BOOT_SPINWAIT`: the firmware releases all harts in the kernel, one hart
> +  wins a lottery and executes the early boot code while the other harts are
> +  parked waiting for the initialization to finish. This method is now

nit: s/now//

What do you mean by deprecated? There's no requirement to implement the
HSM extension, right?

> +  **deprecated**.
> +- Ordered booting: the firmware releases only one hart that will execute the
> +  initialization phase and then will start all other harts using the SBI HSM
> +  extension.
> +
> +UEFI
> +----
> +
> +UEFI memory map
> +~~~~~~~~~~~~~~~
> +
> +When booting with UEFI, the RISC-V kernel will use only the EFI memory map to
> +populate the system memory.
> +
> +The UEFI firmware must parse the subnodes of the `/reserved-memory` device tree
> +node and abide by the device tree specification to convert the attributes of
> +those subnodes (`no-map` and `reusable`) into their correct EFI equivalent
> +(refer to section "3.5.4 /reserved-memory and UEFI" of the device tree
> +specification).
> +
> +RISCV_EFI_BOOT_PROTOCOL
> +~~~~~~~~~~~~~~~~~~~~~~~
> +
> +When booting with UEFI, the EFI stub requires the boot hartid in order to pass
> +it to the RISC-V kernel in `$a1`. The EFI stub retrieves the boot hartid using
> +one of the following methods:
> +
> +- `RISCV_EFI_BOOT_PROTOCOL` (**preferred**).
> +- `boot-hartid` device tree subnode (**deprecated**).
> +
> +Any new firmware must implement `RISCV_EFI_BOOT_PROTOCOL` as the device tree
> +based approach is deprecated now.
> +
> +During kernel boot: (Kernel internals)

With the other section titles changed, this could be:
	Early Boot Requirements and Constraints
	=======================================
	
	The RISC-V kernel's early boot process operates under the
	following constraints:

Thoughts?

> +======================================
> +
> +EFI stub and device tree

Same comments about "device tree" here etc.

> +------------------------
> +
> +When booting with UEFI, the device tree is supplemented by the EFI stub with the
> +following parameters (largely shared with arm64 in Documentation/arm/uefi.rst):
> +
> +==========================  ======   ===========================================
> +Name                        Size     Description
> +==========================  ======   ===========================================
> +linux,uefi-system-table     64-bit   Physical address of the UEFI System Table.

nit: Hmm, I think for all of these sizes s/-bit/ bits/.

> +
> +linux,uefi-mmap-start       64-bit   Physical address of the UEFI memory map,
> +                                     populated by the UEFI GetMemoryMap() call.
> +
> +linux,uefi-mmap-size        32-bit   Size in bytes of the UEFI memory map
> +                                     pointed to in previous entry.
> +
> +linux,uefi-mmap-desc-size   32-bit   Size in bytes of each entry in the UEFI
> +                                     memory map.
> +
> +linux,uefi-mmap-desc-ver    32-bit   Version of the mmap descriptor format.
> +
> +kaslr-seed                  64-bit   Entropy used to randomize the kernel image
> +                                     base address location.
> +
> +bootargs                             Kernel command line
> +==========================  ======   ===========================================
> +
> +Virtual mapping setup

nit: s/setup/Installation/

> +---------------------
> +
> +The installation of the virtual mapping is done in 2 steps in the RISC-V kernel:
> +
> +1. :c:func:`setup_vm` installs a temporary kernel mapping in
> +   :c:var:`early_pg_dir` which allows to discover the system memory: only the

s/to discover/discovery of/
s/: only/. Only/

> +   kernel text/data are mapped at this point. When establishing this mapping,
> +   no allocation can be done (since the system memory is not known yet), so
> +   :c:var:`early_pg_dir` page table is statically allocated (using only one
> +   table for each level).
> +
> +2. :c:func:`setup_vm_final` creates the final kernel mapping in
> +   :c:var:`swapper_pg_dir` and takes advantage of the discovered system memory
> +   to create the linear mapping. When establishing this mapping, the kernel
> +   can allocate memory but cannot access it directly (since the direct mapping
> +   is not present yet), so it uses temporary mappings in the fixmap region to
> +   be able to access the newly allocated page table levels.
> +
> +For :c:func:`virt_to_phys` and :c:func:`phys_to_virt` to be able to correctly
> +convert direct mapping addresses to physical addresses, it needs to know the

nit: s/it/they/

> +start of the DRAM: this happens after 1, right before 2 installs the direct

s/:/./
Also how about s/1/step 1/ & s/2/step 2/?

> +mapping (see :c:func:`setup_bootmem` function in arch/riscv/mm/init.c). So

s/So//

> +any usage of those macros before the final virtual mapping is installed must be
> +carefully examined.
> +
> +Device-tree mapping via fixmap
> +------------------------------
> +
> +The RISC-V kernel uses the fixmap region to map the device tree because the
> +device tree virtual mapping must remain the same between :c:func:`setup_vm` and
> +:c:func:`setup_vm_final` calls since :c:var:`reserved_mem` array is initialized

Missing a "the" before reserved_mem.

> +with virtual addresses established by :c:func:`setup_vm` and used with the
> +mapping established by :c:func:`setup_vm_final`.
> +
> +Pre-MMU execution
> +-----------------
> +
> +Any code that executes before even the first virtual mapping is established
> +must be very carefully compiled as:

Could you point out what the non-obvious examples of this code are?

> +- `-fno-pie`: This is needed for relocatable kernels which use `-fPIE`, since

Is there a reason why the capitalisation is different for the two
compiler flags?

> +  otherwise, any access to a global symbol would go through the GOT which is
> +  only relocated virtually.
> +- `-mcmodel=medany`: Any access to a global symbol must be PC-relative to avoid
> +  any relocations to happen before the MMU is setup.
> +- Also note that *all* instrumentation must also be disabled (that includes

nit: s/Also note that//

> +  KASAN, ftrace and others).
> +
> +As using a symbol from a different compilation unit requires this unit to be
> +compiled with those flags, we advise, as much as possible, not to use external
> +symbols.

If the use of early alternatives grows, are we going to have to split
the vendors early alternatives into a different compilation unit from
their regular alternatives?

Cheers,
Conor.
  
Alexandre Ghiti June 19, 2023, 2:04 p.m. UTC | #4
On Mon, Jun 19, 2023 at 2:26 PM Conor Dooley <conor.dooley@microchip.com> wrote:
>
> Hey Alex,
>
> Thanks for working on this :) I've got a mix of suggestions and
> questions below. Hopefully it is not too disjoint, since I didn't write
> them in order.
>
> On Mon, Jun 19, 2023 at 11:47:04AM +0200, Alexandre Ghiti wrote:
> > This document describes the constraints and requirements of the early
> > boot process in a RISC-V kernel.
> >
> > Szigned-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > ---
> >  Documentation/riscv/boot-image-header.rst |   3 -
> >  Documentation/riscv/boot.rst              | 181 ++++++++++++++++++++++
> >  Documentation/riscv/index.rst             |   1 +
> >  3 files changed, 182 insertions(+), 3 deletions(-)
> >  create mode 100644 Documentation/riscv/boot.rst
> >
> > diff --git a/Documentation/riscv/boot-image-header.rst b/Documentation/riscv/boot-image-header.rst
> > index d7752533865f..a4a45310c4c4 100644
> > --- a/Documentation/riscv/boot-image-header.rst
> > +++ b/Documentation/riscv/boot-image-header.rst
> > @@ -7,9 +7,6 @@ Boot image header in RISC-V Linux
> >
> >  This document only describes the boot image header details for RISC-V Linux.
> >
> > -TODO:
> > -  Write a complete booting guide.
> > -
> >  The following 64-byte header is present in decompressed Linux kernel image::
> >
> >       u32 code0;                /* Executable code */
> > diff --git a/Documentation/riscv/boot.rst b/Documentation/riscv/boot.rst
> > new file mode 100644
> > index 000000000000..b02230818b79
> > --- /dev/null
> > +++ b/Documentation/riscv/boot.rst
> > @@ -0,0 +1,181 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +=============================================
> > +Early boot requirements/constraints on RISC-V
> > +=============================================
>
> Please use "title case", here and elsewhere in the doc.

You mean using "title: " instead of "===="? Or using uppercase for the
first letter of each word? FYI I followed
https://docs.kernel.org/doc-guide/sphinx.html?highlight=title#specific-guidelines-for-the-kernel-documentation

> I'd also be inclined to drop the "Early" from here, as it permits more
> natural section headings. Perhaps "RISC-V Kernel Boot Requirements and
> Constraints"?

Good suggestion, I'll go with that, thanks

>
> > +
> > +:Author: Alexandre Ghiti <alexghiti@rivosinc.com>
> > +:Date: 23 May 2023
> > +
> > +This document describes what the RISC-V kernel expects from the previous stages
>
> "the previous stages" is a bit vague IMO. You mean bootloader stages I
> assume, but I think it should be explicit. Perhaps:
> "...what a RISC-V kernel expects from bootloaders and firmware, and the
> constraints..."
>
> > +and the firmware, but also the constraints that any developer must have in mind
> > +when touching the early boot process, e.g. before the final virtual mapping is
> > +setup.
>
> s/setup./set up./
>
> Do you mean to have "For example" here? Or is "before the final virtual
> mapping is set up" the definition or "early boot"? If the latter, I
> would reword this as something like:
> "...when modifying the early boot process. For the purposes of this
> document, the 'early boot process' refers to any code that runs before
> the final virtual mapping is set up."

Thanks, that's what I meant.

>
> > +Pre-kernel boot (Expectations from firmware)
>
> Firmware or bootloaders? TBH, I would just drop the section in () and
> do something like:
>         Pre-kernel Requirements and Constraints
>         =======================================
>
>         The RISC-V kernel expects the following of bootloaders and platform
>         firmware:
>

Ok

> > +
> > +Registers state
>
> s/Registers state/Register State/

Ok

>
> > +---------------
> > +
> > +The RISC-V kernel expects:
> > +
> > +  * `$a0` to contain the hartid of the current core.
> > +  * `$a1` to contain the address of the device tree in memory.
> > +
> > +CSR state
> > +---------
> > +
> > +The RISC-V kernel expects:
> > +
> > +  * `$satp = 0`: the MMU must be disabled.
>
> "the MMU, if present, must be disabled." ;)

Ahah forgot the !mmu case, thanks :)

>
> > +
> > +Reserved memory for resident firmware
> > +-------------------------------------
> > +
> > +The RISC-V kernel expects the firmware to mark any resident memory with the
>
> Should this be
> "...resident memory, or memory it has protected with PMPs, with..."
> ?

I used "resident" memory instead of "PMP" memory because it was more
general. I mean you can have a region that is resident but not
protected by PMP, and I don't think the kernel should ask for this
resident memory to be protected with PMP right?

>
> > +`no-map` flag, thus the kernel won't map those regions in the direct mapping
>
> "no-map" is a DT specific term, should this section be moved down under
> DT, as a sub-section of that?

Maybe I can rephrase with something like that:

"The RISC-V kernel must not map any resident memory in the direct
mapping, so the firmware must correctly mark those regions as follows:
- when using a devicetree, using the `no-map` flag,
- when booting with UEFI without devicetree, either as
`EfiRuntimeServicesData/Code` or `EfiReserved`."

>
> > +(avoiding issues with hibernation, speculative accesses and probably other
> > +subsystems).
>
> I'm not sure that this () section is beneficial. To be honest, recent
> issues aside, this section here seems like a statement of the obvious...

But I made the mistake, so it was not that straightforward to
me...I'll remove the ().

>
> > +
> > +Kernel location
> > +---------------
> > +
> > +The RISC-V kernel expects to be placed at a PMD boundary (2MB for rv64 and 4MB
>
> Would that be better worded as "(2 MB aligned for rv64 and 4 MB aligned
> for rv32)"? It might be overly explicit, but I figure there's no harm...

Sure I can add that.

>
> > +for rv32). Note though that the EFI stub will physically relocate the kernel if
>
> s/though//

Ok

>
> > +that's not the case.
> > +
> > +Device-tree

Damn, missed this one, thanks!

>
> s/Device-tree/Devicetree/ and...
>
> > +-----------
> > +
> > +The RISC-V kernel always expects a device tree, it is:
>
> ...s/device tree/devicetree/ to match elsewhere in the kernel docs.
> Same applies to the other instances of "device tree" in this patch,
> please.

Ok I'll do that (but I'm happy to say that I thought about that and it
was intentional since "git grep "device tree" | wc -l" returns a
significant number of instances :)).

>
> > +
> > +- either passed directly to the kernel from the previous stage using the `$a1`
> > +  register,
> > +- or when booting with UEFI, the device tree will be retrieved by the EFI stub
> > +  using the EFI configuration table or it will be created.
>
> Can I suggest changing this around a little, pulling the "either" &
> dropping some boilerplate so that it reads (to me!) a little more
> naturally:
>         The RISC-V kernel always expects a devicetree, it is either:
>
>         - passed directly to the kernel from the previous stage using the `$a1`
>           register,
>         - retrieved by the EFI stub when booting with UEFI, using the EFI
>           configuration table or it will be created by ____.
>
> Also, please elaborate on what it will be created by.

Is it better this way:

"The RISC-V kernel always expects a devicetree, it is either:

- passed directly to the kernel from the previous stage using the
`$a1`
  register,
- retrieved by the EFI stub if present in the EFI configuration table,
- created by the EFI stub otherwise."

>
> > +
> > +Bootflow
>
> "Boot Flow", no?
> I am not sure that this is the "correct" heading for the content it
> describes, but I have nothing better to offer :/

Yes you're right, what about simply "Kernel Entrance"? Not sure this
is easily understandable though.

>
> > +--------
> > +
> > +There exist 2 methods to enter the kernel:
> > +
> > +- `RISCV_BOOT_SPINWAIT`: the firmware releases all harts in the kernel, one hart
> > +  wins a lottery and executes the early boot code while the other harts are
> > +  parked waiting for the initialization to finish. This method is now
>
> nit: s/now//

Ok

>
> What do you mean by deprecated? There's no requirement to implement the
> HSM extension, right?

I would say yes, you have to use a recent version of openSBI that
supports the HSM extension. @Atish Kumar Patra WDYT?

>
> > +  **deprecated**.
> > +- Ordered booting: the firmware releases only one hart that will execute the
> > +  initialization phase and then will start all other harts using the SBI HSM
> > +  extension.
> > +
> > +UEFI
> > +----
> > +
> > +UEFI memory map
> > +~~~~~~~~~~~~~~~
> > +
> > +When booting with UEFI, the RISC-V kernel will use only the EFI memory map to
> > +populate the system memory.
> > +
> > +The UEFI firmware must parse the subnodes of the `/reserved-memory` device tree
> > +node and abide by the device tree specification to convert the attributes of
> > +those subnodes (`no-map` and `reusable`) into their correct EFI equivalent
> > +(refer to section "3.5.4 /reserved-memory and UEFI" of the device tree
> > +specification).
> > +
> > +RISCV_EFI_BOOT_PROTOCOL
> > +~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +When booting with UEFI, the EFI stub requires the boot hartid in order to pass
> > +it to the RISC-V kernel in `$a1`. The EFI stub retrieves the boot hartid using
> > +one of the following methods:
> > +
> > +- `RISCV_EFI_BOOT_PROTOCOL` (**preferred**).
> > +- `boot-hartid` device tree subnode (**deprecated**).
> > +
> > +Any new firmware must implement `RISCV_EFI_BOOT_PROTOCOL` as the device tree
> > +based approach is deprecated now.
> > +
> > +During kernel boot: (Kernel internals)
>
> With the other section titles changed, this could be:
>         Early Boot Requirements and Constraints
>         =======================================
>
>         The RISC-V kernel's early boot process operates under the
>         following constraints:
>
> Thoughts?

I think it's better as you propose, I changed it, thanks.

>
> > +======================================
> > +
> > +EFI stub and device tree
>
> Same comments about "device tree" here etc.
>
> > +------------------------
> > +
> > +When booting with UEFI, the device tree is supplemented by the EFI stub with the
> > +following parameters (largely shared with arm64 in Documentation/arm/uefi.rst):
> > +
> > +==========================  ======   ===========================================
> > +Name                        Size     Description
> > +==========================  ======   ===========================================
> > +linux,uefi-system-table     64-bit   Physical address of the UEFI System Table.
>
> nit: Hmm, I think for all of these sizes s/-bit/ bits/.

That's a copy-paste from the link just above the table.

But maybe I should have pointed to the doc and added only the
"bootargs" stuff (maybe that's also present for arm64 actually).

>
> > +
> > +linux,uefi-mmap-start       64-bit   Physical address of the UEFI memory map,
> > +                                     populated by the UEFI GetMemoryMap() call.
> > +
> > +linux,uefi-mmap-size        32-bit   Size in bytes of the UEFI memory map
> > +                                     pointed to in previous entry.
> > +
> > +linux,uefi-mmap-desc-size   32-bit   Size in bytes of each entry in the UEFI
> > +                                     memory map.
> > +
> > +linux,uefi-mmap-desc-ver    32-bit   Version of the mmap descriptor format.
> > +
> > +kaslr-seed                  64-bit   Entropy used to randomize the kernel image
> > +                                     base address location.
> > +
> > +bootargs                             Kernel command line
> > +==========================  ======   ===========================================
> > +
> > +Virtual mapping setup
>
> nit: s/setup/Installation/

Ok

>
> > +---------------------
> > +
> > +The installation of the virtual mapping is done in 2 steps in the RISC-V kernel:
> > +
> > +1. :c:func:`setup_vm` installs a temporary kernel mapping in
> > +   :c:var:`early_pg_dir` which allows to discover the system memory: only the
>
> s/to discover/discovery of/

You mean "the discovery of" right?

> s/: only/. Only/

Ok

>
> > +   kernel text/data are mapped at this point. When establishing this mapping,
> > +   no allocation can be done (since the system memory is not known yet), so
> > +   :c:var:`early_pg_dir` page table is statically allocated (using only one
> > +   table for each level).
> > +
> > +2. :c:func:`setup_vm_final` creates the final kernel mapping in
> > +   :c:var:`swapper_pg_dir` and takes advantage of the discovered system memory
> > +   to create the linear mapping. When establishing this mapping, the kernel
> > +   can allocate memory but cannot access it directly (since the direct mapping
> > +   is not present yet), so it uses temporary mappings in the fixmap region to
> > +   be able to access the newly allocated page table levels.
> > +
> > +For :c:func:`virt_to_phys` and :c:func:`phys_to_virt` to be able to correctly
> > +convert direct mapping addresses to physical addresses, it needs to know the
>
> nit: s/it/they/

Ok

>
> > +start of the DRAM: this happens after 1, right before 2 installs the direct
>
> s/:/./

Ahah, you really don't like long sentences :) But of course ok :)

> Also how about s/1/step 1/ & s/2/step 2/?

Way better, thanks

>
> > +mapping (see :c:func:`setup_bootmem` function in arch/riscv/mm/init.c). So
>
> s/So//
>

Ok

> > +any usage of those macros before the final virtual mapping is installed must be
> > +carefully examined.
> > +
> > +Device-tree mapping via fixmap
> > +------------------------------
> > +
> > +The RISC-V kernel uses the fixmap region to map the device tree because the
> > +device tree virtual mapping must remain the same between :c:func:`setup_vm` and
> > +:c:func:`setup_vm_final` calls since :c:var:`reserved_mem` array is initialized
>
> Missing a "the" before reserved_mem.

Ok

>
> > +with virtual addresses established by :c:func:`setup_vm` and used with the
> > +mapping established by :c:func:`setup_vm_final`.
> > +
> > +Pre-MMU execution
> > +-----------------
> > +
> > +Any code that executes before even the first virtual mapping is established
> > +must be very carefully compiled as:
>
> Could you point out what the non-obvious examples of this code are?

I can do a list, yes

>
> > +- `-fno-pie`: This is needed for relocatable kernels which use `-fPIE`, since
>
> Is there a reason why the capitalisation is different for the two
> compiler flags?

No idea!

>
> > +  otherwise, any access to a global symbol would go through the GOT which is
> > +  only relocated virtually.
> > +- `-mcmodel=medany`: Any access to a global symbol must be PC-relative to avoid
> > +  any relocations to happen before the MMU is setup.
> > +- Also note that *all* instrumentation must also be disabled (that includes
>
> nit: s/Also note that//

Ok

>
> > +  KASAN, ftrace and others).
> > +
> > +As using a symbol from a different compilation unit requires this unit to be
> > +compiled with those flags, we advise, as much as possible, not to use external
> > +symbols.
>
> If the use of early alternatives grows, are we going to have to split
> the vendors early alternatives into a different compilation unit from
> their regular alternatives?

Indeed, that would relax this constraint for the non-early part of the
file, but as we already talked, the goal is to remove all those
constraints by moving all this code in kernel/pi (at least it is my
goal :)).

>
> Cheers,
> Conor.

Thanks for your thorough review!
  
Conor Dooley June 19, 2023, 4 p.m. UTC | #5
On Mon, Jun 19, 2023 at 04:04:52PM +0200, Alexandre Ghiti wrote:
> On Mon, Jun 19, 2023 at 2:26 PM Conor Dooley <conor.dooley@microchip.com> wrote:
> > On Mon, Jun 19, 2023 at 11:47:04AM +0200, Alexandre Ghiti wrote:

> > > diff --git a/Documentation/riscv/boot.rst b/Documentation/riscv/boot.rst
> > > new file mode 100644
> > > index 000000000000..b02230818b79
> > > --- /dev/null
> > > +++ b/Documentation/riscv/boot.rst
> > > @@ -0,0 +1,181 @@
> > > +.. SPDX-License-Identifier: GPL-2.0
> > > +
> > > +=============================================
> > > +Early boot requirements/constraints on RISC-V
> > > +=============================================
> >
> > Please use "title case", here and elsewhere in the doc.
> 
> You mean using "title: " instead of "===="? Or using uppercase for the
> first letter of each word? FYI I followed
> https://docs.kernel.org/doc-guide/sphinx.html?highlight=title#specific-guidelines-for-the-kernel-documentation

The latter. That's weird, I guess it would be nice to see what Jon
thinks about that.

> > > +Reserved memory for resident firmware
> > > +-------------------------------------
> > > +
> > > +The RISC-V kernel expects the firmware to mark any resident memory with the
> >
> > Should this be
> > "...resident memory, or memory it has protected with PMPs, with..."
> > ?
> 
> I used "resident" memory instead of "PMP" memory because it was more
> general. I mean you can have a region that is resident but not
> protected by PMP, and I don't think the kernel should ask for this
> resident memory to be protected with PMP right?

Nah, I was thinking about the opposite. PMP protected regions that do
not have memory-resident programs in them.

> > > +`no-map` flag, thus the kernel won't map those regions in the direct mapping
> >
> > "no-map" is a DT specific term, should this section be moved down under
> > DT, as a sub-section of that?
> 
> Maybe I can rephrase with something like that:
> 
> "The RISC-V kernel must not map any resident memory in the direct
> mapping, so the firmware must correctly mark those regions as follows:
> - when using a devicetree, using the `no-map` flag,
> - when booting with UEFI without devicetree, either as
> `EfiRuntimeServicesData/Code` or `EfiReserved`."

I'm not sure you need to have a list to be honest. Could do it in
free-form text if you like. But you've got 3 options there for Efi
stuff, isn't only one of them a valid equivalent for "no-map"?
> 
> > > +(avoiding issues with hibernation, speculative accesses and probably other
> > > +subsystems).
> >
> > I'm not sure that this () section is beneficial. To be honest, recent
> > issues aside, this section here seems like a statement of the obvious...
> 
> But I made the mistake, so it was not that straightforward to
> me...I'll remove the ().

I know, hence the "recent issues aside" ;)

> > > +that's not the case.
> > > +
> > > +Device-tree
> 
> Damn, missed this one, thanks!
> 
> >
> > s/Device-tree/Devicetree/ and...
> >
> > > +-----------
> > > +
> > > +The RISC-V kernel always expects a device tree, it is:
> >
> > ...s/device tree/devicetree/ to match elsewhere in the kernel docs.
> > Same applies to the other instances of "device tree" in this patch,
> > please.
> 
> Ok I'll do that (but I'm happy to say that I thought about that and it
> was intentional since "git grep "device tree" | wc -l" returns a
> significant number of instances :)).

Yeah, I had the same thoughts recently too. It's completely mixed
unfortunately, but I suppose I was going off of the headings in the DT
docs that are in rst form. It's not a big deal obviously.

> > > +- either passed directly to the kernel from the previous stage using the `$a1`
> > > +  register,
> > > +- or when booting with UEFI, the device tree will be retrieved by the EFI stub
> > > +  using the EFI configuration table or it will be created.
> >
> > Can I suggest changing this around a little, pulling the "either" &
> > dropping some boilerplate so that it reads (to me!) a little more
> > naturally:
> >         The RISC-V kernel always expects a devicetree, it is either:
> >
> >         - passed directly to the kernel from the previous stage using the `$a1`
> >           register,
> >         - retrieved by the EFI stub when booting with UEFI, using the EFI
> >           configuration table or it will be created by ____.
> >
> > Also, please elaborate on what it will be created by.
> 
> Is it better this way:
> 
> "The RISC-V kernel always expects a devicetree, it is either:
> 
> - passed directly to the kernel from the previous stage using the
> `$a1`
>   register,
> - retrieved by the EFI stub if present in the EFI configuration table,
> - created by the EFI stub otherwise."

Nah, I think the 2 bullet structure was better. This 3 bullet mode
implies that if not passed in a1, then the EFI stub will create it.
Which is obviously not true
> 
> >
> > > +
> > > +Bootflow
> >
> > "Boot Flow", no?
> > I am not sure that this is the "correct" heading for the content it
> > describes, but I have nothing better to offer :/
> 
> Yes you're right, what about simply "Kernel Entrance"? Not sure this
> is easily understandable though.

"Non-boot Hart Initialisation"?

> > > +--------
> > > +
> > > +There exist 2 methods to enter the kernel:
> > > +
> > > +- `RISCV_BOOT_SPINWAIT`: the firmware releases all harts in the kernel, one hart
> > > +  wins a lottery and executes the early boot code while the other harts are
> > > +  parked waiting for the initialization to finish. This method is now
> >
> > nit: s/now//
> 
> Ok
> 
> >
> > What do you mean by deprecated? There's no requirement to implement the
> > HSM extension, right?
> 
> I would say yes, you have to use a recent version of openSBI that
> supports the HSM extension. @Atish Kumar Patra WDYT?

Uh, you don't need to use OpenSBI in the first place, let alone use a
recent version of it, right? What am I missing?
Also, what about !SMP systems? Although my suggested new section title
kinda addresses that!

> > > +  **deprecated**.
> > > +- Ordered booting: the firmware releases only one hart that will execute the
> > > +  initialization phase and then will start all other harts using the SBI HSM
> > > +  extension.
> > > +
> > > +UEFI
> > > +----
> > > +
> > > +UEFI memory map
> > > +~~~~~~~~~~~~~~~
> > > +
> > > +When booting with UEFI, the RISC-V kernel will use only the EFI memory map to
> > > +populate the system memory.
> > > +
> > > +The UEFI firmware must parse the subnodes of the `/reserved-memory` device tree
> > > +node and abide by the device tree specification to convert the attributes of
> > > +those subnodes (`no-map` and `reusable`) into their correct EFI equivalent
> > > +(refer to section "3.5.4 /reserved-memory and UEFI" of the device tree
> > > +specification).
> > > +
> > > +RISCV_EFI_BOOT_PROTOCOL
> > > +~~~~~~~~~~~~~~~~~~~~~~~
> > > +
> > > +When booting with UEFI, the EFI stub requires the boot hartid in order to pass
> > > +it to the RISC-V kernel in `$a1`. The EFI stub retrieves the boot hartid using
> > > +one of the following methods:
> > > +
> > > +- `RISCV_EFI_BOOT_PROTOCOL` (**preferred**).
> > > +- `boot-hartid` device tree subnode (**deprecated**).
> > > +
> > > +Any new firmware must implement `RISCV_EFI_BOOT_PROTOCOL` as the device tree
> > > +based approach is deprecated now.
> > > +
> > > +During kernel boot: (Kernel internals)
> >
> > With the other section titles changed, this could be:
> >         Early Boot Requirements and Constraints
> >         =======================================
> >
> >         The RISC-V kernel's early boot process operates under the
> >         following constraints:
> >
> > Thoughts?
> 
> I think it's better as you propose, I changed it, thanks.
> 
> >
> > > +======================================
> > > +
> > > +EFI stub and device tree
> >
> > Same comments about "device tree" here etc.
> >
> > > +------------------------
> > > +
> > > +When booting with UEFI, the device tree is supplemented by the EFI stub with the
> > > +following parameters (largely shared with arm64 in Documentation/arm/uefi.rst):
> > > +
> > > +==========================  ======   ===========================================
> > > +Name                        Size     Description
> > > +==========================  ======   ===========================================
> > > +linux,uefi-system-table     64-bit   Physical address of the UEFI System Table.
> >
> > nit: Hmm, I think for all of these sizes s/-bit/ bits/.
> 
> That's a copy-paste from the link just above the table.
> 
> But maybe I should have pointed to the doc and added only the
> "bootargs" stuff (maybe that's also present for arm64 actually).

If it is identical, sounds like a good idea. It's common code that does
that stuff, right?

> > > +---------------------
> > > +
> > > +The installation of the virtual mapping is done in 2 steps in the RISC-V kernel:
> > > +
> > > +1. :c:func:`setup_vm` installs a temporary kernel mapping in
> > > +   :c:var:`early_pg_dir` which allows to discover the system memory: only the
> >
> > s/to discover/discovery of/
> 
> You mean "the discovery of" right?

No? The "the" there would not be required.

> > > +For :c:func:`virt_to_phys` and :c:func:`phys_to_virt` to be able to correctly
> > > +convert direct mapping addresses to physical addresses, it needs to know the
> >
> > nit: s/it/they/
> 
> Ok
> 
> >
> > > +start of the DRAM: this happens after 1, right before 2 installs the direct
> >
> > s/:/./
> 
> Ahah, you really don't like long sentences :)

I dunno if it is long sentences per se, I just think it is easier to
follow.

> But of course ok :)
> 
> > Also how about s/1/step 1/ & s/2/step 2/?
> 
> Way better, thanks

> > > +Pre-MMU execution
> > > +-----------------
> > > +
> > > +Any code that executes before even the first virtual mapping is established
> > > +must be very carefully compiled as:
> >
> > Could you point out what the non-obvious examples of this code are?
> 
> I can do a list, yes

Not even a list, just something like "...established, for example early
alternatives and foo, must be very..."

> Thanks for your thorough review!

NW chief.
  
Jonathan Corbet June 19, 2023, 4:48 p.m. UTC | #6
Conor Dooley <conor@kernel.org> writes:

> On Mon, Jun 19, 2023 at 04:04:52PM +0200, Alexandre Ghiti wrote:
>> On Mon, Jun 19, 2023 at 2:26 PM Conor Dooley <conor.dooley@microchip.com> wrote:
>> > On Mon, Jun 19, 2023 at 11:47:04AM +0200, Alexandre Ghiti wrote:
>
>> > > diff --git a/Documentation/riscv/boot.rst b/Documentation/riscv/boot.rst
>> > > new file mode 100644
>> > > index 000000000000..b02230818b79
>> > > --- /dev/null
>> > > +++ b/Documentation/riscv/boot.rst
>> > > @@ -0,0 +1,181 @@
>> > > +.. SPDX-License-Identifier: GPL-2.0
>> > > +
>> > > +=============================================
>> > > +Early boot requirements/constraints on RISC-V
>> > > +=============================================
>> >
>> > Please use "title case", here and elsewhere in the doc.
>> 
>> You mean using "title: " instead of "===="? Or using uppercase for the
>> first letter of each word? FYI I followed
>> https://docs.kernel.org/doc-guide/sphinx.html?highlight=title#specific-guidelines-for-the-kernel-documentation
>
> The latter. That's weird, I guess it would be nice to see what Jon
> thinks about that.

I have Never Been Fond of Excessive Use of Capital Letters, so my
preference would be capitalization as in a normal sentence.

That said, I don't really feel that something like this is important
enough that we need to define and enforce a policy around it.  If the
maintainers for specific subsystems feel differently, then I guess
that's up to them.

Thanks,

jon
  
Sunil V L June 20, 2023, 7:18 a.m. UTC | #7
Hey Alex,

Many thanks for writing this up!

On Mon, Jun 19, 2023 at 11:49:13AM +0200, Alexandre Ghiti wrote:
> @Sunil V L Something about ACPI is more than welcome :)
> 
> And thanks to @Björn Töpel and @Atish Kumar Patra for helping in
> writing this document!
> 
> On Mon, Jun 19, 2023 at 11:47 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> >
> > This document describes the constraints and requirements of the early
> > boot process in a RISC-V kernel.
> >
> > Szigned-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > ---
> >  Documentation/riscv/boot-image-header.rst |   3 -
> >  Documentation/riscv/boot.rst              | 181 ++++++++++++++++++++++
> >  Documentation/riscv/index.rst             |   1 +
> >  3 files changed, 182 insertions(+), 3 deletions(-)
> >  create mode 100644 Documentation/riscv/boot.rst
> >
> > diff --git a/Documentation/riscv/boot-image-header.rst b/Documentation/riscv/boot-image-header.rst
> > index d7752533865f..a4a45310c4c4 100644
> > --- a/Documentation/riscv/boot-image-header.rst
> > +++ b/Documentation/riscv/boot-image-header.rst
> > @@ -7,9 +7,6 @@ Boot image header in RISC-V Linux
> >
> >  This document only describes the boot image header details for RISC-V Linux.
> >
> > -TODO:
> > -  Write a complete booting guide.
> > -
> >  The following 64-byte header is present in decompressed Linux kernel image::
> >
> >         u32 code0;                /* Executable code */
> > diff --git a/Documentation/riscv/boot.rst b/Documentation/riscv/boot.rst
> > new file mode 100644
> > index 000000000000..b02230818b79
> > --- /dev/null
> > +++ b/Documentation/riscv/boot.rst
> > @@ -0,0 +1,181 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +=============================================
> > +Early boot requirements/constraints on RISC-V
> > +=============================================
> > +
> > +:Author: Alexandre Ghiti <alexghiti@rivosinc.com>
> > +:Date: 23 May 2023
> > +
> > +This document describes what the RISC-V kernel expects from the previous stages
> > +and the firmware, but also the constraints that any developer must have in mind
> > +when touching the early boot process, e.g. before the final virtual mapping is
> > +setup.
> > +
> > +Pre-kernel boot (Expectations from firmware)
> > +============================================
> > +
> > +Registers state
> > +---------------
> > +
> > +The RISC-V kernel expects:
> > +
> > +  * `$a0` to contain the hartid of the current core.
> > +  * `$a1` to contain the address of the device tree in memory.
> > +
> > +CSR state
> > +---------
> > +
> > +The RISC-V kernel expects:
> > +
> > +  * `$satp = 0`: the MMU must be disabled.
> > +
> > +Reserved memory for resident firmware
> > +-------------------------------------
> > +
> > +The RISC-V kernel expects the firmware to mark any resident memory with the
> > +`no-map` flag, thus the kernel won't map those regions in the direct mapping
> > +(avoiding issues with hibernation, speculative accesses and probably other
> > +subsystems).
> > +
> > +Kernel location
> > +---------------
> > +
> > +The RISC-V kernel expects to be placed at a PMD boundary (2MB for rv64 and 4MB
> > +for rv32). Note though that the EFI stub will physically relocate the kernel if
> > +that's not the case.
> > +
> > +Device-tree
> > +-----------
> > +
> > +The RISC-V kernel always expects a device tree, it is:
> > +
In general, the firmware can pass either DT or ACPI (not both at the
same time) to the OS. So, I think the statement that kernel always
expects DT probably causes confusion. Can we mention some thing like
below?

The firmware can pass either DeviceTree or ACPI tables to the RISC-V
kernel.

The DeviceTree is either passed directly to the kernel from the
previous stage using the `$a1` register, or when booting with UEFI, it
can be passed using the EFI configuration table.

The ACPI tables are passed to the kernel using the EFI configuration
table. In this case, a tiny DeviceTree is still created by the EFI
stub. Please refer to "EFI stub and device" tree section below for
details of the tiny DeviceTree.

Thanks,
Sunil
  
Sunil V L June 20, 2023, 7:41 a.m. UTC | #8
On Mon, Jun 19, 2023 at 04:04:52PM +0200, Alexandre Ghiti wrote:
> On Mon, Jun 19, 2023 at 2:26 PM Conor Dooley <conor.dooley@microchip.com> wrote:
> >
> > Hey Alex,
> >
> > Thanks for working on this :) I've got a mix of suggestions and
> > questions below. Hopefully it is not too disjoint, since I didn't write
> > them in order.
> >
> > On Mon, Jun 19, 2023 at 11:47:04AM +0200, Alexandre Ghiti wrote:
> > > This document describes the constraints and requirements of the early
> > > boot process in a RISC-V kernel.
> > >
> > > Szigned-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > ---
> > >  Documentation/riscv/boot-image-header.rst |   3 -
> > >  Documentation/riscv/boot.rst              | 181 ++++++++++++++++++++++
> > >  Documentation/riscv/index.rst             |   1 +
> > >  3 files changed, 182 insertions(+), 3 deletions(-)
> > >  create mode 100644 Documentation/riscv/boot.rst
> > >
> > > diff --git a/Documentation/riscv/boot-image-header.rst b/Documentation/riscv/boot-image-header.rst
> > > index d7752533865f..a4a45310c4c4 100644
> > > --- a/Documentation/riscv/boot-image-header.rst
> > > +++ b/Documentation/riscv/boot-image-header.rst
> > > @@ -7,9 +7,6 @@ Boot image header in RISC-V Linux
> > >
> > >  This document only describes the boot image header details for RISC-V Linux.
> > >
> > > -TODO:
> > > -  Write a complete booting guide.
> > > -
> > >  The following 64-byte header is present in decompressed Linux kernel image::
> > >
> > >       u32 code0;                /* Executable code */
> > > diff --git a/Documentation/riscv/boot.rst b/Documentation/riscv/boot.rst
> > > new file mode 100644
> > > index 000000000000..b02230818b79
> > > --- /dev/null
> > > +++ b/Documentation/riscv/boot.rst
> > > @@ -0,0 +1,181 @@
> > > +.. SPDX-License-Identifier: GPL-2.0
> > > +
> > > +=============================================
> > > +Early boot requirements/constraints on RISC-V
> > > +=============================================
> >
> > Please use "title case", here and elsewhere in the doc.
> 
> You mean using "title: " instead of "===="? Or using uppercase for the
> first letter of each word? FYI I followed
> https://docs.kernel.org/doc-guide/sphinx.html?highlight=title#specific-guidelines-for-the-kernel-documentation
> 
> > I'd also be inclined to drop the "Early" from here, as it permits more
> > natural section headings. Perhaps "RISC-V Kernel Boot Requirements and
> > Constraints"?
> 
> Good suggestion, I'll go with that, thanks
> 
> >
> > > +
> > > +:Author: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > +:Date: 23 May 2023
> > > +
> > > +This document describes what the RISC-V kernel expects from the previous stages
> >
> > "the previous stages" is a bit vague IMO. You mean bootloader stages I
> > assume, but I think it should be explicit. Perhaps:
> > "...what a RISC-V kernel expects from bootloaders and firmware, and the
> > constraints..."
> >
> > > +and the firmware, but also the constraints that any developer must have in mind
> > > +when touching the early boot process, e.g. before the final virtual mapping is
> > > +setup.
> >
> > s/setup./set up./
> >
> > Do you mean to have "For example" here? Or is "before the final virtual
> > mapping is set up" the definition or "early boot"? If the latter, I
> > would reword this as something like:
> > "...when modifying the early boot process. For the purposes of this
> > document, the 'early boot process' refers to any code that runs before
> > the final virtual mapping is set up."
> 
> Thanks, that's what I meant.
> 
> >
> > > +Pre-kernel boot (Expectations from firmware)
> >
> > Firmware or bootloaders? TBH, I would just drop the section in () and
> > do something like:
> >         Pre-kernel Requirements and Constraints
> >         =======================================
> >
> >         The RISC-V kernel expects the following of bootloaders and platform
> >         firmware:
> >
> 
> Ok
> 
> > > +
> > > +Registers state
> >
> > s/Registers state/Register State/
> 
> Ok
> 
> >
> > > +---------------
> > > +
> > > +The RISC-V kernel expects:
> > > +
> > > +  * `$a0` to contain the hartid of the current core.
> > > +  * `$a1` to contain the address of the device tree in memory.
> > > +
> > > +CSR state
> > > +---------
> > > +
> > > +The RISC-V kernel expects:
> > > +
> > > +  * `$satp = 0`: the MMU must be disabled.
> >
> > "the MMU, if present, must be disabled." ;)
> 
> Ahah forgot the !mmu case, thanks :)
> 
> >
> > > +
> > > +Reserved memory for resident firmware
> > > +-------------------------------------
> > > +
> > > +The RISC-V kernel expects the firmware to mark any resident memory with the
> >
> > Should this be
> > "...resident memory, or memory it has protected with PMPs, with..."
> > ?
> 
> I used "resident" memory instead of "PMP" memory because it was more
> general. I mean you can have a region that is resident but not
> protected by PMP, and I don't think the kernel should ask for this
> resident memory to be protected with PMP right?
> 
> >
> > > +`no-map` flag, thus the kernel won't map those regions in the direct mapping
> >
> > "no-map" is a DT specific term, should this section be moved down under
> > DT, as a sub-section of that?
> 
> Maybe I can rephrase with something like that:
> 
> "The RISC-V kernel must not map any resident memory in the direct
> mapping, so the firmware must correctly mark those regions as follows:
> - when using a devicetree, using the `no-map` flag,
> - when booting with UEFI without devicetree, either as
> `EfiRuntimeServicesData/Code` or `EfiReserved`."
> 
Hi Alex,

I am not sure about the idea behind mentioning only UEFI boot without
DT since UEFI boot is supported with DT also. Should we just mention
that "when booting with UEFI, resident firmware ranges must be marked as
per UEFI specification" ? Converting reserved-memory node in DT to UEFI
memory map is anyway mentioned separately under UEFI memory map section
right?

Thanks,
Sunil
  
Alexandre Ghiti June 20, 2023, 8:14 a.m. UTC | #9
On Tue, Jun 20, 2023 at 9:18 AM Sunil V L <sunilvl@ventanamicro.com> wrote:
>
> Hey Alex,
>
> Many thanks for writing this up!
>
> On Mon, Jun 19, 2023 at 11:49:13AM +0200, Alexandre Ghiti wrote:
> > @Sunil V L Something about ACPI is more than welcome :)
> >
> > And thanks to @Björn Töpel and @Atish Kumar Patra for helping in
> > writing this document!
> >
> > On Mon, Jun 19, 2023 at 11:47 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > >
> > > This document describes the constraints and requirements of the early
> > > boot process in a RISC-V kernel.
> > >
> > > Szigned-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > ---
> > >  Documentation/riscv/boot-image-header.rst |   3 -
> > >  Documentation/riscv/boot.rst              | 181 ++++++++++++++++++++++
> > >  Documentation/riscv/index.rst             |   1 +
> > >  3 files changed, 182 insertions(+), 3 deletions(-)
> > >  create mode 100644 Documentation/riscv/boot.rst
> > >
> > > diff --git a/Documentation/riscv/boot-image-header.rst b/Documentation/riscv/boot-image-header.rst
> > > index d7752533865f..a4a45310c4c4 100644
> > > --- a/Documentation/riscv/boot-image-header.rst
> > > +++ b/Documentation/riscv/boot-image-header.rst
> > > @@ -7,9 +7,6 @@ Boot image header in RISC-V Linux
> > >
> > >  This document only describes the boot image header details for RISC-V Linux.
> > >
> > > -TODO:
> > > -  Write a complete booting guide.
> > > -
> > >  The following 64-byte header is present in decompressed Linux kernel image::
> > >
> > >         u32 code0;                /* Executable code */
> > > diff --git a/Documentation/riscv/boot.rst b/Documentation/riscv/boot.rst
> > > new file mode 100644
> > > index 000000000000..b02230818b79
> > > --- /dev/null
> > > +++ b/Documentation/riscv/boot.rst
> > > @@ -0,0 +1,181 @@
> > > +.. SPDX-License-Identifier: GPL-2.0
> > > +
> > > +=============================================
> > > +Early boot requirements/constraints on RISC-V
> > > +=============================================
> > > +
> > > +:Author: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > +:Date: 23 May 2023
> > > +
> > > +This document describes what the RISC-V kernel expects from the previous stages
> > > +and the firmware, but also the constraints that any developer must have in mind
> > > +when touching the early boot process, e.g. before the final virtual mapping is
> > > +setup.
> > > +
> > > +Pre-kernel boot (Expectations from firmware)
> > > +============================================
> > > +
> > > +Registers state
> > > +---------------
> > > +
> > > +The RISC-V kernel expects:
> > > +
> > > +  * `$a0` to contain the hartid of the current core.
> > > +  * `$a1` to contain the address of the device tree in memory.
> > > +
> > > +CSR state
> > > +---------
> > > +
> > > +The RISC-V kernel expects:
> > > +
> > > +  * `$satp = 0`: the MMU must be disabled.
> > > +
> > > +Reserved memory for resident firmware
> > > +-------------------------------------
> > > +
> > > +The RISC-V kernel expects the firmware to mark any resident memory with the
> > > +`no-map` flag, thus the kernel won't map those regions in the direct mapping
> > > +(avoiding issues with hibernation, speculative accesses and probably other
> > > +subsystems).
> > > +
> > > +Kernel location
> > > +---------------
> > > +
> > > +The RISC-V kernel expects to be placed at a PMD boundary (2MB for rv64 and 4MB
> > > +for rv32). Note though that the EFI stub will physically relocate the kernel if
> > > +that's not the case.
> > > +
> > > +Device-tree
> > > +-----------
> > > +
> > > +The RISC-V kernel always expects a device tree, it is:
> > > +
> In general, the firmware can pass either DT or ACPI (not both at the
> same time) to the OS. So, I think the statement that kernel always
> expects DT probably causes confusion. Can we mention some thing like
> below?
>
> The firmware can pass either DeviceTree or ACPI tables to the RISC-V
> kernel.
>
> The DeviceTree is either passed directly to the kernel from the
> previous stage using the `$a1` register, or when booting with UEFI, it
> can be passed using the EFI configuration table.
>
> The ACPI tables are passed to the kernel using the EFI configuration
> table. In this case, a tiny DeviceTree is still created by the EFI
> stub. Please refer to "EFI stub and device" tree section below for
> details of the tiny DeviceTree.
>

Great, this is way better so I'll go with that:

Hardware description
--------------------

The firmware can pass either a devicetree or ACPI tables to the RISC-V kernel.

The devicetree is either passed directly to the kernel from the previous stage
using the `$a1` register, or when booting with UEFI, it can be passed
using the
EFI configuration table.

The ACPI tables are passed to the kernel using the EFI configuration table. In
this case, a tiny devicetree is still created by the EFI stub. Please refer to
"EFI stub and devicetree" tree section below for details about this devicetree.

Thanks!

> Thanks,
> Sunil
  
Alexandre Ghiti June 20, 2023, 8:24 a.m. UTC | #10
On Tue, Jun 20, 2023 at 9:41 AM Sunil V L <sunilvl@ventanamicro.com> wrote:
>
> On Mon, Jun 19, 2023 at 04:04:52PM +0200, Alexandre Ghiti wrote:
> > On Mon, Jun 19, 2023 at 2:26 PM Conor Dooley <conor.dooley@microchip.com> wrote:
> > >
> > > Hey Alex,
> > >
> > > Thanks for working on this :) I've got a mix of suggestions and
> > > questions below. Hopefully it is not too disjoint, since I didn't write
> > > them in order.
> > >
> > > On Mon, Jun 19, 2023 at 11:47:04AM +0200, Alexandre Ghiti wrote:
> > > > This document describes the constraints and requirements of the early
> > > > boot process in a RISC-V kernel.
> > > >
> > > > Szigned-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > > ---
> > > >  Documentation/riscv/boot-image-header.rst |   3 -
> > > >  Documentation/riscv/boot.rst              | 181 ++++++++++++++++++++++
> > > >  Documentation/riscv/index.rst             |   1 +
> > > >  3 files changed, 182 insertions(+), 3 deletions(-)
> > > >  create mode 100644 Documentation/riscv/boot.rst
> > > >
> > > > diff --git a/Documentation/riscv/boot-image-header.rst b/Documentation/riscv/boot-image-header.rst
> > > > index d7752533865f..a4a45310c4c4 100644
> > > > --- a/Documentation/riscv/boot-image-header.rst
> > > > +++ b/Documentation/riscv/boot-image-header.rst
> > > > @@ -7,9 +7,6 @@ Boot image header in RISC-V Linux
> > > >
> > > >  This document only describes the boot image header details for RISC-V Linux.
> > > >
> > > > -TODO:
> > > > -  Write a complete booting guide.
> > > > -
> > > >  The following 64-byte header is present in decompressed Linux kernel image::
> > > >
> > > >       u32 code0;                /* Executable code */
> > > > diff --git a/Documentation/riscv/boot.rst b/Documentation/riscv/boot.rst
> > > > new file mode 100644
> > > > index 000000000000..b02230818b79
> > > > --- /dev/null
> > > > +++ b/Documentation/riscv/boot.rst
> > > > @@ -0,0 +1,181 @@
> > > > +.. SPDX-License-Identifier: GPL-2.0
> > > > +
> > > > +=============================================
> > > > +Early boot requirements/constraints on RISC-V
> > > > +=============================================
> > >
> > > Please use "title case", here and elsewhere in the doc.
> >
> > You mean using "title: " instead of "===="? Or using uppercase for the
> > first letter of each word? FYI I followed
> > https://docs.kernel.org/doc-guide/sphinx.html?highlight=title#specific-guidelines-for-the-kernel-documentation
> >
> > > I'd also be inclined to drop the "Early" from here, as it permits more
> > > natural section headings. Perhaps "RISC-V Kernel Boot Requirements and
> > > Constraints"?
> >
> > Good suggestion, I'll go with that, thanks
> >
> > >
> > > > +
> > > > +:Author: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > > +:Date: 23 May 2023
> > > > +
> > > > +This document describes what the RISC-V kernel expects from the previous stages
> > >
> > > "the previous stages" is a bit vague IMO. You mean bootloader stages I
> > > assume, but I think it should be explicit. Perhaps:
> > > "...what a RISC-V kernel expects from bootloaders and firmware, and the
> > > constraints..."
> > >
> > > > +and the firmware, but also the constraints that any developer must have in mind
> > > > +when touching the early boot process, e.g. before the final virtual mapping is
> > > > +setup.
> > >
> > > s/setup./set up./
> > >
> > > Do you mean to have "For example" here? Or is "before the final virtual
> > > mapping is set up" the definition or "early boot"? If the latter, I
> > > would reword this as something like:
> > > "...when modifying the early boot process. For the purposes of this
> > > document, the 'early boot process' refers to any code that runs before
> > > the final virtual mapping is set up."
> >
> > Thanks, that's what I meant.
> >
> > >
> > > > +Pre-kernel boot (Expectations from firmware)
> > >
> > > Firmware or bootloaders? TBH, I would just drop the section in () and
> > > do something like:
> > >         Pre-kernel Requirements and Constraints
> > >         =======================================
> > >
> > >         The RISC-V kernel expects the following of bootloaders and platform
> > >         firmware:
> > >
> >
> > Ok
> >
> > > > +
> > > > +Registers state
> > >
> > > s/Registers state/Register State/
> >
> > Ok
> >
> > >
> > > > +---------------
> > > > +
> > > > +The RISC-V kernel expects:
> > > > +
> > > > +  * `$a0` to contain the hartid of the current core.
> > > > +  * `$a1` to contain the address of the device tree in memory.
> > > > +
> > > > +CSR state
> > > > +---------
> > > > +
> > > > +The RISC-V kernel expects:
> > > > +
> > > > +  * `$satp = 0`: the MMU must be disabled.
> > >
> > > "the MMU, if present, must be disabled." ;)
> >
> > Ahah forgot the !mmu case, thanks :)
> >
> > >
> > > > +
> > > > +Reserved memory for resident firmware
> > > > +-------------------------------------
> > > > +
> > > > +The RISC-V kernel expects the firmware to mark any resident memory with the
> > >
> > > Should this be
> > > "...resident memory, or memory it has protected with PMPs, with..."
> > > ?
> >
> > I used "resident" memory instead of "PMP" memory because it was more
> > general. I mean you can have a region that is resident but not
> > protected by PMP, and I don't think the kernel should ask for this
> > resident memory to be protected with PMP right?
> >
> > >
> > > > +`no-map` flag, thus the kernel won't map those regions in the direct mapping
> > >
> > > "no-map" is a DT specific term, should this section be moved down under
> > > DT, as a sub-section of that?
> >
> > Maybe I can rephrase with something like that:
> >
> > "The RISC-V kernel must not map any resident memory in the direct
> > mapping, so the firmware must correctly mark those regions as follows:
> > - when using a devicetree, using the `no-map` flag,
> > - when booting with UEFI without devicetree, either as
> > `EfiRuntimeServicesData/Code` or `EfiReserved`."
> >
> Hi Alex,
>
> I am not sure about the idea behind mentioning only UEFI boot without
> DT since UEFI boot is supported with DT also. Should we just mention
> that "when booting with UEFI, resident firmware ranges must be marked as
> per UEFI specification" ? Converting reserved-memory node in DT to UEFI
> memory map is anyway mentioned separately under UEFI memory map section
> right?

Right, let's make it simple:

"The RISC-V kernel must not map any resident memory in the direct
mapping, so the
firmware must correctly mark those regions as per the devicetree
specification and/or the UEFI specification."

Feel free to comment if that's still not right to you,

Thanks,

>
> Thanks,
> Sunil
  
Alexandre Ghiti June 20, 2023, 9:09 a.m. UTC | #11
On Mon, Jun 19, 2023 at 6:00 PM Conor Dooley <conor@kernel.org> wrote:
>
> On Mon, Jun 19, 2023 at 04:04:52PM +0200, Alexandre Ghiti wrote:
> > On Mon, Jun 19, 2023 at 2:26 PM Conor Dooley <conor.dooley@microchip.com> wrote:
> > > On Mon, Jun 19, 2023 at 11:47:04AM +0200, Alexandre Ghiti wrote:
>
> > > > diff --git a/Documentation/riscv/boot.rst b/Documentation/riscv/boot.rst
> > > > new file mode 100644
> > > > index 000000000000..b02230818b79
> > > > --- /dev/null
> > > > +++ b/Documentation/riscv/boot.rst
> > > > @@ -0,0 +1,181 @@
> > > > +.. SPDX-License-Identifier: GPL-2.0
> > > > +
> > > > +=============================================
> > > > +Early boot requirements/constraints on RISC-V
> > > > +=============================================
> > >
> > > Please use "title case", here and elsewhere in the doc.
> >
> > You mean using "title: " instead of "===="? Or using uppercase for the
> > first letter of each word? FYI I followed
> > https://docs.kernel.org/doc-guide/sphinx.html?highlight=title#specific-guidelines-for-the-kernel-documentation
>
> The latter. That's weird, I guess it would be nice to see what Jon
> thinks about that.
>
> > > > +Reserved memory for resident firmware
> > > > +-------------------------------------
> > > > +
> > > > +The RISC-V kernel expects the firmware to mark any resident memory with the
> > >
> > > Should this be
> > > "...resident memory, or memory it has protected with PMPs, with..."
> > > ?
> >
> > I used "resident" memory instead of "PMP" memory because it was more
> > general. I mean you can have a region that is resident but not
> > protected by PMP, and I don't think the kernel should ask for this
> > resident memory to be protected with PMP right?
>
> Nah, I was thinking about the opposite. PMP protected regions that do
> not have memory-resident programs in them.

Ok, I'll add a reference to PMP regions like you mentioned the first time.

>
> > > > +`no-map` flag, thus the kernel won't map those regions in the direct mapping
> > >
> > > "no-map" is a DT specific term, should this section be moved down under
> > > DT, as a sub-section of that?
> >
> > Maybe I can rephrase with something like that:
> >
> > "The RISC-V kernel must not map any resident memory in the direct
> > mapping, so the firmware must correctly mark those regions as follows:
> > - when using a devicetree, using the `no-map` flag,
> > - when booting with UEFI without devicetree, either as
> > `EfiRuntimeServicesData/Code` or `EfiReserved`."
>
> I'm not sure you need to have a list to be honest. Could do it in
> free-form text if you like. But you've got 3 options there for Efi
> stuff, isn't only one of them a valid equivalent for "no-map"?

Indeed, the no-map attribute is (according to the dt spec) always
converted to EfiReservedMemoryType and no-map is the only way to
"reserve" such memory using devicetree. A firmware can also mark those
regions as EfiRuntimeServicesData/Code so that they don't get mapped
in the direct mapping.  But ok, let's make it simple, I'll go with
something like Sunil's proposition.

> >
> > > > +(avoiding issues with hibernation, speculative accesses and probably other
> > > > +subsystems).
> > >
> > > I'm not sure that this () section is beneficial. To be honest, recent
> > > issues aside, this section here seems like a statement of the obvious...
> >
> > But I made the mistake, so it was not that straightforward to
> > me...I'll remove the ().
>
> I know, hence the "recent issues aside" ;)
>
> > > > +that's not the case.
> > > > +
> > > > +Device-tree
> >
> > Damn, missed this one, thanks!
> >
> > >
> > > s/Device-tree/Devicetree/ and...
> > >
> > > > +-----------
> > > > +
> > > > +The RISC-V kernel always expects a device tree, it is:
> > >
> > > ...s/device tree/devicetree/ to match elsewhere in the kernel docs.
> > > Same applies to the other instances of "device tree" in this patch,
> > > please.
> >
> > Ok I'll do that (but I'm happy to say that I thought about that and it
> > was intentional since "git grep "device tree" | wc -l" returns a
> > significant number of instances :)).
>
> Yeah, I had the same thoughts recently too. It's completely mixed
> unfortunately, but I suppose I was going off of the headings in the DT
> docs that are in rst form. It's not a big deal obviously.
>
> > > > +- either passed directly to the kernel from the previous stage using the `$a1`
> > > > +  register,
> > > > +- or when booting with UEFI, the device tree will be retrieved by the EFI stub
> > > > +  using the EFI configuration table or it will be created.
> > >
> > > Can I suggest changing this around a little, pulling the "either" &
> > > dropping some boilerplate so that it reads (to me!) a little more
> > > naturally:
> > >         The RISC-V kernel always expects a devicetree, it is either:
> > >
> > >         - passed directly to the kernel from the previous stage using the `$a1`
> > >           register,
> > >         - retrieved by the EFI stub when booting with UEFI, using the EFI
> > >           configuration table or it will be created by ____.
> > >
> > > Also, please elaborate on what it will be created by.
> >
> > Is it better this way:
> >
> > "The RISC-V kernel always expects a devicetree, it is either:
> >
> > - passed directly to the kernel from the previous stage using the
> > `$a1`
> >   register,
> > - retrieved by the EFI stub if present in the EFI configuration table,
> > - created by the EFI stub otherwise."
>
> Nah, I think the 2 bullet structure was better. This 3 bullet mode
> implies that if not passed in a1, then the EFI stub will create it.
> Which is obviously not true

I'll go with Sunil's proposition which is way better than the list above!

> >
> > >
> > > > +
> > > > +Bootflow
> > >
> > > "Boot Flow", no?
> > > I am not sure that this is the "correct" heading for the content it
> > > describes, but I have nothing better to offer :/
> >
> > Yes you're right, what about simply "Kernel Entrance"? Not sure this
> > is easily understandable though.
>
> "Non-boot Hart Initialisation"?

Hmmm not that great either (sorry for being direct here)

>
> > > > +--------
> > > > +
> > > > +There exist 2 methods to enter the kernel:
> > > > +
> > > > +- `RISCV_BOOT_SPINWAIT`: the firmware releases all harts in the kernel, one hart
> > > > +  wins a lottery and executes the early boot code while the other harts are
> > > > +  parked waiting for the initialization to finish. This method is now
> > >
> > > nit: s/now//
> >
> > Ok
> >
> > >
> > > What do you mean by deprecated? There's no requirement to implement the
> > > HSM extension, right?
> >
> > I would say yes, you have to use a recent version of openSBI that
> > supports the HSM extension. @Atish Kumar Patra WDYT?
>
> Uh, you don't need to use OpenSBI in the first place, let alone use a
> recent version of it, right? What am I missing?

You need a M-Mode firmware which follows the SBI specification and
that implements the HSM extension.

> Also, what about !SMP systems? Although my suggested new section title
> kinda addresses that!

I'll add "On SMP systems, there exist 2 methods to enter the
kernel:....", I don't think we need to detail the !SMP case as it is
obvious.

>
> > > > +  **deprecated**.
> > > > +- Ordered booting: the firmware releases only one hart that will execute the
> > > > +  initialization phase and then will start all other harts using the SBI HSM
> > > > +  extension.
> > > > +
> > > > +UEFI
> > > > +----
> > > > +
> > > > +UEFI memory map
> > > > +~~~~~~~~~~~~~~~
> > > > +
> > > > +When booting with UEFI, the RISC-V kernel will use only the EFI memory map to
> > > > +populate the system memory.
> > > > +
> > > > +The UEFI firmware must parse the subnodes of the `/reserved-memory` device tree
> > > > +node and abide by the device tree specification to convert the attributes of
> > > > +those subnodes (`no-map` and `reusable`) into their correct EFI equivalent
> > > > +(refer to section "3.5.4 /reserved-memory and UEFI" of the device tree
> > > > +specification).
> > > > +
> > > > +RISCV_EFI_BOOT_PROTOCOL
> > > > +~~~~~~~~~~~~~~~~~~~~~~~
> > > > +
> > > > +When booting with UEFI, the EFI stub requires the boot hartid in order to pass
> > > > +it to the RISC-V kernel in `$a1`. The EFI stub retrieves the boot hartid using
> > > > +one of the following methods:
> > > > +
> > > > +- `RISCV_EFI_BOOT_PROTOCOL` (**preferred**).
> > > > +- `boot-hartid` device tree subnode (**deprecated**).
> > > > +
> > > > +Any new firmware must implement `RISCV_EFI_BOOT_PROTOCOL` as the device tree
> > > > +based approach is deprecated now.
> > > > +
> > > > +During kernel boot: (Kernel internals)
> > >
> > > With the other section titles changed, this could be:
> > >         Early Boot Requirements and Constraints
> > >         =======================================
> > >
> > >         The RISC-V kernel's early boot process operates under the
> > >         following constraints:
> > >
> > > Thoughts?
> >
> > I think it's better as you propose, I changed it, thanks.
> >
> > >
> > > > +======================================
> > > > +
> > > > +EFI stub and device tree
> > >
> > > Same comments about "device tree" here etc.
> > >
> > > > +------------------------
> > > > +
> > > > +When booting with UEFI, the device tree is supplemented by the EFI stub with the
> > > > +following parameters (largely shared with arm64 in Documentation/arm/uefi.rst):
> > > > +
> > > > +==========================  ======   ===========================================
> > > > +Name                        Size     Description
> > > > +==========================  ======   ===========================================
> > > > +linux,uefi-system-table     64-bit   Physical address of the UEFI System Table.
> > >
> > > nit: Hmm, I think for all of these sizes s/-bit/ bits/.
> >
> > That's a copy-paste from the link just above the table.
> >
> > But maybe I should have pointed to the doc and added only the
> > "bootargs" stuff (maybe that's also present for arm64 actually).
>
> If it is identical, sounds like a good idea. It's common code that does
> that stuff, right?

Yep, all of that is done in the same generic function in libstub
(update_fdt()). I'll move the bootargs stuff to the arm documentation
\o/

>
> > > > +---------------------
> > > > +
> > > > +The installation of the virtual mapping is done in 2 steps in the RISC-V kernel:
> > > > +
> > > > +1. :c:func:`setup_vm` installs a temporary kernel mapping in
> > > > +   :c:var:`early_pg_dir` which allows to discover the system memory: only the
> > >
> > > s/to discover/discovery of/
> >
> > You mean "the discovery of" right?
>
> No? The "the" there would not be required.

That was a genuine question, thanks

>
> > > > +For :c:func:`virt_to_phys` and :c:func:`phys_to_virt` to be able to correctly
> > > > +convert direct mapping addresses to physical addresses, it needs to know the
> > >
> > > nit: s/it/they/
> >
> > Ok
> >
> > >
> > > > +start of the DRAM: this happens after 1, right before 2 installs the direct
> > >
> > > s/:/./
> >
> > Ahah, you really don't like long sentences :)
>
> I dunno if it is long sentences per se, I just think it is easier to
> follow.
>
> > But of course ok :)
> >
> > > Also how about s/1/step 1/ & s/2/step 2/?
> >
> > Way better, thanks
>
> > > > +Pre-MMU execution
> > > > +-----------------
> > > > +
> > > > +Any code that executes before even the first virtual mapping is established
> > > > +must be very carefully compiled as:
> > >
> > > Could you point out what the non-obvious examples of this code are?
> >
> > I can do a list, yes
>
> Not even a list, just something like "...established, for example early
> alternatives and foo, must be very..."

Done as follows:

"A few pieces of code need to run before even the first virtual mapping is
established, that comprises the installation of the first virtual mapping, the
early alternatives and the early parsing of the kernel command line. That code
must be very carefully compiled as:..."

>
> > Thanks for your thorough review!
>
> NW chief.
>
  
Conor Dooley June 20, 2023, 9:57 a.m. UTC | #12
On Mon, Jun 19, 2023 at 10:48:06AM -0600, Jonathan Corbet wrote:
> Conor Dooley <conor@kernel.org> writes:
> 
> > On Mon, Jun 19, 2023 at 04:04:52PM +0200, Alexandre Ghiti wrote:
> >> On Mon, Jun 19, 2023 at 2:26 PM Conor Dooley <conor.dooley@microchip.com> wrote:
> >> > On Mon, Jun 19, 2023 at 11:47:04AM +0200, Alexandre Ghiti wrote:
> >
> >> > > diff --git a/Documentation/riscv/boot.rst b/Documentation/riscv/boot.rst
> >> > > new file mode 100644
> >> > > index 000000000000..b02230818b79
> >> > > --- /dev/null
> >> > > +++ b/Documentation/riscv/boot.rst
> >> > > @@ -0,0 +1,181 @@
> >> > > +.. SPDX-License-Identifier: GPL-2.0
> >> > > +
> >> > > +=============================================
> >> > > +Early boot requirements/constraints on RISC-V
> >> > > +=============================================
> >> >
> >> > Please use "title case", here and elsewhere in the doc.
> >> 
> >> You mean using "title: " instead of "===="? Or using uppercase for the
> >> first letter of each word? FYI I followed
> >> https://docs.kernel.org/doc-guide/sphinx.html?highlight=title#specific-guidelines-for-the-kernel-documentation
> >
> > The latter. That's weird, I guess it would be nice to see what Jon
> > thinks about that.
> 
> I have Never Been Fond of Excessive Use of Capital Letters, so my
> preference would be capitalization as in a normal sentence.
> 
> That said, I don't really feel that something like this is important
> enough that we need to define and enforce a policy around it.  If the
> maintainers for specific subsystems feel differently, then I guess
> that's up to them.

Okay, thanks.
  
Song Shuai June 20, 2023, 10:31 a.m. UTC | #13
在 2023/6/19 17:47, Alexandre Ghiti 写道:
> This document describes the constraints and requirements of the early
> boot process in a RISC-V kernel.
> 
> Szigned-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> ---
>   Documentation/riscv/boot-image-header.rst |   3 -
>   Documentation/riscv/boot.rst              | 181 ++++++++++++++++++++++
>   Documentation/riscv/index.rst             |   1 +
>   3 files changed, 182 insertions(+), 3 deletions(-)
>   create mode 100644 Documentation/riscv/boot.rst
> 
> diff --git a/Documentation/riscv/boot-image-header.rst b/Documentation/riscv/boot-image-header.rst
> index d7752533865f..a4a45310c4c4 100644
> --- a/Documentation/riscv/boot-image-header.rst
> +++ b/Documentation/riscv/boot-image-header.rst
> @@ -7,9 +7,6 @@ Boot image header in RISC-V Linux
>   
>   This document only describes the boot image header details for RISC-V Linux.
>   
> -TODO:
> -  Write a complete booting guide.
> -
>   The following 64-byte header is present in decompressed Linux kernel image::
>   
>   	u32 code0;		  /* Executable code */
> diff --git a/Documentation/riscv/boot.rst b/Documentation/riscv/boot.rst
> new file mode 100644
> index 000000000000..b02230818b79
> --- /dev/null
> +++ b/Documentation/riscv/boot.rst
> @@ -0,0 +1,181 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============================================
> +Early boot requirements/constraints on RISC-V
> +=============================================
> +
> +:Author: Alexandre Ghiti <alexghiti@rivosinc.com>
> +:Date: 23 May 2023
> +
> +This document describes what the RISC-V kernel expects from the previous stages
> +and the firmware, but also the constraints that any developer must have in mind
> +when touching the early boot process, e.g. before the final virtual mapping is
> +setup.
> +
> +Pre-kernel boot (Expectations from firmware)
> +============================================
> +
> +Registers state
> +---------------
> +
> +The RISC-V kernel expects:
> +
> +  * `$a0` to contain the hartid of the current core.
> +  * `$a1` to contain the address of the device tree in memory.
> +
> +CSR state
> +---------
> +
> +The RISC-V kernel expects:
> +
> +  * `$satp = 0`: the MMU must be disabled.
> +
> +Reserved memory for resident firmware
> +-------------------------------------
> +
> +The RISC-V kernel expects the firmware to mark any resident memory with the
> +`no-map` flag, thus the kernel won't map those regions in the direct mapping
> +(avoiding issues with hibernation, speculative accesses and probably other
> +subsystems).
> +
> +Kernel location
> +---------------
> +
> +The RISC-V kernel expects to be placed at a PMD boundary (2MB for rv64 and 4MB
> +for rv32). Note though that the EFI stub will physically relocate the kernel if
> +that's not the case.
> +
> +Device-tree
> +-----------
> +
> +The RISC-V kernel always expects a device tree, it is:
> +
> +- either passed directly to the kernel from the previous stage using the `$a1`
> +  register,
> +- or when booting with UEFI, the device tree will be retrieved by the EFI stub
> +  using the EFI configuration table or it will be created.
> +
> +Bootflow
> +--------
> +
> +There exist 2 methods to enter the kernel:
> +
> +- `RISCV_BOOT_SPINWAIT`: the firmware releases all harts in the kernel, one hart
> +  wins a lottery and executes the early boot code while the other harts are
> +  parked waiting for the initialization to finish. This method is now
> +  **deprecated**.
> +- Ordered booting: the firmware releases only one hart that will execute the
> +  initialization phase and then will start all other harts using the SBI HSM
> +  extension.
> +
> +UEFI
> +----
> +
> +UEFI memory map
> +~~~~~~~~~~~~~~~
> +
> +When booting with UEFI, the RISC-V kernel will use only the EFI memory map to
> +populate the system memory.
> +
> +The UEFI firmware must parse the subnodes of the `/reserved-memory` device tree
> +node and abide by the device tree specification to convert the attributes of
> +those subnodes (`no-map` and `reusable`) into their correct EFI equivalent
> +(refer to section "3.5.4 /reserved-memory and UEFI" of the device tree
> +specification).
how about declare the version of Device Tree specification?
like the devicetree-specification-v0.4-rc1 we recently reference
> +
> +RISCV_EFI_BOOT_PROTOCOL
> +~~~~~~~~~~~~~~~~~~~~~~~
> +
> +When booting with UEFI, the EFI stub requires the boot hartid in order to pass
> +it to the RISC-V kernel in `$a1`. The EFI stub retrieves the boot hartid using
> +one of the following methods:
> +
> +- `RISCV_EFI_BOOT_PROTOCOL` (**preferred**).
> +- `boot-hartid` device tree subnode (**deprecated**).
> +
> +Any new firmware must implement `RISCV_EFI_BOOT_PROTOCOL` as the device tree
> +based approach is deprecated now.
> +
> +During kernel boot: (Kernel internals)
> +======================================
> +
> +EFI stub and device tree
> +------------------------
> +
> +When booting with UEFI, the device tree is supplemented by the EFI stub with the
> +following parameters (largely shared with arm64 in Documentation/arm/uefi.rst):
> +
> +==========================  ======   ===========================================
> +Name                        Size     Description
> +==========================  ======   ===========================================
> +linux,uefi-system-table     64-bit   Physical address of the UEFI System Table.
> +
> +linux,uefi-mmap-start       64-bit   Physical address of the UEFI memory map,
> +                                     populated by the UEFI GetMemoryMap() call.
> +
> +linux,uefi-mmap-size        32-bit   Size in bytes of the UEFI memory map
> +                                     pointed to in previous entry.
> +
> +linux,uefi-mmap-desc-size   32-bit   Size in bytes of each entry in the UEFI
> +                                     memory map.
> +
> +linux,uefi-mmap-desc-ver    32-bit   Version of the mmap descriptor format.
> +
> +kaslr-seed                  64-bit   Entropy used to randomize the kernel image
> +                                     base address location.
> +
> +bootargs                             Kernel command line
how about use "string" to declare the type of "bootargs"
and replace the "Size" in header with "Type"
> +==========================  ======   ===========================================
> +
> +Virtual mapping setup
> +---------------------
> +
> +The installation of the virtual mapping is done in 2 steps in the RISC-V kernel:
> +
> +1. :c:func:`setup_vm` installs a temporary kernel mapping in
> +   :c:var:`early_pg_dir` which allows to discover the system memory: only the
> +   kernel text/data are mapped at this point. When establishing this mapping,
> +   no allocation can be done (since the system memory is not known yet), so
> +   :c:var:`early_pg_dir` page table is statically allocated (using only one
> +   table for each level).
> +
> +2. :c:func:`setup_vm_final` creates the final kernel mapping in
> +   :c:var:`swapper_pg_dir` and takes advantage of the discovered system memory
> +   to create the linear mapping. When establishing this mapping, the kernel
> +   can allocate memory but cannot access it directly (since the direct mapping
> +   is not present yet), so it uses temporary mappings in the fixmap region to
> +   be able to access the newly allocated page table levels.
> +
> +For :c:func:`virt_to_phys` and :c:func:`phys_to_virt` to be able to correctly
> +convert direct mapping addresses to physical addresses, it needs to know the
> +start of the DRAM: this happens after 1, right before 2 installs the direct
> +mapping (see :c:func:`setup_bootmem` function in arch/riscv/mm/init.c). So
> +any usage of those macros before the final virtual mapping is installed must be
> +carefully examined.
> +
> +Device-tree mapping via fixmap
> +------------------------------
> +
> +The RISC-V kernel uses the fixmap region to map the device tree because the
> +device tree virtual mapping must remain the same between :c:func:`setup_vm` and
> +:c:func:`setup_vm_final` calls since :c:var:`reserved_mem` array is initialized
> +with virtual addresses established by :c:func:`setup_vm` and used with the
> +mapping established by :c:func:`setup_vm_final`.
> +
> +Pre-MMU execution
> +-----------------
> +
> +Any code that executes before even the first virtual mapping is established
> +must be very carefully compiled as:
> +
> +- `-fno-pie`: This is needed for relocatable kernels which use `-fPIE`, since
> +  otherwise, any access to a global symbol would go through the GOT which is
> +  only relocated virtually.
> +- `-mcmodel=medany`: Any access to a global symbol must be PC-relative to avoid
> +  any relocations to happen before the MMU is setup.
> +- Also note that *all* instrumentation must also be disabled (that includes
> +  KASAN, ftrace and others).
> +
> +As using a symbol from a different compilation unit requires this unit to be
> +compiled with those flags, we advise, as much as possible, not to use external
> +symbols.
> diff --git a/Documentation/riscv/index.rst b/Documentation/riscv/index.rst
> index 175a91db0200..1f66062def6d 100644
> --- a/Documentation/riscv/index.rst
> +++ b/Documentation/riscv/index.rst
> @@ -5,6 +5,7 @@ RISC-V architecture
>   .. toctree::
>       :maxdepth: 1
>   
> +    boot
>       boot-image-header
>       vm-layout
>       hwprobe
  
Conor Dooley June 20, 2023, 10:33 a.m. UTC | #14
On Tue, Jun 20, 2023 at 11:09:47AM +0200, Alexandre Ghiti wrote:
> On Mon, Jun 19, 2023 at 6:00 PM Conor Dooley <conor@kernel.org> wrote:
> > On Mon, Jun 19, 2023 at 04:04:52PM +0200, Alexandre Ghiti wrote:
> > > On Mon, Jun 19, 2023 at 2:26 PM Conor Dooley <conor.dooley@microchip.com> wrote:
> > > > On Mon, Jun 19, 2023 at 11:47:04AM +0200, Alexandre Ghiti wrote:

> > > > > +Bootflow
> > > >
> > > > "Boot Flow", no?
> > > > I am not sure that this is the "correct" heading for the content it
> > > > describes, but I have nothing better to offer :/
> > >
> > > Yes you're right, what about simply "Kernel Entrance"? Not sure this
> > > is easily understandable though.
> >
> > "Non-boot Hart Initialisation"?
> 
> Hmmm not that great either (sorry for being direct here)

lol, no need to apologise.

> > > > > +There exist 2 methods to enter the kernel:
> > > > > +
> > > > > +- `RISCV_BOOT_SPINWAIT`: the firmware releases all harts in the kernel, one hart
> > > > > +  wins a lottery and executes the early boot code while the other harts are
> > > > > +  parked waiting for the initialization to finish. This method is now
> > > >
> > > > nit: s/now//
> > >
> > > Ok
> > >
> > > >
> > > > What do you mean by deprecated? There's no requirement to implement the
> > > > HSM extension, right?
> > >
> > > I would say yes, you have to use a recent version of openSBI that
> > > supports the HSM extension. @Atish Kumar Patra WDYT?
> >
> > Uh, you don't need to use OpenSBI in the first place, let alone use a
> > recent version of it, right? What am I missing?
> 
> You need a M-Mode firmware which follows the SBI specification and
> that implements the HSM extension.

Firstly, and maybe I am showing my ignorance here, but we do support
m-mode in Linux, and SMP is not disabled for m-mode kernels where it is
required to use the spinwait method.
Secondly, I don't think that we've actually noted that non-HSM booting
is deprecated before now - at least not somewhere obviously. Things like
the platform spec on github might require it & it may be deprecated in
SBI implementations etc, but in the Kconfig option it is not described
as deprecated. The Kconfig option only says that it "should be only
enabled for M-mode Linux or platforms relying on older firmware without
SBI HSM extension".
I think marking it as deprecated here is not accurate, and instead we
would be better off pointing out what the limitations of the method are
and noting the limited situations when it should be used.

> > Also, what about !SMP systems? Although my suggested new section title
> > kinda addresses that!
> 
> I'll add "On SMP systems, there exist 2 methods to enter the

nit: s/exist/are/

> kernel:....", I don't think we need to detail the !SMP case as it is
> obvious.

That's fine. Maybe I am just a pedant, but I think it is good to be a
bit over precise.

> > > > > +  **deprecated**.
> > > > > +- Ordered booting: the firmware releases only one hart that will execute the
> > > > > +  initialization phase and then will start all other harts using the SBI HSM
> > > > > +  extension.

> > > > > +---------------------
> > > > > +
> > > > > +The installation of the virtual mapping is done in 2 steps in the RISC-V kernel:
> > > > > +
> > > > > +1. :c:func:`setup_vm` installs a temporary kernel mapping in
> > > > > +   :c:var:`early_pg_dir` which allows to discover the system memory: only the
> > > >
> > > > s/to discover/discovery of/
> > >
> > > You mean "the discovery of" right?
> >
> > No? The "the" there would not be required.
> 
> That was a genuine question, thanks

Sorry if the "No?" came across as aggressive, I meant it inquisitively.
Adding the "the" changes the meaning slightly, but not in a way that
that is relevant here.

> > > > > +Pre-MMU execution
> > > > > +-----------------
> > > > > +
> > > > > +Any code that executes before even the first virtual mapping is established
> > > > > +must be very carefully compiled as:
> > > >
> > > > Could you point out what the non-obvious examples of this code are?
> > >
> > > I can do a list, yes
> >
> > Not even a list, just something like "...established, for example early
> > alternatives and foo, must be very..."
> 
> Done as follows:
> 
> "A few pieces of code need to run before even the first virtual mapping is
> established, that comprises the installation of the first virtual mapping, the
> early alternatives and the early parsing of the kernel command line. That code
> must be very carefully compiled as:..."

Changed slightly:
"A few pieces of code need to run before even the first virtual mapping is
established. These are the installation of the first virtual mapping itself,
patching of early alternatives and the early parsing of the kernel command line.
That code must be very carefully compiled as:..."

Two minor suggestions there, one to make the it more obvious what the first
section inside commas refers to & one to note what it is that we do with
the alternatives.

Cheers,
Conor.
  
Alexandre Ghiti June 20, 2023, noon UTC | #15
On Tue, Jun 20, 2023 at 12:33 PM Conor Dooley
<conor.dooley@microchip.com> wrote:
>
> On Tue, Jun 20, 2023 at 11:09:47AM +0200, Alexandre Ghiti wrote:
> > On Mon, Jun 19, 2023 at 6:00 PM Conor Dooley <conor@kernel.org> wrote:
> > > On Mon, Jun 19, 2023 at 04:04:52PM +0200, Alexandre Ghiti wrote:
> > > > On Mon, Jun 19, 2023 at 2:26 PM Conor Dooley <conor.dooley@microchip.com> wrote:
> > > > > On Mon, Jun 19, 2023 at 11:47:04AM +0200, Alexandre Ghiti wrote:
>
> > > > > > +Bootflow
> > > > >
> > > > > "Boot Flow", no?
> > > > > I am not sure that this is the "correct" heading for the content it
> > > > > describes, but I have nothing better to offer :/
> > > >
> > > > Yes you're right, what about simply "Kernel Entrance"? Not sure this
> > > > is easily understandable though.
> > >
> > > "Non-boot Hart Initialisation"?
> >
> > Hmmm not that great either (sorry for being direct here)
>
> lol, no need to apologise.
>
> > > > > > +There exist 2 methods to enter the kernel:
> > > > > > +
> > > > > > +- `RISCV_BOOT_SPINWAIT`: the firmware releases all harts in the kernel, one hart
> > > > > > +  wins a lottery and executes the early boot code while the other harts are
> > > > > > +  parked waiting for the initialization to finish. This method is now
> > > > >
> > > > > nit: s/now//
> > > >
> > > > Ok
> > > >
> > > > >
> > > > > What do you mean by deprecated? There's no requirement to implement the
> > > > > HSM extension, right?
> > > >
> > > > I would say yes, you have to use a recent version of openSBI that
> > > > supports the HSM extension. @Atish Kumar Patra WDYT?
> > >
> > > Uh, you don't need to use OpenSBI in the first place, let alone use a
> > > recent version of it, right? What am I missing?
> >
> > You need a M-Mode firmware which follows the SBI specification and
> > that implements the HSM extension.
>
> Firstly, and maybe I am showing my ignorance here, but we do support
> m-mode in Linux, and SMP is not disabled for m-mode kernels where it is
> required to use the spinwait method.

You're right.

> Secondly, I don't think that we've actually noted that non-HSM booting
> is deprecated before now - at least not somewhere obviously. Things like
> the platform spec on github might require it & it may be deprecated in
> SBI implementations etc, but in the Kconfig option it is not described
> as deprecated. The Kconfig option only says that it "should be only
> enabled for M-mode Linux or platforms relying on older firmware without
> SBI HSM extension".
> I think marking it as deprecated here is not accurate, and instead we
> would be better off pointing out what the limitations of the method are
> and noting the limited situations when it should be used.

You're right again, before this is officially deprecated, I'll point
out the limitations of this method only as follows:

Kernel entrance
---------------

On SMP systems, there are 2 methods to enter the kernel:

- `RISCV_BOOT_SPINWAIT`: the firmware releases all harts in the kernel, one hart
  wins a lottery and executes the early boot code while the other
harts are
  parked waiting for the initialization to finish. This method is mostly used to
  support older firmwares without SBI HSM extension and M-mode RISC-V
kernel.
- `Ordered booting`: the firmware releases only one hart that will
execute the
  initialization phase and then will start all other harts using the SBI HSM
  extension. The ordered booting method is the preferred booting
method for
  booting the RISC-V kernel because it can support cpu hotplug and kexec.

>
> > > Also, what about !SMP systems? Although my suggested new section title
> > > kinda addresses that!
> >
> > I'll add "On SMP systems, there exist 2 methods to enter the
>
> nit: s/exist/are/
>
> > kernel:....", I don't think we need to detail the !SMP case as it is
> > obvious.
>
> That's fine. Maybe I am just a pedant, but I think it is good to be a
> bit over precise.
>
> > > > > > +  **deprecated**.
> > > > > > +- Ordered booting: the firmware releases only one hart that will execute the
> > > > > > +  initialization phase and then will start all other harts using the SBI HSM
> > > > > > +  extension.
>
> > > > > > +---------------------
> > > > > > +
> > > > > > +The installation of the virtual mapping is done in 2 steps in the RISC-V kernel:
> > > > > > +
> > > > > > +1. :c:func:`setup_vm` installs a temporary kernel mapping in
> > > > > > +   :c:var:`early_pg_dir` which allows to discover the system memory: only the
> > > > >
> > > > > s/to discover/discovery of/
> > > >
> > > > You mean "the discovery of" right?
> > >
> > > No? The "the" there would not be required.
> >
> > That was a genuine question, thanks
>
> Sorry if the "No?" came across as aggressive, I meant it inquisitively.
> Adding the "the" changes the meaning slightly, but not in a way that
> that is relevant here.

No worries :)

>
> > > > > > +Pre-MMU execution
> > > > > > +-----------------
> > > > > > +
> > > > > > +Any code that executes before even the first virtual mapping is established
> > > > > > +must be very carefully compiled as:
> > > > >
> > > > > Could you point out what the non-obvious examples of this code are?
> > > >
> > > > I can do a list, yes
> > >
> > > Not even a list, just something like "...established, for example early
> > > alternatives and foo, must be very..."
> >
> > Done as follows:
> >
> > "A few pieces of code need to run before even the first virtual mapping is
> > established, that comprises the installation of the first virtual mapping, the
> > early alternatives and the early parsing of the kernel command line. That code
> > must be very carefully compiled as:..."
>
> Changed slightly:
> "A few pieces of code need to run before even the first virtual mapping is
> established. These are the installation of the first virtual mapping itself,
> patching of early alternatives and the early parsing of the kernel command line.
> That code must be very carefully compiled as:..."
>
> Two minor suggestions there, one to make the it more obvious what the first
> section inside commas refers to & one to note what it is that we do with
> the alternatives.
>

Thanks

> Cheers,
> Conor.
  
Alexandre Ghiti June 20, 2023, 12:06 p.m. UTC | #16
On Tue, Jun 20, 2023 at 12:32 PM Song Shuai <songshuaishuai@tinylab.org> wrote:
>
>
>
> 在 2023/6/19 17:47, Alexandre Ghiti 写道:
> > This document describes the constraints and requirements of the early
> > boot process in a RISC-V kernel.
> >
> > Szigned-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > ---
> >   Documentation/riscv/boot-image-header.rst |   3 -
> >   Documentation/riscv/boot.rst              | 181 ++++++++++++++++++++++
> >   Documentation/riscv/index.rst             |   1 +
> >   3 files changed, 182 insertions(+), 3 deletions(-)
> >   create mode 100644 Documentation/riscv/boot.rst
> >
> > diff --git a/Documentation/riscv/boot-image-header.rst b/Documentation/riscv/boot-image-header.rst
> > index d7752533865f..a4a45310c4c4 100644
> > --- a/Documentation/riscv/boot-image-header.rst
> > +++ b/Documentation/riscv/boot-image-header.rst
> > @@ -7,9 +7,6 @@ Boot image header in RISC-V Linux
> >
> >   This document only describes the boot image header details for RISC-V Linux.
> >
> > -TODO:
> > -  Write a complete booting guide.
> > -
> >   The following 64-byte header is present in decompressed Linux kernel image::
> >
> >       u32 code0;                /* Executable code */
> > diff --git a/Documentation/riscv/boot.rst b/Documentation/riscv/boot.rst
> > new file mode 100644
> > index 000000000000..b02230818b79
> > --- /dev/null
> > +++ b/Documentation/riscv/boot.rst
> > @@ -0,0 +1,181 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +=============================================
> > +Early boot requirements/constraints on RISC-V
> > +=============================================
> > +
> > +:Author: Alexandre Ghiti <alexghiti@rivosinc.com>
> > +:Date: 23 May 2023
> > +
> > +This document describes what the RISC-V kernel expects from the previous stages
> > +and the firmware, but also the constraints that any developer must have in mind
> > +when touching the early boot process, e.g. before the final virtual mapping is
> > +setup.
> > +
> > +Pre-kernel boot (Expectations from firmware)
> > +============================================
> > +
> > +Registers state
> > +---------------
> > +
> > +The RISC-V kernel expects:
> > +
> > +  * `$a0` to contain the hartid of the current core.
> > +  * `$a1` to contain the address of the device tree in memory.
> > +
> > +CSR state
> > +---------
> > +
> > +The RISC-V kernel expects:
> > +
> > +  * `$satp = 0`: the MMU must be disabled.
> > +
> > +Reserved memory for resident firmware
> > +-------------------------------------
> > +
> > +The RISC-V kernel expects the firmware to mark any resident memory with the
> > +`no-map` flag, thus the kernel won't map those regions in the direct mapping
> > +(avoiding issues with hibernation, speculative accesses and probably other
> > +subsystems).
> > +
> > +Kernel location
> > +---------------
> > +
> > +The RISC-V kernel expects to be placed at a PMD boundary (2MB for rv64 and 4MB
> > +for rv32). Note though that the EFI stub will physically relocate the kernel if
> > +that's not the case.
> > +
> > +Device-tree
> > +-----------
> > +
> > +The RISC-V kernel always expects a device tree, it is:
> > +
> > +- either passed directly to the kernel from the previous stage using the `$a1`
> > +  register,
> > +- or when booting with UEFI, the device tree will be retrieved by the EFI stub
> > +  using the EFI configuration table or it will be created.
> > +
> > +Bootflow
> > +--------
> > +
> > +There exist 2 methods to enter the kernel:
> > +
> > +- `RISCV_BOOT_SPINWAIT`: the firmware releases all harts in the kernel, one hart
> > +  wins a lottery and executes the early boot code while the other harts are
> > +  parked waiting for the initialization to finish. This method is now
> > +  **deprecated**.
> > +- Ordered booting: the firmware releases only one hart that will execute the
> > +  initialization phase and then will start all other harts using the SBI HSM
> > +  extension.
> > +
> > +UEFI
> > +----
> > +
> > +UEFI memory map
> > +~~~~~~~~~~~~~~~
> > +
> > +When booting with UEFI, the RISC-V kernel will use only the EFI memory map to
> > +populate the system memory.
> > +
> > +The UEFI firmware must parse the subnodes of the `/reserved-memory` device tree
> > +node and abide by the device tree specification to convert the attributes of
> > +those subnodes (`no-map` and `reusable`) into their correct EFI equivalent
> > +(refer to section "3.5.4 /reserved-memory and UEFI" of the device tree
> > +specification).
> how about declare the version of Device Tree specification?
> like the devicetree-specification-v0.4-rc1 we recently reference

You're right, the section number refers to this version of the
specification, so it should be noted.

> > +
> > +RISCV_EFI_BOOT_PROTOCOL
> > +~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +When booting with UEFI, the EFI stub requires the boot hartid in order to pass
> > +it to the RISC-V kernel in `$a1`. The EFI stub retrieves the boot hartid using
> > +one of the following methods:
> > +
> > +- `RISCV_EFI_BOOT_PROTOCOL` (**preferred**).
> > +- `boot-hartid` device tree subnode (**deprecated**).
> > +
> > +Any new firmware must implement `RISCV_EFI_BOOT_PROTOCOL` as the device tree
> > +based approach is deprecated now.
> > +
> > +During kernel boot: (Kernel internals)
> > +======================================
> > +
> > +EFI stub and device tree
> > +------------------------
> > +
> > +When booting with UEFI, the device tree is supplemented by the EFI stub with the
> > +following parameters (largely shared with arm64 in Documentation/arm/uefi.rst):
> > +
> > +==========================  ======   ===========================================
> > +Name                        Size     Description
> > +==========================  ======   ===========================================
> > +linux,uefi-system-table     64-bit   Physical address of the UEFI System Table.
> > +
> > +linux,uefi-mmap-start       64-bit   Physical address of the UEFI memory map,
> > +                                     populated by the UEFI GetMemoryMap() call.
> > +
> > +linux,uefi-mmap-size        32-bit   Size in bytes of the UEFI memory map
> > +                                     pointed to in previous entry.
> > +
> > +linux,uefi-mmap-desc-size   32-bit   Size in bytes of each entry in the UEFI
> > +                                     memory map.
> > +
> > +linux,uefi-mmap-desc-ver    32-bit   Version of the mmap descriptor format.
> > +
> > +kaslr-seed                  64-bit   Entropy used to randomize the kernel image
> > +                                     base address location.
> > +
> > +bootargs                             Kernel command line
> how about use "string" to declare the type of "bootargs"
> and replace the "Size" in header with "Type"

Why not, this will be more accurate.

> > +==========================  ======   ===========================================
> > +
> > +Virtual mapping setup
> > +---------------------
> > +
> > +The installation of the virtual mapping is done in 2 steps in the RISC-V kernel:
> > +
> > +1. :c:func:`setup_vm` installs a temporary kernel mapping in
> > +   :c:var:`early_pg_dir` which allows to discover the system memory: only the
> > +   kernel text/data are mapped at this point. When establishing this mapping,
> > +   no allocation can be done (since the system memory is not known yet), so
> > +   :c:var:`early_pg_dir` page table is statically allocated (using only one
> > +   table for each level).
> > +
> > +2. :c:func:`setup_vm_final` creates the final kernel mapping in
> > +   :c:var:`swapper_pg_dir` and takes advantage of the discovered system memory
> > +   to create the linear mapping. When establishing this mapping, the kernel
> > +   can allocate memory but cannot access it directly (since the direct mapping
> > +   is not present yet), so it uses temporary mappings in the fixmap region to
> > +   be able to access the newly allocated page table levels.
> > +
> > +For :c:func:`virt_to_phys` and :c:func:`phys_to_virt` to be able to correctly
> > +convert direct mapping addresses to physical addresses, it needs to know the
> > +start of the DRAM: this happens after 1, right before 2 installs the direct
> > +mapping (see :c:func:`setup_bootmem` function in arch/riscv/mm/init.c). So
> > +any usage of those macros before the final virtual mapping is installed must be
> > +carefully examined.
> > +
> > +Device-tree mapping via fixmap
> > +------------------------------
> > +
> > +The RISC-V kernel uses the fixmap region to map the device tree because the
> > +device tree virtual mapping must remain the same between :c:func:`setup_vm` and
> > +:c:func:`setup_vm_final` calls since :c:var:`reserved_mem` array is initialized
> > +with virtual addresses established by :c:func:`setup_vm` and used with the
> > +mapping established by :c:func:`setup_vm_final`.
> > +
> > +Pre-MMU execution
> > +-----------------
> > +
> > +Any code that executes before even the first virtual mapping is established
> > +must be very carefully compiled as:
> > +
> > +- `-fno-pie`: This is needed for relocatable kernels which use `-fPIE`, since
> > +  otherwise, any access to a global symbol would go through the GOT which is
> > +  only relocated virtually.
> > +- `-mcmodel=medany`: Any access to a global symbol must be PC-relative to avoid
> > +  any relocations to happen before the MMU is setup.
> > +- Also note that *all* instrumentation must also be disabled (that includes
> > +  KASAN, ftrace and others).
> > +
> > +As using a symbol from a different compilation unit requires this unit to be
> > +compiled with those flags, we advise, as much as possible, not to use external
> > +symbols.
> > diff --git a/Documentation/riscv/index.rst b/Documentation/riscv/index.rst
> > index 175a91db0200..1f66062def6d 100644
> > --- a/Documentation/riscv/index.rst
> > +++ b/Documentation/riscv/index.rst
> > @@ -5,6 +5,7 @@ RISC-V architecture
> >   .. toctree::
> >       :maxdepth: 1
> >
> > +    boot
> >       boot-image-header
> >       vm-layout
> >       hwprobe
>
> --
> Thanks
> Song Shuai
>

Thanks Song!
  

Patch

diff --git a/Documentation/riscv/boot-image-header.rst b/Documentation/riscv/boot-image-header.rst
index d7752533865f..a4a45310c4c4 100644
--- a/Documentation/riscv/boot-image-header.rst
+++ b/Documentation/riscv/boot-image-header.rst
@@ -7,9 +7,6 @@  Boot image header in RISC-V Linux
 
 This document only describes the boot image header details for RISC-V Linux.
 
-TODO:
-  Write a complete booting guide.
-
 The following 64-byte header is present in decompressed Linux kernel image::
 
 	u32 code0;		  /* Executable code */
diff --git a/Documentation/riscv/boot.rst b/Documentation/riscv/boot.rst
new file mode 100644
index 000000000000..b02230818b79
--- /dev/null
+++ b/Documentation/riscv/boot.rst
@@ -0,0 +1,181 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================================
+Early boot requirements/constraints on RISC-V
+=============================================
+
+:Author: Alexandre Ghiti <alexghiti@rivosinc.com>
+:Date: 23 May 2023
+
+This document describes what the RISC-V kernel expects from the previous stages
+and the firmware, but also the constraints that any developer must have in mind
+when touching the early boot process, e.g. before the final virtual mapping is
+setup.
+
+Pre-kernel boot (Expectations from firmware)
+============================================
+
+Registers state
+---------------
+
+The RISC-V kernel expects:
+
+  * `$a0` to contain the hartid of the current core.
+  * `$a1` to contain the address of the device tree in memory.
+
+CSR state
+---------
+
+The RISC-V kernel expects:
+
+  * `$satp = 0`: the MMU must be disabled.
+
+Reserved memory for resident firmware
+-------------------------------------
+
+The RISC-V kernel expects the firmware to mark any resident memory with the
+`no-map` flag, thus the kernel won't map those regions in the direct mapping
+(avoiding issues with hibernation, speculative accesses and probably other
+subsystems).
+
+Kernel location
+---------------
+
+The RISC-V kernel expects to be placed at a PMD boundary (2MB for rv64 and 4MB
+for rv32). Note though that the EFI stub will physically relocate the kernel if
+that's not the case.
+
+Device-tree
+-----------
+
+The RISC-V kernel always expects a device tree, it is:
+
+- either passed directly to the kernel from the previous stage using the `$a1`
+  register,
+- or when booting with UEFI, the device tree will be retrieved by the EFI stub
+  using the EFI configuration table or it will be created.
+
+Bootflow
+--------
+
+There exist 2 methods to enter the kernel:
+
+- `RISCV_BOOT_SPINWAIT`: the firmware releases all harts in the kernel, one hart
+  wins a lottery and executes the early boot code while the other harts are
+  parked waiting for the initialization to finish. This method is now
+  **deprecated**.
+- Ordered booting: the firmware releases only one hart that will execute the
+  initialization phase and then will start all other harts using the SBI HSM
+  extension.
+
+UEFI
+----
+
+UEFI memory map
+~~~~~~~~~~~~~~~
+
+When booting with UEFI, the RISC-V kernel will use only the EFI memory map to
+populate the system memory.
+
+The UEFI firmware must parse the subnodes of the `/reserved-memory` device tree
+node and abide by the device tree specification to convert the attributes of
+those subnodes (`no-map` and `reusable`) into their correct EFI equivalent
+(refer to section "3.5.4 /reserved-memory and UEFI" of the device tree
+specification).
+
+RISCV_EFI_BOOT_PROTOCOL
+~~~~~~~~~~~~~~~~~~~~~~~
+
+When booting with UEFI, the EFI stub requires the boot hartid in order to pass
+it to the RISC-V kernel in `$a1`. The EFI stub retrieves the boot hartid using
+one of the following methods:
+
+- `RISCV_EFI_BOOT_PROTOCOL` (**preferred**).
+- `boot-hartid` device tree subnode (**deprecated**).
+
+Any new firmware must implement `RISCV_EFI_BOOT_PROTOCOL` as the device tree
+based approach is deprecated now.
+
+During kernel boot: (Kernel internals)
+======================================
+
+EFI stub and device tree
+------------------------
+
+When booting with UEFI, the device tree is supplemented by the EFI stub with the
+following parameters (largely shared with arm64 in Documentation/arm/uefi.rst):
+
+==========================  ======   ===========================================
+Name                        Size     Description
+==========================  ======   ===========================================
+linux,uefi-system-table     64-bit   Physical address of the UEFI System Table.
+
+linux,uefi-mmap-start       64-bit   Physical address of the UEFI memory map,
+                                     populated by the UEFI GetMemoryMap() call.
+
+linux,uefi-mmap-size        32-bit   Size in bytes of the UEFI memory map
+                                     pointed to in previous entry.
+
+linux,uefi-mmap-desc-size   32-bit   Size in bytes of each entry in the UEFI
+                                     memory map.
+
+linux,uefi-mmap-desc-ver    32-bit   Version of the mmap descriptor format.
+
+kaslr-seed                  64-bit   Entropy used to randomize the kernel image
+                                     base address location.
+
+bootargs                             Kernel command line
+==========================  ======   ===========================================
+
+Virtual mapping setup
+---------------------
+
+The installation of the virtual mapping is done in 2 steps in the RISC-V kernel:
+
+1. :c:func:`setup_vm` installs a temporary kernel mapping in
+   :c:var:`early_pg_dir` which allows to discover the system memory: only the
+   kernel text/data are mapped at this point. When establishing this mapping,
+   no allocation can be done (since the system memory is not known yet), so
+   :c:var:`early_pg_dir` page table is statically allocated (using only one
+   table for each level).
+
+2. :c:func:`setup_vm_final` creates the final kernel mapping in
+   :c:var:`swapper_pg_dir` and takes advantage of the discovered system memory
+   to create the linear mapping. When establishing this mapping, the kernel
+   can allocate memory but cannot access it directly (since the direct mapping
+   is not present yet), so it uses temporary mappings in the fixmap region to
+   be able to access the newly allocated page table levels.
+
+For :c:func:`virt_to_phys` and :c:func:`phys_to_virt` to be able to correctly
+convert direct mapping addresses to physical addresses, it needs to know the
+start of the DRAM: this happens after 1, right before 2 installs the direct
+mapping (see :c:func:`setup_bootmem` function in arch/riscv/mm/init.c). So
+any usage of those macros before the final virtual mapping is installed must be
+carefully examined.
+
+Device-tree mapping via fixmap
+------------------------------
+
+The RISC-V kernel uses the fixmap region to map the device tree because the
+device tree virtual mapping must remain the same between :c:func:`setup_vm` and
+:c:func:`setup_vm_final` calls since :c:var:`reserved_mem` array is initialized
+with virtual addresses established by :c:func:`setup_vm` and used with the
+mapping established by :c:func:`setup_vm_final`.
+
+Pre-MMU execution
+-----------------
+
+Any code that executes before even the first virtual mapping is established
+must be very carefully compiled as:
+
+- `-fno-pie`: This is needed for relocatable kernels which use `-fPIE`, since
+  otherwise, any access to a global symbol would go through the GOT which is
+  only relocated virtually.
+- `-mcmodel=medany`: Any access to a global symbol must be PC-relative to avoid
+  any relocations to happen before the MMU is setup.
+- Also note that *all* instrumentation must also be disabled (that includes
+  KASAN, ftrace and others).
+
+As using a symbol from a different compilation unit requires this unit to be
+compiled with those flags, we advise, as much as possible, not to use external
+symbols.
diff --git a/Documentation/riscv/index.rst b/Documentation/riscv/index.rst
index 175a91db0200..1f66062def6d 100644
--- a/Documentation/riscv/index.rst
+++ b/Documentation/riscv/index.rst
@@ -5,6 +5,7 @@  RISC-V architecture
 .. toctree::
     :maxdepth: 1
 
+    boot
     boot-image-header
     vm-layout
     hwprobe