Message ID | 20221213234505.173468-1-npache@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:e747:0:0:0:0:0 with SMTP id c7csp428686wrn; Tue, 13 Dec 2022 15:49:49 -0800 (PST) X-Google-Smtp-Source: AA0mqf47jVvMKdApMjNba6ob85fjyTi5jU67/Lvm0DfhY3eMVeEDoQ8YaUJLAs6JADqFg2P+JwVb X-Received: by 2002:a17:906:d90:b0:7c0:bbab:22ea with SMTP id m16-20020a1709060d9000b007c0bbab22eamr20244292eji.46.1670975389411; Tue, 13 Dec 2022 15:49:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1670975389; cv=none; d=google.com; s=arc-20160816; b=FWDmLB6zloUMAzo26WYrTnHIB78iK73vUtpj4VQ0AIHCpUdQL4kg8DGETECyBwtkXM zlTTKZVbDhrXk6reli8AI7BuC0FOOSbulBQaJcUlasuSlk1PzSxA9cKvbcmGqmVST4pz t1p+Eon21MOdSOg7YaG8tWLAnokodCxeItszXv0ujkvB7Ojl3UrpmEZ6BzJAXSaeIPgo zVIfmL9RLZbdww5OWYa2qaHs2cKFxN0FTa8Bd/aneaQtAmO73Qhj4Li+L6ghOWCQK7Ph WqU0vOHaSgSHGYJVUPZMD4lH7BCCDixgDNmbFmZgm1kUh4HeKrt50OVC9HCPGbSI4+xB mh2g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=DVhb72kd9R99mwbuxV0gpxyvFBS7R/Jb7x2Bq+94b50=; b=v8Uo6sICcwYP5SlaLQK4UQr8yVpijvLfdm7Gv8bMWnpzHL91Ydn5PJJ0qmEZYsfFrG 0zdEfob9jDnyLl99WLWmbKrenu4n9DxDXeO0oBb/o5fzt6k117kG9+P99ywfYFS4KB2R mguXLthInAXqDDt84YgWZw+A+8HlDqW4/+/8Dj/siGjWN2NeRnVFU7l9Zh8Rcv03pDKp AqUuHp84lDcjIoAsgN4VeZgdo7uE3u0HzRmfW/DGCLVio6cfXC/S/8p804o+ZKHORa0h Vkd3k5gXL3oEUs18GYoJMfC6WeXRe3OgZU6GqP572+Smwu3+vIuoFdGhY9Zi3JtaP2tm Oolw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=hDkYAJFA; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id hb40-20020a170907162800b007c175331850si1486764ejc.428.2022.12.13.15.49.26; Tue, 13 Dec 2022 15:49:49 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=hDkYAJFA; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236646AbiLMXq6 (ORCPT <rfc822;jeantsuru.cumc.mandola@gmail.com> + 99 others); Tue, 13 Dec 2022 18:46:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52702 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236149AbiLMXqh (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 13 Dec 2022 18:46:37 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 99C4A19C3B for <linux-kernel@vger.kernel.org>; Tue, 13 Dec 2022 15:45:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1670975116; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=DVhb72kd9R99mwbuxV0gpxyvFBS7R/Jb7x2Bq+94b50=; b=hDkYAJFABz/c+uzTQ3YCO5IQP/VE6wn2mWBTYgAgVT4qcxGDjOPSDxZfkBZMrToDm6Rvjq c6v7DNwU/hH+VBYArk8ALr+fsyhuQJU6qUxLZHb57tR3ml33JK2PwxfabgUFJlujTNcIj1 ZEQZQPwLz5V+PgHSi4m+ERp9oTf7jT4= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-487-FE8nIjigNqSvYgEBCMkZsw-1; Tue, 13 Dec 2022 18:45:13 -0500 X-MC-Unique: FE8nIjigNqSvYgEBCMkZsw-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id BCAA1803533; Tue, 13 Dec 2022 23:45:12 +0000 (UTC) Received: from localhost.localdomain.com (unknown [10.22.32.182]) by smtp.corp.redhat.com (Postfix) with ESMTP id 37F894085720; Tue, 13 Dec 2022 23:45:12 +0000 (UTC) From: Nico Pache <npache@redhat.com> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: muchun.song@linux.dev, mike.kravetz@oracle.com, akpm@linux-foundation.org, willy@infradead.org, gerald.schaefer@linux.ibm.com Subject: [RFC V2] mm: add the zero case to page[1].compound_nr in set_compound_order Date: Tue, 13 Dec 2022 16:45:05 -0700 Message-Id: <20221213234505.173468-1-npache@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.2 X-Spam-Status: No, score=-0.9 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,URI_DOTEDU autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752144393801207873?= X-GMAIL-MSGID: =?utf-8?q?1752144689703043033?= |
Series |
[RFC,V2] mm: add the zero case to page[1].compound_nr in set_compound_order
|
|
Commit Message
Nico Pache
Dec. 13, 2022, 11:45 p.m. UTC
Since commit 1378a5ee451a ("mm: store compound_nr as well as
compound_order") the page[1].compound_nr must be explicitly set to 0 if
calling set_compound_order(page, 0).
This can lead to bugs if the caller of set_compound_order(page, 0) forgets
to explicitly set compound_nr=0. An example of this is commit ba9c1201beaa
("mm/hugetlb: clear compound_nr before freeing gigantic pages")
Collapse these calls into the set_compound_order by utilizing branchless
bitmaths [1].
[1] https://graphics.stanford.edu/~seander/bithacks.html#ConditionalSetOrClearBitsWithoutBranching
V2: slight changes to commit log and remove extra '//' in the comments
Signed-off-by: Nico Pache <npache@redhat.com>
---
include/linux/mm.h | 6 +++++-
mm/hugetlb.c | 6 ------
2 files changed, 5 insertions(+), 7 deletions(-)
Comments
On 12/13/22 16:45, Nico Pache wrote: > Since commit 1378a5ee451a ("mm: store compound_nr as well as > compound_order") the page[1].compound_nr must be explicitly set to 0 if > calling set_compound_order(page, 0). > > This can lead to bugs if the caller of set_compound_order(page, 0) forgets > to explicitly set compound_nr=0. An example of this is commit ba9c1201beaa > ("mm/hugetlb: clear compound_nr before freeing gigantic pages") There has been some recent work in this area. The latest patch being, https://lore.kernel.org/linux-mm/20221213212053.106058-1-sidhartha.kumar@oracle.com/
Hi Mike, Thanks for the pointer! Would the branchless conditional be an improvement over the current approach? I'm not sure how hot this path is, but it may be worth the optimization. -- Nico On Tue, Dec 13, 2022 at 4:48 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 12/13/22 16:45, Nico Pache wrote: > > Since commit 1378a5ee451a ("mm: store compound_nr as well as > > compound_order") the page[1].compound_nr must be explicitly set to 0 if > > calling set_compound_order(page, 0). > > > > This can lead to bugs if the caller of set_compound_order(page, 0) forgets > > to explicitly set compound_nr=0. An example of this is commit ba9c1201beaa > > ("mm/hugetlb: clear compound_nr before freeing gigantic pages") > > There has been some recent work in this area. The latest patch being, > https://lore.kernel.org/linux-mm/20221213212053.106058-1-sidhartha.kumar@oracle.com/ > > -- > Mike Kravetz >
According to the document linked the following approach is even faster than the one I used due to CPU parallelization: page[1].compound_nr = ( shift & ~shift) | (-order & shift); for(int x =0; x< 11;x++){ unsigned int order = x; unsigned long shift = 1U << order; printf("order %d output : %lu\n", order, ( shift & ~shift) | (-order & shift)); } order 0 output : 0 order 1 output : 2 order 2 output : 4 order 3 output : 8 order 4 output : 16 order 5 output : 32 order 6 output : 64 order 7 output : 128 order 8 output : 256 -- Nico On Tue, Dec 13, 2022 at 4:53 PM Nico Pache <npache@redhat.com> wrote: > > Hi Mike, > > Thanks for the pointer! Would the branchless conditional be an > improvement over the current approach? I'm not sure how hot this path > is, but it may be worth the optimization. > > -- Nico > > On Tue, Dec 13, 2022 at 4:48 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > On 12/13/22 16:45, Nico Pache wrote: > > > Since commit 1378a5ee451a ("mm: store compound_nr as well as > > > compound_order") the page[1].compound_nr must be explicitly set to 0 if > > > calling set_compound_order(page, 0). > > > > > > This can lead to bugs if the caller of set_compound_order(page, 0) forgets > > > to explicitly set compound_nr=0. An example of this is commit ba9c1201beaa > > > ("mm/hugetlb: clear compound_nr before freeing gigantic pages") > > > > There has been some recent work in this area. The latest patch being, > > https://lore.kernel.org/linux-mm/20221213212053.106058-1-sidhartha.kumar@oracle.com/ > > > > -- > > Mike Kravetz > >
On 12/13/22 17:27, Nico Pache wrote: > According to the document linked the following approach is even faster > than the one I used due to CPU parallelization: I do not think we are very concerned with speed here. This routine is being called in the creation of compound pages, and in the case of hugetlb the tear down of gigantic pages. In general, creation and tear down of gigantic pages happens infrequently. Usually only at system/application startup and system/application shutdown. I think the only case where we 'might' be concerned with speed is in the creation of compound pages for THP. Do note that this code path is still using set_compound_order as it has not been converted to folios.
On 12/13/22 5:02 PM, Mike Kravetz wrote: > On 12/13/22 17:27, Nico Pache wrote: >> According to the document linked the following approach is even faster >> than the one I used due to CPU parallelization: > > I do not think we are very concerned with speed here. This routine is being > called in the creation of compound pages, and in the case of hugetlb the > tear down of gigantic pages. In general, creation and tear down of gigantic > pages happens infrequently. Usually only at system/application startup and > system/application shutdown. > Hi Nico, I wrote a bpftrace script to track the time spent in __prep_compound_gigantic_folio both with and without the branch in folio_set_order() and resulting histogram was the same for both versions. This is probably because the for loop through every base page has a much higher overhead than the singular call to folio_set_order(). I am not sure what the performance difference for THP would be. @prep_nsecs: [1M, 2M) 50|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| Below is the script. Thanks, Sidhartha Kumar k:__prep_compound_gigantic_folio { @prep_start[pid] = nsecs; } kr:__prep_compound_gigantic_folio { @prep_nsecs = hist((nsecs - @prep_start[pid])); delete(@prep_start[pid]); } > I think the only case where we 'might' be concerned with speed is in the > creation of compound pages for THP. Do note that this code path is > still using set_compound_order as it has not been converted to folios.
On Tue, Dec 13, 2022 at 04:45:05PM -0700, Nico Pache wrote: > Since commit 1378a5ee451a ("mm: store compound_nr as well as > compound_order") the page[1].compound_nr must be explicitly set to 0 if > calling set_compound_order(page, 0). > > This can lead to bugs if the caller of set_compound_order(page, 0) forgets > to explicitly set compound_nr=0. An example of this is commit ba9c1201beaa > ("mm/hugetlb: clear compound_nr before freeing gigantic pages") > > Collapse these calls into the set_compound_order by utilizing branchless > bitmaths [1]. > > [1] https://graphics.stanford.edu/~seander/bithacks.html#ConditionalSetOrClearBitsWithoutBranching > > V2: slight changes to commit log and remove extra '//' in the comments We don't usually use // comments anywhere in the kernel other than the SPDX header. > static inline void set_compound_order(struct page *page, unsigned int order) > { > + unsigned long shift = (1U << order); Shift is a funny name for this variable. order is the shift. this is 'nr'. > page[1].compound_order = order; > #ifdef CONFIG_64BIT > - page[1].compound_nr = 1U << order; > + // Branchless conditional: > + // order > 0 --> compound_nr = shift > + // order == 0 --> compound_nr = 0 > + page[1].compound_nr = shift ^ (-order ^ shift) & shift; Can the compiler see through this? Before, the compiler sees: page[1].compound_order = 0; page[1].compound_nr = 1U << 0; ... page[1].compound_nr = 0; and it can eliminate the first store. Now the compiler sees: unsigned long shift = (1U << 0); page[1].compound_order = order; page[1].compound_nr = shift ^ (0 ^ shift) & shift; Does it do the maths at compile-time, knowing that order is 0 at this callsite and deducing that it can just store a 0? I think it might, since shift is constant-1, page[1].compound_nr = 1 ^ (0 ^ 1) & 1; -> page[1].compound_nr = 1 ^ 1 & 1; -> page[1].compound_nr = 0 & 1; -> page[1].compound_nr = 0; But you should run it through the compiler and check the assembly output for __destroy_compound_gigantic_page().
On Tue, Dec 13, 2022 at 11:38 PM Sidhartha Kumar <sidhartha.kumar@oracle.com> wrote: > > On 12/13/22 5:02 PM, Mike Kravetz wrote: > > On 12/13/22 17:27, Nico Pache wrote: > >> According to the document linked the following approach is even faster > >> than the one I used due to CPU parallelization: > > > > I do not think we are very concerned with speed here. This routine is being > > called in the creation of compound pages, and in the case of hugetlb the > > tear down of gigantic pages. In general, creation and tear down of gigantic > > pages happens infrequently. Usually only at system/application startup and > > system/application shutdown. > > > Hi Nico, > > I wrote a bpftrace script to track the time spent in > __prep_compound_gigantic_folio both with and without the branch in > folio_set_order() and resulting histogram was the same for both > versions. This is probably because the for loop through every base page > has a much higher overhead than the singular call to folio_set_order(). > I am not sure what the performance difference for THP would be. Hi Sidhartha, Ok great! We may want to proactively implement a branchless version so once/if THP comes around to utilizing this we won't see a regression. Furthermore, Waiman brought up a good point off the list: This bitmath is needlessly complex and can be achieved with page[1].compound_nr = (1U << order) & ~1U; Tested: order 0 output : 0 order 1 output : 2 order 2 output : 4 order 3 output : 8 order 4 output : 16 order 5 output : 32 order 6 output : 64 order 7 output : 128 order 8 output : 256 order 9 output : 512 order 10 output : 1024 > Below is the script. > Thanks, > Sidhartha Kumar Thanks for the script!! Cheers, -- Nico > k:__prep_compound_gigantic_folio > { > @prep_start[pid] = nsecs; > } > > kr:__prep_compound_gigantic_folio > { > @prep_nsecs = hist((nsecs - @prep_start[pid])); > delete(@prep_start[pid]); > }
On Wed, Dec 14, 2022 at 10:04 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, Dec 13, 2022 at 04:45:05PM -0700, Nico Pache wrote: > > Since commit 1378a5ee451a ("mm: store compound_nr as well as > > compound_order") the page[1].compound_nr must be explicitly set to 0 if > > calling set_compound_order(page, 0). > > > > This can lead to bugs if the caller of set_compound_order(page, 0) forgets > > to explicitly set compound_nr=0. An example of this is commit ba9c1201beaa > > ("mm/hugetlb: clear compound_nr before freeing gigantic pages") > > > > Collapse these calls into the set_compound_order by utilizing branchless > > bitmaths [1]. > > > > [1] https://graphics.stanford.edu/~seander/bithacks.html#ConditionalSetOrClearBitsWithoutBranching > > > > V2: slight changes to commit log and remove extra '//' in the comments > > We don't usually use // comments anywhere in the kernel other than > the SPDX header. Whoops! > > static inline void set_compound_order(struct page *page, unsigned int order) > > { > > + unsigned long shift = (1U << order); > > Shift is a funny name for this variable. order is the shift. this is 'nr'. Good point! Waiman found an even better/cleaner solution that would avoid needing an extra variable. page[1].compound_nr = (1U << order) & ~1U; > > page[1].compound_order = order; > > #ifdef CONFIG_64BIT > > - page[1].compound_nr = 1U << order; > > + // Branchless conditional: > > + // order > 0 --> compound_nr = shift > > + // order == 0 --> compound_nr = 0 > > + page[1].compound_nr = shift ^ (-order ^ shift) & shift; > > Can the compiler see through this? Before, the compiler sees: > > page[1].compound_order = 0; > page[1].compound_nr = 1U << 0; > ... > page[1].compound_nr = 0; > > and it can eliminate the first store. This may be the case at the moment, but with: https://lore.kernel.org/linux-mm/20221213212053.106058-1-sidhartha.kumar@oracle.com/ we will have a branch instead. Sidhartha tested it and found no regression; the concern is that if THPs get implemented using this callpath then we may end up seeing a slowdown. After doing my analysis below I dont think this is the case for the destroy case(at least on x86). In the destroy case for both the branch and branchless approach we see the compiler optimizing away the bitmath and the branch and setting the variable to zero. In the prep case we see the introduction of a test and cmovne instruction, implying a branch. > Now the compiler sees: > unsigned long shift = (1U << 0); > page[1].compound_order = order; > page[1].compound_nr = shift ^ (0 ^ shift) & shift; > > Does it do the maths at compile-time, knowing that order is 0 at this > callsite and deducing that it can just store a 0? > > I think it might, since shift is constant-1, > > page[1].compound_nr = 1 ^ (0 ^ 1) & 1; > -> page[1].compound_nr = 1 ^ 1 & 1; > -> page[1].compound_nr = 0 & 1; > -> page[1].compound_nr = 0; > > But you should run it through the compiler and check the assembly > output for __destroy_compound_gigantic_page(). Yep it does look like it gets optimized away for the destroy case: Bitmath Case (destroy) --------------------------------- Dump of assembler code for function __update_and_free_page: ... mov %rsi,%rbp //move 2nd arg (page) to rbp ... movb $0x0,0x51(%rbp) //page[1].compound_order = 0 movl $0x0,0x5c(%rbp) //page[1].compound_nr = 0 ... Math for movl : 0x5c (92) - 64 (sizeof page[0]) = 28 pahole page: unsigned int compound_nr; /* 28 4 */ Bitmath Case (prep) --------------------------------- In the case of prep_compound_gigantic_page the bitmath is being computed 0xffffffff8134f17d <+13>: mov %rdi,%r12 0xffffffff8134f180 <+16>: push %rbp 0xffffffff8134f181 <+17>: mov $0x1,%ebp 0xffffffff8134f186 <+22>: shl %cl,%ebp 0xffffffff8134f188 <+24>: neg %ecx 0xffffffff8134f18a <+26>: push %rbx 0xffffffff8134f18b <+27>: and %ebp,%ecx 0xffffffff8134f18d <+29>: mov %sil,0x51(%rdi) 0xffffffff8134f191 <+33>: mov %ecx,0x5c(%rdi) //set page[1].compound_nr Now to break down the approach with the branch: Branch Case (destroy) --------------------------------- No branch utilized to determine the following instructions. 0xffffffff813507bc <+236>: movb $0x0,0x51(%rbp) 0xffffffff813507c0 <+240>: movl $0x0,0x5c(%rbp) Branch Case (prep) --------------------------------- The branch is being computed with the introduction of a cmovne instruction. 0xffffffff8134f15d <+13>: mov %rdi,%r12 0xffffffff8134f160 <+16>: push %rbp 0xffffffff8134f161 <+17>: mov $0x1,%ebp 0xffffffff8134f166 <+22>: shl %cl,%ebp 0xffffffff8134f168 <+24>: test %esi,%esi //test 0xffffffff8134f16a <+26>: push %rbx 0xffffffff8134f16b <+27>: cmovne %ebp,%ecx //branch evaluation 0xffffffff8134f16e <+30>: mov %sil,0x51(%rdi) 0xffffffff8134f172 <+34>: mov %ecx,0x5c(%rdi) So it looks like in the destruction of compound pages we'll see no gain or loss between the bitmath or branch approach. However, in the prep case we may see some performance loss once/if THP utilizes this path due to the branch and the loss of CPU parallelization that can be achieved utilizing the bitmath approach. Cheers, -- Nico >
On Wed, Dec 14, 2022 at 7:48 PM Nico Pache <npache@redhat.com> wrote: > > On Wed, Dec 14, 2022 at 10:04 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Tue, Dec 13, 2022 at 04:45:05PM -0700, Nico Pache wrote: > > > Since commit 1378a5ee451a ("mm: store compound_nr as well as > > > compound_order") the page[1].compound_nr must be explicitly set to 0 if > > > calling set_compound_order(page, 0). > > > > > > This can lead to bugs if the caller of set_compound_order(page, 0) forgets > > > to explicitly set compound_nr=0. An example of this is commit ba9c1201beaa > > > ("mm/hugetlb: clear compound_nr before freeing gigantic pages") > > > > > > Collapse these calls into the set_compound_order by utilizing branchless > > > bitmaths [1]. > > > > > > [1] https://graphics.stanford.edu/~seander/bithacks.html#ConditionalSetOrClearBitsWithoutBranching > > > > > > V2: slight changes to commit log and remove extra '//' in the comments > > > > We don't usually use // comments anywhere in the kernel other than > > the SPDX header. > > Whoops! > > > > static inline void set_compound_order(struct page *page, unsigned int order) > > > { > > > + unsigned long shift = (1U << order); > > > > Shift is a funny name for this variable. order is the shift. this is 'nr'. > > Good point! Waiman found an even better/cleaner solution that would > avoid needing an extra variable. > page[1].compound_nr = (1U << order) & ~1U; > > > > page[1].compound_order = order; > > > #ifdef CONFIG_64BIT > > > - page[1].compound_nr = 1U << order; > > > + // Branchless conditional: > > > + // order > 0 --> compound_nr = shift > > > + // order == 0 --> compound_nr = 0 > > > + page[1].compound_nr = shift ^ (-order ^ shift) & shift; > > > > Can the compiler see through this? Before, the compiler sees: > > > > page[1].compound_order = 0; > > page[1].compound_nr = 1U << 0; > > ... > > page[1].compound_nr = 0; > > > > and it can eliminate the first store. > > This may be the case at the moment, but with: > https://lore.kernel.org/linux-mm/20221213212053.106058-1-sidhartha.kumar@oracle.com/ > we will have a branch instead. Sidhartha tested it and found no > regression; the concern is that if THPs get implemented using this > callpath then we may end up seeing a slowdown. > > After doing my analysis below I dont think this is the case for the > destroy case(at least on x86). > In the destroy case for both the branch and branchless approach we see > the compiler optimizing away the bitmath and the branch and setting > the variable to zero. > In the prep case we see the introduction of a test and cmovne > instruction, implying a branch. > > > Now the compiler sees: > > unsigned long shift = (1U << 0); > > page[1].compound_order = order; > > page[1].compound_nr = shift ^ (0 ^ shift) & shift; > > > > Does it do the maths at compile-time, knowing that order is 0 at this > > callsite and deducing that it can just store a 0? > > > > I think it might, since shift is constant-1, > > > > page[1].compound_nr = 1 ^ (0 ^ 1) & 1; > > -> page[1].compound_nr = 1 ^ 1 & 1; > > -> page[1].compound_nr = 0 & 1; > > -> page[1].compound_nr = 0; > > > > But you should run it through the compiler and check the assembly > > output for __destroy_compound_gigantic_page(). > > Yep it does look like it gets optimized away for the destroy case: > > Bitmath Case (destroy) > --------------------------------- > Dump of assembler code for function __update_and_free_page: > ... > mov %rsi,%rbp //move 2nd arg (page) to rbp > ... > movb $0x0,0x51(%rbp) //page[1].compound_order = 0 > movl $0x0,0x5c(%rbp) //page[1].compound_nr = 0 > ... > > Math for movl : 0x5c (92) - 64 (sizeof page[0]) = 28 > pahole page: unsigned int compound_nr; /* 28 4 */ > > Bitmath Case (prep) > --------------------------------- > In the case of prep_compound_gigantic_page the bitmath is being computed > 0xffffffff8134f17d <+13>: mov %rdi,%r12 > 0xffffffff8134f180 <+16>: push %rbp > 0xffffffff8134f181 <+17>: mov $0x1,%ebp > 0xffffffff8134f186 <+22>: shl %cl,%ebp > 0xffffffff8134f188 <+24>: neg %ecx > 0xffffffff8134f18a <+26>: push %rbx > 0xffffffff8134f18b <+27>: and %ebp,%ecx > 0xffffffff8134f18d <+29>: mov %sil,0x51(%rdi) > 0xffffffff8134f191 <+33>: mov %ecx,0x5c(%rdi) //set page[1].compound_nr > > Now to break down the approach with the branch: > > Branch Case (destroy) > --------------------------------- > No branch utilized to determine the following instructions. > 0xffffffff813507bc <+236>: movb $0x0,0x51(%rbp) > 0xffffffff813507c0 <+240>: movl $0x0,0x5c(%rbp) > > Branch Case (prep) > --------------------------------- > The branch is being computed with the introduction of a cmovne instruction. > 0xffffffff8134f15d <+13>: mov %rdi,%r12 > 0xffffffff8134f160 <+16>: push %rbp > 0xffffffff8134f161 <+17>: mov $0x1,%ebp > 0xffffffff8134f166 <+22>: shl %cl,%ebp > 0xffffffff8134f168 <+24>: test %esi,%esi //test > 0xffffffff8134f16a <+26>: push %rbx > 0xffffffff8134f16b <+27>: cmovne %ebp,%ecx //branch evaluation > 0xffffffff8134f16e <+30>: mov %sil,0x51(%rdi) > 0xffffffff8134f172 <+34>: mov %ecx,0x5c(%rdi) > To expand a little more on the analysis: I computed the latency/throughput between <+24> and <+27> using intel's manual (APPENDIX D): The bitmath solutions shows a total latency of 2.5 with a Throughput of 0.5. The branch solution show a total latency of 4 and throughput of 1.5. Given this is not a tight loop, and the next instruction is requiring the data computed, better (lower) latency is the more ideal situation. Just wanted to add that little piece :) -- Nico > So it looks like in the destruction of compound pages we'll see no > gain or loss between the bitmath or branch approach. > However, in the prep case we may see some performance loss once/if THP > utilizes this path due to the branch and the loss of CPU > parallelization that can be achieved utilizing the bitmath approach. > > Cheers, > -- Nico
On Thu, Dec 15, 2022 at 02:38:28PM -0700, Nico Pache wrote: > To expand a little more on the analysis: > I computed the latency/throughput between <+24> and <+27> using > intel's manual (APPENDIX D): > > The bitmath solutions shows a total latency of 2.5 with a Throughput of 0.5. > The branch solution show a total latency of 4 and throughput of 1.5. > > Given this is not a tight loop, and the next instruction is requiring > the data computed, better (lower) latency is the more ideal situation. > > Just wanted to add that little piece :) I appreciate how hard you're working on this, but it really is straining at gnats ;-) For a modern cpu, the most important thing is cache misses and avoiding dirtying cachelines. Cycle counting isn't that important when an L3 cache miss takes 2000 (or more) cycles.
On Thu, Dec 15, 2022 at 2:47 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Dec 15, 2022 at 02:38:28PM -0700, Nico Pache wrote: > > To expand a little more on the analysis: > > I computed the latency/throughput between <+24> and <+27> using > > intel's manual (APPENDIX D): > > > > The bitmath solutions shows a total latency of 2.5 with a Throughput of 0.5. > > The branch solution show a total latency of 4 and throughput of 1.5. > > > > Given this is not a tight loop, and the next instruction is requiring > > the data computed, better (lower) latency is the more ideal situation. > > > > Just wanted to add that little piece :) > > I appreciate how hard you're working on this, but it really is straining > at gnats ;-) For a modern cpu, the most important thing is cache misses > and avoiding dirtying cachelines. Cycle counting isn't that important > when an L3 cache miss takes 2000 (or more) cycles. Haha yeah I figured so once I saw the results, but I figured I'd share. We have HPC systems in the TiB of memory so sometimes gnats matter ;p The 2-3 extra cycles may turn into 2million extra cycles on a 2TiB system full of THPs-- I guess that's not a significant amount of cycles either in the grand scheme of things. Cheers, -- Nico >
diff --git a/include/linux/mm.h b/include/linux/mm.h index 6a05a3bc0a28..9510f6294706 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -938,9 +938,13 @@ static inline int head_compound_pincount(struct page *head) static inline void set_compound_order(struct page *page, unsigned int order) { + unsigned long shift = (1U << order); page[1].compound_order = order; #ifdef CONFIG_64BIT - page[1].compound_nr = 1U << order; + // Branchless conditional: + // order > 0 --> compound_nr = shift + // order == 0 --> compound_nr = 0 + page[1].compound_nr = shift ^ (-order ^ shift) & shift; #endif } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3d9f4abec17c..706dec43a6a2 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1344,9 +1344,6 @@ static void __destroy_compound_gigantic_page(struct page *page, } set_compound_order(page, 0); -#ifdef CONFIG_64BIT - page[1].compound_nr = 0; -#endif __ClearPageHead(page); } @@ -1865,9 +1862,6 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, __ClearPageReserved(p); } set_compound_order(page, 0); -#ifdef CONFIG_64BIT - page[1].compound_nr = 0; -#endif __ClearPageHead(page); return false; }