Message ID | 20230421101415.5734-1-osalvador@suse.de |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp960149vqo; Fri, 21 Apr 2023 03:20:13 -0700 (PDT) X-Google-Smtp-Source: AKy350aFtDjS7dKur/g+lgXoXQYTAvLxII3aqWMZwk0ISPqozvEMYF5iruqzW7gOIwjz5UE5DP0G X-Received: by 2002:a17:902:db09:b0:1a6:d46b:dfb5 with SMTP id m9-20020a170902db0900b001a6d46bdfb5mr5568709plx.26.1682072412882; Fri, 21 Apr 2023 03:20:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682072412; cv=none; d=google.com; s=arc-20160816; b=PAxKC6wdZxFWdFNlzvd3RgYfr8W8DPaA5edByOo6JiZMlBQw8f0wMHBNrXUs1qsmHN c4zJyHjQIxdY+rBVh2aanhwuKirfHGMUVGd5hWlRhh+Xps2YJ6L0XxzhH0YTVNz7NcVr Q8dUrzj+stVOYUu9tpRoVTkYOUzARlUJK1egbDVJO+ejNBEZ1HxfyS7UISi5NYpORV4G fYiwrzkfeKcnGGkpqGW/y7TBE5+chuWogWAwR8uzPvpGlCvP57M6Re8HnN2BG0gE9hQy X7YSLHunfks2qp3f5r7eAZlzQGraLAPwmiHhjl1XN3F2rFuDW8noaPvQL9ByF55/WI68 WzAw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature:dkim-signature; bh=XPGL8P2XA7j00dRVX3s+KCxeC+rjR4e4xQOiJaucwNo=; b=IhYZ88tIbNX3Ge58LF4Myey3fOByQ7rRsKZv3cp/8dvdvs6bi9zYflomaTRolVtlBw dmBvox9XKMMhLifI9L01yG3HSaUYtVTPyrTAGP20ykcpLMKBqDoJDj7GdrPdvCKtzhdd wuhQT8/i3a8WET4V/i6H4cRyr6O5nemX0eHXciC7tMmrYGcSe1DfroFcaA/B7N47SmAa NZzK8ho/ZQYQ+pNXHVZa38W2pptdhK6ooMOXB/of5qCGQNlpdGeAwwqM+gOlkthb/GL5 2hOdWQbSkJRgaFEdrYWcnBmjabLi5EXKT/mXiwYbO7CBqoA/0hNp8g1dmLiiaE0UYhFg GfPQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b=cH0+KmRF; dkim=neutral (no key) header.i=@suse.de header.s=susede2_ed25519; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.de Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l6-20020a170903244600b0019a95ab6b4esi4662598pls.407.2023.04.21.03.20.00; Fri, 21 Apr 2023 03:20:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b=cH0+KmRF; dkim=neutral (no key) header.i=@suse.de header.s=susede2_ed25519; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231215AbjDUKOd (ORCPT <rfc822;cjcooper78@gmail.com> + 99 others); Fri, 21 Apr 2023 06:14:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48600 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230260AbjDUKOb (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 21 Apr 2023 06:14:31 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 76981107 for <linux-kernel@vger.kernel.org>; Fri, 21 Apr 2023 03:14:30 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 277DE1FDDB; Fri, 21 Apr 2023 10:14:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1682072069; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=XPGL8P2XA7j00dRVX3s+KCxeC+rjR4e4xQOiJaucwNo=; b=cH0+KmRFbQjGyYmetk1EDardNufiO90JBkvn6ILrP8GjUqtgtODoIE6hTNhkRC20XI76Ul KERVpE4mPom/pXvjHiZlNAOE68E+amEZ2gydOFXlKQ5054InSwCNVXNjVJ+hIgCJLPJEas 9QwVYEZuiB5RgAay5/kDpugZAUSdJq4= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1682072069; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=XPGL8P2XA7j00dRVX3s+KCxeC+rjR4e4xQOiJaucwNo=; b=74vubsOSwB/834QZmrs5/G4FEBrC3iXQL3axQjd1o+bfn4mFEv1nEe0KLxCSoA09AehG6X 7vT71Q6QWC/YCYAg== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 406661390E; Fri, 21 Apr 2023 10:14:28 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id pJKADARiQmRNaAAAMHmgww (envelope-from <osalvador@suse.de>); Fri, 21 Apr 2023 10:14:28 +0000 From: Oscar Salvador <osalvador@suse.de> To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>, Eric Dumazet <edumazet@google.com>, Waiman Long <longman@redhat.com>, Suren Baghdasaryan <surenb@google.com>, Marco Elver <elver@google.com>, Andrey Konovalov <andreyknvl@gmail.com>, Alexander Potapenko <glider@google.com>, Oscar Salvador <osalvador@suse.de> Subject: [PATCH v4 0/3] page_owner: print stacks and their counter Date: Fri, 21 Apr 2023 12:14:12 +0200 Message-Id: <20230421101415.5734-1-osalvador@suse.de> X-Mailer: git-send-email 2.40.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1763780762434764920?= X-GMAIL-MSGID: =?utf-8?q?1763780762434764920?= |
Series | page_owner: print stacks and their counter | |
Message
Oscar Salvador
April 21, 2023, 10:14 a.m. UTC
Changes v3 -> v4: - Rebase (long time has passed) - Use boolean instead of enum for action by Alexander Potapenko - (I left some feedback untouched because it's been long and would like to discuss it here now instead of re-vamping and old thread) Changes v2 -> v3: - Replace interface in favor of seq operations (suggested by Vlastimil) - Use debugfs interface to store/read valued (suggested by Ammar) Hi, page_owner is a great debug functionality tool that gets us to know about all pages that have been allocated/freed and their stacktrace. This comes very handy when e.g: debugging leaks, as with some scripting we might be able to see those stacktraces that are allocating pages but not freeing theme. In my experience, that is one of the most useful cases, but it can get really tedious to screen through all pages aand try to reconstruct the stack <-> allocated/freed relationship. There is a lot of noise to cancel off. This patch aims to fix that by adding a new functionality into page_owner. What this does is to create a new read-only file "page_owner_stacks", which prints only the allocating stacktraces and their counting, being that the times the stacktrace has allocated - the times it has freed. So we have a clear overview of stacks <-> allocated/freed relationship without the need to fiddle with pages and trying to match free stacktraces with allocated stacktraces. This is achieved by adding a new refcount_t field in the stack_record struct, incrementing that refcount_t everytime the same stacktrace allocates, and decrementing it when it frees a page. Details can be seen in the respective patches. We also create another file called "page_owner_threshold", which let us specify a threshold, so when when reading from "page_owner_stacks", we will only see those stacktraces which counting goes beyond the threshold we specified. One thing I am not completely happy about is to polute lib/stackdepot.c file with the stack_* functions. We could sort that out if the stack_record struct definitions were in a header file instead of stackdepot.c. But I am not sure about that trade-off, so suggestions are accepted. A PoC can be found below: # cat /sys/kernel/debug/page_owner_threshold 0 # cat /sys/kernel/debug/page_owner_stacks > stacks_full.txt # head -32 stacks_full.txt prep_new_page+0x10d/0x180 get_page_from_freelist+0x1bd6/0x1e10 __alloc_pages+0x194/0x360 alloc_page_interleave+0x13/0x90 new_slab+0x31d/0x530 ___slab_alloc+0x5d7/0x720 __slab_alloc.isra.85+0x4a/0x90 kmem_cache_alloc+0x455/0x4a0 acpi_ps_alloc_op+0x57/0x8f acpi_ps_create_scope_op+0x12/0x23 acpi_ps_execute_method+0x102/0x2c1 acpi_ns_evaluate+0x343/0x4da acpi_evaluate_object+0x1cb/0x392 acpi_run_osc+0x135/0x260 acpi_init+0x165/0x4ed do_one_initcall+0x3e/0x200 stack count: 2 free_pcp_prepare+0x287/0x5c0 free_unref_page+0x1c/0xd0 __mmdrop+0x50/0x160 finish_task_switch+0x249/0x2b0 __schedule+0x2c3/0x960 schedule+0x44/0xb0 futex_wait_queue+0x70/0xd0 futex_wait+0x160/0x250 do_futex+0x11c/0x1b0 __x64_sys_futex+0x5e/0x1d0 do_syscall_64+0x37/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd stack count: 1 # echo 10000 > /sys/kernel/debug/page_owner_threshold # cat /sys/kernel/debug/page_owner_stacks > stacks_10000.txt # cat stacks_10000.txt prep_new_page+0x10d/0x180 get_page_from_freelist+0x1bd6/0x1e10 __alloc_pages+0x194/0x360 folio_alloc+0x17/0x40 page_cache_ra_unbounded+0x96/0x170 filemap_get_pages+0x23d/0x5e0 filemap_read+0xbf/0x3a0 __kernel_read+0x136/0x2f0 kernel_read_file+0x197/0x2d0 kernel_read_file_from_fd+0x54/0x90 __do_sys_finit_module+0x89/0x120 do_syscall_64+0x37/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd stack count: 36195 prep_new_page+0x10d/0x180 get_page_from_freelist+0x1bd6/0x1e10 __alloc_pages+0x194/0x360 folio_alloc+0x17/0x40 page_cache_ra_unbounded+0x96/0x170 filemap_get_pages+0x23d/0x5e0 filemap_read+0xbf/0x3a0 new_sync_read+0x106/0x180 vfs_read+0x16f/0x190 ksys_read+0xa5/0xe0 do_syscall_64+0x37/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd stack count: 44484 prep_new_page+0x10d/0x180 get_page_from_freelist+0x1bd6/0x1e10 __alloc_pages+0x194/0x360 folio_alloc+0x17/0x40 page_cache_ra_unbounded+0x96/0x170 filemap_get_pages+0xdd/0x5e0 filemap_read+0xbf/0x3a0 new_sync_read+0x106/0x180 vfs_read+0x16f/0x190 ksys_read+0xa5/0xe0 do_syscall_64+0x37/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd stack count: 17874 Oscar Salvador (3): lib/stackdepot: Add a refcount field in stack_record mm, page_owner: Add page_owner_stacks file to print out only stacks and their counte mm,page_owner: Filter out stacks by a threshold counter include/linux/stackdepot.h | 17 +++- lib/stackdepot.c | 160 ++++++++++++++++++++++++++++++++++--- mm/kasan/common.c | 3 +- mm/page_owner.c | 46 ++++++++++- 4 files changed, 209 insertions(+), 17 deletions(-)
Comments
On Fri, Apr 21, 2023 at 12:14 PM Oscar Salvador <osalvador@suse.de> wrote: > > Changes v3 -> v4: > - Rebase (long time has passed) > - Use boolean instead of enum for action by Alexander Potapenko > - (I left some feedback untouched because it's been long and > would like to discuss it here now instead of re-vamping > and old thread) > > Changes v2 -> v3: > - Replace interface in favor of seq operations (suggested by Vlastimil) > - Use debugfs interface to store/read valued (suggested by Ammar) > > Hi, > > page_owner is a great debug functionality tool that gets us to know > about all pages that have been allocated/freed and their stacktrace. > This comes very handy when e.g: debugging leaks, as with some scripting > we might be able to see those stacktraces that are allocating pages > but not freeing theme. > > In my experience, that is one of the most useful cases, but it can get > really tedious to screen through all pages aand try to reconstruct the > stack <-> allocated/freed relationship. There is a lot of noise > to cancel off. > > This patch aims to fix that by adding a new functionality into page_owner. > What this does is to create a new read-only file "page_owner_stacks", > which prints only the allocating stacktraces and their counting, being that > the times the stacktrace has allocated - the times it has freed. > > So we have a clear overview of stacks <-> allocated/freed relationship > without the need to fiddle with pages and trying to match free stacktraces > with allocated stacktraces. > > This is achieved by adding a new refcount_t field in the stack_record struct, > incrementing that refcount_t everytime the same stacktrace allocates, > and decrementing it when it frees a page. Details can be seen in the > respective patches. I think the implementation of these counters is too specific to page_owner and is hard to use for any other purpose. If we decide to have them, there should be no page_owner-specific logic in the way we initialize/increment/decrement these counters. The thresholds in "mm,page_owner: Filter out stacks by a threshold counter" should also belong elsewhere. Given that no other stackdepot user needs these counters, maybe it should be cleaner to store an opaque struct along with the stack, passing its size to stack_depot_save(), and letting users access it directly using the stackdepot handler. I am also wondering if a separate hashtable mapping handlers to counters would solve the problem for you?
On 2023-04-21 13:19, Alexander Potapenko wrote: > I think the implementation of these counters is too specific to > page_owner and is hard to use for any other purpose. > If we decide to have them, there should be no page_owner-specific > logic in the way we initialize/increment/decrement these counters. Another solution would be to always increment the refcount in __stack_depot_save, in this case the "page-owner" specific changes are gone, and it is more of a generic thing. e.g: Andrey Konovalov mentioned that in a future KASAN remodelation, he would be using a stack refcount as well. > The thresholds in "mm,page_owner: Filter out stacks by a threshold > counter" should also belong elsewhere. That can certainly be cleaned up I guess to not polute non-page_owner code. > Given that no other stackdepot user needs these counters, maybe it > should be cleaner to store an opaque struct along with the stack, > passing its size to stack_depot_save(), and letting users access it > directly using the stackdepot handler. > > I am also wondering if a separate hashtable mapping handlers to > counters would solve the problem for you? Let us see first if with the changes from above the code gets to a more generic and clean stage, if not we can explore further options. Thanks for your feedback Alexander!
On Mon, 24 Apr 2023 05:54:59 +0200 Oscar Salvador <osalvador@suse.de> wrote: > > Given that no other stackdepot user needs these counters, maybe it > > should be cleaner to store an opaque struct along with the stack, > > passing its size to stack_depot_save(), and letting users access it > > directly using the stackdepot handler. > > > > I am also wondering if a separate hashtable mapping handlers to > > counters would solve the problem for you? > > Let us see first if with the changes from above the code gets to a more > generic and clean stage, if not we can explore further options. Alexander, does this approach sound reasonable to you? The overall feature seems useful, although I'm not seeing any positive reviewer feedback.
On 6/9/23 23:55, Andrew Morton wrote: > On Mon, 24 Apr 2023 05:54:59 +0200 Oscar Salvador <osalvador@suse.de> wrote: > >> > Given that no other stackdepot user needs these counters, maybe it >> > should be cleaner to store an opaque struct along with the stack, >> > passing its size to stack_depot_save(), and letting users access it >> > directly using the stackdepot handler. >> > >> > I am also wondering if a separate hashtable mapping handlers to >> > counters would solve the problem for you? >> >> Let us see first if with the changes from above the code gets to a more >> generic and clean stage, if not we can explore further options. > > Alexander, does this approach sound reasonable to you? Note this is a v4 thread; there was (and the version in mm-unstable is) v5, where the latest was Alexander requesting further changes: https://lore.kernel.org/all/CAG_fn%3DUJgnLFGgpkXbMeD6axZN_ifEPHvWpy2_oiPyG1a6PXng@mail.gmail.com/ > The overall feature seems useful, although I'm not seeing any positive > reviewer feedback.