Message ID | 20231221031915.619337-1-pasha.tatashin@soleen.com |
---|---|
Headers |
Return-Path: <linux-kernel+bounces-7799-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:2483:b0:fb:cd0c:d3e with SMTP id q3csp159107dyi; Wed, 20 Dec 2023 19:19:53 -0800 (PST) X-Google-Smtp-Source: AGHT+IF6FuTEU+x49n3BPcwOPQ88u1NETtf4WG1QFaF/EG9f+jXkfoNSxmkjmjSFvv9KxKnQzRl/ X-Received: by 2002:a05:622a:118f:b0:423:91a6:dc46 with SMTP id m15-20020a05622a118f00b0042391a6dc46mr29857510qtk.2.1703128793503; Wed, 20 Dec 2023 19:19:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1703128793; cv=none; d=google.com; s=arc-20160816; b=tN8JkQUU70qAPphsylvPfzlq8XvX3CrZPRTPYViYHe8KwvOKCdfoDQF7IZK8n6Lf2f 28T8DTustk3Stpackuv3Y8QXxUe3Totq8kcM8VF1WSefy49vq5znmFo8fXbDUH9evFQ6 kzXFhIPh+zR3kJa1xZj9+f6hcY0ttLyWBWBKQ30NKYednC9B/V6mkt+8sokJEmBhmSTO lo0B4j52XGdGJ85MoyGDXnQsyRQ5ZBq7Fzw2PLepq1bOApj0mNBf3HAl2SjimlDHbq+k mCUgW10gNBKN3ksqt9xeyMyYmiCJUWqaYo8iEK3BBunHeQnxCxmJmc98L1LfJcN3k8Ol owVA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:to:from :dkim-signature; bh=+Ghdfdvj5dTvmX90o5EKDMFpbNy/cwlZMLwYqTLMlBw=; fh=+EQehun3nPNEkDgvQXjOrECitLCNDBPNEEjgmoD33ag=; b=sEjMkri3x1fKFQF0UoEhPIaqx73jME0o1R9LFgu5VgF+ythWA30P1UqpdFSNuDNdBx IeYWORcghBedM7ktz6mYLN4qQdO1v3y4anjM/vZWuHQYu/yhHSM/wC9Tg41w4V5Fo1Uy MuqPOZ2Rtch4Y9i5F4IMPdYCJiJDxN6GZjR2OJ2V6nwSLsHxC+3TA5LlmRs828NnlUmF 8GPVB68hS4QKeFn7xFrrkpkUEFDaRuZGjrH/GNcvmtKW4vNJ1gIipvdpu1GrmaXrUfEU Y9O5WfmZN25VXmjpJYW8vTOjdgyGCofMmh/ZsJ0uGYSzrgy6ZlQ34VWZq2VETk+dTToT Kaqg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@soleen.com header.s=google header.b="CNMWDCG/"; spf=pass (google.com: domain of linux-kernel+bounces-7799-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-7799-ouuuleilei=gmail.com@vger.kernel.org" Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id 2-20020ac85942000000b0042374e9cc36si1236055qtz.370.2023.12.20.19.19.53 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Dec 2023 19:19:53 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-7799-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; dkim=pass header.i=@soleen.com header.s=google header.b="CNMWDCG/"; spf=pass (google.com: domain of linux-kernel+bounces-7799-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-7799-ouuuleilei=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 102751C20F36 for <ouuuleilei@gmail.com>; Thu, 21 Dec 2023 03:19:52 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 481DD8F77; Thu, 21 Dec 2023 03:19:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=soleen.com header.i=@soleen.com header.b="CNMWDCG/" X-Original-To: linux-kernel@vger.kernel.org Received: from mail-qk1-f171.google.com (mail-qk1-f171.google.com [209.85.222.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C369B4416 for <linux-kernel@vger.kernel.org>; Thu, 21 Dec 2023 03:19:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=soleen.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=soleen.com Received: by mail-qk1-f171.google.com with SMTP id af79cd13be357-7811548d0b7so16413685a.1 for <linux-kernel@vger.kernel.org>; Wed, 20 Dec 2023 19:19:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=soleen.com; s=google; t=1703128758; x=1703733558; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:to :from:from:to:cc:subject:date:message-id:reply-to; bh=+Ghdfdvj5dTvmX90o5EKDMFpbNy/cwlZMLwYqTLMlBw=; b=CNMWDCG/RVrfsVZIPCm6YfJJd8W+dCZ9Sx4TF5NzBsufl3wbh+a64Wkm9tphnA5I7G CELiS80qWUlCRIcAifwKgW1J48HRq4GNS4UvBEU3reqtur9N8Y4EzYLgIGhnJR+oVf89 XI4tMSKE1yT5eZPs/LU56OVzjwoA5j5z4wjBw5XHQ0281xJaldnKZScw2MgFZ2aRdwwr G6lbG+WW/CCXo1HcgNrYnZpAsWxWN4bAK/0J7MKcIEIbB0ZJnnahWtd5Hyi0OJjpdWmw sAvVG8ebqfDAmqZv6P17ej+Hg0iGpuLu5OI8xpl6FeaHIo84ICh67joiN+cajStLbbmw trRw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703128758; x=1703733558; h=content-transfer-encoding:mime-version:message-id:date:subject:to :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=+Ghdfdvj5dTvmX90o5EKDMFpbNy/cwlZMLwYqTLMlBw=; b=SpNm8O91ZqH5X4+mcLpYgzazcCVvHrIAAAJEJdlTQ+MKPTRJVO2mDjKgjExest3foz /bQpfPnYgzykPAKn7nTNoOMa+IBGrPAD5W4jHHRm+4RuvKcGHwEtOdiU1qjYDAimSCsE +z2K0B52/poll9rXQoa4DXVU3EOHIk9giIludUs+m9UmfGl0QZ40ZZOy48pBeaigB08v LAcIBP/gfvFzUXShCH0xAKeBjFT/i6QQzcw6h2Dzm3cOMQpKOHWTwLla0RDOgSQaA8sd Mk+GI8sR9HbLqKRIkTBHwe1/AMJgvGMA9yKSXvaJyu3OoKVKy9Ijwd4VpfCWR6t/aGWL cW7w== X-Gm-Message-State: AOJu0YxPBPLFzL2GtV64pa87V2cYtQI8fqRTjZSYn7eNInIuyb70/QQD wKQZ74MdAs4ARhmvT6R/ftxVqg== X-Received: by 2002:a05:620a:20c3:b0:781:129e:101 with SMTP id f3-20020a05620a20c300b00781129e0101mr1960327qka.37.1703128758561; Wed, 20 Dec 2023 19:19:18 -0800 (PST) Received: from soleen.c.googlers.com.com (55.87.194.35.bc.googleusercontent.com. [35.194.87.55]) by smtp.gmail.com with ESMTPSA id m18-20020a05620a221200b0077d85695db4sm371893qkh.99.2023.12.20.19.19.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Dec 2023 19:19:18 -0800 (PST) From: Pasha Tatashin <pasha.tatashin@soleen.com> To: akpm@linux-foundation.org, linux-mm@kvack.org, pasha.tatashin@soleen.com, linux-kernel@vger.kernel.org, rientjes@google.com, dwmw2@infradead.org, baolu.lu@linux.intel.com, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, iommu@lists.linux.dev Subject: [RFC 0/3] iommu/intel: Free empty page tables on unmaps Date: Thu, 21 Dec 2023 03:19:12 +0000 Message-ID: <20231221031915.619337-1-pasha.tatashin@soleen.com> X-Mailer: git-send-email 2.43.0.472.g3155946c3a-goog Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1785859977778710080 X-GMAIL-MSGID: 1785859977778710080 |
Series |
iommu/intel: Free empty page tables on unmaps
|
|
Message
Pasha Tatashin
Dec. 21, 2023, 3:19 a.m. UTC
This series frees empty page tables on unmaps. It intends to be a low overhead feature. The read-writer lock is used to synchronize page table, but most of time the lock is held is reader. It is held as a writer for short period of time when unmapping a page that is bigger than the current iova request. For all other cases this lock is read-only. page->refcount is used in order to track number of entries at each page table. Microbenchmark data using iova_stress[1]: Base: yqbtg12:/home# ./iova_stress -s 16 dma_size: 4K iova space: 16T iommu: ~ 32783M time: 22.297s /iova_stress -s 16 dma_size: 4K iova space: 16T iommu: ~ 0M time: 23.388s The test maps/unmaps 4K pages and cycles through the IOVA space. Base uses 32G of memory, and test completes in 22.3S. Fix uses 0G of memory, and test completes in 23.4s. I believe the proposed fix is a good compromize in terms of complexity/ scalability. A more scalable solution would be to spread read/writer lock per-page table, and user page->private field to store the lock itself. However, since iommu already has some protection: i.e. no-one touches the iova space of the request map/unmap we can avoid the extra complexity and rely on a single per page table RW lock, and be in a reader mode most of the time. [1] https://github.com/soleen/iova_stress Pasha Tatashin (3): iommu/intel: Use page->refcount to count number of entries in IOMMU iommu/intel: synchronize page table map and unmap operations iommu/intel: free empty page tables on unmaps drivers/iommu/intel/iommu.c | 153 ++++++++++++++++++++++++++++-------- drivers/iommu/intel/iommu.h | 44 +++++++++-- 2 files changed, 158 insertions(+), 39 deletions(-)
Comments
On Thu, Dec 21, 2023 at 03:19:12AM +0000, Pasha Tatashin wrote: > This series frees empty page tables on unmaps. It intends to be a > low overhead feature. > > The read-writer lock is used to synchronize page table, but most of > time the lock is held is reader. It is held as a writer for short > period of time when unmapping a page that is bigger than the current > iova request. For all other cases this lock is read-only. > > page->refcount is used in order to track number of entries at each page > table. Have I not put enough DANGER signs up around the page refcount? * If you want to use the refcount field, it must be used in such a way * that other CPUs temporarily incrementing and then decrementing the * refcount does not cause problems. On receiving the page from * alloc_pages(), the refcount will be positive. You can't use refcount for your purpose, and honestly I'm shocked you haven't seen any of your WARNings trigger.
On Wed, Dec 20, 2023 at 11:16 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Dec 21, 2023 at 03:19:12AM +0000, Pasha Tatashin wrote: > > This series frees empty page tables on unmaps. It intends to be a > > low overhead feature. > > > > The read-writer lock is used to synchronize page table, but most of > > time the lock is held is reader. It is held as a writer for short > > period of time when unmapping a page that is bigger than the current > > iova request. For all other cases this lock is read-only. > > > > page->refcount is used in order to track number of entries at each page > > table. > > Have I not put enough DANGER signs up around the page refcount? > > * If you want to use the refcount field, it must be used in such a way > * that other CPUs temporarily incrementing and then decrementing the > * refcount does not cause problems. On receiving the page from > * alloc_pages(), the refcount will be positive. > > You can't use refcount for your purpose, and honestly I'm shocked you > haven't seen any of your WARNings trigger. Hi Matthew, Thank you for looking at this. Could you please explain exactly why refcount can't be used like this? After alloc_page() refcount is set to 1, we never reduce it to 0, every new entry in a page table adds 1, so we get up-to 513, that is why I added warn like this: WARN_ON_ONCE(rc > 513 || rc < 2); to dma_set_pte() macro. When refcount == 1, we know that the page table is empty, and can be added to a freelist for a delayed freeing. What is wrong with using refcount for a scalable way of keeping the track of number of entries in a iommu page table? Is there a better way that I should use? Thank you, Pasha
On Thu, Dec 21, 2023 at 12:13 AM Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > On Wed, Dec 20, 2023 at 11:16 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Thu, Dec 21, 2023 at 03:19:12AM +0000, Pasha Tatashin wrote: > > > This series frees empty page tables on unmaps. It intends to be a > > > low overhead feature. > > > > > > The read-writer lock is used to synchronize page table, but most of > > > time the lock is held is reader. It is held as a writer for short > > > period of time when unmapping a page that is bigger than the current > > > iova request. For all other cases this lock is read-only. > > > > > > page->refcount is used in order to track number of entries at each page > > > table. > > > > Have I not put enough DANGER signs up around the page refcount? > > > > * If you want to use the refcount field, it must be used in such a way > > * that other CPUs temporarily incrementing and then decrementing the > > * refcount does not cause problems. On receiving the page from > > * alloc_pages(), the refcount will be positive. > > > > You can't use refcount for your purpose, and honestly I'm shocked you > > haven't seen any of your WARNings trigger. > > Hi Matthew, > > Thank you for looking at this. > > Could you please explain exactly why refcount can't be used like this? > > After alloc_page() refcount is set to 1, we never reduce it to 0, > every new entry in a page table adds 1, so we get up-to 513, that is > why I added warn like this: WARN_ON_ONCE(rc > 513 || rc < 2); to I guess, what you mean is that other CPUs could temporarily increase/decrease refcount outside of IOMMU management, do you have an example of why that would happen? I could remove the above warning, and in the worst case we would miss an opportunity to free a page table during unmap, not a big deal, it can be freed during another map/unmap event. Still better than today, where we never free them during unmaps. Pasha
On Thu, Dec 21, 2023 at 12:42:41AM -0500, Pasha Tatashin wrote: > On Thu, Dec 21, 2023 at 12:13 AM Pasha Tatashin > <pasha.tatashin@soleen.com> wrote: > > > > On Wed, Dec 20, 2023 at 11:16 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Thu, Dec 21, 2023 at 03:19:12AM +0000, Pasha Tatashin wrote: > > > > This series frees empty page tables on unmaps. It intends to be a > > > > low overhead feature. > > > > > > > > The read-writer lock is used to synchronize page table, but most of > > > > time the lock is held is reader. It is held as a writer for short > > > > period of time when unmapping a page that is bigger than the current > > > > iova request. For all other cases this lock is read-only. > > > > > > > > page->refcount is used in order to track number of entries at each page > > > > table. > > > > > > Have I not put enough DANGER signs up around the page refcount? > > > > > > * If you want to use the refcount field, it must be used in such a way > > > * that other CPUs temporarily incrementing and then decrementing the > > > * refcount does not cause problems. On receiving the page from > > > * alloc_pages(), the refcount will be positive. > > > > > > You can't use refcount for your purpose, and honestly I'm shocked you > > > haven't seen any of your WARNings trigger. > > > > Hi Matthew, > > > > Thank you for looking at this. > > > > Could you please explain exactly why refcount can't be used like this? > > > > After alloc_page() refcount is set to 1, we never reduce it to 0, > > every new entry in a page table adds 1, so we get up-to 513, that is > > why I added warn like this: WARN_ON_ONCE(rc > 513 || rc < 2); to > > I guess, what you mean is that other CPUs could temporarily > increase/decrease refcount outside of IOMMU management, do you have an > example of why that would happen? I could remove the above warning, > and in the worst case we would miss an opportunity to free a page > table during unmap, not a big deal, it can be freed during another > map/unmap event. Still better than today, where we never free them > during unmaps. Both GUP-fast and the page cache will find a page under RCU protection, inc it's refcount if not zero, check the page is still the one they were looking for, and if not will dec the refcount again. That means if a page has been in the page cache or process page tables and you can't guarantee that all CPUs have been through the requisite grace periods, you might see the refcount increased. I'm not prepared to make a guarantee that these are the only circumstances under which you'll see a temporarily higher refcount than you expect. Either currently or in the future. If you use the refcount as anything other than a refcount, you're living dangerously. And if you think that you'll be the one to do the last refcount put, you're not necessarily correct (see the saga around __free_pages() which ended up as commit e320d3012d25 fixed by 462a8e08e0e6 (which indicates the rare race does actually happen)). Now, it seems like from your further explanation that the consequence of getting this wrong is simply that you fail to free the page early. That seems OK, but I insist that you insert some comments explaining what is going on and why it's safe so somebody auditing uses of refcount doesn't have to reanalyse the whole thing for themself. Or worse that somebody working on the iommu sees this and thinks they can "improve" on it.
On Thu, Dec 21, 2023 at 9:06 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Dec 21, 2023 at 12:42:41AM -0500, Pasha Tatashin wrote: > > On Thu, Dec 21, 2023 at 12:13 AM Pasha Tatashin > > <pasha.tatashin@soleen.com> wrote: > > > > > > On Wed, Dec 20, 2023 at 11:16 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Thu, Dec 21, 2023 at 03:19:12AM +0000, Pasha Tatashin wrote: > > > > > This series frees empty page tables on unmaps. It intends to be a > > > > > low overhead feature. > > > > > > > > > > The read-writer lock is used to synchronize page table, but most of > > > > > time the lock is held is reader. It is held as a writer for short > > > > > period of time when unmapping a page that is bigger than the current > > > > > iova request. For all other cases this lock is read-only. > > > > > > > > > > page->refcount is used in order to track number of entries at each page > > > > > table. > > > > > > > > Have I not put enough DANGER signs up around the page refcount? > > > > > > > > * If you want to use the refcount field, it must be used in such a way > > > > * that other CPUs temporarily incrementing and then decrementing the > > > > * refcount does not cause problems. On receiving the page from > > > > * alloc_pages(), the refcount will be positive. > > > > > > > > You can't use refcount for your purpose, and honestly I'm shocked you > > > > haven't seen any of your WARNings trigger. > > > > > > Hi Matthew, > > > > > > Thank you for looking at this. > > > > > > Could you please explain exactly why refcount can't be used like this? > > > > > > After alloc_page() refcount is set to 1, we never reduce it to 0, > > > every new entry in a page table adds 1, so we get up-to 513, that is > > > why I added warn like this: WARN_ON_ONCE(rc > 513 || rc < 2); to > > > > I guess, what you mean is that other CPUs could temporarily > > increase/decrease refcount outside of IOMMU management, do you have an > > example of why that would happen? I could remove the above warning, > > and in the worst case we would miss an opportunity to free a page > > table during unmap, not a big deal, it can be freed during another > > map/unmap event. Still better than today, where we never free them > > during unmaps. > > Both GUP-fast and the page cache will find a page under RCU protection, > inc it's refcount if not zero, check the page is still the one they were > looking for, and if not will dec the refcount again. That means if a > page has been in the page cache or process page tables and you can't > guarantee that all CPUs have been through the requisite grace periods, > you might see the refcount increased. Interesting scenario, it sounds like this could only happen for a short period of time at the beginning of the life of a page in the IOMMU Page Table. > I'm not prepared to make a guarantee that these are the only circumstances > under which you'll see a temporarily higher refcount than you expect. > Either currently or in the future. If you use the refcount as anything > other than a refcount, you're living dangerously. And if you think that > you'll be the one to do the last refcount put, you're not necessarily > correct (see the saga around __free_pages() which ended up as commit > e320d3012d25 fixed by 462a8e08e0e6 (which indicates the rare race does > actually happen)). > > Now, it seems like from your further explanation that the consequence > of getting this wrong is simply that you fail to free the page early. > That seems OK, but I insist that you insert some comments explaining > what is going on and why it's safe so somebody auditing uses of refcount > doesn't have to reanalyse the whole thing for themself. Or worse that > somebody working on the iommu sees this and thinks they can "improve" > on it. Yes, I can add detailed comments explaining how refcount is used here. Alternatively, I was thinking of using mapcount: From mm_types.h: * If your page will not be mapped to userspace, you can also use the four * bytes in the mapcount union, but you must call page_mapcount_reset() * before freeing it. It sounds like we can safely use _mapcount for our needs, and do page_mapcount_reset() before freeing pages. Pasha