Message ID | 20231129143529.260264-1-akrowiak@linux.ibm.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:a5a7:0:b0:403:3b70:6f57 with SMTP id d7csp381048vqn; Wed, 29 Nov 2023 06:35:49 -0800 (PST) X-Google-Smtp-Source: AGHT+IEZMkUFPOF3LS14VM0lKBuHuPPNuP0GJKbNUIlQ+/jYkkw50Ir+DUi52rkC6f9oTFVCxo8m X-Received: by 2002:a05:6a00:130f:b0:6cb:a60c:1d with SMTP id j15-20020a056a00130f00b006cba60c001dmr18524493pfu.17.1701268548947; Wed, 29 Nov 2023 06:35:48 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701268548; cv=none; d=google.com; s=arc-20160816; b=pxs/fPfLYfCcrtObosqIdT/4Y8E0aQWQK0I3ZidALMOY/fcoxfLoRDubVNdhAhtkiy W16Fg88uK9z3RjcKv1Z8GFrB6q1EsFijmatsKqfZcblmY7GVXg3SqzlhbhtpuHpPS0ya xzBUoUc4RR+MKIwDtcizhLM/b+EB0GaD75Gd1z7BASZR7fHB5/wwQrBmIMAXWhTZ3oiF 41DlT2897xZ0tXyAjpW1h6HuUAo25RwUwZPDd/JrycMLNwbIdS3OHAf4MxfH6OcoLCqd +FYXu2zB71gA+f2/X+WgMaq6OxMr7GqcADQTu65u78te3DuDkCwpANWGxUBJA5gRjEeU 7XLg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=84dgAyyjWyetZQdS5ITnDLJAcSc79AlusTv7kFI15mM=; fh=0M3JPXXuI9GXlB995TLGVJ+O045xV7Cdi7q2Ksu8yq4=; b=IBUml5EYit0GSW/BehZv0jyNAHQCWGdOJlQCfeF69xU1dkcC+tkjZ7zGfjj5BdmaDL NzohSHQBYC2l7A854+Ji9fmhdXmWWpYxOubfMPp3taDdgFrLf2Wa6XoLhxCIc0dQ+uSZ 2Cx1WyRuWM37r4teWZ1maANl9Bf765wlUL8OQ01mMLSvjjUmUs6lQY7Nvff9LINJqFJh dk/T7n/sbIwJJ8rYO3Zm1rXpu68kU0QT2WZzQ1oKirPoAePVX1XqZklNG7Gv/cWbs6vm /A1c/mBirefBDKCWYGYheRbmQA3xqsyK8n9hSiT+dbmWw3F5NRmw3tuqLIOCYZQ00EmK tOgA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=nUXwUblM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id hq3-20020a056a00680300b006cdc6af3ff6si1911554pfb.117.2023.11.29.06.35.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Nov 2023 06:35:48 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=nUXwUblM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id E22258039DE7; Wed, 29 Nov 2023 06:35:43 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234350AbjK2Ofe (ORCPT <rfc822;toshivichauhan@gmail.com> + 99 others); Wed, 29 Nov 2023 09:35:34 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42028 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234323AbjK2Of3 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 29 Nov 2023 09:35:29 -0500 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ADDFD19A; Wed, 29 Nov 2023 06:35:35 -0800 (PST) Received: from pps.filterd (m0356517.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 3ATEX1M4007854; Wed, 29 Nov 2023 14:35:33 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : mime-version : content-transfer-encoding; s=pp1; bh=84dgAyyjWyetZQdS5ITnDLJAcSc79AlusTv7kFI15mM=; b=nUXwUblM1WYfj7+b8+vyi4NgclDC522CYawTZvmpBQ3jEGYKiXbhur8GrRMRDJF6Mj2s /RWFr/XoN35c4w2kJmgKkcMHBC9LSLNSqyCWDnLsyY+HX2izduQ6jc4EplXL4SnyI79d 1GG1L1YQR5/JaEREne5pRRZ3wM7RWcMkAO0SyxPj8om8NYkGPoqlpHd7XhGt40KWU5w8 /KZ6RYmzm0+eyxJ/YzNDpyQSB9M4vK5AKv34Oa3h0rFhRYZDzXobpbd1q19UF3OoO2xY 4vYA9Awbb0354S7M9+RaXtSQSO+bJHBq4W5LJ5Xds28v8VjtW4VlzqbXcTxQgfR4U9ek wQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3up75k03tk-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 29 Nov 2023 14:35:33 +0000 Received: from m0356517.ppops.net (m0356517.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 3ATEXOvf009646; Wed, 29 Nov 2023 14:35:32 GMT Received: from ppma11.dal12v.mail.ibm.com (db.9e.1632.ip4.static.sl-reverse.com [50.22.158.219]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3up75k03t0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 29 Nov 2023 14:35:32 +0000 Received: from pps.filterd (ppma11.dal12v.mail.ibm.com [127.0.0.1]) by ppma11.dal12v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 3ATE6Qe0004403; Wed, 29 Nov 2023 14:35:31 GMT Received: from smtprelay05.dal12v.mail.ibm.com ([172.16.1.7]) by ppma11.dal12v.mail.ibm.com (PPS) with ESMTPS id 3ukwy1y71q-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 29 Nov 2023 14:35:31 +0000 Received: from smtpav02.dal12v.mail.ibm.com (smtpav02.dal12v.mail.ibm.com [10.241.53.101]) by smtprelay05.dal12v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 3ATEZV1523397088 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 29 Nov 2023 14:35:31 GMT Received: from smtpav02.dal12v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0B48C5805F; Wed, 29 Nov 2023 14:35:31 +0000 (GMT) Received: from smtpav02.dal12v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 3971558051; Wed, 29 Nov 2023 14:35:30 +0000 (GMT) Received: from li-2c1e724c-2c76-11b2-a85c-ae42eaf3cb3d.ibm.com.com (unknown [9.61.149.198]) by smtpav02.dal12v.mail.ibm.com (Postfix) with ESMTP; Wed, 29 Nov 2023 14:35:30 +0000 (GMT) From: Tony Krowiak <akrowiak@linux.ibm.com> To: linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: jjherne@linux.ibm.com, pasic@linux.ibm.com, alex.williamson@redhat.com, borntraeger@linux.ibm.com, kwankhede@nvidia.com, frankja@linux.ibm.com, imbrenda@linux.ibm.com, david@redhat.com Subject: [PATCH] s390/vfio-ap: handle response code 01 on queue reset Date: Wed, 29 Nov 2023 09:35:24 -0500 Message-ID: <20231129143529.260264-1-akrowiak@linux.ibm.com> X-Mailer: git-send-email 2.41.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: Wh_sxyqNi8eoxitQQ6JunvZBuky1cP6p X-Proofpoint-GUID: 47QYZ6reoqUfJKHJWdjE3qAOhmpsEUSD X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.997,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-11-29_12,2023-11-29_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1011 priorityscore=1501 malwarescore=0 mlxscore=0 phishscore=0 mlxlogscore=999 adultscore=0 bulkscore=0 spamscore=0 suspectscore=0 impostorscore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2311060000 definitions=main-2311290110 X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Wed, 29 Nov 2023 06:35:44 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783909369889716950 X-GMAIL-MSGID: 1783909369889716950 |
Series |
s390/vfio-ap: handle response code 01 on queue reset
|
|
Commit Message
Anthony Krowiak
Nov. 29, 2023, 2:35 p.m. UTC
In the current implementation, response code 01 (AP queue number not valid)
is handled as a default case along with other response codes returned from
a queue reset operation that are not handled specifically. Barring a bug,
response code 01 will occur only when a queue has been externally removed
from the host's AP configuration; nn this case, the queue must
be reset by the machine in order to avoid leaking crypto data if/when the
queue is returned to the host's configuration. The response code 01 case
will be handled specifically by logging a WARN message followed by cleaning
up the IRQ resources.
Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
---
drivers/s390/crypto/vfio_ap_ops.c | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)
Comments
Am 29.11.23 um 15:35 schrieb Tony Krowiak: > In the current implementation, response code 01 (AP queue number not valid) > is handled as a default case along with other response codes returned from > a queue reset operation that are not handled specifically. Barring a bug, > response code 01 will occur only when a queue has been externally removed > from the host's AP configuration; nn this case, the queue must > be reset by the machine in order to avoid leaking crypto data if/when the > queue is returned to the host's configuration. The response code 01 case > will be handled specifically by logging a WARN message followed by cleaning > up the IRQ resources. > To me it looks like this can be triggered by the LPAR admin, correct? So it is not desireable but possible. In that case I prefer to not use WARN, maybe use dev_warn or dev_err instead. WARN can be a disruptive event if panic_on_warn is set. > Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> > --- > drivers/s390/crypto/vfio_ap_ops.c | 31 +++++++++++++++++++++++++++++++ > 1 file changed, 31 insertions(+) > > diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c > index 4db538a55192..91d6334574d8 100644 > --- a/drivers/s390/crypto/vfio_ap_ops.c > +++ b/drivers/s390/crypto/vfio_ap_ops.c > @@ -1652,6 +1652,21 @@ static int apq_status_check(int apqn, struct ap_queue_status *status) > * a value indicating a reset needs to be performed again. > */ > return -EAGAIN; > + case AP_RESPONSE_Q_NOT_AVAIL: > + /* > + * This response code indicates the queue is not available. > + * Barring a bug, response code 01 will occur only when a queue > + * has been externally removed from the host's AP configuration; > + * in which case, the queue must be reset by the machine in > + * order to avoid leaking crypto data if/when the queue is > + * returned to the host's configuration. In this case, let's go > + * ahead and log a warning message and return 0 so the AQIC > + * resources get cleaned up by the caller. > + */ > + WARN(true, > + "Unable to reset queue %02x.%04x: not in host AP configuration\n", > + AP_QID_CARD(apqn), AP_QID_QUEUE(apqn)); > + return 0; > default: > WARN(true, > "failed to verify reset of queue %02x.%04x: TAPQ rc=%u\n", > @@ -1736,6 +1751,22 @@ static void vfio_ap_mdev_reset_queue(struct vfio_ap_queue *q) > q->reset_status.response_code = 0; > vfio_ap_free_aqic_resources(q); > break; > + case AP_RESPONSE_Q_NOT_AVAIL: > + /* > + * This response code indicates the queue is not available. > + * Barring a bug, response code 01 will occur only when a queue > + * has been externally removed from the host's AP configuration; > + * in which case, the queue must be reset by the machine in > + * order to avoid leaking crypto data if/when the queue is > + * returned to the host's configuration. In this case, let's go > + * ahead and log a warning message then clean up the AQIC > + * resources. > + */ > + WARN(true, > + "Unable to reset queue %02x.%04x: not in host AP configuration\n", > + AP_QID_CARD(q->apqn), AP_QID_QUEUE(q->apqn)); > + vfio_ap_free_aqic_resources(q); > + break; > default: > WARN(true, > "PQAP/ZAPQ for %02x.%04x failed with invalid rc=%u\n",
On Wed, 29 Nov 2023 09:35:24 -0500 Tony Krowiak <akrowiak@linux.ibm.com> wrote: > In the current implementation, response code 01 (AP queue number not valid) > is handled as a default case along with other response codes returned from > a queue reset operation that are not handled specifically. Barring a bug, > response code 01 will occur only when a queue has been externally removed > from the host's AP configuration; nn this case, the queue must > be reset by the machine in order to avoid leaking crypto data if/when the > queue is returned to the host's configuration. s/if\/when/at latest before/ I would argue that some of the cleanups need to happen before even 01 is reflected... The code comments may also require a similar rewording. With that fixed: Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Regards, Halil > The response code 01 case > will be handled specifically by logging a WARN message followed by cleaning > up the IRQ resources. > > Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
On 11/29/23 12:12, Christian Borntraeger wrote: > Am 29.11.23 um 15:35 schrieb Tony Krowiak: >> In the current implementation, response code 01 (AP queue number not >> valid) >> is handled as a default case along with other response codes returned >> from >> a queue reset operation that are not handled specifically. Barring a bug, >> response code 01 will occur only when a queue has been externally removed >> from the host's AP configuration; nn this case, the queue must >> be reset by the machine in order to avoid leaking crypto data if/when the >> queue is returned to the host's configuration. The response code 01 case >> will be handled specifically by logging a WARN message followed by >> cleaning >> up the IRQ resources. >> > > To me it looks like this can be triggered by the LPAR admin, correct? So it > is not desireable but possible. > In that case I prefer to not use WARN, maybe use dev_warn or dev_err > instead. > WARN can be a disruptive event if panic_on_warn is set. Yes, it can be triggered by the LPAR admin. I can't use dev_warn here because we don't have a reference to any device, but I can use pr_warn if that suffices. > > >> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> >> --- >> drivers/s390/crypto/vfio_ap_ops.c | 31 +++++++++++++++++++++++++++++++ >> 1 file changed, 31 insertions(+) >> >> diff --git a/drivers/s390/crypto/vfio_ap_ops.c >> b/drivers/s390/crypto/vfio_ap_ops.c >> index 4db538a55192..91d6334574d8 100644 >> --- a/drivers/s390/crypto/vfio_ap_ops.c >> +++ b/drivers/s390/crypto/vfio_ap_ops.c >> @@ -1652,6 +1652,21 @@ static int apq_status_check(int apqn, struct >> ap_queue_status *status) >> * a value indicating a reset needs to be performed again. >> */ >> return -EAGAIN; >> + case AP_RESPONSE_Q_NOT_AVAIL: >> + /* >> + * This response code indicates the queue is not available. >> + * Barring a bug, response code 01 will occur only when a queue >> + * has been externally removed from the host's AP configuration; >> + * in which case, the queue must be reset by the machine in >> + * order to avoid leaking crypto data if/when the queue is >> + * returned to the host's configuration. In this case, let's go >> + * ahead and log a warning message and return 0 so the AQIC >> + * resources get cleaned up by the caller. >> + */ >> + WARN(true, >> + "Unable to reset queue %02x.%04x: not in host AP >> configuration\n", >> + AP_QID_CARD(apqn), AP_QID_QUEUE(apqn)); >> + return 0; >> default: >> WARN(true, >> "failed to verify reset of queue %02x.%04x: TAPQ rc=%u\n", >> @@ -1736,6 +1751,22 @@ static void vfio_ap_mdev_reset_queue(struct >> vfio_ap_queue *q) >> q->reset_status.response_code = 0; >> vfio_ap_free_aqic_resources(q); >> break; >> + case AP_RESPONSE_Q_NOT_AVAIL: >> + /* >> + * This response code indicates the queue is not available. >> + * Barring a bug, response code 01 will occur only when a queue >> + * has been externally removed from the host's AP configuration; >> + * in which case, the queue must be reset by the machine in >> + * order to avoid leaking crypto data if/when the queue is >> + * returned to the host's configuration. In this case, let's go >> + * ahead and log a warning message then clean up the AQIC >> + * resources. >> + */ >> + WARN(true, >> + "Unable to reset queue %02x.%04x: not in host AP >> configuration\n", >> + AP_QID_CARD(q->apqn), AP_QID_QUEUE(q->apqn)); >> + vfio_ap_free_aqic_resources(q); >> + break; >> default: >> WARN(true, >> "PQAP/ZAPQ for %02x.%04x failed with invalid rc=%u\n",
Am 04.12.23 um 15:53 schrieb Tony Krowiak: > > > On 11/29/23 12:12, Christian Borntraeger wrote: >> Am 29.11.23 um 15:35 schrieb Tony Krowiak: >>> In the current implementation, response code 01 (AP queue number not valid) >>> is handled as a default case along with other response codes returned from >>> a queue reset operation that are not handled specifically. Barring a bug, >>> response code 01 will occur only when a queue has been externally removed >>> from the host's AP configuration; nn this case, the queue must >>> be reset by the machine in order to avoid leaking crypto data if/when the >>> queue is returned to the host's configuration. The response code 01 case >>> will be handled specifically by logging a WARN message followed by cleaning >>> up the IRQ resources. >>> >> >> To me it looks like this can be triggered by the LPAR admin, correct? So it >> is not desireable but possible. >> In that case I prefer to not use WARN, maybe use dev_warn or dev_err instead. >> WARN can be a disruptive event if panic_on_warn is set. > > Yes, it can be triggered by the LPAR admin. I can't use dev_warn here because we don't have a reference to any device, but I can use pr_warn if that suffices. Ok, please use pr_warn then.
On Mon, 4 Dec 2023 16:16:31 +0100 Christian Borntraeger <borntraeger@linux.ibm.com> wrote: > Am 04.12.23 um 15:53 schrieb Tony Krowiak: > > > > > > On 11/29/23 12:12, Christian Borntraeger wrote: > >> Am 29.11.23 um 15:35 schrieb Tony Krowiak: > >>> In the current implementation, response code 01 (AP queue number not valid) > >>> is handled as a default case along with other response codes returned from > >>> a queue reset operation that are not handled specifically. Barring a bug, > >>> response code 01 will occur only when a queue has been externally removed > >>> from the host's AP configuration; nn this case, the queue must > >>> be reset by the machine in order to avoid leaking crypto data if/when the > >>> queue is returned to the host's configuration. The response code 01 case > >>> will be handled specifically by logging a WARN message followed by cleaning > >>> up the IRQ resources. > >>> > >> > >> To me it looks like this can be triggered by the LPAR admin, correct? So it > >> is not desireable but possible. > >> In that case I prefer to not use WARN, maybe use dev_warn or dev_err instead. > >> WARN can be a disruptive event if panic_on_warn is set. > > > > Yes, it can be triggered by the LPAR admin. I can't use dev_warn here because we don't have a reference to any device, but I can use pr_warn if that suffices. > > Ok, please use pr_warn then. Shouldn't we rather make this an 'info'. I mean we probably do not want people complaining about this condition. Yes it should be a best practice to coordinate such things with the guest, and ideally remove the resource from the guest first. But AFAIU our stack is supposed to be able to handle something like this. IMHO issuing a warning is excessive measure. I know Reinhard and Tony probably disagree with the last sentence though. Regards, Halil
On 12/4/23 11:15, Halil Pasic wrote: > On Mon, 4 Dec 2023 16:16:31 +0100 > Christian Borntraeger <borntraeger@linux.ibm.com> wrote: > >> Am 04.12.23 um 15:53 schrieb Tony Krowiak: >>> >>> >>> On 11/29/23 12:12, Christian Borntraeger wrote: >>>> Am 29.11.23 um 15:35 schrieb Tony Krowiak: >>>>> In the current implementation, response code 01 (AP queue number not valid) >>>>> is handled as a default case along with other response codes returned from >>>>> a queue reset operation that are not handled specifically. Barring a bug, >>>>> response code 01 will occur only when a queue has been externally removed >>>>> from the host's AP configuration; nn this case, the queue must >>>>> be reset by the machine in order to avoid leaking crypto data if/when the >>>>> queue is returned to the host's configuration. The response code 01 case >>>>> will be handled specifically by logging a WARN message followed by cleaning >>>>> up the IRQ resources. >>>>> >>>> >>>> To me it looks like this can be triggered by the LPAR admin, correct? So it >>>> is not desireable but possible. >>>> In that case I prefer to not use WARN, maybe use dev_warn or dev_err instead. >>>> WARN can be a disruptive event if panic_on_warn is set. >>> >>> Yes, it can be triggered by the LPAR admin. I can't use dev_warn here because we don't have a reference to any device, but I can use pr_warn if that suffices. >> >> Ok, please use pr_warn then. > > Shouldn't we rather make this an 'info'. I mean we probably do not want > people complaining about this condition. Yes it should be a best practice > to coordinate such things with the guest, and ideally remove the resource > from the guest first. But AFAIU our stack is supposed to be able to > handle something like this. IMHO issuing a warning is excessive measure. > I know Reinhard and Tony probably disagree with the last sentence > though. I don't feel strongly one way or the other. Anybody else? > > Regards, > Halil
On 12/4/23 07:10, Halil Pasic wrote: > On Wed, 29 Nov 2023 09:35:24 -0500 > Tony Krowiak <akrowiak@linux.ibm.com> wrote: > >> In the current implementation, response code 01 (AP queue number not valid) >> is handled as a default case along with other response codes returned from >> a queue reset operation that are not handled specifically. Barring a bug, >> response code 01 will occur only when a queue has been externally removed >> from the host's AP configuration; nn this case, the queue must >> be reset by the machine in order to avoid leaking crypto data if/when the >> queue is returned to the host's configuration. > > s/if\/when/at latest before/ > > I would argue that some of the cleanups need to happen before even 01 is > reflected... To what cleanups are you referring? > > The code comments may also require a similar rewording. With that fixed: > Reviewed-by: Halil Pasic <pasic@linux.ibm.com> > > Regards, > Halil > >> The response code 01 case >> will be handled specifically by logging a WARN message followed by cleaning >> up the IRQ resources. >> >> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
On Mon, 4 Dec 2023 12:51:49 -0500 Tony Krowiak <akrowiak@linux.ibm.com> wrote: > > s/if\/when/at latest before/ > > > > I would argue that some of the cleanups need to happen before even 01 is > > reflected... > > To what cleanups are you referring? Event notification and interruption disablement for starters. Otherwise OS has no way to figure out when is GISA and NIB safe to deallocate. Those actions are part of the reset process. I.e. some of the reset stuff can be deferred at most until the queue is made accessible again, some not so much. Regards, Halil
On 2023-12-04 17:15, Halil Pasic wrote: > On Mon, 4 Dec 2023 16:16:31 +0100 > Christian Borntraeger <borntraeger@linux.ibm.com> wrote: > >> Am 04.12.23 um 15:53 schrieb Tony Krowiak: >> > >> > >> > On 11/29/23 12:12, Christian Borntraeger wrote: >> >> Am 29.11.23 um 15:35 schrieb Tony Krowiak: >> >>> In the current implementation, response code 01 (AP queue number not valid) >> >>> is handled as a default case along with other response codes returned from >> >>> a queue reset operation that are not handled specifically. Barring a bug, >> >>> response code 01 will occur only when a queue has been externally removed >> >>> from the host's AP configuration; nn this case, the queue must >> >>> be reset by the machine in order to avoid leaking crypto data if/when the >> >>> queue is returned to the host's configuration. The response code 01 case >> >>> will be handled specifically by logging a WARN message followed by cleaning >> >>> up the IRQ resources. >> >>> >> >> >> >> To me it looks like this can be triggered by the LPAR admin, correct? So it >> >> is not desireable but possible. >> >> In that case I prefer to not use WARN, maybe use dev_warn or dev_err instead. >> >> WARN can be a disruptive event if panic_on_warn is set. >> > >> > Yes, it can be triggered by the LPAR admin. I can't use dev_warn here because we don't have a reference to any device, but I can use pr_warn if that suffices. >> >> Ok, please use pr_warn then. > > Shouldn't we rather make this an 'info'. I mean we probably do not want > people complaining about this condition. Yes it should be a best > practice > to coordinate such things with the guest, and ideally remove the > resource > from the guest first. But AFAIU our stack is supposed to be able to > handle something like this. IMHO issuing a warning is excessive > measure. > I know Reinhard and Tony probably disagree with the last sentence > though. Halil, Tony, the thing about about info versus warning versus error is our own stuff. Keep in mind that these messages end up in the "debug feature" as FFDC data. So it comes to the point which FFDC data do you/Tony want to see there ? It should be enough to explain to a customer what happened without the need to "recreate with higher debug level" if something serious happened. So my private decision table is: 1) is it something serious, something exceptional, something which may not come up again if tried to recreate ? Yes -> make it visible on the first occurrence as error msg. 2) is it something you want to read when a customer hits it and you tell him to extract and examine the debug feature data ? Yes -> make it a warning and make sure your debug feature by default records warnings. 3) still serious, but may flood the debug feature. Good enough and high probability to reappear on a recreate ? Yes -> make it an info message and live with the risk that you may not be able to explain to a customer what happened without a recreate and higher debug level. 4) not 1-3, -> maybe a debug msg but still think about what happens when a customer enables "debug feature" with highest level. Does it squeeze out more important stuff ? Maybe make it dynamic debug with pr_debug() (see kernel docu admin-guide/dynamic-debug-howto.rst). > > Regards, > Halil
On Tue, 05 Dec 2023 09:04:23 +0100 Harald Freudenberger <freude@linux.ibm.com> wrote: > On 2023-12-04 17:15, Halil Pasic wrote: > > On Mon, 4 Dec 2023 16:16:31 +0100 > > Christian Borntraeger <borntraeger@linux.ibm.com> wrote: > > > >> Am 04.12.23 um 15:53 schrieb Tony Krowiak: > >> > > >> > > >> > On 11/29/23 12:12, Christian Borntraeger wrote: > >> >> Am 29.11.23 um 15:35 schrieb Tony Krowiak: > >> >>> In the current implementation, response code 01 (AP queue number not valid) > >> >>> is handled as a default case along with other response codes returned from > >> >>> a queue reset operation that are not handled specifically. Barring a bug, > >> >>> response code 01 will occur only when a queue has been externally removed > >> >>> from the host's AP configuration; nn this case, the queue must > >> >>> be reset by the machine in order to avoid leaking crypto data if/when the > >> >>> queue is returned to the host's configuration. The response code 01 case > >> >>> will be handled specifically by logging a WARN message followed by cleaning > >> >>> up the IRQ resources. > >> >>> > >> >> > >> >> To me it looks like this can be triggered by the LPAR admin, correct? So it > >> >> is not desireable but possible. > >> >> In that case I prefer to not use WARN, maybe use dev_warn or dev_err instead. > >> >> WARN can be a disruptive event if panic_on_warn is set. > >> > > >> > Yes, it can be triggered by the LPAR admin. I can't use dev_warn here because we don't have a reference to any device, but I can use pr_warn if that suffices. > >> > >> Ok, please use pr_warn then. > > > > Shouldn't we rather make this an 'info'. I mean we probably do not want > > people complaining about this condition. Yes it should be a best > > practice > > to coordinate such things with the guest, and ideally remove the > > resource > > from the guest first. But AFAIU our stack is supposed to be able to > > handle something like this. IMHO issuing a warning is excessive > > measure. > > I know Reinhard and Tony probably disagree with the last sentence > > though. > > Halil, Tony, the thing about about info versus warning versus error is > our > own stuff. Keep in mind that these messages end up in the "debug > feature" > as FFDC data. So it comes to the point which FFDC data do you/Tony want > to > see there ? It should be enough to explain to a customer what happened > without the need to "recreate with higher debug level" if something > serious > happened. So my private decision table is: > 1) is it something serious, something exceptional, something which may > not > come up again if tried to recreate ? Yes -> make it visible on the > first > occurrence as error msg. > 2) is it something you want to read when a customer hits it and you tell > him > to extract and examine the debug feature data ? Yes -> make it a > warning > and make sure your debug feature by default records warnings. > 3) still serious, but may flood the debug feature. Good enough and high > probability to reappear on a recreate ? Yes -> make it an info > message > and live with the risk that you may not be able to explain to a > customer > what happened without a recreate and higher debug level. > 4) not 1-3, -> maybe a debug msg but still think about what happens when > a > customer enables "debug feature" with highest level. Does it squeeze > out > more important stuff ? Maybe make it dynamic debug with pr_debug() > (see > kernel docu admin-guide/dynamic-debug-howto.rst). AFAIU the default log level of the S390 Debug Feature is 3 that is error. So warnings do not help us there by default. And if we are already asking the reporter to crank up the loglevel of the debug feature, we can as the reporter to crank it up to 5, assumed there is not too much stuff that log level 5 in that area... How much info stuff do we have for the 'ap' debug facility (I hope that is the facility used by vfio_ap)? I think log levels are supposed to be primarily about severity, and and I'm not sure that a queue becoming unavailable in G1 without fist re-configuring the G2 so that it no more has access to the given queue is not really a warning severity thing. IMHO if we really do want people complaining about this should they ever see it, yes it should be a warning. If not then probably not. Regards, Halil
On 12/6/23 12:17 PM, Halil Pasic wrote: > On Tue, 05 Dec 2023 09:04:23 +0100 > Harald Freudenberger <freude@linux.ibm.com> wrote: > >> On 2023-12-04 17:15, Halil Pasic wrote: >>> On Mon, 4 Dec 2023 16:16:31 +0100 >>> Christian Borntraeger <borntraeger@linux.ibm.com> wrote: >>> >>>> Am 04.12.23 um 15:53 schrieb Tony Krowiak: >>>>> >>>>> On 11/29/23 12:12, Christian Borntraeger wrote: >>>>>> Am 29.11.23 um 15:35 schrieb Tony Krowiak: >>>>>>> In the current implementation, response code 01 (AP queue number not valid) >>>>>>> is handled as a default case along with other response codes returned from >>>>>>> a queue reset operation that are not handled specifically. Barring a bug, >>>>>>> response code 01 will occur only when a queue has been externally removed >>>>>>> from the host's AP configuration; nn this case, the queue must >>>>>>> be reset by the machine in order to avoid leaking crypto data if/when the >>>>>>> queue is returned to the host's configuration. The response code 01 case >>>>>>> will be handled specifically by logging a WARN message followed by cleaning >>>>>>> up the IRQ resources. >>>>>>> >>>>>> To me it looks like this can be triggered by the LPAR admin, correct? So it >>>>>> is not desireable but possible. >>>>>> In that case I prefer to not use WARN, maybe use dev_warn or dev_err instead. >>>>>> WARN can be a disruptive event if panic_on_warn is set. >>>>> Yes, it can be triggered by the LPAR admin. I can't use dev_warn here because we don't have a reference to any device, but I can use pr_warn if that suffices. >>>> Ok, please use pr_warn then. >>> Shouldn't we rather make this an 'info'. I mean we probably do not want >>> people complaining about this condition. Yes it should be a besNo info logging is done via the S390 Debug Feature in vfio_ap. >>> There are a few warning messages logged solely in the handle_pqap >>> and vfio_ap_irq_enable functions. The question is, why are we >>> talking about the S390 Debug Feature? We are talking about using >>> pr_warn verses pr_info. What am I missing here?t >>> practice >>> to coordinate such things with the guest, and ideally remove the >>> resource >>> from the guest first. But AFAIU our stack is supposed to be able to >>> handle something like this. IMHO issuing a warning is excessive >>> measure. >>> I know Reinhard and Tony probably disagree with the last sentence >>> though. >> Halil, Tony, the thing about about info versus warning versus error is >> our >> own stuff. Keep in mind that these messages end up in the "debug >> feature" >> as FFDC data. So it comes to the point which FFDC data do you/Tony want >> to >> see there ? It should be enough to explain to a customer what happened >> without the need to "recreate with higher debug level" if something >> serious >> happened. So my private decision table is: >> 1) is it something serious, something exceptional, something which may >> not >> come up again if tried to recreate ? Yes -> make it visible on the >> first >> occurrence as error msg. >> 2) is it something you want to read when a customer hits it and you tell >> him >> to extract and examine the debug feature data ? Yes -> make it a >> warning >> and make sure your debug feature by default records warnings. >> 3) still serious, but may flood the debug feature. Good enough and high >> probability to reappear on a recreate ? Yes -> make it an info >> message >> and live with the risk that you may not be able to explain to a >> customer >> what happened without a recreate and higher debug level. >> 4) not 1-3, -> maybe a debug msg but still think about what happens when >> a >> customer enables "debug feature" with highest level. Does it squeeze >> out >> more important stuff ? Maybe make it dynamic debug with pr_debug() >> (see >> kernel docu admin-guide/dynamic-debug-howto.rst). > AFAIU the default log level of the S390 Debug Feature is 3 that is > error. So warnings do not help us there by default. And if we are > already asking the reporter to crank up the loglevel of the debug > feature, we can as the reporter to crank it up to 5, assumed there > is not too much stuff that log level 5 in that area... How much > info stuff do we have for the 'ap' debug facility (I hope > that is the facility used by vfio_ap)? No info logging is done via the S390 Debug Feature in vfio_ap. There are a few warning messages logged solely in the handle_pqap and vfio_ap_irq_enable functions. The question is, why are we talking about the S390 Debug Feature given the discussion is about using pr_warn verses pr_info. What am I missing here? > > I think log levels are supposed to be primarily about severity, and > and I'm not sure that a queue becoming unavailable in G1 without > fist re-configuring the G2 so that it no more has access to the > given queue is not really a warning severity thing. IMHO if we > really do want people complaining about this should they ever see it, > yes it should be a warning. If not then probably not. > > Regards, > Halil
On 12/4/23 5:05 PM, Halil Pasic wrote: > On Mon, 4 Dec 2023 12:51:49 -0500 > Tony Krowiak <akrowiak@linux.ibm.com> wrote: > >>> s/if\/when/at latest before/ >>> >>> I would argue that some of the cleanups need to happen before even 01 is >>> reflected... >> To what cleanups are you referring? > Event notification and interruption disablement for starters. Otherwise > OS has no way to figure out when is GISA and NIB safe to deallocate. > Those actions are part of the reset process. I.e. some of the reset stuff > can be deferred at most until the queue is made accessible again, some > not so much. How do you propose we disable interrupts if the PQAP(AQIC) will likely fail with response code 01 which is the subject of this patch? Do you think we should not free up the AQIC resources as we do in this patch? > > > Regards, > Halil
diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c index 4db538a55192..91d6334574d8 100644 --- a/drivers/s390/crypto/vfio_ap_ops.c +++ b/drivers/s390/crypto/vfio_ap_ops.c @@ -1652,6 +1652,21 @@ static int apq_status_check(int apqn, struct ap_queue_status *status) * a value indicating a reset needs to be performed again. */ return -EAGAIN; + case AP_RESPONSE_Q_NOT_AVAIL: + /* + * This response code indicates the queue is not available. + * Barring a bug, response code 01 will occur only when a queue + * has been externally removed from the host's AP configuration; + * in which case, the queue must be reset by the machine in + * order to avoid leaking crypto data if/when the queue is + * returned to the host's configuration. In this case, let's go + * ahead and log a warning message and return 0 so the AQIC + * resources get cleaned up by the caller. + */ + WARN(true, + "Unable to reset queue %02x.%04x: not in host AP configuration\n", + AP_QID_CARD(apqn), AP_QID_QUEUE(apqn)); + return 0; default: WARN(true, "failed to verify reset of queue %02x.%04x: TAPQ rc=%u\n", @@ -1736,6 +1751,22 @@ static void vfio_ap_mdev_reset_queue(struct vfio_ap_queue *q) q->reset_status.response_code = 0; vfio_ap_free_aqic_resources(q); break; + case AP_RESPONSE_Q_NOT_AVAIL: + /* + * This response code indicates the queue is not available. + * Barring a bug, response code 01 will occur only when a queue + * has been externally removed from the host's AP configuration; + * in which case, the queue must be reset by the machine in + * order to avoid leaking crypto data if/when the queue is + * returned to the host's configuration. In this case, let's go + * ahead and log a warning message then clean up the AQIC + * resources. + */ + WARN(true, + "Unable to reset queue %02x.%04x: not in host AP configuration\n", + AP_QID_CARD(q->apqn), AP_QID_QUEUE(q->apqn)); + vfio_ap_free_aqic_resources(q); + break; default: WARN(true, "PQAP/ZAPQ for %02x.%04x failed with invalid rc=%u\n",