From patchwork Wed Oct 25 09:42:19 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Steven Rostedt <rostedt@goodmis.org>
X-Patchwork-Id: 157969
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:ce89:0:b0:403:3b70:6f57 with SMTP id p9csp2478604vqx;
        Wed, 25 Oct 2023 02:43:00 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IH6Y+b8Vlee//n5qq8N9AWsZh2XmYcCgehtEWMSq/sg52yBSdWa81TWSjH/I2A2+FLwyl/1
X-Received: by 2002:a0d:ea04:0:b0:583:d9dd:37fd with SMTP id
 t4-20020a0dea04000000b00583d9dd37fdmr15171399ywe.31.1698226980183;
        Wed, 25 Oct 2023 02:43:00 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1698226980; cv=none;
        d=google.com; s=arc-20160816;
        b=03mjhsMEmiYYelXyaR0giczp9WLzGwGdX3FRPLPEFgClREjdfbPargYmiwdo+We+XM
         i7AZ2asux0tasTWC7GXEXVfVWnKYH7OsmWK1rcWr/iqwRGmhcnOvgXe0vJ0NchV67wL1
         3mA/Jiz2K6F0FE8zRFMRI1XNwwTVzxNlunwb9BLesU1GcfEBmcHPnNh4T2HN4YH7fPNv
         Etdr8DNFRJnUkt2Rldff4DPaIGOnzlKCQT3eHKPoGQzjUzsUURHbhDCcRuQG8q8GP9v9
         GgcO68e5s4VzCbbvgiffUK8oFTguizDrRQWD+MVPQhKzwhqxXtObTt5RdshLcRXApRva
         rJ3A==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:mime-version:message-id:subject:cc:to:from:date;
        bh=+oS+fEuLv894G8V7mvFt619AxmJHB7R5pXCn7U+7pdM=;
        fh=t48R5f0Yq0UFmD1Cn4cQDdtiUXHWKlTpi/lNNLzhos4=;
        b=pcSdXuxYriul69hIPVyveYxdXKCi2FwZsYHJANsyOpNzeBCxknCPWLXnvT1FA8Tbj8
         YHG50kwZO/E1WpceNKmbUaSTJFN1ScTIqEv9Lxxubpj9fHycckGgTvB0mexYohawGPsX
         8c4FCQULnRS8ZfkPYSFWUh+p5FzXNYOlYiz82tvUdK9zt2HJ5k2uYrhMfyu66M2Bxh7h
         YjztHMpGSUddO+yoGW+/mijjYIVKrPBEU33cUKbz1fb64sk8aRBUkLLAGFCNl8TRBj1O
         o+nN79G2PFhSA0UrkDJVDS53wHkk+7ZsO3cQEamuJr01vUr02GAejdXmkbIriLg5RjhV
         fD6g==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.32 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32])
        by mx.google.com with ESMTPS id
 h3-20020a0dc503000000b005a83dbf2775si9588909ywd.444.2023.10.25.02.42.59
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 25 Oct 2023 02:43:00 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.32 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by agentk.vger.email (Postfix) with ESMTP id 499A5802F265;
	Wed, 25 Oct 2023 02:42:56 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S234491AbjJYJma (ORCPT <rfc822;aposhian.dev@gmail.com>
        + 26 others); Wed, 25 Oct 2023 05:42:30 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50574 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229606AbjJYJm3 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 25 Oct 2023 05:42:29 -0400
Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4B4D4A1
        for <linux-kernel@vger.kernel.org>;
 Wed, 25 Oct 2023 02:42:26 -0700 (PDT)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id AC4BDC433C8;
        Wed, 25 Oct 2023 09:42:22 +0000 (UTC)
Date: Wed, 25 Oct 2023 05:42:19 -0400
From: Steven Rostedt <rostedt@goodmis.org>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <peterz@infradead.org>,
        Ankur Arora <ankur.a.arora@oracle.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        linux-mm@kvack.org, x86@kernel.org, akpm@linux-foundation.org,
        luto@kernel.org, bp@alien8.de, dave.hansen@linux.intel.com,
        hpa@zytor.com, mingo@redhat.com, juri.lelli@redhat.com,
        vincent.guittot@linaro.org, willy@infradead.org, mgorman@suse.de,
        jon.grimm@amd.com, bharata@amd.com, raghavendra.kt@amd.com,
        boris.ostrovsky@oracle.com, konrad.wilk@oracle.com,
        jgross@suse.com, andrew.cooper3@citrix.com,
        Joel Fernandes <joel@joelfernandes.org>,
        Youssef Esmat <youssefesmat@chromium.org>,
        Vineeth Pillai <vineethrp@google.com>,
        Suleiman Souhlal <suleiman@google.com>,
        Ingo Molnar <mingo@kernel.org>,
        Daniel Bristot de Oliveira <bristot@kernel.org>
Subject: [POC][RFC][PATCH] sched: Extended Scheduler Time Slice
Message-ID: <20231025054219.1acaa3dd@gandalf.local.home>
X-Mailer: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-pc-linux-gnu)
MIME-Version: 1.0
X-Spam-Status: No,
 score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable
	autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]);
 Wed, 25 Oct 2023 02:42:56 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1780720053906937718
X-GMAIL-MSGID: 1780720053906937718

[
 This is basically a resend of this email, but as a separate patch and not
 part of a very long thread.
 https://lore.kernel.org/lkml/20231024214958.22fff0bc@gandalf.local.home/
]

This has very good performance improvements on user space implemented spin
locks, and I'm sure this can be used for spin locks in VMs too. That will
come shortly.

I started with Thomas's PREEMPT_AUTO.patch from the rt-devel tree:

 https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/PREEMPT_AUTO.patch?h=v6.6-rc6-rt10-patches

So you need to select:

  CONFIG_PREEMPT_AUTO

The below is my proof of concept patch. It still has debugging in it, and
I'm sure the interface will need to be changed.

There's now a new file:  /sys/kernel/extend_sched

Attached is a program that tests it. It mmaps that file, with:

 struct extend_map {
	unsigned long		flags;
 };

 static __thread struct extend_map *extend_map;

That is, there's this structure for every thread. It's assigned with:

	fd = open("/sys/kernel/extend_sched", O_RDWR);
	extend_map = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

I don't actually like this interface, as it wastes a full page for just two
bits :-p  Perhaps it should be a new system call, where it just locks in
existing memory from the user application? The requirement is that each
thread needs its own bits to play with. It should not be shared with other
threads. It could be, as it will not mess up the kernel, but will mess up
the application.

Anyway, to tell the kernel to "extend" the time slice if possible because
it's in a critical section, we have:

 static void extend(void)
 {
	if (!extend_map)
		return;

	extend_map->flags = 1;
 }

And to say that's it's done:

 static void unextend(void)
 {
	unsigned long prev;

	if (!extend_map)
		return;

	prev = xchg(&extend_map->flags, 0);
	if (prev & 2)
		sched_yield();
 }

So, bit 1 is for user space to tell the kernel "please extend me", and bit
two is for the kernel to tell user space "OK, I extended you, but call
sched_yield() when done".

This test program creates 1 + number of CPUs threads, that run in a loop
for 5 seconds. Each thread will grab a user space spin lock (not a futex,
but just shared memory). Before grabbing the lock it will call "extend()",
if it fails to grab the lock, it calls "unextend()" and spins on the lock
until its free, where it will try again. Then after it gets the lock, it
will update a counter, and release the lock, calling "unextend()" as well.
Then it will spin on the counter until it increments again to allow another
task to get into the critical section.

With the init of the extend_map disabled and it doesn't use the extend
code, it ends with:

 Ran for 3908165 times
 Total wait time: 33.965654

I can give you stdev and all that too, but the above is pretty much the
same after several runs.

After enabling the extend code, it has:

 Ran for 4829340 times
 Total wait time: 32.635407

It was able to get into the critical section almost 1 million times more in
those 5 seconds! That's a 23% improvement!

The wait time for getting into the critical section also dropped by the
total of over a second (4% improvement).

I ran a traceeval tool on it (still work in progress, but I can post when
it's done), and with the following trace, and the writes to trace-marker
(tracefs_printf)

 trace-cmd record -e sched_switch ./extend-sched

It showed that without the extend, each task was preempted while holding
the lock around 200 times. With the extend, only one task was ever
preempted while holding the lock, and it only happened once!

Note, I tried replacing the user space spin lock with a futex, and it
dropped performance down so much with and without the update, that the
benefit is in the noise.

Below is my patch (with debugging and on top of Thomas's PREEMPT_AUTO.patch):

Attached is the program I tested it with. It uses libtracefs to write to
the trace_marker file, but if you don't want to build it with libtracefs:

  gcc -o extend-sched extend-sched.c `pkg-config --libs --cflags libtracefs` -lpthread

You can just do:

 grep -v tracefs extend-sched.c > extend-sched-notracefs.c

And build that.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9b13b7d4f1d3..fb540dd0dec0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -740,6 +740,10 @@ struct kmap_ctrl {
 #endif
 };
 
+struct extend_map {
+	long				flags;
+};
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/*
@@ -802,6 +806,8 @@ struct task_struct {
 	unsigned int			core_occupation;
 #endif
 
+	struct extend_map		*extend_map;
+
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
 #endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index c1f706038637..21d0e4d81d33 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -147,17 +147,32 @@ void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
 static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 					    unsigned long ti_work)
 {
+	unsigned long ignore_mask;
+
 	/*
 	 * Before returning to user space ensure that all pending work
 	 * items have been completed.
 	 */
 	while (ti_work & EXIT_TO_USER_MODE_WORK) {
+		ignore_mask = 0;
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
+		if (ti_work & _TIF_NEED_RESCHED) {
 			schedule();
 
+		} else if (ti_work & _TIF_NEED_RESCHED_LAZY) {
+			if (!current->extend_map ||
+			    !(current->extend_map->flags & 1)) {
+				schedule();
+			} else {
+				trace_printk("Extend!\n");
+				/* Allow to leave with NEED_RESCHED_LAZY still set */
+				ignore_mask |= _TIF_NEED_RESCHED_LAZY;
+				current->extend_map->flags |= 2;
+			}
+		}
+
 		if (ti_work & _TIF_UPROBE)
 			uprobe_notify_resume(regs);
 
@@ -184,6 +199,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		tick_nohz_user_enter_prepare();
 
 		ti_work = read_thread_flags();
+		ti_work &= ~ignore_mask;
 	}
 
 	/* Return the latest work state for arch_exit_to_user_mode() */
diff --git a/kernel/exit.c b/kernel/exit.c
index edb50b4c9972..ddf89ec9ab62 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -906,6 +906,13 @@ void __noreturn do_exit(long code)
 	if (tsk->io_context)
 		exit_io_context(tsk);
 
+	if (tsk->extend_map) {
+		unsigned long addr = (unsigned long)tsk->extend_map;
+
+		virt_to_page(addr)->mapping = NULL;
+		free_page(addr);
+	}
+
 	if (tsk->splice_pipe)
 		free_pipe_info(tsk->splice_pipe);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b6d20dfb9a8..da2214082d25 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1166,6 +1166,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->wake_q.next = NULL;
 	tsk->worker_private = NULL;
 
+	tsk->extend_map = NULL;
+
 	kcov_task_init(tsk);
 	kmsan_task_create(tsk);
 	kmap_local_fork(tsk);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 976092b7bd45..297061cfa08d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -32,3 +32,4 @@ obj-y += core.o
 obj-y += fair.o
 obj-y += build_policy.o
 obj-y += build_utility.o
+obj-y += extend.o
diff --git a/kernel/sched/extend.c b/kernel/sched/extend.c
new file mode 100644
index 000000000000..a632e1a8f57b
--- /dev/null
+++ b/kernel/sched/extend.c
@@ -0,0 +1,90 @@
+#include <linux/kobject.h>
+#include <linux/pagemap.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
+
+#ifdef CONFIG_SYSFS
+static ssize_t extend_sched_read(struct file *file,  struct kobject *kobj,
+				 struct bin_attribute *bin_attr,
+				 char *buf, loff_t off, size_t len)
+{
+	static const char output[] = "Extend scheduling time slice\n";
+
+	printk("%s:%d\n", __func__, __LINE__);
+	if (off >= sizeof(output))
+		return 0;
+
+	strscpy(buf, output + off, len);
+	return min((ssize_t)len, sizeof(output) - off - 1);
+}
+
+static ssize_t extend_sched_write(struct file *file, struct kobject *kobj,
+				  struct bin_attribute *bin_attr,
+				  char *buf, loff_t off, size_t len)
+{
+	printk("%s:%d\n", __func__, __LINE__);
+	return -EINVAL;
+}
+
+static vm_fault_t extend_sched_mmap_fault(struct vm_fault *vmf)
+{
+	vm_fault_t ret = VM_FAULT_SIGBUS;
+
+	trace_printk("%s:%d\n", __func__, __LINE__);
+	/* Only has one page */
+	if (vmf->pgoff || !current->extend_map)
+		return ret;
+
+	vmf->page = virt_to_page(current->extend_map);
+
+	get_page(vmf->page);
+	vmf->page->mapping = vmf->vma->vm_file->f_mapping;
+	vmf->page->index   = vmf->pgoff;
+
+	return 0;
+}
+
+static void extend_sched_mmap_open(struct vm_area_struct *vma)
+{
+	printk("%s:%d\n", __func__, __LINE__);
+	WARN_ON(!current->extend_map);
+}
+
+static const struct vm_operations_struct extend_sched_vmops = {
+	.open		= extend_sched_mmap_open,
+	.fault		= extend_sched_mmap_fault,
+};
+
+static int extend_sched_mmap(struct file *file, struct kobject *kobj,
+			     struct bin_attribute *attr,
+			     struct vm_area_struct *vma)
+{
+	if (current->extend_map)
+		return -EBUSY;
+
+	current->extend_map = page_to_virt(alloc_page(GFP_USER | __GFP_ZERO));
+	if (!current->extend_map)
+		return -ENOMEM;
+
+	vm_flags_mod(vma, VM_DONTCOPY | VM_DONTDUMP | VM_MAYWRITE, 0);
+	vma->vm_ops = &extend_sched_vmops;
+
+	return 0;
+}
+
+static struct bin_attribute extend_sched_attr = {
+	.attr = {
+		.name = "extend_sched",
+		.mode = 0777,
+	},
+	.read = &extend_sched_read,
+	.write = &extend_sched_write,
+	.mmap = &extend_sched_mmap,
+};
+
+static __init int extend_init(void)
+{
+	return sysfs_create_bin_file(kernel_kobj, &extend_sched_attr);
+}
+late_initcall(extend_init);
+#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 700b140ac1bb..17ca22e80384 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -993,9 +993,10 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool
 		resched_curr(rq);
 	} else {
 		/* Did the task ignore the lazy reschedule request? */
-		if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
+		if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY)) {
+			trace_printk("Force resched?\n");
 			resched_curr(rq);
-		else
+		} else
 			resched_curr_lazy(rq);
 	}
 	clear_buddies(cfs_rq, se);