[RFC/REFACT] Refactoring and significantly reducing code complexity

Message ID ZRU/EjubEH/5QLlG@fedora
State New
Headers
Series [RFC/REFACT] Refactoring and significantly reducing code complexity |

Commit Message

Wang Jinchao Sept. 28, 2023, 8:53 a.m. UTC
  This is a refactored version with the following main changes:

- The parallel workqueue no longer uses the WQ_UNBOUND attribute
- Removal of CPU-related logic, sysfs-related interfaces
- removal of structures like padata_cpumask, and deletion of parallel_data
- Using completion to maintain sequencing
- no longer using lists
- removing structures like padata_list and padata_serial_queue
- Removal of padata_do_serial()
- padata automatically ensures the calling sequence.

Testing was conducted using ltp's pcrypt_aead01, and the execution time
comparison between the old and new versions is as follows:

Old Version:
real 0m27.451s
user 0m0.031s
sys 0m0.260s

New Version:
real 0m21.351s
user 0m0.023s
sys 0m0.260s

Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
---
 crypto/pcrypt.c        |  34 +-
 include/linux/padata.h |  95 +----
 kernel/padata.c        | 815 ++---------------------------------------
 3 files changed, 43 insertions(+), 901 deletions(-)
  

Comments

Steffen Klassert Sept. 29, 2023, 5:47 a.m. UTC | #1
On Thu, Sep 28, 2023 at 04:53:38PM +0800, Wang Jinchao wrote:
> This is a refactored version with the following main changes:
> 
> - The parallel workqueue no longer uses the WQ_UNBOUND attribute
> - Removal of CPU-related logic, sysfs-related interfaces
> - removal of structures like padata_cpumask, and deletion of parallel_data
> - Using completion to maintain sequencing
> - no longer using lists
> - removing structures like padata_list and padata_serial_queue
> - Removal of padata_do_serial()

This removes all the logic that is needed to ensure that
the parallelized objects return in the same order as
they were before the parallelization. This change makes
padata unusable for networking.
  
Wang Jinchao Oct. 7, 2023, 1:17 a.m. UTC | #2
On Fri, Sep 29, 2023 at 07:47:22AM +0200, Steffen Klassert wrote:
> On Thu, Sep 28, 2023 at 04:53:38PM +0800, Wang Jinchao wrote:
> > This is a refactored version with the following main changes:
> > 
> > - The parallel workqueue no longer uses the WQ_UNBOUND attribute
> > - Removal of CPU-related logic, sysfs-related interfaces
> > - removal of structures like padata_cpumask, and deletion of parallel_data
> > - Using completion to maintain sequencing
> > - no longer using lists
> > - removing structures like padata_list and padata_serial_queue
> > - Removal of padata_do_serial()
> 
> This removes all the logic that is needed to ensure that
> the parallelized objects return in the same order as
> they were before the parallelization. This change makes
> padata unusable for networking.

The RFC use the following three to ensure serial timing sequence:

1.  Use alloc_ordered_workqueue() to create a serial worker queue where 
    serial() function runs. This ensures that serial() function executes
    as serial work was enqueued using queue_work().
2.  Queue the serial work before enqueueing parallel work in padata_do_parallel().
    This ensures the serial work follows the same order as the padata_do_parallel().
3.  The serial work wait for completion of parallel_done, which will be 
    complete()ed after the parallel() function within the parallel work.

This is just a design idea, because I am not familiar with IPsec, I haven't 
tested it in a real network environment yet. 
Could you give me some clues on how to use pcrypt in an IPsec scenario?
  
Wang Jinchao Oct. 8, 2023, 7:58 a.m. UTC | #3
On Fri, Sep 29, 2023 at 07:47:22AM +0200, Steffen Klassert wrote:
> On Thu, Sep 28, 2023 at 04:53:38PM +0800, Wang Jinchao wrote:
> > This is a refactored version with the following main changes:
> > 
> > - The parallel workqueue no longer uses the WQ_UNBOUND attribute
> > - Removal of CPU-related logic, sysfs-related interfaces
> > - removal of structures like padata_cpumask, and deletion of parallel_data
> > - Using completion to maintain sequencing
> > - no longer using lists
> > - removing structures like padata_list and padata_serial_queue
> > - Removal of padata_do_serial()
> 
> This removes all the logic that is needed to ensure that
> the parallelized objects return in the same order as
> they were before the parallelization. This change makes
> padata unusable for networking.

Hello Steffen, 
I have constructed a scenario where parallel() time cost 
is forced to be reversed , and then ensured the order using serial(). 
The tests have passed on a 32-core machine. 
Can you please explain if my refactored approach guarantees the serial ordering?

Here is the code:

padata_test.c
--------------
```c
#include <linux/delay.h>
#include <linux/module.h>
#include <linux/padata.h>

struct request {
  struct padata_priv padata;
  int seq;
  struct completion done;
};

#define TEST_ARRAY_SIZE 200

#define PARALLEL_BASE_TIME 10

static int serial_cnt;
static int shuffled;
struct request requests[TEST_ARRAY_SIZE];
void parallel(struct padata_priv *padata) {
  struct request *req = container_of(padata, struct request, padata);

  // The smaller the req->seq number, the longer the delay time
  // creating a reverse order.
  mdelay((TEST_ARRAY_SIZE - req->seq) * PARALLEL_BASE_TIME);
  msleep((TEST_ARRAY_SIZE - req->seq) * PARALLEL_BASE_TIME);
}

void serial(struct padata_priv *padata) {
  struct request *req = container_of(padata, struct request, padata);
  if (req->seq != serial_cnt)
    shuffled = 1;
  serial_cnt++;
  complete(&req->done);
}

struct padata_instance *pinst;
struct padata_shell *ps;
static int __init padata_test_init(void) {
  serial_cnt = 0;
  shuffled = 0;
  pinst = padata_alloc("padata_test");
  if (!pinst)
    return -ENOMEM;

  ps = padata_alloc_shell(pinst);

  for (int i = 0; i < TEST_ARRAY_SIZE; i++) {
    requests[i].padata.parallel = parallel;
    requests[i].padata.serial = serial;
    requests[i].seq = i;
    init_completion(&requests[i].done);
  }

  for (int i = 0; i < TEST_ARRAY_SIZE; i++) {
    padata_do_parallel(ps, &requests[i].padata);
  }

  // only wait for the last one 
  // if they were not done serially
  // the serial_cnt will not be TEST_ARRAY_SIZE
  // there was a shuffering
  int last = TEST_ARRAY_SIZE - 1;
  wait_for_completion(&requests[last].done);
  if (serial_cnt != TEST_ARRAY_SIZE)
	shuffled = 1;
  
  printk(KERN_INFO "padata_test: shuffled %d serial_cnt %d\n", shuffled,
         serial_cnt);
  return 0;
}

static void __exit padata_test_exit(void) {
  padata_free_shell(ps);
  padata_free(pinst);
}

module_init(padata_test_init);
module_exit(padata_test_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("wangjinchao");
MODULE_DESCRIPTION("padata Test Module");
```

Makefile
---------
```makefile
obj-m += padata_test.o

all:
	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

clean:
	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
```

run.sh
--------
```bash
make
dmesg -C
insmod padata_test.ko
rmmod padata_test
dmesg|grep serial_cnt
```
  
Steffen Klassert Oct. 16, 2023, 10:28 a.m. UTC | #4
On Sat, Oct 07, 2023 at 09:17:59AM +0800, Wang Jinchao wrote:
> On Fri, Sep 29, 2023 at 07:47:22AM +0200, Steffen Klassert wrote:
> > On Thu, Sep 28, 2023 at 04:53:38PM +0800, Wang Jinchao wrote:
> > > This is a refactored version with the following main changes:
> > > 
> > > - The parallel workqueue no longer uses the WQ_UNBOUND attribute
> > > - Removal of CPU-related logic, sysfs-related interfaces
> > > - removal of structures like padata_cpumask, and deletion of parallel_data
> > > - Using completion to maintain sequencing
> > > - no longer using lists
> > > - removing structures like padata_list and padata_serial_queue
> > > - Removal of padata_do_serial()
> > 
> > This removes all the logic that is needed to ensure that
> > the parallelized objects return in the same order as
> > they were before the parallelization. This change makes
> > padata unusable for networking.
> 
> The RFC use the following three to ensure serial timing sequence:
> 
> 1.  Use alloc_ordered_workqueue() to create a serial worker queue where 
>     serial() function runs. This ensures that serial() function executes
>     as serial work was enqueued using queue_work().
> 2.  Queue the serial work before enqueueing parallel work in padata_do_parallel().
>     This ensures the serial work follows the same order as the padata_do_parallel().
> 3.  The serial work wait for completion of parallel_done, which will be 
>     complete()ed after the parallel() function within the parallel work.

I had a closer look to the padata codebase now. The logic to
parallelize/serialize changed quite a bit since I wrote padata
initially. I don't understand the new logic completely,
so we need to rely on Daniels review for this.

> This is just a design idea, because I am not familiar with IPsec, I haven't 
> tested it in a real network environment yet. 
> Could you give me some clues on how to use pcrypt in an IPsec scenario?

pcrypt has performance gains if the cost of the crypto
oreration is higher than the cost of moving the crypto
request to another cpu. That was the case back in the
days, but the situation changed since then. We now have
hardware support for crypto, like aes-ni etc.
So for hardware supported algorithms, pcrypt will not
help much. But it is still interesting for crypto
algorithms that are implemented in software only.

Here is how you can instantiate a pcrypted
algorithm:

----------------------------------------------------
You need to instantiate pcrypt before you can use it.

You need either crconf or the tcrypt module to instantiate pcrypt.

With crconf:

You can get crconf from https://sourceforge.net/projects/crconf/
After installing crconf do e.g.

crconf add driver "pcrypt(authenc(hmac(sha1-generic),cbc(aes-asm)))" type 3


With tcrypt:

modprobe tcrypt alg="pcrypt(authenc(hmac(sha1-generic),cbc(aes-asm)))" type=3

The modprobe will return with an error, don't worry about it, that's ok.



Once you've did one of the above, your /proc/crypto should show
something like:

name         : authenc(hmac(sha1),cbc(aes))
driver       : pcrypt(authenc(hmac(sha1-generic),cbc(aes-asm)))
module       : pcrypt
priority     : 2100
refcnt       : 1
selftest     : passed
type         : aead
async        : yes
blocksize    : 16
ivsize       : 16
maxauthsize  : 20
geniv        : <built-in>


Now pcrypt is instantiated, e.g. all new IPsec states (that do
hmac-sha1, cbc-aes) will use it.
  
Steffen Klassert Oct. 16, 2023, 10:47 a.m. UTC | #5
Hi,

On Thu, Sep 28, 2023 at 04:53:38PM +0800, Wang Jinchao wrote:
> This is a refactored version with the following main changes:
> 
> - The parallel workqueue no longer uses the WQ_UNBOUND attribute
> - Removal of CPU-related logic, sysfs-related interfaces
> - removal of structures like padata_cpumask, and deletion of parallel_data
> - Using completion to maintain sequencing
> - no longer using lists
> - removing structures like padata_list and padata_serial_queue
> - Removal of padata_do_serial()
> - padata automatically ensures the calling sequence.
> 
> Testing was conducted using ltp's pcrypt_aead01, and the execution time
> comparison between the old and new versions is as follows:
> 
> Old Version:
> real 0m27.451s
> user 0m0.031s
> sys 0m0.260s
> 
> New Version:
> real 0m21.351s
> user 0m0.023s
> sys 0m0.260s
> 
> Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
> ---
>  crypto/pcrypt.c        |  34 +-
>  include/linux/padata.h |  95 +----
>  kernel/padata.c        | 815 ++---------------------------------------
>  3 files changed, 43 insertions(+), 901 deletions(-)

...

>  
> -static int pcrypt_sysfs_add(struct padata_instance *pinst, const char *name)
> -{
> -	int ret;
> -
> -	pinst->kobj.kset = pcrypt_kset;
> -	ret = kobject_add(&pinst->kobj, NULL, "%s", name);
> -	if (!ret)
> -		kobject_uevent(&pinst->kobj, KOBJ_ADD);
> -
> -	return ret;
> -}

Users might rely on the existence of that sysfs files,
so removing this might be dangerous.

>  static void padata_serial_worker(struct work_struct *serial_work)
>  {
> -	struct padata_serial_queue *squeue;
> -	struct parallel_data *pd;
> -	LIST_HEAD(local_list);
> -	int cnt;
> -
> -	local_bh_disable();

Note that the networking RX path must run with BHs off.
The call to padata_do_parallel might come with BHs off
and the serialization callback should also called with
BHs off.

Please make sure that the networking usecase continues
to work with that refactorization.

Thanks!
  
Daniel Jordan Oct. 25, 2023, 6:07 p.m. UTC | #6
Hello,

On Thu, Sep 28, 2023 at 04:53:38PM +0800, Wang Jinchao wrote:
> This is a refactored version with the following main changes:

The RFC overall is a nice simplification, and the basic approach of using an
ordered workqueue seems to work.

> - The parallel workqueue no longer uses the WQ_UNBOUND attribute

What's the justification here?  If it improves performance, please show
numbers.  Earlier tests[0] showed a large improvement when adding this
flag.

[0] https://lore.kernel.org/linux-crypto/20190906014029.3345-1-daniel.m.jordan@oracle.com/

> - Removal of CPU-related logic, sysfs-related interfaces

I agree with Steffen that we should continue to honor the cpumasks that the
user sets.

The simplest way I see to make the parallel mask work with your refactor is to
just make the parallel workqueue unbound again, since setting workqueue
attributes is only allowed for unbound, and bring back some of the plumbing
that leads to the apply_workqueue_attrs call.

The serial mask is trickier.  Changing attributes of an ordered workqueue (the
cpumask in this case) makes the kernel throw a warning...

static int apply_workqueue_attrs_locked
        ...
        /* creating multiple pwqs breaks ordering guarantee */                   
        if (!list_empty(&wq->pwqs)) {                                            
                if (WARN_ON(wq->flags & __WQ_ORDERED_EXPLICIT))                  
                        return -EINVAL;                                          
                                                                                 
                wq->flags &= ~__WQ_ORDERED;                                      
        }

...but I'm not sure this is a fundamental limitation.  The changelog of
0a94efb5acbb ("workqueue: implicit ordered attribute should be overridable")
says changes to "max_active and some attribute changes" are rejected, but it
might be possible to relax the warning to allow setting a cpumask while still
rejecting other changes.

> Testing was conducted using ltp's pcrypt_aead01, and the execution time
> comparison between the old and new versions is as follows:
> 
> Old Version:
> real 0m27.451s
> user 0m0.031s
> sys 0m0.260s
> 
> New Version:
> real 0m21.351s
> user 0m0.023s
> sys 0m0.260s

Great speedup.  A test that runs many requests for a long time in parallel is
also good to run, such as [0].

> @@ -986,57 +281,27 @@ struct padata_instance *padata_alloc(const char *name)
...
> +	pinst->serial_wq = alloc_ordered_workqueue ("%s_serial",
> +									WQ_MEM_RECLAIM | WQ_FREEZABLE,
> +									name);

Why add these two WQ_ flags?  Also, whitespace is kinda funky.
  
Daniel Jordan Oct. 25, 2023, 6:12 p.m. UTC | #7
On Thu, Sep 28, 2023 at 04:53:38PM +0800, Wang Jinchao wrote:
> -/**
> - * struct parallel_data - Internal control structure, covers everything
> - * that depends on the cpumask in use.
> - *
> - * @ps: padata_shell object.
> - * @reorder_list: percpu reorder lists
> - * @squeue: percpu padata queues used for serialuzation.
> - * @refcnt: Number of objects holding a reference on this parallel_data.
> - * @seq_nr: Sequence number of the parallelized data object.
> - * @processed: Number of already processed objects.
> - * @cpu: Next CPU to be processed.
> - * @cpumask: The cpumasks in use for parallel and serial workers.
> - * @reorder_work: work struct for reordering.
> - * @lock: Reorder lock.
> - */
> -struct parallel_data {
> -	struct padata_shell		*ps;
> -	struct padata_list		__percpu *reorder_list;
> -	struct padata_serial_queue	__percpu *squeue;
> -	refcount_t			refcnt;
> -	unsigned int			seq_nr;
> -	unsigned int			processed;
> -	int				cpu;
> -	struct padata_cpumask		cpumask;
> -	struct work_struct		reorder_work;
> -	spinlock_t                      ____cacheline_aligned lock;
> -};

reorder_list used to serialize one sequence of objects per padata_shell,
but now serial_wq serializes all sequences of objects in one list of
work_structs.  That works in theory, since a total order can maintain
each sequence's order, but it's possible (not sure yet, need to think
more) that this could lead to deadlocks or other issues in odd cases
such as the one that padata_shell was introduced for in bbefa1dd6a6d
("crypto: pcrypt - Avoid deadlock by using per-instance padata queues").

The reproducer in that commit produces this splat when testing the RFC.
Not sure if it's related to the above though.

[   40.084146] alg: aead: pcrypt(pcrypt(rfc4106-gcm-aesni)) encryption test failed (wrong result) on test vector 1, cfg="in-place (one sglist)"
[   40.087192] alg: self-tests for rfc4106(gcm(aes)) using pcrypt(pcrypt(rfc4106-gcm-aesni)) failed (rc=-22)
[   40.087195] ------------[ cut here ]------------
[   40.090296] alg: self-tests for rfc4106(gcm(aes)) using pcrypt(pcrypt(rfc4106-gcm-aesni)) failed (rc=-22)
[   40.090313] WARNING: CPU: 4 PID: 321 at crypto/testmgr.c:5936 alg_test+0x404/0x510
[   40.094020] Modules linked in: pcrypt
[   40.094882] CPU: 4 PID: 321 Comm: cryptomgr_test Not tainted 6.6.0-rc6-padata-refact-rfc-v1-test+ #3
[   40.096856] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.2-2-2 04/01/2014
[   40.098838] RIP: 0010:alg_test+0x404/0x510
[   40.099664] Code: e9 4a fd ff ff c6 44 24 17 01 41 bd ff ff ff ff e9 d1 fc ff ff 44 89 e9 48 89 da 4c 89 e6 48 c7 c7 a0 59 1d 82 e8 1c 45 c4 ff <0f> 0b e9 aa fe ff ff 0f 0b 48 c7 c7 80 58 1d 82 e8 e7 c6 cc ff e9
[   40.103180] RSP: 0018:ffffc90000343e10 EFLAGS: 00010286
[   40.104322] RAX: 0000000000000000 RBX: ffff88800421cc00 RCX: 0000000000000000
[   40.105875] RDX: 0000000000000002 RSI: ffffc90000343cc0 RDI: 00000000ffffffff
[   40.107467] RBP: 0000000000000883 R08: 00000000ffffdfff R09: 0000000000000001
[   40.109249] R10: 00000000ffffdfff R11: ffffffff824722a0 R12: ffff88800421cc80
[   40.110862] R13: 00000000ffffffea R14: 00000000ffffffff R15: 0000000000000000
[   40.112354] FS:  0000000000000000(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
[   40.114132] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   40.115130] CR2: 00007f1808faba38 CR3: 000000000243e002 CR4: 0000000000170ea0
[   40.115959] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   40.116630] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   40.117261] Call Trace:
[   40.117559]  <TASK>
[   40.117811]  ? __warn+0x7d/0x140
[   40.118165]  ? alg_test+0x404/0x510
[   40.118452]  ? report_bug+0x18d/0x1c0
[   40.118743]  ? handle_bug+0x3a/0x70
[   40.118998]  ? exc_invalid_op+0x13/0x60
[   40.119265]  ? asm_exc_invalid_op+0x16/0x20
[   40.119597]  ? alg_test+0x404/0x510
[   40.119918]  ? _raw_spin_unlock_irqrestore+0x1b/0x40
[   40.120390]  ? try_to_wake_up+0x9a/0x6a0
[   40.120749]  ? preempt_count_add+0x6a/0xa0
[   40.121121]  ? __pfx_cryptomgr_test+0x10/0x10
[   40.121494]  cryptomgr_test+0x20/0x40
[   40.121751]  kthread+0x100/0x130
[   40.121979]  ? __pfx_kthread+0x10/0x10
[   40.122247]  ret_from_fork+0x30/0x50
[   40.122511]  ? __pfx_kthread+0x10/0x10
[   40.122775]  ret_from_fork_asm+0x1b/0x30
[   40.123050]  </TASK>
[   40.123207] ---[ end trace 0000000000000000 ]---
  
Daniel Jordan Oct. 25, 2023, 6:13 p.m. UTC | #8
On Sat, Oct 07, 2023 at 09:17:59AM +0800, Wang Jinchao wrote:
> This is just a design idea, because I am not familiar with IPsec, I haven't 
> tested it in a real network environment yet. 

Out of curiosity, what's your use case for padata?
  
Wang Jinchao Oct. 26, 2023, 1:07 a.m. UTC | #9
On Wed, Oct 25, 2023 at 02:07:18PM -0400, Daniel Jordan wrote:
> Hello,
> 
> On Thu, Sep 28, 2023 at 04:53:38PM +0800, Wang Jinchao wrote:
> > This is a refactored version with the following main changes:
> 
> The RFC overall is a nice simplification, and the basic approach of using an
> ordered workqueue seems to work.
> 
> > - The parallel workqueue no longer uses the WQ_UNBOUND attribute
> 
> What's the justification here?  If it improves performance, please show
> numbers.  Earlier tests[0] showed a large improvement when adding this
> flag.
> 
> [0] https://lore.kernel.org/linux-crypto/20190906014029.3345-1-daniel.m.jordan@oracle.com/
> 
When I wrote the email on September 28th, I only used the "pcrypt_aead01"
test case from LTP. Thank you for you and Steffen's responses. I now have
more test cases, but I haven't tested them yet.
> > - Removal of CPU-related logic, sysfs-related interfaces
> 
> I agree with Steffen that we should continue to honor the cpumasks that the
> user sets.
> 
> The simplest way I see to make the parallel mask work with your refactor is to
> just make the parallel workqueue unbound again, since setting workqueue
> attributes is only allowed for unbound, and bring back some of the plumbing
> that leads to the apply_workqueue_attrs call.
You've convinced me, and I agree with your stance on keeping the cpumask.
So, using WQ_UNBOUND is the right choice, and this will be reflected in
my upcoming patches.

> 
> The serial mask is trickier.  Changing attributes of an ordered workqueue (the
> cpumask in this case) makes the kernel throw a warning...
> 
> static int apply_workqueue_attrs_locked
>         ...
>         /* creating multiple pwqs breaks ordering guarantee */                   
>         if (!list_empty(&wq->pwqs)) {                                            
>                 if (WARN_ON(wq->flags & __WQ_ORDERED_EXPLICIT))                  
>                         return -EINVAL;                                          
>                                                                                  
>                 wq->flags &= ~__WQ_ORDERED;                                      
>         }
> 
> ...but I'm not sure this is a fundamental limitation.  The changelog of
> 0a94efb5acbb ("workqueue: implicit ordered attribute should be overridable")
> says changes to "max_active and some attribute changes" are rejected, but it
> might be possible to relax the warning to allow setting a cpumask while still
> rejecting other changes.
Workqueue provides the alloc_ordered_workqueue method, which may be more suitable
for serial workers.

> 
> > Testing was conducted using ltp's pcrypt_aead01, and the execution time
> > comparison between the old and new versions is as follows:
> > 
> > Old Version:
> > real 0m27.451s
> > user 0m0.031s
> > sys 0m0.260s
> > 
> > New Version:
> > real 0m21.351s
> > user 0m0.023s
> > sys 0m0.260s
> 
> Great speedup.  A test that runs many requests for a long time in parallel is
> also good to run, such as [0].
> 
I will conduct the test as previously mentioned.
> > @@ -986,57 +281,27 @@ struct padata_instance *padata_alloc(const char *name)
> ...
> > +	pinst->serial_wq = alloc_ordered_workqueue ("%s_serial",
> > +									WQ_MEM_RECLAIM | WQ_FREEZABLE,
> > +									name);
> 
> Why add these two WQ_ flags?  Also, whitespace is kinda funky.
You're right, I need to adjust this part of the code.
  
Wang Jinchao Oct. 26, 2023, 1:12 a.m. UTC | #10
On Wed, Oct 25, 2023 at 02:12:31PM -0400, Daniel Jordan wrote:
> On Thu, Sep 28, 2023 at 04:53:38PM +0800, Wang Jinchao wrote:
> > -/**
> > - * struct parallel_data - Internal control structure, covers everything
> > - * that depends on the cpumask in use.
> > - *
> > - * @ps: padata_shell object.
> > - * @reorder_list: percpu reorder lists
> > - * @squeue: percpu padata queues used for serialuzation.
> > - * @refcnt: Number of objects holding a reference on this parallel_data.
> > - * @seq_nr: Sequence number of the parallelized data object.
> > - * @processed: Number of already processed objects.
> > - * @cpu: Next CPU to be processed.
> > - * @cpumask: The cpumasks in use for parallel and serial workers.
> > - * @reorder_work: work struct for reordering.
> > - * @lock: Reorder lock.
> > - */
> > -struct parallel_data {
> > -	struct padata_shell		*ps;
> > -	struct padata_list		__percpu *reorder_list;
> > -	struct padata_serial_queue	__percpu *squeue;
> > -	refcount_t			refcnt;
> > -	unsigned int			seq_nr;
> > -	unsigned int			processed;
> > -	int				cpu;
> > -	struct padata_cpumask		cpumask;
> > -	struct work_struct		reorder_work;
> > -	spinlock_t                      ____cacheline_aligned lock;
> > -};
> 
> reorder_list used to serialize one sequence of objects per padata_shell,
> but now serial_wq serializes all sequences of objects in one list of
> work_structs.  That works in theory, since a total order can maintain
> each sequence's order, but it's possible (not sure yet, need to think
> more) that this could lead to deadlocks or other issues in odd cases
> such as the one that padata_shell was introduced for in bbefa1dd6a6d
> ("crypto: pcrypt - Avoid deadlock by using per-instance padata queues").
> 
Yes, you are correct. This version is not only ordered at the padata_shell
level but at the instance level, which indeed doesn't align with the design.
Apart from potential deadlocks, it may also cause a padata_shell_B that
should have completed earlier to be blocked by an unrelated padata_shell_B.
I will address this issue in subsequent patches.
  
Wang Jinchao Oct. 26, 2023, 1:15 a.m. UTC | #11
On Wed, Oct 25, 2023 at 02:13:43PM -0400, Daniel Jordan wrote:
> On Sat, Oct 07, 2023 at 09:17:59AM +0800, Wang Jinchao wrote:
> > This is just a design idea, because I am not familiar with IPsec, I haven't 
> > tested it in a real network environment yet. 
> 
> Out of curiosity, what's your use case for padata?
When I wrote the rfc_v1 on September 28th, I only used the "pcrypt_aead01" 
test case from LTP. Thank you for you and Steffen's responses. I now have
more test cases, but I haven't tested them yet.
  
Wang Jinchao Oct. 26, 2023, 1:45 a.m. UTC | #12
On Thu, Oct 26, 2023 at 09:12:54AM +0800, Wang Jinchao wrote:
> On Wed, Oct 25, 2023 at 02:12:31PM -0400, Daniel Jordan wrote:
> > On Thu, Sep 28, 2023 at 04:53:38PM +0800, Wang Jinchao wrote:
> > > -/**
> > > - * struct parallel_data - Internal control structure, covers everything
> > > - * that depends on the cpumask in use.
> > > - *
> > > - * @ps: padata_shell object.
> > > - * @reorder_list: percpu reorder lists
> > > - * @squeue: percpu padata queues used for serialuzation.
> > > - * @refcnt: Number of objects holding a reference on this parallel_data.
> > > - * @seq_nr: Sequence number of the parallelized data object.
> > > - * @processed: Number of already processed objects.
> > > - * @cpu: Next CPU to be processed.
> > > - * @cpumask: The cpumasks in use for parallel and serial workers.
> > > - * @reorder_work: work struct for reordering.
> > > - * @lock: Reorder lock.
> > > - */
> > > -struct parallel_data {
> > > -	struct padata_shell		*ps;
> > > -	struct padata_list		__percpu *reorder_list;
> > > -	struct padata_serial_queue	__percpu *squeue;
> > > -	refcount_t			refcnt;
> > > -	unsigned int			seq_nr;
> > > -	unsigned int			processed;
> > > -	int				cpu;
> > > -	struct padata_cpumask		cpumask;
> > > -	struct work_struct		reorder_work;
> > > -	spinlock_t                      ____cacheline_aligned lock;
> > > -};
> > 
> > reorder_list used to serialize one sequence of objects per padata_shell,
> > but now serial_wq serializes all sequences of objects in one list of
> > work_structs.  That works in theory, since a total order can maintain
> > each sequence's order, but it's possible (not sure yet, need to think
> > more) that this could lead to deadlocks or other issues in odd cases
> > such as the one that padata_shell was introduced for in bbefa1dd6a6d
> > ("crypto: pcrypt - Avoid deadlock by using per-instance padata queues").
> > 
> Yes, you are correct. This version is not only ordered at the padata_shell
> level but at the instance level, which indeed doesn't align with the design.
> Apart from potential deadlocks, it may also cause a padata_shell_B that
> should have completed earlier to be blocked by an unrelated padata_shell_B.
> I will address this issue in subsequent patches.
Apart from potential deadlocks, it may also cause a padata_shell_B that
should have completed earlier to be blocked by an unrelated padata_shell_A.
I will address this issue in subsequent patches.
  

Patch

diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index 8c1d0ca41213..db9a36c35229 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -70,8 +70,6 @@  static void pcrypt_aead_done(void *data, int err)
 	struct padata_priv *padata = pcrypt_request_padata(preq);
 
 	padata->info = err;
-
-	padata_do_serial(padata);
 }
 
 static void pcrypt_aead_enc(struct padata_priv *padata)
@@ -86,7 +84,6 @@  static void pcrypt_aead_enc(struct padata_priv *padata)
 		return;
 
 	padata->info = ret;
-	padata_do_serial(padata);
 }
 
 static int pcrypt_aead_encrypt(struct aead_request *req)
@@ -114,7 +111,7 @@  static int pcrypt_aead_encrypt(struct aead_request *req)
 			       req->cryptlen, req->iv);
 	aead_request_set_ad(creq, req->assoclen);
 
-	err = padata_do_parallel(ictx->psenc, padata, &ctx->cb_cpu);
+	err = padata_do_parallel(ictx->psenc, padata);
 	if (!err)
 		return -EINPROGRESS;
 
@@ -133,7 +130,6 @@  static void pcrypt_aead_dec(struct padata_priv *padata)
 		return;
 
 	padata->info = ret;
-	padata_do_serial(padata);
 }
 
 static int pcrypt_aead_decrypt(struct aead_request *req)
@@ -161,7 +157,7 @@  static int pcrypt_aead_decrypt(struct aead_request *req)
 			       req->cryptlen, req->iv);
 	aead_request_set_ad(creq, req->assoclen);
 
-	err = padata_do_parallel(ictx->psdec, padata, &ctx->cb_cpu);
+	err = padata_do_parallel(ictx->psdec, padata);
 	if (!err)
 		return -EINPROGRESS;
 
@@ -304,31 +300,18 @@  static int pcrypt_create(struct crypto_template *tmpl, struct rtattr **tb)
 	return -EINVAL;
 }
 
-static int pcrypt_sysfs_add(struct padata_instance *pinst, const char *name)
-{
-	int ret;
-
-	pinst->kobj.kset = pcrypt_kset;
-	ret = kobject_add(&pinst->kobj, NULL, "%s", name);
-	if (!ret)
-		kobject_uevent(&pinst->kobj, KOBJ_ADD);
-
-	return ret;
-}
-
 static int pcrypt_init_padata(struct padata_instance **pinst, const char *name)
 {
 	int ret = -ENOMEM;
+	pr_info("%s:%d %s\n", __FILE__, __LINE__, __func__);
 
 	*pinst = padata_alloc(name);
+	pr_info("%s:%d %s\n", __FILE__, __LINE__, __func__);
+	pr_info("%s:%d %s *pinst=%p\n", __FILE__, __LINE__, __func__, *pinst);
+
 	if (!*pinst)
 		return ret;
-
-	ret = pcrypt_sysfs_add(*pinst, name);
-	if (ret)
-		padata_free(*pinst);
-
-	return ret;
+	return 0;
 }
 
 static struct crypto_template pcrypt_tmpl = {
@@ -344,15 +327,18 @@  static int __init pcrypt_init(void)
 	pcrypt_kset = kset_create_and_add("pcrypt", NULL, kernel_kobj);
 	if (!pcrypt_kset)
 		goto err;
+	pr_info("%s:%d %s\n", __FILE__, __LINE__, __func__);
 
 	err = pcrypt_init_padata(&pencrypt, "pencrypt");
 	if (err)
 		goto err_unreg_kset;
 
+	pr_info("%s:%d %s\n", __FILE__, __LINE__, __func__);
 	err = pcrypt_init_padata(&pdecrypt, "pdecrypt");
 	if (err)
 		goto err_deinit_pencrypt;
 
+	pr_info("%s:%d %s\n", __FILE__, __LINE__, __func__);
 	return crypto_register_template(&pcrypt_tmpl);
 
 err_deinit_pencrypt:
diff --git a/include/linux/padata.h b/include/linux/padata.h
index 495b16b6b4d7..ed880ef155c2 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -19,107 +19,36 @@ 
 #include <linux/list.h>
 #include <linux/kobject.h>
 
-#define PADATA_CPU_SERIAL   0x01
-#define PADATA_CPU_PARALLEL 0x02
 
 /**
  * struct padata_priv - Represents one job
  *
- * @list: List entry, to attach to the padata lists.
- * @pd: Pointer to the internal control structure.
- * @cb_cpu: Callback cpu for serializatioon.
- * @seq_nr: Sequence number of the parallelized data object.
  * @info: Used to pass information from the parallel to the serial function.
+ * @parallel_done: completetion of parallel_work.
+ * @parallel_work: Parallel work.
+ * @serial_work: Serial work.
  * @parallel: Parallel execution function.
  * @serial: Serial complete function.
  */
 struct padata_priv {
-	struct list_head	list;
-	struct parallel_data	*pd;
-	int			cb_cpu;
-	unsigned int		seq_nr;
 	int			info;
+	struct completion parallel_done;
+	struct work_struct	parallel_work;
+	struct work_struct	serial_work;
 	void                    (*parallel)(struct padata_priv *padata);
 	void                    (*serial)(struct padata_priv *padata);
 };
 
-/**
- * struct padata_list - one per work type per CPU
- *
- * @list: List head.
- * @lock: List lock.
- */
-struct padata_list {
-	struct list_head        list;
-	spinlock_t              lock;
-};
-
-/**
-* struct padata_serial_queue - The percpu padata serial queue
-*
-* @serial: List to wait for serialization after reordering.
-* @work: work struct for serialization.
-* @pd: Backpointer to the internal control structure.
-*/
-struct padata_serial_queue {
-       struct padata_list    serial;
-       struct work_struct    work;
-       struct parallel_data *pd;
-};
-
-/**
- * struct padata_cpumask - The cpumasks for the parallel/serial workers
- *
- * @pcpu: cpumask for the parallel workers.
- * @cbcpu: cpumask for the serial (callback) workers.
- */
-struct padata_cpumask {
-	cpumask_var_t	pcpu;
-	cpumask_var_t	cbcpu;
-};
-
-/**
- * struct parallel_data - Internal control structure, covers everything
- * that depends on the cpumask in use.
- *
- * @ps: padata_shell object.
- * @reorder_list: percpu reorder lists
- * @squeue: percpu padata queues used for serialuzation.
- * @refcnt: Number of objects holding a reference on this parallel_data.
- * @seq_nr: Sequence number of the parallelized data object.
- * @processed: Number of already processed objects.
- * @cpu: Next CPU to be processed.
- * @cpumask: The cpumasks in use for parallel and serial workers.
- * @reorder_work: work struct for reordering.
- * @lock: Reorder lock.
- */
-struct parallel_data {
-	struct padata_shell		*ps;
-	struct padata_list		__percpu *reorder_list;
-	struct padata_serial_queue	__percpu *squeue;
-	refcount_t			refcnt;
-	unsigned int			seq_nr;
-	unsigned int			processed;
-	int				cpu;
-	struct padata_cpumask		cpumask;
-	struct work_struct		reorder_work;
-	spinlock_t                      ____cacheline_aligned lock;
-};
-
 /**
  * struct padata_shell - Wrapper around struct parallel_data, its
  * purpose is to allow the underlying control structure to be replaced
  * on the fly using RCU.
  *
  * @pinst: padat instance.
- * @pd: Actual parallel_data structure which may be substituted on the fly.
- * @opd: Pointer to old pd to be freed by padata_replace.
  * @list: List entry in padata_instance list.
  */
 struct padata_shell {
 	struct padata_instance		*pinst;
-	struct parallel_data __rcu	*pd;
-	struct parallel_data		*opd;
 	struct list_head		list;
 };
 
@@ -151,24 +80,17 @@  struct padata_mt_job {
 /**
  * struct padata_instance - The overall control structure.
  *
- * @cpu_online_node: Linkage for CPU online callback.
- * @cpu_dead_node: Linkage for CPU offline callback.
  * @parallel_wq: The workqueue used for parallel work.
  * @serial_wq: The workqueue used for serial work.
  * @pslist: List of padata_shell objects attached to this instance.
- * @cpumask: User supplied cpumasks for parallel and serial works.
  * @kobj: padata instance kernel object.
  * @lock: padata instance lock.
  * @flags: padata flags.
  */
 struct padata_instance {
-	struct hlist_node		cpu_online_node;
-	struct hlist_node		cpu_dead_node;
 	struct workqueue_struct		*parallel_wq;
 	struct workqueue_struct		*serial_wq;
 	struct list_head		pslist;
-	struct padata_cpumask		cpumask;
-	struct kobject                   kobj;
 	struct mutex			 lock;
 	u8				 flags;
 #define	PADATA_INIT	1
@@ -187,9 +109,6 @@  extern void padata_free(struct padata_instance *pinst);
 extern struct padata_shell *padata_alloc_shell(struct padata_instance *pinst);
 extern void padata_free_shell(struct padata_shell *ps);
 extern int padata_do_parallel(struct padata_shell *ps,
-			      struct padata_priv *padata, int *cb_cpu);
-extern void padata_do_serial(struct padata_priv *padata);
+			      struct padata_priv *padata);
 extern void __init padata_do_multithreaded(struct padata_mt_job *job);
-extern int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type,
-			      cpumask_var_t cpumask);
 #endif
diff --git a/kernel/padata.c b/kernel/padata.c
index 222d60195de6..ec80e2b9556d 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -13,9 +13,7 @@ 
 
 #include <linux/completion.h>
 #include <linux/export.h>
-#include <linux/cpumask.h>
 #include <linux/err.h>
-#include <linux/cpu.h>
 #include <linux/padata.h>
 #include <linux/mutex.h>
 #include <linux/sched.h>
@@ -44,31 +42,8 @@  struct padata_mt_job_state {
 	unsigned long		chunk_size;
 };
 
-static void padata_free_pd(struct parallel_data *pd);
 static void __init padata_mt_helper(struct work_struct *work);
 
-static int padata_index_to_cpu(struct parallel_data *pd, int cpu_index)
-{
-	int cpu, target_cpu;
-
-	target_cpu = cpumask_first(pd->cpumask.pcpu);
-	for (cpu = 0; cpu < cpu_index; cpu++)
-		target_cpu = cpumask_next(target_cpu, pd->cpumask.pcpu);
-
-	return target_cpu;
-}
-
-static int padata_cpu_hash(struct parallel_data *pd, unsigned int seq_nr)
-{
-	/*
-	 * Hash the sequence numbers to the cpus by taking
-	 * seq_nr mod. number of cpus in use.
-	 */
-	int cpu_index = seq_nr % cpumask_weight(pd->cpumask.pcpu);
-
-	return padata_index_to_cpu(pd, cpu_index);
-}
-
 static struct padata_work *padata_work_alloc(void)
 {
 	struct padata_work *pw;
@@ -144,27 +119,18 @@  static void __init padata_works_free(struct list_head *works)
 
 static void padata_parallel_worker(struct work_struct *parallel_work)
 {
-	struct padata_work *pw = container_of(parallel_work, struct padata_work,
-					      pw_work);
-	struct padata_priv *padata = pw->pw_data;
-
-	local_bh_disable();
+	struct padata_priv *padata = container_of(parallel_work, struct padata_priv, parallel_work);						  
 	padata->parallel(padata);
-	spin_lock(&padata_works_lock);
-	padata_work_free(pw);
-	spin_unlock(&padata_works_lock);
-	local_bh_enable();
+	complete(&padata->parallel_done);
 }
 
+static void padata_serial_worker(struct work_struct *serial_work);
+
 /**
  * padata_do_parallel - padata parallelization function
  *
  * @ps: padatashell
  * @padata: object to be parallelized
- * @cb_cpu: pointer to the CPU that the serialization callback function should
- *          run on.  If it's not in the serial cpumask of @pinst
- *          (i.e. cpumask.cbcpu), this function selects a fallback CPU and if
- *          none found, returns -EINVAL.
  *
  * The parallelization callback function will run with BHs off.
  * Note: Every object which is parallelized by padata_do_parallel
@@ -173,269 +139,36 @@  static void padata_parallel_worker(struct work_struct *parallel_work)
  * Return: 0 on success or else negative error code.
  */
 int padata_do_parallel(struct padata_shell *ps,
-		       struct padata_priv *padata, int *cb_cpu)
+		       struct padata_priv *padata)
 {
 	struct padata_instance *pinst = ps->pinst;
-	int i, cpu, cpu_index, err;
-	struct parallel_data *pd;
-	struct padata_work *pw;
-
-	rcu_read_lock_bh();
-
-	pd = rcu_dereference_bh(ps->pd);
+	int err;
 
 	err = -EINVAL;
 	if (!(pinst->flags & PADATA_INIT) || pinst->flags & PADATA_INVALID)
 		goto out;
 
-	if (!cpumask_test_cpu(*cb_cpu, pd->cpumask.cbcpu)) {
-		if (cpumask_empty(pd->cpumask.cbcpu))
-			goto out;
-
-		/* Select an alternate fallback CPU and notify the caller. */
-		cpu_index = *cb_cpu % cpumask_weight(pd->cpumask.cbcpu);
-
-		cpu = cpumask_first(pd->cpumask.cbcpu);
-		for (i = 0; i < cpu_index; i++)
-			cpu = cpumask_next(cpu, pd->cpumask.cbcpu);
-
-		*cb_cpu = cpu;
-	}
-
 	err =  -EBUSY;
 	if ((pinst->flags & PADATA_RESET))
 		goto out;
 
-	refcount_inc(&pd->refcnt);
-	padata->pd = pd;
-	padata->cb_cpu = *cb_cpu;
-
-	spin_lock(&padata_works_lock);
-	padata->seq_nr = ++pd->seq_nr;
-	pw = padata_work_alloc();
-	spin_unlock(&padata_works_lock);
-
-	if (!pw) {
-		/* Maximum works limit exceeded, run in the current task. */
-		padata->parallel(padata);
-	}
-
-	rcu_read_unlock_bh();
-
-	if (pw) {
-		padata_work_init(pw, padata_parallel_worker, padata, 0);
-		queue_work(pinst->parallel_wq, &pw->pw_work);
-	}
+	INIT_WORK(&padata->parallel_work, padata_parallel_worker);
+	INIT_WORK(&padata->serial_work, padata_serial_worker);
+	init_completion(&padata->parallel_done);
+	queue_work(pinst->serial_wq, &padata->serial_work);
+	queue_work(pinst->parallel_wq, &padata->parallel_work);
 
 	return 0;
 out:
-	rcu_read_unlock_bh();
-
 	return err;
 }
 EXPORT_SYMBOL(padata_do_parallel);
 
-/*
- * padata_find_next - Find the next object that needs serialization.
- *
- * Return:
- * * A pointer to the control struct of the next object that needs
- *   serialization, if present in one of the percpu reorder queues.
- * * NULL, if the next object that needs serialization will
- *   be parallel processed by another cpu and is not yet present in
- *   the cpu's reorder queue.
- */
-static struct padata_priv *padata_find_next(struct parallel_data *pd,
-					    bool remove_object)
-{
-	struct padata_priv *padata;
-	struct padata_list *reorder;
-	int cpu = pd->cpu;
-
-	reorder = per_cpu_ptr(pd->reorder_list, cpu);
-
-	spin_lock(&reorder->lock);
-	if (list_empty(&reorder->list)) {
-		spin_unlock(&reorder->lock);
-		return NULL;
-	}
-
-	padata = list_entry(reorder->list.next, struct padata_priv, list);
-
-	/*
-	 * Checks the rare case where two or more parallel jobs have hashed to
-	 * the same CPU and one of the later ones finishes first.
-	 */
-	if (padata->seq_nr != pd->processed) {
-		spin_unlock(&reorder->lock);
-		return NULL;
-	}
-
-	if (remove_object) {
-		list_del_init(&padata->list);
-		++pd->processed;
-		pd->cpu = cpumask_next_wrap(cpu, pd->cpumask.pcpu, -1, false);
-	}
-
-	spin_unlock(&reorder->lock);
-	return padata;
-}
-
-static void padata_reorder(struct parallel_data *pd)
-{
-	struct padata_instance *pinst = pd->ps->pinst;
-	int cb_cpu;
-	struct padata_priv *padata;
-	struct padata_serial_queue *squeue;
-	struct padata_list *reorder;
-
-	/*
-	 * We need to ensure that only one cpu can work on dequeueing of
-	 * the reorder queue the time. Calculating in which percpu reorder
-	 * queue the next object will arrive takes some time. A spinlock
-	 * would be highly contended. Also it is not clear in which order
-	 * the objects arrive to the reorder queues. So a cpu could wait to
-	 * get the lock just to notice that there is nothing to do at the
-	 * moment. Therefore we use a trylock and let the holder of the lock
-	 * care for all the objects enqueued during the holdtime of the lock.
-	 */
-	if (!spin_trylock_bh(&pd->lock))
-		return;
-
-	while (1) {
-		padata = padata_find_next(pd, true);
-
-		/*
-		 * If the next object that needs serialization is parallel
-		 * processed by another cpu and is still on it's way to the
-		 * cpu's reorder queue, nothing to do for now.
-		 */
-		if (!padata)
-			break;
-
-		cb_cpu = padata->cb_cpu;
-		squeue = per_cpu_ptr(pd->squeue, cb_cpu);
-
-		spin_lock(&squeue->serial.lock);
-		list_add_tail(&padata->list, &squeue->serial.list);
-		spin_unlock(&squeue->serial.lock);
-
-		queue_work_on(cb_cpu, pinst->serial_wq, &squeue->work);
-	}
-
-	spin_unlock_bh(&pd->lock);
-
-	/*
-	 * The next object that needs serialization might have arrived to
-	 * the reorder queues in the meantime.
-	 *
-	 * Ensure reorder queue is read after pd->lock is dropped so we see
-	 * new objects from another task in padata_do_serial.  Pairs with
-	 * smp_mb in padata_do_serial.
-	 */
-	smp_mb();
-
-	reorder = per_cpu_ptr(pd->reorder_list, pd->cpu);
-	if (!list_empty(&reorder->list) && padata_find_next(pd, false))
-		queue_work(pinst->serial_wq, &pd->reorder_work);
-}
-
-static void invoke_padata_reorder(struct work_struct *work)
-{
-	struct parallel_data *pd;
-
-	local_bh_disable();
-	pd = container_of(work, struct parallel_data, reorder_work);
-	padata_reorder(pd);
-	local_bh_enable();
-}
-
 static void padata_serial_worker(struct work_struct *serial_work)
 {
-	struct padata_serial_queue *squeue;
-	struct parallel_data *pd;
-	LIST_HEAD(local_list);
-	int cnt;
-
-	local_bh_disable();
-	squeue = container_of(serial_work, struct padata_serial_queue, work);
-	pd = squeue->pd;
-
-	spin_lock(&squeue->serial.lock);
-	list_replace_init(&squeue->serial.list, &local_list);
-	spin_unlock(&squeue->serial.lock);
-
-	cnt = 0;
-
-	while (!list_empty(&local_list)) {
-		struct padata_priv *padata;
-
-		padata = list_entry(local_list.next,
-				    struct padata_priv, list);
-
-		list_del_init(&padata->list);
-
-		padata->serial(padata);
-		cnt++;
-	}
-	local_bh_enable();
-
-	if (refcount_sub_and_test(cnt, &pd->refcnt))
-		padata_free_pd(pd);
-}
-
-/**
- * padata_do_serial - padata serialization function
- *
- * @padata: object to be serialized.
- *
- * padata_do_serial must be called for every parallelized object.
- * The serialization callback function will run with BHs off.
- */
-void padata_do_serial(struct padata_priv *padata)
-{
-	struct parallel_data *pd = padata->pd;
-	int hashed_cpu = padata_cpu_hash(pd, padata->seq_nr);
-	struct padata_list *reorder = per_cpu_ptr(pd->reorder_list, hashed_cpu);
-	struct padata_priv *cur;
-	struct list_head *pos;
-
-	spin_lock(&reorder->lock);
-	/* Sort in ascending order of sequence number. */
-	list_for_each_prev(pos, &reorder->list) {
-		cur = list_entry(pos, struct padata_priv, list);
-		if (cur->seq_nr < padata->seq_nr)
-			break;
-	}
-	list_add(&padata->list, pos);
-	spin_unlock(&reorder->lock);
-
-	/*
-	 * Ensure the addition to the reorder list is ordered correctly
-	 * with the trylock of pd->lock in padata_reorder.  Pairs with smp_mb
-	 * in padata_reorder.
-	 */
-	smp_mb();
-
-	padata_reorder(pd);
-}
-EXPORT_SYMBOL(padata_do_serial);
-
-static int padata_setup_cpumasks(struct padata_instance *pinst)
-{
-	struct workqueue_attrs *attrs;
-	int err;
-
-	attrs = alloc_workqueue_attrs();
-	if (!attrs)
-		return -ENOMEM;
-
-	/* Restrict parallel_wq workers to pd->cpumask.pcpu. */
-	cpumask_copy(attrs->cpumask, pinst->cpumask.pcpu);
-	err = apply_workqueue_attrs(pinst->parallel_wq, attrs);
-	free_workqueue_attrs(attrs);
-
-	return err;
+	struct padata_priv *padata = container_of(serial_work, struct padata_priv, serial_work);
+	wait_for_completion(&padata->parallel_done);
+	padata->serial(padata);
 }
 
 static void __init padata_mt_helper(struct work_struct *w)
@@ -530,449 +263,11 @@  void __init padata_do_multithreaded(struct padata_mt_job *job)
 	padata_works_free(&works);
 }
 
-static void __padata_list_init(struct padata_list *pd_list)
-{
-	INIT_LIST_HEAD(&pd_list->list);
-	spin_lock_init(&pd_list->lock);
-}
-
-/* Initialize all percpu queues used by serial workers */
-static void padata_init_squeues(struct parallel_data *pd)
-{
-	int cpu;
-	struct padata_serial_queue *squeue;
-
-	for_each_cpu(cpu, pd->cpumask.cbcpu) {
-		squeue = per_cpu_ptr(pd->squeue, cpu);
-		squeue->pd = pd;
-		__padata_list_init(&squeue->serial);
-		INIT_WORK(&squeue->work, padata_serial_worker);
-	}
-}
-
-/* Initialize per-CPU reorder lists */
-static void padata_init_reorder_list(struct parallel_data *pd)
-{
-	int cpu;
-	struct padata_list *list;
-
-	for_each_cpu(cpu, pd->cpumask.pcpu) {
-		list = per_cpu_ptr(pd->reorder_list, cpu);
-		__padata_list_init(list);
-	}
-}
-
-/* Allocate and initialize the internal cpumask dependend resources. */
-static struct parallel_data *padata_alloc_pd(struct padata_shell *ps)
-{
-	struct padata_instance *pinst = ps->pinst;
-	struct parallel_data *pd;
-
-	pd = kzalloc(sizeof(struct parallel_data), GFP_KERNEL);
-	if (!pd)
-		goto err;
-
-	pd->reorder_list = alloc_percpu(struct padata_list);
-	if (!pd->reorder_list)
-		goto err_free_pd;
-
-	pd->squeue = alloc_percpu(struct padata_serial_queue);
-	if (!pd->squeue)
-		goto err_free_reorder_list;
-
-	pd->ps = ps;
-
-	if (!alloc_cpumask_var(&pd->cpumask.pcpu, GFP_KERNEL))
-		goto err_free_squeue;
-	if (!alloc_cpumask_var(&pd->cpumask.cbcpu, GFP_KERNEL))
-		goto err_free_pcpu;
-
-	cpumask_and(pd->cpumask.pcpu, pinst->cpumask.pcpu, cpu_online_mask);
-	cpumask_and(pd->cpumask.cbcpu, pinst->cpumask.cbcpu, cpu_online_mask);
-
-	padata_init_reorder_list(pd);
-	padata_init_squeues(pd);
-	pd->seq_nr = -1;
-	refcount_set(&pd->refcnt, 1);
-	spin_lock_init(&pd->lock);
-	pd->cpu = cpumask_first(pd->cpumask.pcpu);
-	INIT_WORK(&pd->reorder_work, invoke_padata_reorder);
-
-	return pd;
-
-err_free_pcpu:
-	free_cpumask_var(pd->cpumask.pcpu);
-err_free_squeue:
-	free_percpu(pd->squeue);
-err_free_reorder_list:
-	free_percpu(pd->reorder_list);
-err_free_pd:
-	kfree(pd);
-err:
-	return NULL;
-}
-
-static void padata_free_pd(struct parallel_data *pd)
-{
-	free_cpumask_var(pd->cpumask.pcpu);
-	free_cpumask_var(pd->cpumask.cbcpu);
-	free_percpu(pd->reorder_list);
-	free_percpu(pd->squeue);
-	kfree(pd);
-}
-
 static void __padata_start(struct padata_instance *pinst)
 {
 	pinst->flags |= PADATA_INIT;
 }
 
-static void __padata_stop(struct padata_instance *pinst)
-{
-	if (!(pinst->flags & PADATA_INIT))
-		return;
-
-	pinst->flags &= ~PADATA_INIT;
-
-	synchronize_rcu();
-}
-
-/* Replace the internal control structure with a new one. */
-static int padata_replace_one(struct padata_shell *ps)
-{
-	struct parallel_data *pd_new;
-
-	pd_new = padata_alloc_pd(ps);
-	if (!pd_new)
-		return -ENOMEM;
-
-	ps->opd = rcu_dereference_protected(ps->pd, 1);
-	rcu_assign_pointer(ps->pd, pd_new);
-
-	return 0;
-}
-
-static int padata_replace(struct padata_instance *pinst)
-{
-	struct padata_shell *ps;
-	int err = 0;
-
-	pinst->flags |= PADATA_RESET;
-
-	list_for_each_entry(ps, &pinst->pslist, list) {
-		err = padata_replace_one(ps);
-		if (err)
-			break;
-	}
-
-	synchronize_rcu();
-
-	list_for_each_entry_continue_reverse(ps, &pinst->pslist, list)
-		if (refcount_dec_and_test(&ps->opd->refcnt))
-			padata_free_pd(ps->opd);
-
-	pinst->flags &= ~PADATA_RESET;
-
-	return err;
-}
-
-/* If cpumask contains no active cpu, we mark the instance as invalid. */
-static bool padata_validate_cpumask(struct padata_instance *pinst,
-				    const struct cpumask *cpumask)
-{
-	if (!cpumask_intersects(cpumask, cpu_online_mask)) {
-		pinst->flags |= PADATA_INVALID;
-		return false;
-	}
-
-	pinst->flags &= ~PADATA_INVALID;
-	return true;
-}
-
-static int __padata_set_cpumasks(struct padata_instance *pinst,
-				 cpumask_var_t pcpumask,
-				 cpumask_var_t cbcpumask)
-{
-	int valid;
-	int err;
-
-	valid = padata_validate_cpumask(pinst, pcpumask);
-	if (!valid) {
-		__padata_stop(pinst);
-		goto out_replace;
-	}
-
-	valid = padata_validate_cpumask(pinst, cbcpumask);
-	if (!valid)
-		__padata_stop(pinst);
-
-out_replace:
-	cpumask_copy(pinst->cpumask.pcpu, pcpumask);
-	cpumask_copy(pinst->cpumask.cbcpu, cbcpumask);
-
-	err = padata_setup_cpumasks(pinst) ?: padata_replace(pinst);
-
-	if (valid)
-		__padata_start(pinst);
-
-	return err;
-}
-
-/**
- * padata_set_cpumask - Sets specified by @cpumask_type cpumask to the value
- *                      equivalent to @cpumask.
- * @pinst: padata instance
- * @cpumask_type: PADATA_CPU_SERIAL or PADATA_CPU_PARALLEL corresponding
- *                to parallel and serial cpumasks respectively.
- * @cpumask: the cpumask to use
- *
- * Return: 0 on success or negative error code
- */
-int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type,
-		       cpumask_var_t cpumask)
-{
-	struct cpumask *serial_mask, *parallel_mask;
-	int err = -EINVAL;
-
-	cpus_read_lock();
-	mutex_lock(&pinst->lock);
-
-	switch (cpumask_type) {
-	case PADATA_CPU_PARALLEL:
-		serial_mask = pinst->cpumask.cbcpu;
-		parallel_mask = cpumask;
-		break;
-	case PADATA_CPU_SERIAL:
-		parallel_mask = pinst->cpumask.pcpu;
-		serial_mask = cpumask;
-		break;
-	default:
-		 goto out;
-	}
-
-	err =  __padata_set_cpumasks(pinst, parallel_mask, serial_mask);
-
-out:
-	mutex_unlock(&pinst->lock);
-	cpus_read_unlock();
-
-	return err;
-}
-EXPORT_SYMBOL(padata_set_cpumask);
-
-#ifdef CONFIG_HOTPLUG_CPU
-
-static int __padata_add_cpu(struct padata_instance *pinst, int cpu)
-{
-	int err = 0;
-
-	if (cpumask_test_cpu(cpu, cpu_online_mask)) {
-		err = padata_replace(pinst);
-
-		if (padata_validate_cpumask(pinst, pinst->cpumask.pcpu) &&
-		    padata_validate_cpumask(pinst, pinst->cpumask.cbcpu))
-			__padata_start(pinst);
-	}
-
-	return err;
-}
-
-static int __padata_remove_cpu(struct padata_instance *pinst, int cpu)
-{
-	int err = 0;
-
-	if (!cpumask_test_cpu(cpu, cpu_online_mask)) {
-		if (!padata_validate_cpumask(pinst, pinst->cpumask.pcpu) ||
-		    !padata_validate_cpumask(pinst, pinst->cpumask.cbcpu))
-			__padata_stop(pinst);
-
-		err = padata_replace(pinst);
-	}
-
-	return err;
-}
-
-static inline int pinst_has_cpu(struct padata_instance *pinst, int cpu)
-{
-	return cpumask_test_cpu(cpu, pinst->cpumask.pcpu) ||
-		cpumask_test_cpu(cpu, pinst->cpumask.cbcpu);
-}
-
-static int padata_cpu_online(unsigned int cpu, struct hlist_node *node)
-{
-	struct padata_instance *pinst;
-	int ret;
-
-	pinst = hlist_entry_safe(node, struct padata_instance, cpu_online_node);
-	if (!pinst_has_cpu(pinst, cpu))
-		return 0;
-
-	mutex_lock(&pinst->lock);
-	ret = __padata_add_cpu(pinst, cpu);
-	mutex_unlock(&pinst->lock);
-	return ret;
-}
-
-static int padata_cpu_dead(unsigned int cpu, struct hlist_node *node)
-{
-	struct padata_instance *pinst;
-	int ret;
-
-	pinst = hlist_entry_safe(node, struct padata_instance, cpu_dead_node);
-	if (!pinst_has_cpu(pinst, cpu))
-		return 0;
-
-	mutex_lock(&pinst->lock);
-	ret = __padata_remove_cpu(pinst, cpu);
-	mutex_unlock(&pinst->lock);
-	return ret;
-}
-
-static enum cpuhp_state hp_online;
-#endif
-
-static void __padata_free(struct padata_instance *pinst)
-{
-#ifdef CONFIG_HOTPLUG_CPU
-	cpuhp_state_remove_instance_nocalls(CPUHP_PADATA_DEAD,
-					    &pinst->cpu_dead_node);
-	cpuhp_state_remove_instance_nocalls(hp_online, &pinst->cpu_online_node);
-#endif
-
-	WARN_ON(!list_empty(&pinst->pslist));
-
-	free_cpumask_var(pinst->cpumask.pcpu);
-	free_cpumask_var(pinst->cpumask.cbcpu);
-	destroy_workqueue(pinst->serial_wq);
-	destroy_workqueue(pinst->parallel_wq);
-	kfree(pinst);
-}
-
-#define kobj2pinst(_kobj)					\
-	container_of(_kobj, struct padata_instance, kobj)
-#define attr2pentry(_attr)					\
-	container_of(_attr, struct padata_sysfs_entry, attr)
-
-static void padata_sysfs_release(struct kobject *kobj)
-{
-	struct padata_instance *pinst = kobj2pinst(kobj);
-	__padata_free(pinst);
-}
-
-struct padata_sysfs_entry {
-	struct attribute attr;
-	ssize_t (*show)(struct padata_instance *, struct attribute *, char *);
-	ssize_t (*store)(struct padata_instance *, struct attribute *,
-			 const char *, size_t);
-};
-
-static ssize_t show_cpumask(struct padata_instance *pinst,
-			    struct attribute *attr,  char *buf)
-{
-	struct cpumask *cpumask;
-	ssize_t len;
-
-	mutex_lock(&pinst->lock);
-	if (!strcmp(attr->name, "serial_cpumask"))
-		cpumask = pinst->cpumask.cbcpu;
-	else
-		cpumask = pinst->cpumask.pcpu;
-
-	len = snprintf(buf, PAGE_SIZE, "%*pb\n",
-		       nr_cpu_ids, cpumask_bits(cpumask));
-	mutex_unlock(&pinst->lock);
-	return len < PAGE_SIZE ? len : -EINVAL;
-}
-
-static ssize_t store_cpumask(struct padata_instance *pinst,
-			     struct attribute *attr,
-			     const char *buf, size_t count)
-{
-	cpumask_var_t new_cpumask;
-	ssize_t ret;
-	int mask_type;
-
-	if (!alloc_cpumask_var(&new_cpumask, GFP_KERNEL))
-		return -ENOMEM;
-
-	ret = bitmap_parse(buf, count, cpumask_bits(new_cpumask),
-			   nr_cpumask_bits);
-	if (ret < 0)
-		goto out;
-
-	mask_type = !strcmp(attr->name, "serial_cpumask") ?
-		PADATA_CPU_SERIAL : PADATA_CPU_PARALLEL;
-	ret = padata_set_cpumask(pinst, mask_type, new_cpumask);
-	if (!ret)
-		ret = count;
-
-out:
-	free_cpumask_var(new_cpumask);
-	return ret;
-}
-
-#define PADATA_ATTR_RW(_name, _show_name, _store_name)		\
-	static struct padata_sysfs_entry _name##_attr =		\
-		__ATTR(_name, 0644, _show_name, _store_name)
-#define PADATA_ATTR_RO(_name, _show_name)		\
-	static struct padata_sysfs_entry _name##_attr = \
-		__ATTR(_name, 0400, _show_name, NULL)
-
-PADATA_ATTR_RW(serial_cpumask, show_cpumask, store_cpumask);
-PADATA_ATTR_RW(parallel_cpumask, show_cpumask, store_cpumask);
-
-/*
- * Padata sysfs provides the following objects:
- * serial_cpumask   [RW] - cpumask for serial workers
- * parallel_cpumask [RW] - cpumask for parallel workers
- */
-static struct attribute *padata_default_attrs[] = {
-	&serial_cpumask_attr.attr,
-	&parallel_cpumask_attr.attr,
-	NULL,
-};
-ATTRIBUTE_GROUPS(padata_default);
-
-static ssize_t padata_sysfs_show(struct kobject *kobj,
-				 struct attribute *attr, char *buf)
-{
-	struct padata_instance *pinst;
-	struct padata_sysfs_entry *pentry;
-	ssize_t ret = -EIO;
-
-	pinst = kobj2pinst(kobj);
-	pentry = attr2pentry(attr);
-	if (pentry->show)
-		ret = pentry->show(pinst, attr, buf);
-
-	return ret;
-}
-
-static ssize_t padata_sysfs_store(struct kobject *kobj, struct attribute *attr,
-				  const char *buf, size_t count)
-{
-	struct padata_instance *pinst;
-	struct padata_sysfs_entry *pentry;
-	ssize_t ret = -EIO;
-
-	pinst = kobj2pinst(kobj);
-	pentry = attr2pentry(attr);
-	if (pentry->show)
-		ret = pentry->store(pinst, attr, buf, count);
-
-	return ret;
-}
-
-static const struct sysfs_ops padata_sysfs_ops = {
-	.show = padata_sysfs_show,
-	.store = padata_sysfs_store,
-};
-
-static const struct kobj_type padata_attr_type = {
-	.sysfs_ops = &padata_sysfs_ops,
-	.default_groups = padata_default_groups,
-	.release = padata_sysfs_release,
-};
-
 /**
  * padata_alloc - allocate and initialize a padata instance
  * @name: used to identify the instance
@@ -986,57 +281,27 @@  struct padata_instance *padata_alloc(const char *name)
 	pinst = kzalloc(sizeof(struct padata_instance), GFP_KERNEL);
 	if (!pinst)
 		goto err;
-
-	pinst->parallel_wq = alloc_workqueue("%s_parallel", WQ_UNBOUND, 0,
+	pr_info("%s:%d %s\n", __FILE__, __LINE__, __func__);
+	pinst->parallel_wq = alloc_workqueue("%s_parallel", WQ_CPU_INTENSIVE, 0,
 					     name);
 	if (!pinst->parallel_wq)
 		goto err_free_inst;
 
-	cpus_read_lock();
-
-	pinst->serial_wq = alloc_workqueue("%s_serial", WQ_MEM_RECLAIM |
-					   WQ_CPU_INTENSIVE, 1, name);
+	pinst->serial_wq = alloc_ordered_workqueue ("%s_serial",
+									WQ_MEM_RECLAIM | WQ_FREEZABLE,
+									name);
 	if (!pinst->serial_wq)
-		goto err_put_cpus;
-
-	if (!alloc_cpumask_var(&pinst->cpumask.pcpu, GFP_KERNEL))
-		goto err_free_serial_wq;
-	if (!alloc_cpumask_var(&pinst->cpumask.cbcpu, GFP_KERNEL)) {
-		free_cpumask_var(pinst->cpumask.pcpu);
-		goto err_free_serial_wq;
-	}
+		goto err_free_parallel_wq;
 
 	INIT_LIST_HEAD(&pinst->pslist);
 
-	cpumask_copy(pinst->cpumask.pcpu, cpu_possible_mask);
-	cpumask_copy(pinst->cpumask.cbcpu, cpu_possible_mask);
-
-	if (padata_setup_cpumasks(pinst))
-		goto err_free_masks;
-
 	__padata_start(pinst);
 
-	kobject_init(&pinst->kobj, &padata_attr_type);
 	mutex_init(&pinst->lock);
 
-#ifdef CONFIG_HOTPLUG_CPU
-	cpuhp_state_add_instance_nocalls_cpuslocked(hp_online,
-						    &pinst->cpu_online_node);
-	cpuhp_state_add_instance_nocalls_cpuslocked(CPUHP_PADATA_DEAD,
-						    &pinst->cpu_dead_node);
-#endif
-
-	cpus_read_unlock();
-
 	return pinst;
 
-err_free_masks:
-	free_cpumask_var(pinst->cpumask.pcpu);
-	free_cpumask_var(pinst->cpumask.cbcpu);
-err_free_serial_wq:
-	destroy_workqueue(pinst->serial_wq);
-err_put_cpus:
-	cpus_read_unlock();
+err_free_parallel_wq:
 	destroy_workqueue(pinst->parallel_wq);
 err_free_inst:
 	kfree(pinst);
@@ -1052,7 +317,11 @@  EXPORT_SYMBOL(padata_alloc);
  */
 void padata_free(struct padata_instance *pinst)
 {
-	kobject_put(&pinst->kobj);
+	WARN_ON(!list_empty(&pinst->pslist));
+
+	destroy_workqueue(pinst->serial_wq);
+	destroy_workqueue(pinst->parallel_wq);
+	kfree(pinst);
 }
 EXPORT_SYMBOL(padata_free);
 
@@ -1065,7 +334,6 @@  EXPORT_SYMBOL(padata_free);
  */
 struct padata_shell *padata_alloc_shell(struct padata_instance *pinst)
 {
-	struct parallel_data *pd;
 	struct padata_shell *ps;
 
 	ps = kzalloc(sizeof(*ps), GFP_KERNEL);
@@ -1074,22 +342,12 @@  struct padata_shell *padata_alloc_shell(struct padata_instance *pinst)
 
 	ps->pinst = pinst;
 
-	cpus_read_lock();
-	pd = padata_alloc_pd(ps);
-	cpus_read_unlock();
-
-	if (!pd)
-		goto out_free_ps;
-
 	mutex_lock(&pinst->lock);
-	RCU_INIT_POINTER(ps->pd, pd);
 	list_add(&ps->list, &pinst->pslist);
 	mutex_unlock(&pinst->lock);
 
 	return ps;
 
-out_free_ps:
-	kfree(ps);
 out:
 	return NULL;
 }
@@ -1107,7 +365,6 @@  void padata_free_shell(struct padata_shell *ps)
 
 	mutex_lock(&ps->pinst->lock);
 	list_del(&ps->list);
-	padata_free_pd(rcu_dereference_protected(ps->pd, 1));
 	mutex_unlock(&ps->pinst->lock);
 
 	kfree(ps);
@@ -1117,20 +374,6 @@  EXPORT_SYMBOL(padata_free_shell);
 void __init padata_init(void)
 {
 	unsigned int i, possible_cpus;
-#ifdef CONFIG_HOTPLUG_CPU
-	int ret;
-
-	ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "padata:online",
-				      padata_cpu_online, NULL);
-	if (ret < 0)
-		goto err;
-	hp_online = ret;
-
-	ret = cpuhp_setup_state_multi(CPUHP_PADATA_DEAD, "padata:dead",
-				      NULL, padata_cpu_dead);
-	if (ret < 0)
-		goto remove_online_state;
-#endif
 
 	possible_cpus = num_possible_cpus();
 	padata_works = kmalloc_array(possible_cpus, sizeof(struct padata_work),
@@ -1144,11 +387,5 @@  void __init padata_init(void)
 	return;
 
 remove_dead_state:
-#ifdef CONFIG_HOTPLUG_CPU
-	cpuhp_remove_multi_state(CPUHP_PADATA_DEAD);
-remove_online_state:
-	cpuhp_remove_multi_state(hp_online);
-err:
-#endif
 	pr_warn("padata: initialization failed\n");
 }