From 4f4fd7c5798bbdd5a03a60f6269cf1177fbd11ef Mon Sep 17 00:00:00 2001 From: Nigel Croxon Date: Fri, 29 Mar 2019 10:46:15 -0700 Subject: Don't jump to compute_result state from check_result state Changing state from check_state_check_result to check_state_compute_result not only is unsafe but also doesn't appear to serve a valid purpose. A raid6 check should only be pushing out extra writes if doing repair and a mis-match occurs. The stripe dev management will already try and do repair writes for failing sectors. This patch makes the raid6 check_state_check_result handling work more like raid5's. If somehow too many failures for a check, just quit the check operation for the stripe. When any checks pass, don't try and use check_state_compute_result for a purpose it isn't needed for and is unsafe for. Just mark the stripe as in sync for passing its parity checks and let the stripe dev read/write code and the bad blocks list do their job handling I/O errors. Repro steps from Xiao: These are the steps to reproduce this problem: 1. redefined OPT_MEDIUM_ERR_ADDR to 12000 in scsi_debug.c 2. insmod scsi_debug.ko dev_size_mb=11000 max_luns=1 num_tgts=1 3. mdadm --create /dev/md127 --level=6 --raid-devices=5 /dev/sde1 /dev/sde2 /dev/sde3 /dev/sde5 /dev/sde6 sde is the disk created by scsi_debug 4. echo "2" >/sys/module/scsi_debug/parameters/opts 5. raid-check It panic: [ 4854.730899] md: data-check of RAID array md127 [ 4854.857455] sd 5:0:0:0: [sdr] tag#80 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 4854.859246] sd 5:0:0:0: [sdr] tag#80 Sense Key : Medium Error [current] [ 4854.860694] sd 5:0:0:0: [sdr] tag#80 Add. Sense: Unrecovered read error [ 4854.862207] sd 5:0:0:0: [sdr] tag#80 CDB: Read(10) 28 00 00 00 2d 88 00 04 00 00 [ 4854.864196] print_req_error: critical medium error, dev sdr, sector 11656 flags 0 [ 4854.867409] sd 5:0:0:0: [sdr] tag#100 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 4854.869469] sd 5:0:0:0: [sdr] tag#100 Sense Key : Medium Error [current] [ 4854.871206] sd 5:0:0:0: [sdr] tag#100 Add. Sense: Unrecovered read error [ 4854.872858] sd 5:0:0:0: [sdr] tag#100 CDB: Read(10) 28 00 00 00 2e e0 00 00 08 00 [ 4854.874587] print_req_error: critical medium error, dev sdr, sector 12000 flags 4000 [ 4854.876456] sd 5:0:0:0: [sdr] tag#101 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 4854.878552] sd 5:0:0:0: [sdr] tag#101 Sense Key : Medium Error [current] [ 4854.880278] sd 5:0:0:0: [sdr] tag#101 Add. Sense: Unrecovered read error [ 4854.881846] sd 5:0:0:0: [sdr] tag#101 CDB: Read(10) 28 00 00 00 2e e8 00 00 08 00 [ 4854.883691] print_req_error: critical medium error, dev sdr, sector 12008 flags 4000 [ 4854.893927] sd 5:0:0:0: [sdr] tag#166 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 4854.896002] sd 5:0:0:0: [sdr] tag#166 Sense Key : Medium Error [current] [ 4854.897561] sd 5:0:0:0: [sdr] tag#166 Add. Sense: Unrecovered read error [ 4854.899110] sd 5:0:0:0: [sdr] tag#166 CDB: Read(10) 28 00 00 00 2e e0 00 00 10 00 [ 4854.900989] print_req_error: critical medium error, dev sdr, sector 12000 flags 0 [ 4854.902757] md/raid:md127: read error NOT corrected!! (sector 9952 on sdr1). [ 4854.904375] md/raid:md127: read error NOT corrected!! (sector 9960 on sdr1). [ 4854.906201] ------------[ cut here ]------------ [ 4854.907341] kernel BUG at drivers/md/raid5.c:4190! raid5.c:4190 above is this BUG_ON: handle_parity_checks6() ... BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */ Cc: # v3.16+ OriginalAuthor: David Jeffery Cc: Xiao Ni Tested-by: David Jeffery Signed-off-by: David Jeffy Signed-off-by: Nigel Croxon Signed-off-by: Song Liu Signed-off-by: Jens Axboe --- drivers/md/raid5.c | 19 ++++--------------- 1 file changed, 4 insertions(+), 15 deletions(-) (limited to 'drivers/md') diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index c033bfcb209e..364dd2f6fa1b 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -4223,26 +4223,15 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh, case check_state_check_result: sh->check_state = check_state_idle; + if (s->failed > 1) + break; /* handle a successful check operation, if parity is correct * we are done. Otherwise update the mismatch count and repair * parity if !MD_RECOVERY_CHECK */ if (sh->ops.zero_sum_result == 0) { - /* both parities are correct */ - if (!s->failed) - set_bit(STRIPE_INSYNC, &sh->state); - else { - /* in contrast to the raid5 case we can validate - * parity, but still have a failure to write - * back - */ - sh->check_state = check_state_compute_result; - /* Returning at this point means that we may go - * off and bring p and/or q uptodate again so - * we make sure to check zero_sum_result again - * to verify if p or q need writeback - */ - } + /* Any parity checked was correct */ + set_bit(STRIPE_INSYNC, &sh->state); } else { atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches); if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) { -- cgit v1.2.3 From 4bc034d35377196c854236133b07730a777c4aba Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Fri, 29 Mar 2019 10:46:16 -0700 Subject: Revert "MD: fix lock contention for flush bios" This reverts commit 5a409b4f56d50b212334f338cb8465d65550cd85. This patch has two problems. 1/ it make multiple calls to submit_bio() from inside a make_request_fn. The bios thus submitted will be queued on current->bio_list and not submitted immediately. As the bios are allocated from a mempool, this can theoretically result in a deadlock - all the pool of requests could be in various ->bio_list queues and a subsequent mempool_alloc could block waiting for one of them to be released. 2/ It aims to handle a case when there are many concurrent flush requests. It handles this by submitting many requests in parallel - all of which are identical and so most of which do nothing useful. It would be more efficient to just send one lower-level request, but allow that to satisfy multiple upper-level requests. Fixes: 5a409b4f56d5 ("MD: fix lock contention for flush bios") Cc: # v4.19+ Tested-by: Xiao Ni Signed-off-by: NeilBrown Signed-off-by: Song Liu Signed-off-by: Jens Axboe --- drivers/md/md.c | 159 ++++++++++++++++++++------------------------------------ drivers/md/md.h | 22 +++----- 2 files changed, 62 insertions(+), 119 deletions(-) (limited to 'drivers/md') diff --git a/drivers/md/md.c b/drivers/md/md.c index 05ffffb8b769..021e522001c1 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -132,24 +132,6 @@ static inline int speed_max(struct mddev *mddev) mddev->sync_speed_max : sysctl_speed_limit_max; } -static void * flush_info_alloc(gfp_t gfp_flags, void *data) -{ - return kzalloc(sizeof(struct flush_info), gfp_flags); -} -static void flush_info_free(void *flush_info, void *data) -{ - kfree(flush_info); -} - -static void * flush_bio_alloc(gfp_t gfp_flags, void *data) -{ - return kzalloc(sizeof(struct flush_bio), gfp_flags); -} -static void flush_bio_free(void *flush_bio, void *data) -{ - kfree(flush_bio); -} - static struct ctl_table_header *raid_table_header; static struct ctl_table raid_table[] = { @@ -423,54 +405,30 @@ static int md_congested(void *data, int bits) /* * Generic flush handling for md */ -static void submit_flushes(struct work_struct *ws) -{ - struct flush_info *fi = container_of(ws, struct flush_info, flush_work); - struct mddev *mddev = fi->mddev; - struct bio *bio = fi->bio; - - bio->bi_opf &= ~REQ_PREFLUSH; - md_handle_request(mddev, bio); - - mempool_free(fi, mddev->flush_pool); -} -static void md_end_flush(struct bio *fbio) +static void md_end_flush(struct bio *bio) { - struct flush_bio *fb = fbio->bi_private; - struct md_rdev *rdev = fb->rdev; - struct flush_info *fi = fb->fi; - struct bio *bio = fi->bio; - struct mddev *mddev = fi->mddev; + struct md_rdev *rdev = bio->bi_private; + struct mddev *mddev = rdev->mddev; rdev_dec_pending(rdev, mddev); - if (atomic_dec_and_test(&fi->flush_pending)) { - if (bio->bi_iter.bi_size == 0) { - /* an empty barrier - all done */ - bio_endio(bio); - mempool_free(fi, mddev->flush_pool); - } else { - INIT_WORK(&fi->flush_work, submit_flushes); - queue_work(md_wq, &fi->flush_work); - } + if (atomic_dec_and_test(&mddev->flush_pending)) { + /* The pre-request flush has finished */ + queue_work(md_wq, &mddev->flush_work); } - - mempool_free(fb, mddev->flush_bio_pool); - bio_put(fbio); + bio_put(bio); } -void md_flush_request(struct mddev *mddev, struct bio *bio) +static void md_submit_flush_data(struct work_struct *ws); + +static void submit_flushes(struct work_struct *ws) { + struct mddev *mddev = container_of(ws, struct mddev, flush_work); struct md_rdev *rdev; - struct flush_info *fi; - - fi = mempool_alloc(mddev->flush_pool, GFP_NOIO); - - fi->bio = bio; - fi->mddev = mddev; - atomic_set(&fi->flush_pending, 1); + INIT_WORK(&mddev->flush_work, md_submit_flush_data); + atomic_set(&mddev->flush_pending, 1); rcu_read_lock(); rdev_for_each_rcu(rdev, mddev) if (rdev->raid_disk >= 0 && @@ -480,40 +438,59 @@ void md_flush_request(struct mddev *mddev, struct bio *bio) * we reclaim rcu_read_lock */ struct bio *bi; - struct flush_bio *fb; atomic_inc(&rdev->nr_pending); atomic_inc(&rdev->nr_pending); rcu_read_unlock(); - - fb = mempool_alloc(mddev->flush_bio_pool, GFP_NOIO); - fb->fi = fi; - fb->rdev = rdev; - bi = bio_alloc_mddev(GFP_NOIO, 0, mddev); - bio_set_dev(bi, rdev->bdev); bi->bi_end_io = md_end_flush; - bi->bi_private = fb; + bi->bi_private = rdev; + bio_set_dev(bi, rdev->bdev); bi->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH; - - atomic_inc(&fi->flush_pending); + atomic_inc(&mddev->flush_pending); submit_bio(bi); - rcu_read_lock(); rdev_dec_pending(rdev, mddev); } rcu_read_unlock(); + if (atomic_dec_and_test(&mddev->flush_pending)) + queue_work(md_wq, &mddev->flush_work); +} - if (atomic_dec_and_test(&fi->flush_pending)) { - if (bio->bi_iter.bi_size == 0) { - /* an empty barrier - all done */ - bio_endio(bio); - mempool_free(fi, mddev->flush_pool); - } else { - INIT_WORK(&fi->flush_work, submit_flushes); - queue_work(md_wq, &fi->flush_work); - } +static void md_submit_flush_data(struct work_struct *ws) +{ + struct mddev *mddev = container_of(ws, struct mddev, flush_work); + struct bio *bio = mddev->flush_bio; + + /* + * must reset flush_bio before calling into md_handle_request to avoid a + * deadlock, because other bios passed md_handle_request suspend check + * could wait for this and below md_handle_request could wait for those + * bios because of suspend check + */ + mddev->flush_bio = NULL; + wake_up(&mddev->sb_wait); + + if (bio->bi_iter.bi_size == 0) { + /* an empty barrier - all done */ + bio_endio(bio); + } else { + bio->bi_opf &= ~REQ_PREFLUSH; + md_handle_request(mddev, bio); } } + +void md_flush_request(struct mddev *mddev, struct bio *bio) +{ + spin_lock_irq(&mddev->lock); + wait_event_lock_irq(mddev->sb_wait, + !mddev->flush_bio, + mddev->lock); + mddev->flush_bio = bio; + spin_unlock_irq(&mddev->lock); + + INIT_WORK(&mddev->flush_work, submit_flushes); + queue_work(md_wq, &mddev->flush_work); +} EXPORT_SYMBOL(md_flush_request); static inline struct mddev *mddev_get(struct mddev *mddev) @@ -560,6 +537,7 @@ void mddev_init(struct mddev *mddev) atomic_set(&mddev->openers, 0); atomic_set(&mddev->active_io, 0); spin_lock_init(&mddev->lock); + atomic_set(&mddev->flush_pending, 0); init_waitqueue_head(&mddev->sb_wait); init_waitqueue_head(&mddev->recovery_wait); mddev->reshape_position = MaxSector; @@ -5511,22 +5489,6 @@ int md_run(struct mddev *mddev) if (err) return err; } - if (mddev->flush_pool == NULL) { - mddev->flush_pool = mempool_create(NR_FLUSH_INFOS, flush_info_alloc, - flush_info_free, mddev); - if (!mddev->flush_pool) { - err = -ENOMEM; - goto abort; - } - } - if (mddev->flush_bio_pool == NULL) { - mddev->flush_bio_pool = mempool_create(NR_FLUSH_BIOS, flush_bio_alloc, - flush_bio_free, mddev); - if (!mddev->flush_bio_pool) { - err = -ENOMEM; - goto abort; - } - } spin_lock(&pers_lock); pers = find_pers(mddev->level, mddev->clevel); @@ -5686,11 +5648,8 @@ int md_run(struct mddev *mddev) return 0; abort: - mempool_destroy(mddev->flush_bio_pool); - mddev->flush_bio_pool = NULL; - mempool_destroy(mddev->flush_pool); - mddev->flush_pool = NULL; - + bioset_exit(&mddev->bio_set); + bioset_exit(&mddev->sync_set); return err; } EXPORT_SYMBOL_GPL(md_run); @@ -5894,14 +5853,6 @@ static void __md_stop(struct mddev *mddev) mddev->to_remove = &md_redundancy_group; module_put(pers->owner); clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - if (mddev->flush_bio_pool) { - mempool_destroy(mddev->flush_bio_pool); - mddev->flush_bio_pool = NULL; - } - if (mddev->flush_pool) { - mempool_destroy(mddev->flush_pool); - mddev->flush_pool = NULL; - } } void md_stop(struct mddev *mddev) diff --git a/drivers/md/md.h b/drivers/md/md.h index c52afb52c776..2deb84fa93f9 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -252,19 +252,6 @@ enum mddev_sb_flags { MD_SB_NEED_REWRITE, /* metadata write needs to be repeated */ }; -#define NR_FLUSH_INFOS 8 -#define NR_FLUSH_BIOS 64 -struct flush_info { - struct bio *bio; - struct mddev *mddev; - struct work_struct flush_work; - atomic_t flush_pending; -}; -struct flush_bio { - struct flush_info *fi; - struct md_rdev *rdev; -}; - struct mddev { void *private; struct md_personality *pers; @@ -470,8 +457,13 @@ struct mddev { * metadata and bitmap writes */ - mempool_t *flush_pool; - mempool_t *flush_bio_pool; + /* Generic flush handling. + * The last to finish preflush schedules a worker to submit + * the rest of the request (without the REQ_PREFLUSH flag). + */ + struct bio *flush_bio; + atomic_t flush_pending; + struct work_struct flush_work; struct work_struct event_work; /* used by dm to report failure event */ void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev); struct md_cluster_info *cluster_info; -- cgit v1.2.3 From 2bc13b83e6298486371761de503faeffd15b7534 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Fri, 29 Mar 2019 10:46:17 -0700 Subject: md: batch flush requests. Currently if many flush requests are submitted to an md device is quick succession, they are serialized and can take a long to process them all. We don't really need to call flush all those times - a single flush call can satisfy all requests submitted before it started. So keep track of when the current flush started and when it finished, allow any pending flush that was requested before the flush started to complete without waiting any more. Test results from Xiao: Test is done on a raid10 device which is created by 4 SSDs. The tool is dbench. 1. The latest linux stable kernel Operation Count AvgLat MaxLat -------------------------------------------------- Deltree 768 10.509 78.305 Flush 2078376 0.013 10.094 Close 21787697 0.019 18.821 LockX 96580 0.007 3.184 Mkdir 384 0.008 0.062 Rename 1255883 0.191 23.534 ReadX 46495589 0.020 14.230 WriteX 14790591 7.123 60.706 Unlink 5989118 0.440 54.551 UnlockX 96580 0.005 2.736 FIND_FIRST 10393845 0.042 12.079 SET_FILE_INFORMATION 2415558 0.129 10.088 QUERY_FILE_INFORMATION 4711725 0.005 8.462 QUERY_PATH_INFORMATION 26883327 0.032 21.715 QUERY_FS_INFORMATION 4929409 0.010 8.238 NTCreateX 29660080 0.100 53.268 Throughput 1034.88 MB/sec (sync open) 128 clients 128 procs max_latency=60.712 ms 2. With patch1 "Revert "MD: fix lock contention for flush bios"" Operation Count AvgLat MaxLat -------------------------------------------------- Deltree 256 8.326 36.761 Flush 693291 3.974 180.269 Close 7266404 0.009 36.929 LockX 32160 0.006 0.840 Mkdir 128 0.008 0.021 Rename 418755 0.063 29.945 ReadX 15498708 0.007 7.216 WriteX 4932310 22.482 267.928 Unlink 1997557 0.109 47.553 UnlockX 32160 0.004 1.110 FIND_FIRST 3465791 0.036 7.320 SET_FILE_INFORMATION 805825 0.015 1.561 QUERY_FILE_INFORMATION 1570950 0.005 2.403 QUERY_PATH_INFORMATION 8965483 0.013 14.277 QUERY_FS_INFORMATION 1643626 0.009 3.314 NTCreateX 9892174 0.061 41.278 Throughput 345.009 MB/sec (sync open) 128 clients 128 procs max_latency=267.939 m 3. With patch1 and patch2 Operation Count AvgLat MaxLat -------------------------------------------------- Deltree 768 9.570 54.588 Flush 2061354 0.666 15.102 Close 21604811 0.012 25.697 LockX 95770 0.007 1.424 Mkdir 384 0.008 0.053 Rename 1245411 0.096 12.263 ReadX 46103198 0.011 12.116 WriteX 14667988 7.375 60.069 Unlink 5938936 0.173 30.905 UnlockX 95770 0.005 4.147 FIND_FIRST 10306407 0.041 11.715 SET_FILE_INFORMATION 2395987 0.048 7.640 QUERY_FILE_INFORMATION 4672371 0.005 9.291 QUERY_PATH_INFORMATION 26656735 0.018 19.719 QUERY_FS_INFORMATION 4887940 0.010 7.654 NTCreateX 29410811 0.059 28.551 Throughput 1026.21 MB/sec (sync open) 128 clients 128 procs max_latency=60.075 ms Cc: # v4.19+ Tested-by: Xiao Ni Signed-off-by: NeilBrown Signed-off-by: Song Liu Signed-off-by: Jens Axboe --- drivers/md/md.c | 27 +++++++++++++++++++++++---- drivers/md/md.h | 3 +++ 2 files changed, 26 insertions(+), 4 deletions(-) (limited to 'drivers/md') diff --git a/drivers/md/md.c b/drivers/md/md.c index 021e522001c1..d0f688399a56 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -427,6 +427,7 @@ static void submit_flushes(struct work_struct *ws) struct mddev *mddev = container_of(ws, struct mddev, flush_work); struct md_rdev *rdev; + mddev->start_flush = ktime_get_boottime(); INIT_WORK(&mddev->flush_work, md_submit_flush_data); atomic_set(&mddev->flush_pending, 1); rcu_read_lock(); @@ -467,6 +468,7 @@ static void md_submit_flush_data(struct work_struct *ws) * could wait for this and below md_handle_request could wait for those * bios because of suspend check */ + mddev->last_flush = mddev->start_flush; mddev->flush_bio = NULL; wake_up(&mddev->sb_wait); @@ -481,15 +483,32 @@ static void md_submit_flush_data(struct work_struct *ws) void md_flush_request(struct mddev *mddev, struct bio *bio) { + ktime_t start = ktime_get_boottime(); spin_lock_irq(&mddev->lock); wait_event_lock_irq(mddev->sb_wait, - !mddev->flush_bio, + !mddev->flush_bio || + ktime_after(mddev->last_flush, start), mddev->lock); - mddev->flush_bio = bio; + if (!ktime_after(mddev->last_flush, start)) { + WARN_ON(mddev->flush_bio); + mddev->flush_bio = bio; + bio = NULL; + } spin_unlock_irq(&mddev->lock); - INIT_WORK(&mddev->flush_work, submit_flushes); - queue_work(md_wq, &mddev->flush_work); + if (!bio) { + INIT_WORK(&mddev->flush_work, submit_flushes); + queue_work(md_wq, &mddev->flush_work); + } else { + /* flush was performed for some other bio while we waited. */ + if (bio->bi_iter.bi_size == 0) + /* an empty barrier - all done */ + bio_endio(bio); + else { + bio->bi_opf &= ~REQ_PREFLUSH; + mddev->pers->make_request(mddev, bio); + } + } } EXPORT_SYMBOL(md_flush_request); diff --git a/drivers/md/md.h b/drivers/md/md.h index 2deb84fa93f9..257cb4c9e22b 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -463,6 +463,9 @@ struct mddev { */ struct bio *flush_bio; atomic_t flush_pending; + ktime_t start_flush, last_flush; /* last_flush is when the last completed + * flush was started. + */ struct work_struct flush_work; struct work_struct event_work; /* used by dm to report failure event */ void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev); -- cgit v1.2.3 From 72deb455b5ec619ff043c30bc90025aa3de3cdda Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Fri, 5 Apr 2019 18:08:59 +0200 Subject: block: remove CONFIG_LBDAF Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit architectures. These types are required to support block device and/or file sizes larger than 2 TiB, and have generally defaulted to on for a long time. Enabling the option only increases the i386 tinyconfig size by 145 bytes, and many data structures already always use 64-bit values for their in-core and on-disk data structures anyway, so there should not be a large change in dynamic memory usage either. Dropping this option removes a somewhat weird non-default config that has cause various bugs or compiler warnings when actually used. Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- drivers/md/dm-exception-store.h | 28 ++-------------------------- drivers/md/dm-integrity.c | 8 ++------ drivers/md/md.c | 6 ++---- 3 files changed, 6 insertions(+), 36 deletions(-) (limited to 'drivers/md') diff --git a/drivers/md/dm-exception-store.h b/drivers/md/dm-exception-store.h index 12b5216c2cfe..721efc493942 100644 --- a/drivers/md/dm-exception-store.h +++ b/drivers/md/dm-exception-store.h @@ -135,9 +135,8 @@ struct dm_dev *dm_snap_cow(struct dm_snapshot *snap); /* * Funtions to manipulate consecutive chunks */ -# if defined(CONFIG_LBDAF) || (BITS_PER_LONG == 64) -# define DM_CHUNK_CONSECUTIVE_BITS 8 -# define DM_CHUNK_NUMBER_BITS 56 +#define DM_CHUNK_CONSECUTIVE_BITS 8 +#define DM_CHUNK_NUMBER_BITS 56 static inline chunk_t dm_chunk_number(chunk_t chunk) { @@ -163,29 +162,6 @@ static inline void dm_consecutive_chunk_count_dec(struct dm_exception *e) e->new_chunk -= (1ULL << DM_CHUNK_NUMBER_BITS); } -# else -# define DM_CHUNK_CONSECUTIVE_BITS 0 - -static inline chunk_t dm_chunk_number(chunk_t chunk) -{ - return chunk; -} - -static inline unsigned dm_consecutive_chunk_count(struct dm_exception *e) -{ - return 0; -} - -static inline void dm_consecutive_chunk_count_inc(struct dm_exception *e) -{ -} - -static inline void dm_consecutive_chunk_count_dec(struct dm_exception *e) -{ -} - -# endif - /* * Return the number of sectors in the device. */ diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c index d57d997a52c8..0eb56ba89a7f 100644 --- a/drivers/md/dm-integrity.c +++ b/drivers/md/dm-integrity.c @@ -88,14 +88,10 @@ struct journal_entry { #if BITS_PER_LONG == 64 #define journal_entry_set_sector(je, x) do { smp_wmb(); WRITE_ONCE((je)->u.sector, cpu_to_le64(x)); } while (0) -#define journal_entry_get_sector(je) le64_to_cpu((je)->u.sector) -#elif defined(CONFIG_LBDAF) -#define journal_entry_set_sector(je, x) do { (je)->u.s.sector_lo = cpu_to_le32(x); smp_wmb(); WRITE_ONCE((je)->u.s.sector_hi, cpu_to_le32((x) >> 32)); } while (0) -#define journal_entry_get_sector(je) le64_to_cpu((je)->u.sector) #else -#define journal_entry_set_sector(je, x) do { (je)->u.s.sector_lo = cpu_to_le32(x); smp_wmb(); WRITE_ONCE((je)->u.s.sector_hi, cpu_to_le32(0)); } while (0) -#define journal_entry_get_sector(je) le32_to_cpu((je)->u.s.sector_lo) +#define journal_entry_set_sector(je, x) do { (je)->u.s.sector_lo = cpu_to_le32(x); smp_wmb(); WRITE_ONCE((je)->u.s.sector_hi, cpu_to_le32((x) >> 32)); } while (0) #endif +#define journal_entry_get_sector(je) le64_to_cpu((je)->u.sector) #define journal_entry_is_unused(je) ((je)->u.s.sector_hi == cpu_to_le32(-1)) #define journal_entry_set_unused(je) do { ((je)->u.s.sector_hi = cpu_to_le32(-1)); } while (0) #define journal_entry_is_inprogress(je) ((je)->u.s.sector_hi == cpu_to_le32(-2)) diff --git a/drivers/md/md.c b/drivers/md/md.c index d0f688399a56..1fa2682951f1 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1106,8 +1106,7 @@ static int super_90_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor * (not needed for Linear and RAID0 as metadata doesn't * record this size) */ - if (IS_ENABLED(CONFIG_LBDAF) && (u64)rdev->sectors >= (2ULL << 32) && - sb->level >= 1) + if ((u64)rdev->sectors >= (2ULL << 32) && sb->level >= 1) rdev->sectors = (sector_t)(2ULL << 32) - 2; if (rdev->sectors < ((sector_t)sb->size) * 2 && sb->level >= 1) @@ -1405,8 +1404,7 @@ super_90_rdev_size_change(struct md_rdev *rdev, sector_t num_sectors) /* Limit to 4TB as metadata cannot record more than that. * 4TB == 2^32 KB, or 2*2^32 sectors. */ - if (IS_ENABLED(CONFIG_LBDAF) && (u64)num_sectors >= (2ULL << 32) && - rdev->mddev->level >= 1) + if ((u64)num_sectors >= (2ULL << 32) && rdev->mddev->level >= 1) num_sectors = (sector_t)(2ULL << 32) - 2; do { md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size, -- cgit v1.2.3 From ee37e62191a59d253fc916b9fc763deb777211e2 Mon Sep 17 00:00:00 2001 From: Yufen Yu Date: Tue, 2 Apr 2019 14:22:14 +0800 Subject: md: add mddev->pers to avoid potential NULL pointer dereference When doing re-add, we need to ensure rdev->mddev->pers is not NULL, which can avoid potential NULL pointer derefence in fallowing add_bound_rdev(). Fixes: a6da4ef85cef ("md: re-add a failed disk") Cc: Xiao Ni Cc: NeilBrown Cc: # 4.4+ Reviewed-by: NeilBrown Signed-off-by: Yufen Yu Signed-off-by: Song Liu --- drivers/md/md.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'drivers/md') diff --git a/drivers/md/md.c b/drivers/md/md.c index 1fa2682951f1..664b77ceaf2d 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2850,8 +2850,10 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len) err = 0; } } else if (cmd_match(buf, "re-add")) { - if (test_bit(Faulty, &rdev->flags) && (rdev->raid_disk == -1) && - rdev->saved_raid_disk >= 0) { + if (!rdev->mddev->pers) + err = -EINVAL; + else if (test_bit(Faulty, &rdev->flags) && (rdev->raid_disk == -1) && + rdev->saved_raid_disk >= 0) { /* clear_bit is performed _after_ all the devices * have their local Faulty bit cleared. If any writes * happen in the meantime in the local node, they -- cgit v1.2.3 From ed4d0a4ea11e19863952ac6a7cea3bbb27ccd452 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 4 Apr 2019 18:56:10 +0200 Subject: md: add a missing endianness conversion in check_sb_changes The on-disk value is little endian and we need to convert it to native endian before storing the value in the in-core structure. Fixes: 7564beda19b36 ("md-cluster/raid10: support add disk under grow mode") Cc: # 4.20+ Acked-by: Guoqing Jiang Signed-off-by: Christoph Hellwig Signed-off-by: Song Liu --- drivers/md/md.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'drivers/md') diff --git a/drivers/md/md.c b/drivers/md/md.c index 664b77ceaf2d..a15eb7e37c6d 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9227,7 +9227,7 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev) * reshape is happening in the remote node, we need to * update reshape_position and call start_reshape. */ - mddev->reshape_position = sb->reshape_position; + mddev->reshape_position = le64_to_cpu(sb->reshape_position); if (mddev->pers->update_reshape_pos) mddev->pers->update_reshape_pos(mddev); if (mddev->pers->start_reshape) -- cgit v1.2.3 From c35403f82ced283807f31eeafc2a7aebf1b20331 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 4 Apr 2019 18:56:11 +0200 Subject: md: use correct types in md_bitmap_print_sb If we want to convert from a little endian format we need to cast to a little endian type, otherwise sparse will be unhappy. Signed-off-by: Christoph Hellwig Signed-off-by: Song Liu --- drivers/md/md-bitmap.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'drivers/md') diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index 1cd4f991792c..3a62a46b75c7 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -490,10 +490,10 @@ void md_bitmap_print_sb(struct bitmap *bitmap) pr_debug(" magic: %08x\n", le32_to_cpu(sb->magic)); pr_debug(" version: %d\n", le32_to_cpu(sb->version)); pr_debug(" uuid: %08x.%08x.%08x.%08x\n", - le32_to_cpu(*(__u32 *)(sb->uuid+0)), - le32_to_cpu(*(__u32 *)(sb->uuid+4)), - le32_to_cpu(*(__u32 *)(sb->uuid+8)), - le32_to_cpu(*(__u32 *)(sb->uuid+12))); + le32_to_cpu(*(__le32 *)(sb->uuid+0)), + le32_to_cpu(*(__le32 *)(sb->uuid+4)), + le32_to_cpu(*(__le32 *)(sb->uuid+8)), + le32_to_cpu(*(__le32 *)(sb->uuid+12))); pr_debug(" events: %llu\n", (unsigned long long) le64_to_cpu(sb->events)); pr_debug("events cleared: %llu\n", -- cgit v1.2.3 From 00485d094244db47ae2bf4f738e8d0f8d9f5890f Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 4 Apr 2019 18:56:12 +0200 Subject: md: use correct type in super_1_load If we want to convert from a little endian format we need to cast to a little endian type, otherwise sparse will be unhappy. Signed-off-by: Christoph Hellwig Signed-off-by: Song Liu --- drivers/md/md.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'drivers/md') diff --git a/drivers/md/md.c b/drivers/md/md.c index a15eb7e37c6d..7a74e2de40b5 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1548,7 +1548,7 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_ */ s32 offset; sector_t bb_sector; - u64 *bbp; + __le64 *bbp; int i; int sectors = le16_to_cpu(sb->bblog_size); if (sectors > (PAGE_SIZE / 512)) @@ -1560,7 +1560,7 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_ if (!sync_page_io(rdev, bb_sector, sectors << 9, rdev->bb_page, REQ_OP_READ, 0, true)) return -EIO; - bbp = (u64 *)page_address(rdev->bb_page); + bbp = (__le64 *)page_address(rdev->bb_page); rdev->badblocks.shift = sb->bblog_shift; for (i = 0 ; i < (sectors << (9-3)) ; i++, bbp++) { u64 bb = le64_to_cpu(*bbp); -- cgit v1.2.3 From ae50640bebc48f1fc0092f16ea004c7c4d12c985 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 4 Apr 2019 18:56:13 +0200 Subject: md: use correct type in super_1_sync If we want to convert from a little endian format we need to cast to a little endian type, otherwise sparse will be unhappy. Signed-off-by: Christoph Hellwig Signed-off-by: Song Liu --- drivers/md/md.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'drivers/md') diff --git a/drivers/md/md.c b/drivers/md/md.c index 7a74e2de40b5..4a31380b0eff 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1872,7 +1872,7 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev) md_error(mddev, rdev); else { struct badblocks *bb = &rdev->badblocks; - u64 *bbp = (u64 *)page_address(rdev->bb_page); + __le64 *bbp = (__le64 *)page_address(rdev->bb_page); u64 *p = bb->page; sb->feature_map |= cpu_to_le32(MD_FEATURE_BAD_BLOCKS); if (bb->changed) { -- cgit v1.2.3 From 2b598ee54a1e50323143a613aa29eb40377d8792 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 4 Apr 2019 18:56:14 +0200 Subject: md: mark md_cluster_mod static Sparse complains that it has no external declaration, and it turns out that it is never even used outside of md.c. So just mark it static and drop the export. Acked-by: Guoqing Jiang Signed-off-by: Christoph Hellwig Signed-off-by: Song Liu --- drivers/md/md.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'drivers/md') diff --git a/drivers/md/md.c b/drivers/md/md.c index 4a31380b0eff..541015373f6a 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -88,8 +88,7 @@ static struct kobj_type md_ktype; struct md_cluster_operations *md_cluster_ops; EXPORT_SYMBOL(md_cluster_ops); -struct module *md_cluster_mod; -EXPORT_SYMBOL(md_cluster_mod); +static struct module *md_cluster_mod; static DECLARE_WAIT_QUEUE_HEAD(resync_wait); static struct workqueue_struct *md_wq; -- cgit v1.2.3 From 368ecade0532982c8916a49e66b8105413d8db59 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 4 Apr 2019 18:56:15 +0200 Subject: md: add __acquires/__releases annotations to (un)lock_two_stripes This tells sparse that we acquire/release the two stripe locks and avoids a warning. Signed-off-by: Christoph Hellwig Signed-off-by: Song Liu --- drivers/md/raid5.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'drivers/md') diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 364dd2f6fa1b..d794d8745144 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -711,6 +711,8 @@ static bool is_full_stripe_write(struct stripe_head *sh) } static void lock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2) + __acquires(&sh1->stripe_lock) + __acquires(&sh2->stripe_lock) { if (sh1 > sh2) { spin_lock_irq(&sh2->stripe_lock); @@ -722,6 +724,8 @@ static void lock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2) } static void unlock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2) + __releases(&sh1->stripe_lock) + __releases(&sh2->stripe_lock) { spin_unlock(&sh1->stripe_lock); spin_unlock_irq(&sh2->stripe_lock); -- cgit v1.2.3 From efcd487c69b9d968552a6bf80e7839c4f28b419d Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 4 Apr 2019 18:56:16 +0200 Subject: md: add __acquires/__releases annotations to handle_active_stripes This tells sparse that we release and reacquire the device_lock and avoids a warning. Signed-off-by: Christoph Hellwig Signed-off-by: Song Liu --- drivers/md/raid5.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'drivers/md') diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index d794d8745144..2b0a715e70c9 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -6159,6 +6159,8 @@ static int retry_aligned_read(struct r5conf *conf, struct bio *raid_bio, static int handle_active_stripes(struct r5conf *conf, int group, struct r5worker *worker, struct list_head *temp_inactive_list) + __releases(&conf->device_lock) + __acquires(&conf->device_lock) { struct stripe_head *batch[MAX_STRIPE_BATCH], *sh; int i, batch_size = 0, hash; -- cgit v1.2.3 From c42d3240990814eec1e4b2b93fa0487fc4873aed Mon Sep 17 00:00:00 2001 From: Pawel Baldysiak Date: Wed, 27 Mar 2019 13:48:21 +0100 Subject: md: return -ENODEV if rdev has no mddev assigned Mdadm expects that setting drive as faulty will fail with -EBUSY only if this operation will cause RAID to be failed. If this happens, it will try to stop the array. Currently -EBUSY might also be returned if rdev is in the middle of the removal process - for example there is a race with mdmon that already requested the drive to be failed/removed. If rdev does not contain mddev, return -ENODEV instead, so the caller can distinguish between those two cases and behave accordingly. Reviewed-by: NeilBrown Signed-off-by: Pawel Baldysiak Signed-off-by: Song Liu --- drivers/md/md.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'drivers/md') diff --git a/drivers/md/md.c b/drivers/md/md.c index 541015373f6a..45ffa23fa85d 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -3380,10 +3380,10 @@ rdev_attr_store(struct kobject *kobj, struct attribute *attr, return -EIO; if (!capable(CAP_SYS_ADMIN)) return -EACCES; - rv = mddev ? mddev_lock(mddev): -EBUSY; + rv = mddev ? mddev_lock(mddev) : -ENODEV; if (!rv) { if (rdev->mddev == NULL) - rv = -EBUSY; + rv = -ENODEV; else rv = entry->store(rdev, page, length); mddev_unlock(mddev); -- cgit v1.2.3 From a25d8c327bb41742dbd59f8c545f59f3b9c39983 Mon Sep 17 00:00:00 2001 From: Song Liu Date: Tue, 16 Apr 2019 09:34:21 -0700 Subject: Revert "Don't jump to compute_result state from check_result state" This reverts commit 4f4fd7c5798bbdd5a03a60f6269cf1177fbd11ef. Cc: Dan Williams Cc: Nigel Croxon Cc: Xiao Ni Signed-off-by: Song Liu --- drivers/md/raid5.c | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) (limited to 'drivers/md') diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 2b0a715e70c9..b5742d07662d 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -4227,15 +4227,26 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh, case check_state_check_result: sh->check_state = check_state_idle; - if (s->failed > 1) - break; /* handle a successful check operation, if parity is correct * we are done. Otherwise update the mismatch count and repair * parity if !MD_RECOVERY_CHECK */ if (sh->ops.zero_sum_result == 0) { - /* Any parity checked was correct */ - set_bit(STRIPE_INSYNC, &sh->state); + /* both parities are correct */ + if (!s->failed) + set_bit(STRIPE_INSYNC, &sh->state); + else { + /* in contrast to the raid5 case we can validate + * parity, but still have a failure to write + * back + */ + sh->check_state = check_state_compute_result; + /* Returning at this point means that we may go + * off and bring p and/or q uptodate again so + * we make sure to check zero_sum_result again + * to verify if p or q need writeback + */ + } } else { atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches); if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) { -- cgit v1.2.3 From b2176a1dfb518d870ee073445d27055fea64dfb8 Mon Sep 17 00:00:00 2001 From: Nigel Croxon Date: Tue, 16 Apr 2019 09:50:09 -0700 Subject: md/raid: raid5 preserve the writeback action after the parity check The problem is that any 'uptodate' vs 'disks' check is not precise in this path. Put a "WARN_ON(!test_bit(R5_UPTODATE, &dev->flags)" on the device that might try to kick off writes and then skip the action. Better to prevent the raid driver from taking unexpected action *and* keep the system alive vs killing the machine with BUG_ON. Note: fixed warning reported by kbuild test robot Signed-off-by: Dan Williams Signed-off-by: Nigel Croxon Signed-off-by: Song Liu --- drivers/md/raid5.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) (limited to 'drivers/md') diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index b5742d07662d..7fde645d2e90 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -4191,7 +4191,7 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh, /* now write out any block on a failed drive, * or P or Q if they were recomputed */ - BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */ + dev = NULL; if (s->failed == 2) { dev = &sh->dev[s->failed_num[1]]; s->locked++; @@ -4216,6 +4216,14 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh, set_bit(R5_LOCKED, &dev->flags); set_bit(R5_Wantwrite, &dev->flags); } + if (WARN_ONCE(dev && !test_bit(R5_UPTODATE, &dev->flags), + "%s: disk%td not up to date\n", + mdname(conf->mddev), + dev - (struct r5dev *) &sh->dev)) { + clear_bit(R5_LOCKED, &dev->flags); + clear_bit(R5_Wantwrite, &dev->flags); + s->locked--; + } clear_bit(STRIPE_DEGRADED, &sh->state); set_bit(STRIPE_INSYNC, &sh->state); -- cgit v1.2.3 From 1568ee7e3c6305d9fbb2414bfd4b56e52853d42d Mon Sep 17 00:00:00 2001 From: Guoju Fang Date: Thu, 25 Apr 2019 00:48:26 +0800 Subject: bcache: fix crashes stopping bcache device before read miss done The bio from upper layer is considered completed when bio_complete() returns. In most scenarios bio_complete() is called in search_free(), but when read miss happens, the bio_compete() is called when backing device reading completed, while the struct search is still in use until cache inserting finished. If someone stops the bcache device just then, the device may be closed and released, but after cache inserting finished the struct search will access a freed struct cached_dev. This patch add the reference of bcache device before bio_complete() when read miss happens, and put it after the search is not used. Signed-off-by: Guoju Fang Signed-off-by: Coly Li Signed-off-by: Jens Axboe --- drivers/md/bcache/request.c | 26 +++++++++++++++++++++----- 1 file changed, 21 insertions(+), 5 deletions(-) (limited to 'drivers/md') diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index f101bfe8657a..f11123079fe0 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -706,14 +706,14 @@ static void search_free(struct closure *cl) { struct search *s = container_of(cl, struct search, cl); - atomic_dec(&s->d->c->search_inflight); + atomic_dec(&s->iop.c->search_inflight); if (s->iop.bio) bio_put(s->iop.bio); bio_complete(s); closure_debug_destroy(cl); - mempool_free(s, &s->d->c->search); + mempool_free(s, &s->iop.c->search); } static inline struct search *search_alloc(struct bio *bio, @@ -756,13 +756,13 @@ static void cached_dev_bio_complete(struct closure *cl) struct search *s = container_of(cl, struct search, cl); struct cached_dev *dc = container_of(s->d, struct cached_dev, disk); - search_free(cl); cached_dev_put(dc); + search_free(cl); } /* Process reads */ -static void cached_dev_cache_miss_done(struct closure *cl) +static void cached_dev_read_error_done(struct closure *cl) { struct search *s = container_of(cl, struct search, cl); @@ -800,7 +800,22 @@ static void cached_dev_read_error(struct closure *cl) closure_bio_submit(s->iop.c, bio, cl); } - continue_at(cl, cached_dev_cache_miss_done, NULL); + continue_at(cl, cached_dev_read_error_done, NULL); +} + +static void cached_dev_cache_miss_done(struct closure *cl) +{ + struct search *s = container_of(cl, struct search, cl); + struct bcache_device *d = s->d; + + if (s->iop.replace_collision) + bch_mark_cache_miss_collision(s->iop.c, s->d); + + if (s->iop.bio) + bio_free_pages(s->iop.bio); + + cached_dev_bio_complete(cl); + closure_put(&d->cl); } static void cached_dev_read_done(struct closure *cl) @@ -833,6 +848,7 @@ static void cached_dev_read_done(struct closure *cl) if (verify(dc) && s->recoverable && !s->read_dirty_data) bch_data_verify(dc, s->orig_bio); + closure_get(&dc->disk.cl); bio_complete(s); if (s->iop.bio && -- cgit v1.2.3 From 4e0c04ec3a304490a83d5c0355e64176acc9b4ba Mon Sep 17 00:00:00 2001 From: Guoju Fang Date: Thu, 25 Apr 2019 00:48:27 +0800 Subject: bcache: fix inaccurate result of unused buckets To get the amount of unused buckets in sysfs_priority_stats, the code count the buckets which GC_SECTORS_USED is zero. It's correct and should not be overwritten by the count of buckets which prio is zero. Signed-off-by: Guoju Fang Signed-off-by: Coly Li Signed-off-by: Jens Axboe --- drivers/md/bcache/sysfs.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'drivers/md') diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 17bae9c14ca0..6cd44d3cf906 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -996,8 +996,6 @@ SHOW(__bch_cache) !cached[n - 1]) --n; - unused = ca->sb.nbuckets - n; - while (cached < p + n && *cached == BTREE_PRIO) cached++, n--; -- cgit v1.2.3 From 78d4eb8ad9e1d413449d1b7a060f50b6efa81ebd Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Thu, 25 Apr 2019 00:48:28 +0800 Subject: bcache: avoid clang -Wunintialized warning clang has identified a code path in which it thinks a variable may be unused: drivers/md/bcache/alloc.c:333:4: error: variable 'bucket' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized] fifo_pop(&ca->free_inc, bucket); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop' #define fifo_pop(fifo, i) fifo_pop_front(fifo, (i)) ^~~~~~~~~~~~~~~~~~~~~~~~~ drivers/md/bcache/util.h:189:6: note: expanded from macro 'fifo_pop_front' if (_r) { \ ^~ drivers/md/bcache/alloc.c:343:46: note: uninitialized use occurs here allocator_wait(ca, bch_allocator_push(ca, bucket)); ^~~~~~ drivers/md/bcache/alloc.c:287:7: note: expanded from macro 'allocator_wait' if (cond) \ ^~~~ drivers/md/bcache/alloc.c:333:4: note: remove the 'if' if its condition is always true fifo_pop(&ca->free_inc, bucket); ^ drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop' #define fifo_pop(fifo, i) fifo_pop_front(fifo, (i)) ^ drivers/md/bcache/util.h:189:2: note: expanded from macro 'fifo_pop_front' if (_r) { \ ^ drivers/md/bcache/alloc.c:331:15: note: initialize the variable 'bucket' to silence this warning long bucket; ^ This cannot happen in practice because we only enter the loop if there is at least one element in the list. Slightly rearranging the code makes this clearer to both the reader and the compiler, which avoids the warning. Signed-off-by: Arnd Bergmann Reviewed-by: Nathan Chancellor Signed-off-by: Coly Li Signed-off-by: Jens Axboe --- drivers/md/bcache/alloc.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'drivers/md') diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c index 5002838ea476..f8986effcb50 100644 --- a/drivers/md/bcache/alloc.c +++ b/drivers/md/bcache/alloc.c @@ -327,10 +327,11 @@ static int bch_allocator_thread(void *arg) * possibly issue discards to them, then we add the bucket to * the free list: */ - while (!fifo_empty(&ca->free_inc)) { + while (1) { long bucket; - fifo_pop(&ca->free_inc, bucket); + if (!fifo_pop(&ca->free_inc, bucket)) + break; if (ca->discard) { mutex_unlock(&ca->set->bucket_lock); -- cgit v1.2.3 From 792732d9852c0e4505aceff4631ea2168fd02480 Mon Sep 17 00:00:00 2001 From: Geliang Tang Date: Thu, 25 Apr 2019 00:48:29 +0800 Subject: bcache: use kmemdup_nul for CACHED_LABEL buffer This patch uses kmemdup_nul to create a NUL-terminated string from dc->sb.label. This is better than open coding it. With this, we can move env[2] initialization into env[] array to make code more elegant. Signed-off-by: Geliang Tang Signed-off-by: Coly Li Signed-off-by: Jens Axboe --- drivers/md/bcache/super.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) (limited to 'drivers/md') diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index a697a3a923cd..6e618cb6126c 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -906,21 +906,18 @@ static int cached_dev_status_update(void *arg) void bch_cached_dev_run(struct cached_dev *dc) { struct bcache_device *d = &dc->disk; - char buf[SB_LABEL_SIZE + 1]; + char *buf = kmemdup_nul(dc->sb.label, SB_LABEL_SIZE, GFP_KERNEL); char *env[] = { "DRIVER=bcache", kasprintf(GFP_KERNEL, "CACHED_UUID=%pU", dc->sb.uuid), - NULL, + kasprintf(GFP_KERNEL, "CACHED_LABEL=%s", buf ? : ""), NULL, }; - memcpy(buf, dc->sb.label, SB_LABEL_SIZE); - buf[SB_LABEL_SIZE] = '\0'; - env[2] = kasprintf(GFP_KERNEL, "CACHED_LABEL=%s", buf); - if (atomic_xchg(&dc->running, 1)) { kfree(env[1]); kfree(env[2]); + kfree(buf); return; } @@ -944,6 +941,7 @@ void bch_cached_dev_run(struct cached_dev *dc) kobject_uevent_env(&disk_to_dev(d->disk)->kobj, KOBJ_CHANGE, env); kfree(env[1]); kfree(env[2]); + kfree(buf); if (sysfs_create_link(&d->kobj, &disk_to_dev(d->disk)->kobj, "dev") || sysfs_create_link(&disk_to_dev(d->disk)->kobj, &d->kobj, "bcache")) -- cgit v1.2.3 From 3a3947271cd6ce05c5b30c1250fa99de57410500 Mon Sep 17 00:00:00 2001 From: George Spelvin Date: Thu, 25 Apr 2019 00:48:30 +0800 Subject: bcache: Clean up bch_get_congested() There are a few nits in this function. They could in theory all be separate patches, but that's probably taking small commits too far. 1) I added a brief comment saying what it does. 2) I like to declare pointer parameters "const" where possible for documentation reasons. 3) It uses bitmap_weight(&rand, BITS_PER_LONG) to compute the Hamming weight of a 32-bit random number (giving a random integer with mean 16 and variance 8). Passing by reference in a 64-bit variable is silly; just use hweight32(). 4) Its helper function fract_exp_two is unnecessarily tangled. Gcc can optimize the multiply by (1 << x) to a shift, but it can be written in a much more straightforward way at the cost of one more bit of internal precision. Some analysis reveals that this bit is always available. This shrinks the object code for fract_exp_two(x, 6) from 23 bytes: 0000000000000000 : 0: 89 f9 mov %edi,%ecx 2: c1 e9 06 shr $0x6,%ecx 5: b8 01 00 00 00 mov $0x1,%eax a: d3 e0 shl %cl,%eax c: 83 e7 3f and $0x3f,%edi f: d3 e7 shl %cl,%edi 11: c1 ef 06 shr $0x6,%edi 14: 01 f8 add %edi,%eax 16: c3 retq To 19: 0000000000000017 : 17: 89 f8 mov %edi,%eax 19: 83 e0 3f and $0x3f,%eax 1c: 83 c0 40 add $0x40,%eax 1f: 89 f9 mov %edi,%ecx 21: c1 e9 06 shr $0x6,%ecx 24: d3 e0 shl %cl,%eax 26: c1 e8 06 shr $0x6,%eax 29: c3 retq (Verified with 0 <= frac_bits <= 8, 0 <= x < 16< Signed-off-by: Coly Li Signed-off-by: Jens Axboe --- drivers/md/bcache/request.c | 15 ++++++++------- drivers/md/bcache/request.h | 2 +- drivers/md/bcache/util.h | 26 +++++++++++++++++++------- 3 files changed, 28 insertions(+), 15 deletions(-) (limited to 'drivers/md') diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index f11123079fe0..41adcd1546f1 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -329,12 +329,13 @@ void bch_data_insert(struct closure *cl) bch_data_insert_start(cl); } -/* Congested? */ - -unsigned int bch_get_congested(struct cache_set *c) +/* + * Congested? Return 0 (not congested) or the limit (in sectors) + * beyond which we should bypass the cache due to congestion. + */ +unsigned int bch_get_congested(const struct cache_set *c) { int i; - long rand; if (!c->congested_read_threshold_us && !c->congested_write_threshold_us) @@ -353,8 +354,7 @@ unsigned int bch_get_congested(struct cache_set *c) if (i > 0) i = fract_exp_two(i, 6); - rand = get_random_int(); - i -= bitmap_weight(&rand, BITS_PER_LONG); + i -= hweight32(get_random_u32()); return i > 0 ? i : 1; } @@ -376,7 +376,7 @@ static bool check_should_bypass(struct cached_dev *dc, struct bio *bio) { struct cache_set *c = dc->disk.c; unsigned int mode = cache_mode(dc); - unsigned int sectors, congested = bch_get_congested(c); + unsigned int sectors, congested; struct task_struct *task = current; struct io *i; @@ -412,6 +412,7 @@ static bool check_should_bypass(struct cached_dev *dc, struct bio *bio) goto rescale; } + congested = bch_get_congested(c); if (!congested && !dc->sequential_cutoff) goto rescale; diff --git a/drivers/md/bcache/request.h b/drivers/md/bcache/request.h index 721bf336ed1a..c64dbd7a91aa 100644 --- a/drivers/md/bcache/request.h +++ b/drivers/md/bcache/request.h @@ -33,7 +33,7 @@ struct data_insert_op { BKEY_PADDED(replace_key); }; -unsigned int bch_get_congested(struct cache_set *c); +unsigned int bch_get_congested(const struct cache_set *c); void bch_data_insert(struct closure *cl); void bch_cached_dev_request_init(struct cached_dev *dc); diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h index 00aab6abcfe4..1fbced94e4cc 100644 --- a/drivers/md/bcache/util.h +++ b/drivers/md/bcache/util.h @@ -560,17 +560,29 @@ static inline uint64_t bch_crc64_update(uint64_t crc, return crc; } -/* Does linear interpolation between powers of two */ +/* + * A stepwise-linear pseudo-exponential. This returns 1 << (x >> + * frac_bits), with the less-significant bits filled in by linear + * interpolation. + * + * This can also be interpreted as a floating-point number format, + * where the low frac_bits are the mantissa (with implicit leading + * 1 bit), and the more significant bits are the exponent. + * The return value is 1.mantissa * 2^exponent. + * + * The way this is used, fract_bits is 6 and the largest possible + * input is CONGESTED_MAX-1 = 1023 (exponent 16, mantissa 0x1.fc), + * so the maximum output is 0x1fc00. + */ static inline unsigned int fract_exp_two(unsigned int x, unsigned int fract_bits) { - unsigned int fract = x & ~(~0 << fract_bits); - - x >>= fract_bits; - x = 1 << x; - x += (x * fract) >> fract_bits; + unsigned int mantissa = 1 << fract_bits; /* Implicit bit */ - return x; + mantissa += x & (mantissa - 1); + x >>= fract_bits; /* The exponent */ + /* Largest intermediate value 0x7f0000 */ + return mantissa << x >> fract_bits; } void bch_bio_map(struct bio *bio, void *base); -- cgit v1.2.3 From a4b732a248d12cbdb46999daf0bf288c011335eb Mon Sep 17 00:00:00 2001 From: Liang Chen Date: Thu, 25 Apr 2019 00:48:31 +0800 Subject: bcache: fix a race between cache register and cacheset unregister There is a race between cache device register and cache set unregister. For an already registered cache device, register_bcache will call bch_is_open to iterate through all cachesets and check every cache there. The race occurs if cache_set_free executes at the same time and clears the caches right before ca is dereferenced in bch_is_open_cache. To close the race, let's make sure the clean up work is protected by the bch_register_lock as well. This issue can be reproduced as follows, while true; do echo /dev/XXX> /sys/fs/bcache/register ; done& while true; do echo 1> /sys/block/XXX/bcache/set/unregister ; done & and results in the following oops, [ +0.000053] BUG: unable to handle kernel NULL pointer dereference at 0000000000000998 [ +0.000457] #PF error: [normal kernel read fault] [ +0.000464] PGD 800000003ca9d067 P4D 800000003ca9d067 PUD 3ca9c067 PMD 0 [ +0.000388] Oops: 0000 [#1] SMP PTI [ +0.000269] CPU: 1 PID: 3266 Comm: bash Not tainted 5.0.0+ #6 [ +0.000346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014 [ +0.000472] RIP: 0010:register_bcache+0x1829/0x1990 [bcache] [ +0.000344] Code: b0 48 83 e8 50 48 81 fa e0 e1 10 c0 0f 84 a9 00 00 00 48 89 c6 48 89 ca 0f b7 ba 54 04 00 00 4c 8b 82 60 0c 00 00 85 ff 74 2f <49> 3b a8 98 09 00 00 74 4e 44 8d 47 ff 31 ff 49 c1 e0 03 eb 0d [ +0.000839] RSP: 0018:ffff92ee804cbd88 EFLAGS: 00010202 [ +0.000328] RAX: ffffffffc010e190 RBX: ffff918b5c6b5000 RCX: ffff918b7d8e0000 [ +0.000399] RDX: ffff918b7d8e0000 RSI: ffffffffc010e190 RDI: 0000000000000001 [ +0.000398] RBP: ffff918b7d318340 R08: 0000000000000000 R09: ffffffffb9bd2d7a [ +0.000385] R10: ffff918b7eb253c0 R11: ffffb95980f51200 R12: ffffffffc010e1a0 [ +0.000411] R13: fffffffffffffff2 R14: 000000000000000b R15: ffff918b7e232620 [ +0.000384] FS: 00007f955bec2740(0000) GS:ffff918b7eb00000(0000) knlGS:0000000000000000 [ +0.000420] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000801] CR2: 0000000000000998 CR3: 000000003cad6000 CR4: 00000000001406e0 [ +0.000837] Call Trace: [ +0.000682] ? _cond_resched+0x10/0x20 [ +0.000691] ? __kmalloc+0x131/0x1b0 [ +0.000710] kernfs_fop_write+0xfa/0x170 [ +0.000733] __vfs_write+0x2e/0x190 [ +0.000688] ? inode_security+0x10/0x30 [ +0.000698] ? selinux_file_permission+0xd2/0x120 [ +0.000752] ? security_file_permission+0x2b/0x100 [ +0.000753] vfs_write+0xa8/0x1a0 [ +0.000676] ksys_write+0x4d/0xb0 [ +0.000699] do_syscall_64+0x3a/0xf0 [ +0.000692] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: Liang Chen Cc: stable@vger.kernel.org Signed-off-by: Coly Li Signed-off-by: Jens Axboe --- drivers/md/bcache/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'drivers/md') diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 6e618cb6126c..53c5e3e0ac22 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1514,6 +1514,7 @@ static void cache_set_free(struct closure *cl) bch_btree_cache_free(c); bch_journal_free(c); + mutex_lock(&bch_register_lock); for_each_cache(ca, c, i) if (ca) { ca->set = NULL; @@ -1532,7 +1533,6 @@ static void cache_set_free(struct closure *cl) mempool_exit(&c->search); kfree(c->devices); - mutex_lock(&bch_register_lock); list_del(&c->list); mutex_unlock(&bch_register_lock); -- cgit v1.2.3 From 14215ee01f6377c81c25c2cecda729e8811d2826 Mon Sep 17 00:00:00 2001 From: Coly Li Date: Thu, 25 Apr 2019 00:48:32 +0800 Subject: bcache: move definition of 'int ret' out of macro read_bucket() 'int ret' is defined as a local variable inside macro read_bucket(). Since this macro is called multiple times, and following patches will use a 'int ret' variable in bch_journal_read(), this patch moves definition of 'int ret' from macro read_bucket() to range of function bch_journal_read(). Signed-off-by: Coly Li Reviewed-by: Hannes Reinecke Signed-off-by: Jens Axboe --- drivers/md/bcache/journal.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'drivers/md') diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index b2fd412715b1..6e18057d1d82 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -147,7 +147,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list) { #define read_bucket(b) \ ({ \ - int ret = journal_read_bucket(ca, list, b); \ + ret = journal_read_bucket(ca, list, b); \ __set_bit(b, bitmap); \ if (ret < 0) \ return ret; \ @@ -156,6 +156,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list) struct cache *ca; unsigned int iter; + int ret = 0; for_each_cache(ca, c, iter) { struct journal_device *ja = &ca->journal; @@ -267,7 +268,7 @@ bsearch: struct journal_replay, list)->j.seq; - return 0; + return ret; #undef read_bucket } -- cgit v1.2.3 From 1bee2addc0c8470c8aaa65ef0599eeae96dd88bc Mon Sep 17 00:00:00 2001 From: Coly Li Date: Thu, 25 Apr 2019 00:48:33 +0800 Subject: bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim() In journal_reclaim() ja->cur_idx of each cache will be update to reclaim available journal buckets. Variable 'int n' is used to count how many cache is successfully reclaimed, then n is set to c->journal.key by SET_KEY_PTRS(). Later in journal_write_unlocked(), a for_each_cache() loop will write the jset data onto each cache. The problem is, if all jouranl buckets on each cache is full, the following code in journal_reclaim(), 529 for_each_cache(ca, c, iter) { 530 struct journal_device *ja = &ca->journal; 531 unsigned int next = (ja->cur_idx + 1) % ca->sb.njournal_buckets; 532 533 /* No space available on this device */ 534 if (next == ja->discard_idx) 535 continue; 536 537 ja->cur_idx = next; 538 k->ptr[n++] = MAKE_PTR(0, 539 bucket_to_sector(c, ca->sb.d[ja->cur_idx]), 540 ca->sb.nr_this_dev); 541 } 542 543 bkey_init(k); 544 SET_KEY_PTRS(k, n); If there is no available bucket to reclaim, the if() condition at line 534 will always true, and n remains 0. Then at line 544, SET_KEY_PTRS() will set KEY_PTRS field of c->journal.key to 0. Setting KEY_PTRS field of c->journal.key to 0 is wrong. Because in journal_write_unlocked() the journal data is written in following loop, 649 for (i = 0; i < KEY_PTRS(k); i++) { 650-671 submit journal data to cache device 672 } If KEY_PTRS field is set to 0 in jouranl_reclaim(), the journal data won't be written to cache device here. If system crahed or rebooted before bkeys of the lost journal entries written into btree nodes, data corruption will be reported during bcache reload after rebooting the system. Indeed there is only one cache in a cache set, there is no need to set KEY_PTRS field in journal_reclaim() at all. But in order to keep the for_each_cache() logic consistent for now, this patch fixes the above problem by not setting 0 KEY_PTRS of journal key, if there is no bucket available to reclaim. Signed-off-by: Coly Li Reviewed-by: Hannes Reinecke Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe --- drivers/md/bcache/journal.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) (limited to 'drivers/md') diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index 6e18057d1d82..5180bed911ef 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -541,11 +541,11 @@ static void journal_reclaim(struct cache_set *c) ca->sb.nr_this_dev); } - bkey_init(k); - SET_KEY_PTRS(k, n); - - if (n) + if (n) { + bkey_init(k); + SET_KEY_PTRS(k, n); c->journal.blocks_free = c->sb.bucket_size >> c->block_bits; + } out: if (!journal_full(&c->journal)) __closure_wake_up(&c->journal.wait); @@ -672,6 +672,9 @@ static void journal_write_unlocked(struct closure *cl) ca->journal.seq[ca->journal.cur_idx] = w->data->seq; } + /* If KEY_PTRS(k) == 0, this jset gets lost in air */ + BUG_ON(i == 0); + atomic_dec_bug(&fifo_back(&c->journal.pin)); bch_journal_next(&c->journal); jou