6655 Commits

Author SHA1 Message Date
Anand Jain
36bd329814 btrfs: device stats, log when stats are zeroed
[ Upstream commit a69976bc69308aa475d0ba3b8b3efd1d013c0460 ]

We had a report indicating that some read errors aren't reported by the
device stats in the userland. It is important to have the errors
reported in the device stat as user land scripts might depend on it to
take the reasonable corrective actions. But to debug these issue we need
to be really sure that request to reset the device stat did not come
from the userland itself. So log an info message when device error reset
happens.

For example:
 BTRFS info (device sdc): device stats zeroed by btrfs(9223)

Reported-by: philip@philip-seeger.de
Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28 16:36:06 +01:00
David Sterba
76d4e6aeac btrfs: safely advance counter when looking up bio csums
[ Upstream commit 4babad10198fa73fe73239d02c2e99e3333f5f5c ]

Dan's smatch tool reports

  fs/btrfs/file-item.c:295 btrfs_lookup_bio_sums()
  warn: should this be 'count == -1'

which points to the while (count--) loop. With count == 0 the check
itself could decrement it to -1. There's a WARN_ON a few lines below
that has never been seen in practice though.

It turns out that the value of page_bytes_left matches the count (by
sectorsize multiples). The loop never reaches the state where count
would go to -1, because page_bytes_left == 0 is found first and this
breaks out.

For clarity, use only plain check on count (and only for positive
value), decrement safely inside the loop. Any other discrepancy after
the whole bio list processing should be reported by the exising
WARN_ON_ONCE as well.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28 16:36:05 +01:00
Johannes Thumshirn
ad915b3d80 btrfs: fix possible NULL-pointer dereference in integrity checks
[ Upstream commit 3dbd351df42109902fbcebf27104149226a4fcd9 ]

A user reports a possible NULL-pointer dereference in
btrfsic_process_superblock(). We are assigning state->fs_info to a local
fs_info variable and afterwards checking for the presence of state.

While we would BUG_ON() a NULL state anyways, we can also just remove
the local fs_info copy, as fs_info is only used once as the first
argument for btrfs_num_copies(). There we can just pass in
state->fs_info as well.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205003
Signed-off-by: Johannes Thumshirn <jth@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28 16:36:05 +01:00
David Sterba
13b91b8b70 btrfs: log message when rw remount is attempted with unclean tree-log
commit 10a3a3edc5b89a8cd095bc63495fb1e0f42047d9 upstream.

A remount to a read-write filesystem is not safe when there's tree-log
to be replayed. Files that could be opened until now might be affected
by the changes in the tree-log.

A regular mount is needed to replay the log so the filesystem presents
the consistent view with the pending changes included.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28 16:35:53 +01:00
David Sterba
893bb1890f btrfs: print message when tree-log replay starts
commit e8294f2f6aa6208ed0923aa6d70cea3be178309a upstream.

There's no logged information about tree-log replay although this is
something that points to previous unclean unmount. Other filesystems
report that as well.

Suggested-by: Chris Murphy <lists@colorremedies.com>
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28 16:35:53 +01:00
Filipe Manana
841793cd07 Btrfs: fix race between using extent maps and merging them
commit ac05ca913e9f3871126d61da275bfe8516ff01ca upstream.

We have a few cases where we allow an extent map that is in an extent map
tree to be merged with other extents in the tree. Such cases include the
unpinning of an extent after the respective ordered extent completed or
after logging an extent during a fast fsync. This can lead to subtle and
dangerous problems because when doing the merge some other task might be
using the same extent map and as consequence see an inconsistent state of
the extent map - for example sees the new length but has seen the old start
offset.

With luck this triggers a BUG_ON(), and not some silent bug, such as the
following one in __do_readpage():

  $ cat -n fs/btrfs/extent_io.c
  3061  static int __do_readpage(struct extent_io_tree *tree,
  3062                           struct page *page,
  (...)
  3127                  em = __get_extent_map(inode, page, pg_offset, cur,
  3128                                        end - cur + 1, get_extent, em_cached);
  3129                  if (IS_ERR_OR_NULL(em)) {
  3130                          SetPageError(page);
  3131                          unlock_extent(tree, cur, end);
  3132                          break;
  3133                  }
  3134                  extent_offset = cur - em->start;
  3135                  BUG_ON(extent_map_end(em) <= cur);
  (...)

Consider the following example scenario, where we end up hitting the
BUG_ON() in __do_readpage().

We have an inode with a size of 8KiB and 2 extent maps:

  extent A: file offset 0, length 4KiB, disk_bytenr = X, persisted on disk by
            a previous transaction

  extent B: file offset 4KiB, length 4KiB, disk_bytenr = X + 4KiB, not yet
            persisted but writeback started for it already. The extent map
	    is pinned since there's writeback and an ordered extent in
	    progress, so it can not be merged with extent map A yet

The following sequence of steps leads to the BUG_ON():

1) The ordered extent for extent B completes, the respective page gets its
   writeback bit cleared and the extent map is unpinned, at that point it
   is not yet merged with extent map A because it's in the list of modified
   extents;

2) Due to memory pressure, or some other reason, the MM subsystem releases
   the page corresponding to extent B - btrfs_releasepage() is called and
   returns 1, meaning the page can be released as it's not dirty, not under
   writeback anymore and the extent range is not locked in the inode's
   iotree. However the extent map is not released, either because we are
   not in a context that allows memory allocations to block or because the
   inode's size is smaller than 16MiB - in this case our inode has a size
   of 8KiB;

3) Task B needs to read extent B and ends up __do_readpage() through the
   btrfs_readpage() callback. At __do_readpage() it gets a reference to
   extent map B;

4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B
   while holding the write lock on the inode's extent map tree - this
   results in try_merge_map() being called and since it's possible to merge
   extent map B with extent map A now (the extent map B was removed from
   the list of modified extents), the merging begins - it sets extent map
   B's start offset to 0 (was 4KiB), but before it increments the map's
   length to 8KiB (4kb + 4KiB), task A is at:

   BUG_ON(extent_map_end(em) <= cur);

   The call to extent_map_end() sees the extent map has a start of 0
   and a length still at 4KiB, so it returns 4KiB and 'cur' is 4KiB, so
   the BUG_ON() is triggered.

So it's dangerous to modify an extent map that is in the tree, because some
other task might have got a reference to it before and still using it, and
needs to see a consistent map while using it. Generally this is very rare
since most paths that lookup and use extent maps also have the file range
locked in the inode's iotree. The fsync path is pretty much the only
exception where we don't do it to avoid serialization with concurrent
reads.

Fix this by not allowing an extent map do be merged if if it's being used
by tasks other then the one attempting to merge the extent map (when the
reference count of the extent map is greater than 2).

Reported-by: ryusuke1925 <st13s20@gm.ibaraki-ct.ac.jp>
Reported-by: Koki Mitani <koki.mitani.xg@hco.ntt.co.jp>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28 16:35:53 +01:00
Josef Bacik
8d3a5b8134 btrfs: flush write bio if we loop in extent_write_cache_pages
[ Upstream commit 42ffb0bf584ae5b6b38f72259af1e0ee417ac77f ]

There exists a deadlock with range_cyclic that has existed forever.  If
we loop around with a bio already built we could deadlock with a writer
who has the page locked that we're attempting to write but is waiting on
a page in our bio to be written out.  The task traces are as follows

  PID: 1329874  TASK: ffff889ebcdf3800  CPU: 33  COMMAND: "kworker/u113:5"
   #0 [ffffc900297bb658] __schedule at ffffffff81a4c33f
   #1 [ffffc900297bb6e0] schedule at ffffffff81a4c6e3
   #2 [ffffc900297bb6f8] io_schedule at ffffffff81a4ca42
   #3 [ffffc900297bb708] __lock_page at ffffffff811f145b
   #4 [ffffc900297bb798] __process_pages_contig at ffffffff814bc502
   #5 [ffffc900297bb8c8] lock_delalloc_pages at ffffffff814bc684
   #6 [ffffc900297bb900] find_lock_delalloc_range at ffffffff814be9ff
   #7 [ffffc900297bb9a0] writepage_delalloc at ffffffff814bebd0
   #8 [ffffc900297bba18] __extent_writepage at ffffffff814bfbf2
   #9 [ffffc900297bba98] extent_write_cache_pages at ffffffff814bffbd

  PID: 2167901  TASK: ffff889dc6a59c00  CPU: 14  COMMAND:
  "aio-dio-invalid"
   #0 [ffffc9003b50bb18] __schedule at ffffffff81a4c33f
   #1 [ffffc9003b50bba0] schedule at ffffffff81a4c6e3
   #2 [ffffc9003b50bbb8] io_schedule at ffffffff81a4ca42
   #3 [ffffc9003b50bbc8] wait_on_page_bit at ffffffff811f24d6
   #4 [ffffc9003b50bc60] prepare_pages at ffffffff814b05a7
   #5 [ffffc9003b50bcd8] btrfs_buffered_write at ffffffff814b1359
   #6 [ffffc9003b50bdb0] btrfs_file_write_iter at ffffffff814b5933
   #7 [ffffc9003b50be38] new_sync_write at ffffffff8128f6a8
   #8 [ffffc9003b50bec8] vfs_write at ffffffff81292b9d
   #9 [ffffc9003b50bf00] ksys_pwrite64 at ffffffff81293032

I used drgn to find the respective pages we were stuck on

page_entry.page 0xffffea00fbfc7500 index 8148 bit 15 pid 2167901
page_entry.page 0xffffea00f9bb7400 index 7680 bit 0 pid 1329874

As you can see the kworker is waiting for bit 0 (PG_locked) on index
7680, and aio-dio-invalid is waiting for bit 15 (PG_writeback) on index
8148.  aio-dio-invalid has 7680, and the kworker epd looks like the
following

  crash> struct extent_page_data ffffc900297bbbb0
  struct extent_page_data {
    bio = 0xffff889f747ed830,
    tree = 0xffff889eed6ba448,
    extent_locked = 0,
    sync_io = 0
  }

Probably worth mentioning as well that it waits for writeback of the
page to complete while holding a lock on it (at prepare_pages()).

Using drgn I walked the bio pages looking for page
0xffffea00fbfc7500 which is the one we're waiting for writeback on

  bio = Object(prog, 'struct bio', address=0xffff889f747ed830)
  for i in range(0, bio.bi_vcnt.value_()):
      bv = bio.bi_io_vec[i]
      if bv.bv_page.value_() == 0xffffea00fbfc7500:
	  print("FOUND IT")

which validated what I suspected.

The fix for this is simple, flush the epd before we loop back around to
the beginning of the file during writeout.

Fixes: b293f02e1423 ("Btrfs: Add writepages support")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-14 16:32:20 -05:00
Filipe Manana
f6eb1d580e Btrfs: fix race between adding and putting tree mod seq elements and nodes
[ Upstream commit 7227ff4de55d931bbdc156c8ef0ce4f100c78a5b ]

There is a race between adding and removing elements to the tree mod log
list and rbtree that can lead to use-after-free problems.

Consider the following example that explains how/why the problems happens:

1) Task A has mod log element with sequence number 200. It currently is
   the only element in the mod log list;

2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to
   access the tree mod log. When it enters the function, it initializes
   'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock'
   before checking if there are other elements in the mod seq list.
   Since the list it empty, 'min_seq' remains set to (u64)-1. Then it
   unlocks the lock 'tree_mod_seq_lock';

3) Before task A acquires the lock 'tree_mod_log_lock', task B adds
   itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a
   sequence number of 201;

4) Some other task, name it task C, modifies a btree and because there
   elements in the mod seq list, it adds a tree mod elem to the tree
   mod log rbtree. That node added to the mod log rbtree is assigned
   a sequence number of 202;

5) Task B, which is doing fiemap and resolving indirect back references,
   calls btrfs get_old_root(), with 'time_seq' == 201, which in turn
   calls tree_mod_log_search() - the search returns the mod log node
   from the rbtree with sequence number 202, created by task C;

6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating
   the mod log rbtree and finds the node with sequence number 202. Since
   202 is less than the previously computed 'min_seq', (u64)-1, it
   removes the node and frees it;

7) Task B still has a pointer to the node with sequence number 202, and
   it dereferences the pointer itself and through the call to
   __tree_mod_log_rewind(), resulting in a use-after-free problem.

This issue can be triggered sporadically with the test case generic/561
from fstests, and it happens more frequently with a higher number of
duperemove processes. When it happens to me, it either freezes the VM or
it produces a trace like the following before crashing:

  [ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  [ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1
  [ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
  [ 1245.321287] RIP: 0010:rb_next+0x16/0x50
  [ 1245.321307] Code: ....
  [ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202
  [ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b
  [ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80
  [ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000
  [ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038
  [ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8
  [ 1245.321539] FS:  00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000
  [ 1245.321591] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0
  [ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [ 1245.321706] Call Trace:
  [ 1245.321798]  __tree_mod_log_rewind+0xbf/0x280 [btrfs]
  [ 1245.321841]  btrfs_search_old_slot+0x105/0xd00 [btrfs]
  [ 1245.321877]  resolve_indirect_refs+0x1eb/0xc60 [btrfs]
  [ 1245.321912]  find_parent_nodes+0x3dc/0x11b0 [btrfs]
  [ 1245.321947]  btrfs_check_shared+0x115/0x1c0 [btrfs]
  [ 1245.321980]  ? extent_fiemap+0x59d/0x6d0 [btrfs]
  [ 1245.322029]  extent_fiemap+0x59d/0x6d0 [btrfs]
  [ 1245.322066]  do_vfs_ioctl+0x45a/0x750
  [ 1245.322081]  ksys_ioctl+0x70/0x80
  [ 1245.322092]  ? trace_hardirqs_off_thunk+0x1a/0x1c
  [ 1245.322113]  __x64_sys_ioctl+0x16/0x20
  [ 1245.322126]  do_syscall_64+0x5c/0x280
  [ 1245.322139]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [ 1245.322155] RIP: 0033:0x7fdee3942dd7
  [ 1245.322177] Code: ....
  [ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7
  [ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004
  [ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44
  [ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48
  [ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50
  [ 1245.322423] Modules linked in: ....
  [ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]---

Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum
sequence number and iterates the rbtree while holding the lock
'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock'
lock, since it is now redundant.

Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions")
Fixes: 097b8a7c9e48e2 ("Btrfs: join tree mod log code with the code holding back delayed refs")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-14 16:32:19 -05:00
David Sterba
1066f7e4bc btrfs: remove trivial locking wrappers of tree mod log
[ Upstream commit b1a09f1ec540408abf3a50d15dff5d9506932693 ]

The wrappers are trivial and do not bring any extra value on top of the
plain locking primitives.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-14 16:32:19 -05:00
Josef Bacik
8d9ec37255 btrfs: free block groups after free'ing fs trees
[ Upstream commit 4e19443da1941050b346f8fc4c368aa68413bc88 ]

Sometimes when running generic/475 we would trip the
WARN_ON(cache->reserved) check when free'ing the block groups on umount.
This is because sometimes we don't commit the transaction because of IO
errors and thus do not cleanup the tree logs until at umount time.

These blocks are still reserved until they are cleaned up, but they
aren't cleaned up until _after_ we do the free block groups work.  Fix
this by moving the free after free'ing the fs roots, that way all of the
tree logs are cleaned up and we have a properly cleaned fs.  A bunch of
loops of generic/475 confirmed this fixes the problem.

CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-14 16:32:19 -05:00
Anand Jain
71f54d0173 btrfs: use bool argument in free_root_pointers()
[ Upstream commit 4273eaff9b8d5e141113a5bdf9628c02acf3afe5 ]

We don't need int argument bool shall do in free_root_pointers().  And
rename the argument as it confused two people.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-14 16:32:19 -05:00
Filipe Manana
1d5a59051c Btrfs: fix missing hole after hole punching and fsync when using NO_HOLES
[ Upstream commit 0e56315ca147b3e60c7bf240233a301d3c7fb508 ]

When using the NO_HOLES feature, if we punch a hole into a file and then
fsync it, there are cases where a subsequent fsync will miss the fact that
a hole was punched, resulting in the holes not existing after replaying
the log tree.

Essentially these cases all imply that, tree-log.c:copy_items(), is not
invoked for the leafs that delimit holes, because nothing changed those
leafs in the current transaction. And it's precisely copy_items() where
we currenly detect and log holes, which works as long as the holes are
between file extent items in the input leaf or between the beginning of
input leaf and the previous leaf or between the last item in the leaf
and the next leaf.

First example where we miss a hole:

  *) The extent items of the inode span multiple leafs;

  *) The punched hole covers a range that affects only the extent items of
     the first leaf;

  *) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC
     is set in the inode's runtime flags).

  That results in the hole not existing after replaying the log tree.

  For example, if the fs/subvolume tree has the following layout for a
  particular inode:

      Leaf N, generation 10:

      [ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ]

      Leaf N + 1, generation 10:

      [ EXTENT_ITEM (128K 64K) ... ]

  If at transaction 11 we punch a hole coverting the range [0, 128K[, we end
  up dropping the two extent items from leaf N, but we don't touch the other
  leaf, so we end up in the following state:

      Leaf N, generation 11:

      [ ... INODE_ITEM INODE_REF ]

      Leaf N + 1, generation 10:

      [ EXTENT_ITEM (128K 64K) ... ]

  A full fsync after punching the hole will only process leaf N because it
  was modified in the current transaction, but not leaf N + 1, since it
  was not modified in the current transaction (generation 10 and not 11).
  As a result the fsync will not log any holes, because it didn't process
  any leaf with extent items.

Second example where we will miss a hole:

  *) An inode as its items spanning 5 (or more) leafs;

  *) A hole is punched and it covers only the extents items of the 3rd
     leaf. This resulsts in deleting the entire leaf and not touching any
     of the other leafs.

  So the only leaf that is modified in the current transaction, when
  punching the hole, is the first leaf, which contains the inode item.
  During the full fsync, the only leaf that is passed to copy_items()
  is that first leaf, and that's not enough for the hole detection
  code in copy_items() to determine there's a hole between the last
  file extent item in the 2nd leaf and the first file extent item in
  the 3rd leaf (which was the 4th leaf before punching the hole).

Fix this by scanning all leafs and punch holes as necessary when doing a
full fsync (less common than a non-full fsync) when the NO_HOLES feature
is enabled. The lack of explicit file extent items to mark holes makes it
necessary to scan existing extents to determine if holes exist.

A test case for fstests follows soon.

Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-14 16:32:19 -05:00
Filipe Manana
8a024c09cb Btrfs: fix assertion failure on fsync with NO_HOLES enabled
[ Upstream commit 0ccc3876e4b2a1559a4dbe3126dda4459d38a83b ]

Back in commit a89ca6f24ffe4 ("Btrfs: fix fsync after truncate when
no_holes feature is enabled") I added an assertion that is triggered when
an inline extent is found to assert that the length of the (uncompressed)
data the extent represents is the same as the i_size of the inode, since
that is true most of the time I couldn't find or didn't remembered about
any exception at that time. Later on the assertion was expanded twice to
deal with a case of a compressed inline extent representing a range that
matches the sector size followed by an expanding truncate, and another
case where fallocate can update the i_size of the inode without adding
or updating existing extents (if the fallocate range falls entirely within
the first block of the file). These two expansion/fixes of the assertion
were done by commit 7ed586d0a8241 ("Btrfs: fix assertion on fsync of
regular file when using no-holes feature") and commit 6399fb5a0b69a
("Btrfs: fix assertion failure during fsync in no-holes mode").
These however missed the case where an falloc expands the i_size of an
inode to exactly the sector size and inline extent exists, for example:

 $ mkfs.btrfs -f -O no-holes /dev/sdc
 $ mount /dev/sdc /mnt

 $ xfs_io -f -c "pwrite -S 0xab 0 1096" /mnt/foobar
 wrote 1096/1096 bytes at offset 0
 1 KiB, 1 ops; 0.0002 sec (4.448 MiB/sec and 4255.3191 ops/sec)

 $ xfs_io -c "falloc 1096 3000" /mnt/foobar
 $ xfs_io -c "fsync" /mnt/foobar
 Segmentation fault

 $ dmesg
 [701253.602385] assertion failed: len == i_size || (len == fs_info->sectorsize && btrfs_file_extent_compression(leaf, extent) != BTRFS_COMPRESS_NONE) || (len < i_size && i_size < fs_info->sectorsize), file: fs/btrfs/tree-log.c, line: 4727
 [701253.602962] ------------[ cut here ]------------
 [701253.603224] kernel BUG at fs/btrfs/ctree.h:3533!
 [701253.603503] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
 [701253.603774] CPU: 2 PID: 7192 Comm: xfs_io Tainted: G        W         5.0.0-rc8-btrfs-next-45 #1
 [701253.604054] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
 [701253.604650] RIP: 0010:assfail.constprop.23+0x18/0x1a [btrfs]
 (...)
 [701253.605591] RSP: 0018:ffffbb48c186bc48 EFLAGS: 00010286
 [701253.605914] RAX: 00000000000000de RBX: ffff921d0a7afc08 RCX: 0000000000000000
 [701253.606244] RDX: 0000000000000000 RSI: ffff921d36b16868 RDI: ffff921d36b16868
 [701253.606580] RBP: ffffbb48c186bcf0 R08: 0000000000000000 R09: 0000000000000000
 [701253.606913] R10: 0000000000000003 R11: 0000000000000000 R12: ffff921d05d2de18
 [701253.607247] R13: ffff921d03b54000 R14: 0000000000000448 R15: ffff921d059ecf80
 [701253.607769] FS:  00007f14da906700(0000) GS:ffff921d36b00000(0000) knlGS:0000000000000000
 [701253.608163] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [701253.608516] CR2: 000056087ea9f278 CR3: 00000002268e8001 CR4: 00000000003606e0
 [701253.608880] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [701253.609250] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [701253.609608] Call Trace:
 [701253.609994]  btrfs_log_inode+0xdfb/0xe40 [btrfs]
 [701253.610383]  btrfs_log_inode_parent+0x2be/0xa60 [btrfs]
 [701253.610770]  ? do_raw_spin_unlock+0x49/0xc0
 [701253.611150]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
 [701253.611537]  btrfs_sync_file+0x3b2/0x440 [btrfs]
 [701253.612010]  ? do_sysinfo+0xb0/0xf0
 [701253.612552]  do_fsync+0x38/0x60
 [701253.612988]  __x64_sys_fsync+0x10/0x20
 [701253.613360]  do_syscall_64+0x60/0x1b0
 [701253.613733]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 [701253.614103] RIP: 0033:0x7f14da4e66d0
 (...)
 [701253.615250] RSP: 002b:00007fffa670fdb8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
 [701253.615647] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f14da4e66d0
 [701253.616047] RDX: 000056087ea9c260 RSI: 000056087ea9c260 RDI: 0000000000000003
 [701253.616450] RBP: 0000000000000001 R08: 0000000000000020 R09: 0000000000000010
 [701253.616854] R10: 000000000000009b R11: 0000000000000246 R12: 000056087ea9c260
 [701253.617257] R13: 000056087ea9c240 R14: 0000000000000000 R15: 000056087ea9dd10
 (...)
 [701253.619941] ---[ end trace e088d74f132b6da5 ]---

Updating the assertion again to allow for this particular case would result
in a meaningless assertion, plus there is currently no risk of logging
content that would result in any corruption after a log replay if the size
of the data encoded in an inline extent is greater than the inode's i_size
(which is not currently possibe either with or without compression),
therefore just remove the assertion.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-14 16:32:19 -05:00
Qu Wenruo
cf1569db06 btrfs: Get rid of the confusing btrfs_file_extent_inline_len
[ Upstream commit e41ca5897489b1c18af75ff0cc8f5c80260b3281 ]

We used to call btrfs_file_extent_inline_len() to get the uncompressed
data size of an inlined extent.

However this function is hiding evil, for compressed extent, it has no
choice but to directly read out ram_bytes from btrfs_file_extent_item.
While for uncompressed extent, it uses item size to calculate the real
data size, and ignoring ram_bytes completely.

In fact, for corrupted ram_bytes, due to above behavior kernel
btrfs_print_leaf() can't even print correct ram_bytes to expose the bug.

Since we have the tree-checker to verify all EXTENT_DATA, such mismatch
can be detected pretty easily, thus we can trust ram_bytes without the
evil btrfs_file_extent_inline_len().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-14 16:32:19 -05:00
Josef Bacik
f7a7788766 btrfs: set trans->drity in btrfs_commit_transaction
commit d62b23c94952e78211a383b7d90ef0afbd9a3717 upstream.

If we abort a transaction we have the following sequence

if (!trans->dirty && list_empty(&trans->new_bgs))
	return;
WRITE_ONCE(trans->transaction->aborted, err);

The idea being if we didn't modify anything with our trans handle then
we don't really need to abort the whole transaction, maybe the other
trans handles are fine and we can carry on.

However in the case of create_snapshot we add a pending_snapshot object
to our transaction and then commit the transaction.  We don't actually
modify anything.  sync() behaves the same way, attach to an existing
transaction and commit it.  This means that if we have an IO error in
the right places we could abort the committing transaction with our
trans->dirty being not set and thus not set transaction->aborted.

This is a problem because in the create_snapshot() case we depend on
pending->error being set to something, or btrfs_commit_transaction
returning an error.

If we are not the trans handle that gets to commit the transaction, and
we're waiting on the commit to happen we get our return value from
cur_trans->aborted.  If this was not set to anything because sync() hit
an error in the transaction commit before it could modify anything then
cur_trans->aborted would be 0.  Thus we'd return 0 from
btrfs_commit_transaction() in create_snapshot.

This is a problem because we then try to do things with
pending_snapshot->snap, which will be NULL because we didn't create the
snapshot, and then we'll get a NULL pointer dereference like the
following

"BUG: kernel NULL pointer dereference, address: 00000000000001f0"
RIP: 0010:btrfs_orphan_cleanup+0x2d/0x330
Call Trace:
 ? btrfs_mksubvol.isra.31+0x3f2/0x510
 btrfs_mksubvol.isra.31+0x4bc/0x510
 ? __sb_start_write+0xfa/0x200
 ? mnt_want_write_file+0x24/0x50
 btrfs_ioctl_snap_create_transid+0x16c/0x1a0
 btrfs_ioctl_snap_create_v2+0x11e/0x1a0
 btrfs_ioctl+0x1534/0x2c10
 ? free_debug_processing+0x262/0x2a3
 do_vfs_ioctl+0xa6/0x6b0
 ? do_sys_open+0x188/0x220
 ? syscall_trace_enter+0x1f8/0x330
 ksys_ioctl+0x60/0x90
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x4a/0x1b0

In order to fix this we need to make sure anybody who calls
commit_transaction has trans->dirty set so that they properly set the
trans->transaction->aborted value properly so any waiters know bad
things happened.

This was found while I was running generic/475 with my modified
fsstress, it reproduced within a few runs.  I ran with this patch all
night and didn't see the problem again.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-14 16:32:14 -05:00
Josef Bacik
e1404bf536 btrfs: do not zero f_bavail if we have available space
commit d55966c4279bfc6a0cf0b32bf13f5df228a1eeb6 upstream.

There was some logic added a while ago to clear out f_bavail in statfs()
if we did not have enough free metadata space to satisfy our global
reserve.  This was incorrect at the time, however didn't really pose a
problem for normal file systems because we would often allocate chunks
if we got this low on free metadata space, and thus wouldn't really hit
this case unless we were actually full.

Fast forward to today and now we are much better about not allocating
metadata chunks all of the time.  Couple this with d792b0f19711 ("btrfs:
always reserve our entire size for the global reserve") which now means
we'll easily have a larger global reserve than our free space, we are
now more likely to trip over this while still having plenty of space.

Fix this by skipping this logic if the global rsv's space_info is not
full.  space_info->full is 0 unless we've attempted to allocate a chunk
for that space_info and that has failed.  If this happens then the space
for the global reserve is definitely sacred and we need to report
b_avail == 0, but before then we can just use our calculated b_avail.

Reported-by: Martin Steigerwald <martin@lichtvoll.de>
Fixes: ca8a51b3a979 ("btrfs: statfs: report zero available if metadata are exhausted")
CC: stable@vger.kernel.org # 4.5+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-By: Martin Steigerwald <martin@lichtvoll.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-05 14:18:29 +00:00
Omar Sandoval
a7a67b4e8e btrfs: use correct count in btrfs_file_write_iter()
[ Upstream commit c09767a8960ca0500fb636bf73686723337debf4 ]

generic_write_checks() may modify iov_iter_count(), so we must get the
count after the call, not before. Using the wrong one has a couple of
consequences:

1. We check a longer range in check_can_nocow() for nowait than we're
   actually writing.
2. We create extra hole extent maps in btrfs_cont_expand(). As far as I
   can tell, this is harmless, but I might be missing something.

These issues are pretty minor, but let's fix it before something more
important trips on it.

Fixes: edf064e7c6fe ("btrfs: nowait aio support")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27 14:46:46 +01:00
Filipe Manana
510cd98350 Btrfs: fix inode cache waiters hanging on path allocation failure
[ Upstream commit 9d123a35d7e97bb2139747b16127c9b22b6a593e ]

If the caching thread fails to allocate a path, it returns without waking
up any cache waiters, leaving them hang forever. Fix this by following the
same approach as when we fail to start the caching thread: print an error
message, disable inode caching and make the wakers fallback to non-caching
mode behaviour (calling btrfs_find_free_objectid()).

Fixes: 581bb050941b4f ("Btrfs: Cache free inode numbers in memory")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27 14:46:46 +01:00
Filipe Manana
2162f5aae4 Btrfs: fix inode cache waiters hanging on failure to start caching thread
[ Upstream commit a68ebe0790fc88b4314d17984a2cf99ce2361901 ]

If we fail to start the inode caching thread, we print an error message
and disable the inode cache, however we never wake up any waiters, so they
hang forever waiting for the caching to finish. Fix this by waking them
up and have them fallback to a call to btrfs_find_free_objectid().

Fixes: e60efa84252c05 ("Btrfs: avoid triggering bug_on() when we fail to start inode caching task")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27 14:46:46 +01:00
Filipe Manana
0d479ec44e Btrfs: fix hang when loading existing inode cache off disk
[ Upstream commit 7764d56baa844d7f6206394f21a0e8c1f303c476 ]

If we are able to load an existing inode cache off disk, we set the state
of the cache to BTRFS_CACHE_FINISHED, but we don't wake up any one waiting
for the cache to be available. This means that anyone waiting for the
cache to be available, waiting on the condition that either its state is
BTRFS_CACHE_FINISHED or its available free space is greather than zero,
can hang forever.

This could be observed running fstests with MOUNT_OPTIONS="-o inode_cache",
in particular test case generic/161 triggered it very frequently for me,
producing a trace like the following:

  [63795.739712] BTRFS info (device sdc): enabling inode map caching
  [63795.739714] BTRFS info (device sdc): disk space caching is enabled
  [63795.739716] BTRFS info (device sdc): has skinny extents
  [64036.653886] INFO: task btrfs-transacti:3917 blocked for more than 120 seconds.
  [64036.654079]       Not tainted 5.2.0-rc4-btrfs-next-50 #1
  [64036.654143] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [64036.654232] btrfs-transacti D    0  3917      2 0x80004000
  [64036.654239] Call Trace:
  [64036.654258]  ? __schedule+0x3ae/0x7b0
  [64036.654271]  schedule+0x3a/0xb0
  [64036.654325]  btrfs_commit_transaction+0x978/0xae0 [btrfs]
  [64036.654339]  ? remove_wait_queue+0x60/0x60
  [64036.654395]  transaction_kthread+0x146/0x180 [btrfs]
  [64036.654450]  ? btrfs_cleanup_transaction+0x620/0x620 [btrfs]
  [64036.654456]  kthread+0x103/0x140
  [64036.654464]  ? kthread_create_worker_on_cpu+0x70/0x70
  [64036.654476]  ret_from_fork+0x3a/0x50
  [64036.654504] INFO: task xfs_io:3919 blocked for more than 120 seconds.
  [64036.654568]       Not tainted 5.2.0-rc4-btrfs-next-50 #1
  [64036.654617] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [64036.654685] xfs_io          D    0  3919   3633 0x00000000
  [64036.654691] Call Trace:
  [64036.654703]  ? __schedule+0x3ae/0x7b0
  [64036.654716]  schedule+0x3a/0xb0
  [64036.654756]  btrfs_find_free_ino+0xa9/0x120 [btrfs]
  [64036.654764]  ? remove_wait_queue+0x60/0x60
  [64036.654809]  btrfs_create+0x72/0x1f0 [btrfs]
  [64036.654822]  lookup_open+0x6bc/0x790
  [64036.654849]  path_openat+0x3bc/0xc00
  [64036.654854]  ? __lock_acquire+0x331/0x1cb0
  [64036.654869]  do_filp_open+0x99/0x110
  [64036.654884]  ? __alloc_fd+0xee/0x200
  [64036.654895]  ? do_raw_spin_unlock+0x49/0xc0
  [64036.654909]  ? do_sys_open+0x132/0x220
  [64036.654913]  do_sys_open+0x132/0x220
  [64036.654926]  do_syscall_64+0x60/0x1d0
  [64036.654933]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fix this by adding a wake_up() call right after setting the cache state to
BTRFS_CACHE_FINISHED, at start_caching(), when we are able to load the
cache from disk.

Fixes: 82d5902d9c681b ("Btrfs: Support reading/writing on disk free ino cache")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27 14:46:46 +01:00
Johannes Thumshirn
90515d01c0 btrfs: fix memory leak in qgroup accounting
commit 26ef8493e1ab771cb01d27defca2fa1315dc3980 upstream.

When running xfstests on the current btrfs I get the following splat from
kmemleak:

unreferenced object 0xffff88821b2404e0 (size 32):
  comm "kworker/u4:7", pid 26663, jiffies 4295283698 (age 8.776s)
  hex dump (first 32 bytes):
    01 00 00 00 00 00 00 00 10 ff fd 26 82 88 ff ff  ...........&....
    10 ff fd 26 82 88 ff ff 20 ff fd 26 82 88 ff ff  ...&.... ..&....
  backtrace:
    [<00000000f94fd43f>] ulist_alloc+0x25/0x60 [btrfs]
    [<00000000fd023d99>] btrfs_find_all_roots_safe+0x41/0x100 [btrfs]
    [<000000008f17bd32>] btrfs_find_all_roots+0x52/0x70 [btrfs]
    [<00000000b7660afb>] btrfs_qgroup_rescan_worker+0x343/0x680 [btrfs]
    [<0000000058e66778>] btrfs_work_helper+0xac/0x1e0 [btrfs]
    [<00000000f0188930>] process_one_work+0x1cf/0x350
    [<00000000af5f2f8e>] worker_thread+0x28/0x3c0
    [<00000000b55a1add>] kthread+0x109/0x120
    [<00000000f88cbd17>] ret_from_fork+0x35/0x40

This corresponds to:

  (gdb) l *(btrfs_find_all_roots_safe+0x41)
  0x8d7e1 is in btrfs_find_all_roots_safe (fs/btrfs/backref.c:1413).
  1408
  1409            tmp = ulist_alloc(GFP_NOFS);
  1410            if (!tmp)
  1411                    return -ENOMEM;
  1412            *roots = ulist_alloc(GFP_NOFS);
  1413            if (!*roots) {
  1414                    ulist_free(tmp);
  1415                    return -ENOMEM;
  1416            }
  1417

Following the lifetime of the allocated 'roots' ulist, it gets freed
again in btrfs_qgroup_account_extent().

But this does not happen if the function is called with the
'BTRFS_FS_QUOTA_ENABLED' flag cleared, then btrfs_qgroup_account_extent()
does a short leave and directly returns.

Instead of directly returning we should jump to the 'out_free' in order to
free all resources as expected.

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-23 08:20:32 +01:00
Goldwyn Rodrigues
f04fb20253 btrfs: simplify inode locking for RWF_NOWAIT
commit 9cf35f673583ccc9f3e2507498b3079d56614ad3 upstream.

This is similar to 942491c9e6d6 ("xfs: fix AIM7 regression"). Apparently
our current rwsem code doesn't like doing the trylock, then lock for
real scheme. This causes extra contention on the lock and can be
measured eg. by AIM7 benchmark.  So change our read/write methods to
just do the trylock for the RWF_NOWAIT case.

Fixes: edf064e7c6fe ("btrfs: nowait aio support")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-17 19:45:46 +01:00
Josef Bacik
a93dd1c462 btrfs: abort transaction after failed inode updates in create_subvol
[ Upstream commit c7e54b5102bf3614cadb9ca32d7be73bad6cecf0 ]

We can just abort the transaction here, and in fact do that for every
other failure in this function except these two cases.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31 12:37:56 +01:00
Dan Carpenter
f8f86208bf btrfs: return error pointer from alloc_test_extent_buffer
[ Upstream commit b6293c821ea8fa2a631a2112cd86cd435effeb8b ]

Callers of alloc_test_extent_buffer have not correctly interpreted the
return value as error pointer, as alloc_test_extent_buffer should behave
as alloc_extent_buffer. The self-tests were unaffected but
btrfs_find_create_tree_block could call both functions and that would
cause problems up in the call chain.

Fixes: faa2dbf004e8 ("Btrfs: add sanity tests for new qgroup accounting code")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31 12:37:55 +01:00
Omar Sandoval
7cfef55ff0 btrfs: don't prematurely free work in scrub_missing_raid56_worker()
[ Upstream commit 57d4f0b863272ba04ba85f86bfdc0f976f0af91c ]

Currently, scrub_missing_raid56_worker() puts and potentially frees
sblock (which embeds the work item) and then submits a bio through
scrub_wr_submit(). This is another potential instance of the bug in
"btrfs: don't prematurely free work in run_ordered_work()". Fix it by
dropping the reference after we submit the bio.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31 12:37:53 +01:00
Omar Sandoval
4a27508240 btrfs: don't prematurely free work in reada_start_machine_worker()
[ Upstream commit e732fe95e4cad35fc1df278c23a32903341b08b3 ]

Currently, reada_start_machine_worker() frees the reada_machine_work and
then calls __reada_start_machine() to do readahead. This is another
potential instance of the bug in "btrfs: don't prematurely free work in
run_ordered_work()".

There _might_ already be a deadlock here: reada_start_machine_worker()
can depend on itself through stacked filesystems (__read_start_machine()
-> reada_start_machine_dev() -> reada_tree_block_flagged() ->
read_extent_buffer_pages() -> submit_one_bio() ->
btree_submit_bio_hook() -> btrfs_map_bio() -> submit_stripe_bio() ->
submit_bio() onto a loop device can trigger readahead on the lower
filesystem).

Either way, let's fix it by freeing the work at the end.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31 12:37:53 +01:00
Omar Sandoval
6d52fb75cd btrfs: don't prematurely free work in run_ordered_work()
[ Upstream commit c495dcd6fbe1dce51811a76bb85b4675f6494938 ]

We hit the following very strange deadlock on a system with Btrfs on a
loop device backed by another Btrfs filesystem:

1. The top (loop device) filesystem queues an async_cow work item from
   cow_file_range_async(). We'll call this work X.
2. Worker thread A starts work X (normal_work_helper()).
3. Worker thread A executes the ordered work for the top filesystem
   (run_ordered_work()).
4. Worker thread A finishes the ordered work for work X and frees X
   (work->ordered_free()).
5. Worker thread A executes another ordered work and gets blocked on I/O
   to the bottom filesystem (still in run_ordered_work()).
6. Meanwhile, the bottom filesystem allocates and queues an async_cow
   work item which happens to be the recently-freed X.
7. The workqueue code sees that X is already being executed by worker
   thread A, so it schedules X to be executed _after_ worker thread A
   finishes (see the find_worker_executing_work() call in
   process_one_work()).

Now, the top filesystem is waiting for I/O on the bottom filesystem, but
the bottom filesystem is waiting for the top filesystem to finish, so we
deadlock.

This happens because we are breaking the workqueue assumption that a
work item cannot be recycled while it still depends on other work. Fix
it by waiting to free the work item until we are done with all of the
related ordered work.

P.S.:

One might ask why the workqueue code doesn't try to detect a recycled
work item. It actually does try by checking whether the work item has
the same work function (find_worker_executing_work()), but in our case
the function is the same. This is the only key that the workqueue code
has available to compare, short of adding an additional, layer-violating
"custom key". Considering that we're the only ones that have ever hit
this, we should just play by the rules.

Unfortunately, we haven't been able to create a minimal reproducer other
than our full container setup using a compress-force=zstd filesystem on
top of another compress-force=zstd filesystem.

Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31 12:37:44 +01:00
Omar Sandoval
9e5ae20bb9 btrfs: don't prematurely free work in end_workqueue_fn()
[ Upstream commit 9be490f1e15c34193b1aae17da58e14dd9f55a95 ]

Currently, end_workqueue_fn() frees the end_io_wq entry (which embeds
the work item) and then calls bio_endio(). This is another potential
instance of the bug in "btrfs: don't prematurely free work in
run_ordered_work()".

In particular, the endio call may depend on other work items. For
example, btrfs_end_dio_bio() can call btrfs_subio_endio_read() ->
__btrfs_correct_data_nocsum() -> dio_read_error() ->
submit_dio_repair_bio(), which submits a bio that is also completed
through a end_workqueue_fn() work item. However,
__btrfs_correct_data_nocsum() waits for the newly submitted bio to
complete, thus it depends on another work item.

This example currently usually works because we use different workqueue
helper functions for BTRFS_WQ_ENDIO_DATA and BTRFS_WQ_ENDIO_DIO_REPAIR.
However, it may deadlock with stacked filesystems and is fragile
overall. The proper fix is to free the work item at the very end of the
work function, so let's do that.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31 12:37:44 +01:00
Filipe Manana
1622a3416b Btrfs: fix removal logic of the tree mod log that leads to use-after-free issues
commit 6609fee8897ac475378388238456c84298bff802 upstream.

When a tree mod log user no longer needs to use the tree it calls
btrfs_put_tree_mod_seq() to remove itself from the list of users and
delete all no longer used elements of the tree's red black tree, which
should be all elements with a sequence number less then our equals to
the caller's sequence number. However the logic is broken because it
can delete and free elements from the red black tree that have a
sequence number greater then the caller's sequence number:

1) At a point in time we have sequence numbers 1, 2, 3 and 4 in the
   tree mod log;

2) The task which got assigned the sequence number 1 calls
   btrfs_put_tree_mod_seq();

3) Sequence number 1 is deleted from the list of sequence numbers;

4) The current minimum sequence number is computed to be the sequence
   number 2;

5) A task using sequence number 2 is at tree_mod_log_rewind() and gets
   a pointer to one of its elements from the red black tree through
   a call to tree_mod_log_search();

6) The task with sequence number 1 iterates the red black tree of tree
   modification elements and deletes (and frees) all elements with a
   sequence number less then or equals to 2 (the computed minimum sequence
   number) - it ends up only leaving elements with sequence numbers of 3
   and 4;

7) The task with sequence number 2 now uses the pointer to its element,
   already freed by the other task, at __tree_mod_log_rewind(), resulting
   in a use-after-free issue. When CONFIG_DEBUG_PAGEALLOC=y it produces
   a trace like the following:

  [16804.546854] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  [16804.547451] CPU: 0 PID: 28257 Comm: pool Tainted: G        W         5.4.0-rc8-btrfs-next-51 #1
  [16804.548059] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
  [16804.548666] RIP: 0010:rb_next+0x16/0x50
  (...)
  [16804.550581] RSP: 0018:ffffb948418ef9b0 EFLAGS: 00010202
  [16804.551227] RAX: 6b6b6b6b6b6b6b6b RBX: ffff90e0247f6600 RCX: 6b6b6b6b6b6b6b6b
  [16804.551873] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff90e0247f6600
  [16804.552504] RBP: ffff90dffe0d4688 R08: 0000000000000001 R09: 0000000000000000
  [16804.553136] R10: ffff90dffa4a0040 R11: 0000000000000000 R12: 000000000000002e
  [16804.553768] R13: ffff90e0247f6600 R14: 0000000000001663 R15: ffff90dff77862b8
  [16804.554399] FS:  00007f4b197ae700(0000) GS:ffff90e036a00000(0000) knlGS:0000000000000000
  [16804.555039] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [16804.555683] CR2: 00007f4b10022000 CR3: 00000002060e2004 CR4: 00000000003606f0
  [16804.556336] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [16804.556968] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [16804.557583] Call Trace:
  [16804.558207]  __tree_mod_log_rewind+0xbf/0x280 [btrfs]
  [16804.558835]  btrfs_search_old_slot+0x105/0xd00 [btrfs]
  [16804.559468]  resolve_indirect_refs+0x1eb/0xc70 [btrfs]
  [16804.560087]  ? free_extent_buffer.part.19+0x5a/0xc0 [btrfs]
  [16804.560700]  find_parent_nodes+0x388/0x1120 [btrfs]
  [16804.561310]  btrfs_check_shared+0x115/0x1c0 [btrfs]
  [16804.561916]  ? extent_fiemap+0x59d/0x6d0 [btrfs]
  [16804.562518]  extent_fiemap+0x59d/0x6d0 [btrfs]
  [16804.563112]  ? __might_fault+0x11/0x90
  [16804.563706]  do_vfs_ioctl+0x45a/0x700
  [16804.564299]  ksys_ioctl+0x70/0x80
  [16804.564885]  ? trace_hardirqs_off_thunk+0x1a/0x20
  [16804.565461]  __x64_sys_ioctl+0x16/0x20
  [16804.566020]  do_syscall_64+0x5c/0x250
  [16804.566580]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [16804.567153] RIP: 0033:0x7f4b1ba2add7
  (...)
  [16804.568907] RSP: 002b:00007f4b197adc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [16804.569513] RAX: ffffffffffffffda RBX: 00007f4b100210d8 RCX: 00007f4b1ba2add7
  [16804.570133] RDX: 00007f4b100210d8 RSI: 00000000c020660b RDI: 0000000000000003
  [16804.570726] RBP: 000055de05a6cfe0 R08: 0000000000000000 R09: 00007f4b197add44
  [16804.571314] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4b197add48
  [16804.571905] R13: 00007f4b197add40 R14: 00007f4b100210d0 R15: 00007f4b197add50
  (...)
  [16804.575623] ---[ end trace 87317359aad4ba50 ]---

Fix this by making btrfs_put_tree_mod_seq() skip deletion of elements that
have a sequence number equals to the computed minimum sequence number, and
not just elements with a sequence number greater then that minimum.

Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31 12:36:45 +01:00
Josef Bacik
5a744f9eee btrfs: handle ENOENT in btrfs_uuid_tree_iterate
commit 714cd3e8cba6841220dce9063a7388a81de03825 upstream.

If we get an -ENOENT back from btrfs_uuid_iter_rem when iterating the
uuid tree we'll just continue and do btrfs_next_item().  However we've
done a btrfs_release_path() at this point and no longer have a valid
path.  So increment the key and go back and do a normal search.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31 12:36:44 +01:00
Josef Bacik
9cb1548707 btrfs: do not leak reloc root if we fail to read the fs root
commit ca1aa2818a53875cfdd175fb5e9a2984e997cce9 upstream.

If we fail to read the fs root corresponding with a reloc root we'll
just break out and free the reloc roots.  But we remove our current
reloc_root from this list higher up, which means we'll leak this
reloc_root.  Fix this by adding ourselves back to the reloc_roots list
so we are properly cleaned up.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31 12:36:44 +01:00
Josef Bacik
08c0e06a7b btrfs: skip log replay on orphaned roots
commit 9bc574de590510eff899c3ca8dbaf013566b5efe upstream.

My fsstress modifications coupled with generic/475 uncovered a failure
to mount and replay the log if we hit a orphaned root.  We do not want
to replay the log for an orphan root, but it's completely legitimate to
have an orphaned root with a log attached.  Fix this by simply skipping
replaying the log.  We still need to pin it's root node so that we do
not overwrite it while replaying other logs, as we re-read the log root
at every stage of the replay.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31 12:36:44 +01:00
Josef Bacik
0145dc5ac7 btrfs: do not call synchronize_srcu() in inode_tree_del
commit f72ff01df9cf5db25c76674cac16605992d15467 upstream.

Testing with the new fsstress uncovered a pretty nasty deadlock with
lookup and snapshot deletion.

Process A
unlink
 -> final iput
   -> inode_tree_del
     -> synchronize_srcu(subvol_srcu)

Process B
btrfs_lookup  <- srcu_read_lock() acquired here
  -> btrfs_iget
    -> find inode that has I_FREEING set
      -> __wait_on_freeing_inode()

We're holding the srcu_read_lock() while doing the iget in order to make
sure our fs root doesn't go away, and then we are waiting for the inode
to finish freeing.  However because the free'ing process is doing a
synchronize_srcu() we deadlock.

Fix this by dropping the synchronize_srcu() in inode_tree_del().  We
don't need people to stop accessing the fs root at this point, we're
only adding our empty root to the dead roots list.

A larger much more invasive fix is forthcoming to address how we deal
with fs roots, but this fixes the immediate problem.

Fixes: 76dda93c6ae2 ("Btrfs: add snapshot/subvolume destroy ioctl")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31 12:36:43 +01:00
Josef Bacik
af6ba22a38 btrfs: don't double lock the subvol_sem for rename exchange
commit 943eb3bf25f4a7b745dd799e031be276aa104d82 upstream.

If we're rename exchanging two subvols we'll try to lock this lock
twice, which is bad.  Just lock once if either of the ino's are subvols.

Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31 12:36:43 +01:00
Josef Bacik
04e23c8fce btrfs: record all roots for rename exchange on a subvol
commit 3e1740993e43116b3bc71b0aad1e6872f6ccf341 upstream.

Testing with the new fsstress support for subvolumes uncovered a pretty
bad problem with rename exchange on subvolumes.  We're modifying two
different subvolumes, but we only start the transaction on one of them,
so the other one is not added to the dirty root list.  This is caught by
btrfs_cow_block() with a warning because the root has not been updated,
however if we do not modify this root again we'll end up pointing at an
invalid root because the root item is never updated.

Fix this by making sure we add the destination root to the trans list,
the same as we do with normal renames.  This fixes the corruption.

Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17 20:39:18 +01:00
Filipe Manana
ebf66f5a00 Btrfs: send, skip backreference walking for extents with many references
commit fd0ddbe2509568b00df364156f47561e9f469f15 upstream.

Backreference walking, which is used by send to figure if it can issue
clone operations instead of write operations, can be very slow and use
too much memory when extents have many references. This change simply
skips backreference walking when an extent has more than 64 references,
in which case we fallback to a write operation instead of a clone
operation. This limit is conservative and in practice I observed no
signicant slowdown with up to 100 references and still low memory usage
up to that limit.

This is a temporary workaround until there are speedups in the backref
walking code, and as such it does not attempt to add extra interfaces or
knobs to tweak the threshold.

Reported-by: Atemu <atemu.main@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17 20:39:17 +01:00
Qu Wenruo
b0e1357a0a btrfs: Remove btrfs_bio::flags member
commit 34b127aecd4fe8e6a3903e10f204a7b7ffddca22 upstream.

The last user of btrfs_bio::flags was removed in commit 326e1dbb5736
("block: remove management of bi_remaining when restoring original
bi_end_io"), remove it.

(Tagged for stable as the structure is heavily used and space savings
are desirable.)

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17 20:39:16 +01:00
Filipe Manana
8a2e1bc91a Btrfs: fix negative subv_writers counter and data space leak after buffered write
commit a0e248bb502d5165b3314ac3819e888fdcdf7d9f upstream.

When doing a buffered write it's possible to leave the subv_writers
counter of the root, used for synchronization between buffered nocow
writers and snapshotting. This happens in an exceptional case like the
following:

1) We fail to allocate data space for the write, since there's not
   enough available data space nor enough unallocated space for allocating
   a new data block group;

2) Because of that failure, we try to go to NOCOW mode, which succeeds
   and therefore we set the local variable 'only_release_metadata' to true
   and set the root's sub_writers counter to 1 through the call to
   btrfs_start_write_no_snapshotting() made by check_can_nocow();

3) The call to btrfs_copy_from_user() returns zero, which is very unlikely
   to happen but not impossible;

4) No pages are copied because btrfs_copy_from_user() returned zero;

5) We call btrfs_end_write_no_snapshotting() which decrements the root's
   subv_writers counter to 0;

6) We don't set 'only_release_metadata' back to 'false' because we do
   it only if 'copied', the value returned by btrfs_copy_from_user(), is
   greater than zero;

7) On the next iteration of the while loop, which processes the same
   page range, we are now able to allocate data space for the write (we
   got enough data space released in the meanwhile);

8) After this if we fail at btrfs_delalloc_reserve_metadata(), because
   now there isn't enough free metadata space, or in some other place
   further below (prepare_pages(), lock_and_cleanup_extent_if_need(),
   btrfs_dirty_pages()), we break out of the while loop with
   'only_release_metadata' having a value of 'true';

9) Because 'only_release_metadata' is 'true' we end up decrementing the
   root's subv_writers counter to -1 (through a call to
   btrfs_end_write_no_snapshotting()), and we also end up not releasing the
   data space previously reserved through btrfs_check_data_free_space().
   As a consequence the mechanism for synchronizing NOCOW buffered writes
   with snapshotting gets broken.

Fix this by always setting 'only_release_metadata' to false at the start
of each iteration.

Fixes: 8257b2dc3c1a ("Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume")
Fixes: 7ee9e4405f26 ("Btrfs: check if we can nocow if we don't have data space")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17 20:39:16 +01:00
Josef Bacik
ca64b008af btrfs: use refcount_inc_not_zero in kill_all_nodes
commit baf320b9d531f1cfbf64c60dd155ff80a58b3796 upstream.

We hit the following warning while running down a different problem

[ 6197.175850] ------------[ cut here ]------------
[ 6197.185082] refcount_t: underflow; use-after-free.
[ 6197.194704] WARNING: CPU: 47 PID: 966 at lib/refcount.c:190 refcount_sub_and_test_checked+0x53/0x60
[ 6197.521792] Call Trace:
[ 6197.526687]  __btrfs_release_delayed_node+0x76/0x1c0
[ 6197.536615]  btrfs_kill_all_delayed_nodes+0xec/0x130
[ 6197.546532]  ? __btrfs_btree_balance_dirty+0x60/0x60
[ 6197.556482]  btrfs_clean_one_deleted_snapshot+0x71/0xd0
[ 6197.566910]  cleaner_kthread+0xfa/0x120
[ 6197.574573]  kthread+0x111/0x130
[ 6197.581022]  ? kthread_create_on_node+0x60/0x60
[ 6197.590086]  ret_from_fork+0x1f/0x30
[ 6197.597228] ---[ end trace 424bb7ae00509f56 ]---

This is because the free side drops the ref without the lock, and then
takes the lock if our refcount is 0.  So you can have nodes on the tree
that have a refcount of 0.  Fix this by zero'ing out that element in our
temporary array so we don't try to kill it again.

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17 20:39:15 +01:00
Josef Bacik
797200898a btrfs: check page->mapping when loading free space cache
commit 3797136b626ad4b6582223660c041efdea8f26b2 upstream.

While testing 5.2 we ran into the following panic

[52238.017028] BUG: kernel NULL pointer dereference, address: 0000000000000001
[52238.105608] RIP: 0010:drop_buffers+0x3d/0x150
[52238.304051] Call Trace:
[52238.308958]  try_to_free_buffers+0x15b/0x1b0
[52238.317503]  shrink_page_list+0x1164/0x1780
[52238.325877]  shrink_inactive_list+0x18f/0x3b0
[52238.334596]  shrink_node_memcg+0x23e/0x7d0
[52238.342790]  ? do_shrink_slab+0x4f/0x290
[52238.350648]  shrink_node+0xce/0x4a0
[52238.357628]  balance_pgdat+0x2c7/0x510
[52238.365135]  kswapd+0x216/0x3e0
[52238.371425]  ? wait_woken+0x80/0x80
[52238.378412]  ? balance_pgdat+0x510/0x510
[52238.386265]  kthread+0x111/0x130
[52238.392727]  ? kthread_create_on_node+0x60/0x60
[52238.401782]  ret_from_fork+0x1f/0x30

The page we were trying to drop had a page->private, but had no
page->mapping and so called drop_buffers, assuming that we had a
buffer_head on the page, and then panic'ed trying to deref 1, which is
our page->private for data pages.

This is happening because we're truncating the free space cache while
we're trying to load the free space cache.  This isn't supposed to
happen, and I'll fix that in a followup patch.  However we still
shouldn't allow those sort of mistakes to result in messing with pages
that do not belong to us.  So add the page->mapping check to verify that
we still own this page after dropping and re-acquiring the page lock.

This page being unlocked as:
btrfs_readpage
  extent_read_full_page
    __extent_read_full_page
      __do_readpage
        if (!nr)
	   unlock_page  <-- nr can be 0 only if submit_extent_page
			    returns an error

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add callchain ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17 20:39:15 +01:00
Josef Bacik
fe6f3973ff btrfs: only track ref_heads in delayed_ref_updates
[ Upstream commit 158ffa364bf723fa1ef128060646d23dc3942994 ]

We use this number to figure out how many delayed refs to run, but
__btrfs_run_delayed_refs really only checks every time we need a new
delayed ref head, so we always run at least one ref head completely no
matter what the number of items on it.  Fix the accounting to only be
adjusted when we add/remove a ref head.

In addition to using this number to limit the number of delayed refs
run, a future patch is also going to use it to calculate the amount of
space required for delayed refs space reservation.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05 15:37:25 +01:00
Arnd Bergmann
ab5ef999c7 btrfs: avoid link error with CONFIG_NO_AUTO_INLINE
[ Upstream commit 7e17916b35797396f681a3270245fd29c1e4c250 ]

Note: this patch fixes a problem in a feature outside of btrfs ("kernel
hacking: add a config option to disable compiler auto-inlining") and is
applied ahead of time due to cross-subsystem dependencies.

On 32-bit ARM with gcc-8, I see a link error with the addition of the
CONFIG_NO_AUTO_INLINE option:

fs/btrfs/super.o: In function `btrfs_statfs':
super.c:(.text+0x67b8): undefined reference to `__aeabi_uldivmod'
super.c:(.text+0x67fc): undefined reference to `__aeabi_uldivmod'
super.c:(.text+0x6858): undefined reference to `__aeabi_uldivmod'
super.c:(.text+0x6920): undefined reference to `__aeabi_uldivmod'
super.c:(.text+0x693c): undefined reference to `__aeabi_uldivmod'
fs/btrfs/super.o:super.c:(.text+0x6958): more undefined references to `__aeabi_uldivmod' follow

So far this is the only file that shows the behavior, so I'd propose
to just work around it by marking the functions as 'static inline'
that normally get inlined here.

The reference to __aeabi_uldivmod comes from a div_u64() which has an
optimization for a constant division that uses a straight '/' operator
when the result should be known to the compiler. My interpretation is
that as we turn off inlining, gcc still expects the result to be constant
but fails to use that constant value.

Link: https://lkml.kernel.org/r/20181103153941.1881966-1-arnd@arndb.de
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
[ add the note ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-01 09:14:04 +01:00
Nikolay Borisov
67631651aa btrfs: handle error of get_old_root
[ Upstream commit 315bed43fea532650933e7bba316a7601d439edf ]

In btrfs_search_old_slot get_old_root is always used with the assumption
it cannot fail. However, this is not true in rare circumstance it can
fail and return null. This will lead to null point dereference when the
header is read. Fix this by checking the return value and properly
handling NULL by setting ret to -EIO and returning gracefully.

Coverity-id: 1087503
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-01 09:13:22 +01:00
Qu Wenruo
e06e89fe9a btrfs: block-group: Fix a memory leak due to missing btrfs_put_block_group()
commit 4b654acdae850f48b8250b9a578a4eaa518c7a6f upstream.

In btrfs_read_block_groups(), if we have an invalid block group which
has mixed type (DATA|METADATA) while the fs doesn't have MIXED_GROUPS
feature, we error out without freeing the block group cache.

This patch will add the missing btrfs_put_block_group() to prevent
memory leak.

Note for stable backports: the file to patch in versions <= 5.3 is
fs/btrfs/extent-tree.c

Fixes: 49303381f19a ("Btrfs: bail out if block group has different mixed flag")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-29 09:17:43 +01:00
Josef Bacik
f94575d0b4 btrfs: fix incorrect updating of log root tree
commit 4203e968947071586a98b5314fd7ffdea3b4f971 upstream.

We've historically had reports of being unable to mount file systems
because the tree log root couldn't be read.  Usually this is the "parent
transid failure", but could be any of the related errors, including
"fsid mismatch" or "bad tree block", depending on which block got
allocated.

The modification of the individual log root items are serialized on the
per-log root root_mutex.  This means that any modification to the
per-subvol log root_item is completely protected.

However we update the root item in the log root tree outside of the log
root tree log_mutex.  We do this in order to allow multiple subvolumes
to be updated in each log transaction.

This is problematic however because when we are writing the log root
tree out we update the super block with the _current_ log root node
information.  Since these two operations happen independently of each
other, you can end up updating the log root tree in between writing out
the dirty blocks and setting the super block to point at the current
root.

This means we'll point at the new root node that hasn't been written
out, instead of the one we should be pointing at.  Thus whatever garbage
or old block we end up pointing at complains when we mount the file
system later and try to replay the log.

Fix this by copying the log's root item into a local root item copy.
Then once we're safely under the log_root_tree->log_mutex we update the
root item in the log_root_tree.  This way we do not modify the
log_root_tree while we're committing it, fixing the problem.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-17 13:43:50 -07:00
Filipe Manana
9aa376a13f Btrfs: fix race setting up and completing qgroup rescan workers
[ Upstream commit 13fc1d271a2e3ab8a02071e711add01fab9271f6 ]

There is a race between setting up a qgroup rescan worker and completing
a qgroup rescan worker that can lead to callers of the qgroup rescan wait
ioctl to either not wait for the rescan worker to complete or to hang
forever due to missing wake ups. The following diagram shows a sequence
of steps that illustrates the race.

        CPU 1                                                         CPU 2                                  CPU 3

 btrfs_ioctl_quota_rescan()
  btrfs_qgroup_rescan()
   qgroup_rescan_init()
    mutex_lock(&fs_info->qgroup_rescan_lock)
    spin_lock(&fs_info->qgroup_lock)

    fs_info->qgroup_flags |=
      BTRFS_QGROUP_STATUS_FLAG_RESCAN

    init_completion(
      &fs_info->qgroup_rescan_completion)

    fs_info->qgroup_rescan_running = true

    mutex_unlock(&fs_info->qgroup_rescan_lock)
    spin_unlock(&fs_info->qgroup_lock)

    btrfs_init_work()
     --> starts the worker

                                                        btrfs_qgroup_rescan_worker()
                                                         mutex_lock(&fs_info->qgroup_rescan_lock)

                                                         fs_info->qgroup_flags &=
                                                           ~BTRFS_QGROUP_STATUS_FLAG_RESCAN

                                                         mutex_unlock(&fs_info->qgroup_rescan_lock)

                                                         starts transaction, updates qgroup status
                                                         item, etc

                                                                                                           btrfs_ioctl_quota_rescan()
                                                                                                            btrfs_qgroup_rescan()
                                                                                                             qgroup_rescan_init()
                                                                                                              mutex_lock(&fs_info->qgroup_rescan_lock)
                                                                                                              spin_lock(&fs_info->qgroup_lock)

                                                                                                              fs_info->qgroup_flags |=
                                                                                                                BTRFS_QGROUP_STATUS_FLAG_RESCAN

                                                                                                              init_completion(
                                                                                                                &fs_info->qgroup_rescan_completion)

                                                                                                              fs_info->qgroup_rescan_running = true

                                                                                                              mutex_unlock(&fs_info->qgroup_rescan_lock)
                                                                                                              spin_unlock(&fs_info->qgroup_lock)

                                                                                                              btrfs_init_work()
                                                                                                               --> starts another worker

                                                         mutex_lock(&fs_info->qgroup_rescan_lock)

                                                         fs_info->qgroup_rescan_running = false

                                                         mutex_unlock(&fs_info->qgroup_rescan_lock)

							 complete_all(&fs_info->qgroup_rescan_completion)

Before the rescan worker started by the task at CPU 3 completes, if
another task calls btrfs_ioctl_quota_rescan(), it will get -EINPROGRESS
because the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN is set at
fs_info->qgroup_flags, which is expected and correct behaviour.

However if other task calls btrfs_ioctl_quota_rescan_wait() before the
rescan worker started by the task at CPU 3 completes, it will return
immediately without waiting for the new rescan worker to complete,
because fs_info->qgroup_rescan_running is set to false by CPU 2.

This race is making test case btrfs/171 (from fstests) to fail often:

  btrfs/171 9s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad)
#      --- tests/btrfs/171.out     2018-09-16 21:30:48.505104287 +0100
#      +++ /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad      2019-09-19 02:01:36.938486039 +0100
#      @@ -1,2 +1,3 @@
#       QA output created by 171
#      +ERROR: quota rescan failed: Operation now in progress
#       Silence is golden
#      ...
#      (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/171.out /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad'  to see the entire diff)

That is because the test calls the btrfs-progs commands "qgroup quota
rescan -w", "qgroup assign" and "qgroup remove" in a sequence that makes
calls to the rescan start ioctl fail with -EINPROGRESS (note the "btrfs"
commands 'qgroup assign' and 'qgroup remove' often call the rescan start
ioctl after calling the qgroup assign ioctl,
btrfs_ioctl_qgroup_assign()), since previous waits didn't actually wait
for a rescan worker to complete.

Another problem the race can cause is missing wake ups for waiters,
since the call to complete_all() happens outside a critical section and
after clearing the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN. In the sequence
diagram above, if we have a waiter for the first rescan task (executed
by CPU 2), then fs_info->qgroup_rescan_completion.wait is not empty, and
if after the rescan worker clears BTRFS_QGROUP_STATUS_FLAG_RESCAN and
before it calls complete_all() against
fs_info->qgroup_rescan_completion, the task at CPU 3 calls
init_completion() against fs_info->qgroup_rescan_completion which
re-initilizes its wait queue to an empty queue, therefore causing the
rescan worker at CPU 2 to call complete_all() against an empty queue,
never waking up the task waiting for that rescan worker.

Fix this by clearing BTRFS_QGROUP_STATUS_FLAG_RESCAN and setting
fs_info->qgroup_rescan_running to false in the same critical section,
delimited by the mutex fs_info->qgroup_rescan_lock, as well as doing the
call to complete_all() in that same critical section. This gives the
protection needed to avoid rescan wait ioctl callers not waiting for a
running rescan worker and the lost wake ups problem, since setting that
rescan flag and boolean as well as initializing the wait queue is done
already in a critical section delimited by that mutex (at
qgroup_rescan_init()).

Fixes: 57254b6ebce4ce ("Btrfs: add ioctl to wait for qgroup rescan completion")
Fixes: d2c609b834d62f ("btrfs: properly track when rescan worker is running")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-05 12:48:14 +02:00
Lu Fengqi
5f8c5f1f8b btrfs: qgroup: Drop quota_root and fs_info parameters from update_qgroup_status_item
[ Upstream commit 2e980acdd829742966c6a7e565ef3382c0717295 ]

They can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-05 12:48:14 +02:00
Qu Wenruo
db04b74e00 btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data space
commit bab32fc069ce8829c416e8737c119f62a57970f9 upstream.

[BUG]
Under the following case with qgroup enabled, if some error happened
after we have reserved delalloc space, then in error handling path, we
could cause qgroup data space leakage:

From btrfs_truncate_block() in inode.c:

	ret = btrfs_delalloc_reserve_space(inode, &data_reserved,
					   block_start, blocksize);
	if (ret)
		goto out;

 again:
	page = find_or_create_page(mapping, index, mask);
	if (!page) {
		btrfs_delalloc_release_space(inode, data_reserved,
					     block_start, blocksize, true);
		btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, true);
		ret = -ENOMEM;
		goto out;
	}

[CAUSE]
In the above case, btrfs_delalloc_reserve_space() will call
btrfs_qgroup_reserve_data() and mark the io_tree range with
EXTENT_QGROUP_RESERVED flag.

In the error handling path, we have the following call stack:
btrfs_delalloc_release_space()
|- btrfs_free_reserved_data_space()
   |- btrsf_qgroup_free_data()
      |- __btrfs_qgroup_release_data(reserved=@reserved, free=1)
         |- qgroup_free_reserved_data(reserved=@reserved)
            |- clear_record_extent_bits();
            |- freed += changeset.bytes_changed;

However due to a completion bug, qgroup_free_reserved_data() will clear
EXTENT_QGROUP_RESERVED flag in BTRFS_I(inode)->io_failure_tree, other
than the correct BTRFS_I(inode)->io_tree.
Since io_failure_tree is never marked with that flag,
btrfs_qgroup_free_data() will not free any data reserved space at all,
causing a leakage.

This type of error handling can only be triggered by errors outside of
qgroup code. So EDQUOT error from qgroup can't trigger it.

[FIX]
Fix the wrong target io_tree.

Reported-by: Josef Bacik <josef@toxicpanda.com>
Fixes: bc42bda22345 ("btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05 12:48:10 +02:00
Nikolay Borisov
1ca2ae2358 btrfs: Relinquish CPUs in btrfs_compare_trees
commit 6af112b11a4bc1b560f60a618ac9c1dcefe9836e upstream.

When doing any form of incremental send the parent and the child trees
need to be compared via btrfs_compare_trees. This  can result in long
loop chains without ever relinquishing the CPU. This causes softlockup
detector to trigger when comparing trees with a lot of items. Example
report:

watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [snapperd:16153]
CPU: 0 PID: 16153 Comm: snapperd Not tainted 5.2.9-1-default #1 openSUSE Tumbleweed (unreleased)
Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
pstate: 40000005 (nZcv daif -PAN -UAO)
pc : __ll_sc_arch_atomic_sub_return+0x14/0x20
lr : btrfs_release_extent_buffer_pages+0xe0/0x1e8 [btrfs]
sp : ffff00001273b7e0
Call trace:
 __ll_sc_arch_atomic_sub_return+0x14/0x20
 release_extent_buffer+0xdc/0x120 [btrfs]
 free_extent_buffer.part.0+0xb0/0x118 [btrfs]
 free_extent_buffer+0x24/0x30 [btrfs]
 btrfs_release_path+0x4c/0xa0 [btrfs]
 btrfs_free_path.part.0+0x20/0x40 [btrfs]
 btrfs_free_path+0x24/0x30 [btrfs]
 get_inode_info+0xa8/0xf8 [btrfs]
 finish_inode_if_needed+0xe0/0x6d8 [btrfs]
 changed_cb+0x9c/0x410 [btrfs]
 btrfs_compare_trees+0x284/0x648 [btrfs]
 send_subvol+0x33c/0x520 [btrfs]
 btrfs_ioctl_send+0x8a0/0xaf0 [btrfs]
 btrfs_ioctl+0x199c/0x2288 [btrfs]
 do_vfs_ioctl+0x4b0/0x820
 ksys_ioctl+0x84/0xb8
 __arm64_sys_ioctl+0x28/0x38
 el0_svc_common.constprop.0+0x7c/0x188
 el0_svc_handler+0x34/0x90
 el0_svc+0x8/0xc

Fix this by adding a call to cond_resched at the beginning of the main
loop in btrfs_compare_trees.

Fixes: 7069830a9e38 ("Btrfs: add btrfs_compare_trees function")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05 12:48:09 +02:00
Filipe Manana
8c17312125 Btrfs: fix use-after-free when using the tree modification log
commit efad8a853ad2057f96664328a0d327a05ce39c76 upstream.

At ctree.c:get_old_root(), we are accessing a root's header owner field
after we have freed the respective extent buffer. This results in an
use-after-free that can lead to crashes, and when CONFIG_DEBUG_PAGEALLOC
is set, results in a stack trace like the following:

  [ 3876.799331] stack segment: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
  [ 3876.799363] CPU: 0 PID: 15436 Comm: pool Not tainted 5.3.0-rc3-btrfs-next-54 #1
  [ 3876.799385] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
  [ 3876.799433] RIP: 0010:btrfs_search_old_slot+0x652/0xd80 [btrfs]
  (...)
  [ 3876.799502] RSP: 0018:ffff9f08c1a2f9f0 EFLAGS: 00010286
  [ 3876.799518] RAX: ffff8dd300000000 RBX: ffff8dd85a7a9348 RCX: 000000038da26000
  [ 3876.799538] RDX: 0000000000000000 RSI: ffffe522ce368980 RDI: 0000000000000246
  [ 3876.799559] RBP: dae1922adadad000 R08: 0000000008020000 R09: ffffe522c0000000
  [ 3876.799579] R10: ffff8dd57fd788c8 R11: 000000007511b030 R12: ffff8dd781ddc000
  [ 3876.799599] R13: ffff8dd9e6240578 R14: ffff8dd6896f7a88 R15: ffff8dd688cf90b8
  [ 3876.799620] FS:  00007f23ddd97700(0000) GS:ffff8dda20200000(0000) knlGS:0000000000000000
  [ 3876.799643] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 3876.799660] CR2: 00007f23d4024000 CR3: 0000000710bb0005 CR4: 00000000003606f0
  [ 3876.799682] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [ 3876.799703] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [ 3876.799723] Call Trace:
  [ 3876.799735]  ? do_raw_spin_unlock+0x49/0xc0
  [ 3876.799749]  ? _raw_spin_unlock+0x24/0x30
  [ 3876.799779]  resolve_indirect_refs+0x1eb/0xc80 [btrfs]
  [ 3876.799810]  find_parent_nodes+0x38d/0x1180 [btrfs]
  [ 3876.799841]  btrfs_check_shared+0x11a/0x1d0 [btrfs]
  [ 3876.799870]  ? extent_fiemap+0x598/0x6e0 [btrfs]
  [ 3876.799895]  extent_fiemap+0x598/0x6e0 [btrfs]
  [ 3876.799913]  do_vfs_ioctl+0x45a/0x700
  [ 3876.799926]  ksys_ioctl+0x70/0x80
  [ 3876.799938]  ? trace_hardirqs_off_thunk+0x1a/0x20
  [ 3876.799953]  __x64_sys_ioctl+0x16/0x20
  [ 3876.799965]  do_syscall_64+0x62/0x220
  [ 3876.799977]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [ 3876.799993] RIP: 0033:0x7f23e0013dd7
  (...)
  [ 3876.800056] RSP: 002b:00007f23ddd96ca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [ 3876.800078] RAX: ffffffffffffffda RBX: 00007f23d80210f8 RCX: 00007f23e0013dd7
  [ 3876.800099] RDX: 00007f23d80210f8 RSI: 00000000c020660b RDI: 0000000000000003
  [ 3876.800626] RBP: 000055fa2a2a2440 R08: 0000000000000000 R09: 00007f23ddd96d7c
  [ 3876.801143] R10: 00007f23d8022000 R11: 0000000000000246 R12: 00007f23ddd96d80
  [ 3876.801662] R13: 00007f23ddd96d78 R14: 00007f23d80210f0 R15: 00007f23ddd96d80
  (...)
  [ 3876.805107] ---[ end trace e53161e179ef04f9 ]---

Fix that by saving the root's header owner field into a local variable
before freeing the root's extent buffer, and then use that local variable
when needed.

Fixes: 30b0463a9394d9 ("Btrfs: fix accessing the root pointer in tree mod log functions")
CC: stable@vger.kernel.org # 3.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05 12:48:09 +02:00