[ Upstream commit 538bcaa6261b77e71d37f5596c33127c1a3ec3f7 ]
The jbd2 superblock is lockless now, so there is probably a race
condition between writing it so disk and modifing contents of it, which
may lead to checksum error. The following race is the one case that we
have captured.
jbd2 fsstress
jbd2_journal_commit_transaction
jbd2_journal_update_sb_log_tail
jbd2_write_superblock
jbd2_superblock_csum_set jbd2_journal_revoke
jbd2_journal_set_features(revork)
modify superblock
submit_bh(checksum incorrect)
Fix this by locking the buffer head before modifing it. We always
write the jbd2 superblock after we modify it, so this just means
calling the lock_buffer() a little earlier.
This checksum corruption problem can be reproduced by xfstests
generic/475.
Reported-by: zhangyi (F) <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit cc4b1242d7e3b42eed73881fc749944146493e4f ]
The preadv2 and pwritev2 syscalls are supposed to emulate the readv and
writev syscalls when offset == -1. Therefore the compat code should
check for offset before calling do_compat_preadv64 and
do_compat_pwritev64. This is the case for the preadv2 and pwritev2
syscalls, but handling of offset == -1 is missing in their 64-bit
equivalent.
This patch fixes that, calling do_compat_readv and do_compat_writev when
offset == -1. This fixes the following glibc tests on x32:
- misc/tst-preadvwritev2
- misc/tst-preadvwritev64v2
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit dce30ca9e3b676fb288c33c1f4725a0621361185 ]
guard_bio_eod() can truncate a segment in bio to allow it to do IO on
odd last sectors of a device.
It already checks if the IO starts past EOD, but it does not consider
the possibility of an IO request starting within device boundaries can
contain more than one segment past EOD.
In such cases, truncated_bytes can be bigger than PAGE_SIZE, and will
underflow bvec->bv_len.
Fix this by checking if truncated_bytes is lower than PAGE_SIZE.
This situation has been found on filesystems such as isofs and vfat,
which doesn't check the device size before mount, if the device is
smaller than the filesystem itself, a readahead on such filesystem,
which spans EOD, can trigger this situation, leading a call to
zero_user() with a wrong size possibly corrupting memory.
I didn't see any crash, or didn't let the system run long enough to
check if memory corruption will be hit somewhere, but adding
instrumentation to guard_bio_end() to check truncated_bytes size, was
enough to see the error.
The following script can trigger the error.
MNT=/mnt
IMG=./DISK.img
DEV=/dev/loop0
mkfs.vfat $IMG
mount $IMG $MNT
cp -R /etc $MNT &> /dev/null
umount $MNT
losetup -D
losetup --find --show --sizelimit 16247280 $IMG
mount $DEV $MNT
find $MNT -type f -exec cat {} + >/dev/null
Kudos to Eric Sandeen for coming up with the reproducer above
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6e876c3dd205d30b0db6850e97a03d75457df007 ]
In jbd2_journal_commit_transaction(), if we are in abort mode,
we may flush the buffer without setting descriptor block checksum
by goto start_journal_io. Then fs is mounted,
jbd2_descriptor_block_csum_verify() failed.
[ 271.379811] EXT4-fs (vdd): shut down requested (2)
[ 271.381827] Aborting journal on device vdd-8.
[ 271.597136] JBD2: Invalid checksum recovering block 22199 in log
[ 271.598023] JBD2: recovery failed
[ 271.598484] EXT4-fs (vdd): error loading journal
Fix this problem by keep setting descriptor block checksum if the
descriptor buffer is not NULL.
This checksum problem can be reproduced by xfstests generic/388.
Signed-off-by: luojiajun <luojiajun3@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 68e2672f8fbd1e04982b8d2798dd318bf2515dd2 ]
There is a NULL pointer dereference of devname in strspn()
The oops looks something like:
CIFS: Attempting to mount (null)
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
...
RIP: 0010:strspn+0x0/0x50
...
Call Trace:
? cifs_parse_mount_options+0x222/0x1710 [cifs]
? cifs_get_volume_info+0x2f/0x80 [cifs]
cifs_setup_volume_info+0x20/0x190 [cifs]
cifs_get_volume_info+0x50/0x80 [cifs]
cifs_smb3_do_mount+0x59/0x630 [cifs]
? ida_alloc_range+0x34b/0x3d0
cifs_do_mount+0x11/0x20 [cifs]
mount_fs+0x52/0x170
vfs_kern_mount+0x6b/0x170
do_mount+0x216/0xdc0
ksys_mount+0x83/0xd0
__x64_sys_mount+0x25/0x30
do_syscall_64+0x65/0x220
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fix this by adding a NULL check on devname in cifs_parse_devname()
Signed-off-by: Yao Liu <yotta.liu@ucloud.cn>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 259594bea574e515a148171b5cd84ce5cbdc028a ]
When compiling with -Wformat, clang emits the following warnings:
fs/cifs/smb1ops.c:312:20: warning: format specifies type 'unsigned
short' but the argument has type 'unsigned int' [-Wformat]
tgt_total_cnt, total_in_tgt);
^~~~~~~~~~~~
fs/cifs/cifs_dfs_ref.c:289:4: warning: format specifies type 'short'
but the argument has type 'int' [-Wformat]
ref->flags, ref->server_type);
^~~~~~~~~~
fs/cifs/cifs_dfs_ref.c:289:16: warning: format specifies type 'short'
but the argument has type 'int' [-Wformat]
ref->flags, ref->server_type);
^~~~~~~~~~~~~~~~
fs/cifs/cifs_dfs_ref.c:291:4: warning: format specifies type 'short'
but the argument has type 'int' [-Wformat]
ref->ref_flag, ref->path_consumed);
^~~~~~~~~~~~~
fs/cifs/cifs_dfs_ref.c:291:19: warning: format specifies type 'short'
but the argument has type 'int' [-Wformat]
ref->ref_flag, ref->path_consumed);
^~~~~~~~~~~~~~~~~~
The types of these arguments are unconditionally defined, so this patch
updates the format character to the correct ones for ints and unsigned
ints.
Link: https://github.com/ClangBuiltLinux/linux/issues/378
Signed-off-by: Louis Taylor <louis@kragniz.eu>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9083977dabf3833298ddcd40dee28687f1e6b483 ]
Fix below warning coming because of using mutex lock in atomic context.
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:98
in_atomic(): 1, irqs_disabled(): 0, pid: 585, name: sh
Preemption disabled at: __radix_tree_preload+0x28/0x130
Call trace:
dump_backtrace+0x0/0x2b4
show_stack+0x20/0x28
dump_stack+0xa8/0xe0
___might_sleep+0x144/0x194
__might_sleep+0x58/0x8c
mutex_lock+0x2c/0x48
f2fs_trace_pid+0x88/0x14c
f2fs_set_node_page_dirty+0xd0/0x184
Do not use f2fs_radix_tree_insert() to avoid doing cond_resched() with
spin_lock() acquired.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit cc725ef3cb202ef2019a3c67c8913efa05c3cce6 ]
In the process of creating a node, it will cause NULL pointer
dereference in kernel if o2cb_ctl failed in the interval (mkdir,
o2cb_set_node_attribute(node_num)] in function o2cb_add_node.
The node num is initialized to 0 in function o2nm_node_group_make_item,
o2nm_node_group_drop_item will mistake the node number 0 for a valid
node number when we delete the node before the node number is set
correctly. If the local node number of the current host happens to be
0, cluster->cl_local_node will be set to O2NM_INVALID_NODE_NUM while
o2hb_thread still running. The panic stack is generated as follows:
o2hb_thread
\-o2hb_do_disk_heartbeat
\-o2hb_check_own_slot
|-slot = ®->hr_slots[o2nm_this_node()];
//o2nm_this_node() return O2NM_INVALID_NODE_NUM
We need to check whether the node number is set when we delete the node.
Link: http://lkml.kernel.org/r/133d8045-72cc-863e-8eae-5013f9f6bc51@huawei.com
Signed-off-by: Jia Guo <guojia12@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Acked-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bc31d0cdcfbadb6258b45db97e93b1c83822ba33 ]
We have a customer reporting crashes in lock_get_status() with many
"Leaked POSIX lock" messages preceeding the crash.
Leaked POSIX lock on dev=0x0:0x56 ...
Leaked POSIX lock on dev=0x0:0x56 ...
Leaked POSIX lock on dev=0x0:0x56 ...
Leaked POSIX lock on dev=0x0:0x53 ...
Leaked POSIX lock on dev=0x0:0x53 ...
Leaked POSIX lock on dev=0x0:0x53 ...
Leaked POSIX lock on dev=0x0:0x53 ...
POSIX: fl_owner=ffff8900e7b79380 fl_flags=0x1 fl_type=0x1 fl_pid=20709
Leaked POSIX lock on dev=0x0:0x4b ino...
Leaked locks on dev=0x0:0x4b ino=0xf911400000029:
POSIX: fl_owner=ffff89f41c870e00 fl_flags=0x1 fl_type=0x1 fl_pid=19592
stack segment: 0000 [#1] SMP
Modules linked in: binfmt_misc msr tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag rpcsec_gss_krb5 arc4 ecb auth_rpcgss nfsv4 md4 nfs nls_utf8 lockd grace cifs sunrpc ccm dns_resolver fscache af_packet iscsi_ibft iscsi_boot_sysfs vmw_vsock_vmci_transport vsock xfs libcrc32c sb_edac edac_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel drbg ansi_cprng vmw_balloon aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr vmxnet3 i2c_piix4 vmw_vmci shpchp fjes processor button ac btrfs xor raid6_pq sr_mod cdrom ata_generic sd_mod ata_piix vmwgfx crc32c_intel drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm serio_raw ahci libahci drm libata vmw_pvscsi sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
Supported: Yes
CPU: 6 PID: 28250 Comm: lsof Not tainted 4.4.156-94.64-default #1
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
task: ffff88a345f28740 ti: ffff88c74005c000 task.ti: ffff88c74005c000
RIP: 0010:[<ffffffff8125dcab>] [<ffffffff8125dcab>] lock_get_status+0x9b/0x3b0
RSP: 0018:ffff88c74005fd90 EFLAGS: 00010202
RAX: ffff89bde83e20ae RBX: ffff89e870003d18 RCX: 0000000049534f50
RDX: ffffffff81a3541f RSI: ffffffff81a3544e RDI: ffff89bde83e20ae
RBP: 0026252423222120 R08: 0000000020584953 R09: 000000000000ffff
R10: 0000000000000000 R11: ffff88c74005fc70 R12: ffff89e5ca7b1340
R13: 00000000000050e5 R14: ffff89e870003d30 R15: ffff89e5ca7b1340
FS: 00007fafd64be800(0000) GS:ffff89f41fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000001c80018 CR3: 000000a522048000 CR4: 0000000000360670
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
0000000000000208 ffffffff81a3d6b6 ffff89e870003d30 ffff89e870003d18
ffff89e5ca7b1340 ffff89f41738d7c0 ffff89e870003d30 ffff89e5ca7b1340
ffffffff8125e08f 0000000000000000 ffff89bc22b67d00 ffff88c74005ff28
Call Trace:
[<ffffffff8125e08f>] locks_show+0x2f/0x70
[<ffffffff81230ad1>] seq_read+0x251/0x3a0
[<ffffffff81275bbc>] proc_reg_read+0x3c/0x70
[<ffffffff8120e456>] __vfs_read+0x26/0x140
[<ffffffff8120e9da>] vfs_read+0x7a/0x120
[<ffffffff8120faf2>] SyS_read+0x42/0xa0
[<ffffffff8161cbc3>] entry_SYSCALL_64_fastpath+0x1e/0xb7
When Linux closes a FD (close(), close-on-exec, dup2(), ...) it calls
filp_close() which also removes all posix locks.
The lock struct is initialized like so in filp_close() and passed
down to cifs
...
lock.fl_type = F_UNLCK;
lock.fl_flags = FL_POSIX | FL_CLOSE;
lock.fl_start = 0;
lock.fl_end = OFFSET_MAX;
...
Note the FL_CLOSE flag, which hints the VFS code that this unlocking
is done for closing the fd.
filp_close()
locks_remove_posix(filp, id);
vfs_lock_file(filp, F_SETLK, &lock, NULL);
return filp->f_op->lock(filp, cmd, fl) => cifs_lock()
rc = cifs_setlk(file, flock, type, wait_flag, posix_lck, lock, unlock, xid);
rc = server->ops->mand_unlock_range(cfile, flock, xid);
if (flock->fl_flags & FL_POSIX && !rc)
rc = locks_lock_file_wait(file, flock)
Notice how we don't call locks_lock_file_wait() which does the
generic VFS lock/unlock/wait work on the inode if rc != 0.
If we are closing the handle, the SMB server is supposed to remove any
locks associated with it. Similarly, cifs.ko frees and wakes up any
lock and lock waiter when closing the file:
cifs_close()
cifsFileInfo_put(file->private_data)
/*
* Delete any outstanding lock records. We'll lose them when the file
* is closed anyway.
*/
down_write(&cifsi->lock_sem);
list_for_each_entry_safe(li, tmp, &cifs_file->llist->locks, llist) {
list_del(&li->llist);
cifs_del_lock_waiters(li);
kfree(li);
}
list_del(&cifs_file->llist->llist);
kfree(cifs_file->llist);
up_write(&cifsi->lock_sem);
So we can safely ignore unlocking failures in cifs_lock() if they
happen with the FL_CLOSE flag hint set as both the server and the
client take care of it during the actual closing.
This is not a proper fix for the unlocking failure but it's safe and
it seems to prevent the lock leakages and crashes the customer
experiences.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 5e86bdda41534e17621d5a071b294943cae4376e upstream.
Currently, we are releasing the indirect buffer where we are done with
it in ext4_ind_remove_space(), so we can see the brelse() and
BUFFER_TRACE() everywhere. It seems fragile and hard to read, and we
may probably forget to release the buffer some day. This patch cleans
up the code by putting of the code which releases the buffers to the
end of the function.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jari Ruusu <jari.ruusu@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e6a9467ea14bae8691b0f72c500510c42ea8edb8 upstream.
ocfs2_reflink_inodes_lock() can swap the inode1/inode2 variables so that
we always grab cluster locks in order of increasing inode number.
Unfortunately, we forget to swap the inode record buffer head pointers
when we've done this, which leads to incorrect bookkeepping when we're
trying to make the two inodes have the same refcount tree.
This has the effect of causing filesystem shutdowns if you're trying to
reflink data from inode 100 into inode 97, where inode 100 already has a
refcount tree attached and inode 97 doesn't. The reflink code decides
to copy the refcount tree pointer from 100 to 97, but uses inode 97's
inode record to open the tree root (which it doesn't have) and blows up.
This issue causes filesystem shutdowns and metadata corruption!
Link: http://lkml.kernel.org/r/20190312214910.GK20533@magnolia
Fixes: 29ac8e856cb369 ("ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 73601ea5b7b18eb234219ae2adf77530f389da79 upstream.
syzbot is hitting lockdep warning [1] due to trying to open a fifo
during an execve() operation. But we don't need to open non regular
files during an execve() operation, for all files which we will need are
the executable file itself and the interpreter programs like /bin/sh and
ld-linux.so.2 .
Since the manpage for execve(2) says that execve() returns EACCES when
the file or a script interpreter is not a regular file, and the manpage
for uselib(2) says that uselib() can return EACCES, and we use
FMODE_EXEC when opening for execve()/uselib(), we can bail out if a non
regular file is requested with FMODE_EXEC set.
Since this deadlock followed by khungtaskd warnings is trivially
reproducible by a local unprivileged user, and syzbot's frequent crash
due to this deadlock defers finding other bugs, let's workaround this
deadlock until we get a chance to find a better solution.
[1] https://syzkaller.appspot.com/bug?id=b5095bfec44ec84213bac54742a82483aad578ce
Link: http://lkml.kernel.org/r/1552044017-7890-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Reported-by: syzbot <syzbot+e93a80c1bb7c5c56e522461c149f8bf55eab1b2b@syzkaller.appspotmail.com>
Fixes: 8924feff66f35fe2 ("splice: lift pipe_lock out of splice_to_pipe()")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org> [4.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0cb98abb5bd13b9a636bde603d952d722688b428 upstream.
Allow the async rpc task for finish and update the open state if needed,
then free the slot. Otherwise, the async rpc unable to decode the reply.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Fixes: ae55e59da0e4 ("pnfs: Don't release the sequence slot...")
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3897b6f0a859288c22fb793fad11ec2327e60fcd upstream.
Parity page is incorrectly unmapped in finish_parity_scrub(), triggering
a reference counter bug on i386, i.e.:
[ 157.662401] kernel BUG at mm/highmem.c:349!
[ 157.666725] invalid opcode: 0000 [#1] SMP PTI
The reason is that kunmap(p_page) was completely left out, so we never
did an unmap for the p_page and the loop unmapping the rbio page was
iterating over the wrong number of stripes: unmapping should be done
with nr_data instead of rbio->real_stripes.
Test case to reproduce the bug:
- create a raid5 btrfs filesystem:
# mkfs.btrfs -m raid5 -d raid5 /dev/sdb /dev/sdc /dev/sdd /dev/sde
- mount it:
# mount /dev/sdb /mnt
- run btrfs scrub in a loop:
# while :; do btrfs scrub start -BR /mnt; done
BugLink: https://bugs.launchpad.net/bugs/1812845
Fixes: 5a6ac9eacb49 ("Btrfs, raid56: support parity scrub on raid56")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2cc8334270e281815c3850c3adea363c51f21e0d upstream.
When Filipe added the recursive directory logging stuff in
2f2ff0ee5e430 ("Btrfs: fix metadata inconsistencies after directory
fsync") he specifically didn't take the directory i_mutex for the
children directories that we need to log because of lockdep. This is
generally fine, but can lead to this WARN_ON() tripping if we happen to
run delayed deletion's in between our first search and our second search
of dir_item/dir_indexes for this directory. We expect this to happen,
so the WARN_ON() isn't necessary. Drop the WARN_ON() and add a comment
so we know why this case can happen.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bf504110bc8aa05df48b0e5f0aa84bfb81e0574b upstream.
If we do a shrinking truncate against an inode which is already present
in the respective log tree and then rename it, as part of logging the new
name we end up logging an inode item that reflects the old size of the
file (the one which we previously logged) and not the new smaller size.
The decision to preserve the size previously logged was added by commit
1a4bcf470c886b ("Btrfs: fix fsync data loss after adding hard link to
inode") in order to avoid data loss after replaying the log. However that
decision is only needed for the case the logged inode size is smaller then
the current size of the inode, as explained in that commit's change log.
If the current size of the inode is smaller then the previously logged
size, we know a shrinking truncate happened and therefore need to use
that smaller size.
Example to trigger the problem:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0xab 0 8000" /mnt/foo
$ xfs_io -c "fsync" /mnt/foo
$ xfs_io -c "truncate 3000" /mnt/foo
$ mv /mnt/foo /mnt/bar
$ xfs_io -c "fsync" /mnt/bar
<power failure>
$ mount /dev/sdb /mnt
$ od -t x1 -A d /mnt/bar
0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
*
0008000
Once we rename the file, we log its name (and inode item), and because
the inode was already logged before in the current transaction, we log it
with a size of 8000 bytes because that is the size we previously logged
(with the first fsync). As part of the rename, besides logging the inode,
we do also sync the log, which is done since commit d4682ba03ef618
("Btrfs: sync log after logging new name"), so the next fsync against our
inode is effectively a no-op, since no new changes happened since the
rename operation. Even if did not sync the log during the rename
operation, the same problem (fize size of 8000 bytes instead of 3000
bytes) would be visible after replaying the log if the log ended up
getting synced to disk through some other means, such as for example by
fsyncing some other modified file. In the example above the fsync after
the rename operation is there just because not every filesystem may
guarantee logging/journalling the inode (and syncing the log/journal)
during the rename operation, for example it is needed for f2fs, but not
for ext4 and xfs.
Fix this scenario by, when logging a new name (which is triggered by
rename and link operations), using the current size of the inode instead
of the previously logged inode size.
A test case for fstests follows soon.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202695
CC: stable@vger.kernel.org # 4.4+
Reported-by: Seulbae Kim <seulbae@gatech.edu>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 674a2b27234d1b7afcb0a9162e81b2e53aeef217 upstream.
All indirect buffers get by ext4_find_shared() should be released no
mater the branch should be freed or not. But now, we forget to release
the lower depth indirect buffers when removing space from the same
higher depth indirect block. It will lead to buffer leak and futher
more, it may lead to quota information corruption when using old quota,
consider the following case.
- Create and mount an empty ext4 filesystem without extent and quota
features,
- quotacheck and enable the user & group quota,
- Create some files and write some data to them, and then punch hole
to some files of them, it may trigger the buffer leak problem
mentioned above.
- Disable quota and run quotacheck again, it will create two new
aquota files and write the checked quota information to them, which
probably may reuse the freed indirect block(the buffer and page
cache was not freed) as data block.
- Enable quota again, it will invoke
vfs_load_quota_inode()->invalidate_bdev() to try to clean unused
buffers and pagecache. Unfortunately, because of the buffer of quota
data block is still referenced, quota code cannot read the up to date
quota info from the device and lead to quota information corruption.
This problem can be reproduced by xfstests generic/231 on ext3 file
system or ext4 file system without extent and quota features.
This patch fix this problem by releasing the missing indirect buffers,
in ext4_ind_remove_space().
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 372a03e01853f860560eade508794dd274e9b390 upstream.
Ext4 needs to serialize unaligned direct AIO because the zeroing of
partial blocks of two competing unaligned AIOs can result in data
corruption.
However it decides not to serialize if the potentially unaligned aio is
past i_size with the rationale that no pending writes are possible past
i_size. Unfortunately if the i_size is not block aligned and the second
unaligned write lands past i_size, but still into the same block, it has
the potential of corrupting the previous unaligned write to the same
block.
This is (very simplified) reproducer from Frank
// 41472 = (10 * 4096) + 512
// 37376 = 41472 - 4096
ftruncate(fd, 41472);
io_prep_pwrite(iocbs[0], fd, buf[0], 4096, 37376);
io_prep_pwrite(iocbs[1], fd, buf[1], 4096, 41472);
io_submit(io_ctx, 1, &iocbs[1]);
io_submit(io_ctx, 1, &iocbs[2]);
io_getevents(io_ctx, 2, 2, events, NULL);
Without this patch the 512B range from 40960 up to the start of the
second unaligned write (41472) is going to be zeroed overwriting the data
written by the first write. This is a data corruption.
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
00009200 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
*
0000a000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0000a200 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31
With this patch the data corruption is avoided because we will recognize
the unaligned_aio and wait for the unwritten extent conversion.
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
00009200 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
*
0000a200 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31
*
0000b200
Reported-by: Frank Sorenson <fsorenso@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fixes: e9e3bcecf44c ("ext4: serialize unaligned asynchronous DIO")
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fa30dde38aa8628c73a6dded7cb0bba38c27b576 upstream.
We see the following NULL pointer dereference while running xfstests
generic/475:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
PGD 8000000c84bad067 P4D 8000000c84bad067 PUD c84e62067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 7 PID: 9886 Comm: fsstress Kdump: loaded Not tainted 5.0.0-rc8 #10
RIP: 0010:ext4_do_update_inode+0x4ec/0x760
...
Call Trace:
? jbd2_journal_get_write_access+0x42/0x50
? __ext4_journal_get_write_access+0x2c/0x70
? ext4_truncate+0x186/0x3f0
ext4_mark_iloc_dirty+0x61/0x80
ext4_mark_inode_dirty+0x62/0x1b0
ext4_truncate+0x186/0x3f0
? unmap_mapping_pages+0x56/0x100
ext4_setattr+0x817/0x8b0
notify_change+0x1df/0x430
do_truncate+0x5e/0x90
? generic_permission+0x12b/0x1a0
This is triggered because the NULL pointer handle->h_transaction was
dereferenced in function ext4_update_inode_fsync_trans().
I found that the h_transaction was set to NULL in jbd2__journal_restart
but failed to attached to a new transaction while the journal is aborted.
Fix this by checking the handle before updating the inode.
Fixes: b436b9bef84d ("ext4: Wait for proper transaction commit on fsync")
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d3ca4651d05c0ff7259d087d8c949bcf3e14fb46 upstream.
When truncate(2) hits IO error when reading indirect extent block the
code just bugs with:
kernel BUG at linux-4.15.0/fs/udf/truncate.c:249!
...
Fix the problem by bailing out cleanly in case of IO error.
CC: stable@vger.kernel.org
Reported-by: jean-luc malet <jeanluc.malet@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c1dffe0bf7f9c3d57d9f237a7cb2a81e62babd2b upstream.
If we have to retransmit a request, we should ensure that we reinitialise
the sequence results structure, since in the event of a signal
we need to treat the request as if it had not been sent.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dd838821f0a29781b185cd8fb8e48d5c177bd838 upstream.
Commit 62a063b8e7d1 "nfsd4: fix crash on writing v4_end_grace before
nfsd startup" is trying to fix a NULL dereference issue, but it
mistakenly checks if the nfsd server is started. So fix it.
Fixes: 62a063b8e7d1 "nfsd4: fix crash on writing v4_end_grace before nfsd startup"
Cc: stable@vger.kernel.org
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Yihao Wu <wuyihao@linux.alibaba.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b602345da6cbb135ba68cf042df8ec9a73da7981 upstream.
If the result of an NFSv3 readdir{,plus} request results in the
"offset" on one entry having to be split across 2 pages, and is sized
so that the next directory entry doesn't fit in the requested size,
then memory corruption can happen.
When encode_entry() is called after encoding the last entry that fits,
it notices that ->offset and ->offset1 are set, and so stores the
offset value in the two pages as required. It clears ->offset1 but
*does not* clear ->offset.
Normally this omission doesn't matter as encode_entry_baggage() will
be called, and will set ->offset to a suitable value (not on a page
boundary).
But in the case where cd->buflen < elen and nfserr_toosmall is
returned, ->offset is not reset.
This means that nfsd3proc_readdirplus will see ->offset with a value 4
bytes before the end of a page, and ->offset1 set to NULL.
It will try to write 8bytes to ->offset.
If we are lucky, the next page will be read-only, and the system will
BUG: unable to handle kernel paging request at...
If we are unlucky, some innocent page will have the first 4 bytes
corrupted.
nfsd3proc_readdir() doesn't even check for ->offset1, it just blindly
writes 8 bytes to the offset wherever it is.
Fix this by clearing ->offset after it is used, and copying the
->offset handling code from nfsd3_proc_readdirplus into
nfsd3_proc_readdir.
(Note that the commit hash in the Fixes tag is from the 'history'
tree - this bug predates git).
Fixes: 0b1d57cf7654 ("[PATCH] kNFSd: Fix nfs3 dentry encoding")
Fixes-URL: https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=0b1d57cf7654
Cc: stable@vger.kernel.org (v2.6.12+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8127d82705998568b52ac724e28e00941538083d upstream.
If the I/O completion failed with a fatal error, then we should just
exit nfs_pageio_complete_mirror() rather than try to recoalesce.
Fixes: a7d42ddb3099 ("nfs: add mirroring support to pgio layer")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4d91969ed4dbcefd0e78f77494f0cb8fada9048a upstream.
Whether we need to exit early, or just reprocess the list, we
must not lost track of the request which failed to get recoalesced.
Fixes: 03d5eb65b538 ("NFS: Fix a memory leak in nfs_do_recoalesce")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f57dcf4c72113c745d83f1c65f7291299f65c14f upstream.
When we fail to add the request to the I/O queue, we currently leave it
to the caller to free the failed request. However since some of the
requests that fail are actually created by nfs_pageio_add_request()
itself, and are not passed back the caller, this leads to a leakage
issue, which can again cause page locks to leak.
This commit addresses the leakage by freeing the created requests on
error, using desc->pg_completion_ops->error_cleanup()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Fixes: a7d42ddb30997 ("nfs: add mirroring support to pgio layer")
Cc: stable@vger.kernel.org # v4.0: c18b96a1b862: nfs: clean up rest of reqs
Cc: stable@vger.kernel.org # v4.0: d600ad1f2bdb: NFS41: pop some layoutget
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 01215d3edb0f384ddeaa5e4a22c1ae5ff634149f upstream.
The jh pointer may be used uninitialized in the two cases below and the
compiler complain about it when enabling JBUFFER_TRACE macro, fix them.
In file included from fs/jbd2/transaction.c:19:0:
fs/jbd2/transaction.c: In function ‘jbd2_journal_get_undo_access’:
./include/linux/jbd2.h:1637:38: warning: ‘jh’ is used uninitialized in this function [-Wuninitialized]
#define JBUFFER_TRACE(jh, info) do { printk("%s: %d\n", __func__, jh->b_jcount);} while (0)
^
fs/jbd2/transaction.c:1219:23: note: ‘jh’ was declared here
struct journal_head *jh;
^
In file included from fs/jbd2/transaction.c:19:0:
fs/jbd2/transaction.c: In function ‘jbd2_journal_dirty_metadata’:
./include/linux/jbd2.h:1637:38: warning: ‘jh’ may be used uninitialized in this function [-Wmaybe-uninitialized]
#define JBUFFER_TRACE(jh, info) do { printk("%s: %d\n", __func__, jh->b_jcount);} while (0)
^
fs/jbd2/transaction.c:1332:23: note: ‘jh’ was declared here
struct journal_head *jh;
^
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 904cdbd41d749a476863a0ca41f6f396774f26e4 upstream.
Now, we capture a data corruption problem on ext4 while we're truncating
an extent index block. Imaging that if we are revoking a buffer which
has been journaled by the committing transaction, the buffer's jbddirty
flag will not be cleared in jbd2_journal_forget(), so the commit code
will set the buffer dirty flag again after refile the buffer.
fsx kjournald2
jbd2_journal_commit_transaction
jbd2_journal_revoke commit phase 1~5...
jbd2_journal_forget
belongs to older transaction commit phase 6
jbddirty not clear __jbd2_journal_refile_buffer
__jbd2_journal_unfile_buffer
test_clear_buffer_jbddirty
mark_buffer_dirty
Finally, if the freed extent index block was allocated again as data
block by some other files, it may corrupt the file data after writing
cached pages later, such as during unmount time. (In general,
clean_bdev_aliases() related helpers should be invoked after
re-allocation to prevent the above corruption, but unfortunately we
missed it when zeroout the head of extra extent blocks in
ext4_ext_handle_unwritten_extents()).
This patch mark buffer as freed and set j_next_transaction to the new
transaction when it already belongs to the committing transaction in
jbd2_journal_forget(), so that commit code knows it should clear dirty
bits when it is done with the buffer.
This problem can be reproduced by xfstests generic/455 easily with
seeds (3246 3247 3248 3249).
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1c2d14212b15a60300a2d4f6364753e87394c521 upstream.
When ext2 filesystem is created with 64k block size, ext2_max_size()
will return value less than 0. Also, we cannot write any file in this fs
since the sb->maxbytes is less than 0. The core of the problem is that
the size of block index tree for such large block size is more than
i_blocks can carry. So fix the computation to count with this
possibility.
File size limits computed with the new function for the full range of
possible block sizes look like:
bits file_size
10 17247252480
11 275415851008
12 2196873666560
13 2197948973056
14 2198486220800
15 2198754754560
16 2198888906752
CC: stable@vger.kernel.org
Reported-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f96c3ac8dfc24b4e38fc4c2eba5fea2107b929d1 upstream.
When computing maximum size of filesystem possible with given number of
group descriptor blocks, we forget to include s_first_data_block into
the number of blocks. Thus for filesystems with non-zero
s_first_data_block it can happen that computed maximum filesystem size
is actually lower than current filesystem size which confuses the code
and eventually leads to a BUG_ON in ext4_alloc_group_tables() hitting on
flex_gd->count == 0. The problem can be reproduced like:
truncate -s 100g /tmp/image
mkfs.ext4 -b 1024 -E resize=262144 /tmp/image 32768
mount -t ext4 -o loop /tmp/image /mnt
resize2fs /dev/loop0 262145
resize2fs /dev/loop0 300000
Fix the problem by properly including s_first_data_block into the
computed number of filesystem blocks.
Fixes: 1c6bd7173d66 "ext4: convert file system to meta_bg if needed..."
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit abdc644e8cbac2e9b19763680e5a7cf9bab2bee7 upstream.
The reason is that while swapping two inode, we swap the flags too.
Some flags such as EXT4_JOURNAL_DATA_FL can really confuse the things
since we're not resetting the address operations structure. The
simplest way to keep things sane is to restrict the flags that can be
swapped.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8e928218780e2f1cf2f5891c7575e8f0b284fcce upstream.
In the past we had data corruption when reading compressed extents that
are shared within the same file and they are consecutive, this got fixed
by commit 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and
shared extents") and by commit 808f80b46790f ("Btrfs: update fix for read
corruption of compressed and shared extents"). However there was a case
that was missing in those fixes, which is when the shared and compressed
extents are referenced with a non-zero offset. The following shell script
creates a reproducer for this issue:
#!/bin/bash
mkfs.btrfs -f /dev/sdc &> /dev/null
mount -o compress /dev/sdc /mnt/sdc
# Create a file with 3 consecutive compressed extents, each has an
# uncompressed size of 128Kb and a compressed size of 4Kb.
for ((i = 1; i <= 3; i++)); do
head -c 4096 /dev/zero
for ((j = 1; j <= 31; j++)); do
head -c 4096 /dev/zero | tr '\0' "\377"
done
done > /mnt/sdc/foobar
sync
echo "Digest after file creation: $(md5sum /mnt/sdc/foobar)"
# Clone the first extent into offsets 128K and 256K.
xfs_io -c "reflink /mnt/sdc/foobar 0 128K 128K" /mnt/sdc/foobar
xfs_io -c "reflink /mnt/sdc/foobar 0 256K 128K" /mnt/sdc/foobar
sync
echo "Digest after cloning: $(md5sum /mnt/sdc/foobar)"
# Punch holes into the regions that are already full of zeroes.
xfs_io -c "fpunch 0 4K" /mnt/sdc/foobar
xfs_io -c "fpunch 128K 4K" /mnt/sdc/foobar
xfs_io -c "fpunch 256K 4K" /mnt/sdc/foobar
sync
echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)"
echo "Dropping page cache..."
sysctl -q vm.drop_caches=1
echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)"
umount /dev/sdc
When running the script we get the following output:
Digest after file creation: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
linked 131072/131072 bytes at offset 131072
128 KiB, 1 ops; 0.0033 sec (36.960 MiB/sec and 295.6830 ops/sec)
linked 131072/131072 bytes at offset 262144
128 KiB, 1 ops; 0.0015 sec (78.567 MiB/sec and 628.5355 ops/sec)
Digest after cloning: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
Digest after hole punching: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
Dropping page cache...
Digest after hole punching: fba694ae8664ed0c2e9ff8937e7f1484 /mnt/sdc/foobar
This happens because after reading all the pages of the extent in the
range from 128K to 256K for example, we read the hole at offset 256K
and then when reading the page at offset 260K we don't submit the
existing bio, which is responsible for filling all the page in the
range 128K to 256K only, therefore adding the pages from range 260K
to 384K to the existing bio and submitting it after iterating over the
entire range. Once the bio completes, the uncompressed data fills only
the pages in the range 128K to 256K because there's no more data read
from disk, leaving the pages in the range 260K to 384K unfilled. It is
just a slightly different variant of what was solved by commit
005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared
extents").
Fix this by forcing a bio submit, during readpages(), whenever we find a
compressed extent map for a page that is different from the extent map
for the previous page or has a different starting offset (in case it's
the same compressed extent), instead of the extent map's original start
offset.
A test case for fstests follows soon.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: 808f80b46790f ("Btrfs: update fix for read corruption of compressed and shared extents")
Fixes: 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents")
Cc: stable@vger.kernel.org # 4.3+
Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 349ae63f40638a28c6fce52e8447c2d14b84cc0c upstream.
We recently had a customer issue with a corrupted filesystem. When
trying to mount this image btrfs panicked with a division by zero in
calc_stripe_length().
The corrupt chunk had a 'num_stripes' value of 1. calc_stripe_length()
takes this value and divides it by the number of copies the RAID profile
is expected to have to calculate the amount of data stripes. As a DUP
profile is expected to have 2 copies this division resulted in 1/2 = 0.
Later then the 'data_stripes' variable is used as a divisor in the
stripe length calculation which results in a division by 0 and thus a
kernel panic.
When encountering a filesystem with a DUP block group and a
'num_stripes' value unequal to 2, refuse mounting as the image is
corrupted and will lead to unexpected behaviour.
Code inspection showed a RAID1 block group has the same issues.
Fixes: e06cd3dd7cea ("Btrfs: add validadtion checks for chunk loading")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a0873490660246db587849a9e172f2b7b21fa88a upstream.
We are holding a transaction handle when setting an acl, therefore we can
not allocate the xattr value buffer using GFP_KERNEL, as we could deadlock
if reclaim is triggered by the allocation, therefore setup a nofs context.
Fixes: 39a27ec1004e8 ("btrfs: use GFP_KERNEL for xattr and acl allocations")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a0ce2f0aa6ad97c3d4927bf2ca54bcebdf062d55 upstream.
Before this patch, it was possible for two pipes to affect each other after
data had been transferred between them with tee():
============
$ cat tee_test.c
int main(void) {
int pipe_a[2];
if (pipe(pipe_a)) err(1, "pipe");
int pipe_b[2];
if (pipe(pipe_b)) err(1, "pipe");
if (write(pipe_a[1], "abcd", 4) != 4) err(1, "write");
if (tee(pipe_a[0], pipe_b[1], 2, 0) != 2) err(1, "tee");
if (write(pipe_b[1], "xx", 2) != 2) err(1, "write");
char buf[5];
if (read(pipe_a[0], buf, 4) != 4) err(1, "read");
buf[4] = 0;
printf("got back: '%s'\n", buf);
}
$ gcc -o tee_test tee_test.c
$ ./tee_test
got back: 'abxx'
$
============
As suggested by Al Viro, fix it by creating a separate type for
non-mergeable pipe buffers, then changing the types of buffers in
splice_pipe_to_pipe() and link_pipe().
Cc: <stable@vger.kernel.org>
Fixes: 7c77f0b3f920 ("splice: implement pipe to pipe splicing")
Fixes: 70524490ee2e ("[PATCH] splice: add support for sys_tee()")
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 73052b0daee0b750b39af18460dfec683e4f5887 upstream.
d_delete only unhashes an entry if it is reached with
dentry->d_lockref.count != 1. Prior to commit 8ead9dd54716 ("devpts:
more pty driver interface cleanups"), d_delete was called on a dentry
from devpts_pty_kill with two references held, which would trigger the
unhashing, and the subsequent dputs would release it.
Commit 8ead9dd54716 reworked devpts_pty_kill to stop acquiring the second
reference from d_find_alias, and the d_delete call left the dentries
still on the hashed list without actually ever being dropped from dcache
before explicit cleanup. This causes the number of negative dentries for
devpts to pile up, and an `ls /dev/pts` invocation can take seconds to
return.
Provide always_delete_dentry() from simple_dentry_operations
as .d_delete for devpts, to make the dentry be dropped from dcache.
Without this cleanup, the number of dentries in /dev/pts/ can be grown
arbitrarily as:
`python -c 'import pty; pty.spawn(["ls", "/dev/pts"])'`
A systemtap probe on dcache_readdir to count d_subdirs shows this count
to increase with each pty spawn invocation above:
probe kernel.function("dcache_readdir") {
subdirs = &@cast($file->f_path->dentry, "dentry")->d_subdirs;
p = subdirs;
p = @cast(p, "list_head")->next;
i = 0
while (p != subdirs) {
p = @cast(p, "list_head")->next;
i = i+1;
}
printf("number of dentries: %d\n", i);
}
Fixes: 8ead9dd54716 ("devpts: more pty driver interface cleanups")
Signed-off-by: Varad Gautam <vrd@amazon.de>
Reported-by: Zheng Wang <wanz@amazon.de>
Reported-by: Brandon Schwartz <bsschwar@amazon.de>
Root-caused-by: Maximilian Heyne <mheyne@amazon.de>
Root-caused-by: Nicolas Pernas Maradei <npernas@amazon.de>
CC: David Woodhouse <dwmw@amazon.co.uk>
CC: Maximilian Heyne <mheyne@amazon.de>
CC: Stefan Nuernberger <snu@amazon.de>
CC: Amit Shah <aams@amazon.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Matthew Wilcox <willy@infradead.org>
CC: Eric Biggers <ebiggers@google.com>
CC: <stable@vger.kernel.org> # 4.9+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6dfbd84684700cb58b34e8602c01c12f3d2595c8 upstream.
When we have a READ lease for a file and have just issued a write
operation to the server we need to purge the cache and set oplock/lease
level to NONE to avoid reading stale data. Currently we do that
only if a write operation succedeed thus not covering cases when
a request was sent to the server but a negative error code was
returned later for some other reasons (e.g. -EIOCBQUEUED or -EINTR).
Fix this by turning off caching regardless of the error code being
returned.
The patches fixes generic tests 075 and 112 from the xfs-tests.
Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7b9b9edb49ad377b1e06abf14354c227e9ac4b06 upstream.
Currently on lease break the client sets a caching level twice:
when oplock is detected and when oplock is processed. While the
1st attempt sets the level to the value provided by the server,
the 2nd one resets the level to None unconditionally.
This happens because the oplock/lease processing code was changed
to avoid races between page cache flushes and oplock breaks.
The commit c11f1df5003d534 ("cifs: Wait for writebacks to complete
before attempting write.") fixed the races for oplocks but didn't
apply the same changes for leases resulting in overwriting the
server granted value to None. Fix this by properly processing
lease breaks.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 399504e21a10be16dd1408ba0147367d9d82a10c upstream.
same story as with last May fixes in sysfs (7b745a4e4051
"unfuck sysfs_mount()"); new_sb is left uninitialized
in case of early errors in kernfs_mount_ns() and papering
over it by treating any error from kernfs_mount_ns() as
equivalent to !new_ns ends up conflating the cases when
objects had never been transferred to a superblock with
ones when that has happened and resulting new superblock
had been dropped. Easily fixed (same way as in sysfs
case). Additionally, there's a superblock leak on
kernfs_node_dentry() failure *and* a dentry leak inside
kernfs_node_dentry() itself - the latter on probably
impossible errors, but the former not impossible to trigger
(as the matter of fact, injecting allocation failures
at that point *does* trigger it).
Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 822ad64d7e46a8e2c8b8a796738d7b657cbb146d ]
In the request_key() upcall mechanism there's a dependency loop by which if
a key type driver overrides the ->request_key hook and the userspace side
manages to lose the authorisation key, the auth key and the internal
construction record (struct key_construction) can keep each other pinned.
Fix this by the following changes:
(1) Killing off the construction record and using the auth key instead.
(2) Including the operation name in the auth key payload and making the
payload available outside of security/keys/.
(3) The ->request_key hook is given the authkey instead of the cons
record and operation name.
Changes (2) and (3) allow the auth key to naturally be cleaned up if the
keyring it is in is destroyed or cleared or the auth key is unlinked.
Fixes: 7ee02a316600 ("keys: Fix dependency loop between construction record and auth key")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <james.morris@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d2ceb7e57086750ea6198a31fd942d98099a0786 ]
If nfs_page_async_flush() removes the page from the mapping, then we can't
use page_file_mapping() on it as nfs_updatepate() is wont to do when
receiving an error. Instead, push the mapping to the stack before the page
is possibly truncated.
Fixes: 8fc75bed96bb ("NFS: Fix up return value on fatal errors in nfs_page_async_flush()")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 5e3cc1ee1405a7eb3487ed24f786dec01b4cbe1f upstream.
Use inode->i_lock to protect i_size_write(), else i_size_read() in
generic_fillattr() may loop infinitely in read_seqcount_begin() when
multiple processes invoke v9fs_vfs_getattr() or v9fs_vfs_getattr_dotl()
simultaneously under 32-bit SMP environment, and a soft lockup will be
triggered as show below:
watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [stat:2217]
Modules linked in:
CPU: 5 PID: 2217 Comm: stat Not tainted 5.0.0-rc1-00005-g7f702faf5a9e #4
Hardware name: Generic DT based system
PC is at generic_fillattr+0x104/0x108
LR is at 0xec497f00
pc : [<802b8898>] lr : [<ec497f00>] psr: 200c0013
sp : ec497e20 ip : ed608030 fp : ec497e3c
r10: 00000000 r9 : ec497f00 r8 : ed608030
r7 : ec497ebc r6 : ec497f00 r5 : ee5c1550 r4 : ee005780
r3 : 0000052d r2 : 00000000 r1 : ec497f00 r0 : ed608030
Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
Control: 10c5387d Table: ac48006a DAC: 00000051
CPU: 5 PID: 2217 Comm: stat Not tainted 5.0.0-rc1-00005-g7f702faf5a9e #4
Hardware name: Generic DT based system
Backtrace:
[<8010d974>] (dump_backtrace) from [<8010dc88>] (show_stack+0x20/0x24)
[<8010dc68>] (show_stack) from [<80a1d194>] (dump_stack+0xb0/0xdc)
[<80a1d0e4>] (dump_stack) from [<80109f34>] (show_regs+0x1c/0x20)
[<80109f18>] (show_regs) from [<801d0a80>] (watchdog_timer_fn+0x280/0x2f8)
[<801d0800>] (watchdog_timer_fn) from [<80198658>] (__hrtimer_run_queues+0x18c/0x380)
[<801984cc>] (__hrtimer_run_queues) from [<80198e60>] (hrtimer_run_queues+0xb8/0xf0)
[<80198da8>] (hrtimer_run_queues) from [<801973e8>] (run_local_timers+0x28/0x64)
[<801973c0>] (run_local_timers) from [<80197460>] (update_process_times+0x3c/0x6c)
[<80197424>] (update_process_times) from [<801ab2b8>] (tick_nohz_handler+0xe0/0x1bc)
[<801ab1d8>] (tick_nohz_handler) from [<80843050>] (arch_timer_handler_virt+0x38/0x48)
[<80843018>] (arch_timer_handler_virt) from [<80180a64>] (handle_percpu_devid_irq+0x8c/0x240)
[<801809d8>] (handle_percpu_devid_irq) from [<8017ac20>] (generic_handle_irq+0x34/0x44)
[<8017abec>] (generic_handle_irq) from [<8017b344>] (__handle_domain_irq+0x6c/0xc4)
[<8017b2d8>] (__handle_domain_irq) from [<801022e0>] (gic_handle_irq+0x4c/0x88)
[<80102294>] (gic_handle_irq) from [<80101a30>] (__irq_svc+0x70/0x98)
[<802b8794>] (generic_fillattr) from [<8056b284>] (v9fs_vfs_getattr_dotl+0x74/0xa4)
[<8056b210>] (v9fs_vfs_getattr_dotl) from [<802b8904>] (vfs_getattr_nosec+0x68/0x7c)
[<802b889c>] (vfs_getattr_nosec) from [<802b895c>] (vfs_getattr+0x44/0x48)
[<802b8918>] (vfs_getattr) from [<802b8a74>] (vfs_statx+0x9c/0xec)
[<802b89d8>] (vfs_statx) from [<802b9428>] (sys_lstat64+0x48/0x78)
[<802b93e0>] (sys_lstat64) from [<80101000>] (ret_fast_syscall+0x0/0x28)
[dominique.martinet@cea.fr: updated comment to not refer to a function
in another subsystem]
Link: http://lkml.kernel.org/r/20190124063514.8571-2-houtao1@huawei.com
Cc: stable@vger.kernel.org
Fixes: 7549ae3e81cc ("9p: Use the i_size_[read, write]() macros instead of using inode->i_size directly.")
Reported-by: Xing Gaopeng <xingaopeng@huawei.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 605b0487f0bc1ae9963bf52ece0f5c8055186f81 upstream.
Mark Syms has reported seeing tasks that are stuck waiting in
find_insert_glock. It turns out that struct lm_lockname contains four padding
bytes on 64-bit architectures that function glock_waitqueue doesn't skip when
hashing the glock name. As a result, we can end up waking up the wrong
waitqueue, and the waiting tasks may be stuck forever.
Fix that by using ht_parms.key_len instead of sizeof(struct lm_lockname) for
the key length.
Reported-by: Mark Syms <mark.syms@citrix.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 43636c804df0126da669c261fc820fb22f62bfc2 ]
When something let __find_get_block_slow() hit all_mapped path, it calls
printk() for 100+ times per a second. But there is no need to print same
message with such high frequency; it is just asking for stall warning, or
at least bloating log files.
[ 399.866302][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
[ 399.873324][T15342] b_state=0x00000029, b_size=512
[ 399.878403][T15342] device loop0 blocksize: 4096
[ 399.883296][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
[ 399.890400][T15342] b_state=0x00000029, b_size=512
[ 399.895595][T15342] device loop0 blocksize: 4096
[ 399.900556][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
[ 399.907471][T15342] b_state=0x00000029, b_size=512
[ 399.912506][T15342] device loop0 blocksize: 4096
This patch reduces frequency to up to once per a second, in addition to
concatenating three lines into one.
[ 399.866302][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8, b_state=0x00000029, b_size=512, device loop0 blocksize: 4096
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f585b283e3f025754c45bbe7533fc6e5c4643700 ]
In autofs_fill_super() on error of get inode/make root dentry the return
should be ENOMEM as this is the only failure case of the called
functions.
Link: http://lkml.kernel.org/r/154725123240.11260.796773942606871359.stgit@pluto-themaw-net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 63ce5f552beb9bdb41546b3a26c4374758b21815 ]
autofs_expire_run() calls dput(dentry) to drop the reference count of
dentry. However, dentry is read via autofs_dentry_ino(dentry) after
that. This may result in a use-free-bug. The patch drops the reference
count of dentry only when it is never used.
Link: http://lkml.kernel.org/r/154725122396.11260.16053424107144453867.stgit@pluto-themaw-net
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c27d82f52f75fc9d8d9d40d120d2a96fdeeada5e ]
When superblock has lots of inodes without any pagecache (like is the
case for /proc), drop_pagecache_sb() will iterate through all of them
without dropping sb->s_inode_list_lock which can lead to softlockups
(one of our customers hit this).
Fix the problem by going to the slow path and doing cond_resched() in
case the process needs rescheduling.
Link: http://lkml.kernel.org/r/20190114085343.15011-1-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>