Summary of http://lkml.org/lkml/2014/3/14/363 :
Ted: module_param(queue_depth, int, 444)
Joe: 0444!
Rusty: User perms >= group perms >= other perms?
Joe: CLASS_ATTR, DEVICE_ATTR, SENSOR_ATTR and SENSOR_ATTR_2?
Side effect of stricter permissions means removing the unnecessary
S_IFREG from several callers.
Note that the BUILD_BUG_ON_ZERO((perm) & 2) test was removed: a fair
number of drivers fail this test, so that will be the debate for a
future patch.
Suggested-by: Joe Perches <joe@perches.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> for drivers/pci/slot.c
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We can get false negative from __lookup_mnt() if an unrelated vfsmount
gets moved. In that case legitimize_mnt() is guaranteed to fail,
and we will fall back to non-RCU walk... unless we end up running
into a hard error on a filesystem object we wouldn't have reached
if not for that false negative. IOW, delaying that check until
the end of pathname resolution is wrong - we should recheck right
after we attempt to cross the mountpoint. We don't need to recheck
unless we see d_mountpoint() being true - in that case even if
we have just raced with mount/umount, we can simply go on as if
we'd come at the moment when the sucker wasn't a mountpoint; if we
run into a hard error as the result, it was a legitimate outcome.
__lookup_mnt() returning NULL is different in that respect, since
it might've happened due to operation on completely unrelated
mountpoint.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In all callchains leading to prepend_name(), the value left in *buflen
is eventually discarded unused if prepend_name() has returned a negative.
So we are free to do what prepend() does, and subtract from *buflen
*before* checking for underflow (which turns into checking the sign
of subtraction result, of course).
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit bd2a31d522344 ("get rid of fget_light()") introduced the
__fdget_pos() function, which returns the resulting file pointer and
fdput flags combined in an 'unsigned long'. However, it also changed the
behavior to return files with FMODE_PATH set, which shouldn't happen
because read(), write(), lseek(), etc. aren't allowed on such files.
This commit restores the old behavior.
This regression actually had no effect on read() and write() since
FMODE_READ and FMODE_WRITE are not set on file descriptors opened with
O_PATH, but it did cause lseek() on a file descriptor opened with O_PATH
to fail with ESPIPE rather than EBADF.
Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit 9c225f2655e36a4 ("vfs: atomic f_pos accesses as per POSIX") changed
several system calls to use fdget_pos() instead of fdget(), but missed
sys_llseek(). Fix it.
Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
xfstests's btrfs/035 triggers a BUG_ON, which we use to detect the split
of inline extents in __btrfs_drop_extents().
For inline extents, we cannot duplicate another EXTENT_DATA item, because
it breaks the rule of inline extents, that is, 'start offset' needs to be 0.
We have set limitations for the source inode's compressed inline extents,
because it needs to decompress and recompress. Now the destination inode's
inline extents also need similar limitations.
With this, xfstests btrfs/035 doesn't run into panic.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
I added an optimization for large files where we would stop searching for
backrefs once we had looked at the number of references we currently had for
this extent. This works great most of the time, but for snapshots that point to
this extent and has changes in the original root this assumption falls on it
face. So keep track of any delayed ref mods made and add in the actual ref
count as reported by the extent item and use that to limit how far down an inode
we'll search for extents. Thanks,
Reportedy-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reported-by: Hugo Mills <hugo@carfax.org.uk>
Tested-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Chris Mason <clm@fb.com>
For an incremental send, fix the process of determining whether the directory
inode we're currently processing needs to have its move/rename operation delayed.
We were ignoring the fact that if the inode's new immediate ancestor has a higher
inode number than ours but wasn't renamed/moved, we might still need to delay our
move/rename, because some other ancestor directory higher in the hierarchy might
have an inode number higher than ours *and* was renamed/moved too - in this case
we have to wait for rename/move of that ancestor to happen before our current
directory's rename/move operation.
Simple steps to reproduce this issue:
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir -p /mnt/a/x1/x2
$ mkdir /mnt/a/Z
$ mkdir -p /mnt/a/x1/x2/x3/x4/x5
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mv /mnt/a/x1/x2/x3 /mnt/a/Z/X33
$ mv /mnt/a/x1/x2 /mnt/a/Z/X33/x4/x5/X22
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
The incremental send caused the kernel code to enter an infinite loop when
building the path string for directory Z after its references are processed.
A more complex scenario:
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir -p /mnt/a/b/c/d
$ mkdir /mnt/a/b/c/d/e
$ mkdir /mnt/a/b/c/d/f
$ mv /mnt/a/b/c/d/e /mnt/a/b/c/d/f/E2
$ mkdir /mmt/a/b/c/g
$ mv /mnt/a/b/c/d /mnt/a/b/D2
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mkdir /mnt/a/o
$ mv /mnt/a/b/c/g /mnt/a/b/D2/f/G2
$ mv /mnt/a/b/D2 /mnt/a/b/dd
$ mv /mnt/a/b/c /mnt/a/C2
$ mv /mnt/a/b/dd/f /mnt/a/o/FF
$ mv /mnt/a/b /mnt/a/o/FF/E2/BB
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
It's possible to change the parent/child relationship between directories
in such a way that if a child directory has a higher inode number than
its parent, it doesn't necessarily means the child rename/move operation
can be performed immediately. The parent migth have its own rename/move
operation delayed, therefore in this case the child needs to have its
rename/move operation delayed too, and be performed after its new parent's
rename/move.
Steps to reproduce the issue:
$ umount /mnt
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir /mnt/A
$ mkdir /mnt/B
$ mkdir /mnt/C
$ mv /mnt/C /mnt/A
$ mv /mnt/B /mnt/A/C
$ mkdir /mnt/A/C/D
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mv /mnt/A/C/D /mnt/A/D2
$ mv /mnt/A/C/B /mnt/A/D2/B2
$ mv /mnt/A/C /mnt/A/D2/B2/C2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
The incremental send caused the kernel code to enter an infinite loop when
building the path string for directory C after its references are processed.
The necessary conditions here are that C has an inode number higher than both
A and B, and B as an higher inode number higher than A, and D has the highest
inode number, that is:
inode_number(A) < inode_number(B) < inode_number(C) < inode_number(D)
The same issue could happen if after the first snapshot there's any number
of intermediary parent directories between A2 and B2, and between B2 and C2.
A test case for xfstests follows, covering this simple case and more advanced
ones, with files and hard links created inside the directories.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
No need to search in the send tree for the generation number of the inode,
we already have it in the recorded_ref structure passed to us.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
While we update an existing ref head's extent_op, we're not holding
its spinlock, so while we're updating its extent_op contents (key,
flags) we can have a task running __btrfs_run_delayed_refs() that
holds the ref head's lock and sets its extent_op to NULL right after
the task updating the ref head just checked its extent_op was not NULL.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Since most of the btrfs_workqueue is printed as pointer address,
for easier analysis, add trace for btrfs_workqueue alloc/destroy.
So it is possible to determine the workqueue that a given work belongs
to(by comparing the wq pointer address with alloc trace event).
Signed-off-by: Qu Wenruo <quenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
When finding new extents during an autodefrag, don't do so many fs tree
lookups to find an extent with a size smaller then the target treshold.
Instead, after each fs tree forward search immediately unlock upper
levels and process the entire leaf while holding a read lock on the leaf,
since our leaf processing is very fast.
This reduces lock contention, allowing for higher concurrency when other
tasks want to write/update items related to other inodes in the fs tree,
as we're not holding read locks on upper tree levels while processing the
leaf and we do less tree searches.
Test:
sysbench --test=fileio --file-num=512 --file-total-size=16G \
--file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
--file-rw-ratio=3 --file-io-mode=sync --max-time=1800 \
--max-requests=10000000000 [prepare|run]
(fileystem mounted with -o autodefrag, averages of 5 runs)
Before this change: 58.852Mb/sec throughtput, read 77.589Gb, written 25.863Gb
After this change: 63.034Mb/sec throughtput, read 83.102Gb, written 27.701Gb
Test machine: quad core intel i5-3570K, 32Gb of RAM, SSD.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
The error message is confusing:
# btrfs sub delete /mnt/mysub/
Delete subvolume '/mnt/mysub'
ERROR: cannot delete '/mnt/mysub' - Directory not empty
The error message does not make sense to me: It's not about deleting a
directory but it's a subvolume, and it doesn't matter if the subvolume is
empty or not.
Maybe EPERM or is more appropriate in this case, combined with an explanatory
kernel log message. (e.g. "subvolume with ID 123 cannot be deleted because
it is configured as default subvolume.")
Reported-by: Koen De Wit <koen.de.wit@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
When locking file ranges in the inode's io_tree, cache the first
extent state that belongs to the target range, so that when unlocking
the range we don't need to search in the io_tree again, reducing cpu
time and making and therefore holding the io_tree's lock for a shorter
period.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zach found this deadlock that would happen like this
btrfs_end_transaction <- reduce trans->use_count to 0
btrfs_run_delayed_refs
btrfs_cow_block
find_free_extent
btrfs_start_transaction <- increase trans->use_count to 1
allocate chunk
btrfs_end_transaction <- decrease trans->use_count to 0
btrfs_run_delayed_refs
lock tree block we are cowing above ^^
We need to only decrease trans->use_count if it is above 1, otherwise leave it
alone. This will make nested trans be the only ones who decrease their added
ref, and will let us get rid of the trans->use_count++ hack if we have to commit
the transaction. Thanks,
cc: stable@vger.kernel.org
Reported-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Tested-by: Zach Brown <zab@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
If multiple redundant fsync calls are triggered, we don't need to write its
node pages with fsync mark continuously.
So, this patch adds FI_NEED_FSYNC to track whether the latest node block is
written with the fsync mark or not.
If the mark was set, a new fsync doesn't need to write a node block.
Otherwise, we should do a new node block with the mark for roll-forward
recovery.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces fi->i_sem to protect fi's info that includes xattr_ver,
pino, i_nlink.
This enables to remove i_mutex during f2fs_sync_file, resulting in performance
improvement when a number of fsync calls are triggered from many concurrent
threads.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
It is more reasonable to determine the reclaiming rate of prefree segments
according to the volume size, which is set to 5% by default.
For example, if the volume is 128GB, the prefree segments are reclaimed
when the number reaches to 6.4GB.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The NM_WOUT_THRESHOLD is now obsolete since f2fs starts to control on a basis
of the memory footprint.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces ram_thresh, a sysfs entry, which controls the memory
footprint used by the free nid list and the nat cache.
Previously, the free nid list was controlled by MAX_FREE_NIDS, while the nat
cache was managed by NM_WOUT_THRESHOLD.
However, this approach cannot be applied dynamically according to the system.
So, this patch adds ram_thresh that users can specify the threshold, which is
in order of 1 / 1024.
For example, if the total ram size is 4GB and the value is set to 10 by default,
f2fs tries to control the number of free nids and nat caches not to consume over
10 * (4GB / 1024) = 10MB.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The try_to_free_nats should not receive the negative nr_shrink.
Otherwise, it can drop all the nat entries by the while loop.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If a page is on writeback, f2fs can face with deadlock due to under writepages.
This is caused by merging IOs inside f2fs, so if it comes to detect, let's throw
merged IOs, which is implemented by f2fs_wait_on_page_writeback.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The Hurd file system uses uses the inode field which is now used for
i_version for its translator block. This means that ext2 file systems
that are formatted for GNU Hurd can't be used to support NFSv4. Given
that Hurd file systems don't support extents, and a huge number of
modern file system features, this is no great loss.
If we don't do this, the attempt to update the i_version field will
stomp over the translator block field, which will cause file system
corruption for Hurd file systems. This can be replicated via:
mke2fs -t ext2 -o hurd /dev/vdc
mount -t ext4 /dev/vdc /vdc
touch /vdc/bug0000
umount /dev/vdc
e2fsck -f /dev/vdc
Addresses-Debian-Bug: #738758
Reported-By: Gabriele Giacone <1o5g4r8o@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Here is a revised patch based on Steve's feedback:
This patch eliminates function gfs2_set_mode which was only called in
one place, and always returned 0.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch eliminates function gfs2_security_init in favor of just
calling security_inode_init_security directly.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch increases the maximum number of ACLs from 25 to 300 for
a 4K block size. The value is adjusted accordingly if the block size
is smaller. Note that this is an arbitrary limit with a performance
tradeoff, and that the physical limit is slightly over 500.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
If a timeout or a signal interrupts the NFSv4 trunking discovery
SETCLIENTID_CONFIRM call, then we don't know whether or not the
server has changed the callback identifier on us.
Assume that it did, and schedule a 'path down' recovery...
Tested-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This patch adds new interfaces to create and destory cache,
ext4_xattr_create_cache() and ext4_xattr_destroy_cache(), and remove
the cache creation and destory calls from ex4_init_xattr() and
ext4_exitxattr() in fs/ext4/xattr.c.
fs/ext4/super.c has been changed so that when a filesystem is mounted
a cache is allocated and attched to its ext4_sb_info structure.
fs/mbcache.c has been changed so that only one slab allocator is
allocated and used by all mbcache structures.
Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
The patch increases the parallelism of mbcache by using the built-in
lock in the hlist_bl_node to protect the mb_cache's local block and
index hash chains. The global data mb_cache_lru_list and
mb_cache_list continue to be protected by the global
mb_cache_spinlock.
New block group spinlock, mb_cache_bg_lock is also added to serialize
accesses to mb_cache_entry's local data.
A new member e_refcnt is added to the mb_cache_entry structure to help
preventing an mb_cache_entry from being deallocated by a free while it
is being referenced by either mb_cache_entry_get() or
mb_cache_entry_find().
Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch changes each mb_cache's both block and index hash chains to
use a hlist_bl_node, which contains a built-in lock. This is the
first step in decoupling of locks serializing accesses to mb_cache
global data and each mb_cache_entry local data.
Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
It can be used to convert a range of file to zeros preferably without
issuing data IO. Blocks should be preallocated for the regions that span
holes in the file, and the entire range is preferable converted to
unwritten extents
This can be also used to preallocate blocks past EOF in the same way as
with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
size to remain the same.
Also add appropriate tracepoints.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Move block allocation out of the ext4_fallocate into separate function
called ext4_alloc_file_blocks(). This will allow us to use the same
allocation code for other allocation operations such as zero range which
is commit in the next patch.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently in ext4_fallocate we would update inode size, c_time and sync
the file with every partial allocation which is entirely unnecessary. It
is true that if the crash happens in the middle of truncate we might end
up with unchanged i size, or c_time which I do not think is really a
problem - it does not mean file system corruption in any way. Note that
xfs is doing things the same way e.g. update all of the mentioned after
the allocation is done.
This commit moves all the updates after the allocation is done. In
addition we also need to change m_time as not only inode has been change
bot also data regions might have changed (unwritten extents). However
m_time will be only updated when i_size changed.
Also we do not need to be paranoid about changing the c_time only if the
actual allocation have happened, we can change it even if we try to
allocate only to find out that there are already block allocated. It's
not really a big deal and it will save us some additional complexity.
Also use ext4_debug, instead of ext4_warning in #ifdef EXT4FS_DEBUG
section.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>-
--
v3: Do not remove the code to set EXT4_INODE_EOFBLOCKS flag
fs/ext4/extents.c | 96 ++++++++++++++++++++++++-------------------------------
1 file changed, 42 insertions(+), 54 deletions(-)
This patch introduces nr_pages_to_write to align page writes to the segment
or other operational unit size, which can be tuned according to the system
environment.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces nr_pages_to_skip(sbi, type) to determine writepages can
be skipped.
The dentry, node, and meta pages can be conrolled by F2FS without breaking the
FS consistency.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously 'background_gc={on***,off***}' is being parsed as correct option,
with this patch we cloud fix the trivial bug in mount process.
Change log from v1:
o need to check length of parameter suggested by Jaegeuk Kim.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
We should return error number of read_normal_summaries instead of -EINVAL when
read_normal_summaries failed.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces a help function f2fs_has_xattr_block for better
readability.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
After sucessful decompressing, the buffer which pointed by 'buf' will be
lost as 'buf' is overwrite by 'big_oops_buf' and will never be freed.
Signed-off-by: Liu ShuoX <shuox.liu@intel.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
In case new offset is equal to prz->buffer_size, it won't wrap at this
time and will return old(overflow) value next time.
Signed-off-by: Liu ShuoX <shuox.liu@intel.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
In case that ramoops_init_przs failed, max_dump_cnt won't be reset to
zero in error handle path.
Signed-off-by: Liu ShuoX <shuox.liu@intel.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
ramoops_get_next_prz get the prz according the paramters. If it get a
uninitialized prz, access its members by following persistent_ram_old_size(prz)
will cause a NULL pointer crash.
Ex: if ftrace_size is 0, fprz will be NULL.
Fix it by return NULL in advance.
Signed-off-by: Liu ShuoX <shuox.liu@intel.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>