Attempt to use the new DLM operations. If it is not supported, use the
traditional ocfs2_controld.
To exchange ocfs2 versioning, we use the LVB of the version dlm lock.
It first attempts to take the lock in EX mode (non-blocking). If
successful (which means it is the first mount), it writes the version
number and downconverts to PR lock. If it is unsuccessful, it reads the
version from the lock.
If this becomes the standard (with o2cb as well), it could simplify
userspace tools to check if the filesystem is mounted on other nodes.
Dan: Since ocfs2_protocol_version are two u8 values, the additional
checks with LONG* don't make sense.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the native DLM locks for version control negotiation. Most of the
framework is taken from gfs2/lock_dlm.c
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is done to differentiate between using and not using controld and
use the connection information accordingly.
We need to be backward compatible. So, we use a new enum
ocfs2_connection_type to identify when controld is used and when it is
not.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We perform this because the DLM recovery callbacks will require the
ocfs2_live_connection structure to record the node information when
dlm_new_lockspace() is updated (in the last patch of the series).
Before calling dlm_new_lockspace(), we need the structure ready for the
.recover_done() callback, which would set oc_this_node. This is the
reason we allocate ocfs2_live_connection beforehand in user_connect().
[AKPM] rc initialization is not required because it assigned in case of
errors. It will be cleared by compiler anyways.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reveiwed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These are the callbacks called by the fs/dlm code in case the membership
changes. If there is a failure while/during calling any of these, the
DLM creates a new membership and relays to the rest of the nodes.
- recover_prep() is called when DLM understands a node is down.
- recover_slot() is called once all nodes have acknowledged
recover_prep and recovery can begin.
- recover_done() is called once the recovery is complete. It returns
the new membership.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is an effort of removing ocfs2_controld.pcmk and getting ocfs2 DLM
handling up to the times with respect to DLM (>=4.0.1) and corosync
(2.3.x). AFAIK, cman also is being phased out for a unified corosync
cluster stack.
fs/dlm performs all the functions with respect to fencing and node
management and provides the API's to do so for ocfs2. For all future
references, DLM stands for fs/dlm code.
The advantages are:
+ No need to run an additional userspace daemon (ocfs2_controld)
+ No controld device handling and controld protocol
+ Shifting responsibilities of node management to DLM layer
For backward compatibility, we are keeping the controld handling code.
Once enough time has passed we can remove a significant portion of the
code. This was tested by using the kernel with changes on older
unmodified tools. The kernel used ocfs2_controld as expected, and
displayed the appropriate warning message.
This feature requires modification in the userspace ocfs2-tools. The
changes can be found at: https://github.com/goldwynr/ocfs2-tools branch:
nocontrold Currently, not many checks are present in the userspace code,
but that would change soon.
This patch (of 6):
Add clustername to cluster connection.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The versioning information is confusing for end-users. The numbers are
stuck at 1.5.0 when the tools version have moved to 1.8.2. Remove the
versioning system in the OCFS2 modules and let the kernel version be the
guide to debug issues.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We usually rely on the fact that struct members not specified in the
initializer are set to NULL. So do that with fsnotify function pointers
as well.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After removing event structure creation from the generic layer there is
no reason for separate .should_send_event and .handle_event callbacks.
So just remove the first one.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently fsnotify framework creates one event structure for each
notification event and links this event into all interested notification
groups. This is done so that we save memory when several notification
groups are interested in the event. However the need for event
structure shared between inotify & fanotify bloats the event structure
so the result is often higher memory consumption.
Another problem is that fsnotify framework keeps path references with
outstanding events so that fanotify can return open file descriptors
with its events. This has the undesirable effect that filesystem cannot
be unmounted while there are outstanding events - a regression for
inotify compared to a situation before it was converted to fsnotify
framework. For fanotify this problem is hard to avoid and users of
fanotify should kind of expect this behavior when they ask for file
descriptors from notified files.
This patch changes fsnotify and its users to create separate event
structure for each group. This allows for much simpler code (~400 lines
removed by this patch) and also smaller event structures. For example
on 64-bit system original struct fsnotify_event consumes 120 bytes, plus
additional space for file name, additional 24 bytes for second and each
subsequent group linking the event, and additional 32 bytes for each
inotify group for private data. After the conversion inotify event
consumes 48 bytes plus space for file name which is considerably less
memory unless file names are long and there are several groups
interested in the events (both of which are uncommon). Fanotify event
fits in 56 bytes after the conversion (fanotify doesn't care about file
names so its events don't have to have it allocated). A win unless
there are four or more fanotify groups interested in the event.
The conversion also solves the problem with unmount when only inotify is
used as we don't have to grab path references for inotify events.
[hughd@google.com: fanotify: fix corruption preventing startup]
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rounding of name length when passing it to userspace was done in several
places. Provide a function to do it and use it in all places.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cond_resched_lock(cinfo->lock) is called everywhere else while holding
the cinfo->lock spinlock. Not holding this lock while calling
transfer_commit_list in filelayout_recover_commit_reqs causes the BUG
below.
It's true that we can't hold this lock while calling pnfs_put_lseg,
because that might try to lock the inode lock - which might be the
same lock as cinfo->lock.
To reproduce, mount a 2 DS pynfs server and run an O_DIRECT command
that crosses a stripe boundary and is not page aligned, such as:
dd if=/dev/zero of=/mnt/f bs=17000 count=1 oflag=direct
BUG: sleeping function called from invalid context at linux/fs/nfs/nfs4filelayout.c:1161
in_atomic(): 0, irqs_disabled(): 0, pid: 27, name: kworker/0:1
2 locks held by kworker/0:1/27:
#0: (events){.+.+.+}, at: [<ffffffff810501d7>] process_one_work+0x175/0x3a5
#1: ((&dreq->work)){+.+...}, at: [<ffffffff810501d7>] process_one_work+0x175/0x3a5
CPU: 0 PID: 27 Comm: kworker/0:1 Not tainted 3.13.0-rc3-branch-dros_testing+ #21
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
Workqueue: events nfs_direct_write_schedule_work [nfs]
0000000000000000 ffff88007a39bbb8 ffffffff81491256 ffff88007b87a130 ffff88007a39bbd8 ffffffff8105f103 ffff880079614000 ffff880079617d40 ffff88007a39bc20 ffffffffa011603e ffff880078988b98 0000000000000000
Call Trace:
[<ffffffff81491256>] dump_stack+0x4d/0x66
[<ffffffff8105f103>] __might_sleep+0x100/0x105
[<ffffffffa011603e>] transfer_commit_list+0x94/0xf1 [nfs_layout_nfsv41_files]
[<ffffffffa01160d6>] filelayout_recover_commit_reqs+0x3b/0x68 [nfs_layout_nfsv41_files]
[<ffffffffa00ba53a>] nfs_direct_write_reschedule+0x9f/0x1d6 [nfs]
[<ffffffff810705df>] ? mark_lock+0x1df/0x224
[<ffffffff8106e617>] ? trace_hardirqs_off_caller+0x37/0xa4
[<ffffffff8106e691>] ? trace_hardirqs_off+0xd/0xf
[<ffffffffa00ba8f8>] nfs_direct_write_schedule_work+0x9d/0xb7 [nfs]
[<ffffffff810501d7>] ? process_one_work+0x175/0x3a5
[<ffffffff81050258>] process_one_work+0x1f6/0x3a5
[<ffffffff810501d7>] ? process_one_work+0x175/0x3a5
[<ffffffff8105187e>] worker_thread+0x149/0x1f5
[<ffffffff81051735>] ? rescuer_thread+0x28d/0x28d
[<ffffffff81056d74>] kthread+0xd2/0xda
[<ffffffff81056ca2>] ? __kthread_parkme+0x61/0x61
[<ffffffff8149e66c>] ret_from_fork+0x7c/0xb0
[<ffffffff81056ca2>] ? __kthread_parkme+0x61/0x61
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Version 3 cap export message includes information about the imported
caps. It allows us to add the imported caps if the corresponding cap
import message still hasn't been received.
This allow us to handle situation that the importer MDS crashes and
the cap import message is missing.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Version 3 cap import message includes the ID of the exported
caps. It allow us to remove the exported caps if we still haven't
received the corresponding cap export message.
We remove the exported caps because they are stale, keeping them
can compromise consistence.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Some inodes in readdir reply may have no caps. Getattr mds request
for these inodes can return -ESTALE. The fix is consider dentry that
links to inode with no caps as invalid. Invalid dentry causes a
lookup request to send to the mds, the MDS will send caps back.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Send requests that operate on path to directory's auth MDS if
mode == USE_AUTH_MDS. Always retry using the auth MDS if got
-ESTALE reply from non-auth MDS. Also clean up the code that
handles auth MDS change.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
- don't trim auth cap if there are flusing caps
- don't trim auth cap if any 'write' cap is wanted
- allow trimming non-auth cap even if the inode is dirty
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
handle following sequence of events:
- non-auth MDS revokes Fc cap. queue invalidate work
- auth MDS issues Fc cap through request reply. i_rdcache_gen gets
increased.
- invalidate work runs. it finds i_rdcache_revoking != i_rdcache_gen,
so it does nothing.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Here's the big driver core and sysfs patch set for 3.14-rc1.
There's a lot of work here moving sysfs logic out into a "kernfs" to
allow other subsystems to also have a virtual filesystem with the same
attributes of sysfs (handle device disconnect, dynamic creation /
removal as needed / unneeded, etc. This is primarily being done for
the cgroups filesystem, but the goal is to also move debugfs to it when
it is ready, solving all of the known issues in that filesystem as well.
The code isn't completed yet, but all should be stable now (there is a
big section that was reverted due to problems found when testing.)
There's also some other smaller fixes, and a driver core addition that
allows for a "collection" of objects, that the DRM people will be using
soon (it's in this tree to make merges after -rc1 easier.)
All of this has been in linux-next with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iEYEABECAAYFAlLdh0cACgkQMUfUDdst+ylv4QCfeDKDgLo4LsaBIIrFSxLoH/c7
UUsAoMPRwA0h8wy+BQcJAg4H4J4maKj3
=0pc0
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core / sysfs patches from Greg KH:
"Here's the big driver core and sysfs patch set for 3.14-rc1.
There's a lot of work here moving sysfs logic out into a "kernfs" to
allow other subsystems to also have a virtual filesystem with the same
attributes of sysfs (handle device disconnect, dynamic creation /
removal as needed / unneeded, etc)
This is primarily being done for the cgroups filesystem, but the goal
is to also move debugfs to it when it is ready, solving all of the
known issues in that filesystem as well. The code isn't completed
yet, but all should be stable now (there is a big section that was
reverted due to problems found when testing)
There's also some other smaller fixes, and a driver core addition that
allows for a "collection" of objects, that the DRM people will be
using soon (it's in this tree to make merges after -rc1 easier)
All of this has been in linux-next with no reported issues"
* tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (113 commits)
kernfs: associate a new kernfs_node with its parent on creation
kernfs: add struct dentry declaration in kernfs.h
kernfs: fix get_active failure handling in kernfs_seq_*()
Revert "kernfs: fix get_active failure handling in kernfs_seq_*()"
Revert "kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq"
Revert "kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()"
Revert "kernfs: remove KERNFS_REMOVED"
Revert "kernfs: restructure removal path to fix possible premature return"
Revert "kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()"
Revert "kernfs: remove kernfs_addrm_cxt"
Revert "kernfs: make kernfs_get_active() block if the node is deactivated but not removed"
Revert "kernfs: implement kernfs_{de|re}activate[_self]()"
Revert "kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers"
Revert "pci: use device_remove_file_self() instead of device_schedule_callback()"
Revert "scsi: use device_remove_file_self() instead of device_schedule_callback()"
Revert "s390: use device_remove_file_self() instead of device_schedule_callback()"
Revert "sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()"
Revert "kernfs: remove unnecessary NULL check in __kernfs_remove()"
kernfs: remove unnecessary NULL check in __kernfs_remove()
drivers/base: provide an infrastructure for componentised subsystems
...
If clp is new (cl_count = 1) and it matches another client in
nfs4_discover_server_trunking, the nfs_put_client will free clp before
->cl_preserve_clid is set.
Cc: stable@vger.kernel.org # 3.7+
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fixed a variety of trivial checkpatch warnings. The only delta should
be some minor formatting on log strings that were split / too long.
Signed-off-by: Chris Fries <cfries@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Both nfs41_walk_client_list and nfs40_walk_client_list expect the
'status' variable to be set to the value -NFS4ERR_STALE_CLIENTID
if the loop fails to find a match.
The problem is that the 'pos->cl_cons_state > NFS_CS_READY' changes
the value of 'status', and sets it either to the value '0' (which
indicates success), or to the value EINTR.
Cc: stable@vger.kernel.org # 3.7.x: 7b1f1fd1842e6: NFSv4/4.1: Fix bugs in
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
0d0d110720d7960b77c03c9f2597faaff4b484ae asserts that "d_splice_alias()
can't return error unless it was given an IS_ERR(inode)".
That was true of the implementation of d_splice_alias, but this is
really a problem with d_splice_alias: at a minimum it should be able to
return -ELOOP in the case where inserting the given dentry would cause a
directory loop.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Pull namespace fixes from Eric Biederman:
"This is a set of 3 regression fixes.
This fixes /proc/mounts when using "ip netns add <netns>" to display
the actual mount point.
This fixes a regression in clone that broke lxc-attach.
This fixes a regression in the permission checks for mounting /proc
that made proc unmountable if binfmt_misc was in use. Oops.
My apologies for sending this pull request so late. Al Viro gave
interesting review comments about the d_path fix that I wanted to
address in detail before I sent this pull request. Unfortunately a
bad round of colds kept from addressing that in detail until today.
The executive summary of the review was:
Al: Is patching d_path really sufficient?
The prepend_path, d_path, d_absolute_path, and __d_path family of
functions is a really mess.
Me: Yes, patching d_path is really sufficient. Yes, the code is mess.
No it is not appropriate to rewrite all of d_path for a regression
that has existed for entirely too long already, when a two line
change will do"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
vfs: Fix a regression in mounting proc
fork: Allow CLONE_PARENT after setns(CLONE_NEWPID)
vfs: In d_path don't call d_dname on a mount point
We should always make sure the cached page is up-to-date when we're
determining whether we can extend a write to cover the full page -- even
if we've received a write delegation from the server.
Commit c7559663 added logic to skip this check if we have a write
delegation, which can lead to data corruption such as the following
scenario if client B receives a write delegation from the NFS server:
Client A:
# echo 123456789 > /mnt/file
Client B:
# echo abcdefghi >> /mnt/file
# cat /mnt/file
0�D0�abcdefghi
Just because we hold a write delegation doesn't mean that we've read in
the entire page contents.
Cc: <stable@vger.kernel.org> # v3.11+
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Once created, a kernfs_node is always destroyed by kernfs_put().
Since ba7443bc656e ("sysfs, kernfs: implement
kernfs_create/destroy_root()"), kernfs_put() depends on kernfs_root()
to locate the ino_ida. kernfs_root() in turn depends on
kernfs_node->parent being set for !dir nodes. This means that
kernfs_put() of a !dir node requires its ->parent to be initialized.
This leads to oops when a newly created !dir node is destroyed without
going through kernfs_add_one() or after failing kernfs_add_one()
before ->parent is set. kernfs_root() invoked from kernfs_put() will
try to dereference NULL parent.
Fix it by moving parent association to kernfs_new_node() from
kernfs_add_one(). kernfs_new_node() now takes @parent instead of
@root and determines the root from the parent and also sets the new
node's parent properly. @parent parameter is removed from
kernfs_add_one(). As there's no parent when creating the root node,
__kernfs_new_node() which takes @root as before and doesn't set the
parent is used in that case.
This ensures that a kernfs_node in any stage in its life has its
parent associated and thus can be put.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
"disconnected" is too easily confused with "DCACHE_DISCONNECTED". I
think "unhashed" is the more precise term here.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Sage Weil <sage@inktank.com>
This is a small cleanup to function gfs2_rgrp_go_lock so that it
uses rgd instead of its more complicated twin.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Al Viro has tactfully pointed out that we are using the incorrect
error code in some cases. This patch fixes that, and also removes
the (unused) return value for glock dumping.
> * gfs2_iget() - ENOBUFS instead of ENOMEM. ENOBUFS is
> "No buffer space available (POSIX.1 (XSI STREAMS option))" and since
> we don't support STREAMS it's probably fair game, but... what the hell?
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Doing sync_meta_pages with META_FLUSH when checkpoint, we overide rw
using WRITE_FLUSH_FUA. At this time, we also should set
REQ_META|REQ_PRIO.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch should resolve the following bug.
=========================================================
[ INFO: possible irq lock inversion dependency detected ]
3.13.0-rc5.f2fs+ #6 Not tainted
---------------------------------------------------------
kswapd0/41 just changed the state of lock:
(&sbi->gc_mutex){+.+.-.}, at: [<ffffffffa030503e>] f2fs_balance_fs+0xae/0xd0 [f2fs]
but this lock took another, RECLAIM_FS-READ-unsafe lock in the past:
(&sbi->cp_rwsem){++++.?}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Chain exists of:
&sbi->gc_mutex --> &sbi->cp_mutex --> &sbi->cp_rwsem
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&sbi->cp_rwsem);
local_irq_disable();
lock(&sbi->gc_mutex);
lock(&sbi->cp_mutex);
<Interrupt>
lock(&sbi->gc_mutex);
*** DEADLOCK ***
This bug is due to the f2fs_balance_fs call in f2fs_write_data_page.
If f2fs_write_data_page is triggered by wbc->for_reclaim via kswapd, it should
not call f2fs_balance_fs which tries to get a mutex grabbed by original syscall
flow.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJS080OAAoJECvKgwp+S8JaIdUQAJKNZTzXKylUjUZty42t57Jh
1qRrQeJ6ha+JVSpYX4jJz/mSzUdJdjoFg7J3O54OnVFj/CnlcY7GRZj3VMel9ijf
uhlf8DcU6JsThcFK4Q6mqXtdAHDPkQ1jkQHLNCe7bow9AjCzHymAZWJix4YvEsXF
zeJJURMqSaJeo/44MynnXyn/h5RRhg+5HWErhoFiVUzDzHR3RoQqmt3lPVVJkdj1
iokHLMzGui2vs52vUJj2yx7m9kaoDx/6bJpqR61qHfk5S4wjLkUI+1ID8dsTNVF2
4O3THb0nUDWx4wuJIxrAKoPiYjiemX1KmQXlUVr3IsfhDiiBbLyviVyn4aRaFIxV
IRCVXCj1CWw+cFLeCA5E+/WvpxjLfKs4WNBxIqjes5YRPM4PLpU3MDiabssaUzHI
0VPbU8TQ05hqH0wbs0hIgXyvED6yNn9d3sPHS2Lb5i2tp3E0FzVEoh2EH2jn8lmQ
1DAdi+ezk9EiJs8AFiN6MSIBpAZosX3Nq+RTmYGKqLZMGnxlJ30YspNlipiBPFpC
4xokkMZAZ0+wzpVabOMie36Rc/AaOAqiOjS1C6UIoOSrBTgtwWL7Ft2Da3SKb0KX
XQhNWCHNYcgOn9/DDDmGxzwt6HsEzOIYinMwrG37LSass5KEvopssmiLCXn8wry+
QXUoiFFFAPpg8iXaqj4X
=AHdo
-----END PGP SIGNATURE-----
Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux
Pull writeback fix from Wu Fengguang:
"Fix data corruption on NFS writeback.
It has been in linux-next for one month"
* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Fix data corruption on NFS
Well I don't get the same warning locally as the kbuild
robot, but I guess this should fix the problem, anyway.
Here is the warning:
head: 2d9e72303d538024627fb1fe2cbde48aec12acc0
commit: ee2411a8db49a21bc55dc124e1b434ba194c8903 [19/20] GFS2: Clean up quota slot allocation
config: make ARCH=powerpc allmodconfig
All error/warnings:
fs/gfs2/quota.c: In function 'gfs2_quota_init':
>> fs/gfs2/quota.c:1246:3: error: implicit declaration of function '__vmalloc' [-Werror=implicit-function-declaration]
sdp->sd_quota_bitmap = __vmalloc(bm_size, GFP_NOFS, PAGE_KERNEL);
^
>> fs/gfs2/quota.c:1246:24: warning: assignment makes pointer from integer without a cast [enabled by default]
sdp->sd_quota_bitmap = __vmalloc(bm_size, GFP_NOFS, PAGE_KERNEL);
^
fs/gfs2/quota.c: In function 'gfs2_quota_cleanup':
>> fs/gfs2/quota.c:1361:4: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration]
vfree(sdp->sd_quota_bitmap);
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There is a bug in the function nilfs_segctor_collect, which results in
active data being written to a segment, that is marked as clean. It is
possible, that this segment is selected for a later segment
construction, whereby the old data is overwritten.
The problem shows itself with the following kernel log message:
nilfs_sufile_do_cancel_free: segment 6533 must be clean
Usually a few hours later the file system gets corrupted:
NILFS: bad btree node (blocknr=8748107): level = 0, flags = 0x0, nchildren = 0
NILFS error (device sdc1): nilfs_bmap_last_key: broken bmap (inode number=114660)
The issue can be reproduced with a file system that is nearly full and
with the cleaner running, while some IO intensive task is running.
Although it is quite hard to reproduce.
This is what happens:
1. The cleaner starts the segment construction
2. nilfs_segctor_collect is called
3. sc_stage is on NILFS_ST_SUFILE and segments are freed
4. sc_stage is on NILFS_ST_DAT current segment is full
5. nilfs_segctor_extend_segments is called, which
allocates a new segment
6. The new segment is one of the segments freed in step 3
7. nilfs_sufile_cancel_freev is called and produces an error message
8. Loop around and the collection starts again
9. sc_stage is on NILFS_ST_SUFILE and segments are freed
including the newly allocated segment, which will contain active
data and can be allocated at a later time
10. A few hours later another segment construction allocates the
segment and causes file system corruption
This can be prevented by simply reordering the statements. If
nilfs_sufile_cancel_freev is called before nilfs_segctor_extend_segments
the freed segments are marked as dirty and cannot be allocated any more.
Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Reviewed-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gradually, the global qd_lock is being used for less and less.
After this patch it will only be used for the per super block
list whose purpose is to allow syncing of changes back to the
master quota file from the local quota changes file. Fixing
up that process to make it more efficient will be the subject
of a later patch, however this patch removes another barrier
to doing that.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Abhijith Das <adas@redhat.com>
Quota slot allocation has historically used a vector of pages
and a set of homegrown find/test/set/clear bit functions. Since
the size of the bitmap is likely to be based on the default
qc file size, thats a couple of pages at most. So we ought
to be able to allocate that as a single chunk, with a vmalloc
fallback, just in case of memory fragmentation.
We are then able to use the kernel's own find/test/set/clear
bit functions, rather than rolling our own.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Abhijith Das <adas@redhat.com>
While investigating a rather strange bit of code in the quota
clean up function, I spotted that the reason for its existence
was that when remounting read only, we were not stopping the
quotad thread, and thus it was possible for it to still have
a reference to some of the quotas in that case.
This patch moves the logd and quota thread start and stop into
the make_fs_rw/ro functions, so that we now stop those threads
when mounted read only.
This means that quotad will always be stopped before we call
the quota clean up function, and we can thus dispose of the
(rather hackish) code that waits for it to give up its
reference on the quotas.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Abhijith Das <adas@redhat.com>
Prior to this patch, GFS2 kept all the quotas for each
super block in a single linked list. This is rather slow
when there are large numbers of quotas.
This patch introduces a hlist_bl based hash table, similar
to the one used for glocks. The initial look up of the quota
is now lockless in the case where it is already cached,
although we still have to take the per quota spinlock in
order to bump the ref count. Either way though, this is a
big improvement on what was there before.
The qd_lock and the per super block list is preserved, for
the time being. However it is intended that since this is no
longer used for its original role, it should be possible to
shrink the number of items on that list in due course and
remove the requirement to take qd_lock in qd_get.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Abhijith Das <adas@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
We recently fixed the writeback of pages prior to performing
direct i/o, however the initial fix was perhaps a bit heavy
handed. There is no need to invalidate pages if the direct i/o
is only a read, since they will be identical to what has been
flushed to disk anyway.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When kernfs_seq_start() fails to obtain an active reference, it
returns ERR_PTR(-ENODEV). kernfs_seq_stop() is then invoked with the
error pointer value; however, it still proceeds to invoke
kernfs_put_active() on the node leading to unbalanced put.
If kernfs_seq_stop() is called even after active ref failure, it
should skip invocation of @ops->seq_stop() and put_active.
Unfortunately, this is a bit complicated because active ref failure
isn't the only thing which may fail with ERR_PTR(-ENODEV).
@ops->seq_start/next() may also fail with the error value and
kernfs_seq_stop() doesn't have a way to tell apart those failures.
Work it around by factoring out the active part of kernfs_seq_stop()
into kernfs_seq_stop_active() and invoking it directly if
@ops->seq_start/next() fail with ERR_PTR(-ENODEV) and updating
kernfs_seq_stop() to skip kernfs_seq_stop_active() on
ERR_PTR(-ENODEV). This is a bit nasty but ensures that the active put
is skipped iff get_active failed in kernfs_seq_start().
tj: This was originally committed as d92d2e6bd72b but got reverted by
683bb2761fbf along with other kernfs self removal patches.
However, this one is an independent fix and shouldn't have been
reverted together. Reinstate the change. Sorry about the mess.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Support for f2fs-tools/tools/f2stat to monitor
/sys/kernel/debug/f2fs/status
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
With the 2 previous changes, all the long time operations are moved out
of the protection region, so here we can use spinlock rather than mutex
(orphan_inode_mutex) for lower overhead.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Move alloc new orphan node out of lock protection region.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
"boo sync" parameter is never referenced in f2fs_wait_on_page_writeback.
We should remove this parameter.
Signed-off-by: Yuan Zhong <yuan.mark.zhong@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This reverts commit d92d2e6bd72b653f9811e0c9c46307c743b3fc58.
Tejun writes:
I'm sorry but can you please revert the whole series?
get_active() waiting while a node is deactivated has potential
to lead to deadlock and that deactivate/reactivate interface is
something fundamentally flawed and that cgroup will have to work
with the remove_self() like everybody else. IOW, I think the
first posting was correct.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>