13177 Commits

Author SHA1 Message Date
Linus Torvalds
8d80ce80e1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (71 commits)
  SELinux: inode_doinit_with_dentry drop no dentry printk
  SELinux: new permission between tty audit and audit socket
  SELinux: open perm for sock files
  smack: fixes for unlabeled host support
  keys: make procfiles per-user-namespace
  keys: skip keys from another user namespace
  keys: consider user namespace in key_permission
  keys: distinguish per-uid keys in different namespaces
  integrity: ima iint radix_tree_lookup locking fix
  TOMOYO: Do not call tomoyo_realpath_init unless registered.
  integrity: ima scatterlist bug fix
  smack: fix lots of kernel-doc notation
  TOMOYO: Don't create securityfs entries unless registered.
  TOMOYO: Fix exception policy read failure.
  SELinux: convert the avc cache hash list to an hlist
  SELinux: code readability with avc_cache
  SELinux: remove unused av.decided field
  SELinux: more careful use of avd in avc_has_perm_noaudit
  SELinux: remove the unused ae.used
  SELinux: check seqno when updating an avc_node
  ...
2009-03-26 11:03:39 -07:00
Matthew Garrett
0a1c01c947 Make relatime default
Change the default behaviour of the kernel to use relatime for all
filesystems. This can be overridden with the "strictatime" mount
option.

Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-26 11:01:10 -07:00
Matthew Garrett
d0adde574b Add a strictatime mount option
Add support for explicitly requesting full atime updates. This makes it
possible for kernels to default to relatime but still allow userspace to
override it.

Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-26 10:56:35 -07:00
Matthew Garrett
11ff6f05f1 Allow relatime to update atime once a day
Allow atime to be updated once per day even with relatime. This lets
utilities like tmpreaper (which delete files based on last access time)
continue working, making relatime a plausible default for distributions.

Signed-off-by: Matthew Garrett <mjg@redhat.com>
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Acked-by: Valerie Aurora Henson <vaurora@redhat.com>
Acked-by: Alan Cox <alan@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-26 10:48:13 -07:00
Stefan Weinhuber
b44b0ab3ba [S390] dasd: add large volume support
The dasd device driver will now support ECKD devices with more then
65520 cylinders.
In the traditional ECKD adressing scheme each track is addressed
by a 16-bit cylinder and 16-bit head number. The new addressing
scheme makes use of the fact that the actual number of heads is
never larger then 15, so 12 bits of the head number can be redefined
to be part of the cylinder address.

Signed-off-by: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2009-03-26 15:24:05 +01:00
Jens Axboe
a2a9537ac0 Get rid of pdflush_operation() in emergency sync and remount
Opencode a cheasy approach with kevent. The idea here is that we'll
add some generic delayed work infrastructure, which probably wont be
based on pdflush (or maybe it will, in which case we can just add it
back).

This is in preparation for getting rid of pdflush completely.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-03-26 11:01:36 +01:00
Jens Axboe
6933c02e9c btrfs: get rid of current_is_pdflush() in btrfs_btree_balance_dirty
Chris says it's safe to kill.

Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-03-26 11:01:35 +01:00
Theodore Ts'o
563bdd61fe ext4: Check for an valid i_mode when reading the inode from disk
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-26 00:06:19 -04:00
Theodore Ts'o
7058548cd5 ext4: Use WRITE_SYNC for commits which are caused by fsync()
If a commit is triggered by fsync(), set a flag indicating the journal
blocks associated with the transaction should be flushed out using
WRITE_SYNC.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-25 23:35:46 -04:00
Manish Katiyar
c16831b4cc ext2: Zero our b_size in ext2_quota_read()
ext2_quota_read() doesn't initialize tmp_bh.b_size before calling
ext2_get_block() where we access it. Since it is a local variable it
might contain some garbage. Make sure it is filled with reasonable
value before passing.

Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:18:38 +01:00
Matt LaPlante
620372a9ff trivial: fix typos/grammar errors in fs/Kconfig
Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:18:38 +01:00
Jan Kara
268157ba67 quota: Coding style fixes
Wrap long lines, remove assignments from conditions, rewrite two
overcomplicated for loops.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:18:38 +01:00
Jan Kara
7a2435d874 quota: Remove superfluous inlines
Remove inlines of large functions to decrease code size (saved 1543
bytes).

Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:18:37 +01:00
Jan Kara
90c0af05a5 nfsd: Use lowercase names of quota functions
Use lowercase names of quota functions instead of old uppercase ones.

CC: bfields@fieldses.org
CC: neilb@suse.de
Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:18:37 +01:00
Jan Kara
c94d2a22f2 jfs: Use lowercase names of quota functions
Use lowercase names of quota functions instead of old uppercase ones.

Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Kleikamp <shaggy@austin.ibm.com>
2009-03-26 02:18:37 +01:00
Jan Kara
bacfb7c2e5 udf: Use lowercase names of quota functions
Use lowercase names of quota functions instead of old uppercase ones.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:18:36 +01:00
Jan Kara
5f5fa796c6 ufs: Use lowercase names of quota functions
Use lowercase names of quota functions instead of old uppercase ones.

Signed-off-by: Jan Kara <jack@suse.cz>
CC: Evgeniy Dushistov <dushistov@mail.ru>
2009-03-26 02:18:36 +01:00
Jan Kara
77db4f25bc reiserfs: Use lowercase names of quota functions
Use lowercase names of quota functions instead of old uppercase ones.

Signed-off-by: Jan Kara <jack@suse.cz>
CC: reiserfs-devel@vger.kernel.org
2009-03-26 02:18:36 +01:00
Jan Kara
a269eb1829 ext4: Use lowercase names of quota functions
Use lowercase names of quota functions instead of old uppercase ones.

Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mingming Cao <cmm@us.ibm.com>
CC: linux-ext4@vger.kernel.org
2009-03-26 02:18:36 +01:00
Jan Kara
81a0522739 ext3: Use lowercase names of quota functions
Use lowercase names of quota functions instead of old uppercase ones.

Signed-off-by: Jan Kara <jack@suse.cz>
CC: linux-ext4@vger.kernel.org
2009-03-26 02:18:36 +01:00
Jan Kara
6f90bee506 ext2: Use lowercase names of quota functions
Use lowercase names of quota functions instead of old uppercase ones.

Signed-off-by: Jan Kara <jack@suse.cz>
CC: linux-ext4@vger.kernel.org
2009-03-26 02:18:36 +01:00
Jan Kara
314649558d ramfs: Remove quota call
Ramfs has no bussiness in quotas.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:18:35 +01:00
Jan Kara
9e3509e273 vfs: Use lowercase names of quota functions
Use lowercase names of quota functions instead of old uppercase ones.

Signed-off-by: Jan Kara <jack@suse.cz>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
2009-03-26 02:18:35 +01:00
Jan Kara
d26ac1a812 quota: Remove dqbuf_t and other cleanups
Remove bogus typedef which is just a definition of char *.
Remove unnecessary type casts.
Substitute freedqbuf() with kfree.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:18:35 +01:00
Jan Kara
dd6f3c6d5a quota: Remove NODQUOT macro
Remove this macro which is just a definition of NULL. Fix a few coding style
issues along the way.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:18:35 +01:00
Jan Kara
c516610cfe quota: Make global quota locks cacheline aligned
Andrew Morton has suggested that three global quota locks can end up in the
same cacheline which can result in bad cacheline ping-pong on SMP machines.
Make locks cacheline aligned so that we avoid this problem (thanks goes to
Andrew for the idea).

Signed-off-by: Jan Kara <jack@suse.cz>
CC: Andrew Morton <akpm@linux-foundation.org>
2009-03-26 02:18:35 +01:00
Jan Kara
884d179dff quota: Move quota files into separate directory
Quota subsystem has more and more files. It's time to create a dir for it.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:18:35 +01:00
Mingming Cao
60e58e0f30 ext4: quota reservation for delayed allocation
Uses quota reservation/claim/release to handle quota properly for delayed
allocation in the three steps: 1) quotas are reserved when data being copied
to cache when block allocation is defered 2) when new blocks are allocated.
reserved quotas are converted to the real allocated quota, 2) over-booked
quotas for metadata blocks are released back.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:18:34 +01:00
Jan Kara
643d00ccc3 reiserfs: Remove unnecessary quota functions
reiserfs_dquot_initialize() and reiserfs_dquot_drop() is no longer
needed because of modified quota locking.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:18:34 +01:00
Jan Kara
edf7245362 ext4: Remove unnecessary quota functions
ext4_dquot_initialize() and ext4_dquot_drop() is no longer
needed because of modified quota locking.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:18:34 +01:00
Jan Kara
a219ce3748 ext3: Remove unnecessary quota functions
ext3_dquot_initialize() and ext3_dquot_drop() is no longer
needed because of modified quota locking.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:18:34 +01:00
Mingming Cao
08d0350ce9 quota: Move EXPORT_SYMBOL immediately next to the functions/varibles
According to checkpatch: EXPORT_SYMBOL(foo); should immediately follow its
 function/variable

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:18:34 +01:00
Mingming Cao
740d9dcd94 quota: Add quota reservation claim and released operations
Reserved quota will be claimed at the block allocation time. Over-booked
quota could be returned back with the release callback function.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:18:24 +01:00
Mingming Cao
f18df22899 quota: Add quota reservation support
Delayed allocation defers the block allocation at the dirty pages
flush-out time, doing quota charge/check at that time is too late.
But we can't charge the quota blocks until blocks are really allocated,
otherwise users could get overcharged after reboot from system crash.

This patch adds quota reservation for delayed allocation. Quota blocks
are reserved in memory, inode and quota won't gets dirtied until later
block allocation time.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:15:50 +01:00
Chris Mason
1a81af4d1d Btrfs: make sure btrfs_update_delayed_ref doesn't increase ref_mod
btrfs_update_delayed_ref is optimized to add and remove different
references in one pass through the delayed ref tree.  It is a zero
sum on the total number of refs on a given extent.

But, the code was recording an extra ref in the head node.  This
never made it down to the disk but was used when deciding if it was
safe to free the extent while dropping snapshots.

The fix used here is to make sure the ref_mod count is unchanged
on the head ref when btrfs_update_delayed_ref is called.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-25 09:55:11 -04:00
Hugh Dickins
095160aee9 sysfs: fix some bin_vm_ops errors
Commit 86c9508eb1c0ce5aa07b5cf1d36b60c54efc3d7a
"sysfs: don't block indefinitely for unmapped files" in linux-next
crashes the PowerMac G5 when X starts up.  It's caught out by the way
powerpc's pci_mmap of legacy_mem uses shmem_zero_setup(), substituting
a new vma->vm_file whose private_data no longer points to the bin_buffer
(substitution done because some versions of X crash if that mmap fails).

The fix to this is straightforward: the original vm_file is fput() in
that case, so this mmap won't block sysfs at all, so just don't switch
over to bin_vm_ops if vm_file has changed.

But more fixes made before realizing that was the problem:-

It should not be an error if bin_page_mkwrite() finds no underlying
page_mkwrite().

Check that a file already mmap'ed has the same underlying vm_ops
_before_ pointing vma->vm_ops at bin_vm_ops.

If the file being mmap'ed is a shmem/tmpfs file, don't fail the mmap
on CONFIG_NUMA=y, just because that has a set_policy and get_policy:
provide bin_set_policy, bin_get_policy and bin_migrate.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Eric Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-03-24 16:38:26 -07:00
Alex Chiang
669420644c sysfs: only allow one scheduled removal callback per kobj
The only way for a sysfs attribute to remove itself (without
deadlock) is to use the sysfs_schedule_callback() interface.

Vegard Nossum discovered that a poorly written sysfs ->store
callback can repeatedly schedule remove callbacks on the same
device over and over, e.g.

	$ while true ; do echo 1 > /sys/devices/.../remove ; done

If the 'remove' attribute uses the sysfs_schedule_callback API
and also does not protect itself from concurrent accesses, its
callback handler will be called multiple times, and will
eventually attempt to perform operations on a freed kobject,
leading to many problems.

Instead of requiring all callers of sysfs_schedule_callback to
implement their own synchronization, provide the protection in
the infrastructure.

Now, sysfs_schedule_callback will only allow one scheduled
callback per kobject. On subsequent calls with the same kobject,
return -EAGAIN.

This is a short term fix. The long term fix is to allow sysfs
attributes to remove themselves directly, without any of this
callback hokey pokey.

[cornelia.huck@de.ibm.com: s390 ccwgroup bits]

Reported-by: vegard.nossum@gmail.com
Signed-off-by: Alex Chiang <achiang@hp.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-03-24 16:38:26 -07:00
Ming Lei
f67f129e51 Driver core: implement uevent suppress in kobject
This patch implements uevent suppress in kobject and removes it
from struct device, based on the following ideas:

1,Uevent sending should be one attribute of kobject, so suppressing it
in kobject layer is more natural than in device layer. By this way,
we can do it for other objects embedded with kobject.

2,It may save several bytes for each instance of struct device.(On my
omap3(32bit ARM) based box, can save 8bytes per device object)

This patch also introduces dev_set|get_uevent_suppress() helpers to
set and query uevent_suppress attribute in case to help kobject
as private part of struct device in future.

[This version is against the latest driver-core patch set of Greg,please
ignore the last version.]

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-03-24 16:38:26 -07:00
Eric W. Biederman
e0edd3c65a sysfs: don't block indefinitely for unmapped files.
Modify sysfs bin files so that we can remove the bin file while they are
still mapped.  When the kobject is removed we unmap the bin file and
arrange for future accesses to the mapping to receive SIGBUS.

Implementing this prevents a nasty DOS when pci devices are hot plugged
and unplugged.  Where if any of their resources were mmaped the kernel
could not free up their pci resources or release their pci data
structures.

[akpm@linux-foundation.org: remove unused var]
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-03-24 16:38:26 -07:00
Eric W. Biederman
04256b4a8f sysfs: reference sysfs_dirent from sysfs inodes
The sysfs_dirent serves as both an inode and a directory entry
for sysfs.  To prevent the sysfs inode numbers from being freed
prematurely hold a reference to sysfs_dirent from the sysfs inode.

[akpm@linux-foundation.org: add comment]
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-03-24 16:38:25 -07:00
Alex Chiang
425cb02912 sysfs: sysfs_add_one WARNs with full path to duplicate filename
sysfs: sysfs_add_one WARNs with full path to duplicate filename

As a debugging aid, it can be useful to know the full path to a
duplicate file being created in sysfs.

We now will display warnings such as:

	sysfs: cannot create duplicate filename '/foo'

when attempting to create multiple files named 'foo' in the sysfs
root, or:

	sysfs: cannot create duplicate filename '/bus/pci/slots/5/foo'

when attempting to create multiple files named 'foo' under a
given directory in sysfs.

The path displayed is always a relative path to sysfs_root. The
leading '/' in the path name refers to the sysfs_root mount
point, and should not be confused with the "real" '/'.

Thanks to Alex Williamson for essentially writing sysfs_pathname.

Cc: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-03-24 16:38:25 -07:00
Eric W. Biederman
4a67a1bc0b sysfs: Take sysfs_mutex when fetching the root inode.
sysfs_get_inode ultimately calls sysfs_count_nlink when the a
directory inode is fectched.  sysfs_count_nlink needs to be
called under the sysfs_mutex to guard against the unlikely
but possible scenario that the root directory is changing
as we are counting the number entries in it, and just in
general to be consistent.

Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-03-24 16:38:24 -07:00
Qinghuang Feng
8231f2f99a SYSFS: use standard magic.h for sysfs
SYSFS_MAGIC has been added into magic.h, so only use that definition
in magic.h to avoid potential consistency problem.

Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-03-24 16:38:24 -07:00
Chris Mason
af4176b49c Btrfs: optimize fsyncs on old files
The fsync log has code to make sure all of the parents of a file are in the
log along with the file.  It uses a minimal log of the parent directory
inodes, just enough to get the parent directory on disk.

If the transaction that originally created a file is fully on disk,
and the file hasn't been renamed or linked into other directories, we
can safely skip the parent directory walk.  We know the file is on disk
somewhere and we can go ahead and just log that single file.

This is more important now because unrelated unlinks in the parent directory
might make us force a commit if we try to log the parent.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:52 -04:00
Chris Mason
12fcfd22fe Btrfs: tree logging unlink/rename fixes
The tree logging code allows individual files or directories to be logged
without including operations on other files and directories in the FS.
It tries to commit the minimal set of changes to disk in order to
fsync the single file or directory that was sent to fsync or O_SYNC.

The tree logging code was allowing files and directories to be unlinked
if they were part of a rename operation where only one directory
in the rename was in the fsync log.  This patch adds a few new rules
to the tree logging.

1) on rename or unlink, if the inode being unlinked isn't in the fsync
log, we must force a full commit before doing an fsync of the directory
where the unlink was done.  The commit isn't done during the unlink,
but it is forced the next time we try to log the parent directory.

Solution: record transid of last unlink/rename per directory when the
directory wasn't already logged.  For renames this is only done when
renaming to a different directory.

mkdir foo/some_dir
normal commit
rename foo/some_dir foo2/some_dir
mkdir foo/some_dir
fsync foo/some_dir/some_file

The fsync above will unlink the original some_dir without recording
it in its new location (foo2).  After a crash, some_dir will be gone
unless the fsync of some_file forces a full commit

2) we must log any new names for any file or dir that is in the fsync
log.  This way we make sure not to lose files that are unlinked during
the same transaction.

2a) we must log any new names for any file or dir during rename
when the directory they are being removed from was logged.

2a is actually the more important variant.  Without the extra logging
a crash might unlink the old name without recreating the new one

3) after a crash, we must go through any directories with a link count
of zero and redo the rm -rf

mkdir f1/foo
normal commit
rm -rf f1/foo
fsync(f1)

The directory f1 was fully removed from the FS, but fsync was never
called on f1, only its parent dir.  After a crash the rm -rf must
be replayed.  This must be able to recurse down the entire
directory tree.  The inode link count fixup code takes care of the
ugly details.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:52 -04:00
Chris Mason
a74ac32207 Btrfs: Make sure i_nlink doesn't hit zero too soon during log replay
During log replay, inodes are copied from the log to the main filesystem
btrees.  Sometimes they have a zero link count in the log but they actually
gain links during the replay or have some in the main btree.

This patch updates the link count to be at least one after copying the
inode out of the log.  This makes sure the inode is deleted during an
iput while the rest of the replay code is still working on it.

The log replay has fixup code to make sure that link counts are correct
at the end of the replay, so we could use any non-zero number here and
it would work fine.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:51 -04:00
Chris Mason
a4b6e07d1a Btrfs: limit balancing work while flushing delayed refs
The delayed reference mechanism is responsible for all updates to the
extent allocation trees, including those updates created while processing
the delayed references.

This commit tries to limit the amount of work that gets created during
the final run of delayed refs before a commit.  It avoids cowing new blocks
unless it is required to finish the commit, and so it avoids new allocations
that were not really required.

The goal is to avoid infinite loops where we are always making more work
on the final run of delayed refs.  Over the long term we'll make a
special log for the last delayed ref updates as well.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:51 -04:00
Chris Mason
5d13a98f3b Btrfs: readahead checksums during btrfs_finish_ordered_io
This reads in blocks in the checksum btree before starting the
transaction in btrfs_finish_ordered_io.  It makes it much more likely
we'll be able to do operations inside the transaction without
needing any btree reads, which limits transaction latencies overall.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:51 -04:00
Chris Mason
b9473439d3 Btrfs: leave btree locks spinning more often
btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
for the buffers it was dirtying.  This may require a kmalloc and it
was not atomic.  So, anyone who called btrfs_mark_buffer_dirty had to
set any btree locks they were holding to blocking first.

This commit changes dirty tracking for extent buffers to just use a flag
in the extent buffer.  Now that we have one and only one extent buffer
per page, this can be safely done without losing dirty bits along the way.

This also introduces a path->leave_spinning flag that callers of
btrfs_search_slot can use to indicate they will properly deal with a
path returned where all the locks are spinning instead of blocking.

Many of the btree search callers now expect spinning paths,
resulting in better btree concurrency overall.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:28 -04:00
Chris Mason
89573b9c51 Btrfs: Only let very young transactions grow during commit
Commits are fairly expensive, and so btrfs has code to sit around for a while
during the commit and let new writers come in.

But, while we're sitting there, new delayed refs might be added, and those
can be expensive to process as well.  Unless the transaction is very very
young, it makes sense to go ahead and let the commit finish without hanging
around.

The commit grow loop isn't as important as it used to be, the fsync logging
code handles most performance critical syncs now.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:28 -04:00