* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (71 commits)
SELinux: inode_doinit_with_dentry drop no dentry printk
SELinux: new permission between tty audit and audit socket
SELinux: open perm for sock files
smack: fixes for unlabeled host support
keys: make procfiles per-user-namespace
keys: skip keys from another user namespace
keys: consider user namespace in key_permission
keys: distinguish per-uid keys in different namespaces
integrity: ima iint radix_tree_lookup locking fix
TOMOYO: Do not call tomoyo_realpath_init unless registered.
integrity: ima scatterlist bug fix
smack: fix lots of kernel-doc notation
TOMOYO: Don't create securityfs entries unless registered.
TOMOYO: Fix exception policy read failure.
SELinux: convert the avc cache hash list to an hlist
SELinux: code readability with avc_cache
SELinux: remove unused av.decided field
SELinux: more careful use of avd in avc_has_perm_noaudit
SELinux: remove the unused ae.used
SELinux: check seqno when updating an avc_node
...
Change the default behaviour of the kernel to use relatime for all
filesystems. This can be overridden with the "strictatime" mount
option.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for explicitly requesting full atime updates. This makes it
possible for kernels to default to relatime but still allow userspace to
override it.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allow atime to be updated once per day even with relatime. This lets
utilities like tmpreaper (which delete files based on last access time)
continue working, making relatime a plausible default for distributions.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Acked-by: Valerie Aurora Henson <vaurora@redhat.com>
Acked-by: Alan Cox <alan@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The dasd device driver will now support ECKD devices with more then
65520 cylinders.
In the traditional ECKD adressing scheme each track is addressed
by a 16-bit cylinder and 16-bit head number. The new addressing
scheme makes use of the fact that the actual number of heads is
never larger then 15, so 12 bits of the head number can be redefined
to be part of the cylinder address.
Signed-off-by: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Opencode a cheasy approach with kevent. The idea here is that we'll
add some generic delayed work infrastructure, which probably wont be
based on pdflush (or maybe it will, in which case we can just add it
back).
This is in preparation for getting rid of pdflush completely.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If a commit is triggered by fsync(), set a flag indicating the journal
blocks associated with the transaction should be flushed out using
WRITE_SYNC.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext2_quota_read() doesn't initialize tmp_bh.b_size before calling
ext2_get_block() where we access it. Since it is a local variable it
might contain some garbage. Make sure it is filled with reasonable
value before passing.
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Use lowercase names of quota functions instead of old uppercase ones.
CC: bfields@fieldses.org
CC: neilb@suse.de
Signed-off-by: Jan Kara <jack@suse.cz>
Use lowercase names of quota functions instead of old uppercase ones.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Kleikamp <shaggy@austin.ibm.com>
Use lowercase names of quota functions instead of old uppercase ones.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mingming Cao <cmm@us.ibm.com>
CC: linux-ext4@vger.kernel.org
Use lowercase names of quota functions instead of old uppercase ones.
Signed-off-by: Jan Kara <jack@suse.cz>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
Remove bogus typedef which is just a definition of char *.
Remove unnecessary type casts.
Substitute freedqbuf() with kfree.
Signed-off-by: Jan Kara <jack@suse.cz>
Andrew Morton has suggested that three global quota locks can end up in the
same cacheline which can result in bad cacheline ping-pong on SMP machines.
Make locks cacheline aligned so that we avoid this problem (thanks goes to
Andrew for the idea).
Signed-off-by: Jan Kara <jack@suse.cz>
CC: Andrew Morton <akpm@linux-foundation.org>
Uses quota reservation/claim/release to handle quota properly for delayed
allocation in the three steps: 1) quotas are reserved when data being copied
to cache when block allocation is defered 2) when new blocks are allocated.
reserved quotas are converted to the real allocated quota, 2) over-booked
quotas for metadata blocks are released back.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
According to checkpatch: EXPORT_SYMBOL(foo); should immediately follow its
function/variable
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reserved quota will be claimed at the block allocation time. Over-booked
quota could be returned back with the release callback function.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Delayed allocation defers the block allocation at the dirty pages
flush-out time, doing quota charge/check at that time is too late.
But we can't charge the quota blocks until blocks are really allocated,
otherwise users could get overcharged after reboot from system crash.
This patch adds quota reservation for delayed allocation. Quota blocks
are reserved in memory, inode and quota won't gets dirtied until later
block allocation time.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
btrfs_update_delayed_ref is optimized to add and remove different
references in one pass through the delayed ref tree. It is a zero
sum on the total number of refs on a given extent.
But, the code was recording an extra ref in the head node. This
never made it down to the disk but was used when deciding if it was
safe to free the extent while dropping snapshots.
The fix used here is to make sure the ref_mod count is unchanged
on the head ref when btrfs_update_delayed_ref is called.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Commit 86c9508eb1c0ce5aa07b5cf1d36b60c54efc3d7a
"sysfs: don't block indefinitely for unmapped files" in linux-next
crashes the PowerMac G5 when X starts up. It's caught out by the way
powerpc's pci_mmap of legacy_mem uses shmem_zero_setup(), substituting
a new vma->vm_file whose private_data no longer points to the bin_buffer
(substitution done because some versions of X crash if that mmap fails).
The fix to this is straightforward: the original vm_file is fput() in
that case, so this mmap won't block sysfs at all, so just don't switch
over to bin_vm_ops if vm_file has changed.
But more fixes made before realizing that was the problem:-
It should not be an error if bin_page_mkwrite() finds no underlying
page_mkwrite().
Check that a file already mmap'ed has the same underlying vm_ops
_before_ pointing vma->vm_ops at bin_vm_ops.
If the file being mmap'ed is a shmem/tmpfs file, don't fail the mmap
on CONFIG_NUMA=y, just because that has a set_policy and get_policy:
provide bin_set_policy, bin_get_policy and bin_migrate.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Eric Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The only way for a sysfs attribute to remove itself (without
deadlock) is to use the sysfs_schedule_callback() interface.
Vegard Nossum discovered that a poorly written sysfs ->store
callback can repeatedly schedule remove callbacks on the same
device over and over, e.g.
$ while true ; do echo 1 > /sys/devices/.../remove ; done
If the 'remove' attribute uses the sysfs_schedule_callback API
and also does not protect itself from concurrent accesses, its
callback handler will be called multiple times, and will
eventually attempt to perform operations on a freed kobject,
leading to many problems.
Instead of requiring all callers of sysfs_schedule_callback to
implement their own synchronization, provide the protection in
the infrastructure.
Now, sysfs_schedule_callback will only allow one scheduled
callback per kobject. On subsequent calls with the same kobject,
return -EAGAIN.
This is a short term fix. The long term fix is to allow sysfs
attributes to remove themselves directly, without any of this
callback hokey pokey.
[cornelia.huck@de.ibm.com: s390 ccwgroup bits]
Reported-by: vegard.nossum@gmail.com
Signed-off-by: Alex Chiang <achiang@hp.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch implements uevent suppress in kobject and removes it
from struct device, based on the following ideas:
1,Uevent sending should be one attribute of kobject, so suppressing it
in kobject layer is more natural than in device layer. By this way,
we can do it for other objects embedded with kobject.
2,It may save several bytes for each instance of struct device.(On my
omap3(32bit ARM) based box, can save 8bytes per device object)
This patch also introduces dev_set|get_uevent_suppress() helpers to
set and query uevent_suppress attribute in case to help kobject
as private part of struct device in future.
[This version is against the latest driver-core patch set of Greg,please
ignore the last version.]
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Modify sysfs bin files so that we can remove the bin file while they are
still mapped. When the kobject is removed we unmap the bin file and
arrange for future accesses to the mapping to receive SIGBUS.
Implementing this prevents a nasty DOS when pci devices are hot plugged
and unplugged. Where if any of their resources were mmaped the kernel
could not free up their pci resources or release their pci data
structures.
[akpm@linux-foundation.org: remove unused var]
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The sysfs_dirent serves as both an inode and a directory entry
for sysfs. To prevent the sysfs inode numbers from being freed
prematurely hold a reference to sysfs_dirent from the sysfs inode.
[akpm@linux-foundation.org: add comment]
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sysfs: sysfs_add_one WARNs with full path to duplicate filename
As a debugging aid, it can be useful to know the full path to a
duplicate file being created in sysfs.
We now will display warnings such as:
sysfs: cannot create duplicate filename '/foo'
when attempting to create multiple files named 'foo' in the sysfs
root, or:
sysfs: cannot create duplicate filename '/bus/pci/slots/5/foo'
when attempting to create multiple files named 'foo' under a
given directory in sysfs.
The path displayed is always a relative path to sysfs_root. The
leading '/' in the path name refers to the sysfs_root mount
point, and should not be confused with the "real" '/'.
Thanks to Alex Williamson for essentially writing sysfs_pathname.
Cc: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sysfs_get_inode ultimately calls sysfs_count_nlink when the a
directory inode is fectched. sysfs_count_nlink needs to be
called under the sysfs_mutex to guard against the unlikely
but possible scenario that the root directory is changing
as we are counting the number entries in it, and just in
general to be consistent.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
SYSFS_MAGIC has been added into magic.h, so only use that definition
in magic.h to avoid potential consistency problem.
Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The fsync log has code to make sure all of the parents of a file are in the
log along with the file. It uses a minimal log of the parent directory
inodes, just enough to get the parent directory on disk.
If the transaction that originally created a file is fully on disk,
and the file hasn't been renamed or linked into other directories, we
can safely skip the parent directory walk. We know the file is on disk
somewhere and we can go ahead and just log that single file.
This is more important now because unrelated unlinks in the parent directory
might make us force a commit if we try to log the parent.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The tree logging code allows individual files or directories to be logged
without including operations on other files and directories in the FS.
It tries to commit the minimal set of changes to disk in order to
fsync the single file or directory that was sent to fsync or O_SYNC.
The tree logging code was allowing files and directories to be unlinked
if they were part of a rename operation where only one directory
in the rename was in the fsync log. This patch adds a few new rules
to the tree logging.
1) on rename or unlink, if the inode being unlinked isn't in the fsync
log, we must force a full commit before doing an fsync of the directory
where the unlink was done. The commit isn't done during the unlink,
but it is forced the next time we try to log the parent directory.
Solution: record transid of last unlink/rename per directory when the
directory wasn't already logged. For renames this is only done when
renaming to a different directory.
mkdir foo/some_dir
normal commit
rename foo/some_dir foo2/some_dir
mkdir foo/some_dir
fsync foo/some_dir/some_file
The fsync above will unlink the original some_dir without recording
it in its new location (foo2). After a crash, some_dir will be gone
unless the fsync of some_file forces a full commit
2) we must log any new names for any file or dir that is in the fsync
log. This way we make sure not to lose files that are unlinked during
the same transaction.
2a) we must log any new names for any file or dir during rename
when the directory they are being removed from was logged.
2a is actually the more important variant. Without the extra logging
a crash might unlink the old name without recreating the new one
3) after a crash, we must go through any directories with a link count
of zero and redo the rm -rf
mkdir f1/foo
normal commit
rm -rf f1/foo
fsync(f1)
The directory f1 was fully removed from the FS, but fsync was never
called on f1, only its parent dir. After a crash the rm -rf must
be replayed. This must be able to recurse down the entire
directory tree. The inode link count fixup code takes care of the
ugly details.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
During log replay, inodes are copied from the log to the main filesystem
btrees. Sometimes they have a zero link count in the log but they actually
gain links during the replay or have some in the main btree.
This patch updates the link count to be at least one after copying the
inode out of the log. This makes sure the inode is deleted during an
iput while the rest of the replay code is still working on it.
The log replay has fixup code to make sure that link counts are correct
at the end of the replay, so we could use any non-zero number here and
it would work fine.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The delayed reference mechanism is responsible for all updates to the
extent allocation trees, including those updates created while processing
the delayed references.
This commit tries to limit the amount of work that gets created during
the final run of delayed refs before a commit. It avoids cowing new blocks
unless it is required to finish the commit, and so it avoids new allocations
that were not really required.
The goal is to avoid infinite loops where we are always making more work
on the final run of delayed refs. Over the long term we'll make a
special log for the last delayed ref updates as well.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This reads in blocks in the checksum btree before starting the
transaction in btrfs_finish_ordered_io. It makes it much more likely
we'll be able to do operations inside the transaction without
needing any btree reads, which limits transaction latencies overall.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
for the buffers it was dirtying. This may require a kmalloc and it
was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to
set any btree locks they were holding to blocking first.
This commit changes dirty tracking for extent buffers to just use a flag
in the extent buffer. Now that we have one and only one extent buffer
per page, this can be safely done without losing dirty bits along the way.
This also introduces a path->leave_spinning flag that callers of
btrfs_search_slot can use to indicate they will properly deal with a
path returned where all the locks are spinning instead of blocking.
Many of the btree search callers now expect spinning paths,
resulting in better btree concurrency overall.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Commits are fairly expensive, and so btrfs has code to sit around for a while
during the commit and let new writers come in.
But, while we're sitting there, new delayed refs might be added, and those
can be expensive to process as well. Unless the transaction is very very
young, it makes sense to go ahead and let the commit finish without hanging
around.
The commit grow loop isn't as important as it used to be, the fsync logging
code handles most performance critical syncs now.
Signed-off-by: Chris Mason <chris.mason@oracle.com>