Commit Graph

13722 Commits

Author SHA1 Message Date
Steven Whitehouse
e56985da45 GFS2: Fix page_mkwrite() return code
This allows for the possibility of returning VM_FAULT_OOM as
well as VM_FAULT_SIGBUS. This ensures that the correct action
is taken.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-04-20 16:02:02 +01:00
Steven Whitehouse
52fcd11c09 GFS2: Clear dirty bit at end of inode glock sync
The dirty bit can get set during the inode glock sync. Its too
complicated to change that at the moment, so this is the quick
fix - to clear the bit again at the end of the function.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-04-20 09:05:21 +01:00
Andi Kleen
613cbe3d48 Don't set relatime when noatime is specified
Since commit 0a1c01c947 ("Make relatime
default") when a file system is mounted explicitely with noatime it gets
both the MNT_RELATIME and MNT_NOATIME bits set.

This shows up like this in /proc/mounts:

  /dev/xxx /yyy ext3 rw,noatime,relatime,errors=continue,data=writeback 0 0

That looks strange.  The VFS uses noatime in this case, but both flags
are set.  So it's more a cosmetic issue, but still better to fix.

Cc: mjg@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-19 10:46:47 -07:00
Linus Torvalds
8d4ab5daca Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: when renaming don't try to unlink negative dentry
  cifs: remove unneeded bcc_ptr update in CIFSTCon
  cifs: add cFYI messages with some of the saved strings from ssetup/tcon
  cifs: fix buffer size for tcon->nativeFileSystem field
  cifs: fix unicode string area word alignment in session setup
  [CIFS] Fix build break caused by change to new current_umask helper function
  [CIFS] Fix sparse warnings
  [CIFS] Add support for posix open during lookup
  cifs: no need to use rcu_assign_pointer on immutable keys
  cifs: remove dnotify thread code
  [CIFS] remove some build warnings
  cifs: vary timeout on writes past EOF based on offset (try #5)
  [CIFS] Fix build break from recent DFS patch when DFS support not enabled
  Remote DFS root support.
  [CIFS] Endian convert UniqueId when reporting inode numbers from server files
  cifs: remove some pointless conditionals before kfree()
  cifs: flush data on any setattr
2009-04-18 21:37:07 -07:00
Jeff Layton
fc6f394332 cifs: when renaming don't try to unlink negative dentry
When attempting to rename a file on a read-only share, the kernel can
call cifs_unlink on a negative dentry, which causes an oops. Only try
to unlink the file if it's a positive dentry.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Shirish Pargaonkar <shirishp@us.ibm.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 21:08:15 +00:00
Linus Torvalds
74a205a3f1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
  UIO: fix specific device driver missing statement for depmod
  Driver core: remove pr_fmt() from dynamic_dev_dbg() printk
  driver core: prevent device_for_each_child from oopsing
  dynamic debug: resurrect old pr_debug() semantics as pr_devel()
  Driver Core: early platform driver
  proc: mounts_poll() make consistent to mdstat_poll
  sysfs: sysfs poll keep the poll rule of regular file.
  driver core: allow non-root users to listen to uevents
  driver core: fix driver_match_device
  sysfs: don't use global workqueue in sysfs_schedule_callback()
2009-04-17 13:53:16 -07:00
Matt Kraai
6566abdbd0 AFS: Guard afs_file_readpage_read_complete() definition with CONFIG_AFS_FSCACHE
If CONFIG_AFS_FSCACHE is not defined, the following warning is displayed when
fs/afs/file.c is compiled:

 fs/afs/file.c:111: warning: ‘afs_file_readpage_read_complete’ defined but not used

This occurs because all calls to this function are guarded by
CONFIG_AFS_FSCACHE.  Thus, guard its definition as well.

Signed-off-by: Matt Kraai <kraai@ftbfs.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-17 09:55:19 -07:00
Alan Cox
d29a2e9438 vfat: Note the NLS requirement
Close bug #4754. Stop people getting into a situation where they can't
get their FAT filesystems to mount as they expect.

Signed-off-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-17 09:32:11 -07:00
Randy Dunlap
b80901bbf5 splice: fix new kernel-doc warnings
splice: fix kernel-doc warnings

  Warning(fs/splice.c:617): bad line:
  Warning(fs/splice.c:722): No description found for parameter 'sd'
  Warning(fs/splice.c:722): Excess function parameter 'pipe' description in 'splice_from_pipe_begin'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-17 07:38:07 -07:00
Jeff Layton
22c9d52bc0 cifs: remove unneeded bcc_ptr update in CIFSTCon
This pointer isn't used again after this point. It's also not updated in
the ascii case, so there's no need to update it here.

Pointed-out-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:50 +00:00
Jeff Layton
313fecfa69 cifs: add cFYI messages with some of the saved strings from ssetup/tcon
...to make it easier to find problems in this area in the future.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:50 +00:00
Jeff Layton
f083def68f cifs: fix buffer size for tcon->nativeFileSystem field
The buffer for this was resized recently to fix a bug. It's still
possible however that a malicious server could overflow this field
by sending characters in it that are >2 bytes in the local charset.
Double the size of the buffer to account for this possibility.

Also get rid of some really strange and seemingly pointless NULL
termination. It's NULL terminating the string in the source buffer,
but by the time that happens, we've already copied the string.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:50 +00:00
Jeff Layton
27b87fe52b cifs: fix unicode string area word alignment in session setup
The handling of unicode string area alignment is wrong.
decode_unicode_ssetup improperly assumes that it will always be preceded
by a pad byte. This isn't the case if the string area is already
word-aligned.

This problem, combined with the bad buffer sizing for the serverDomain
string can cause memory corruption. The bad alignment can make it so
that the alignment of the characters is off. This can make them
translate to characters that are greater than 2 bytes each. If this
happens we can overflow the allocation.

Fix this by fixing the alignment in CIFS_SessSetup instead so we can
verify it against the head of the response. Also, clean up the
workaround for improperly terminated strings by checking for a
odd-length unicode buffers and then forcibly terminating them.

Finally, resize the buffer for serverDomain. Now that we've fixed
the alignment, it's probably fine, but a malicious server could
overflow it.

A better solution for handling these strings is still needed, but
this should be a suitable bandaid.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:50 +00:00
Steve French
88dd47fff4 [CIFS] Fix build break caused by change to new current_umask helper function
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:50 +00:00
Steve French
bc8cd4390c [CIFS] Fix sparse warnings
Signed-off-by: Shirish Pargaonkar <shirishp@us.ibm.com>
CC: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:49 +00:00
Steve French
a6ce4932fb [CIFS] Add support for posix open during lookup
This patch by utilizing lookup intents, and thus removing a network
roundtrip in the open path, improves performance dramatically on
open (30% or more) to Samba and other servers which support the
cifs posix extensions

Signed-off-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:49 +00:00
Jeff Layton
d9fb5c091b cifs: no need to use rcu_assign_pointer on immutable keys
cifs: no need to use rcu_assign_pointer on immutable keys

Neither keytype in use by CIFS has an "update" method. This means that
the keys are immutable once instantiated. We don't need to use RCU
to set the payload data pointers.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:49 +00:00
Jeff Layton
5144ebf408 cifs: remove dnotify thread code
cifs: remove dnotify thread code

Al Viro recently removed the dir_notify code from the kernel along with
the CIFS code that used it. We can also get rid of the dnotify thread
as well.

In actuality, it never had anything to do with dir_notify anyway. All
it did was unnecessarily wake up all the tasks waiting on the response
queues every 15s. Previously that happened to prevent tasks from hanging
indefinitely when the server went unresponsive, but we put those to
sleep with proper timeouts now so there's no reason to keep this around.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:49 +00:00
Steve French
2d6d589d80 [CIFS] remove some build warnings
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:49 +00:00
Jeff Layton
fbec9ab952 cifs: vary timeout on writes past EOF based on offset (try #5)
This is the fourth version of this patch:

The first three generated a compiler warning asking for explicit curly
braces.

The first two didn't handle update the size correctly when writes that
didn't start at the eof were done.

The first patch also didn't update the size correctly when it explicitly
set via truncate().

This patch adds code to track the client's current understanding of the
size of the file on the server separate from the i_size, and then to use
this info to semi-intelligently set the timeout for writes past the EOF.

This helps prevent timeouts when trying to write large, sparse files on
windows servers.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:49 +00:00
Steve French
d036f50fc2 [CIFS] Fix build break from recent DFS patch when DFS support not enabled
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:48 +00:00
Igor Mammedov
1bfe73c258 Remote DFS root support.
Allows to mount share on a server that returns -EREMOTE
 at the tree connect stage or at the check on a full path
 accessibility.

Signed-off-by: Igor Mammedov <niallain@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:48 +00:00
Steve French
85a6dac54a [CIFS] Endian convert UniqueId when reporting inode numbers from server files
Jeff made a good point that we should endian convert the UniqueId when we use
it to set i_ino Even though this value is opaque to the client, when comparing
the inode numbers of the same server file from two different clients (one
big endian, one little endian) or when we compare a big endian client's view
of i_ino with what the server thinks - we should get the same value

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:48 +00:00
Wei Yongjun
74496d365a cifs: remove some pointless conditionals before kfree()
Remove some pointless conditionals before kfree().

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:48 +00:00
Jeff Layton
0f4d634c59 cifs: flush data on any setattr
We already flush all the dirty pages for an inode before doing
ATTR_SIZE and ATTR_MTIME changes. There's another problem though -- if
we change the mode so that the file becomes read-only then we may not
be able to write data to it after a reconnect.

Fix this by just going back to flushing all the dirty data on any
setattr call. There are probably some cases that can be optimized out,
but I'm not sure they're worthwhile and we need to consider them more
carefully to make sure that we don't cause regressions if we have
to reconnect before writeback occurs.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:48 +00:00
KOSAKI Motohiro
31b07093c4 proc: mounts_poll() make consistent to mdstat_poll
In recently sysfs_poll discussion, Neil Brown pointed out /proc/mounts
also should be fixed.

SUSv3 says "Regular files shall always poll TRUE for reading and
writing".  see
http://www.opengroup.org/onlinepubs/009695399/functions/poll.html

Then, mounts_poll()'s default should be "POLLIN | POLLRDNORM".  it mean
always readable.

In addition, event trigger should use "POLLERR | POLLPRI" instead
POLLERR.  it makes consistent to mdstat_poll() and sysfs_poll(). and,
select(2) can handle POLLPRI easily.


Reported-by: Neil Brown <neilb@suse.de>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-04-16 16:17:10 -07:00
KOSAKI Motohiro
1af3557abd sysfs: sysfs poll keep the poll rule of regular file.
Currently, following test programs don't finished.

% ruby -e '
Thread.new { sleep }
File.read("/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies")
'

strace expose the reason.

...
open("/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies", O_RDONLY|O_LARGEFILE) = 3
ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf9fa6b8) = -1 ENOTTY (Inappropriate ioctl for device)
fstat64(3, {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0
_llseek(3, 0, [0], SEEK_CUR)            = 0
select(4, [3], NULL, NULL, NULL)        = 1 (in [3])
read(3, "1400000 1300000 1200000 1100000 1"..., 4096) = 62
select(4, [3], NULL, NULL, NULL


Because Ruby (the scripting language) VM assume select system-call
against regular file don't block.  it because SUSv3 says "Regular files
shall always poll TRUE for reading and writing".  see
http://www.opengroup.org/onlinepubs/009695399/functions/poll.html it
seems valid assumption.

But sysfs_poll() don't keep this rule although sysfs file can read and
write always.

This patch restore proper poll behavior to sysfs.
/sys/block/md*/md/sync_action polling application and another sysfs
updating sensitive application still can use POLLERR and POLLPRI.

Cc: Neil Brown <neilb@suse.de>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-04-16 16:17:09 -07:00
Alex Chiang
d110271e1f sysfs: don't use global workqueue in sysfs_schedule_callback()
A sysfs attribute using sysfs_schedule_callback() to commit suicide
may end up calling device_unregister(), which will eventually call
a driver's ->remove function.

Drivers may call flush_scheduled_work() in their shutdown routines,
in which case lockdep will complain with something like the following:

  =============================================
  [ INFO: possible recursive locking detected ]
  2.6.29-rc8-kk #1
  ---------------------------------------------
  events/4/56 is trying to acquire lock:
  (events){--..}, at: [<ffffffff80257fc0>] flush_workqueue+0x0/0xa0

  but task is already holding lock:
  (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230

  other info that might help us debug this:
  3 locks held by events/4/56:
  #0:  (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
  #1:  (&ss->work){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
  #2:  (pci_remove_rescan_mutex){--..}, at: [<ffffffff803c10d1>] remove_callback+0x21/0x40

  stack backtrace:
  Pid: 56, comm: events/4 Not tainted 2.6.29-rc8-kk #1
  Call Trace:
  [<ffffffff8026dfcd>] validate_chain+0xb7d/0x1260
  [<ffffffff8026eade>] __lock_acquire+0x42e/0xa40
  [<ffffffff8026f148>] lock_acquire+0x58/0x80
  [<ffffffff80257fc0>] ? flush_workqueue+0x0/0xa0
  [<ffffffff8025800d>] flush_workqueue+0x4d/0xa0
  [<ffffffff80257fc0>] ? flush_workqueue+0x0/0xa0
  [<ffffffff80258070>] flush_scheduled_work+0x10/0x20
  [<ffffffffa0144065>] e1000_remove+0x55/0xfe [e1000e]
  [<ffffffff8033ee30>] ? sysfs_schedule_callback_work+0x0/0x50
  [<ffffffff803bfeb2>] pci_device_remove+0x32/0x70
  [<ffffffff80441da9>] __device_release_driver+0x59/0x90
  [<ffffffff80441edb>] device_release_driver+0x2b/0x40
  [<ffffffff804419d6>] bus_remove_device+0xa6/0x120
  [<ffffffff8043e46b>] device_del+0x12b/0x190
  [<ffffffff8043e4f6>] device_unregister+0x26/0x70
  [<ffffffff803ba969>] pci_stop_dev+0x49/0x60
  [<ffffffff803baab0>] pci_remove_bus_device+0x40/0xc0
  [<ffffffff803c10d9>] remove_callback+0x29/0x40
  [<ffffffff8033ee4f>] sysfs_schedule_callback_work+0x1f/0x50
  [<ffffffff8025769a>] run_workqueue+0x15a/0x230
  [<ffffffff80257648>] ? run_workqueue+0x108/0x230
  [<ffffffff8025846f>] worker_thread+0x9f/0x100
  [<ffffffff8025bce0>] ? autoremove_wake_function+0x0/0x40
  [<ffffffff802583d0>] ? worker_thread+0x0/0x100
  [<ffffffff8025b89d>] kthread+0x4d/0x80
  [<ffffffff8020d4ba>] child_rip+0xa/0x20
  [<ffffffff8020cebc>] ? restore_args+0x0/0x30
  [<ffffffff8025b850>] ? kthread+0x0/0x80
  [<ffffffff8020d4b0>] ? child_rip+0x0/0x20

Although we know that the device_unregister path will never acquire
a lock that a driver might try to acquire in its ->remove, in general
we should never attempt to flush a workqueue from within the same
workqueue, and lockdep rightly complains.

So as long as sysfs attributes cannot commit suicide directly and we
are stuck with this callback mechanism, put the sysfs callbacks on
their own workqueue instead of the global one.

This has the side benefit that if a suicidal sysfs attribute kicks
off a long chain of ->remove callbacks, we no longer induce a long
delay on the global queue.

This also fixes a missing module_put in the error path introduced
by sysfs-only-allow-one-scheduled-removal-callback-per-kobj.patch.

We never destroy the workqueue, but I'm not sure that's a
problem.

Reported-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Tested-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-04-16 16:17:08 -07:00
Chris Mason
35c80d5f40 Add block_write_full_page_endio for passing endio handler
block_write_full_page doesn't allow the caller to control what happens
when the IO is over.  This adds a new call named block_write_full_page_endio
so the buffer head end_io handler can be provided by the caller.

This will be used by the ext3 data=guarded mode to do i_size updates in
a workqueue based end_io handler.  end_buffer_async_write is also
exported so it can be called to do the dirty work of managing page
writeback for the higher level end_io handler.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
Acked-by: Theodore Tso <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-16 07:47:49 -07:00
Linus Torvalds
a2c252ebde Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
  GFS2: Use DEFINE_SPINLOCK
  GFS2: cleanup file_operations mess
  GFS2: Move umount flush rwsem
  GFS2: Fix symlink creation race
  GFS2: Make quotad's waiting interruptible
2009-04-15 09:04:12 -07:00
Nikanth Karthikesan
b1fffc9ca6 gfs2: Remove code handling bio_alloc failure with __GFP_WAIT
Remove code handling bio_alloc failure with __GFP_WAIT.
GFP_NOFS implies __GFP_WAIT.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:13 +02:00
Nikanth Karthikesan
226e7dabf5 ext4: Remove code handling bio_alloc failure with __GFP_WAIT
Remove code handling bio_alloc failure with __GFP_WAIT.
GFP_NOIO implies __GFP_WAIT.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:13 +02:00
Nikanth Karthikesan
4d1f9fdb61 dio: Remove code handling bio_alloc failure with __GFP_WAIT
Remove code handling bio_alloc failure with __GFP_WAIT.
GFP_KERNEL implies __GFP_WAIT.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:13 +02:00
Jens Axboe
86c824b943 bio: add documentation to bio_alloc()
Explain that with __GFP_WAIT set it will not fail, and that the caller
must never allocate more than 1 bio at the time.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:12 +02:00
Miklos Szeredi
61e0d47c33 splice: add helpers for locking pipe inode
There are lots of sequences like this, especially in splice code:

	if (pipe->inode)
		mutex_lock(&pipe->inode->i_mutex);
	/* do something */
	if (pipe->inode)
		mutex_unlock(&pipe->inode->i_mutex);

so introduce helpers which do the conditional locking and unlocking.
Also replace the inode_double_lock() call with a pipe_double_lock()
helper to avoid spreading the use of this functionality beyond the
pipe code.

This patch is just a cleanup, and should cause no behavioral changes.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:12 +02:00
Miklos Szeredi
f8cc774ce4 splice: remove generic_file_splice_write_nolock()
Remove the now unused generic_file_splice_write_nolock() function.
It's conceptually broken anyway, because splice may need to wait for
pipe events so holding locks across the whole operation is wrong.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:12 +02:00
Miklos Szeredi
328eaaba4e ocfs2: fix i_mutex locking in ocfs2_splice_to_file()
Rearrange locking of i_mutex on destination and call to
ocfs2_rw_lock() so locks are only held while buffers are copied with
the pipe_to_file() actor, and not while waiting for more data on the
pipe.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:12 +02:00
Miklos Szeredi
eb443e5a25 splice: fix i_mutex locking in generic_splice_write()
Rearrange locking of i_mutex on destination so it's only held while
buffers are copied with the pipe_to_file() actor, and not while
waiting for more data on the pipe.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:11 +02:00
Miklos Szeredi
2933970b96 splice: remove i_mutex locking in splice_from_pipe()
splice_from_pipe() is only called from two places:

  - generic_splice_sendpage()
  - splice_write_null()

Neither of these require i_mutex to be taken on the destination inode.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:11 +02:00
Miklos Szeredi
b3c2d2ddd6 splice: split up __splice_from_pipe()
Split up __splice_from_pipe() into four helper functions:

  splice_from_pipe_begin()
  splice_from_pipe_next()
  splice_from_pipe_feed()
  splice_from_pipe_end()

splice_from_pipe_next() will wait (if necessary) for more buffers to
be added to the pipe.  splice_from_pipe_feed() will feed the buffers
to the supplied actor and return when there's no more data available
(or if all of the requested data has been copied).

This is necessary so that implementations can do locking around the
non-waiting splice_from_pipe_feed().

This patch should not cause any change in behavior.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:11 +02:00
Xu Gang
1328df7252 GFS2: Use DEFINE_SPINLOCK
SPIN_LOCK_UNLOCKED is deprecated, use DEFINE_SPINLOCK instead.
(as suggested in Documentation/spinlocks.txt)

Signed-off-by: Xu Gang <xug@cn.fujitsu.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-04-15 10:18:07 +01:00
Christoph Hellwig
10d2198805 GFS2: cleanup file_operations mess
Remove the weird pointer to file_operations mess and replace it with
straight-forward defining of the lockinginstance names to the _nolock
variants.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-04-15 10:17:18 +01:00
Steven Whitehouse
a228df6339 GFS2: Move umount flush rwsem
The rwsem, used only on umount, is in the wrong place in glock.c.
This patch moves it up a bit so that it does not get called under
a spinlock.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-04-15 10:16:13 +01:00
Steven Whitehouse
5cf32524de GFS2: Fix symlink creation race
In certain cases symlinks can appear to have zero size if a lookup
on the inode occurs within a certain (very short) time after the
symlink has been created. The symlink is correctly created on disk
but appears to have zero size when stat()ed. This patch closes the
race and prevents incorrect sizes appearing.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-04-15 10:15:38 +01:00
Steven Whitehouse
7fa5d20d1a GFS2: Make quotad's waiting interruptible
So we don't count its D state in the loadavg.

Reported-by: Nathan Straz <nstraz@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-04-15 10:15:08 +01:00
Jens Axboe
053c525fcf buffer: switch do_emergency_thaw() away from pdflush_operation()
This is (again) a preparatory patch similar to commit
a2a9537ac0. It open codes a simple
async way of executing do_thaw_all() out of context, so we can get
rid of pdflush.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 08:28:12 +02:00
Linus Torvalds
e9de427e40 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: fix "direct_io" private mmap
  fuse: fix argument type in fuse_get_user_pages()
2009-04-14 10:12:07 -07:00
Linus Torvalds
9fc0178caa Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: fix possible mismatch of sufile counters on recovery
  nilfs2: segment usage file cleanups
  nilfs2: fix wrong accounting and duplicate brelse in nilfs_sufile_set_error
  nilfs2: simplify handling of active state of segments fix
  nilfs2: remove module version
  nilfs2: fix lockdep recursive locking warning on meta data files
  nilfs2: fix lockdep recursive locking warning on bmap
  nilfs2: return f_fsid for statfs2
2009-04-14 10:10:53 -07:00
Theodore Ts'o
38d726d153 jbd: use SWRITE_SYNC_PLUG when writing synchronous revoke records
The revoke records must be written using the same way as the rest of
the blocks during the commit process; that is, either marked as
synchronous writes or as asynchornous writes.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-14 10:10:47 -04:00
Theodore Ts'o
67c457a8c3 jbd2: use SWRITE_SYNC_PLUG when writing synchronous revoke records
The revoke records must be written using the same way as the rest of
the blocks during the commit process; that is, either marked as
synchronous writes or as asynchornous writes.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-14 07:50:56 -04:00
Chuck Ebbert
6b82f3cb2d ext4: really print the find_group_flex fallback warning only once
Missing braces caused the warning to print more than once.

Signed-Off-By: Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-14 07:37:40 -04:00
Jan Kara
316cb4ef3e ext2: fix data corruption for racing writes
If two writers allocating blocks to file race with each other (e.g.
because writepages races with ordinary write or two writepages race with
each other), ext2_getblock() can be called on the same inode in parallel.
Before we are going to allocate new blocks, we have to recheck the block
chain we have obtained so far without holding truncate_mutex.  Otherwise
we could overwrite the indirect block pointer set by the other writer
leading to data loss.

The below test program by Ying is able to reproduce the data loss with ext2
on in BRD in a few minutes if the machine is under memory pressure:

long kMemSize  = 50 << 20;
int kPageSize = 4096;

int main(int argc, char **argv) {
	int status;
	int count = 0;
	int i;
	char *fname = "/mnt/test.mmap";
	char *mem;
	unlink(fname);
	int fd = open(fname, O_CREAT | O_EXCL | O_RDWR, 0600);
	status = ftruncate(fd, kMemSize);
	mem = mmap(0, kMemSize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	// Fill the memory with 1s.
	memset(mem, 1, kMemSize);
	sleep(2);
	for (i = 0; i < kMemSize; i++) {
		int byte_good = mem[i] != 0;
		if (!byte_good && ((i % kPageSize) == 0)) {
			//printf("%d ", i / kPageSize);
			count++;
		}
	}
	munmap(mem, kMemSize);
	close(fd);
	unlink(fname);

	if (count > 0) {
		printf("Running %d bad page\n", count);
		return 1;
	}
	return 0;
}

Cc: Ying Han <yinghan@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-13 15:04:33 -07:00
Jan Kara
3243387948 jbd: update locking coments
Update information about locking in JBD revoke code.

Reported-by: Lin Tan <tammy000@gmail.com>.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-13 15:04:32 -07:00
Dave Anderson
eb2e5f452a hfs: fix memory leak when unmounting
When an HFS filesystem is unmounted, it leaks a 2-page bitmap.  Also,
under extreme memory pressure, it's possible that hfs_releasepage() may
use a tree pointer that has not been initialized, and if so, the release
request should just be rejected.

[akpm@linux-foundation.org: free_pages(0) is legal, remove obvious comment]
Signed-off-by: Dave Anderson <anderson@redhat.com>
Tested-by: Eugene Teo <eugeneteo@kernel.sg>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-13 15:04:29 -07:00
Linus Torvalds
3c1795cc4b Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: remove xfs_flush_space
  xfs: flush delayed allcoation blocks on ENOSPC in create
  xfs: block callers of xfs_flush_inodes() correctly
  xfs: make inode flush at ENOSPC synchronous
  xfs: use xfs_sync_inodes() for device flushing
  xfs: inform the xfsaild of the push target before sleeping
  xfs: prevent unwritten extent conversion from blocking I/O completion
  xfs: fix double free of inode
  xfs: validate log feature fields correctly
2009-04-13 14:35:13 -07:00
Ryusuke Konishi
c85399c2da nilfs2: fix possible mismatch of sufile counters on recovery
On-disk counters ndirtysegs and ncleansegs of sufile, can go wrong
after roll-forward recovery because
nilfs_prepare_segment_for_recovery() function marks segments dirty
without adjusting value of these counters.

This fixes the problem by adding a function to sufile which does the
operation adjusting the counters, and by letting the recovery function
use it.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-04-13 09:53:52 +09:00
Ryusuke Konishi
a703018f7b nilfs2: segment usage file cleanups
This will simplify sufile.c by sharing common code which repeatedly
appears in routines updating a segment usage entry; a wrapper function
nilfs_sufile_update() is introduced for the purpose, and counter
modifications are integrated to a new function
nilfs_sufile_mod_counter().

This is a preparation for the successive bugfix patch ("nilfs2: fix
possible mismatch of sufile counters on recovery").

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-04-13 09:53:51 +09:00
Ryusuke Konishi
88072faf9a nilfs2: fix wrong accounting and duplicate brelse in nilfs_sufile_set_error
The nilfs_sufile_set_error() function wrongly adjusts the number of
dirty segments instead of the number of clean segments.  In addition,
the function calls brelse() twice for the same buffer head.

This fixes these bugs.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-04-13 09:53:51 +09:00
Ryusuke Konishi
3efb55b496 nilfs2: simplify handling of active state of segments fix
This fixes a bug of ("nilfs2: simplify handling of active state of
segments") patch.  The patch did not take account that a base index is
increased in nilfs_sufile_get_suinfo() function if requested entries
go across block boundary on sufile.

Due to this bug, the active flag sometimes appears on wrong segments
and has induced malfunction of garbage collection.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-04-13 09:53:51 +09:00
Ryusuke Konishi
e7a7402c0d nilfs2: remove module version
A MODULE_VERSION() macro has been used in out-of-tree nilfs modules,
but it's needless and not updated in tree.  So, this removes it along
with the version declaration.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-04-13 09:53:50 +09:00
Ryusuke Konishi
c2698e50e3 nilfs2: fix lockdep recursive locking warning on meta data files
This fixes the following false detection of lockdep against nilfs meta
data files:

=============================================
[ INFO: possible recursive locking detected ]
2.6.29 #26
---------------------------------------------
mount.nilfs2/4185 is trying to acquire lock:
 (&mi->mi_sem){----}, at: [<d0c7925b>] nilfs_sufile_get_stat+0x1e/0x105 [nilfs2]
 but task is already holding lock:
  (&mi->mi_sem){----}, at: [<d0c72026>] nilfs_count_free_blocks+0x48/0x84 [nilfs2]

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-04-13 09:53:50 +09:00
Ryusuke Konishi
bcb48891b0 nilfs2: fix lockdep recursive locking warning on bmap
The bmap semaphore of DAT file can be held while a bmap of other files
is locked.  This has caused the following false detection of lockdep
check:

mount.nilfs2/4667 is trying to acquire lock:
 (&bmap->b_sem){..--}, at: [<d0c6c4b4>] nilfs_bmap_lookup_at_level+0x1a/0x74 [nilfs2]

but task is already holding lock:
 (&bmap->b_sem){..--}, at: [<d0c6c4b4>] nilfs_bmap_lookup_at_level+0x1a/0x74 [nilfs2]

This will fix the false detection by distinguishing semaphores of the
DAT and other files.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-04-13 09:53:49 +09:00
Ryusuke Konishi
c306af23e1 nilfs2: return f_fsid for statfs2
This follows the change of Coly Li's series ("fs: return f_fsid for
statfs(2)"), and make nilfs2 return f_fsid info for statfs(2).

Acked-by: Coly Li <coly.li@suse.de>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-04-13 09:53:49 +09:00
Linus Torvalds
54f93b74cf Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: check block device size on mount
  ext4: Fix off-by-one-error in ext4_valid_extent_idx()
  ext4: Fix big-endian problem in __ext4_check_blockref()
2009-04-09 16:42:05 -07:00
Felix Blyakher
dc2a5536d6 Merge branch 'master' into for-linus 2009-04-09 14:12:07 -05:00
Stoyan Gaydarov
11ff5f6aff afs: BUG to BUG_ON changes
Signed-off-by: Stoyan Gaydarov <stoyboyker@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-09 10:41:19 -07:00
Miklos Szeredi
3121bfe763 fuse: fix "direct_io" private mmap
MAP_PRIVATE mmap could return stale data from the cache for
"direct_io" files.  Fix this by flushing the cache on mmap. 

Found with a slightly modified fsx-linux.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-09 17:37:53 +02:00
Miklos Szeredi
ce60a2f157 fuse: fix argument type in fuse_get_user_pages()
Fix the following warning:

fs/fuse/file.c: In function 'fuse_direct_io':
fs/fuse/file.c:1002: warning: passing argument 3 of 'fuse_get_user_pages' from incompatible pointer type

This was introduced by commit f4975c67 "fuse: allow kernel to access
"direct_io" files".

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-09 17:37:52 +02:00
Linus Torvalds
a7b334de4d Merge branch 'ext3-latency-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'ext3-latency-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext3: Try to avoid starting a transaction in writepage for data=writepage
  block_write_full_page: switch synchronous writes to use WRITE_SYNC_PLUG
2009-04-08 17:42:32 -07:00
Nobuhiro Iwamatsu
4c967291fc nommu: fix typo vma->pg_off to vma->vm_pgoff
6260a4b052 ("/proc/pid/maps: don't show
pgoff of pure ANON VMAs" had a typo.

fs/proc/task_nommu.c:138: error: 'struct vm_area_struct' has no member named 'pg_off'
distcc[21484] ERROR: compile fs/proc/task_nommu.c on sprygo/32 failed

Signed-off-by: Nobuhiro Iwamatsu <iwamatsu.nobuhiro@renesas.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-08 10:21:44 -07:00
Alexander Beregalov
2b3fffefea befs: fix build on parisc
fs/befs/super.c:85: error: 'PAGE_SIZE' undeclared

Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-08 10:21:43 -07:00
Jan Kara
430db323fa ext3: Try to avoid starting a transaction in writepage for data=writepage
This does the same as commit 9e80d40773
(avoid starting a transaction when no block allocation is needed)
but for data=writeback mode of ext3. We also cleanup the data=ordered
case a bit to stick to coding style...

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-08 13:15:10 -04:00
Theodore Ts'o
6e34eeddf7 block_write_full_page: switch synchronous writes to use WRITE_SYNC_PLUG
Now that we have a distinction between WRITE_SYNC and WRITE_SYNC_PLUG,
use WRITE_SYNC_PLUG in __block_write_full_page() to avoid unplugging
the block device I/O queue between each page that gets flushed out.

Otherwise, when we run sync() or fsync() and we need to write out a
large number of pages, the block device queue will get unplugged
between for every page that is flushed out, which will be a pretty
serious performance regression caused by commit a64c8610.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-08 13:15:09 -04:00
Trond Myklebust
2b2ec7554c NFS: Fix the return value in nfs_page_mkwrite()
Commit c2ec175c39 ("mm: page_mkwrite
change prototype to match fault") exposed a bug in the NFS
implementation of page_mkwrite.  We should be returning 0 on success...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 14:07:03 -07:00
From: Thiemo Nagel
0f2ddca66d ext4: check block device size on mount
Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-07 14:07:47 -04:00
Miklos Szeredi
7bfac9ecf0 splice: fix deadlock in splicing to file
There's a possible deadlock in generic_file_splice_write(),
splice_from_pipe() and ocfs2_file_splice_write():

 - task A calls generic_file_splice_write()
 - this calls inode_double_lock(), which locks i_mutex on both
   pipe->inode and target inode
 - ordering depends on inode pointers, can happen that pipe->inode is
   locked first
 - __splice_from_pipe() needs more data, calls pipe_wait()
 - this releases lock on pipe->inode, goes to interruptible sleep
 - task B calls generic_file_splice_write(), similarly to the first
 - this locks pipe->inode, then tries to lock inode, but that is
   already held by task A
 - task A is interrupted, it tries to lock pipe->inode, but fails, as
   it is already held by task B
 - ABBA deadlock

Fix this by explicitly ordering locks: the outer lock must be on
target inode and the inner lock (which is later unlocked and relocked)
must be on pipe->inode.  This is OK, pipe inodes and target inodes
form two nonoverlapping sets, generic_file_splice_write() and friends
are not called with a target which is a pipe.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:34:46 -07:00
Ryusuke Konishi
612392307c nilfs2: support nanosecond timestamp
After a review of user's feedback for finding out other compatibility
issues, I found nilfs improperly initializes timestamps in inode;
CURRENT_TIME was used there instead of CURRENT_TIME_SEC even though nilfs
didn't have nanosecond timestamps on disk.  A few users gave us the report
that the tar program sometimes failed to expand symbolic links on nilfs,
and it turned out to be the cause.

Instead of applying the above displacement, I've decided to support
nanosecond timestamps on this occation.  Fortunetaly, a needless 64-bit
field was in the nilfs_inode struct, and I found it's available for this
purpose without impact for the users.

So, this will do the enhancement and resolve the tar problem.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:20 -07:00
Ryusuke Konishi
e339ad31f5 nilfs2: introduce secondary super block
The former versions didn't have extra super blocks.  This improves the
weak point by introducing another super block at unused region in tail of
the partition.

This doesn't break disk format compatibility; older versions just ingore
the secondary super block, and new versions just recover it if it doesn't
exist.  The partition created by an old mkfs may not have unused region,
but in that case, the secondary super block will not be added.

This doesn't make more redundant copies of the super block; it is a future
work.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:20 -07:00
Ryusuke Konishi
cece552074 nilfs2: simplify handling of active state of segments
will reduce some lines of segment constructor.  Previously, the state was
complexly controlled through a list of segments in order to keep
consistency in meta data of usage state of segments.  Instead, this
presents ``calculated'' active flags to userland cleaner program and stop
maintaining its real flag on disk.

Only by this fake flag, the cleaner cannot exactly know if each segment is
reclaimable or not.  However, the recent extension of nilfs_sustat ioctl
struct (nilfs2-extend-nilfs_sustat-ioctl-struct.patch) can prevent the
cleaner from reclaiming in-use segment wrongly.

So, now I can apply this for simplification.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:20 -07:00
Ryusuke Konishi
c96fa464a5 nilfs2: mark minor flag for checkpoint created by internal operation
Nilfs creates checkpoints even for garbage collection or metadata updates
such as checkpoint mode change.  So, user often sees checkpoints created
only by such internal operations.

This is inconvenient in some situations.  For example, application that
monitors checkpoints and changes them to snapshots, will fall into an
infinite loop because it cannot distinguish internally created
checkpoints.

This patch solves this sort of problem by adding a flag to checkpoint for
identification.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:19 -07:00
Ryusuke Konishi
458c5b0822 nilfs2: clean up sketch file
The sketch file is a file to mark checkpoints with user data.  It was
experimentally introduced in the original implementation, and now
obsolete.  The file was handled differently with regular files; the file
size got truncated when a checkpoint was created.

This stops the special treatment and will treat it as a regular file.
Most users are not affected because mkfs.nilfs2 no longer makes this file.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:19 -07:00
Ryusuke Konishi
e626874685 nilfs2: super block operations fix endian bug
This adds a missing endian conversion of checksum field in the super
block.  This fixes compatibility issue on big endian machines which will
come to surface after supporting recovery of super block.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:19 -07:00
Ryusuke Konishi
1f5abe7e7d nilfs2: replace BUG_ON and BUG calls triggerable from ioctl
Pekka Enberg advised me:
> It would be nice if BUG(), BUG_ON(), and panic() calls would be
> converted to proper error handling using WARN_ON() calls. The BUG()
> call in nilfs_cpfile_delete_checkpoints(), for example, looks to be
> triggerable from user-space via the ioctl() system call.

This will follow the comment and keep them to a minimum.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:19 -07:00
Ryusuke Konishi
2c2e52fc4f nilfs2: extend nilfs_sustat ioctl struct
This adds a new argument to the nilfs_sustat structure.

The extended field allows to delete volatile active state of segments,
which was needed to protect freshly-created segments from garbage
collection but has confused code dealing with segments.  This
extension alleviates the mess and gives room for further
simplifications.

The volatile active flag is not persistent, so it's eliminable on this
occasion without affecting compatibility other than the ioctl change.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:19 -07:00
Ryusuke Konishi
7a9461939a nilfs2: use unlocked_ioctl
Pekka Enberg suggested converting ->ioctl operations to use
->unlocked_ioctl to avoid BKL.

The conversion was verified to be safe, so I will take it on this
occasion.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:19 -07:00
Ryusuke Konishi
8082d36aed nilfs2: remove compat ioctl code
This removes compat code from the nilfs ioctls and applies the same
function for both .ioctl and .compat_ioctl file operations.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:18 -07:00
Ryusuke Konishi
dc498d09be nilfs2: use fixed sized types for ioctl structures
Nilfs ioctl had structures not having fixed sized types such as:

  struct nilfs_argv {
         void *v_base;
         size_t v_nmembs;
         size_t v_size;
         int v_index;
         int v_flags;
  };

Further, some of them are wrongly aligned:

  e.g.

  struct nilfs_cpmode {
        __u64 cm_cno;
        int cm_mode;
  };

The size of wrongly aligned structures varies depending on
architectures, and it breaks the identity of ioctl commands, which
leads to arch dependent errors.

Previously, these are compensated by using compat_ioctl.

This fixes these problems and allows removal of compat ioctl.

Since this will change sizes of those structures, binary compatibility
for the past utilities will once break; new utilities have to be used
instead.  However, it would be helpful to avoid platform dependent
problems in the long term.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:18 -07:00
Ryusuke Konishi
1088dcf4c3 nilfs2: remove timedwait ioctl command
This removes NILFS_IOCTL_TIMEDWAIT command from ioctl interface along
with the related flags and wait queue.

The command is terrible because it just sleeps in the ioctl.  I prefer
to avoid this by devising means of event polling in userland program.
By reconsidering the userland GC daemon, I found this is possible
without changing behaviour of the daemon and sacrificing efficiency.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:18 -07:00
Ryusuke Konishi
76068c4ff1 nilfs2: fix buggy behavior seen in enumerating checkpoints
This will fix the weird behavior of lscp command in listing continuously
created checkpoints; the output of lscp is rewinded regularly for the
recent nilfs.  As a result of debugging, a defect was found in
nilfs_cpfile_do_get_cpinfo() function.

Though the function can be repeatedly called to enumerate checkpoints and
it can skip invalid checkpoint entries, the index value was not carried
between successive calls.

The bug has long been present, and came to surface after applying a bugfix
nilfs2-fix-problems-of-memory-allocation-in-ioctl.patch, which increased
frequency of calling the function.  The similar bugfix was already applied
for ``snapshots'' by
nilfs2-fix-gc-failure-on-volumes-keeping-numerous-snapshots.patch.

This fixes the problem by making the index argument bidirectional on the
function.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:18 -07:00
Pekka Enberg
8acfbf0939 nilfs2: clean up indirect function calling conventions
This cleans up the strange indirect function calling convention used in
nilfs to follow the normal kernel coding style.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:17 -07:00
Ryusuke Konishi
7fa10d2001 nilfs2: fix improper return values of nilfs_get_cpinfo ioctl
A few tool developers gave me requests for fixing inconvenient return
value of nilfs_get_cpinfo() ioctl; if the requested mode is NILFS_SNAPSHOT
and the specified start entry is not a snapshot, the ioctl unnaturally
returns one as the number of acquired snapshot item.

In addition, the ioctl function returns an ENOENT error for checkpoints
within blocks deleted by garbage collection.

These behaviors require corrections for programs which enumerate
snapshots.  This resolves the inconvenience by changing the return values
to zero for the above cases.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:17 -07:00
Ryusuke Konishi
b028fcfc4c nilfs2: fix gc failure on volumes keeping numerous snapshots
This resolves the following failure of nilfs2 cleaner daemon:

 nilfs_cleanerd[20670]: cannot clean segments: No such file or directory
 nilfs_cleanerd[20670]: shutdown

When creating thousands of snapshots, the cleaner daemon had rarely died
as above due to an error returned from the kernel code.

After applying the recent patch which fixed memory allocation problems in
ioctl (Message-Id: <20081215.155840.105124170.ryusuke@osrg.net>), the
problem gets more frequent.

It turned out to be a bug of nilfs_ioctl_wrap_copy function and one of its
callback routines to read out information of snapshots; if the
nilfs_ioctl_wrap_copy function divided a large read request into multiple
requests, the second and later requests have failed since a restart
position on snapshot meta data was not properly set forward.

It's a deficiency of the callback interface that cannot pass the restart
position among multiple requests.  This patch fixes the issue by allowing
nilfs_ioctl_wrap_copy and snapshot read functions to exchange a position
argument.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:17 -07:00
Ryusuke Konishi
047180f2d7 nilfs2: insert explanations in gcinode file
The file gcinode.c gives buffer cache functions for on-disk blocks
moved in garbage collection.  Joern Engel has suggested inserting its
explanations in the source file (Message-ID:
<20080917144146.GD8750@logfs.org> and
<20080917224953.GB14644@logfs.org>).

This follows the comment.

Cc: Joern Engel <joern@logfs.org>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:17 -07:00
Ryusuke Konishi
47420c7998 nilfs2: avoid double error caused by nilfs_transaction_end
Pekka Enberg pointed out that double error handlings found after
nilfs_transaction_end() can be avoided by separating abort operation:

 OK, I don't understand this. The only way nilfs_transaction_end() can
 fail is if we have NILFS_TI_SYNC set and we fail to construct the
 segment. But why do we want to construct a segment if we don't commit?

 I guess what I'm asking is why don't we have a separate
 nilfs_transaction_abort() function that can't fail for the erroneous
 case to avoid this double error value tracking thing?

This does the separation and renames nilfs_transaction_end() to
nilfs_transaction_commit() for clarification.

Since, some calls of these functions were used just for exclusion control
against the segment constructor, they are replaced with semaphore
operations.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:17 -07:00
Ryusuke Konishi
a2e7d2df82 nilfs2: cleanup nilfs_clear_inode
This will remove the following unnecessary locks and cleanup code in
nilfs_clear_inode():

- unnecessary protection using nilfs_transaction_begin() and
  nilfs_transaction_end().

- cleanup code of i_dirty list field which is never chained
  when this function is called.

- spinlock used when releasing i_bh field.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:17 -07:00
Ryusuke Konishi
3358b4aaa8 nilfs2: fix problems of memory allocation in ioctl
This is another patch for fixing the following problems of a memory
copy function in nilfs2 ioctl:

(1) It tries to allocate 128KB size of memory even for small objects.

(2) Though the function repeatedly tries large memory allocations
    while reducing the size, GFP_NOWAIT flag is not specified.
    This increases the possibility of system memory shortage.

(3) During the retries of (2), verbose warnings are printed
    because _GFP_NOWARN flag is not used for the kmalloc calls.

The first patch was still doing large allocations by kmalloc which are
repeatedly tried while reducing the size.

Andi Kleen told me that using copy_from_user for large memory is not
good from the viewpoint of preempt latency:

 On Fri, 12 Dec 2008 21:24:11 +0100, Andi Kleen <andi@firstfloor.org> wrote:
 > > In the current interface, each data item is copied twice: one is to
 > > the allocated memory from user space (via copy_from_user), and another
 >
 > For such large copies it is better to use multiple smaller (e.g. 4K)
 > copy user, that gives better real time preempt latencies. Each cfu has a
 > cond_resched(), but only one, not multiple times in the inner loop.

He also advised me that:

 On Sun, 14 Dec 2008 16:13:27 +0100, Andi Kleen <andi@firstfloor.org> wrote:
 > Better would be if you could go to PAGE_SIZE. order 0 allocations
 > are typically the fastest / least likely to stall.
 >
 > Also in this case it's a good idea to use __get_free_pages()
 > directly, kmalloc tends to be become less efficient at larger
 > sizes.

For the function in question, the size of buffer memory can be reduced
since the buffer is repeatedly used for a number of small objects.  On
the other hand, it may incur large preempt latencies for larger buffer
because a copy_from_user (and a copy_to_user) was applied only once
each cycle.

With that, this revision uses the order 0 allocations with
__get_free_pages() to fix the original problems.

Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:16 -07:00
Ryusuke Konishi
0c4fb87764 nilfs2: update makefile and Kconfig
This adds a Makefile for the nilfs2 file system, and updates the
makefile and Kconfig file in the file system directory.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:16 -07:00
Koji Sato
7942b919f7 nilfs2: ioctl operations
This adds userland interface implemented with ioctl.

Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:16 -07:00
Ryusuke Konishi
a3d93f709e nilfs2: block cache for garbage collection
This adds the cache of on-disk blocks to be moved in garbage
collection.  The disk blocks are held with dummy inodes (called
gcinodes), and this file provides lookup function of the dummy inodes,
and their buffer read function.

Signed-off-by: Seiji Kihara <kihara.seiji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Yoshiji Amagai <amagai.yoshiji@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:16 -07:00
Ryusuke Konishi
84ef1ecfde nilfs2: another dat for garbage collection
NILFS2 uses another DAT inode during garbage collection to ensure
atomicity and consistency of the DAT in the transient state.  This
twin inode is called GCDAT.

This adds functions to initialize the GCDAT and to switch page caches
and B-tree node caches between these two inodes.

Signed-off-by: Seiji Kihara <kihara.seiji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Yoshiji Amagai <amagai.yoshiji@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:16 -07:00