Commit Graph

316 Commits

Author SHA1 Message Date
Hugh Dickins
689bcebfda [PATCH] unpaged: PG_reserved bad_page
It used to be the case that PG_reserved pages were silently never freed, but
in 2.6.15-rc1 they may be freed with a "Bad page state" message.  We should
work through such cases as they appear, fixing the code; but for now it's
safer to issue the message without freeing the page, leaving PG_reserved set.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22 09:13:42 -08:00
Hugh Dickins
f57e88a8d8 [PATCH] unpaged: ZERO_PAGE in VM_UNPAGED
It's strange enough to be looking out for anonymous pages in VM_UNPAGED areas,
let's not insert the ZERO_PAGE there - though whether it would matter will
depend on what we decide about ZERO_PAGE refcounting.

But whereas do_anonymous_page may (exceptionally) be called on a VM_UNPAGED
area, do_no_page should never be: just BUG_ON.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22 09:13:42 -08:00
Hugh Dickins
ee498ed730 [PATCH] unpaged: anon in VM_UNPAGED
copy_one_pte needs to copy the anonymous COWed pages in a VM_UNPAGED area,
zap_pte_range needs to free them, do_wp_page needs to COW them: just like
ordinary pages, not like the unpaged.

But recognizing them is a little subtle: because PageReserved is no longer a
condition for remap_pfn_range, we can now mmap all of /dev/mem (whether the
distro permits, and whether it's advisable on this or that architecture, is
another matter).  So if we can see a PageAnon, it may not be ours to mess with
(or may be ours from elsewhere in the address space).  I suspect there's an
entertaining insoluble self-referential problem here, but the page_is_anon
function does a good practical job, and MAP_PRIVATE PROT_WRITE VM_UNPAGED will
always be an odd choice.

In updating the comment on page_address_in_vma, noticed a potential NULL
dereference, in a path we don't actually take, but fixed it.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22 09:13:42 -08:00
Hugh Dickins
920fc356f5 [PATCH] unpaged: COW on VM_UNPAGED
Remove the BUG_ON(vma->vm_flags & VM_UNPAGED) from do_wp_page, and let it do
Copy-On-Write without touching the VM_UNPAGED's page counts - but this is
incomplete, because the anonymous page it inserts will itself need to be
handled, here and in other functions - next patch.

We still don't copy the page if the pfn is invalid, because the
copy_user_highpage interface does not allow it.  But that's not been a problem
in the past: can be added in later if the need arises.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22 09:13:42 -08:00
Hugh Dickins
101d2be764 [PATCH] unpaged: VM_NONLINEAR VM_RESERVED
There's one peculiar use of VM_RESERVED which the previous patch left behind:
because VM_NONLINEAR's try_to_unmap_cluster uses vm_private_data as a swapout
cursor, but should never meet VM_RESERVED vmas, it was a way of extending
VM_NONLINEAR to VM_RESERVED vmas using vm_private_data for some other purpose.
 But that's an empty set - they don't have the populate function required.  So
just throw away those VM_RESERVED tests.

But one more interesting in rmap.c has to go too: try_to_unmap_one will want
to swap out an anonymous page from VM_RESERVED or VM_UNPAGED area.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22 09:13:42 -08:00
Hugh Dickins
0b14c179a4 [PATCH] unpaged: VM_UNPAGED
Although we tend to associate VM_RESERVED with remap_pfn_range, quite a few
drivers set VM_RESERVED on areas which are then populated by nopage.  The
PageReserved removal in 2.6.15-rc1 changed VM_RESERVED not to free pages in
zap_pte_range, without changing those drivers not to set it: so their pages
just leak away.

Let's not change miscellaneous drivers now: introduce VM_UNPAGED at the core,
to flag the special areas where the ptes may have no struct page, or if they
have then it's not to be touched.  Replace most instances of VM_RESERVED in
core mm by VM_UNPAGED.  Force it on in remap_pfn_range, and the sparc and
sparc64 io_remap_pfn_range.

Revert addition of VM_RESERVED to powerpc vdso, it's not needed there.  Is it
needed anywhere?  It still governs the mm->reserved_vm statistic, and special
vmas not to be merged, and areas not to be core dumped; but could probably be
eliminated later (the drivers are probably specifying it because in 2.4 it
kept swapout off the vma, but in 2.6 we work from the LRU, which these pages
don't get on).

Use the VM_SHM slot for VM_UNPAGED, and define VM_SHM to 0: it serves no
purpose whatsoever, and should be removed from drivers when we clean up.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: William Irwin <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22 09:13:42 -08:00
Hugh Dickins
664beed019 [PATCH] unpaged: unifdefed PageCompound
It looks like snd_xxx is not the only nopage to be using PageReserved as a way
of holding a high-order page together: which no longer works, but is masked by
our failure to free from VM_RESERVED areas.  We cannot fix that bug without
first substituting another way to hold the high-order page together, while
farming out the 0-order pages from within it.

That's just what PageCompound is designed for, but it's been kept under
CONFIG_HUGETLB_PAGE.  Remove the #ifdefs: which saves some space (out- of-line
put_page), doesn't slow down what most needs to be fast (already using
hugetlb), and unifies the way we handle high-order pages.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22 09:13:42 -08:00
Hugh Dickins
83e9b7e929 [PATCH] unpaged: private write VM_RESERVED
The PageReserved removal in 2.6.15-rc1 issued a "deprecated" message when you
tried to mmap or mprotect MAP_PRIVATE PROT_WRITE a VM_RESERVED, and failed
with -EACCES: because do_wp_page lacks the refinement to COW pages in those
areas, nor do we expect to find anonymous pages in them; and it seemed just
bloat to add code for handling such a peculiar case.  But immediately it
caused vbetool and ddcprobe (using lrmi) to fail.

So revert the "deprecated" messages, letting mmap and mprotect succeed.  But
leave do_wp_page's BUG_ON(vma->vm_flags & VM_RESERVED) in place until we've
added the code to do it right: so this particular patch is only good if the
app doesn't really need to write to that private area.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22 09:13:42 -08:00
Hugh Dickins
ed5297a940 [PATCH] unpaged: get_user_pages VM_RESERVED
The PageReserved removal in 2.6.15-rc1 prohibited get_user_pages on the areas
flagged VM_RESERVED in place of PageReserved.  That is correct in theory - we
ought not to interfere with struct pages in such a reserved area; but in
practice it broke BTTV for one.

So revert to prohibiting only on VM_IO: if someone gets into trouble with
get_user_pages on VM_RESERVED, it'll just be a "don't do that".

You can argue that videobuf_mmap_mapper shouldn't set VM_RESERVED in the first
place, but now's not the time for breaking drivers without notice.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22 09:13:41 -08:00
Kyle McMartin
2161558fa5 Merge branch 'master' 2005-11-18 16:39:20 -05:00
Matthew Wilcox
9ab8851549 [PARISC] Fix compile warning caused by conflicting types of expand_upwards()
Fix compile warning caused by conflicting types of expand_upwards. IA64
requires it to not be static inline, as it's used outside mm/mmap.c

Signed-off-by: Matthew Wilcox <willy@parisc-linux.org>
Signed-off-by: Kyle McMartin <kyle@parisc-linux.org>
2005-11-18 16:16:42 -05:00
Hans Reiser
58bb01a9cd [PATCH] re-export clear_page_dirty_for_io()
2.6.14 has this exported, and reiser4 (at least) uses it.  Put things back
the way they were.

Signed-off-by: Vladimir V. Saveliev <vs@namesys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-18 07:49:45 -08:00
Jens Axboe
6b1de9161e [PATCH] VM: fix zone list restart in page allocatate
We must reassign z before looping through the zones kicking kswapd,
since it will be NULL if we hit an OOM condition and jump back to the
beginning again. 'z' is initially assigned before the restart: label. So
move the restart label up a little.

Signed-off-by: Jens Axboe <axboe@suse.de>
2005-11-17 12:43:01 -08:00
Linus Torvalds
4060994c3e Merge x86-64 update from Andi 2005-11-14 19:56:02 -08:00
Andi Kleen
07808b74e7 [PATCH] x86_64: Remove obsolete ARCH_HAS_ATOMIC_UNSIGNED and page_flags_t
Has been introduced for x86-64 at some point to save memory
in struct page, but has been obsolete for some time. Just
remove it.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-14 19:55:14 -08:00
Andi Kleen
b0d4169321 [PATCH] x86_64: When cpu_up fails clean up page allocator properly
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-14 19:55:13 -08:00
Andi Kleen
a2f1b42490 [PATCH] x86_64: Add 4GB DMA32 zone
Add a new 4GB GFP_DMA32 zone between the GFP_DMA and GFP_NORMAL zones.

As a bit of historical background: when the x86-64 port
was originally designed we had some discussion if we should
use a 16MB DMA zone like i386 or a 4GB DMA zone like IA64 or
both. Both was ruled out at this point because it was in early
2.4 when VM is still quite shakey and had bad troubles even
dealing with one DMA zone.  We settled on the 16MB DMA zone mainly
because we worried about older soundcards and the floppy.

But this has always caused problems since then because
device drivers had trouble getting enough DMA able memory. These days
the VM works much better and the wide use of NUMA has proven
it can deal with many zones successfully.

So this patch adds both zones.

This helps drivers who need a lot of memory below 4GB because
their hardware is not accessing more (graphic drivers - proprietary
and free ones, video frame buffer drivers, sound drivers etc.).
Previously they could only use IOMMU+16MB GFP_DMA, which
was not enough memory.

Another common problem is that hardware who has full memory
addressing for >4GB misses it for some control structures in memory
(like transmit rings or other metadata).  They tended to allocate memory
in the 16MB GFP_DMA or the IOMMU/swiotlb then using pci_alloc_consistent,
but that can tie up a lot of precious 16MB GFPDMA/IOMMU/swiotlb memory
(even on AMD systems the IOMMU tends to be quite small) especially if you have
many devices.  With the new zone pci_alloc_consistent can just put
this stuff into memory below 4GB which works better.

One argument was still if the zone should be 4GB or 2GB. The main
motivation for 2GB would be an unnamed not so unpopular hardware
raid controller (mostly found in older machines from a particular four letter
company) who has a strange 2GB restriction in firmware. But
that one works ok with swiotlb/IOMMU anyways, so it doesn't really
need GFP_DMA32. I chose 4GB to be compatible with IA64 and because
it seems to be the most common restriction.

The new zone is so far added only for x86-64.

For other architectures who don't set up this
new zone nothing changes. Architectures can set a compatibility
define in Kconfig CONFIG_DMA_IS_DMA32 that will define GFP_DMA32
as GFP_DMA. Otherwise it's a nop because on 32bit architectures
it's normally not needed because GFP_NORMAL (=0) is DMA able
enough.

One problem is still that GFP_DMA means different things on different
architectures. e.g. some drivers used to have #ifdef ia64  use GFP_DMA
(trusting it to be 4GB) #elif __x86_64__ (use other hacks like
the swiotlb because 16MB is not enough) ... . This was quite
ugly and is now obsolete.

These should be now converted to use GFP_DMA32 unconditionally. I haven't done
this yet. Or best only use pci_alloc_consistent/dma_alloc_coherent
which will use GFP_DMA32 transparently.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-14 19:55:13 -08:00
Christoph Lameter
50c85a19e7 [PATCH] slab: remove alloc_pages() calls
The slab allocator never uses alloc_pages since kmem_getpages() is always
called with a valid nodeid.  Remove the branch and the code from
kmem_getpages()

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:12 -08:00
Pekka Enberg
065d41cb26 [PATCH] slab: convert cache to page mapping macros
This patch converts object cache <-> page mapping macros to static inline
functions to make the more explicit and readable.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:12 -08:00
Nick Piggin
669ed17521 [PATCH] mm: highmem watermarks
The pages_high - pages_low and pages_low - pages_min deltas are the asynch
reclaim watermarks.  As such, the should be in the same ratios as any other
zone for highmem zones.  It is the pages_min - 0 delta which is the
PF_MEMALLOC reserve, and this is the region that isn't very useful for
highmem.

This patch ensures highmem systems have similar characteristics as non highmem
ones with the same amount of memory, and also that highmem zones get similar
reclaim pressures to other zones.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:12 -08:00
Rohit Seth
7fb1d9fca5 [PATCH] mm: __alloc_pages cleanup
Clean up of __alloc_pages.

Restoration of previous behaviour, plus further cleanups by introducing an
'alloc_flags', removing the last of should_reclaim_zone.

Signed-off-by: Rohit Seth <rohit.seth@intel.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:12 -08:00
Robin Holt
51c6f666fc [PATCH] mm: ZAP_BLOCK causes redundant work
The address based work estimate for unmapping (for lockbreak) is and always
was horribly inefficient for sparse mappings.  The problem is most simply
explained with an example:

If we find a pgd is clear, we still have to call into unmap_page_range
PGDIR_SIZE / ZAP_BLOCK_SIZE times, each time checking the clear pgd, in
order to progress the working address to the next pgd.

The fundamental way to solve the problem is to keep track of the end
address we've processed and pass it back to the higher layers.

From: Nick Piggin <npiggin@suse.de>

  Modification to completely get away from address based work estimate
  and instead use an abstract count, with a very small cost for empty
  entries as opposed to present pages.

  On 2.6.14-git2, ppc64, and CONFIG_PREEMPT=y, mapping and unmapping 1TB
  of virtual address space takes 1.69s; with the following patch applied,
  this operation can be done 1000 times in less than 0.01s

From: Andrew Morton <akpm@osdl.org>

With CONFIG_HUTETLB_PAGE=n:

mm/memory.c: In function `unmap_vmas':
mm/memory.c:779: warning: division by zero

Due to

			zap_work -= (end - start) /
					(HPAGE_SIZE / PAGE_SIZE);

So make the dummy HPAGE_SIZE non-zero

Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:12 -08:00
Kirill Korotaev
885036d32f [PATCH] mm: __GFP_NOFAIL fix
In __alloc_pages():

if ((p->flags & (PF_MEMALLOC | PF_MEMDIE)) && !in_interrupt()) {
         /* go through the zonelist yet again, ignoring mins */
         for (i = 0; zones[i] != NULL; i++) {
                 struct zone *z = zones[i];

                 page = buffered_rmqueue(z, order, gfp_mask);
                 if (page) {
                         zone_statistics(zonelist, z);
                         goto got_pg;
                 }
         }
         goto nopage;                <<<< HERE!!! FAIL...
}

kswapd (which has PF_MEMALLOC flag) can fail to allocate memory even when
it allocates it with __GFP_NOFAIL flag.

Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
Signed-Off-By: Denis Lunev <den@sw.ru>
Signed-Off-By: Kirill Korotaev <dev@sw.ru>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:12 -08:00
Linus Torvalds
63f45b8094 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial 2005-11-11 16:29:22 -08:00
Dave Jones
6b482c6779 [PATCH] Don't print per-cpu vm stats for offline cpus.
I just hit a page allocation error on a kernel configured to support
64 CPUs.  It spewed 60 completely useless unnecessary lines of info.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-10 13:25:53 -08:00
Adrian Bunk
dc6f3f276e mm/slab.c: fix a comment typo 2005-11-08 16:44:08 +01:00
Adrian Bunk
4936967374 [PATCH] mm/swap_state.c: unexport swapper_space
I didn't find any possible modular usage in the kernel.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:54:07 -08:00
Adrian Bunk
e2de225710 [PATCH] mm/swapfile.c: unexport total_swap_pages
I didn't find any possible modular usage in the kernel.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:54:07 -08:00
Adrian Bunk
1b09d16489 [PATCH] mm/swap.c: unexport vm_acct_memory
I didn't find any possible modular usage in the kernel.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:54:07 -08:00
Adrian Bunk
f8b8db77b0 [PATCH] unexport nr_swap_pages
I didn't find any possible modular usage in the kernel.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:54:07 -08:00
Adrian Bunk
e6a7e0e7ce [PATCH] unexport clear_page_dirty_for_io
I didn't find any possible modular usage in the kernel.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:54:06 -08:00
Adrian Bunk
99697dc02d [PATCH] unexport hugetlb_total_pages
I didn't find any possible modular usage in the kernel.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:54:06 -08:00
Adrian Bunk
55be570c52 [PATCH] mm/{mmap,nommu}.c: several unexports
I didn't find any possible modular usage in the kernel.

This patch was already ACK'ed by Christoph Hellwig.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:54:06 -08:00
Randy Dunlap
d44e0780bc [PATCH] kernel-doc: fix warnings in vmalloc.c
Fix new kernel-doc errors in vmalloc.c.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:56 -08:00
Randy Dunlap
1e5d533142 [PATCH] more kernel-doc cleanups, additions
Various core kernel-doc cleanups:
- add missing function parameters in ipc, irq/manage, kernel/sys,
  kernel/sysctl, and mm/slab;
- move description to just above function for kernel_restart()

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:55 -08:00
Andrew Morton
7361f4d8ca [PATCH] readahead commentary
Add a few comments surrounding the generic readahead API.

Also convert some ulongs into pgoff_t: the identifier for PAGE_CACHE_SIZE
offsets into pagecache.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:37 -08:00
Manfred Spraul
cd61ef6268 [PATCH] slab: Use same schedule timeout for all cpus in cache_reap
Chen noticed that cache_reap uses REAPTIMEOUT_CPUC+smp_processor_id() as
the timeout for rescheduling.

The "+smp_processor_id()" part is wrong, the timeout should be identical
for all cpus: start_cpu_timer already adds a cpu dependant offset to avoid
any clustering.

The attached patch removes smp_processor_id().

Signed-Off-By: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:24 -08:00
Pekka J Enberg
2109a2d1b1 [PATCH] mm: rename kmem_cache_s to kmem_cache
This patch renames struct kmem_cache_s to kmem_cache so we can start using
it instead of kmem_cache_t typedef.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:24 -08:00
Andrew Morton
4f12bb4f77 [PATCH] slab: don't BUG on duplicated cache
slab presently goes BUG if someone tries to register an already-registered
cache.

But this can happen if the user accidentally loads a module which is already
statically linked into the kernel.  Nuking the kernel is rather a harsh
reaction.

Change it into a warning, and just fail the kmem_cache_alloc() attempt.  If
the module is well-behaved, the modprobe will fail and all is well.

Notes:

- Swaps the ranking of cache_chain_sem and lock_cpu_hotplug().  Doesn't seem
  important.

Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:24 -08:00
Hugh Dickins
2d4b95f060 [PATCH] Suppress split ptlock on arches which may use one page for multiple page tables
Suppress split ptlock on arches which may use one page for multiple page
tables.  Reconsider what better to do (particularly on ppc64) later on.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:23 -08:00
Benjamin Herrenschmidt
3c726f8dee [PATCH] ppc64: support 64k pages
Adds a new CONFIG_PPC_64K_PAGES which, when enabled, changes the kernel
base page size to 64K.  The resulting kernel still boots on any
hardware.  On current machines with 4K pages support only, the kernel
will maintain 16 "subpages" for each 64K page transparently.

Note that while real 64K capable HW has been tested, the current patch
will not enable it yet as such hardware is not released yet, and I'm
still verifying with the firmware architects the proper to get the
information from the newer hypervisors.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-06 16:56:47 -08:00
Steve French
7f28570185 Export __pagevec_release and pagevec_lookup_tag
These are needed to implement cifs_writepages

Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2005-11-01 10:22:55 -08:00
Matt Mackall
5d57bd39eb [PATCH] Error checks omitted in init_tmpfs() in mm/tiny-shmem.c
From: Hareesh Nagarajan <hnagar2@gmail.com>

Signed-off-by: Hareesh Nagarajan <hnagar2@gmail.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:27 -08:00
Paul E. McKenney
a241ec65ae [PATCH] RCU torture-testing kernel module
This patch is a rewrite of the one submitted on October 1st, using modules
(http://marc.theaimsgroup.com/?l=linux-kernel&m=112819093522998&w=2).

This rewrite adds a tristate CONFIG_RCU_TORTURE_TEST, which enables an
intense torture test of the RCU infratructure.  This is needed due to the
continued changes to the RCU infrastructure to accommodate dynamic ticks,
CPU hotplug, realtime, and so on.  Most of the code is in a separate file
that is compiled only if the CONFIG variable is set.  Documentation on how
to run the test and interpret the output is also included.

This code has been tested on i386 and ppc64, and an earlier version of the
code has received extensive testing on a number of architectures as part of
the PREEMPT_RT patchset.

Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:27 -08:00
Tejun Heo
c7e9dd4dd0 [PATCH] vm: remove redundant assignment from __pagevec_release_nonlru()
This patch removes redundant assignment from __pagevec_release_nonlru().
pages_to_free.cold is set to pvec->cold by pagevec_init() call right above
the assignment.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:22 -08:00
Tejun Heo
39e88ca2c9 [PATCH] fs: error case fix in __generic_file_aio_read
When __generic_file_aio_read() hits an error during reading, it reports the
error iff nothing has successfully been read yet.  This is condition - when
an error occurs, if nothing has been read/written, report the error code;
otherwise, report the amount of bytes successfully transferred upto that
point.

This corner case can be exposed by performing readv(2) with the following
iov.

 iov[0] = len0 @ ptr0
 iov[1] = len1 @ NULL (or any other invalid pointer)
 iov[2] = len2 @ ptr2

When file size is enough, performing above readv(2) results in

 len0 bytes from file_pos @ ptr0
 len2 bytes from file_pos + len0 @ ptr2

And the return value is len0 + len2.  Test program is attached to this
mail.

This patch makes __generic_file_aio_read()'s error handling identical to
other functions.

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/uio.h>
#include <errno.h>
#include <string.h>

int main(int argc, char **argv)
{
	const char *path;
	struct stat stbuf;
	size_t len0, len1;
	void *buf0, *buf1;
	struct iovec iov[3];
	int fd, i;
	ssize_t ret;

	if (argc < 2) {
		fprintf(stderr, "Usage: testreadv path (better be a "
			"small text file)\n");
		return 1;
	}
	path = argv[1];

	if (stat(path, &stbuf) < 0) {
		perror("stat");
		return 1;
	}

	len0 = stbuf.st_size / 2;
	len1 = stbuf.st_size - len0;

	if (!len0 || !len1) {
		fprintf(stderr, "Dude, file is too small\n");
		return 1;
	}

	if ((fd = open(path, O_RDONLY)) < 0) {
		perror("open");
		return 1;
	}

	if (!(buf0 = malloc(len0)) || !(buf1 = malloc(len1))) {
		perror("malloc");
		return 1;
	}

	memset(buf0, 0, len0);
	memset(buf1, 0, len1);

	iov[0].iov_base = buf0;
	iov[0].iov_len = len0;
	iov[1].iov_base = NULL;
	iov[1].iov_len = len1;
	iov[2].iov_base = buf1;
	iov[2].iov_len = len1;

	printf("vector ");
	for (i = 0; i < 3; i++)
		printf("%p:%zu ", iov[i].iov_base, iov[i].iov_len);
	printf("\n");

	ret = readv(fd, iov, 3);
	if (ret < 0)
		perror("readv");

	printf("readv returned %zd\nbuf0 = [%s]\nbuf1 = [%s]\n",
	       ret, (char *)buf0, (char *)buf1);

	return 0;
}

Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:22 -08:00
Paul Jackson
68860ec10b [PATCH] cpusets: automatic numa mempolicy rebinding
This patch automatically updates a tasks NUMA mempolicy when its cpuset
memory placement changes.  It does so within the context of the task,
without any need to support low level external mempolicy manipulation.

If a system is not using cpusets, or if running on a system with just the
root (all-encompassing) cpuset, then this remap is a no-op.  Only when a
task is moved between cpusets, or a cpusets memory placement is changed
does the following apply.  Otherwise, the main routine below,
rebind_policy() is not even called.

When mixing cpusets, scheduler affinity, and NUMA mempolicies, the
essential role of cpusets is to place jobs (several related tasks) on a set
of CPUs and Memory Nodes, the essential role of sched_setaffinity is to
manage a jobs processor placement within its allowed cpuset, and the
essential role of NUMA mempolicy (mbind, set_mempolicy) is to manage a jobs
memory placement within its allowed cpuset.

However, CPU affinity and NUMA memory placement are managed within the
kernel using absolute system wide numbering, not cpuset relative numbering.

This is ok until a job is migrated to a different cpuset, or what's the
same, a jobs cpuset is moved to different CPUs and Memory Nodes.

Then the CPU affinity and NUMA memory placement of the tasks in the job
need to be updated, to preserve their cpuset-relative position.  This can
be done for CPU affinity using sched_setaffinity() from user code, as one
task can modify anothers CPU affinity.  This cannot be done from an
external task for NUMA memory placement, as that can only be modified in
the context of the task using it.

However, it easy enough to remap a tasks NUMA mempolicy automatically when
a task is migrated, using the existing cpuset mechanism to trigger a
refresh of a tasks memory placement after its cpuset has changed.  All that
is needed is the old and new nodemask, and notice to the task that it needs
to rebind its mempolicy.  The tasks mems_allowed has the old mask, the
tasks cpuset has the new mask, and the existing
cpuset_update_current_mems_allowed() mechanism provides the notice.  The
bitmap/cpumask/nodemask remap operators provide the cpuset relative
calculations.

This patch leaves open a couple of issues:

 1) Updating vma and shmfs/tmpfs/hugetlbfs memory policies:

    These mempolicies may reference nodes outside of those allowed to
    the current task by its cpuset.  Tasks are migrated as part of jobs,
    which reside on what might be several cpusets in a subtree.  When such
    a job is migrated, all NUMA memory policy references to nodes within
    that cpuset subtree should be translated, and references to any nodes
    outside that subtree should be left untouched.  A future patch will
    provide the cpuset mechanism needed to mark such subtrees.  With that
    patch, we will be able to correctly migrate these other memory policies
    across a job migration.

 2) Updating cpuset, affinity and memory policies in user space:

    This is harder.  Any placement state stored in user space using
    system-wide numbering will be invalidated across a migration.  More
    work will be required to provide user code with a migration-safe means
    to manage its cpuset relative placement, while preserving the current
    API's that pass system wide numbers, not cpuset relative numbers across
    the kernel-user boundary.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:22 -08:00
Paul Jackson
28a42b9ea7 [PATCH] cpusets: confine pdflush to its cpuset
This patch keeps pdflush daemons on the same cpuset as their parent, the
kthread daemon.

Some large NUMA configurations put as much as they can of kernel threads
and other classic Unix load in what's called a bootcpuset, keeping the rest
of the system free for dedicated jobs.

This effort is thwarted by pdflush, which dynamically destroys and
recreates pdflush daemons depending on load.

It's easy enough to force the originally created pdflush deamons into the
bootcpuset, at system boottime.  But the pdflush threads created later were
allowed to run freely across the system, due to the necessary line in their
startup kthread():

        set_cpus_allowed(current, CPU_MASK_ALL);

By simply coding pdflush to start its threads with the cpus_allowed
restrictions of its cpuset (inherited from kthread, its parent) we can
ensure that dynamically created pdflush threads are also kept in the
bootcpuset.

On systems w/o cpusets, or w/o a bootcpuset implementation, the following
will have no affect, leaving pdflush to run on any CPU, as before.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:21 -08:00
Jan Kara
aaa4059bc2 [PATCH] ext3: Fix unmapped buffers in transaction's lists
Fix the problem (BUG 4964) with unmapped buffers in transaction's
t_sync_data list.  The problem is we need to call filesystem's own
invalidatepage() from block_write_full_page().

block_write_full_page() must call filesystem's invalidatepage().  Otherwise
following nasty race can happen:

   proc 1                                        proc 2
   ------                                        ------
- write some new data to 'offset'
  => bh gets to the transactions data list
                                              - starts truncate
                                                => i_size set to new size
- mpage_writepages()
  - ext3_ordered_writepage() to 'offset'
    - block_write_full_page()
      - page->index > end_index+1
        - block_invalidatepage()
          - discard_buffer()
            - clear_buffer_mapped()

- commit triggers and finds unmapped buffer - BOOM!

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:17 -08:00
Nikita Danilov
b1459461f1 [PATCH] mm/filemap.c:filemap_populate(): move export.
move EXPORT_SYMBOL(filemap_populate) to the proper place: just after
function itself: it's easy to miss that function is exported otherwise.

Signed-off-by: Nikita Danilov <nikita@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:45 -07:00