View Incident: http://co-op.engr.sgi.com/BugWorks/query.cgi/903758

Status: open                          Priority: 2                           
Assigned Engineer: erikj              Submitter: erikj                      
Assigned Group: linux-kernel          Project: communitylinux               
Opened Date: 11/04/03                 Description:

For LBS (kernel 2.4) we implemented round-robin allocation of certain kernel
hash tables.  This was documented in pv 893276.

We need to port or implement this for 2.6.

Before I begin to research this, I understood at least one SGI person (mort?) 
might be looking in to some aspect of this already.  If this is the case, please
pass along a bug number.  We can perhaps dupe this to that. 


.....

==========================
ADDITIONAL INFORMATION (UPDATE)
From: jes@sgi.com (BugWorks)
Date: Nov 25 2003 04:53:40AM
==========================
ACTIONS:
 CC List: jbarnes steiner mort jh hawkes raybry ->
          jbarnes steiner mort jh hawkes raybry jbarnes
--------------------------

Hi,

Going vmalloc for these allocations is never going to fly with the community,
vmalloc is a scarce resource and shouldn't be used lightly like this.

However I think the real problem here is that the current algorithm for
determining the hash table sizes is out of proportion with reality. Ie. on
ascender we do not need 1-2GB of inode and dentry hash table. I have been
trying to come up with a better algorithm which I am going to propose to the
community.

Right now I cap it at 64MB for those tables which should be sufficient even
on ascender.

I am attaching a copy of my current patch which I recommend goes into
Jesse's patchset for the time being. When I get some response from the
linux-kernel list it might need to be updated.

Wrt the IP hash table then it would be interesting to adjust that based
on the number of network interfaces installed in the system. However
thats hard to adjust since that number isn't known until after the hash
has been allocated. Either way that hash needs to be reduced as well.

Jes

--- orig/linux-2.6.0-test9/fs/dcache.c	Sat Oct 25 11:42:58 2003
+++ linux-2.6.0-test10/fs/dcache.c	Tue Nov 25 05:33:04 2003
@@ -1549,9 +1549,8 @@
 static void __init dcache_init(unsigned long mempages)
 {
 	struct hlist_head *d;
-	unsigned long order;
 	unsigned int nr_hash;
-	int i;
+	int i, order;
 
 	/* 
 	 * A constructor could be added for stable state like the lists,
@@ -1571,12 +1570,17 @@
 	
 	set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
 
+#if 0
 #if PAGE_SHIFT < 13
 	mempages >>= (13 - PAGE_SHIFT);
 #endif
 	mempages *= sizeof(struct hlist_head);
 	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
 		;
+#endif
+	mempages >>= (23 - (PAGE_SHIFT - 1));
+	order = max(2, fls(mempages));
+	order = min(12, order);
 
 	do {
 		unsigned long tmp;
@@ -1594,7 +1598,7 @@
 			__get_free_pages(GFP_ATOMIC, order);
 	} while (dentry_hashtable == NULL && --order >= 0);
 
-	printk(KERN_INFO "Dentry cache hash table entries: %d (order: %ld, %ld bytes)\n",
+	printk(KERN_INFO "Dentry cache hash table entries: %d (order: %d, %ld bytes)\n",
 			nr_hash, order, (PAGE_SIZE << order));
 
 	if (!dentry_hashtable)
--- orig/linux-2.6.0-test9/fs/inode.c	Sat Oct 25 11:44:53 2003
+++ linux-2.6.0-test10/fs/inode.c	Tue Nov 25 05:33:27 2003
@@ -1333,17 +1333,21 @@
 void __init inode_init(unsigned long mempages)
 {
 	struct hlist_head *head;
-	unsigned long order;
 	unsigned int nr_hash;
-	int i;
+	int i, order;
 
 	for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
 		init_waitqueue_head(&i_wait_queue_heads[i].wqh);
 
+#if 0
 	mempages >>= (14 - PAGE_SHIFT);
 	mempages *= sizeof(struct hlist_head);
 	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
 		;
+#endif
+	mempages >>= (23 - (PAGE_SHIFT - 1));
+	order = max(2, fls(mempages));
+	order = min(12, order);
 
 	do {
 		unsigned long tmp;
