02 April 2009

Lowering ASP.NET memory usage

This blog post is a work in progress on how to keep ASP.NET web application memory usage low. The motivation for this is to avoid the web app being stopped for using too much memory.

Background

The Microsoft Internet Information Services (IIS) web server has administrator options to Recycle Worker Threads, ie to automatically stop a thread that runs an ASP.NET web application if a threshold is passed. Some of these thresholds are 'arbitrary', eg the default Elapsed Time of 29 hours. However the Virtual Memory and Used Memory thresholds tend to kick in as you use more memory. If a web app is 'recycled' then you get no warning - the thread is simply terminated; the web app is only restarted in response to another web request.

In addition, a web app will tend to slow down as its memory use increases.

I am working on this topic for my FindinSite-MS site search engine. This one web app both (a) does searches and (b) crawls a web site to build a 'search database' that is used by the search. The crawl/index task is done in a separate background thread - an indexing task is either started from the user interface, or run on a predefined schedule. (The app has to be alive for a scheduled index to run - so an outside process can be set up to wake up a web app if need be.)

This software currently suffers from two problems:
- Too much of the database-being-searched is kept in memory
- More significantly: The index process uses a large amount of memory.

Note that my 'search database' db (as I refer to it) is just a set of files in memory - no real database is used. This makes the search engine easier to deploy as a database is not required.

Memory heuristics

I use the System.GC.GetTotalMemory(false) call to get the current total memory usage, without forcing a garbage collection.

I don't have a precise figure for what amount of memory is too big. On our Crystaltech shared host, anything less than 10MB is good, while anything over 100MB is bad - though a recent indexing run worked with a maximum 400MB memory usage.

Redesign methodology

My initial focus was on designing a new 'search database'. This db design needs simultaneously to be searchable and buildable, ie sufficently fast and low-memory while searching - and the same while building.

The main obvious programming technique is not to keep anything in memory. This is actually quite a hard mindset to achieve. For example, it would be quite nice to keep a bit of information in memory about every file indexed, eg URL, title, size, etc. While this might work for 1,000 files, or even 25,000, it is not going to work for half a million.

Using disk instead of memory

If I cannot store information in memory, then I'll have to save it to disk. In fact, this may not be as bad an option as it sounds, as the operating system (and hardware etc) will cache disk data in memory, so performance time may not go down too significantly. Storing data on disk should not impact on ASP.NET memory usage, so should help ensure that my app isn't killed.

Example 1: word file list

I want to store a list of file numbers for each word found during the crawl. My first re-design had 32 of these numbers in memory, along with 4 other integers - with the rest on disk. A second redesign reduced this to 8+4. My latest design simply has 2 integers per word in memory. Everything else is on disk.

The first integer is the block number in the temporary data file. The second number is the last inserted file number - this makes sure that I don't update the block more than once per file for each word.

Example 2: file list

I want to know whether I've indexed a file before. I used to keep 2 (yes two) lists of files in memory. Now this is all written out to disk.

So how do I check quickly if I've already indexed a file? OK, I do have a List<> of all files indexed. Each List element is a structure that contains two integers, a FileHash and a FilePointer. The FileHash is the HashCode of the file path, and the FilePointer is the location of the full information on disk.

To check whether I've indexed a file, I find the HashCode of the file path. I then iterate through the List<>. If the hash matches then I use the FilePointer to retrieve the path from disk. If this matches, then the file has been indexed before. I keep looking if it doesn't match, in case two or more files have the same hash.

Example 3: reversed word list

To support wild cards at the start of a search, eg a search for [*hris], I need to reverse each word, so that "chris" becomes "sirhc". [I won't reveal my algorithm just now.]

At one stage, I created a list of reversed words as I went through the normal word list. However, I now write the reversed words out to disk first. I then clear the word list (see below) to reduce my memory footprint. Finally I read my reversed words in for processing.

Example 4: SortedDictionary.Clear()

I use a SortedDictionary during the indexing run. If I set this to null and do a garbage collect, then no memory is freed. If I call Clear() and then set this to null, the memory is cleared. [And I'm pretty certain that there are no other references to items in the dictionary.]

Should I use Cache?

Is it safe to use the ASP.NET Page.Cache? I presume that using this will still add to the memory used by the application, so the web app could still be shut down unilaterally, without trying to clear the Cache.

I do use the Cache as a part of the search process. However I set an expiry time of 5 minutes - this provides a useful cache while a user is searching, but clears the memory when they have probably gone anyway.

Storing data in the Session

I have just remembered that I store the search results for each user in a Session variable. This is useful, eg when they ask for the second page of hits for a search. This results data could be reasonably big, so I now think that this is not wise. The Session variable will presumably be cleared after say 20 minutes, but this is too big a risk. I'll have to store the results to disk, retrieve and clear as necessary.

Large Object Heap objects

This article on The Dangers of the Large Object Heap says that any object larger than 85kB (or 8kB for doubles) might result in increased memory usage. Try not to use large objects.
Does each collection or generic collection count as one object, or is each one a multitude of its individual components?

Current progress

A crawl of a 100,000 simple HTML files now takes 11 minutes and uses a maximum of 23MB of memory. When this db is loaded for searching, the rest state of the web app is 6MB. After a couple of searches this went up to 22MB.

The previous version took 83 minutes and used 126MB max memory. When loaded for searching, 44MB of memory is used, going up to 50MB after a couple of searches.

This is not exactly a completely fair test as the new code is not complete in several important ways. However it does show dramatic memory usage reductions. I am not exactly sure why there's such a dramatic speed improvement - it might be because of the reduced memory usage - or it could be the simpler not-really-complete algorithm.

2 comments:

Joe Hopkins said...

Hi,

i wouldn't recommend using ASP.NET Cache because ASP.NET Cache being stand-alone & InProc can be a real problem. Trying to "simulate" distribution through SqlCacheDependency is also really a "hack" in my opinion. The right way to solve this scalability problem is through an in-memory distributed cache.

I personally like NCache which is really impressive because of its rich set of caching topologies (Mirrored, Replicated, Partitioned, Partition-Replica, and Client Cache). The cool thing is that it also has NCache Express which is free for 2-server environments.

Jason said...

Environment.WorkingSet returns the working set incorrectly for my asp.net application which is the only application in its application pool.

On a Windows 2003 Server SP2 with 3GBs of Ram which is a VMWare Virtual Machine, it reports working set as 2.047.468.061 bytes(1952MBs) and Process.WorkingSet value is 75.563.008 bytes(72MBs).

• Memory Status values returned by GlobalMemoryStatusEx:

AvailExtendedVirtual : 0
AvailPageFile: 4.674.134.016
AvailPhys: 2.140.078.080
AvailVirtual: 1.347.272.704
TotalPageFile: 6.319.915.008
TotalPhys: 3.245.568.000
TotalVirtual: 2.147.352.576
• GetProcessMemoryInfo()

Working Set : 55.140.352
Peak Working Set: 75.571.200
PageFile : 94.560.256
QuotaPagedPoolUsage : 376.012
QuotaNonPagedPoolUsage : 33.261
• GetProcessWorkingSetSize() - min : 204.800 - max : 1.413.120

• GetPerformanceInfo()

CommitLimit : 1.542.948 pages 6.319.915.008 bytes
CommitPeak : 484.677 pages 1.985.236.992 bytes
CommitTotal : 417.514 pages 1.710.137.344 bytes
HandleCount : 57.012
KernelNonpaged : 8.671 pages 35.516.416 bytes
KernelPaged : 27.302 pages 111.828.992 bytes
KernelTotal : 35.973 pages 147.345.408 bytes
PageSize : 4.096 bytes
PhysicalAvailable : 508.083 pages 2.081.107.968 bytes
PhysicalTotal : 792.375 pages 3.245.568.000 bytes
ProcessCount : 43
SystemCache : 263.734 pages 1.080.254.464 bytes
ThreadCount : 1.038
After loding the new patch, http://support.microsoft.com/kb/983583/en-us, .NET Version changes to 2.0.50727.3615 and Environment.WorkingSet now returns the value: 2.047.468.141.(which is 80 bytes bigger than previous one)

On a Vista machine with 3GBs of Ram, Environment.WorkingSet and Process.WorkingSet values are similar and around 37 MBs.

So, why Environment.WorkingSet returns a fixed value? Restarting the application pool does not change anything, it always return the same magic value, 2.047.468.061.

I have also setup an .NET 1.1.4322.2443 application, and it weirdly WorkingSet is returned a number from a random set of unrelated numbers(193.654.824, 214.101.416, 57.207.080, 287.635.496) each time page refreshed while GetProcessMemoryInfo() returns the expected number.

I have also found that when the application run by impersonating "NT AUTHORITY\NetworkService" account this problem does not occur, Environment.WorkingSet returns the expected number both .net v1.1 and v2.0.

I have checked CodeAccessPermissions like EnvironmentPermission for windows user and NetworkService but could not find anything that restricts reading the WorkingSet value.

So, what could cause this? Is it a bug, some incorrect configuration or corrupt file etc.?