14 April 2009

DNN development and production sites

Bro John wants me to write up how he does DotNetNuke (DNN) upgrades and site copies. As a preamble, he wrote this:
=============
I would think anyone with a production site would want three copies
A - a live site
B - a transition site
C - a development site

also
D - a trash site

As most of my stuff is now in phdcc.CodeModules, I develop these until they work on C, I copy these to D to see if they still work in a different environ. I then copy them to A and test.

I would prefer to have a B which means I can shut down A and have users running on B and then swap back to A later if for example I do a DNN version upgrade.

Doing a DNN version upgrade on a live site is just asking for trouble.

If this sort of stuff is not sorted in DNN properly then I wouldn't consider it a solid environment to do anything serious in.

I am trying to be careful about keeping all database/email server specific stuff in a few files. Hard links to other pages on the site are OK if you copy the entire site correctly.

Maybe bigger players have other tricks they pull. Maybe they swap DNS pointers. However that takes time to permeate and produces horrid cahcing problems.

03 April 2009

Content-Location HTTP header for current URL

All our web sites on shared host provider Crystaltech/Newtek have just started serving an extra HTTP header Content-Location for all static web page requests - the header contains the current URL.

To see this in action, use Rex Swain's HTTP Viewer to view this page at our web site: http://www.phdcc.com/phd.html.

If you look in the received output, you will see that the Content-Location header is set to the requested URL:
Content-Location:·http://www.phdcc.com/phd.html(CR)(LF)

Normally, the Content-Location header is used to indicate when the content actually corresponds to another URL. So if you look at http://www.phdcc.com/ you will see this output:
Content-Location:·http://www.phdcc.com/default.htm(CR)(LF)
ie the web site home page is actually called default.htm.

While this is a harmless change for most users, it did in fact fool our FindinSite-MS search engine - I am in the process of releasing a new version to cope with this. Our ISP Crystaltech claims that nothing has changed. Has anyone any further information?

02 April 2009

Lowering ASP.NET memory usage

This blog post is a work in progress on how to keep ASP.NET web application memory usage low. The motivation for this is to avoid the web app being stopped for using too much memory.

Background

The Microsoft Internet Information Services (IIS) web server has administrator options to Recycle Worker Threads, ie to automatically stop a thread that runs an ASP.NET web application if a threshold is passed. Some of these thresholds are 'arbitrary', eg the default Elapsed Time of 29 hours. However the Virtual Memory and Used Memory thresholds tend to kick in as you use more memory. If a web app is 'recycled' then you get no warning - the thread is simply terminated; the web app is only restarted in response to another web request.

In addition, a web app will tend to slow down as its memory use increases.

I am working on this topic for my FindinSite-MS site search engine. This one web app both (a) does searches and (b) crawls a web site to build a 'search database' that is used by the search. The crawl/index task is done in a separate background thread - an indexing task is either started from the user interface, or run on a predefined schedule. (The app has to be alive for a scheduled index to run - so an outside process can be set up to wake up a web app if need be.)

This software currently suffers from two problems:
- Too much of the database-being-searched is kept in memory
- More significantly: The index process uses a large amount of memory.

Note that my 'search database' db (as I refer to it) is just a set of files in memory - no real database is used. This makes the search engine easier to deploy as a database is not required.

Memory heuristics

I use the System.GC.GetTotalMemory(false) call to get the current total memory usage, without forcing a garbage collection.

I don't have a precise figure for what amount of memory is too big. On our Crystaltech shared host, anything less than 10MB is good, while anything over 100MB is bad - though a recent indexing run worked with a maximum 400MB memory usage.

Redesign methodology

My initial focus was on designing a new 'search database'. This db design needs simultaneously to be searchable and buildable, ie sufficently fast and low-memory while searching - and the same while building.

The main obvious programming technique is not to keep anything in memory. This is actually quite a hard mindset to achieve. For example, it would be quite nice to keep a bit of information in memory about every file indexed, eg URL, title, size, etc. While this might work for 1,000 files, or even 25,000, it is not going to work for half a million.

Using disk instead of memory

If I cannot store information in memory, then I'll have to save it to disk. In fact, this may not be as bad an option as it sounds, as the operating system (and hardware etc) will cache disk data in memory, so performance time may not go down too significantly. Storing data on disk should not impact on ASP.NET memory usage, so should help ensure that my app isn't killed.

Example 1: word file list

I want to store a list of file numbers for each word found during the crawl. My first re-design had 32 of these numbers in memory, along with 4 other integers - with the rest on disk. A second redesign reduced this to 8+4. My latest design simply has 2 integers per word in memory. Everything else is on disk.

The first integer is the block number in the temporary data file. The second number is the last inserted file number - this makes sure that I don't update the block more than once per file for each word.

Example 2: file list

I want to know whether I've indexed a file before. I used to keep 2 (yes two) lists of files in memory. Now this is all written out to disk.

So how do I check quickly if I've already indexed a file? OK, I do have a List<> of all files indexed. Each List element is a structure that contains two integers, a FileHash and a FilePointer. The FileHash is the HashCode of the file path, and the FilePointer is the location of the full information on disk.

To check whether I've indexed a file, I find the HashCode of the file path. I then iterate through the List<>. If the hash matches then I use the FilePointer to retrieve the path from disk. If this matches, then the file has been indexed before. I keep looking if it doesn't match, in case two or more files have the same hash.

Example 3: reversed word list

To support wild cards at the start of a search, eg a search for [*hris], I need to reverse each word, so that "chris" becomes "sirhc". [I won't reveal my algorithm just now.]

At one stage, I created a list of reversed words as I went through the normal word list. However, I now write the reversed words out to disk first. I then clear the word list (see below) to reduce my memory footprint. Finally I read my reversed words in for processing.

Example 4: SortedDictionary.Clear()

I use a SortedDictionary during the indexing run. If I set this to null and do a garbage collect, then no memory is freed. If I call Clear() and then set this to null, the memory is cleared. [And I'm pretty certain that there are no other references to items in the dictionary.]

Should I use Cache?

Is it safe to use the ASP.NET Page.Cache? I presume that using this will still add to the memory used by the application, so the web app could still be shut down unilaterally, without trying to clear the Cache.

I do use the Cache as a part of the search process. However I set an expiry time of 5 minutes - this provides a useful cache while a user is searching, but clears the memory when they have probably gone anyway.

Storing data in the Session

I have just remembered that I store the search results for each user in a Session variable. This is useful, eg when they ask for the second page of hits for a search. This results data could be reasonably big, so I now think that this is not wise. The Session variable will presumably be cleared after say 20 minutes, but this is too big a risk. I'll have to store the results to disk, retrieve and clear as necessary.

Large Object Heap objects

This article on The Dangers of the Large Object Heap says that any object larger than 85kB (or 8kB for doubles) might result in increased memory usage. Try not to use large objects.
Does each collection or generic collection count as one object, or is each one a multitude of its individual components?

Current progress

A crawl of a 100,000 simple HTML files now takes 11 minutes and uses a maximum of 23MB of memory. When this db is loaded for searching, the rest state of the web app is 6MB. After a couple of searches this went up to 22MB.

The previous version took 83 minutes and used 126MB max memory. When loaded for searching, 44MB of memory is used, going up to 50MB after a couple of searches.

This is not exactly a completely fair test as the new code is not complete in several important ways. However it does show dramatic memory usage reductions. I am not exactly sure why there's such a dramatic speed improvement - it might be because of the reduced memory usage - or it could be the simpler not-really-complete algorithm.