03 April 2009

Content-Location HTTP header for current URL

All our web sites on shared host provider Crystaltech/Newtek have just started serving an extra HTTP header Content-Location for all static web page requests - the header contains the current URL.

To see this in action, use Rex Swain's HTTP Viewer to view this page at our web site: http://www.phdcc.com/phd.html.

If you look in the received output, you will see that the Content-Location header is set to the requested URL:
Content-Location:·http://www.phdcc.com/phd.html(CR)(LF)

Normally, the Content-Location header is used to indicate when the content actually corresponds to another URL. So if you look at http://www.phdcc.com/ you will see this output:
Content-Location:·http://www.phdcc.com/default.htm(CR)(LF)
ie the web site home page is actually called default.htm.

While this is a harmless change for most users, it did in fact fool our FindinSite-MS search engine - I am in the process of releasing a new version to cope with this. Our ISP Crystaltech claims that nothing has changed. Has anyone any further information?

02 April 2009

Lowering ASP.NET memory usage

This blog post is a work in progress on how to keep ASP.NET web application memory usage low. The motivation for this is to avoid the web app being stopped for using too much memory.

Background

The Microsoft Internet Information Services (IIS) web server has administrator options to Recycle Worker Threads, ie to automatically stop a thread that runs an ASP.NET web application if a threshold is passed. Some of these thresholds are 'arbitrary', eg the default Elapsed Time of 29 hours. However the Virtual Memory and Used Memory thresholds tend to kick in as you use more memory. If a web app is 'recycled' then you get no warning - the thread is simply terminated; the web app is only restarted in response to another web request.

In addition, a web app will tend to slow down as its memory use increases.

I am working on this topic for my FindinSite-MS site search engine. This one web app both (a) does searches and (b) crawls a web site to build a 'search database' that is used by the search. The crawl/index task is done in a separate background thread - an indexing task is either started from the user interface, or run on a predefined schedule. (The app has to be alive for a scheduled index to run - so an outside process can be set up to wake up a web app if need be.)

This software currently suffers from two problems:
- Too much of the database-being-searched is kept in memory
- More significantly: The index process uses a large amount of memory.

Note that my 'search database' db (as I refer to it) is just a set of files in memory - no real database is used. This makes the search engine easier to deploy as a database is not required.

Memory heuristics

I use the System.GC.GetTotalMemory(false) call to get the current total memory usage, without forcing a garbage collection.

I don't have a precise figure for what amount of memory is too big. On our Crystaltech shared host, anything less than 10MB is good, while anything over 100MB is bad - though a recent indexing run worked with a maximum 400MB memory usage.

Redesign methodology

My initial focus was on designing a new 'search database'. This db design needs simultaneously to be searchable and buildable, ie sufficently fast and low-memory while searching - and the same while building.

The main obvious programming technique is not to keep anything in memory. This is actually quite a hard mindset to achieve. For example, it would be quite nice to keep a bit of information in memory about every file indexed, eg URL, title, size, etc. While this might work for 1,000 files, or even 25,000, it is not going to work for half a million.

Using disk instead of memory

If I cannot store information in memory, then I'll have to save it to disk. In fact, this may not be as bad an option as it sounds, as the operating system (and hardware etc) will cache disk data in memory, so performance time may not go down too significantly. Storing data on disk should not impact on ASP.NET memory usage, so should help ensure that my app isn't killed.

Example 1: word file list

I want to store a list of file numbers for each word found during the crawl. My first re-design had 32 of these numbers in memory, along with 4 other integers - with the rest on disk. A second redesign reduced this to 8+4. My latest design simply has 2 integers per word in memory. Everything else is on disk.

The first integer is the block number in the temporary data file. The second number is the last inserted file number - this makes sure that I don't update the block more than once per file for each word.

Example 2: file list

I want to know whether I've indexed a file before. I used to keep 2 (yes two) lists of files in memory. Now this is all written out to disk.

So how do I check quickly if I've already indexed a file? OK, I do have a List<> of all files indexed. Each List element is a structure that contains two integers, a FileHash and a FilePointer. The FileHash is the HashCode of the file path, and the FilePointer is the location of the full information on disk.

To check whether I've indexed a file, I find the HashCode of the file path. I then iterate through the List<>. If the hash matches then I use the FilePointer to retrieve the path from disk. If this matches, then the file has been indexed before. I keep looking if it doesn't match, in case two or more files have the same hash.

Example 3: reversed word list

To support wild cards at the start of a search, eg a search for [*hris], I need to reverse each word, so that "chris" becomes "sirhc". [I won't reveal my algorithm just now.]

At one stage, I created a list of reversed words as I went through the normal word list. However, I now write the reversed words out to disk first. I then clear the word list (see below) to reduce my memory footprint. Finally I read my reversed words in for processing.

Example 4: SortedDictionary.Clear()

I use a SortedDictionary during the indexing run. If I set this to null and do a garbage collect, then no memory is freed. If I call Clear() and then set this to null, the memory is cleared. [And I'm pretty certain that there are no other references to items in the dictionary.]

Should I use Cache?

Is it safe to use the ASP.NET Page.Cache? I presume that using this will still add to the memory used by the application, so the web app could still be shut down unilaterally, without trying to clear the Cache.

I do use the Cache as a part of the search process. However I set an expiry time of 5 minutes - this provides a useful cache while a user is searching, but clears the memory when they have probably gone anyway.

Storing data in the Session

I have just remembered that I store the search results for each user in a Session variable. This is useful, eg when they ask for the second page of hits for a search. This results data could be reasonably big, so I now think that this is not wise. The Session variable will presumably be cleared after say 20 minutes, but this is too big a risk. I'll have to store the results to disk, retrieve and clear as necessary.

Large Object Heap objects

This article on The Dangers of the Large Object Heap says that any object larger than 85kB (or 8kB for doubles) might result in increased memory usage. Try not to use large objects.
Does each collection or generic collection count as one object, or is each one a multitude of its individual components?

Current progress

A crawl of a 100,000 simple HTML files now takes 11 minutes and uses a maximum of 23MB of memory. When this db is loaded for searching, the rest state of the web app is 6MB. After a couple of searches this went up to 22MB.

The previous version took 83 minutes and used 126MB max memory. When loaded for searching, 44MB of memory is used, going up to 50MB after a couple of searches.

This is not exactly a completely fair test as the new code is not complete in several important ways. However it does show dramatic memory usage reductions. I am not exactly sure why there's such a dramatic speed improvement - it might be because of the reduced memory usage - or it could be the simpler not-really-complete algorithm.

29 March 2009

DNN portal Import/Export Link Url bug

In DNN DotNetNuke 4.92, you can use [Host][Portals] to export a complete Portal as a template to the Portals/_default/ directory. You can then import the template elsewhere using [Admin][Site Wizard].

Summary: if you have pages that have a "Link Url" that refers to another page on the site, then these links do not survive the import/export process properly. Currently, you will need to fix these up by hand after the import, going through [Admin][Pages] to edit the selected page's settings.

Preamble: in DNN, each page is identified by an integer TabId. This nomenclature stems from DNN's initial versions that refered to each page as a "tab".

Details: When a "Link Url" is set up to another page on the site, the page's TabId is stored as an integer in the Tabs database table Url column. When the portal is exported, this Url TabId number is stored in the template (in a <url> XML element). On import, the Url TabId is simply stored directly in the new page row Url column. However, the TabId of the desired "Link Url" page will almost certainly have changed, so the page will redirect the user to the wrong page.

Background: The Site Wizard does not really describe what it does very well. An old book says that the import process is *additive*, ie the pages in the template are *added* to the existing pages on your site. However this isn't entirely correct: if you select "Replace" in the wizard, then the existing pages are soft-deleted, ie their TabName has "_old" appended and the page is marked as Deleted.

The import Site Wizard assigns a new TabId to each imported page (if a new page is created). This means that it is virtually certain that the TabId-s have changed. If your template etc has hard-coded links to particular TabId-s, then these are almost guaranteed to be wrong.

It would be useful if it were possible to import a template and keep the original TabId-s. Currently, the portal export does not contain the current TabId for each page. Superficially it is also not possible to keep existing TabId-s as the Tab table TabID column is an identity. However it is possible to use the SET IDENTITY_INSERT command to insert a row with a particular TabId.

Reported on DNN Gemini bug tracker

To do:
- See if there is a better way of putting links (eg in Text/HTML and in code) so that they survive the import/export process. Me: in code you can do this using the TabController GetTabByName function.
- Double-check what the Ignore and Merge site wizard options do

Bro John adds:
- These problems typically cause an infinite redirect loop and that IE does not notice this but FF and Chrome do. Me: The problem in one case is that the Login page was not visible to anonymous users, so it redirects to the Login page ad mausoleum. The import process does not reset the LoginTabId; it only sets it if a tab with tabtype logintab is found. Unless the old LoginTabId is Replaced, this should be safe.
- The logon menu item typically fails so you need to have added a hand made login page to be able to log in to the site once you have created it ...