[Looking for Charlie's main web site?]

Suffering CPU or memory problems in CF? Spiders could be killing you in ways you'd never dream

If you're trying to get to the bottom of high memory or CPU use on CFML servers, you may be missing a seemingly innocuous but deadly invader, especially if you're focusing only on "what are my long running requests?" What if, instead, the problem is due thousands or millions of small page requests that are unexpectedly creating thousands or millions of sessions and client variables each day? It's a pernicious problem that many may dismiss too readily.

The Management Summary

In deference to those who don't like detail, I'll offer in this entry a brief overview of the issue and some starting points for diagnostics. In a future entry, I'll offer more detail on the problem, as well as diagnostic and remediation suggestions.

It may seem innocent enough: search bots may be visiting your site many times a day, due to many different search engines, and perhaps even many times per day per bot. And you may have hundreds or thousands of folks signed up with RSS readers to watch your RSS feeds.

The problem is that if the pages being visited are CFML-based, the bots and RSS readers will not likely track cookies, which starts a real waterfall of problems. There are two potential impacts, sessions and clients.

Creation of many new sessions

Assuming you have SessionManagement="yes" enabled, each such new request without a cookie will cause the CFML server (CF or BD) to create a new session. Normally, that's not a problem. The browser for a typical user will usually store these cookies for reuse on a future page request. But since these bots do not typically track cookies, then this causes the CFML server to create a new session for each page request.

And that's not one new session per bot, but one new session per bot per page requested. And these new sessions will live as long as your sessiontimeout is set--which could be minutes, hours, or days. That could become a substantial resource for CFML server to manage, even if there's "nothing" in the session.

(Props to Mark Kruger whose blog entry, Sessions and Cookies and Bots (oh my), was the first I saw to point this out. As I pointed out in his comments, the problem could be still worse with respect to client variables also, as I'll explain below. And since then, I've realized that RSS Readers could be another, and different problem, since the number of individuals running them may be far greater than the number of search engines.)

Creation of many new sessions with large amounts of memory per session

Further, if you DO for some reason put a lot of data into "new sessions" when they're created, then this could become a huge memory burden. And as memory use increases substantially, so will the cost of garbage collection. Eventually the CPU to manage that GC will become problematic, and your system could become unresponsive or even unstable.

Creation of many client records

Finally, there's even a more pernicious (and more persistent) problem due to client variables. Now, even if you'd say "we don't use them", consider this: if you've got ClientManagement="yes" set in your CFAPPLICATION, and you've not disabled in the CF admin the "Disable Global Client Variable Updates" option for your client repository (and they are enabled by default), then the CFML server will create fixed client variables (hitcount, lastvisit, etc.) for every new client. Given the problem above, this would be a new set of these records for each visiting bot request! If you're storing these in the Registry, any great increase in the number of entries is clearly bad enough.

But even if you're storing these auto-created variables in a database, the problem is quite different from the wild creation of sessions. At least those are removed when their sessions timeout or the server is restarted. With client variable, though, these are typically set to expire in days, weeks, or even months! So both your registry or database table, as well as the internal memory/process by the CFML server, could become very large and burdensome.

Diagnosis

If you think this may be happening to you (and even if you don't), you should set up monitoring to see how many unique new sessions or clients are being created. You have a few ways to do this.

Sadly, there's no mechanism (I can think of right now) for CF to tell you how many new sessions or clients have been created in a day. Note that tools like the admin API or service factory--or even tools like SeeFusion and FusionReactor--tell you what's going on right now. Not what's happened throughout the day, or even historically.

If you can access them, you could analyze your web server logs to find out how many pages are coming in that have no CFID/CFTOKEN cookie values. That would be a clue, as it would leave CF to create new ones in response. You could also look at either the registry or client variable database to see how many entries there are. You may be shocked by the number.

Remediation

The simplest thing is to ensure that any code that may be hit by bots, search engines, RSS readers, etc. do not use code that has CFAPPLICATION SessionManagement="yes" or ClientManagemenet="yes" (or the equivalent properties in application.cfc). That may be trivial, or it may be a hassle, depending on the complexity of your application.

Update: here's a thought that's come up from discussions on a private mailing list. One solution to consider would be to detect if the request has the CFID cookie--which it wouldn't for bots, and if not, set the session timeout to a short value. Be careful, though, because legitmate first-time visitors will also have no CFID token, so if you set any session vars on the page they request or in the application.cfm/.cfc file, don't set it so short that those will be gone when they go to the next page they visit. Perhaps set it to a minute, but realize the implications.

Other thoughts

Similarly, if you use load testing tools, be sure to enable any option to have them honor cookies. Otherwise your testing results may not be accurate, as you're imposing this burden on the server of it creating new sessions and clients for each user request you're simulating, which would not happen in production (except for the bots and RSS readers, etc.).

I'll offer some tools to help with these issues in a future entry. I'll also expand on the discussion of the problem, for those who would appreciate it. For now, I wanted to get this out ASAP, since the problem came up on a list and I wanted to point to this as something to consider. Hopefully it may help others as well.">

Comments
Michael Dinowitz had pointed this out to my office (we had him offer to help us debug some server memory issues). I have taken his ideas and run with with. Here are some entries that blogged that might be of interest:

http://www.bennadel....

... fixed an error that was made in ....

http://www.bennadel....

Anyway, what I basically do is turn off and on session management on a per-page-hit basis (as per Dinowitz's suggestion).

Also, Charlie, are you the carehart that made the first comment on this page: http://livedocs.macr...

Cause if you are, THANK YOU THANK YOU THANK YOU, you have saved me a lot of stress.
# Posted By Ben Nadel | 10/4/06 5:17 PM
I wrote a blog entry on this a while back with a hack level fix. I've altered it since then to allow a bot to create a session but have it time out after 2 seconds rather than the 10 minutes a person has.

http://www.blogoffus...
# Posted By Michael Dinowitz | 10/4/06 5:36 PM
Michael,

Why have you switched over to a very short session duration? I assume that this is so that your whole site can assume that session management is being used without complicating the logic of the way things work???

-Ben
# Posted By Ben Nadel | 10/4/06 5:48 PM
Thanks, Ben. I'd not yet seen other blog entries besides Mark's, or I would have pointed them out (and may not even have written the entry), so thanks for the links. :-) Glad to see Mike offer his link from last year, too. I guess I'm late to the party. :-) But perhaps another round of notice will reach more people.

Mike makes a great point I did not: the impact of the client variable issue applies more than just if you use registry or db client vars, but cookie-based client vars as well. As he wrote, "It seems that when a client variable is set, a memory structure is also set for CF. Now each bot hit is assumed to be it's own session as it does not accept cookies. This mean each bot hit generates a memory structure of about 1k. Now this is not really a lot, but when you have a few 10's of thousands of hits from bots a day, it adds up. " Mike also offers more remediation techniques.

As for the CFDocs custom tag, yes, that was me, Ben. :-) Glad it helped.
# Posted By Charlie Arehart | 10/4/06 7:57 PM
Ben - Exactly. I want a bot to 'exist' as a normal user, which means a normal session. I just don't want the session to last. I wish Adobe had some solid docs or settings for how/when client var structures time out or can be timed out. For what I do, client vars are superior to session.

Charlie - Your late to my post, I'm late to someone elses, etc. This is a VERY important topic that many just don't think about and bringing it up every now and again is only to the benefit of everyone.
# Posted By Michael Dinowitz | 10/4/06 9:38 PM
Charlie,

I had the same issue on a server that was serving RSS via CF. As you correctly say, RSS readers generally don't honour cookies, so every RSS request was creating a new session. Combine that with a session-scoped user object, some session-scoped cached data, and a spider that was crawling the site and all it's RSS links, and you very quickly end up with a site that mysteriously goes down in the middle of the night. I fixed it by
- adding a bit of code ino OnRequestEnd.cfm to check if the user was not logged in, and if not, manually clear the session scope
- writing the RSS to flat files

I put the full sordid story here : "RSS Ate My Server" - http://instantbadger...
# Posted By Al Davidson | 10/6/06 11:28 AM
The problem of spiders etc is big problem.

Some tips:

1. Implement client side caching using <cfheader> with either etags or Last-Modified.

2. Don't be scared to send a 503 error response to say "server busy try again later"

3. BlueDragon has the <cfthrottle> tag that helps you manage the repeat offenders
# Posted By Alan | 10/13/06 8:30 AM
Good points, Alan. Thanks.
# Posted By Charlie Arehart | 10/13/06 12:13 PM
BlogCFC was created by Raymond Camden. This blog is running version 5.005.