Some code to throttle rapid requests to your CF server from one IP address
Note: This blog post is from 2010. Some content may be outdated--though not necessarily. Same with links and subsequent comments from myself or others. Corrections are welcome, in the comments. And I may revise the content as necessary.Some time ago I implemented some code on my own site to throttle when any single IP address (bot, spider, hacker, user) made too many requests at once. I've mentioned it occasionally and people have often asked me to share it, which I've happily done by email. Today with another request I decided to post it and of course seek any feedback.
It's a first cut. While there are couple of concerns that will come to mind for some readers, and I try to address those at the end, it does work for me and has helped improve my server's stability and reliability, and it's been used by many others.
Update in 2020: I have changed the 503 status code below to 429, as that has become the norm for such throttles. I had acknowledged it as an option originally. I just want to change it now, in case someone just grabs the code and doesn't read it all or the comments. Speaking of comments, do see the discussion below with thoughts from others, especially from James Moberg who created his own variant addressing some concerns, as offered on github, and the conversation that followed about that, including yet another later variant.
Update in 2021: Rather than use my code, perhaps you would rather have this throttling done by your web server or another proxy. It is now a feature offered in IIS, Apache, and others. I discuss those in a new section below.
Background: do you need to care about throttling? Perhaps more than you realize
As background, in my consulting to help people troubleshoot CF server problems, one of the most common surprises I help people discover is that their servers are often being bombarded by spiders, bots, hackers, people grabbing their content, rss readers, or even just their own internal/external ping tools (monitoring whether the server is up.)
It can either be that there are many more than they expect, coming more often than they expect, or they may come extremely fast to your server (even many times a second). This throttle tool can help deal with the latter.
Why you can't "just use robots.txt and call it a day"
Yes, I do know that there is a robots.txt standard (or "robots exclusion protocol") which, if implemented on your server, robots should follow so as not to abuse your site. And it does offer a crawl-delay option.
The first problem is that some of the things I allude to above aren't bots in the classic sense (such as RSS readers, ping tools). They don't "crawl" your site, so they don't regard that they need to be told how/where to look. They're just coming looking for a given page.
A second is that the crawl-delay is not honored by all spiders.
The third problem is that some bots simply ignore the robots.txt, or don't honor all of it. For instance, while Google honors the file in terms of what it should look at, my understanding is that it does not regard it with respect to how often it should come. Instead, Google requires you to implement the webmaster toolkit for your site to control its crawl rate.
Then, too, if you may have multiple sites on your server, the spider or bot may not consider that in deciding to send a wave of requests to your server. It may say "I'll only send requests to domain x at a rate of 1 per second", but it may not realize that it's sending requests to domains x, y, z (and a, b, and c) all of which are one server/cluster, which could lead a single server to in fact be hit far more than once a second (in that scenario). It may seem that's an edge case, but honestly it's not that unusual from what I've observed.
Finally, another reason all this becomes a concern is that of course there can be many spiders, bots, and other automated requests all hitting your server at once sometimes. My tool can't help with that, but it can at least the other points above.
(As with so much in IT and this very space, things do change, so what's true today may change, or one may have old knowledge, so as always I welcome feedback.)
Rather than use my code, you may want to have your web server do the throttling
Again, this section is an update in 2021: rather than use my code, there are now features to do this sort of throttling in web servers like IIS (which calls the feature "dynamic ip restrictions"/DIPR) and Apache (such as via mod_limitipconn or mod_evasive) and others. Or your load balancer may offer it, as may a WAF, or third party services. I list several such options in the Security/Protection area of my CF411.com site. (This 2010 post was from before IIS added its DIPR feature, and as most CFers at the time used IIS, which is why I offered the code approach below.)
FWIW, I had offered a a comment below in 2012 when IIS first came out with a module to do this, as something you would download and configure. But don't use that if you are on IIS 8 or above. Since then, Microsoft rolled it into IIS 8 and above as an optional built-in feature (rather than being that older downloadable module approach for IIS 7).
You can learn more about enabling and using it here, including how to configure it in the IIS UI or config/xml, and also how to simply LOG what WOULD be blocked. (In my 2010 code below, I did add logging, but I didn't think to offer an option to ONLY log. That would be pretty simple to implement, if you'd like to. The code is open--though it's from the time before github's prominence, so I never posted it there.)
The code
So I hope I've made the case for why you should consider some sort of throttling, such that too many requests from one IP address are rejected. I've done it in a two-fold approach, sending both a plain text warning message and an http header that is appropriate for this sort of "slow down" kind of rejection. You can certainly change it to your taste.
I've just implemented it as a UDF (user-defined function). Yes, I could have also written at in all CFscript (which would run in any release, as there nothing that couldn't be written in script in that code--well, except the CFLOG, which could be removed). But since CF6 added the ability to define UDFs with tags, and to keep things simplest for the most people, I've just done it as tags. Feel free to modify it to all script if you'd like. It's just a starting point.
I simply drop the UDF into my application.cfm (or application.cfc, as appropriate). Yes, one could include it, or implement it as a CFC method if they wished.
<!---
Written by Charlie Arehart, [email protected], in 2009, updated 2012
- Throttles requests made more than "count" times within "duration" seconds from single IP.
- sends 429 status code for bots to consider as well as text for humans to read
- also logs to a new "limiter.log" that is created automatically in cf logs directory (cfusion\logs, in CF10 ad above), tracking when limits are hit, to help fine tune
- note that since it relies on the application scope, you need to place the call to it AFTER a cfapplication tag in application.cfm (if called in onrequeststart of application.cfc, that would be implicitly called after onapplicationstart so no worries there)
- updated 10/16/12: now adds a test around the actual throttling code, so that it applies only to requests that present no cookie, so should only impact spiders, bots, and other automated requests. A "legit" user in a regular browser will be given a cookie by CF after their first visit and so would no longer be throttled.
- I also tweaked the cflog output to be more like a csv-format output
--->
<cfargument name="count" type="numeric" default="3">
<cfargument name="duration" type="numeric" default="3">
<cfif not IsDefined("application.rate_limiter")>
<cfset application.rate_limiter = StructNew()>
<cfset application.rate_limiter[CGI.REMOTE_ADDR] = StructNew()>
<cfset application.rate_limiter[CGI.REMOTE_ADDR].attempts = 1>
<cfset application.rate_limiter[CGI.REMOTE_ADDR].last_attempt = Now()>
<cfelse>
<cfif cgi.http_cookie is "">
<cfif StructKeyExists(application.rate_limiter, CGI.REMOTE_ADDR) and DateDiff("s",application.rate_limiter[CGI.REMOTE_ADDR].last_attempt,Now()) LT arguments.duration>
<cfif application.rate_limiter[CGI.REMOTE_ADDR].attempts GT arguments.count>
<cfoutput><p>You are making too many requests too fast, please slow down and wait #arguments.duration# seconds</p></cfoutput>
<cfheader statuscode="429" statustext="Service Unavailable">
<cfheader name="Retry-After" value="#arguments.duration#">
<cflog file="limiter" text="'limiter invoked for:','#cgi.remote_addr#',#application.rate_limiter[CGI.REMOTE_ADDR].attempts#,#cgi.request_method#,'#cgi.SCRIPT_NAME#', '#cgi.QUERY_STRING#','#cgi.http_user_agent#','#application.rate_limiter[CGI.REMOTE_ADDR].last_attempt#',#listlen(cgi.http_cookie,";")#">
<cfset application.rate_limiter[CGI.REMOTE_ADDR].attempts = application.rate_limiter[CGI.REMOTE_ADDR].attempts + 1>
<cfset application.rate_limiter[CGI.REMOTE_ADDR].last_attempt = Now()>
<cfabort>
<cfelse>
<cfset application.rate_limiter[CGI.REMOTE_ADDR].attempts = application.rate_limiter[CGI.REMOTE_ADDR].attempts + 1>
<cfset application.rate_limiter[CGI.REMOTE_ADDR].last_attempt = Now()>
</cfif>
<cfelse>
<cfset application.rate_limiter[CGI.REMOTE_ADDR] = StructNew()>
<cfset application.rate_limiter[CGI.REMOTE_ADDR].attempts = 1>
<cfset application.rate_limiter[CGI.REMOTE_ADDR].last_attempt = Now()>
</cfif>
</cfif>
</cfif>
</cffunction>
Then I call the UDF, using simply cfset limiter(), as shown below. That's it. No arguments need be passed to it, unless you want to override the defaults of limiting things to 3 requests from one IP address within 3 seconds.
<cfset limiter(count=3,duration=5)>
Note that since the UDF relies on the application scope, you need to place the call to it AFTER a cfapplication tag if using application.cfm. If using application.cfc, you could call it from within your onrequeststart, and that would be implicitly called after onapplicationstart so no worries there.
Caveats
There are definitely a few caveat to consider, and some concerns/observations that readers may have. The first couple have to do with the whole idea of doing this throttling by IP address:
- First, some will be quick to point out a potential flaw in the approach of throttling by IP address is that you may have some visitors who are behind a proxy, where they appear to your server to all be coming from one ip address. This is a dilemma that requires more handling. For instance, one idea would be to key on yet another field in the request headers (like the user agent), so that you use two keys to identify "a user" rather then just the IP address. If you think that's an issue for you, feel free to tweak it and report back here for others to benefit. I didn't choose to bother with that, as in my case (on my site), I just am not that worried about the problem. Note that the log that I create will help you determine if/when the UDF is doing any work at all. Again, the CFLOG will create a new log called limiter.log, in the CF logs folder (cfusion\logs, in CF10 and above).
- Other folks will want to be sure I point that many spiders and other automated request tools now may come to your site from different IP addresses, still within that short timespan. My code would not detect them. For now, I have not put in anything to address this (it wouldn't be trivial). But the percentage of hits you'd fail to block because of this problem may be relatively low. Still, doing anything is better than doing nothing.
- Speaking of the frequency with which this code would run, someone might reasonably propose that this sort of "check" might only need to be done for requests that look like spiders and bots. As I've talked about elsewhere, spiders and bots tend not to present any cookies, and so you could add a test near the top to only pay attention to requests that have no cookie (cgi.http_cookie is ""). I'll leave you to do that if you think it worthwhile. Since there's a chance that some non-spider requesters could also make such frequent requests, I'll leave such a test out for now. (Update: I changed this on 10/16/12 to add just that test, so the code above now only blocks such requests that "look like spiders". A legit browser visitor would get the cookie set by CF on the first request, so won't be impacted by this limiter.)
- Someone may fear that this could cause spiders and bots to store this phrase "You are making too many requests too fast, please slow down and wait" (or whatever value you use). But I will note that I have searched Google, Bing, and Yahoo for this phrase and not found it as the result shown for a page on any site that may have implemented this code. (Since I originally gave the status code of 503, I think that's why it would not store it as the result. I have not checked again since changing this to use 429, per my update above)
- Here's a related gotcha to consider, if you implement this and then try to test it from your browser and find "I can't ever seem to get the error to show" even when I refresh the page often. Here's the explanation: some browsers have a built-in throttling mechanism of their own and they won't send more than x requests to a given domain from the browser at a time. I've spoken on this before, and you can read more from yslow creator Steve Souders. So while you may think you can just hit refresh 4 times to force this, it may not quite work that way. What I have found is that if you wait for each request to finish and then do the refresh (and do that 4 times), you'll get the expected message. Again, use the logs for real verification of whether the throttling is really working for real users, and to what extent. (Separately, after the update above on 10/16/12 to only limit spiders/bots/requests without a cookie, that's another reason you'll never be throttled by this in a regular browser.)
- Finally, someone may note that technically I ought to be doing a CFLOCK since I am updating a shared scope (application) variable. The situation in which this code is running is certainly susceptible to a "race condition" (two or more threads running at once, updating the same variable). But in this case, it's not the end of the world if two requests modify the data at once. And I'd rather not have code like this doing any CFLOCKing since it's prospectively running on all requests.
Some other thoughts
Beyond those caveats, there are a few more points about this idea that you may want to consider:
- Of course, an inevitable question/concern some may have is, "but if you slow down a bots, might that that not affect what they think about your site? Might they stop crawling entirely?" I suppose that's a consideration that each will have to make for themselves. I implemented this several months ago and haven't noticed any change either in my page ranks, my own search results, etc. That's all just anecdotal, of course. And again, things can change. I'll say that of course you use this at your own risk. I just offer it for those who may want to consider it, and want to save a little time trying to code up a solution. Again, I welcome feedback if it could be improved.
- Some may recommend (and others may want to consider) instead that this sort of throttling could/should be done at the servlet filter level, rather than in CFML (filters are something I've written about before .) Yep, since CF runs atop a servlet engine (JRun by default), you could indeed do that, which could apply then to all applications on your entire CF server (rather than implemented per application like above.) And there are indeed throttling servlet filters, such as this one. Again, I offer this UDF for those who aren't interested in trying to implement such a filter. If you do, and want to share your experience here, please do.
- BlueDragon fans will want to point out that they don't need to code a solution at all (or use this), because it's had a CFTHROTTLE tag for several years. Indeed it has. I do wish Adobe would implement it in CF (I'm not aware of it existing in Railo). Until then, perhaps this will help others has it has me. (The BD CFThrottle tag also implements a solution for the problem of possible visits by folks behind a proxy, with a TOKEN attribute allowing you to key on yet another field in the request headers.)
- There is another nasty effect of spiders, bots, and other automated requests, and that's the risk of an explosion of sessions which could eat away at your java heap space. People often accuse CF of a memory leak, which it's really just this issue. I've written on it before (see the related entries at the bottom here, above the comments). This suggestion about throttling requests may help a little with that, but it really is a bigger problem with other solutions, that I allude to in the other entries.
- It would probably be wise to add some sort of additional code to purge entries from this application-scoped array, let they grow in size forever over the life of a CF server. It's only really necessary to worry about entries that are less than a minute old, since any older than that would not trigger the throttle mechanism (since it's based on x requests in y seconds). It may not be wise to do this check on every request, but it may be wise to add some another function that could be called, perhaps as a scheduled task, to purge any "old" entries.
- Finally, yes, I realize I could and should post this UDF to the wonderful CFlib repository, and I surely will. I wouldn't mind getting some feedback if anyone sees any issues with it. I'm sure there's some improvement that could be made. I just wanted to get it out, as is, given that it works for me and may help others.
Besides feedback/corrections/suggestions, please do also let me know here if it's helpful for you.
For more content like this from Charlie Arehart:Need more help with problems?
- Signup to get his blog posts by email:
- Follow his blog RSS feed
- View the rest of his blog posts
- View his blog posts on the Adobe CF portal
- If you may prefer direct help, rather than digging around here/elsewhere or via comments, he can help via his online consulting services
- See that page for more on how he can help a) over the web, safely and securely, b) usually very quickly, c) teaching you along the way, and d) with satisfaction guaranteed
As for "cfcombine", I have to say you've stumped me there. I've never heard of it. I realize it's not a tag, but you must have something else in mind, perhaps some project or tool. I googled but could find nothing obvious. Can you share a little more?
http://combine.riafo...
This project "combines multiple javascript or CSS files into a single, compressed, HTTP request."
You'd want to be careful to exclude scripts like this as maximum limits could be met within a single page load by visitors if not properly configured.
On another note: To deal with abusive requests, I've written a SpiderCheck custom tag to identify both good and bad/abusive spiders. Identified abusive spiders receive an automatic "403 Forbidden" status and a message. I've also written a "BlackListIP" script that blocks POST requests by known exploited IPs and works with Project HoneyPot. I haven't published any of my internal projects/scripts before because I hadn't had much time. I primarily communicate on third-party blogs on topics of interest and don't attend many conferences. (I hope this doesn't make me a troll.) I wouldn't mind sharing my scripts if someone is interested in reviewing them and distributing them. (I personally don't have time to provide any customer support.)
Thanks for all you do.
As for your scripts, I'm sure many would appreciate seeing them. And as I wrote, I too just would hand out my script on request so just finally decided to offer it here.
But you don't need a blog to post it, of course. You could post them onto RIAForge, and then you don't have to "support it" yourself. A community of enthusiastic users may well rally around the tool--and even if not, no one "expects" full support of things posted on riaforge.
I'd think it worth putting it out there just to "run it up the flag pole and see who salutes". I'm sure Ray or others would help you if you're unsure how to go about getting them posted. (I don't know that you should expect too many volunteers from readers here, though. My blog is not among the more widely read/shared/promoted, but maybe someone will see this discussion and offer to help.)
I myself have been meaning to get around to posting something to riaforge also, just to see what the experience is like. I'm sure someone will pipe in to say "it's incredibly easy". I don't doubt that. Just haven't that the right project at the right time (when I could explore it) to see for myself.
But back to your tools, there are certainly a lot of ways to solve the spider dilemma. I've been reluctant to do a check of either those or blacklisted IPs just because of the overhead of checking them on each page request. I imagine those are big lists! :-) But certainly if someone is suffering because of them (or fears that), then it may be worth doing. Again, I'm sure many would appreciate seeing your tools. Hope you'll share them. :-)
--
Mike
Thank you very much for this timely post. @cfchris and I were just discussion this problem last week and this is exactly what he was recommending. I'm curious as to the other steps that you recommend when dealing with this?
Here are a few things we've tried and our experience:
We've adopted a suggesting of the community and sniffed the user agent and forcing the session to timeout for known bots. Specifically what we were seeing is that each bot would throw away the cookie, generating a new session for each request. At standard session timeout rates 15+ minutes, this quickly added up and overtook all of the servers memory. The obvious challenge with that is keeping up with the known bots out there. For instance, a Russian search index http://www.yandex.co... indexes many of our client pages and completely ignores the Robots.txt recommendation.
In general, with every problem we've experienced with search engines taking down our servers it had to do with abandoned sessions slowly eating up memory and ultimately crashing the server.
Robots.txt crawl rate - One thing we employ although not many respect is the Robot 'rate request'. If you set this to the same as the suggested crawl rate in this code say, once every five seconds or Request-rate: 1/5 in Robots.txt, that might help them sync up.
Challenges / Suggestions:
The one challenge I see with this code is that many of our clients images are dynamically re-sized using a ColdFusion call that checks to see if the thumbnail exists, generates it if necessary and then redirects to the file. This in theory would cause pages with heavy dynamic images to trigger this code. However, we may need to revisit that solution in general as the very fact that it does take up a ColdFusion thread for the processing is ultimately what causes the load from our BOT servers to crash the server.
Ultimately you addressed this already, potential impacts on SEO. I thought about this for a while, but the experience right now is that rouge bots that don't play nice also get the same treatment as first class engines such as google, yahoo, msn, and ultimately the same as real users. One other addition might be to add a 'safe user agent list'. For instance, if you notice it's a 'trusted user agent', you simply exclude them from this check. Obvious problems are that user agents are sometimes faked by crawlers to be Firefox or IE, but playing around with that might also prevent this from taking down real users and priority bots, while still keeping most of the ones that don't play well away.
I'm going to play with this on our staging enviroment for a while but plan on launching this into a few sites later this week. I'll let you know my findings.
Thanks again for your help on this topic,
Roy
So I have just added a new bullet to my caveats/notes above, and I point people to the two older entries which are list as "related entries", shown just above the comments here.
It's without question a huge problem, and as I note in the caveat, this limiter wasn't really focused on solving that problem, though it will help. Your ideas are among many that have been considered and that I discuss some in the other entries. Thanks for bringing it up, though.
Wouldn't bots/crawlers or individual internet browsers be assigned their own session? That way you wouldn't have to worry about affecting multiple users in an IP address and wouldn't have to check for CFToken.
So every visit they make creates a new session (CF creates a new session for each page visit they make)--and therefore each new page visit would not be able to access the session scope created in the previous visit.
What you're thinking could make sense for "typical" browsers perhaps, but not for this problem of automated requests. Why not try it out for yourself, though, if you're really interested in this topic. It can be a compelling learning experience.
I've written about it previously a few years ago here: http://tinyurl.com/s...
One thing I wanted to point out is that this function is a bit more aggressive in blocking bots than might at first be apparent. For instance, I set it up to allow up to 6 requests every 3 seconds. I found immediately that Googlebot was being blocked even though it was making only one request per second. The reason is that the "last_attempt" value is updated at every request so that the number of attempts keeps going up. One way around this is to only update the "last_attempt" value when the number of requests is reset.
Incidentally, I ended up rewriting the function a bit and changing it to use ColdFusion's new built-in cache functionality to automatically prune old IP addresses. Thought I'd throw it out there:
https://www.modernsi...
Good stuff. Thanks for sharing. I'll do some testing of your recommendation about last_attempt and will tweak the code above after that.
Finally, the use of CF9 caching is nice, though I did want to write something that worked more generically, across CF versions and CFML engines.
Still, folks should definitely check out your variation. Two heads are better than one! :-)
PS I suppose we may want to consider posting this somewhere that can be better shared, tweaked, etc. but in this case I think yours would have been a fork, being so different, so there would still end up being two out there, so I don't know if it's really worth our bothering. :-)
One thing I've learned is that the 500 errors are meant more for "server" errors, while 400-level ones are meant more for "client" errors, like the client made a mistake (as in this case, where the client is making too many requests.)
It has been encouraging to see other tools referring to this same approach I've outlined above.
For instance, there is now an IIS settings for this, in IIS 7's "Dynamic IP Restrictions" module. More at http://learn.iis.net..., in the section "Blocking of IP addresses based on number of requests over time" (where they show returning a 403.8).
[As an update in 2021, regarding that last mention of the IIS feature, in IIS 8 and above it's now built-into IIS. See my discussion of this above, as a new section in the blog post, "Rather than use my code, you may want to have your web server do the throttling". I wanted to update this comment here, rather than offer a new one that would appear well below it, to help folks using IIS 8 and above to not get that older module. And though long links within comments can look look ugly, here is a link to that new section above. https://www.carehart...]
Finally, here is a Ruby implementation, http://datagraph.rub..., which also discusses the debate between returning a 403 or 503 code.
So clearly it's not a new idea, and I guess we'll see as the industry moves toward more of a consensus on the best status code as well as other aspects of such blocking, but at least it validates the approach I put forth. :-)
http://tools.ietf.or...
And I was made aware of it in reviewing the API for Harvest (my preferred timesheet/invoicing system), who use the same approach for throttling rapid requests. Just nice to see I wasn't off in left field, and that even today this seems the same preferred approach for throttling. (If anyone has another, I'm open to ideas.)
And it seems that it came to be because Brad Wood had found and used the code in his own PiBox blog, which he discussed here: http://pi.bradwood.c...
Glad to see the code helping still more people beyond just as shared here.
https://gist.github....
First, I should note that David Hammond had done much the same thing, changing it to use ehcache, as discussed at https://www.modernsi... (and mentioned it in a comment above, both in 2012).
And he hadn't put it into github. Of course back then and indeed in 2010 when I first posted this entry, it wasn't as much "the thing", but surely putting it there is helpful, not only to make it easier for folks to tweak and discuss, but even just to get the code (versus using copy/paste from blog posts like this). I'd made a comment to that point in reply to his, but neither of us got around to doing it, it seems.
Second, besides comments from others and myself here, there are surely many other ideas for improvements that I foresaw (and mentioned in the blog post). Folks who may be motivated to "take this ball and run with it" and try to implement still more tweaks should review both my post and the comments (since those don't appear in the github repo).
For instance, I see you changed it to use status code 429, versus the 503 I used initially. You may have noticed I discussed that very point in a comment (also in 2012): http://www.carehart...., where I showed different tools like this (far more popular ones, like those embedded in web servers and app servers) that used a variety of values: 503, 403, 429. I then added the next comment after it, in 2014, pointing to an IETF standard suggesting 503 and a retry-after header.
But you say in your gist comments that you chose 429 based on "best practices". I'd love to hear if you found something more definitive on this, since it was up in the air in years past. :-)
As for ehcache, it just wasn't an option when I first wrote the code in 2009 (CF9 came out later that year), and I didn't think of it when I posted the entry in 2010. Even so, at that time lots of people would have still been on CF8 or earlier at the time, so I was leaving it generic enough to work not only on multiple version of CF but also other CFML engines of the time (Railo, BlueDragon, and since then Lucee).
BTW, James, speaking of older comments here, did you ever get around to posting the related scripts you'd mentioned in a comment in 2010 (http://www.carehart....)? If not, I'm sure those could be useful for folks also, I'm sure.
I do see that you have over 100 gists posted there on github, but it only lets us see 10 at a time, and doesn't seem to offer searching just in a particular person's gists, so I couldn't readily tell on my own.
Keep up the great work, and I hope to find time to get involved in using and perhaps contributing to the version you've posted.
https://gist.github....