Timing Out Sockets

Imagine a scenario: You've learned how to make your RPG program talk TCP/IP using the socket API. You've written a wonderful application that uses TCP/IP. Your boss is proud of you. You walk around with a smile on your face. All is right with the world.

Then, it happens. Your program gets hung up waiting for data on a socket. Some sort of weird communications problem has left your program stuck! It's holding up the system, things aren't getting done. Mass hysteria.

This article shows you how to time-out your sockets so they can recover from network errors.

There are 2 ways of performing timeouts on sockets. They are:

alarm signals
the select() API

Alarm Signals

On a Unix system, "signals" are used to report fatal errors in much the same way that *ESCAPE messages (MCH1234, CPF1234, etc error messages) are used to report errors on i5/OS. Signals work like hardware interrupts do... they stop whatever the program is doing, and force it to handle the signal before continuing.

In order to let you handle signals in your programs, the socket APIs all abort when a signal is received. They return -1 for failure, and set errno to EINTR (which means "interrupted by signal")

Since the socket APIs are "Unix-type" APIs, IBM has provided the same signalling capabilities on i5/OS.

There's an API called sigaction() that tells the system that your program would like to handle it's own signals. There's another API called alarm() that tells the system to send a signal in a certain number of seconds (if it arrives while the socket API is still active, it'll abort the API!)

At the start of your program, tell the system you want it to call a subprocecure in your program when a signal arrives. In this example, it'll call a subprocedure named "my_handler":

     act.sa_flags = 0;
     act.sa_sigaction = *null;
     act.sa_handler = %paddr(my_handler);
     sigfillset(act.sa_mask);
     sigaction(SIGALRM: act: *omit);

For each socket API you want to time-out, call the alarm() API before the socket API:

     alarm(30);
     rc = connect(sock: %addr(connto): %size(connto));
     alarm(0);

     if (rc = -1);
        if (errno = EINTR);
            msg = 'connect() timed out!';
        else;
            msg = %str(strerror(errno)));
        endif;
     endif;

The signal technique works on all socket APIs, but I only demonstrated the connect() API, above... To use it with a different socket API, just replace the connect() call above with a call to recv(), read(), send(), write() or accept().

The signal technique works best with blocking sockets, because a socket API has to be active when the signal is received, or you have to write additional code to make your program respond to the signals as well.

For the full details of the signal technique, including sample code, please read Handling Errors in TCP/IP Programming from the October 27, 2005 issue of this newsletter.

The Select() Method and Non-Blocking Sockets

The alternative to the signal method involves using non-blocking sockets with the select() API to produce timeouts. In most situations, the signal method works just as well, but when you want to make sure your program is as robust as possible, I recommend non-blocking sockets instead.

The term "blocking" means that it suspends execution of your program until the API has completed. For example, I could say "The recv() API blocks until data is received" and that would mean that the API suspends your program and doesn't return control to you until data is received.

The term non-blocking, appropriately, means the opposite. Your program never waits for a non-blocking socket, the API always returns control back to you immediately. If the socket can't carry out it's function (for example, you call the recv() API, but there's no data to return unless it waits for some) the API will return an error code. When you check errno to see what the error is, the error will be EWOULDBLOCK denoting that in order for the API to do it's job, it would have to block.

Because non-blocking sockets never wait for a network event to occur, they're the safest way to write your applications. Your program will never get "frozen" waiting for a socket. No opportunity for mass hysteria!

Non-blocking mode is enabled by flipping a bit in the descriptor's flags. To do that, you have to use the fcntl() API with the F_GETFL option to retrieve the flags. You can then turn the non-blocking flag on, and activate the new flags by calling fcntl() with the F_SETFL option.

Here's sample code that creates a new socket descriptor and puts it into non-blocking mode:

 D sock            s             10i 0   
 D flags           s             10i 0   
      .
      .
      sock = socket(AF_INET: SOCK_STREAM: IPPROTO_IP)
      if (sock = -1);                      
         // error occurred! check errno! 
      endif;                             
                                         
      flags = fcntl(sock: F_GETFL);        
      flags = %bitor(flags: O_NONBLOCK); 
      fcntl(sock: F_SETFL: flags);

Tip: In V5R1 and earlier releases, the %bitor() BIF does not exist. In those releases, you can set the flags using addition instead. For example:

      flags = fcntl(sock: F_GETFL);        
      flags = flags + O_NONBLOCK; 
      fcntl(sock: F_SETFL: flags);

The only drawback to using addition is that it'll cause "unpredictable results" if the socket is already in non-blocking mode, so you should only use it if you know the socket is still in blocking mode. This won't be a problem if you set the flag immediately after the socket is created.

Waiting For Events with the Select() API

Of course, non-blocking also presents a new challenge. In most situations, your program will be capable of reading data much faster than the network is capable of delivering it, so socket APIs will almost always fail! You don't want to keep calling the API in a loop until it succeeds, since that would use massive amounts of CPU resources. I love challenges, don't you?

That's where the select() API comes in. The select() API is able to block (or wait) for network events to occur on whole sets of sockets. As soon as at least one network event has occurred, the select() API will return control to your program so that you can process the events. You can set a timeout value on the select() API to ensure that it returns control to your program after a particular amount of time, even if no events have occurred.

The select() API waits until at least one of the following events has occurred:

There's data to read from one or more of the sockets in the "read set".
There's buffer space to write data to one or more of the sockets in the "write set".
An exception event has occurred on one or more of the sockets in the "exception set". See the note below.
The timeout occurs.

Important: An exception event is not an error. Exception events in TCP/IP usually means that some out of band data has been received. Out of band data is a means of sending data in a "separate stream" from the main data stream, and is almost never used in modern TCP/IP applications.

When I refer to a "set" in the descriptions above, I'm referring to string of bits where each bit represents one descriptor. You can monitor for events on just one socket if you wish, but creating a set that only has one socket, or you can monitor for events on many sockets at once, if you need to.

In ILE C, macros (preprocessor directives) are used to manipulate descriptor sets. Since you can't use C macros in RPG, I've written RPG subprocedures that do the same thing as the C macros. The subprocedures that you use to manipulate descriptor sets follow:

FD_ZERO: Removes all descriptors from a set. In other words, it "clears" or "zeroes" the set.
FD_SET: Adds a descriptor to a set.
FD_CLR: Removes a descriptor from a set.
FD_ISSET: Checks to see if a given descriptor is in a set.

For example, to create a descriptor set named readset that contains two sockets, you could do the following:

     D readset         s                   like(fdset)
         .
         .
         FD_ZERO(readset);
         FD_SET(sock1: readset);
         FD_SET(sock2: readset);

The following is the RPG protototype for the select() API:

     D Select          PR            10I 0 extproc('select')
     D   max_desc                    10I 0 VALUE
     D   read_set                      *   VALUE
     D   write_set                     *   VALUE
     D   except_set                    *   VALUE
     D   wait_Time                     *   VALUE

The first parameter (max_desc) should be used to a number that's one higher than specify the highest descriptor number that the API should check for. The second, third and fourth parameters are pointers to descriptor sets that specify the descriptors that the select() API should monitor for read, write and exception events, respectively. The last parameter is a pointer to a data structure that specifies the timeout value.

Any of the pointers in the preceding prototype can be set to *NULL if you don't want to use the corresponding parameter. For example, if you don't want to monitor for exception events, you can pass *NULL for the except_set parameter. If you don't want the select() API to time out, you can pass *NULL for the wait_time parameter, and so forth.

Speaking of the wait_time parameter, it points to a data structure that specifies the timeout value in two fields. The first subfield of the data structure specifies a number of seconds, and the second one specifies a number of microseconds. The select() API will timeout when the total of the two subfields has elapsed. For example:

     D timeout         DS                  qualified
     D   tv_sec                      10I 0
     D   tv_usec                     10I 0
        .
        .
        timeout.tv_sec = 30;
        timeout.tv_usec = 500000;

Since one microsecond is one millionth of a second, the preceding code will create a data structure that will cause the select() API to time out after 3.5 seconds.

When the select() API is successful, it returns the number of sockets that have events pending. The return value can be zero if the request timed out before any events occurred. In either case, the descriptor sets will be changed so that they only contain the descriptors that have pending events.

If the select() API fails, it returns -1, and the descriptor sets are unchanged. You should check errno to see what went wrong.

Example: Timeout for the Recv() API

I recommend that you create a service program containing useful routines for use with sockets, including the timeout routines that I describe in this article. Having to re-code these routines for every socket call you make in every application would be a major source of frustration! Write them once, and re-use them everywhere!

To use the select() API and a non-blocking socket to perform a timeout, I suggest that you first call the regular socket API to see if there's data. If not, you'll get an EWOULDBLOCK error code, and you'll know that you should call select() to wait for data.

Here's an example of reading data from a non-blocking socket with a timeout:

      *+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
      * nbrecv(): Receive data on non-blocking socket w/timeout
      *
      *    sock = (input) socket to receive data from
      *     buf = (output) buffer to receive into (address of variable)
      *    size = (input) size of buffer to send (size of variable)
      *    secs = (input) seconds to wait before timing out
      *
      * Returns length received, or -1 upon failure (check errno!)
      *+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
     P nbrecv          B                   export
     D nbrecv          pi            10i 0
     D    sock                       10i 0 value
     D    buf                          *   value
     D    size                       10i 0 value
     D    secs                       10p 3 value

     D readset         s                   like(fdset)
     D timeout         ds                  likeds(timeval)
     D err             s             10i 0 based(p_err)
     D rc              s             10i 0

      /free

          p_err = sys_errno();

          dow '1';

              rc = recv(sock: buf: size: 0);
              if (rc <> -1);
                 return rc;
              endif;

              // -----------------------------------
              //  Wait until socket is readable
              // -----------------------------------

              FD_ZERO(readset);
              FD_SET(sock: readset);
              timeout.tv_sec = %int(secs);
              timeout.tv_usec = (secs - timeout.tv_sec) * 1000000;

              rc = select( sock+1             // descriptor count
                         : %addr(readset)     // read set
                         : *null              // write set
                         : *null              // exceptional set
                         : %addr(timeout) );  // timeout
              select;
              when rc = 0;
                 err = ETIME;
                 return -1;
              when rc = -1;
                 return -1;
              endsl;

          enddo;
      /end-free
     P                 E

I did a few interesting things in this example:

I made my subproceure accept the same arguments as the normal recv() API, except that I changed the "flags" parameter into a timeout value. This makes it easy to scan your existing programs for calls to the recv() API and replace them with nbrecv().
I created only one descriptor set, and only added one socket to it, because it's the only one I'm interested in handling timeouts for.
Since this procedure only reads a socket, it's not interested in write or exception events. Therefore, it passes *null for those parameters.
If a timeout occurs, the select() API will return 0 to say that there weren't any events on my sockets.
When a timeout occurs, I set errno to ETIME. That way, if the caller checks errno to see what went wrong, it'll get a time out error.

Example: Timeout for the Connect() API

Using the connect() API with a non-blocking socket is more complicated. Setting up a connection takes time, and if it fails, it needs a way to report the error to you. This can be a problem with a non-blocking socket, because a non-blocking socket must return control to your program immediately.

Therefore, the connect() API will always return -1 on a non-blocking socket. It'll set errno to EINPROGRESS to tell you than the connection is in progress -- in other words, it's being established in the background!

You can call the select() API to wait until it has completed by monitoring for a write event. If the socket becomes writable, it tells you that either the connection has been established (and therefore you can write data to the socket) or that an error has occurred. The select() API will not return -1 if connect() fails, because it's not the select() API that failed, it's connect()!

To determine whether the connection was successful or not, you have to retrieve an error code using the SO_ERROR option of the getsockopt() API.

Therefore, to connect() a non-blocking socket, I following these steps:

Call connect() to start the connection progress.
The preceding step should always fail, but if it doesn't for some odd reason, return success to the caller.
Make sure the error was EINPROGRESS. If not, return an error to the caller.
Create a descriptor set containing the socket.
Create a timeout structure.
Call select() to wait until the socket is writable.
If a timeout occurs, return an ETIME error to the caller.
Call getsockopt() with SO_ERROR to determine if connect() succeeded or failed.
If it failed, set errno to the error code, and return the error to the caller.
Otherwise, return success.

Perhaps you can see why I wouldn't want to re-write this for every program that uses sockets! Once you have one that's working properly, you can just call it when you need it. No need to go through this more than once!

Here's an example of a non-blocking connect with a timeout:

      *+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
      * nbconnect(): Connect with a non-blocking socket and timeout
      *
      *    sock = (input) socket to connect
      *    addr = (input) sockaddr structure that denotes the
      *                   location to connect to.
      *    size = (input) size of preceding structure
      *    secs = (input) seconds to wait for connection before
      *                   timing out.
      *
      * Returns 0 if successful, or -1 upon error (check errno!)
      *+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
     P nbconnect       B                   export
     D nbconnect       pi            10i 0
     D    sock                       10i 0 value
     D    addr                         *   value
     D    size                       10i 0 value
     D    secs                       10i 0 value

     D writeset        s                   like(fdset)
     D timeout         ds                  likeds(timeval)
     D err             s             10i 0 based(p_err)
     D rc              s             10i 0
     D connerr         s             10i 0
     D errsize         s             10i 0

      /free

          p_err = sys_errno();

          // -----------------------------------------------
          // On a non-blocking connection, the connect() API
          // won't wait for completion.  It'll start the
          // connection attempt, then return EINPROGRESS to
          // tell you that the connection is in progress...
          // -----------------------------------------------

          rc = connect(sock: addr: size);
          select;
          when rc = 0;
             return 0;
          when err <> EINPROGRESS;
             return -1;
          endsl;

          // -----------------------------------------------
          //  The select() API can be used to wait for a
          //  connection to complete by waiting until it's
          //  "writable".  Note that the select() API has
          //  a timeout value you can set!
          //
          //  Select returns 0 if a timeout occurs.
          //
          //  Note that select() only returns -1 if an error
          //  occurs with the select() API.  If the connect()
          //  API (running in the background) fails, it will
          //  not return -1.
          // -----------------------------------------------

          FD_ZERO(writeset);
          FD_SET(sock: writeset);
          timeout.tv_sec = secs;
          timeout.tv_usec = 0;

          rc = select( sock+1             // descriptor count
                     : *null              // read set
                     : %addr(writeset)    // write set
                     : *null              // exceptional set
                     : %addr(timeout) );  // timeout
          select;
          when rc = 0;
             err = ETIME;
             return -1;
          when rc = -1;
             return -1;
          endsl;

          // -----------------------------------------------
          //  To detect if the connect() API (running in the
          //  background) has failed, you need to get the
          //  SO_ERROR socket option
          // -----------------------------------------------

          size = %size(connerr);
          getsockopt( sock
                    : SOL_SOCKET
                    : SO_ERROR
                    : %addr(connerr)
                    : errsize );
          if (connerr <> 0);
             err = connerr;
             return -1;
          endif;

          return 0;
      /end-free
     P                 E

More Examples and Code Download

The code download for this article provides the code for my socket utility service program. It contains code that demonstrates both timeouts with signals and timeouts with the select() API for the connect(), send(), recv() and accept() APIs. It also contains all of the prototypes needed to call the socket APIs and signal APIs.

The same techniques that I use for the regular socket APIs can be applied to the Global Secure Toolkit (GSKit) APIs for SSL programming, but I did not provide samples of these in the code download for this article. If you need timeouts for those, simply replace the calls to send() and recv() with calls to gsk_secure_soc_read() and gsk_secure_soc_write(), respectively.

You can retrieve the code download for this article from the followng link:
http://www.scottklement.com/rpg/socktimeout/SocketUtilities.zip

More About TCP/IP Programming with Sockets

If you read this article and said to yourself "Whoa there, back up a step!" then you may be interested in reading my previous articles about socket programming. Here are links to those articles:

In System iNetwork Programming Tips:: (NOTE: The publisher has taken these offline!)

In System iNEWS:: (NOTE: The publisher has taken these offline!)

On ScottKlement.com:

RPG IV Sockets Tutorial (This is an older tutorial, some code may be out of date, but should still work.)