An incomplete guide to what's going wrong with your PBIO/DataExchange/ECho program.

Q1. I get a Segmentation fault/Bus error in write_IOfile(), encode_IOcontext*(), DEport_write_data*() or ECsubmit_typed_event().

Segmentation faults can be caused by a variety of things including memory overwrites from other portions of the program. But the most common causes of a seg fault in these calls are attempting to write corrupt data or an improper IOFieldList.

All of these calls use the supplied IOFieldList to walk your data and package it for transmission. If, for example, there are bad pointers in your event data, you'll get a seg fault or bus error. If your IOFieldList doesn't match the actual data, PBIO may interpret something as pointer when it isn't, with the same result. So, check your data. If necessary print it out before call in question. Make sure that you've initialized everything that needs to be. (In particular, don't assume that memory obtained from malloc() is zeroed. malloc() does not guarantee that.)

If your data is OK then carefully examine the IOFieldList. If there is a problem causing a seg fault, it is with a data type that implies a pointer, such as string or a dynamic array. If you've used the IOOffset() macro, make sure that the field name and pointer name are right. IOFieldLists are often copied and edited rather than written from scratch. If you forgot to change something, the field list won't match your data. In the case of a dynamic array, make sure that the corresponding field in your datastructure is a pointer, not a static array. I.E. this field list specifies a dynamic array:

IOFieldList example[] = {
    {"icount", "integer", 
       sizeof(long), IOOffset(rec_ptr, icount)},
    {"var_int_array", "integer[icount]",
       sizeof(int), IOOffset(rec_ptr, var_int_array)},
    { NULL, NULL, 0, 0}
};
The corresponding C data structure is:
typedef struct _rec {
    long	icount;
    int		*var_int_array;
} rec, *rec_ptr;
If you got this wrong, and perhaps used a structure like:
typedef struct _rec {
    long	icount;
    int		var_int_array[1];
} rec, *rec_ptr;
you'll get a segmentation fault.

Q2. I set my DataExchange/PBIO socket to non-blocking and something bad happened, why?

Many people think that just setting a socket to non-blocking means that everything will happen the same except that their program will never wait for a write() to complete. That's right, except for the "everything will happen the same" part. Normal socket I/O blocks only when the socket buffers fill up. If it happens consistently enough to impact your programs speed, you've probably got a mismatch between the speed of the sender and receiver. In this case, using blocking I/O serves to slow down the sender by blocking him when the buffers are full.

In contrast, when OS buffers get full, writes on non-blocking sockets return an error instead of blocking. At this point, half of the data might be buffered but the OS says there isn't room for the other half. The idea is that then the application can handle the situation somehow. But for a library like DataExchange, there really isn't a good way to handle the situation. It can't pull back the portion of data that has been buffered, so it's pretty much committed to sending the whole message. (Sending half a message would make PBIO loose track of message boundaries.)

If DataExchange wanted to support non-blocking I/O we'd really only have two options. First, we could sit in a loop on the write() call, trying to finish the write of the last half of the message. This is essentially a far less efficient version of a blocking write(). The second option would be to copy the last half of the message into a buffer, let the program go on and continue doing our own message buffering until the OS buffers got emptied. This has a couple of drawbacks. If there truely is a write/read speed mismatch we'll just queue messages until we run out of memory. To avoid that we'd have to set a buffer limit. But what to do when we hit the limit? Probably we'd have to block the application. But if we went to all the effort of implementing this we'd just have re-implemented what the OS is already doing for blocking calls. Two-level buffering doesn't make much sense and there's no way that DataExchange could do it as efficiently as the OS does.

So, we just don't support non-blocking I/O. If you set the FDs to non-blocking, what will happen is that data will get dropped when the OS buffers are full. You might get lucky and whole messages (instead of partial messages) will be dropped. But if you don't, PBIO will eventually blow up.

Your best bet is to stick with blocking I/O and figure out why you're writing faster than you're reading. Solve that and you won't block.


This page is maintained by Greg Eisenhauer
Last Modified Jun 23, 1999