Case Study: Complex Data File

Once you see the complexity of this data file it will be easy to understand why other products were unable to read it. Luckily, most files are not this complex, but it often doesn’t take much to make a file inaccessible to most tools.

  • Physically, the file is IBM variable length, 7GB in size. It is updated nightly in batch by a very complex legacy PL/1 application.
  • The file is blocked at a fixed 8,000 bytes, with records being spanned across blocks.
  • There is one record for each account, with each record containing a variety of different segments, each describing specific activities or facts.
  • Each record contains a 50 byte header, of which 14 bytes is reserved for a bit array identifying which of 112 possible segments are included in this record.
  • The segments themselves can not be identified by their content. The only means of identification is to rely on the fact that they are stored sequentially in the record, and that the next data relates to the next bit in the array that is set.
  • The original designer likely thought that 112 would be more segment types than they would ever need, but of course this assumption was wrong. At some point in the past they had used all 112 segment types and needed more. They addressed this by making one of the latter segments an additional 10 byte array of bits, which logically extended the first and identified the presence of an additional 80 possible segment types. Needless to say, this is only present if one of the 80 new segment types is included in the record.
  • The lengths of segments themselves followed no particular pattern. Many are of a fixed length specific to that segment type, but the length isn’t included in the file. This has to be discerned from the PL/1 copybooks.
  • Other segment types are variable length. At least for these a standard is followed. The first two bytes are a binary segment length. Most variable length segments are actually instances of a repeating block of data of the same type, where the number of occurrences must be inferred from the total segment length divided by the length of each element (again, only available from the copybooks).
  • Taken as a whole, across all the segment types in use, there are over 9000 fields. Each is stored the densest format practical. For example, all dates are stored as two byte binary values, counting the number of days since April 1, 1940 .
  • Many of the individual segment types include codes to specify the nature of the particular transaction(s). For example, a segment holding cash adjustments would include a code indicating the nature of the adjustment. Of course, the sign of the adjustment amount is specific to each adjustment code, with the “normal” transaction being positive.

It is estimated that if the source file were flattened using conventional techniques the resulting flattened file size would be in the terabytes.