Case Study: Complex Data File
Once you see the complexity of this data file it will be easy to
understand why other products were unable to read it. Luckily, most
files are not this complex, but it often doesn’t take much
to make a file inaccessible to most tools.
- Physically, the file is IBM variable
length, 7GB in size. It is updated
nightly in batch by a very complex
legacy PL/1 application.
- The file is blocked at a fixed
8,000 bytes, with records being spanned
across blocks.
- There is one record for each account,
with each record containing a variety
of different segments, each describing
specific activities or facts.
- Each record contains a 50 byte
header, of which 14 bytes is reserved
for a bit array identifying which
of 112 possible segments are included
in this record.
- The segments themselves can not
be identified by their content. The
only means of identification is to
rely on the fact that they are stored
sequentially in the record, and that
the next data relates to the next
bit in the array that is set.
- The original designer likely thought
that 112 would be more segment types
than they would ever need, but of
course this assumption was wrong.
At some point in the past they had
used all 112 segment types and needed
more. They addressed this by making
one of the latter segments an additional
10 byte array of bits, which logically
extended the first and identified
the presence of an additional 80
possible segment types. Needless
to say, this is only present if one
of the 80 new segment types is included
in the record.
- The lengths of segments themselves
followed no particular pattern. Many
are of a fixed length specific to
that segment type, but the length
isn’t included in the file.
This has to be discerned from the
PL/1 copybooks.
- Other segment types are variable
length. At least for these a standard
is followed. The first two bytes
are a binary segment length. Most
variable length segments are actually
instances of a repeating block of
data of the same type, where the
number of occurrences must be inferred
from the total segment length divided
by the length of each element (again,
only available from the copybooks).
- Taken as a whole, across all the
segment types in use, there are over
9000 fields. Each is stored the densest
format practical. For example, all
dates are stored as two byte binary
values, counting the number of days
since April 1, 1940 .
- Many of the individual segment
types include codes to specify the
nature of the particular transaction(s).
For example, a segment holding cash
adjustments would include a code
indicating the nature of the adjustment.
Of course, the sign of the adjustment
amount is specific to each adjustment
code, with the “normal” transaction
being positive.
It is estimated that if the source file were flattened using conventional
techniques the resulting flattened file size would be in the terabytes.