Data Stage Sequential File Stages (Import and Export) Performance Tuning

Improving Sequential File Performance

If the source file is fixed/de-limited, the Readers Per Nodeoption can be used to read a single input file in parallel at evenly-spaced offsets. Note that in this manner, input row order is not maintained.

 If the input sequential file cannot be read in parallel, performance can still be improved by separating the file I/O from the column parsing operation. To accomplish this, define a single large string column for the non-parallel Sequential File read, and then pass this to a Column Import stage to parse the file in parallel. The formatting and column properties of the Column Import stage match those of the Sequential File stage. 

On heavily-loaded file servers or some RAID/SAN array configurations, the environment variables $APT_IMPORT_BUFFER_SIZEand $APT_EXPORT_BUFFER_SIZEcan be used to improve I/O performance. These settings specify the size of the read (import) and write (export) buffer size in Kbytes, with a default of 128 (128K). Increasing this may improve performance.

Finally, in some disk array configurations, setting the environment variable $APT_CONSISTENT_BUFFERIO_SIZEto a value equal to the read/write size in bytes can significantly improve performance of Sequential File operations.

Partitioning Sequential File Reads

Care must be taken to choose the appropriate partitioning method from a Sequential File read:

Don’t read from Sequential File using SAME partitioning! Unless more than one source file is specified, SAME will read the entire file into a single partition, making the entire downstream flow run sequentially (unless it is later repartitioned).

When multiple files are read by a single Sequential File stage (using multiple files, or by using a File Pattern), each file’s data is read into a separate partition. It is important to use ROUND-ROBIN partitioning (or other partitioning appropriate to downstream components) to evenly distribute the data in the flow.

Sequential File (Export) Buffering

By default, the Sequential File (export operator) stage buffers its writes to optimize performance. When a job completes successfully, the buffers are always flushed to disk. The environment variable $APT_EXPORT_FLUSH_COUNTallows the job developer to specify how frequently (in number of rows) that the Sequential File stage flushes its internal buffer on writes. Setting this value to a low number (such as 1) is useful for realtime applications, but there is a small performance penalty associated with increased I/O.

 Reading from and Writing to Fixed-Length Files

Particular attention must be taken when processing fixed-length fields using the Sequential File stage:

If the incoming columns are variable-length data types (eg. Integer, Decimal, Varchar), the field width column property must be set to match the fixed-width of the input column. Double-click on the column number in the grid dialog to set this column property.
 

If a field is nullable, you must define the null field value and length in the Nullable section of the column property. Double-click on the column number in the grid dialog to set these properties.
 

When writing fixed-length files from variable-length fields (eg. Integer, Decimal, Varchar), the field width and pad string column properties must be set to match the fixed-width of the output column. Double-click on the column number in the grid dialog to set this column property.

 To display each field value, use the print_field import property. All import and export properties are listed in chapter 25, Import/Export Properties of the Orchestrate 7.0 Operators Reference.

 Reading Bounded-Length VARCHAR Columns

Care must be taken when reading delimited, bounded-length Varchar columns (Varchars with the length option set). By default, if the source file has fields with values longer than the maximum Varchar length, these extra characters will be silently truncated.

Starting with v7.01 the environment variable

$APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS will direct DataStage to reject records with strings longer than their declared maximum column length. 

Tags
Recent content