Thursday, March 13, 2008

Create a FME Mapping file to convert column aligned text (CAT) to Access

There have been occasions when I needed to bulk load fixed width or Column Aligned Text (CAT) format files into a table in Microsoft Access. Access has a great Import Wizard that allows me to interactively define the fixed width columns before importing to a table. However I needed a non-graphical batch tool to do the same job. So I decided to use FME for the job. I explored the FME Workbench for a way to define the source CAT dataset but it was nowhere to be found. There is the Comma Separated Value (CSV) format but my dataset was not formatted that way. I did some research on Google and could not find a suitable method within the Workbench. So I had to roll up my sleeves and tried to write a FME Mapping file that did the job. So here are the steps that I took.


Step 1 - Create the FME Mapping File

Create an empty text file called cat2mdb.fme with your favorite text editor. Add the following comments and mapping statements to the file.

# ----------------------------------------------------------------------
# Title: CAT.TXT to MDB Mapping file.
# Description:
# This mapping file sets up the FME to translate
# TXT data from a column aligned text format file into
# an Access MDB database file while filtering out non-valid
# record lines.
# ------------------------------------------------------------------------
READER_TYPE TABLE
READER_KEYWORD TABLE_IN

WRITER_TYPE DATABASE
WRITER_KEYWORD DATABASE


The READER_TYPE line above tells FME to read tabular or TABLE format data. The WRITER_TYPE line tells FME that we want to write out to a database.


# ------------------------------------------------------------------------
# GUI lines are set up to prompt for the needed parameters.

GUI DIRNAME SourceDir Source Directory:
GUI TEXT SourceFile Source Dataset FileName:

TABLE_IN_DATASET "$(SourceDir)"

# First get the dataset for the database. Its an MDB file
DEFAULT_MACRO DestDataset
INCLUDE [ if {"$(DestDataset)" != ""} { \
puts {DEFAULT_MACRO DestDataset_DATABASE $(DestDataset)} \
} ]
DEFAULT_MACRO DestDataset_DATABASE
GUI FILENAME DestDataset_DATABASE Access_Database_Files(*.mdb)|*.mdb|All_files(*.*)|*.* Destination Access Database File:


DATABASE_SERVER_TYPE MDB

DATABASE_IMMEDIATE_WRITE yes

# For database compression
DEFAULT_MACRO _COMPRESS_AT_END_DATABASE yes
GUI CHOICE _COMPRESS_AT_END_DATABASE yes%no Compress Database When Done:

DATABASE_COMPRESS_AT_END $(_COMPRESS_AT_END_DATABASE)

DATABASE_DATASET "$(DestDataset_DATABASE)"

LOG_FILENAME "cat2mdb.log"
LOG_APPEND NO


GUI TITLE Translate CAT.TXT to Access MDB

FME_DEBUG UNCORRELATED UNGROUPED


The above lines define some macros which are used by the FME Universal Translator graphical user interface to get the necessary input and output parameters. The macro DATABASE_COMPRESS_AT_END is set to 'yes' in the example above. The figure below shows how the macros look like when the mapping file is run from the Translator.
Now we have to add in the *_DEF statements to define the file formats for the input CAT dataset and the output Access dataset. For example, we may have a sample CAT file that looks like the few lines below.

* This is a comment line
S0000000105981116893620091001XXX
S0000000216897716897720091001XXX
S0000000516965916862220081001XXX

The first 9 characters make up an ID number. The next 6 characters starting from column 10 define the OLDPOSCODE field. The NEWPOSCODE field is 6 characters long starting from the 16th column. The next 8 characters starting from column 22 make up the REGDATE field. Next is a 1 character REGCODE field. Finally the last 3 characters define the AREA field. The correct input TABLE definition statement in the mapping file would therefore be:

#--------------------------------------------------------------------------
# Define the Input CAT file
TABLE_IN_DEF TextFile CAT \
$(SourceFile) \
ID String(9,1) \
OLDPOSCODE String(6,10) \
NEWPOSCODE String(6,16) \
REGDATE Integer(8,22,0) \
REGCODE String(1,30) \
AREA String(3,31)

We want to bulk load all the input data into the output Access table. The output DATABASE_DEF table definition should be almost identical as shown below. If you do not want to append to the Access table then add in the keyword SQL_DROP_TABLE and set it to "yes"; this will tell FME to drop and recreate the table.

#--------------------------------------------------------------------------
# Define the Output Access table
DATABASE_DEF MYTABLE \
SQL_DROP_TABLE yes \
ID char(9) \
OLDPOSCODE char(6) \
NEWPOSCODE char(6) \
REGDATE integer \
REGCODE char(1) \
AREA char(3)

Since we do not add, delete, or change any of the input data, we can transfer all of the data to the output table as is. This is done by adding the following lines in the mapping file.

# -------------------------------------------------------------------------

TABLE_IN *
DATABASE *

We have now created a basic mapping file that can be used to bulk load the CAT dataset to an Access database without any data validation. The next step is to add the statements to perform the data checking.

Step 2 - Adding the Checking


There may be comment lines or blank lines in the CAT file which you may not want to load. Comment lines usually have a comment character like '#' or '!' in the first column. To filter out these lines, you can use the TestFactory and define a TCL regular expression to test for the conditions. In our example sample data, if the first column has the '*' character, then the line is a comment line and should be ignored.

# -------------------------------------------------------------------------

FACTORY_DEF * TestFactory \
FACTORY_NAME "Validate Line" \
INPUT FEATURE_TYPE * \
TEST @TCL("regexp {^\*|^$} &ID") != 1 \
OUTPUT PASSED FEATURE_TYPE *

# -------------------------------------------------------------------------

FACTORY_DEF * TeeFactory \
FACTORY_NAME "CAT.TXT -> MYTABLE Correlator" \
INPUT FEATURE_TYPE * \
OUTPUT FEATURE_TYPE MYTABLE

The first factory is named as "Validate Line" with the FACTORY_NAME keyword. The INPUT FEATURE_TYPE line with the '*' allows the factory to read all lines from the CAT file. For each ID field of the input lines, the factory will test it against the TCL regular expression "{^\*|^$)". In plain English, the regular expression will look for the existence of the '*' character or the newline character '$' in the first column. If either of these is found, then the entire TEST TCL expression will evaluate to the value of 1 and the line record will fail the test and the factory will not pass the record down the pipeline to the second factory.

Step 3 - Run the Translation

Once the mapping file has been created, you can run it through the FME Universal Translator interface or you can run it from the DOS Command Prompt. An example translation run is shown below.

C:\> fme cat2mdb.fme --SourceDir c:\MyFolder --SourceFile cat.txt --DestDataset output.mdb LOG_FILENAME cat2mdb.log


If all goes well, you'd find a new table called MyTable in the destination Access database output.mdb. The sample files can be downloaded here:

cat2mdb.fme and cat.txt.


No comments:

Post a Comment