Data Migration Done Right

If you've ever had to migrate large amounts of data from one schema to another with non-trivial structural differences in entity definitions or relations then you'll know that its no simple task and the effort required should not be under-estimated.

Data migration strategies can take many forms, from simplistic manual exporting/importing (e.g. you write a few SQL expressions to dump data out to a CSV format then use a loader to import into your new schema) to more sophisticated custom migration programs (e.g. a custom JDBC Java application with logic to query data from the legacy schema and intelligently insert into the new schema).

For complex projects, the first solution is rarely feasible. Migration often requires careful consideration of how keys and relationships should be sequenced, split, massaged and re-joint as well as various complex data clensing operations. A CSV approach is often lacking in flexibility for advanced cases. The latter solution on the other hand, i.e. writing a complete application specific to your migration needs, will give you all the flexibility you need but will also consume more time and effort than necessary.

Luckily, data migration has been a common hurdle in enough projects to draw significant attention in searching for generic solutions, which has resulted in the ETL (Extract Load Transform) paradigm. You can read more about it elsewhere, but basically ETL is the foundation for a number of frameworks and tools that make data migration a whole lot easier, and there's a lot of them out there.

In the Open-Source/Freeware space, I've recently come across Scriptella and Talend. Both are free Java based ETL frameworks that allow you to do advanced transformations and migrations but the tools take very different approaches:
  • Scriptella is a minimalistic ETL scripting engine which simply takes XML ETL instructions as input and blindly executes them. To use it, you need to write your ETL scripts based on the various APIs and using the provided modules/syntax/structure to make things work.
  • Talend on the other hand goes one step further and provides a full visual Eclipse-based IDE to aid you in creating the ETL instructions graphically through simple drag-and-drop operations and through visual mappers, whilst still providing all the flexibility to write and invoke custom Java code at any level and any time.
For my project, I chose to go with Talend so I won't comment on Scriptella any further. If you're thinking of using Talend for your migration projects as well, I highly recommend watching the online screen-casts first and then having a play to get familiar with the basics, otherwise the whole thing may seem a little overwhelming and put you off from exploring it altogether. After spending a few good days with it, I'm convinced it's the best (free) tool for migration that I've ever used and I'll never consider any other adhoc migration approach unless the system/process is extremely trivial.

The tool however does have it's quirks. A few lessons learnt from my experiences so far are outlined below:

Job Sequencing: I've found it useful to create small fine-grained jobs, which are usually responsible for migrating data into a single entity or a single group of closely related entities. Trying to do too much in one job (or job file I should say) makes things messy, harder to test and harder to troubleshoot. Instead I create a top-level 'Run_All' job file in which I drag-and-drop all my little sub-jobs and sequence them in a snake-like fashion across the canvas. Running this job will thus perform the entire migration process.

tMap Component: I use this in virtually every single job, whether I need it or not. I don't know if it has performance implications (my jobs still run fine) but the component gives great flexibility and allows you to easily structure your tasks in a common format.

tMap Joins: Be careful with the tMap joins however. You have the option of enabling an Inner Join, but you can also select 'All Matches', 'First Match', 'Last Match' and 'Unique Match'. Unique Match is selected by default, which can throw you off if you're thinking in standard SQL. 'All Matches' with Inner Joins seems to return what I'd expect from writing the equivalent SQL query and executing it nativelly against the backend.

Custom Functions: It's very easy to write custom Java static functions in a class and call upon these from anywhere in your tMap nodes to do fancy things against certain data fields, etc. I wrote a very simple function that lets me insert order numbers incremented by 10 in whatever field I want. There's probably an out-of-box way to do this but it was quicker to write this up then troll through forums.

 private static Map seqMap = null;
public static int inc(String seq, int step, int ignoreMe) {
if(seqMap == null) seqMap = new HashMap();
synchronized(seqMap) {
Integer s = (Integer)seqMap.get(seq);
if(s == null) s = new Integer(0);
s = s.intValue() + step;
seqMap.put(seq, s);
return s.intValue();
}
}

Use the Docs: Talend comes with two very good PDF documents, a user guide and a component guide. After watching the screencasts and getting familiar with the general workflow, use the guides as a reference point for any basic functionality questions, etc.

Self-Contained Project: It helps to be able to re-execute the entire migration project, or a subset thereof, as easily as possible, i.e. design so that you can click the Run button as often as needed. This is especially useful when debugging/testing the migration jobs. To do this you need a way to reset the database to some initial state. In my case I chose to drop-cascade all tables and re-create from scratch via a custom SQL DDL script. Executing arbitrary SQL scripts is not immediately obvious or as intuitive as it should be in Talend. The way I do it is I import my DDL script as a CSV schema in Talend with one command per line, all in one column (set the delimiter and line-break characters accordingly to get it in this format). Then design a simple job with the CSV file as input linking straight to a tRow output. In tRow Basic Settings/Query, enter (QueryString is the name of the column you defined in the CSV schema):
""+row1.QueryString+""
I place this job at the beginning of my execution chain so that it refreshes the database every time I run for consistant results. Different migration scenarios may require different initialization strategies.

Use the Debugger: You will undoubtably get a stack of NullPointerExceptions, ClassCastExceptions, etc. at different stages during your script development. Use the debugger when this happens and click on the first line of the error message in the stack trace to be taken to the offending line of generated Java code. This should give you a very strong hint of exactly what map and what field the problem is occuring in. I believe some NullPointerExceptions are actually product bugs, especially when using the intermediary Var attributes. Just check your logic first and rule out data nulls, etc, then try bypassing the Vars and entering the expressions straight in the destination maps. This has worked for me many times.

Clean-Up Job: My current migration project involves creating a new PostgreSQL schema and porting data from a legacy system. At the end of the Talend migration process, the sequences are all out-of-sync since lots of artificial ids are copied across. It's thus necessary to reset all sequences to point to the next available id in each table to avoid collisions. This can be done easily by creating a 'Clean_Up' job, placing a tPostgresqlConnection_1 component and sending a subjobok trigger to a tJava_1 component, then copy-paste the following code inside the Java component:

Connection con = (Connection)globalMap.get("conn_" + "tPostgresqlConnection_1");
String schema = (String)globalMap.get("schema_" + "tPostgresqlConnection_1");

con.setAutoCommit(true);
DatabaseMetaData meta = con.getMetaData();
String[] types = {"TABLE"};
ResultSet tables = meta.getTables(null, schema, null, types);
while(tables.next()) {
String tname = tables.getString("TABLE_NAME");

//let's do a reindex while we're at it
con.createStatement().execute("REINDEX TABLE " + schema + ".\"" + tname + "\"");

ResultSet pk = meta.getPrimaryKeys(null, schema, tname);
while(pk.next()) {
String pkname = pk.getString("COLUMN_NAME");
String[] seqtypes = {"SEQUENCE"};
ResultSet seq = meta.getTables(null, schema, tname + "_" + pkname + "%", seqtypes);
while(seq.next()) {
String seqname = seq.getString("TABLE_NAME");

String sql = "SELECT SETVAL('" + schema + ".\"" + seqname + "\"', (SELECT MAX(\"" + pkname + "\") FROM " + schema + ".\"" + tname + "\") + 1)";
//we hit the sequence, let's reset it's value
boolean success = con.createStatement().execute(sql);
if(!success) { //should throw exception before we get here if it fails
System.err.println("Seqence " + seqname + " reset ok.");
}
}
}
}

//let's do a vacuum while we're at it
con.createStatement().execute("VACUUM FULL ANALYZE;");
This will dynamically retrieve all table names, all primary keys for each table, and look for a sequence that follows the standard naming convention (Table_Name_Key_Column_Name_seq). If found, the script resets the sequence to the next available id for that table. The script also takes the liberty to perform a REINDEX on each table and a VACUUM FULL ANALYZE at the end.

tNormalize Component: This thing is priceless if you need to pull delimited data out of one field from one source row, and break this up into multiple rows for insertion into your new schema. I had several 'ids' fields with space-delimited lists of key identifiers that really should have been stored in separate associative relations, the data is close to useless otherwise.

Wrapping Up: As a final note, data migration should always be taken as a serious task when drafting up project plans and estimates. Allow plenty of time for it as you'll need to fully understand the inner workings (and hacks) of the legacy system and may need to reformat your target schema to account for things you've left out, etc. Investing some time in learning proper usage of a good ETL tool will pay off in the long term. Your data migration strategy should be easily re-executable with minimal to no human interaction.

Comments

  1. Hi Richard,

    your post helped a great deal in my effort to extend the tRowGenerator component to do a random lookup in a reference table a pick values for a generated column.
    Many thanks and may you be blessed for generously sharing your knowledge.
    Peace
    G.

    ReplyDelete

Post a Comment

Popular posts from this blog

Wkhtmltopdf font and sizing issues

Import Google Contacts to Nokia PC Suite

Can't delete last blank page from Word