We wish to add more Change Data Capture (CDC) to all our various data sources, from JDE, Infosys, AWE, Oracle DB, etc to allow for a robust, controllable and speedy ETL to our BI layers.
The chosen product for BI ETL is Talend. For Microsoft SQL, Talend currently only supports an intrusive, SQL trigger based CDC.
SQL triggers can cause significant added performance and maintenance costs, and can even corrupt data in some scenarios.
On the other hand, when using Talend with Microsoft SQL, SQL triggers currently is the only available option to reach near real-time ETL.
See the decision matrix to help decide what is best for each use case.
The primary purpose of this slideset is to inform stakeholders on the various options available for CDC, specifically as supported by Talend at this time (2016).
Secondary objective is to clear up any myths and misconceptions about CDC as it pertains to Microsoft SQL servers.
As with many of the more esoteric processes surrounding data, such as Security, ETL, Data Replication and Auditing, the subject of Change Data Capture (and its close relative Changing Dimensions) is either unknown by most IT professionals, or then only marginally understood.
The author of this presentation has extensive experience in designing and implementing a wide variety of complex CDC processes since the 1980’s, and is not playing favourites in any way. So rest assured, you are in good hands.
Based on Talend as of December 2016.
Link to Talend Manuals https://help.talend.com/
Talend is just one product in a group of high-level ETL tools that are often used to allow laypersons and ETL neophytes to build up a still relative reliable and usable ETL system with minimum effort. (And that is the whole raison d’etre, really.)
Other examples of high-level ETL tools that we have used in the past or are still currently using are Informatica and Magic XPI.
Tools such as Talend significantly abstract and - from a layperson’s perspective - simplify many of the highly complex ETL tasks.
ETL tool vendors have to provide “one size fits all” generic but highly flexible solutions. Because all abstraction coupled with flexibility always has a cost, ETL tools are not as performant as static, but expertly hand coded ETL. To make up for the lost performance ETL tools tend to require a substantial infrastructure boost.
Due to the heavy use of abstraction layers in high-level ETL tools, much more resources have to be spent in other areas, such as:
Implementing an ETL tool is thus always more expensive than having good ETL specialists. On the plus side, easily replaceable and cheap labour can be trained to use the tool, which helps guarantee business continuity.
As always, there is no free lunch - all you can chose is where to eat.
Similar to MagicXPI and Informatica and other high-level ETL tools, Talend offers a friendly, easy to use & mostly drag and drop visual interface.
With a bit of training, a business analyst who knows their data can quickly click together even complex ETL data flows, and simply publish for immediate use. This reduces (or eliminates!) otherwise long wait times on specialized IT staff resources.
At the same time, because everything is visual, Talend is also almost self documenting - one look at the data flow and it becomes quite clear where the data comes from and where it ends up.
Side Note: more complex transforms may be a bit more complicated to view quickly, depending on how they were implemented. Very basic SQL knowledge may be required. But the logic is still much easier to understand than pure T-SQL or PL/SQL.
Scheduling when an ETL job needs to be run is also a mere matter of a few simple clicks.
There are many useful options, ranging from near real time, all the way up to scheduling a job to run only once a month (very handy for Month End or data cleanup processes, for example).
Talend also offers a very readable dashboard on all ETL jobs that are scheduled.
It is immediately obvious what jobs were successful or not.
Because everything is in one tool, the user can quickly pinpoint where the error occurred, and even - depending on the cause - immediately correct the issue.
Talend offers three main CDC options, depending on the database engine that the source data resides on.
We primarily run on Microsoft SQL Servers. For direct CDC on MS-SQL, Talend currently only offers the intrusive SQL trigger based CDC (see “Solutioning” further below for a non-intrusive workaround that could be used with Talend).
For Oracle V11 and previous Talend also supports the non-intrusive Redo/Archive Log CDC methods. This is similar to Microsoft’s TransLog based CDC.
Oracle V12+ supports XStream over OCI, and Talend takes full advantage of that.
This suggested work method has an industry wide proven track record of ensuring that requirements are followed with little or no misunderstandings, and that quality assurance is built live and iteratively into the development itself.
You may recognize this as a variant of “Pair Programming” which was eventually codified in the 90’s in “Extreme Programming”, and is now part of both “Agile” and “Lean” SDLC.
When designing and building (“coding”) an ETL process within Talend, it is highly recommended to intimately and directly include the “customer”.
Optimally the “customer” (for example the BA that requested the ETL) should sit physically side by side with the ETL developer, at least for the initial rough sketching out of the data flow.
Remote screen sharing is also acceptable, as long as both parties can stay uninterrupted.
After the initial design, the customer must be directly & live involved at each milestone. Several walk through sessions of each ETL step is recommended (“code review”), to catch any issues.
Leveraging Talend based options
The first option for Microsoft SQL based data sources where modification of the source is not possible or advisable, is quite simple.
We would use Microsoft SQL Replication to publish the whole source database to a new database (all tables & indexes). SQL Repl is included with all our MS-SQL licenses, thus no extra software purchase required, and we already run it for Infosys.
Talend would use the replicated database as CDC and ETL “Atomic” source - triggers and custom tables can be created in the replicated database (this was tested).
The replicated copy can also include extra ETL related indexes and other additional ETL relevant objects that are not possible to add onto the source system!
Slight downside: MS-SQL replication requires an experienced DBA.
If using a replicated database is not acceptable, then the only remaining option for Microsoft SQL based databases is to allow Talend to create triggers and other CDC related objects directly on the source data.
Talend trigger based CDC requires the source tables and the CDC collection tables to be on the same server.
For MS-SQL sources specifically, you also need to set up separate schemata to hold the CDC collection in the same database as the source tables.
This is required by Talend to ensure the CDC triggers do not impact performance too much with cross-DB or even cross-server coding.
(Note: For Pioneer we can use Oracle’s v11g REDO/ARCHIVE with Talend CDC!)
Default Talend CDC mode for MS-SQL
Does not require much additional space or databases (a little bit extra required for CDC collection data).
Does not require DBA assistance - Talend trained user can set up from within tool.
Talend more likely to fully support their default setup.
Triggers increase CPU usage, and will impact performance of source (locks held during trigger execution).
Increases maintenance needs of source data, database and applications dependant on data.
Increases change management costs and duration (incl uninstall of CDC).
Extra CDC collection tables are required in same database as source tables - this will cause extra disk IO on system and extra core cache memory usage, which is already at a premium.
This is the default method of using Talend with MS-SQL servers, so we are just pointing out some of the requirements here.
How it works.
CDC => “Change Data Capture”.
CDC is used to monitor any insert, update, or delete of row and/or column data. The before and after state of the data is then acted on and often stored to a separate repository for further processing.
This saved CDC information can then be used for auditing or ETL purposes.
These days databases are the most common source that requires change data capture.
We mostly use SQL based database engines, but CDC can also be applied to what is termed “No-SQL” type data repositories.
Other CDC source formats that are still important are structured files, such as CSV, XML, JSON or even EXCEL type files.
CDC can also be applied to web based data sources, such as web services or even web pages.
Finally, and this is again becoming increasingly important in the security field, CDC is applied to raw data streams, either over the wire or wireless.
Transaction/Redo Logs, MessageQueue/ServiceBus, Hooks/Triggers or even just plain data compare are the methods used on the database side.
As you see, CDC is by no means trivial.
Auditing: CDC is regularly used to track who made what changes to business data, in order to ensure that only valid changes are made. Auditing also comes into play when monitoring changes to sensitive data, such as pricing or salary data.
ETL: CDC also has a place in ETL, especially when DC (changing dimensions) need to be tracked for BI and other reporting purposes. Used correctly CDC can also reduce ETL IO stress, as only data that has actually changed needs to be exported from the data source.
We currently use CDC for both use cases.
CDC can be either “intrusive” or “non-Intrusive” to the system being monitored.
Intrusive CDC processes require that the source system be modified in some way, either one time or continuously. Modifications include adding special columns to all rows, or extra tables, or even adding extra code such as triggers, UDF, SPs or adding In-Process binaries. Example: Talend’s use of triggers for CDC is considered intrusive.
Non-Intrusive CDC processes on the other hand do not actively affect the structure or data of the source system. Example: Microsoft’s SQL engine built-in CDC is considered non-intrusive.
Both have their Pros and Cons, and both have their place in daily data ops, depending on the situation.
Built In: Some systems have CDC built right in (either Non-Intrusive or Intrusive):
On the application level (ie: an ERP system actively journals CDC itself);
Or on the data engine level (ie: Microsoft SQL Server CDC);
Or finally on the data transport layer itself.
Add On: Are third-party modules that are Intrusive, as they require system mods to various degrees, be it at an application, database or other layer.
External: Finally there are two types of external CDC systems, both are very Non-Intrusive:
Log or Journal file reading (Talend’s use of Oracle’s Redo Log, for example);
ETL that does CDC on the destination’s dirty layer (BICO, Portal, etc).
Eventually some client will require the CDC data collected. How the client is informed of the change depends strongly on the implementation.
Active Mode (also called “push” or “post” mode) is available if the CDC system either signals a client that there is data ready to review, or even simply pushes the CDC collection directly to the client. This is still the more uncommon mode and is considered real-time.
Passive Mode (also called “pull”, “get” or “poll” mode) is the more common mode, and it relies on the client finding out by itself if there is new CDC collections available. Some vendors state their Passive Mode is real-time, but at best it can be considered “near” real-time, if at all.
Intrusive CDC is often implemented as an Add-On.
This is usually the case when there is no non-intrusive built-in CDC.
Or, in the case of Talend and Microsoft SQL server, when the client tools that rely on CDC do not support Microsoft’s built-in CDC.
The good news is, there are solutions available to implement intrusive CDC in such a way that they are not required directly on the source system.
See the solutioning section for an option for implementing Talend CDC in a non-intrusive manner (on the source), with the respective pros and cons.
There are various risks to consider when allowing intrusive CDC directly on source data.
The intrusive CDC will certainly reduce source system performance directly.
This is because intrusive CDC by definition runs in the domain of the monitored system itself.
Depending on implementation, this can be even quite severe during high load.
During data mass updates, triggers - including those for CDC - may need to be disabled.
If not disabled, the update may either fail (remember if a trigger fails, the originating transaction also fails), or require a much longer implementation duration than expected to complete.
Longer duration may especially be an issue if the mass update is done during a fixed outage period.
Any structure changes, upgrades or patching to the source system may become quite complicated.
In some cases any intrusive CDC may first need to be disabled or removed entirely before a safe upgrade can be made.
This could also be necessary anyway due to some upgrade processes that drop and recreate objects, which if the CDC is not removed, would break the CDC.
Non-Intrusive CDC seems like the way to always go, right? In theory, yes. But in practice non-intrusive CDC also has its cons, especially when it comes to the built-in variants.
Most non-intrusive CDC rely on some form of internal data change journalling (ie MS-SQL TransLog, Oracle Redo Log, etc). This requires that backups are run as COPY ONLY, otherwise the journal may be truncated before the CDC has a chance to record changes. And other journal consuming operations (ie: SQL replications) also have to stay friendly with the CDC.
Another issue to consider is that someone - or some process - has to monitor the CDC collection to ensure it does not cancer and use up all the disk resources. This would be the case if there is nothing running to consume and reset the CDC collection.
A short introduction
Because Talend’s default CDC is based on TRIGGERS, we will have a quick look at how TABLE triggers work (Talend does not support DDL triggers).
A table tigger is a SQL code block that executes each time data in a row is modified in any way. There is a non-negligible performance cost with executing extra code for each modified row. The longer a trigger executes, the longer data locks are held.
Triggers can be of varying complexity, and depending on database settings, can even be nested/chained or recursive. Nested/chained and recursive triggers extensively adds performance costs, so should be avoided.
Pro Tip: Because a trigger runs on EACH ROW that is modified, it may be a good idea in some cases to disable all triggers on the affected table(s) during mass updates! Of course, that would also disable trigger based CDC during that time.
Infosys is a good example of a Built-In Intrusive CDC. It has always used a TRIGGER based CDC internally to capture targeted data change on sensitive columns in a defined set of tables. (Currently over 100 triggers set.)
Infosys does CDC for auditing purposes, and uses one TRIGGER per change type.
Example for table [CLNTLIVR] (Customer Delivery Product):
The CDC collection is stored in the [MAINTLOG] table within Infosys’s DB.
All major modern DB Engines and even some ERP systems support native non-intrusive CDC. Sadly each native CDC is proprietary - there is no industry agreed on standard. That requires each DB engine’s CDC to be coded against specifically, and possibly even version dependently (example: Oracle 12 has a completely new native CDC, compared to its older editions). That is why to save development time and cost, and also to simplify quality control, most ETL tool vendors such as Talend end up supporting only one or two native CDC, while all other data sources require the implementation of intrusive trigger based CDC.
That you have to know... Talend trigger based CDC has a few requirements, but there is one main one that is very vital. Source tables that need to be monitored for change require a Primary Key (PK) to be set. If there is no primary key on the data, then Talend has no way to know exactly which row has changed. PS: PK are unique in a given table.
The existence of SQL triggers has allowed for many otherwise impossible workarounds, so are a real boon to DBA and DBDevs. Triggers also allow for CDC when no other option is available, or where supporting the many non-standard proprietary built-in CDC methods is just not viable, resource wise. Modern trigger logic is quite optimised compared to just a decade ago, so there is less danger now in implementing CDC triggers. And thanks to various built in views in the current MS-SQL editions, it is easier now than ever to track down trigger based performance issues. Still, when it comes to triggers, a good saying to remember is: With great power comes great responsibility. Triggers are amazing, but they can quickly burn up your DB’s IO, or if coded badly, even cause data corruption.
We’ve already hinted at some of the risks of using triggers. Triggers run once for each record affected; A trigger runs in the same transaction as the original operation - if the trigger fails, the whole transactions fails (actually this can be a PRO for auditing!); A trigger inserting data into another table will lock both tables until the trigger is done. If another update/insert/delete opp or trigger fires wanting to write to the same table as the first opp or trigger, you will experience blocking; This means that even well crafted triggers use up CPU, IO and RAM and lock resources and can drastically slow down long running or often repeated DB tasks (note: not an issue normally on short, one time running tasks); Finally, badly coded triggers can destroy your data in such a way that, if not caught, can even make backups being useless.
Risks when using SQL triggers in our ERPs that need to be considered.
JDE 920 makes extremely heavy use of cursors on MS-SQL.
Because triggers run in the same transaction as the originating operation, this can cause unexpected blocking, depending on how well the cursors have been optimised and configured.
From empirical simple profiling, it seems JDE makes no or little attempt to optimise its cursors - more research would be required to validate this finding.
Also, remember under MS-SQL depending on what transactional isolation level the SELECT are run under, a simple query can also cause blocking of data change operations and their linked triggers.
Infosys has always been transactionally very busy. The design makes heavy use of high I/O “current data” tables physically partitioned from write once historic tables. As we have seen, Infosys also already uses triggers for auditing.
There are a few high-priority and business vital ETL process that run on Infosys directly (intrusively) that might be blocked or slowed down when adding more triggers:
AWE ETL - continuous integration, already has lock issues.
CARDLOCK ETL - Periodic integration, already causes lock issues.
SalesFeed ETL - daily export - vital to BI to be correct and uninterrupted.
Note: the Infosys DB is already real-time replicated. This can be used as the BI source, or second subscription directly to a BI server to be used as the ATOMIC layer.