Oracle Data Integrator and Hadoop. Is ODI the only ETL tool for Big Data that works?
Mo Data stashed this in Big Data Preparation
Both ODI and the Hadoop ecosystem share a common design philosophy. Bring the processing to the data rather than the other way around. Sounds logical, doesn’t it? Why move Terabytes of data around your network if you can process it all in the one place. Why invest millions in additional servers and hardware just to transform and process your data?
In the ODI world this approach is known as ELT. ELT is a marketing concept pointing to the fact that data transformations are performed in the same processing engine where the data resides than moving it around for transformations. It has underpinned the product since its inception.
While other ETL tools such as Informatica now also offer some pushdown functionality (e.g. Hive pushdown) it is not in the DNA of these tools or companies to do so. Traditionally, these tools settled for a completely different approach and the problems of this are now showing more so than ever before. It is hard for these vendors to work around their original design philosophy. Let me compare this to Microsoft and Google. While the latter has the Internet and Big Data in their DNA as a company the former doesn’t and Microsoft are throwing huge resources at this problem without being overly successful at closing the gap. Let me ask you another way. Why settle for the copy if you can get the real thing?
The advantage of ODI over traditional ETL tools doesn’t stop there. ODI has a concept of reusable code templates aka Knowledge Modules. This meta data driven design approach encapsulate common data integration strategies such as timestamp based extracts, data merging, auditing of changes, truncate loads, parking defective records in an error hospital etc. and makes them available for reuse. This can result in ETL developer productivity gains of more than 40%.
What will the future of data integration on Hadoop look like? At the moment a lot of the ETL is still hand written using custom Map Reduce jobs. As SQL engines on Hadoop reach a higher level of maturity they will be the vehicles for 90%+ data transformation flows for Big Data. Only for very specific use cases where performance is the highest priority will we see custom coding on Spark, Map Reduce etc. Based on the underlying design principles, Oracle Data Integrator is a perfect match for Hadoop.
Coming back to my question in the headline. Yes, I believe that Oracle Data Integrator really is the only ETL and data integration tool that is fit for purpose for Big Data workloads.
If you are planning to run a Big Data project, an ODI implementation, or both then get in touch with us. Why settle for second best if you can get the ODI and Big Data experts?