Is it time for a new data warehouse?
Mo Data stashed this in Big Data Technologies
Is it time for a new data warehouse?
Ever since the data warehouse was invented, several data warehousing concepts were the same independent of the data warehouse vendor. All of these concepts were put in place such that the response time of the data warehouse would meet the response time expectations of the end user. A consequence of these concepts is that it adds additional work to the set-up of reports in the data warehouse. With SAP HANA, the extra work is not required, as HANA can meet the end user’s response time expectations without the extra work. The effort to build reports in HANA is substantially less than setting up like reports in a data warehouse. Let’s explore.
The first concept is limiting the amount of data taken from the source system. The premise is that all of the information is not required to be in the data warehouse as the end users will only be interested in reports off of certain data from the target system. Furthermore, the premise is that taking less data into the data warehouse will improve response times. All of this is true; however the flaw is that as business priorities change, the reporting requirements also change. As reporting needs change, different data is needed in the reports. Since all of the data was not brought over from the source system into the data warehouse, several things need to be re-worked. Bringing additional data into the data warehouse requires extensive re-design. For example, when the business articulates their requirements for a new report or data mart in the data warehouse, it is possible (probable) that some of the requirements will not be articulated fully. After the data extraction programs are built and the reporting structures in the data warehouse are complete and loaded, it is possible that new data will be identified as required. So, the data must be dropped, extraction programs modified, ETL jobs changed, reporting structures in the data warehouse changed, and the data must be reloaded. All of these steps are necessary to do something as simple as adding a few fields into the data warehouse from a table or tables in the source system. Even though the change seems small when requested by the business (e.g. please add three fields to xyz report), the work required to make the change isn’t small and can take weeks to months to implement. The business often becomes frustrated wondering why such a simple change takes so long and with such a high cost. With HANA, the source system data can be loaded into HANA in a far more granular level of detail and number of columns. And that’s how HANA solves this problem. You can include every field because you don’t have all of the limitations of typical warehouses. The business can change without changing the data warehouse. The columnar nature of the HANA database utilizes multiple compression algorithms such as dictionary compression, run length encoding, etc to help keep the database to a manageable size. Another benefit of using HANA as a data warehouse is that load times are significantly faster than loading into traditional data warehouses.
The second concept is setting up indexes to speed things up for the end user. Indexes are necessary to provide adequate performance in the RDBMS technologies used over the past 30 years. In fact, concepts such as indexes and aggregates were long considered best practices to deliver performance. However, indexes create substantial additional overhead for the data warehouse from a size perspective and also from a DBA perspective. Why are Indexes required? Again, it is to meet the response time expectations of the end user. The only way a traditional data warehouse can meet the response time expectation of the end user is to have an index of what the user will search. This is one more step in the set up process of a data warehouse that is not required with HANA. In HANA, data is stored completely in-memory, and indexes are not required. HANA can meet the response time expectations of the end user without indexing. Why? Because reading from memory is 1000x – 1,000,000x faster than reading from disk. This is another step that is eliminated in setting up and maintaining a data warehouse and reporting.
The third and final concept is the data warehouse duplicating data. If a data warehouse has 5 TB of data, it is likely only ~ 1TB in raw data. Why is the data duplicated? Again, it is to meet the response time expectations of the end user. Let’s take sales reporting. Reports will be needed to show sales per day/week/month/quarter/year, also sales per geography per day/week/month/quarter/year. These reports are pre-calculated so that they are ready to go when a user wants the specific report. When setting up the reports, this is another required step to meet the response time expectations of the user, adding additional time to a data warehouse project. Another step that is not required with HANA. HANA is fast enough to calculate these reports, from the raw data, on the fly. In addition, since HANA does not copy the raw data and since HANA compresses data on average 4:1, the 5TB data set is now ~ 256GB. This has an additional positive impact on the TCO because the database remains manageable, thus enabling the loading of more data in less space. Again, the ‘best practices’ of the past are now obsolete because HANA doesn’t require aggregates.
These are 3 solid reasons that analytics can be delivered with less resource effort with HANA vs. with a traditional data warehouse. The time for a traditional data warehouse has passed. SAP has re-invented the solution with SAP HANA. Business requirements are evolving more and more rapidly, and it’s time for a data solution that can change with them.