Data Lakes - serving the needs of Big Data analytics - core component of a Big Data Supply Chain
Mo Data stashed this in Big Data Preparation
"The problem is that, in the world of big data, we don’t really know what value the data has when it’s initially accepted from the array of sources available to us. We might know some questions we want to answer, but not to the extent that it makes sense to close off the ability to answer questions that materialize later. Therefore, storing data in some “optimal” form for later analysis doesn’t make any sense. Instead, what the Dixon suggests is storing the data in a massive, easily accessible repository based on the cheap storage that’s available today. Then, when there are questions that need answers, that is the time to organize and sift through the chunks of data that will provide those answers.""Once the data lake is operational, users need a way to traverse the lake and determine the value of the information that “lives” in the lake. New search engines have been developed, which are specifically designed to query the data types that reside in the data lake. These search engines fundamentally differ from their counterparts used in data warehouses. In a data warehouse, the main mode of access is a relational database storage paradigm, in which the structure of the data was predetermined at the time the database was designed. With data lakes and big data, the structure of the data is more flexible.""Pervasive’s Data Rush provides another way of sifting through data. Data Rush is a programming toolkit that creates highly parallel applications that are very friendly to the challenge of sifting through large amounts of data. When using Data Rush, you must write a program, which is harder than configuring Hadoop, but the speed and richness of the extraction will be worth it for certain applications. Another approach is to monitor the stream of data arriving in the lake for specific events. Complex event processing (CEP) engines can also sift through data as it enters storage, or later when it’s needed for analysis.""In fact, there are many new, different structures entering the market, which determine the structure of the data at the time of search, not at the time of storage. That means the process of searching through a data lake is much more like running a query on Google, looking at the result set and deciding, “Ah, here’s a field I’m interested in.” On the next search, you use that field, and probably create and identify other fields as well, searching interactively and expanding the description of the structure of the big data at the same time."