Data Strategy: told through tales of an 18th Century British gardener
Mo Data stashed this in Data Strategy
This is a great article, depicting Lancelot Capability Brown - an 18th Century Gardener who innovated away from the formal gardens of his time - and the analogy with highly structured data stores and the Big Data problems we face today.
Big Data's Constant Gardener Paul Sonderegger
The work of an 18th-century British landscape architect might seem like an odd place to look for insights into the elusive task of extracting real business value from big data. But Lancelot “Capability” Brown—that really was his nickname—has much to teach us if we study his redefinition of the English garden in the late 1700s.
In its time, his work was a revelation. Unlike formal gardens of the Renaissance, like the ones at Versailles, which used careful geometric arrangements to show the triumph of order over nature, Brown’s landscaping designs were full of rolling green lawns, seemingly random bands of trees, and ponds that looked like they’d always been there.
But Brown wasn’t rejecting order in favor of nature. He was combining the two. He redirected springs and rivers to make them flow where he wanted. He kept trees that framed a house as you came up the drive, removing those that didn’t. In fact, Brown got his nickname from repeatedly saying that his clients’ lands had great “capability” for the designs he envisioned. Although he was reshaping nature, when his work was complete he had done the natural landscape one better, resulting in what one admirer called “nature perfected.”
By analogy, Brown’s approach to landscaping holds the key to understanding how to get the most from big data. The data warehouses created by large organizations over the past 40 years are like the pre-Brownian triumphs of order over nature. Unfortunately, today’s datafication of everything—aka big data—threatens to upend such systematized storage and let in the data wilds. In response, CIOs are tempted to wall off big data into its own silos, afraid it’ll otherwise overrun their whole IT environment.
But the choice between a cleanly coordinated “Versailles” or a big data jungle is a false one. There’s a Capability Brown solution that redirects the power of wild data without imposing artificial order. It’s a way of querying data across Hadoop, relational, and NoSQL databases as if it were all in a single, high-performance system.
SQL, the most common language for accessing data, is the key. But to understand how SQL—which stands for structured query language, after all—can give you convenient access to diverse data without standardizing it, we have to take another short history lesson.
SQL may have been created in California in the 1970s, but its roots reach back to work done by German mathematicians more than a century ago. In 1879, Gottlob Frege, a brilliant but introverted professor, imposed rigor on the unruliness of human thought by formalizing logic mathematically. Frege’s efforts inspired work on set theory and predicate logic. These important tools for reasoning about things based on their properties were foundational to the early work on electronic computers in the post-World War II era.
By the 1960s, there were enough computers around, each with its own way of structuring data for an application, that the hunt was on for a general purpose way to represent data and query it. In 1970, Edgar Codd, an IBM researcher, proposed just such a way based on set theory and mathematical logic. This was the relational model which, in turn, encouraged the creation of SQL.
It’s important to remember that every advance from Frege to Codd relied upon systematic methods to represent the things one wants to think about, and clearly defined rules to keep that thinking on track. The relational model is not SQL; SQL is not the relational model. The relational model is the garden of data, and SQL is the gardener.
Our story so far is all about the triumph of order over nature. But with the advent of big data, it’s a jungle out there. The datafication of everything creates a huge diversity of data. Think of log files from sensors, configuration files from mobile devices, text in tweets and posts, and branching hierarchies from website click-stream paths. None of this data starts out as nice, neat tables, which form the heart of relational databases. Instead, it shows up as bundles of files and records in Hadoop clusters and NoSQL stores.
Yet Capability Brown would have said that these torrents and outgrowths of data have great capability. They can tell you in detail about your customers, products, capital equipment, and even your own business processes, if only they could be redirected.
Actually, it’s here where SQL has a power Brown could only dream of. While SQL was designed to work on tables, its real job is to select and filter data inside those tables. SQL works at the most elemental level of mathematical logic – the properties, or attributes, of things and their values. The tables are just convenient containers.
Most of the new data types flooding into the modern enterprise from sensors, mobile devices, and online services are really just bags of attribute-value pairs. Let’s examine the familiar tweet. A single tweet has 150 attribute-value pairs hanging off it, collectively called metadata. The metadata include a unique ID number for that tweet, the author’s screen name, how many tweets this person has sent, the ID number of the tweet this tweet is replying to, geographic coordinates, and more. These pieces of metadata are provided as a list, one attribute-value pair at a time. Some tweets have them all populated, others don’t. And Twitter can change this list of attributes any time it wants. In February 2013, Twitter added a few, including “lang” whose values come from Twitter’s language detection algorithms.
What if you could redirect this list of attribute-value pairs to look like a table? Imagine deriving a table from the tweet where the columns are the attributes it happens to have, and you put values for the attributes in the row’s cells wherever appropriate. Now do that for each of the nearly 672 million tweetsproduced during the World Cup. The resulting table would be extremely sparse (many of the tweets don’t have a value for attributes that other tweets do), and no self-respecting database administrator would ever design such a thing. However, such a dynamically derived table provides convenient access to the exotic variety in the source data. By analogy to Brown, it’s nature perfected.
The IT industry knows all this, which is why we see so much activity around SQL on Hadoop. This is good, but there’s one more step. What you really want to do is query the data in Hadoop, relational databases, and NoSQL repositories as if it were all in the same system. You’d have equal access to both the data jungle and the formal garden, without the misery of planting them in the same earth. This is what Oracle Big Data SQL does.
Now let’s step into some Oracle-specific capabilities.Oracle Big Data SQL is the only product that provides the three things SQL has to have to be useful across this diversity of repositories: seamlessness, speed, and security. So, let me tell you how we do it.
Seamlessness comes from the automatic derivation of tables from data loaded into Hadoop, much like the tweets we discussed already. These tables tell the Oracle database about the shape and location of individual objects in Hadoop and NoSQL just as if they were in the database itself. This seamless catalog of data lets the query planner think globally but act locally, sending pieces of the query to wherever the data lives.
Farming out a query across multiple systems is normally called query federation because the planner hands off to the local execution engine, whatever it may be. But then performance becomes unpredictable. Instead, Oracle Big Data SQL uses query franchising. It puts Smart Scan technology inherited from the software that makes Exadata warehouses run so fast on the Hadoop and NoSQL nodes. Pieces of the query run in different locations, but to the same standards.
Because the database now thinks it’s talking only to itself the whole time, the full security of the Oracle database extends to the entire big data environment. SQL now becomes big data’s constant gardener, making queries across Hadoop, relational, and NoSQL databases act as if they were running in a single, high-performance system.
This is a thing of beauty for retailers who want to know about correlations between social media data (in Hadoop) and in-store sales (in a warehouse). It’s of immense value to airplane manufacturers who want to know how data from the latest test flight (in a NoSQL database) fits with operational data from current commercial flights (in a warehouse). And let’s not forget wireless service providers who want to combine phone configuration details (in NoSQL) with application usage (in Hadoop) with call hand-off records (in a warehouse) to improve service in small neighborhoods of big cities.
Let’s conclude by connecting yesterday’s grand gardens with today’s torrents of data. Capability Brown’s genius was in realizing there was a relationship between nature and order, not a competition. His method of bending the natural capabilities of land and water, rather than replacing them, created an enduring landscape architecture that you can still see today throughout England. Tapping the capabilities of data in its natural format to provide easy access without artificial order is equally as powerful.
Stashed in: For Milo