Sign up FAST! Login

How do you turn data into products into money - 3 posts

From data to money: Building a startup

Thanks to commodity computing power, it’s possible to build a startup business based around big data and analytics. But what does it take to do this, and how can you make money? These questions were addressed recently in blog posts by Russell Jurney and Pete Warden.

Jurney takes on the question of how many people you need to start a data product team. He draws out the ideal roles for such a team including: customer, market strategist, deal maker, product manager, experience designer, interaction designer, web developer, data hacker and researcher.

Quite the cast, and not really the ideal starting point for a product or business startup, so Jurney condenses these roles into the more succinct definitions of “hustler,” “designer” and “prodineer” — a minimum of three people.

Analytic products are such a multidisciplinary undertaking that in a data startup a founding team is at minimum three people. Ideally all are founders. There are probably exceptions, but that is the minimum number of bodies required to flesh out all these areas with passionate people who share the vision and are deeply invested in the success of the company. Someone needs to be good at and enjoy each of these roles.

Once you start, and have a minimal product, Jurney recommends quickly connecting with real customers, and taking it from there. The next step is making money, of course, which is what Pete Warden has been thinking about.

After running through a “thousand ways not to do it,” Warden reckons finding a way to make money is the most important question for big data startups. He paints the stages of evolution a data product goes through to actually deliver value to customers.

  • Data: You need it, but selling it raw is the lowest level of business. Warden writes “The data itself, no matter how unique, is low value, since it will take somebody else a lot of effort to turn it into something they can use to make money”.
  • Charts: Simple graphs, which at least help users understand what you have, but “still leaves them staring at a space shuttle control panel, though, and only the most dogged people will invest enough time to understand how to use it.”
  • Reports: Bring a focus to what the customer wants. Many data-driven startups stop here and make good money doing that. But there’s further to go: “It can be very hard to defend this position. Unless you have exclusive access to a data source, the barriers to entry are low and you’ll be competing against a lot of other teams”.
  • Recommendations: Your product now goes from raw data and produces actionable recommendations, a much more defensible business. “To get here you also have to have absorbed a tremendous amount of non-obvious detail about the customer’s requirements, which is a big barrier to anyone copying you,” Warden writes.

Ending his piece, Warden offers this pithy advice: “More actionable means more valuable!”

Data in the dirt

What would you say to a pub full of people about data? That was my challenge as I gave a talk at Ignite Sebastopol 4, held in O’Reilly’s hometown of Sebastopol, Calif. Explaining some of the 200-year history of Strata, I had to use twenty slides for 15 seconds each to get my point across.

Dolphins, cellphones and social networks

A couple of recent research reports bring interesting insights from social networks outside of the online worlds of Facebook and Twitter. Writing in Ars Technical, Casey Johnston reports on how the mathematics of text messaging might help mobile phone networks plan capacity. Researchers discovered that text-messaging patterns were generally bimodal.

Text message sets often start off with a burst: the times between messages are short and follow a power-law distribution (that is, there are a lot of text messages with short intervals between them).

Outside of an initial two- to 20-minute window, though, the time between messages falls dramatically. There are fewer, longer intervals between messages, and the tail can extend up to five or six hours past the initial burst, as the intervals continue to grow longer and the texts less frequent.

The researchers took these observations, and developed models to explain what they saw. The model assumed that text exchanges were primarily task-focused, dealing with some issue the conversants had in common, such as deciding what to eat for dinner.

Cliques in a dolphin community. Cliques in a dolphin community noted in a Microsoft research report (PDF).

Karate students and dolphin pods feature in recent research from Microsoft, explained by Christopher Mims in his Technology Review blog. Using a new approach built on game theory, researchers were able to model cliques in communities. Possible applications of the research include urban development, criminal intelligence and marketing. Mims explains the wide applicability of the technique:

Intriguingly, two of the data sets the researchers tested their work on, which are apparently standard for this kind of research, were data gathered by anthropologists about a Karate academy, and data gathered by marine biologists about a pod of 64 dolphins. Applying their game-theoretic approach to both networks, they were able to resolve cliques that other approaches missed entirely.

Resolving cliques also has applications in determining identity, Mims points out. Individuals with non-unique names can be identified instead by the community footprint generated by their clique membership.

Gangsta test data

Perhaps one of the best known pieces of test data is the Lorem Ipsum text, used by graphic designers as a substitute for real text during the “greeking” process. This venerable text has now received an update for contemporary culture, courtesy of a couple of Dutch developers.

The Gangsta Lorem Ipsum generator serves up such modern nonsenses as Lorizzle bling bling dolor we gonna chung amizzle, consectetuer adipiscing dizzle.

The most important unsolved question for Big Data startups is how to make money. I consider myself somewhat of an expert on this, having discovered a thousand ways not to do it over the last two years. Here's my hierarchy showing the stages from raw data to cold, hard cash:


You have a bunch of files containing information you've gathered, way too much for any human to ever read. You know there's a lot of useful stuff in there though, but you can talk until you're blue in the face and the people with the checkbooks will keep them closed. The data itself, no matter how unique, is low value, since it will take somebody else a lot of effort to turn it into something they can use to make money. It's like trying to sell raw mining ore on a street corner; the buyer will have to invest so much time and effort processing it, they'd much prefer to buy a more finished version even if it's a lot more expensive.

Down the road there will definitely be a need for data marketplaces, common platforms where producers and consumers of large information sets can connect, just as there are for other commodities. The big question is how long it will take for the market to mature; to standardize on formats and develop the processing capabilities on the data consumer side. Companies like InfoChimps are smart to keep their flag planted in that space, it will be a big segment someday, but they're also moving up the value chain for near-term revenue opportunities.


You take that massive deluge of data and turn it into some summary tables and simple graphs. You want to give an unbiased overview of the information, so the tables and graphs are quite detailed. This now makes a bit more sense to the potential end-users, they can at least understand what it is you have, and start to imagine ways they could use it. The inclusion of all the relevant information still leaves them staring at a space shuttle control panel though, and only the most dogged people will invest enough time to understand how to use it.


You're finally getting a feel for what your customers actually want, and you now process your data into a pretty minimal report. You focus on a few key metrics (eg unique site visitors per-day, time on site, conversion rate) and present them clearly in tables and graphs. You're now providing answers to informational questions the customers are asking; "Is my website doing what I want it to?", "What areas are most popular?", "What are people saying about my brand on Twitter?". There's good money to be had here, and this is the point many successful data-driven startups are at.

The biggest trouble is that it can be very hard to defend this position. Unless you have exclusive access to a data source, the barriers to entry are low and you'll be competing against a lot of other teams. If all you're doing is presenting information, that's pretty easy to copy, and caused a race to the bottom in prices in spaces like 'social listening platforms'/'brand monitoring' and website analytics.


Now you know your customers really well, and you truly understand what they need. You're able to take the raw data and magically turn it into recommendations for actions they should take. You tell them which keywords they should spend more AdWords money on. You point out the bloggers and Twitter users they should be wooing to gain the PR they're after. You're offering them direct ways to meet their business goals, which is incredibly valuable. This is the Nirvana of data startups, you've turned into an essential business tool that your customers know is helping them make money, so they're willing to pay a lot. To get here you also have to have absorbed a tremendous amount of non-obvious detail about the customer's requirements, which is a big barrier to anyone copying you. Without the same level of background knowledge they'll deliver something that fails to meet the customer's need, even if it looks the same on the surface.

This is why Radian6 has flourished and been able to afford to buy out struggling 'social listening platforms' for a song. They know their customers and give them recommendations, not mere information. If this sounds like a consultancy approach, it's definitely approaching that, though hopefully with enough automation that finding skilled employees isn't your bottleneck.

Of course the line between the last two stages is not clear-cut (Radian6 is still very dashboard-centric for example), and it does all sound a bit like the horrible use of 'solution' as a buzz-word for tools back in the 90's, but I still find it very helpful when I'm thinking about how to move forward. More actionable means more valuable!

Yesterday I went to a great talk by DJ Patil called "Building Great Data Products." DJ has an impressive background in building data-centric products: he was head of LinkedIn's data products for several years, then Data Scientist in Residence at Greylock, and is now the VP of Product at RelateIQ, a CRM tool whose homepage reads, "The Beginning of Data Science in Decision Making." He also co-coined the term Data Scientist.

DJ discussed some of the lessons he learned while building products at LinkedIn and RelateIQ, and the following is a summary of my notes from the talk:

  • Don't try to be too clever. Simple, straightforward approaches beat cleverness 9 times out of 10.
  • Start with something simple, then make it more complex if necessary. Don't start with something complex and then simplify.
  • The hardest part of data science is getting good, clean data. Cleaning data is often 80% of the work.
  • Try to get clean data from the front end (i.e. the user) instead of cleaning it on the backend. For example, if you're trying to figure out what company someone works for, it's easier to guide them with auto-complete or "did you mean ___?" suggestions, rather than accepting whatever they type and trying to understand it later. You'd be surprised at the number of ways in which people can input the same thing if you don't give them any guidance.
  • Use humans in general and Mechanical Turk specifically for early versions of your product, then try to automate and streamline as desired.
  • Build easy products first. For example, start with collaborative filtering before diving into fully personalized recommendations.
  • Showing users their own data via charts, blog posts, etc. is a great way to engage them.
  • When showing data, think about 1) what you want the viewer to take away, 2) what actions you want them to take, 3) and how you want them to feel. UX is very important. Don't overload people with too much information or creep them out with inappropriate details.
  • Set user expectations low. If you set high expectations and screw up, it's very hard to regain a user's trust. For example, if you tell someone, "We know you will love XYZ!" and they don't like XYZ, they'll be skeptical of your future recommendations -- or even ignore them. If you reframe as, "Are you interested in XYZ? No? Okay, sorry!" then users will be more forgiving.
  • Unfortunately, the best way to test data products is in production. It's the only way to find out if your recommendations are effective and to learn about all of the warts and corner cases that lead to embarrassing mistakes. For example, how do you tell if your product suggestions are good? Show them to users and measure the effect that they have on spending/engagement/whatever you're hoping to improve.
  • Simple beats clever 9 times out of 10, but you need to be able to recognize when to build something sophisticated.
  • Try to augment humans and make them more efficient instead of trying to replace them. People generally dislike feeling unnecessary or replaceable.
  • Minimize the friction in your product. If you're asking users to answer questions or input data, make that as easy and painless as possible -- otherwise users won't do it. Nobody reads manuals and instructions anymore. Strive to make products that are as intuitive as the iPad or Angry Birds.
  • Rule of thumb: every time you ask for data, your conversion funnel takes a 10% hit. Try to keep all questions lightweight and easy to answer so that you can minimize the damage.

DJ's talk was great and I can vouch for many of these lessons personally based on my work at Google and Factual. I think the most valuable lesson that I've learned over the last decade is one that came up repeatedly during the talk: simple approaches are often surprisingly effective. For example, I remember one task where I had a sparse dataset and had to fill out as much of the missing data as possible. Instead of using fancy algorithms and sophisticated machine learning, I tried the following heuristic: for every pair of columns, if a value, X, in one column was associated with a value, Y, in another column almost all of the time, then every time the first column value was X and the second column value was missing, I'd set the second column value to Y. It was a very naive approach, and yet it managed to fill in a large chunk of my dataset. I've now used this heuristic for data about books, movies, places of interest, and other datasets, and it often makes more clever strategies unnecessary or not worth the time.

Another important lesson that I've learned is that it's a great idea to work with small samples of data and use a single machine for as long as possible. Hadoop and distributed systems are nice when you're running in production on terabytes of data, but they greatly diminish your development speed and ability to experiment. You'll probably make progress much more rapidly if you just load a 500MB slice of data into RAM and experiment on your laptop.

After DJ's talk, I started thinking about the many blog posts that I've seen that focus on technologies that are commonly used for working with data: Hadoop, scipy, regular expressions, etc. I'd love to see more blog posts (and books) about higher level strategies and tactics for building data products. Posts that offer suggestions like "work with small samples"; "leverage Mechanical Turk"; and "start with the simplest approaches." I might turn a few of these topics into future blog posts, but I'm sure there are many lessons that I haven't learned yet. If you know of any great resources for creating data products, please mention them in the comment section!

dilbert data money

Stashed in:

To save this post, select a stash from drop-down menu or type in a new one: