The data scientist is tossed around loosely these days, so much so that it's creating a bit of confusion in the tech industry
Mo Data stashed this in Big Data Hype Cycle
Don't try to find one superhuman who does it all. You need three experts: business analyst, machine learning expert, and data engineer, says Lithium Technologies chief scientist.
Is there really a data scientist shortage, or are organizations simply trying too hard to recruit a unicorn, a jack-of-all-trades who possesses both advanced technical and business acumen?
If the unicorn hypothesis is true, it would explain why the scarcity of data scientists isexpected to worsen in the coming years.
The solution isn't difficult, some industry insiders believe, but rather one that might prove unpopular with cost-conscious organizations unable or unwilling to hire a data science team rather than a single data scientist.
Dr. Michael Wu is chief scientist of Lithium Technologies, a San Francisco-based company that sells social customer experience management software to businesses. Not surprisingly, Lithium captures a lot of data on consumer behavior, and part of Wu's job is to analyze that information and predict customer actions on an aggregate level
"What the industry calls a 'data scientist' now is really several different roles," said Wu in a phone interview with InformationWeek. "When people say there's a shortage of data scientists, (they mean) there is a shortage of people with all of these different skills."
Wu subdivides the data scientist role into three distinct jobs, each requiring a different skill set: business analyst, machine learning expert, and data engineer.
"You need these three groups of people to work together in order to inform the business decision-makers," said Wu.
The role of business analyst existed long before the terms "big data" or "data scientist" were in vogue. This person works with front-end tools, meaning those closest to the organization's core business or function, such as Microsoft Excel, Tableau Software's visualization tools, or QlikTech's QlikView BI apps. A business analyst might also have sufficient programming skills to code up dashboards, and have some familiarity with SQL and NoSQL.
"They analyze business-level data and try to produce actionable insights," said Wu. "A lot of companies have (these) people."
The recent hype surrounding big data, however, has led many business analysts to rebrand themselves as data scientists even though they are not, according to Wu's definition.
"It automatically gives them a little boost in their salary," Wu said, chuckling.
The second data science role is that of machine-learning expert, a statistics-minded person who builds data models and makes sure the information they provide is accurate, easy to understand, and unbiased.
"These are the people who develop algorithms and crunch numbers," said Wu. "They are interested in building models that predict something."
A machine-learning expert, for instance, might develop algorithms that predict consumer sentiment or estimate a person's influence in a particular industry.
"There are even machine-learning algorithms that look at images and tag them automatically, or look at videos and try to understand what the video is about," said Wu.
Like the business analyst, the machine-learning expert isn't a new profession, but rather one that's existed "in the last 30 years or so," Wu estimated.
The third key job, data engineer, is "the bottom layer, the foundation," said Wu. "They are the ones who play with Hadoop, MapReduce, HBase, Cassandra. These are people interested in capturing, storing, and processing this data… so that the algorithm people can build models and derive insights from it."
However, it's nearly impossible to find one person -- that data scientist unicorn -- who excels in each of these three areas, Wu said. And that's why organizations must focus instead on building a data science team.
Amen. It feels like anyone can call himself or herself a data scientist.
I'd actually go further than the three roles specified, I would add 'data alchemist' to the mix, maybe not as a separate role, but definitely needs to be present. Let's say the organization is generating raw data from a consumer facing application, say an ecommerce store - the business analyst is going to figure out that consumer driven recommendations based on some sort of collaborative filtering is required, the data scientist will sort out the algorithm and the machine learning guy is going to leverage actual purchase or click-through data to improve the accuracy. However, there is some magic where the consumer users can be persuaded to part with some more information, in return for a little more value, that's where the alchemy comes in - designing a holistic end-to-end system that is centered around data. In our consumer facing application can consumers create lists of gifts they would buy for their friends and those emailed to friends who are currently not users? That would increase the data available for the collaborative filtering and may also increase the user population. This is where the data alchemist demonstrates their value.
LinkedIn and Facebook definitely employ people to do this. I guess soon other companies will too.
You sound like you disapprove, Adam? Also how's the re-design coming?
The redesign is coming slowly, mostly because the site grows without redesign.
Do I disapprove of data alchemy being used to turn users into product? Yes I do.
I believe data alchemy should be used to improve the lives of users, not to sell users to customers.
Very good point, users should not become the products. My notion of Data Alchemy is not really about products or exploiting users, but about building 'systems' where data is the fuel for recursive improvements. That data might come from sensors as well as social sources and the idea of the alchemy is to work out some sort of turbocharger where data exhaust is recycled and used to generate extra value.
My favorite example is Google - indexing websites and offering search for free. The exhaust was the search and click history of the users, which when aggregated and sliced by site / page instead of user, produces a very useful piece of data that powers adwords. Little over simplified perhaps, but that's how I see the alchemy.
I have strong beliefs regarding personal data which is where IMHO privacy starts to get eroded - but that's another topic...
Google USED to be pseudonymous / anonymous but now they too mix in social data.
Did Google have a choice but to turn their data mining onto personal sources? Does any for-profit company have an alternative path if personal data is available, especially if their competitors are using it? Pretty much anyone who is doing online advertising is optimizing based on personal click-history - not sure if there is much of an alternative. Is it too late now that our personal data is out there in the open - has the horse already bolted?
It's not too late to say no to using personal data. It just requires the desire.
In 2000, I was wondering about personal data and how valuable it was to become:
Today as we advise organizations as to how to turn data into value, I am conscious of a missing industry standard infrastructure where organizations can act as custodians of our personal data. I think Oracle has ownership of the Sun Liberty Alliance patent where a federated data model would allow that to happen, but I can't see anything being implemented anytime soon, not while there is so little regard for our personal data.
Right, you made the point that economic incentive is to have little regard for personal data.
If the economics change, companies' behavior will change.