What Popular Baby Names Teach Us About Data Analytics
A typical big data analysis goes like this: First, a data scientist finds some obscure data accumulating in a server. Next, he or she spends days or weeks slicing and dicing the numbers, eventually stumbling upon some unusual insights. Then, a meeting is organized to present the findings to business managers, after which, the scientist feels disgruntled or even disrespected while the managers wish they could take the time back.
When these meetings fail, the main points of contention usually include unclear purpose; analyses that are too narrowly focused; and over-confidence in the science, which turns off non-technical managers. If you’re facing this situation, you should read the FiveThirtyEight article on mining the baby names dataset. When you’re done, send the article to your analytics team.
What FiveThirtyEight’s Nate Silver and Allison McCann did with the baby names dataset sets an example for all data analysts: They imbued it with a relevant business problem, attached complementary data, made a bold, but acceptable, assumption to patch a hole in the data, and elaborated their conclusion with a margin of error. Their article represents the best of data journalism. It surpasses most examples of big data analytics, as we know it.
Curated by the Social Security Administration (SSA), the dataset of the first names of all newborn Americans since 1880 is a star of big data. In the past few years, the baby names dataset has been mined to death (pardon the pun). Its fame can be traced to computer scientist Martin Wattenberg, who created the Baby Names Voyager, a user-friendly interface for visualizing the baby names. The purpose of the Voyager is investigating what names were popular when. Since Wattenberg, a line of analysts has pursued numerous projects, such as the most trendy names, the most poisonednames, and the most distinctive name by state.
All this slicing and dicing have produced insights that are little more than sound bites or click bait. And then, Silver and McCann entered the picture.
They imbued the data with a relevant business problem.
Instead of asking what names were popular (or poisoned or trendy or distinctive) in a given period of time, the two data journalists turned the question around and investigated whether someone’s first name provides sufficient information to guess when he or she was born.
This framing of the issue immediately reminds me of the real-world problems of guessing someone’s religion or languages spoken from his or her name, place of residence, and other factors. Many sophisticated businesses use such demographic data to develop customer segmentation. If your business purchases third-party data with those variables, you are already benefiting from the type of analysis Silver and McCann presented. (In practice, direct information on people’s age is more available than religion or languages.)
They attached complementary data.
It is rarely the case that one dataset contains all of the information needed to solve a business problem. The SSA data have information on births but not on deaths. A simple averaging of the birth dates of every Elizabeth ever born leads to a vastly over-stated average age because some of those people are no longer living. To perform the analysis properly, the data journalists incorporated actuarial life tables, which contain estimates of death rates.
They patched a hole in the data.
Actuaries, however, do not care about first names. The death rates can be split by gender, but not by name. The analyst could give up on the project at this stage, or make an assumption and trudge forward. Silver and McCann chose the latter route by assuming that death rates do not vary by first name. This is, without a doubt, a bold move, but one I’m comfortable with because it allows the analysis to reach a satisfactory state. Data analysts often face this type of decision in the course of any big data work. (You can see key analytical decisions in the footnotes of the article.)
They elaborated their conclusion with a margin of error.
The powerful graphics in the article clearly display the potential error sustained if one uses first names to predict a person’s age. Silver and McCann showed that the level of accuracy depends on gender and on the shape of the popularity trend. In some of the better examples, they can bracket someone’s age to within 10 years with 50% confidence. All too often, media reports of big data analyses omit any quantification of their accuracy, a harsh irony given the field’s trumpeting of the scientific method.
All the lessons described here apply easily to any business analytics team. Instead of generating sound bites with scant business relevance, data scientists should consult their business partners early and agree on an interesting business problem before digging into the data. As gigantic as many of today’s datasets are, they may still lack important variables, thus requiring augmentation. Big data analysis is highly valued because it can provide useful predictions, but analysts err when they fail to include a margin of error. Sound business decisions require understanding not only the most likely scenario, but also the range of possibilities. As the discipline of data science and analytics evolves, the process of generating business insights will improve, and there will be less all-around frustration when teams meet about data projects.
Kaiser Fung is a professional statistician for Vimeo and author of Junk Charts, a blog devoted to the critical examination of data and graphics in the mass media. His latest book isNumber Sense: How to Use Big Data to Your Advantage. He holds an MBA from Harvard Business School, in addition to degrees from Princeton and Cambridge Universities, and teaches statistics at New York University.
Picture Mildred, Agnes, Ethel and Blanche. Perhaps you imagine the Golden Girls or your grandmother’s poker game. These are names for women of age, wisdom and distinction. The median living Mildred in the United States is now 78 years old.
Now imagine Madison, Sydney, Alexa and Hailey. They sound like the starting midfield on a fourth-grade girls’ soccer team. And they might as well be: the median American females with these names are between 9 and 12 years old.
There are quite a lot of websites devoted to tracking the popularity of American baby names over time. (The data ultimately comes from the Social Security Administration, which records birth names dating back to 1880.) But we haven’t seen anyone ask the age of living Americans with a given name.
The method for determining the answer is quite simple1: All you really need is the SSA’s baby name database and its actuarial tables, which estimate how many people born in a given year are still alive.2
Below, for example, is a chart of Josephs. It shows how many American boys named Joseph were born in each year since 1900. And it shows how many of them are still alive today,3 assuming that Josephs die at the same rate as other American males.4
The peak year for boys named Joseph was 1914 — when about 39,000 of them were born. Those 1914 Josephs would be due to celebrate their 100th birthdays at some point this year. But only about 130 of them were still alive as of Jan. 1.
Joseph has been one of the most enduring American names; it’s never gone out of fashion. So knowing that a man is named Joseph doesn’t tell you very much about his age. The median living Joseph is 37 years old, and the interquartile range (that is, the range spanning the 25th through 75th percentiles) runs from 21 to 56. In other words, a quarter of living Josephs are older than 56 and a quarter are younger than 21; the rest are somewhere in between. Not very helpful.
By contrast, you can make much stronger inferences about a woman named Brittany. That name was very popular from the mid-1980s through the mid-1990s, but it wasn’t all that common before and hasn’t been since. If you know a Brittany, she is probably of college age or just a bit older. Half of living American Brittanys5 are between the ages of 19 and 25.6
We can run these calculations for any name in the SSA’s database — for instance, for the 25 most popular male names since 1900. Joshuas, Andrews and Matthews are the youngest of these, with median ages of 22, 24 and 26. Georges and Donalds are the oldest, each with a median age of 59.
The data for the top 25 female names is more dynamic. The median Emily is just 17 years old; the median Dorothy is 74.
Girls’ names typically cycle in and out of fashion more quickly than boys’ names, which means that they have narrower interquartile ranges. For instance, almost half of living Lisas are now in their 40s, meaning that they were born at some point between 1964 and 1973.
However, there are some exceptions — most notably Anna, which is a remarkably well-enduring girl’s name. The name Anna steadily declined in popularity from 1900 to 1950; however, many of those older Annas are no longer with us, and the name has remained at reasonably steady levels of popularity since then. Thus, while a quarter of living Annas are younger than 14, another quarter are older than 62.
Boys are catching up when it comes to fashionable names that reveal a lot about their age. Do you know a Liam, an Aiden, a Jayden or a Mason? Their median ages are 3, 4, 4 and 6, respectively. How about a Noah, an Elijah or an Isaiah? They are 8, 8 and 9. (The charts that follow are restricted to birth names given to at least 100,000 Americans of a particular gender since 1900.)
By contrast, the majority of living Hermans, Howards, Harrys, Harolds, Harveys and Herberts are in their 60s, or older. And the oldest male name is Elmer, with a median age of 66.
Eva, Mia, Sophia, Ella and Isabella might be friends with Mason and Liam in their kindergarten classes. The median girls with these names are between 5 and 8 years old.
We’ve already listed some of the oldest female names, but we didn’t mention the oldest of all: Gertrude. The median living Gertrude is 80 years old; a quarter of Gertrudes are older than 87. (Note also the presence of Betty and Wilma, the names of the “Flintstones” wives, on the oldest names list. Betty and Wilma are not quite prehistoric. But they are each now a median of 73 years old.)
Other names have unusual distributions. What if you know a woman — or a girl — named Violet? The median living Violet is 47 years old. However, you’d be mistaken in assuming that a given Violet is middle-aged. Instead, a quarter of Violets are older than 78, while another quarter are younger than 4. Only about 4 percent of Violets are within five years of 47.
Lolas, Stellas and Claras also have highly bimodal distributions.
This pattern is slightly less common among male names. But it does occur occasionally, perhaps partly as an unfortunate consequence of the movie “Titanic.” The two male names with the widest age spreads are Leo (as in DiCaprio7) and Jack (as in Dawson, the character he played in the film).
Jack died in the end, so let’s end on a morbid note. Out of all Americans given a particular name since 1900, how many have since died?
These results are highly similar to the lists of the oldest names, although slightly more Mabels (90.8 percent) have died than Gertrudes (89.4 percent). Elmer is the deadest common male name, at a 79.2 percent fatality rate. But if the list were liberalized to include more infrequent names, Hyman (91.3 percent), Eino (89.7 percent) and Isidore (87.2 percent) would do a better job of keeping up with the ladies, ’til death did they part.
Well, that's a lot of data that supports the claim that names peak in popularity and then fade back out.