Building The LinkedIn Knowledge Graph with Machine Learning
Adam Rifkin stashed this in Interest Graph!
LinkedIn's knowledge graph was built with machine learning.
LinkedIn’s knowledge graph is a large knowledge base built upon “entities” on LinkedIn, such as members, jobs, titles, skills, companies, geographical locations, schools, etc. These entities and the relationships among them form the ontology of the professional world and are used by LinkedIn to enhance its recommender systems, search, monetization and consumer products, and business and consumer analytics.
Creating a large knowledge base is a big challenge. Websites like Wikipedia and Freebase primarily rely on direct contributions from human volunteers. Other related work, such as Google's Knowledge Vault and Microsoft's Satori, focuses on automatically extracting facts from the internet for constructing knowledge bases. Different from these efforts, we derive LinkedIn’s knowledge graph primarily from a large amount of user-generated content from members, recruiters, advertisers, and company administrators, and supplement it with data extracted from the internet, which is noisy and can have duplicates. The knowledge graph needs to scale as new members register, new jobs are posted, new companies, skills, and titles appear in member profiles and job descriptions, etc.
To solve the challenges we face when building the LinkedIn knowledge graph, we apply machine learning techniques, which is essentially a process of data standardization on user-generated content and external data sources, in which machine learning is applied to entity taxonomy construction, entity relationship inference, data representation for downstream data consumers, insight extraction from graph, and interactive data acquisition from users to validate our inference and collect training data. LinkedIn’s knowledge graph is a dynamic graph. New entities are added to the graph and new relationships are formed continuously. Existing relationships can also change. For example, the mapping from a member to her current title changes when she has a new job. We need to update the LinkedIn knowledge graph in real time upon member profile changes and when new entities emerge.
Construction of entity taxonomy:
For LinkedIn, an entity taxonomy consists of the identity of an entity (e.g., its identifier, definition, canonical name, and synonyms in different languages, etc.) and the attributes of an entity. Entities are created in two ways:
Organic entities are generated by users, where informational attributes are produced and maintained by users. Examples include members, premium jobs, companies created by their administrators, etc.
Auto-created entities are generated by LinkedIn. Since the member coverage of an entity (number of members who have this entity) is key to the value that data can drive across both monetization and consumer products, we focus on creating new entities for which we can map members to. By mining member profiles for entity candidates and utilizing external data sources and human validations to enrich candidate attributes, we created tens of thousands of skills, titles, geographical locations, companies, certificates, etc., to which we can map members.
To date, there are 450M members, 190M historical job listings, 9M companies, 200+ countries (where 60+ have granular geolocational data), 35K skills in 19 languages, 28K schools, 1.5K fields of study, 600+ degrees, 24K titles in 19 languages, and 500+ certificates, among other entities.
Normalizing the data:
Entities represent the nodes in the LinkedIn knowledge graph. We need to clean up user-generated organic entities, which can have meaningless names, invalid or incomplete attributes, stale content, or no member mapped to them. We inductively generate rules to identify inaccurate or problematic organic entities. For auto-created entities, the generation process includes:
Generate candidates. Each entity has a canonical name which is an English phrase in most cases. Entity candidates are common phrases in member profiles and job descriptions based on intuitive rules.
Disambiguate entities. A phrase can have different meanings in different contexts. By representing each phrase as a vector of top co-occurred phrases in member profiles and job descriptions, we developed a soft clustering algorithm to group phrases. An ambiguous phrase can appear in multiple clusters and represent different entities.
De-duplicate entities. Multiple phrases can represent the same entity if they are synonyms of each other. By representing each phrase as a word vector (e.g., produced by a word2vec model trained on member profiles and job descriptions), we run a clustering algorithm combined with manual validations from taxonomists to de-duplicate entities. Similar techniques are also used to cluster entities if the taxonomy has a hierarchical structure.
Translate entities into other languages. Given the power-law nature of the member coverage of entities, linguistic experts at LinkedIn manually translate the top entities with high member coverages into international languages to achieve high precision, and PSCFG-based machine translation models are applied to automatically translate long-tail entities to achieve high recall.
Insights can be extracted from the graph:
Additional knowledge can be inferred on top of the standardized knowledge graph, generating insights for business and consumer analytics. For example, by conducting OLAP to selectively aggregate graph data from different points of view, we can generate real-time insights such as the number of members who have a given skill in a given location (supply), the number of job hires requiring a given skill in that same location (demand), and finally the sophisticated skill gap after considering both supply and demand ends. We can also constrain the data analytics into a certain time range for fetching retrospective insights. The below figure lists the top ten most in-demand soft skills that can help job seekers stand out from other candidates based on data analytics on member profile updates between June 2014 and June 2015.
Insights help leaders and sales make business decisions, and increase member engagement with LinkedIn. For example, the above insights encourage members to add those soft skills to their profiles or learn them in LinkedIn online courses.