juan_sequeda
Contributor

Who should be responsible for your data? The knowledge scientist

opinion
Oct 30, 20195 mins

Organizations that recognize the importance of clean and reliable data while elevating knowledge work will move faster along the path to true data-driven decision-making

data scientist woman at virtual monitor user interface tools for data science by metamorworks getty
Credit: Metamorworks / Getty Images

How can you build a data-driven culture and spur digital transformation without thinking through who should be responsible for your data? Let’s do that together.

Data engineers and data scientists each occupy critical roles. Data engineers manage the data infrastructure and are in charge of designing, building, and integrating data workflows, pipelines, and the ETL process. Their goal is to provide data for data scientists’ analysis. Data scientists are those who can turn data into insights by applying statistics, machine learning, and analytical approaches. Their goal is to answer critical business questions.

Data-driven organizations require reliable, clean data to function. Without it, your AI, machine learning, and analytics are worthless. Unreliable, erroneous, and incomplete data leads to answers that can’t be trusted—hence, “garbage in, garbage out.”  

Therefore, the process of wrangling and cleaning data is crucial, often said to be 80% of a data scientist’s work. Typically, this is seen as boring, annoying grunt work people don’t want to do.

However, I think this negative view is at least partly based on a major underappreciation of the significance of such work. Data wrangling and cleaning is not simply about eliminating white spaces, replacing wrong characters, and normalizing dates. Stepping back, these tasks should be viewed in the context of two key objectives:

  1. Understanding the ecosystem of people, data, and tasks in an organization
  2. Communicating and documenting that knowledge in order to generate clean and reliable data

Yes, data wrangling and cleaning can take 80% of a data scientist’s time and energy. This does not mean that 80% is wasted. While these tasks can and should be optimized for efficiency, they are part of the vital knowledge work that should be elevated within a data-driven organization. But who should be doing it?

Who should be responsible for data?

In typical organizations, the need for reliable data is constant, but the knowledge work that creates it is ad hoc. Practices and results are not documented and shared because data scientists are usually not equipped, trained, or incentivized to do so. Indeed, in our experience, a lot of the “softer” knowledge work (like conference calls, discussions, whiteboarding sessions, documentation, long Slack chats) required to create clean and reliable data is not valued by data scientists or their managers. Making matters worse, most tools are designed and provisioned for a small set of user types and teams to the exclusion of other user types and teams. Thus, the responsibility to create and manage reliable data is siloed, scattered, or even non-existent.

I argue that data scientists should not be responsible for creating and managing reliable and clean data because their responsibility is to turn data into insights. Instead, I call for a new role which must be developed to fill this critical need: the knowledge scientist.

Who is a knowledge scientist?

A knowledge scientist is a person who builds bridges between business requirements, questions, and data. The goal of the knowledge scientist is to document knowledge by gathering information from business users, data scientists, data engineers, and their environments in order to make data more useful for AI, machine learning, business intelligence, data analytics, and more.

From a hard skills perspective, knowledge scientists should work with business users and demonstrate what they have learned by using skills and techniques such as data modeling, knowledge representation, and ontology engineering. The output is a data model that represents how the business user sees the world. Knowledge scientists should align this data model with other models derived from talking to other business users.

Furthermore, while working with data engineers, the knowledge scientist should be fluent in data access and transformation methods such as query and programming languages. They should transform the data being provided by the data engineer and map it to the business meaning provided by the business user. They should be conversant in analytical and machine learning methods.

Knowledge work is people work. From a soft skills perspective, the knowledge scientist should have excellent communication skills that can be applied to both the business user and the data engineer. The knowledge scientist should be both a “people person” and a “geek.”

The knowledge science discipline has its roots in the knowledge engineering approaches of the 1980s and 1990s. In that world, skills such as knowledge acquisition, knowledge elicitation, and knowledge specification were taught and used. These are lost arts in industry today, particularly in the data science context. I believe that revisiting these approaches will be a key part of developing both the instructional curriculum and the tooling needed to support the knowledge scientist.

The organizations which identify the central importance of clean and reliable data while elevating knowledge work will be at the forefront of digital transformation and will move faster along the path to creating a data-driven organization. Who are the knowledge scientists in your organization?

juan_sequeda

Juan F. Sequeda is the Principal Scientist at data.world. He joined through the acquisition of Capsenta, a company he founded as a spin-off from his research. He holds a PhD in Computer Science from The University of Texas at Austin.

Juan is the recipient of the NSF Graduate Research Fellowship, received 2nd Place in the 2013 Semantic Web Challenge for his work on ConstituteProject.org, Best Student Research Paper at the 2014 International Semantic Web Conference and the 2015 Best Transfer and Innovation Project awarded by the Institute for Applied Informatics. Juan is on the Editorial Board of the Journal of Web Semantics, member of multiple program committees (ISWC, ESWC, WWW, AAAI, IJCAI). He was the General Chair of AMW2018, PC chair of ISWC 2017 In-Use track, co-creator of COLD workshop (7 years co-located at ISWC). He has served as a bridge between academia and industry as the current chair of the Property Graph Schema Working Group, member of the Graph Query Languages task force of the Linked Data Benchmark Council (LDBC) and past invited expert member and standards editor at the World Wide Web Consortium (W3C).

Wearing his scientific hat, Juan's goal is to reliably create knowledge from inscrutable data. His research interests are on the intersection of Logic and Data for (ontology-based) data integration and semantic/graph data management, and what now is called Knowledge Graphs.

Wearing his business hat, Juan is a product manager, does business development and strategy, technical sales and works with customers to understand their problems to translated back to R&D.

The opinions expressed in this blog are those of Juan Sequeda and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.