Contributor

Bridging the developer and data scientist gap with cloud, notebooks and PixieDust

opinion
Sep 14, 20174 mins

When developers and data scientists work together, the benefits abound

data science certification face in profile with heat map
Credit: Thinkstock

A wealth of information hides in the vast amount of data produced every day—roadside sensors measuring traffic volume, medical imaging for rapid diagnosis, and satellites circling overhead analyzing weather patterns. In nearly every industry, cloud enables exponential growth by providing cheap, remote storage of data, access through a variety of devices, and elastic compute for data processing at scale. But, how can we capture the full potential of this data? 

To do so requires a closer collaboration between data scientists and developers. As data-driven intelligence becomes a more integral component of nearly every function—from inventory management to personalized customer marketing—these two roles are finding the need to work together in tandem. Yet many teams today still struggle with doing so, as they continue to work with different tools and in separate languages.

Notebooks, for example, are powerful, cloud-ready tools that often require experience with programming languages that are popular among data scientists, like Python, for their strength in numerical analysis. Because of their Python base, in particular are often overlooked by developers, who typically prefer working in languages such as Java or Node.js.  

However, notebooks can offer tremendous potential to help bridge the gap between developers and data scientists, and can bring collaboration and benefits to both sides. Notebooks allow users to write and share code and rich text, all in one environment understood by both data scientists and developers. This allows them to work on the same data sets simultaneously, instead of the traditional process in which developers hand off raw data to data scientists, who translate it into languages like Python for analysis and then give findings and models back to developers – who must translate it yet again into their preferred language, such as Java or HTML.  

While this approach has worked in the past, today’s era of constant iteration and the continuous demand for new and competitive features requires a more agile and connected approach—instead of passing data around in a relay hand-off.

Let’s consider an example. If a marketing department wants to quickly build an application that generates real-time sentiment analysis from Twitter, team members can turn to their data science and development teams for support. More likely than not, the data scientist ingesting the social data and the developer building the dashboard will be working in different languages, which can cause friction, bottlenecks and time to market delays. 

This is where notebooks come in, combined with the magic of PixieDust. PixieDust is an open source helper library for Jupyter notebooks that allows developers to explore data analysis models without having to learn or code in statistical languages. Fueled by the collaborative power of the cloud, PixieDust enables users to visualize data, build dashboards, and more efficiently share data findings within notebooks.  

By using notebooks and PixieDust together, the data scientist and developer can work in their preferred language in the same notebook. This means a developer can obtain early insights into raw data at the same time a data scientist begins working with the same sets—allowing both sides to immediately view trends worth exploring, as well as communicate feedback around potential new features, without waiting for the typical translation to be completed first.

PixieDust allows different languages to be used in tandem by abstracting out trends and patterns in data, and turns these insights into understandable visualizations which can be interpreted by almost any user, even non-technical line-of-business users—instead of lines of code.

PixieDust supports a data strategy tuned for cloud and AI by improving developer productivity, and quickly turns data into logic that users across a business can understand. It offers value for developers and data scientists alike by tapping into the power of the cloud to understand data, and helps them to work together to quickly identify business opportunities through data visualization. PixieDust works to derive meaning from numbers and allows intelligence to be delivered to developers, data scientists and business users with clarity.

When developers and data scientists work together, the benefits abound. Just imagine what you and your teams could build once you remove the barriers that stand in the way of what’s possible with data.

David Taieb is a Distinguished Engineer with the Watson Data Platform Developer Advocacy team at IBM, where he leads a team of avid technologists with the mission of educating developers on the art of possible with cloud technologies. He’s passionate about building open source tools, such as the PixieDust Python library for the Jupyter Notebook and Apache Spark, that help improve developer’s productivity and overall experience.

Previously, David was the lead architect for the Watson Core UI and Tooling team based in Littleton, Massachusetts, where he led the design and development of a Unified Tooling Platform to support all the Watson Tools, including accuracy analysis, test experiments, corpus ingestion, and training data generation. Before that, he was the lead architect for the Domino Server OSGi team responsible for integrating the eXpeditor J2EE Web Container in Domino and building first-class APIs for the developer community.

David started with IBM in 1996, working on various globalization technologies and products including Domino Global Workbench and a multilingual content management system for the Websphere Application Server. David enjoys sharing his experience by speaking at conferences and meeting as many people as possible. You’ll find him at various events like the Strata Data Conference, Velocity and IBM Interconnect.

The opinions expressed in this blog are those of David Taieb and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.

More from this author