A software engineer's guide to data science
It’s not a question of if developers are going to be working with data scientists—it’s a matter of when. Over the past few years, businesses have begun to realize that data science is the key to solving some of their most pressing problems, and they are bringing more and more data scientists into the workforce. So what does it mean for you as a software developer?
To get the answers, I asked Dr. Nicole Forsgren, director of organizational performance and analytics at Chef Software, and Ohad Assulin, chief data scientist at Hewlett Packard Enterprise Software, to explain what data scientists actually do and how you as a software engineer can work effectively with them—and perhaps add a few of those in-demand data science skills to your own CV.
What data scientists do
Data science is simply the conversion of data to knowledge. Data may be structured or unstructured, and unstructured data can take many forms, such as text, images, or video. A data scientist touches on the use of data to help make business decisions or to analyze data and present it in a way that can help make business decisions.
Data scientists can have many different roles and responsibilities in the business. For example, they might:
- Educate the business. The business is often unaware that it has challenges that data science can solve. Part of the data scientist’s job is to explain how data science can help.
- Look for problems to solve. A data scientist who understands the business is in an excellent position to seek out and identify data-related challenges where the solution can bring value to the business.
- Research new techniques. New problems surface daily, and techniques are continually evolving to meet the challenges. Data scientists can be deeply involved in researching new algorithms and making known algorithms more effective.
- Collate data for analysis. This is often referred to as extract, transform, and load (ETL), the process of extracting data from different sources, transforming it into a format that makes it easier to work with, and then loading it into a system for processing.
- Crunch numbers. Data scientists must be well versed in techniques such as statistical analysis, exploratory analysis, and predictive analysis and can identify and apply the appropriate algorithms to the data.
- Implement algorithms. Many data scientists are software developers too, producing code that becomes part of a product. They are fluent in modern software development and delivery techniques.
- Design big data-capable architecture. The infrastructure required for data science is often different from the infrastructure for other types of projects. Data scientists need to be familiar with the state-of-the-art tools, many of which are open source.
- Present insights. The results of an analysis must be presented in a format that the nontechnical or non-mathematical person can easily understand and, more importantly, act on.
In a small organization, a data scientist might perform all of these roles, while larger organizations may have a team of data scientists that includes specialists in different areas.
Forsgren says the data scientist role is an evolution of the business intelligence (BI) role and might perform some, or all, of these roles in an organization. While the traditional BI role was typically more database-centric, often analyzing offline data, data scientists tend to have a stronger background in statistics, predictive analytics techniques, and the implementation of algorithms on real-time or near-real-time data. At the end of the day, however, the two roles are not drastically different.
Assulin's organization employs many different types of data scientists. Some, particularly those in Hewlett Packard Labs, are focused on an academic perspective, researching new generic techniques that can be applied in different contexts. Others, such as Assulin and his colleagues in HPE Software, are focused on a specific business, developing a deep understanding of the business, and solving specific problems for the business.
What data science means for software developers
To understand what data science means for software developers, you need to understand the answers to three questions:
- How can data scientists make the software development lifecycle (SDLC) more efficient?
- How can they help you improve your product?
- How are they involved in developing software for your product?
To make your SDLC process more efficient, Forsgren says, you need to think about your goal and keep in mind that performance and effectiveness are best measured at the team level, rather than at the individual level. For example, if your goal is to increase your software development productivity, you don't want to count the number of lines of code each developer writes. But you will want to capture metrics such as the number of deploys, stability of the code-base, number of pull requests, and amount of branching and merging. If that data isn’t available, the data scientist will need to work with the developers and the operations engineers to make that data available by getting access to the source code repository (such as Git). The data scientist can then paint a picture of the current status and recommend improvements to reduce build and delivery time and to increase quality. Companies are gradually realizing that they need to measure in order to improve. As they say at Etsy, “If it moves, graph it”, and that applies to everything in the SDLC.
To improve the product, data scientists need to examine production data, such as server metrics, server logs, and application logs. But a good data scientist first takes a step back, asking, “What are the questions I can ask?” and, “What data do I need to answer them?” The data scientist may need to ask developers to add hooks to capture additional data, if the existing production data is insufficient. Once the right data has been captured, however, the data scientist can turn it into concrete recommendations.
When data scientists are developing software, they could be writing anything from pseudo-code to fully productized code, for things from data collection to number crunching to visualizing and presenting the results. Assulin points out that a big data project needs a specific platform on which to run, such as Hadoop or Spark, and each requires a particular architecture and design. So data scientists must work closely with a data engineer or big data architect. These people know how to deal with all aspects of managing the data, although they might not be experts in algorithms. But this may not be two different people; many data scientists double as data engineers or big data architects.
What data scientists produce
What a data scientist delivers depends on what you want them to do for you. If you’re asking for insight into the kinds of problems on which they can help or an analysis based on data, you’ll get a report or presentation expressed in plain business language that all stakeholders can understand.
In many cases, though, they produce code. This may come in the form of prototypes and demos that developers later roll up into the main product, or production-quality code that they write and deliver directly into the product. For this to be effective, Forsgren says, data scientists must work closely with other developers.
Assulin says that data scientists must give the consumer something that’s easy to work with, whether in the form of a library or microservice, that integrates easily with the main product’s code. In the past, data scientists wrote algorithms in the language with which they were most familiar, even when the developers didn't know the language, just to prove that it works. Many times, the algorithm would be inefficient, and the developers would need to rewrite it. Today, the data scientist is expected to deliver code that is well designed, well written, secure, and performant, just like the rest of the product’s code. However, Assulin cautions that a lone data scientist may be limited by not having anyone else to bounce ideas off of, and the highly mathematical nature of the code can make code reviews difficult.
A data scientist's development toolkit
Python and R are the most common programming languages data scientists use. These open-source languages are supported by large communities, and you can find many add-on packages designed for data science applications, such as NumPy and SciPy. These languages are interpreted, rather than compiled, leaving the data scientist free to focus on the problem rather than nuances of the language. Scala, a compiled language, is also quite popular among data scientists. Forsgren and Assulin note that these languages are free, while commercial languages such as MatLab, Stata, and SPSS can be expensive.
Testing data science code is challenging because validation can be highly mathematical. Data scientists do test their own code and algorithms, but, as Assulin says, that’s like the fox guarding the henhouse. Nevertheless, part of data science is defining the validation, and testers can work with that. Forsgren says data scientists should be writing well-documented and repeatable code and following standard development practices, including code reviews, just as any other developer should do. If you have more than one data scientist on your team, they should do code reviews for each other.
Data scientists might be assigned to a features team, or they might be part of a dedicated data science team. Assulin says that data scientists in an agile development environment tend to work at a different cadence from the rest of the team, because their work requires lots of research. If they’re working on a features team, their delivery will be quicker, but possibly at the cost of quality. As they move away from the dedicated features team, their delivery will be of higher quality, but perhaps at a slower pace. Startups with a well-defined data problem should include their data scientists in the teams, whereas larger organizations with a variety of problems and data will do better with a team of data scientists who can support one another while providing data science services to the rest of the organization.
Why the business needs data science
Ideally, a business analyst or product owner should present a problem to the data scientist, who will then attempt to solve it. But in practice, the business person is unaware that the problem is solvable with data science. In fact, the business person is often unaware that there is even a problem.
One of Assulin’s roles is to educate business analysts and product owners on techniques and analytical tools that are available to the business, such as explaining how the data scientist can make predictions about the future based on past history, gather insights with data clustering, or make recommendations based on user behavior. He doesn’t go too deeply into the technical details, but he makes sure that the business is aware of what he can offer. He also conducts workshops with product managers to elicit difficult problems that could be solved using data science.
Forsgren says that businesses today should aim to be metrics-driven or, at the very least, metrics-informed. Business is increasingly competitive in every industry, and a data scientist can help you identify competitive and strategic advantages.
If your organization already has a data scientist, he or she can help you recruit. If you have no data scientists on staff, Assulin recommends getting help from outside of your organization. “It’s like Java—you can’t interview someone for a Java position effectively if you don’t know Java yourself,” he says. Forsgren agrees and adds that you should also ask business-related questions, such as, “How do you see a data scientist adding value to my business?”
The first data scientist to join your business should have initiative and understand what value he or she can bring. If the data scientist is not an expert in your business, he or she should at least be able to suggest the types of data that you need to collect and where you can find it. Less experienced candidates should be able to cite at least some contribution to a data science project, for example as part of a data science boot camp or university-level project.
Some data scientists have a background in physics, mathematics, or computer science. But Forsgren notes that people coming from a purely scientific or technical background could be limited if they lack a grounding in business. She says that an MBA or other further degree, such as one in HR, healthcare, or medical informatics, in addition to a technical background, can be incredibly powerful. Ideally, the data scientist will have both business and technical skills, but a data scientist who is stronger on the business side can work closely with a more technical data scientist, so that they complement each other. Assulin adds that a further degree provides a grounding in research, which is an essential skill for any data scientist.
Data science for developers: How to come up to speed
If you want to become a data scientist yourself, this is the perfect time to do it. Online courses, such as those from Coursera, EdX, and Udacity, have good data science tracks. Some of them are quite advanced and taught by well-known data scientists, such as Andrew Ng and Sebastian Thrun.
You can also find boot camps that take people with a computer science, mathematics, or physics background and teach them how to solve a data-centric business problem from scratch. Kaggle offers data-science competitions sponsored by companies such as Facebook and Wix, which may recruit participants who score well.
Assulin says that when he’s recruiting data scientists, his baseline is a computer science degree, or at least significant experience in software development, because data scientists are expected to write production-level code that is part of the product.
Increasing demand for data scientists
The demand for data scientists continues to increase, and more and more software engineers are working with data scientists in their organizations every day. These data scientists might be involved in education across the organization or deeply immersed in the implementation of statistical algorithms. They may be big data architects or presenting to C-level executives. Or they could be doing all of these things. As a developer with a clear understanding of data science concepts and how data scientists work, you'll be positioned to collaborate with data scientists while expanding your own expertise in this growing discipline.
Have you had to work with data scientists or transition to that role in the past? Or are you doing it now? Share your experiences in the comments.
Image credit: Flickr