You are here

Beakers and flasks

Learn data science: 44 essential resources for developers

public://pictures/esherman_0.png
Erik Sherman, Journalist, Independent

Data science is hot, and the path to becoming a data scientist can scorch. That's why there aren't many who possess all of the skills that fall under the title of data scientist. This scarcity means that data scientists cost a lot of money, but companies are looking to undercut the perks and compensation full-fledged data scientists can command by breaking the role up into multiple ones: data engineer, scientist, visualization designer, manager, and so on.

In other words, developers have a faster back door into the position. Someone must write the software, implement the models, run the machine learning libraries, and otherwise enable the data science. There is a need for programming specialists who know the subject. That can be you. 

Data science bootcamps involve substantial investment of time over a short period as well as a significant payment. If that doesn't look appealing to you, here is a collection of no- or low-cost training and tools to get up to speed on the field of data science. (We've divided them into six categories that we will update over time.)

[ Get up to speed on quality-driven development with TechBeacon's new guide. Plus: Download the World Quality Report 2019-20 for lessons from leading organizations. ]

Foundational math

Probability and Statistics, Self-Paced

A free and self-paced introduction to probability, statistics, and basic statistical inference from Stanford University.

MIT OpenCoursewar, Mathematics

A large variety of math courses from MIT—including resources—available for free online use. Includes courses in single-variable calculus, multi-variable calculus, probability and statistics, and linear algebra.

Calculus Lifesaver: A Free Online Course from Princeton

A review of calculus, in case your differentiation and integration are a little rusty.

Linear Algebra - Foundations to Frontiers

A free, self-paced course covering linear transformations, matrices, systems of linear equations, vector spaces, and other areas important to working with large datasets.

A First Course in Linear Algebra

A free textbook on linear algebra, available for download or online reading. Also available is the exercise and solutions manual.

Statistical modeling

Statistical Learning, Self-Paced

A Stanford online course that focuses on regression and classification methods, with an emphasis on modern data analysis without heavy reliance on formulas and complex mathematics.

Statistical Reasoning, Self-Paced

Another Stanford online course with an emphasis on statistical inference and understanding distribution and data relationships.

Multilevel Modeling Online Course

From the University of Bristol, this course moves from quantitative research to multilevel modeling of continuous and discrete data.

Data Analysis & Statistical Inference

A Duke University online course about how to collect and analyze data, make statistical inferences, and draw conclusions.

Foundations of Data Analysis - Part 2: Inferential Statistics

From the University of Texas at Austin (through edX), this archived course covers how to use data samples and inferential statistics, including t-tests, chi-square, ANOVA, and regression.

[ Get up to speed with TechBeacon's Guide to Software Test Automation. Plus: Get the Buyer’s Guide for Selecting Software Test Automation Tools ]

Data science software tools

Think Bayes: Bayesian Statistics Made Simple

A free downloadable or online book on Bayesian statistics with implementations in Python.

Statistics with R Specialization

A series of statistics courses from Duke (through Coursera) focused on using R and RStudio. R is a popular statistical programming language. The series runs $59 a month for a certificate, although it is possible to audit any of the courses in the series for no charge to see if they would be helpful.

Jupyter Notebook Tutorial: The Definitive Guide

A practical introduction to Jupyter Notebook and how to use it in data science projects.

BeautifulSoup series

A series of articles on using BeautifulSoup, a Python library used for parsing and useful in scraping data from websites.

Web Scraping with Beautiful Soup

A Stanford tutorial on using BeautifulSoup with Python to scrape data from websites.

Selenium Tutorial: Web Scraping with Selenium and Python

A tutorial for using the Selenium library with Python to simulate web surfing and scrape data.

NumPy

A scientific computing package for Python used to manipulate and manage n-dimensional arrays.

SciPy.org

A Python-based ecosystem of open-source software for mathematics, science, and engineering. The libraries include NumPy, SciPy, Matplotlib, pandas, and more.

Statsmodel

A Python statistics library to explore data and perform statistical tests.

scikit-learn

A Python machine learning library.

Apache Spark

A framework for a cluster computing engine that enables large-scale data processing.

Data Science and Engineering with Apache Spark

A course from the University of California at Berkeley on using Spark.

Intro to D3.js

An introduction to one of the most popular JavaScript libraries for creating data visualizations in a browser.

Seaborn: statistical data visualization

A Python data visualization library.

DB-Engines

An information and research center on various databases in the software industry.

State of the Database Landscape

An infographic that maps the categories of various databases and analytics platforms. It's useful for choosing the right database for your use case. The 2016 version requires some info to download, but the 2014 version does not.

Advanced techniques

Top-down learning path: Machine Learning for Software Engineers

A multi-month, self-taught study plan for developers to understand and implement machine learning.

Introduction to Machine Learning for Developers

A guide to understanding the basics of machine learning, whether for building a chat bot or using big data to create a recommendation engine for an app.

Machine Learning in a Year

A self-study course to begin using machine learning without a heavy background in mathematics.

Computational and Inferential Thinking: The Foundations of Data Science

The textbook for the Foundations of Data Science class at the University of California, Berkeley.

Natural language processing: an introduction

An overview and tutorial for natural language processing and NLP system design.

Introduction to Natural Language Processing

An introduction to computerized processing of natural languages.

Geospatial Analysis

A free web version of a book about undertaking the geospatial analysis of data.

Data visualization

Data Analysis: Visualization and Dashboard Design

Course through edX from TUDelft on advanced techniques for robust data analysis in a business environment, including importing, summarizing, interpreting, analyzing and visualizing data.

D3 Tutorials

Practical tutorials in steps of creating visualizations within JavaScript.

Comprehensive Guide to Data Visualization in R

How to take statistical analysis done in R and visually render results.

Designing Data Visualizations

A downloadable O'Reilly book on how to approach the design aspects of visually presenting data.

Cross-category training and tools

Introduction to Probability and Statistics

Although MIT's Open Courseware offerings are mentioned in the first section, this course in particular covers a wider range of topics. It starts with basic probability and works through applied statistics including Bayesian inference, linear regression, and resampling methods.

How to Become a Data Scientist (Part 1/3)

This might be the resource you want to read first. It briefs you on on the skills you need and steps you may have to take to get into commercial data science. It's also useful if you're a recruiter vetting candidates for a data science position. Also available are Part 2 and Part 3.

The Python Data Science Handbook

A recently published, free book that provides all the getting started and reference topics needed to harness Python's may data science-related libraries.

OSS University Data Science

A github repo that includes a collection of free courses to learn data science.

Going Pro in Data Science: What It Takes to Succeed as a Professional Data Scientist

A free ebook from O'Reilly about the skills necessary for real-world data science and how to approach the necessary coding.

Top 50 Data Science Resources: The Best Blogs, Forums, Videos and Tutorials to Learn All about Data Science

A list of great resources on many aspects of data science.

Excel to MySQL: Analytic Techniques for Business

A Duke University online series about performing sophisticated data analysis with Excel and Tableau.

Update 1:

Here are some additional resources recommended by the community and the TechBeacon editors after publication:

The "Joel Test" for Data Science

Over a decade ago, software engineering luminary Joel Spolsky wrote 12 questions to determine the maturity of your software engineering team. Domino Data Lab came up with their won Joel Test for assessing the maturity of your data science program and individual data scientists.  This is a useful resource for building some of the baseline requirements you might want a data scientist candidate to have.

Machine Learning for Product Managers

Similar to this article, Machine Learning for Product Managers is also a collection of resources for getting started with machine learning in your organization. There are 18 links to blogs, courses, and books, some of which you've already found in this collection.

The Best Machine Learning Course for Beginners

The machine learning Coursera course from Stanford has gotten rave reviews and is taught by Andrew Ng, an associate professor at Stanford, a chief scientist at Baidu, and the co-founder of Coursera itself.

46 Questions on SQL to Test a Data Science Professional

This is another good resource for testing the skills of someone claiming to be a data scientist. The authors assert, "If there is one language, every data science professional should know – it is SQL." The questions are multiple choice and they focus on practical aspects and challenges people encounter while using Excel.

Being a Data Scientist: My Experience and Toolset

Jefferson Heard explores what he does as a data scientist with very detailed stories about his various experiences in the field. The article also contains a linked list to all of the tools he's used in his work.

Any resources that you find particularly useful in the field of data science? Share them in the comments below and we will consider adding them to our evolving guide.

 

[ Learn how to apply DevOps principles to succeed with your SAP modernization in TechBeacon's new guide. Plus: Get the SAP HANA migration white paper. ]