Data Science Roadmap: Become Fluent in 8 Stages

Data Science Roadmap

The Data Science Roadmap guides you through every step in your journey to become a Data Science Practitioner. From foundation to the advanced stage.

This roadmap is the first post in the series of articles to come which will focus on each phase and each stage. This post covers the roadmap in general and what to expect from it.

Purpose

The purpose of this roadmap is to clear up the noise in the Data Science field. It is a constant source of confusion for beginners and professionals alike.

The common issues faced by both beginners and professionals are listed below:

  1. Getting confused about where to start their Data Science journey.
  2. How to proceed in the Data Science journey?
  3. How to make sense of the latest developments happening in the Data Science space?

To address this confusion, I made this roadmap, which is a kind of virtual coach. The roadmap shows you three things:

  1. Where you are right now in your Data Science journey.
  2. What are the next steps?
  3. How to keep yourself updated?

Introduction

The roadmap has 3 phases:

  1. Foundation
  2. Professional
  3. Advanced

Each phase is composed of many stages. In short, Phase 1 is the core of the roadmap and lays the ground-work.

Phase 2 is all about the practical skills you require to accomplish various tasks. Together with Phase 1, you develop a foundational knowledge to perform most of the tasks as a Data Science Practitioner.

Phase 3 is where you learn all the hot buzzwords in the industry right now. For instance, where does Kubernetes and Kubeflow fit in, how to leverage them for Data Science workloads.

You may be wondering about which phase do I start?

  1. Just start at the beginning if you are a complete beginner.
  2. Start with Phase 2, if you need to understand the tools required to complete day-to-day work.
  3. Start with Phase 3, if you already know enough about everything and want to learn the next big thing.

Foundation

This phase is all about understanding Data Science better and to develop the necessary technical skills. This phase has three stages.

Scope

This stage helps you understand everything you need to know about Data Science as a whole. Additionally, it provides context to your Data Science journey. It also clarifies critical questions you might have like what is Data Science and what it is not.

Engagement

Connecting with the key influencers within the field and learning from them is indispensable. Therefore, in this stage, you get to know all the top blogs, twitter accounts, online communities you need to follow.

Fundamental Skills

This stage is the core of this phase. Undoubtedly, you need to learn the essential technical skills to succeed in the Data Science field. Please do not overlook any of these skills as they form the foundation on which everything else builds.

Don’t make the mistake of assuming that these fundamental skills are not significant. These technical skills are the ones that require utmost attention as you progress through your career eventually.

SQL and Data Warehousing Concepts

SQL and Data Warehousing concepts are some of the most critical skills that you need to develop. Most people make the mistake of underestimating SQL and then realize it’s importance later in their careers. It is, therefore, always better to build a solid understanding of these concepts.

Nevertheless, having a firm grip on SQL also helps with the Apache Spark’s SQL interface in phase two as well.

In this division, you get to know the vital SQL concepts for Data Science and why you need them.

Python and Software Engineering Concepts

The ability to write Pythonic code, to compose modular and loosely-coupled architectures are essential skills. In short, they are as important as knowing machine learning algorithms.

Knowing how to code in Jupyter Notebook is not enough; you need to understand how to write modular code, which follows the time tested software engineering concepts. Moreover, it comes in handy while building machine learning pipelines as well.

In this division, you get to know how to design Pythonic interfaces for Data Science and why it is crucial.

Scientific Python Stack

Your best friends are Numpy, Scipy, and Pandas. Before you turn to Pandas, take time to understand Numpy thoroughly. Pandas is just a layer on top of Numpy.

Furthermore, if you need to do any special operation not apparent in the Pandas documentation, people opt for iterative procedures. This is highly inefficient and should not be encouraged.

Therefore, the ability to write an efficient vectorized code with Numpy is something that should become second nature to you.

Numpy and SQL are similar to each other in some aspects. Both operate on multiple rows together. It is, therefore, critical that you have SQL expertise before you switch to Numpy.

In this division, you get to know why vectorized code matters and how to leverage them for Data Science purposes.

Machine Learning Concepts

Having a decent understanding of Linear Algebra and Calculus is mandatory. You need to know the intuition behind the maths. This will build an intuitive understanding of the fundamental machine learning algorithms.

Apart from that, it would also be helpful if you focused on how to build production machine learning pipelines.

Your expertise in software engineering will come into play here. In short, you need to have an intuitive understanding of maths and how to build machine learning pipelines. This is true whether you are a data scientist or a data engineer.

In this division, you get to know fundamental machine learning algorithms and how to build pipelines.

Professional

If your fundamentals are in place, then, to become a professional, you need to learn a few more skills. There are three stages in this process.

Big Data Formats

Knowing the difference between column-based storage and row-based storage is essential because each has its performance implications.

Moreover, it helps to know when to use what type of storage format. It dramatically simplifies the amount of data processed and can cause significant faster data processing.

In this division, you get to know various Big Data formats and it’s performance implications on Data Science workloads.

Massive Scale Analysis (MSA)

This stage has two components to it. Firstly, the distributed computing part and secondly, the deep learning part.

Having a good understanding of various Big Data frameworks such as Apache Spark would be ideal. Sometimes, when using Pandas, you’ll come up with situations that don’t fit in the memory.

In such situations, having a good understanding of Apache Spark, it’s Dataframe and SQL based API does a good job.

Then comes the deep learning. Understand the use-cases of common deep learning frameworks such as PyTorch and Tensorflow. Also, it’s GPU-based acceleration techniques assists in solving a wide range of difficult problems. The potential for solving new problems with deep learning keeps growing day by day.

Trying to learn deep learning without first thoroughly understanding the fundamentals is like shooting in the foot. In short, to appreciate the breadth and applicability of deep learning, you need to have strong fundamentals beforehand.

In this stage, you get to know what is distributed systems, the latest Big Data tools. Also, how to leverage them for Data Science workloads of the 21st century.

Cloud-Based Massive Scale Analysis (MSA)

With the growth of various cloud vendors, deploying machine learning models in the cloud is a reality and growing fast.

Consequently, leveraging the various machine learning APIs available in the multiple cloud vendors helps in massive-scale analysis and deployments.

Tools like managed Hadoop environments and machine learning workflow framework are a real game-changer in that; they certainly make it very easy to build and deploy massive scale models.

In this stage, you get to know the cloud-based machine learning analysis platforms. Also, how to leverage them for Data Science workloads of the 21st century.

Advanced

This phase is for advanced Data Science Practitioners. It has two stages.

Containerized Machine Learning Deployments

Understanding the container world and container orchestration world is becoming an important skill to have. Therefore, the core concentration for this stage is how to leverage Docker and Kubernetes for machine learning deployments.

Serverless Machine Learning Deployments

This stage is the most volatile stage with active developments. I will help you understand the fundamentals of serverless and function-as-a-service in this stage.

Therefore, the core concentration for this stage is how to leverage serverless technologies for machine learning deployments.

Conclusion

This roadmap covers the important pillars in the Data Science Field. There are other pillars like Visualization which I have omitted to keep the focus on the core of Data Science. This does not mean to say that Visualization is not an important part of Data Science. It is an important part of Data Science.

Once you are confident with the above skills, you can venture into Visualization as per your liking.

Keep a tab at this space for more follow-up articles. Please let me know in the comments section if you think I missed some important pillars of Data Science.