Data science has been one of the hottest jobs in 2021, and will surely be one of the most in-demand jobs for 2022.
Most of the data scientists I know picked up their skills as self-learners, sometimes starting with a computer engineer background. From what I have seen, the school can only take you far enough to be independent in your job, but will never be able to keep up with the latest technology.
For example, in the past few months, I have been specializing in vector-based encoding in NLP.
None of my friends at school are learning this technology, which has been made popular in the last year and a half.
Study guide for data science
Self-learning is a necessary evil in this subject.
People like me love it, because I am not constraint by the limitations of the school, while for other people is torture, as there is never a defined and clear path to what they need to study: every day they discover something new and have to adapt their study plan.
Overall, If I had to study data science all over again, this is the path that I would suggest my younger self to take:
1. Cross-sectional data
- Supervised learning: Binary Classification and Monovariate Regression
- Data Transformations
- Supervised learning: Multi-label classification and Multi-variate regression
- Unsupervised learning: Clustering
- Unsupervised learning: Dimensionality Reduction
- Unsupervised learning: Rule Association Learning
2. Natural Language Processing
- Using API, parsing data
- Web Scraping
- Data Cleaning, Lemmatization, Stemming
- Sentiment Analysis
3. Deep learning
- Multi-layer perceptron
- Predictive Maintenance
- Start working with Big Data
- Word2Vec encoding
- BERT encoding
- Topic Modeling
- Recommendation Systems
5. Computer Vision…
This is the list of concepts in learning order that I can suggest each one of you follow. There is a concise logic that these concepts follow, and I am going to explain it in the rest of the article.
Start with Cross-sectional data
If you scroll at the top of my list, you can see that I started with cross-sectional data. Why is that?
Like many of you, when I first started I was hyped by the idea of becoming a data scientist and wanted to master the coolest use cases around: object recognition.
Unfortunately, the task was too big for me at the time, and I failed repetitively at it, getting nothing done.
You need to study with something easy, that you can master in a few days only, that does not require much effort, and that acts as the foundation for your knowledge.
That is knowing how to work and make predictions on cross-sectional data.
The next step: NLP
Natural Language Processing is one of the most in-demand applications at the moment. For every job that I am hired for I am asked to perform a natural language processing task.
The reason for it to be high in demand is that there is a minimum threshold for developers to access this technology that scares most ML programmers.
As a result, there is a high demand for NLP, which is found basically anywhere, and a low supply of experts.
However, NLP is tough to learn. If you do not know how to manage cross-sectional data first, forget about even adventuring in NLP.
In the beginning, you can first do everything with a simple notebook, but most NLP tasks will require huge processing power and will involve Gigabytes of data.
Because you are not ready for this size, yet, only focus on the data volume you can manage.
The main tasks you will want to learn in NLP are performing a Sentiment Analysis, Named Entity Recognition, and String-search.
Finally, now that you know how to code properly several ML use cases and you know how to work with cross-sectional data very well, you can start understanding the math and the concepts behind deep learning.
The main issue with this technology is that is very tough and slow to implement.
It may require hours for you to set up a deep learning experiment, mostly because you have to preprocess the data in a way that it can be fit inside a neural network.
Deep learning, however, requires huge computing power and you will have to tune your data.
Is more an effort of math than programming, although to program it is tough already.
This technology is going to be a must for you to master if you wish to become an expert in the subject.
Because most of the vector-based models emerged not longer than 2 years ago, it is still a subject at its early stage, and it is unlikely you will find much-structured content in a course.
Instead, you may need to proceed with experiments, learning the theory on the side.
I am starting to write several guides on how to use vectors to create amazing things, including recommendation systems and topic modeling.
This technology is used most of the time to complete NLP tasks.
Throughout this article, I have been mainly focusing on data science concepts.
There is a myriad of other things you are expected to know when managing a data science project, including Git, Cloud Computing, and Data Engineering.
My suggestion is to study what you need in your path to learning data science.