Machine Learning Curriculum
Machine Learning is a branch of Artificial Intelligence dedicated at making
machines learn from observational data without being explicitly programmed.
Machine learning and AI are not the same. Machine learning is an instrument in
the AI symphony — a component of AI. So what is Machine Learning — or ML —
exactly? It’s the ability for an algorithm to learn from prior data in order
to produce a behavior. ML is teaching machines to make decisions in situations
they have never seen.
This curriculum is made to guide you to learn machine learning, recommend tools, and help you to embrace ML lifestyle by suggesting media to follow.
I update it regularly to maintain freshness and get rid of outdated content and deprecated tools.
Machine Learning in General
Study this section to understand fundamental concepts and develop intuitions before going any deeper.
A computer program is said to learn from experience E
with respect to some
class of tasks T
and performance measure P
if its performance at tasks in
T
, as measured by P
, improves with experience E
.
Books
Reinforcement Learning
Building a machine that senses the environment and then chooses the best policy
(action) to do at any given state to maximize its expected long-term scalar
reward is the goal of reinforcement learning.
Deep Learning
Deep learning is a branch of machine learning where deep artificial neural
networks (DNN) — algorithms inspired by the way neurons work in the brain — find
patterns in raw data by combining multiple layers of artificial neurons. As the
layers increase, so does the neural network’s ability to learn increasingly
abstract concepts.
The simplest kind of DNN is
a Multilayer Perceptron
(MLP).
- THE LITTLE BOOK OF DEEP LEARNING This book is a short introduction to deep learning for readers with a STEM background, originally designed to be read on a phone screen. It is distributed under a non-commercial Creative Commons license and was downloaded close to 250’000 times in the month following its public release.
- Full Stack Deep Learning Learn Production-Level Deep Learning from Top Practitioners
- DeepLearning.ai a bunch of courses taught by Andrew Ng at Coursera; It’s the sequel of Machine Learning course at Coursera.
- Intro to Deep Learning with PyTorch a course by Facebook AI on Udacity
- A friendly introduction to Deep Learning and Neural Networks
- A Neural Network Playground Tinker with a simple neural network designed to help you visualize the learning process
- Deep Learning Demystified - Youtube explain inspiration of deep learning from real neurons to artificial neural networks
- Learn TensorFlow and deep learning, without a Ph.D. This 3-hour course (video + slides) offers developers a quick introduction to deep-learning fundamentals, with some TensorFlow thrown into the bargain.
- A Guide to Deep Learning by YN^2 a curated maths guide to Deep Learning
- Practical Deep Learning For Coders Course at Fast.ai taught by Jeremy Howard (Kaggle’s #1 competitor 2 years running, and founder of Enlitic)
- Deep learning - Udacity recommended for visual learner who knows some ML, this course provides high level ideas of deep learning, dense intuitive details put in a short amount of time, you will use TensorFlow inside the course
- Deep Learning Summer School, Montreal 2015
- Neural networks class - YouTube Playlist
- http://neuralnetworksanddeeplearning.com/index.html a hands-on online book for deep learning maths intuition, I can say that after you finish this, you will be able to explain deep learning in a fine detail.
- The Neural Network Zoo a bunch of neural network models that you should know about (I know about half of them so don’t worry that you don’t know many because most of them are not popular or useful in the present)
- Intro to TensorFlow for Deep Learning taught at Udacity
- Primers • AI Here’s a hand-picked selection of articles on AI fundamentals/concepts that cover the entire process of building neural nets to training them to evaluating results. There is also a very detailed Transformer architecture explanation.
- Hugging Face Diffusion Models Course Learn the theory, train the model from scratch, and use it to generate images and audio.
- Deep Learning Fundamentals by Lightning.AI with Sebastian Raschka
Convolutional Neural Networks
DNNs that work with grid data like sound waveforms, images and videos better
than ordinary DNNs. They are based on the assumptions that nearby input units
are more related than the distant units. They also utilize translation
invariance. For example, given an image, it might be useful to detect the same
kind of edges everywhere on the image.
They are sometimes called convnets or CNNs.
Recurrent Neural Networks
DNNs that have states. They also understand sequences that vary in length.
They are sometimes called RNNs.
Best Practices
Libraries and frameworks that are useful for practical machine learning
Frameworks
Machine learning building blocks
- scikit-learn general machine learning library, high level abstraction, geared towards beginners
- TensorFlow; Awesome TensorFlow; computation graph framework built by Google, has nice visualization board, probably the most popular framework nowadays for doing Deep Learning
- Keras: Deep Learning for humans Keras is a deep learning API written in Python, running on top of TensorFlow. It’s still king of high level abstraction for deep learning. Update: Keras is now available for TensorFlow, JAX and PyTorch!
- PyTorch Tensors and Dynamic neural networks in Python with strong GPU acceleration. It’s commonly used by cutting-edge researchers including OpenAI.
- Lightning The Deep Learning framework to train, deploy, and ship AI products Lightning fast. (Used to be called PyTorch Lightning)
- JAX is Autograd and XLA, brought together for high-performance machine learning research.
- OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
- Apache MXNet (incubating) for Deep Learning Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to mix symbolic and imperative programming to maximize efficiency and productivity.
- Chainer A flexible framework of neural networks for deep learning
- Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning. There is a specific focus on reinforcement learning with several contextual bandit algorithms implemented and the online nature lending to the problem well.
- H2O is an in-memory platform for distributed, scalable machine learning.
- spektral Graph Neural Networks with Keras and Tensorflow 2.
- Ivy is both an ML transpiler and a framework, currently supporting JAX, TensorFlow, PyTorch and Numpy. Ivy unifies all ML frameworks 💥 enabling you not only to write code that can be used with any of these frameworks as the backend, but also to convert 🔄 any function, model or library written in any of them to your preferred framework!
No coding
- Ludwig Ludwig is a toolbox that allows users to train and test deep learning models without the need to write code. It is built on top of TensorFlow.
Gradient Boosting
Models that are used heavily in competitions because of their outstanding generalization performance.
- https://github.com/dmlc/xgboost eXtreme Gradient Boosting
- https://github.com/microsoft/LightGBM lightweight alternative compared to xgboost
- https://github.com/catboost/catboost A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
- https://github.com/tensorflow/decision-forests TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models.
- PyTorch/TensorFlow implementation of TabNet paper. Further read: TabNet balances explainability and model performance on tabular data, but can it dethrone boosted tree models?
Time Series Inference
Time series data require unique feature extraction process for them to be usable in most machine learning models because most models require data to be in a tabular format.
Or you can use special model architectures which target time series e.g. LSTM, TCN, etc.
Life Cycle
Libraries that help you develop/debug/deploy the model in production (MLOps). There is more to ML than training the model.
- https://huggingface.co/ Just like GitHub, but for storing ML models, datasets, and apps (they call apps as spaces). They have libraries for you to easily use their models/datasets in your code. The storage is free and unlimited for both public and private projects.
- https://wandb.ai/ Build better models faster with experiment tracking, dataset versioning, and model management
- https://github.com/flyteorg/flyte Flyte makes it easy to create concurrent, scalable, and maintainable workflows for machine learning and data processing.
- https://github.com/allegroai/clearml Auto-Magical Suite of tools to streamline your ML workflow. Experiment Manager, ML-Ops and Data-Management
- https://github.com/quantumblacklabs/kedro A Python framework for creating reproducible, maintainable and modular data science code.
- https://github.com/determined-ai/determined Determined is an open-source deep learning training platform that makes building models fast and easy. I use it mainly for tuning hyperparameters.
- https://github.com/iterative/cml Continuous Machine Learning (CML) is an open-source library for implementing continuous integration & delivery (CI/CD) in machine learning projects. Use it to automate parts of your development workflow, including model training and evaluation, comparing ML experiments across your project history, and monitoring changing datasets.
- https://github.com/creme-ml/creme Python library for online machine learning. All the tools in the library can be updated with a single observation at a time, and can therefore be used to learn from streaming data.
- https://github.com/aimhubio/aim A super-easy way to record, search and compare 1000s of ML training runs
- https://github.com/Netflix/metaflow Metaflow is a human-friendly Python library that helps scientists and engineers build and manage real-life data science projects. Metaflow was originally developed at Netflix.
- MLflow MLflow (currently in beta) is an open source platform to manage the ML lifecycle, including experimentation, reproducibility and deployment. It currently offers three components: MLflow Tracking, MLflow Projects, MLflow Models.
- FloydHub a Heroku for Deep Learning (You focus on the model, they’ll deploy)
- comet.ml Comet enables data scientists and teams to track, compare, explain and optimize experiments and models across the model’s entire lifecycle. From training to production
- https://neptune.ai/ Manage all your model building metadata in a single place
- https://github.com/fastai/nbdev Create delightful python projects using Jupyter Notebooks
- https://rapids.ai/ data science on GPUs
- https://github.com/datarevenue-berlin/OpenMLOps
- https://github.com/jacopotagliabue/you-dont-need-a-bigger-boat Not really a tool, but a guide on how to compose many tools together in real-world reasonable scale business.
- https://www.modular.com/ A company with ambitious goal of redesigning AI infrastructure from the ground up. They introduce a new language called Mojo which is a superset of python.
GPU Cloud
Remember that this is an opinionated list. There are bazillions of cloud providers out there. I’m not going to list them all. I’m just going to list the ones that I’m familiar with and I think are good.
- https://lightning.ai/ Lightning Studio makes it possible for you to ditch your high-end laptop for developing machine learning models. Just write code in the cloud using VSCode and use their GPUs for training or inference. Lightning Studio is similar to GitHub Codespaces but with GPU.
- https://modal.com/ Modal lets you run or deploy machine learning models, massively parallel compute jobs, task queues, web apps, and much more, without your own infrastructure.
- https://www.runpod.io/ Save over 80% on GPUs. GPU rental made easy with Jupyter for PyTorch, Tensorflow or any other AI framework. I’ve used it before. Quite easy to use.
- https://replicate.com/ Run and fine-tune open-source models. Deploy custom models at scale using cog. All with one line of code.
- https://bentoml.com/ BentoML is the platform for software engineers to build AI products. Deploy using BentoML package.
- https://www.baseten.co/ Fast and scalable model inference in the cloud using truss
- https://lambdalabs.com/ GPU cloud built for deep learning. Instant access to the best prices for cloud GPUs on the market. No commitments or negotiations required. Save over 73% vs AWS, Azure, and GCP. Configured for deep learning with Pytorch, TensorFlow, Jupyter
- https://www.beam.cloud/ On-demand GPU compute: Train and deploy AI and LLM applications securely on serverless GPUs, without managing infrastructure
Data Storage
- https://github.com/huggingface/datasets/ a package for loading, preprocessing and sharing datasets.
- https://github.com/activeloopai/deeplake Data Lake for Deep Learning. Build, manage, query, version, & visualize datasets. Stream data real-time to PyTorch/TensorFlow.
- https://github.com/determined-ai/yogadl Better approach to data loading for Deep Learning. API-transparent caching to disk, GCS, or S3.
- https://github.com/google/ml_collections ML Collections is a library of Python Collections designed for ML use cases. It contains ConfigDict, a “dict-like” data structures with dot access to nested elements. It is supposed to be used as a main way of expressing configurations of experiments and models.
Data Wrangling
Data cleaning and data augmentation
- https://github.com/cgnorthcutt/cleanlab Clean labeling error of dataset
- https://github.com/aleju/imgaug Image augmentation library which supports converting keypoints to heatmaps
- https://github.com/albu/albumentations Fastest image augmentation library
- https://github.com/mdbloice/Augmentor Easy-to-use image augmentation for classification tasks (cannot augment keypoints)
- https://github.com/facebookresearch/AugLy A data augmentations library for audio, image, text, and video.
Data Orchestration
- https://github.com/PrefectHQ/prefect
- https://github.com/dagster-io/dagster
- https://github.com/ploomber/ploomber Ploomber is the fastest way to build data pipelines ⚡️. Use your favorite editor (Jupyter, VSCode, PyCharm) to develop interactively and deploy ☁️ without code changes (Kubernetes, Airflow, AWS Batch, and SLURM).
- https://github.com/orchest/orchest Build data pipelines, the easy way using user-friendly UI
Data Visualization
- https://github.com/gradio-app/gradio Create UIs for your machine learning model in Python in 3 minutes. The UI is a web app that can be shared with anyone, even non-technical people. One of the features I like is the examples component. It shows you very well that this app is for machine learning use case.
- https://github.com/streamlit/streamlit Streamlit turns data scripts into shareable web apps in minutes. All in Python. All for free. No front‑end experience required.
- https://github.com/oegedijk/explainerdashboard Quickly build Explainable AI dashboards that show the inner workings of so-called “blackbox” machine learning models.
- https://github.com/lux-org/lux By simply printing out a dataframe in a Jupyter notebook, Lux recommends a set of visualizations highlighting interesting trends and patterns in the dataset.
- https://github.com/slundberg/shap SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model.
- https://github.com/comet-ml/kangas Kangas is a tool for exploring, analyzing, and visualizing large-scale multimedia data. It provides a straightforward Python API for logging large tables of data, along with an intuitive visual interface for performing complex queries against your dataset.
Hyperparameter Tuning
Before you begin, please read this blog post to understand the motivation of searching in general: https://www.determined.ai/blog/stop-doing-iterative-model-development
Open your eyes to search-driven development. It will change you. Main benefit is that there will be no setbacks. Only progress and improvement are allowed. Imagine working and progressing everyday, instead of regressing backwards because your new solution doesn’t work. This guaranteed progress is what search-driven development will do to you. Apply it to everything in optimization, not just machine learning.
My top opinionated preferences are determined, ray tune, and optuna because of parallelization (distributed tuning on many machines), flexibility (can optimize arbitrary objectives and allow dataset parameters to be tuned), library of SOTA tuning algorithms (e.g. HyperBand, BOHB, TPE, PBT, ASHA, etc), result visualization/analysis tools, and extensive documentations/tutorials.
- https://github.com/determined-ai/determined Determined is an open-source deep learning training platform that makes building models fast and easy. IMO, it’s best for cheaply tuning hyperparameters of deep learning models because it will train many epochs on models that have promising metrics and early stop models that don’t. They support AWS and most cloud services as first-class citizen. They also support preemptible instances, which is again, cheap. When you finish training, all the GPU instances are automatically shutdown. If you want to save money on large-scale training, go with Determined.
- https://docs.ray.io/en/master/tune/index.html Ray Tune is a Python library for experiment execution and hyperparameter tuning at any scale. If you are looking for distributed tuning, Ray Tune is probably the most serious framework out there.
- https://github.com/optuna/optuna an automatic hyperparameter optimization software framework (framework agnostic, define-by-run)
- https://github.com/pyhopper/pyhopper PyHopper is a hyperparameter optimizer, made specifically for high-dimensional problems arising in machine learning research and businesses. This guy claimed that it’s 10x faster than Optuna. Is this claim true? We can’t know until we try!
- https://github.com/keras-team/keras-tuner an easy-to-use, distributable hyperparameter optimization for keras; read its article here
- https://github.com/autonomio/talos Hyperparameter Optimization for Keras, TensorFlow (tf.keras) and PyTorch
- https://github.com/maxpumperla/hyperas Keras + Hyperopt: A very simple wrapper for convenient hyperparameter optimization
- https://github.com/fmfn/BayesianOptimization A Python implementation of global optimization with gaussian processes.
- https://github.com/hyperopt/hyperopt
- https://github.com/msu-coinlab/pymoo Multi-objective Optimization in Python
- https://github.com/google/vizier Open Source Vizier: Reliable and Flexible Black-Box Optimization. OSS Vizier is a Python-based service for black-box optimization and research, based on Google Vizier, one of the first hyperparameter tuning services designed to work at scale.
AutoML
Make machines learn without the tedious task of feature engineering, model selection, and hyperparameter tuning
that you have to do yourself. Let the machines perform machine learning for you!
Personally if I have a tabular dataset I would try FLAML and mljar first, especially if you want to get something working fast.
If you want to try gradient boosting frameworks such as XGBoost, LightGBM, CatBoost, etc but you don’t know which one works best,
I suggest you to try AutoML first because internally it will try the gradient boosting frameworks mentioned previously.
- Best OpenSource AutoML frameworks in 2021 an article on medium containing a curated list of OpenSource AutoML frameworks.
- https://github.com/dabl/dabl Data Analysis Baseline Library; quickly train a simple model to be used as a performance baseline
- https://www.automl.org/ Find curated list of AutoML libraries and researches
- https://github.com/jhfjhfj1/autokeras As of writing (24 August 2018), this library is pretty premature as it can only does classification.
- https://github.com/automl/auto-sklearn/ Does not run on Windows, you need to install WSL (Windows Subsystem for Linux) to use it
- https://github.com/EpistasisLab/tpot Run thousands of machine learning pipelines and output the code for you
- https://github.com/ClimbsRocks/auto_ml Read what the author think about the comparison between tpot and auto-sklearn
- https://github.com/microsoft/FLAML Fast and Lightweight AutoML with cost-effective economical optimization algorithms.
- https://github.com/mljar/mljar-supervised an Automated Machine Learning Python package that works with tabular data. I like that it generates visualization report (in the Explain mode) and extra features for you e.g. golden features and K-means features.
- https://github.com/awslabs/autogluon AutoML for Text, Image, and Tabular Data. But it doesn’t support Windows (as of 11 October 2021).
- https://github.com/AutoViML/Auto_ViML Auto_ViML was designed for building High Performance Interpretable Models with the fewest variables needed.
Model Architectures
Architectures that are state-of-the-art in its field.
- https://github.com/rwightman/pytorch-image-models PyTorch image models, scripts, pretrained weights – ResNet, ResNeXT, EfficientNet, EfficientNetV2, NFNet, Vision Transformer, MixNet, MobileNet-V3/V2, RegNet, DPN, CSPNet, and more. It is typically called
timm
.
- https://modelzoo.co/ Model Zoo
- https://github.com/tensorflow/models
- Magenta: Music and Art Generation with Machine Intelligence
- https://github.com/phillipi/pix2pix
Image-to-image translation using conditional adversarial nets;
TensorFlow port of pix2pix;
Watch the presentation of this work:
Learning to see without a teacher
- wav2letter Facebook AI Research’s Automatic Speech Recognition Toolkit
- https://github.com/huggingface/transformers State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch
- https://github.com/huggingface/diffusers 🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch
- https://bigscience.huggingface.co/blog/bloom open large language model from BigScience LLM. Article
- https://github.com/hpcaitech/ColossalAI Article
- https://stability.ai/blog/stable-diffusion-public-release Stable Diffusion is a model that can generate high-quality images from brief text descriptions. Here is a short Twitter thread explaining why it works so well. And here is a thread containing resources to learn more about diffusion models.
Prompt Engineering
Large language models (LLMs) like GPT-3 are powerful, but they need to be prompted to generate the desired output. This is where prompt engineering comes in. Prompt engineering is the process of designing prompts that can be used to generate the desired output.
- https://github.com/hwchase17/langchain It’s a python package for building applications with LLMs through composability.
- https://dust.tt/ A web-based tool for designing and deploying large language model apps.
- https://github.com/jerryjliu/gpt_index GPT Index is a project consisting of a set of data structures that are created using LLMs and can be traversed using LLMs in order to answer queries.
- https://github.com/Xpitfire/symbolicai/ Compositional Differentiable Programming Library: Building applications with LLMs at its core through our Symbolic API leverages the power of classical and differentiable programming in Python.
Nice Blogs & Vlogs to Follow
Impactful People
- Geoffrey Hinton, he has been called
the godfather of deep learning
by introducing 2 revolutionizing techniques (ReLU and Dropout) with his students.
These techniques solve the Vanishing Gradient and Generalization problem of
deep neural networks.
- Yann LeCun, he invented CNNs
(Convolutional neural networks), the kind of network that is really popular
among computer vision developers today. Currently working at Meta.
- Yoshua Bengio another
serious professor at Deep Learning, you can
watch his TEDx talk here (2017)
- Andrew Ng he discovered that GPUs make deep learning faster.
He taught 2 famous online courses, Machine Learning and Deep Learning specialization at Coursera.
particular type of RNN)
- Jeff Dean, a
Google Brain engineer, watch his TEDx Talk
- Ian Goodfellow, he invented
GANs (Generative Adversarial Networks), is an OpenAI engineer
- David Silver this is
the guy behind AlphaGo and Artari reinforcement learning game agents at DeepMind
- Demis Hassabis CEO of
DeepMind, has given a lot of talks about AlphaGo and Reinforcement Learning
achievements they have
- Andrej Karparthy he teaches convnet
classes, wrote ConvNetJS, and produces a lot of content for DL community, he
also writes a blog (see Nice Blogs & Vlogs to Follow section)
- Pedro Domingos he wrote the book
The Master Algorithm: How the Quest for the Ultimate Learning Machine Will
Remake Our World, watch his TEDx talk here
- Emad Mostaque he is the founder of stability.ai, a company that releases many open source AI models including Stable Diffusion
- Sam Altman he is the president of OpenAI, a company
that releases ChatGPT
Cutting-Edge Research Publishers
Steal the most recent techniques introduced by smart computer scientists (could be you).
Thoughtful Insights for Future Research
Uncategorized
Other Big Lists
I am confused, too many links, where do I start?
If you are a beginner and want to get started with my suggestions, please read this issue:
https://github.com/offchan42/machine-learning-curriculum/issues/4
Disclaimer
From now on, this list is going to be compact and opinionated towards my own real-world ML journey and I will put only content that I think are truly beneficial for me and most people.
All the materials and tools that are not good enough (in any aspect) will be gradually removed to combat information overload, including:
- too difficult materials without much intuition; impractical content
- too much theory without real-world practice
- low-quality and unstructured materials
- courses that I don’t consider to enroll myself
- knowledge or tools that are too niche and not many people can use it in their works e.g. deepdream or unsupervised domain adaptation (because you can Google it if you want to use it in your work).
- tools that are beaten by other tools; not being state-of-the-art anymore
- commercial tools that look like it can die any time soon
- projects that are outdated or not maintained anymore
NOTE: There is no particular rank for each link. The order in which they
appear does not convey any meaning and should not be treated differently.
How to contribute to this list
- Fork this repository, then apply your change.
- Make a pull request and tag me if you want.
- That’s it. If your edition is useful, I’ll merge it.
Or you can just submit a new issue
containing the resource you want me to include if you don’t have time to send a pull request.
The resource you want to include should be free to study.