Python is not the only language for scientific computing or data processing, but it has become the de facto standard in the domain: it is simple to use, easy to install, especially with it's interpreted nature, but most of all it comes with a wide ecosystem of powerful libraries.
You don't want to write plain Python, Python is slow when it comes to loops or processing, but its scientific libraries are well optimized and rely on low level standard implementations.
So when using it, just remember one thing: don't use loops, use NumPy, Pandas, Scikit-learn, or any other well implemented libraries that exist! The planet, your HPC center, or your cloud bill will thank you for that.
Also keep in mind that although we focus on Python here for its standard position in the scientific domain, there are other good languages to use: C++ or Rust for the performance, or Julia to benefit from the best of both compiled and interpreted world.
Here a selection of the most common packages that we recommend you for scientific computing and data science. All these tools are well-known Open-Source libraries. They also come with plenty of good resources for learning how to use them.
Python Scientific core stack
These are the base libraries than anyone doing scientific Python should know and use.
| Function | Tools | Related use cases |
|---|---|---|
| Python Scientific core stack |
NumPy | Fundamental package for scientific computing with Python: Get Start. It focuses on N-dimensionnal array processing, and is the basis of almost all the tools listed below. When dealing with EO data like rasters, Numpy Array is the one you will use! |
| Pandas | Fast, powerful, flexible and easy to use open source data analysis and manipulation tool: Pandas cheat sheet. It is build to deal with DataFrames: tabular datasets coming from CSV or Parquet files for example. | |
| Xarray | Xarray makes working with labelled multi-dimensional arrays in Python simple: Tutorials. Xarray combines Numpy and Pandas to handle n-dimensional arrays using coordinates and dimensions. Built for NetCDF like dataset (climate, oceanography), it can also be used for images stack. | |
| Scipy | SciPy provides algorithms for optimization, integration, interpolation, eigenvalue problems, algebraic equations, differential equations, statistics and many other classes of problems. |
Geospatial and image data handling
Higher level packages, often relying on base libraries, to handle geospatial or imagery datasets.
| Function | Tools | Related use cases |
|---|---|---|
| Geospatial and image data handling |
GeoPandas | Extension of Pandas to work with geospatial data: Examples |
| Rioxarray | Extension of Xarray that handle spatial coordinates and other attributes: Examples | |
| Scikit-image | Collection of algorithms for image processing | General example |
Libraries for image processing and machine learning
Python is the reference language for Machine and Deep Learning libraries, most efficient and well-known tools propose a Python implementation or API.
| Function | Tools | Related use cases |
|---|---|---|
| Image Processing and Machine Learning |
Scikit-learn | Simple and efficient tools for predictive data analysis | Home page |
| PyTorch | Optimized tensor library for deep learning using GPUs and CPUs. | |
| TensorFlow | End-to-end platform for machine and especially Deep Learning. |
Code optimization and distribution
Again, the first thing to optimize Python code is to use well built libraries like Numpy. But it's not always enough, and there are other solutions. Python multithreading is often bound by the GIL, so when it comes to parallelisation, multiprocessing and framework like Dask are the correct approach, which can come with the price of moving data between processes. For more information on the subject of optimization and distribution, see the associated page.
| Function | Tools | Related use cases |
|---|---|---|
| Code optimization and distribution |
Dask | Provides advanced parallelism for analytics, enabling performance at scale for the tools you love | Dask Website |
| Multiprocessing | Package that supports spawning processes using an API similar to the threading module. | |
| Numba | Translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library. Numba-compiled numerical algorithms in Python can approach the speed of C or FORTRAN. |