Merge pull request #5 from hrhampapura/harsha

NicholasCote · web-flow · commit eb5de32b04d8 · 2024-06-13T14:21:11.000-06:00
Harsha
diff --git a/notebooks/01_data_preprocessing.ipynb b/notebooks/01_data_preprocessing.ipynb
@@ -5,7 +5,31 @@
    "id": "31b1fab7-2441-4746-925f-69d51c4d98a8",
    "metadata": {},
    "source": [
-    "# Generate annual/yearly zarr stores from hourly ERA5 data files on NCAR's Research Data Archive"
+    "# Generate annual/yearly zarr stores from hourly ERA5 NetCDF files on NCAR's Research Data Archive"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ef28c0f9-b371-44be-a0f4-654681115eb6",
+   "metadata": {},
+   "source": [
+    "## Warning: Please Read\n",
+    "- ERA5 data on NCAR is stored in hourly NetCDF files. Therefore, it is necessary to create intermediate ARCO datasets for fast processing.\n",
+    "- In this notebook, we read hourly data from NCAR's publicly accessible ERA5 collection using an intake catalog, compute the annual means and store the result using zarr stores.\n",
+    "- If you don't have write permision to save to the Research Data Archive (RDA), please save the result to your local folder.\n",
+    "- If you need annual means for the following variables, please don't run this notebook. The data has already been calculated and can be accessed via https from https://data.rda.ucar.edu/pythia_era5_24/annual_means/\n",
+    "\n",
+    "   - Air temperature at 2 m/ VAR_2T\n",
+    "     \n",
+    "- Otherwise, please run this script once to generate the annual means.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "741a7ddb-1343-4807-9ed2-0150c350d73d",
+   "metadata": {},
+   "source": [
+    "## Imports"
    ]
   },
   {
@@ -40,15 +64,15 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 30,
+   "execution_count": 4,
    "id": "4bee4557-d1f1-4720-bf61-a09f106f41ba",
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "https://data.rda.ucar.edu/pythia_era5_24/pythia_intake_catalogs/era5_catalog.json\n"
+      "https://data.rda.ucar.edu/pythia_era5_24/annual_means/\n"
      ]
     }
    ],
@@ -64,7 +88,79 @@
     "######## \n",
     "zarr_path         = rda_scratch + \"/tas_zarr/\"\n",
     "##########\n",
-    "print(era5_catalog)"
+    "print(annual_means)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "20a20dff-a028-4e38-a7b7-8bfb670bdf01",
+   "metadata": {},
+   "source": [
+    "### Create a Dask cluster"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c6eb54a1-a044-4402-8e5a-a53cefb11256",
+   "metadata": {},
+   "source": [
+    "#### Dask Introduction\n",
+    "\n",
+    "[Dask](https://www.dask.org/) is a solution that enables the scaling of Python libraries. It mimics popular scientific libraries such as numpy, pandas, and xarray that enables an easier path to parallel processing without having to refactor code. \n",
+    "\n",
+    "There are 3 components to parallel processing with Dask: the client, the scheduler, and the workers. \n",
+    "\n",
+    "The Client is best envisioned as the application that sends information to the Dask cluster. In Python applications this is handled when the client is defined with `client = Client(CLUSTER_TYPE)`. A Dask cluster comprises of a single scheduler that manages the execution of tasks on workers. The `CLUSTER_TYPE` can be defined in a number of different ways.\n",
+    "\n",
+    "- There is LocalCluster, a cluster running on the same hardware as the application and sharing the available resources, directly in Python with `dask.distributed`. \n",
+    "\n",
+    "- In certain JupyterHubs Dask Gateway may be available and a dedicated dask cluster with its own resources can be created dynamically with `dask.gateway`. \n",
+    "\n",
+    "- On HPC systems `dask_jobqueue` is used to connect to the HPC Slurm and PBS job schedulers to provision resources.\n",
+    "\n",
+    "The `dask.distributed` client python module can also be used to connect to existing clusters. A Dask Scheduler and Workers can be deployed in containers, or on Kubernetes, without using a Python function to create a dask cluster. The `dask.distributed` Client is configured to connect to the scheduler either by container name, or by the Kubernetes service name.   "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "09542b2f-aac2-4596-aeaf-89dee2f67cee",
+   "metadata": {},
+   "source": [
+    "#### Select the Dask cluster type"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "59253ed5-2e1d-4415-bb60-78606d78d36a",
+   "metadata": {},
+   "source": [
+    "The default will be `LocalCluster` as that can run on any system.\n",
+    "\n",
+    "If running on a HPC computer with a PBS Scheduler, set to True. Otherwise, set to False."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1a995e3d-2be7-414e-a7bf-7c53178d44d2",
+   "metadata": {},
+   "source": [
+    "USE_PBS_SCHEDULER = False"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8bf1065d-7e67-4259-8f9d-876743106a41",
+   "metadata": {},
+   "source": [
+    "If running on Jupyter server with Dask Gateway configured, set to True. Otherwise, set to False."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8df9739b-5005-4c0d-bf5e-4dc4cc432f50",
+   "metadata": {},
+   "source": [
+    "USE_DASK_GATEWAY = False"
    ]
   },
   {
diff --git a/notebooks/02_era5_anomaly.ipynb b/notebooks/02_era5_anomaly.ipynb
@@ -141,7 +141,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "id": "27eba78f-3a1c-4d9a-ad07-501f713069aa",
    "metadata": {},
    "outputs": [],
@@ -159,7 +159,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 2,
    "id": "4804f8c9-a5f2-4ed7-a5a4-fdb9c344e5d5",
    "metadata": {},
    "outputs": [],