|
6 | 6 | "source": [ |
7 | 7 | "# Xarray's Data structures\n", |
8 | 8 | "\n", |
9 | | - "In this lesson, we cover the basics of Xarray data structures. Our\n", |
10 | | - "learning goals are as follows. By the end of the lesson, we will be able to:\n", |
| 9 | + "In this lesson, we cover the basics of Xarray data structures. By the end of the lesson, we will be able to:\n", |
11 | 10 | "\n", |
12 | | - "- Understand the basic data structures (`DataArray` and `Dataset` objects) in Xarray\n", |
13 | | - "\n", |
14 | | - "---\n", |
15 | | - "\n", |
16 | | - "## Introduction\n", |
17 | | - "\n", |
18 | | - "Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called “tensors”)\n", |
19 | | - "are an essential part of computational science. They are encountered in a wide\n", |
20 | | - "range of fields, including physics, astronomy, geoscience, bioinformatics,\n", |
21 | | - "engineering, finance, and deep learning. In Python, [NumPy](https://numpy.org/)\n", |
22 | | - "provides the fundamental data structure and API for working with raw ND arrays.\n", |
23 | | - "However, real-world datasets are usually more than just raw numbers; they have\n", |
24 | | - "labels which encode information about how the array values map to locations in\n", |
25 | | - "space, time, etc.\n", |
26 | | - "\n", |
27 | | - "Here is an example of how we might structure a dataset for a weather forecast:\n", |
28 | | - "\n", |
29 | | - "<img src=\"https://docs.xarray.dev/en/stable/_images/dataset-diagram.png\" align=\"center\" width=\"80%\">\n", |
30 | | - "\n", |
31 | | - "You'll notice multiple data variables (temperature, precipitation), coordinate\n", |
32 | | - "variables (latitude, longitude), and dimensions (x, y, t). We'll cover how these\n", |
33 | | - "fit into Xarray's data structures below.\n", |
34 | | - "\n", |
35 | | - "Xarray doesn’t just keep track of labels on arrays – it uses them to provide a\n", |
36 | | - "powerful and concise interface. For example:\n", |
37 | | - "\n", |
38 | | - "- Apply operations over dimensions by name: `x.sum('time')`.\n", |
39 | | - "\n", |
40 | | - "- Select values by label (or logical location) instead of integer location:\n", |
41 | | - " `x.loc['2014-01-01']` or `x.sel(time='2014-01-01')`.\n", |
42 | | - "\n", |
43 | | - "- Mathematical operations (e.g., `x - y`) vectorize across multiple dimensions\n", |
44 | | - " (array broadcasting) based on dimension names, not shape.\n", |
45 | | - "\n", |
46 | | - "- Easily use the split-apply-combine paradigm with groupby:\n", |
47 | | - " `x.groupby('time.dayofyear').mean()`.\n", |
48 | | - "\n", |
49 | | - "- Database-like alignment based on coordinate labels that smoothly handles\n", |
50 | | - " missing values: `x, y = xr.align(x, y, join='outer')`.\n", |
51 | | - "\n", |
52 | | - "- Keep track of arbitrary metadata in the form of a Python dictionary:\n", |
53 | | - " `x.attrs`.\n", |
54 | | - "\n", |
55 | | - "The N-dimensional nature of xarray’s data structures makes it suitable for\n", |
56 | | - "dealing with multi-dimensional scientific data, and its use of dimension names\n", |
57 | | - "instead of axis labels (`dim='time'` instead of `axis=0`) makes such arrays much\n", |
58 | | - "more manageable than the raw numpy ndarray: with xarray, you don’t need to keep\n", |
59 | | - "track of the order of an array’s dimensions or insert dummy dimensions of size 1\n", |
60 | | - "to align arrays (e.g., using np.newaxis).\n", |
61 | | - "\n", |
62 | | - "The immediate payoff of using xarray is that you’ll write less code. The\n", |
63 | | - "long-term payoff is that you’ll understand what you were thinking when you come\n", |
64 | | - "back to look at it weeks or months later.\n" |
| 11 | + ":::{admonition} Learning Goals\n", |
| 12 | + "- Understand the basic Xarray data structures `DataArray` and `Dataset` \n", |
| 13 | + "- Customize the display of Xarray data structures\n", |
| 14 | + "- The connection between Pandas and Xarray data structures\n", |
| 15 | + ":::" |
65 | 16 | ] |
66 | 17 | }, |
67 | 18 | { |
|
72 | 23 | "\n", |
73 | 24 | "Xarray provides two data structures: the `DataArray` and `Dataset`. The\n", |
74 | 25 | "`DataArray` class attaches dimension names, coordinates and attributes to\n", |
75 | | - "multi-dimensional arrays while `Dataset` combines multiple arrays.\n", |
| 26 | + "multi-dimensional arrays while `Dataset` combines multiple DataArrays.\n", |
76 | 27 | "\n", |
77 | 28 | "Both classes are most commonly created by reading data.\n", |
78 | | - "To learn how to create a DataArray or Dataset manually, see the [Creating Data Structures](01.1_creating_data_structures.ipynb) tutorial.\n", |
79 | | - "\n", |
80 | | - "Xarray has a few small real-world tutorial datasets hosted in this GitHub repository https://github.com/pydata/xarray-data.\n", |
81 | | - "We'll use the [xarray.tutorial.load_dataset](https://docs.xarray.dev/en/stable/generated/xarray.tutorial.open_dataset.html#xarray.tutorial.open_dataset) convenience function to download and open the `air_temperature` (National Centers for Environmental Prediction) Dataset by name." |
| 29 | + "To learn how to create a DataArray or Dataset manually, see the [Creating Data Structures](01.1_creating_data_structures.ipynb) tutorial." |
82 | 30 | ] |
83 | 31 | }, |
84 | 32 | { |
|
88 | 36 | "outputs": [], |
89 | 37 | "source": [ |
90 | 38 | "import numpy as np\n", |
91 | | - "import xarray as xr" |
| 39 | + "import xarray as xr\n", |
| 40 | + "import pandas as pd\n", |
| 41 | + "\n", |
| 42 | + "# When working in a Jupyter Notebook you might want to customize Xarray display settings to your liking\n", |
| 43 | + "# The following settings reduce the amount of data displayed out by default\n", |
| 44 | + "xr.set_options(display_expand_attrs=False, display_expand_data=False)\n", |
| 45 | + "np.set_printoptions(threshold=10, edgeitems=2)" |
92 | 46 | ] |
93 | 47 | }, |
94 | 48 | { |
|
97 | 51 | "source": [ |
98 | 52 | "### Dataset\n", |
99 | 53 | "\n", |
100 | | - "`Dataset` objects are dictionary-like containers of DataArrays, mapping a variable name to each DataArray.\n" |
| 54 | + "`Dataset` objects are dictionary-like containers of DataArrays, mapping a variable name to each DataArray.\n", |
| 55 | + "\n", |
| 56 | + "Xarray has a few small real-world tutorial datasets hosted in this GitHub repository https://github.com/pydata/xarray-data.\n", |
| 57 | + "We'll use the [xarray.tutorial.load_dataset](https://docs.xarray.dev/en/stable/generated/xarray.tutorial.open_dataset.html#xarray.tutorial.open_dataset) convenience function to download and open the `air_temperature` (National Centers for Environmental Prediction) Dataset by name." |
101 | 58 | ] |
102 | 59 | }, |
103 | 60 | { |
|
147 | 104 | "cell_type": "markdown", |
148 | 105 | "metadata": {}, |
149 | 106 | "source": [ |
150 | | - "#### What is all this anyway? (String representations)\n", |
| 107 | + "#### HTML vs text representations\n", |
151 | 108 | "\n", |
152 | 109 | "Xarray has two representation types: `\"html\"` (which is only available in\n", |
153 | 110 | "notebooks) and `\"text\"`. To choose between them, use the `display_style` option.\n", |
154 | 111 | "\n", |
155 | 112 | "So far, our notebook has automatically displayed the `\"html\"` representation (which we will continue using).\n", |
156 | | - "The `\"html\"` representation is interactive, allowing you to collapse sections (left arrows) and\n", |
157 | | - "view attributes and values for each value (right hand sheet icon and data symbol)." |
| 113 | + "The `\"html\"` representation is interactive, allowing you to collapse sections (▶) and\n", |
| 114 | + "view attributes and values for each value (📄 and ≡)." |
158 | 115 | ] |
159 | 116 | }, |
160 | 117 | { |
|
171 | 128 | "cell_type": "markdown", |
172 | 129 | "metadata": {}, |
173 | 130 | "source": [ |
174 | | - "The output consists of:\n", |
| 131 | + "☝️ From top to bottom the output consists of:\n", |
175 | 132 | "\n", |
176 | | - "- a summary of all *dimensions* of the `Dataset` `(lat: 25, time: 2920, lon: 53)`: this tells us that the first\n", |
177 | | - " dimension is named `lat` and has a size of `25`, the second dimension is named\n", |
178 | | - " `time` and has a size of `2920`, and the third dimension is named `lon` and has a size\n", |
179 | | - " of `53`. Because we will access the dimensions by name, the order doesn't matter.\n", |
180 | | - "- an unordered list of *coordinates* or dimensions with coordinates with one item\n", |
181 | | - " per line. Each item has a name, one or more dimensions in parentheses, a dtype\n", |
182 | | - " and a preview of the values. Also, if it is a dimension coordinate, it will be\n", |
183 | | - " marked with a `*`.\n", |
184 | | - "- an alphabetically sorted list of *dimensions without coordinates* (if there are any)\n", |
185 | | - "- an unordered list of *attributes*, or metadata" |
| 133 | + "- **Dimensions**: summary of all *dimensions* of the `Dataset` `(lat: 25, time: 2920, lon: 53)`: this tells us that the first dimension is named `lat` and has a size of `25`, the second dimension is named `time` and has a size of `2920`, and the third dimension is named `lon` and has a size of `53`. Because we will access the dimensions by name, the order doesn't matter.\n", |
| 134 | + "- **Coordinates**: an unordered list of *coordinates* or dimensions with coordinates with one item per line. Each item has a name, one or more dimensions in parentheses, a dtype and a preview of the values. Also, if it is a dimension coordinate, it will be printed in **bold** font. *dimensions without coordinates* appear in plain font (there are none in this example, but you might imagine a 'mask' coordinate that has a value assigned at every point).\n", |
| 135 | + "- **Data variables**: names of each nD *measurement* in the dataset, followed by its dimensions `(time, lat, lon)`, dtype, and a preview of values.\n", |
| 136 | + "- **Indexes**: Each dimension with coordinates is backed by an \"Index\". In this example, each dimension is backed by a `PandasIndex`\n", |
| 137 | + "- **Attributes**: an unordered list of metadata (for example, a paragraph describing the dataset)" |
186 | 138 | ] |
187 | 139 | }, |
188 | 140 | { |
|
379 | 331 | "methods on `xarray` objects:\n" |
380 | 332 | ] |
381 | 333 | }, |
382 | | - { |
383 | | - "cell_type": "code", |
384 | | - "execution_count": null, |
385 | | - "metadata": {}, |
386 | | - "outputs": [], |
387 | | - "source": [ |
388 | | - "import pandas as pd" |
389 | | - ] |
390 | | - }, |
391 | 334 | { |
392 | 335 | "cell_type": "code", |
393 | 336 | "execution_count": null, |
|
429 | 372 | "cell_type": "markdown", |
430 | 373 | "metadata": {}, |
431 | 374 | "source": [ |
432 | | - "**<code>to_series</code>**: This will always convert `DataArray` objects to\n", |
433 | | - "`pandas.Series`, using a `MultiIndex` for higher dimensions\n" |
| 375 | + "### to_series\n", |
| 376 | + "This will always convert `DataArray` objects to `pandas.Series`, using a `MultiIndex` for higher dimensions\n" |
434 | 377 | ] |
435 | 378 | }, |
436 | 379 | { |
|
446 | 389 | "cell_type": "markdown", |
447 | 390 | "metadata": {}, |
448 | 391 | "source": [ |
449 | | - "**<code>to_dataframe</code>**: This will always convert `DataArray` or `Dataset`\n", |
450 | | - "objects to a `pandas.DataFrame`. Note that `DataArray` objects have to be named\n", |
451 | | - "for this.\n" |
| 392 | + "### to_dataframe\n", |
| 393 | + "\n", |
| 394 | + "This will always convert `DataArray` or `Dataset` objects to a `pandas.DataFrame`. Note that `DataArray` objects have to be named for this. Since columns in a `DataFrame` need to have the same index, they are\n", |
| 395 | + "broadcasted." |
452 | 396 | ] |
453 | 397 | }, |
454 | 398 | { |
|
459 | 403 | "source": [ |
460 | 404 | "ds.air.to_dataframe()" |
461 | 405 | ] |
462 | | - }, |
463 | | - { |
464 | | - "cell_type": "markdown", |
465 | | - "metadata": {}, |
466 | | - "source": [ |
467 | | - "Since columns in a `DataFrame` need to have the same index, they are\n", |
468 | | - "broadcasted.\n" |
469 | | - ] |
470 | | - }, |
471 | | - { |
472 | | - "cell_type": "code", |
473 | | - "execution_count": null, |
474 | | - "metadata": {}, |
475 | | - "outputs": [], |
476 | | - "source": [ |
477 | | - "ds.to_dataframe()" |
478 | | - ] |
479 | 406 | } |
480 | 407 | ], |
481 | 408 | "metadata": { |
|
0 commit comments