Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessor to convert calendar and "fill-in" the data #2106

Open
larsbarring opened this issue Jun 20, 2023 · 12 comments
Open

Preprocessor to convert calendar and "fill-in" the data #2106

larsbarring opened this issue Jun 20, 2023 · 12 comments
Labels
enhancement New feature or request

Comments

@larsbarring
Copy link

larsbarring commented Jun 20, 2023

Is your feature request related to a problem? Please describe.
There are many use-cases when model data and observational datasets are combined for some analyses. When the datasets have daily resolution non-standard model calendars cause problems. In addition, for certain analyses (e.g. of climate indicators related to spell length or day-of-year when something happens) the leap day of the standardcalendar is a complication. Hence a preprocessor to convert calendars would be very useful, and allow for a common approach within a wide community for solving a problem that otherwise, and traditionally, "everyone" is solving in some ad hoc way by a quick-and-dirty fix (in the worst case again and again).

Hence, we propose the following conversion table (numbers are explained below)

OUTPUT→ ↓INPUT 360_day 365_day standard gregorian proleptic_ gregorian 366_day julian none
360_day 0 2 13 X 3 n/a n/a n/a
365_day n/a 0 14 X 4 n/a n/a n/a
standard n/a 15 0 X 11 n/a 12 n/a
gregorian n/a 15 1 20 11 n/a 12 n/a
proleptic_ gregorian n/a 5 11 X 0 n/a n/a n/a
366_day n/a 5 16 X 6 0 n/a n/a
julian n/a n/a 12 X n/a n/a 0 n/a
none n/a n/a n/a X n/a n/a n/a 0
no. conversion action
0 pass through
1 change only calendar string
2 fill-in(*) five days evenly spaced across the year
3 fill-in(*) five or six days evenly spaced across the year
4 for leap years fill-in(*) the leap day
5 remove leap days
6 for non-leap years remove the leap day
11 iff all data after (including) 1582-10-15 00:00:00 then 1 else n/a
12 iff all data before (excluding) 1582-10-05 00:00:00 then 1 else n/a
13 iff all data after (including) 1582-10-15 00:00:00 then 3 else n/a
14 iff all data after (including) 1582-10-15 00:00:00 then 4 else n/a
15 iff all data after (including) 1582-10-15 00:00:00 then 5 else n/a
16 iff all data after (including) 1582-10-15 00:00:00 then 6 else n/a
20 issue a warning and change only calendar string to standard
X illegal conversion because the gregorian calendar is deprecated
n/a not available (is there a concrete use case for any of these ??)

(*) Fill-in: For transformation from 365_day calendar to standard or proleptic_gregorian calendar it is suggested to add a day after February 28th (day-in-year 59). For transformation from the 360_day calendar several days have to be added:

For conversion to a non-leap year the following days should be inserted (day-in-year in parenthesis):

February 6th (36), April 19th (109), July 2nd (183), September 12th (255), November 25th (329).

For conversion to a leap year the following days should be inserted (day-in-year in parenthesis):

January 31st (31), March 31st (91), June 1st (153), July 31st (213), September 30st (275) and November 30th (335).

This follows what has been implemented in xarray (xarray.Dataset.convert_calendar using align_to = "year" ) . However, as is indicated in the table above, we suggest not to implement transformations to the 360_day calendar, or the xarray alternative align_to = "date" because it removes several days.

We also suggest the following alternatives for "creating" fill-in data:

  • filling with NaN, or similar like _FillValue
  • copying the data from the preceding day
  • [linear] interpolation between data for the preceding day and the succeeding day

Would you be able to help out?
Would you have the time and skills to implement the solution yourself?

@larsbarring larsbarring added the enhancement New feature or request label Jun 20, 2023
@valeriupredoi
Copy link
Contributor

Hi @larsbarring many thanks for proposing this! We have a preprocessor called regrid_time that aligns time points from cubes with differing time axes on a common standardized time axis, and that should account for differing calendars too (I believe I made sure that calendars were taken care of via conversion from num to date via a standard calendar, when I wrote that), have you seen/tried it? Of course, we can always generalize it or add a new eg regrid/align_calendars preprocessor. About fill values, we need to stick to the CMOR standard _fillValue that is specified in the file's metadata (1e20), no mixing of NaNs or 1e999 or other such things. About interpolating data from one day to another - that shouldn't be tricky - because for monthly means we don't need do that, for daily means we can just do a mean of the two days straddling the missing day data point, not sure about hourly data though. @ESMValGroup/technical-lead-development-team what you folks think?

@larsbarring
Copy link
Author

larsbarring commented Jun 21, 2023

HI @valeriupredoi, many thanks for quick the response! No, I did not know of regrid_time (I a newcomer to the world of ESMVal), but i have been discussing the general requirement from our side with @zklaus and @ljoakim (any insights?). I had quick look at the docu link you provide, but that does not give enough detail (for me ...). I guess some quick testing might fill that gap. Anyways, and as a general comment, I think that it would be very useful to align what is done in ESMVal with what is done in xarray (or vice versa!! --- I am all for common methods :-) ). And yes, as you write this is a feature request that do focus on daily data, at higher temporal resolution something else is needed, and lower resolution (monthly...) things become easier.

And, yes, I do take your point about _fillValue.

@zklaus
Copy link

zklaus commented Jun 21, 2023

Our regrid_time does not change the calendar, so is not addressing this issue.

I am not sure I understand the comment on fill value. @larsbarring suggested that the fill value could possibly be used as one marker value. That would be possible with 1e20 as with any other; regardless, that applies to masking, I think. It still is perfectly reasonable to use nans for cases where nans make sense, no?

I think a new change_calendar or ensure_calendar preprocessor would be useful. The main challenge to me seems to be the efficient allocation and memory handling of long timeseries with a few extra days sprinkled in. The filling method should probably be configurable.

@valeriupredoi
Copy link
Contributor

ah then - very good and informative comments from both you gents @larsbarring and @zklaus 👍 I am annoyed with myself I didn't fix the calendar business in regrid_time TBH but then again, that function doesn't really help much when it comes to frequent data (daily etc). Then it does indeed sound like a good idea for a preprocessor! About xarray - good idea, we are directly involved with iris, so prob best we go via iris first (especially since I believe there still is an effort to merge forces, iris and xarray, I mean). Cheers @zklaus - I think I overthinked the missing/fill value issue 😁

@ljoakim
Copy link
Contributor

ljoakim commented Sep 15, 2023

This may also be relevant as an option when adding days to a 360_day calendar: https://loca.ucsd.edu/loca-calendar/. To convert to standard, they always insert Feb 29th for leap years, and 5 additional days are inserted randomly in their corresponding 72 day time block, in order to reduce some statistical effects that may arise from adding the same days every year. Also implemented in xclim (https://github.com/Ouranosinc/xclim/issues/841).

@schlunma
Copy link
Contributor

schlunma commented Jan 23, 2024

We just encountered problems (again) with regrid_time in #2299. I really like this approach here, but I fear that it will take some time to properly implement this.

In the meantime, I propose to implement a "workaround" for monthly and yearly data. For those, it should be sufficient to simply take the 15th of the month (for monthly data) or 1 July (for yearly data) and assign those dates to the data using a fixed calendar (most likely standard). The errors from this should be minimal.

This would be very simple to implement. Moreover, we are already doing exactly the same for multi_model_statistics:

def _unify_time_coordinates(cubes):

If we agree on this, I can try to implement that.

@schlunma
Copy link
Contributor

How should this be implemented? I can think of two solutions at the moment:

  1. Add a calendar keyword to regrid_time. If not set (default), use the current behavior, if set, use as a common calendar. This is fully backwards-compatible and easy to implement; however, it enforces calendar changes (e.g., if all input data is already on the same calendar, then this will also be changed).
  2. Add a new multi-dataset preprocessor align_calendar (or align_time or any other name). This is also fully backwards-compatible and can keep identical calendars if possible (just like the code for multi-model statistics does); however, it's more difficult to implement. Moreover, we then end up with two very similar preprocessors regrid_time and the new one, which might be confusing (and I am not sure if regrid_time in its current form is useful at all).

I think 1. is the better solution, and we could even think of changing the default behavior in the future (with a proper deprecation cycle). @ESMValGroup/technical-lead-development-team any opinions?

@larsbarring
Copy link
Author

larsbarring commented Jan 24, 2024

In relation to what @schlunma wrote:

In the meantime, I propose to implement a "workaround" for monthly and yearly data. For those, it should be sufficient to simply take the 15th of the month (for monthly data) or 1 July (for yearly data) and assign those dates to the data using a fixed calendar (most likely standard). The errors from this should be minimal.

I would like to just state the obvious difference between an intensive and an extensive quantity. For the former, simply adjusting the time coordinate should be fine, but for the latter differences in period length between different dataset calendars should be factored in by adjusting the data as such, and not only the time coordinate.

@bouweandela
Copy link
Member

Any improvements to regrid_time are welcome of course. Maybe we should consider the suggestion by @zklaus to rename it to change_calendar if that makes it easier to find. I agree with the suggestion from the top post to use xarray.Dataset.convert_calendar (and also xarray.Dataset.interp_calendar) as much as possible and see if we can contribute additional features, as described in the table above, back to those functions. I couldn't find any feature in Iris that supports this type of time-specific operations, but we could open an issue there as well to ask if there is an interest.

@schlunma
Copy link
Contributor

Maybe we should consider the suggestion by @zklaus to rename it to change_calendar if that makes it easier to find. I agree with the suggestion from the top post to use xarray.Dataset.convert_calendar (and also xarray.Dataset.interp_calendar) as much as possible and see if we can contribute additional features, as described in the table above, back to those functions.

In principle, I agree with that. However, I fear that it will take some time to implement this properly for all calendar combinations and frequencies (in addition, there is still no perfect solution for bridging iris and xarray; see SciTools/iris#4994), and I am on a deadline here (I need this for our EGU abstract).

Thus, I would propose to expand the existing regrid_time so that it's able to convert calendars for monthly and yearly data. As mentioned, this is mostly straightforward to implement and we are already doing the exact same in our multi-model statistics code. Of course, we should mention the caveats there (e.g., extensive vs. intensive variables). In the future, we can add a new preprocessor change_calendar that generalizes this for all frequencies and (maybe) deprecate regrid_time. What do you think about this?

@bouweandela
Copy link
Member

Improving regrid_time a bit sounds fine to me. We can deprecate it as soon as someone has time to actually work on this issue and implement things properly.

there is still no perfect solution for bridging iris and xarray

This may not be a huge problem in this case, it should be possible to just convert the time coordinate and time dependent data/coordinates etc separately to an Xarray Dataset and use the resulting values to make a new cube.

@schlunma
Copy link
Contributor

A PR with improvements to regrid_time and the proposed addition of a calendar argument (that only works for decadal, yearly, and monthly data) is open here: #2311

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants