Your goal is to predict the total_cases label for each (city, year, weekofyear) in the test set. There are two cities, San Juan and Iquitos, with test data for each city spanning 5 and 3 years respectively. You will make one submission that contains predictions for both cities. The data for each city have been concatenated along with a city column indicating the source: sj for San Juan and iq for Iquitos. The test set is a pure future hold-out, meaning the test data are sequential and non-overlapping with any of the training data. Throughout, missing values have been filled as NaNs.
- City and date indicators
city– City abbreviations:sjfor San Juan andiqfor Iquitosweek_start_date– Date given in yyyy-mm-dd format
- NOAA's GHCN daily climate data weather station measurements
station_max_temp_c– Maximum temperaturestation_min_temp_c– Minimum temperaturestation_avg_temp_c– Average temperaturestation_precip_mm– Total precipitationstation_diur_temp_rng_c– Diurnal temperature range
- PERSIANN satellite precipitation measurements (0.25x0.25 degree scale)
precipitation_amt_mm– Total precipitation
- NOAA's NCEP Climate Forecast System Reanalysis measurements (0.5x0.5 degree scale)
reanalysis_sat_precip_amt_mm– Total precipitationreanalysis_dew_point_temp_k– Mean dew point temperaturereanalysis_air_temp_k– Mean air temperaturereanalysis_relative_humidity_percent– Mean relative humidityreanalysis_specific_humidity_g_per_kg– Mean specific humidityreanalysis_precip_amt_kg_per_m2– Total precipitationreanalysis_max_air_temp_k– Maximum air temperaturereanalysis_min_air_temp_k– Minimum air temperaturereanalysis_avg_temp_k– Average air temperaturereanalysis_tdtr_k– Diurnal temperature range
- Satellite vegetation - Normalized difference vegetation index (NDVI) - NOAA's CDR Normalized Difference Vegetation Index (0.5x0.5 degree scale) measurements
ndvi_se– Pixel southeast of city centroidndvi_sw– Pixel southwest of city centroidndvi_ne– Pixel northeast of city centroidndvi_nw– Pixel northwest of city centroid
For example, a single row in the dataset, indexed by (city, year, weekofyear): (sj, 1994, 18), has these values:
week_start_date: 1994-05-07total_cases: 22station_max_temp_c: 33.3station_avg_temp_c: 27.7571428571station_precip_mm: 10.5station_min_temp_c: 22.8station_diur_temp_rng_c: 7.7precipitation_amt_mm: 68.0reanalysis_sat_precip_amt_mm: 68.0reanalysis_dew_point_temp_k: 295.235714286reanalysis_air_temp_k: 298.927142857reanalysis_relative_humidity_percent: 80.3528571429reanalysis_specific_humidity_g_per_kg: 16.6214285714reanalysis_precip_amt_kg_per_m2: 14.1reanalysis_max_air_temp_k: 301.1reanalysis_min_air_temp_k: 297.0reanalysis_avg_temp_k: 299.092857143reanalysis_tdtr_k: 2.67142857143ndvi_location_1: 0.1644143ndvi_location_2: 0.0652ndvi_location_3: 0.1321429ndvi_location_4: 0.08175
Performance is evaluated according to the mean absolute error.
The format for the submission file is simply the (city, year, weekofyear) and the predicted total_cases for San Juan or Iquitos (for an example, see SubmissionFormat.csv on the data download page). The total_cases should be
represented as integer values.
For example, if you just predicted that there were 5 cases each week for 5 weeks in San Juan and 3 cases each week for 5 weeks in Iquitos, for a total of 10 weeks, your predictions might look like this:
city year weekofyear total_cases
sj 2008 18 5
sj 2008 19 5
sj 2008 20 5
sj 2008 21 5
sj 2008 22 5
...
iq 2013 22 3
iq 2013 23 3
iq 2013 24 3
iq 2013 25 3
iq 2013 26 3Your .csv file that you submit would look like:
city,year,weekofyear,total_cases
sj,2008,18,5
sj,2008,19,5
sj,2008,20,5
sj,2008,21,5
sj,2008,22,5
...
iq,2013,22,3
iq,2013,23,3
iq,2013,24,3
iq,2013,25,3
iq,2013,26,3
Keep in mind that you need to submit one CSV with predictions for both cities! Hence the requirement of the city column. Results will be parsed on our end and MAE scores will be given for each city's predictions.
NOTE: This document was formatted using chatgpt