Time series prediction for daily COVID-19 cases and deaths using PyTorch

Note: This article’s intention is to not give an exact prediction about COVID-19 but to give an idea about how to develop an LSTM model to do predictions on time-series data.

I am Thirunayan, an Artificial Intelligence/Deep learning enthusiast. Currently, I am an Intern Data Scientist at Rootcode Labs. This is something I recently worked on.

RNN’s(Recurrent Neural Networks) are especially good at processing and predicting sequential data. LSTM  (Long Short Term Memory) is a particular type of RNN, that is mostly used in time series prediction. Although there many mathematical models for time series prediction, such as the ARIMA model, LSTM’s have been gaining popularity much recently in their ability to recall patterns on time series data.

The dataset used to train and test the model is the daily time-series data provided by the John Hopkins University on the number of cases as well as deaths. The particular is updated on a daily basis. (https://github.com/CSSEGISandData/COVID-19).

As in any ML prediction, the first step is the data processing stage, over in this scenario although there wasn’t any necessity to deal with null values, the dataset contained the count of infected people daily ordered by country. As we are doing the prediction overall the global count, there was a need to get the cumulative number of patients each day totally.

The next step was the visualization of the cumulative progression of the number of infected patients.

From this graph a clearly exponential pattern is deduced. However, this graph does not show the number of cases as each day in this graph contains the accumulation of the patients from the previous day. 

To get the non-cumulative number of infected patients each day :

Now visualizing this pandas series would give the non-cumulative count of patients each day:

Now, we have the necessary data in the format we need. We can start allocating the data for training and testing.

Here 27 days are allocated for training and 14 for testing:

The next step would be to normalize the data and scale it to binary so that the training is faster and also to create a sequence of functions which divides the training data into chunks of sequences for each cycle:

So for training our model, we used a sequence with 5 samples of the training data for each cycle. As we are done with the data pre-processing phase. We can move on to building the model’s architecture.

Instead of using custom training functions, I have used a class to encapsulate the necessary hyperparameters so that they can be tuned through objects from the class.

The method “reset_hidden_state()” is used to reset the state of the LSTM model since we are using a stateless LSTM. The method “forward()” will get the sequences to pass the LSTM layer at once. We take the output of the last time step and pass it through our linear layer to get the prediction. Moving forward a training class is defined through which the data partitions for training can be provided.

The next step is to define the hyperparameter values: In our case, we have only feature which is the number of the infected count, I’ve used 600 hidden neurons, with 2 hidden layers.

I have used MSE(Mean squared error) as the metric for error along with  22 epoch cycles to train the model. 22 epochs may seem like a low number of epochs than usual, since our training data is relatively small, using a high number of epochs may just cause overfitting.

Also, to counter the problem of using a small dataset, we can use the predicted values as input to the next sequence, we can reuse the predicted values themselves as the training data. Even after applying this strategy, the model’s accuracy did not seem impressive. Thus the only option left would be to use all data for training and finally do the prediction for the next 14 days.

Conclusively these were the predictions produced for the predicted number of cases until 22nd April.

Conclusion

This prediction was carried out on April 5th of 2020.

The overall prediction summary was that by 15th of April we would reach more than 2 million cases and by 22nd of April, we would have more than 4 million cases. But by 20th April 2020 (Today) the worldwide case number is close to 2.4 million which is well below the prediction. Maybe the lockdown measures are working, maybe we are handling it better or maybe my model isn’t that great! 🙂

Inspired by : https://www.curiousily.com/posts/time-series-forecasting-with-lstm-for-daily-coronavirus-cases/