Socio-economic factors in the spread of SARS-COV-2 across Russian regions

Russia ABSTRACT Relevance. The worldwide spread of a new infection SARS-CoV-2 makes relevant the analysis of the socio-economic factors that make modern civilization vulnerable to previously unknown diseases. In this regard, the development of mathematical models describing the spread of pandemics like COVID-19 and the identification of socio-economic factors affecting the epidemiological situation in regions is an important research task. Research objective. This study seeks to develop a mathematical model describing the spread of COVID-19, thus enabling the analysis of the main characteristics of the spread of the disease and assessment of the impact of various socio-economic factors. Data and methods. The study relies on the official statistical data on the pandemic presented on coronavirus sites in Russia and other countries, Yandex DataLens dataset service, as well as data from the Federal State Statistics Service. The data were analyzed by using a correlation analysis of COVID-19 incidence parameters and socio-economic characteristics of regions; multivariate regression – to determine the parameters of the probabilistic mathematical model of the spread of the pandemic proposed by the authors; clustering – to group the regions with similar incidence characteristics and exclude the regions with abnormal parameters from the analysis. Results. A mathematical model of the spread of the COVID-19 pandemic is proposed. The parameters of this model are determined on the basis of official statistics on morbidity, in particular the frequency (probability) of infections, the reliability of the disease detection, the probability density of the disease duration, and its average value. Based on the specificity of COVID-19, Russia regions are clustered according to disease-related characteristics. For clusters that include regions with typical disease-related characteristics, a correlation analysis of the relationship between the number of cases and the rate of infection ( with the socio-economic characteristics of the region is carried out. The most significant factors associated with the parameters of the pandemic are identified. Conclusions. The proposed mathematical model of the pandemic and the established correlations between the parameters of the epidemiological situation and the socio-economic characteristics of the regions can be used to make informed decisions regarding the key risk factors and their impact on the course of the


Introduction
The rapid worldwide spread of the new SARS-CoV-2 infection has raised some important questions about the socio-economic problems that make modern civilization vulnerable to new, previously unknown diseases. To prevent situations like the COVID-19 pandemic in the future, its development should be analyzed and its patterns should be identified, in particular, to assess the influence of various socio-economic factors on its parameters such as the disease spread rate (number of cases, growth rate) and its peculiarities (disease duration and mortality). The results of this analysis can be used to build models of disease spread and make decisions concerning disease prevention.
The problems of modeling of global disease spread have been discussed for a long time. A mathematical model of influenza spread developed by Rvachev (1971) and the models of nonlinear population waves developed by Svirezhev (1987) can serve as an example. The issues of construction of probabilistic epidemics models were considered by Whittle (1955), who formulated the equations for the distribution of the number of individuals infected during an epidemic. This distribution turns from unimodal to bimodal with a decrease in the ratio of the intensity of isolation of infected individuals to the intensity of infection of healthy individuals to some critical value (Bailey, 1975). In more recent works (Barlow & Weinstein, 2020), the so-called SIR (Susceptible -Infected -Recovered) epidemics models were considered and differential equations for the number of infected individuals were formulated. Studies that focused on the spread of coronavirus proposed extensions of the SIR model by using nonlinear differential equations describing the dynamics of various groups of participants in the epidemic (e.g. Ndairou et al., 2020;Abdo et al., 2020). A large number of works are concerned with the algorithms for epidemic spread prediction, based on both probabilistic models (e.g. Zhang et al., 2020) and more traditional approaches based on ARI-MA, where epidemiological indicators are linked to a number of social and demographic characteristics of countries affected by the COVID-19 pandemic (e.g. Chakraborty & Ghosh, 2020). Finally, some studies apply various theoretical models for calculation of the reproduction index -the expected number of secondary cases caused in a fully susceptible population by a typical infected person (Driessche & Watmough, 2020;Ndairou et al., 2020).
To address the above research gap, this study intends to meet the following objectives: -to compare the statistical data on the progress of the COVID-19 pandemic in Russian regions; -to identify socio-economic factors that determined the main characteristics of the disease spread; -to construct a probabilistic mathematical model describing the progress of the pandemic that would enable us to forecast the course of pandemic and assess the reliability of such forecasts.
The study relies on the materials retrieved from the Yandex DataLens 1 service, the official website on coronavirus in the Russian Federation 2 as well as the Rosstat data for Russian regions (2018) 3 .

Theoretical foundations of the analysis of the COVID-19 pandemic
As already noted in the literature described above, a large number of problems in a variety of fields can be reduced to random-walk processes of the subjects under study along the nodes of a directed graph. Moreover, each node corresponds to a particular group (cluster) in which the subjects under study are united in accordance with their inherent characteristics. This can be, for example, customers choosing the products of one of the competing manufacturers (Sinitsyn et al., 2011), vehicles moving along city streets between the intersections where they congregate (Tolmachev et al., 2019), and even students belonging to different performance groups in massive open online courses.
In the model of the COVID-19 pandemic considered herein, the population is grouped by health status as follows: I -infected individuals in whom the disease has not been identified: those who are ill without clinical symptoms, who did not apply to medical institutions or were mistakenly diagnosed with the absence of the disease; H -healthy and susceptible to disease individuals; A -detected active cases of the disease; R -recovered; D -deaths; U -disease-resistant (disease-resistant individuals: the cases of infected persons who were undetected and then recovered as well as vaccinated persons).
Hereinafter, the size of each group will be denoted by letter N with the corresponding index N I …N U . The relative share of each group in the population of region N will be denoted as n with an index denoting the group: n I …n U . N I + N H + … + N U = N or n I + n H + … + n U = 1 (1) The graph corresponding to the model used herein is shown in Figure 1.  The following assumptions arising from the information about the course of the coronavirus-induced disease contained on the sites on coronavirus in the Russian Federation 4 and other countries 5 were made when constructing the graph: 1. The recovery is accompanied by persistent immunity (the probability of re-infection is formally taken into account by the probability of P UH , hereinafter the authors will assume P UH = 0).
2. All the recovered persons recover completely, that is, there is no return of those who have recovered to the group of active cases (treatment is not completed or the conclusion about the recovery is wrong).
3. Long-term consequences leading to death after a fixed recovery (transitions between groups R and D) are not taken into account 6 .
4. There are no vaccinations, that is, transitions from the H group directly to the U group are impossible.
If necessary, the above restrictions can be removed and the model can be expanded, but this will lead to computational complexity.
The transitions shown in Figure 1 can be described by using a 6 × 6 matrix T , which, taking into account the above remarks, has the following form: The columns of matrix (2) contain the probabilities of transitions from the group indicated in the top row to the group presented in the rightmost column, which implies that the sum of all elements of each column is equal to 1. Such matrices are usually called stochastic (Leskovec et al., 2020).
Using matrix (2) X N N is the vector with components equal to the sizes of the corresponding groups. Performing simple calculations by analogy with the approach described, for example, by Feller (1964) when deriving the equations for the processes of birth and death, one can get the following: It is implied that repeated indices are summed. Obviously, (3) is an analog of the Kolmogorov equation for the considered model. Using (3), various aspects connected with probability P(N I , N H , …, N U ; t) can be studied; in particular, it is possible to determine confidence intervals for all projections of morbidity, mortality, and recovery rates as well as to analyze possible epidemic scenarios and assess their probabilities. However, it is more appropriate to discuss these results in a special work concerned with the mathematical aspects of the model under consideration. For the purposes of further analysis, this paper is restricted to the derivation on the basis of (3) of the equations for the mathematical expectations of values n I , n H , …, n U 7 . In discrete form 8 after a series of transformations, one can obtain: n t n t P n t P n t P n t (4b) n t n t P n t P n t P n t (4c) n t n t P n t (4d) n t n t P n t P n t (4e) n t n t P n t P n t (4f) When deriving equations (4), the above-mentioned feature of matrix (2) was used (the sum of the elements of each column is equal to 1). It was additionally assumed that P ID = 0. The reason for this is that all severe cases ending in death require medical intervention and are recorded as active before death (transition from group I to group A). Those who are asymptomatic or mildly symptomatic and do not seek medical help and accordingly, are unlikely to die may well remain undetected.
Let us make one more well-grounded assumption: let us exclude the probability of delayed formation of immunity (thus the subjects from group R (recovered) will go to group U (disease resistant) immediately after their recovery, without participating in group R for the time required for the final formation of immunity). This is con-7 This approach, with the exception of the number of analyzed participants in the epidemic, is similar to the SIR model described above.
8 Equations (4) can be written in differential form similarly to (3). To do this, it is required to transfer n H (t) to the left side of the equation and replace, sistent with the fact that P RU = 1, while P RR = 0. It is obvious that in this case, group U actually absorbs group R, and equations (4e) and (4f) can be replaced by one: n t n t P n t P n t (4g) It should also be noted that the selection of two different groups of infected people (I) and active cases (A) is based on the assumption that not all cases of COVID-19 are detected. Therefore, the probability of transition between these groups is different from 1. The available statistical information does not have enough data to determine the size of group I. As noted in the monograph by Svirezhev (1987), the probability P IA can be represented as: Here d is the likelihood of an infected person self-referring to a doctor per time unit, T is the number of tests carried out per time unit to detect COVID-19 9 , and finally, P α is the reliability of disease detection -the probability of a type I error in diagnosis (the disease goes detected).
An attempt can be made to determine parameters (5) using equation (4c) and data on the number of active cases and the number of tests. For example, Figure 2 shows the proportion of detected COVID-19 cases in the tests performed.
The initial phase of the pandemic is clearly visible (until about mid-May 2020). In this phase, an exponential increase in the number of cases and positive results of tests performed was observed 10 . Its subsequent stabilization followed by a decline to a certain constant value is a characteristic form of such graphs for most countries. For example, Figure 2 shows the graphs for Russia and Germany.
Assuming that the concentration of cases determined for the tested groups can be extended to the entire population of Russia, a simple calculation shows that the number of infected people in the country should be about 4.25 million people, which significantly exceeds the official data of about 860 thousand people. However, formula (5) assumes the use of a random sample of representatives of the population for testing. In practice, in Russia and in the world, mainly representatives of risk groups are tested (those who have been in contact with patients, who have arrived from disadvantaged regions with signs of SARS, tourists, medical workers, etc.), among whom the concentration of cases is naturally higher. This situation with testing explains the above discrepancy and suggests that to estimate the total number of infected people in the country, separate studies of the representative samples of the entire population are required. 10 A closer analysis shows that the rate of exponential growth in the number of cases during this period is mainly due to the exponential increase in the number of tests performed.  Figure  4 reflect the activity of testing by days of the week, apparently related to the established COVID-19 testing procedures. Figure 4 shows that since May 30, the ratio T/N has stabilized at the average level of 0.194%, while the median value has remained practically the same -0.198%.  To eliminate problems with the analysis of equations (4) in terms of the above-mentioned uncertainty of data on the number of infected individuals, various simplifying assumptions can be used: 1. The country's health care system detects all cases of the disease (more precisely, the proportion of those undiagnosed is negligible). In this case, group I is actually eliminated.
2. The ratio of the number of infected individuals and detected cases is approximately constant. For example, for Russia, according to the estimates above, the number of detected infected individuals is approximately 20-24% of the total number of infected people.
In what follows, we will focus on the second assumption. To accurately determine the ratio between the infected and detected active cases, it is required to solve equations (4), having previously determined the parameters by using, for example, regression models, analysis of medical statistics, etc. Relevant examples will be discussed below.
One more remark should be made. Following the works [5-9] cited above, a local temporal connection between all variables in equations (4) was assumed. In reality, this is not the case. For example, at time t, the individuals who were infected a certain number of days ago recover. Similarly, due to the long incubation period, at time t, the individuals who were infected up to 14 days ago are detected 11 . Thus, equations (4) are retarded equations. The current size of the groups under consideration depends on their past size. To take this into account in equations (4), the following substitutions should be made 12 : (7d) 11 https://www.worldometers.info/coronavirus/ (date of access: August 10, 2020) 12 For the discrete form of equations, the integration should be replaced by summation. The authors have kept the integral notation for the reasons of compactness. Thus, in their original form, Eqs. (4) are integro-differential retarded equations.
https://journals.urfu.ru/index.php/r-economy Online ISSN 2412-0731 Here functions f IA (t -τ), …, f IU (t -τ)specify (in the order of the equations): the density of the probability distribution of the incubation period (7a), the time from disease detection to death (7b), to recovery (7c) and (7d). The latter two will be determined by using the available statistics in section 3.2 below.
The probability distribution of the incubation period can be estimated by using the website data on the spread of coronavirus in the world 13 . Thus, it can be assumed, that function f IA (t -τ) has the form shown in Figure 5 (obviously, the function must also include multiplier P IA ≤ 1, taking into account that some infected persons remain undetected).  Let us consider the process of transition of healthy individuals to the group of infected ones, which is the most important in the light of the pandemic. At the same time, further fate of the infected will not be considered (that is, transitions I → U are not analyzed). Then, from 4 (a) and (4b), one can find: The latter terms in (8) appeared due to the decoding of probability P HI (4a, b). In this case, an urn sampling scheme of two with returns is used as a model of contacts. The choice is made among all the undetected infected and healthy and susceptible to disease individuals; f is the average frequency of contacts in the population, multiplied by the probability of infection by contact 14 . Thus, the presented formula assumes that an individual can become infected with a probability proportional to the fraction of the infected among all individuals with whom the said individual came in contact (Svirezhev, 1987).
We can attempt to determine the parameters (8b) on the basis of statistical data on COVID-19 cases. For this purpose, it should be taken into account that, despite the large absolute number of cases, their share in the population is small: where (8) has the form: The above-mentioned multiplier P IA (5) is clearly distinguished in formula (9b). Taking into account the above assumption that the ratio of the number of infected individuals and detected cases is constant, the construction of a regression model can be simplified. Namely, the statistical data on the number of detected cases provided by Yandex DataLens 15 were used instead of the unavailable data on the number of the infected. The results are summarized in Table 1. https://journals.urfu.ru/index.php/r-economy R-ECONOMY, 2020, 6(3), 129-145 doi: 10.15826/recon.2020.6.3.011 Online ISSN 2412-0731 Using the data on the population of the Russian Federation 16 and parameters from Table 1, P α (reliability of disease detection -second line) can be determined: P α = 65.3%. Then, from the first line, the proportion of individuals with the completed incubation period who independently consult doctors every day can be found: d = 18.4%. Finally, f = 0.14. The comparison of the data calculated by model (9b) with the actual data is shown in Figure 6.
Finally, using equations (4)-(9), some of the usual characteristics of the epidemic can be expressed.
The spreading coefficient R t : The condition R t < 1, which is mandatory to remove the quarantine's restrictions, is reduced to the following equation: where: 16 https://rosstat.gov.ru/ (date of access: August 10, 2020) The equal sign in (11) corresponds to a plateau -a growth termination of the number of the infected. Using the mean value theorem standard for estimation of the value of the integral in (11) (Fikhtenholts, 1970), one can find: Here SMA T [n I (t)] is the simple moving average with a period equal to the incubation period. In accordance with (12a), the sufficient condition for reaching a plateau has the following form: The meaning of this condition is obvious, in a time unit (e.g. a day) the number of infected people identified and isolated should exceed the number of newly infected. We use the data from Table 1 and take into account that the most controllable parameter in (12b) is the number of tests carried out per time unit. It can be shown that to reach a plateau and subsequent transition to a decline in incidence, at least 4.6 million tests must be performed daily (provided that all other factors remain the same figures for the Russian Federation are much more modest. Given the cost of testing, the alternative is obvious -the transformation in the right direction of all the factors listed in the previous paragraph, including reduction in the number of contacts (social distancing) and the probability of infection upon contact (mask and glove mode, disinfection). The exact solution to equations (4)-(9) is a rather complicated problem, which will not be considered in this work. However, as the above assessments show, even without such a solution, one can obtain useful information for analyzing the development of the COVID-19 pandemic in Russian regions.
In the following sections, we are going to discuss the dependence of the parameters of models (9b) and (4), determining the main characteristics of the pandemic on socio-economic parameters and characterizing Russian regions.

Methods of analysis
The following methods were used to analyze the entire set of collected data on the COVID-19 pandemic and the socio-economic situation in Russian regions.
We apply such methods of statistical analysis as correlation analysis of parameters characterizing morbidity and factors describing the socio-economic state of the regions and multivariate regression.
Correlation analysis was used to determine the degree of influence of regions' socio-economic characteristics on the epidemiological situation in these regions.
By using a multivariate regression, we were able to determine the parameters of the mathematical epidemic spread model proposed in Section 1 based on the comparison of the number of cases predicted by the model and the actual number of cases registered in Russia.
We also applied the following methods of data mining (Barsegyan et al., 2007;Leskovec et al., 2020): -clustering by using self-organizing Kohonen maps (Debock, & Kohonen, 2001), which led us to divide all Russian regions into clusters -groups with similar characteristics of the incidence of COVID-19; -hierarchical clustering (Zhambu, 1988), which enabled us to separate the "abnormal" clusters that differ significantly from the others in terms of morbidity characteristics and exclude them from the sample for subsequent regression and correlation analysis.
The data from the website on coronavirus in the Russian Federation 17 , the service with Yandex DataLens 18 datasets as well as the data from Rosstat for Russian regions (2018) were used for the analysis.

Results of the analysis of factors affecting the incidence of COVID in Russian regions
This section provides a comparative analysis of Russian regions in terms of the incidence of COVID-19 and socio-economic factors, which, in accordance with the theoretical concepts described in Section 2, can affect the characteristics of the padndemic.

Characteristics of the COVID-19 epidemic in Russian regions
The selection of adequate characteristics of the analyzed objects is a prerequisite for successful application of data mining methods. In the future, in addition to the obvious characteristics such as the infection rate (number of cases per 1,000 people) and the mortality rate (proportion of deaths among those infected), we are going to use the following indicators: -as the growth rate of the infected (the popular parameter R t ) see (10); -the growth rate of the recovered defined by equation (4f); -the growth rate of lethal outcomes defined by equation (4d); -the time from the moment the infected person is identified until the moment of his or her recovery or death.

General characteristics of COVID-19 pandemic in Russian regions
Online ISSN 2412-0731 figure shows a normal distribution with the same mean and standard deviation.
As can be seen, the distribution of the lethality level differs significantly from the normal one (this is also confirmed by the corresponding check using the Pearson, Kolmogorov-Smirnov criteria (Ivchenko, & Medvedev, 2014)  The significant difference between the data presented in Figures 7 and 8 from the normal distribution indicates the presence of certain nonrandom factors that distinguish the course of the pandemic in various regions. This is evidenced by the data on the geometric mean growth indices of the number of the infected, the recovered, and lethal outcomes since the beginning of the pandemic in Russia. The corresponding data as of the end of July are presented in Figure 9.

Figure 9. Growth rates of indices for cases: infections -[R t -1], recoveries -[i R (t) -1] and lethal outcomes -[i D (t) -1]
Source: the authors' calculations based on the dataset from Yandex DataLens service https://datalens.yandex.ru As can be seen, the spread of indices by regions is also quite noticeable. In this regard, it is of interest to analyze the factors affecting the rate of development of the pandemic. It can be concluded from the theoretical model in Section 2 that the presented indices are determined not only by the pathogenicity of the virus in a particular region but also by the socio-economic parameters of the latter, which affect the probabilities of transitions between different groups of pandemic participants in equation (9).

Analysis of the time of recovery and death of patients in Russia
In this section we are going to consider one more characteristic of the pathogenicity of SARS-CoV-2 -the duration of the disease. The initial data for the analysis of the time of recovery or a lethal outcome were obtained from Yandex DataLens 19 . The period from March 2, 2020 to July 18, 2020 was analyzed for the regions of Russia and from January 22, 2020 to July 17, 2020 for the world. Data processing was performed by using the scripts developed by the authors in the Python language in the Anaconda data analysis package.
According to the data mining standard CRISP-DM (Chapman et al., 2000), the obtained initial data required processing and preparation for analysis. Therefore, first, the countries in which there were no records of the recovered https://journals.urfu.ru/index.php/r-economy Online ISSN 2412-0731 people 20 or the presented values differed significantly from the values for most countries 21 were excluded from the data sample. Such distortions of statistics lead to an abnormally high mortality rate. We excluded regions and countries where the mortality rate exceeded 50% from the sample for analyzing the duration of the disease. Moreover, subjects with an unrepresentative number (less than 2,000) of closed cases -recoveries and lethal outcomes -were also excluded from the sample. Finally, the erroneous initial data were corrected, for example, the negative values of the numbers of new cases, recoveries, and lethal outcomes.
To determine the duration of the disease, the FIFO method (First In -First Out) -the first to enter the group of active cases is the first to leave it -recover or die, was used as the only available method according to official statistics. Thus, in each sample analyzed, the total number of recoveries and lethal outcomes was equal to the number of cases.
To determine which of the groups a closed case belongs to, the Monte Carlo procedure was applied, namely, the closed cases were divided into groups "Recovered" or "Dead" with the probability equal to the probability of recovery or death by the time point in question. The latter probability was determined by the ratio of the total number of 20 For example, in Sweden, the number of cases is 77,281, the number of lethal outcomes is 5568, the number of recoveries is 0. 21 For example, in the UK, the number of cases is 294,803, the number of lethal outcomes is 42,477, the number of recoveries is 1312. recoveries (lethal outcomes) to the total number of the infected (detected). The results obtained in the form of the distribution of the disease duration until recovery or death for Russian regions are shown in Figure 10.
For comparison, Figure 11 shows the data for the countries included in the sample formed in accordance with the above rules.
The presented figures show that the disease duration until recovery or death does not differ statistically significantly. On the contrary, the data on the disease duration for different countries and different Russian regions differ significantly. In particular, in Russia, the most probable disease duration is higher than in other countries (with a significantly lower mortality rate). Given the occasional foreign reports of the re-infections with COVID-19, there is some reason to believe that the existing treatment regimens for such patients may allow the completion of treatment prior to the patient's final recovery.

Clustering of Russian regions by disease characteristics
It should be noted that in different Russian regions, the disease can be caused by different strains of SARS-CoV-2 22 ; therefore, before analyzing the socio-economic factors influencing the 22 There have been reports in the media about the existence of various modifications of coronavirus in Russia (for example, https://radiokp.ru/obschestvo/skoltekh-nashel-v-rossii-9-shtammov-koronavirusa-kotorykh-net-v-drugikh-stra-nakh_nid28655_au67au); however, at the time of this writing, it was not possible to find any serious scientific papers confirming this (quite probable) thesis. spread of the disease, it is necessary to group the regions where the course of the disease has similar medical characteristics, that is, the subjects included in the cluster are closer to each other in these characteristics than to the subjects of other clusters. The mortality rate and the average time that elapses from the moment of infection to the recovery or death of the infected person were considered as such characteristics. The clustering of Russian regions was carried out by using a self-organizing neural network (Debock, & Kohonen., 2001). As a result of data processing by the neural network 23 for all Russian regions included in the sample, the subjects were distributed across 10 clusters presented in Table 2. Table 2 also shows the significance levels of the variables for cluster 23 The Deductor Studio software, academic version 5.3.0.88, was used for clustering.  formation (100% is the maximum significance). The higher the level of significance of a variable, the more likely it is that subjects with similar values of this variable will fall into the same cluster. The clusters presented in Table 2 by the degree of similarity of the variables of the subjects included in them can be combined into a hierarchical structure shown in Figure 12. The vertical axis represents the conditional distance (degree of difference) between the clusters (%). The less the given distance, the less the difference between the clusters. It is noteworthy that the clusters containing a small number of subjects (0,3,5,9) are quite far from all the others. In the next section, when analyzing the dependence of the characteristics of the epidemic on the socio-economic factors of the region, we will focus on this hierarchy and carry it out only for subjects belonging to relatively close clusters.
4. Discussion. Correlation and regression analysis of socio-economic factors affecting the characteristics of the pandemic As we pointed out at the end of the previous section, we will restrict ourselves to a correlation analysis of morbidity characteristics only for subjects included in clusters 1, 2, 4, 6, 7, 8. As can be concluded from equations (9)-(11), the rate of increase in the morbidity R t is determined by the difference between the frequency of infection f and the frequency of detection and localization of patients: At the same time, the very number of cases when the morbidity plateau is reached satisfies the equation: Hence: Thus, the morbidity rate R t and the number of cases are proportional to the same value.
In accordance with the above, let us consider the factors able to affect parameters (15). At the same time, we will present below the results only in the case of a significant difference in the correlation coefficients and morbidity characteristics from zero. We will consider the correlation coefficients by groups of socio-economic factors. At the same time, since the values of the correlation coefficients are small in general and the distributions of the number of factors differ significantly from the normal, the quadrant correlation coefficient will also be calculated (Amosova et al., 2001). The factors for which the values of the standard and quadrant correlation coefficients differ significantly (and, moreover, have different signs) are discarded. All characteristics X i for the sample of Russian regions before correlation analysis are reduced to the standard form: where X is the average, and σ X is the standard sample deviation. The results are presented below.
First, the presence of correlations of general characteristics of a region, such as square area, population, number of municipalities, population density, and road density (affecting the transport mobility of the population) with the characteristics of morbidity, was checked. Statistically significant correlations are presented in Table 3.
Thus, the risk group (high rate of development of the epidemic and a large number of cases) includes regions with a large population and a large number of municipalities. When analyzing the structure of the population, significant (with a confidence level more than 99%) correlations were found only for the share of the urban and rural population (correlation coefficient is +0.3 for the share of the urban population and -0.3 for the rural population, the corresponding quadrant correlation coefficient is ± 0.27). The analysis confirms the above conclusion about the increased risk of disease for large settlements. This conclusion is confirmed by the analysis of the main economic indicators presented in Table 4.
The results of Table 4, in line with the previous data, indicate an increased risk of the pandemic in industrialized regions with large fixed assets, significant gross regional product, and developed retail trade. This is natural, since all of the above factors imply the concentration of a sufficiently large number of people in a relatively limited space (work premises, offices, shopping centers and shops), that is, an increase in the number of Table 4 Significant (with the confidence rate of more than 99% and the value greater than the specified value) correlation coefficients of the characteristics of the incidence of COVID-19 with the main economic indicators of a Russian region, correlation coefficient/quadrant correlation coefficient * Average value by type: "Manufacturing", "Supply of electricity, gas and steam; air conditioning", "Water supply; wastewater disposal, waste collection and disposal, pollution elimination activities", "Retail trade turnover, mln rubles".
Source: the authors' calculations based on the dataset from Yandex DataLens service https://datalens.yandex.ru. contacts -parameter f in (13-15). This feature of pandemics like that of COVID-19 indicates the significant risks that they can pose for economically developed regions. In this regard the risk management plan for business entities should be revised in order to take into account such threats. When analyzing the impact of the labor market on the COVID-19 morbidity, the following significant correlations (with the confidence rate of more than 99%) are worth noting: It is obvious that the factors listed above cover the main activities of modern urbanized territories, the state of the labor market, transport systems, and education. The lack of any significant correlation with the characteristics of health care and morbidity unrelated to COVID-19 is noteworthy. The use of such information makes it possible to determine all the parameters characterizing the development of the disease in equations (4)-(9) and thereby develop a consistent mathematical model corresponding to the actual data, whose the numerical solution will enable us to consider various probable scenarios for the development of the pandemic in Russia. This task will be addressed in our future works.

Conclusion
The mathematical model proposed in this article allows for a logically consistent description of the development of the COVID-19 pandemic in Russia. It is noteworthy that the model relies on the data that can be obtained by using the national morbidity rate system and does not require the analysis of individual case histories, the data which are much more difficult to obtain. At the same time, an additional result was a check of the completeness and reliability of statistical data. Thus, when calculating the distribution of disease durations in the regions, the correspondence of the daily flows of the infected and the closed cases was checked. While in Russia only the isolated cases of inconsistency were observed, then the analysis of the data around the world revealed quite a number of countries giving reasons to doubt the adequacy of the statistical recording of morbidity. It should be noted that the need and importance of organizing an adequate and operational statistical accounting system necessary for making management decisions during pandemics like COVID-19 can be considered as one of its lessons.
Due to the relevance of this issue, we decided not to present in detail the results and conclusions obtained in this article, leaving them for individual research.
The correlation analysis of the influence of socio-economic factors on the development of the disease indicates that COVID-19 and other similar pandemics are a serious challenge for modern civilization. Like the development of a virus in the cells of an organism that uses the normal processes to function, the pandemic develops by using the socio-economic processes that have developed and are necessary for a civilized society. In this regard, a kind of "vaccination" of socio-economic systems is required to reduce the rate of infection, the number of cases, or both indicators at the same time. This refers to the rational organization of transport services, cultural events, jobs, educational processes, and the entire socio-economic structure of the region. Certain experience in connection with quarantine measures has already been accumulated. However, to ensure the resilience of socio-economic systems to such diseases, evidence-based measures are needed that have a selective effect on the key risk factors for the development of the pandemic and take into account their long-term consequences. It could be recommended to analyze the developed strategies for the socio-economic development of the regions for the period up to 2035 from this point of view and make the necessary adjustments. Such activities and adjustments to strategies should be based on models such as those outlined in this paper.