Lyme disease (LD), a tick-borne, bacterial, zoonotic infection, remains a serious challenge for public health. The disease is distributed globally, predominantly in temperate portions of the Northern Hemisphere such as Europe, Canada and USA (Oliver et al., 2014). In the United States (US), the geographical distribution of LD is primarily confined to the north-eastern and mid-western areas (Pepin et al., 2012). Past studies have shown that in these areas, LD is caused by Borrelia burgdorferi sensu stricto. The pathogen is mainly transmitted to humans during blood meals by the bite of infected blacklegged ticks (Ixodes scapularis) with white footed mice (Peromyscus leucopus) serving as the primary reservoir for this bacteria (Anderson, 1989). The disease is the most common vector-borne disease in U.S. with an estimated average number of 30,000 new cases every year (Radolf et al., 2012); however, the genuine number is likely much higher. The mean incidence rate of LD in the top 13 U.S. states with the highest incidence rate during 2005-2009 progressively rose from 29.6±10.6 per 100,000 in 2005 to 49.6±15.5 per 100,000 in 2009 (Pepin et al., 2012). At the same time, in 11 states with the lowest incidence rate, the mean incidence developed from 1.3±0.7 to 2.3±1.7 per 100,000 individuals (Pepin et al., 2012). Although this common zoonotic disease rarely leads to death, it can cause severe symptoms related to skin, joints and heart in addition to anxiety and depression if untreated (CDC, 2016). LD can also be a socioeconomic burden to society.
In recent years, exploratory spatial data analyses (ESDA) to describe spatial patterns of LD has increased significantly as a strategy to improve our understanding of disease transmission and risk. Several recent studies from different parts of the U.S. have examined the spatial pattern of LD using ESDA. For example, Kugeler and colleagues (2015) applied circular scan statistics to detect high-risk counties of LD in the U.S. from 1993 to 2012. They showed that the number of counties with high incidence of LD successively increased from 69 (1993-1997) to 130 (1998-2002) and further to 197 (2003-2007) and 260 (2008-2012) counties, respectively (Kugeler et al., 2015). In Texas, which is a nonendemic LD area, Szonya et al. (2015) applied global and local Moran’s I-tests to determine the distribution and location of possible clusters, respectively, with respect to the spatial distribution of LD at the county level (2000-2011). They observed a clustered distribution with a high incidence cluster in central parts of the state, mainly in a cross-timbers eco-region (Szonyi et al., 2015). In Virginia, Li et al. (2014) utilized the Empirical Bayes smoothing (EBS) method on census tract LD cases with the aim of lessening random variations, especially in censuses with small populations. Then, they applied space-time scan statistic and found a primary cluster in northern Virginia which had experienced population growth and urban-sub-urban improvements between 2008 and 2011 (Li et al., 2014).
The town of Lyme in Connecticut was the first spot that LD was recognized in the U.S. The initial cluster in 1976 was observed in children (Steere et al., 1977). Since then, in spite of all endeavours conducted by the Connecticut Department of Public Health (CTDPH) to control the disease, it remains endemic with substantial morbidity rates. Although LD is a well-investigated epidemiological subject in Connecticut, historical changes in patterns of disease have been minimally studied (Cromley et al., 1993. Additionally, even though geographical information systems (GIS) is a useful tool to study infectious diseases (Moore and Carpenter, 1999), powerful GIS-based studies of LD from this region are insufficient for prioritizing counties for intervention. Thus, the main objective of this study is to use the combination of GIS and ESDA to better describe changes in the spatial pattern of LD, supposing that the reported passive cases of LD represent a spatially random subsample of the disease in the state. Our specific research questions included: 1) what is the spatial distribution (random/ dispersed/ clustered) of LD incidence over the past 24 years?; 2) If clustered, where have the clusters/hotspots occurred? 3) Can we avoid or minimize the effects of edges in the spatial scan statistic technique?; and 4) Which cluster detection technique (LISA/spatial scan statistic) is more sensitive, specific and accurate?
Materials and Methods
Data collection and preparation
We used passively reported indigenous LD cases over a period of 24 years from 1991 to 2014 throughout the state of Connecticut. We retrieved data from the CTDPH containing yearly counts and rates of LD at the town level. The CTDPH has a well-established LD surveillance system operating since 1987. Reports of cases were based on the National Surveillance Case Definition from the Centers for Disease Control and Prevention (CDC) for LD (CDC, 1996, 2008, 2011). Data were geocoded and grouped into four equal intervals (each period included six years: 1991–1996, 1997– 2002, 2003–2008 and 2009–2014) to further explore clustering, possible clusters and how hotspots had changed.
Administrative boundaries of towns were obtained from the Map and Geographic Information Center (MAGIC) of Connecticut GIS data using the shapefile format (http://magic.lib.uconn.edu/). Similarly, annual population statistics were downloaded from CTDPH (http://www.ct.gov/dph/site/default.asp). The study area and the names of the towns mentioned in this paper are shown in Figure 1.
We applied global clustering techniques to statistically evaluate whether the existing pattern of LD incidence was random, clustered, or dispersed. We used the global Moran’s I statistic (Moran, 1950) to measure spatial autocorrelation using GeoDa software version 1.6.7 (Anselin, 2004). The null hypothesis assumes that there is no spatial pattern among the incidence of LD in different towns (i.e. complete spatial randomness) (O’Sullivan and Unwin, 2014). This statistic employs a covariance term between each town and its neighbours as follows (Mitchell, 1999):
where xi and xj are incidences of LD in the ith and jth towns, respectively; N the aggregate number of towns; and wij the spatial neighbourhood weight for towns i and j generated based on the firstorder Queen’s contiguity which shares all common points including boundaries and vertices. The generated spatial weight is used as a criterion for recognising neighbours of each town. The weight is defined taking into account adjacent neighbours and written as:
The Moran’s I index varies between -1 and +1, with 0 showing spatially random distribution, while negative values indicate dispersed distributions and positive values for clustered distributions. We assessed significance of the index using both the Z-score and P value.
The global clustering techniques provide information about the overall distribution of LD (random, clustered or dispersed), but we were also interested in identifying local clusters. First, we applied the EBS routine to account for variation in town sizes and populations. Contrasts in population size among the spatial areal units (i.e. towns of Connecticut) may lead to variance instability and spurious outliers (Anselin, 2004). This is due to the observed raw rate in spatial areal units with small population being profoundly affected by small changes of adding or removing few cases. Thus, crude rates might not reflect underlying risk compared with other areal units with large populations. EBS provides a solution to avoid this type of possible bias as it adjusts the estimated risk toward the global mean to reduce variance instability (Clayton and Kaldor, 1987); areas with low population are adjusted more than areas with larger populations. Since there was a considerable differences in areas of some towns (e.g., Derby and New London are approximately 5 mi2 while Woodstock and New Milford cover more than 60 mi2) and also population size of towns (e.g., Union and Canaan have about 1,000 people, whereas New Haven and Bridgeport have more than 120,000 individuals) applying the EBS is justifiable. We calculated spatial weights for each time interval using the first-order Queen’s contiguity. EBS smoothed rates were employed in local cluster detection analyses.
Local Moran’s I
To detect local clusters of the LD rates smoothed by EBS, we applied Anselin’s local indicator of spatial autocorrelation (LISA) statistics (Kulldorff, 2010). LISA identifies hotspots (towns with a high incidence surrounded by high incidences); coldspots (towns with a low incidence surrounded by low incidences) and outliers (towns with a high incidence surrounded by low incidences, or towns with a low incidence surrounding by high incidences). We used GeoDa (https://geodacenter.github.io/) for LISA analyses. We set the number of permutation tests to 999 and 95% significance level (P<0.05). We mapped significant clusters using ArcGIS 10.2 (ESRI, Redlands, CA, USA).
Spatial scan statistics
We were interested in comparing LD clusters identified by LISA and the Spatial Scan Statistic (SSS), for which we used an ellipsoidal, moving window situated on the centroid of each town so that at any point the window incorporated different sets of neighbours. At each position, the radii of ellipse was set to vary continuously from 0 to a maximum that never included more than half of the total population at risk. If the window contained the centroid of the neighbouring towns, then that whole town was included. This procedure produces a very large number of ellipsoidal windows and each one can be a possible cluster of LD with different set of neighbours. The ratio of the length of longest to the shortest axis of the ellipse was 1.5, 2, 3, 4 or 5. For each shape, a different number of angles (i.e. the angle between the horizontal east-west line and the longest axis of the ellipse) of the ellipse were also tested. For each ellipse a likelihood ratio statistic was computed based on the number of observed and expected cases within and outside the ellipse. The null hypothesis (i.e. LD incidence is equal inside and outside of the window) was tested against the alternative hypothesis that the risk was elevated within the ellipse. The likelihood ratio is reported by P values calculated based on the Monte Carlo simulation approach which finds the maximum likelihood ratio over the entire study region. The ellipse with the maximum likelihood was signified the most likely (primary) cluster. This approach also detected secondary clusters which were additional ranked clusters that had high likelihood ratios but did not overlap the primary cluster (Kulldorff, 2010).
Three sets of data were built for the analysis based on discrete Poisson probability model in SaTScan software version 9.4.2 (Kulldorff, 2010). The datasets were: Case file representing the annual number of cases of LD for each town (n=169) from 1991 to 2014; Coordinate file was the two-dimensional Cartesian coordinates of centroid of each town; and Population file was the population size of each town. We used the mean population and mean number of cases per time interval. To scan the study area we applied two criteria: no geographic overlap between clusters and 5% of the population as the maximum to search for hotspots with the aim of comparing the results with LISA’s high-high clusters as the LISA analysis only considered the neighboring towns. Also in another run, we used no geographic overlap between clusters and 50% population at risk to investigate whether the results depended on the primary settings. To ensure statistical power, the number of Monte Carlo replications was set to 999 and only clusters at 99% confidence interval were considered (P<0.01).
Table 1 shows a 2x2 confusion matrix used to compare the performance of LISA and SSS (for 5% and 50% of populations at risk) to detect hotspots, sensitivity, specificity and overall accuracy (Fielding and Bell, 1997). In this case, sensitivity measures how well the cluster detection techniques correctly detected the presence of LD hotspots, whereas specificity provides a measure of how well the techniques correctly identified the absence of LD hotspots. Overall accuracy shows the ability of cluster detection techniques to identify true positive and true negative LD hotspots. Locations of the detected clusters with both LISA and SSS techniques were compared with the location of true clusters as defined by Birnbaum et al. (1996) that true clusters explain fewer than 5% of all reported clusters. Therefore the top 5% towns with regard to high LD incidence were considered as True Hotspots. We calculated these statistics for each technique in each period, separately.
There were 54,478 reported human LD cases from 1991 to 2014, out of which 10,328 cases (19.0%) occurred in first period (1991-1996), 20,234 cases (37.1%) in second period (1997-2002), 11,210 cases (20.6%) in third period (2003-2008) and 12,706 cases (23.3%) in the last period (2009-2014). The annual incidence of LD ranged between 84.7 (in 1993) and 305.6 (in 2002) cases per 100,000 individuals with a mean of 144.8 cases per 100,000 people (Figure 2). The P values for the global Moran’s I statistic were close to zero which reject the null hypothesis of complete spatial randomness (CSR) for all time periods (Table 2). The index values ranged from 0.55 to 0.71, which indicates significant clustering. Although, we detected clustering with both raw and smoothed methods, and in all four periods, the variation in rates required EBS before running LISA. Based on the results of EB smoothing technique, LISA found High-High clusters (hotspots), which varied for each study period. Results of LISA showed that in the first and last period, the hotspots were completely restricted to the towns in the East. In other words, for the first period, 24 towns and for the last period 30 towns were identified as the hotspot in eastern Connecticut. The towns in the West were more influenced in the second and third periods. The number of towns in the West identified as hotspots in the second period was 8, while 7 towns were hotspot in the East of Connecticut. In addition, in the third period, 10 towns in the West and 6 towns in the East were identified as hotspots. It should be noted that applying EBS before running LISA, reduces the likelihood of detecting a false cluster in low-population areas where the cases were detected. But comparison of the detected clusters by raw and EBS methods in LISA showed small differences with regard to the location of the detected clusters.
Spatial scan statistics with 5% and 50% of the populations at risk with no geographic cluster overlaps identified clusters with high incidence rates for each period. Except for the second period, The SSS results revealed that the most likely cluster predominantly occurred in the eastern region of the state, while the secondary cluster occurred in towns in the West (Figure 3). Primary and secondary hotspots were observed in different locations when 5% of the population at risk was investigated using SSS for comparison with LISA. For the first period, the primary cluster occurred in the eastern parts of the state and included 22 towns and 792 cases. The risk of LD incidence within the primary cluster was 7.67 times greater than outside the cluster. The secondary cluster occurred in the western region and contained 5 towns and 152 cases. During the second period, the primary cluster occurred only in the eastern parts of the state and included 23 towns and 1,208 cases. The risk of LD incidence within this cluster was 3.40 times higher than outside. During the 2003-2008 period, the disease largely affected the eastern region with 22 towns and 738 cases as compared to the western parts with 10 towns and 145 cases. The relative risks of primary and secondary clusters were 3.33 and 9.18, respectively. In the last period, only eastern parts of the state showed statically significant clustering (no secondary cluster). The primary cluster contained 24 towns and 942 cases. The risk of LD incidence in this cluster was 3.69 times more than in other areas (Table 3).
Comparison of the results of accuracy assessments of the spatial scan statistics at 5% and 50% of the populations at risk, shows that sensitivity of this method increases with an increment of the population defined to be at risk. However, increasing the population at risk also leads to a decrease in the specificity of the results. In addition, LISA tended to have higher specificity and had the highest overall accuracy (Figure 4).
This retrospective study examined the spatial structure of LD incidence distribution in Connecticut based on 24 years of reported data with the aim of describing the spatial distribution and the changes that have occurred with regard to the disease. It differs from most other studies in the region by not focusing on local studies of the pathogen, reservoir or vector and their associations with the environment. Instead, it focuses on the changing patterns of documented human disease occurrencies. We assumed that the number of affected human cases corresponded to the density of Ixodes scapularis in Connecticut (Connally et al., 2006; Mather et al., 1996; Stafford et al., 1998). In addition, it should be noted that ticks have limited capabilities to move to new areas because of their small size (Ogden et al., 2008). Therefore, one of the means to fight against the risk of the disease in the study area would be targeted intervention in the areas (towns in this case) that were constantly affected with a very high morbidity rate. Targeted control, planning and management of the disease can assist with resource allocation to the towns with persistent high incidence rates resulting in time and costs savings.
Results of this study confirm and extend the findings of Xue et al. (2015) in Connecticut. They analysed yearly clusters of LD and showed that the distribution was clustered and that this clustering occurred in western and eastern Connecticut with few cases in the central region supposing the yearly incidence weighted-geographic mean analysis represents the clusters of the case distributions. According to the paper, the epidemic of LD reached equilibrium in 2007 in the western parts of Connecticut, while this happened in 2009 in the East. Comparison of the results shows that periods 1-3 included epidemic conditions while period 4 contained equilibrium condition. Therefore, actual/sudden shifts in the locations of clusters occurred roughly before equilibrium (period 1-3); and it may be that epidemic conditions affected both western and eastern parts of the state; while equilibrium conditions only occurred in the eastern regions.
This study identified towns with high LD incidence rates using the LISA test and spatial scan statistics approaches. By intersecting the locations of the clusters identified by LISA some towns, including Chaplin, Windham and Scotland, were persistently detected to have high-rate clusters. Likewise, neighbours of these towns including Andover, Columbia and Lebanon, were recognized as having high-rate clusters in 3 out of the 4 periods of study (Figure 1). This indicates that these areas were almost persistently affected by a high incidence of LD during the 24 years of study and therefore they deserve closer consideration.
In this study, we used an elliptical window in the scan rather than highly applied circular shape used in other studies for several reasons. First, according to previously published papers, the elliptic shape had better power and precision compared with circular one and also follows more accurately certain geographic features with varying shapes and directions (Root et al., 2009). Another reason was to avoid edge effects at the borders of the study area. When circles are used for scanning the study area, particularly when the circle should centre on the centroid of border towns, considerable parts of it inevitably covers the neighbouring state (i.e. Rhode Island in the East, Massachusetts in the North, Long Island Sound to the South and New York in West, where states are also LD-endemic areas where data were not accessible) and the true number of cases would be underestimated. The use of ellipses with different shapes and directions helps to reduce this problem to some degree.
There were similarities in the locations of the spatial hotspots identified by SaTScan and LISA even though these methods apply different methodologies to detect hotspots. It should be noted that we smoothed the incidence rate for the LISA cluster detection method, while this is not available for spatial scan statistics. The results were also in agreement with the findings of Barro et al. (2015), who compared different cluster detection techniques including Getis-Ord Gi*statistic, a multidirectional optimal ecotope- based algorithm (AMOEBA) and the spatial scan statistic for identifying hotspots of human cutaneous anthrax in Georgia for point data. They also found that SSS was more sensitive (due to augmentation in the quantity of true positives) but less specific (by increment of the population at risk because of a declining number of true negative clusters). Here, the results were highly dependent on defining the weight matrix in LISA or the percentage of the population at risk in spatial scan statistics.
The most important limitation of this study is attributed to the data reported that were used in this study. As indicated by the CDC, surveillance data are subject to under-reporting and misclassification in highly endemic areas such as Connecticut; however, this problem would not be very severe in light of well-designed LD surveillance system in this state. Additionally, as long as there is simply a spatially random thinning of reported cases, the overall spatial analysis should not be affected. However, if there is persistent over- or under-reporting in selected regions, it may be difficult to recognize the impact of such biases. Another factor influencing the results is that the definition of LD has changed several times: from using a two-test approach for laboratory affirmation (CDC, 1996) to the addition of probable and suspect categories with less strict criteria (CDC, 2007) and subtle changes in the specification of confirmed cases (CDC, 2011). However, empirically (Figure 3), these changes do not seem to substantially have altered the areas of high-risk clusters. Therefore, although the analyses conducted were based on data reported by CTDPH, the findings of this study should be interpreted with some caution.
The findings of this study can be regarded as a basis for generating hypotheses about underlying risk components. For instance, visual comparison of the locations of towns that were never identified as cluster areas (Figure 3) and the digital elevation model (DEM) of the study area suggest that low-risk towns are located in low-lying areas; thus it seems that people who live at moderate or higher altitudes in Connecticut are at a higher risk. This would be consistent with the earliest epidemiologic investigations (Steere et al., 1977) that identified increased risk in areas away from the seashore. Thus, altitude as a proxy of other environmental factors can uncover the reasons of the spatial distribution of the disease in the study area. Moreover, except for Windham, all of the other towns identified as persistent clusters had populations lower than 8,000 people showing that persistent towns as clusters occurred in less populated areas. Therefore, further studies should incorporate other environmental factors that have significant influence on the spatial and temporal patterns of the disease.