Dataset construction, cluster identification and definition of DCCs
Whole genome sequencing of two collections of isolates from Manchester, UK, and the Netherlands was carried out as previously described2. Briefly, DNA was extracted from colony sweeps of subcultured samples before to paired-end sequencing using the Illumina HiSeq platform. These samples were combined with whole genome sequencing samples from previous studies2,10,12,13,14,15,16,18,43,44,45,46 to give a final dataset containing 2,045 samples from 1,178 patients. Samples were genotyped to the subspecies level using Mash47, comparing against one reference genome from each subspecies. Sample were assigned to the subspecies with the smallest genetic distance; all samples exhibited a clear subspecies assignment. A summary of all samples used, including sample accession numbers, is provided in Supplementary Table 3.
Sequencing reads from each sample were mapped against the corresponding subspecies reference sequence using the multiple_mappings_to_bam pipeline (https://github.com/sanger-pathogens/bact-gen-scripts) with BWA-MEM as the aligner. ATCC19977 (accession no. CU458896.1) was used as the reference for M. a. abscessus and CIP_108297 (accession no. GCF_001792625.1) for M. a. massiliense. SNPs were called from this alignment using the multiple_mappings_to_bam pipeline. Subspecies phylogenetic trees were reconstructed using RAxML v.8.2.12 (ref. 48) with the general time reversible (GTR) model of nucleotide substitution and gamma rate heterogeneity with four gamma classes. To enable extraction of maximal genetic variation, clusters of samples were identified in the subspecies trees using FastBAPS49. This clustering identified 19 clusters in M. a. abscessus and 17 clusters in M. a. massiliense (Extended Data Fig. 1). A summary of each FastBAPS cluster is provided in Supplementary Table 1.
All subsequent analyses were carried out on each FastBAPS cluster independently. De novo assembly was carried out on each sample as previously described50 and the best assembly was identified for each FastBAPS cluster on the basis of number of contigs and N50-N90 values (Supplementary Table 1). Samples from each FastBAPS cluster were mapped against their respective best assembly as above to maximize the captured SNP diversity. Recombination was removed from FastBAPS cluster alignments using Gubbins v.2.4.1 (ref. 51) and phylogenetic trees were reconstructed for each FastBAPS cluster as above. Phylogenetic trees were viewed and figures constructed using FigTree52 and GGTREE53.
DCCs were identified as clusters of highly related sequences collected from at least 20 patients on multiple continents. DCCs 1, 2, 3, 5 and 7 are FastBAPS clusters while DCCs 4 and 6 are subclusters within a FastBAPS cluster (Supplementary Table 1). No DCCs were identified in M. a. bolletii.
Phylogenetic analyses, temporal, spatiotemporal and population reconstruction
Temporal phylogenetic reconstruction was carried out on DCCs 1–7. DCC-specific datasets were constructed containing the earliest sequenced sample from each patient that clusters within the DCC. These samples were mapped against the respective DCC reference sequence (Supplementary Table 1) as above and a maximum likelihood phylogenetic tree reconstructed with RAxML48 as above. Methods to infer substitution rates and ancestral dates are only valid if there is a temporal signal within the dataset54. We initially assessed temporal signal within each dataset using root-to-tip randomization. In each case, the maximum likelihood tree was rooted to minimize the heuristic residual mean square score using TempEst54. We examined the root-to-tip correlation visually (Extended Data Fig. 7a) and through comparison of the R2 correlation between sample collection date and root-to-tip distance with 1,000 randomizations of the tip dates to identify significance of the correlation. A significant positive correlation was observed for DCCs 1–4 (P < 0.001). We therefore initially reconstructed the temporal history of these DCCs using BEAST v.2.4.2 (ref. 19). We used the Hasegawa–Kishono–Yano (HKY) model of nucleotide substitution. We used the relaxed log-normal clock model with a log-normal prior on the substitution rate with mean set to the estimated slope in TempEst and standard deviation 0.5. We modelled the population history using the coalescent Bayesian skyline population prior. At least three independent runs were carried out for 100,000,000 steps for each dataset. Convergence was assessed using Tracer v.1.7 (ref. 55).
As a more thorough test of temporal signal within each DCC, we carried out the date randomization test56. BEAST v.2.4.2 was run on each DCC dataset using a uniform substitution rate prior between 1 × 10−9 and 1 × 10−5, with these bounds chosen to encompass the likely substitution rates for Mycobacteria based on previous work2,56. Other priors were as described above. The results from these uniform prior runs were highly similar to those with the informed substitution rate prior in each case (Extended Data Fig. 8). Ten date randomizations were performed where the sequence collection dates were randomly assigned to tips. BEAST was run on each of these randomized datasets independently using the same uniform substitution rate prior (1 × 10−9 – 1 × 10−5). All four DCCs passed the date randomization test, defined here as the median posterior substitution rate and most recent common ancestor dates with the real sample collection dates not overlapping with that of any of the ten date randomizations (Extended Data Fig. 8). We did not attempt these analyses with DCCs 5–7 as they did not pass the correlation test.
The inferred substitution rates of DCCs 1–3 were highly similar (Extended Data Fig. 7b). The substitution rate of DCC4 is higher (Extended Data Fig. 7b), probably due to this clade having a far more recent common ancestor date than DCCs 1–3 (Fig. 1c) and therefore less opportunity to remove deleterious substitutions. As DCCs 5–7 contain similar levels of diversity to DCCs 1–3 (Extended Data Fig. 7c), we reconstructed their temporal history as above but using a uniform substitution rate prior with boundaries of 8.76 × 10−8 – 2.41 × 10−7, chosen as the upper and lower 95% highest probability density (HPD) substitution rate estimates for DCCs 1–3 (Extended Data Fig. 7b).
We determined whether each DCC has undergone a historical population expansion by using the Bayesian skyline plot estimates (Extended Data Fig. 9) of relative genetic diversity in the posterior distribution. We used all samples in the posterior distribution and found that all samples in all DCCs exhibited an increase in relative genetic diversity of more than tenfold relative to the value at the root of the tree, thereby strongly supporting a historical population expansion in each case. We identified the date of the expansion in each DCC by calculating the earliest date at which the relative genetic diversity increased by more than tenfold relative to the root of the tree and combined these values into a single distribution, from which the median and 95% HPD was calculated in each case.
Before carrying out spatiotemporal reconstruction, we calculated the association index57 of the distribution of collection continents across the maximum likelihood tree of each DCC. This was significant in each case on the basis of 1,000 location randomizations (P < 0.001 in each DCC), indicating a correlation between phylogeny and continent of collection. We carried out asymmetric phylogeographic reconstructions of DCCs 1–3 using the BEAST_CLASSIC package v.1.3.2 in BEAST v.2.4.2 (ref. 19). Each sequence was labelled with the continent of collection. We used an informed log-normal substitution rate prior and Bayesian skyline population prior as above. We used an exponential prior on the overall rate of lineage movements with mean 1.0. The relative rates of migration between different continent pairs were modelled using a gamma distribution with alpha and beta both set to 1.0. As the number of sequences collected from each continent is unequal for each DCC, we assessed the robustness of our inferences by randomly subsampling the sequences from overrepresented continents and rerunning the spatiotemporal reconstruction. We carried out the subsampling five times and found that the results were highly similar in all subsamples (for example, Fig. 2b). Supported migration routes were identified using SPREAD v.0.9.6 (ref. 58) as directed continent pairs had Bayes factor support greater than three in the dataset without subsampling and at least four of the five subsampled datasets.
Mutational spectrum analysis
The mutational spectrum consists of all of the mutations that have occurred within the history of a sample set in their surrounding nucleotide context31. It is necessary to identify the direction of each mutation, that is, the parental nucleotide and the descendent nucleotide. To identify the mutations that have occurred and their direction, we carried out ancestral reconstruction on each FastBAPS cluster phylogenetic tree. Recombination was removed and phylogenetic trees reconstructed as above. Ancestral reconstruction was carried out on all variable alignment positions using the phylogenetic analysis by maximum likelihood (PAML) package v.4.9 (ref. 59). We compared the fit of HKY, HKY + GAMMA, GTR and GTR + GAMMA models of nucleotide substitution. Results were highly similar with all models and in all cases either the GTR or GTR + GAMMA model was supported. The mutations that occurred along each branch in the phylogeny were extracted from the PAML output. The surrounding nucleotide context of each mutation was identified from the reference sequence that was mapped against. The number of polymorphic sites contributing to each mutational spectrum is shown in Extended Data Fig. 10.
To compare the mutational spectrum in different niches, the branches in the phylogenetic tree were divided into categories (Fig. 3a). We reasoned that the internal branches within the non-DCC clusters will have been environmental as these branches probably often span hundreds to thousands of years during which time prolonged human infection will have been unlikely. We therefore calculated the environmental mutational spectrum by combining contextual mutations inferred along all internal branches of non-DCC clusters.
The phylogenetic branches within clades containing sequences from a single patient represent within-patient evolution. We therefore calculated the within-patient mutational spectrum by combining contextual mutations inferred along branches within monophyletic patient clades.
The relative contributions of environmental and within-patient evolution along tip branches and branches leading to patient ancestors is unclear as the patient may have acquired the infection at any point along this branch. We therefore did not include these branches in the environmental mutational spectrum or the within-patient spectrum.
To examine the mutations acquired during DCC transmission chains, we combined the contextual mutations that occurred along the internal branches within the seven DCC trees (Fig. 3a). We excluded the deep branches in these clades that precede population expansion (Fig. 3a) to only examine mutations that have occurred since emergence and therefore in more recent transmission chains.
We compared mutational spectra between niches by subtracting the inferred environmental spectrum from the DCC internal branch and within CF patient spectra. Significance of observed differences was assessed through 1,000 independent down-samplings of the inferred environmental mutations to the number identified along DCC internal branches or within CF patients. Contextual mutations were identified as significant if their calculated proportion fell outside two standard deviations of the mean proportion in the 1,000 replicates. This process was repeated ten times and all reported significant mutations were significant in all ten runs.
Decomposition of the mutational spectrum into input signatures was carried out using signal32 (https://signal.mutationalsignatures.com/, date last accessed 15 November 2020). The contextual mutations that were elevated in the DCCs relative to the environment were combined into a 10,000 mutation catalogue with their relative frequencies representing their relative enrichment above the environment. This catalogue was used as input for signal specifying lung as the originating organ. Mutational drivers were assigned from the respective COSMIC mutational signature (https://cancer.sanger.ac.uk/cosmic/signatures/SBS/index.tt, date last accessed 15 November, 2020).
Transmission network reconstruction
SNP distances were calculated between all pairs of samples within each FastBAPS cluster using PairSNP60 and the minimum SNP distance between each pair of patients extracted. Patients were classified as being linked at a given SNP cut-off if their closest pair of samples differed by that number of SNPs or fewer. Localized linkages were classified on the basis of available metadata if the patients were in the sample hospital, CF Trust, city or state.
Transmission networks were reconstructed on the basis of minimum SNP distance between patient isolates. SNP distances of 20 and 38 SNPs were previously shown to represent ‘probable’ and ‘possible’ transmission, respectively, on the basis of the number of SNPs observed in within patient infections2. We therefore plotted the transmission network at ten SNPs to represent very likely transmission and 38 SNPs to represent possible transmission.
Transmission network connectivity measures (Extended Data Fig. 6) were calculated using 38 SNPs as a cut-off to include linkages, with 38 chosen to include possible person-to-person transmission events2. Therefore, any patient linkages of 39 SNPs or more were not included. The total number of linkages involving patients with or without CF was identified and divided by the total number of patients within the respective group to calculate the average number of transmission linkages per patient with and without CF. To calculate the weighted connectivity measures, each edge in the transmission network was given a weighting of a 39-SNP distance. Therefore, edges linking patient pairs whose isolates differ by zero SNPs were given a weighting of 39 and edges had zero weighting if they connect patient pairs whose isolates differ by 39 or more SNPs. The total weighting of all edges involving patients with or without CF was identified and divided by the total number of patients within the respective group to calculate the average weighted connectivity. To identify the average weighting of CF–CF, CF–non-CF and non-CF–non-CF linkages, the total weighted connectivity of each edge type was calculated by summing the weights of all respective edges and this was divided by the total number of potential linkages of that type.
Trends in tobacco smoking
Annual estimates of the number of cigarette sales per adult per day were obtained from https://ourworldindata.org/smoking (last accessed 18 March 2021). 30 countries were included with data available from before 1960.
Further information on research design is available in the Nature Research Reporting Summary linked to this article.