Effect of aging on the transcriptomic changes associated with the expression of the HERV-K (HML-2) provirus at 1q22

Background The human genome contains remnants of ancient retroviral infections called human endogenous retroviruses (HERV). Their expression is often observed in several diseases of autoimmune or inflammatory nature. However, the exact biological mechanisms induced by HERVs are still poorly understood. We have previously shown that several HERVs of the HERV-K (HML-2) family are strongly transcribed in the peripheral blood mononuclear cells (PBMC) derived from young and old individuals. To examine the potential functional consequences of HERV-K (HML-2) expression, we have now analyzed the correlation of its expression with age-associated changes in the transcriptome using gene set enrichment analysis (GSEA). We focused our analysis on the HERV-K (HML-2) provirus at 1q22, also known as ERVK-7. Results The genes strongly correlating with the expression of HERV-K (HML-2) provirus at 1q22 expression were found to be almost entirely different in young and old individuals. The number of genes strongly correlating (Pearson correlation coefficient ≥ 0.7) with 1q22 expression was 946 genes in the old and 435 in the young, of which only 41 genes correlated strongly in both. Consequently, the related gene ontology (GO) biological processes were different. In the older individuals, many of the highest correlating processes relate to the function of neutrophils. Conclusions The results of this work suggest that the biological processes associated with the expression of HERV-K (HML-2) provirus at 1q22 are different in the blood of young and old individuals. Specifically, a strong association was found in the older individuals between neutrophil activity and the expression of the HERV-K (HML-2) provirus at 1q22. These findings offer insight into potential effects of altered HERV expression in older individuals.


Background
An estimated 8% of the human genome consists of endogenous retroviruses, which are remnants of exogenous retroviral infections that have integrated into the human germline throughout millions of years. The majority of these human endogenous retrovirus (HERV) proviruses do not contain intact open reading frames due to the accumulation of mutations, insertions, and deletions. Yet despite being seemingly harmless passengers, several studies have shown that activation of HERV proviruses is associated with various diseases, such as multiple sclerosis, rheumatoid arthritis and HIV infection (reviewed in [15]).
However, the exact immunomodulatory mechanisms of HERV activation are not known. It is clear that the original retrotranspositional mechanisms are not involved, as human ERVs have lost their infectious capacity. Therefore, it seems likely that the expressed HERV proteins (env, gag, pol) are critical in this respect [9]. HERV RNA could be recognized as a pathogen-associated molecular pattern (PAMP) by toll-like receptors, and this would induce type I interferon production contributing to the pathogenesis of autoinflammatory diseases [26].
The most recently integrated proviruses, belonging to the HERV-K (HML-2) family, are relatively well preserved and have retained considerable expression capacity [10]. Yet out of the 91 proviruses in the HERV-K family, only two, at 1q22 and at 1q23.3, are thought able to produce intact viral proteins [3]. In our previous work [16], we found the HERV-K (HML-2) at 1q22 to be significantly higher expressed (p-value < 0.05) in nonagenarian PBMCs (Peripheral Blood Mononuclear Cells) compared to young controls (Fig. 1), while expression of 1q23.3 was found to be very low. In our results, median 1q22 mRNA level in the young was 0.357 TPM (Transcripts Per Million), compared to 0.478 TPM in the old, and the mean 1q22 overexpression in the old was 1.3fold. In this work, we have thus focused solely on the provirus at 1q22 in investigating the potential consequences of these transcriptomic changes. The provirus at 1q22 is also known as K102 and ERVK-7 [21].

Results
In order to investigate the potential consequences of the significantly increased expression of HERV-K (HML-2) at 1q22 in PBMCs from older individuals, co-expression of the provirus with human genes was investigated utilizing Pearson's product-moment correlation coefficient. Provirus 1q22 expression was found to correlate strongly (Pearson's correlation coefficient ≥ 0.7) with the expression of 946 genes in nonagenarians, but with only 435 in the young controls, out of 9447 total genes. Of the genes that correlated strongly with 1q22 in either young or old individuals, only 41 correlated strongly (Pearson's correlation coefficient ≥ 0.7) with 1q22 in both the young and the old. Highest correlating genes (Pearson's correlation coefficient ≥ 0.95) can be seen in the supplementary file, in table S1. However, correlating the expression of one provirus to thousands of genes results in strong correlations with some genes by pure chance, even if there is no true association, and therefore it is not enough to look at individual correlations, but necessary to study these correlations as a whole.
To explore what biological processes might be affected by or be behind the transcriptomic changes, gene set enrichment analysis (GSEA) was performed based on known gene ontology (GO) functions of co-expressed genes [1,20,22]. Figure 2a and b show the 15 most upregulated and 15 most downregulated processes in both nonagenarians and young controls, respectively. The 20 most significantly associating GO biological processes are listed in Tables 1 and 2, in the order of significance, for both age groups. Figure 2c illustrates the strong association of 1q22 expression with genes involved in neutrophil activation in the older individuals, demonstrated by the number of highly correlating genes located at the beginning of the gene list.
To quantitatively assess the enrichment of neutrophil associated genes in the transcripts that are strongly coexpressed with 1q22, the hypergeometric test was utilized. The old individuals had 4 of the top 20 most significantly enriched GO terms involve neutrophils (Table 1), which according to the hypergeometric test results in a very low p-value of 4.96e-13, indicating significance (p-value < 0.05). This p-value signifies that the association to neutrophil function seen with 1q22 is very unlikely to be purely coincidental. For the young individuals, no significantly enriched biological processes were directly involved in neutrophil function ( Table 2).
The GSEA indicated association between 1q22 and neutrophil activation was further studied by focusing on the correlations between 1q22 expression and common markers of neutrophil activation: IL-8 [29] and myeloperoxidase (MPO) [17]. The results are shown in Table 3. IL-8, also known as neutrophil chemotactic factor, had a strong and significant correlation with 1q22 in the old (Pearson's correlation coefficient = 0.83, p-value = 0.02), but not in the young (Pearson's correlation coefficient = 0.18, p-value = 0.70). In the old individuals, some correlation was seen between 1q22 and MPO, though this correlation was not statistically significant. MPO, along  with human neutrophil elastase (HNE), is expressed in neutrophils and released to the extracellular space as part of neutrophil degranulation. The expression of HNE was below threshold (mean normalized read count < 50) and was therefore not included in analysis. Correlations between 1q22 and these genes were present only weakly in the PBMCs from young controls.

Discussion
Many of the biological processes enriched in the PBMCs from the old individuals are related to neutrophil activation and function ( Fig. 2a and Table 1). This association between 1q22 expression and neutrophil function is not present in the samples from young controls. In addition to the differences between age groups in the number of directly neutrophil involving GO terms, the GSEA results for the old contain many GO terms that are indirectly related to neutrophil function, such as terms describing granulocyte and leukocyte activity as well as more general immune system functions. Elevated levels of neutrophil activation in old individuals have already been previously reported. For example, in a time series study spanning 15 years, Samson et al. found that the numbers of neutrophils and monocytes increased with age, correlating with frailty and CRP trajectories [18]. It is known that neutrophils are crucial in the first line of defense against infectious agents and that several of their functional activities are affected by aging [7]. As host cells for viruses, neutrophils are suboptimal, as they are both short-lived and terminally differentiated. Table 1 For nonagenarian samples, the GO biological processes that were most significantly associated with HERV-K (HML-2) provirus 1q22 expression, ordered by p-value. Enrichment score for each GO biological process is determined through GSEA, as shown in Fig. 2. Set size refers to the total number of the studied genes associated to each process, while gene ratio is the number of those genes that strongly correlated with 1q22 expression. The GO terms that are directly involved in neutrophil functions are in bold Gene count refers to the number of genes associated with each GO biological process. Gene ratio is the percentage of genes that significantly correlated with 1q22 expression from the total number of genes associated to that process. Terms are ranked in the figure by decreasing gene ratio. One GO term, "neutrophil activation", directly involves neutrophil function and is bolded in the figure. (c) GSEA plot indicating enrichment of the biological process of neutrophil activation in the nonagenarians, based on the concentration of neutrophil activation -related genes at the beginning of the gene list ordered by correlation with 1q22 expression However, internalization of viruses during infection does occur, for example with the West Nile virus, which is reported to replicate within neutrophils [2]. Though many of the weapons in the neutrophil's arsenal are specifically for bacteria, neutrophils still play an important indirect role in viral infection defense by influencing the behavior of other cell subsets [19]. It is this indirect relationship between neutrophils and viruses that may explain why the results indicate enrichment of neutrophil-associated functions in PBMC samples that should not contain more than trace amounts of neutrophil-derived RNA. Though neutrophils are absent from the samples, the effects exerted by lymphocytes on neutrophils and by neutrophils on lymphocytes can still be seen in gene expression from the lymphocyte-containing PBMC samples. For example, neutrophil chemotaxis and activation can be induced by lymphocyte-produced IL-8 [29]. In the results of this work, IL-8 expression was found to strongly correlate with 1q22 expression in the old, yet not in the young, which could be seen as another link between 1q22 expression and neutrophil activation.
Based on the results of this work, it could thus be speculated that a chronic activation of neutrophils by HERVs is leading to the functional exhaustion of neutrophils as part of the aging-associated decline of immune capacity, i.e., immunosenescence. Additionally, as the expression of various HERVs have been shown to be increased in several diseases of autoimmune and autoinflammatory nature, HERV expression could be leading to increased production of pro-inflammatory mediators and in this way potentiating age-associated inflammation, referred to as "inflammaging". It would therefore be of interest to analyze HERV-associated biological processes in cases of autoimmune or autoinflammatory disease.
If HERVs were indeed causing activation of neutrophilassociated mechanisms, they would not be the only viruses to do so. Dunning et al. [6] described a comparable change in the analysis of the blood transcriptome in patients with long-lasting and severe influenza infection. This change from the "anti-viral" transcriptomic signature (including the effects of type-I interferons) to the "anti-bacterial" signature (e.g. neutrophil-mediated) is similar to that we observed in the HERV-K 1q22 expressing PBMC from the older individuals. However, Dunning et al. [6] saw a rise in transcripts associated with the GO term "response to bacterium", while  Table 2 For young control samples, the GO biological processes that were most significantly associated with HERV-K (HML-2) provirus 1q22 expression, ordered by p-value. Enrichment score for each GO biological process is determined through GSEA, as shown in Fig. 2. Set size refers to the total number of the studied genes associated to each process, while gene ratio is the number of those genes that strongly correlated with 1q22 expression. No significantly enriched GO terms were directly involved in neutrophil function in the young control samples that GO term was not enriched in our data. Additionally the transcripts that we found to most strongly correlate with 1q22 in young and in old separately (Supplementary file, Table S1) had no overlap with the transcripts that Dunning et al. describe to be associated with the GO terms "response to virus" or "response to bacterium". The difference in results could be explained by Dunning et al. having sequenced whole-blood samples, as opposed to the PBMC samples used in this work. In our results, the association with neutrophils does not come from a general response to bacterium, rather we see a more specific co-expression with transcripts associated with neutrophil activation. As a limitation of this work, the cell types involved in this provirus 1q22 associated processes cannot be determined based on this PBMC derived data, as the correlating transcripts could be derived from different cell types. A natural next step in this line of research would therefore be to investigate this association between 1q22 and neutrophils in datasets from specific cell types.

Conclusions
The results of this work suggest that the biological processes associated with the expression of HERV-K (HML-2) provirus at 1q22 are different in the blood of young and old individuals. The number of genes strongly correlating with the expression of the provirus is higher in the PBMC from the old individuals compared to the young, and the associated GO biological processes differ between the age groups. The strong association that was found in the older individuals between neutrophil activity and the expression of HERV-K (HML-2) provirus at 1q22 could be explained by a chronic activation of neutrophils by HERVs, leading to the functional exhaustion of neutrophils as part of immunosenescence. Expression f this provirus could also be leading to increased production of proinflammatory mediators, thus potentiating inflammaging. Taken as a whole, these findings offer insight into potential effects of altered HERV expression in older individuals. Investigation of provirus 1q22 expression associated biological functions in specific cell types and in conditions of autoimmune or autoinflammatory disease could further elucidate the consequences of its expression.

Materials and methods
This work extends the bioinformatic analyses done on our Vitality 90+ dataset that we have previously presented in [16]. Below are again described the sample collection and RNA sequencing protocols used, followed by the new analyses that have been performed as part of this work.

Study populations
Samples were collected from two populations: nonagenarian individuals (n = 7) and young controls (n = 7). Nonagenarians studied are all 94-year-old women, participants in The Vitality 90+ study. Young controls are women aged between 26 and 32, with a median age of 28, all healthy laboratory personnel with no medically diagnosed chronic illnesses, were non-smokers and had not had any infections or received any vaccinations within the 2 weeks prior to blood sample collection. The methods of recruitment and characterization of participants were done as has been reported previously [8]. The study participants provided their written informed consent.

Sample collection
Blood samples were collected by a trained laboratory technician in laboratory facilities. All blood samples were drawn between 8 am and 12 am and collected into EDTA containing tubes. Samples were directly subjected to leucocyte separation on a Ficoll-Paque density gradient (Ficoll-Paque™ Premium, cat. no. 17-5442-03, GE Healthcare Bio-Sciences AB, Uppsala, Sweden). The PBMC layer was collected and cells used for RNA extraction were suspended in 150 μl of RNAlater solution (Ambion Inc., Austin, TX, USA).

RNA extraction
RNA used for RNA sequencing was purified using a miRNeasy mini kit (Qiagen, CA, USA) according to manufacturer's protocol with on-column DNA digestion (Qiagen). The concentration and quality of the RNA was assessed with a NanoDrop ND-1000 spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA).

RNA sequencing
Agilent Bioanalyzer RNA nano chips (Agilent) were used to evaluate the integrity of total RNA and Qubit RNA kit (Life Technologies) to quantitate RNA in samples. 1 μg of total RNA was used for ScriptSeq™ Complete Gold System (Epicentre) to ribodeplete rRNA and further for RNA-seq library preparation. SPRI beads (Agencourt AMPure XP, Beckman Coulter) were used for purification of RNAseq libraries. The library QC was evaluated on High Sensitivity chips by Agilent Bioanalyzer (Agilent). Paired-end sequencing of RNA-seq libraries was done using Illumina HiSeq technology with a minimum of 60 million 2x100bp paired-end reads per sample.

Data preprocessing
Raw reads were aligned to human genome reference build hg19 using TopHat v2.0.13 [24] with the default parameters. SAMtools [12] was used to filter out reads mapping to multiple regions of the genome. The raw expression estimates were calculated using the Cuffquant tool from Cufflinks v.2.2.1 [23,25]. The expression values were normalized with the Cuffnorm tool from Cufflinks v.2.2.1 [23,25], utilizing the geometric normalization method, which scales the read counts as well as the FPKM values according to procedure described in Love et al. [14]. To ensure the robustness of the normalization the expressions of HERV elements were quantified and normalized together with ENSEMBL v. 82 gene reference set [11,27]. The annotation data for HERV-K (HML-2) was from Subramanian et al. [21].

Correlation calculations
To prepare the dataset for correlation analysis, low expression genes with a mean normalized read count lower than 50 were filtered out. This threshold reduced the number of transcripts in the dataset from 57,886 to 12,639. Genes with no associated GO terms were also removed, utilizing R package msigdbr [13] (version 7.0.1), further reducing the number of genes to 9447. Changes to the mean transcript counts as a result of the filtering are shown in Table 4. Correlations between the expressions of each gene and of HERV-K (HML-2) at 1q22 were calculated utilizing Pearson's product-moment correlation coefficient. The R programming language (version 3.6.1) with stats package was used to calculate the correlation coefficients.
The expression of the filtered genes across samples approximates normal distribution in accordance with the assumptions of the Pearson correlation coefficient significance calculation. The supplementary file figures S1 and S2 show Q-Q plots of 1q22 expression against normal distribution in the nonagenarian and young control samples respectively.

GSEA
Gene set enrichment analysis (GSEA) was used to evaluate the potential biological functions of the HERV-K (HML-2) provirus at 1q22. The gene list used in GSEA was ranked by Pearson correlation coefficient of the expression between 1q22 and genes. The GSEA function used here is provided by the clusterProfiler software package version 3.12.0 [28], which utilizes biological process (BP) gene ontology (GO) terms [1,22]. The genome annotation information was retrieved with R package org.Hs.eg.db [4] (version 3.8.2). In the GSEA, the minimum size of gene sets was 25 genes and the maximum was 500. Permutation size was 100,000. Significance of enrichment was determined by a Benjamini-Hochberg corrected p-value of less than 0.05.

Hypergeometric test
In order to statistically study the incidence of neutrophil related GO terms in the top 20 most significantly enriched terms, the hypergeometric test was utilized. The R programming language [5] (version 3.6.1) and more specifically the packages stats and GO.db were used to perform the hypergeometric test. There was a total of 29,698 biological process GO terms, of which only 17 specifically describe neutrophil functions. For the old individuals, 4 GO terms related to neutrophil function were found within the top 20 most significantly enriched GO terms. For the young individuals, no significantly enriched GO terms were related to neutrophils.
Additional file 1: Table S1. Genes that had the highest correlation (Pearson correlation coefficient > 0.95) with 1q22 expression. Both positive and negative correlation are included. There is no overlap in these genes between the age groups. Figure S1. Q-Q plot of the HERV-K (HML-2) provirus at 1q22 expression across nonagenarian samples against normal distribution. The plot as well as a Shapiro-Wilk test (pvalue = 0.58) support that the distribution of the expression of 1q22 in the studied samples approximates normal distribution. Figure S2. Q-Q plot of the HERV-K (HML-2) provirus at 1q22 expression across young samples against normal distribution. The plot as well as a Shapiro-Wilk test (p-value = 0.32) support that the distribution of the expression of 1q22 in the studied samples approximates normal distribution.