Pre-processing of TCGA datasets

If you are here probably you have already downloaded a dataset from TCGA. At the end of the download procedure we saved a .RData file containing our favourite dataset.

Now, we are going to load it along with some R packages.

library(maftools)
library(org.Hs.eg.db)
library(impute)
library(doParallel)
library(MethylMix)
library(TCGAbiolinks)

# load data downloaded by TCGAbiolinks
load("downloadTCGA/TCGA-OV-hg38.RData")

To process the dataset and make the analysis procedure easier to follow, we implemented some functions, that are stored in a separate file that we need to source.

Please download ‘functions-to-process-TCGA-data.R’ from our GitHub page and move it into the ‘downloadTCGA’ direcotry.

source("downloadTCGA/functions-to-process-TCGA-data.R")

genome = "hg38"
tumor = "OV"
project = paste0("TCGA-", tumor)

Expression

Now, we’re ready to start.

The first dataset we are going to process here is the expression data. We are going to set unique gene identifiers and store counts accordingly. We choose to work with EntrezID; thus, we are going to translate the gene identifiers and discard all the features without a valid EntrezID.

# convert ensembl ids into entrez
geneAnnotation <- rowData(exprData)

ensembl2entrez <- mapIds(org.Hs.eg.db, 
                         keys = sub("\\..*", "", geneAnnotation$gene_id),
                         column = "ENTREZID", keytype = "ENSEMBL")
geneAnnotation$entrezID <- ensembl2entrez

rowData(exprData) <- geneAnnotation
genes <- geneAnnotation$gene_id[!is.na(geneAnnotation$entrezID)]

We filter for patients features, selecting only patients with primary tumors.

# select primary tumors
colAnnotation <- exprData@colData
primaryTumorSel <- colAnnotation$definition == "Primary solid Tumor"
primaryTumor <- colnames(exprData)[primaryTumorSel]

We are going to filter the expression data: we keep only genes with a valid EntrezID on the rows, and primaryTumor samples on the columns.

exprData <- exprData[genes, primaryTumor, drop=FALSE]

After the conversion of gene IDs, some of them may be duplicated. We are going to average the profiles of the genes present more than once in the expression matrix.

# take average for duplicated entrez ids
exp <- assay(exprData, 1)
row.names(exp) <- sub("\\..*", "", row.names(exprData))

avg <- tapply(row.names(exp), rowData(exprData)$entrezID, function(r){
  row = exp[r, ,drop=FALSE]
  if (nrow(row)>1)
    row=colMeans(row, na.rm = TRUE)
  row
})

expAvg <- do.call(rbind, avg)
row.names(expAvg) <- names(avg)

colnames(expAvg) <- substr(colnames(expAvg), 1, 12)

expAvg[1:6, 1:6]

##           TCGA-20-1687 TCGA-29-2414 TCGA-24-1416 TCGA-29-1697
## 1                    1           13            1            3
## 10                   1            3            0            6
## 100                307          556          352          373
## 1000              1112          517          319         7134
## 10000              274          531           95          564
## 100009668            2            6            8           30
##           TCGA-25-1635 TCGA-20-1686
## 1                    3            1
## 10                   8            1
## 100                530          173
## 1000              3755           77
## 10000              885          308
## 100009668            4           16

Our expression matrix is now ready: EntrezID x samples.

Mutation

Next, we are going to prepare mutation data. Some statistics are summarized.

maf <- read.maf(mutData)@data

## -Validating
## -Silent variants: 9351 
## -Summarizing
## --Mutiple centers found
## WUGSC;BI;BI;WUGSC--Possible FLAGS among top ten genes:
##   TTN
##   USH2A
##   MUC16
##   HMCN1
##   FLG
## -Processing clinical data
## --Missing clinical data
## -Finished in 0.819s elapsed (2.077s cpu)

mut.statistics <- tapply(seq_along(maf$Hugo_Symbol), maf$IMPACT, function(idx){
  stat = as.data.frame(table(maf[idx, 9]))
  names(stat)[1] <- "mutation_type"
  stat
})

mut.statistics

## $HIGH
##            mutation_type Freq
## 1        Frame_Shift_Del 1243
## 2        Frame_Shift_Ins 1034
## 3           In_Frame_Del    0
## 4           In_Frame_Ins    0
## 5      Missense_Mutation    0
## 6      Nonsense_Mutation 1656
## 7       Nonstop_Mutation   46
## 8            Splice_Site  893
## 9 Translation_Start_Site   37
## 
## $MODERATE
##            mutation_type  Freq
## 1        Frame_Shift_Del     0
## 2        Frame_Shift_Ins     0
## 3           In_Frame_Del   443
## 4           In_Frame_Ins    19
## 5      Missense_Mutation 24906
## 6      Nonsense_Mutation     0
## 7       Nonstop_Mutation     0
## 8            Splice_Site     0
## 9 Translation_Start_Site     0

We choose to filter the data according to EntrezID genes that have been measured in expression analysis. Moreover, mutations are filtered according to the impact of the mutation, considering mutations with high and moderate impact.

considerThisImpactType <- c("HIGH", "MODERATE")

ov.mutations <- prepareMutations(maf, impact = considerThisImpactType, 
                                 filterByThisEntrez = row.names(expAvg), 
                                 patients = colnames(expAvg))

ov.mutations$data[1:6,1:6]

##           TCGA-24-2033 TCGA-13-1488 TCGA-24-1844 TCGA-25-2400
## 1                    1            0            0            0
## 10                   0            1            1            0
## 100                  0            0            0            1
## 1000                 0            0            0            0
## 10003                0            0            0            0
## 100049587            0            0            0            0
##           TCGA-09-2045 TCGA-23-1029
## 1                    0            0
## 10                   0            0
## 100                  0            0
## 1000                 1            1
## 10003                0            0
## 100049587            0            0

CNV

We can now move to the CNV data. As we downloaded only the primary tumors, we do not need any particular transformation. For CNV, our tool needs numeric values; so we need to transform the values.

ov.cnv <- lapply(cnvData, as.numeric)
ov.cnv <- data.frame(ov.cnv, check.names = FALSE)
row.names(ov.cnv) <- row.names(cnvData)

for (i in names(ov.cnv)) {
  if (!identical(ov.cnv[[i]], as.numeric(cnvData[[i]])))
    warning(paste0("sample number ", i, " is not equal after numeric transformation"))
}

ov.cnv[1:6,1:6]

##        TCGA-04-1331 TCGA-04-1332 TCGA-04-1335 TCGA-04-1336 TCGA-04-1337
## 116983           -1            0           -1            1            0
## 140625           -1            0           -1            1            0
## 375790           -1            0           -1            1            0
## 441869           -1            0           -1            1            0
## 55210            -1            0           -1            1            0
## 83858            -1            0           -1            1            0
##        TCGA-04-1338
## 116983           -2
## 140625           -2
## 375790           -2
## 441869           -2
## 55210            -2
## 83858            -2

Methylation

Finally, we are going to process the methylation data with MethylMix R package. First, we are going to select the samples that are ‘primary tumors’ and ‘normals’ in the TCGA. We generate two datasets and we process both of them, removing rows with low values and imputing missing values. Finally, we can create CpG clusters.

# Select primary solid tumors
sampleSelection <- colData(metData27)[[5]] == "Primary solid Tumor"
normalSelection <- colData(metData27)[[5]] == "Solid Tissue Normal"
samples <- colnames(metData27)[sampleSelection]
normals <- colnames(metData27)[normalSelection]
normData <- metData27[, normals]
metData <- metData27[, samples]

# Remove all rows with low values and impute missing values
thr <- ncol(assay(normData)) * 0.4
remove <- apply(assay(normData), 1, sum) > thr
normAssay <- assay(normData)[!remove, ]
normAssay <- imputeKNN(normAssay, k = 5)$data

## Warning in knnimp(x, k, maxmiss = rowmax, maxp = maxp): 3593 rows with more than 50 % entries missing;
##  mean imputation used for these rows

thr <- ncol(assay(metData)) * 0.4
remove <- apply(is.na(assay(metData)), 1, sum) > thr
metAssay <- assay(metData)[!remove, ]
metAssay <- imputeKNN(metAssay, k = 5)$data

# Make clusters
cl <- makeCluster(16)
registerDoParallel(cl)
metClustValues <- ClusterProbes(metAssay, normAssay)

## Removing 2115 probes with SNPs.
## Clustering 15383 probes in CpG site clusters.
## Removing doubles for 1 samples.
## 
## Found 12195 CpG site clusters.

stopCluster(cl)

The metClustValues object contains two methylation beta-value matrices and one dictionary, which connects CpG clusters to the methylation array probes. Now we need to connect methylation clusters to genes (EntrezID).

geneNames <- sapply(strsplit(row.names(metClustValues$MET_Cancer_Clustered), "---"),
                    function(x) x[1])
assList <- tapply(row.names(metClustValues$MET_Cancer_Clustered), geneNames, function(g) g)
symbol2entrez <- mapIds(org.Hs.eg.db, keys = names(assList), column = "ENTREZID",
                        keytype = "SYMBOL")
names(assList) <- symbol2entrez
metClustValues$eMap <- assList[!is.na(names(assList))]

metClustValues$MET_Cancer_Clustered[1:6, 1:6]

##                  TCGA-42-2593 TCGA-29-2414 TCGA-29-1707 TCGA-09-2044
## A2BP1---Cluster1   0.18340251   0.01368626  0.253939711   0.02354808
## A4GALT             0.03316592   0.06308160  0.048887535   0.06805393
## AAAS               0.01495299   0.01634827  0.014152061   0.01223318
## AACS               0.03354944   0.02310594  0.009328074   0.03470545
## AADAT---Cluster1   0.01523262   0.01366706  0.010625261   0.01134474
## AADAT---Cluster2   0.02773927   0.02048844  0.016261786   0.02019205
##                  TCGA-04-1351 TCGA-13-1506
## A2BP1---Cluster1  0.120138926   0.02680149
## A4GALT            0.064649710   0.07557158
## AAAS              0.019189938   0.03804874
## AACS              0.056246113   0.02541006
## AADAT---Cluster1  0.006467459   0.01268726
## AADAT---Cluster2  0.009586054   0.02525091

Survival annotation

To conclude, we extract the follow-up values. We are going to calculate both Overall Survival (OS) and Progression Free Survival (PFI). The definition of these two measures are made following Liu et al., 2018.

Using the function createFollowUp on clinical data, we are going to obtain the follow-up table.

fup <- createFollowUp(followUp, newTumorEvent)

head(fup$os)

##              status days
## TCGA-04-1331      1 1336
## TCGA-04-1332      1 1247
## TCGA-04-1335      1   55
## TCGA-04-1336      0 1495
## TCGA-04-1337      1   61
## TCGA-04-1338      0 1418

head(fup$pfs)

##              status days
## TCGA-04-1331      1  459
## TCGA-04-1332      1  393
## TCGA-04-1335      1   55
## TCGA-04-1336      0 1495
## TCGA-04-1337      1   61
## TCGA-04-1338      1  380

Two-class annotation

The two-class annotation object is generated using TCGA ovarian cancer subtypes, that can be retrieved with the function PanCancerAtlas_subtypes.

subtypes <- PanCancerAtlas_subtypes()
table(subtypes[subtypes$cancer.type == "OVCA", "Subtype_mRNA"])

## Subtype_mRNA
## Differentiated Immunoreactive    Mesenchymal  Proliferative 
##            135            107            109            138

subtypes <- subtypes[subtypes$cancer.type == "OVCA",]

classes <- data.frame(row.names = substr(subtypes$pan.samplesID, 1, 12), 
                      classes = subtypes$Subtype_mRNA)

head(classes)

##                     classes
## TCGA-04-1331  Proliferative
## TCGA-04-1332    Mesenchymal
## TCGA-04-1336 Differentiated
## TCGA-04-1337    Mesenchymal
## TCGA-04-1338    Mesenchymal
## TCGA-04-1341  Proliferative

We can save the pre-processed dataset on a RData file inside the ‘downloadTCGA’ directory.

save(classes, fup, metClustValues, ov.cnv, ov.mutations, expAvg, 
     file = "downloadTCGA/TCGA-OV-pre-processed.RData")