If you are here probably you have already downloaded a dataset from
TCGA. At the end of the download procedure we saved a
.RData
file containing our favourite dataset.
Now, we are going to load it along with some R packages.
library(maftools)
library(org.Hs.eg.db)
library(impute)
library(doParallel)
library(MethylMix)
library(TCGAbiolinks)
# load data downloaded by TCGAbiolinks
load("downloadTCGA/TCGA-OV-hg38.RData")
To process the dataset and make the analysis procedure easier to follow, we implemented some functions, that are stored in a separate file that we need to source.
Please download ‘functions-to-process-TCGA-data.R’ from our GitHub page and move it into the ‘downloadTCGA’ direcotry.
source("downloadTCGA/functions-to-process-TCGA-data.R")
genome = "hg38"
tumor = "OV"
project = paste0("TCGA-", tumor)
Now, we’re ready to start.
The first dataset we are going to process here is the expression data. We are going to set unique gene identifiers and store counts accordingly. We choose to work with EntrezID; thus, we are going to translate the gene identifiers and discard all the features without a valid EntrezID.
# convert ensembl ids into entrez
geneAnnotation <- rowData(exprData)
ensembl2entrez <- mapIds(org.Hs.eg.db,
keys = sub("\\..*", "", geneAnnotation$gene_id),
column = "ENTREZID", keytype = "ENSEMBL")
geneAnnotation$entrezID <- ensembl2entrez
rowData(exprData) <- geneAnnotation
genes <- geneAnnotation$gene_id[!is.na(geneAnnotation$entrezID)]
We filter for patients features, selecting only patients with primary tumors.
# select primary tumors
colAnnotation <- exprData@colData
primaryTumorSel <- colAnnotation$definition == "Primary solid Tumor"
primaryTumor <- colnames(exprData)[primaryTumorSel]
We are going to filter the expression data: we keep only genes with a valid EntrezID on the rows, and primaryTumor samples on the columns.
exprData <- exprData[genes, primaryTumor, drop=FALSE]
After the conversion of gene IDs, some of them may be duplicated. We are going to average the profiles of the genes present more than once in the expression matrix.
# take average for duplicated entrez ids
exp <- assay(exprData, 1)
row.names(exp) <- sub("\\..*", "", row.names(exprData))
avg <- tapply(row.names(exp), rowData(exprData)$entrezID, function(r){
row = exp[r, ,drop=FALSE]
if (nrow(row)>1)
row=colMeans(row, na.rm = TRUE)
row
})
expAvg <- do.call(rbind, avg)
row.names(expAvg) <- names(avg)
colnames(expAvg) <- substr(colnames(expAvg), 1, 12)
expAvg[1:6, 1:6]
## TCGA-20-1687 TCGA-29-2414 TCGA-24-1416 TCGA-29-1697
## 1 1 13 1 3
## 10 1 3 0 6
## 100 307 556 352 373
## 1000 1112 517 319 7134
## 10000 274 531 95 564
## 100009668 2 6 8 30
## TCGA-25-1635 TCGA-20-1686
## 1 3 1
## 10 8 1
## 100 530 173
## 1000 3755 77
## 10000 885 308
## 100009668 4 16
Our expression matrix is now ready: EntrezID x samples.
Next, we are going to prepare mutation data. Some statistics are summarized.
maf <- read.maf(mutData)@data
## -Validating
## -Silent variants: 9351
## -Summarizing
## --Mutiple centers found
## WUGSC;BI;BI;WUGSC--Possible FLAGS among top ten genes:
## TTN
## USH2A
## MUC16
## HMCN1
## FLG
## -Processing clinical data
## --Missing clinical data
## -Finished in 0.819s elapsed (2.077s cpu)
mut.statistics <- tapply(seq_along(maf$Hugo_Symbol), maf$IMPACT, function(idx){
stat = as.data.frame(table(maf[idx, 9]))
names(stat)[1] <- "mutation_type"
stat
})
mut.statistics
## $HIGH
## mutation_type Freq
## 1 Frame_Shift_Del 1243
## 2 Frame_Shift_Ins 1034
## 3 In_Frame_Del 0
## 4 In_Frame_Ins 0
## 5 Missense_Mutation 0
## 6 Nonsense_Mutation 1656
## 7 Nonstop_Mutation 46
## 8 Splice_Site 893
## 9 Translation_Start_Site 37
##
## $MODERATE
## mutation_type Freq
## 1 Frame_Shift_Del 0
## 2 Frame_Shift_Ins 0
## 3 In_Frame_Del 443
## 4 In_Frame_Ins 19
## 5 Missense_Mutation 24906
## 6 Nonsense_Mutation 0
## 7 Nonstop_Mutation 0
## 8 Splice_Site 0
## 9 Translation_Start_Site 0
We choose to filter the data according to EntrezID genes that have been measured in expression analysis. Moreover, mutations are filtered according to the impact of the mutation, considering mutations with high and moderate impact.
considerThisImpactType <- c("HIGH", "MODERATE")
ov.mutations <- prepareMutations(maf, impact = considerThisImpactType,
filterByThisEntrez = row.names(expAvg),
patients = colnames(expAvg))
ov.mutations$data[1:6,1:6]
## TCGA-24-2033 TCGA-13-1488 TCGA-24-1844 TCGA-25-2400
## 1 1 0 0 0
## 10 0 1 1 0
## 100 0 0 0 1
## 1000 0 0 0 0
## 10003 0 0 0 0
## 100049587 0 0 0 0
## TCGA-09-2045 TCGA-23-1029
## 1 0 0
## 10 0 0
## 100 0 0
## 1000 1 1
## 10003 0 0
## 100049587 0 0
We can now move to the CNV data. As we downloaded only the primary tumors, we do not need any particular transformation. For CNV, our tool needs numeric values; so we need to transform the values.
ov.cnv <- lapply(cnvData, as.numeric)
ov.cnv <- data.frame(ov.cnv, check.names = FALSE)
row.names(ov.cnv) <- row.names(cnvData)
for (i in names(ov.cnv)) {
if (!identical(ov.cnv[[i]], as.numeric(cnvData[[i]])))
warning(paste0("sample number ", i, " is not equal after numeric transformation"))
}
ov.cnv[1:6,1:6]
## TCGA-04-1331 TCGA-04-1332 TCGA-04-1335 TCGA-04-1336 TCGA-04-1337
## 116983 -1 0 -1 1 0
## 140625 -1 0 -1 1 0
## 375790 -1 0 -1 1 0
## 441869 -1 0 -1 1 0
## 55210 -1 0 -1 1 0
## 83858 -1 0 -1 1 0
## TCGA-04-1338
## 116983 -2
## 140625 -2
## 375790 -2
## 441869 -2
## 55210 -2
## 83858 -2
Finally, we are going to process the methylation data with
MethylMix
R package. First, we are going to select the
samples that are ‘primary tumors’ and ‘normals’ in the TCGA. We generate
two datasets and we process both of them, removing rows with low values
and imputing missing values. Finally, we can create CpG clusters.
# Select primary solid tumors
sampleSelection <- colData(metData27)[[5]] == "Primary solid Tumor"
normalSelection <- colData(metData27)[[5]] == "Solid Tissue Normal"
samples <- colnames(metData27)[sampleSelection]
normals <- colnames(metData27)[normalSelection]
normData <- metData27[, normals]
metData <- metData27[, samples]
# Remove all rows with low values and impute missing values
thr <- ncol(assay(normData)) * 0.4
remove <- apply(assay(normData), 1, sum) > thr
normAssay <- assay(normData)[!remove, ]
normAssay <- imputeKNN(normAssay, k = 5)$data
## Warning in knnimp(x, k, maxmiss = rowmax, maxp = maxp): 3593 rows with more than 50 % entries missing;
## mean imputation used for these rows
thr <- ncol(assay(metData)) * 0.4
remove <- apply(is.na(assay(metData)), 1, sum) > thr
metAssay <- assay(metData)[!remove, ]
metAssay <- imputeKNN(metAssay, k = 5)$data
# Make clusters
cl <- makeCluster(16)
registerDoParallel(cl)
metClustValues <- ClusterProbes(metAssay, normAssay)
## Removing 2115 probes with SNPs.
## Clustering 15383 probes in CpG site clusters.
## Removing doubles for 1 samples.
##
## Found 12195 CpG site clusters.
stopCluster(cl)
The metClustValues
object contains two methylation
beta-value matrices and one dictionary, which connects CpG clusters to
the methylation array probes. Now we need to connect methylation
clusters to genes (EntrezID).
geneNames <- sapply(strsplit(row.names(metClustValues$MET_Cancer_Clustered), "---"),
function(x) x[1])
assList <- tapply(row.names(metClustValues$MET_Cancer_Clustered), geneNames, function(g) g)
symbol2entrez <- mapIds(org.Hs.eg.db, keys = names(assList), column = "ENTREZID",
keytype = "SYMBOL")
names(assList) <- symbol2entrez
metClustValues$eMap <- assList[!is.na(names(assList))]
metClustValues$MET_Cancer_Clustered[1:6, 1:6]
## TCGA-42-2593 TCGA-29-2414 TCGA-29-1707 TCGA-09-2044
## A2BP1---Cluster1 0.18340251 0.01368626 0.253939711 0.02354808
## A4GALT 0.03316592 0.06308160 0.048887535 0.06805393
## AAAS 0.01495299 0.01634827 0.014152061 0.01223318
## AACS 0.03354944 0.02310594 0.009328074 0.03470545
## AADAT---Cluster1 0.01523262 0.01366706 0.010625261 0.01134474
## AADAT---Cluster2 0.02773927 0.02048844 0.016261786 0.02019205
## TCGA-04-1351 TCGA-13-1506
## A2BP1---Cluster1 0.120138926 0.02680149
## A4GALT 0.064649710 0.07557158
## AAAS 0.019189938 0.03804874
## AACS 0.056246113 0.02541006
## AADAT---Cluster1 0.006467459 0.01268726
## AADAT---Cluster2 0.009586054 0.02525091
To conclude, we extract the follow-up values. We are going to calculate both Overall Survival (OS) and Progression Free Survival (PFI). The definition of these two measures are made following Liu et al., 2018.
Using the function createFollowUp
on clinical data, we
are going to obtain the follow-up table.
fup <- createFollowUp(followUp, newTumorEvent)
head(fup$os)
## status days
## TCGA-04-1331 1 1336
## TCGA-04-1332 1 1247
## TCGA-04-1335 1 55
## TCGA-04-1336 0 1495
## TCGA-04-1337 1 61
## TCGA-04-1338 0 1418
head(fup$pfs)
## status days
## TCGA-04-1331 1 459
## TCGA-04-1332 1 393
## TCGA-04-1335 1 55
## TCGA-04-1336 0 1495
## TCGA-04-1337 1 61
## TCGA-04-1338 1 380
The two-class annotation object is generated using TCGA ovarian
cancer subtypes, that can be retrieved with the function
PanCancerAtlas_subtypes
.
subtypes <- PanCancerAtlas_subtypes()
table(subtypes[subtypes$cancer.type == "OVCA", "Subtype_mRNA"])
## Subtype_mRNA
## Differentiated Immunoreactive Mesenchymal Proliferative
## 135 107 109 138
subtypes <- subtypes[subtypes$cancer.type == "OVCA",]
classes <- data.frame(row.names = substr(subtypes$pan.samplesID, 1, 12),
classes = subtypes$Subtype_mRNA)
head(classes)
## classes
## TCGA-04-1331 Proliferative
## TCGA-04-1332 Mesenchymal
## TCGA-04-1336 Differentiated
## TCGA-04-1337 Mesenchymal
## TCGA-04-1338 Mesenchymal
## TCGA-04-1341 Proliferative
We can save the pre-processed
dataset on a RData
file inside the ‘downloadTCGA’
directory.
save(classes, fup, metClustValues, ov.cnv, ov.mutations, expAvg,
file = "downloadTCGA/TCGA-OV-pre-processed.RData")