In this tutorial, we are going to download a dataset from The Cancer Genome Atlas. Please note that datasets are often huge and it will take some time.

First, let’s focus on the TCGA datasets at https://portal.gdc.cancer.gov and pick your favorite one.

For this tutorial, we choose TCGA-OV, i.e., High Grade Serous Ovarian Cancer (HGSOC) dataset.

Now, we’ll set some variables and load some useful libraries. Moreover, to better organize downloads we suggest to create a directory (“downloadTCGA”) and store everything there.

library(TCGAbiolinks)

tumor = "OV"
project = paste0("TCGA-", tumor)
genome = "hg38"

methylation_platforms <- c("Illumina Human Methylation 27",
                           "Illumina Human Methylation 450")

dirname = "downloadTCGA"
if (!file.exists(dirname)){
  dir.create(dirname)
}

Clinical data

We can start with the download of clinical data.

cliQuery <- GDCquery(project = project, data.category = "Clinical", 
                     data.format = "bcr xml")

GDCdownload(cliQuery, method="api", files.per.chunk = 10, 
            directory = "downloadTCGA/GDCdata")

followUp <- GDCprepare_clinic(cliQuery, clinical.info = "follow_up",
                              directory = "downloadTCGA/GDCdata")
newTumorEvent <- GDCprepare_clinic(cliQuery, clinical.info = "new_tumor_event",
                                   directory = "downloadTCGA/GDCdata")

We have downloaded two data frames: followUp and newTumorEvent.

Expression

Now, it’s time for the expression counts.

expQuery <- GDCquery(project = project,
                     data.category = "Transcriptome Profiling",
                     data.type = "Gene Expression Quantification",
                     workflow.type = "STAR - Counts")

GDCdownload(expQuery, method = "api", directory = "downloadTCGA/GDCdata")

exprData <- GDCprepare(expQuery, directory = "downloadTCGA/GDCdata")

Methylation

We need to download methylation beta values, that are available from 2 different platforms: “Illumina Human Methylation 27” and “Illumina Human Methylation 450”. Since only 10 samples from the TCGA-OV project have been profiled with the 450k Illumina platform, we decide to download data from the 27k platform.

metQuery <- GDCquery(project = project,
                     data.category = "DNA Methylation",
                     data.type = "Methylation Beta Value",
                     platform = methylation_platforms[[1]])

metData <- GDCprepare(metQuery, directory = "downloadTCGA/GDCdata")

Mutation

Next, we download mutation data.

mutQuery <- GDCquery(
    project = project, 
    data.category = "Simple Nucleotide Variation", 
    data.type = "Masked Somatic Mutation",
    workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking")

GDCdownload(mutQuery, method = "api", directory = "downloadTCGA/GDCdata")

mutData <- GDCprepare(mutQuery, directory = "downloadTCGA/GDCdata")

Copy Number Variation

Finally, we can download the CNV data generated by the GISTIC2 pipeline, selecting only primary tumors.

gisticTable <- getGistic("OV-TP", type = "thresholded")
cnvData <- gisticTable[,-c(1:3)]
colnames(cnvData) <- substr(colnames(cnvData), 1, 12)
row.names(cnvData) <- gisticTable$`Locus ID`

We save all the downloaded data in a .RData file inside the ‘downloadTCGA’ directory, that we will load in the next tutorial for the pre-processing.

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Rome
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils    
## [7] datasets  methods   base     
## 
## other attached packages:
##  [1] MethylMix_2.36.0            doParallel_1.0.17          
##  [3] iterators_1.0.14            foreach_1.5.2              
##  [5] impute_1.80.0               maftools_2.22.0            
##  [7] TCGAbiolinks_2.34.0         devtools_2.4.5             
##  [9] usethis_3.0.0               kableExtra_1.4.0           
## [11] MOSClip_0.99.5              graphite_1.52.0            
## [13] EDASeq_2.40.0               ShortRead_1.64.0           
## [15] GenomicAlignments_1.42.0    SummarizedExperiment_1.36.0
## [17] MatrixGenerics_1.18.0       matrixStats_1.4.1          
## [19] Rsamtools_2.22.0            GenomicRanges_1.58.0       
## [21] Biostrings_2.74.0           GenomeInfoDb_1.42.0        
## [23] XVector_0.46.0              BiocParallel_1.40.0        
## [25] org.Hs.eg.db_3.20.0         AnnotationDbi_1.68.0       
## [27] IRanges_2.40.0              S4Vectors_0.44.0           
## [29] Biobase_2.66.0              BiocGenerics_0.52.0        
## 
## loaded via a namespace (and not attached):
##   [1] fs_1.6.5                    bitops_1.0-9               
##   [3] httr_1.4.7                  RColorBrewer_1.1-3         
##   [5] SuperExactTest_1.1.0        Rgraphviz_2.50.0           
##   [7] profvis_0.4.0               tools_4.4.2                
##   [9] backports_1.5.0             utf8_1.2.4                 
##  [11] R6_2.5.1                    DT_0.33                    
##  [13] GetoptLong_1.0.5            urlchecker_1.0.1           
##  [15] withr_3.0.2                 prettyunits_1.2.0          
##  [17] gridExtra_2.3               cli_3.6.3                  
##  [19] gRbase_2.0.3                Cairo_1.6-2                
##  [21] flashClust_1.01-2           sandwich_3.1-1             
##  [23] labeling_0.4.3              sass_0.4.9                 
##  [25] mvtnorm_1.3-2               survMisc_0.5.6             
##  [27] readr_2.1.5                 qpgraph_2.40.0             
##  [29] systemfonts_1.1.0           yulab.utils_0.1.7          
##  [31] svglite_2.1.3               R.utils_2.12.3             
##  [33] sessioninfo_1.2.2           rstudioapi_0.17.1          
##  [35] RSQLite_2.3.7               generics_0.1.3             
##  [37] gridGraphics_0.5-1          shape_1.4.6.1              
##  [39] BiocIO_1.16.0               hwriter_1.3.2.1            
##  [41] car_3.1-3                   dplyr_1.1.4                
##  [43] qtl_1.70                    lars_1.3                   
##  [45] leaps_3.2                   Matrix_1.7-1               
##  [47] interp_1.1-6                fansi_1.0.6                
##  [49] abind_1.4-8                 R.methodsS3_1.8.2          
##  [51] lifecycle_1.0.4             scatterplot3d_0.3-44       
##  [53] multcomp_1.4-26             yaml_2.3.10                
##  [55] carData_3.0-5               SparseArray_1.6.0          
##  [57] BiocFileCache_2.14.0        grid_4.4.2                 
##  [59] blob_1.2.4                  promises_1.3.0             
##  [61] crayon_1.5.3                pwalign_1.2.0              
##  [63] miniUI_0.1.1.1              lattice_0.22-6             
##  [65] GenomicFeatures_1.58.0      annotate_1.84.0            
##  [67] KEGGREST_1.46.0             magick_2.8.5               
##  [69] pillar_1.9.0                knitr_1.48                 
##  [71] ComplexHeatmap_2.22.0       rjson_0.2.23               
##  [73] TCGAbiolinksGUI.data_1.26.0 estimability_1.5.1         
##  [75] corpcor_1.6.10              codetools_0.2-20           
##  [77] glue_1.8.0                  downloader_0.4             
##  [79] remotes_2.5.0               data.table_1.16.2          
##  [81] MultiAssayExperiment_1.32.0 vctrs_0.6.5                
##  [83] png_0.1-8                   coxrobust_1.0.1            
##  [85] gtable_0.3.6                cachem_1.1.0               
##  [87] aroma.light_3.36.0          xfun_0.49                  
##  [89] mime_0.12                   S4Arrays_1.6.0             
##  [91] coda_0.19-4.1               survival_3.7-0             
##  [93] pheatmap_1.0.12             KMsurv_0.1-5               
##  [95] ellipsis_0.3.2              TH.data_1.1-2              
##  [97] bit64_4.5.2                 progress_1.2.3             
##  [99] filelock_1.0.3              bslib_0.8.0                
## [101] elasticnet_1.3              colorspace_2.1-1           
## [103] DBI_1.2.3                   DNAcopy_1.80.0             
## [105] tidyselect_1.2.1            emmeans_1.10.5             
## [107] bit_4.5.0                   compiler_4.4.2             
## [109] curl_5.2.3                  rvest_1.0.4                
## [111] httr2_1.0.6                 graph_1.84.0               
## [113] xml2_1.3.6                  DelayedArray_0.32.0        
## [115] rtracklayer_1.66.0          checkmate_2.3.2            
## [117] scales_1.3.0                multcompView_0.1-10        
## [119] rappdirs_0.3.3              stringr_1.5.1              
## [121] digest_0.6.37               rmarkdown_2.29             
## [123] htmltools_0.5.8.1           pkgconfig_2.0.3            
## [125] jpeg_0.1-10                 highr_0.11                 
## [127] FactoMineR_2.11             dbplyr_2.5.0               
## [129] fastmap_1.2.0               rlang_1.1.4                
## [131] GlobalOptions_0.1.2         htmlwidgets_1.6.4          
## [133] UCSC.utils_1.2.0            shiny_1.9.1                
## [135] farver_2.1.2                jquerylib_0.1.4            
## [137] zoo_1.8-12                  jsonlite_1.8.9             
## [139] R.oo_1.27.0                 RCurl_1.98-1.16            
## [141] magrittr_2.0.3              Formula_1.2-5              
## [143] GenomeInfoDbData_1.2.13     ggplotify_0.1.2            
## [145] NbClust_3.0.1               munsell_0.5.1              
## [147] Rcpp_1.0.13-1               stringi_1.8.4              
## [149] zlibbioc_1.52.0             MASS_7.3-61                
## [151] pkgbuild_1.4.5              plyr_1.8.9                 
## [153] ggrepel_0.9.6               deldir_2.0-4               
## [155] survminer_0.5.0             splines_4.4.2              
## [157] hms_1.1.3                   circlize_0.4.16            
## [159] igraph_2.1.1                ggpubr_0.6.0               
## [161] ggsignif_0.6.4              pkgload_1.4.0              
## [163] biomaRt_2.62.0              XML_3.99-0.17              
## [165] evaluate_1.0.1              latticeExtra_0.6-30        
## [167] tzdb_0.4.0                  httpuv_1.6.15              
## [169] tidyr_1.3.1                 purrr_1.0.2                
## [171] reshape_0.8.9               km.ci_0.5-6                
## [173] clue_0.3-65                 ggplot2_3.5.1              
## [175] BiocBaseUtils_1.8.0         broom_1.0.7                
## [177] xtable_1.8-4                restfulr_0.0.15            
## [179] rstatix_0.7.2               later_1.3.2                
## [181] viridisLite_0.4.2           tibble_3.2.1               
## [183] memoise_2.0.1               cluster_2.1.6