In this tutorial, we are going to download a dataset from The Cancer Genome Atlas. Please note that datasets are often huge and it will take some time.
First, let’s focus on the TCGA datasets at https://portal.gdc.cancer.gov and pick your favorite one.
For this tutorial, we choose TCGA-OV, i.e., High Grade Serous Ovarian Cancer (HGSOC) dataset.
Now, we’ll set some variables and load some useful libraries. Moreover, to better organize downloads we suggest to create a directory (“downloadTCGA”) and store everything there.
library(TCGAbiolinks)
tumor = "OV"
project = paste0("TCGA-", tumor)
genome = "hg38"
methylation_platforms <- c("Illumina Human Methylation 27",
"Illumina Human Methylation 450")
dirname = "downloadTCGA"
if (!file.exists(dirname)){
dir.create(dirname)
}
We can start with the download of clinical data.
cliQuery <- GDCquery(project = project, data.category = "Clinical",
data.format = "bcr xml")
GDCdownload(cliQuery, method="api", files.per.chunk = 10,
directory = "downloadTCGA/GDCdata")
followUp <- GDCprepare_clinic(cliQuery, clinical.info = "follow_up",
directory = "downloadTCGA/GDCdata")
newTumorEvent <- GDCprepare_clinic(cliQuery, clinical.info = "new_tumor_event",
directory = "downloadTCGA/GDCdata")
We have downloaded two data frames: followUp and newTumorEvent.
Now, it’s time for the expression counts.
expQuery <- GDCquery(project = project,
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "STAR - Counts")
GDCdownload(expQuery, method = "api", directory = "downloadTCGA/GDCdata")
exprData <- GDCprepare(expQuery, directory = "downloadTCGA/GDCdata")
We need to download methylation beta values, that are available from 2 different platforms: “Illumina Human Methylation 27” and “Illumina Human Methylation 450”. Since only 10 samples from the TCGA-OV project have been profiled with the 450k Illumina platform, we decide to download data from the 27k platform.
metQuery <- GDCquery(project = project,
data.category = "DNA Methylation",
data.type = "Methylation Beta Value",
platform = methylation_platforms[[1]])
metData <- GDCprepare(metQuery, directory = "downloadTCGA/GDCdata")
Next, we download mutation data.
mutQuery <- GDCquery(
project = project,
data.category = "Simple Nucleotide Variation",
data.type = "Masked Somatic Mutation",
workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking")
GDCdownload(mutQuery, method = "api", directory = "downloadTCGA/GDCdata")
mutData <- GDCprepare(mutQuery, directory = "downloadTCGA/GDCdata")
Finally, we can download the CNV data generated by the GISTIC2 pipeline, selecting only primary tumors.
gisticTable <- getGistic("OV-TP", type = "thresholded")
cnvData <- gisticTable[,-c(1:3)]
colnames(cnvData) <- substr(colnames(cnvData), 1, 12)
row.names(cnvData) <- gisticTable$`Locus ID`
We save all the downloaded data in a .RData
file inside
the ‘downloadTCGA’ directory, that we will load in the next tutorial for
the pre-processing.
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Rome
## tzcode source: system (glibc)
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils
## [7] datasets methods base
##
## other attached packages:
## [1] MethylMix_2.36.0 doParallel_1.0.17
## [3] iterators_1.0.14 foreach_1.5.2
## [5] impute_1.80.0 maftools_2.22.0
## [7] TCGAbiolinks_2.34.0 devtools_2.4.5
## [9] usethis_3.0.0 kableExtra_1.4.0
## [11] MOSClip_0.99.5 graphite_1.52.0
## [13] EDASeq_2.40.0 ShortRead_1.64.0
## [15] GenomicAlignments_1.42.0 SummarizedExperiment_1.36.0
## [17] MatrixGenerics_1.18.0 matrixStats_1.4.1
## [19] Rsamtools_2.22.0 GenomicRanges_1.58.0
## [21] Biostrings_2.74.0 GenomeInfoDb_1.42.0
## [23] XVector_0.46.0 BiocParallel_1.40.0
## [25] org.Hs.eg.db_3.20.0 AnnotationDbi_1.68.0
## [27] IRanges_2.40.0 S4Vectors_0.44.0
## [29] Biobase_2.66.0 BiocGenerics_0.52.0
##
## loaded via a namespace (and not attached):
## [1] fs_1.6.5 bitops_1.0-9
## [3] httr_1.4.7 RColorBrewer_1.1-3
## [5] SuperExactTest_1.1.0 Rgraphviz_2.50.0
## [7] profvis_0.4.0 tools_4.4.2
## [9] backports_1.5.0 utf8_1.2.4
## [11] R6_2.5.1 DT_0.33
## [13] GetoptLong_1.0.5 urlchecker_1.0.1
## [15] withr_3.0.2 prettyunits_1.2.0
## [17] gridExtra_2.3 cli_3.6.3
## [19] gRbase_2.0.3 Cairo_1.6-2
## [21] flashClust_1.01-2 sandwich_3.1-1
## [23] labeling_0.4.3 sass_0.4.9
## [25] mvtnorm_1.3-2 survMisc_0.5.6
## [27] readr_2.1.5 qpgraph_2.40.0
## [29] systemfonts_1.1.0 yulab.utils_0.1.7
## [31] svglite_2.1.3 R.utils_2.12.3
## [33] sessioninfo_1.2.2 rstudioapi_0.17.1
## [35] RSQLite_2.3.7 generics_0.1.3
## [37] gridGraphics_0.5-1 shape_1.4.6.1
## [39] BiocIO_1.16.0 hwriter_1.3.2.1
## [41] car_3.1-3 dplyr_1.1.4
## [43] qtl_1.70 lars_1.3
## [45] leaps_3.2 Matrix_1.7-1
## [47] interp_1.1-6 fansi_1.0.6
## [49] abind_1.4-8 R.methodsS3_1.8.2
## [51] lifecycle_1.0.4 scatterplot3d_0.3-44
## [53] multcomp_1.4-26 yaml_2.3.10
## [55] carData_3.0-5 SparseArray_1.6.0
## [57] BiocFileCache_2.14.0 grid_4.4.2
## [59] blob_1.2.4 promises_1.3.0
## [61] crayon_1.5.3 pwalign_1.2.0
## [63] miniUI_0.1.1.1 lattice_0.22-6
## [65] GenomicFeatures_1.58.0 annotate_1.84.0
## [67] KEGGREST_1.46.0 magick_2.8.5
## [69] pillar_1.9.0 knitr_1.48
## [71] ComplexHeatmap_2.22.0 rjson_0.2.23
## [73] TCGAbiolinksGUI.data_1.26.0 estimability_1.5.1
## [75] corpcor_1.6.10 codetools_0.2-20
## [77] glue_1.8.0 downloader_0.4
## [79] remotes_2.5.0 data.table_1.16.2
## [81] MultiAssayExperiment_1.32.0 vctrs_0.6.5
## [83] png_0.1-8 coxrobust_1.0.1
## [85] gtable_0.3.6 cachem_1.1.0
## [87] aroma.light_3.36.0 xfun_0.49
## [89] mime_0.12 S4Arrays_1.6.0
## [91] coda_0.19-4.1 survival_3.7-0
## [93] pheatmap_1.0.12 KMsurv_0.1-5
## [95] ellipsis_0.3.2 TH.data_1.1-2
## [97] bit64_4.5.2 progress_1.2.3
## [99] filelock_1.0.3 bslib_0.8.0
## [101] elasticnet_1.3 colorspace_2.1-1
## [103] DBI_1.2.3 DNAcopy_1.80.0
## [105] tidyselect_1.2.1 emmeans_1.10.5
## [107] bit_4.5.0 compiler_4.4.2
## [109] curl_5.2.3 rvest_1.0.4
## [111] httr2_1.0.6 graph_1.84.0
## [113] xml2_1.3.6 DelayedArray_0.32.0
## [115] rtracklayer_1.66.0 checkmate_2.3.2
## [117] scales_1.3.0 multcompView_0.1-10
## [119] rappdirs_0.3.3 stringr_1.5.1
## [121] digest_0.6.37 rmarkdown_2.29
## [123] htmltools_0.5.8.1 pkgconfig_2.0.3
## [125] jpeg_0.1-10 highr_0.11
## [127] FactoMineR_2.11 dbplyr_2.5.0
## [129] fastmap_1.2.0 rlang_1.1.4
## [131] GlobalOptions_0.1.2 htmlwidgets_1.6.4
## [133] UCSC.utils_1.2.0 shiny_1.9.1
## [135] farver_2.1.2 jquerylib_0.1.4
## [137] zoo_1.8-12 jsonlite_1.8.9
## [139] R.oo_1.27.0 RCurl_1.98-1.16
## [141] magrittr_2.0.3 Formula_1.2-5
## [143] GenomeInfoDbData_1.2.13 ggplotify_0.1.2
## [145] NbClust_3.0.1 munsell_0.5.1
## [147] Rcpp_1.0.13-1 stringi_1.8.4
## [149] zlibbioc_1.52.0 MASS_7.3-61
## [151] pkgbuild_1.4.5 plyr_1.8.9
## [153] ggrepel_0.9.6 deldir_2.0-4
## [155] survminer_0.5.0 splines_4.4.2
## [157] hms_1.1.3 circlize_0.4.16
## [159] igraph_2.1.1 ggpubr_0.6.0
## [161] ggsignif_0.6.4 pkgload_1.4.0
## [163] biomaRt_2.62.0 XML_3.99-0.17
## [165] evaluate_1.0.1 latticeExtra_0.6-30
## [167] tzdb_0.4.0 httpuv_1.6.15
## [169] tidyr_1.3.1 purrr_1.0.2
## [171] reshape_0.8.9 km.ci_0.5-6
## [173] clue_0.3-65 ggplot2_3.5.1
## [175] BiocBaseUtils_1.8.0 broom_1.0.7
## [177] xtable_1.8-4 restfulr_0.0.15
## [179] rstatix_0.7.2 later_1.3.2
## [181] viridisLite_0.4.2 tibble_3.2.1
## [183] memoise_2.0.1 cluster_2.1.6