Database Import

First, download the pre-build database files that we provide and unpack them. Please make sure that you have enough space available. The packed file consumes 31 Gb. When unpacked, it consumed additional 188 Gb.

$ cd /plenty/space
$ wget https://file-public.bihealth.org/transient/varfish/varfish-server-background-db-20201006.tar.gz{,.sha256}
$ sha256sum -c varfish-server-background-db-20201006.tar.gz.sha256
$ tar xzvf varfish-server-background-db-20201006.tar.gz

We recommend to exclude the large databases: frequency tables, extra annotations and dbSNP. Also, keep in mind that importing the whole database takes >24h, depending on the speed of your HDD.

This is a list of the possible imports, sorted by its size:

Component	Size	Exclude	Function
gnomAD_genomes	80G	highly recommended	frequency annotation
extra-annos	50G	highly recommended	diverse
dbSNP	32G	highly recommended	SNP annotation
thousand_genomes	6,5G	highly recommended	frequency annotation
gnomAD_exomes	6,0G	highly recommended	frequency annotation
knowngeneaa	4,5G	highly recommended	alignment annotation
clinvar	3,3G	highly recommended	pathogenicity classification
ExAC	1,9G	highly recommended	frequency annotation
dbVar	573M	recommended	SNP annotation
gnomAD_SV	250M	recommended	SV frequency annotation
ncbi_gene	151M		gene annotation
ensembl_regulatory	77M		frequency annotation
DGV	43M		SV annotation
hpo	22M		phenotype information
hgnc	15M		gene annotation
gnomAD_constraints	13M		frequency annotation
mgi	10M		mouse gene annotation
ensembltorefseq	8,3M		identifier mapping
hgmd_public	5,0M		gene annotation
ExAC_constraints	4,6M		frequency annotation
refseqtoensembl	2,0M		identifier mapping
ensembltogenesymbol	1,6M		identifier mapping
ensembl_genes	1,2M		gene annotation
HelixMTdb	1,2M		MT frequency annotation
refseqtogenesymbol	1,1M		identifier mapping
refseq_genes	804K		gene annotation
mim2gene	764K		phenotype information
MITOMAP	660K		MT frequency annotation
kegg	632K		pathway annotation
mtDB	336K		MT frequency annotation
tads_hesc	108K		domain annotation
tads_imr90	108K		domain annotation
vista	104K		orthologous region annotation
acmg	16K		disease gene annotation

You can find the import_versions.tsv file in the root folder of the package. This file determines which component (called table_group and represented as folder in the package) gets imported when the import command is issued. To exclude a table, simply comment out (#) or delete the line. Excluding tables that are not required for development can reduce time and space consumption. Also, the GRCh38 tables can be excluded.

A space-consumption-friendly version of the file would look like this:

build       table_group     version
GRCh37      acmg    v2.0
#GRCh37     clinvar 20200929
#GRCh37     dbSNP   b151
#GRCh37     dbVar   latest
GRCh37      DGV     2016
GRCh37      ensembl_genes   r96
GRCh37      ensembl_regulatory      latest
GRCh37      ensembltogenesymbol     latest
GRCh37      ensembltorefseq latest
GRCh37      ExAC_constraints        r0.3.1
#GRCh37     ExAC    r1
#GRCh37     extra-annos     20200704
GRCh37      gnomAD_constraints      v2.1.1
#GRCh37     gnomAD_exomes   r2.1
#GRCh37     gnomAD_genomes  r2.1
#GRCh37     gnomAD_SV       v2
GRCh37      HelixMTdb       20190926
GRCh37      hgmd_public     ensembl_r75
GRCh37      hgnc    latest
GRCh37      hpo     latest
GRCh37      kegg    april2011
#GRCh37     knowngeneaa     latest
GRCh37      mgi     latest
GRCh37      mim2gene        latest
GRCh37      MITOMAP 20200116
GRCh37      mtDB    latest
GRCh37      ncbi_gene       latest
GRCh37      refseq_genes    r105
GRCh37      refseqtoensembl latest
GRCh37      refseqtogenesymbol      latest
GRCh37      tads_hesc       dixon2012
GRCh37      tads_imr90      dixon2012
#GRCh37     thousand_genomes        phase3
GRCh37      vista   latest
#GRCh38     clinvar 20200929
#GRCh38     dbVar   latest
#GRCh38     DGV     2016

To perform the import, issue:

$ python manage.py import_tables --tables-path /plenty/space/varfish-server-background-db-20201006

Performing the import twice will automatically skip tables that are already imported. To re-import tables, add the --force parameter to the command:

$ python manage.py import_tables --tables-path varfish-db-downloader --force