.. _developer_database:

===============
Database Import
===============

First, download the pre-build database files that we provide and unpack them.
Please make sure that you have enough space available. The packed file consumes
31 Gb. When unpacked, it consumed additional 188 Gb.

.. code-block:: bash

    $ cd /plenty/space
    $ wget https://file-public.bihealth.org/transient/varfish/varfish-server-background-db-20201006.tar.gz{,.sha256}
    $ sha256sum -c varfish-server-background-db-20201006.tar.gz.sha256
    $ tar xzvf varfish-server-background-db-20201006.tar.gz

We recommend to exclude the large databases: frequency tables, extra
annotations and dbSNP. Also, keep in mind that importing the whole database
takes >24h, depending on the speed of your HDD.

This is a list of the possible imports, sorted by its size:

===================  ====  ==================  =============================
Component            Size  Exclude             Function
===================  ====  ==================  =============================
gnomAD_genomes       80G   highly recommended  frequency annotation
extra-annos          50G   highly recommended  diverse
dbSNP                32G   highly recommended  SNP annotation
thousand_genomes     6,5G  highly recommended  frequency annotation
gnomAD_exomes        6,0G  highly recommended  frequency annotation
knowngeneaa          4,5G  highly recommended  alignment annotation
clinvar              3,3G  highly recommended  pathogenicity classification
ExAC                 1,9G  highly recommended  frequency annotation
dbVar                573M  recommended         SNP annotation
gnomAD_SV            250M  recommended         SV frequency annotation
ncbi_gene            151M                      gene annotation 
ensembl_regulatory   77M                       frequency annotation
DGV                  43M                       SV annotation
hpo                  22M                       phenotype information
hgnc                 15M                       gene annotation
gnomAD_constraints   13M                       frequency annotation
mgi                  10M                       mouse gene annotation
ensembltorefseq      8,3M                      identifier mapping
hgmd_public          5,0M                      gene annotation
ExAC_constraints     4,6M                      frequency annotation
refseqtoensembl      2,0M                      identifier mapping
ensembltogenesymbol  1,6M                      identifier mapping
ensembl_genes        1,2M                      gene annotation
HelixMTdb            1,2M                      MT frequency annotation
refseqtogenesymbol   1,1M                      identifier mapping
refseq_genes         804K                      gene annotation
mim2gene             764K                      phenotype information
MITOMAP              660K                      MT frequency annotation
kegg                 632K                      pathway annotation
mtDB                 336K                      MT frequency annotation
tads_hesc            108K                      domain annotation
tads_imr90           108K                      domain annotation
vista                104K                      orthologous region annotation
acmg                 16K                       disease gene annotation
===================  ====  ==================  =============================

You can find the ``import_versions.tsv`` file in the root folder of the
package. This file determines which component (called ``table_group`` and
represented as folder in the package) gets imported when the import command is
issued. To exclude a table, simply comment out (``#``) or delete the line.
Excluding tables that are not required for development can reduce time and
space consumption. Also, the GRCh38 tables can be excluded.

A space-consumption-friendly version of the file would look like this::

    build	table_group	version
    GRCh37	acmg	v2.0
    #GRCh37	clinvar	20200929
    #GRCh37	dbSNP	b151
    #GRCh37	dbVar	latest
    GRCh37	DGV	2016
    GRCh37	ensembl_genes	r96
    GRCh37	ensembl_regulatory	latest
    GRCh37	ensembltogenesymbol	latest
    GRCh37	ensembltorefseq	latest
    GRCh37	ExAC_constraints	r0.3.1
    #GRCh37	ExAC	r1
    #GRCh37	extra-annos	20200704
    GRCh37	gnomAD_constraints	v2.1.1
    #GRCh37	gnomAD_exomes	r2.1
    #GRCh37	gnomAD_genomes	r2.1
    #GRCh37	gnomAD_SV	v2
    GRCh37	HelixMTdb	20190926
    GRCh37	hgmd_public	ensembl_r75
    GRCh37	hgnc	latest
    GRCh37	hpo	latest
    GRCh37	kegg	april2011
    #GRCh37	knowngeneaa	latest
    GRCh37	mgi	latest
    GRCh37	mim2gene	latest
    GRCh37	MITOMAP	20200116
    GRCh37	mtDB	latest
    GRCh37	ncbi_gene	latest
    GRCh37	refseq_genes	r105
    GRCh37	refseqtoensembl	latest
    GRCh37	refseqtogenesymbol	latest
    GRCh37	tads_hesc	dixon2012
    GRCh37	tads_imr90	dixon2012
    #GRCh37	thousand_genomes	phase3
    GRCh37	vista	latest
    #GRCh38	clinvar	20200929
    #GRCh38	dbVar	latest
    #GRCh38	DGV	2016

To perform the import, issue:

.. code-block:: bash

    $ python manage.py import_tables --tables-path /plenty/space/varfish-server-background-db-20201006

Performing the import twice will automatically skip tables that are already
imported. To re-import tables, add the ``--force`` parameter to the command:

.. code-block:: bash

    $ python manage.py import_tables --tables-path varfish-db-downloader --force