AnnoTree Database Scripts¶

This repository contains all the scripts for creating a MySQL database that can be queried by the front-end AnnoTree application for the visualization of functional data on a phylogenetic tree. The scripts were written for implementation with data provided by the GTDB but any data files that use the same format may be used.

Installation Requirements¶

You must be running a Unix-based operating system (ie. Linux or Mac)
Install MySQL 5.7+
Install Python 3.+
Install required Python modules. To do this, download the requirements.txt file in the root of this repo and run the following:pip install -r requirements.txt

Configuration YAML and Data Files:¶

Use gtdb_database/test_data/db_config_example.yml as a template and change the fields accordingly. It is suggested that this file be password-protected in order to hide the secure database information it contains. Consult the gtdb_database/test_data directory for example data files.

Field	Description
`database_name`	The name to be assigned to the database in MySQL [default: `gtdb_bacteria`]
`host`	MySQL database host. This may differ from default if running in a Docker image [default: `localhost`]
`port`	MySQL port. This may differ from default if running in a Docker container [default: `3306`]
`user`	MySQL user with database creation privileges
`password`	Password for the MySQL user
`kegg_counts`	Path to count matrix with KEGG ID's as column names and genome ID's as row names. Row names must match the leaf names in `newick_tree` [Example: `gtdb_kegg_table.test.tsv`]
`kegg_tophits_dir`	Path to directory containing KEGG hit scores for each genome. Each file must follow the naming format `[genome ID]_ko_hits.tsv` and have the same header and format as the files in the example directory [Example directory: `ko_tophits.test`]
`metadata`	Path to metadata file supplied by the GTDB. So far only the `accession` (ie. genome ID), `ncbi_taxonomy`, `gtdb_taxonomy`, and `ncbi_taxid` fields are used so a file containing only these fields should work [Example: `bac_metadata.test.tsv`]
`newick_tree`	Path to phylogenetic tree in Newick format. The tree must contain branch lengths, bootstrap values, and labels at internal nodes with taxonomic ranks following Greengenes taxonomy formatting (ie. 'p__Firmicutes'). Leaf names should represent genome ID's [Example: `tsv-to-json/gtdb_r80_bac120.20171025.tree`]
`pfam_counts`	Path to count matrix with Pfam ID's as column names and genome ID's as row names. Row names must match the leaf names in `newick_tree` [Example: `gtdb_pfam_table.test.tsv`]
`pfam_ftp_dir_url`	URL to the Pfam FTP directory corresponding to the Pfam version you would like to download [Example for v32.0: `ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/database_files`]
`pfam_tophits_dir`	Path to directory containing Pfam hit scores for each genome. Each file must follow the naming format `[genome ID]_pfam_tophit.tsv` and have the same header and format as the files in the example directory [Example directory: `pfam_tophits.test`]
`protein_seq_dir`	Path to directory containing protein FASTA files for each genome. Each file must follow the naming format `[genome ID]_protein.faa`. Encoded STOP codons (`*`) are permitted but are removed for database loading [Example directory: `protein_files.test`]
`gtdb_taxonomy`	Path to file containing the taxonomy information for each genome ID with similar headers and delimiters as those in the taxonomy files provided by the GTDB [Example: `tsv-to-json/gtdb_bac_r80_20171025.tsv`]
`json_tree`	Desired output path for the JSON tree generated from data in the `gtdb_taxonomy` and `newick_tree` files. It is recommended that this file be given a version number associated with the data files used to generate it [Example: `bac_r80_tree.json`]
`pfamA_sql`	Desired output path of the MySQL dump file for the `pfamA` table of the Pfam database. It is recommended that this file be given a version number associated with the Pfam version that was given in `pfam_ftp_dir_url`. If 'current_release' was used, you should verify the version [Example: `pfamA_v32_0.sql`]
`pfamA_txt`	Desired output path of the MySQL data file for the `pfamA` table of the Pfam database. If you are running MySQL with the `--secure-file-priv` option (true by default), the file must be in the secure directory. It is recommended that this file be given a version number associated with the Pfam version that was given in `pfam_ftp_dir_url`. If 'current_release' was used, you should verify the version [Example: `pfamA_v32_0.txt`]
`pfam_taxonomy_sql`	Desired output path of the MySQL dump file for the `taxonomy` table of the Pfam database. It is recommended that this file be given a version number associated with the Pfam version that was given in `pfam_ftp_dir_url`. If 'current_release' was used, you should verify the version [Example: `pfam_taxonomy_v32_0.sql`]
`pfam_taxonomy_txt`	Desired output path of the MySQL data file for the `taxonomy` table of the Pfam database. If you are running MySQL with the `--secure-file-priv` option (true by default), the file must be in the secure directory. It is recommended that this file be given a version number associated with the Pfam version that was given in `pfam_ftp_dir_url`. If 'current_release' was used, you should verify the version [Example: `pfamA_v32_0.txt`]
`tigrfam_counts`	Path to count matrix with TIGRFAM ID's as column names and genome ID's as row names. Row names must match the leaf names in `newick_tree` [Example: `gtdb_tigrfam_table.test.tsv`
`tigrfam_tophits_dir`	Path to directory containing TIGRFAM hit scores for each genome. Each file must follow the naming format `[genome ID]_tigrfam_tophit.tsv` and have the same header and format as the files in the example directory [Example directory: `tigrfam_tophits.test`]
`tigrfam_info_dir`	Path to directory containing a .INFO file for each TIGRFAM ID. The directory can be obtained from the JCVI FTP site: `ftp://ftp.jcvi.org/pub/data/TIGRFAMs/`. [Example directory: `TIGRFAMs_INFO.test` derived from the `TIGRFAMs_15.0_INFO.tar.gz` file at the JCVI FTP site]

NOTE: All file paths must be full paths or be relative to the directory that you are running make_db.py.

Running Instructions¶

Once you have satisfied all of the installation requirements, all data is formatted correctly, and you have generated a configuration file, run the wrapper script: python make_db.py --config path/to/config.yamlIt will generate and populate a new MySQL database and output progress to the screen and make_db.log.