AnnoTree Database Scripts

This repository contains all the scripts for creating a MySQL database that can be queried by the front-end AnnoTree application for the visualization of functional data on a phylogenetic tree. The scripts were written for implementation with data provided by the GTDB but any data files that use the same format may be used.

Installation Requirements

  • You must be running a Unix-based operating system (ie. Linux or Mac)
  • Install MySQL 5.7+
  • Install Python 3.+
  • Install required Python modules. To do this, download the requirements.txt file in the root of this repo and run the following:pip install -r requirements.txt

Configuration YAML and Data Files:

Use gtdb_database/test_data/db_config_example.yml as a template and change the fields accordingly. It is suggested that this file be password-protected in order to hide the secure database information it contains. Consult the gtdb_database/test_data directory for example data files.

Field Description
database_name The name to be assigned to the database in MySQL [default: gtdb_bacteria]
host MySQL database host. This may differ from default if running in a Docker image [default: localhost]
port MySQL port. This may differ from default if running in a Docker container [default: 3306]
user MySQL user with database creation privileges
password Password for the MySQL user
kegg_counts Path to count matrix with KEGG ID's as column names and genome ID's as row names. Row names must match the leaf names in newick_tree [Example: gtdb_kegg_table.test.tsv]
kegg_tophits_dir Path to directory containing KEGG hit scores for each genome. Each file must follow the naming format [genome ID]_ko_hits.tsv and have the same header and format as the files in the example directory [Example directory: ko_tophits.test]
metadata Path to metadata file supplied by the GTDB. So far only the accession (ie. genome ID), ncbi_taxonomy, gtdb_taxonomy, and ncbi_taxid fields are used so a file containing only these fields should work [Example: bac_metadata.test.tsv]
newick_tree Path to phylogenetic tree in Newick format. The tree must contain branch lengths, bootstrap values, and labels at internal nodes with taxonomic ranks following Greengenes taxonomy formatting (ie. 'p__Firmicutes'). Leaf names should represent genome ID's [Example: tsv-to-json/gtdb_r80_bac120.20171025.tree]
pfam_counts Path to count matrix with Pfam ID's as column names and genome ID's as row names. Row names must match the leaf names in newick_tree [Example: gtdb_pfam_table.test.tsv]
pfam_ftp_dir_url URL to the Pfam FTP directory corresponding to the Pfam version you would like to download [Example for v32.0: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/database_files]
pfam_tophits_dir Path to directory containing Pfam hit scores for each genome. Each file must follow the naming format [genome ID]_pfam_tophit.tsv and have the same header and format as the files in the example directory [Example directory: pfam_tophits.test]
protein_seq_dir Path to directory containing protein FASTA files for each genome. Each file must follow the naming format [genome ID]_protein.faa. Encoded STOP codons (*) are permitted but are removed for database loading [Example directory: protein_files.test]
gtdb_taxonomy Path to file containing the taxonomy information for each genome ID with similar headers and delimiters as those in the taxonomy files provided by the GTDB [Example: tsv-to-json/gtdb_bac_r80_20171025.tsv]
json_tree Desired output path for the JSON tree generated from data in the gtdb_taxonomy and newick_tree files. It is recommended that this file be given a version number associated with the data files used to generate it [Example: bac_r80_tree.json]
pfamA_sql Desired output path of the MySQL dump file for the pfamA table of the Pfam database. It is recommended that this file be given a version number associated with the Pfam version that was given in pfam_ftp_dir_url. If 'current_release' was used, you should verify the version [Example: pfamA_v32_0.sql]
pfamA_txt Desired output path of the MySQL data file for the pfamA table of the Pfam database. If you are running MySQL with the --secure-file-priv option (true by default), the file must be in the secure directory. It is recommended that this file be given a version number associated with the Pfam version that was given in pfam_ftp_dir_url. If 'current_release' was used, you should verify the version [Example: pfamA_v32_0.txt]
pfam_taxonomy_sql Desired output path of the MySQL dump file for the taxonomy table of the Pfam database. It is recommended that this file be given a version number associated with the Pfam version that was given in pfam_ftp_dir_url. If 'current_release' was used, you should verify the version [Example: pfam_taxonomy_v32_0.sql]
pfam_taxonomy_txt Desired output path of the MySQL data file for the taxonomy table of the Pfam database. If you are running MySQL with the --secure-file-priv option (true by default), the file must be in the secure directory. It is recommended that this file be given a version number associated with the Pfam version that was given in pfam_ftp_dir_url. If 'current_release' was used, you should verify the version [Example: pfamA_v32_0.txt]
tigrfam_counts Path to count matrix with TIGRFAM ID's as column names and genome ID's as row names. Row names must match the leaf names in newick_tree [Example: gtdb_tigrfam_table.test.tsv
tigrfam_tophits_dir Path to directory containing TIGRFAM hit scores for each genome. Each file must follow the naming format [genome ID]_tigrfam_tophit.tsv and have the same header and format as the files in the example directory [Example directory: tigrfam_tophits.test]
tigrfam_info_dir Path to directory containing a .INFO file for each TIGRFAM ID. The directory can be obtained from the JCVI FTP site: ftp://ftp.jcvi.org/pub/data/TIGRFAMs/. [Example directory: TIGRFAMs_INFO.test derived from the TIGRFAMs_15.0_INFO.tar.gz file at the JCVI FTP site]

NOTE: All file paths must be full paths or be relative to the directory that you are running make_db.py.

Running Instructions

Once you have satisfied all of the installation requirements, all data is formatted correctly, and you have generated a configuration file, run the wrapper script: python make_db.py --config path/to/config.yamlIt will generate and populate a new MySQL database and output progress to the screen and make_db.log.