AnnoTree Database Scripts

This repository contains all the scripts for creating a MySQL database that can be queried by the front-end AnnoTree application for the visualization of functional data on a phylogenetic tree. The scripts were written for implementation with data provided by the GTDB but any data files that use the same format may be used.

Installation Requirements

  • You must be running a Unix-based operating system (ie. Linux or Mac)
  • Install MySQL 5.7+
  • Install Python 3.+
  • Install required Python modules. To do this, download the requirements.txt file in the root of this repo and run the following:pip install -r requirements.txt

Configuration YAML and Data Files:

Use gtdb_database/test_data/db_config_example.yml as a template and change the fields accordingly. It is suggested that this file be password-protected in order to hide the secure database information it contains. Consult the gtdb_database/test_data directory for example data files.

Field Description
database_name The name to be assigned to the database in MySQL [default: gtdb_bacteria]
host MySQL database host. This may differ from default if running in a Docker image [default: localhost]
port MySQL port. This may differ from default if running in a Docker container [default: 3306]
user MySQL user with database creation privileges
password Password for the MySQL user
kegg_counts Path to count matrix with KEGG ID's as column names and genome ID's as row names. Row names must match the leaf names in newick_tree [Example: gtdb_kegg_table.test.tsv]
kegg_tophits_dir Path to directory containing KEGG hit scores for each genome. Each file must follow the naming format [genome ID]_ko_hits.tsv and have the same header and format as the files in the example directory [Example directory: ko_tophits.test]
metadata Path to metadata file supplied by the GTDB. So far only the accession (ie. genome ID), ncbi_taxonomy, gtdb_taxonomy, and ncbi_taxid fields are used so a file containing only these fields should work [Example: bac_metadata.test.tsv]
newick_tree Path to phylogenetic tree in Newick format. The tree must contain branch lengths, bootstrap values, and labels at internal nodes with taxonomic ranks following Greengenes taxonomy formatting (ie. 'p__Firmicutes'). Leaf names should represent genome ID's [Example: tsv-to-json/gtdb_r80_bac120.20171025.tree]
pfam_counts Path to count matrix with Pfam ID's as column names and genome ID's as row names. Row names must match the leaf names in newick_tree [Example: gtdb_pfam_table.test.tsv]
pfam_ftp_dir_url URL to the Pfam FTP directory corresponding to the Pfam version you would like to download [Example for v32.0:]
pfam_tophits_dir Path to directory containing Pfam hit scores for each genome. Each file must follow the naming format [genome ID]_pfam_tophit.tsv and have the same header and format as the files in the example directory [Example directory: pfam_tophits.test]
protein_seq_dir Path to directory containing protein FASTA files for each genome. Each file must follow the naming format [genome ID]_protein.faa. Encoded STOP codons (*) are permitted but are removed for database loading [Example directory: protein_files.test]
gtdb_taxonomy Path to file containing the taxonomy information for each genome ID with similar headers and delimiters as those in the taxonomy files provided by the GTDB [Example: tsv-to-json/gtdb_bac_r80_20171025.tsv]
json_tree Desired output path for the JSON tree generated from data in the gtdb_taxonomy and newick_tree files. It is recommended that this file be given a version number associated with the data files used to generate it [Example: bac_r80_tree.json]
pfamA_sql Desired output path of the MySQL dump file for the pfamA table of the Pfam database. It is recommended that this file be given a version number associated with the Pfam version that was given in pfam_ftp_dir_url. If 'current_release' was used, you should verify the version [Example: pfamA_v32_0.sql]
pfamA_txt Desired output path of the MySQL data file for the pfamA table of the Pfam database. If you are running MySQL with the --secure-file-priv option (true by default), the file must be in the secure directory. It is recommended that this file be given a version number associated with the Pfam version that was given in pfam_ftp_dir_url. If 'current_release' was used, you should verify the version [Example: pfamA_v32_0.txt]
pfam_taxonomy_sql Desired output path of the MySQL dump file for the taxonomy table of the Pfam database. It is recommended that this file be given a version number associated with the Pfam version that was given in pfam_ftp_dir_url. If 'current_release' was used, you should verify the version [Example: pfam_taxonomy_v32_0.sql]
pfam_taxonomy_txt Desired output path of the MySQL data file for the taxonomy table of the Pfam database. If you are running MySQL with the --secure-file-priv option (true by default), the file must be in the secure directory. It is recommended that this file be given a version number associated with the Pfam version that was given in pfam_ftp_dir_url. If 'current_release' was used, you should verify the version [Example: pfamA_v32_0.txt]
tigrfam_counts Path to count matrix with TIGRFAM ID's as column names and genome ID's as row names. Row names must match the leaf names in newick_tree [Example: gtdb_tigrfam_table.test.tsv
tigrfam_tophits_dir Path to directory containing TIGRFAM hit scores for each genome. Each file must follow the naming format [genome ID]_tigrfam_tophit.tsv and have the same header and format as the files in the example directory [Example directory: tigrfam_tophits.test]
tigrfam_info_dir Path to directory containing a .INFO file for each TIGRFAM ID. The directory can be obtained from the JCVI FTP site: [Example directory: TIGRFAMs_INFO.test derived from the TIGRFAMs_15.0_INFO.tar.gz file at the JCVI FTP site]

NOTE: All file paths must be full paths or be relative to the directory that you are running

Running Instructions

Once you have satisfied all of the installation requirements, all data is formatted correctly, and you have generated a configuration file, run the wrapper script: python --config path/to/config.yamlIt will generate and populate a new MySQL database and output progress to the screen and make_db.log.