AnnoTree Database Scripts¶
This repository contains all the scripts for creating a MySQL database that can be queried by the front-end AnnoTree application for the visualization of functional data on a phylogenetic tree. The scripts were written for implementation with data provided by the GTDB but any data files that use the same format may be used.
Installation Requirements¶
- You must be running a Unix-based operating system (ie. Linux or Mac)
- Install MySQL 5.7+
- Install Python 3.+
- Install required Python modules. To do this, download the
requirements.txt
file in the root of this repo and run the following:pip install -r requirements.txt
Configuration YAML and Data Files:¶
Use gtdb_database/test_data/db_config_example.yml
as a template and change the fields accordingly. It is suggested that this file be password-protected in order to hide the secure database information it contains. Consult the gtdb_database/test_data
directory for example data files.
Field | Description |
---|---|
database_name |
The name to be assigned to the database in MySQL [default: gtdb_bacteria ] |
host |
MySQL database host. This may differ from default if running in a Docker image [default: localhost ] |
port |
MySQL port. This may differ from default if running in a Docker container [default: 3306 ] |
user |
MySQL user with database creation privileges |
password |
Password for the MySQL user |
kegg_counts |
Path to count matrix with KEGG ID's as column names and genome ID's as row names. Row names must match the leaf names in newick_tree [Example: gtdb_kegg_table.test.tsv ] |
kegg_tophits_dir |
Path to directory containing KEGG hit scores for each genome. Each file must follow the naming format [genome ID]_ko_hits.tsv and have the same header and format as the files in the example directory [Example directory: ko_tophits.test ] |
metadata |
Path to metadata file supplied by the GTDB. So far only the accession (ie. genome ID), ncbi_taxonomy , gtdb_taxonomy , and ncbi_taxid fields are used so a file containing only these fields should work [Example: bac_metadata.test.tsv ] |
newick_tree |
Path to phylogenetic tree in Newick format. The tree must contain branch lengths, bootstrap values, and labels at internal nodes with taxonomic ranks following Greengenes taxonomy formatting (ie. 'p__Firmicutes'). Leaf names should represent genome ID's [Example: tsv-to-json/gtdb_r80_bac120.20171025.tree ] |
pfam_counts |
Path to count matrix with Pfam ID's as column names and genome ID's as row names. Row names must match the leaf names in newick_tree [Example: gtdb_pfam_table.test.tsv ] |
pfam_ftp_dir_url |
URL to the Pfam FTP directory corresponding to the Pfam version you would like to download [Example for v32.0: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/database_files ] |
pfam_tophits_dir |
Path to directory containing Pfam hit scores for each genome. Each file must follow the naming format [genome ID]_pfam_tophit.tsv and have the same header and format as the files in the example directory [Example directory: pfam_tophits.test ] |
protein_seq_dir |
Path to directory containing protein FASTA files for each genome. Each file must follow the naming format [genome ID]_protein.faa . Encoded STOP codons (* ) are permitted but are removed for database loading [Example directory: protein_files.test ] |
gtdb_taxonomy |
Path to file containing the taxonomy information for each genome ID with similar headers and delimiters as those in the taxonomy files provided by the GTDB [Example: tsv-to-json/gtdb_bac_r80_20171025.tsv ] |
json_tree |
Desired output path for the JSON tree generated from data in the gtdb_taxonomy and newick_tree files. It is recommended that this file be given a version number associated with the data files used to generate it [Example: bac_r80_tree.json ] |
pfamA_sql |
Desired output path of the MySQL dump file for the pfamA table of the Pfam database. It is recommended that this file be given a version number associated with the Pfam version that was given in pfam_ftp_dir_url . If 'current_release' was used, you should verify the version [Example: pfamA_v32_0.sql ] |
pfamA_txt |
Desired output path of the MySQL data file for the pfamA table of the Pfam database. If you are running MySQL with the --secure-file-priv option (true by default), the file must be in the secure directory. It is recommended that this file be given a version number associated with the Pfam version that was given in pfam_ftp_dir_url . If 'current_release' was used, you should verify the version [Example: pfamA_v32_0.txt ] |
pfam_taxonomy_sql |
Desired output path of the MySQL dump file for the taxonomy table of the Pfam database. It is recommended that this file be given a version number associated with the Pfam version that was given in pfam_ftp_dir_url . If 'current_release' was used, you should verify the version [Example: pfam_taxonomy_v32_0.sql ] |
pfam_taxonomy_txt |
Desired output path of the MySQL data file for the taxonomy table of the Pfam database. If you are running MySQL with the --secure-file-priv option (true by default), the file must be in the secure directory. It is recommended that this file be given a version number associated with the Pfam version that was given in pfam_ftp_dir_url . If 'current_release' was used, you should verify the version [Example: pfamA_v32_0.txt ] |
tigrfam_counts |
Path to count matrix with TIGRFAM ID's as column names and genome ID's as row names. Row names must match the leaf names in newick_tree [Example: gtdb_tigrfam_table.test.tsv |
tigrfam_tophits_dir |
Path to directory containing TIGRFAM hit scores for each genome. Each file must follow the naming format [genome ID]_tigrfam_tophit.tsv and have the same header and format as the files in the example directory [Example directory: tigrfam_tophits.test ] |
tigrfam_info_dir |
Path to directory containing a .INFO file for each TIGRFAM ID. The directory can be obtained from the JCVI FTP site: ftp://ftp.jcvi.org/pub/data/TIGRFAMs/ . [Example directory: TIGRFAMs_INFO.test derived from the TIGRFAMs_15.0_INFO.tar.gz file at the JCVI FTP site] |
NOTE: All file paths must be full paths or be relative to the directory that you are running make_db.py
.
Running Instructions¶
Once you have satisfied all of the installation requirements, all data is formatted correctly, and you have generated a configuration file, run the wrapper script: python make_db.py --config path/to/config.yaml
It will generate and populate a new MySQL database and output progress to the screen and make_db.log
.