Create custom database and run analysis with it
Create a database
Not only you can take advantage of already created database, but create your own ones! You can use the following steps:
- Download the .csv file from "Browse by genome" tab of NCBI Genome database. (for more information you can refer to the get started guide) Alternatively you can use local folder with genome sequences in fasta format.
note
The .csv file name will be used as a database name. Similarly the folder name will be used for the same purposes if local genome sequences are used
- Run the rRNADif script with
--makedb_file
(when using .csv file) or--makedb_genomes
(for local genomes' folder) options
tip
You can take advantage of --split_download
flag while using --makedb_file
. In that way, genomes will be downloaded in groups. For each group, genomes will be annotated with barrnap and deleted. Then the next group download begins. Comes in handy when not much space is available for database creation.
- (optional) If you want to use a different algorithm for Multiple Sequence Alignment or program for Phylogeny computation feel free to use
-m
or-t
flags. Default programs are mafft for MSA and fasttree for phylogenetic tree computation (GTR model). More options can be viewed viash rrnadif.sg -h
and should be installed before usage (see here).
Directory structure where database is
Databases are stored under datasets folder in the repo. Here, datasets/downloaded_csv folder holds csv file with organism names which were used for database creation and datasets/computed stores the actual data:
note
Databases are just csv files. The folders holds fasta files (one genome per file), which will be copied during the analysis if the final phylogenetic tree should be constructed
Run analysis with created database
After the database was created, you should only provide -n
flag to specify it (Database name is a folder name in datasets/computed/)
Example:
Update database
The database can be updated using local folder with genomes, or the "Browse by genome" .csv file. Just use proper updatedb flag (--updatedb_file
for csv files and --updatedb_genomes
for local folder). To specify which database to update please use the -n
flag. By default, if no database name specified, the default database will be updated.
Example:
note
After the database was created, there is no difference between used sources (csv or local folder). Therefore, multiple sources for updating the same database can be used.
Updating the database follows the same logic as the database creation, so you use the same flags to tweak this process. Particularly -m
and -t
to choose the algorithms and --split_download
to specify the number of genomes downloaded at once in case of using csv file.
caution
Beware of the names of fasta files! By default, all spaces and special characters are replaced by _
when the csv file is used. Therefore, to omit duplicate items in the database please use the same convection.
Used rules to rename genomes with special characters in their names
Character | Replaced with |
---|---|
Space ( ) | Underscore (_) |
Forward slash (/) | Underscore (_) |
Colon (:) | Underscore (_) |
Semicolon (;) | Underscore (_) |
Comma (,) | Underscore (_) |
Square brackets ([) or (]) | Deleted () |
These rules are used when the database is created or updated with the csv file (see [Create a database](#Create a database)). We encourage to follow them when using local files.