Formated UTOPIA db
Formated UTOPIA db is available here.
UTOPIA database construction
Prerequesites
-
Perl Perl libraries: Getopt, Log4Perl, DBI, SQL.
-
Sqlite
-
NCBI tool kit: https://www.ncbi.nlm.nih.gov/books/NBK179288/
-
NCBI Taxonomy Database ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
-
Install personal Perl libraries
Change libraries path in scripts extract_seq.pl
, extract_cluster_taxonomy.pl
.
NCBI Taxonomy database construction
Download and extract
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -xzf taxdump.tar.gz
Load SQLite database
loadTaxonomy.pl -names names.dmp -nodes nodes.dmp -struct taxonomyStructure.sql
Database generation
eDirect download
esearch -db "nucleotide" -query "\"internal transcribed spacer\"[Title] AND \"fungi\"[Porgn] AND \"complete sequence\" [Title]" | efetch -format gb > GB_ITS.gb
WARN: Unfortunately it is possible that some errors are generated during download, thus corrupting the Genbank file. Search for the term 'error' in the file and manually remove corrupted sequences.
extract sequence and taxonomy from Genbank file
This script uses the sqlite database previously created. You MUST modified line 18 the database path to fit your own.
perl extract_seq.pl GB_ITS.gb
This create two files:
- A fasta file: sequences.dna.
- A tabulated two column tabulated file: taxonomy.txt.
Prinseq sequence check
Generate statistics
- Maybe embedded prinseq script if its light enough *
prinseq-lite.pl -fasta sequences.fna -graph_data sequences.gd -graph_stats ld,gc,ns,pt,ts,de,da,sc,dn
Generate graphs in HTML format
prinseq-graphs.pl -i sequences.gd -html_all -o sequences_prinseq
Filtering sequences
Thanks to the statistics given by Prinseq, we've choose too exclude sequences less than 100bp and larger than 5000bp. Also we do not allowed sequences with N's or with IUPAC code.
prinseq-lite.pl -fasta sequences.fna -out_format 1 -out_good sequences_good -noniupac -ns_max_n 1 -min_len 100 -max_len 5000
[OPTIONAL] extract ITS1 and ITS2 segments with ITSx.
The ITSx software can be downloaded here.
In order to speed up this step, we've split the sequences.dna fasta file using homemade scripts.
fasSpliter.pl -dir split -ns 200 sequences.fna
\ls -1d _* | sed 's,^.*$,listPath.pl -d & >> idx,' | bash
sed "s,\(/scratch/umrf/ITSbank/_[0123]000/sequences_\([0-9]*\)\).fas,echo '/home/stheil/soft/ITSx_1.1/ITSx -i & -o \1 -p /home/stheil/soft/ITSx_1.1/ITSx_db/HMMs/ --reset --cpu 8 --multi_thread T --preserve T --only_full T -t F ' | qsub -N itsx_\2 -cwd," | bash
Concatenate ITS 1 and 2 sequences.
cat _*/*.ITS1.fasta > ITS1.fasta
cat _*/*.ITS2.fasta > ITS2.fasta
Clustering with CD-HIT (dereplication) and re-extract taxonomy
-s options stands for length difference cutoff. meaning that sequences from the same cluster as the same length
echo 'cd-hit -i sequences_good.fasta -o sequences_good.cdhit_c1_s0999 -c 1 -T 8 -s 0.999 -sf 1 -sc 1 -M 4000' | qsub -cwd -pe multithread 8 -N cdhit_c1_s0999
Extract taxonomy from clusters
Clustering sequence can produce cluster in wich taxonomy is not homogenous. i.e. in the same cluster, multiple annotation will be present. To resolve those cases, we propose two different strategies:
- Extract one sequence by taxonomy clade from the considered cluster.
- Annotate the representative sequence at the last homogenous rank. The threshold is used to determine if the rank is homogenous or not.
extract_cluster_taxonomy.pl -taxo taxonomy.txt -clstr sequences_good.cdhit.clstr -fasta sequences.fna -dir split_cdhit -mode split -threshold 0.8
ITSbank taxonomy validation / database generation
Taxid file
Generate the Taxid file.
Rscript ParseRanks_v1.R -t split_cdhit/cdhit_taxonomy.txt -o split_cdhit/cdhit_taxonomy.taxid
Taxonomy incongruity
Validation on TAXID file (output from ParseRanks_v1.R)
Filetax=split_cdhit/cdhit_taxonomy.taxid
taxo=split_cdhit/cdhit_taxonomy.txt
Check for duplicated taxon.
cut -f2- -d\* $Filetax | sort | uniq -c | sort -h -r | awk '$1!=1' | awk '{print $2}' > split_cdhit/duplicated_taxons.txt
Check for multiple ancestor.
cut -f 2 -d '*' $Filetax | cut -f 1 -d ' ' | sort | uniq -c | awk '$1!=1 && $2 !="unclassified"' | sort -r -h > split_cdhit/problematic_taxons.tsv
Replace taxonomy by the most common (automatic)
cp $taxo tmp
for i in $(awk '{print $2}' split_cdhit/problematic_taxons.tsv);
do
echo $i
counts=$(grep -o k__Fungi.*$i $taxo | sort | uniq -c | sort -r -h)
ftax=$(grep -o k__Fungi.*$i $taxo | sort | uniq -c | sort -r -h | awk '{print $2}' | head -1)
echo "${counts}"
echo
sed -i "s,k__Fungi.*$i,$ftax,g" tmp
done > split_cdhit/log_validation_taxid.txt
mv tmp split_cdhit/sequences_taxonomy_corrected.txt
Taxid Regeneration
Rscript ParseRanks_v1.R -t split_cdhit/sequences_taxonomy_corrected.txt -o split_cdhit/sequences_corrected.taxid
IDTAXA file
Rscript IDtaxa_trainingDB.R -s split_cdhit/cdhit_sequences.fna -t split_cdhit/sequences_taxonomy_corrected.txt -r split_cdhit/sequences_corrected.taxid -o split_cdhit/sequences_corrected.Rdata