Skip to content
Snippets Groups Projects
user avatar
Etienne Rifa authored
cdb017a6
History
Name Last commit Last update
db_files
lib
rentrez
scripts
README.md

Formated UTOPIA db

Formated UTOPIA db is available here.

UTOPIA database construction

Prerequesites

Change libraries path in scripts extract_seq.pl, extract_cluster_taxonomy.pl.

NCBI Taxonomy database construction

Download and extract

wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -xzf taxdump.tar.gz

Load SQLite database

loadTaxonomy.pl -names names.dmp -nodes nodes.dmp -struct taxonomyStructure.sql

Database generation

eDirect download

esearch -db "nucleotide" -query "\"internal transcribed spacer\"[Title] AND \"fungi\"[Porgn] AND \"complete sequence\" [Title]" | efetch -format gb > GB_ITS.gb

WARN: Unfortunately it is possible that some errors are generated during download, thus corrupting the Genbank file. Search for the term 'error' in the file and manually remove corrupted sequences.

extract sequence and taxonomy from Genbank file

This script uses the sqlite database previously created. You MUST modified line 18 the database path to fit your own.

perl extract_seq.pl GB_ITS.gb

This create two files:

  • A fasta file: sequences.dna.
  • A tabulated two column tabulated file: taxonomy.txt.

Prinseq sequence check

Generate statistics

  • Maybe embedded prinseq script if its light enough *
prinseq-lite.pl -fasta sequences.fna -graph_data sequences.gd -graph_stats ld,gc,ns,pt,ts,de,da,sc,dn

Generate graphs in HTML format

prinseq-graphs.pl -i sequences.gd -html_all -o sequences_prinseq

Filtering sequences

Thanks to the statistics given by Prinseq, we've choose too exclude sequences less than 100bp and larger than 5000bp. Also we do not allowed sequences with N's or with IUPAC code.

prinseq-lite.pl -fasta sequences.fna -out_format 1 -out_good sequences_good -noniupac -ns_max_n 1 -min_len 100 -max_len 5000

[OPTIONAL] extract ITS1 and ITS2 segments with ITSx.

The ITSx software can be downloaded here.

In order to speed up this step, we've split the sequences.dna fasta file using homemade scripts.

fasSpliter.pl -dir split -ns 200 sequences.fna

\ls -1d _* | sed 's,^.*$,listPath.pl -d & >> idx,' | bash

sed "s,\(/scratch/umrf/ITSbank/_[0123]000/sequences_\([0-9]*\)\).fas,echo '/home/stheil/soft/ITSx_1.1/ITSx -i & -o \1 -p /home/stheil/soft/ITSx_1.1/ITSx_db/HMMs/ --reset --cpu 8 --multi_thread T --preserve T --only_full T -t F ' | qsub -N itsx_\2 -cwd," | bash

Concatenate ITS 1 and 2 sequences.

cat _*/*.ITS1.fasta > ITS1.fasta
cat _*/*.ITS2.fasta > ITS2.fasta

Clustering with CD-HIT (dereplication) and re-extract taxonomy

-s options stands for length difference cutoff. meaning that sequences from the same cluster as the same length

echo 'cd-hit -i sequences_good.fasta -o sequences_good.cdhit_c1_s0999 -c 1 -T 8 -s 0.999 -sf 1 -sc 1 -M 4000' | qsub -cwd -pe multithread 8 -N cdhit_c1_s0999

Extract taxonomy from clusters

Clustering sequence can produce cluster in wich taxonomy is not homogenous. i.e. in the same cluster, multiple annotation will be present. To resolve those cases, we propose two different strategies:

  • Extract one sequence by taxonomy clade from the considered cluster.
  • Annotate the representative sequence at the last homogenous rank. The threshold is used to determine if the rank is homogenous or not.
extract_cluster_taxonomy.pl -taxo taxonomy.txt -clstr sequences_good.cdhit.clstr -fasta sequences.fna -dir split_cdhit -mode split -threshold 0.8

ITSbank taxonomy validation / database generation

Taxid file

Generate the Taxid file.

Rscript ParseRanks_v1.R -t split_cdhit/cdhit_taxonomy.txt -o split_cdhit/cdhit_taxonomy.taxid

Taxonomy incongruity

Validation on TAXID file (output from ParseRanks_v1.R)

Filetax=split_cdhit/cdhit_taxonomy.taxid
taxo=split_cdhit/cdhit_taxonomy.txt

Check for duplicated taxon.

cut -f2- -d\* $Filetax | sort | uniq -c | sort -h -r | awk '$1!=1' | awk '{print $2}' > split_cdhit/duplicated_taxons.txt

Check for multiple ancestor.

cut -f 2 -d '*' $Filetax | cut -f 1 -d ' ' | sort | uniq -c | awk '$1!=1 && $2 !="unclassified"' | sort -r -h > split_cdhit/problematic_taxons.tsv

Replace taxonomy by the most common (automatic)

cp $taxo tmp
for i in $(awk '{print $2}' split_cdhit/problematic_taxons.tsv);
do
	echo $i
  counts=$(grep -o k__Fungi.*$i $taxo | sort | uniq -c | sort -r -h)
	ftax=$(grep -o k__Fungi.*$i $taxo | sort | uniq -c | sort -r -h | awk '{print $2}' | head -1)
	echo "${counts}"
  echo

	sed -i "s,k__Fungi.*$i,$ftax,g" tmp
done > split_cdhit/log_validation_taxid.txt
mv tmp split_cdhit/sequences_taxonomy_corrected.txt

Taxid Regeneration

Rscript ParseRanks_v1.R -t split_cdhit/sequences_taxonomy_corrected.txt -o split_cdhit/sequences_corrected.taxid

IDTAXA file

Rscript IDtaxa_trainingDB.R -s split_cdhit/cdhit_sequences.fna -t split_cdhit/sequences_taxonomy_corrected.txt -r split_cdhit/sequences_corrected.taxid -o split_cdhit/sequences_corrected.Rdata