Newer
Older
Nextflow pipeline creating the neo4j NeoLeg database.
## Requirements
- the nextflow.config needs to be modified for your needs. Another config file can be created using nextflow.config as a template.
- path to required files should be given as command arguments or mentionned in the config file
## Input
### species_info
File describing the species genomes (tsv format). Column header as specified in the example is mandatory.
If a field is unknown, replace with "NA" or similar.
sp_id sp_name accession ncbi_taxon_id assembly_v annotation_v
lcul Lens culinaris CDC Redberry 3864 2.0 2.0
mtrun Medicago truncatula A17 3880 5.0 5.1.7
psat Pisum sativum Cameor 3888 a c
vfab Vicia faba Hedin/2 3906 1.0 1.0
vrad Vigna radiata GCF_000741045.1 3916 Vradiata_ver6 101
### species_files
File listing paths to all genome files (tsv format). Column header as specified in the example is mandatory.
example
```
ID genome gff3 protein
lcul input/genome/lcul_genome.fa input/genome/lcul_annotation.gff3 input/proteins/lcul.fa
mtrun input/genome/mtrun_genome.fa input/genome/mtrun_annotation.gff3 input/proteins/mtrun.fa
psat input/genome/psat_genome.fa input/genome/psat_annotation.gff3 input/proteins/psat.fa
vfab input/genome/vfab_genome.fa input/genome/vfab_annotation.gff3 input/proteins/vfab.fa
vrad input/genome/vrad_genome.fa input/genome/vrad_annotation.gff3 input/proteins/vrad.fa
```
### annotation_files
File listing paths to all annotation files (tsv format). Column header as specified in the example is mandatory.
If eggnog files are not provided, user can replace the path by a "." and set `eggnog_files` to true in config.
ID eggnog mapman interproscan trapid_go trapid_gene_family trapid_rna_family
lcul input/eggnog/lcul.tsv /input/mapman/lcul.txt input/interproscan/lcul.tsv input/trapid/go/lcul_go.zip input/trapid/gene_family/lcul_gf.zip input/trapid/rna_family/lcul_rf.zip
mtrun input/eggnog/mtrun.tsv input/mapman/mtrun.txt input/interproscan/mtrun.tsv input/trapid/go/mtrun_go.zip input/trapid/gene_family/mtrun_gf.zip input/trapid/rna_family/mtrun_rf.zip
psat input/eggnog/psat.tsv input/mapman/psat.txt input/interproscan/psat.tsv input/trapid/go/psat_go.zip input/trapid/gene_family/psat_gf.zip input/trapid/rna_family/psat_rf.zip
vfab input/eggnog/vfab.tsv input/mapman/vfab.txt input/interproscan/vfab.tsv input/trapid/go/vfab_go.zip input/trapid/gene_family/vfab_gf.zip input/trapid/rna_family/vfab_rf.zip
vrad input/eggnog/vrad.tsv input/mapman/vrad.txt input/interproscan/vrad.tsv input/trapid/go/vrad_go.zip input/trapid/gene_family/vrad_gf.zip input/trapid/rna_family/vrad_rf.zip
### synteny and orthology files
From OrthoFinder, need files need the Log.txt (`OF_log` parameter, for nodes annotation) and the N0.tsv (`OF_N0` parameter, file containing the orthogroups and associated genes)
From MCScanX, need the collinearity file (`synt_collinearity` parameter, is the .collinearity file).
These two sets of files can be generated using the *specifics_MCScanX* pipeline
### parameters
Some parameters are listed in the nextflow.config file.
These include but are not limited to:
- `make_eggnog_db` to select wether to create eggNOG database or use one whose path must be written in `eggnog_db` parameter
- `use_trapid_gos` if user would rather have TRAPID GOterms on RNA nodes instead of eggNOG GOterms. The `keep_hidden_gos` refers to keeping GOterms that are not the most specifics on RNAs (see http://bioinformatics.psb.ugent.be/trapid_02/documentation/general)
## Running the pipeline
```
nextflow run main.nf -c nextflow.config --species_files neoleg_infiles.tsv --annotation_files neoleg_infiles.tsv --outdir results/20220322_neoleg
```
## Output
2022/03/22
To date, the pipeline can create (if publish set to `true`) the following folders:
- *conf_files*: simple copy of config files for input tracing.
- *eggnog_annotation*: annotation of each proteome by the EggNOG database.
- *nodes_gene*: one csv file per species containing gene nodes + 1 header.
- *nodes_gene_family*: one csv file per species containing gene families nodes + 1 header. TODO: concat all of them (sort and uniq), for now doing it manually.
- *nodes_mapman*: one rdf file containing the mapman ontology for n10s import (using `n10s.rdf.import.fetch()`.
- *nodes_orthogroup*: one csv file containing all orthogroups nodes + 1 header.
- *nodes_protein*: one csv file per species containing protein nodes + 1 header.
- *nodes_protein_annotation*: one csv file per species containing protein annotation nodes from InterProScan + 1 header.
- *nodes_rna*: one csv file per species containing rna nodes + 1 header.
- *nodes_rna_family*: one csv file per species containing rna families nodes + 1 header. TODO: concat all of them (sort and uniq), for now doing it manually.
- *nodes_synteny*: one csv file containing all synteny nodes + 1 header.
- *nodes_synteny_experiment*: one csv file containing all synteny experiments nodes + 1 header.
- *edges_gene_gene_family*: one csv file per species containing edges from gene to gene family + 1 header.
- *edges_gene_orthogroup*: one csv file containing edges from gene to orthogroup + 1 header.
- *edges_gene_rna*: one csv file per species containing edges from gene to rna + 1 header.
- *edges_gene_synteny*: one csv file containing edges from gene to synteny + 1 header.
- *edges_protein_protein_annotation*: one csv file per species containing edges from protein to protein annotation + 1 header.
- *edges_rna_mapman*: one csv file per species containing edges from rna to mapman term + 1 header.
- *edges_rna_protein*: one csv file per species containing edges from rna to protein + 1 header.
- *edges_rna_rna_family*: one csv file per species containing edges from rna to rna family + 1 header.
- *edges_synteny_synteny_experiment*: one csv file containing edges from synteny to synteny_experiment + 1 header.