README.md

# Ortho_KB pipeline

Ortho_KB is a framwork based on a nextflow pipeline creating files for a Neo4j database.
The OrthoLegKB instance is built using Ortho_KB.

The Ortho_KB pipeline creates files of nodes and relationships for Neo4j, following the model below:
![graph_model](fig2.jpg)

## Requirements
- the nextflow.config needs to be modified for your needs. Another config file can be created using nextflow.config as a template.
	- path to required files should be given as command arguments or mentionned in the config file
## Input

### species_info
File describing the species genomes (TSV format). Column header as specified in the example is mandatory.
If a field is unknown, replace with "NA" or similar.

Below is described file format for the `species_files` parameter.
```tsv
sp_id	sp_name accession	ncbi_taxon_id	assembly_v	annotation_v
lcul	Lens culinaris	CDC Redberry	3864    2.0	2.0
mtrun	Medicago truncatula	A17	3880    5.0	5.1.7
psat	Pisum sativum	Cameor	3888    a	c
vfab	Vicia faba      Hedin/2	3906    1.0	1.0
vrad	Vigna radiata	GCF_000741045.1	3916	Vradiata_ver6	101
```

### species_files
File listing paths to all genome files (TSV format). Column header as specified in the example is mandatory.

example
```tsv
ID      genome  gff3    protein
lcul    input/genome/lcul_genome.fa     input/genome/lcul_annotation.gff3       input/proteins/lcul.fa
mtrun   input/genome/mtrun_genome.fa    input/genome/mtrun_annotation.gff3      input/proteins/mtrun.fa
psat    input/genome/psat_genome.fa     input/genome/psat_annotation.gff3       input/proteins/psat.fa
vfab    input/genome/vfab_genome.fa     input/genome/vfab_annotation.gff3       input/proteins/vfab.fa
vrad    input/genome/vrad_genome.fa     input/genome/vrad_annotation.gff3       input/proteins/vrad.fa
```
### annotation_files
File listing paths to all annotation files (tsv format). Column header as specified in the example is mandatory.
If eggnog files are not provided, user can replace the path by a "." and set `eggnog_files` to `true` in the config.

Below is described file format for the `annotation_files` parameter.
```tsv
ID      eggnog	mapman  interproscan    trapid_go       trapid_gene_family      trapid_rna_family
lcul    input/eggnog/lcul.tsv	/input/mapman/lcul.txt input/interproscan/lcul.tsv   input/trapid/go/lcul_go.zip   input/trapid/gene_family/lcul_gf.zip  input/trapid/rna_family/lcul_rf.zip
mtrun   input/eggnog/mtrun.tsv	input/mapman/mtrun.txt        input/interproscan/mtrun.tsv  input/trapid/go/mtrun_go.zip  input/trapid/gene_family/mtrun_gf.zip input/trapid/rna_family/mtrun_rf.zip
psat    input/eggnog/psat.tsv	input/mapman/psat.txt input/interproscan/psat.tsv   input/trapid/go/psat_go.zip   input/trapid/gene_family/psat_gf.zip  input/trapid/rna_family/psat_rf.zip
vfab    input/eggnog/vfab.tsv	input/mapman/vfab.txt input/interproscan/vfab.tsv   input/trapid/go/vfab_go.zip   input/trapid/gene_family/vfab_gf.zip  input/trapid/rna_family/vfab_rf.zip
vrad    input/eggnog/vrad.tsv	input/mapman/vrad.txt input/interproscan/vrad.tsv   input/trapid/go/vrad_go.zip   input/trapid/gene_family/vrad_gf.zip  input/trapid/rna_family/vrad_rf.zip
```
### transcriptomics files
Transcriptomics files are obtained from nf-core/fetchngs and nf-core/rnaseq.
Below is described file format for the `transcriptomic_files` parameter.
```tsv
ID	bioinfo_protocol	dataset	counts	tpm	metadata	samples_annotation
henriet2019_PRJNA517587	input/transcriptomics/bioinfo_protocol.csv	input/transcriptomics/datasets.csv	input/transcriptomics/henriet2019_PRJNA517587/salmon.merged.gene_counts.tsv	input/transcriptomics/henriet2019_PRJNA517587/salmon.merged.gene_tpm.tsv	input/transcriptomics/henriet2019_PRJNA517587/samplesheet.csv	input/transcriptomics/henriet2019_PRJNA517587/samples_annotation.csv
morgil2019_PRJNA474098	input/transcriptomics/bioinfo_protocol.csv	input/transcriptomics/datasets.csv	input/transcriptomics/morgil2019_PRJNA474098/salmon.merged.gene_counts.tsv	input/transcriptomics/morgil2019_PRJNA474098/salmon.merged.gene_tpm.tsv	input/transcriptomics/morgil2019_PRJNA474098/samplesheet.csv	input/transcriptomics/morgil2019_PRJNA474098/samples_annotation.csv
bahrman2019_PRJNA543764	input/transcriptomics/bioinfo_protocol.csv	input/transcriptomics/datasets.csv	input/transcriptomics/bahrman2019_PRJNA543764/salmon.merged.gene_counts.tsv	input/transcriptomics/bahrman2019_PRJNA543764/salmon.merged.gene_tpm.tsv	input/transcriptomics/bahrman2019_PRJNA543764/samplesheet.csv	input/transcriptomics/bahrman2019_PRJNA543764/samples_annotation.csv
wu2020_PRJNA611089	input/transcriptomics/bioinfo_protocol.csv	input/transcriptomics/datasets.csv	input/transcriptomics/wu2020_PRJNA611089/salmon.merged.gene_counts.tsv	input/transcriptomics/wu2020_PRJNA611089/salmon.merged.gene_tpm.tsv	input/transcriptomics/wu2020_PRJNA611089/samplesheet.csv	input/transcriptomics/wu2020_PRJNA611089/samples_annotation.csv
```
The dataset file (CSV) allows the following columns:
```
dataset_id:ID(rnaseq_dataset-ID),title:String,species:String,project:String,year:int,author:String,contact:String,doi:String,abstract:String
```
The dataset_id is the only mandatory column, that will allow to link Condition nodes and Dataset nodes.

### genetic files
Below is described file format for the `qtl_files` parameter.
```tsv
ID	qtl
psat	input/qtl/psat_qtl.csv
vfab	input/qtl/psat_qtl.csv
lcul	input/qtl/psat_qtl.csv
```

The recognized columns for QTL data (in CSV) are the following:
```csv
species,qtl_id,trait,site,year,population,population_type,lod,pvalue,additive,r2,linkage_group,assembly,chromosome,peakmarker_genetpos,peakmarker_id,peakmarker_start,peakmarker_end,leftmarker_id,leftmarker_genetpos,leftmarker_start,leftmarker_end,rightmarker_id,rightmarker_genetpos,rightmarker_start,rightmarker_end,comment_1,comment_2,reference,doi,Note
```
The following columns are mandatory:
- species
- qtl_id
- trait
- location
- year
- chromosome
- population
- population_type (either Biparental or DiversityPanel)
For Biparental population_type the following columns are also mandatory:
- leftmarker_start
- leftmarker_end
- rightmarker_start
- rightmarker_end
For DiversityPanel population_type the following columns are also mandatory:
- peakmarker_start
- peakmarker_end

### synteny and orthology files
From OrthoFinder, need the Log.txt (`OF_log` parameter, for nodes annotation) and the N0.tsv (`OF_N0` parameter, file containing the orthogroups and associated genes)
From MCScanX, need the collinearity file (`mcscanx_collinearity` parameter, is the .collinearity file) and the tandem file (`mcscanx_tandem` parameter).
These two sets of files can be generated using the *specifics_MCScanX* pipeline

### parameters
Some parameters are listed in the nextflow.config file.
These include but are not limited to:
- `make_eggnog_db` to select wether to create eggNOG database or use one whose path must be written in `eggnog_db` parameter
- `use_trapid_gos` if user would rather have TRAPID GOterms on RNA nodes instead of eggNOG GOterms. The `keep_hidden_gos` refers to keeping GOterms that are not the most specifics on RNAs (see http://bioinformatics.psb.ugent.be/trapid_02/documentation/general)


## Running the pipeline
```
nextflow run main.nf -c user.config --mcscanx_collinearity input/synt.collinearity --mcscanx_tandem input/synt.tandem --OF_N0 input/N0.tsv --species_files input/ortholegkb_infiles.tsv --annotation_files input/ortholegkb_infiles_annotation.tsv --transcriptomic_files input/ortholegkb_infiles_transcriptomics.tsv --qtl_files input/ortholegkb_infiles_qtl.tsv --outdir results/ortholegkb_files
```

## Output
(last update: 2023/03/21)
The pipeline can create (if publish set to `true`) the following directories:
- *conf_files*: simple copy of config files for input tracing.
- *eggnog_annotation*: annotation of each proteome by the EggNOG database.

- *nodes_gene*: one csv file per species containing gene nodes + 1 header.
- *nodes_gene_family*: one csv file per species containing gene families nodes + 1 header. TODO: concat all of them (sort and uniq), for now doing it manually.
- *nodes_mapman*: one rdf file containing the mapman ontology for n10s import (using `n10s.rdf.import.fetch()`.
- *nodes_orthogroup*: one csv file containing all orthogroups nodes + 1 header.
- *nodes_protein*: one csv file per species containing protein nodes + 1 header.
- *nodes_protein_annotation*: one csv file per species containing protein annotation nodes from InterProScan + 1 header.
- *nodes_rna*: one csv file per species containing rna nodes + 1 header.
- *nodes_rna_family*: one csv file per species containing rna families nodes + 1 header. TODO: concat all of them (sort and uniq), for now doing it manually.
- *nodes_synteny*: one csv file containing all synteny nodes + 1 header.
- *nodes_synteny_experiment*: one csv file containing all synteny experiments nodes + 1 header.

- *edges_gene_gene_family*: one csv file per species containing edges from gene to gene family + 1 header.
- *edges_gene_orthogroup*: one csv file containing edges from gene to orthogroup + 1 header.
- *edges_gene_rna*: one csv file per species containing edges from gene to rna + 1 header.
- *edges_gene_synteny*: one csv file containing edges from gene to synteny + 1 header.
- *edges_protein_protein_annotation*: one csv file per species containing edges from protein to protein annotation + 1 header.
- *edges_rna_mapman*: one csv file per species containing edges from rna to mapman term + 1 header.
- *edges_rna_protein*: one csv file per species containing edges from rna to protein + 1 header.
- *edges_rna_rna_family*: one csv file per species containing edges from rna to rna family + 1 header.
- *edges_synteny_synteny_experiment*: one csv file containing edges from synteny to synteny_experiment + 1 header.

The merged/ directory contains the merge of each file category for all species considered.