README.md

# Ortho_KB pipeline

Ortho_KB is a framwork based on a nextflow pipeline creating files for a Neo4j database.
The OrthoLegKB instance is built using Ortho_KB.

The Ortho_KB pipeline creates files of nodes and relationships for Neo4j, following the model below:
![graph_model](fig/OrthoLegKB.png)

## Requirements
- the nextflow.config needs to be modified for your needs. Another config file can be created using nextflow.config as a template.
- path to required files should be given as command arguments or mentionned in the config file
## Input

### species_info
File describing the species genomes (TSV format). Column header as specified in the example is mandatory.
If a field is unknown, replace with "NA" or similar.

Below is described file format for the `species_files` parameter.
```tsv
id      genus   species accession       ncbi_taxon_id   assembly_v      annotation_v    source
lcul    Lens    L. culinaris    CDC Redberry    3864    2.0     2.0     10.1101/2021.07.23.453237
mtru    Medicago        M. truncatula   A17     3880    5.0     5.1.7   10.1038/s41477-018-0286-7
psat    Pisum   P. sativum      Cameor  3888    a       c       10.1038/s41588-019-0480-1
vrad    Vigna   V.radiata       GCF_000741045.1 3916    Vradiata_ver6   101     10.1002/tpg2.20121
```

### species_files
File listing paths to all genome files (TSV format). Column header as specified in the example is mandatory.

```tsv
ID  gff3   protein converted_chr_names	fai
lcul    input/genome/lcul_annotation.gff3       input/proteins/lcul.fa	input/genome/lcul_match.csv	input/genome/lcul_genome.fa.fai
mtrun   input/genome/mtrun_annotation.gff3      input/proteins/mtrun.fa	input/genome/mtrun_match.csv	input/genome/mtrun_genome.fa.fai
psat    input/genome/psat_annotation.gff3       input/proteins/psat.fa	input/genome/psat_match.csv	input/genome/psat_genome.fa.fai
vfab    input/genome/vfab_annotation.gff3       input/proteins/vfab.fa	input/genome/vfab_match.csv	input/genome/vfab_genome.fa.fai
vrad    input/genome/vrad_annotation.gff3       input/proteins/vrad.fa	input/genome/vrad_match.csv	input/genome/vrad_genome.fa.fai
```
### annotation_files
File listing paths to all annotation files (tsv format). Column header as specified in the example is mandatory.

Below is described file format for the `annotation_files` parameter.
```tsv
ID      eggnog	mapman  interproscan    trapid_go       trapid_gene_family      trapid_rna_family
lcul    input/eggnog/lcul.tsv	/input/mapman/lcul.txt input/interproscan/lcul.tsv   input/trapid/go/lcul_go.zip   input/trapid/gene_family/lcul_gf.zip  input/trapid/rna_family/lcul_rf.zip
mtrun   input/eggnog/mtrun.tsv	input/mapman/mtrun.txt        input/interproscan/mtrun.tsv  input/trapid/go/mtrun_go.zip  input/trapid/gene_family/mtrun_gf.zip input/trapid/rna_family/mtrun_rf.zip
psat    input/eggnog/psat.tsv	input/mapman/psat.txt input/interproscan/psat.tsv   input/trapid/go/psat_go.zip   input/trapid/gene_family/psat_gf.zip  input/trapid/rna_family/psat_rf.zip
vfab    input/eggnog/vfab.tsv	input/mapman/vfab.txt input/interproscan/vfab.tsv   input/trapid/go/vfab_go.zip   input/trapid/gene_family/vfab_gf.zip  input/trapid/rna_family/vfab_rf.zip
vrad    input/eggnog/vrad.tsv	input/mapman/vrad.txt input/interproscan/vrad.tsv   input/trapid/go/vrad_go.zip   input/trapid/gene_family/vrad_gf.zip  input/trapid/rna_family/vrad_rf.zip
```
### transcriptomics files
Transcriptomics files are obtained from nf-core/fetchngs and nf-core/rnaseq.
Below is described file format for the `transcriptomic_files` parameter.
```tsv
ID	bioinfo_protocol	dataset	counts	tpm	metadata	samples_annotation
henriet2019_PRJNA517587	input/transcriptomics/bioinfo_protocol.csv	input/transcriptomics/datasets.csv	input/transcriptomics/henriet2019_PRJNA517587/salmon.merged.gene_counts.tsv	input/transcriptomics/henriet2019_PRJNA517587/salmon.merged.gene_tpm.tsv	input/transcriptomics/henriet2019_PRJNA517587/samplesheet.csv	input/transcriptomics/henriet2019_PRJNA517587/samples_annotation.csv
morgil2019_PRJNA474098	input/transcriptomics/bioinfo_protocol.csv	input/transcriptomics/datasets.csv	input/transcriptomics/morgil2019_PRJNA474098/salmon.merged.gene_counts.tsv	input/transcriptomics/morgil2019_PRJNA474098/salmon.merged.gene_tpm.tsv	input/transcriptomics/morgil2019_PRJNA474098/samplesheet.csv	input/transcriptomics/morgil2019_PRJNA474098/samples_annotation.csv
bahrman2019_PRJNA543764	input/transcriptomics/bioinfo_protocol.csv	input/transcriptomics/datasets.csv	input/transcriptomics/bahrman2019_PRJNA543764/salmon.merged.gene_counts.tsv	input/transcriptomics/bahrman2019_PRJNA543764/salmon.merged.gene_tpm.tsv	input/transcriptomics/bahrman2019_PRJNA543764/samplesheet.csv	input/transcriptomics/bahrman2019_PRJNA543764/samples_annotation.csv
wu2020_PRJNA611089	input/transcriptomics/bioinfo_protocol.csv	input/transcriptomics/datasets.csv	input/transcriptomics/wu2020_PRJNA611089/salmon.merged.gene_counts.tsv	input/transcriptomics/wu2020_PRJNA611089/salmon.merged.gene_tpm.tsv	input/transcriptomics/wu2020_PRJNA611089/samplesheet.csv	input/transcriptomics/wu2020_PRJNA611089/samples_annotation.csv
```
The dataset file (CSV) allows the following columns:
```
dataset_id,topic,genera,species,title,author,year,doi
```
The dataset_id will allow to link Condition nodes and Dataset nodes.

### genetic files
Below is described file format for the `qtl_files` parameter.
```tsv
ID  status  species	qtl qtl_annotation
psat_public    public   psat  	input/qtl/psat_qtl.csv  input/qtl/psat_qtl_annotation.csv
vfab_public    public   vfab	input/qtl/vfab_qtl.csv  input/qtl/vfab_qtl_annotation.csv
lcul_public    public   lcul	input/qtl/lcul_qtl.csv  input/qtl/lcul_qtl_annotation.csv
```
The status must be either "private" or "public" and will add appropriate labels on nodes of the dataset.
The "species" column should match the ID of the species_files. This way, multiple datasets for
the same species is possible.

The recognized columns for QTL data (in CSV) are the following:
```csv
species,qtl_id,trait,site,year,population,population_type,lod,pvalue,additive,r2,linkage_group,assembly,chromosome,peakmarker_genetpos,peakmarker_id,peakmarker_start,peakmarker_end,leftmarker_id,leftmarker_genetpos,leftmarker_start,leftmarker_end,rightmarker_id,rightmarker_genetpos,rightmarker_start,rightmarker_end,comment_1,comment_2,reference,doi,Note
```
The following columns are mandatory:
- species
- qtl_id
- trait
- location
- year
- chromosome
- population
- population_type (either Biparental or DiversityPanel)
For Biparental population_type the following columns are also mandatory:
- leftmarker_start
- leftmarker_end
- rightmarker_start
- rightmarker_end
For DiversityPanel population_type the following columns are also mandatory:
- peakmarker_start
- peakmarker_end

### synteny and orthology files
From OrthoFinder, need the Log.txt (`OF_log` parameter, for nodes annotation) and the N0.tsv (`OF_N0` parameter, file containing the orthogroups and associated genes)
From MCScanX, need the collinearity file (`mcscanx_collinearity` parameter, is the .collinearity file) and the tandem file (`mcscanx_tandem` parameter).
These two sets of files can be generated using the *specifics_MCScanX* pipeline

### parameters
Check the nextflow.config file for additional parameters.


## Running the pipeline
```
nextflow run main.nf  \
-c conf/example_data.config \
--outdir results/example_data
```


## Output
(last update: 2023/03/21)
The pipeline can create (if publish set to `true`) the following directories:
- *conf_files*: simple copy of config files for input tracing.
- *eggnog_annotation*: annotation of each proteome by the EggNOG database.

- *nodes_gene*: one csv file per species containing gene nodes + 1 header.
- *nodes_gene_family*: one csv file per species containing gene families nodes + 1 header. TODO: concat all of them (sort and uniq), for now doing it manually.
- *nodes_mapman*: one rdf file containing the mapman ontology for n10s import (using `n10s.rdf.import.fetch()`.
- *nodes_orthogroup*: one csv file containing all orthogroups nodes + 1 header.
- *nodes_protein*: one csv file per species containing protein nodes + 1 header.
- *nodes_protein_annotation*: one csv file per species containing protein annotation nodes from InterProScan + 1 header.
- *nodes_rna*: one csv file per species containing rna nodes + 1 header.
- *nodes_rna_family*: one csv file per species containing rna families nodes + 1 header. TODO: concat all of them (sort and uniq), for now doing it manually.
- *nodes_synteny*: one csv file containing all synteny nodes + 1 header.
- *nodes_synteny_experiment*: one csv file containing all synteny experiments nodes + 1 header.

- *edges_gene_gene_family*: one csv file per species containing edges from gene to gene family + 1 header.
- *edges_gene_orthogroup*: one csv file containing edges from gene to orthogroup + 1 header.
- *edges_gene_rna*: one csv file per species containing edges from gene to rna + 1 header.
- *edges_gene_synteny*: one csv file containing edges from gene to synteny + 1 header.
- *edges_protein_protein_annotation*: one csv file per species containing edges from protein to protein annotation + 1 header.
- *edges_rna_mapman*: one csv file per species containing edges from rna to mapman term + 1 header.
- *edges_rna_protein*: one csv file per species containing edges from rna to protein + 1 header.
- *edges_rna_rna_family*: one csv file per species containing edges from rna to rna family + 1 header.
- *edges_synteny_synteny_experiment*: one csv file containing edges from synteny to synteny_experiment + 1 header.

The merged/ directory contains the merge of each file category for all species considered.
This is the directory used for the Neo4j import.


# Creating and populating a Neo4j database with Ortho_KB output

The output of **Ortho_KB** can be ingested by **Neo4j** to populate a database (**Docker is required**, as Neo4j runs in a Docker container).

A small Bash script is available at [`scripts_populate_db/create_and_populate_neo4j_db.sh`](scripts_populate_db/create_and_populate_neo4j_db.sh) to **create and populate a Neo4j database** using files from the `merged/` directory, which is an output of Ortho_KB.

## Example usage

To create an instance of Ortho_KB based on the `example_data/`, you can use the results available at `example_data/results/`.
Open [`scripts_populate_db/create_and_populate_neo4j_db.sh`](scripts_populate_db/create_and_populate_neo4j_db.sh) and set the `NEO_WDIR` variable to the **absolute path** of the Neo4j directory that should store its files.
To create a database for a different dataset, update the `input` variable with the appropriate path.

Run the script in specifics_ortho_kb/ as described below:
```
./scripts_populate_db/create_and_populate_neo4j_db.sh
```

The script will:
- **Create a Docker container** for the specified version of Neo4j.
- **Configure Neo4j** with the **neosemantics** and **APOC** plugins.
- **Perform an initial data import** using `neo4j-admin import`.
- **Download/Import ontologies** (GO, PO, TO, PECO).
- **Establish connections** between the imported data and ontologies, along with **index creation**.

⏳ *This process may take a few minutes.*

## Accessing Neo4j

By default, the created **Neo4j database** will be available at:
🔗 [http://localhost:7474/browser/](http://localhost:7474/browser/)

### **Login Credentials**
The default credentials are specified in [`scripts_populate_db/create_and_populate_neo4j_db.sh`](scripts_populate_db/create_and_populate_neo4j_db.sh):
- **Username:** `db_username`
- **Password:** `db_pwd`

Run **Cypher queries** to explore the database content.

Note: to style the nodes in the Neo4j Browser, drag and drop the [`scripts_populate_db/style.grass`](scripts_populate_db/style.grass) file into the Neo4j Browser window.