Skip to content
Snippets Groups Projects 11.6 KiB
Newer Older
# Ortho_KB pipeline
Baptiste Imbert's avatar
Baptiste Imbert committed

Ortho_KB is a framwork based on a nextflow pipeline creating files for a Neo4j database.
The OrthoLegKB instance is built using Ortho_KB.

The Ortho_KB pipeline creates files of nodes and relationships for Neo4j, following the model below:
Baptiste Imbert's avatar
Baptiste Imbert committed

## Requirements
- the nextflow.config needs to be modified for your needs. Another config file can be created using nextflow.config as a template.
- path to required files should be given as command arguments or mentionned in the config file
## Input
Baptiste Imbert's avatar
Baptiste Imbert committed

### species_info
File describing the species genomes (TSV format). Column header as specified in the example is mandatory.
If a field is unknown, replace with "NA" or similar.
Baptiste Imbert's avatar
Baptiste Imbert committed

Below is described file format for the `species_files` parameter.
id      genus   species accession       ncbi_taxon_id   assembly_v      annotation_v    source
lcul    Lens    L. culinaris    CDC Redberry    3864    2.0     2.0     10.1101/2021.07.23.453237
mtru    Medicago        M. truncatula   A17     3880    5.0     5.1.7   10.1038/s41477-018-0286-7
psat    Pisum   P. sativum      Cameor  3888    a       c       10.1038/s41588-019-0480-1
vrad    Vigna   V.radiata       GCF_000741045.1 3916    Vradiata_ver6   101     10.1002/tpg2.20121
Baptiste Imbert's avatar
Baptiste Imbert committed

### species_files
File listing paths to all genome files (TSV format). Column header as specified in the example is mandatory.
Baptiste Imbert's avatar
Baptiste Imbert committed

ID  gff3   protein converted_chr_names	fai
lcul    input/genome/lcul_annotation.gff3       input/proteins/lcul.fa	input/genome/lcul_match.csv	input/genome/lcul_genome.fa.fai
mtrun   input/genome/mtrun_annotation.gff3      input/proteins/mtrun.fa	input/genome/mtrun_match.csv	input/genome/mtrun_genome.fa.fai
psat    input/genome/psat_annotation.gff3       input/proteins/psat.fa	input/genome/psat_match.csv	input/genome/psat_genome.fa.fai
vfab    input/genome/vfab_annotation.gff3       input/proteins/vfab.fa	input/genome/vfab_match.csv	input/genome/vfab_genome.fa.fai
vrad    input/genome/vrad_annotation.gff3       input/proteins/vrad.fa	input/genome/vrad_match.csv	input/genome/vrad_genome.fa.fai
### annotation_files
File listing paths to all annotation files (tsv format). Column header as specified in the example is mandatory.
Baptiste Imbert's avatar
Baptiste Imbert committed

Below is described file format for the `annotation_files` parameter.
ID      eggnog	mapman  interproscan    trapid_go       trapid_gene_family      trapid_rna_family
lcul    input/eggnog/lcul.tsv	/input/mapman/lcul.txt input/interproscan/lcul.tsv   input/trapid/go/   input/trapid/gene_family/  input/trapid/rna_family/
mtrun   input/eggnog/mtrun.tsv	input/mapman/mtrun.txt        input/interproscan/mtrun.tsv  input/trapid/go/  input/trapid/gene_family/ input/trapid/rna_family/
psat    input/eggnog/psat.tsv	input/mapman/psat.txt input/interproscan/psat.tsv   input/trapid/go/   input/trapid/gene_family/  input/trapid/rna_family/
vfab    input/eggnog/vfab.tsv	input/mapman/vfab.txt input/interproscan/vfab.tsv   input/trapid/go/   input/trapid/gene_family/  input/trapid/rna_family/
vrad    input/eggnog/vrad.tsv	input/mapman/vrad.txt input/interproscan/vrad.tsv   input/trapid/go/   input/trapid/gene_family/  input/trapid/rna_family/
### transcriptomics files
Transcriptomics files are obtained from nf-core/fetchngs and nf-core/rnaseq.
Below is described file format for the `transcriptomic_files` parameter.
ID	bioinfo_protocol	dataset	counts	tpm	metadata	samples_annotation
henriet2019_PRJNA517587	input/transcriptomics/bioinfo_protocol.csv	input/transcriptomics/datasets.csv	input/transcriptomics/henriet2019_PRJNA517587/salmon.merged.gene_counts.tsv	input/transcriptomics/henriet2019_PRJNA517587/salmon.merged.gene_tpm.tsv	input/transcriptomics/henriet2019_PRJNA517587/samplesheet.csv	input/transcriptomics/henriet2019_PRJNA517587/samples_annotation.csv
morgil2019_PRJNA474098	input/transcriptomics/bioinfo_protocol.csv	input/transcriptomics/datasets.csv	input/transcriptomics/morgil2019_PRJNA474098/salmon.merged.gene_counts.tsv	input/transcriptomics/morgil2019_PRJNA474098/salmon.merged.gene_tpm.tsv	input/transcriptomics/morgil2019_PRJNA474098/samplesheet.csv	input/transcriptomics/morgil2019_PRJNA474098/samples_annotation.csv
bahrman2019_PRJNA543764	input/transcriptomics/bioinfo_protocol.csv	input/transcriptomics/datasets.csv	input/transcriptomics/bahrman2019_PRJNA543764/salmon.merged.gene_counts.tsv	input/transcriptomics/bahrman2019_PRJNA543764/salmon.merged.gene_tpm.tsv	input/transcriptomics/bahrman2019_PRJNA543764/samplesheet.csv	input/transcriptomics/bahrman2019_PRJNA543764/samples_annotation.csv
wu2020_PRJNA611089	input/transcriptomics/bioinfo_protocol.csv	input/transcriptomics/datasets.csv	input/transcriptomics/wu2020_PRJNA611089/salmon.merged.gene_counts.tsv	input/transcriptomics/wu2020_PRJNA611089/salmon.merged.gene_tpm.tsv	input/transcriptomics/wu2020_PRJNA611089/samplesheet.csv	input/transcriptomics/wu2020_PRJNA611089/samples_annotation.csv
The dataset file (CSV) allows the following columns:
The dataset_id will allow to link Condition nodes and Dataset nodes.

### genetic files
Below is described file format for the `qtl_files` parameter.
ID  status  species	qtl qtl_annotation
psat_public    public   psat  	input/qtl/psat_qtl.csv  input/qtl/psat_qtl_annotation.csv
vfab_public    public   vfab	input/qtl/vfab_qtl.csv  input/qtl/vfab_qtl_annotation.csv
lcul_public    public   lcul	input/qtl/lcul_qtl.csv  input/qtl/lcul_qtl_annotation.csv
The status must be either "private" or "public" and will add appropriate labels on nodes of the dataset.
The "species" column should match the ID of the species_files. This way, multiple datasets for
the same species is possible.

The recognized columns for QTL data (in CSV) are the following:
The following columns are mandatory:
- species
- qtl_id
- trait
- location
- year
- chromosome
- population
- population_type (either Biparental or DiversityPanel)
For Biparental population_type the following columns are also mandatory:
- leftmarker_start
- leftmarker_end
- rightmarker_start
- rightmarker_end
For DiversityPanel population_type the following columns are also mandatory:
- peakmarker_start
- peakmarker_end
Baptiste Imbert's avatar
Baptiste Imbert committed

### synteny and orthology files
From OrthoFinder, need the Log.txt (`OF_log` parameter, for nodes annotation) and the N0.tsv (`OF_N0` parameter, file containing the orthogroups and associated genes)
From MCScanX, need the collinearity file (`mcscanx_collinearity` parameter, is the .collinearity file) and the tandem file (`mcscanx_tandem` parameter).
These two sets of files can be generated using the *specifics_MCScanX* pipeline
Baptiste Imbert's avatar
Baptiste Imbert committed

### parameters
Check the nextflow.config file for additional parameters.
## Running the pipeline
nextflow run  \
-c conf/example_data.config \
--outdir results/example_data
Baptiste Imbert's avatar
Baptiste Imbert committed

## Output
(last update: 2023/03/21)
The pipeline can create (if publish set to `true`) the following directories:
- *conf_files*: simple copy of config files for input tracing.
- *eggnog_annotation*: annotation of each proteome by the EggNOG database.
- *nodes_gene*: one csv file per species containing gene nodes + 1 header.
- *nodes_gene_family*: one csv file per species containing gene families nodes + 1 header. TODO: concat all of them (sort and uniq), for now doing it manually.
- *nodes_mapman*: one rdf file containing the mapman ontology for n10s import (using `n10s.rdf.import.fetch()`.
- *nodes_orthogroup*: one csv file containing all orthogroups nodes + 1 header.
- *nodes_protein*: one csv file per species containing protein nodes + 1 header.
- *nodes_protein_annotation*: one csv file per species containing protein annotation nodes from InterProScan + 1 header.
- *nodes_rna*: one csv file per species containing rna nodes + 1 header.
- *nodes_rna_family*: one csv file per species containing rna families nodes + 1 header. TODO: concat all of them (sort and uniq), for now doing it manually.
- *nodes_synteny*: one csv file containing all synteny nodes + 1 header.
- *nodes_synteny_experiment*: one csv file containing all synteny experiments nodes + 1 header.
- *edges_gene_gene_family*: one csv file per species containing edges from gene to gene family + 1 header.
- *edges_gene_orthogroup*: one csv file containing edges from gene to orthogroup + 1 header.
- *edges_gene_rna*: one csv file per species containing edges from gene to rna + 1 header.
- *edges_gene_synteny*: one csv file containing edges from gene to synteny + 1 header.
- *edges_protein_protein_annotation*: one csv file per species containing edges from protein to protein annotation + 1 header.
- *edges_rna_mapman*: one csv file per species containing edges from rna to mapman term + 1 header.
- *edges_rna_protein*: one csv file per species containing edges from rna to protein + 1 header.
- *edges_rna_rna_family*: one csv file per species containing edges from rna to rna family + 1 header.
- *edges_synteny_synteny_experiment*: one csv file containing edges from synteny to synteny_experiment + 1 header.

The merged/ directory contains the merge of each file category for all species considered.
This is the directory used for the Neo4j import.

# Creating and populating a Neo4j database with Ortho_KB output

The output of **Ortho_KB** can be ingested by **Neo4j** to populate a database (**Docker is required**, as Neo4j runs in a Docker container).

A small Bash script is available at [`scripts_populate_db/`](scripts_populate_db/ to **create and populate a Neo4j database** using files from the `merged/` directory, which is an output of Ortho_KB.

## Example usage

To create an instance of Ortho_KB based on the `example_data/`, you can use the results available at `example_data/results/`.
Open [`scripts_populate_db/`](scripts_populate_db/ and set the `NEO_WDIR` variable to the **absolute path** of the Neo4j directory that should store its files.
To create a database for a different dataset, update the `input` variable with the appropriate path.

Run the script in specifics_ortho_kb/ as described below:

The script will:
- **Create a Docker container** for the specified version of Neo4j.
- **Configure Neo4j** with the **neosemantics** and **APOC** plugins.
- **Perform an initial data import** using `neo4j-admin import`.
- **Download/Import ontologies** (GO, PO, TO, PECO).
- **Establish connections** between the imported data and ontologies, along with **index creation**.

*This process may take a few minutes.*

## Accessing Neo4j

By default, the created **Neo4j database** will be available at:
🔗 [http://localhost:7474/browser/](http://localhost:7474/browser/)

### **Login Credentials**
The default credentials are specified in [`scripts_populate_db/`](scripts_populate_db/
- **Username:** `db_username`
- **Password:** `db_pwd`

Run **Cypher queries** to explore the database content.

Note: to style the nodes in the Neo4j Browser, drag and drop the [`scripts_populate_db/style.grass`](scripts_populate_db/style.grass) file into the Neo4j Browser window.