The OLGENOME project is organized into Work Packages (WPs), for which the following activities are to be performed:

>> WP1: Project coordination

The research activities of the project are coordinated by a Scientific Committee, presided by a Coordinator and constituted by WPs managers. The Coordinator, in collaboration with the managers of each WP, guarantees cooperativity among operating units in order to reach the set goals. In compliance with the expected results of  the project proposal, he promotes and coordinates work progress meetings, to promote the cooperation among project partners and to share the results and the new knowledge achieved during the project.

>> WP2: Genome sequencing and assembling

The whole genome sequence will be obtained through a combination of the “BAC by BAC” and “Whole Genome Shotgun (WGS) approaches. The hierarchical sequencing of more than 100 BAC pools is necessary to obtain the sequence of each of the two aplotypes found in the highly heterozygous diploid genome of cv.Leccino. The sequence of each BAC pool, obtained on an Illumina system with a minimal 50x coverage per pool, will be reconstructed to obtain aplotype-specific assemblies.

In order to reconstruct its structure, the whole genome will be sequenced applying a WGS approach through third generation sequencing, outputting sequences sometimes even longer than several tens of thousands bps. In the past, the elevated heterozygosity and number of repeated regions in the olive tree genome made it very complicated to reach satisfactory results, in fact, different NGS technologies producing sequences only a few hundreds bps long were used at the time.

Subsequently, the reads will be assembled through the FALCON tool, making it possible to reconstruct separately the two different olive tree aplotypes. This is an essential step, since it has been previously verified that the two aplotypes show deep differences, with a polymorphism rate far higher than ordinary eukaryotic organisms or plants.

>>WP3: Anchoring the genome to a genetic map

The genome will be oriented and anchored to the 23 chromosomes through the combined use of Hi-C library sequencing and a saturated genetic map previously built.
A Hi-C library will be constructed through restriction enzyme digestion, addition of a marker (biotin) to identify the interacting regions, free ends ligation, purification, fragmentation and selection of the DNA fragments bound to biotin, which are to be sequenced through Illumina technology to create an intra-chromosomal contact map.

The further bioinformatic analysis will make it possible not only to remove low-quality reads and chimera sequences representing only deceiving contiguous DNA fragments, but also to align the resulting WGS reads to the assembled genome in order to identify the interacting regions and at the same time to quantify them on the basis of the number of sequences aligning to the same regions. The highest number of interactions is expected to be found in regions which are close to each other on the chromosome, and this kind of evidence will provide useful data to orientate and sort the assembled sequences coherently.

Moreover, a genetic map will be created, using the offspring deriving from a cross between the variety whose genome is to be sequenced (cv. Leccino) and cv. Frantoio. The obtained population will consist of 180 seedlings. The whole offspring will be subjected to genotyping through GBS (Genotyping-By-Sequencing) to identify and map up to 15000 SNPs markers, an amount potentially sufficient to allow anchoring of the developed SNPs to the genomic scaffolds.

>>WP4: Gene annotation

In order to assign a putative role to each nucleotide in the genome sequence, a platform devoted to gene structure prediction on an assembled and oriented genome will be developed. Gene structure prediction will be based on an ‘ab-initio’ approach, relying only on the assembled genome and on rules about both general gene structures and more specific olive tree ones, by means of tools such as Augustus and Glimmer. Annotation data resulting from other previously studied genomes will be integrated, using gene homology to identify the ends of coding regions. Lastly, olive tree-specific transcriptomic data, like ESTs collections found in public databases or RNA-Seq reads produced during the project itself, will be used, to provide further evidence to identify olive tree-specific genes.

Moreover, the so produced genes will be subjected to functional annotation through the BLAST2GO tool, making it possible to assign them a biological function after alignment to a database of known genes, to attribute them Gene Ontology terms and possibly to include the predicted genes in the wider context of biological pathways as well.

Automatic gene annotation and prediction are essential and very robust starting points, whose results, however, need to be integrated with further information and revised by several researchers expert in their specific fields of interest. Corrections and integrations in the different databases fields will be performed, editing, if considered necessary, data like the exact beginning and end of a gene or an exon, the association with a specific Gene Ontology term. Moreover, more specific data about newly found polymorphisms could be added, and the functional description of the gene may be updated on the basis of experimental data and/or previously acquired knowledge.

>>WP5: De-novo transcriptome analysis

During the project, it is also expected to build a reference transcriptome for cv.Leccino, with the aim to optimise genome assembling and annotation and to support the identification of candidate genes for key functions involved in the expression of characters of interest. The obtained full-length cDNAs, ESTs and RNA-seq will help genome annotation, since they represent experimental evidences of expressed sequences, which, properly processed and aligned to the genome, will made it possible to identify the protein coding regions. RNA obtained from a wide variety of tissues and organs in different stages of development of the plant are to be used. Subsequently, a set of Illumina sequence libraries will be produced. The sequencing will be aimed at obtaining a sufficient number of paired-end reads, to facilitate the next transcription units assembling process. The obtained data will be used to help organising the information resulting from the sequence assembling and annotation.
Moreover, it is planned to use RNA-sequencing to identify new gene functions regulating relevant agronomic characters, involved in the mechanisms behind plant productivity and production quality. Tissue samples produced under controlled conditions (variation of environmental parameters and use of elicitors), will be analysed. Therefore, RNA will be extracted, and Illumina sequences will be produced: the resulting sequencing data will be properly mapped on the reference transcriptome and on the genome draft.