HumanVega Home

Glossary

%ID (BLAST Results)

Percentage of the aligned query (input) sequence which is identical to the subject (hit) sequence.

1-to-1 orthologues

A type of orthologue assigned for a pair of species where only one copy is found in each species.

1-to-many orthologues

A type of orthologue assigned for a pair of species where one gene in one species is orthologous to multiple genes in the other species.

1000 Genomes Project

The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied using a light sequencing approach in many individuals. (www.1000genomes.org)

3prime overlapping_ncrna

Gene where ditag and/or published experimental data strongly support the existence of short non-coding transcripts transcribed from the 3'UTR.

Accession number

A unique identifier given to a sequence when it is submitted to one of the DNA repositories (GenBank, EMBL, DDBJ).

agp file (A Golden Path)

A file provided to Ensembl that describes how the longer sequences in the genome assembly were assembled from shorter sequences. For example, an AGP file can describe how a chromosome is assembled from a collection of scaffolds or a collection of contigs. For an AGP file that describes how a scaffold is assembled from a collection of contigs, each contig will be listed on a separate line in the AGP file and the line will include information about where the contig lies within the scaffold and the orientation of the contig.

Algorithm

A sequence of computational tasks or actions that carry out a specific function.

Alignment

A comparison between two or more sequences by matching identical and/or similar residues and assigning a score to the match.

Allele

An allele is an alternative form of a nucleotide sequence, a gene or a locus in the genome. The term was originally used to describe variation among protein coding genes, but it also refers to variation among non-coding genes or DNA sequences.

Alternative sequence

Genomic sequence that differs from the genomic DNA on the primary assembly. The alternative sequences come in two types: allelic sequence (haplotypes and novel patches) and fix patches. Novel patches represent new allelic loci but they are not necessarily haplotypes. Fix patches are where the primary assembly was found to be incorrect, and the patch reflects the corrected sequence. Both haplotypes, novel patches and fix patches are determined by the Genome Reference Consortium. You can read the GRC definitions here. When using the API, the primary assembly is referred to as reference sequence and alternate sequence is referred to as non-reference sequence.

Alu

A dispersed intermediately repetitive DNA sequence found in the human genome in about one million copies. The sequence is about 300 bp long and is found commonly in introns, 3' untranslated regions of genes, and intergenic genomic regions. The name Alu comes from the a recognition site for the AluI endonuclease that cleaves it. The Alu universal primer sequence is as follows: 5'-GTG GAT CAC CTG AGG TCA GGA GTT TC-3' (26-mer). In Ensembl, Alu repeats are included in the Type I Transposons/SINE track in the Region in detail page.

Ambiguity code

The standard ambiguity codes are provided by IUPAC (INTERNATIONAL UNION OF PURE AND APPLIED CHEMISTRY) and indicate the possible nucleotides or amino acids that can occur at a given position.

Nucleotides:

The symbols are valid for both DNA and RNA:

A = adenine

C = cytosine

G = guanine

T = thymine

R = G A (purine)

Y = T C (pyrimidine)

K = G T (keto)

M = A C (amino)

S = G C (strong bonds)

W = A T (weak bonds)

B = G T C (all but A)

D = G A T (all but C)

H = A C T (all but G)

V = G C A (all but T)

N = A G C T (any)

 

Amino Acids

A = Alanine

C = Cysteine

D = Aspartic Acid

E = Glutamic Acid

F = Phenylalanine

G = Glycine

H = Histidine

I = Isoleucine

K = Lysine

L = Leucine

M = Methionine

N = Asparagine

P = Proline

Q = Glutamine

R = Arginine

S = Serine

T = Threonine

V = Valine

W = Tryptophan

Y = Tyrosine

X = Unspecified Amino Acid

* = Stop Codon

Ambiguous ORF

Transcripts that are believed to be protein coding but have more than one possible ORF.

Antisense

Genes or transcripts that overlap any coding exon of a locus on the opposite strand. It also encompasses antisense regulation of a coding gene that have been reported in the literature.

API (Application Programming Interface)

A series of routines that applications can use to make the operating system request and carry out lower-level services.

Apparent orthologue

Apparent orthologues occur when genes in different species have homology to one another, but we have reason to believe that they are not directly orthologous to each other. For example, this may occur when there has been a gene duplication but in both species we only have one gene. This may be due to gaps in the genome assemblies, gene deletions or degradation to pseudogenes. These genes are actually between species paralogues, but based on our data we cannot say for certain that the two genes are not orthologues that have significantly diverged.

APPRIS

APPRIS is a system to annotate alternatively spliced transcripts based on a range of computational methods. It provides value to the annotations of the human, mouse, zebrafish, rat, and pig genomes. (http://nar.oxfordjournals.org/content/41/D1/D110.long).

APPRIS has selected a single CDS variant for each gene as the 'PRINCIPAL' isoform. Principal isoforms are tagged with the numbers 1 to 5, with 1 being the most reliable.

PRINCIPAL:1
Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS.
PRINCIPAL:2
Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant.
PRINCIPAL:3
Where the APPRIS core modules are unable to choose a clear principal variant and there more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated.
PRINCIPAL:4
Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with a distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant.
PRINCIPAL:5
Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant.

For genes in which the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the "candidate" variants not chosen as principal are labeled in the following way:

ALTERNATIVE:1
Candidate transcript(s) models that are conserved in at least three tested species.
ALTERNATIVE:2
Candidate transcript(s) models that appear to be conserved in fewer than three tested species.

Non-candidate transcripts are not tagged and are considered as "Minor" transcripts.

Further information and additional web services can be found at the APPRIS website.

Artifact (in the context of a transcript)

Error in the sequence in a public database (for example UniProtKB, NCBI RefSeq). Annotation is by the VEGA/Havana project.

Assembly

When the genome of a species is to be sequenced, the chromosomes from many cells are broken at random positions into small fragments, which are sequenced, and reassembled into long sequences (contigs). Contigs may be assembled into longer sequences called scaffolds and sometimes, if the depth of sequencing is high enough, there may be enough information to assemble most of the scaffolds into chromosomes. The resulting collection of sequences after assembly is called a genome assembly.

Ensembl imports genome assemblies from the sources listed on the home page of each species.

ATV (A Tree Viewer)

An application (Java tool) for the visualisation of phylogenetic trees. Allows the possibility to edit and export data. See Zmasek et al.

BAC (Bacterial Artificial Chromosome)

A vector used to clone DNA fragments (100 to 300-kb insert size; average, 150 kb) from another species so that it can be replicated in bacteria.

Base pairs (number of base pairs in the genome)

The base pairs length on pages such as the whole genome display (next to the golden path length) is based on the assembled end position of the last seq_region in each chromosome (from the AGP file), or if there is a terminal gap it is set to the assembled end location of that terminal gap.

Between species paralogues (between species paralogs)

Members of the same gene family in different species that are not direct orthologues. In a gene tree, these genes are separated by a duplication node.

Biotype

A gene or transcript classification. Transcript types include protein coding, pseudogene, and long non-coding and short non-coding RNAs. For human, mouse and selected other species we incorporate manual annotation from Havana. Where a gene or transcript has been manually annotated, we use the manually assigned biotype. The full list of biotypes used by Havana are here.

 

The biotypes can be grouped into protein coding, pseudogene, long noncoding and short noncoding. Examples of biotypes in each group are as follows:

Protein coding:

  • IG_C_gene, IG_D_gene, IG_J_gene, IG_LV_gene, IG_M_gene, IG_V_gene, IG_Z_gene, nonsense_mediated_decay, nontranslating_CDS, non_stop_decay, polymorphic_pseudogene, protein_coding, TR_C_gene, TR_D_gene, TR_gene, TR_J_gene, TR_V_gene.

Pseudogene:

  • disrupted_domain, IG_C_pseudogene, IG_J_pseudogene, IG_pseudogene, IG_V_pseudogene, processed_pseudogene, pseudogene, transcribed_processed_pseudogene, transcribed_unprocessed_pseudogene, translated_processed_pseudogene, translated_unprocessed_pseudogene, TR_J_pseudogene, TR_V_pseudogene, unitary_pseudogene, unprocessed_pseudogene

Long noncoding:

  • 3prime_overlapping_ncrna, ambiguous_orf, antisense, lincRNA, ncrna_host, non_coding, processed_transcript, retained_intron, sense_intronic, sense_overlapping

Short noncoding:

  • miRNA, miRNA_pseudogene, misc_RNA, misc_RNA_pseudogene, Mt_rRNA, Mt_tRNA, Mt_tRNA_pseudogene, ncRNA, pre_miRNA, RNase_MRP_RNA, RNase_P_RNA, rRNA, rRNA_pseudogene, scRNA_pseudogene, snlRNA, snoRNA, snoRNA_pseudogene, snRNA, snRNA_pseudogene, SRP_RNA, tmRNA,, tRNA, tRNA_pseudogene
Finding biotype groupings

If you see a biotype in Ensembl and are not sure which biotype group it belongs to, you can check this by connecting to the latest ensembl_production database.
For Ensembl release 73, connect to database "ensembl_production_73" eg.

  mysql -uanonymous -P3306 -hensembldb.ensembl.org -Densembl_production_73 -e "select distinct(name),biotype_group from biotype where db_type like '%core%' and is_current=1 order by biotype_group,name;"

BLAST (Basic Local Alignment Search Tool)

A sequence comparison algorithm optimised for speed which is used to search sequence databases for optimal local alignments to a query. (Altschul et al., J Mol Biol 215:403-410; 1990)

BLAT (BLAST-Like Alignment Tool)

An mRNA/DNA and cross-species protein sequence analysis tool to quickly find sequences of 95% and greater similarity of length 40 bases or more. (Kent, W.J. 2002. BLAT -- The BLAST-Like Alignment Tool. Genome Research 4: 656-664)

BLOSUM 62 (Blocks Substitution Matrix)

A matrix that defines scores for amino acid substitutions, reflecting the similarity of physicochemical properties, and observed substitution frequencies. The BLOSUM 62 matrix is tailored using sequences sharing no more than 62% identity (sequences closer evolutionary, were represented by a single sequence in the alignment to avoid bias from using related family members). (Henikoff and Henikoff, Proc Natl Acad Sci U S A 89:10915-10919; 1992).

Canonical transcript

For human, the canonical transcript for a gene is set according to the following hierarchy: 1. Longest CCDS translation with no stop codons. 2. If no (1), choose the longest Ensembl/Havana merged translation with no stop codons. 3. If no (2), choose the longest translation with no stop codons. 4. If no translation, choose the longest non-protein-coding transcript.

CCDS

A coding sequence in the Consensus Coding Sequence Set is consistently annotated between Ensembl, Vega, UCSC and NCBI. The long term goal is to support convergence towards a standard set of gene annotations on the human genome.

cDNA (Complementary DNA)

DNA obtained by reverse transcription of a mRNA template. In bioinformatics jargon, cDNA is thought of as a DNA version of the mRNA sequence. Generally, cDNAs are denoted in coding or 'sense' orientation.

CDS (Coding sequence)

The portion of a gene or an mRNA that codes for a protein. Introns are not coding sequences, nor are the 5' or 3' UTR. The coding sequence in a cDNA or mature mRNA includes everything from the start codon through to the stop codon, inclusive.

Centimorgan (cM)

A unit of genetic distance, determined by how frequently two genes on the same chromosome are inherited together. One centimorgan equals 1% recombinant offspring. In humans, 1 cM is about 1 x 10^6 bp

Chr:bp

The chromosome location and coordinates in base pairs.

CIGAR (Compact Idiosyncratic Gapped Alignment Report)

Defines the sequence of matches/mismatches and deletions (or gaps). The cigar line defines the sequence of matches/mismatches and deletions (or gaps). For example, this cigar line 2MD3M2D2M will mean that the alignment contains 2 matches/mismatches, 1 deletion (number 1 is omitted in order to save some space), 3 matches/mismatches, 2 deletions and 2 matches/mismatches. If the original sequence is:

Original sequence: AACGCTT

The aligned sequence will be:

cigar line: 2MD3M2D2M
M M D M M M D D M M
A A - C G C - - T T
Class

The type of variation data (e.g. SNP, insertion, deletion, somatic SNV).

Clinical sign. (Clinical significance)

Clinical significance is reported to dbSNP from the submitter. Variants from OMIM may have the value probable-pathogenic. Other assignations are unknown, untested,non-pathogenic, probable-non-pathogenic, drug-response, histocompatibility, and other.

Clone

A segment of DNA that has been inserted into a vector molecule, such as a plasmid, and then replicated to form many identical copies.

CNV

Copy number variation. It is defined by SO (sequence ontology) as a variation that increases or decreases the copy number of a given region. See more details here.

Codon

Three base pairs in either DNA or RNA that code for an amino acid (or stop translation).

Constitutive exon

Exons that are not spliced out, therefore present in all transcripts of a given gene. ENSRNOE00000218349 is a constitutive exon (CE) in rat, for example. You can search for constitutive exons using BioMart and the Core API

Contig

A contig is a contiguous stretch of DNA sequence without gaps that has been assembled solely based on direct sequencing information.

Short sequences (reads) from a fragmented genome are compared against one another, and overlapping reads are merged to produce one long sequence. This merging process is iterative: overlapping reads are added to the merged sequence whenever possible and so the merged sequence becomes even longer. When no further reads overlap the long merged sequence, then this sequence - called a contig - has reached its maximum length.

Contig can be used in other contexts: A contig can be the sequence corresponding to only one clone. A contig map shows the regions of a chromosome where contiguous DNA segments overlap.

Contiguous gene split

Paralogues where one of them has been split partway along, but where the fragments of the gene are within 1Mb of each other on the same chromosome strand.

Coordinate system

In Ensembl, the term "coordinate system" or "coord_system" identifies which level of the assembly we are working on. A genome assembly imported into Ensembl has up to three coordinate systems defined in the coord_system table: contigs, scaffolds or chromosomes.

We define one additional coordinate system: toplevel. Toplevel sequences are tagged in the seq_region_attrib table. Most gene annotation is done on toplevel sequence.

COSMIC

The Catalogue of Somatic Mutations in Cancer contains somatic mutation information extracted from the primary literature, including details relating to human cancers. More details are here.

Cosmid

DNA from a bacterial virus spliced with a small fragment of a genome (up to 50 kb) to be amplified and sequenced.

Coverage

Refers to the number of overlapping sequences used to build a region of the assembly. High coverage indicates a good amount of sequence information while low coverage reflects a low amount of sequence information. 

Low coverage

With capillary (Sanger) sequencing methods, a 2x depth of coverage was generally considered to be low coverage. The 29 Mammals Project generated 2x genome assemblies for mammals across a broad taxonomic range. 

High coverage

With capillary sequencing methods, a 6-7x or greater depth of coverage was generally considered to be high coverage. 

Next-generation sequencing

With next-generation sequencing methods, it is common to see genome assemblies with much higher sequencing depth of coverage. 

Cytogenetic map

A banding pattern on a chromosome resulting from staining and examination by microscopy. Cytogenetic abnormalities such as deletions or inverted nucleotide sequences may be detected by examining and comparing banding patterns.

D'

The difference between the observed and the expected frequency of a given haplotype. If two loci are independent (i.e. in linkage equilibrium and therefore not coinherited at all), the D' value will be 0.

DAS (Distributed Annotation System)

A protocol for requesting and returning annotation data for genomic regions. See the BioDAS site for more information.

dbSNP

The Single Nucleotide Polymorphism database (dbSNP) is a public-domain archive for a broad collection of simple (short) genetic polymorphisms. This collection of polymorphisms is maintained by NCBI and includes single-base nucleotide substitutions (also known as single nucleotide polymorphisms or SNPs), small-scale multi-base deletions or insertions (also called deletion insertion polymorphisms, indels or DIPs), and retroposable element insertions and microsatellite repeat variations (also called short tandem repeats or STRs). See more details here.

dbVAR

Database of genomic structural variation (SV), such as copy number variation.

DDBJ (DNA Data Bank of Japan)

DDBJ is the sole DNA data bank in Japan, which is officially certified to collect DNA sequences from researchers and to issue the internationally recognized accession number to data submitters. Data is exchanged with EMBL/EBI and GenBank/NCBI on a daily basis, and the three data banks share virtually the same data at any given time.

DGV

Database of Genomic Variants It provides a curated catalogue of structural variation (SV) in the human genome.

DGVa

The Database of Genomic Variants archive is a repository that provides archiving, accessioning and distribution of publicly available genomic structural variants, in all species.

Disrupted domain ((in the context of a transcript))

Coding region omiitted due to a splice variation. Annotation is by the VEGA/Havana project.

Domain

A region of special biological interest within a single protein sequence. However, a domain may also be defined as a region within the three-dimensional structure of a protein that may encompass regions of several distinct protein sequences that accomplishes a specific function. A domain class is a group of domains that share a common set of well-defined properties or characteristics.

Dotter

Ensembl DotterView is based on the program Dotter, a dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. The Dotter tool provides a visual display of the sequence alignment it represents. The dotplot displays detailed comparison of two sequences. Every residue in one sequence is compared to every residue in the other sequence. The first sequence runs along the x-axis and the second sequence along the y-axis. In regions where the two sequences are similar to each other, a row of high scores will run diagonally across the dot matrix. If you're comparing a sequence against itself to find internal repeats, you'll notice that the main diagonal scores maximally, since it's the 100% perfect self-match. To make the score matrix more intelligible, the pairwise scores are averaged over a sliding window that runs diagonally. The averaged score matrix forms a three-dimensional landscape, with the two sequences in two dimensions and the height of the peaks in the third. This landscape is projected onto two dimensions by aid of grayscales - higher peaks are indicated by darker grays. Dotter was written by Erik L.L. Sonnhammer and Richard Durbin Gene 167: GC1-10 (1995)

DUST

A standalone application that looks for low complexity sequences.

DWGA (Derived from Whole Genome Alignments)

Human versus Chimpanzee exception: The human versus chimpanzee orthologue predictions were obtained in a completely different manner. Since the current chimpanzee genome sequence assembly is the result of low-coverage sequencing, the assembled sequence is of too poor quality to generate a gene set on the classical Ensembl gene build pipeline. The chimpanzee gene set produced by Ensembl has rather been generated by "projecting" human genes to the chimpanzee genome through whole genome BLASTz alignments between both species and filtering for orthologue sequence alignments. The result of this procedure is de facto the human - chimpanzee orthologue set that has been Derived from Whole Genome Alignments (DWGA). See the Prediction Method section on a relevant Ensembl Gene Report page.

E-val (BLAST Results)

Probability that the alignment between the query (input) sequence and subject (hit) sequence is due to chance.

EGA

The European Genome-phenome Archive repository allows the exploration of datasets from numerous genotype experiments, supplied by a range of data providers. See more details here.

EMBL (European Molecular Biology Laboratory)

Europe's primary nucleotide sequence resource. The main sources of the DNA and RNA sequences in the database are submissions from individual researchers, genome sequencing projects and patent applications.

ENCODE (ENCyclopedia Of DNA Elements)

The ENCODE project uses defined regions of the Human genome to test and evaluate different methods and technologies for finding various functional elements in Human DNA. The two main criteria for manually selected regions were presence of well-studied genes or other known sequence elements, and existence of a substantial amount of comparative sequence data. A total of 14.82Mb of sequence was manually selected using this approach, consisting of 14 targets that range in size from 500kb to 2Mb.

End phase

In protein-coding exons, the end phase is the place where the intron lands inside the codon : 0 between codons, 1 between the 1st and second base, 2 between the second and 3rd base. Exons therefore have a start phase and an end phase, but introns have just one phase. An exon which is non coding (or non-coding at the end) has a end phase of -1.

Ensembl genes

Set of Ensembl gene predictions based on experimental evidence from protein sequences and/or near-full-length cDNA available from public sequence databases. "Ensembl known genes" are predicted on the basis of species-specific database entries from manually curated UniProt/Swiss-Prot, partially manually curated RefSeq and UniProt/TrEMBL databases. Predictions of "Ensembl novel genes" are based on other experimental evidence such as protein and cDNA sequence information from related species. Golden genes are the result of a merge between a Havana transcript (manually curated) and an Ensembl gene prediction from the annotation pipeline. See "havana transcript".

Eponine

Eponine is a probabilistic method for detecting transcription start sites (TSS) in mammalian genomic sequence, with good specificity and excellent positional accuracy. Eponine models consist of a set of DNA weight matrices recognizing specific sequence motifs. Each of these is associated with a position distribution relative to the TSS.

EST (Expressed Sequence Tags)

Coarse sequence reads from flanking vector regions into the inserts of cDNA libraries. ESTs act as physical markers for cloning and full length sequencing of the cDNAs of expressed genes. Typically identified by purifying mRNAs, converting to cDNAs, and then sequencing a portion of the cDNAs. Usually short, single reads from a tissue or stage in development.

EST genes

Set of Ensembl gene predictions solely based on EST evidence. The process of EST gene prediction uses a combination of Exonerate, BLAST and Est2Genome to map ESTs onto the genomic sequence. Redundant ESTs are merged, before GenomeWise is used to assign 5' and 3' UTRs to the longest found ORF. See Eyras et al. for a more complete explanation of the EST gene prediction process.

Evidence status (variant)

Evidence codes reflect the data supporting the variant. They are: Multiple Observations, Phenotype or Disease, Frequency, HapMap, 1000 Genomes, ESP, ExAC and Cited. Find out more in this FAQ.

Exon

The part of the genomic sequence that remains in the transcript (mRNA) after introns have been spliced out.

Exonerate

A fast gapped DNA-DNA alignment algorithm. It can be used for aligning various types of sequences such as genomic DNA, cDNAs/ESTs, and proteins.

Feature

Any annotation on a specific location in the genomic sequence.

Fgenes

FGENES, also known as Find Genes, is a Human gene predictor that is based on pattern recognition of different types of exons, promoters and poly A signals. It is built based on linear discriminant functions of internal, 5'-coding, and 3'-coding exon recognition. It is designed to find the optimal combination of these components and to construct a set of gene models along a given sequence.

Flagged variants

SNPs and other genetic variants that failed our quality control analyses, therefore they are flagged as suspicious.

Flanking sequence

Sequence 5' or 3' to a DNA or RNA sequence of interest (for example gene, transcript, SNP or repeat).

Frameshift intron

Frameshift introns are the length of 1, 2, 4, or 5 basepairs. They are introduced by the Ensembl genebuild in order to fit the cDNA sequence to the genome.

Frequency

A measure of how prevalent an allele or genotype is in a population. In Ensembl, it is displayed ranging from 0 (zero) to 1 (one).

GENCODE

The aim of GENCODE as a sub-project of the ENCODE scale-up project is to annotate all evidence-based gene features in the entire human genome at a high accuracy. The GENCODE gene set is equivalent to the Ensembl/HAVANA merged gene set displayed on our website.

GENCODE Basic set

The GENCODE gene set is the gene set Ensembl displays for human and mouse. GENCODE Basic is a subset of the GENCODE gene set, and is intended to provide a simplified, high-quality subset of the GENCODE transcript annotations that will be useful to the majority of users. This subset prioritises full-length protein coding transcripts over partial or non-protein coding transcripts within the same gene.

GENCODE Basic includes all genes in the GENCODE gene set, with a representative subset of the transcripts (splice variants). The GENCODE Basic set is available for the human and mouse gene sets from Ensembl release 75.

More information

We worked with GENCODE to decide how to tag transcripts as 'Basic'. These are the rules that we use to tag which transcripts are included in the GENCODE Basic set:

For each gene:

  1. Loop through all protein-coding (and similar biotype) transcripts and tag all the complete (CDS start- and end found) transcripts. If none of the transcripts are complete, tag only the transcript(s) with the longest CDS.
  2. Loop through all the small noncoding and antisense transcripts and tag all the complete (mRNA start- and end found) transcripts. If none are complete, loop through the long-noncoding transcripts too and then tag only the transcript(s) with the longest combined exon length.
  3. Combine the results from steps (1) and (2) and this is what is displayed as ‘GENCODE Basic’.
  4. If, after step (3), we've got an empty basket and no transcripts in the gene are tagged as 'Basic', we look for pseudogene transcripts and tag all the pseudogene transcripts that we find.
  5. Finally, we've still got no transcripts tagged from steps (1) or (2) or (4), then we tag transcripts with 'problematic' biotypes ie. retained_intron, TEC, ambiguous_ORF and disrupted_domain.
GENCODE Comprehensive set

The complete GENCODE gene set is sometimes referred to as the GENCODE Comprehensive set. It is the gene set that Ensembl displays for human and mouse.

Gene hit (BLAST Results)

The gene name of the gene the transcript hit is part of.

Gene model (Gene model)

A gene model is a representation of a gene. In Ensembl, gene models are annotated automatically using biological sequences data (protein, mRNA) as support.

Each gene model includes information such as the genomic coordinates of the gene and its coding and noncoding exon(s). One or more transcripts may be annotated. 

Genebuild

The Ensembl gene annotation system is used to annotate genome assemblies. This process is called the genebuild.

GeneWise

GeneWise is a sequence analysis tool for comparing proteins to DNA sequences allowing for introns and frameshifts. 

Genome assembly

The set of chromosomes, unlocalized scaffolds, unplaced scaffolds, and alternate sequences that represent an organism’s genome.

For some species in Ensembl, the DNA sequenced for the genome project was obtained from one individual (eg. chimpanzee), whereas for other species (eg. human) it was obtained from multiple individuals.

Ensembl imports genome assemblies from the sources listed on the home page of each species.

See also: Assembly

Genome assembly

The set of chromosomes, unlocalized scaffolds, unplaced scaffolds, and alternate sequences that represent an organism’s genome.

For some species in Ensembl, the DNA sequenced for the genome project was obtained from one individual (eg. chimpanzee), whereas for other species (eg. human) it was obtained from multiple individuals.

Ensembl imports genome assemblies from the sources listed on the home page of each species.

See also: Assembly

Genomic Location (BLAST Results)

Location of the hit on the genome

Genotype

Specific alleles present in an individual's genome, or the genetic makeup of one organism.

GENSCAN

An application for identification of complete gene structures in genomic DNA (Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94). The splice site models used are described in more detail in: Burge, C. B. (1998) Modeling dependencies in pre-mRNA splicing signals. In Salzberg, S., Searls, D. and Kasif, S., eds. Computational Methods in Molecular Biology, Elsevier Science, Amsterdam, pp. 127-163.

Global MAF (Global Minor Allele Frequency)

Minor Allele Frequency (MAF) refers to the lowest allele frequency of a sequence variant (such as a SNP). In Ensembl, the global MAF is calculated using the allele frequences across all 1000 Genomes Phase I populations.

GO (Gene Ontology)

An organized hierarchy of terms produced by the Gene Ontology Consortium, used to describe biological processes, cellular component, and molecular function. Specific GO terms are as follows: Molecular Function Ontology. Tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity. Biological Process Ontology. Broad biological goals, such as mitosis or purine metabolism, are accomplished by ordered assemblies of molecular functions. Cellular Component Ontology. Subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex. A gene may be indexed under many GO terms depending on GO classification system. A gene product has one or more molecular functions and is used in one or more biological processes; it might be associated with one or more cellular components. For instance, cytochrome c can be described by the molecular function term electron transporter activity, the biological process terms oxidative phosphorylation and induction of cell death, and the cellular component terms mitochondrial matrix and mitochondrial inner membrane.

Golden path length

The golden path is the length of the reference assembly. It consists of the sum of all top-level sequences in the seq_region table, omitting any redundant regions such as haplotypes and PARs (pseudoautosomal regions).

Haplotype

Known variations to the primary assembly, due to variability in the human genome sequence (eg. the highly variable MHC locus containing halpotypes HSCHR6_MHC_COX, HSCHR6_MHC_SSTO, HSCHR6_MHC_APD, HSCHR6_MHC_DBB, HSCHR6_MHC_MANN, HSCHR6_MHC_MCF, and HSCHR6_MHC_QBL). In Region in Detail, the haplotype regions are coloured with a red background.

HapMap project

An international collaboration formed to develop a haplotype map of the human genome and thus describe the common patterns of human DNA sequence variation. (http://hapmap.ncbi.nlm.nih.gov/)

Havana transcript

A transcript resulting from manual curation of genome annotation for vertebrate species. The Havana team is a subset of Vega (See "Vega genes".)

HGMD

The Human Gene Mutation Database provides a comprehensive collection of known and published mutations responsible for inherited disease in humans. More details are here.

HGVS

The Human Genome Variation Society aims to promote the identification and characterisation of changes in the genome that lead to susceptibility to illness. Genomic variants will be collected, documented, and the resulting information in addition to any associated clinical variations will be freely distributed. See more details here.

HGVS names

Nomenclature for a given variant according to the Human Genome Variation Society (HGVS). A guide to HGVS names can be found on their website.

Homologues (homologs)

Specific sequences that are descended from the same common sequence in an ancestor. See orthologues or paralogues.

Identity

A measure of how similar two sequences are, specifically, what percent of amino acids are the same in type and position between the two sequences.

Ig C gene

Gene that codes the constant (C) region of immunoglobulins.

Ig C pseudogene

Locus that shares an evolutionary history with the Ig C gene but it has been mutated through frameshift and/or stop codon(s) that disrupt the open reading frame.

Ig D gene

Gene that rearranges at the DNA level and codes the diversity (D) region of the variable domain of immunoglobulins.

Ig gene (Immunoglobulin gene)

IG segment genes are annotated by Ensembl automatic annotation (see article) and manual curation by the VEGA/Havana project.

Ig J gene

A gene that rearranges at the DNA level and codes the joining (J) region of the variable domain of immunoglobulins.

Ig J pseudogene

Locus that shares an evolutionary history with the Ig J gene but it has been mutated through frameshift and/or stop codon(s) that disrupt the open reading frame.

Ig pseudogene

Locus that shares an evolutionary history with immunoglobulin genes but it has been mutated through frameshift and/or stop codon(s) that disrupt the open reading frame.

Ig V gene

A gene that rearranges at the DNA level and codes the variable (V) region of the variable domain of immunoglobulins.

Ig V pseudogene

Locus that shares an evolutionary history with the Ig V gene but it has been mutated through frameshift and/or stop codon(s) that disrupt the open reading frame.

IMPACT (IMPACT (variation consequences))

A subjective classification of the severity of the variant consequence. The four categories are:

HIGH:

The variant is assumed to have high (disruptive) impact in the protein, probably causing protein truncation, loss of function or triggering nonsense mediated decay.

MODERATE:

A non-disruptive variant that might change protein effectiveness.

LOW:

Assumed to be mostly harmless or unlikely to change protein behaviour.

MODIFIER:

Usually non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact.

In-del (Insertion-deletion)

A mutation or polymorphism in which one or more base pairs have been inserted into or removed from a genomic sequence.

InterPro

InterPro is an integrated resource for protein families, domains and sites, combining information from several different protein signature databases. InterPro IDs are linked to the summary of information about that domain or family. InterPro is managed by EBI. A number of databases (SwissProt, TrEMBL, PROSITE, PRINTS, Pfam, and ProDom, SMART, TIGRFAMs, PIR SuperFamilies and SUPERFAMILY) with different approaches to biological information are used to derive protein signatures. ProteinView, GeneView and DomainView provide links to the relevant InterPro entries.

Intron

The part of the genomic sequence that is transcribed and then spliced out of the transcript (mRNA). Noncoding.

Intron-spanning reads

A sequence from an RNASeq experiment which aligns to the border between adjacent exons, aligning to the 3' end of the preceeding exon and the 5' end of the latter exon. This shows that in the particular sequenced fragment, the intron between was spliced out. This may indicate a splice junction that has not already been identified, suggesting a new transcript for our gene-build.

Jalview

Jalview is a multiple alignment editor, used by the EBI clustalw server and the PFAM protein domain database and is available as a general purpose alignment editor.

Karyotype

The Karyotype View in Ensembl displays the set of chromosomes for a species, including the centromere and banding pattern as they would appear under a light microscope. Dark bands indicate heterochromatin and light bands indicate homochromatin.

The Karyotype View is only available for species where the genome assembly provided to us has been assembled into chromosomes. For many species in Ensembl, the genome assembly is comprised of only unplaced scaffolds.

Known gene

A known gene is an Ensembl gene for which at least one known transcript has been annotated.

Known transcript

A known Ensembl transcript matches to a sequence for the same species in a public, scientific database such as UniProtKB or NCBI RefSeq.

LD (Linkage Disequilibrium)

A measure of how often two SNPs or specific sequences are inherited together.

Length (aa)

The number of amino acids in, for example, a protein.

Length (BLAST Results)

The full length of the alignment, including all gaps in either the query or the alignment.

Length (bp)

The number of base pairs in, for example, a transcript.

LincRNA

Large intergenic non-coding RNAs, usually associated with open chromatin signatures such as histone modification sites.

Linkage

A measure of how often features (genes, specific sequences) on a chromosome are inherited together.

Long non coding gene

Long non coding genes are usually greater than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as long non coding genes: 3prime_overlapping_ncrna, ambiguous_orf, antisense, antisense_RNA, lincRNA, ncrna_host, non_coding, non_stop_decay, processed_transcript, retained_intron, sense_intronic, sense_overlapping. The majority of the long non coding genes in Ensembl are annotated manually by HAVANA.

LOVD

The Leiden Open Variation Database is a freely available gene-centered collection of DNA variations. An online tool is provided. See more details here

Low-complexity region

A region in the sequence with a biased composition (i.e. repeated sequences or residues.)

LRG_gene

An LRG (Locus Reference Genomic) is a fixed sequence, independent of the genome, specifically created for the diagnostic community to record DNA sequence variation on a fixed framework. Sequence variants in LSDBs (Locus Specific Databases) are reported using LRG sequences. For more information, please see the LRG project page.

LSDB

Locus specific database. These can be generated by groups focused on a single gene, region, or a single disease.  See more details here.

Many-to-many orthologues

A type of orthologue assigned for a pair of species where multiple orthologues are found in both species.

Marker

A short sequence whose placement on the genome is known.

MBRH (Multiple Best Reciprocal Hit)

When due to gene duplications there are multiple 'best' hits with identical score, E-value, % identity, %positivity, one is unable to pick a unique orthologue for a gene. This results in more complex graphs of 'best' relationships. This often occurs when different genes have identical translations, which could be due to a duplication event, an assembly error, or chance. On average 3% of the genes have an identical translation to some other gene either within it's genome or in another genome.

  • MBRH / DUP 1.# - MBRH set where in one genome there is only one gene, but the other genome has multiple genes, all on the same chromosome and within 1.5 megbases of each other. This could be due to recent gene duplication events where sequences have not diverged or a mis-assembly of the genome sequence leading to artificial, apparent gene duplications. (e.g. MBRH / DUP 1.2 or MBRH/ DUP 1.4)
  • MBRH / SYN - This is a more complex MBRH set where there are multiple genes in each genome split across multiple chromosomes. The one(s) labeled MBRH/SYN satisfies both the MBRH criteria and the RHS search criteria.
  • MBRH / COMPLEX - This is a more complex MBRH set where there are multiple genes in each genome split across multiple chromosomes. This MBRH pair does not satisfy the RHS criteria.
MGI (Mouse Genome Informatics)

Houses a database that provides integrated access to data on the genetics, genomics, and biology of mouse (Mus musculus).

Microsatellite

A region in the genomic sequence containing short tandem repeats.

miRNA

It stands for micro RNA, a small RNA molecule (ca. 22 nucleotides) that functions in the post-transcriptional regulation of gene expression.

miRNA pseudogene

Locus that shares an evolutionary history with miRNA genes but it has been mutated through frameshift and/or stop codon(s) that disrupt the open reading frame.

misc RNA pseudogene

Locus that shares an evolutionary history with miscRNA genes but it has been mutated through frameshift and/or stop codon(s) that disrupt the open reading frame.

misc_RNA

Short non coding RNA genes that have not been classified into the other short noncoding RNA biotypes such as scRNA, tRNA, rRNA, among others.

Motif

A conserved region of sequence with a specific function/ structure.

mt rRNA

Mitochondrially encoded ribosomal RNA genes.

mt tRNA

Mitochondrially encoded transfer RNA genes.

mt tRNA pseudogene

Locus that shares an evolutionary history with mt tRNA genes but it has been mutated through frameshift and/or stop codon(s) that disrupt the open reading frame.

Mutation

A modification (insertion, deletion, or alteration) in the genomic or amino acid sequence.

ncRNA (non-coding RNA)

Short non-coding RNAs such as rRNA, scRNA, snTNA, snoRNA and miRNA are annotated by the Ensembl ncRNA pipeline (see article). To view these short ncRNAs, go to Region In Detail and open the Configure This Page window. Select ncRNA from the Genes menu.

Transfer RNAs (tRNAs) are identified by tRNAscan. To view tRNAs, go to Region In Detail and open the Configure This Page window. Select tRNA from the Simple Feature menu.

Long intergenic ncRNAs have only been annotated for human and mouse. To view long ncRNAs, go to Region In Detail and open the Configure This Page window. Select lincRNA from the Genes menu.

Non coding

A noncoding mRNA does not result in a protein product. Noncoding mRNAs may be short in length (eg. microRNAs) or longer in length (eg. lincRNAs).

Nonsense mediated decay

Transcript is thought to undergo nonsense mediated decay, a process which detects nonsense mutations and prevents the expression of truncated or erroneous proteins.

Novel gene

A novel gene is an Ensembl gene for which only one or more novel transcripts have been annotated.

Novel transcript

A novel Ensembl transcript does not match to a sequence for the same species in a public, scientific database such as UniProtKB or NCBI RefSeq.

OMIM (Online Mendelian Inheritance in Man)

A genetic knowledge database that focuses on the relationship between phenotype and genotype. Mendelian Inheritance in Man (MIM) was first published in 1966 and is updated daily. Ensembl links to OMIM entries in the gene tab (under External references), and in the variation tab (under Phenotype Data).

ORF (Open Reading Frame)

A DNA sequence that possesses a start codon and a large window of sequence with no stop codon that could potentially code for a protein.

Orientation (BLAST Results for cDNA/protein)

Strand of the genome the subject (hit) cDNA or peptide is found on.

Orientation (BLAST Results for genomic)

Strand of the genome the query sequence is found on.

Orthologues (orthologs)

Orthologues are genes derived from a common ancestor through vertical descent (or speciation) and can be thought of as the direct evolutionary counterpart. In contrast, paralogues are genes within the same genome that have evolved by duplication.

Other paralogues (other paralogs)

Paralogues which are very far away from the other members of a paralogue family. They are part of the same super-family, but the precise taxonomic relationship to other members is undefined, as the trees are too large to compute.

Overlapping Genes (BLAST Results)

Genes that overlap the genomic location hit.

p value

The probability that the association is significant. Ensembl reports the p value, the negative log scale of p values, where a higher number suggests a higher probability, or the e-value, where a lower number suggests higher probability.  

PAR (Pseudoautosomal region)

Small regions of sequence identity located at the tips of the short and the long arms of the X and Y chromosomes where recombination and genetic exchange take place. Genes within the pseudoautosomal region are not sex linked. The Genome Reference Consortium defines two PARs for the human genome assembly. The first pseudoautosomal region, PAR1, is located at the tip of the short arm and consists entirely of N's. The second pseudoautosomal region, PAR2, is located at the tip of the long arm. In the Ensembl human database, DNA for the complete X chromosome is stored and annotated. Only the two unique regions of the Y chromosome are stored and annotated. We are able to represent the complete Y chromosome by filling the 'gaps' with the two PAR regions from the X chromosome. This is done on-the-fly using our assembly_exceptions table. Please note that when using the API, SliceAdaptor by default will fetch only the unique regions of the genome. This means that the PARs on chromosome X will be fetched but only the unique regions on Y will be fetched. To fetch the full length of the Y chromosome using the SliceAdaptor, set the 4th argument to '1' as shown: my $slices = $slice_adaptor->fetch_all( 'toplevel', 'GRCh37', 0, 1 );

Paralogues (paralogs)

Sequences (homologues) that have evolved by duplication.

Patch

These alternate sequences are with regard to the genomic DNA on the primary assembly. Novel patches represent new allelic loci but they are not necessarily haplotypes. Fix patches are where the primary assembly was found to be incorrect, and the patch reflects the corrected sequence. Currently available for human and mouse, they are provided by the GRC. By default, our browser displays the unchanged primary assembly (eg. GRCh37 chromosomes). In order to apply a novel (red) or fix (green) patch to a chromosome, click on the "Assembly Exception" track from the Region In Detail window.

PDB (Protein Data Bank)

A repository for 3-D biological macromolecular structure data. PDB archives protein structures deduced from crystallography and nuclear magnetic reasonance (NMR) experiments on protein structures. The Protein Data Bank (PDB) is operated by Rutgers, The State University of New Jersey; the San Diego Supercomputer Center at the University of California, San Diego; and the Center for Advanced Research in Biotechnology of the National Institute of Standards and Technology -- three members of the Research Collaboratory for Structural Bioinformatics (RCSB). The RCSB PDB is supported by funds from the National Science Foundation, the Department of Energy, and the National Institutes of Health.

Pfam

Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. Pfam can be used to view the domain organization of proteins, to view multiple alignments, protein domain architectures, protein structures, and species distributions.

Pmatch

Pmatch is a fast, exact matching program for aligning protein sequences with either protein or DNA sequence.

Polymorphic pseudogene

Pseudogene loci in one genome (e.g. the reference) but coding in other genomes (i.e. non-reference) due to short or large scale genomic variation in some individuals/haplotypes/strains in a given population/species. It can be known elsewhere as segregating pseudogenes or SPGs.

PolyPhen

A tool which predicts the variation effect on protein function based on physical and comparative considerations. See the PolyPhen website for more information.

Pre-release site

Initial annotations of upcoming Ensembl genomes, usually without gene predictions or validation, are regularly made available on the pre-release site, pre.ensembl.org

Primary assembly

The collection of assembled chromosomes, unlocalized scaffolds and unplaced scaffolds that represent a non-redundant haploid genome. The primary assembly excludes alternate sequences (alternate loci, fix patches, novel patches).

When using the Ensembl API, the primary assembly is referred to as the "reference" assembly. The primary assembly can be fetched with the Ensembl SliceAdaptor, by ensuring that the boolean $include_non_reference is set to 0. eg.

my $slices = $slice_adaptor->fetch_all( 'toplevel', 'GRCh37', 0);

Prints

The PRINTS protein fingerprint database is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of a SwissProt/TrEMBL composite. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, full diagnostic potency deriving from the mutual context provided by motif neighbors.

Processed pseudogene

Pseudogene that appears to have been produced by integration of a reverse transcribed mRNA into the genome.

Processed transcript

Transcripts that don't contain an open reading frame (ORF) and cannot be placed in one of the other categories.

Projected gene (or known by_projection)

A projected Ensembl gene has only one or more novel transcripts annotated, and has a known gene from human or mouse as an orthologue. The gene symbol and description are projected from the human or mouse orthologue.

Prosite

PROSITE is a database of protein families and domains run by the (Expert Protein Analysis System (ExPASy) proteomics server of the Swiss Institute of Bioinformatics (SIB). It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs.

Protein coding

Genes and/or transcript that contains an open reading frame (ORF).

Protein ID

Ensembl protein IDs are unique for differing translations.

Pseudogene

A pseudogene shares an evolutionary history with a functional protein-coding gene but it has been mutated through evolution to contain frameshift and/or stop codon(s) that disrupt the open reading frame.

Putative gene split

Paralogues where one of them has been split partway along, but where the fragments are more than 1Mb apart, or on different chromosomes or strands.

QTL (Quantitative Trait Locus)

Genetic loci where allelic variation is associated with variation in a quantitative trait (e.g. blood pressure). The presence of QTL is inferred from genetic mapping. Total variation is partitioned into components linked to a number of discrete, mapped chromosome markers described by statistical association to quantitative variation in a particular phenotypic trait that is thought to be controlled by the cumulative action of alleles at multiple loci.

Query %id

Query %id indicates the percentage of the query sequence matching the target sequence.

Query end (BLAST Results)

Position within the query sequence where the alignment ends, 1-based, inclusive.

Query name (BLAST Results)

Name of your query (input) sequence, if specified.

Query ori (BLAST Results)

The strand of the query (input) sequence used to match to the subject (hit) - almost always Forward.

Query start (BLAST Results)

Position within the query sequence where the alignment begins, 1-based, inclusive.

r2

The correlation between a pair of loci. It varies from 0 (loci are in complete linkage equilibrium) to 1 (loci are in complete linkage disequilibrium and coinherited).

Readthrough (read-through)

Readthrough transcripts are tagged by HAVANA and defined as transcripts connecting two independent loci ie. transcript connecting two independent loci. A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).

Readthrough transcripts are also annotated by RefSeq.

Reference assembly

The collection of assembled chromosomes, unlocalized scaffolds, unplaced scaffolds and alternate sequences (alternate loci, fix patches, novel patches) that represent a species'  genome assembly. 

For many species in Ensembl, the reference assembly has a single path through the genome and is equivalent to the primary assembly. For other species (eg. human and mouse), the reference assembly includes alternate sequences.

When using the Ensembl API, the full reference assembly can be fetched with the Ensembl SliceAdaptor, by ensuring that the boolean $include_non_reference is set to 1. eg.

my $slices = $slice_adaptor->fetch_all( 'toplevel', 'GRCh37', 1);

Reference SNP (Reference Single Nucleotide Polymorphism)

A SNP assigned to eliminate redundancy in the NCBI dbSNP database. All SNPs submitted at the position of a reference SNP are given the reference SNP identifier (a number preceded by 'rs').

RefSeq

NCBI's Reference Sequences (RefSeq) database is a curated database of Genbank's genomes, mRNAs and proteins. RefSeq attempts to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, tRNA, and protein products, providing a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses.

RefSeq comparison (How gene annotation in Ensembl and RefSeq differs)

Some species have a 'RefSeq comparison' attribute that you can find on the Gene Summary pages. There are four types of comments:

  1. No overlapping RefSeq annotation found
  2. Overlapping RefSeq annotation not matched
  3. Overlapping RefSeq Gene ID 93759 matches and has similar biotype of protein_coding
  4. Overlapping RefSeq Gene ID 667103 matches but different biotype of protein_coding

These comments are a guideline only and we encourage you to compare the annotation that you find in Ensembl with annotation from RefSeq yourself. 

How are the matches decided?

The rule for whether or not the Ensembl an RefSeq genes match are basic. The logic is as follows:

  • For each gene in Ensembl, fetch all imported overlapping genes from RefSeq
  • If no RefSeq genes are fetched, the Ensembl gene is tagged with the comment, "No overlapping RefSeq annotation found".
  • Overlapping RefSeq genes may be on the opposite strand. In this case, they are not considered for matching and the Ensembl gene may be tagged with the comment, "Overlapping RefSeq annotation not matched".
  • If a RefSeq gene overlaps the Ensembl gene on the same strand, we will not automatically consider them as a match. They will only be considered a match in two cases, either:
    • both genes have been assigned the same name, or
    • the length, start and end of the RefSeq gene is within 10% of the length of the Ensembl gene's length, start and end. 
  • If a match is found, we report whether the biotypes are similar or not.

Which species will have these comments?

These comments are only available where we have annotation imported from RefSeq. This is limited both by the species that are annotated by RefSeq and by our frequency in importing their annotation. This link will show you a list of species where RefSeq annotation is available.

Please note that these comments will be updated only when we import new annotation from RefSeq and may therefore be out of sync with the latest RefSeq annotation.

Repeat

Repetitive DNA in which the same sequence occurs multiple times.

Repeat Masking

The method by which repeated sequences and low-complexity regions are hidden, usually used in searches by alignment and homology-searching programs.

RepeatMasker

RepeatMasker (AFA Smit & P Green) is a standard software tool used in computational genomics to identify repetitive elements and low-complexity sequences.

Retained intron

Alternatively spliced transcript that is believed to contain intronic sequence relative to other coding transcripts in a given locus.

Retrotransposed

A noncoding pseudogene produced by integration of a reverse transcribed mRNA into the genome. This biotype was deprecated from Ensembl release 70, where all genes with the biotype retrotransposed were change to processed_pseudogene.

RH map (Radiation Hybrid map)

Technique for identifying landmarks (STS) every 100 kb in the human genome, the ordering is relative to the frequency with which they are separated by radiation-induced breaks. The frequency is assayed by analysing a panel of human-hamster hybrid cell lines.

RNASeq gene alignments

A sequence from an RNASeq experiment aligned to the genome.

RNASeq gene models

An assembled gene, made up by combining RNAseq alignments and intron-spanning reads from an RNASeq experiment. This may correspond to a known gene, or may represent a novel transcript.

rRNA

The RNA component of the ribosome.

rRNA pseudogene

Ribosomal RNA pseudogene.

SARA (Same As Reference Assembly)

An acronym used to indicate a SNP (single nucleotide polymorphism) that has the same sequence as the strain used in the assembly.

Scaffold

Supercontigs or scaffolds are sets of ordered, oriented contigs. They are longer sequences than contigs, but shorter than full chromosomes.

Score (BLAST Results)

BLAST score calculated from the alignment. Score depends on the calculation parameters you specified.

scRNA

Small cytoplasmic RNA.

scRNA pseudogene

Small cytoplasmic RNA pseudogene.

SEG

Seg divides sequences into contrasting segments of low-complexity and high-complexity. Low-complexity segments defined by the algorithm represent "simple sequences" or "compositionally-biased regions". Segment lengths and the number of segments per sequence are determined automatically by the algorithm.

Sense intronic

Long non-coding transcript that lies within the intron (does not overlap any exons) of a coding gene on the same strand.

Sense overlapping

Long non-coding transcript that contains a coding gene within one of its introns and on the same strand.

SGD (Saccharomyces Genome Database)

Canonical database for the molecular biology and genetics of Saccharomyces cerevisiae.

Shotgun method

(also whole genome shotgun) Semi-automated sequencing method that involves randomly sequenced cloned pieces of the genome (size selected, sually 2, 10, 50 and 150 kb), with no prior knowledge their location. The clones are then sequenced from both ends. The two ends of the same clone are referred to as mate pairs. The distance between two "mate pairs" can be inferred if the library size is known and has a narrow window of deviation. This approach can be contrasted with "directed" strategies, in which pieces of DNA from known chromosomal locations are sequenced.

Shotgun sequencing

A method in which small, random DNA sequences are generated that overlap. The fragments are sequenced and the full, connected sequence determined through the overlaps.

SIFT

A tool which predicts the variation effect on protein function based on sequence homology and the physico-chemical similarity between the alternate amino acids. See the SIFT website for more information.

SignalP

The SignalP application predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks. Signal peptides indicate a protein that will be secreted. Prediction of signal peptides is quite accurate however care must be exercised and these regions should be verified by other means. (Henrik Nielsen, Jacob Engelbrecht, Søren Brunak and Gunnar von Heijne. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Engineering 10, 1-6 (1997)

Similarity

How well one sequence matches another determined by calculation by an alignment program of identical and conserved residues.

Slice

The term "slice" in Ensembl refers to a length of DNA sequence. A slice can be any length, from one base long to the entire length of a chromosome. A slice is defined as follows:

'coord_system_name:coord_system_version:seq_region_name:start:end:strand'

eg. 'chromosome:GRCm38:X:1000:2000:1'

Small non coding gene

Small non coding genes are usually fewer than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as small non coding genes: miRNA, miscRNA, rRNA, scRNA, snlRNA, snoRNA, snRNA, and also the pseudogenic form of these biotypes. The majority of the small non coding genes in Ensembl are annotated automatically by our ncRNA pipeline. Please note that tRNAs are annotated separately using tRNAscan. tRNAs are included as 'simple fetaures', not genes, because they are not annotated using aligned sequence evidence.

SNAP

  1. (Synonymous/Non-synonymous Analysis Program) A program which calculates synonymous and non-synonymous substitution rates based on a set of codon-aligned nucleotide sequences, based on the method of Nei and Gojobori, incorporating a statistic developed in Ota and Nei.
  2. An ab initio gene prediction program developed by Ian Korf that models protein coding sequences in genomic DNA by means of hidden Markov models.

snoRNA

Small nucleolar RNA genes which can be involved in chemical modifications of other RNA genes.

snoRNA pseudogene

Small nucleolar RNA pseudogene, involved in modifications of other RNAs.

SNP (Single Nucleotide Polymorphism)

SNPs are common variations that occur in DNA with a 0.1% frequency. Ensembl displays SNPs obtained from dbSNP, (the SNP repository maintained by NCBI; The Human Genic Bi-Allelic Sequences Database (HGVBase) and The SNP Consortium Ltd.(TSC).

snRNA

Small nuclear RNA.

snRNA pseudogene

Small nuclear RNA pseudogene.

SNV

A Single Nucleotide Variant (SNV) is a nucleotide position in genomic DNA at which different sequence alternatives (alleles) exist. SNVs include SNPs and single nucleotide insertions or deletions. See more details here.

Source

The origin of the variation data (e.g. dbSNP, COSMIC, DGVa).

sRNA

Small but  highly structured non-coding RNA molecules produced by bacteria. They are 50-250 nucleotide long and contain several stem-loops.

SSAHA (Sequence Search and Alignment by Hashing Algorithm)

A search designed to detect exact matches, or nearly exact matches, in DNA or protein databases. The SSAHA search has been optimized for alignments of high percentage identity and display as results the most significant matches for ungapped alignments between sequences. Each exact match in an SSAHA alignment is analogous to finding a high-scoring segment pair in BLAST. A number of consecutive matches on a contig may represent features of a gene such as exons or 5' and 3' untranslated regions, depending on the nature of the query sequence.

Stable ID (Stable identifier)

Stable identifiers are defined for a number of features including genes, transcripts, translations, exons. Stable IDs for all species follow the same format:

ENS + species stable ID prefix + feature type + 11 numbers

Each species has its own three-letter stable ID prefix which can be found in the species database meta table under the meta_key "species.stable_id_prefix". Human is the only species without a three-letter stable ID prefix.

A Human stable ID for Gene may look like: ENSG00000139618

A Mouse stable ID for Transcript may look like: ENSMUST00000103109

A Zebrafish stable ID for protein may look like: ENSDARP00000006884

Stable IDs are versioned.

Start phase

In protein-coding exons, the start phase is the place where the intron lands inside the codon : 0 between codons, 1 between the 1st and second base, 2 between the second and 3rd base. Exons therefore have a start phase and an end phase, but introns have just one phase. An exon which is non coding (or non-coding at the start) has a start phase of -1.

STS markers

STS markers are short sequences of genomic DNA that can be uniquely amplified by the polymerase chain reaction (PCR) using a pair of primers. Because each is unique, STSs are often used in linkage and radiation hybrid mapping techniques. STSs serve as landmarks on the physical map of the human genome.

Subject end (BLAST Results)

Position within the subject sequence where the alignment ends, 1-based, inclusive.

Subject name (BLAST Results)

Name of the object (cDNA/peptide) the query sequence aligns to.

Subject ori (BLAST Results)

The orientation of the query (input) sequence compared to the subject (hit).

Subject start (BLAST Results)

Position within the subject sequence where the alignment begins, 1-based, inclusive.

Supercontig

Supercontigs or scaffolds are sets of ordered, oriented contigs. They are longer sequences than contigs, but shorter than full chromosomes.

Supercontigs

Assemblies consist of sequence contigs combined into scaffolds, also known as supercontigs. Supercontigs are combined and ordered according to their orientation and linking information provided by mated sequences from the ends of genomic sub-clones. For some species, supercontigs are combined into ultracontigs, in which neighboring supercontigs are organized into their proper order and orientation using linking information provided by the physical map of BAC clones independently assembled using restriction fragment patterns and the FPC program.

SV

Structural variation. It is generally defined as a region of DNA of at least 1 kb in size. It can include inversions, balanced translocations or genomic imbalances (insertions and deletions), commonly referred to as copy number variants (CNVs).

Synteny

The term synteny was originally defined to mean that two gene loci share the same chromosome. In a genomic context we refer to syntenic regions if both sequence and gene order is conserved between two (closely related) species.

Tandem repeats

Multiple copies of the same base sequence on a chromosome; used as markers in physical mapping.

Target % id

Target %id indicates the percentage of the target sequence matching the query sequence.

Toplevel

The largest continuous sequence for an organism. The official technical definition for toplevel sequences are 'sequence regions in the genome assembly that are not a component of another sequence region'. For example, when a genome is assembled into chromosomes, toplevel sequences will be chromosomes and unplaced scaffolds. If a genome has only been assembled into scaffolds, then toplevel sequences are scaffolds and unplaced contigs.

TR C gene

Gene coding for the constant (C) domain of T cell receptors.

TR D gene

Gene that rearranges at the DNA level and codes the diversity (D) region of the variable domain of T cell receptors.

TR gene (T cell receptor gene)

T cell receptor gene. Havana annotation.

TR J gene

Gene that rearranges at the DNA level and codes the joining (J) region of the variable domain of T cell receptors.

TR J pseudogene

Locus that shares an evolutionary history with the TR J gene but it has been mutated through frameshift and/or stop codon(s) that disrupt the open reading frame.

TR pseudogene

Locus that shares an evolutionary history with T cell receptor genes but it has been mutated through frameshift and/or stop codon(s) that disrupt the open reading frame.

TR V gene

Gene that rearranges at the DNA level and codes the variable (V) region of the variable domain of T cell receptors.

TR V pseudogene

Locus that shares an evolutionary history with the TR V gene but it has been mutated through frameshift and/or stop codon(s) that disrupt the open reading frame.

Transcribed processed pseudogene

Processed pseudogenes that have evidence of transcription through the presence of locus-specific mRNAs and/or ESTs.

Transcribed unprocessed pseudogene

Unprocessed pseudogenes that have evidence of transcription through the presence of locus-specific mRNAs and/or ESTs.

Transcript

Nucleotide sequence resulting from the transcription of the genomic DNA to mRNA. One gene can have different transcripts or splice variants resulting from the alternative splicing of different exons in genes.

Transcript ID

Ensembl transcript identifiers are unique for each splice variant.

Translation start site

The position within an mRNA at which synthesis of a protein begins. The translation start site is usually an AUG codon, but occasionally, GUG or CUG codons are used to initiate protein synthesis.

tRNA

Transfer RNA. These are identified using tRNAscan and the results stored as 'simple features' (not genes). See Ensembl annotation for more information.

tRNA pseudogene

Transfer RNA pseudogene.

TSC (The SNP Consortium)

A non-profit foundation to provide public SNP-related information available to the public without intellectual property restrictions.

TSL (Transcript Support Level)

The Transcript Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users. The method relies on the primary data that can support full-length transcript structure: mRNA and EST alignments supplied by UCSC and Ensembl.

It is important that users understand how to assess transcript annotations that they see in GENCODE. While some transcript models have a high level of support through the full length of their exon structure, there are also transcripts that are poorly supported and that should be considered speculative. 

Method:

The mRNA and EST alignments are compared to the GENCODE transcripts and the transcripts are scored according to how well the alignment matches over its full length. The GENCODE TSL provides a consistent method of evaluating the level of support that a GENCODE transcript annotation is actually expressed in humans. Human transcript sequences from the International Nucleotide Sequence Database Collaboration (GenBank, ENA, and DDBJ) are used as the evidence for this analysis. Exonerate RNA alignments from Ensembl, BLAT RNA and EST alignments from the UCSC Genome Browser Database are used in the analysis. Erroneous transcripts and libraries identified in lists maintained by the Ensembl, UCSC, HAVANA and RefSeq groups are flagged as suspect. GENCODE annotations for protein-coding and non-protein-coding transcripts are compared with the evidence alignments.

Annotations in the MHC region and other immunological genes are not evaluated, as automatic alignments tend to be very problematic. Methods for evaluating single-exon genes are still being developed and they are not included in the current analysis.

Multi-exon GENCODE annotations are evaluated using the criteria that all introns are supported by an evidence alignment and the evidence alignment does not indicate that there are unannotated exons. Small insertions and deletions in evidence alignments are assumed to be due to polymorphisms and not considered as differing from the annotations. All intron boundaries must match exactly. The transcript start and end locations are allowed to differ.

Categories

The following categories are assigned to each of the evaluated annotations:

  • tsl1 – all splice junctions of the transcript are supported by at least one non-suspect mRNA
  • tsl2 – the best supporting mRNA is flagged as suspect or the support is from multiple ESTs
  • tsl3 – the only support is from a single EST
  • tsl4 – the best supporting EST is flagged as suspect
  • tsl5 – no single transcript supports the model structure
  • tslNA – the transcript was not analysed for one of the following reasons:
    • pseudogene annotation, including transcribed pseudogenes
    • human leukocyte antigen (HLA) transcript
    • immunoglobin gene transcript
    • T-cell receptor transcript
    • single-exon transcript (will be included in a future version)
ttt (test)

UK10K

A project that aims to understand the correlation between low-frequency and rare genetic changes with human disease by studying the genetic code of 10,000 people in the UK. See more details here.

Unigene

UniGene is an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each Unigene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location.

UniProt/Swiss-Prot

(Universal Protein Resource) is the world's most comprehensive catalogue of information on proteins. UniProt/Swiss-Prot is a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases. SwissProt is maintained collaboratively by the Swiss Institute for Bioinformatics (SIB) and the European Bioinformatics Institute (EBI).

UniProt/TrEMBL

SPTrEMBL is a subset of TrEMBL (Translated EMBL database) containing the computer-annotated protein translations of all coding sequences (CDS) present in the EMBL EMBL nucleotides that are not yet incorporated into the UniProt/SwissProt database.

UniSTS

UniSTS is a NCBI resource for non-redundant Sequence Tagged Sites (STS) markers. For each marker, UniSTS displays the primer sequences, product size, and mapping information, as well as cross references to dbSNP, RHdb, GDB, MGD, etc. The marker report also lists GenBank and RefSeq records that contain the primer sequences determined by ePCR.

Unitary pseudogene

A species specific unprocessed pseudogene without a parent gene in the same species but with an active orthologue in another species.

Unprocessed pseudogene

Pseudogenes that can contain introns since they are produced by gene duplication.

UTR (Untranslated Region)

The 5' UTR is the portion of an mRNA from the 5' end to the position of the first codon used in translation. The 3' UTR is the portion of an mRNA from the position of the last codon that is used in translation to the 3' end.

Vega genes

Vega genes from the Vertebrate Genome Annotation (VEGA) database include manual annotation of specific Human, Mouse, and Zebrafish clones. Annotation is performed on a clone-by-clone basis using a combination of similarity searches against DNA and protein databases, ab initio gene prediction applications (genscan, Fgenes),. Comparative analysis using vertebrate datasets is used to aid novel gene discovery. The data gathered in these steps is then used to manually annotate the clone adding gene structures, descriptions and poly-A features. The annotation is based on supporting evidence only.

VEP (Variant Effect Predictor)

Ensembl Tool that allows users to provide a list of variants and export a results file containing consequence types.

Within species paralogues (within species paralogs)

Two or more versions of a duplicated gene in a single species. In a gene tree, the genes are separated by a duplication node.

YAC (Yeast Artificial Chromosome)

Originated from a bacterial plasmid, a YAC contains a yeast centromeric region (CEN), a yeast origin of DNA replication, a cluster of unique rectriction sites and a selectable marker and a telomere region at the en of each arm. YACs are capable of cloning extremely large segments of DNA (over 1 megabase long) into a host cell, where the DNA is propagated along with the other chromosomes of the yeast cell.

ZFIN (ZebraFish Information Network)

A database for the zebrafish model organism that holds information on wild-type stocks, mutants, genes, gene expression data, and map markers.