Human
Currently, there are two widely used releases GRCh38 (hg38) and GRCh37 (hg19).
GRCh38 (hg38)
Sequences
- File name:
GRCh38_no_alt_analysis_set_GCA_000001405.15.fa.gz
(MD5 checksum:a08035b6a6e31780e96a34008ff21bd6
) - Local path: /References/Sequences/human/hg38/GRCh38_no_alt_analysis_set_GCA_000001405.15.fa.gz
- Remote backup: OSF
- Description: This file contains sequences for the following:
- chromosomes from the GRCh38 Primary Assembly (PA);
- mitochondrial genome from the GRCh38 non-nuclear assembly;
- unlocalized scaffolds from PA;
- unplaced scaffolds from PA;
- Epstein-Barr virus (EBV) sequence.
- Recipe:
1
wget https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz;
Annotations
Gene annotations
There are three major releases of gene annotations for Homo sapiens:
- GENCODE/Ensembl annotation: The GENCODE annotation is made from Ensembl annotation, so gene annotations are the same in both releases. The only exception is that the genes which are common to the human chromosome X and Y PAR regions can be found twice in the GENCODE GTF, while they are shown only for chromosome X in the Ensembl file. Gene / transcripts IDs are the same in both releases except for annotations is the PAR regions. Comparing to other annotations, GENCODE annotation provides higher coverage among non-coding regions.
- RefSeq Gene (RefGene): Annotations for well-characterized genes (mostly protein-coding genes). Projects like Gene Ontology, KEGG and MSigDB (Molecular Signatures Database, gene sets for GSEA) use this annotation as gene identifiers. So RefGene maybe the preferred annotation if you want to do enrichment analysis with GO/KEGG/GSEA.
- UCSC Known genes: Automatically generated annotations (based on protein sequences from Swiss-Prot), mostly for protein-coding genes.
GENCODE
File name:
gencode.v24.annotation.gtf.gz
(MD5 checksum:17395005bb4471605db62042b992893e
)Local path: /References/Annotations/human/hg38/gencode.v24.annotation.gtf.gz
Remote backup: OSF
Description: GENCODE comprehensive annotation release 24. Downloaded from GENCODE’s website.
Recipe:
1
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/gencode.v24.annotation.gtf.gz
File name:
gencode.v24.segmented.tssup1kb.bed.gz
(MD5 checksum:972a57431c6209667d5aac41bbb01ebd
)Local path: /References/Annotations/human/hg38/gencode.v24.segmented.tss*up1kb.bed.gz*
Remote backup: OSF
Description: Genomic segmentations (promoter, 5_UTR, exon, intron, 3_UTR, and intergenic region) based on GENCODE v24, promoters were defined as upstream 1kb of TSSs (transcripts).
Recipe:
1
2
3
4
5
6
7
8
9
10promoters for protein-coding genes
zcat gencode.v24.annotation.gtf.gz | \
awk 'BEGIN{OFS="\t"} $3=="transcript" {print $1,$4-1,$5,$18,"promoter",$7,$14}' | tr -d '";' | \
awk 'BEGIN{OFS="\t";FS="\t"}{if ($7=="protein_coding"){print $1,$2,$3,$4,$5,$6,$2,$3,"102,194,165"}}' | \
bedtools flank -i - -g hg38.genome -l 1000 -r 0 -s > promoters_1kb_p.bed
promoters for non-protein-coding genes
zcat gencode.v24.annotation.gtf.gz | \
awk 'BEGIN{OFS="\t"} $3=="transcript" {print $1,$4-1,$5,$18,"promoter(NP)",$7,$14}' | tr -d '";' | \
awk 'BEGIN{OFS="\t";FS="\t"}{if ($7!="protein_coding"){print $1,$2,$3,$4,$5,$6,$2,$3,"102,194,165"}}' | \
bedtools flank -i - -g hg38.genome -l 1000 -r 0 -s > promoters_1kb_np.bedFile name:
gencode.v24.segmented.tssflanking500b.bed.gz
Local Path: /References/Anotations/human/hg38/gencode.v24.segmented.tss*flanking500b.bed.gz*
Remote backup: OSF
Description: Genomic segmentations based on GENCODE v24, promoters were defined as TSS $\pm$ 500bp (transcripts). (
2e624c3bc2330beb81464558ead1a11e
)Recipe:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39promoters for protein-coding genes
zcat gencode.v24.annotation.gtf.gz | \
awk 'BEGIN{OFS="\t"} $3=="transcript" {print $1,$4-1,$5,$18,"promoter",$7,$14}' | tr -d '";' | \
awk 'BEGIN{OFS="\t";FS="\t"}{if ($7=="protein_coding"){print $1,$2,$3,$4,$5,$6,$2,$3,"102,194,165"}}' | \
bedtools flank -i - -g hg38.genome -l 500 -r 0 -s | \
bedtools slop -i - -g hg38.genome -l 0 -r 500 -s > promoters_500bp.bed
promoters for non-protein-coding genes
zcat gencode.v24.annotation.gtf.gz | \
awk 'BEGIN{OFS="\t"} $3=="transcript" {print $1,$4-1,$5,$18,"promoter(NP)",$7,$14}' | tr -d '";' | \
awk 'BEGIN{OFS="\t";FS="\t"}{if ($7!="protein_coding"){print $1,$2,$3,$4,$5,$6,$2,$3,"102,194,165"}}' | \
bedtools flank -i - -g hg38.genome -l 500 -r 0 -s \
bedtools slop -i - -g hg38.genome -l 0 -r 500 -s > np_promoters_500bp.bed
intergenic
zcat gencode.v24.annotation.gtf.gz | \
awk 'BEGIN{OFS="\t"} $3=="gene" {print $1,$4-1,$5,$10,$16,$7}' | \
tr -d '";' | \
bedtools slop -i - -g hg38.genome -l 500 -r 0 -s | \
sortBed -g ../hg38.genome | \
bedtools complement -i stdin -g ../hg38.genome | \
awk 'BEGIN{OFS="\t";FS="\t"}{print $1,$2,$3,".","intergenic",".",$2,$3,"141,160,203"}' > intergenic_500bp.bed
exons
zcat gencode.v24.annotation.gtf.gz | \
awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5,$18,"exon",$7}' | \
tr -d '";' | \
sortBed -g ../hg38.genome | \
mergeBed -i - -c 4,5,6 -o distinct,distinct,distinct -s | \
awk 'BEGIN{OFS="\t";FS="\t"}{print $1,$2,$3,$4,$5,$6,$2,$3,"231,138,195"}' > exons.bed
introns
zcat gencode.v24.annotation.gtf.gz | \
awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5,$16,"intron",$7}' | \
tr -d '";' | \
sortBed -g ../hg38.genome | \
subtractBed -a stdin -b exons.bed | \
awk 'BEGIN{OFS="\t";FS="\t"}{print $1,$2,$3,$4,$5,$6,$2,$3,"255,217,47"}' > introns.bed
UTR, perl script from https://davetang.org/muse/2012/09/12/gencode/
get_35_utr.pl gencode.v24.annotation.gtf.gz | \
awk 'BEGIN{OFS="\t";FS="\t"}{if ($5=="3_UTR"){print $1,$2,$3,$4,$5,$6,$2,$3,"166,216,84"}else{print $1,$2,$3,$4,$5,$6,$2,$3,"252,141,98"}}' > utr.bed
cat intergenic_500bp.bed promoters_500bp.bed np_promoters_500bp.bed utr.bed introns.bed exons.bed | sort -k1,1 -k2,2n | bgzip > gencode.v24.segmented.tssflanking500b.bed.gz
RefGene
- File name:
refseq.ver109.20190125.annotation.gtf.gz
(MD5 checksum:848813de5b516e0f328046ef9c931091
) - Local path: /References/Annotations/human/hg38/refseq.ver109.20190125.annotation.gtf.gz
- Remote backup: OSF
- Description: RefSeq annotation in GTF format that has been remapped to use the same set of UCSC-style sequence identifiers used in the FASTA files. The annotation is NCBI Homo sapiens Updated Annotation Release 109.20190125 from 25 January 2019.
- Recipe:
1
2wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf.gz
mv GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf.gz refseq.ver109.20190125.annotation.gtf.gz
Other annotations
Repeat Masker
- File name:
rmsk.bed.gz
(MD5 checksum:ae12aefbef9d4f5bc7695158a67d9a55
) - Local path: /References/Annotations/human/hg38/rmsk.bed.gz
- Remote backup: OSF
- Description: Repeat Masker from UCSC. The following fields were selected:
- genoName (Genomic sequence name)
- genoStart (Start in genomic sequence)
- genoEnd (End in genomic sequence)
- strand (Relative orientation + or -)
- repName (Name of repeat)
- repFamily (Family of repeat).
- Recipe:
1
2
3
4
5wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz
gunzip rmsk.txt.gz
gawk 'OFS="\t"{print $6,$7,$8,$11,$13,$10}' rmsk.txt | \
sort -k1,1 -k2,2n | \
bgzip > rmsk.bed.gz
Generic
Sequences
- Primary assembly:
- rRNA: Human ribosomal DNA complete repeating unit, GenBank accession code: U13369.1 .
Annotations
Motif databases (MEME)
- File name:
motif_databases.12.19.tgz
(MD5 checksum:f5ffcaecc07570ee19dba20b82d7bd73
) - Local path: /References/Annotations/human/generic/motif_databases.12.19.tgz
- Remote backup: OSF
- Description: Motif databases for MEME suite (updated 28 Oct 2019).
- Recipe:
1
wget http://alternate.meme-suite.org/meme-software/Databases/motifs/motif_databases.12.19.tgz
Note
- For all fasta files, 3 standard annotations will also be generated simultaneously:
.fai
: index which allows for fast and random access to any sequences in the indexed fasta file. This index is generated with the following command:1
samtools faidx input.fa
.genome
: Table with two columns, specifying length of each chromosome.1
cut -f1,2 input.fa.fai > size.genome
.dict
:1
2
3java -jar picard.jar CreateSequenceDictionary \
R=input.fa \
O=input.dict
- There are two types of promoters in both
gencode.v24.segmented.tssup1kb.bed
andgencode.v24.segmented.tssflanking1kb.bed
:- Promoters for protein coding genes (denote as
promoter
in these files) - Promoters for non-protein coding genes (denote as
promoter(NP)
)
- Promoters for protein coding genes (denote as