If a human being had actually looked at his blood, anywhere along the way, instead of just running tests through the computer… parasites would have jumped right out at them.
“Failure to Communicate.” House M.D.
There is a case in the TV show House M.D., where a parasite infected the patient; Dr. House’s team runs many tests and gets no clue. In the end, Dr. House tells the team to look at the blood sample through a microscope instead of using numbers from instruments. The team does so and instantly sees the parasite. Even though the settings of the disease may not be very accurate, it leaves me with a profound impression. The same situation can also happen in bioinformatics, as we do many tests to identify targets of interest or evaluate our confidence in conclusions. However, we should be aware that these tests are all based on some assumptions, and it’s always beneficial to visually check the data to have an intuitive impression that the data fit the assumptions. Integrated Genome Browser (IGV
) is a powerful tool for visualizing sequencing data. In this post, I’ll share some of my tricks for making IGV even more useful.
Case 1: There are a bunch of loci that need to be checked visually
If you need to visually check many regions of interest in IGV, you can use a batch script to automate the process. The IGV batch script language allows you to generate a script file that tells IGV which regions to display and where to save the output. Instead of learning the details about this minimal language yourself, you can directly use bedtools
to create a batch script for a list of loci in a bed file. In the following example, genomic loci in loci.bed
will first be extended 200 bp both upstream and downstream, then a batch script covering these regions will be saved to the file batch.script
.
1 | snapshots will be saved to `path_to_store_snapshots` |
After loading the tracks you want to see in IGV, click Tools
>Run Batch Script...
and load the batch.script
file, IGV will get start capturing snapshots of each locus. If the process is slow, you can split the bed file and generate batch scripts for each subset, then load them into separate instances of IGV.
Case 2: Frequently used annotations are not listed in the default server
The IGV team maintains a fabulous web server with some commonly used annotations (like gene annotations from the GENCODE project) or datasets (like ChIP-seq alignments from the ENCODE project); by simply selecting the annotations of interest from File
>Load from Server...
, you can load them to your current session. One small pitfall with this function is that the annotations or datasets are not always up-to-date; for some frequently used files (or customized files), you may want them listed there. In this case, you should consider setting up your data server for IGV.
Step 1: Copy precompiled data files from IGV
You can get a copy of all genome files that IGV is currently using from their GitHub repo:
1 | git clone https://github.com/igvteam/igv.git |
After checkout, you can copy the entire igv/genomes
folder to a new place (assuming it’s /nas1/references
) and set up your data server.
Step 2: Install and configure a web server
If you’ve already had a web server, then you can move to step 3. For Mac users, you can install Nginx with Homebrew:
1 | brew install nginx |
By default, the configuration file for Nginx (installed by Homebrew) is located at /usr/local/etc/nginx/nginx.conf
. In the http
section, add a new server configuration as follows:
1 | server { |
Reload the configurations to make changes effective:
1 | nginx reload |
Create a new directory (annotations
) in /nas1/references
. Now the structure of this folder is something like
- references
- db
- 1kg_ref
- …
- hg19
- hg38
- mm10
- …
- sizes
- 1kg_ref.chrom.sizes
- …
- hg19.chrom.sizes
- hg38.chrom.sizes
- mm10.chrom.sizes
- …
- annotations
- genomes.tab
- genomes.txt
- db
Step 3: Save new annotations and modify data files
Let’s assume you have a new annotation file processed (e.g., processed GENCODE v35 for hg38 with the pipeline we mentioned in the previous post); now, you can move the file to /nas1/references/annotations
. Then you need to modify the default genome and data registry:
Change the content of
db/hg38/hg38_dataServerRegistry.txt
fromhttps://s3.amazonaws.com/igv.org.genomes/hg38/hg38_annotations.xml
tohttps://ref.yaobio.com/db/hg38/hg38_annotations.xml
Add the following item to
db/hg38/hg38_annotations.xml
:1
<Resource name="Gencode V35" path="http://ref.yaobio.com/annotations/gencode.v35.annotation.sorted.gtf.gz" index="http://ref.yaobio.com/annotations/gencode.v35.annotation.sorted.gtf.gz.tbi" hyperlink="http://www.gencodegenes.org/"/>
Step 4: Change the data server setting in IGV
Now in IGV, click View
>Preferences
>Advanced
, and replace the previous value in Data registry url
with http://ref.yaobio.com/db/$$/$$_dataServerRegistry.txt
. Finally, save the changes, and restart IGV; you should be able to see and load newly added annotations into IGV.
General tips
Always load bed files with indices
Always create indices for bed files before loading them into IGV. Otherwise, IGV will read every interval into memory and generate indexes on the fly, consuming excessive memory and time. You can use tabix
to generate an index for interval files before loading them into IGV. This practice can greatly reduce memory usage and computation time. Let’s say we have an interval bed file test_file_1.bed.gz
, it has 18M records; after loading this file into IGV without index, IGV takes more than 25GB of memory!
But if you use tabix test_file_1.bed.gz
to generate the index first, and then feed IGV with the same file, it only takes 2GB!
Conclusion
IGV is a valuable tool for visualizing sequencing data, and these tips and tricks can help you make the most of its capabilities. By using batch scripting, setting up a personal data server, and optimizing bed file loading with indices, you can streamline your bioinformatics workflows and gain more insights from your data.