Range Downloads
Extract specific genomic regions from BAM and VCF files using coordinates or BED files with automatic tool integration.
Overview
Range downloads allow you to extract specific genomic regions instead of downloading entire files, providing:
- Focused data retrieval - Select regions of interest
- Reduced data transfer - Download only what you need
- Storage efficiency - Smaller files for specific analyses
- Faster processing - Work with relevant data quickly
Prerequisites
Range downloads require external bioinformatics tools:
| Tool | Version | Purpose |
|---|---|---|
| samtools | v1.17+ | BAM file processing and region extraction |
| tabix | v1.20+ | VCF file indexing and querying |
| bgzip | v1.20+ | VCF file compression |
Tool Installation
Ubuntu/Debian:
sudo apt update
sudo apt install samtools tabixmacOS (Homebrew):
brew install samtools htslibConda/Mamba:
conda install -c bioconda samtools tabixVerification
Check tool availability and versions:
samtools --version | head -1
tabix --version 2>&1 | head -1
bgzip --version 2>&1 | head -1The tool automatically verifies these dependencies when using range features.
Basic Range Syntax
Genomic Coordinates
Use standard genomic coordinate format: chromosome:start-end
# Single genomic region
./varvis-download.js -t mytarget -a 12345 -g "chr1:1000000-2000000"
# Multiple regions (space-separated)
./varvis-download.js -t mytarget -a 12345 -g "chr1:1000000-2000000 chr2:500000-1500000"
# Another region
./varvis-download.js -t mytarget -a 12345 -g "chr7:5500000-5600000"BED File Regions
Use BED files for complex region definitions:
Create regions.bed:
chr1 1000000 2000000 region_a
chr2 500000 1500000 region_b
chr7 5500000 5600000 region_c
chrX 1000000 1100000 region_xUse BED file:
./varvis-download.js -t mytarget -a 12345 -b regions.bedFile Type Behavior
BAM Files
Range extraction for BAM:
- Uses
samtools viewwith region specification - Requires BAM index (.bai) file
- Automatically downloads index if not present
- Creates new index for extracted BAM
# BAM range download
./varvis-download.js -t mytarget -a 12345 -g "chr1:1000000-2000000" -f "bam,bam.bai"Output files:
sample_001.chr1_1000000_2000000.bam # Extracted region
sample_001.chr1_1000000_2000000.bam.bai # New indexVCF Files
Range extraction for VCF:
- Uses
tabixfor region querying - Requires VCF index (.tbi) file
- Automatically compresses output with
bgzip - Creates new index for extracted VCF
# VCF range download
./varvis-download.js -t mytarget -a 12345 -g "chr1:1000000-2000000" -f "vcf.gz,vcf.gz.tbi"Output files:
sample_001.chr1_1000000_2000000.vcf.gz # Extracted and compressed
sample_001.chr1_1000000_2000000.vcf.gz.tbi # New indexAdvanced Range Examples
Region-Specific Downloads
Multiple defined regions:
# Region A
./varvis-download.js -t mytarget -a 12345 -g "chr1:1000000-2000000"
# Region B
./varvis-download.js -t mytarget -a 12345 -g "chr2:500000-1500000"
# Both regions
./varvis-download.js -t mytarget -a 12345 \
-g "chr1:1000000-2000000 chr2:500000-1500000"Research region set:
# Create research_regions.bed
cat > research_regions.bed << EOF
chr1 1000000 2000000 region_a
chr2 500000 1500000 region_b
chr7 5500000 5600000 region_c
chr12 25000000 25100000 region_d
chr20 1000000 1100000 region_e
EOF
# Download research regions
./varvis-download.js -t mytarget -a 12345 -b research_regions.bedUnmapped Read Extraction
The --unmapped flag extracts reads with no reference assignment from BAM files using the samtools wildcard chromosome *. This is useful for identifying contamination, adapter sequences, or novel sequences (e.g., in Illumina NovaSeq data).
Unmapped Only
# Extract only unmapped reads (creates sample.unmapped.bam)
./varvis-download.js -t mytarget -a 12345 --unmappedOnly samtools is required (no tabix/bgzip). VCF files are automatically skipped.
Combined with Range Downloads
When used with --range, both ranged and unmapped reads are included in a single BAM file:
# Single BAM with a genomic region + unmapped reads
./varvis-download.js -t mytarget -a 12345 -g "chr1:1000000-2000000" --unmappedThis uses command-line regions instead of a BED file to allow the * wildcard alongside genomic coordinates.
Exome Regions
Target-enriched data retrieval:
# Download exome target regions
./varvis-download.js -t mytarget -a 12345 -b exome_targets.bed -f "bam,bam.bai"Custom region panel:
# Create custom region BED file
cat > custom_regions.bed << EOF
chr1 1000000 2000000 region_a
chr2 500000 1500000 region_b
chr3 2500000 2600000 region_c
chr7 5500000 5600000 region_d
chr11 1000000 1100000 region_e
EOF
./varvis-download.js -t mytarget -a 12345 -b custom_regions.bedWhole Chromosome Downloads
Single chromosome:
# Chromosome 21 (smallest autosome)
./varvis-download.js -t mytarget -a 12345 -g "chr21:1-48129895"
# X chromosome
./varvis-download.js -t mytarget -a 12345 -g "chrX:1-156040895"Multiple chromosomes:
# Chromosomes 21 and 22
./varvis-download.js -t mytarget -a 12345 \
-g "chr21:1-48129895 chr22:1-50818468"Range Download Workflow
Complete Workflow Example
#!/bin/bash
# range_download_workflow.sh
ANALYSIS_ID="12345"
REGION_LABEL="region_a"
REGION="chr1:1000000-2000000"
OUTPUT_DIR="./range_${ANALYSIS_ID}_${REGION_LABEL}"
echo "Starting range download workflow for $REGION_LABEL..."
# Create output directory
mkdir -p "$OUTPUT_DIR"
# Download BAM region
echo "Downloading BAM region: $REGION"
./varvis-download.js -t mytarget -a "$ANALYSIS_ID" \
-g "$REGION" \
-f "bam,bam.bai" \
-d "$OUTPUT_DIR"
# Download VCF region
echo "Downloading VCF region: $REGION"
./varvis-download.js -t mytarget -a "$ANALYSIS_ID" \
-g "$REGION" \
-f "vcf.gz,vcf.gz.tbi" \
-d "$OUTPUT_DIR"
# Verify downloads
echo "Verifying downloaded files..."
for FILE in "$OUTPUT_DIR"/*.bam; do
if samtools view -H "$FILE" >/dev/null 2>&1; then
echo "✓ Valid BAM: $(basename "$FILE")"
else
echo "✗ Invalid BAM: $(basename "$FILE")"
fi
done
for FILE in "$OUTPUT_DIR"/*.vcf.gz; do
if tabix -l "$FILE" >/dev/null 2>&1; then
echo "✓ Valid VCF: $(basename "$FILE")"
else
echo "✗ Invalid VCF: $(basename "$FILE")"
fi
done
echo "Range download workflow complete: $OUTPUT_DIR"Quality Control Integration
Post-download QC:
#!/bin/bash
# range_qc.sh
REGION_BAM="sample_001.chr1_1000000_2000000.bam"
REGION_VCF="sample_001.chr1_1000000_2000000.vcf.gz"
# BAM QC
echo "BAM Quality Control:"
samtools flagstat "$REGION_BAM"
samtools view "$REGION_BAM" | wc -l | xargs echo "Total reads:"
# VCF QC
echo "VCF Quality Control:"
zcat "$REGION_VCF" | grep -v "^#" | wc -l | xargs echo "Total variants:"
zcat "$REGION_VCF" | grep -v "^#" | cut -f7 | sort | uniq -c | sort -nrPerformance Considerations
File Size Impact
Typical size reductions:
- Single small region: 99% size reduction
- Exome regions: 98% size reduction
- Chromosome: 96% size reduction
- Multiple regions: 95-99% reduction
Size estimation:
# Check original file sizes
./varvis-download.js -t mytarget -a 12345 --list | grep -E "bam|vcf"
# Estimate region size (rough calculation)
# Region size / Genome size * Original file size
# Example: 10kb region / 3.2Gb genome * 2GB file = ~6KBNetwork Optimization
Range downloads are faster because:
- Smaller data transfer
- Less network time
- Reduced bandwidth usage
- Parallel processing possible
Optimize for multiple regions:
# Download regions in parallel (separate processes)
./varvis-download.js -t mytarget -a 12345 -g "chr1:1000000-2000000" &
./varvis-download.js -t mytarget -a 12345 -g "chr2:500000-1500000" &
./varvis-download.js -t mytarget -a 12345 -g "chr7:5500000-5600000" &
wait # Wait for all downloads to completeStorage Optimization
Efficient storage patterns:
# Organize by region
mkdir -p ./regions/{region_a,region_b,region_c}
# Download to specific directories
./varvis-download.js -t mytarget -a 12345 -g "chr1:1000000-2000000" -d "./regions/region_a/"
./varvis-download.js -t mytarget -a 12345 -g "chr2:500000-1500000" -d "./regions/region_b/"
./varvis-download.js -t mytarget -a 12345 -g "chr7:5500000-5600000" -d "./regions/region_c/"Integration with Downstream Research Tools
Optional Downstream Processing
Region-specific processing:
# Download region
./varvis-download.js -t mytarget -a 12345 -g "chr1:1000000-2000000" -f "bam,bam.bai"
# Run a downstream research workflow on the region
REGION_BAM="sample_001.chr1_1000000_2000000.bam"
REFERENCE="/data/reference/hg38.fa"
OUTPUT_VCF="research_output.chr1_1000000_2000000.vcf"
# Example with a user-managed external tool
gatk HaplotypeCaller \
-R "$REFERENCE" \
-I "$REGION_BAM" \
-O "$OUTPUT_VCF" \
-L "chr1:1000000-2000000"Coverage Analysis
Regional coverage analysis:
# Download BAM region
./varvis-download.js -t mytarget -a 12345 -g "chr1:1000000-2000000" -f "bam,bam.bai"
# Calculate coverage for region
REGION_BAM="sample_001.chr1_1000000_2000000.bam"
# Per-base coverage
samtools depth "$REGION_BAM" > coverage.chr1_1000000_2000000.txt
# Coverage statistics
samtools depth "$REGION_BAM" | awk '{sum+=$3; count++} END {print "Average coverage:", sum/count}'
# Coverage histogram
samtools depth "$REGION_BAM" | cut -f3 | sort -n | uniq -c | sort -nr > coverage_histogram.txtResearch Annotation Workflows
Region-specific research annotation:
# Download VCF region
./varvis-download.js -t mytarget -a 12345 -g "chr1:1000000-2000000" -f "vcf.gz,vcf.gz.tbi"
# Annotate variants in region
REGION_VCF="sample_001.chr1_1000000_2000000.vcf.gz"
# VEP annotation
vep --input_file "$REGION_VCF" \
--output_file "annotated.chr1_1000000_2000000.vcf" \
--format vcf \
--vcf \
--symbol \
--terms SO \
--tsl \
--hgvs \
--fasta /data/reference/hg38.fa \
--offline \
--cacheTroubleshooting Range Downloads
Common Issues
Tool not found errors:
Error: samtools not found or version too old
Required: samtools v1.17+, Found: v1.10Solution:
# Update samtools
conda install -c bioconda samtools=1.17
# or
sudo apt install samtools=1.17Index file issues:
Error: Could not load index for sample_001.bamSolution:
# The tool should automatically download indexes
# If manual intervention needed:
samtools index sample_001.bamRegion format errors:
Error: Invalid region format: chr1:1000000_2000000Solution:
# Use colon and dash: chr1:1000000-2000000
# Not underscore: chr1:1000000_2000000Debug Range Downloads
Enable debug logging:
./varvis-download.js -t mytarget -a 12345 -g "chr1:1000000-2000000" --loglevel debugTest region validity:
# Test with a small region first
./varvis-download.js -t mytarget -a 12345 -g "chr1:1000-2000" --list
# Verify chromosome naming
./varvis-download.js -t mytarget -a 12345 --list | grep -i bam
samtools view -H original.bam | grep "@SQ" # Check chromosome namesManual tool testing:
# Test samtools with region
samtools view original.bam "chr1:1000000-2000000" | head
# Test tabix with region
tabix original.vcf.gz "chr1:1000000-2000000" | headBest Practices
Region Selection
- Use specific regions for focused retrieval
- Combine related regions in single downloads
- Document the coordinate source and reference genome build
- Use standard coordinate systems (0-based or 1-based consistently)
File Management
- Organize by region and analysis identifier
- Use descriptive directory names
- Keep original and extracted files separate
- Document region coordinates and purposes
Quality Control
- Always verify extracted files
- Check read/variant counts in regions
- Validate chromosome naming consistency
- Test with small regions before large downloads
Next Steps
- Archive Management - Handle archived files
- Batch Operations - Large-scale processing
- Examples - Real-world range download scenarios
