The general process method of ChIP-seq analysis

The general process method of ChIP-seq analysis

Now ChIP-seq data is basically one of the most common types of sequencing data, mainly Transcription factor ChIP-seq and Histone ChIP-seq. The former is the binding position of transcription factors, and the latter is the position where histone modification occurs. Let me share the general process.


1. Quality control

The first thing to look at is the quality of the ChIP-seq data. The signal of the data should be much stronger than the background. Generally, there must be control, so that call peaks are more accurate and credible. There are mainly two types of control: Input DNA and IgG. The former is more commonly used.

Some ways to check the quality: 1). The number of reads in peaks. If the reads of peaks are generally less, the quality is average. 2). The peaks signal is high and the background is low. 3). Deep sequencing. 4). Diverse library (related to duplicate duplications, as shown below)


4). There are repetitions and the similarity between repetitions is high... ……

Software methods for quality control: 1). ChIPQC (T Carroll, Front Genet, 2014.) 2). SPP package-Unix/Linux (PV Karchenko, Nature Biotechnol, 2008.) 3). Standard process in ENCODE

2. Sequence alignment (mapping of fastq)

Sequence comparison generally uses BWA or Bowtie2, and the effect of the two is similar. BWA's bwa samse (single-ended data) and bwa sampe (double-ended data) run slower, but the effect is very good, the usage is as follows:

bwa index reference.fa # Create index -p can set the prefix, if you don't set the prefix, it is reference.fa.

# Single-ended data
bwa aln -t 8 reference.fa test.fq.gz> test.sai
bwa samse -n 10 reference.fa test.sai test.fq.gz> test_se.sam 

# Double-ended data:
bwa aln reference.fa test_reads1.fq> test1.sai
bwa aln reference.fa test_reads2.fq> test2.sai
bwa sampe reference.fa test1.sai test2.sai test_reads1.fq test_reads2.fq> test_pe.sam

BWA's mem is very fast:

bwa mem reference.fa reads.fq> test_se.sam # single-ended
bwa mem reference.fa read1.fq read2.fq> test_pe.sam # double-ended

Usage of bowtie2:

bowtie2-build reference.fa index # Create index

bowtie2 -p 8 -x index -1 test_read1.fq -2 test_read2.fq -S test_se.sam # Single-ended comparison
bowtie2 -p 8 -x index -1 test_read1.fq -2 test_read2.fq -S test_pe.sam # Double-ended comparison

Personally, the effect is similar.

3. Remove duplicates

Due to the inevitable experimental errors in PCR experiments, there will be duplicates. For example, the start and end positions of two different reads are exactly the same. such as:


The second one has been marked by picard. 1024 is added to the marked second column of flags.

Among the deduplication software, samtools rmdup (basically no longer used), samtools markdup (updated) and picard are the most commonly used. The effect of rmdup is not very good, and if there are reads in the same position, the one with higher quality will be selected first. Picard is similar to samtools markdup (as if the same one is called? Not sure). Both can be marked as duplicates, or you can choose to remove them directly. The following is the usage:

samtools markdup -@ 8 -r test.bam filter_test.bam # -r is to directly remove the repetition, without adding it is to mark it directly

There are three options for picard deduplication. In the DUPLICATE_SCORING_STRATEGY parameter, they are SUM_OF_BASE_QUALITIES, TOTAL_MAPPED_REFERENCE_LENGTH and RANDOM. That is, when there are repetitions, the ones with the highest total base quality, the longest matching reference genome, and random ones are selected.

picard MarkDuplicates I=test.bam O= filter_test.bam M=dup_metrics.txt REMOVE_DUPLICATES=true

4. Peak calling peaks are areas where the read signal is relatively strong, which is where the transcription factors or histone modifications we find are most likely to bind. Call peaks still has a lot of software, the more commonly used ones are MACS2 and Hotspot2. Example:

macs2 callpeak -t test.bam -c control.bam -f BAM -g hs -n test -B -q 0.01 

Consider using different parameters for different data.

5. Downstream analysis

After the analysis, there are many things that downstream can do, depending on the situation. You can analyze DNase-seq or ATAC-seq data at the same time to see the relationship between transcription factors and chromatin open regions; or annotate peaks with tools such as Homer to see the relationship between different transcription factors/histone modifications, or analyze the target gene of TF . You can also use MEME for motif analysis.

Welcome to follow the official account!

Shengxin programming daily

Reference: The general process method of ChIP-seq analysis-Cloud + Community-Tencent Cloud