top of page

4. リード数のカウント

2021.06.10 初稿

 

リードカウントとは、生リード(200ntとか)が遺伝子にいくつ張りついたかを数えたもの。

遺伝子の長さが長ければ長いほどシークエンスされたリードも増え、結果リードカウントも多くなるはずなので、補正した値がTPM。

リードカウントを求めるためにtrinityrnaseqの中に含まれているalign_and_estimate_abundance.plを用いる。スクリプト詳細は以下

 

$ /usr/local/Cellar/trinity/2.8.6/libexec/util/align_and_estimate_abundance.pl 

 

#########################################################################

########################

#  Essential parameters:

########################

#  --transcripts <string>           transcript fasta file

#  --seqType <string>               fq|fa

#  If Paired-end:

#     --left <string>

#     --right <string>

#   or Single-end:

#      --single <string>

#   or (preferred):

#      --samples_file <string>    tab-delimited text file indicating biological replicate relationships.

#                                   ex.

#                                        cond_A    cond_A_rep1    A_rep1_left.fq    A_rep1_right.fq

#                                        cond_A    cond_A_rep2    A_rep2_left.fq    A_rep2_right.fq

#                                        cond_B    cond_B_rep1    B_rep1_left.fq    B_rep1_right.fq

#                                        cond_B    cond_B_rep2    B_rep2_left.fq    B_rep2_right.fq

#                      # if single-end instead of paired-end, then leave the 4th column above empty.

 

中略

 

#  Example usage

#   ## Just prepare the reference for alignment and abundance estimation

#    /usr/local/Cellar/trinity/2.8.6/libexec/util/align_and_estimate_abundance.pl --transcripts Trinity.fasta --est_method RSEM --aln_method bowtie --trinity_mode --prep_reference

#   ## Run the alignment and abundance estimation (assumes reference has already been prepped, errors-out if prepped reference not located.)

#    /usr/local/Cellar/trinity/2.8.6/libexec/util/align_and_estimate_abundance.pl --transcripts Trinity.fasta --seqType fq --left reads_1.fq --right reads_2.fq --est_method RSEM --aln_method bowtie --trinity_mode --output_dir rsem_outdir

##  ## prep the reference and run the alignment/estimation

#    /usr/local/Cellar/trinity/2.8.6/libexec/util/align_and_estimate_abundance.pl --transcripts Trinity.fasta --seqType fq --left reads_1.fq --right reads_2.fq --est_method RSEM --aln_method bowtie --trinity_mode --prep_reference --output_dir rsem_outdir

#   ## Use a samples.txt file:

#    /usr/local/Cellar/trinity/2.8.6/libexec/util/align_and_estimate_abundance.pl --transcripts Trinity.fasta --est_method RSEM --aln_method bowtie2 --prep_reference --trinity_mode --samples_file samples.txt --seqType fq  

#

#########################################################################

 

 

---------------------------------------------

TRINITY_HOME=/usr/local/bin/trinityrnaseq

singularity exec -e ${HOME}/trinityrnaseq.v2.11.0.simg \

 ${TRINITY_HOME}/util/align_and_estimate_abundance.pl \

 --transcripts ${HOME}/RNAseq/trinity_out/sample.Trinity.fasta \

 --seqType fq \

 --samples_file ${HOME}/RNAseq/samples_file.txt \

 --est_method RSEM \

 --aln_method bowtie2 \

 --prep_reference \

 --gene_trans_map ${HOME}/RNAseq/trinity_out/sample.Trinity.fasta.gene_trans_map \ 

 --output_dir rsem_out

---------------------------------------------

 

--samples_fileオプションで解析を行うために必要なファイル名を指定する。サンプル数x反復数となるため、手打ちするのではなくサンプル名をリスト化したもの(タブ区切りファイル)を読み込ませる。

使用する全てのサンプル名をのせたタブ区切りファイルを作る(--samples_file<string>の項目)

例えばcontrolとmutantの2サンプルがあり、それぞれ3反復あるとき。

perl -lane 'print "$1\_$2\t$1\_$2\_rep\$3\t$_\t$_" if($_=~/(\S\S\S\S)\S+\-(\d+)\_(\d)\.fastq.gz/) ' samples_file.txt

結果は以下のように4列6行となる。

    control       control_rep1   control-1_1.fastq.gz    control-1_2.fastq.gz

    control        control_rep2   control-2_1.fastq.gz    control-2_2.fastq.gz

    control        control_rep3   control-3_1.fastq.gz    control-3_2.fastq.gz

    mutant       mutant_rep1   mutant-1_1.fastq.gz    mutant-1_2.fastq.gz

    mutant        mutant_rep2   mutant-2_1.fastq.gz    mutant-2_2.fastq.gz

    mutant        mutant_rep3   mutant-3_1.fastq.gz    mutant-3_2.fastq.gz

 

各反復ごと(ペアエンドのものは1反復に含まれる)にカウントデータが出てくるので、rsem_outディレクトリには6つのディレクトリができる。align_and_estimate_abundanceの結果は、rsem_outの各サンプル名のディレクトリに入っている。RSEM.genes.results, RSEM.isoforms.results, RSEM.isoforms.results.ok, RSEM.stat, bowtie2.bam, bowtie2.bam.for_rsem.bam, bowtie2.bam.okがそれにあたる。RSEM.isoforms.resultsを以降の解析で用いるが、中身の詳細は以下。

 

OUTPUT

    sample_name.isoforms.results

        File containing isoform level expression estimates. The first line

        contains column names separated by the tab character. The format of

        each line in the rest of this file is:

 

        transcript_id gene_id length effective_length expected_count TPM FPKM

        IsoPct [posterior_mean_count posterior_standard_deviation_of_count

        pme_TPM pme_FPKM IsoPct_from_pme_TPM TPM_ci_lower_bound

        TPM_ci_upper_bound TPM_coefficient_of_quartile_variation

        FPKM_ci_lower_bound FPKM_ci_upper_bound

        FPKM_coefficient_of_quartile_variation]

 

        Fields are separated by the tab character. Fields within "" are

        optional. They will not be presented if neither '--calc-pme' nor

        '--calc-ci' is set.

 

        'transcript_id' is the transcript name of this transcript. 'gene_id'

        is the gene name of the gene which this transcript belongs to (denote

        this gene as its parent gene). If no gene information is provided,

        'gene_id' and 'transcript_id' are the same.

 

        'length' is this transcript's sequence length (poly(A) tail is not

        counted). 'effective_length' counts only the positions that can

        generate a valid fragment. If no poly(A) tail is added,

        'effective_length' is equal to transcript length - mean fragment

        length + 1. If one transcript's effective length is less than 1, this

        transcript's both effective length and abundance estimates are set to

        0.

 

        'expected_count' is the sum of the posterior probability of each read

        comes from this transcript over all reads. Because 1) each read

        aligning to this transcript has a probability of being generated from

        background noise; 2) RSEM may filter some alignable low quality reads,

        the sum of expected counts for all transcript are generally less than

        the total number of reads aligned.

 

        'TPM' stands for Transcripts Per Million. It is a relative measure of

        transcript abundance. The sum of all transcripts' TPM is 1 million.

        'FPKM' stands for Fragments Per Kilobase of transcript per Million

        mapped reads. It is another relative measure of transcript abundance.

        If we define l_bar be the mean transcript length in a sample, which

        can be calculated as

 

        l_bar = \sum_i TPM_i / 10^6 * effective_length_i (i goes through every

        transcript),

 

        the following equation is hold:

 

        FPKM_i = 10^3 / l_bar * TPM_i.

 

        We can see that the sum of FPKM is not a constant across samples.

 

        'IsoPct' stands for isoform percentage. It is the percentage of this

        transcript's abandunce over its parent gene's abandunce. If its parent

        gene has only one isoform or the gene information is not provided,

        this field will be set to 100.

 

        'posterior_mean_count', 'pme_TPM', 'pme_FPKM' are posterior mean

        estimates calculated by RSEM's Gibbs sampler.

        'posterior_standard_deviation_of_count' is the posterior standard

        deviation of counts. 'IsoPct_from_pme_TPM' is the isoform percentage

        calculated from 'pme_TPM' values.

 

        'TPM_ci_lower_bound', 'TPM_ci_upper_bound', 'FPKM_ci_lower_bound' and

        'FPKM_ci_upper_bound' are lower(l) and upper(u) bounds of 95%

        credibility intervals for TPM and FPKM values. The bounds are

        inclusive (i.e. [l, u]).

 

        'TPM_coefficient_of_quartile_variation' and

        'FPKM_coefficient_of_quartile_variation' are coefficients of quartile

        variation (CQV) for TPM and FPKM values. CQV is a robust way of

        measuring the ratio between the standard deviation and the mean. It is

        defined as

 

        CQV := (Q3 - Q1) / (Q3 + Q1),

 

        where Q1 and Q3 are the first and third quartiles.

参考にしたページ

align_and_estimate_abundance.plについて:https://kazumaxneo.hatenablog.com/entry/2020/12/04/182413

bottom of page