4. リード数のカウント
2021.06.10 初稿
リードカウントとは、生リード(200ntとか)が遺伝子にいくつ張りついたかを数えたもの。
遺伝子の長さが長ければ長いほどシークエンスされたリードも増え、結果リードカウントも多くなるはずなので、補正した値がTPM。
リードカウントを求めるためにtrinityrnaseqの中に含まれているalign_and_estimate_abundance.plを用いる。スクリプト詳細は以下
$ /usr/local/Cellar/trinity/2.8.6/libexec/util/align_and_estimate_abundance.pl
#########################################################################
########################
# Essential parameters:
########################
# --transcripts <string> transcript fasta file
# --seqType <string> fq|fa
# If Paired-end:
# --left <string>
# --right <string>
# or Single-end:
# --single <string>
# or (preferred):
# --samples_file <string> tab-delimited text file indicating biological replicate relationships.
# ex.
# cond_A cond_A_rep1 A_rep1_left.fq A_rep1_right.fq
# cond_A cond_A_rep2 A_rep2_left.fq A_rep2_right.fq
# cond_B cond_B_rep1 B_rep1_left.fq B_rep1_right.fq
# cond_B cond_B_rep2 B_rep2_left.fq B_rep2_right.fq
# # if single-end instead of paired-end, then leave the 4th column above empty.
中略
# Example usage
# ## Just prepare the reference for alignment and abundance estimation
# /usr/local/Cellar/trinity/2.8.6/libexec/util/align_and_estimate_abundance.pl --transcripts Trinity.fasta --est_method RSEM --aln_method bowtie --trinity_mode --prep_reference
# ## Run the alignment and abundance estimation (assumes reference has already been prepped, errors-out if prepped reference not located.)
# /usr/local/Cellar/trinity/2.8.6/libexec/util/align_and_estimate_abundance.pl --transcripts Trinity.fasta --seqType fq --left reads_1.fq --right reads_2.fq --est_method RSEM --aln_method bowtie --trinity_mode --output_dir rsem_outdir
## ## prep the reference and run the alignment/estimation
# /usr/local/Cellar/trinity/2.8.6/libexec/util/align_and_estimate_abundance.pl --transcripts Trinity.fasta --seqType fq --left reads_1.fq --right reads_2.fq --est_method RSEM --aln_method bowtie --trinity_mode --prep_reference --output_dir rsem_outdir
# ## Use a samples.txt file:
# /usr/local/Cellar/trinity/2.8.6/libexec/util/align_and_estimate_abundance.pl --transcripts Trinity.fasta --est_method RSEM --aln_method bowtie2 --prep_reference --trinity_mode --samples_file samples.txt --seqType fq
#
#########################################################################
---------------------------------------------
TRINITY_HOME=/usr/local/bin/trinityrnaseq
singularity exec -e ${HOME}/trinityrnaseq.v2.11.0.simg \
${TRINITY_HOME}/util/align_and_estimate_abundance.pl \
--transcripts ${HOME}/RNAseq/trinity_out/sample.Trinity.fasta \
--seqType fq \
--samples_file ${HOME}/RNAseq/samples_file.txt \
--est_method RSEM \
--aln_method bowtie2 \
--prep_reference \
--gene_trans_map ${HOME}/RNAseq/trinity_out/sample.Trinity.fasta.gene_trans_map \
--output_dir rsem_out
---------------------------------------------
--samples_fileオプションで解析を行うために必要なファイル名を指定する。サンプル数x反復数となるため、手打ちするのではなくサンプル名をリスト化したもの(タブ区切りファイル)を読み込ませる。
使用する全てのサンプル名をのせたタブ区切りファイルを作る(--samples_file<string>の項目)
例えばcontrolとmutantの2サンプルがあり、それぞれ3反復あるとき。
perl -lane 'print "$1\_$2\t$1\_$2\_rep\$3\t$_\t$_" if($_=~/(\S\S\S\S)\S+\-(\d+)\_(\d)\.fastq.gz/) ' samples_file.txt
結果は以下のように4列6行となる。
control control_rep1 control-1_1.fastq.gz control-1_2.fastq.gz
control control_rep2 control-2_1.fastq.gz control-2_2.fastq.gz
control control_rep3 control-3_1.fastq.gz control-3_2.fastq.gz
mutant mutant_rep1 mutant-1_1.fastq.gz mutant-1_2.fastq.gz
mutant mutant_rep2 mutant-2_1.fastq.gz mutant-2_2.fastq.gz
mutant mutant_rep3 mutant-3_1.fastq.gz mutant-3_2.fastq.gz
各反復ごと(ペアエンドのものは1反復に含まれる)にカウントデータが出てくるので、rsem_outディレクトリには6つのディレクトリができる。align_and_estimate_abundanceの結果は、rsem_outの各サンプル名のディレクトリに入っている。RSEM.genes.results, RSEM.isoforms.results, RSEM.isoforms.results.ok, RSEM.stat, bowtie2.bam, bowtie2.bam.for_rsem.bam, bowtie2.bam.okがそれにあたる。RSEM.isoforms.resultsを以降の解析で用いるが、中身の詳細は以下。
OUTPUT
sample_name.isoforms.results
File containing isoform level expression estimates. The first line
contains column names separated by the tab character. The format of
each line in the rest of this file is:
transcript_id gene_id length effective_length expected_count TPM FPKM
IsoPct [posterior_mean_count posterior_standard_deviation_of_count
pme_TPM pme_FPKM IsoPct_from_pme_TPM TPM_ci_lower_bound
TPM_ci_upper_bound TPM_coefficient_of_quartile_variation
FPKM_ci_lower_bound FPKM_ci_upper_bound
FPKM_coefficient_of_quartile_variation]
Fields are separated by the tab character. Fields within "" are
optional. They will not be presented if neither '--calc-pme' nor
'--calc-ci' is set.
'transcript_id' is the transcript name of this transcript. 'gene_id'
is the gene name of the gene which this transcript belongs to (denote
this gene as its parent gene). If no gene information is provided,
'gene_id' and 'transcript_id' are the same.
'length' is this transcript's sequence length (poly(A) tail is not
counted). 'effective_length' counts only the positions that can
generate a valid fragment. If no poly(A) tail is added,
'effective_length' is equal to transcript length - mean fragment
length + 1. If one transcript's effective length is less than 1, this
transcript's both effective length and abundance estimates are set to
0.
'expected_count' is the sum of the posterior probability of each read
comes from this transcript over all reads. Because 1) each read
aligning to this transcript has a probability of being generated from
background noise; 2) RSEM may filter some alignable low quality reads,
the sum of expected counts for all transcript are generally less than
the total number of reads aligned.
'TPM' stands for Transcripts Per Million. It is a relative measure of
transcript abundance. The sum of all transcripts' TPM is 1 million.
'FPKM' stands for Fragments Per Kilobase of transcript per Million
mapped reads. It is another relative measure of transcript abundance.
If we define l_bar be the mean transcript length in a sample, which
can be calculated as
l_bar = \sum_i TPM_i / 10^6 * effective_length_i (i goes through every
transcript),
the following equation is hold:
FPKM_i = 10^3 / l_bar * TPM_i.
We can see that the sum of FPKM is not a constant across samples.
'IsoPct' stands for isoform percentage. It is the percentage of this
transcript's abandunce over its parent gene's abandunce. If its parent
gene has only one isoform or the gene information is not provided,
this field will be set to 100.
'posterior_mean_count', 'pme_TPM', 'pme_FPKM' are posterior mean
estimates calculated by RSEM's Gibbs sampler.
'posterior_standard_deviation_of_count' is the posterior standard
deviation of counts. 'IsoPct_from_pme_TPM' is the isoform percentage
calculated from 'pme_TPM' values.
'TPM_ci_lower_bound', 'TPM_ci_upper_bound', 'FPKM_ci_lower_bound' and
'FPKM_ci_upper_bound' are lower(l) and upper(u) bounds of 95%
credibility intervals for TPM and FPKM values. The bounds are
inclusive (i.e. [l, u]).
'TPM_coefficient_of_quartile_variation' and
'FPKM_coefficient_of_quartile_variation' are coefficients of quartile
variation (CQV) for TPM and FPKM values. CQV is a robust way of
measuring the ratio between the standard deviation and the mean. It is
defined as
CQV := (Q3 - Q1) / (Q3 + Q1),
where Q1 and Q3 are the first and third quartiles.
参考にしたページ
align_and_estimate_abundance.plについて:https://kazumaxneo.hatenablog.com/entry/2020/12/04/182413