4. リード数カウント

4. リード数のカウント

2021.06.10 初稿

リードカウントとは、生リード（200ntとか）が遺伝子にいくつ張りついたかを数えたもの。

遺伝子の長さが長ければ長いほどシークエンスされたリードも増え、結果リードカウントも多くなるはずなので、補正した値がTPM。

リードカウントを求めるためにtrinityrnaseqの中に含まれているalign_and_estimate_abundance.plを用いる。スクリプト詳細は以下

$ /usr/local/Cellar/trinity/2.8.6/libexec/util/align_and_estimate_abundance.pl

#########################################################################

########################

# Essential parameters:

########################

# --transcripts <string> transcript fasta file

# --seqType <string> fq|fa

# If Paired-end:

# --left <string>

# --right <string>

# or Single-end:

# --single <string>

# or (preferred):

# --samples_file <string> tab-delimited text file indicating biological replicate relationships.

# ex.

# cond_A cond_A_rep1 A_rep1_left.fq A_rep1_right.fq

# cond_A cond_A_rep2 A_rep2_left.fq A_rep2_right.fq

# cond_B cond_B_rep1 B_rep1_left.fq B_rep1_right.fq

# cond_B cond_B_rep2 B_rep2_left.fq B_rep2_right.fq

# # if single-end instead of paired-end, then leave the 4th column above empty.

中略

# Example usage

# ## Just prepare the reference for alignment and abundance estimation

# /usr/local/Cellar/trinity/2.8.6/libexec/util/align_and_estimate_abundance.pl --transcripts Trinity.fasta --est_method RSEM --aln_method bowtie --trinity_mode --prep_reference

# ## Run the alignment and abundance estimation (assumes reference has already been prepped, errors-out if prepped reference not located.)

# /usr/local/Cellar/trinity/2.8.6/libexec/util/align_and_estimate_abundance.pl --transcripts Trinity.fasta --seqType fq --left reads_1.fq --right reads_2.fq --est_method RSEM --aln_method bowtie --trinity_mode --output_dir rsem_outdir

## ## prep the reference and run the alignment/estimation

# ## Use a samples.txt file:

# /usr/local/Cellar/trinity/2.8.6/libexec/util/align_and_estimate_abundance.pl --transcripts Trinity.fasta --est_method RSEM --aln_method bowtie2 --prep_reference --trinity_mode --samples_file samples.txt --seqType fq

#########################################################################

---------------------------------------------

TRINITY_HOME=/usr/local/bin/trinityrnaseq

singularity exec -e ${HOME}/trinityrnaseq.v2.11.0.simg \

${TRINITY_HOME}/util/align_and_estimate_abundance.pl \

--transcripts ${HOME}/RNAseq/trinity_out/sample.Trinity.fasta \

--seqType fq \

--samples_file ${HOME}/RNAseq/samples_file.txt \

--est_method RSEM \

--aln_method bowtie2 \

--prep_reference \

--gene_trans_map ${HOME}/RNAseq/trinity_out/sample.Trinity.fasta.gene_trans_map \

--output_dir rsem_out

---------------------------------------------

--samples_fileオプションで解析を行うために必要なファイル名を指定する。サンプル数x反復数となるため、手打ちするのではなくサンプル名をリスト化したもの（タブ区切りファイル）を読み込ませる。

使用する全てのサンプル名をのせたタブ区切りファイルを作る（--samples_file<string>の項目）

例えばcontrolとmutantの2サンプルがあり、それぞれ３反復あるとき。

perl -lane 'print "$1\_$2\t$1\_$2\_rep\$3\t$_\t$_" if($_=~/(\S\S\S\S)\S+\-(\d+)\_(\d)\.fastq.gz/) ' samples_file.txt

結果は以下のように4列6行となる。

control control_rep1 control-1_1.fastq.gz control-1_2.fastq.gz

control control_rep2 control-2_1.fastq.gz control-2_2.fastq.gz

control control_rep3 control-3_1.fastq.gz control-3_2.fastq.gz

mutant mutant_rep1 mutant-1_1.fastq.gz mutant-1_2.fastq.gz

mutant mutant_rep2 mutant-2_1.fastq.gz mutant-2_2.fastq.gz

mutant mutant_rep3 mutant-3_1.fastq.gz mutant-3_2.fastq.gz

各反復ごと（ペアエンドのものは１反復に含まれる）にカウントデータが出てくるので、rsem_outディレクトリには６つのディレクトリができる。align_and_estimate_abundanceの結果は、rsem_outの各サンプル名のディレクトリに入っている。RSEM.genes.results, RSEM.isoforms.results, RSEM.isoforms.results.ok, RSEM.stat, bowtie2.bam, bowtie2.bam.for_rsem.bam, bowtie2.bam.okがそれにあたる。RSEM.isoforms.resultsを以降の解析で用いるが、中身の詳細は以下。

OUTPUT

sample_name.isoforms.results

File containing isoform level expression estimates. The first line

contains column names separated by the tab character. The format of

each line in the rest of this file is:

transcript_id gene_id length effective_length expected_count TPM FPKM

IsoPct [posterior_mean_count posterior_standard_deviation_of_count

pme_TPM pme_FPKM IsoPct_from_pme_TPM TPM_ci_lower_bound

TPM_ci_upper_bound TPM_coefficient_of_quartile_variation

FPKM_ci_lower_bound FPKM_ci_upper_bound

FPKM_coefficient_of_quartile_variation]

Fields are separated by the tab character. Fields within "" are

optional. They will not be presented if neither '--calc-pme' nor

'--calc-ci' is set.

'transcript_id' is the transcript name of this transcript. 'gene_id'

is the gene name of the gene which this transcript belongs to (denote

this gene as its parent gene). If no gene information is provided,

'gene_id' and 'transcript_id' are the same.

'length' is this transcript's sequence length (poly(A) tail is not

counted). 'effective_length' counts only the positions that can

generate a valid fragment. If no poly(A) tail is added,

'effective_length' is equal to transcript length - mean fragment

length + 1. If one transcript's effective length is less than 1, this

transcript's both effective length and abundance estimates are set to

'expected_count' is the sum of the posterior probability of each read

comes from this transcript over all reads. Because 1) each read

aligning to this transcript has a probability of being generated from

background noise; 2) RSEM may filter some alignable low quality reads,

the sum of expected counts for all transcript are generally less than

the total number of reads aligned.

'TPM' stands for Transcripts Per Million. It is a relative measure of

transcript abundance. The sum of all transcripts' TPM is 1 million.

'FPKM' stands for Fragments Per Kilobase of transcript per Million

mapped reads. It is another relative measure of transcript abundance.

If we define l_bar be the mean transcript length in a sample, which

can be calculated as

l_bar = \sum_i TPM_i / 10^6 * effective_length_i (i goes through every

transcript),

the following equation is hold:

FPKM_i = 10^3 / l_bar * TPM_i.

We can see that the sum of FPKM is not a constant across samples.

'IsoPct' stands for isoform percentage. It is the percentage of this

transcript's abandunce over its parent gene's abandunce. If its parent

gene has only one isoform or the gene information is not provided,

this field will be set to 100.

'posterior_mean_count', 'pme_TPM', 'pme_FPKM' are posterior mean

estimates calculated by RSEM's Gibbs sampler.

'posterior_standard_deviation_of_count' is the posterior standard

deviation of counts. 'IsoPct_from_pme_TPM' is the isoform percentage

calculated from 'pme_TPM' values.

'TPM_ci_lower_bound', 'TPM_ci_upper_bound', 'FPKM_ci_lower_bound' and

'FPKM_ci_upper_bound' are lower(l) and upper(u) bounds of 95%

credibility intervals for TPM and FPKM values. The bounds are

inclusive (i.e. [l, u]).

'TPM_coefficient_of_quartile_variation' and

'FPKM_coefficient_of_quartile_variation' are coefficients of quartile

variation (CQV) for TPM and FPKM values. CQV is a robust way of

measuring the ratio between the standard deviation and the mean. It is

defined as

CQV := (Q3 - Q1) / (Q3 + Q1),

where Q1 and Q3 are the first and third quartiles.

参考にしたページ

align_and_estimate_abundance.plについて：https://kazumaxneo.hatenablog.com/entry/2020/12/04/182413