【5.4】igblast简介及参数设置

在NCBI开发IgBLAST以促进免疫球蛋白和T细胞受体可变结构域序列的分析。

IgBLAST允许用户查看种系V,D和J基因的匹配,重排连接处的细节,IG V结构域区域和互补决定区的描述。 IgBLAST具有分析核苷酸和蛋白质序列的能力,并且可以批量处理序列。 此外,IgBLAST允许同时针对种系基因数据库和其他序列数据库进行搜索,以最小化可能最佳匹配的种系V基因缺失的机会。

一、安装

1.1 igblast安装

下载地址:

https://ftp.ncbi.nih.gov/blast/executables/igblast/release/LATEST/

选择最新版1.17 :

cd cd /data/software/igblast/
wget -c https://ftp.ncbi.nih.gov/blast/executables/igblast/release/LATEST/ncbi-igblast-1.17.0-x64-linux.tar.gz

tar -xzf ncbi-igblast-1.17.0-x64-linux.tar.gz

修改环境变量

vim /etc/profile

igblast_path=/data/software/igblast/ncbi-igblast-1.17.0/bin
export PATH=$igblast_path:$PATH

source /etc/profile

1.2 数据库

internal optional 数据集

cd /data/database/igblast

#database
wget -r 1 -p ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/database/
cp -fr ./ftp.ncbi.nih.gov/blast/executables/igblast/release/database ./

# internal_data
wget -r 1 -p ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/internal_data/
cp -fr ./ftp.ncbi.nih.gov/blast/executables/igblast/release/internal_data ./

#optional_file
wget -r 1 -p ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/optional_file/
cp -fr ./ftp.ncbi.nih.gov/blast/executables/igblast/release/optional_file ./

# 删除
rm -fr  ftp.ncbi.nih.gov

IMGT数据

下载地址IMGT序列:http://www.imgt.org/vquest/refseqh.html#VQUEST

需要通过makeblastdb来构建

makeblastdb -parse_seqids -dbtype nucl -in my_seq_file

具体构建方法,见 https://ncbi.github.io/igblast/cook/How-to-set-up.html

wget -c ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/edit_imgt_file.pl

# V-segment database
   $perl edit_imgt_file.pl IMGT_Mouse_IGHV.fasta > ./database/mouse_igh_v
   $makeblastdb -parse_seqids -dbtype nucl -in ./database/mouse_igh_v
# J-segment database
   $perl edit_imgt_file.pl IMGT_Mouse_IGHJ.fasta > ./database/mouse_igh_j
   $makeblastdb -parse_seqids -dbtype nucl -in ./database/mouse_igh_j
# D-segment database
   $perl edit_imgt_file.pl IMGT_Mouse_IGHD.fasta > ./database/mouse_igh_d
   $makeblastdb -parse_seqids -dbtype nucl -in ./database/mouse_igh_d

添加环境变量

vim /etc/profile

export BLASTDB='/data/database/igblast'

#export internal_data='/data/database/igblast/internal_data'

export IGDATA='/data/database/igblast'

source /etc/profile

如果不增加数据库的环境变量,就会报/internal_data/ 找不到

二、运行

cd /data/software/igblast/ncbi-igblast-1.9.0
mkdir test;cd test

igblastp -germline_db_V igblast/database/mouse_gl_V -query test.fa -outfmt 3 -organism human

用IMGT germline database

#igblastp -germline_db_V igblast/imgt_201807/IGKVLV -germline_db_J igblast/imgt_201807/IGKVLV -germline_db_D igblast/imgt_201807/IGKVLV -organism human -query test.fa -auxiliary_data igblast/optional_file/human_gl.aux -show_translation


./bin/igblastn -query infile.fasta -out outfile.igblast.fmt7.out -outfmt 7 -germline_db_V ./database/mouse_gl_V -germline_db_J ./database/mouse_gl_J -germline_db_D ./database/mouse_gl_D -auxiliary_data ./optional_file/mouse_gl.aux -organism mouse -domain_system imgt -ig_seqtype Ig -show_translation -num_threads 10

参数说明

输出格式

*** Formatting options
 -outfmt <String>
   alignment view options:
     3 = Flat query-anchored, show identities,
     4 = Flat query-anchored, no identities,
     7 = Tabular with comment lines
     19 = Rearrangement summary report (AIRR format)

   Options 7 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
   The supported format specifiers are:
            qseqid means Query Seq-id
               qgi means Query GI
              qacc means Query accesion
           qaccver means Query accesion.version
              qlen means Query sequence length
            sseqid means Subject Seq-id
         sallseqid means All subject Seq-id(s), separated by a ';'
               sgi means Subject GI
            sallgi means All subject GIs
              sacc means Subject accession
           saccver means Subject accession.version
           sallacc means All subject accessions
              slen means Subject sequence length
            qstart means Start of alignment in query
              qend means End of alignment in query
            sstart means Start of alignment in subject
              send means End of alignment in subject
              qseq means Aligned part of query sequence
              sseq means Aligned part of subject sequence
            evalue means Expect value
          bitscore means Bit score
             score means Raw score
            length means Alignment length
            pident means Percentage of identical matches
            nident means Number of identical matches
          mismatch means Number of mismatches
          positive means Number of positive-scoring matches
           gapopen means Number of gap openings
              gaps means Total number of gaps
              ppos means Percentage of positive-scoring matches
frames means Query and subject frames separated by a '/'
            qframe means Query frame
            sframe means Subject frame
              btop means Blast traceback operations (BTOP)
            staxid means Subject Taxonomy ID
          ssciname means Subject Scientific Name
          scomname means Subject Common Name
        sblastname means Subject Blast Name
         sskingdom means Subject Super Kingdom
           staxids means unique Subject Taxonomy ID(s), separated by a ';'
                         (in numerical order)
         sscinames means unique Subject Scientific Name(s), separated by a ';'
         scomnames means unique Subject Common Name(s), separated by a ';'
        sblastnames means unique Subject Blast Name(s), separated by a ';'
                         (in alphabetical order)
        sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
                         (in alphabetical order)
            stitle means Subject Title
        salltitles means All Subject Title(s), separated by a '<>'
           sstrand means Subject Strand
             qcovs means Query Coverage Per Subject
           qcovhsp means Query Coverage Per HSP
            qcovus means Query Coverage Per Unique Subject (blastn only)
  • 默认的输出是3,包含的内容包括:‘qseqid sseqid pident length mismatch gapopen gaps qstart qend sstart send evalue bitscore’
  • 可以这样来指定输出 -outfmt “7 qseqid sseqid pident length mismatch”

批量数据处理的时候,建议outfmt选择19,可以得到格式化的结果,例子如下

每一列的解释见:https://docs.airr-community.org/en/latest/datarep/rearrangements.html

三、讨论

3.1 5‘和3’对齐

加入参数extend_align3end和extend_align5end

 export BLASTDB='/data/database/igblast';
 export IGDATA='/data/database/igblast';
/data/software/igblast/ncbi-igblast-1.17.0/bin/igblastn -query query.fa -out out.tsv -outfmt 19 -germline_db_V ./imgt_20201124/mouse_v -germline_db_J ./imgt_20201124/mouse_j -germline_db_D ./imgt_20201124/mouse_d -auxiliary_data ./optional_file/mouse_gl.aux -organism mouse -domain_system imgt -ig_seqtype Ig -show_translation -num_threads 30 -num_clonotype  200000 -extend_align3end -extend_align5end

3.2 自建germline数据库

IMGT/GENE-DB 的 IG “V-REGION”, “D-REGION”, “J-REGION”, “C-GENE exon” sets 下载基因序列

然后:

cat IGHV_nucl.fa IGLV_nucl.fa IGKV_nucl.fa > mouse_v.fa
cat IGHJ_nucl.fa IGKJ_nucl.fa IGLJ_nucl.fa > mouse_j.fa
cat IGHD_nucl.fa >mouse_d.fa

/data/software/igblast/ncbi-igblast-1.17.0/bin/edit_imgt_file.pl mouse_v.fa > mouse_v;
/data/software/igblast/ncbi-igblast-1.17.0/bin/makeblastdb -parse_seqids -dbtype nucl -in mouse_v

/data/software/igblast/ncbi-igblast-1.17.0/bin/edit_imgt_file.pl  mouse_j.fa > mouse_j;
/data/software/igblast/ncbi-igblast-1.17.0/bin/makeblastdb -parse_seqids -dbtype nucl -in mouse_j

/data/software/igblast/ncbi-igblast-1.17.0/bin/edit_imgt_file.pl  mouse_d.fa > mouse_d;
/data/software/igblast/ncbi-igblast-1.17.0/bin/makeblastdb -parse_seqids -dbtype nucl -in mouse_d

参考资料

药企,独角兽,苏州。团队长期招人,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn