【5.2.1.1】PfamScan及pfam数据库

Pfam( http://pfam.sanger.ac.uk/ )是一个被广泛使用的蛋白家族数据库,在最新的版本26.0中包含超过13000个手工确定的蛋白家族,Pfam可以通过 http://pfam.sanger.ac.uk/ 使用,他有两个数据库,高质量,手工确定的Pfam-A,自动注释的Pfam-B数据库。后面的数据产生是根据ADDA算法。是对A的补充。

一、下载

PfamScan.pl工具( ftp://ftp.sanger.ac.uk/pub/databases/Pfam/Tools

数据库( ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/ ),按照说明书,我下载的是

Pfam-A.hmm
Pfam-A.hmm.dat 
Pfam-B.hmm 
Pfam-B.hmm.dat 
active_site.dat
HMMER3 (http://hmmer.janelia.org/software)

准备工作:

Perl 和bioperl的安装 我的已经安装过了,据说可以通过一下方法安装
 sudo apt-get install perl ( replace perl with bioperl for installation of bioperl)
 
Moose的安装
 sudo -i ( the system will ask for password type it in and youll find the user name change to root marked in red. its ready to go now) (因为之前没有权限,没用这一步,所以安装不来,导致后面的报错)
 then use CPAN to install Moose use this:
 CPAN Moose ( this will take a while)

HMMER3的安装

HMMER用来寻找同源序列数据库,做序列比对,它可用一条序列来寻找数据库,功能非常强大。
 tar zxf hmmer-3.1b1.tar.gz
 cd hmmer-3.1b1
 ./configure
 make
 make check
 make install
 cd easel;make install

 修改环境变量:export PATH=/sam/hmmer/binaries:$PATH(这个是针对bash而言的)

 这个时候可以通过在终端输入:hmmscan -h 来检验是否安装成功
 这就可以了嘛,不用怎么安装,修改环境变量即可。

 export PERL5LIB=/sam/hmmmer/pfamscan:$PATH (含有pfam_scan.pl)
 (the path to your pfam_scan.pl should be listed if it is successfully added)可以通过如下命令来查看环境变量是否修改成功
 perl -V

为什么用的是PERL5LIB而不是PATH呢

 What we’re doing in a nutshell is telling PERL to push values on to the @INC array before loading any modules. You can do this on the command line, in your PERL code or with the environment variable PERL5LIB.
 PERL5LIB can contain more than one value. Just set it in you .bashrc file or wherever you see fit. This method works in bash:
 export PERL5LIB=/first/path/to/libs"${PERL5LIB:+:$PERL5LIB}"

二、使用

2.1 通过hmmerspress来下载的数据建库

hmmpress Pfam-A.hmm
hmmpress Pfam-B.hmm

2.2 pfam_scan

使用说明:

 pfam_scan.pl -fasta -dir
 Additonal options:
 -h : show this help
 -o : output file, otherwise send to STDOUT
 -clan_overlap : show overlapping hits within clan member families (applies to Pfam-A families only)
 -align : show the HMM-sequence alignment for each match
 -e_seq : specify hmmscan evalue sequence cutoff for Pfam-A searches (default Pfam defined)
 -e_dom : specify hmmscan evalue domain cutoff for Pfam-A searches (default Pfam defined)
 -b_seq : specify hmmscan bit score sequence cutoff for Pfam-A searches (default Pfam defined)
 -b_dom : specify hmmscan bit score domain cutoff for Pfam-A searches (default Pfam defined)
 -pfamB : search against Pfam-B HMMs (uses E-value sequence and domain cutoff 0.001),
 in addition to searching Pfam-A HMMs
 -only_pfamB : search against Pfam-B HMMs only (uses E-value sequence and domain cutoff 0.001)
 -as : predict active site residues for Pfam-A matches
 -json [pretty] : write results in JSON format. If the optional value "pretty" is given,
 the JSON output will be formatted using the "pretty" option in the JSON
 module
 For more help, check the perldoc:
 shell% perldoc pfam_scan.pl

例如:

/sam/hmmer/PfamScan/pfam_scan.pl -fasta contig_proteins.fasta -dir /sam/hmmer/PfamScan/lib -pfamB -out contig_pfam.fasta

注释出来的结果中.后面跟的数字与不跟数字有什么区别??

 pfam-help@ebi.ac.uk
 There is no difference for the user.
 The extra numerals after the . are for internal auditing and have no meaning
 for the results. In effect both are PF00013.24 - that is: version 24 since
 first creation of family.

2.3 结果的初步解读

# < seq id> < alignment start> < alignment end> < envelope start> < envelope end> < hmm acc>
 < hmm name> < type> < hmm start> < hmm end> < hmm length> < bit score> < E-value> < significance>
 < clan>
 1_1 111 424 110 425 PF01979.15 Amidohydro_1 Domain 2 332
 333 185.8 1.5e-54 1 CL0034
 1_2 30 130 30 130 PF13600.1 DUF4140 Family 1 104
 104 52.1 6.7e-14 1 No_clan

这里的PF代表的是pfam-A,PB代表的是pfam-B数据库。 clan表示上一级的分类

利用官网首页"Jump to”功能,检索注释出来的详细的信息:

Pfam A accession, e.g. PF02171
Pfam A identifier, e.g. piwi
Pfam B accession, e.g. PB000001
Pfam B identifier, e.g. Pfam-B_1
UniProt sequence accession, e.g. P00789
UniProt sequence ID, e.g. CANX_CHICK
NCBI "GI" number, e.g. 113594566
NCBI secondary accession, e.g. BAF18440.1
Pfam clan accession, e.g. CL0005
metaseq ID, e.g. JCVI_ORF_1096665732460
metaseq accession, e.g. JCVI_PEP_1096665732461
Pfam clan accession, e.g. CL0005
Pfam clan ID, e.g. Kazal
PDB entry, e.g. 2abl
Proteome species name, e.g. Homo sapiens
之前的邮箱不好使了。
pfamlist-subscribe@sanger.ac.uk

三、讨论

  1. pfam团队的邮箱:pfam-help@sanger.ac.uk。有问题就可以问他们
  2. Can’t locate Bio/Pfam/Scan/PfamScan.pm in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.14.2 /usr/local/share/perl/5.14.2 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.14 /usr/share/perl/5.14 /usr/local/lib/site_perl .) at /sam/hmmer/PfamScan/pfam_scan.pl line 8. BEGIN failed–compilation aborted at /sam/hmmer/PfamScan/pfam_scan.pl line 8.这个问题折腾了我很久,最后我改了两点,一个就是通过cpan下载Moose,另一个就是修改了pfam_scan.pl的环境变量,就OK了。那就根据我博文中提到的PERL5LIB,我觉得应该是第二个原因。反正问题解决了,who care 呢?

参考资料

  • 文献:The Pfam protein families database
  • 官网说明说 readme
  • shuixia100的博客:http://shuixia100./1/post/2012/04/how-to-install-pfam_scanpl-under-linux-ubuntu.html
  • Brain Goo的博客:http://www.popmartian.com/tipsntricks/2011/04/11/how-to-pass-perl-library-paths-from-the-environment/
药企,独角兽,苏州。团队长期招人,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn