【3.3.5】usearch--比对,聚类,去除chimera

主页http://drive5.com/usearch/usearch_docs.html

主要有3个作用,见下图。比对,聚类,去除chimera。

他的下载是需要邮箱的认证的。

二、案例

2.1 寻找16s

1,下载greengenes 16Ss数据库:(现在最新版本是gg_13_5.fasta.gz)

http://greengenes.secondgenome.com/downloads

ftp://greengenes.microbio.me/greengenes_release/

2,为了更快的搜寻,先将16s基因聚类

usearch –sort  gg_13_5.fasta –output 16Ssorted.fa

usearch –cluster 16Ssorted.fa –id 0.90 –seedsout gg_90.fa

##这里的usearch需要认证的。

3,用数据库中16s基因来寻找我组装后的序列中的16s

makeblastdb –in assembly.fa –dbtype nucl

 blastn –query gg_90.fa –db assembly.fa –num_threds 10 –ma_target_seqs 20 –outfmt 6 –evalue 1e-10 –out gg_blast_results.txt

4,最后,最长匹配的序列提取额出来,“m”设定最短的序列长度

perl \multi-metagenome\misc.scripts\extract.long.hits.from.blast.pl-b gg_blast_results.txt –d assembly.fa -m 500 -o assembly.16S.fa

5,提取出来序列格式如下,可以用RDP或者SINA在线服务器来接着分析。

>68.890.56.834.2446
ATCGATCGATCGATCG…
  Scaffold name
  Start position on the scaffold
  End position on the scaffold
  16S gene length
  Total length of parent scaffold

2.2 序列聚类

cmd = '/data/software/usearch/usearch11.0.667_i86linux32  -cluster_fast %s -id 0.93  -uc %s -centroids %s' % (cdr_fp,clstr_fp, cluster_rep_fa)
  print(cmd)
  os.system(cmd)

cluster_fast聚类还是有问题,有两条序列,本来比较相近,但被划分到了不同的clontype里面。因为取决于模板的选择。最终选择的是calc_distmx 算出序列两两的距离,然后基于cluster_aggd来聚类

def create_clonotype(cdr_fp, result_tsv,identity=0.93):
  """ 一般的cdr长度为50,所以相似度设置为了0.94,"""

  if not os.path.exists('clonotype'):
    os.system('mkdir -p clonotype')
  clstr_fp = result_tsv + 'clusters.uc'
  cluster_rep_fa = result_tsv + 'cluster_rep.fa'

  # cmd = '/data/software/usearch/usearch11.0.667_i86linux32  -cluster_fast %s -id %s -uc %s -centroids %s' % (cdr_fp,identity,clstr_fp, cluster_rep_fa)

  mx_tsv = result_tsv + '_mx.txt'
  tree_tsv = result_tsv + '.tree'
  cmd = '/data/software/usearch/usearch11.0.667_i86linux32 -calc_distmx %s -tabbedout %s -maxdist 0.1 -termdist 0.3' % (cdr_fp,mx_tsv)
  cmd_2 = '/data/software/usearch/usearch11.0.667_i86linux32  -cluster_aggd %s -treeout %s -clusterout %s  -id 0.93 -linkage min' % (mx_tsv,tree_tsv,clstr_fp)

  print(cmd)
  os.system(cmd)
  os.system(cmd_2)

  cluster_1 = {}
  with open(clstr_fp) as data1:
    for each_line in data1:
      cnt = each_line.strip().split('\t')
      num = cnt[0]
      barcode = cnt[1]
      if num not in cluster_1:
        cluster_1[num] = {}
        cluster_1[num]['barcode'] = []
        cluster_1[num]['num'] = 0
      if barcode not in cluster_1[num]['barcode']:
        cluster_1[num]['barcode'].append(barcode)
        cluster_1[num]['num'] += 1
  cluster_sorted = sorted(cluster_1.items(),key= lambda vv:vv[1]['num'],reverse=True)

  num = 0
  with open(result_tsv,'w') as data2:
    header = ['Seq_src','new_clonotype']
    data2.write('%s\n' % '\t'.join(header))
    for one_key in cluster_sorted:
      num +=1
      for one_bar in one_key[1]['barcode']:
        data2.write('%s\t%s\n' % (one_bar ,num))

算两两序列的相似性

/data/software/usearch/usearch11.0.667_i86linux32 -calc_distmx clonotype_tmp/IGHV1-26_IGHJ4_10.fa -tabbedout test.tmp  -maxdist 0.3   -wordlength 3

The -maxdist option gives the maximum distance which should be written when the output is in tabbed_pairs format. This can greatly reduce the size of an output file.

An identity threshold for terminating the calculation can be specified using the termdist option, which is in the range 0.0 to 1.0, where 0.0 means identical sequences (100% sequence id). This is a speed optimization that saves time by skipping alignments of low-identity pairs. If a pair is encountered with distance > maxdist, the calculation is stopped. Because U-sorted order does not correlate perfectly with identity, you should set termdist somewhat lower than the maximum distance that you care about. For example, if you want all pairs with >80% id to appear in the matrix, then you might set -maxdist 0.2 -termdist 0.3. Tests on small datasets can be used to tune -termdist to a reasonable value. By default, termdist is set to 1.0 and the calculation continues for all pairs that have at least one word in common. The word length is set by the wordlength option.

如果序列太短,word_size 设置的太大,两条序列没有word_size的交集,序列很容易没有相似度,就比如下面的这个例子。最终将-wordlength 3 以后,才能看到这两个序列的相似性。

AKGGSFAMDY
AKGGTWAMDY

参考资料

药企,独角兽,苏州。团队长期招人,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn