序列处理工具箱--seqkit

这个工具,真心非常强大。日常工作中关于fastq和fasta处理的需求,基本都可以用这个工具来完成。

文档: https://bioinf.shenwei.me/seqkit/usage/

基本功能:

  • Basic: seq, stats, subseq, sliding, faidx, watch, sana, scat
  • Format conversion: fq2fa, fx2tab, tab2fx, convert, translate
  • Searching: grep, locate, amplicon, fish
  • Set operation: sample, rmdup, common, duplicate, split, split2, head, head-genome, range, pair
  • Edit: concat, replace, restart, mutate, rename
  • Ordering: sort, shuffle
  • BAM processing: bam

一、安装

https://bioinf.shenwei.me/seqkit/usage/

二、使用案例

更多功能有待于阅读文档。

SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Version: 2.0.0

Author: Wei Shen <shenwei356@gmail.com>

Documents  : http://bioinf.shenwei.me/seqkit
Source code: https://github.com/shenwei356/seqkit
Please cite: https://doi.org/10.1371/journal.pone.0163962


Seqkit utlizies the pgzip (https://github.com/klauspost/pgzip) package to
read and write gzip file, and the outputted gzip file would be slighty
larger than files generated by GNU gzip.

Seqkit writes gzip files very fast, much faster than the multi-threaded pigz,
therefore there's no need to pipe the result to gzip/pigz.

Usage:
  seqkit [command]

Available Commands:
  amplicon        extract amplicon (or specific region around it) via primer(s)
  bam             monitoring and online histograms of BAM record features
  common          find common sequences of multiple files by id/name/sequence
  completion      generate the autocompletion script for the specified shell
  concat          concatenate sequences with same ID from multiple files
  convert         convert FASTQ quality encoding between Sanger, Solexa and Illumina
  duplicate       duplicate sequences N times
  faidx           create FASTA index file and extract subsequence
  fish            look for short sequences in larger sequences using local alignment
  fq2fa           convert FASTQ to FASTA
  fx2tab          convert FASTA/Q to tabular format (and length, GC content, average quality...)
  genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell)
  grep            search sequences by ID/name/sequence/sequence motifs, mismatch allowed
  head            print first N FASTA/Q records
  head-genome     print sequences of the first genome with common prefixes in name
  help            Help about any command
  locate          locate subsequences/motifs, mismatch allowed
  mutate          edit sequence (point mutation, insertion, deletion)
  pair            match up paired-end reads from two fastq files
  range           print FASTA/Q records in a range (start:end)
  rename          rename duplicated IDs
  replace         replace name/sequence by regular expression
  restart         reset start position for circular genome
  rmdup           remove duplicated sequences by ID/name/sequence
  sample          sample sequences by number or proportion
  sana            sanitize broken single line FASTQ files
  scat            real time recursive concatenation and streaming of fastx files
  seq             transform sequences (extract ID, filter by length, remove gaps...)
  shuffle         shuffle sequences
  sliding         extract subsequences in sliding windows
  sort            sort sequences by id/name/sequence/length
  split           split sequences into files by id/seq region/size/parts (mainly for FASTA)
  split2          split sequences into files by size/parts (FASTA, PE/SE FASTQ)
  stats           simple statistics of FASTA/Q files
  subseq          get subsequences by region/gtf/bed, including flanking sequences
  tab2fx          convert tabular format to FASTA/Q format
  translate       translate DNA/RNA to protein sequence (supporting ambiguous bases)
  version         print version information and check for update
  watch           monitoring and online histograms of sequence features

Flags:
      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
  -h, --help                            help for seqkit
      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
      --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
      --infile-list string              file of input files list (one file per line), if given, they are appended to files from cli arguments
  -w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)
  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
      --quiet                           be quiet and do not show extra information
  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
  -j, --threads int                     number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)

Use "seqkit [command] --help" for more information about a command.

1.seq子命令 读,换, 查看类型,统计基本信息

seqkit seq hairpin.fa.gz #查看

cat in.fa | seqkit stats #自动检测类型并给出统计信息
file  format  type  num_seqs  sum_len  min_len  avg_len  max_len
-     FASTA   RNA          1       11       11       11       11
seqkit stats *.f{a,q}.gz -T | csvtk pretty -t #可以统计多个fa/fq信息,结果形式更加友好
#  -j 参数会并行运算更快速度
seqkit seq hairpin.fa.gz -n #打印序列ID全名
seqkit seq hairpin.fa.gz -n  -i #只打印ID
seqkit seq hairpin.fa.gz -n -i --id-regexp ...  #通过正则去打印ID中内容(适用于所有的子命令)
seqkit seq hairpin.fa.gz -s -w 0 #只打印序列(全局标志-w定义输出行宽,0表示不换行)
seqkit seq reads_1.fq.gz -w 0 #转换多行fq为单行fq
seqkit seq hairpin.fa.gz -r -p #反向互补
echo -e ">seq\nACGT-ACTGC-ACC" | seqkit seq -g -u #移除gap
cat hairpin.fa | seqkit seq -m 100 -M 1000 | seqkit stats #过滤序列长度并进行统计

2.subseq子命令

前12碱基
$ zcat hairpin.fa.gz | seqkit subseq -r 1:12
后12碱基
zcat hairpin.fa.gz | seqkit subseq -r -12:-1
过滤前后12碱基
zcat hairpin.fa.gz | seqkit subseq -r 13:-13
通过gtf文件得到序列
seqkit subseq --gtf t.gtf t.fa
通过bed文件得到序列并移除重复序列
seqkit subseq --bed Homo_sapiens.GRCh38.84.bed.gz --chr 1 hsa.fa  | seqkit rmdup > chr1.bed.rmdup.fa

3.sliding

sliding sequences, circular genome supported
Usage:
  seqkit sliding [flags]

Flags:
  -C, --circular-genome   circular genome.
  -g, --greedy            greedy mode, i.e., exporting last subsequences even shorter than windows size
  -s, --step int          step size
  -W, --window int        window size

4.faidx (类似samtools中的faidx)

Usage:
  seqkit faidx [flags] <fasta-file> [regions...]

Flags:
  -f, --full-head     print full header line instead of just ID. New fasta index file ending with .seqkit.fai will be created
  -h, --help          help for faidx
  -i, --ignore-case   ignore case
  -r, --use-regexp    IDs are regular expression. But subseq region is not suppored here.

seqkit faidx tests/hairpin.fa hsa-let-7a-1 hsa-let-7a-2  #提取指定序列  -f 输出ID全名 1:10 输出相应位置序列

5.fq转换fa

seqkit fq2fa reads_1.fq.gz -o reads_1.fa.gz

6.将FASTA/Q转换为表格格式,并提供各种信息, 如序列长度 GC

$ seqkit fx2tab hairpin.fa.gz -l -g -n -i -H | head -n 4 | csvtk -t -C '&' pretty
#name       seq   qual   length   GC
cel-let-7                99       43.43
cel-lin-4                94       54.26
cel-mir-1                96       40.62

#两种形式转换
zcat hairpin.fa.gz | seqkit fx2tab | seqkit tab2fx

#按照长度排列序列
 seqkit sort -l hairpin.fa.gz

#得到前1000条reads
 seqkit fx2tab hairpin.fa.gz | head -n 1000 | seqkit tab2fx

7.grep序列

zcat hairpin.fa.gz | seqkit grep -r -p ^hsa #提取ID开头为hsa的reads  -v取想反
zcat hairpin.fa.gz | seqkit grep -f list > new.fa #根据list取子集
cat hairpin.fa.gz | seqkit grep -s -i -p aggcg #提取序列里有AGGCG的reads  -m 允许误配的数量
zcat hairpin.fa.gz | seqkit grep -s -r -i -p TT[CG]AA  #带有模糊碱基的序列匹配    -R 1:30取前30 个碱基

8.duplicate

 cat tests/hairpin.fa | seqkit head -n 1 \
    | seqkit duplicate -n 2  #重复序列2次 

9.rmdup

移除重复序列 by id/name/sequence

Usage:
  seqkit rmdup [flags]

Flags:
  -n, --by-name                by full name instead of just id
  -s, --by-seq                 by seq
  -D, --dup-num-file string    file to save number and list of duplicated seqs
  -d, --dup-seqs-file string   file to save duplicated seqs
  -h, --help                   help for rmdup
  -i, --ignore-case            ignore case

10.sample

zcat hairpin.fa.gz | seqkit sample -p 0.1 -o sample.fa.gz #按照比例取序列
 zcat hairpin.fa.gz | seqkit sample -n 1000 -o sample.fa.gz #按照数量

11.rename

cat in.fa | less  #和seqtk中rename的区别是前者会从1到n重新排序,后者是对后来重复的内容加_2到_n的后缀
>a comment
acgt
>b comment of b
ACTG
>a_2 a comment
aaaa

三、我的案例

参考资料

药企,独角兽,苏州。团队长期招人,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn