欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

reads去污染接头

程序员文章站 2022-06-13 21:00:42
...

对于reads:一段一段的文件,比如说下面这个文件,它是四行一段的,我只想匹配每一段第一行末尾是1还是2,匹配是1的话,匹配第二行是否以ATCG开头或者结尾,如果是输出到一个文件,不是输出到另一个文件。匹配是2的话,匹配第二行是否以GCTA开头或者结尾,如果是输出到一个文件,不是输出到另一个文件。

@FCC36J1ACXX:1:1101:1218:2181#GATTAGTG/1  
ATCGNATATTGTATAGTGATAGAGACTAATATGTACAGAAAACACATGGCTCAATGTAATCATAAAGGTAGGACATGCTTGTAAGAACTC
+
^\\\BQ\ceceggddfgggegdgbfgdgdgggfefdecfhhh`cegfhdfhhhehhhgfgffhdaffh\bdebgcdghegggggdbcddd
@FCC36J1ACXX:1:1101:1236:2181#GATTAGTG/1
GTACNGAAGACAGAGCTGGCAAACAGAAGTTTAAAAGCCCCATGAATACTTCCTGGTGTTCCTTTAACACAGCTGACTGGGTCCTATCG
+
b__eBQ`cgegggghhhiecgifgffhhidghihiihghiihihifdf[egffhihgghhfhiiiehhgggggdcbdedbc_bcccccbc
@FCC36J1ACXX:1:1101:1416:2128#GATTAGTG/2
GGCATGGCTAATGTAAAAGCAGCCCCAAAGATAATTGACACAGGAGGAGGCTTCATTTTAGAAGAGGAAGAAGAAGAAGAACAGAAAATT
+
bbbeeeeeggggggiiiiifhiiiiiiiiihiiiiiiiihiiiighidghbghiiiiiiiighiiiggggeeeeedddddcccccccccc
@FCC36J1ACXX:1:1101:1278:2132#GATTAGTG/2
ATGTCATCCCCATCAAGTTCAGCTTTTCTAAACACTTCCTTCACTCATTCATTTATTCATTCTTACGTGTATTCATTGTGAGAACAGCTA
+
bbbeceeeggfgggiiihifghdghhhgghfifhhghdfhhhihhiiii_ggfhgiihhihfhdfegfgaeffS

shell:
1 awk -vRS="@" -vfq="131206_I175_FCC36J1ACXX_L1_RHUMjdaXACDAAAPEI-52_1" 'NR>1{if(($1~/2$/&&$2~/^ATCG|ATCG$/)||($1~/1$/&&$2~/^GCTA|GCTA$/))printf RS""$0 > fq"@in.fq_2";else printf RS""$0 > fq"@un.fq_2"}' [email protected]t

PERL:

1 #!/usr/bI/perl
2 my @f = map { open my $f, '>', $_; $f } qw/1.txt 2.txt/;
3 my %h = qw/1 ATCG 2 GCTA/;
4 while (<>) {
5     my @l   = ( ~~<>,<>.<> );
6     my ($n) = /(.)$/;
7     my $i   = grep $_ eq $h{$n}, $l[0] =~ /^(.{4}).*(.{4})$/;
8     print { $f[ $i == 0 ] } $_, @l;
9 }

 

pe测序要求两条reads是位置对应的,我们需要删除一部分被污染,另一部分没污染保留下来的:

shell:

1 => cat ref.sh
2 grep @ [email protected]_2 >52.ref
3 grep @ [email protected]_2 >>52.ref
4 perl ref.pl 52.ref [email protected]_2 >131206_I175_FCC36J1ACXX_L1_RHUMjdaXACDAAAPEI-52_2.fq.result
5 perl ref.pl 52.ref [email protected]_2 >131206_I175_FCC36J1ACXX_L1_RHUMjdaXACDAAAPEI-52_1.fq.result

perl部分:

 1 => cat ref.pl
 2 #!/usr/bin/perl
 3 $ref=$ARGV[0];
 4 $fq=$ARGV[1];
 5 #$outdir=$ARGV[2];
 6 open IN1,"< $ref" or die"$!";
 7 open DATA,"< $fq" or die"$!";
 8 
 9 my %R = map { split /\// } <IN1>;
10 while (<DATA>) {
11     my ($k) = split /\//;
12     $R{$k}
13       ? ( <DATA>, <DATA>, <DATA> )
14       : print $_, <DATA>, <DATA>, <DATA>;
15 }

单独shell:

awk -F/ -vRS='@' 'NR==FNR{a[$1]=1;next}NF&&a[$1]{printf RS$0}' file1 file2