对于reads:一段一段的文件,比如说下面这个文件,它是四行一段的,我只想匹配每一段第一行末尾是1还是2,匹配是1的话,匹配第二行是否以ATCG开头或者结尾,如果是输出到一个文件,不是输出到另一个文件。匹配是2的话,匹配第二行是否以GCTA开头或者结尾,如果是输出到一个文件,不是输出到另一个文件。
@FCC36J1ACXX:1:1101:1218:2181#GATTAGTG/1
ATCGNATATTGTATAGTGATAGAGACTAATATGTACAGAAAACACATGGCTCAATGTAATCATAAAGGTAGGACATGCTTGTAAGAACTC
+
^\\\BQ\ceceggddfgggegdgbfgdgdgggfefdecfhhh`cegfhdfhhhehhhgfgffhdaffh\bdebgcdghegggggdbcddd
@FCC36J1ACXX:1:1101:1236:2181#GATTAGTG/1
GTACNGAAGACAGAGCTGGCAAACAGAAGTTTAAAAGCCCCATGAATACTTCCTGGTGTTCCTTTAACACAGCTGACTGGGTCCTATCG
+
b__eBQ`cgegggghhhiecgifgffhhidghihiihghiihihifdf[egffhihgghhfhiiiehhgggggdcbdedbc_bcccccbc
@FCC36J1ACXX:1:1101:1416:2128#GATTAGTG/2
GGCATGGCTAATGTAAAAGCAGCCCCAAAGATAATTGACACAGGAGGAGGCTTCATTTTAGAAGAGGAAGAAGAAGAAGAACAGAAAATT
+
bbbeeeeeggggggiiiiifhiiiiiiiiihiiiiiiiihiiiighidghbghiiiiiiiighiiiggggeeeeedddddcccccccccc
@FCC36J1ACXX:1:1101:1278:2132#GATTAGTG/2
ATGTCATCCCCATCAAGTTCAGCTTTTCTAAACACTTCCTTCACTCATTCATTTATTCATTCTTACGTGTATTCATTGTGAGAACAGCTA
+
bbbeceeeggfgggiiihifghdghhhgghfifhhghdfhhhihhiiii_ggfhgiihhihfhdfegfgaeffS
shell:
1 awk -vRS="@" -vfq="131206_I175_FCC36J1ACXX_L1_RHUMjdaXACDAAAPEI-52_1" 'NR>1{if(($1~/2$/&&$2~/^ATCG|ATCG$/)||($1~/1$/&&$2~/^GCTA|GCTA$/))printf RS""$0 > fq"@in.fq_2";else printf RS""$0 > fq"@un.fq_2"}' [email protected]t
PERL:
1 #!/usr/bI/perl 2 my @f = map { open my $f, '>', $_; $f } qw/1.txt 2.txt/; 3 my %h = qw/1 ATCG 2 GCTA/; 4 while (<>) { 5 my @l = ( ~~<>,<>.<> ); 6 my ($n) = /(.)$/; 7 my $i = grep $_ eq $h{$n}, $l[0] =~ /^(.{4}).*(.{4})$/; 8 print { $f[ $i == 0 ] } $_, @l; 9 }
pe测序要求两条reads是位置对应的,我们需要删除一部分被污染,另一部分没污染保留下来的:
shell:
1 => cat ref.sh 2 grep @ [email protected]_2 >52.ref 3 grep @ [email protected]_2 >>52.ref 4 perl ref.pl 52.ref [email protected]_2 >131206_I175_FCC36J1ACXX_L1_RHUMjdaXACDAAAPEI-52_2.fq.result 5 perl ref.pl 52.ref [email protected]_2 >131206_I175_FCC36J1ACXX_L1_RHUMjdaXACDAAAPEI-52_1.fq.result
perl部分:
1 => cat ref.pl 2 #!/usr/bin/perl 3 $ref=$ARGV[0]; 4 $fq=$ARGV[1]; 5 #$outdir=$ARGV[2]; 6 open IN1,"< $ref" or die"$!"; 7 open DATA,"< $fq" or die"$!"; 8 9 my %R = map { split /\// } <IN1>; 10 while (<DATA>) { 11 my ($k) = split /\//; 12 $R{$k} 13 ? ( <DATA>, <DATA>, <DATA> ) 14 : print $_, <DATA>, <DATA>, <DATA>; 15 }
单独shell:
awk -F/ -vRS='@' 'NR==FNR{a[$1]=1;next}NF&&a[$1]{printf RS$0}' file1 file2