2019/07/12_NGS Data Analysis Course (Harvard Chan Bioinformatics Core)_4
#Learning objectives
- Learn how to search for characters or patterns in a text file using the
grep
command - Learn how to write to file and append to file using output redirection
- Explore how to use the pipe (
|
) character to chain together commands
#Searching files
我们上一节用less 查找一个文件中的内容,如果我们想在多个文件中查找, 并且不打开它们 那我们就要用 grep
Grep is a command-line utility for searching plain-text data sets for lines matching a pattern or regular expression (regex).(不太懂啥意思。。。)简言之,grep是用于搜索纯文本数据全部与模式或正则表达式(regex)匹配的行
Suppose we want to see how many reads in our file Mov10_oe_1.subset.fq
are “bad”, with 10 consecutive Ns (NNNNNNNNNN
).
$ cd ~/unix_lesson/raw_fastq
$ grep NNNNNNNNNN Mov10_oe_1.subset.fq
We get back a lot of lines. What if we want to see the whole fastq record for each of these reads?
We can use the -B
and -A
arguments for grep to return the matched line plus one before (-B1
) and two lines after (-A2
). Since each record is four lines and the second line is the sequence, this should return the whole record.
$ grep -B 1 -A 2 NNNNNNNNNN Mov10_oe_1.subset.fq
不懂,觉得需要实战才能明白
#Redirection
The redirection command for writing something to file is >
.
Let’s try it out and put all the sequences that contain ‘NNNNNNNNNN’ from all the files into another file called bad_reads.txt
.
$ grep -B 1 -A 2 NNNNNNNNNN Mov10_oe_1.subset.fq > bad_reads.txt
The prompt should sit there a little bit, and then it should look like nothing happened. But you should have a new file called bad_reads.txt
.
$ ls -l
Take a look at the file and see if it contains what you think it should. NOTE: If we already had a file named bad_reads.txt
in our directory, it would have overwritten it without any warning.
用> 来重定向,>加一个文件名,就可以命名一个新的文件。
The redirection command for appending something to an existing file is >>
.
If we use >>
, it will append to rather than overwrite a file. This can be useful for saving more than one search。
The redirection command for using the output of a command as input for a different command is |
.
将命令输出用作其他命令输入的重定向命令是 |。
We can also do count the number of lines using the wc
command. wc
stands for word count.
| 表示只执行 | 之后的命令
‘cut’ is a program that will extract columns from files
$ grep exon chr1-hg19_genes.gtf | cut -f1,4,5,7 | head
此命令表示提取了该文件的1,4,5,7列
Removing duplicate exon
we can use a new tool, sort
, to remove exons that show up more than once. We can use the sort
command with the -u
option to return only unique lines.
$ grep exon chr1-hg19_genes.gtf | cut -f1,4,5,7 | sort -u | head
Counting the total number of exons
First, let’s check how many lines we would have without using sort -u
by piping the output to wc -l
.
grep exon chr1-hg19_genes.gtf | cut -f1,4,5,7 | wc -l
Now, to count how many unique exons are on chromosome 1, we will add back the sort -u
and pipe the output to wc -l
$ grep exon chr1-hg19_genes.gtf | cut -f1,4,5,7 | sort -u | wc -l
#Commands, options, and keystrokes covered in this lesson
grep
> (output redirection)
>> (output redirection, append)
| (output redirection, pipe)
wc
cut
sort