2019/07/12_NGS Data Analysis Course (Harvard Chan Bioinformatics Core)_4

程序员文章站 2024-02-09 16:49:58

...

#Learning objectives

Learn how to search for characters or patterns in a text file using the grep command
Learn how to write to file and append to file using output redirection
Explore how to use the pipe (|) character to chain together commands

#Searching files

我们上一节用less 查找一个文件中的内容，如果我们想在多个文件中查找，并且不打开它们那我们就要用 grep

Grep is a command-line utility for searching plain-text data sets for lines matching a pattern or regular expression (regex).(不太懂啥意思。。。）简言之，grep是用于搜索纯文本数据全部与模式或正则表达式（regex）匹配的行

Suppose we want to see how many reads in our file Mov10_oe_1.subset.fq are “bad”, with 10 consecutive Ns (NNNNNNNNNN).

$ cd ~/unix_lesson/raw_fastq

$ grep NNNNNNNNNN Mov10_oe_1.subset.fq

We get back a lot of lines. What if we want to see the whole fastq record for each of these reads?

We can use the -B and -A arguments for grep to return the matched line plus one before (-B1) and two lines after (-A2). Since each record is four lines and the second line is the sequence, this should return the whole record.

$ grep -B 1 -A 2 NNNNNNNNNN Mov10_oe_1.subset.fq

不懂，觉得需要实战才能明白

#Redirection

The redirection command for writing something to file is >.

Let’s try it out and put all the sequences that contain ‘NNNNNNNNNN’ from all the files into another file called bad_reads.txt.

$ grep -B 1 -A 2 NNNNNNNNNN Mov10_oe_1.subset.fq > bad_reads.txt

The prompt should sit there a little bit, and then it should look like nothing happened. But you should have a new file called bad_reads.txt.

$ ls -l

Take a look at the file and see if it contains what you think it should. NOTE: If we already had a file named bad_reads.txt in our directory, it would have overwritten it without any warning.

用> 来重定向，>加一个文件名，就可以命名一个新的文件。

The redirection command for appending something to an existing file is >>.

If we use >>, it will append to rather than overwrite a file. This can be useful for saving more than one search。

The redirection command for using the output of a command as input for a different command is |.

将命令输出用作其他命令输入的重定向命令是 |。

We can also do count the number of lines using the wc command. wc stands for word count.

| 表示只执行 | 之后的命令

‘cut’ is a program that will extract columns from files

$ grep exon chr1-hg19_genes.gtf | cut -f1,4,5,7 | head

此命令表示提取了该文件的1，4，5，7列

Removing duplicate exon

we can use a new tool, sort, to remove exons that show up more than once. We can use the sort command with the -u option to return only unique lines.

$ grep exon chr1-hg19_genes.gtf | cut -f1,4,5,7 | sort -u | head

Counting the total number of exons

First, let’s check how many lines we would have without using sort -u by piping the output to wc -l.

grep exon chr1-hg19_genes.gtf | cut -f1,4,5,7 | wc -l

Now, to count how many unique exons are on chromosome 1, we will add back the sort -u and pipe the output to wc -l

$ grep exon chr1-hg19_genes.gtf | cut -f1,4,5,7 | sort -u | wc -l

#Commands, options, and keystrokes covered in this lesson

grep
> (output redirection)
>> (output redirection, append)
| (output redirection, pipe)
wc
cut
sort