欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

2019/07/12_NGS Data Analysis Course (Harvard Chan Bioinformatics Core)_4

程序员文章站 2024-02-09 16:49:58
...

#Learning objectives

  • Learn how to search for characters or patterns in a text file using the grep command
  • Learn how to write to file and append to file using output redirection
  • Explore how to use the pipe (|) character to chain together commands

#Searching files

我们上一节用less 查找一个文件中的内容,如果我们想在多个文件中查找, 并且不打开它们  那我们就要用 grep

Grep is a command-line utility for searching plain-text data sets for lines matching a pattern or regular expression (regex).(不太懂啥意思。。。)简言之,grep是用于搜索纯文本数据全部与模式或正则表达式(regex)匹配的行

Suppose we want to see how many reads in our file Mov10_oe_1.subset.fq are “bad”, with 10 consecutive Ns (NNNNNNNNNN).

$ cd ~/unix_lesson/raw_fastq

$ grep NNNNNNNNNN Mov10_oe_1.subset.fq

 

We get back a lot of lines. What if we want to see the whole fastq record for each of these reads?

We can use the -B and -A arguments for grep to return the matched line plus one before (-B1) and two lines after (-A2). Since each record is four lines and the second line is the sequence, this should return the whole record.

$ grep -B 1 -A 2 NNNNNNNNNN Mov10_oe_1.subset.fq

不懂,觉得需要实战才能明白

#Redirection

The redirection command for writing something to file is >.

Let’s try it out and put all the sequences that contain ‘NNNNNNNNNN’ from all the files into another file called bad_reads.txt.

$ grep -B 1 -A 2 NNNNNNNNNN Mov10_oe_1.subset.fq > bad_reads.txt

The prompt should sit there a little bit, and then it should look like nothing happened. But you should have a new file called bad_reads.txt.

$ ls -l

Take a look at the file and see if it contains what you think it should. NOTE: If we already had a file named bad_reads.txt in our directory, it would have overwritten it without any warning.

用> 来重定向,>加一个文件名,就可以命名一个新的文件。 

The redirection command for appending something to an existing file is >>.

If we use >>, it will append to rather than overwrite a file. This can be useful for saving more than one search。

The redirection command for using the output of a command as input for a different command is |.

将命令输出用作其他命令输入的重定向命令是 |。

We can also do count the number of lines using the wc command. wc stands for word count.

| 表示只执行 | 之后的命令

‘cut’ is a program that will extract columns from files

$ grep exon chr1-hg19_genes.gtf | cut -f1,4,5,7 | head

此命令表示提取了该文件的1,4,5,7列

Removing duplicate exon

we can use a new tool, sort, to remove exons that show up more than once. We can use the sort command with the -u option to return only unique lines.

$ grep exon chr1-hg19_genes.gtf | cut -f1,4,5,7 | sort -u | head

Counting the total number of exons

First, let’s check how many lines we would have without using sort -u by piping the output to wc -l.

grep exon chr1-hg19_genes.gtf | cut -f1,4,5,7 | wc -l

Now, to count how many unique exons are on chromosome 1, we will add back the sort -u and pipe the output to wc -l

$ grep exon chr1-hg19_genes.gtf | cut -f1,4,5,7 | sort -u | wc -l

#Commands, options, and keystrokes covered in this lesson

grep
> (output redirection)
>> (output redirection, append)
| (output redirection, pipe)
wc
cut
sort

 

 

 

 

相关标签: course_bioinformatics