搜索引擎
project1
For this project, you will write a Java program that processes all text files in a directory and its subdirectories, cleans and parses the text into word stems, and builds an in-memory inverted index to store the mapping from word stems to the documents and position within those documents where those word stems were found.
For example, suppose we have the following mapping stored in our inverted index:
{
“capybara”: {
“input/mammals.txt”: [
11
]
},
“platypus”: {
“input/dangerous/venomous.txt”: [
2
],
“input/mammals.txt”: [
3,
8
]
}
}
This indicates that after processing the word stems from files, the word capybara is found in the file input/mammals.html in position 11. The word platypus is found in two files, input/mammals.html and input/dangerous/venomous.html. In the file input/mammals.html, the word platypus appears twice in positions 3 and 8. In file input/dangerous/venomous.html, the word platypus is in position 2 in the file.
The process of stemming reduces a word to a base form (or “stem”), so that words like interesting, interested, and interests all map to the stem interest. Stemming is a common preprocessing step in many web search engines.
Functionality
The core functionality of your project must satisfy the following requirements:
Process command-line arguments to determine the input to process and output to produce. See the Input and Output sections below for specifics.
Create a custom inverted index data structure that stores a mapping from a word stem to the location(s) the word was found, and the position(s) in that file the word is located. The positions should start at 1. This will require nesting multiple built-in data structures.
If provided a directory as input, find all files within that directory and all subdirectories and parse each text file found. Any files that end in the .text or .txt extension (case insensitive) should be considered a text file. If provided a single file as input, only parse that individual file (regardless of its extension).
Use the UTF-8 character encoding for all file processing, including reading and writing.
Process text files into word stems by removing any non-letter symbols (including digits, punctuation, accents, special characters), convert the remaining alphabetic characters to lowercase, split the text into words by whitespace, and then stem the word using the Apache OpenNLP toolkit.
Use the regular expression (?U)[^\p{Alpha}\p{Space}]+ to remove special characters from text.
Use the regular expression (?U)\p{Space}+ to split text into words by whitespace.
Use the SnowballStemmer English stemming algorithm in OpenNLP to stem words.
If the appropriate command-line arguments are provided, output the inverted index in pretty JSON format. See the Output section below for specifics.
Output user-friendly error messages in the case of exceptions or invalid input. Under no circumstance should your main() method output a stack trace to the user!
The functionality of your project will be evaluated with the Project1Test.java group of JUnit tests.
Input
Your main method must be placed in a class named Driver. The Driver class should accept the following command-line arguments:
-text path where the flag -text indicates the next argument is a path to either a single file or a directory. If the value is a file, open and process that file regardless of its extension. If the value is a directory, find and process all of the text files (with .txt and .text extensions) in that directory and its subdirectories.
-index path where the flag -index is an optional flag that indicates the next argument is the path to use for the inverted index output file. If the path argument is not provided, use index.json as the default output path. If the -index flag is provided, always generate output even if it is empty. If the -index flag is not provided, do not produce an output file.
The command-line flag/value pairs may be provided in any order. Do not convert paths to absolute form when processing command-line input!
Output
All output will be produced in “pretty” JSON format using \t tab characters for indentation. According to the JSON standard, numbers like integers should never be quoted. Any string or object key, however, should always be surrounded by " quotes. Objects (similar to maps) should use curly braces { and } and arrays should use square brackets [ and ]. Make sure there are no trailing commas after the last element.
The paths should be output in the form they were originally provided. The tests use normalized relative paths, so the output should also be normalized relative paths. As long as command-line parameters are not converted to absolute form, this should be the default output provided by the path object.
The contents of your inverted index should be output in alphabetically sorted order as a nested JSON object using a “pretty” format. For example:
{
“capybara”: {
“input/mammals.txt”: [
11
]
},
“platypus”: {
“input/dangerous/venomous.txt”: [
2
],
“input/mammals.txt”: [
3,
8
]
}
}
The project tests account for different path separators (forward slash / for Linux/Mac systems, and backward slash \ for Windows systems). Your code does not have to convert between the two!
Examples
The following are a few examples (non-comprehensive) to illustrate the usage of the command-line arguments that can be passed to your Driver class via a “Run Configuration” in Eclipse, assuming you followed the Running Driver guide to setup the working directory to the project-tests directory.
Consider the following example:
-text “input/text/simple/hello.txt”
-index “actual/index-simple-hello.json”
The above arguments indicate that Driver should build an inverted index from the single hello.txt file in the input/text/simple subdirectory of the current working directory, and output the inverted index as JSON to the index-simple-hello.json file in the actual subdirectory.
-text “input/text/simple/” -index
The above arguments indicate that Driver should build an inverted index from all of the text files found in the text/simple subdirectory of the current working directory, and output the inverted index as JSON to the default path index.json in the current working directory.
-text “input/text/simple/”
The above arguments indicate that Driver should build an inverted index from all of the text files found in the input/text/simple subdirectory of the current working directory, but it should NOT produce an output file. It must still build the inverted index however! (This will be useful in the future when we add the ability to search.)
Hints
It is important to develop the project iteratively. In fact, you may already have certain components complete thanks to the homework assignments. One possible breakdown of tasks are:
Create code that handles parsing command-line arguments into flag/value pairs, and supports default values if a flag is missing a value. This is similar functionality as your ArgumentMap homework assignment.
Create code that is able to traverse a directory and return a list of all the text files found within that directory. The File I/O (Featuring NIO.2) tutorials might be useful for this. This is similar functionality as your TextFileFinder homework assignment.
Create code that is able to parse text into words, including converting that text to lowercase, replacing special characters and digits, splitting that text into words by whitespaces, and finally stemming the words. This is similar functionality as your TextFileStemmer homework assignment.
Create code that handles storing a word, file path, and location into an inverted index data structure. This is similar functionality (but not exactly the same) as your TextFileIndex homework assignment.
Create code that is capable of writing a nested data structure (matching your inverted index data structure) to a file in JSON format. This is similar functionality as your SimpleJsonWriter homework assignment.
Test each part of your project code. Keep in mind the tests provided only test the final output of your Java program. You are responsible for testing the individual components of your code.
You can use and modify homework assignments for this project. However, make sure they are passing all of the tests before using.
Once you have the above components working, then you can work on integrating your code together. This part must still be done iteratively! Do not try to combine all of the code at once. First, work on being able to traverse a directory based on command-line parameters. Then, clean the text files found and output the cleaned words to the console. Finally, add those words to the index and write that index as JSON.
project2
For this project, you will extend your previous project to support exact search and partial search. In addition to meeting the previous project requirements, your code must be able to track the total number of words found in each text file, parse and stem a query file, generate a sorted list of search results from the inverted index, and support writing those results to a JSON file.
Functionality
The core functionality of your project must satisfy the following requirements:
Maintain the functionality of the previous project.
Process additional command-line parameters to determine whether to search the inverted index and whether to produce additional JSON ouput. See the Input section for specifics.
Modify how input text files are processed to also track the total word count of that file when storing it in the inverted index. See the Word Count section for specifics.
Parse a query file line-by-line into a normalized and optimized multiple-word queries matching the processing used to build the inverted index. See the Query Parsing section for specifics.
Efficiently return exact search results from your inverted index, such that any word stem in your inverted index that exactly matches a query word is returned.
Efficiently return partial search results from your inverted index, such that any word stem in your inverted index that starts with a query word is returned.
Sort the search results using a simple term frequency metric to determine the most “useful” search results to output first. See the Result Sorting section for specifics.
The functionality of your project will be evaluated with the Project2Test.java group of JUnit tests.
Word Count
Before you can calculate search results, you need to know how many word stems were stored in your inverted index for each text file.
Avoid opening up any file more than once! You should store this information alongside the inverted index, as the files are first processed.
These word counts will be output to file using the same pretty JSON format as the inverted index. Here is the expected output for the word count of all the text files associated with this project:
{
“input/text/guten/1400-0.txt”: 187368,
“input/text/guten/2701-0.txt”: 215398,
“input/text/guten/50468-0.txt”: 10969,
“input/text/guten/pg1322.txt”: 124370,
“input/text/guten/pg1661.txt”: 107396,
“input/text/guten/pg22577.txt”: 63630,
“input/text/guten/pg37134.txt”: 16696,
“input/text/rfcs/rfc3629.txt”: 4294,
“input/text/rfcs/rfc475.txt”: 3228,
“input/text/rfcs/rfc5646.txt”: 27075,
“input/text/rfcs/rfc6805.txt”: 9785,
“input/text/rfcs/rfc6838.txt”: 9367,
“input/text/rfcs/rfc7231.txt”: 28811,
“input/text/simple/.txt/hidden.txt”: 1,
“input/text/simple/a/b/c/d/subdir.txt”: 1,
“input/text/simple/animals.text”: 11,
“input/text/simple/animals_copy.text”: 11,
“input/text/simple/animals_double.text”: 22,
“input/text/simple/capital_extension.TXT”: 1,
“input/text/simple/capitals.txt”: 4,
“input/text/simple/digits.tXt”: 2,
“input/text/simple/dir.txt/findme.Txt”: 17,
“input/text/simple/hello.txt”: 6,
“input/text/simple/position.teXt”: 20,
“input/text/simple/symbols.txt”: 10,
“input/text/simple/words.tExT”: 36,
“input/text/stems/stem-in.txt”: 22275,
“input/text/stems/stem-out.txt”: 22275
}
…and for the stand-alone sentences.md file:
{
“input/text/simple/sentences.md”: 77
}
You can also find this output in the expected/counts subdirectory in the project-tests repository.
Query Parsing
Search queries will be provided in a text file with one multi-word search query per line. (Eventually, these queries will come from a search box on a webpage instead.) When processing this file, your query parsing code must normalize and optimize the queries as follows:
Clean and parse each query line. Perform the same transformations to each line of query words as used when populating your inverted index. This includes cleaning the line of any non-alphabetic characters, converting the remaining characters to lowercase, splitting the cleaned line into words by whitespace, and stemming each word. For example, the query line:
Observers observing 99 HIDDEN capybaras!
…should be processed into the cleaned, parsed, and stemmed words [observ, observ, hidden, capybara] after this step.
Remove duplicate query word stems. There is no need to search for the same stem twice! For example, if the original query included both observers and observing, the stemmed version of both words observ should be included in the parsed query only once. Specifically, for the stemmed words:
[observ, observ, hidden, capybara]
…the duplicate observ should be removed resulting in [observ, hidden, capybara] after this step.
Sort the unique words for each query alphabetically. This is less necessary for the search operation, but required for the search result output. For example, if the stemmed words without duplicates are:
[observ, hidden, capybara]
…then the final parsed query words should be [capybara, hidden, observ] after this step.
At this stage, the normalized words from the original query line can be used by your inverted index data structure to generate search results and to produce the expected JSON file output.
Result Sorting
Search engines rank their search results such that the most useful results are provided first. For example, PageRank is the well-known algorithm initially used by Google to rank results by estimating the importance of a website by its incoming links.
When returning search results, your code must also return the results in sorted order using a simple version of term frequency and raw count to determine the usefulness of a result. To do this, your code will need to be able to calculate the following:
Total Words: The total number of word stems in each text file. For example, there are 6 word stems in the hello.txt text file. You should already have this information from the word count calculated alongside your inverted index.
Total Matches: The total number of times any of the matching query words appear in the text file. For exact search, this is the number of times the exact word stems appear. For example, the exact word stem perfor appears in the file words.tExT a total of 2 times. For partial search, this is the number of times a word stem starts with one of the query words. For example, a word stem starts with perfor in the file words.tExT a total of 11 times.
For a query with multiple words, your code should consider all query words that appear at a location. For example, for an exact search of the multiple word query [loris, okapi], the word stem lori (from loris) appears 3 times and the word okapi appears 2 times in the file animals.text. Therefore, the number of times any of the matching query words appear in the text file is 3 + 2 = 5 times total.
With this information you can calculate the score of the search result as the percent of words in the file that match the query. This is calculated by dividing the total matches by the total words:
score = total matches / total words
Then, when sorting the search results, compare the results as follows:
Score: First, compare search results by their score so results with higher scores appear first (i.e. sort in descending order by total matches / total words).
Count: If two search results have the same score, then compare by the raw count so results with higher count appear first (i.e. sort in descending order by total matches) instead.
Location: If two search results have the same score and count, then compare by the location (case-insensitive) instead (i.e. sort alphabetically ignoring case by path).
You can use Double.compare(…), Integer.compare(…), and String.compareToIgnoreCase(…) for these comparisons and the built-in sort methods in Java.
See the partial search results for ant perf for an example of results sorted by score, the results for lori for an example of results with the same score and thus sorted by count, and finally the results for capybara hidden for an example of results with the same score and same count and thus sorted alphabetically by location.
If you calculate the score using float instead of double objects, or sort using Path instead of String objects, you may not get the expected results!
Input
Your main method must be placed in a class named Driver. The Driver class should accept the following additional command-line arguments:
-counts filepath where -counts is an optional flag that indicates the next argument is the path to use to output all of the locations and their word count. If the filepath argument is not provided, use counts.json as the default output filename. If the -counts flag is not provided, do not produce an output file of locations.
-query filepath where -query indicates the next argument is a path to a text file of queries to be used for search. If this flag is not provided, then no search should be performed. In this case, your code should behave exactly the same as the previous project.
-exact where -exact is an optional flag that indicates all search operations performed should be exact search. If the flag is NOT present, any search operations should use partial search instead.
-results filepath where -results is an optional flag that indicates the next argument is the path to use for the search results output file. This may include partial or exact search results! If the filepath argument is not provided, use results.json as the default output filename. If the -results flag is not provided, do not produce an output file of search results but still perform the search operation.
The command-line flag/value pairs may be provided in any order, and the order provided is not the same as the order you should perform the operations (i.e. always build the index before performing search, even if the flags are provided in the other order).
Your code should support all of the command-line arguments from the previous project as well.
Output
The output of your inverted index should be the same from the previous project. See the word count section for the desired output.
The search results should be output as a JSON array of JSON objects, where each object represents a single line from the original query file (e.g. one search). The query objects should have two keys, queries and results such that:
The key queries should have a quoted text value that is the parsed and sorted query words joined together by a space. For example, if the parsed query words are [capybara, hidden, observ] then the text output should be “capybara hidden observ”.
The key results should be an array of nested JSON objects, one per search result. Each individual search result should have, in order, these keys:
The key where with a quoted text value that is the relative path to the text file matching one or more of the query words.
The key count with an integer value that is the raw count (i.e. total matches) within the text file.
The key score with a floating point value with 8 digits after the decimal point of the search result score. You can use a DecimalFormat object to achieve this output:
DecimalFormat FORMATTER = new DecimalFormat(“0.00000000”);
System.out.println(FORMATTER.format(Math.PI));
Alternatively, you can use format strings:
String formatted = String.format("%.8f", Math.PI);
System.out.println(formatted);
The use of spaces, newline characters, and spaces are the same as before for “pretty” JSON output. Here is an example output snippet of a single search query with a single search result:
"observ perfor": [
{
"where": "input/text/simple/words.tExT",
"count": 13,
"score": 0.36111111
}
],
You can also find this output in the expected/search/search-exact/search-exact-simple.json file in the project-tests repository. See the other expected output files for more examples.
Examples
The following are a few examples (non-comprehensive) to illustrate the usage of the command-line arguments that can be passed to your Driver class via a “Run Configuration” in Eclipse, assuming you followed the Running Driver guide to setup the working directory to the project-tests directory.
Consider the following example:
-text “input/text/simple/”
-query “input/query/simple.txt”
-results actual/search-exact-simple.json
-exact
This is functionally equivalent to the testSimpleDirectory test. It will create the inverted index from the input/text/simple/ input subdirectory, perform an exact search of the queries in the input/query/simple.txt query input file, and output the search results to search-exact-simple.json.
Hints
It is important to develop the project iteratively. One possible breakdown of tasks are:
Add the ability to parse query files. Compare your parsed queries to those in the expected output files. For example, the line performer performers should become perform after being parsed, cleaned, stemmed, sorted, and discarding duplicates.
Create a class that stores a single search result and implements the Comparable interface. You will need data members for each of the sort criteria, including the location, total word count of the location, and number of times the query occurs at that location.
Add an exact search method to your inverted index data structure that takes already parsed words from a single line of the query file, and returns a sorted list of search results. Output the results to the console for debugging.
Use lists and Collections.sort(List) to sort search results, not a TreeSet data structure. Custom mutable objects do not behave well when used in sets or as keys in maps.
Add the ability to output the search results to JSON file. Make sure the exact search results match the expected output.
Add a partial search method to your inverted index data structure that takes already parsed words from a single line of the query file, and returns a sorted list of search results.
Do not worry about efficient partial search until after you are getting correct results.
The important part will be to test your code as you go. The JUnit tests provided only test the entire project as a whole, not the individual parts. You are responsible for testing the individual parts themselves.
project3
For this project, you will extend your previous project to support multithreading. In addition to meeting the previous project requirements, your code must make a thread-safe inverted index, and use a work queue to build and search an inverted index using multiple threads.
Functionality
The core functionality of your project must satisfy the following requirements:
Maintain the functionality of the previous project.
Process additional command-line parameters to determine whether to use multithreading and if so, how many threads to use in the work queue. See the Input section for specifics.
Support the same output capability as before. See the Output section for specifics.
Create a custom read/write lock class that allows multiple concurrent read operations, non-concurrent write and read/write operations, and allows an active writer to acquire additional read or write locks as necessary (i.e. reentrant write lock).
Create a new thread-safe inverted index using the custom read/write lock class above in addition to the original inverted index created for the previous projects.
Create a single work queue with finish and shutdown features. The work queue should not be initialized unless multithreading is enabled. If multithreading is enabled, the same work queue should be reused to multithread both the building and searching of the inverted index.
Use the work queue to build your inverted index from a directory of files using multiple worker threads. Each worker thread should parse a single file.
Use the work queue (same as the one used for building) to search your inverted index from a file of multiple word queries (for both exact and partial search). Each worker thread should handle an individual query (which may contain multiple words).
Exit gracefully (shutting down all worker threads) without calling System.exit() when all of the building and searching operations are complete.
Extend your classes from previous projects or create completely new classes to support multithreading. DO NOT REMOVE YOUR SINGLE THREADING CODE, as you still need to support single threaded building and searching the index.
You may NOT use any of the classes in the java.util.concurrent package and may NOT use the Stream.parallel method for the multithreaded code.
The functionality of your project will be evaluated with various JUnit tests. Please see the Testing section for specifics.
Input
Your main method must be placed in a class named Driver. The Driver class should accept the following additional command-line arguments:
-threads num threads where -threads indicates the next argument num is the number of worker threads to use. If num is missing or an invalid number of threads are provided, please default to 5 threads.
If the -threads flag is not provided, then assume your project should be single-threaded and should execute exactly as previous projects.
The command-line flag/value pairs may be provided in any order, and the order provided is not the same as the order you should perform the operations (i.e. always build the index before performing search, even if the flags are provided in the other order).
Your code should support all of the command-line arguments from the previous project as well.
Output
The output of your inverted index and search results should be the same from the previous project. As before, you should only generate output files if the necessary flags are provided.
Testing
Unlike previous projects, you only have to pass 100% of the tests in the Project3aTest.java group of JUnit tests to satisfy the project functionality requirements and sign up for your first project 3 code review. This test group does NOT include the long-running runtime tests that benchmark your single- versus multi-threading code.
This is handled automatically by the Github Actions test script. If it detects a project v3.0.x release it will automatically run Project3aTest.java. For all other releases, like v3.1.x, it will run Project3bTest.java instead. This DOES include the long-running runtime tests that benchmark your code. Be prepared to wait!
You must pass Project3bTest.java for followup code reviews and to earn the project design grade. This includes getting a speedup of at least 1.1x faster on average using multi-threading. However, it is possible to get speedups of 1.7x faster or more.
Be patient. The tests for project 3 can take some time to complete, even if you have an efficient approach.
Hints
It is important to develop the project iteratively. One possible breakdown of tasks are:
Get log4j2 working and start adding debug messages in your code. Once you are certain a class is working, disable debug messages for that class in your log4j2.xml file.
Consider extending your previous inverted index to create a thread-safe version. You may need to add additional functionality to your single-threaded versions.
Create a thread-safe inverted index using the synchronized keyword. Do not worry about efficiency or using a custom lock class yet.
Modify how you build your index (both from a directory and from the web) to use multithreading and a work queue. Make sure you still pass the unit tests.
Modify how you search your index to use multithreading and a work queue. Make sure you still pass the unit tests.
Once you are sure this version works, convert your inverted index to use a custom read/write lock class instead of the synchronized keyword (don’t use both). Make sure you still pass the unit tests.
Test your code in a multitude of settings and with different numbers of threads. Some issues will only come up occasionally, or only on certain systems.
Test your code with logging enabled, and then test your code with logging completely disabled. Your code will run faster without logging, which sometimes causes some concurrency problems.
Lastly, do not start on this project until you understand the multithreading code shown in class. If you are stuck on the code shown in class, PLEASE SEEK HELP. You do not need to figure it out on your own! You can ask the CS tutors, the teacher assistant, or the instructor for help.
Do not worry about efficiency until after your first code review. However, if you have had one code review and are running into issues, look for the following common issues:
Make sure the code does not over-notify (e.g. wake up threads more often than necessary).
Make sure the code does not over-finish (e.g. call finish more often than necessary).
Make sure the code does not over-block (e.g. perform blocking operations within a loop).
Make sure the code does not over-synchronize (e.g. preventing operations that could occur at the same time).
These issues are best detected using logging.
The important part will be to test your code as you go. The JUnit tests provided only test the entire project as a whole, not the individual parts. You are responsible for testing the individual parts themselves.
源码地址
https://download.csdn.net/download/rainbowBear/16810860
下一篇: 小型搜索引擎项目