Twitter Memes Dataset Overview with PageRank
This is the last of three blog posts from this summer internship project showcasing how to answer questions concerning big datasets stored in MongoDB using MongoDBs frameworks and connectors. Once wed familiarized ourselves with a sizable
This is the last of three blog posts from this summer internship project showcasing how to answer questions concerning big datasets stored in MongoDB using MongoDB’s frameworks and connectors.
Once we’d familiarized ourselves with a sizable amount of data from the Flights dataset, the next step was to move towards really BIG data: Twitter Memes. SNAP (Stanford Network Analysis Project) provides free large network data sets. In particular, we were interested in the Twitter Memes dataset from the 2008 presidential election generated by MemeTracker, a combined project between Stanford and Cornell which created news cycle maps from news articles and blog posts. This dataset is a compilation of blogs, news articles, and other media postings that pertain to the presidential election. In particular, the project focused on the time lapse between the first mention of a quote and the time it took for the quote to be picked up by mass media. However, our analysis focuses again on the PageRank and importance of both the individual URLs and websites.
The entire dataset contains over 96 million documents and over 418 million links between them. To begin, we focused solely on the memes during the month of April in 2009, which had over 15 million documents. The goal was to run PageRank on this dataset (where all web pages form a graph). The higher the PageRank of a web page, the more important the page. Web pages with a relatively high PageRank usually have a high ratio of incoming links to outgoing links compared to all other pages in the graph.
Disclaimer: As we’d learn through this project, the pages with the most PageRank do not necessarily have to be related to the 2008 presidential election.
Importing the Data
The source file quotes_2009-04.txt was 11G. It came in this continuous format:
P http://blogs.abcnews.com/politicalpunch/2008/09/obama-says-mc-1.html T 2008-09-09 22:35:24 Q that's not change Q you know you can put lipstick on a pig Q what's the difference between a hockey mom and a pit bull lipstick Q you can wrap an old fish in a piece of paper called change L http://reuters.com/article/politicsnews/idusn2944356420080901?pagenumber=1&virtualbrandchannel=10112 L http://cbn.com/cbnnews/436448.aspx L http://voices.washingtonpost.com/thefix/2008/09/bristol_palin_is_pregnant.html?hpid=topnews
- P denotes the URL of the document.
- T represents the time of the post.
- Q is a quote found in the post.
- L is a link that exists in the post.
This was not an ideal schema for MongoDB. With the use of inputMongo.py, the above input was converted into documents resembling the following:
{ "_id" : ObjectId("51c8a3f200a7f40aae706e86"), "url" : "http://blogs.abcnews.com/politicalpunch/2008/09/obama-says-mc-1.html", "quotes" : [ "that's not change", "you know you can put lipstick on a pig", "what's the difference between a hockey mom and a pit bull lipstick", "you can wrap an old fish in a piece of paper called change" ], "links" : [ "http://reuters.com/article/politicsnews/idusn2944356420080901?pagenumber=1&virtualbrandchannel=10112", "http://cbn.com/cbnnews/436448.aspx", "http://voices.washingtonpost.com/thefix/2008/09/bristol_palin_is_pregnant.html?hpid=topnews" ], "time" : ISODate("2008-09-09T22:35:24Z") }
This resulted in 15,312,738 documents. We also utilized bulk insertion instead of individual document insertions. It took about 1 hour and 48 minutes to insert all of these documents into a collection in the database.
Notice that we still generated an ObjectId as _id
. We’d later realize that this is superfluous as the url
is unique per document.
Preparing the Dataset for PageRank
Theoretically, for PageRank to produce more accurate and reflective results, there must be no dead ends and the graph must be strongly connected (every node must be able to travel to any other node in the graph and back). Dead ends are nodes in the graph that have incoming links but no outgoing links. The presence of dead ends in a dataset leaks the PageRank in the graph so that the sum of PageRank of nodes in the graph will slowly converge to zero.
There are 2 ways to fix this problem:
Use a taxation parameter, also called the random surfer, to pass a portion of PageRank from every node to every other node. This is a fix more for a graph that is not strongly connected than for graphs that have dead ends.
Recursively remove dead ends.
We decided to use method (2) first because the resulting graph will be much cleaner and will theoretically give more reasonable results. To make the graph even more strongly connected, we also used a taxation parameter after erasing all dead ends.
Removing dead ends from the dataset proved to more involved than we had initially thought. Here’s what we tried:
Attempt #1: Patience, Young Grasshoppers
The first solution created new collections of nodes that were not dead ends in the last collection. If the size of the links
array is zero for a doc/node, then do not add this doc to the next iteration. Else, iterate through each link in the links
array for each doc. Remove each link which points to a doc that does not exist in the last iteration collection. This is only bounded by the 16MB document limit.
Although we had an index on the url
field, after 5 hours of running the script, we realized that the projected elapsed time would be over 20 days. This prompted us to explore other options.
Attempt #2: Further Optimizing URL search
We originally thought the reason for this lag was due to the script searching for a link in the entire collection, although we did have an index on it already. Instead, we created a more optimized solution to create another collection titled char8 which has as its _id
, the first 8 characters of the url
. Then in each char8 doc, there is a url
array that starts with those 8 characters. So instead of searching for link in each url
in the original memes collection, we’d only search through the links of the char8 collection (which is indexed by the first 8 characters).
This was now projected to finish in 5 days, still too inefficient.
Attempt #3: Migrating to Hadoop/Pig
However, we were undeterred; we still wanted to compute PageRank over the entire dataset with no dead ends. The single script approach was extremely inefficient and slow. It wasn’t fit for BIG DATA. Instead, we turned to Hadoop and Amazon Elastic MapReduce.
Hadoop is an open-source MapReduce framework that handles intensive data computations and often operates over large clusters, although it could also be used locally. It supports many languages and tools to ease usage. Amazon Elastic MapReduce is a service provided by Amazon that hosts the Hadoop framework on its EC2 instances that would be created solely for each MapReduce task and connects to S3 Simple Storage for data input and output. Mongo-Hadoop was the necessary library which would allow us to connect MongoDB as the input/output source for these MapReduce jobs.
We submitted jobs to Elastic MapReduce through both the UI and the command line tool elastic-mapreduce, with our access ID and secret key.
First, we wrote a Pig script, explained below, to eliminate dead ends.
-- Registers the mongo and hadoop jars from our bucket REGISTER s3://memes-bson/mongo-2.10.1.jar REGISTER s3://memes-bson/mongo-hadoop-pig_1.1.2-1.1.0.jar REGISTER s3://memes-bson/mongo-hadoop-core_1.1.2-1.1.0.jar REGISTER s3://memes-bson/mongo-hadoop_1.1.2-1.1.0.jar -- There's a User Defined Function in myudf.MYBAG which needs to be used REGISTER s3://memes-bson/myudf.jar original = LOAD "$INPUT" USING com.mongodb.hadoop.pig.BSONLoader; outs = FOREACH original GENERATE $0#"_id" AS url, $0#"value"#"links" AS links; -- Dead ends will not appear in showing as links is empty. url is the origin of the link while -- single is the destination. myudf.MYBAG turns the tuple of links to a bag of tuple of each link. showing = FOREACH outs GENERATE url, FLATTEN(myudf.MYBAG(links)) AS single; -- JOIN will eliminate links that go to a single which doesn't exist in our dataset joined = JOIN outs BY url, showing BY single; project = FOREACH joined GENERATE showing::url AS url, showing::single AS single; -- Group together the urls such that they form the original scheme of a url and an array of links together = GROUP project BY url; result = FOREACH together GENERATE $0 AS url, $1.single AS links; STORE result INTO "$OUTPUT" USING com.mongodb.hadoop.pig.BSONStorage;
The Pig script above removed all dead ends from the current dataset. The problem with this, is that removing dead ends could create new ones. For example, in the simple graph of
A -> B -> C -> D
D is the only dead end in the graph. But when D is removed, we have
A -> B -> C
So we would keep on removing the new dead ends until there were no new dead ends. In this particular “pathological” graph, the entire graph would be removed because all of the nodes in the graph would eventually be dead ends. Fortunately, most datasets aren’t linear graphs.
We had to figure out a way to repeatedly run the Pig script above until there are no more dead ends. Elastic MapReduce only allows for a single Pig script execution, so we wrote Bash script removeAllDeadEnds.sh that kept running the above Pig script until the output filesize stopped decreasing. The script utilized s3cmd to check file size on S3 and elastic-mapreduce-ruby to submit jobs. In theory, the output filesize will decrease if and only if some dead ends have been removed.
After 10 iterations or so, the script would only erase two or three dead ends in each iteration. This continued until we stopped the script at 70 iterations, which took over 9 hours. With 8 normalized instance hours on m1.xlarge, the iterative jobs finished, on average, in 8 minutes. The initial BSON size of the dataset with dead ends was 606.9MB and the final BSON size with only 2 dead ends was 448.2MB. We decided that the result of running more iterations would only be trivial, and thus we could simply move ahead. We ended up with 1,113,524 total nodes.
PageRank in Elastic MapReduce on the dataset
At this point, we had two collections:
NO_DEAD_ENDS which had 1,113,524 nodes, 2 of which were dead ends.
ORIGINAL which had 36,814,086 nodes, around 75% of which were dead ends.
Whereas we were able to run MongoDB’s inherent MapReduce for the flights dataset to quickly converge, the sheer size of these collections drove us to use Amazon’s Elastic MapReduce to compute the PageRank of the nodes in the graph.
NO_DEAD_ENDS
Preformat
First, we had to preformat the graph to suit the PageRank program. This involved changing the schema, using a Hadoop Job written in Java, to:
{ "_id" : "001rian.blogspot.com/2008_07_01_archive.html", "ptr" : 0.2, "pg" : 8.98049795e-7, "links" : [ "001rian.blogspot.com/2008/07/sanra-dewi-ikutan-bisnis-interne.html", "001rian.blogspot.com/2008/07/peduli-dengan-pagerank-dan-traffic.html", "001rian.blogspot.com/2008/08/88808-08-2008.html", "001rian.blogspot.com/2008/07/jadwal-puasa-dan-imsak-ramadhan-1429-h.html", "001rian.blogspot.com/2008/07/buku-elektronik-untuk-sekolah.html" ] }
-
"pg" : 6.081907480807019e-8
was the initial PageRank of all nodes in a graph, which corresponds to the reciprocal of the total number of nodes in the graph. -
ptr
was the probability of this node going to any other node it links to, which corresponds to the reciprocal of the length of thelinks
array.
PageRank
The structure of the Elastic MapReduce Java program for PageRank was similar to the MongoDB MapReduce program written for the Flights dataset.
However, instead of setting all of the arguments for the Hadoop job in the Bash script, as each iteration (one Map and one Reduce) was a job, the jobs were continuously created and their different variables needed to be set dynamically. For example, the output of the last job was set as the input of the next job for the number iteration
.
last = iteration - 1; FileInputFormat.addInputPath(job, new Path(outputString + last)); // mapred.output.dir is the current configuration variable for the output path in Mongo-Hadoop. job.getConfiguration().set("mapred.output.dir", outputString + iteration); ``` Again, the stopping criterion for PageRank is when the average percentage change of a node, the `residual` or the `diff`, drops below 0.1%. Instead of outputting the individual node diffs in Reduce and aggregating over the entire output to sum up the diff as in the MongoDB MapReduce for Flights, we used the Hadoop counter to sum the residuals per Reduce call. The Hadoop counter is an atomic variable that is accessible by all Mappers and Reducers and it will track any statistic or variable to be read after the job is completed.
// Need this 10E6 because long only takes whole numbers context.getCounter(PageRank.RanCounters.RESIDUAL).increment((long) (residual*10E6)); “`
Therefore, after each job was completed, we viewed the total residual to determine whether it was under the threshold. Our stopping criteria is again when the residual converged to .001 * n where n was the number of elements. In this case, PageRank finished after 16 total iterations in 2 hours and 30 minutes with 7 m1.xlarge instances.
ORIGINAL
Another implementation of the PageRank algorithm didn’t delete all dead ends from the graph, but instead connected all dead ends to every other node in the graph. This created a strongly connected graph but it also increased the noise in the graph. Since the resulting number of nodes after erasing dead ends is only 1 million out of the original 15,312,738, we wanted to see how the PageRank results would change if all of the dead ends were included.
Notice that the original collection from the text file only has 15,312,738 nodes, whereas we accounted for 36,814,086 nodes in the ORIGINAL collection. The extra 21,501,348 nodes are links in the original 15,312,738 nodes but were not documents in the imported collection. Rather than decrease the graph, as with erasing dead ends, making the graph strong connected increased the size of the graph with 21,501,348 extra dead end nodes.
However, there’s no need to actually create edges between all dead end nodes and those that aren’t dead ends (with 27 million dead ends, that would create 27 million * 36 million = 972 billion links). Instead, we simply distributed the summation of the PageRank from all dead ends to every other node. Here are our implementation ideas:
The first (implemented) idea was to add all of the PageRank from nodes with dead ends in the Mapper, and then distribute this PageRank among all nodes in the Reducer when summing up the incoming PageRank; however, this was not feasible as Hadoop counters accessed in the Mapper would be zero in the Reducer. Mappers and Reducers executed simultaneously so the counter values were only finalized after the job was done.
To solve (1) we waited until the end of a job to determine the PageRank of the dead ends. Then the final PageRank and residual was calculated in the Mapper of the next iteration.
The main function, which submits jobs and waits for completion, retrieves the job’s total PageRank from dead ends and passes it as a configuration variable to the next job.
long prevDeadEndsPG = prevJob.getCounters().findCounter(PageRank.RanCounters.DEAD_END_PG).getValue(); currentJob.getConfiguration().setLong("deadEndsPG", prevDeadEndsPG);
Then in the Mapper step, we add this deadEndsPG
divided by the total number of nodes (the probability of any dead end jumping to this node). We compute the residual using the previous PageRank value, and add to the counter for residuals. This way, the final PageRank value of anode for that iteration is determined in the Mapper instead of the Reducer.
long deadEndsPG = 0; // fetch the dead ends PG from the configuration file context.getConfiguration().getLong("deadEndsPG", deadEndsPG); // 10E10 for numbers large enough to keep as long double doubleDeadEndsPG = (double deadEndsPG) / 10E10; double distributeDeadEndsPG = ((double) deadEndsPG) / (PageRank.totalNodes); double beta = 0.9; currentPG = PageRank.beta * (currentPG + distributeDeadEndsPG) + PageRank.distributedBeta; double residual = residual = Math.abs(prevpg - currentPG) / prevpg; context.getCounter(PageRank.RanCounters.RESIDUAL).increment((long) (residual * 10E6));
This original PageRank took 17 iterations in the span of 7 hours and 12 minutes with 8 m1.xlarge instances.
RESULTS
The following is an interpretation of the results obtained after running the PageRank algorithm over the 2 collections above.
NO_DEAD_ENDS
For the dataset with no dead ends, the 10 web pages with the most PageRank are:
1. {"pg":0.020741859913578454 ,"url":"http://slideshare.net/youtube-in-slideshare" ,"quotes":["insert and publish","before slide 1","after slide 1","after slide 2","after slide 3"]} 2. {"pg":0.01490199050574318 ,"url":"http://slideshare.com/youtube-in-slideshare" ,"quotes":["insert and publish","before slide 1","after slide 1","after slide 2","after slide 3"]} 3. {"pg":0.00542114032291505 ,"url":"http://london.kijiji.ca/f-buy-and-sell-w0qqcatidz10" ,"quotes":[]} 4. {"pg":0.005381224128322537 ,"url":"http://badgerandblade.com/index.php?page=terms" ,"quotes":["badger and blade","b amp b"]} 5. {"pg":0.00328534940037117 ,"url":"http://saintjohn.kijiji.ca/f-buy-and-sell-w0qqcatidz10" ,"quotes":[]} 6. {"pg":0.00301961022829243 ,"url":"http://london.kijiji.ca/c-buy-and-sell-business-industrial-salon-equipment-w0qqadidz115815478" ,"quotes":[]} 7. {"pg":0.0028168240288373365 ,"url":"http://dealsofamerica.com/terms.php" ,"quotes":[]} 8. {"pg":0.0025406641926389753 ,"url":"http://london.kijiji.ca/c-buy-and-sell-cds-dvds-vhs-house-seasons-1-4-w0qqadidz123632361" ,"quotes":[]} 9. {"pg":0.0024984791525017504 ,"url":"http://answerbag.com/c_view/3602" ,"quotes":[]} 10. {"pg":0.0021795435717848356 ,"url":"http://chacha.com/entertainment-arts/music" ,"quotes":["up where they play all day in the sun","stay all day in the sun","stay or leave i want you not to go but you did","sometimes goodbye is a second chance"]}
It’s not surprising that http://slideshare.net/youtube-in-slideshare
and http://slideshare.com/youtube-in-slideshare
have the most PageRank. Around the beginning of 2009, SlideShare released a new feature to enable users to embed youtube videos in their presentations. At the time, this feature was in Beta. FAQs, examples, and other details were posted on both http://www.slideshare.net/youtube-in-slideshare
and http://www.slideshare.com/youtube-in-slideshare
. Since this was a new feature (that a lot of people were excited about!), lots of reputable bloggers posted a link to these FAQ pages to showcase the power of the new SlideShare feature. As a result, these FAQ pages accumulated most of the PageRank.
Another interesting observation here is that there are 4 web pages with the most PageRank from kijiji
, a centralized network of online urban communities for posting local online classified advertisements. The reason kijiji
accumulated a lot of PageRank is that there’s a lot of intra-domain (but inter sub-domain) linking. For example, lots of pages on london.kijiji.ca
link to vacation.kijiji.ca
which links back to london.kijiji.ca
. Such intra-domain linking creates a web structure called a spider trap that accumulates a large portion of the PageRank available to the whole system of pages. Furthermore, about 5% fo the entire NO_DEAD_ENDS contains kijiji
web pages.
ORIGINAL
1. {"_id" : "craigslist.org/about/scams.html" , "pg" : 0.018243114523103326 , "links" : []} 2. {"_id" : "slideshare.net/youtube-in-slideshare" , "pg" : 0.003038463243965542 , "links" : ["slideshare.net/youtube-in-slideshare"] , "quotes" : ["insert and publish","before slide 1","after slide 1","after slide 2","after slide 3"]} 3. {"_id" : "slideshare.com/youtube-in-slideshare" , "pg" : 0.002161141838313388 , "links" : ["slideshare.com/youtube-in-slideshare"] , "quotes" : ["insert and publish","before slide 1","after slide 1","after slide 2","after slide 3"]} 4. {"_id" : "ad.doubleclick.net/clk;212944738;23941492;n?goto.canon-asia.com/oic" , "pg" : 0.0015214745797758247 , "links" : [] , "quotes" : []} 5. {"_id" : "mx.answers.yahoo.com/info/disclaimer" , "pg" : 0.0013631525163117727 , "links" : [] , "quotes" : []} 6. {"_id" : "ar.answers.yahoo.com/info/disclaimer" , "pg" : 0.0013542983079855681 , "links" : [] , "quotes" : []} 7. {"_id" : "it.answers.yahoo.com/info/disclaimer" , "pg" : 0.0011670409020562926 , "links" : [] , "quotes" : []} 8. {"_id" : "fr.answers.yahoo.com/info/disclaimer" , "pg" : 0.001083113456512683 , "links" : [] , "quotes" : []} 9. {"_id" : "seroundtable.com" , "pg" : 0.0009033963316740201 , "links" : [] , "quotes" : []} 10. {"_id" : "de.answers.yahoo.com/info/disclaimer" , "pg" : 0.0006914069352292967 , "links" : [] , "quotes" : []}
These results differed significantly from the results of the NO_DEAD_ENDS but for good reasons.
- The PageRank for these web pages are significantly lower than the PageRank of the pages in ORIGINAL because there’s 30x more pages in this graph than NO_DEAD_ENDS.
- The PageRank of the Craigslist scams information page is 6x as much as the next highest PageRank value. This is because every Craigslist page links to the scam page, creating a spider trap. Similarly, the Yahoo! Answers disclaimer pages accumulated a lot of PageRank because most Yahoo! forum pages link to one of the disclaimer pages.
- The curious entry above is the
ad.doubleclick.net
URL. This URL is no longer available. But since doubleclick was an online ad platform, it’s possible that this particular ad was either served the most or gathered the most attention.
SubDomains
Looking at subdomains, we see a lot of entries for websites that have no URLs in the top 10 list. Sites like Twitter, Blogspot, and English Wikipedia are ranked among the top 25. It’s reasonable to assume that links to those websites aren’t purely spam or disclaimers. The D3 bubble chart below comprises the subdomains with the top 25 PageRanks.
Domains
By summing up the PageRank by domains instead of URLs, Yahoo! surpasses SlideShare. This makes sense as Yahoo! is quite spread out among its subdomains.
Lessons Learned
Bulk Insertion Speeds
The Flights dataset was inserted a single document at a time, as we didn’t know about bulk insertions yet. However, pymongo allows for bulk insertions. This is a much preferred and faster method for creating large collections. We utilized bulk insertions on the Twitter Memes dataset with a default batch size of 1000 docs. Here is the amount of time it takes for inputMongo.py to completely finish inserting the original 15,312,736 docs into Twitter Memes:
Bulk Insert Array Size | Elapsed Time in Seconds | Elapsed Time (hr:min:sec) |
---|---|---|
1 | 15007.014883 | 4:10:7 |
10 | 7781.75180 | 2:09:41 |
100 | 7332.346791 | 1:52:12 |
1000 | 6493.35044885 | 1:48:13 |
10000 | Error: Size too big for insert |
The elapsed time for inputting the data dropped significantly between single insertions and insertions of size 10 each, but interestingly, the speed tapered off as the size of the insertions increased. The maxMessageSizeByte value is 48 million bytes, and a bulk insertion of 10,000 exceeded this limit. This occurred in the Python driver, but some of the other drivers will split the array into 16MB chunks, which would avoid this error.
PageRank != Relevance
The results above show that the pages with the most PageRank are often the disclaimers and information pages, which probably isn’t what interests people most of the time. It turns out that Google’s Search algorithm is more complex than our simplistic version of PageRank. For instance, Google takes into account which links people click on for a certain query, thus boosting those URLs’ relevance for that query. Google also filters out some spam links and ad sites. In addition, some websites made with many internal links to intentionally boost their PageRank will have their PageRank values docked as part of Google’s campaign to disincentivize the intentional manipulation of PageRank.
Indexing URLs
Since each document has an unique URL, we created a unique index on this field. Fortunately, all the URLs in this dataset are well within the 1024 bytes limit for index entries. An alternative and more optimal way to index URLs is to use hashing. In addition, MongoDB supports hashed indices.
原文地址:Twitter Memes Dataset Overview with PageRank, 感谢原作者分享。