爬虫利器,chrome headless ,无头浏览器Puppeteer
程序员文章站
2022-05-27 09:02:12
...
之前使用phantomjs爬取京东搜索页数据,发现无法爬取后三十条数据,原因是京东数据动态加载的原因,后发现一款.net爬虫神器Puppeteer
上代码,十分简单:
首先引用headless, chrome .net api
//Enabled headless option
var launchOptions = new LaunchOptions { Headless = true };
//Starting headless browser
var browser = await Puppeteer.LaunchAsync(launchOptions);
//New tab page
var page = await browser.NewPageAsync();
//Request URL to get the page
string url;
string key = HttpUtility.UrlEncode("水果");
url = "https://search.jd.com/Search?keyword=" + key + "&enc=utf-8$page=1";
await page.GoToAsync(url);
await page.Keyboard.PressAsync("Space");
await page.Keyboard.PressAsync("Space");
await page.Keyboard.PressAsync("Space");
await page.Keyboard.PressAsync("Space");
await page.Keyboard.PressAsync("Space");
await page.Keyboard.PressAsync("Space");
await page.Keyboard.PressAsync("Space");
await page.Keyboard.PressAsync("Space");
await page.Keyboard.PressAsync("Space");
await page.Keyboard.PressAsync("Space");
await page.Keyboard.PressAsync("Space");
await page.Keyboard.PressAsync("Space");
//NavigationOptions nav = new NavigationOptions();
//nav.WaitUntil = WaitUntilNavigation.DOMContentLoaded;
await page.WaitForSelectorAsync(".p-num");
await page.ScreenshotAsync("example.png");
//Get and return the HTML content of the page
var htmlString =await page.GetContentAsync();
#region Dispose resources
//Close tab page
await page.CloseAsync();
//Close headless browser, all pages will be closed here.
await browser.CloseAsync();
#endregion
return htmlString;
改利器使用异步,故很实用
demo:https://download.csdn.net/download/v18770350613/11229549