当然,数据挖掘,数据准备部分考虑这样做:配置文件的基础上,打开相应的网站,并保存。之后这些文件的内容,然后分析、文本提取、矩阵变换、集群。
public static void main(String[] args){
final int THREAD_COUNT=5;
String baseUrl=null;
String searchBlogs=null;
String blogs[]=null;
String fileDir=null;
//String category=null;
InputStream inputStream =CsdnBlogMining.class.getClassLoader().getResourceAsStream("config.properties");
Properties p = new Properties();
try {
p.load(inputStream);
baseUrl=p.getProperty("baseUrl");
fileDir=p.getProperty("fileDir");
searchBlogs=p.getProperty("searchBlogs");
if(searchBlogs!=""){
blogs=searchBlogs.split(";");
}
ExecutorService pool=Executors.newFixedThreadPool(THREAD_COUNT);
for(String s:blogs){
pool.submit(new SaveWeb(baseUrl+s,fileDir+"/"+s+".html"));
}
pool.shutdown();
//category=new String(p.getProperty("category").getBytes("ISO-8859-1"),"UTF-8");
} catch (IOException e) {
e.printStackTrace();
}
}
打开网页并保存模块:
public class SaveWeb implements Runnable{
private String url;
private String filename;
public SaveWeb(String url,String filename){
this.url=url;
this.filename=filename;
}
@Override
public void run() {
HttpClient httpclient = new DefaultHttpClient();
HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2");
try{
HttpResponse response = httpclient.execute(httpGet);
HttpEntity entity = response.getEntity();
BufferedOutputStream outputStream = new BufferedOutputStream(new FileOutputStream(filename));
if(response.getStatusLine().getStatusCode() == HttpStatus.SC_OK){
if (entity != null) {
String res=EntityUtils.toString(entity,"UTF-8");
outputStream.write(res.getBytes("UTF-8"));
outputStream.flush();
}
}
outputStream.close();
}catch(IOException e){
e.printStackTrace();
}
}
}
兴许:
作业完毕了,但差点儿和上面的内容没啥关系,本来想全删了。
再想也不算写错。仅仅是没用上而已。还是留着吧。
终于,用java代码循环加并发去获得一个地址列表存到文件中。
而採用R语言去做的挖掘工作。包含获取网页、解析正文、分词、聚类、结果输出等。R语言真是省事,几十行代码全搞定了。但终于分类的结果不理想。看来基于全文的计算通用,刻出来的类是非常不准确,我们必须考虑改进。
版权声明:本文博主原创文章,转载保留原文链接。