Java去除掉HTML里面所有标签,主要就两种,要么用开源的jar处理,要么就自己写正则表达式。自己写的话,可能处理不全一些自定义的标签。企业应用基本都是能找开源就找开源,实在不行才自己写……
1,开源的,我目前找到的就是Jsoup包:
public static String getTextFromTHML(String htmlStr) { Document doc = Jsoup.parse(htmlStr); String text = doc.text(); // remove extra white space StringBuilder builder = new StringBuilder(text); int index = 0; while(builder.length()>index){ char tmp = builder.charAt(index); if(Character.isSpaceChar(tmp) || Character.isWhitespace(tmp)){ builder.setCharAt(index, ' '); } index++; } text = builder.toString().replaceAll(" +", " ").trim(); return text; }
2,自己写的话,百度一搜一大堆,这里只是借用一下:
public static String removeTag(String htmlStr) { String regEx_script = "<script[^>]*?>[//s//S]*?<///script>"; // script String regEx_style = "<style[^>]*?>[//s//S]*?<///style>"; // style String regEx_html = "<[^>]+>"; // HTML tag String regEx_space = "//s+|/t|/r|/n";// other characters Pattern p_script = Pattern.compile(regEx_script, Pattern.CASE_INSENSITIVE); Matcher m_script = p_script.matcher(htmlStr); htmlStr = m_script.replaceAll(""); Pattern p_style = Pattern .compile(regEx_style, Pattern.CASE_INSENSITIVE); Matcher m_style = p_style.matcher(htmlStr); htmlStr = m_style.replaceAll(""); Pattern p_html = Pattern.compile(regEx_html, Pattern.CASE_INSENSITIVE); Matcher m_html = p_html.matcher(htmlStr); htmlStr = m_html.replaceAll(""); Pattern p_space = Pattern .compile(regEx_space, Pattern.CASE_INSENSITIVE); Matcher m_space = p_space.matcher(htmlStr); htmlStr = m_space.replaceAll(" "); return htmlStr; }