如何将markdown转换为wxml

程序员文章站 2023-10-31 22:01:28

话说我要为技术博客写一个小程序版，我的博客解决方案是 hexo + github page ，格式当然是技术控们喜欢的 markdown 了。但小程序使用的却是独有的模版语言 WXML 。我总不能把之前的文章手动转换成小程序的 wxml 格式吧，而网上也没完善的转换库，还是自己写个解析器吧。解析 ......

话说我要为技术博客写一个小程序版，我的博客解决方案是 hexo + github-page，格式当然是技术控们喜欢的 markdown 了。但小程序使用的却是独有的模版语言 wxml。我总不能把之前的文章手动转换成小程序的 wxml 格式吧，而网上也没完善的转换库，还是自己写个解析器吧。

解析器最核心的部分就是字符串模式匹配，既然涉及到字符串匹配，那么就离不开正则表达式。幸好，正则表达式是我的优势之一。

正则表达式

javascript中的正则表达式

解析器涉及到的 javascript 正则表达式知识

regexp 构造函数属性，其中lastmatch，rightcontent在字符串截取时非常有用

长属性名	短属性名	替换标志	说明
input	$_		最近一次要匹配的字符串。opera未实现此属性
lastmatch	$&	$&	最近一次的匹配项。opera未实现此属性
lastparen	$+		最近一次匹配的捕获组。opera未实现此属性
leftcontext	$`	$`	input字符串中lastmatch之前的文本
rightcontext	$'	$'	input字符串中lastmatch之后的文本
multiline	$*		布尔值，表示是否所有表达式都使用多行模式。ie和opera未实现此属性
	$n	$n	分组
		$$	转义$

test 方法和 regexp 构造函数
test 方法调用后，上面的属性就会出现在 regexp 中，不推荐使用短属性名，因为会造成代码可读性的问题，下面就是样例

var text = "this has been a short summer";
var pattern = /(.)hort/g;

if (pattern.test(text)){
    alert(regexp.input);         // this has been a short summer
    alert(regexp.leftcontext);   // this has been a
    alert(regexp.rightcontext);  // summer
    alert(regexp.lastmatch);     // short
    alert(regexp.lastparen);     // s
    alert(regexp.multiline);     // false
}

//长属性名都可以用相应的短属性名来代替。不过由于这些短属性名大都不是有效的ecmascript标识符，因此必须通过方括号语法来访问它们
if (pattern.test(text)){
    alert(regexp.$_);
    alert(regexp["$`"]);
    alert(regexp["$'"]);
    alert(regexp["$&"]);
    alert(regexp["$+"]);
    alert(regexp["$*"]);
}

replace 方法

一般使用的是没有回调函数的简单版本，而回调函数版本则是个大杀器，及其强大

//简单替换, replace默认只进行一次替换, 如设定全局模式,  将会对符合条件的子字符串进行多次替换，最后返回经过多次替换的结果字符串.
var regex = /(\d{4})-(\d{2})-(\d{2})/;
"2011-11-11".replace(regex, "$2/$3/$1");

//replace 使用回调函数自定义替换，必须启用全局模式g，因为要不断向前匹配，直到匹配完整个字符串
//match为当前匹配到的字符串，index为当前匹配结果在字符串中的位置，sourcestr表示原字符串，
//如果有分组，则中间多了匹配到的分组内容，match,group1(分组1)...groupn(分组n),index,sourcestr
"one two three".replace(/\bt[a-za-z]+\b/g, function (match,index,str) { //将非开头的单词大写
    console.log(match,index,str);
    return match.touppercase(); 
});

match 方法

全局模式和非全局模式有显著的区别，全局模式和 exec 方法类似。

// 如果参数中传入的是子字符串或是没有进行全局匹配的正则表达式，那么match()方法会从开始位置执行一次匹配，如果没有匹配到结果，则返回null.否则则会返回一个数组,该数组的第0个元素存放的是匹配文本，返回的数组还含有两个对象属性index和input，分别表示匹配文本的起始字符索引和原字符串，还有分组属性
var str = '1a2b3c4d5e';
console.log(str.match(/b/)); //返回["b", index: 3, input: "1a2b3c4d5e"]

//如果参数传入的是具有全局匹配的正则表达式，那么match()从开始位置进行多次匹配，直到最后.如果没有匹配到结果，则返回null.否则则会返回一个数组，数组中存放所有符合要求的子字符串，但没有index和input属性,也没有分组属性
var str = '1a2b3c4d5e';
str.match(/h/g); //返回null
str.match(/\d/g); //返回["1", "2", "3", "4", "5"]

var pattern = /\d{4}-\d{2}-\d{2}/g;
var str ="2010-11-10 2012-12-12";
var matcharray = str.match(pattern);
for(vari = 0; i < matcharray.length; i++) {
     console.log(matcharray[i]);
}

exec 方法

与全局模式下的 match 类似，但 exec 更强大，因为返回结果包含各种匹配信息，而match全局模式是不包含具体匹配信息的。

//逐步提取,捕获分组匹配文本,必须使用全局模式g, 成功则返回数组(包含匹配的分组信息), 否则为null
//regex每次匹配成功后,会把匹配结束位置更新到lastindex,下次从lastindex开始匹配
//如果不指定全局模式,使用while循环,会造成无穷循环
var pattern = /(\d{4})-(\d{2})-(\d{2})/g;
var str2 = "2011-11-11 2013-13-13" ;
while ((matcharray = pattern.exec(str2)) != null) {
  console.log( "date: " + matcharray[0]+"start at:" + matcharray.index+" ends at:"+         pattern.lastindex);
  console.log( ",year: " + matcharray[1]);
  console.log( ",month: " + matcharray[2]);
  console.log( ",day: " + matcharray[3]);
}

search，split 这两个比较简单的方法则不再介绍

正则表达式高级概念

正常情况下正则是从左向右进行单字符匹配，每匹配到一个字符, 就后移位置, 直到最终消耗完整个字符串，这就是正则表达式的字符串匹配过程，也就是它会匹配字符，占用字符。相关的基本概念不再讲解，这里要讲的和字符匹配不同的概念 - 断言。

断言

正则中大多数结构都是匹配字符，而断言则不同，它不匹配字符，不占用字符，而只在某个位置判断左/右侧的文本是否符合要求。这类匹配位置的元素，可以称为 "锚点"，主要分为三类：单词边界，开始结束位置，环视。

单词边界 \b 是这样的位置，一边是单词字符，一边不是单词字符，如下字符串样例所示

\brow\b   //row
\brow     //row， rowdy
row\b     //row， tomorow

^ 行开头，多行模式下亦匹配每个换行符后的位置，即行首
$ 行结束，多行模式下亦匹配每个换行符前的位置，即行尾

//js 中的 $ 只能匹配字符串的结束位置，不会匹配末尾换行符之前的换行符。但开启多行模式(m)后，^ 和 $ 则可以匹配中间的换行符。 如下例子可验证：

// 默认全局模式下，^ 和 $ 直接匹配到了文本最开头和末尾，忽略了中间的换行符
'hello\nword'.replace(/^|$/g,'<p>')
"<p>hello
word<p>"

// 多行模式下，同时能匹配到结束符中间的换行符
'hello\nword\nhi'.replace(/^|$/mg,'<p>')
"<p>hello<p>
<p>word<p>
<p>hi<p>"

环视

环视是断言中最强的存在，同样不占用字符也不提取任何字符，只匹配文本中的特定位置，与\b, ^ $ 边界符号相似；但环视更加强大，因为它可以指定位置和在指定位置处添加向前或向后验证的条件。

而环视主要体现在它的不占位（不消耗匹配字符）, 因此又被称为零宽断言。所谓不占宽度，可以这样理解：
- 环视的匹配结果不纳入数据结果；
- 环视它匹配过的地方，下次还能用它继续匹配。
环视包括顺序环视和逆序环视，javascript 在 es 2018 才开始支持逆序环视
- (?=) 顺序肯定环视匹配右边
- (?!) 顺序否定环视
- (?<=) 逆序肯定环视匹配左边
- (?<!) 逆序否定环视
来看一下具体的样例
```
// 获取.exe后缀的文件名，不使用分组捕获，能使捕获结果不包含.exe后缀，充分利用了环视匹配结果同时不占位的特性
'asd.exe'.match(/.+(?=\.exe)/)
=> ["asd", index: 0, input: "asd.exe", groups: undefined]

// 变种否定顺序环视，排除特定标签p/a/img，匹配html标签
</?(?!p|a|img)([^> /]+)[^>]*/?> 

//常规逆序环视，同样利用了环视匹配不占位的特性
/(?<=\$)\d+/.exec('benjamin franklin is on the $100 bill')  // ["100",index: 29,...]
/(?<!\$)\d+/.exec('it’s is worth about €90')                // ["90", index: 21,...] 

// 利用环视占位但不匹配的特性
'12345678'.replace(/\b(?=(\d{3})+$)/g , ',') 
=> "12,345,678" //分割数字
```

解析器的编写

正则表达式相关写得有点多，但磨刀不误砍柴工，开始进入主题

markdown格式

hexo 生成的 markdwon 文件格式如下，解析器就是要把它解析成json格式的输出结果，供小程序输出 wxml

---
title: haskell学习-functor
date: 2018-08-15 21:27:15
tags: [haskell]
categories: 技术
banner: https://upload-images.jianshu.io/upload_images/127924-be9013350ffc4b88.jpg
---
<!-- 原文地址：[haskell学习-functor](https://edwardzhong.github.io/2018/08/15/haskellc/) -->
## 什么是functor
**functor** 就是可以执行map操作的对象，functor就像是附加了语义的表达式，可以用盒子进行比喻。**functor** 的定义可以这样理解：给出a映射到b的函数和装了a的盒子，结果会返回装了b的盒子。**fmap** 可以看作是一个接受一个function 和一个 **functor** 的函数，它把function 应用到 **functor** 的每一个元素（映射）。

```haskell
-- functor的定义
class functor f where
    fmap :: (a -> b) -> f a -> f b
```
<!-- more -->

入口

使用node进行文件操作，然后调用解析器生成json文件

const { readdirsync, readfilesync, writefile } = require("fs");
const path = require("path");
const parse = require("./parse");

const files = readdirsync(path.join(__dirname, "posts"));
for (let p of files) {
  let md = readfilesync(path.join(__dirname, "posts", p));
  const objs = parse(md);
  writefile(path.join(__dirname, "json", p.replace('.md','.json')), json.stringify(objs), function( err ){
    err && console.log(err);
  });
}

来看一下解析器入口部分，主要分为：summary 部分，code代码部分，markdown文本部分。将文本内容的注释和空格过滤掉，但是代码部分的注释要保留。

module.exports = function analyze(str) {
    let ret = { summary: {}, lines: [] };
    while (str) {
        // 空格
        if (/^([\s\t\r\n]+)/.test(str)) {
            str = regexp.rightcontext;
        }
        // summary 内容块
        if (/^(\-{3})[\r\n]?([\s\s]+?)\1[\r\n]?/.test(str)) {
            str = regexp.rightcontext;
            ret.summary = summaryparse(regexp.$2);
            ret.num = new date(ret.summary.date).gettime();
        }
        // code
        if (/^`{3}(\w+)?([\s\s]+?)`{3}/.test(str)) {
            const codestr = regexp.$2 || regexp.$1;
            const fn = (regexp.$2 && codeparse[regexp.$1]) ? codeparse[regexp.$1] : codeparse.javascript;
            str = regexp.rightcontext;
            ret.lines.push({ type: "code", child: fn(codestr) });
        }
        // 注释行
        if (/^<!--[\s\s]*?-->/.test(str)) {
            str = regexp.rightcontext;
        }
        // 提取每行字符串, 利用 . 不匹配换行符的特性
        if (/^(.+)[\r\n]?/.test(str)) {
            str = regexp.rightcontext;
            ret.lines.push(textparse(regexp.$1));
        }
    }
    return ret;
};

文本内容提取

summary 内容块的提取比较简单，不讲叙。还是看 markdown 文本内容的解析吧。这里匹配 markdown 常用类型，比如列表，标题h，链接a，图片img等。而返回结果的数据结构就是一个列表，列表里面可以嵌套子列表。但基本就是正则表达式提取内容，最终消耗完字符行。

function textparse(s) {
    const trans = /^\\(\s)/; //转义字符
    const italy = /^(\*)(.+?)\1/; //倾斜
    const bold = /^(\*{2})(.+?)\1/; //加粗
    const italybold = /^(\*{3})(.+?)\1/; //倾斜和加粗
    const headline = /^(\#{1,6})\s+/; //h1-6
    const unsortlist = /^([*\-+])\s+/; //无序列表
    const sortlist = /^(\d+)\.\s+/; //有序列表
    const link = /^\*?\[(.+)\]\(([^()]+)\)\*?/; //链接
    const img = /^(?:!\[([^\]]+)\]\(([^)]+)\)|<img(\s+)src="([^"]+)")/; //图片
    const text =/^[^\\\s*]+/; //普通文本

    if (headline.test(s)) return { type: "h" + regexp.$1.length, text: regexp.rightcontext };
    if (sortlist.test(s)) return { type: "sl", num: regexp.$1, child: lineparse(regexp.rightcontext) };
    if (unsortlist.test(s)) return { type: "ul", num: regexp.$1, child: lineparse(regexp.rightcontext) };
    if (img.test(s)) return { type: "img", src: regexp.$2||regexp.$4, alt: regexp.$1||regexp.$3 };
    if (link.test(s)) return { type: "link", href: regexp.$2, text: regexp.$1 };
    return { type: "text", child: lineparse(s) };

    function lineparse(line) {
        let ws = [];
        while (line) {
            if (/^[\s]+/.test(line)) {
                ws.push({ type: "text", text: "&nbsp;" });
                line = regexp.rightcontext;
            }
            if (trans.test(line)) {
                ws.push({ type: "text", text: regexp.$1 });
                line = regexp.rightcontext;
            }
            if (sortlist.test(line)) {
                return { child: lineparse(regexp.rightcontext) };
            }
            if (unsortlist.test(line)) {
                return { child: lineparse(regexp.rightcontext) };
            }
            if (link.test(line)) {
                ws.push({ type: "link", href: regexp.$2, text: regexp.$1 });
                line = regexp.rightcontext;
            }
            if (italybold.test(line)) {
                ws.push({ type: "italybold", text: regexp.$2 });
                line = regexp.rightcontext;
            }
            if (bold.test(line)) {
                ws.push({ type: "bold", text: regexp.$2 });
                line = regexp.rightcontext;
            }
            if (italy.test(line)) {
                ws.push({ type: "italy", text: regexp.$2 });
                line = regexp.rightcontext;
            }
            if (text.test(line)) {
                ws.push({ type: "text", text: regexp.lastmatch });
                line = regexp.rightcontext;
            }
        }
        return ws;
    }
}

代码块显示

如果只是解析文本内容，还是非常简单的，但是技术博客嘛，代码块是少不了的。为了代码关键字符的颜色显示效果，为了方便阅读，还得继续解析。我博客目前使用到的语言，基本写了对应的解析器，其实有些解析器是可以共用的，比如 style方法不仅可应用到 css 上，还可以应用到类似的预解析器上比如：scss，less。html也一样可应用到类似的标记语言上。

const codeparse = {
  haskell(str){},
  javascript(str){},
  html:html,
  css:style
};

来看一下比较有代表性的 javascript 解析器，这里没有使用根据换行符(\n)将文本内容切割成字符串数组的方式，因为有些类型需要跨行进行联合推断，比如解析块，方法名称判断就是如此。只能将一整块文本用正则表达式慢慢匹配消耗完。最终的结果类似上面的文本匹配结果 - 嵌套列表，类型就是语法关键字，常用内置方法，字符串，数字，特殊符号等。

其实根据这个解析器可以进一步扩展和抽象一下，将它作为类 c 语言族的基本框架。然后只要传递对应语言的正则表达式规则，就能解析出不同语言的结果出来，比如 c#，java，c++，go。

javascript(str) {
    const comreg = /^\/{2,}.*/;
    const keyreg = /^(import|from|extends|new|var|let|const|return|if|else|switch|case|break|continue|of|for|in|array|object|number|boolean|string|regexp|date|error|undefined|null|true|false|this|alert|console)(?=([\s.,;(]|$))/;
    const typereg = /^(window|document|location|sessionstorage|localstorage|math|this)(?=[,.;\s])/;
    const regreg = /^\/\s+\/[gimuys]?/;
    const sysfunreg = /^(foreach|map|filter|reduce|some|every|splice|slice|split|shift|unshift|push|pop|substr|substring|call|apply|bind|match|exec|test|search|replace)(?=[\s\(])/;
    const funreg = /^(function|class)\s+(\w+)(?=[\s({])/;
    const methodreg = /^(\w+?)\s*?(\([^()]*\)\s*?{)/;
    const symbolreg = /^([!><?|\^$&~%*/+\-]+)/;
    const strreg = /^([`'"])([^\1]*?)\1/;
    const numreg = /^(\d+\.\d+|\d+)(?!\w)/;
    const parsecomment = s => {
        const ret = [];
        const lines = s.split(/[\r\n]/g);
        for (let line of lines) {
            ret.push({ type: "comm", text: line });
        }
        return ret;
    };

    let ret = [];

    while (str) {
        if (/^\s*\/\*([\s\s]+?)\*\//.test(str)) {
            str = regexp.rightcontext;
            const coms = parsecomment(regexp.lastmatch);
            ret = ret.concat(coms);
        }
        if (/^(?!\/\*).+/.test(str)) {
            str = regexp.rightcontext;
            ret.push({ type: "text", child:lineparse(regexp.lastmatch) });
        }
        if(/^[\r\n]+/.test(str)){
            str=regexp.rightcontext;
            ret.push({type:'text',text:regexp.lastmatch});
        }
    }
    return ret;

    function lineparse(line) {
        let ws = [];
        while (line) {
            if (/^([\s\t\r\n]+)/.test(line)) {
                ws.push({ type: "text", text: regexp.$1 });
                line = regexp.rightcontext;
            }
            if (comreg.test(line)) {
                ws.push({ type: "comm", text: line });
                break;
            }
            if (regreg.test(line)) {
                ws.push({ type: "fun", text: regexp.lastmatch });
                line = regexp.rightcontext;
            }
            if (symbolreg.test(line)) {
                ws.push({ type: "keyword", text: regexp.$1 });
                line = regexp.rightcontext;
            }
            if (keyreg.test(line)) {
                ws.push({ type: "keyword", text: regexp.$1 });
                line = regexp.rightcontext;
            }
            if (funreg.test(line)) {
                ws.push({ type: "keyword", text: regexp.$1 });
                ws.push({ type: "text", text: "&nbsp;" });
                ws.push({ type: "fun", text: regexp.$2 });
                line = regexp.rightcontext;
            }
            if (methodreg.test(line)) {
                ws.push({ type: "fun", text: regexp.$1 });
                ws.push({ type: "text", text: "&nbsp;" });
                ws.push({ type: "text", text: regexp.$2 });
                line = regexp.rightcontext;
            }
            if (typereg.test(line)) {
                ws.push({ type: "fun", text: regexp.$1 });
                line = regexp.rightcontext;
            }
            if (sysfunreg.test(line)) {
                ws.push({ type: "var", text: regexp.$1 });
                line = regexp.rightcontext;
            }
            if (strreg.test(line)) {
                ws.push({ type: "var", text: regexp.$1 + regexp.$2 + regexp.$1 });
                line = regexp.rightcontext;
            }
            if (numreg.test(line)) {
                ws.push({ type: "var", text: regexp.$1 });
                line = regexp.rightcontext;
            }
            if (/^\w+/.test(line)) {
                ws.push({ type: "text", text: regexp.lastmatch });
                line = regexp.rightcontext;
            }
            if (/^[^`'"!><?|\^$&~%*/+\-\w]+/.test(line)) {
                ws.push({ type: "text", text: regexp.lastmatch });
                line = regexp.rightcontext;
            }
        }
        return ws;
    }
}

显示wxml

最后只要运行解析器，就能生成 markdown 对应的 json 文件了，然后把json加载到微信小程序的云数据库里面，剩下的显示就交由小程序完成。下面就是使用 taro 编写 jsx 显示部分

<view classname='article'>
    {lines.map(l => (
        <block>
        <view classname='line'>
            {l.type.search("h") == 0 && ( <text classname={l.type}>{l.text}</text> )}
            {l.type == "link" && ( <navigator classname='link' url={l.href}> {l.text} </navigator> )}
            {l.type == "img" && ( <image classname='pic' mode='widthfix' src={l.src} /> )}
            {l.type == "sl" && ( <block> 
                <text decode classname='num'> {l.num}.{" "} </text>
                <textchild list={l.child} />
            </block>
            )}
            {l.type == "ul" && ( <block> 
                <text decode classname='num'> {" "} &bull;{" "} </text>
                <textchild list={l.child} />
            </block>
            )}
            {l.type == "text" && l.child.length && ( <textchild list={l.child} /> )}
        </view>
        {l.type == "code" && (
            <view classname='code'>
            {l.child.map(c => (
                <view classname='code-line'>
                {c.type == 'comm' && <text decode classname='comm'> {c.text} </text>}
                {c.type == 'text' && c.child.map(i => (
                    <block>
                    {i.type == "comm" && ( <text decode classname='comm'> {i.text} </text> )}
                    {i.type == "keyword" && ( <text decode classname='keyword'> {i.text} </text> )}
                    {i.type == "var" && ( <text decode classname='var'> {i.text} </text> )}
                    {i.type == "fun" && ( <text decode classname='fun'> {i.text} </text> )}
                    {i.type == "text" && ( <text decode classname='text'> {i.text} </text> )}
                    </block>
                ))}
                </view>
            ))}
            </view>
        )}
        </block>
    ))}
</view>

后记

经过这个项目的磨练，我的正则表达式的能力又上了一个台阶，连环视都已经是信手拈来了

上一篇： 2018年！是什么让这些风口下的“猪” 从天堂摔到了深渊

下一篇： PHP XML操作类DOMDocument

如何将markdown转换为wxml

正则表达式

javascript中的正则表达式

正则表达式高级概念

解析器的编写

markdown格式

入口

文本内容提取

代码块显示

显示wxml

后记

如何将markdown转换为wxml

如何将markdown转换为wxml

如何将ipynb文件转换为html，markdown，pdf等格式文件