• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    迪恩网络公众号

Versal/word2markdown: Convert Word to Markdown, with images and math

原作者: [db:作者] 来自: 网络 收藏 邀请

开源软件名称(OpenSource Name):

Versal/word2markdown

开源软件地址(OpenSource Url):

https://github.com/Versal/word2markdown

开源编程语言(OpenSource Language):

XSLT 90.9%

开源软件介绍(OpenSource Introduction):

word2markdown

At Versal, we have created quite a few lessons for internal use, many of which were originally created as Word documents. Now that we've released the first version of our educational platform, we want to convert those lessons for online publication. We have an internal API that allows uploading of Markdown documents, which are then converted to courses with lessons and gadgets. So we only needed to convert our Word documents to Markdown. Sounds easy right?

Not so much. There are a few solutions, but they only work for very basic text formatting. Our documents were a bit more complex, containing tables, images, and math -- which proved especially tricky! So using a number of existing tools we hacked together our own conversion script. It consists of 9 consecutive steps:

  1. Exporting to HTML using Microsoft Word 2012. We automated this on OS X using Automator. Solutions for other platforms are welcome!
  2. Extracting image types that we want to use. Keeps the original quality, unless that's a proprietary .emz file. In this step we also fix some math.
  3. Converting HTML to XML using tagsoup.
  4. Covert OOML (proprietary Word format) into MathML equations, using Microsoft's own conversion XSLT, and a custom version of this XSLT. Uses Saxon 8.
  5. Some intermediate fixes for whitespace and math.
  6. Conversion back into HTML using Tidy. Also strips a lot of stuff.
  7. More intermediate fixes to deal with shortcomings of Tidy and Pandoc.
  8. Conversion into Markdown using Pandoc.
  9. Lots of cleanup and final fixes to the Markdown.

We've released this pipeline as an open source project (MIT License), although it should be noted that you will need to purchase Microsoft Word for this to work. Hopefully this can be a starting point for a more reliable conversion of Word documents!

Requirements

  • Mac OS X
  • Microsoft Office 2011
  • Pandoc
  • HTML Tidy
  • npm install in this directory
  • Open Microsoft Office, File->Save As Webpage->Compatibility->Encoding->UTF-8. Save, exit, and now you're good to go!

Usage

For Word-to-Markdown scripts, first navigate to this directory, using cd doc-to-md.

Calling doc-to-md.sh sample.doc outputs markdown to stdout. Calling doc-to-md.sh sample.doc sample_files will also copy images. Example:

doc-to-md.sh fixtures/public.docx | less

Tests

Run './accept.sh' to generate new markdown, which you can compare to the original markdown using git.

HTML preview

Run 'fixtures/html.sh' to generate HTML. The HTML uses Mathjax on an external server to display equations in broswers that don't support it (pretty much everything but Firefox).

Disclaimer

Available under the MIT license (see LICENSE file). Built by @janpaul123 for @versal.




鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
热门推荐
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap