Python readability.Document类代码示例

OGeek|极客世界-中国程序员成长平台 › 门户 › 编程› Python›Python编程经验

原作者: [db:作者] 来自: [db:来源] 收藏邀请

本文整理汇总了Python中readability.Document类的典型用法代码示例。如果您正苦于以下问题：Python Document类的具体用法？Python Document怎么用？Python Document使用的例子？那么恭喜您, 这里精选的类代码示例或许可以为您提供帮助。

在下文中一共展示了Document类的20个代码示例，这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞，您的评价将有助于我们的系统推荐出更棒的Python代码示例。

示例1: test_correct_cleanup

    def test_correct_cleanup(self):
        sample = """
        <html>
            <body>
                <section>test section</section>
                <article class="">
<p>Lot of text here.</p>
                <div id="advertisement"><a href="link">Ad</a></div>
<p>More text is written here, and contains punctuation and dots.</p>
</article>
                <aside id="comment1"/>
                <div id="comment2">
                    <a href="asd">spam</a>
                    <a href="asd">spam</a>
                    <a href="asd">spam</a>
                </div>
                <div id="comment3"/>
                <aside id="comment4">A small comment.</aside>
                <div id="comment5"><p>The comment is also helpful, but it's
                    still not the correct item to be extracted.</p>
                    <p>It's even longer than the article itself!"</p></div>
            </body>
        </html>
        """
        doc = Document(sample)
        s = doc.summary()
        #print(s)
        assert('punctuation' in s)
        assert(not 'comment' in s)
        assert(not 'aside' in s)

开发者ID:buriy，项目名称:python-readability，代码行数:30，代码来源:test_article_only.py

示例2: test_lxml_obj_result

 def test_lxml_obj_result(self):
     """Feed Document with an lxml obj instead of an html string. Expect an lxml response"""
     utf8_parser = lxml.html.HTMLParser(encoding='utf-8')
     sample = lxml.html.document_fromstring(load_sample('nyt-article-video.sample.html'), parser=utf8_parser)
     doc = Document(sample, url='http://nytimes.com/')
     res = doc.summary()
     self.assertFalse(isinstance(res, basestring))

开发者ID:RebelMouseTeam，项目名称:python-readability，代码行数:7，代码来源:test_article_only.py

示例3: test_si_sample_html_partial

 def test_si_sample_html_partial(self):
     """Using the si sample, make sure we can get the article alone."""
     sample = load_sample('si-game.sample.html')
     doc = Document(sample)
     doc.parse(["summary"], html_partial=True)
     res = doc.summary()
     self.assertEqual('<div><h1>Tigers-R', res[0:17])

开发者ID:stalkerg，项目名称:python-readability，代码行数:7，代码来源:test_article_only.py

示例4: get

 def get(self):
     url = self.get_argument("url", None)
     # https://www.ifanr.com/1080409
     doc = Webcache.find_one({'url': url}, {'_id': 0})
     if doc:
         self.res = dict(doc)
         return self.write_json()
     try:
         sessions = requests.session()
         sessions.headers[
             'User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
         response = sessions.get(url)
         # response.encoding = 'utf-8'  # TODO
         response.encoding = get_charset(response)
         doc = Document(response.text)
         title = doc.title()
         summary = doc.summary()
         markdown = html2text.html2text(summary)
         markdown = markdown.replace('-\n', '-')
         markdown = markdown.strip()
         res = {}
         res['url'] = url
         res['title'] = title
         res['markdown'] = markdown
         if title and markdown:
             webcache = Webcache
             webcache.new(res)
             self.res = res
         self.write_json()
     except Exception as e:
         print(e)

开发者ID:anwen，项目名称:anwen，代码行数:31，代码来源:api_share.py

示例5: test_si_sample_html_partial

 def test_si_sample_html_partial(self):
     """Using the si sample, make sure we can get the article alone."""
     sample = load_sample('si-game.sample.html')
     doc = Document('http://sportsillustrated.cnn.com/baseball/mlb/gameflash/2012/04/16/40630_preview.html',
                    sample)
     res = doc.get_clean_article()
     self.assertEqual('<div><div class="', res[0:17])

开发者ID:bgruszka，项目名称:python-readability，代码行数:7，代码来源:test_article_only.py

示例6: process_item

 def process_item(self, article, spider):
     
     doc = Document(article['text'])
     article['text'] = strip_tags(doc.summary())
     article['hash'] = hashlib.sha256(article['url']).hexdigest()
     
     return article

开发者ID:omidmt，项目名称:crawler，代码行数:7，代码来源:pipelines.py

示例7: test_si_sample

 def test_si_sample(self):
     """Using the si sample, load article with only opening body element"""
     sample = load_sample('si-game.sample.html')
     doc = Document(sample)
     doc.parse(["summary"])
     res = doc.summary()
     self.assertEqual('<html><body><h1>Tigers-Roya', res[0:27])

开发者ID:stalkerg，项目名称:python-readability，代码行数:7，代码来源:test_article_only.py

示例8: convert

def convert(link):
    """
    use burify's readability implementation to transcode a web page
    and return the transcoded page and images found in it
    """
    if not link:
        logger.error('Cannot transcode nothing!')
        return None, None, None

    try:
        data = transcoder.prepare_link(link)
        if data:
            article = Document(data)
            if article:
                images, content = _collect_images(
                    article.summary(html_partial=False), link)
                return article.short_title(), content, images
            else:
                logger.info('Burify cannot recognize the data')
                return None, None, None
        else:
            logger.info('Cannot parse %s correctly' % link)
            return None, None, None
    except Exception as k:
        logger.error('%s for %s' % (str(k), str(link)))
        return None, None, None

开发者ID:chengdujin，项目名称:newsman，代码行数:26，代码来源:burify.py

示例9: test_si_sample_html_partial

 def test_si_sample_html_partial(self):
     """Using the si sample, make sure we can get the article alone."""
     sample = load_sample("si-game.sample.html")
     doc = Document(
         sample, url="http://sportsillustrated.cnn.com/baseball/mlb/gameflash/2012/04/16/40630_preview.html"
     )
     res = doc.summary(enclose_with_html_tag=True)
     self.assertEqual('<div><div class="', res[0:17])

开发者ID:DannyGoodall，项目名称:python-readability，代码行数:8，代码来源:test_article_only.py

示例10: test_lazy_images

 def test_lazy_images(self):
     """
     Some sites use <img> elements with data-lazy-src elements pointing to the actual image.
     """
     sample = load_sample('wired.sample.html')
     doc = Document('http://www.wired.com/design/2014/01/will-influential-ui-design-minority-report/', sample)
     article = doc.get_clean_article()
     self.assertIn('<img src="http://www.wired.com/images_blogs/design/2014/01/her-joaquin-phoenix-41-660x371.jpg"', article)

开发者ID:bgruszka，项目名称:python-readability，代码行数:8，代码来源:test_article_only.py

示例11: test_many_repeated_spaces

    def test_many_repeated_spaces(self):
        long_space = ' ' * 1000000
        sample = '<html><body><p>foo' + long_space + '</p></body></html>'

        doc = Document(sample)
        s = doc.summary()

        assert 'foo' in s

开发者ID:buriy，项目名称:python-readability，代码行数:8，代码来源:test_article_only.py

示例12: test_si_sample

 def test_si_sample(self):
     """Using the si sample, load article with only opening body element"""
     sample = load_sample('si-game.sample.html')
     doc = Document(
         sample,
         url='http://sportsillustrated.cnn.com/baseball/mlb/gameflash/2012/04/16/40630_preview.html')
     res = doc.summary()
     self.assertEqual('<html><body><div><div class', res[0:27])

开发者ID:buriy，项目名称:python-readability，代码行数:8，代码来源:test_article_only.py

示例13: get

 def get(self):
   urls = self.get_query_arguments('url')
   if urls and len(urls) == 1:
     url = urls[0]
     doc = Document(requests.get(url).text)
     self.write(smartypants(doc.summary()))
     self.write(STYLE)
   else:
     self.write("Please provide ?url=[your-url]")

开发者ID:guidoism，项目名称:prettyweb，代码行数:9，代码来源:unshitify.py

示例14: transform

    def transform(self, row, chan):
        row['response'] = resolve_future(row['response'])

        doc = Document(row['response'].content)

        row['title'] = doc.title()
        summary = doc.summary()
        row['text'] = html2text(summary, bodywidth=160).replace('****', '').strip()

        yield row

开发者ID:hartym，项目名称:readtheweb，代码行数:10，代码来源:transformers.py

示例15: test_best_elem_is_root_and_passing

 def test_best_elem_is_root_and_passing(self):
     sample = (
         '<html class="article" id="body">'
         '   <body>'
         '       <p>1234567890123456789012345</p>'
         '   </body>'
         '</html>'
     )
     doc = Document(sample)
     doc.summary()

开发者ID:buriy，项目名称:python-readability，代码行数:10，代码来源:test_article_only.py

示例16: extract_article

def extract_article(url, ip):
    """Extracts the article using readability"""
    title, summary = None, None
    response = get_url(url, ip)
    if response.status_code == 200:
        doc = Document(response.content)
        summary = unicode(doc.summary())
        title = unicode(doc.title())
        return title, summary
    else:
        return None

开发者ID:apg，项目名称:text-please，代码行数:11，代码来源:textplease.py

示例17: get_html_article

    def get_html_article(self, response):
        """
        先调用readability识别正文,再去除标签以及空行,接下来因为模块识别出的正文会混入导航内容,需进一步处理
        具体做法是以换行符分割识别到内容,判断字数.取出是文章的项
        """

        readable_article = Document(response).summary()
        readable_article = self.remove_html_tag(readable_article)
        readable_article = self.remove_empty_line(readable_article)

        article_split = readable_article.split('\n')

        # 记录识别到文章开始和结束的位置
        begin = 0
        end = 0

        begin_find = False
        end_find = False
        has_article = False

        for index in range(len(article_split)):

            # # 当有一段特别大的时候只拿那一段
            # if len(article_split[index]) > 500:
            #     begin, end = index, index
            #     break

            if not begin_find:
                # 一项长度大于40的话就认为是文章的开头
                if len(article_split[index]) > IS_ARTICLE_SIZE:
                    begin = index
                    begin_find = True
                    has_article = True

            elif not end_find:
                if len(article_split[-index - 1]) == 0:
                    continue
                # \u3002\uff01分别对应中文的.跟? 因为一般中文句子结尾都是.跟?
                elif article_split[-index - 1][-1] in u'\u3002\uff01':
                    if len(article_split[-index - 1]) > IS_ARTICLE_SIZE:
                        end = index
                        end_find = True
                        has_article = True

        empty_list=[]

        if not has_article:
            return empty_list
        elif begin == end:
            empty_list.append(article_split[begin])
            return empty_list
        else:
            return article_split[begin: len(article_split) - end]

开发者ID:zengsn，项目名称:name-crawler-python，代码行数:53，代码来源:htmlArticle.py

示例18: view_html

def view_html(url):
    """Converts an html document to a markdown'd string
    using my own fork of python-readability"""
    try:
        from readability import Document
    except ImportError:
        print("Can't convert document: python-readability is not installed")
        return
    
    html = urlopen(url).read()
    doc=Document(html)
    print(wrap(asciify(BOLD+doc.title()+RESET+"\n"+doc.markdown(),strip_newlines=False),80,''))

开发者ID:edd07，项目名称:resh，代码行数:12，代码来源:view.py

示例19: parse_item

 def parse_item(self, response):
     filename = hashlib.sha1(response.url.encode()).hexdigest()
     readability_document = Document(response.body, url=response.url)
     item = BeerReviewPage()
     item['url'] = response.url
     item['filename'] = filename
     item['depth'] = response.meta['depth']
     item['link_text'] = response.meta['link_text']
     item['title'] = readability_document.short_title()
     with open('data/' + filename + '.html','wb') as html_file:
         html_file.write(readability_document.content())
     print '(' + filename + ') ' + item['title'] + " : " + item['url']
     return item

开发者ID:anoras，项目名称:BeerGeek，代码行数:13，代码来源:BeerGeekSpider.py

示例20: extract_content_texts

def extract_content_texts(name):
    article_archive = os.path.join(DEFAULT_SAVE_PATH, name, 'raw_articles')
    json_archive = os.path.join(DEFAULT_SAVE_PATH, name, 'json_articles')
    mkdir_p(json_archive)
    for html in glob.glob(article_archive+'/*.html'):
        fname = os.path.basename(html)+'.json'
        savepath = os.path.join(json_archive, fname)
        if os.path.exists(savepath):
            logging.info('Skipping existing json data: {0}'.format(savepath))
            continue
        data = {}
        with open(html, 'r') as myfile:
            doc = Document(myfile.read())
            data['title'] = doc.title()
            data['content'] = doc.content()
            data['summary'] = doc.summary()
            with open(savepath, 'w') as saving:
                json.dump(data, saving)

开发者ID:gregjan，项目名称:bullshit-detector，代码行数:18，代码来源:wbm_api.py

注：本文中的readability.Document类示例由纯净天空整理自Github/MSDocs等源码及文档管理平台，相关代码片段筛选自各路编程大神贡献的开源项目，源码版权归原作者所有，传播和使用请参考对应项目的License；未经允许，请勿转载。

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

Python readability.Document类代码示例发布时间：2022-05-26

Python read_config.read_config函数代码示例发布时间：2022-05-26

Python util.grid_equal函数代码示例

1 Python 入门教程

Python入门教程 Python 是一种解释型、面向对象、动态数据类型的高级程序设计语言。 P

阅读：13791|2022-01-22

2 Python wikiutil.getFrontPage函数代码示例

Python wikiutil.getFrontPage函数代码示例

阅读：10178|2022-05-24

3 Python 简介

Python 简介 Python 是一个高层次的结合了解释性、编译性、互动性和面向对象的脚本

阅读：4077|2022-01-22

4 Python tests.group函数代码示例

Python tests.group函数代码示例

阅读：4040|2022-05-27

5 Python util.check_if_user_has_permission

Python util.check_if_user_has_permission函数代码示例

阅读：3836|2022-05-27

6 Python 操练实例98

Python 练习实例98 Python 100例题目：从键盘输入一个字符串，将小写字母全部转换成大

阅读：3508|2022-01-22

7 Python 环境搭建

Python 环境搭建本章节我们将向大家介绍如何在本地搭建 Python 开发环境。 Py

阅读：3029|2022-01-22

8 Python output.darkgreen函数代码示例

Python output.darkgreen函数代码示例

阅读：2646|2022-05-25

9 Python 基础语法

Python 基础语法 Python 语言与 Perl，C 和 Java 等语言有许多相似之处。但是，也

阅读：2639|2022-01-22

10 Python 中文编码

Python 中文编码前面章节中我们已经学会了如何用 Python 输出 Hello, World!，英文没

阅读：2294|2022-01-22

客服电话

电子邮件

Python readability.Document类代码示例

示例1: test_correct_cleanup

示例2: test_lxml_obj_result

示例3: test_si_sample_html_partial

示例4: get

示例5: test_si_sample_html_partial

示例6: process_item

示例7: test_si_sample

示例8: convert

示例9: test_si_sample_html_partial

示例10: test_lazy_images

示例11: test_many_repeated_spaces

示例12: test_si_sample

示例13: get

示例14: transform

示例15: test_best_elem_is_root_and_passing

示例16: extract_article

示例17: get_html_article

示例18: view_html

示例19: parse_item

示例20: extract_content_texts

请发表评论

全部评论

上一篇：

下一篇：

Python util.grid_equal函数代码示例

Python util.get_worker_name函数代码示例

Python util.get_webmention_target函数代

Python util.get_uuid函数代码示例

Python util.get_type_by_name函数代码示例

Python util.grid_equal函数代码示例

Python util.get_worker_name函数代码示例

Python util.get_webmention_target函数代

Python util.get_uuid函数代码示例

Python util.get_type_by_name函数代码示例

Python util.get_stdout函数代码示例

关于我们

产品与服务

解决方案

139-2527-9053