<pre id="vvttv"><mark id="vvttv"><progress id="vvttv"></progress></mark></pre>
    <pre id="vvttv"></pre>

      <p id="vvttv"></p>

          <p id="vvttv"></p>

                <p id="vvttv"></p>

                <pre id="vvttv"><cite id="vvttv"><progress id="vvttv"></progress></cite></pre>

                  <output id="vvttv"><dfn id="vvttv"><th id="vvttv"></th></dfn></output>

                    <p id="vvttv"></p>

                    第 8 章 HTML 處理

                    8.1. 概覽

                    我經常在 comp.lang.python 上看到關于如下的問題: “ 怎么才能從我的 HTML 文檔中列出所有的 [頭|圖像|鏈接] 呢?” “怎么才能 [分析|解釋|munge] 我的 HTML 文檔的文本,但是又要保留標記呢?” “怎么才能一次給我所有的 HTML 標記 [增加|刪除|加引號] 屬性呢?” 本章將回答所有這些問題。

                    下面給出一個完整的,可工作的 Python 程序,它分為兩部分。第一部分,BaseHTMLProcessor.py 是一個通用工具,它可以通過掃描標記和文本塊來幫助您處理 HTML 文件。第二部分,dialect.py 是一個例子,演示了如何使用 BaseHTMLProcessor.py 來轉化 HTML 文檔,保留文本但是去掉了標記。閱讀文檔字符串 (doc string) 和注釋來了解將要發生事情的概況。大部分內容看上去像巫術,因為任一個這些類的方法是如何調用的不是很清楚。不要緊,所有內容都會按進度被逐步地展示出來。

                    例 8.1. BaseHTMLProcessor.py

                    如果您還沒有下載本書附帶的樣例程序, 可以 下載本程序和其他樣例程序

                    
                    from sgmllib import SGMLParser
                    import htmlentitydefs
                    
                    class BaseHTMLProcessor(SGMLParser):
                        def reset(self):                       
                            # extend (called by SGMLParser.__init__)
                            self.pieces = []
                            SGMLParser.reset(self)
                    
                        def unknown_starttag(self, tag, attrs):
                            # called for each start tag
                            # attrs is a list of (attr, value) tuples
                            # e.g. for <pre class="screen">, tag="pre", attrs=[("class", "screen")]
                            # Ideally we would like to reconstruct original tag and attributes, but
                            # we may end up quoting attribute values that weren't quoted in the source
                            # document, or we may change the type of quotes around the attribute value
                            # (single to double quotes).
                            # Note that improperly embedded non-HTML code (like client-side Javascript)
                            # may be parsed incorrectly by the ancestor, causing runtime script errors.
                            # All non-HTML code must be enclosed in HTML comment tags (<!-- code -->)
                            # to ensure that it will pass through this parser unaltered (in handle_comment).
                            strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
                            self.pieces.append("<%(tag)s%(strattrs)s>" % locals())
                    
                        def unknown_endtag(self, tag):         
                            # called for each end tag, e.g. for </pre>, tag will be "pre"
                            # Reconstruct the original end tag.
                            self.pieces.append("</%(tag)s>" % locals())
                    
                        def handle_charref(self, ref):         
                            # called for each character reference, e.g. for "&#160;", ref will be "160"
                            # Reconstruct the original character reference.
                            self.pieces.append("&#%(ref)s;" % locals())
                    
                        def handle_entityref(self, ref):       
                            # called for each entity reference, e.g. for "&copy;", ref will be "copy"
                            # Reconstruct the original entity reference.
                            self.pieces.append("&%(ref)s" % locals())
                            # standard HTML entities are closed with a semicolon; other entities are not
                            if htmlentitydefs.entitydefs.has_key(ref):
                                self.pieces.append(";")
                    
                        def handle_data(self, text):           
                            # called for each block of plain text, i.e. outside of any tag and
                            # not containing any character or entity references
                            # Store the original text verbatim.
                            self.pieces.append(text)
                    
                        def handle_comment(self, text):        
                            # called for each HTML comment, e.g. <!-- insert Javascript code here -->
                            # Reconstruct the original comment.
                            # It is especially important that the source document enclose client-side
                            # code (like Javascript) within comments so it can pass through this
                            # processor undisturbed; see comments in unknown_starttag for details.
                            self.pieces.append("<!--%(text)s-->" % locals())
                    
                        def handle_pi(self, text):             
                            # called for each processing instruction, e.g. <?instruction>
                            # Reconstruct original processing instruction.
                            self.pieces.append("<?%(text)s>" % locals())
                    
                        def handle_decl(self, text):
                            # called for the DOCTYPE, if present, e.g.
                            # <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
                            #     "http://www.w3.org/TR/html4/loose.dtd">
                            # Reconstruct original DOCTYPE
                            self.pieces.append("<!%(text)s>" % locals())
                    
                        def output(self):              
                            """Return processed HTML as a single string"""
                            return "".join(self.pieces)

                    例 8.2. dialect.py

                    
                    import re
                    from BaseHTMLProcessor import BaseHTMLProcessor
                    
                    class Dialectizer(BaseHTMLProcessor):
                        subs = ()
                    
                        def reset(self):
                            # extend (called from __init__ in ancestor)
                            # Reset all data attributes
                            self.verbatim = 0
                            BaseHTMLProcessor.reset(self)
                    
                        def start_pre(self, attrs):            
                            # called for every <pre> tag in HTML source
                            # Increment verbatim mode count, then handle tag like normal
                            self.verbatim += 1                 
                            self.unknown_starttag("pre", attrs)
                    
                        def end_pre(self):                     
                            # called for every </pre> tag in HTML source
                            # Decrement verbatim mode count
                            self.unknown_endtag("pre")         
                            self.verbatim -= 1                 
                    
                        def handle_data(self, text):                                        
                            # override
                            # called for every block of text in HTML source
                            # If in verbatim mode, save text unaltered;
                            # otherwise process the text with a series of substitutions
                            self.pieces.append(self.verbatim and text or self.process(text))
                    
                        def process(self, text):
                            # called from handle_data
                            # Process text block by performing series of regular expression
                            # substitutions (actual substitions are defined in descendant)
                            for fromPattern, toPattern in self.subs:
                                text = re.sub(fromPattern, toPattern, text)
                            return text
                    
                    class ChefDialectizer(Dialectizer):
                        """convert HTML to Swedish Chef-speak
                        
                        based on the classic chef.x, copyright (c) 1992, 1993 John Hagerman
                        """
                        subs = ((r'a([nu])', r'u\1'),
                                (r'A([nu])', r'U\1'),
                                (r'a\B', r'e'),
                                (r'A\B', r'E'),
                                (r'en\b', r'ee'),
                                (r'\Bew', r'oo'),
                                (r'\Be\b', r'e-a'),
                                (r'\be', r'i'),
                                (r'\bE', r'I'),
                                (r'\Bf', r'ff'),
                                (r'\Bir', r'ur'),
                                (r'(\w*?)i(\w*?)$', r'\1ee\2'),
                                (r'\bow', r'oo'),
                                (r'\bo', r'oo'),
                                (r'\bO', r'Oo'),
                                (r'the', r'zee'),
                                (r'The', r'Zee'),
                                (r'th\b', r't'),
                                (r'\Btion', r'shun'),
                                (r'\Bu', r'oo'),
                                (r'\BU', r'Oo'),
                                (r'v', r'f'),
                                (r'V', r'F'),
                                (r'w', r'w'),
                                (r'W', r'W'),
                                (r'([a-z])[.]', r'\1.  Bork Bork Bork!'))
                    
                    class FuddDialectizer(Dialectizer):
                        """convert HTML to Elmer Fudd-speak"""
                        subs = ((r'[rl]', r'w'),
                                (r'qu', r'qw'),
                                (r'th\b', r'f'),
                                (r'th', r'd'),
                                (r'n[.]', r'n, uh-hah-hah-hah.'))
                    
                    class OldeDialectizer(Dialectizer):
                        """convert HTML to mock Middle English"""
                        subs = ((r'i([bcdfghjklmnpqrstvwxyz])e\b', r'y\1'),
                                (r'i([bcdfghjklmnpqrstvwxyz])e', r'y\1\1e'),
                                (r'ick\b', r'yk'),
                                (r'ia([bcdfghjklmnpqrstvwxyz])', r'e\1e'),
                                (r'e[ea]([bcdfghjklmnpqrstvwxyz])', r'e\1e'),
                                (r'([bcdfghjklmnpqrstvwxyz])y', r'\1ee'),
                                (r'([bcdfghjklmnpqrstvwxyz])er', r'\1re'),
                                (r'([aeiou])re\b', r'\1r'),
                                (r'ia([bcdfghjklmnpqrstvwxyz])', r'i\1e'),
                                (r'tion\b', r'cioun'),
                                (r'ion\b', r'ioun'),
                                (r'aid', r'ayde'),
                                (r'ai', r'ey'),
                                (r'ay\b', r'y'),
                                (r'ay', r'ey'),
                                (r'ant', r'aunt'),
                                (r'ea', r'ee'),
                                (r'oa', r'oo'),
                                (r'ue', r'e'),
                                (r'oe', r'o'),
                                (r'ou', r'ow'),
                                (r'ow', r'ou'),
                                (r'\bhe', r'hi'),
                                (r've\b', r'veth'),
                                (r'se\b', r'e'),
                                (r"'s\b", r'es'),
                                (r'ic\b', r'ick'),
                                (r'ics\b', r'icc'),
                                (r'ical\b', r'ick'),
                                (r'tle\b', r'til'),
                                (r'll\b', r'l'),
                                (r'ould\b', r'olde'),
                                (r'own\b', r'oune'),
                                (r'un\b', r'onne'),
                                (r'rry\b', r'rye'),
                                (r'est\b', r'este'),
                                (r'pt\b', r'pte'),
                                (r'th\b', r'the'),
                                (r'ch\b', r'che'),
                                (r'ss\b', r'sse'),
                                (r'([wybdp])\b', r'\1e'),
                                (r'([rnt])\b', r'\1\1e'),
                                (r'from', r'fro'),
                                (r'when', r'whan'))
                    
                    def translate(url, dialectName="chef"):
                        """fetch URL and translate using dialect
                        
                        dialect in ("chef", "fudd", "olde")"""
                        import urllib                      
                        sock = urllib.urlopen(url)         
                        htmlSource = sock.read()           
                        sock.close()                       
                        parserName = "%sDialectizer" % dialectName.capitalize()
                        parserClass = globals()[parserName]                    
                        parser = parserClass()                                 
                        parser.feed(htmlSource)
                        parser.close()         
                        return parser.output() 
                    
                    def test(url):
                        """test all dialects against URL"""
                        for dialect in ("chef", "fudd", "olde"):
                            outfile = "%s.html" % dialect
                            fsock = open(outfile, "wb")
                            fsock.write(translate(url, dialect))
                            fsock.close()
                            import webbrowser
                            webbrowser.open_new(outfile)
                    
                    if __name__ == "__main__":
                        test("http://diveintopython.org/odbchelper_list.html")

                    例 8.3. dialect.py 的輸出結果

                    運行這個腳本會將 第 3.2 節 “List 介紹” 轉換成模仿瑞典廚師用語 (mock Swedish Chef-speak) (來自 The Muppets)、模仿埃爾默嘮叨者用語 (mock Elmer Fudd-speak) (來自 Bugs Bunny 卡通畫) 和模仿中世紀英語 (mock Middle English) (零散地來源于喬叟的《坎特伯雷故事集》)。如果您查看輸出頁面的 HTML 源代碼,您會發現所有的 HTML 標記和屬性沒有改動,但是在標記之間的文本被轉換成模仿語言了。如果您觀查得更仔細些,您會發現,實際上,僅有標題和段落被轉換了;代碼列表和屏幕例子沒有改動。

                    <div class="abstract">
                    <p>Lists awe <span class="application">Pydon</span>'s wowkhowse datatype.
                    If youw onwy expewience wif wists is awways in
                    <span class="application">Visuaw Basic</span> ow (God fowbid) de datastowe
                    in <span class="application">Powewbuiwdew</span>, bwace youwsewf fow
                    <span class="application">Pydon</span> wists.</p>
                    </div>
                    

                      <pre id="vvttv"><mark id="vvttv"><progress id="vvttv"></progress></mark></pre>
                      <pre id="vvttv"></pre>

                        <p id="vvttv"></p>

                            <p id="vvttv"></p>

                                  <p id="vvttv"></p>

                                  <pre id="vvttv"><cite id="vvttv"><progress id="vvttv"></progress></cite></pre>

                                    <output id="vvttv"><dfn id="vvttv"><th id="vvttv"></th></dfn></output>

                                      <p id="vvttv"></p>

                                      这里只有精品视频