How to convert HTML code to Text with Python (solved)
easy_install beautifulsoup4
easy_install html5lib
html_doc = """Some HTML code that you want to convert"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
print(soup.get_text())
Of course, you need to import the module if you want to call its functions. That the role of the line in bold above (from bs4 import BeautifulSoup).
Html-source-code (Photo credit: Wikipedia) |
Alternatives to BeautifulSoup to implement HTML2text
Striptogram might be an alternative to beautiful soup, but I must say, I am fully satisfied by beautiful soup.from stripogram import html2text, html2safehtml
# Only allow <b>, <a>, <i>, <br>, and <p> tags
clean_html = html2safehtml(original_html,valid_tags=("b", "a", "i", "br", "p"))
# Don't process <img> tags, just strip them out. Use an indent of 4 spaces
# and a page that's 80 characters wide.
text = html2text(original_html,ignore_tags=("img",),indent_width=4,page_width=80
If you want to improve your coding skills, I advise you to look at "Cracking the Coding Interview: 150 Programming Questions and Solutions". It was written by Gayle Laakmann McDowell, a former recruiter from Google who also worked at Apple and I find it really great!
No comments:
Post a Comment