Saturday, 3 September 2016

Reclaiming the web in < 100 lines of code


The Internet seems to become less pleasant by the day for those of us who are here primarily to read. Every now and again (i.e. dozens of time per day), I see a URL that points to an article which looks like it might contain some interesting information. I click on the URL hoping to get a nice big piece of text for me to digest, but instead I'm presented with auto-play videos, a JavaScript overlay asking me to subscribe to a newsletter, another JavaScript overlay asking me to use the site's app (obligatory XKCD: App), another JavaScript overlay telling me not to use an adblocker and still another one which thanks me for not using an adblocker after I've told my adblocker to block the previous one... You get the picture.
Today I'm going to describe how you can greatly improve this experience, focusing specifically on news articles from online media, by building a Reader application and a browser extension. The application will transform web pages from looking like the image on the right, to instead look like the one on the left.




Our arsenal
To create our Reader app, we'll use Python and Flask. The browser extension we create is for Google Chrome, although it should be pretty trivial to adapt for Firefox. We'll be using the Newspaper library for article extraction, and we'll write a little bit of HTML and CSS to display our final article as we want to read it.
I assume that you know some basics, and that you have a working version of Python and Pip installed on your system. I don't go into too much depth about how the various components work, so if you have some previous knowledge of Python, HTML, CSS, and JavaScript, you'll find everything below makes a lot more sense. You should be able to piece everything together even without prior experience though.
Setting up
Newspaper, the library we use for text extraction, is primarily a Python3 library. There is a buggy fork for Python2, but I strongly recommend that you use Python3 to take advantage of the maintained version. I therefore assume that your system is set up in such a way that pip invokes pip3 and python points to the python3 interpreter. Adapt the following as necessary if this is not the case. I'm not going to show the extra commands needed to create a virtualenv and install the packages in that. If you feel strongly about this, feel free to adapt as you see fit.
First we need to install Flask and Newspaper. Run the following commands:
pip install Flask
pip install newspaper3k

For the latter, you may have some issues with the installation of the lxml library. GIYF.
Writing the Python code
The core of our app will be a web server that receives a URL from the user, downloads the content from that URL, extracts the text, reformats it, and returns it.
Create a directory for your project and create a file within this directory called reader.py. Add the following code to this file:
from flask import Flask
from flask import request
from flask import render_template
from newspaper import Article


app = Flask(__name__)


@app.route("/read")
def read():
    url = request.args.get("url")
    a = Article(url)
    a.download()
    a.parse()
    paragraphs = a.text.split("\n\n")
    return render_template("article.html", paragraphs=paragraphs, title=a.title)


if __name__ == '__main__':
    app.run(port=5000, debug=True)

The first few lines simply import the parts of Flask we'll be using and the Article class from Newspaper, which is all we need to download the article from the URL and perform text extraction on it.
The next line initialises our Flask app. We then see a single route, which will detect traffic going to the "/read" route, and call the function defined directly below it.
Our actual read() function grabs the URL of the desired article from the arguments of the current URL. It initalises an Article object, downloads the content from the URL, does Newspapers magic parsing on it (text extraction is actually a lot more difficult than one might imagine), and splits the resulting text into paragraphs. Finally, it returns an HTML template (which we'll write in the next section), and passes in the paragraphs of the article as well as the article's title as arguments. We pass in a list of paragraphs instead of the whole text chunk as Newspaper gives us text delimited with newline characters, which will be ignored in our HTML. We therefore will re-insert <p> tags between each paragraph in our template (see the next section).
The final part of the script starts up our web application if we are running it locally and turns on debug mode.
Writing the HTML
Now we need to create an HTML template which will form the skeleton of all news articles read through our app. Create a new directory inside your project directory called templates (this name will allow Flask to find your templates, so don't change it). Create a new file inside this directory called article.html. Your project should now have the following structure:
reader
|-- templates
|   +-- article.html
+-- reader.py

In the article.html file, add the following code:
<html>
    <head>
        <title>{{title}}</title>
        <style>
            body {
              font-family: "Helvetica";
              max-width: 900px;
              padding-left: 20px;
              padding-right: 20px;
              padding-top: 30px;
              margin: 0 auto;
              text-align: justify;
            }
        </style>
    </head>
    <body>
      <h1>{{title}}</h1>
      {% for paragraph in paragraphs %}
          <p>{{paragraph}}</p>
      {% endfor %}
    </body>
</html>

This is a Flask template (or more specifically a Jinja2 template). It has the normal structure of an HTML document (starting and ending with <html><body>, and <head> tags). We have a few lines of internal CSS which will make our article be displayed in a decent font, create margins on the left and right of the article on screens that are wider than 900px, add some padding so that the text doesn't try creep off the screen, put the text in the middle of the screen, and stretch out the text (fully justify) to give nice vertical lines on the left and right (which many people do not like, so feel free to remove the justify line if you prefer ragged right).
The non-html parts of the above code are enclosed in either double braces {{}} or in the brace-percent combination {%%}. The former are simply placeholders for the arguments that we pass in from our Python code (i.e. the paragraphs and the article's title). The latter defines a control sequence -- in our case, a simple for loop which will loop through each of our paragraphs and add them to the page, opening and closing <p> tags as required.
That's our entire app. Let's test it.
Testing our web application
To see if our app works, navigate to your project directory in terminal or command prompt and then run the reader.py script. To do this, run commands similar to the following (depending on where your project directory is located)
cd git/reader
python reader.py

You should see output similar to Running on http://127.0.0.1:5000/ (Press CTRL+C to quit). Now fire up your web browser and find the URL of a news article you'd like to read (e.g. this one about Mother Theresa: http://www.bbc.com/news/world-europe-37258156).
Navigate to http://localhost:5000/reader?url=http://www.bbc.com/news/world-europe-37258156 (substituting the URL you chose above if it's different). If all went well, you'll see the news article presented in a nice compact form, without any of the rubbish that you would normally have inflicted upon you.
Building a Google Chrome extension
Although our application is already usable, it's not very user-friendly. Each time you want to read an article, you have to copy the URL to the clipboard and then construct the long version as shown above. Instead of this, we want to be able right-click on any URL that we come across while browsing the web, and to easily send that article to our app. To do this, we'll build a Google Chrome extension. A basic Google Chrome extension consists of two parts: a manifest file (JSON), which describes the extension and requests the necessary permissions, and a JavaScript file, which is where the functionality of the extension lives.
Create a new directory called readerExtension and inside this create a file called manifest.json as well as one called script.js.
Inside manifest.json add the following code:
{
  "manifest_version": 2,

  "name": "Plaintext Article Reader",
  "description": "Reformats online news to remove all the gunk",
  "version": "1.0",

  "permissions": [
    "contextMenus"
  ],
  "background": {
      "scripts": ["script.js"]
  }
}

The first few lines simply describe our extension. In the permissions section, we state that we need permission to fiddle with the user's context menus (i.e. the menu that appears when you right click), and in the background section, we point to the script.js script, which will get called automatically by the browser.
In the script.js file, add the following code:
function plaintext(info,tab) {
  chrome.tabs.create({ 
    url: "http://localhost:5000/reader?url=" + info.linkUrl,
  });          
}
chrome.contextMenus.create({
  title: "View Plaintext",
  contexts:["link"],
  onclick: plaintext,
});

We start off by defining a function plaintext() which will create a new tab in the user's browser. This tab will redirect to localhost and add the URL that we receive.
The second part creates a context menu (which Chrome will automatically collapse into the existing right-click context menu for us) and adds a "View Plaintext" section. We use contexts to say that we only want this to appear if the user right-clicks on a link and we use onclick to specify that our plaintext() function should be called when the user selects this option.
Installing the Google Chrome extension
To actually publish this as a proper Google Chrome extension would involve going through a lengthy set of steps (and paying Google $5). However, it's easy enough to set Chrome to use Developer mode and to load unpacked extensions.
In the "omnibox" or address bar of Google Chrome, type . At the top of the page, tick the box that says "developer mode". Then choose "Load unpacked extension" and select your readerExtension directory from the file chooser that appears.
Now you've written a Google Chrome extension and installed it! To try it out, simply visit any web page (preferably an online news site, such as http://bbc.co.uk/news), right click on one of the articles, and click "View Plaintext", which will now appear in the context menu whenever you right click on a link.
All that's left to do is to enjoy online reading again. Note that your local Flask app has to be running in order for the extension to work, so you'll need to run python reader.py from your project directory before browsing the web.
Where next?
Instead of running the Flask application locally, you can run it permanently from a VPS. Digital Ocean will give you a basic VPS for $5 a month (and if you sign up with them using my referral link, I'll get some credit with them that I can use to keep messing around with stuff like this and writing about it). I'm not going to go into detail on how to deploy a Flask application to a server (although I do do so in my book Flask By Example). Another advantage of running the app remotely is that if you're on a mobile device and have a slow Internet connection, the server can download the large version of the page with all the attached JavaScript and CSS and serve you a much smaller version that still contains the important parts (i.e. the text that you want to read).
That's it for this post. Happy building! You can find all the code presented in this post on GitHub at https://github.com/sixhobbits/reader.

1 comment:

  1. This comment has been removed by a blog administrator.

    ReplyDelete

About Me

My photo

I'm far away from home in this country called "Europe". I'm studying towards a Master's in Computational Linguistics (I think - this might help: https://xkcd.com/114/). I write about web applications and Python and other things that you may find interesting (considering you got this far).