The
Internet seems to become less pleasant by the day for those of us who are here
primarily to read. Every now and again (i.e. dozens of time per day), I see a
URL that points to an article which looks like it might contain some
interesting information. I click on the URL hoping to get a nice big piece of
text for me to digest, but instead I'm presented with auto-play videos, a
JavaScript overlay asking me to subscribe to a newsletter, another JavaScript
overlay asking me to use the site's app (obligatory XKCD: App), another JavaScript
overlay telling me not to use an adblocker and still another one which thanks
me for not using an adblocker after I've told my adblocker to block the
previous one... You get the picture.
Today
I'm going to describe how you can greatly improve this experience, focusing
specifically on news articles from online media, by building a Reader
application and a browser extension. The application will transform web pages
from looking like the image on the right, to instead look like the one on the
left.
Our arsenal
To
create our Reader app, we'll use Python and Flask. The browser extension we
create is for Google Chrome, although it should be pretty trivial to adapt for
Firefox. We'll be using the Newspaper library for article extraction,
and we'll write a little bit of HTML and CSS to display our final article as we
want to read it.
I assume
that you know some basics, and that you have a working version of Python and
Pip installed on your system. I don't go into too much depth about how the
various components work, so if you have some previous knowledge of Python,
HTML, CSS, and JavaScript, you'll find everything below makes a lot more sense.
You should be able to piece everything together even without prior experience
though.
Setting up
Newspaper,
the library we use for text extraction, is primarily a Python3 library. There
is a buggy fork for Python2, but I strongly recommend that you use Python3 to
take advantage of the maintained version. I therefore assume that your system
is set up in such a way that pip invokes pip3 and python points
to the python3 interpreter. Adapt the
following as necessary if this is not the case. I'm not going to show the extra
commands needed to create a virtualenv and install the packages in that. If you
feel strongly about this, feel free to adapt as you see fit.
First we
need to install Flask and Newspaper. Run the following commands:
pip
install Flask
pip
install newspaper3k
For the
latter, you may have some issues with the installation of the lxml library.
GIYF.
Writing the Python code
The core
of our app will be a web server that receives a URL from the user, downloads
the content from that URL, extracts the text, reformats it, and returns it.
Create a
directory for your project and create a file within this directory called reader.py. Add
the following code to this file:
from
flask import Flask
from
flask import request
from
flask import render_template
from
newspaper import Article
app
= Flask(__name__)
@app.route("/read")
def
read():
url = request.args.get("url")
a = Article(url)
a.download()
a.parse()
paragraphs = a.text.split("\n\n")
return
render_template("article.html", paragraphs=paragraphs, title=a.title)
if
__name__ == '__main__':
app.run(port=5000, debug=True)
The
first few lines simply import the parts of Flask we'll be using and the Article
class from Newspaper, which is all we need to download the article from the URL
and perform text extraction on it.
The next
line initialises our Flask app. We then see a single route, which will detect
traffic going to the "/read" route, and call the function defined
directly below it.
Our
actual read() function grabs the URL of the
desired article from the arguments of the current URL. It initalises an Article object,
downloads the content from the URL, does Newspapers magic parsing on it (text
extraction is actually a lot more difficult than one might imagine), and splits
the resulting text into paragraphs. Finally, it returns an HTML template (which
we'll write in the next section), and passes in the paragraphs of the article
as well as the article's title as arguments. We pass in a list of paragraphs
instead of the whole text chunk as Newspaper gives us text delimited with newline
characters, which will be ignored in our HTML. We therefore will re-insert <p> tags
between each paragraph in our template (see the next section).
The
final part of the script starts up our web application if we are running it
locally and turns on debug mode.
Writing the HTML
Now we
need to create an HTML template which will form the skeleton of all news
articles read through our app. Create a new directory inside your project
directory called templates (this name will allow Flask to
find your templates, so don't change it). Create a new file inside this
directory called article.html. Your project should now have the
following structure:
reader
|--
templates
| +-- article.html
+--
reader.py
In the article.html file,
add the following code:
<html>
<head>
<title>{{title}}</title>
<style>
body {
font-family:
"Helvetica";
max-width: 900px;
padding-left: 20px;
padding-right: 20px;
padding-top: 30px;
margin: 0 auto;
text-align: justify;
}
</style>
</head>
<body>
<h1>{{title}}</h1>
{% for paragraph in paragraphs %}
<p>{{paragraph}}</p>
{% endfor %}
</body>
</html>
This is
a Flask template (or more specifically a Jinja2 template). It has the normal
structure of an HTML document (starting and ending with <html>, <body>, and <head> tags).
We have a few lines of internal CSS which will make our article be displayed in
a decent font, create margins on the left and right of the article on screens
that are wider than 900px, add some padding so that the text doesn't try creep
off the screen, put the text in the middle of the screen, and stretch out the
text (fully justify) to give nice vertical lines on the left and right (which
many people do not like, so feel free to remove the justify line
if you prefer ragged right).
The
non-html parts of the above code are enclosed in either double braces {{}} or
in the brace-percent combination {%%}. The former are simply placeholders
for the arguments that we pass in from our Python code (i.e. the paragraphs and
the article's title). The latter defines a control sequence -- in our case, a
simple for loop which will loop through
each of our paragraphs and add them to the page, opening and closing <p> tags
as required.
That's
our entire app. Let's test it.
Testing our web application
To see
if our app works, navigate to your project directory in terminal or command
prompt and then run the reader.py script. To do this, run
commands similar to the following (depending on where your project directory is
located)
cd
git/reader
python
reader.py
You should see output similar to Running on http://127.0.0.1:5000/ (Press CTRL+C to quit). Now fire up your web browser and find the URL of a news article you'd like to read (e.g. this one about Mother Theresa: http://www.bbc.com/news/world-europe-37258156).
Navigate
to http://localhost:5000/reader?url=http://www.bbc.com/news/world-europe-37258156 (substituting
the URL you chose above if it's different). If all went well, you'll see the
news article presented in a nice compact form, without any of the rubbish that
you would normally have inflicted upon you.
Building a Google Chrome extension
Although
our application is already usable, it's not very user-friendly. Each time you
want to read an article, you have to copy the URL to the clipboard and then
construct the long version as shown above. Instead of this, we want to be able
right-click on any URL that we come across while browsing the web, and to
easily send that article to our app. To do this, we'll build a Google Chrome
extension. A basic Google Chrome extension consists of two parts: a manifest
file (JSON), which describes the extension and requests the necessary
permissions, and a JavaScript file, which is where the functionality of the
extension lives.
Create a
new directory called readerExtension and inside this create a file
called manifest.json as well as one called script.js.
Inside manifest.json add
the following code:
{
"manifest_version": 2,
"name": "Plaintext Article
Reader",
"description": "Reformats
online news to remove all the gunk",
"version": "1.0",
"permissions": [
"contextMenus"
],
"background": {
"scripts":
["script.js"]
}
}
The first few lines simply describe our extension. In the permissions section, we state that we need permission to fiddle with the user's context menus (i.e. the menu that appears when you right click), and in the background section, we point to the script.js script, which will get called automatically by the browser.
In the script.js file,
add the following code:
function
plaintext(info,tab) {
chrome.tabs.create({
url: "http://localhost:5000/reader?url="
+ info.linkUrl,
});
}
chrome.contextMenus.create({
title: "View Plaintext",
contexts:["link"],
onclick: plaintext,
});
We start
off by defining a function plaintext() which will create a new tab in
the user's browser. This tab will redirect to localhost and add the URL that we
receive.
The
second part creates a context menu (which Chrome will automatically collapse
into the existing right-click context menu for us) and adds a "View
Plaintext" section. We use contexts to say that we only want this to
appear if the user right-clicks on a link and we use onclick to
specify that our plaintext() function should be called when
the user selects this option.
Installing the Google Chrome
extension
To
actually publish this as a proper Google Chrome extension would involve going
through a lengthy set of steps (and paying Google $5). However, it's easy
enough to set Chrome to use Developer mode and to load unpacked extensions.
In the
"omnibox" or address bar of Google Chrome, type . At the top of
the page, tick the box that says "developer mode". Then choose
"Load unpacked extension" and select your readerExtension directory
from the file chooser that appears.
Now you've
written a Google Chrome extension and installed it! To try it out, simply visit
any web page (preferably an online news site, such as http://bbc.co.uk/news),
right click on one of the articles, and click "View Plaintext", which
will now appear in the context menu whenever you right click on a link.
All
that's left to do is to enjoy online reading again. Note that your local Flask
app has to be running in order for the extension to work, so you'll need to run python reader.py from
your project directory before browsing the web.
Where next?
Instead
of running the Flask application locally, you can run it permanently from a
VPS. Digital Ocean will give you a basic VPS for $5 a month (and if you sign up
with them using my referral link, I'll get some credit with them
that I can use to keep messing around with stuff like this and writing about
it). I'm not going to go into detail on how to deploy a Flask application to a
server (although I do do so in my book Flask By Example).
Another advantage of running the app remotely is that if you're on a mobile
device and have a slow Internet connection, the server can download the large
version of the page with all the attached JavaScript and CSS and serve you a
much smaller version that still contains the important parts (i.e. the text
that you want to read).
That's
it for this post. Happy building! You can find all the code presented in this
post on GitHub at https://github.com/sixhobbits/reader.