A python package for article scraping and curation
The newspaper package for python makes it extremely easy to download article content. Not only that, it has built in functionality to perform some natural language processing on the content downloaded.
As a quick intro to this package lets see what we can do with this article from Fox Sports as an example:
Once the newspaper package is installed to your machine we can follow along with the introductory documentation for the package.
The following code snippet will extract some basic data about the Fox Sports article:
from newspaper import Article url = 'https://www.foxsports.com.au/nrl/nrl-premiership/teams/sharks/james-maloneys-shock-confession-over-relationship-with-shane-flanagan/news-story/7134a366ebb93358cb23a2291ff78409' article = Article(url) article.download() article.parse() # see who the authors are article.authors # when was the publish date article.publish_date # the article title article.title # the actual text of the article article.text # the url of the main image of the article article.top_img
The authors it was able to extract:
Simon Brunsdon, Staff Writers, Source, Fox Sports
The publish date:
datetime.datetime(2018, 3, 14, 9, 16, tzinfo=tzutc())
James Maloney Shane Flanagan, Penrith Panthers signing, Cronulla Sharks NRL
Text of the article:
“JAMES Maloney has made the shock admission he hasn’t spoken to Cronulla coach Shane Flanagan since being traded to Penrith during the off-season.\n\nMaloney was caught up in the biggest movement of a crazy player market circus which dominated headlines during October and November.\n\nIt involved a big-money swap deal that saw him punted to the Panthers and Matt Moylan land in the Shire.\n\nLIVE stream every 2018 NRL Telstra Premiership game on FOX SPORTS. Get your free 2-week trial now >. If you’re overseas, you can still stream it LIVE on Watch NRL …
The URL of the main image in the article:
This is very useful information, and can form the basis of a content collection project.
Newspaper also has some Natural Language Processing capabilities and can be called as follows:
import nltk #nltk.download('punkt') #this may be necessary if it is your first time running this article.nlp() # Extract keywords article.keywords # Summary of the article article.summary
Article keywords extracted:
shane, signing, nrl, watch, swap, penrith, flanagan, world, james…
Summary of the article:
‘JAMES Maloney has made the shock admission he hasn’t spoken to Cronulla coach Shane Flanagan since being traded to Penrith during the off-season.\nMaloney was caught up in the biggest movement of a crazy player market circus which dominated headlines during October and November.\nIt involved a big-money swap deal that saw him punted to the Panthers and Matt Moylan land in the Shire.\n“It was a bit out of the blue and it just happened, so that’s how it went.”Sharks coach Shane Flanagan and James Maloney celebrate the 2016 premiership.\nI’m comfortable with the decision and I’m sure Cronulla is happy with the decision.”Maloney admits the sudden swap deal surprised him.’
So with less than 20 lines of code it is possible to download and extract key pieces of information about an article as well extract some meaning.
One thought on “Creating an ‘Article Meta Data’ Database – Phase 1: The Newspaper package”