Creating an ‘Article Meta Data’ Database – Phase 2: Saving Data Locally

Previously we looked at getting some interesting pieces of information from an article. This included data such as:

  1. Article Title
  2. Article Publish date
  3. Text of the article
  4. Keywords in the article
  5. and a NLP (Natural Language Processing) generated summary of the article

Now we are going to expand that out a by getting the data for a number of articles and saving this information to disk locally.

In a text file there is a list of URLs that I want to extract some information out of.

The following snippet gets the contents of the text file as a list and defines a function to extract information from the URLs.

# Extract urls from text file

url_list_file = open("crawl_list.txt", "r")

crawlList = [line for line in url_list_file]
crawlList = [url.replace('n', '') for url in crawlList]

# helper function to extract information from a given article

def ArticleDataExtractor(some_url):
    '''function to pull ou all key information from a given article url'''
    from newspaper import Article
    output = {}
    article = Article(some_url)
    output['url'] = some_url
    output['authors'] = article.authors
    output['pubDate'] = str(article.publish_date)
    output['title'] = article.title
    output['text'] = article.text
    # do some NLP
    output['keywords'] = article.keywords
    output['summary'] = article.summary
    return output

The next issue is in what format and how to save the extracted information for each article. For this example the json file type seems most appropriate for this.

# save files in json format

import json

for url in crawlList:
    articleID = url.split("/")[-1]
    extractedData = ArticleDataExtractor(url)
    my_filename = 'data/' + str(articleID) + '.json'
    with open(my_filename, 'w') as fp:
        json.dump(extractedData, fp, indent = 4)


After running the above snippet there is now a series of json files in a folder called data.


Inspecting one of the json files shows that all the information extracted is in there as expected.


We now have a utility for systematically processing a list of article URLs, extracting some data out of them and saving them to disk for us to access later.

A potential extension to this mini project would be to add some sort of crawler functionality to populate the list of articles. That way we could keep a comprehensive data set about all articles a particular outlet publishes. Alternatively some data analysis could be performed to find the most talked about topics, or even some machine learning to find which articles are most similar to each other through clustering.

Send a Comment

Your email address will not be published.