.

Is climate change a threat to our future?

Data

Module 1

To gather data, I used 'three' sort of sources. The reddit api wrapper, or PRAW, to scrape data from reddit, NEWSAPI to scrape data from news sources, and BeautifulSoup for detailed article content extraction.

News API Data Collection

The News API provided a structured way to gather climate-related news articles. Here's how I approached it:

Data Processing Pipeline

The data collection and processing pipeline involved several sophisticated steps to ensure data quality and usability:

1. Initial Data Collection

The raw data from multiple sources required different handling approaches:

NewsAPI Call:
topic = 'air pollution'
URLPost = {
    'apiKey':n_creds['api_key'],
    'source': 'bbc-news',
    'pageSize': 100,
    'totalRequests': 100,
    'q':'air pollution'}
req = requests.get('https://newsapi.org/v2/everything', params=URLPost)
            

After this, I get a JSON sructure dataset with the following format:

Before Processing:

Raw JSON data included mixed content types, null values, and nested structures like:

{
"status": "ok",
"totalResults": 14185,
"articles": [
    {
        "source": {
            "id": null,
            "name": "Yahoo Entertainment"
        },
        "author": "Anna Washenko",
        "title": "Climate change increased the odds of Los Angeles' devastating fires...",
        "description": "As Los Angeles reels from the loss of lives and homes to the...",
        "url": "https://consent.yahoo.com/v2/...d7326db44fc0",
        "urlToImage": null,
        "publishedAt": "2025-01-29T21:16:52Z",
        "content": "If you click 'Accept all', we and our partners, ...[+703 chars]"
    },
            
After Processing:

Cleaned and vectorized data with standardized format:

                        ability  abuses  ac  academic  accelerating  according  \
Android_Authority            0       0   8         0             0          0   
The_Times_of_India           0       0   0         0             0          0   
Nist                         0       0   0         0             0          0   
Vox                          0       0   0         0             0          0   
Naturalnews                  0       0   0         0             0          0   
            

This is a snippet of the same corpus from the JSON data above. The processing pipeline follows these steps:

If you want to view all the data and code for the project -> Check out my github: Module 1 Data

  1. Data Collection: JSON format data gathered from news sources
  2. Text Extraction: Programmatically converted into a corpus of .txt files, organized by news source
  3. Preprocessing:
    • Removal of HTML tags and special characters
    • Tokenization of text into individual words
    • Removal of stop words and rare terms
    • Lemmatization to reduce words to their base form
  4. Vectorization: Implementation of both Count Vectorizer and TF-IDF Vectorizer to create sparse matrices

The TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer was particularly crucial as it:

  • Normalizes word frequencies across documents
  • Assigns higher weights to terms that are unique to specific documents
  • Reduces the impact of common words that appear across many documents
  • Creates a more nuanced representation of document importance

Reddit Data Collection

For Reddit data collection, I utilized PRAW (Python Reddit API Wrapper). An example of the data structure is as follows:

When using PRAW I created a seperate python file to contain a majority of my python reddit helper functions. That is what is called, pythonHelperFunctions.py in this case. This file was imported into my main python file. There were two majors functions I wrote to gather data from reddit, one to get comments from a specific subreddit, useful for gathering labeled data soley from a specific subreddit. The other fucntion was useful to search through all the subreddits based on specicif search term - this is what I used for the majority later on in the project.

and

It was important to account for no text because in cases of empty returns it would break my code when I was trying to dump no data into a json.

The JSON data structures from both News API and Reddit were semi-structured, presenting unique challenges in data cleaning and standardization. Each source required specific handling:

Data Quality Assurance and Challenges

To ensure data quality and replicability, I implemented several validation steps and addressed various challenges:

Data Validation Steps:
Technical Challenges Addressed:
Data Visualization and Analysis:

To better understand the collected data:

Web Scraped TFIDF:

After I gathered URLs from the NEWSAPI I Programmatically scraped content from the sites - then I made a TFIDF sparse matrix out of that newly created corpus.

NewsAPI Data Deliverable

I did two queries with the News api to get the data that I collected. This then turned into two seperate dataframes that were vecotrized. One datafram is of vectorized data from the query "air pollution" and the other the term "climate change".

Climate Change Data
Air Pollution Data

Key Data Processing Function

This function would be plugged right into the vecotorizer objects to systemattically clean and process the data!

This was implemented into the vectorizer objects from scikit learn to process the data before it vectorized the words. I needed stemming and lemmitization to occur and found it conveinvent that we can implement it right into the object.

Next Steps

With the data collection and processing pipeline in place, stay tuned for the next steps -> topics include: Exploratory Data Analysis, Sentiment Analysis, and even experimenting with some basic ML models to contextualize this data. We've collected this data, now we have to synthesize them.