David LaPaglia

Is climate change a threat to our future?

Data

Module 1

To gather data, I used 'three' sort of sources. The reddit api wrapper, or PRAW, to scrape data from reddit, NEWSAPI to scrape data from news sources, and BeautifulSoup for detailed article content extraction.

News API Data Collection

The News API provided a structured way to gather climate-related news articles. Here's how I approached it:

Used endpoints like https://newsapi.org/v2/everything?q=bitcoin with appropriate API keys
Collected raw JSON data including article metadata, URLs, and preview content
Implemented BeautifulSoup to extract full article content from the provided URLs

Data Processing Pipeline

The data collection and processing pipeline involved several sophisticated steps to ensure data quality and usability:

1. Initial Data Collection

The raw data from multiple sources required different handling approaches:

NewsAPI Call:

topic = 'air pollution'
URLPost = {
    'apiKey':n_creds['api_key'],
    'source': 'bbc-news',
    'pageSize': 100,
    'totalRequests': 100,
    'q':'air pollution'}
req = requests.get('https://newsapi.org/v2/everything', params=URLPost)

After this, I get a JSON sructure dataset with the following format:

Before Processing:

Raw JSON data included mixed content types, null values, and nested structures like:

{
"status": "ok",
"totalResults": 14185,
"articles": [
    {
        "source": {
            "id": null,
            "name": "Yahoo Entertainment"
        },
        "author": "Anna Washenko",
        "title": "Climate change increased the odds of Los Angeles' devastating fires...",
        "description": "As Los Angeles reels from the loss of lives and homes to the...",
        "url": "https://consent.yahoo.com/v2/...d7326db44fc0",
        "urlToImage": null,
        "publishedAt": "2025-01-29T21:16:52Z",
        "content": "If you click 'Accept all', we and our partners, ...[+703 chars]"
    },

After Processing:

Cleaned and vectorized data with standardized format:

                        ability  abuses  ac  academic  accelerating  according  \
Android_Authority            0       0   8         0             0          0   
The_Times_of_India           0       0   0         0             0          0   
Nist                         0       0   0         0             0          0   
Vox                          0       0   0         0             0          0   
Naturalnews                  0       0   0         0             0          0

This is a snippet of the same corpus from the JSON data above. The processing pipeline follows these steps:

If you want to view all the data and code for the project -> Check out my github: Module 1 Data

Data Collection: JSON format data gathered from news sources
Text Extraction: Programmatically converted into a corpus of .txt files, organized by news source
Preprocessing:
- Removal of HTML tags and special characters
- Tokenization of text into individual words
- Removal of stop words and rare terms
- Lemmatization to reduce words to their base form
Vectorization: Implementation of both Count Vectorizer and TF-IDF Vectorizer to create sparse matrices

The TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer was particularly crucial as it:

Normalizes word frequencies across documents
Assigns higher weights to terms that are unique to specific documents
Reduces the impact of common words that appear across many documents
Creates a more nuanced representation of document importance

Reddit Data Collection

For Reddit data collection, I utilized PRAW (Python Reddit API Wrapper). An example of the data structure is as follows:

When using PRAW I created a seperate python file to contain a majority of my python reddit helper functions. That is what is called, pythonHelperFunctions.py in this case. This file was imported into my main python file. There were two majors functions I wrote to gather data from reddit, one to get comments from a specific subreddit, useful for gathering labeled data soley from a specific subreddit. The other fucntion was useful to search through all the subreddits based on specicif search term - this is what I used for the majority later on in the project.

and

It was important to account for no text because in cases of empty returns it would break my code when I was trying to dump no data into a json.

The JSON data structures from both News API and Reddit were semi-structured, presenting unique challenges in data cleaning and standardization. Each source required specific handling:

News API data needed URL validation and content extraction
Reddit data required handling of deleted comments and varying content lengths
Both sources needed consistent date formatting and text normalization

Data Quality Assurance and Challenges

To ensure data quality and replicability, I implemented several validation steps and addressed various challenges:

Data Validation Steps:

Automated error handling for API rate limits and connection issues
Data completeness checks before JSON storage
Consistent text encoding and special character handling
Detailed logging of data collection and processing steps
Validation of data for integrity and consistency
Filtering of non-Latin alphabet characters

Technical Challenges Addressed:

API Rate Limiting: Implemented exponential backoff strategy for API requests
...[int+] Characters Remaining | Much of the API didn't return the stories of the query
Data Consistency:Developed robust error handling for missing or malformed data
Text Encoding: Standardized UTF-8 encoding across all sources
Memory Management: Implemented batch processing for large datasets

Data Visualization and Analysis:

To better understand the collected data:

Word Clouds: Visual representation of term frequency across sources

Dataframes: Data exploration and analysis

Web Scraped TFIDF:

After I gathered URLs from the NEWSAPI I Programmatically scraped content from the sites - then I made a TFIDF sparse matrix out of that newly created corpus.

NewsAPI Data Deliverable

I did two queries with the News api to get the data that I collected. This then turned into two seperate dataframes that were vecotrized. One datafram is of vectorized data from the query "air pollution" and the other the term "climate change".

Climate Change Data

Air Pollution Data

Key Data Processing Function

This function would be plugged right into the vecotorizer objects to systemattically clean and process the data!

This was implemented into the vectorizer objects from scikit learn to process the data before it vectorized the words. I needed stemming and lemmitization to occur and found it conveinvent that we can implement it right into the object.

Next Steps

With the data collection and processing pipeline in place, stay tuned for the next steps -> topics include: Exploratory Data Analysis, Sentiment Analysis, and even experimenting with some basic ML models to contextualize this data. We've collected this data, now we have to synthesize them.