David LaPaglia - Sustainability Analysis Project

Project Overview

This comprehensive research project applies advanced text mining and machine learning techniques to analyze how climate change is discussed across different platforms and communities. By examining diverse textual data sources including social media discussions, scientific publications, news articles, and policy documents, this project uncovers nuanced patterns in climate discourse.

The analysis employs both supervised and unsupervised learning approaches to examine sentiment, topic distribution, rhetorical strategies, and lexical choices across different stakeholder communities. Through this computational approach, the project identifies significant variations in how climate issues are framed, which terms are emphasized, and how scientific consensus is represented or contested.

Key Research Questions

How does climate discourse differ between scientific and public forums? Examining the terminological and conceptual gaps between expert and lay discussions.
What are the dominant narratives and frames used to discuss climate change? Identifying the primary ways climate issues are contextualized and presented.
How has the language of climate change evolved over time? Tracking shifts in terminology, sentiment, and focus areas in climate discussions.
What linguistic patterns differentiate skeptical vs. accepting views on climate science? Analyzing lexical markers that indicate stance toward climate consensus.
How do regional and political differences manifest in climate discourse? Examining variations in climate communication across different geographic and ideological communities.

Research Methodology & Workflow

This project follows a systematic text mining methodology to analyze climate discourse across multiple sources:

Data Acquisition

Collection from multiple sources

Preprocessing

Cleaning & normalization

Vectorization

TF-IDF & embeddings

Analysis

ML algorithms & NLP

Visualization

Results interpretation

Project Components

Project Foundations

Introduction & Background

This section establishes the theoretical framework for the project, including the significance of climate discourse analysis, previous research in the field, and the gap this project aims to fill. Key research questions and hypotheses are outlined in detail, providing context for the entire analysis.

Data Collection & Processing

A comprehensive overview of data sources, collection methods, and preprocessing techniques. This component details the APIs used (Reddit, News API), web scraping methodologies, and the extensive data cleaning pipeline that transformed raw text into analysis-ready corpora. Includes details on tokenization, stopword removal, lemmatization, and feature engineering.

Unsupervised Learning Methods

Text Clustering Analysis

Exploration of document clustering techniques including K-means, Hierarchical Clustering, and DBSCAN to identify natural groupings within climate discourse. This section presents detailed algorithm implementations, parameter tuning approaches, and visualization of cluster characteristics. Key findings reveal distinct discourse communities and their linguistic patterns.

Association Rule Mining

Application of association rule mining to uncover significant co-occurrence patterns in climate-related terminology. This component employs the Apriori algorithm to identify frequent itemsets and meaningful association rules, with metrics for support, confidence, and lift. Results highlight surprising term relationships and rhetorical patterns across different sources.

Topic Modeling & LDA

Implementation of Latent Dirichlet Allocation for discovering latent topics in the climate discourse corpus. This section details the mathematical framework of LDA, optimal topic number determination, and interactive topic visualizations. The analysis reveals distinct topic distributions across different platforms and their evolution over time.

Supervised Learning Methods

Text Classification Overview

A comprehensive examination of supervised learning approaches for categorizing climate texts by stance, source, and content. This section discusses feature selection, cross-validation methodologies, and performance evaluation metrics. The comparative analysis reveals strengths and limitations of different classifiers for various text classification tasks.

Naive Bayes Classifier

Detailed implementation of Multinomial and Bernoulli Naive Bayes classifiers for climate text categorization. This component explores probability distributions, feature independence assumptions, and performance optimizations. Results demonstrate Naive Bayes' effectiveness for certain climate stance classification tasks despite its simplicity.

Decision Tree Analysis

Application of decision tree algorithms for interpretable classification of climate texts. This section examines tree construction, information gain metrics, pruning techniques, and ensemble methods like Random Forests. The analysis highlights the interpretability advantages and decision boundaries in climate discourse classification.

Support Vector Machines

Implementation of SVM models with various kernels for high-dimensional text classification. This component explores margin optimization, kernel functions, and hyperparameter tuning strategies. Results show SVM's superior performance for distinguishing subtle differences in climate framing and rhetoric.

Advanced Neural Networks

Neural Networks for Text Classification

Exploration of neural architectures for climate text analysis.

Source Code & Data Repository

All implementation code, datasets, and analysis notebooks are available in my GitHub repository. The repository includes commented code, documentation, and reproducible examples for each component of the project.

View Project Repository

Key Findings & Project Significance

Major Contributions

This research makes several important contributions to our understanding of climate change communication:

Discourse Community Identification

The analysis successfully identified and characterized distinct discourse communities within climate discussions, each with unique linguistic patterns, key terminology, and framing devices. These include scientific, policy-oriented, activist, and skeptical communities, each with identifiable lexical markers.

Narrative Key-words Identification

Topic modeling and sequence analysis revealed recurring narrative structures in climate communication, identifying common story arcs employed by different stakeholders. These narrative templates significantly influence public perception and engagement with climate information.

Practical Implications

These findings have significant implications for climate communication strategies:

Bridging the Terminology Gap: The identified linguistic differences between expert and public discourse highlight the need for translational communication strategies that make scientific concepts accessible without sacrificing accuracy.
Targeted Message Design: Understanding the distinct characteristics of different discourse communities enables more effective tailoring of climate messaging to specific audiences.
Countering Misinformation: The lexical patterns associated with climate misinformation can help in developing automated detection systems and more effective counter-messaging strategies.
Narrative Engagement: The identified narrative structures provide templates for more engaging and persuasive climate communication that resonates with public audiences.

Read My Project Conclusions

Interested in the broader insights from this text mining journey? Check out my project conclusions for non-technical reflections on what I've learned.

Conclusions & Future Work

This text mining analysis of climate change discourse has demonstrated the power of computational approaches for understanding complex communication patterns across diverse platforms and communities. The combination of unsupervised and supervised machine learning techniques provided complementary insights into the structure, content, and evolution of climate discourse.

Future Research Directions

This project opens several promising avenues for future research:

Multimodal Analysis: Extending the analysis to include images, videos, and other media that accompany climate text
Real-time Monitoring: Developing systems for tracking climate discourse evolution in real-time
Cross-lingual Analysis: Expanding to multiple languages to compare climate discourse globally
Causal Influence Modeling: Investigating how discourse patterns influence public opinion and policy outcomes
Intervention Testing: Experimental validation of communication strategies informed by text mining insights
Transformers: Using transformers for more advanced natural language processing tasks

For more information, code, and detailed methodological documentation, please visit the project repository.

Explore Complete Repository