CNN
ML + Data Pipeline — December 2025

CNN SCRAPER

Number
// 003
Date
December 2025
Type
Machine Learning · Data
Stack
Python · TF-IDF · MATLAB
Accuracy
79%
// 01 — Overview

Developed a Python script to scrape CNN headlines and identify emerging news trends using natural language processing and machine learning.

A TF-IDF model paired with logistic regression was used for classification, achieving 79% accuracy. Statistical analysis was performed in MATLAB to validate results.

// 02 — Media
CNN Scraper output
Pipeline output showing headline classification results — 79% accuracy
01 / 01
// 03 — Technical Details
TF-IDF Machine Learning Model

Leveraged Scikit-Learn in Python to build a TF-IDF machine learning model paired with logistic regression, classifying news headlines into different categories.

Efficient Web Scraping

Utilised BeautifulSoup to scrape all headlines from the main CNN news website efficiently and reliably.

Statistical Analysis & Visualisation

Took all the data from the Python script and communicated with MATLAB to compile results and generate statistical analysis and visualisation of the data.

// 04 — Challenges & Learnings
Maintaining High Accuracy

Model accuracy kept fluctuating, making it difficult to maintain a high figure. Solved by enriching the training dataset — but discovered no ML model can ever be perfect.

Python–MATLAB Integration

Communication between both platforms proved tedious. After going through numerous guides, ended up creating a .mat file with Python and accessing it through MATLAB.

// 05 — Tech Stack
Python BeautifulSoup TF-IDF Logistic Regression MATLAB NLP Scikit-Learn