Developed a Python script to scrape CNN headlines and identify emerging news trends using natural language processing and machine learning.
A TF-IDF model paired with logistic regression was used for classification, achieving 79% accuracy. Statistical analysis was performed in MATLAB to validate results.
Leveraged Scikit-Learn in Python to build a TF-IDF machine learning model paired with logistic regression, classifying news headlines into different categories.
Utilised BeautifulSoup to scrape all headlines from the main CNN news website efficiently and reliably.
Took all the data from the Python script and communicated with MATLAB to compile results and generate statistical analysis and visualisation of the data.
Model accuracy kept fluctuating, making it difficult to maintain a high figure. Solved by enriching the training dataset — but discovered no ML model can ever be perfect.
Communication between both platforms proved tedious. After going through numerous guides, ended up creating a .mat file with Python and accessing it through MATLAB.