Developed a Python script to scrape CNN headlines and identify emerging news trends using natural language processing and machine learning.
A TF-IDF model paired with logistic regression was used for classification, achieving 79% accuracy. Statistical analysis was performed in MATLAB to validate results.
TF-IDF Machine learning model
Leveraged libraries such as Scikit-Learn in Python to make my own TF-IDF machine learning model paired with logistic regression, which classifies news headlines into different categories
Efficient Web Scraping
Utilized BeautifulSoup to web scrape all headlines from the main CNN news website
Statistical Analysis and visualization
Took all the data from the Python script and communicated with MATLAB to compile all the results and generate statistical analysis and visualization of the data
Model accuracy kept fluctuating so it was difficult to maintain a high accuracy. Solved this by enriching the dataset that the model is trained on, yet discovered that no ML model can ever be perfect.
Communication between both platforms proved to be tedious. Had to go through numerous guides to figure out a viable method of communication. Ended up creating a .mat file with Python and accessing it through MATLAB