Fake News & Misinformation Detector
Detect fake vs real news articles using Machine Learning, TF-IDF, and Logistic Regression, complete with training scripts, evaluation charts, and an interactive Streamlit web app.
Table of Contents
- Overview
- Demo
- Project Structure
- Installation
- Dataset
- Training the Model
- Evaluation & Charts
- How It Works
- Running the Streamlit App
- Code Modules
- Technologies Used
- License
- Author
- Future Improvements
Overview
The Fake News & Misinformation Detector is a complete end-to-end Natural Language Processing (NLP) project that classifies news headlines and articles as REAL or FAKE.
It combines TF-IDF feature extraction with a Logistic Regression classifier, achieving perfect accuracy on the cleaned dataset.
The project also includes:
- Model evaluation with visual charts
- Interactive Streamlit web app
- Reusable and modular code structure
Demo
Streamlit Web App
When launched, the app allows you to paste or type any news headline or paragraph and analyze its credibility in real time.
- Prediction: REAL or FAKE
- Probability bar visualization
- Adjustable fake-detection threshold
Project Structure
fake-news-detector/
│
├── data/
│ ├── True.csv # Real news (999 rows)
│ ├── Fake.csv # Fake news (999 rows)
│
├── outputs/
│ ├── model.joblib # Trained Logistic Regression model
│ ├── vectorizer.joblib # TF-IDF vectorizer
│ ├── pipeline.joblib # Combined pipeline (optional)
│ ├── metrics.json # Model performance report
│ ├── confusion_matrix.png # Confusion Matrix plot
│ ├── roc_curve.png # ROC curve plot
│ └── pr_curve.png # Precision-Recall curve plot
│
├── src/
│ ├── text_clean.py # Text preprocessing utilities
│ ├── utils.py # I/O helpers
│ ├── train_model.py # Training and evaluation script
│ ├── detect_fake_news.py # CLI prediction script
│ └── streamlit_app.py # Streamlit web application
│
└── README.md
Installation
Clone the Repository
git clone https://github.com/AmirhosseinHonardoust/Fake-News-Detector.git cd Fake-News-Detector
Install Dependencies
pip install -r requirements.txt
Or install manually:
pip install pandas numpy scikit-learn matplotlib streamlit joblib
Dataset
| File | Type | Rows | Columns |
|---|---|---|---|
True.csv | Real news | 999 | title, text, subject, date |
Fake.csv | Fake news | 999 | title, text, subject, date |
Dataset Source:
This project uses and modifies the Fake and Real News Dataset by Clément Bisaillon (Kaggle).
Data was cleaned, header-fixed, and downsampled to 999 REAL and 999 FAKE news articles for balanced training and clear visualization.
Used purely for educational and research purposes.
Training the Model
Run the following command from the project root:
python src/train_model.py --real data/True.csv --fake data/Fake.csv --text-col text --outdir outputs
This script will:
- Load both datasets (real and fake).
- Clean and merge them using
text_clean.py. - Extract TF-IDF features.
- Train a Logistic Regression classifier.
- Save outputs:
outputs/model.jobliboutputs/vectorizer.jobliboutputs/metrics.json- Performance charts (
confusion_matrix.png,roc_curve.png,pr_curve.png)
Evaluation & Charts
After training, the model achieves perfect classification accuracy on this dataset.
Confusion Matrix
| True Label | Predicted REAL | Predicted FAKE |
|---|---|---|
| REAL | 999 ✅ | 0 ❌ |
| FAKE | 0 ❌ | 999 ✅ |
The model correctly classified all 1,998 samples.
ROC Curve
The ROC curve touches the top-left corner AUC = 1.00
Perfect separability between classes.
Precision–Recall Curve
Both precision and recall reach 1.00, meaning zero false predictions.
Key Metrics
| Metric | Value |
|---|---|
| Accuracy | 100 % |
| Precision (FAKE) | 1.00 |
| Recall (FAKE) | 1.00 |
| F1-Score | 1.00 |
| ROC-AUC | 1.00 |
Although perfect accuracy is achieved on this dataset, it’s a controlled sample. Real-world news data will naturally introduce noise and uncertainty.
How It Works
Pipeline Overview
- Text Cleaning → Remove punctuation, URLs, emails, non-ASCII chars.
- TF-IDF Vectorization → Convert words into weighted numerical features.
- Logistic Regression → Predict probability of “FAKE” label.
- Thresholding → If
p(fake) ≥ 0.5→ FAKE, else REAL.
Example: Command-Line Prediction
python src/detect_fake_news.py --model outputs/model.joblib --vectorizer outputs/vectorizer.joblib --text "It s tough sometimes to imagine that Donald Trump has five children since it s clear from Monday s speech in front of 40,000 Boy Scouts and other attendees at the Boy Scouts Jamboree in West Virginia that he has absolutely no idea what kind of talk is appropriate for children.While most adults would take this opportunity to offer some pearls of adult wisdom or cheerlead the Boy Scouts toward their futures, Trump chose to deliver a tirade of Trumpisms.Like almost any time Trump has tried to string together more than a couple of words at a time, most of his speech was an inarticulate mess which consisted of his trademark whining, a wee bit of swearing and a pointless anecdote about a burned out rich guy at a cocktail party."
Output:
Label: FAKE | Fake probability: 0.560 | Threshold: 0.40
Running the Streamlit App
Launch the App
streamlit run src/streamlit_app.py
Then open the local web interface:
http://localhost:8501
App Features
- Paste any headline or paragraph
- Analyze with one click
- Adjust FAKE probability threshold
- See model file locations and loaded status in sidebar
Code Modules
| Module | Purpose |
|---|---|
text_clean.py | Handles text normalization (lowercasing, regex-based cleaning) |
utils.py | Ensures output directories exist and handles JSON I/O |
train_model.py | Loads data, trains the model, and generates metrics and plots |
detect_fake_news.py | CLI script for predicting individual samples |
streamlit_app.py | Streamlit web app for interactive user testing |
Technologies Used
- Python 3.10+
- scikit-learn → TF-IDF Vectorizer, Logistic Regression
- pandas / numpy → Data manipulation
- matplotlib → Model visualization
- joblib → Model persistence
- Streamlit → Web interface
Future Improvements
- Integrate BERT / DistilBERT for contextual language understanding
- Extend dataset for multi-language fake news detection
- Add Explainable AI (LIME / SHAP) for model transparency
- Deploy live on Streamlit Cloud or Hugging Face Spaces