Statistical analysis and predictive modeling of water quality parameters using Python.
This project analyzes a dataset of 500 water samples across five quality parameters to explore relationships between water quality indicators and build predictive models for conductivity.
Key findings:
- Strong positive correlation (r = 0.705) between pH and dissolved oxygen levels
- Multi-parameter linear regression model predicts conductivity from pH, temperature, turbidity, and dissolved oxygen
- OLS regression confirms statistically significant relationships between several parameter pairs (p < 0.05)
The dataset (data/water_quality_testing.csv) contains 500 samples with the following parameters:
| Parameter | Unit | Range |
|---|---|---|
| pH | pH units | 6.83 - 7.48 |
| Temperature | °C | 20.3 - 23.6 |
| Turbidity | NTU | 3.1 - 5.1 |
| Dissolved Oxygen | mg/L | 6.0 - 9.9 |
| Conductivity | µS/cm | 316 - 370 |
water-quality-testing-data-analysis/
├── data/
│ └── water_quality_testing.csv # Water quality dataset (500 samples)
├── notebooks/
│ └── water_quality_analysis.ipynb # Main analysis notebook
├── .gitignore
├── CODE_OF_CONDUCT.md
├── LICENSE
├── README.md
└── requirements.txt
- Python 3.8+
- pip
git clone https://github.com/ChanMeng666/water-quality-testing-data-analysis.git
cd water-quality-testing-data-analysis
pip install -r requirements.txtjupyter notebook notebooks/water_quality_analysis.ipynbRun all cells (Kernel > Restart & Run All) to reproduce the full analysis.
The notebook covers the following topics:
- Data Loading and Inspection - Load dataset, examine structure and summary statistics
- Distribution Analysis - Histograms with KDE for all parameters
- Correlation Analysis - Correlation matrix heatmap and pair plots
- pH vs Dissolved Oxygen - Deep dive into the strongest correlation
- Parameter Relationships - Regression plots for multiple parameter pairs
- Predictive Modeling - Linear regression for conductivity prediction (two-feature and multi-parameter models)
- Statistical Modeling (OLS) - Ordinary least squares regression with statsmodels for statistical inference
- Conclusions - Summary of key findings
- pandas - Data manipulation and analysis
- NumPy - Numerical computing
- Matplotlib - Static plotting and visualization
- seaborn - Statistical data visualization
- scikit-learn - Linear regression modeling
- statsmodels - OLS regression and statistical testing
This project is licensed under the MIT License. See LICENSE for details.