This project analyzes medical insurance data and builds regression models to estimate an individual’s annual insurance charge based on personal and health factors
This work uses a data set of 1338 records with seven attributes
- Age (in years)
- Sex (female or male)
- BMI (body mass index)
- Children (number of dependents)
- Smoker (yes or no)
- Region (northeast, northwest, southeast, southwest)
- Charges (annual medical cost in USD)
Models trained include
- Linear regression
- Polynomial regression of degree two
- Random forest regression
Model performance is compared using mean absolute error mean squared error and root mean squared error
- Clone this repository to your local machine
- Ensure you have Python version three point eight or higher installed
- Place the file insurance.csv in the project root folder
- Create and activate a virtual environment if desired
Run this command in your project folder to install all required libraries
pip install pandas numpy matplotlib seaborn scikit-learn
First start a Jupyter server in the project folder
jupyter notebook
Then open the file Medical_Insurance_Cost_Prediction.ipynb and execute all cells in order
- Load the data from insurance.csv
- Check for missing values
- Encode categorical fields sex smoker and region using label encoding
- Explore distributions and relationships with plots and a correlation heat map
- Split data into training and hold-out sets
- Train linear regression polynomial regression and random forest models
- Evaluate each model on hold-out data using MAE MSE and RMSE
- Compare results and choose the best performing model
- Use the chosen model to predict charges for new input records
Medical_Insurance
* Medical_Insurance_Cost_Prediction.ipynb notebook containing code and analysis
* insurance.csv raw data file
At the end of the notebook you can supply a new record to get an estimated annual charge
import pandas as pd
from sklearn.linear_model import LinearRegression
# code from notebook that trains model and encoders
model = LinearRegression().fit(X_train, y_train)
new_data = pd.DataFrame({
'age': [30],
'sex': [1], # 1 for male 0 for female
'bmi': [24.5],
'children': [2],
'smoker': [0], # 1 for yes 0 for no
'region': [2] # coded regions match notebook encoding
})
estimate = model.predict(new_data)[0]
print(estimate)
Google Drive data set link https://drive.google.com/file/d/1rJNN5oWWXtWnY_MAdYnOE-tI_lpyXQc5/view