library(tidyverse)
diabetes_data <- read_csv("diabetes_risk_dataset.csv") |>
janitor::clean_names()Assignment 3
Diabetes Risk Prediction Dataset
This CSV file (diabetes_risk_dataset.csv) contains health and lifestyle data for individuals, designed to model risk factors associated with diabetes. Each row represents one person and includes metabolic indicators such as glucose level, insulin level, BMI, and cholesterol, along with lifestyle features like diet, physical activity, sleep, and stress.
The file includes a calculated diabetes_risk_score and a diabetes_risk_category, representing the predicted level of diabetes risk for each individual.
Columns Description:
-
patient_id– Unique identifier for each individual. -
age– Age in years -
bmi– Body Mass Index -
fasting_glucose_level– Fasting blood glucose level -
insulin_level– Blood insulin level -
blood_pressure– Blood pressure measurement -
physical_activity_level– Physical activity category (Low, Moderate, High). -
daily_calorie_intake– Average daily calorie intake. -
sugar_intake_grams_per_day– Daily sugar consumption in grams. -
sleep_hours– Average sleep duration per night -
family_history_diabetes– Indicates genetic predisposition (Yes/No) -
waist_circumference_cm– Waist measurement in centimeters -
cholesterol_level– Blood cholesterol level -
triglycerides_level– Triglyceride level in blood -
stress_level– Stress level on a scale from 1 to 10 -
diabetes_risk_score– Calculated diabetes risk score (0–100). -
diabetes_risk_category– Risk classification: Low, Prediabetes, High Risk.
Questions
Answer all the questions by using the tidyverse functions.
- Inspect the structure of the dataset, determine the total number of observations and variables, and display the data types of all columns.
- Transform the categprical variables (e.g.,
physical_activity_level,family_history_diabetesetc.) into factor variables and verify the transformation.
- Construct a new variable named
metabolic_indexdefined as the average ofbmi,fasting_glucose_level, andHbA1c_level
- Calculate the overall mean, median, standard deviation, minimum, and maximum for
bmi,fasting_glucose_level,insulin_level, andblood_pressureusingacross(). Present the results so that variable names appear as rows and statistics appear as columns.
- Derive a categorical variable named
sleep_categoryby classifyingsleep_hoursinto ‘Short’ for values below 6, ‘Adequate’ for values between 6 and 8 inclusive, and ‘Long’ for values above 8.
- Compute the proportion of individuals with family history of diabetes and report the result as a percentage.
- Identify and extract the subset of individuals whose
diabetes_risk_scoreexceeds the 75th percentile and whosestress_levelis greater than 7.
- Rank all individuals by
diabetes_risk_scorein descending order and create a new variable namedglobal_risk_rank.
- Within each diabetes risk category, compute the difference between an individual’s risk score and the mean risk score of that category.
- Calculate the correlation matrix for the 4 variables (
bmitoinsulin_level) and display it in a tidy format.
- Group the data by
diabetes_risk_categoryand compute the mean, median, and standard deviation ofdiabetes_risk_scorefor each category. Display the results in a summary table.
- Group the dataset by
physical_activity_leveland calculate the averagebmi,fasting_glucose_level, andcholesterol_level. Arrange the output in descending order of average BMI.
- Within each
diabetes_risk_category, identify the individual with the highestfasting_glucose_leveland display the correspondingPatient_ID,fasting_glucose_level, and category.
- Compute the total number of individuals and the average
stress_levelwithin each combination ofdiabetes_risk_categoryandfamily_history_diabetes. Sort the results in ascending order of the average stress level.
- For each
physical_activity_level, calculate the proportion of individuals whosediabetes_risk_scoreis above the overall mean risk score.
- Generate a histogram of
bmiand adjust the number of bins to improve clarity.
- Create a boxplot that compares
bmiacrossdiabetes_risk_category.
- Produce a scatter plot that displays
fasting_glucose_levelon the horizontal axis andinsulin_levelon the vertical axis while colouring points bydiabetes_risk_category.
- Construct a faceted density plot of
cholesterol_levelbyphysical_activity_level. Keep them in a single row.
- Create a bar chart that displays the distribution of
diabetes_risk_category.
Submission and Deadline
Create a .qmd file, render it to pdf. Here is a sample qmd file you can use. Finally submit the rendered pdf file on the Google Classroom thread by 11:59 PM, 28 Feruary 2026.