Assignment 3

Diabetes Risk Prediction Dataset

This CSV file (diabetes_risk_dataset.csv) contains health and lifestyle data for individuals, designed to model risk factors associated with diabetes. Each row represents one person and includes metabolic indicators such as glucose level, insulin level, BMI, and cholesterol, along with lifestyle features like diet, physical activity, sleep, and stress.

The file includes a calculated diabetes_risk_score and a diabetes_risk_category, representing the predicted level of diabetes risk for each individual.

library(tidyverse)

diabetes_data <- read_csv("diabetes_risk_dataset.csv") |>
  janitor::clean_names()

Columns Description:

patient_id – Unique identifier for each individual.
age – Age in years
bmi – Body Mass Index
fasting_glucose_level – Fasting blood glucose level
insulin_level – Blood insulin level
blood_pressure – Blood pressure measurement
physical_activity_level – Physical activity category (Low, Moderate, High).
daily_calorie_intake – Average daily calorie intake.
sugar_intake_grams_per_day – Daily sugar consumption in grams.
sleep_hours – Average sleep duration per night
family_history_diabetes – Indicates genetic predisposition (Yes/No)
waist_circumference_cm – Waist measurement in centimeters
cholesterol_level – Blood cholesterol level
triglycerides_level – Triglyceride level in blood
stress_level – Stress level on a scale from 1 to 10
diabetes_risk_score – Calculated diabetes risk score (0–100).
diabetes_risk_category – Risk classification: Low, Prediabetes, High Risk.

Questions

Answer all the questions by using the tidyverse functions.

Inspect the structure of the dataset, determine the total number of observations and variables, and display the data types of all columns.

Transform the categprical variables (e.g., physical_activity_level, family_history_diabetes etc.) into factor variables and verify the transformation.

Construct a new variable named metabolic_index defined as the average of bmi, fasting_glucose_level, and HbA1c_level

Calculate the overall mean, median, standard deviation, minimum, and maximum for bmi, fasting_glucose_level, insulin_level, and blood_pressure using across(). Present the results so that variable names appear as rows and statistics appear as columns.

Derive a categorical variable named sleep_category by classifying sleep_hours into ‘Short’ for values below 6, ‘Adequate’ for values between 6 and 8 inclusive, and ‘Long’ for values above 8.

Compute the proportion of individuals with family history of diabetes and report the result as a percentage.

Identify and extract the subset of individuals whose diabetes_risk_score exceeds the 75th percentile and whose stress_level is greater than 7.

Rank all individuals by diabetes_risk_score in descending order and create a new variable named global_risk_rank.

Within each diabetes risk category, compute the difference between an individual’s risk score and the mean risk score of that category.

Calculate the correlation matrix for the 4 variables (bmi to insulin_level) and display it in a tidy format.

Group the data by diabetes_risk_category and compute the mean, median, and standard deviation of diabetes_risk_score for each category. Display the results in a summary table.

Group the dataset by physical_activity_level and calculate the average bmi, fasting_glucose_level, and cholesterol_level. Arrange the output in descending order of average BMI.

Within each diabetes_risk_category, identify the individual with the highest fasting_glucose_level and display the corresponding Patient_ID, fasting_glucose_level, and category.

Compute the total number of individuals and the average stress_level within each combination of diabetes_risk_category and family_history_diabetes. Sort the results in ascending order of the average stress level.

For each physical_activity_level, calculate the proportion of individuals whose diabetes_risk_score is above the overall mean risk score.

Generate a histogram of bmi and adjust the number of bins to improve clarity.

Create a boxplot that compares bmi across diabetes_risk_category.

Produce a scatter plot that displays fasting_glucose_level on the horizontal axis and insulin_level on the vertical axis while colouring points by diabetes_risk_category.

Construct a faceted density plot of cholesterol_level by physical_activity_level. Keep them in a single row.

Create a bar chart that displays the distribution of diabetes_risk_category.

Submission and Deadline

Create a .qmd file, render it to pdf. Here is a sample qmd file you can use. Finally submit the rendered pdf file on the Google Classroom thread by 11:59 PM, 28 Feruary 2026.