Assignment 3

Diabetes Risk Prediction Dataset

This CSV file (diabetes_risk_dataset.csv) contains health and lifestyle data for individuals, designed to model risk factors associated with diabetes. Each row represents one person and includes metabolic indicators such as glucose level, insulin level, BMI, and cholesterol, along with lifestyle features like diet, physical activity, sleep, and stress.

The file includes a calculated diabetes_risk_score and a diabetes_risk_category, representing the predicted level of diabetes risk for each individual.

library(tidyverse)

diabetes_data <- read_csv("diabetes_risk_dataset.csv") |>
  janitor::clean_names()

Columns Description:

Questions

Answer all the questions by using the tidyverse functions.

  1. Inspect the structure of the dataset, determine the total number of observations and variables, and display the data types of all columns.
  1. Transform the categprical variables (e.g., physical_activity_level, family_history_diabetes etc.) into factor variables and verify the transformation.
  1. Construct a new variable named metabolic_index defined as the average of bmi, fasting_glucose_level, and HbA1c_level
  1. Calculate the overall mean, median, standard deviation, minimum, and maximum for bmi, fasting_glucose_level, insulin_level, and blood_pressure using across(). Present the results so that variable names appear as rows and statistics appear as columns.
  1. Derive a categorical variable named sleep_category by classifying sleep_hours into ‘Short’ for values below 6, ‘Adequate’ for values between 6 and 8 inclusive, and ‘Long’ for values above 8.
  1. Compute the proportion of individuals with family history of diabetes and report the result as a percentage.
  1. Identify and extract the subset of individuals whose diabetes_risk_score exceeds the 75th percentile and whose stress_level is greater than 7.
  1. Rank all individuals by diabetes_risk_score in descending order and create a new variable named global_risk_rank.
  1. Within each diabetes risk category, compute the difference between an individual’s risk score and the mean risk score of that category.
  1. Calculate the correlation matrix for the 4 variables (bmi to insulin_level) and display it in a tidy format.
  1. Group the data by diabetes_risk_category and compute the mean, median, and standard deviation of diabetes_risk_score for each category. Display the results in a summary table.
  1. Group the dataset by physical_activity_level and calculate the average bmi, fasting_glucose_level, and cholesterol_level. Arrange the output in descending order of average BMI.
  1. Within each diabetes_risk_category, identify the individual with the highest fasting_glucose_level and display the corresponding Patient_ID, fasting_glucose_level, and category.
  1. Compute the total number of individuals and the average stress_level within each combination of diabetes_risk_category and family_history_diabetes. Sort the results in ascending order of the average stress level.
  1. For each physical_activity_level, calculate the proportion of individuals whose diabetes_risk_score is above the overall mean risk score.
  1. Generate a histogram of bmi and adjust the number of bins to improve clarity.
  1. Create a boxplot that compares bmi across diabetes_risk_category.
  1. Produce a scatter plot that displays fasting_glucose_level on the horizontal axis and insulin_level on the vertical axis while colouring points by diabetes_risk_category.
  1. Construct a faceted density plot of cholesterol_level by physical_activity_level. Keep them in a single row.
  1. Create a bar chart that displays the distribution of diabetes_risk_category.

Submission and Deadline

Create a .qmd file, render it to pdf. Here is a sample qmd file you can use. Finally submit the rendered pdf file on the Google Classroom thread by 11:59 PM, 28 Feruary 2026.