A big part of my life and graduate semester this spring will be delving further into the science behind machine learning. I applied the principles to my final social media mining project last semester, but now I’m taking an entire course on machine learning. Today I want to follow my last two blogs on eating better and data that can help reduce diabetes with a discussion of how machine learning can be applied to predict diabetes.
In his Data Science Central article, Gnanguru Sattanathan uses open source machine learning platform H20, a Kaggle data set and a gradient boosting algorithm to determine whether someone will be likely to have diabetes. The data set has 768 records of female Pima Indians’ from the National Institute of Diabetes and Digestive and Kidney Diseases. The “Health Statistics” section of this website has data available by ethnicity. However, when I click on the “American Indians and Alaska Natives” link, it’s unclear exactly where this source data originated. I even did a native search for “Pima diabetes data” from the Office of Minority Health website with no luck. Nevermind that slight irritation, let’s assume for sake of analysis that the data on Kaggle is accurate.
Now let’s turn to the algorithm chosen: gradient boosting. (Visit the link for lots of fun equations if you’re the type that gets excited by that.) It’s a type of decision tree supervised learning method that predicts values by minimizing the mean square error to the true values using a 60/40 training set split. It’s based on a paper published in the John Hopkins APL Technical Digest that had an accuracy of 76%. It’s unclear from Sattanathan’s post what accuracy he got from his algorithm. Also, what was the rationale for using this algorithm versus another type? There are many algorithms to choose from that could change the accuracy results.
Another issue I have with this article is the lack of transparency during the analysis. Of all the attributes collected that affects whether a Pima Indian will get diabetes (number of pregnancies, glucose, blood pressure, skin thickness, insulin, BMI, Diabetes Pedigree Function and age), it’s unclear during the machine learning analysis which of these attributes is chosen or excluded. Ideally for the research to be reproducible and verified, the data scientist would describe the entire process.
The final observation I wanted to make is the importance of not applying the results of this algorithm to other ethnicities. The data used in this analysis is from Pima Indians. Other research (see the American Journal of Clinical Nutrition) has shown that obesity is far higher in adult Pima Indians than the rest of the U.S. population. Other research by NIH suggests that diabetes varies between ethnicities of people Therefore, it’s safe to presume this diabetes prediction applies to this population and not necessarily to the general population. Understanding these nuances in data analysis and science can help improve the clarity of the its results.