Study: Allowing guns on college campuses won’t reduce mass shootings


Policies allowing civilians to bring guns on to college campuses are unlikely to reduce mass shootings on campus and are likely to lead to more shootings, homicides, and suicides on campus—especially among students—a new report concludes.

Reference: Allowing guns on college campuses won’t reduce mass shootings

How can Lean Six Sigma help Machine Learning?


I have been using Lean Six Sigma (LSS) to improve business processes for the past 10+ year and am very satisfied with its benefits. Recently, I’ve been working with a consulting firm and a software vendor to implement a machine learning (ML) model to predict remaining useful life (RUL) of service parts. The result which I feel most frustrated is the low accuracy of the resulting model. As shown below, if people measure the deviation as the absolute difference between the actual part life and the predicted one, the resulting model has 127, 60, and 36 days of average deviation for the selected 3 parts. I could not understand why the deviations are so large with machine learning.


After working with the consultants and data scientists, it appears that they can improve the deviation only by 10%. This puzzles me a lot. I thought machine learning is a great new tool to make forecast simple and quick, but I did not expect it could have such large deviation. To me, such deviation, even after the 10% improvement, still renders the forecast useless to the business owners. This forces me to ask myself the following questions:

  • Is machine learning really a good forecasting tool?
  • What do people NOT know about machine learning?
  • What is missing in machine learning? Can lean six sigma fill the missing gap?

Note that machine learning, in general, targets two major categories of problems: unsupervised and supervised learning. My article here focuses on a supervised learning problem by using the regression analysis of machine learning.

Lean Six Sigma

The objective of the Lean Six Sigma (LSS) is to improve process performance by reducing its variance. The variance is defined as the sum square of the difference between actual and forecast of the LSS model. This is the definition used in classical statistics.

The result of the LSS essentially is a statistical function (model) between a set of input / independent variables and the output / dependent variable(s), as show in the chart below.


By identifying the correlations between the input and output variables, the LSS model tells us how we can control the input variables in order to move the output variable(s) into our target values. Most importantly, LSS also requires the monitored process to be “stable”, i.e., minimizing the output variable variance, by minimizing the input variable variance, in order to achieve the so called “breakthrough” state.


As the chart below shows, if you get to your target (center) alone without variance control (the spread around the target in the left chart), there is no guarantee about the target you have achieved; if you reduce the variance without getting to the target (right chart), you miss your target. Only by keeping the variance small and center, LSS is able to ensure the process target is reached with precise precision and with a sustainable and optimal process performance. This is the major contribution of LSS.


Machine Learning (ML)

For supervised machine learning, it looks at a function between a set of input variables and output variable(s) to come up with an “approximation” of the ideal function, as shown by the green curve below.


Similarly, for unsupervised machine learning, it looks for a function which best differentiate a set of clusters.


Comparison between LSS and ML

It is well known that, due to bias and normal randomness, a process is subject to be random in nature; i.e., a process with variance. Therefore, both classical statistics and LSS have shown that, if input variables have large variance, we would expect large variance of the output variable(s).


This would strongly suggest the inaccuracy of the machine learning model, when input variables have large variance. This is why, I think, my recent machine learning project has such large inaccuracy in its prediction, and also the reason why the data science consultants can improve the accuracy only up to 10%.

People may argue that the machine learning does have a step called data cleansing to improve the quality of prediction. Well, the problem is that the data cleansing of ML is not the same as the variance reduction of LSS. In LSS, people would go back to examine the business process to find the source of variance of the input variables in order to eliminate the bias or reduce the variance of those input variables (factors), whereas, in ML, people do not go back to revisit the business process; instead, people in ML only try to correct data errors or eliminate data which do not make sense. As a result, such data cleansing approach does not actually reduce variance; actually, it may not change the input variance at all. Therefore, the ML model is not expected to work well, if people do not understand the role of variance.

As an example, if the left chart below represents the data points after data cleansing, we would get the red curve as the optimal ML. But, if the right chart below represents the data points after variance reduction, the resulting ML model would be much accurate.


In summary, I think the current data cleansing of ML model needs to include the variance reduction technique of LSS in order to have an accurate, reliable, and effective model for either supervised or unsupervised learning. People need to spend effort to review underlying business process to reduce input variance to make it work better for real world problems.

Software vendors and data science consulting firms should embrace the variance reduction technique in the data cleansing phase of ML to deliver real value of ML.

Report sets research priorities for Biden’s cancer moonshot

CancerMoonshot 3

Big Data and data analytics are major agenda items of the moonshot program.

… The moonshot report recommends creating a national network to give more patients around the country access to tumor profiling. Those patients also would be able to share their genetic data with researchers, and volunteer for cutting-edge clinical trials of treatments that match their genetics.

Reference: Report sets research priorities for Biden’s cancer moonshot

Firms may violate workers’ medical privacy with big data


It may be time to take a step back and re-evaluate how U.S. companies are using big data gathered in employee wellness and other health care analytics programs.

… In an editorial posted Tuesday in JAMA Internal Medicine, some Texas researchers argue that use of big data to predict the “risk” of a woman getting pregnant may be crossing a line. It could exacerbate long-standing patterns of employment discrimination and paint pregnancy as something to be discouraged.

… The concern arose after reports of how one health care analytics company launched a product that can track, for example, if a woman has stopped filling birth-control prescriptions or has searched for fertility information on the company’s app. And women may not even be aware that such data is being collected.

Reference: Firms may violate workers’ medical privacy with big data

Expanding Medicaid may lower all premiums


This is a very good data insight.

… The Obama administration for years has been pleading with states to expand their Medicaid programs and offer health coverage to low-income people. Now it has a further argument in its favor: Expansion of Medicaid could lower insurance prices for everyone else.

… By comparing counties across state borders, and adjusting for several differences between them, the researchers calculated that expanding Medicaid meant marketplace premiums that were 7 percent lower.

Reference: Expanding Medicaid may lower all premiums

Study links fracking industry wells to increased risk of asthma attacks


People with asthma who live near bigger or larger numbers of active unconventional natural gas wells operated by the fracking industry in Pennsylvania are 1.5 to four times likelier to have asthma attacks than those who live farther away, new research from the Johns Hopkins Bloomberg School of Public Health suggests.

Reference: Study links fracking industry wells to increased risk of asthma attacks

Mosquito egg collectors of the U.S., unite


This article highlights the importance of data in decision making, as well as a brand new approach of government crowd-sourcing.

“We don’t have a lot of data — good, solid data,” said John-Paul Mutebi, an entomologist with the U.S. Centers for Disease Control and Prevention.

Volunteers now are needed to collect mosquito eggs in their communities and upload the data to populate an online map, which in turn will provide real-time information about hot spots to help researchers and mosquito controllers respond.

Reference: Mosquito egg collectors of the U.S., unite