Data Science Institute Professor Develops Software for Automatic Machine Learning

Andreas Müller

Andreas Mueller is on a mission to make machine learning easy to use.

His book, “Introduction to Machine Learning with Python,” teaches the fundamentals of the field; he co-manages scikit-learn, a machine-learning library for Python; and on Twitter a phalanx of followers seek his advice on techniques in machine learning, a field that empowers computers to learn without being explicitly programmed.

And now, thanks to a $400,000 grant from the National Science Foundation (NSF), Mueller will create software to make machine learning even easier to use.

The NSF is funding his project to develop automatic machine learning, which means users will not have to select an algorithm for their data. Mueller’s software will automatically select an algorithm for them. It’s not easy to pick the right algorithm, says Mueller, especially for lay people who can be baffled by the complex choice of data processing, knobs and settings.

“Researchers who develop new algorithms often don’t provide them in a way that is useful for a wide audience,” says Mueller, a lecturer at the Data Science Institute who is a nationally-known applied data scientist. “While there have been some attempts at creating software packages that do this task, they are not very easy to use, and are not tuned towards simple day-to-day use. The main point of this grant is to create something that is end-user friendly instead of research-quality software.”

Once developed, Mueller’s software tools will supplement scikit-learn, reducing the level of expertise required to apply models to a problem. Scikit-learn is the main machine-learning library for Python, which in turn is one of the most popular programming languages for machine learning, The library contains state-of-the-art machine-learning algorithms, as well as tools to tune and evaluate models. It’s a popular library for researchers looking to apply machine learning to a problem. Mueller is also creating a separate software package with models for automatic supervised learning. It will have a simple interface requiring minimal user interaction as well as easy-to-understand documentation on how to use the software.

Alan Chung, a summer intern at the Data Science Institute who is new to machine learning, says Mueller’s effort to extend the scikit-learn toolkit will be of immense help to novices like him.

“While we can fit any algorithm to a dataset, the variations in different algorithms and model types drastically change the ability to extrapolate on future, unknown data,” says Chung, who is also reading Mueller’s book on machine learning. “An extension of existing tools to help find the ideal algorithm based on the data type will make selecting models much more efficient and will reduce the amount of theory one must understand. Further, even if one knows the theory behind algorithms, selecting an algorithm often comes down to trial and error, and an extension of our the scikit-learn toolkit will drastically simplify this process.”

Machine learning has been at the forefront of recent technological advancements: self-driving cars, computer vision and speech-recognition systems being just three examples. It’s increasingly common in academia, industry and government, but its use by non-technical people has been limited by the complexity of choosing the right algorithm.

“This project will make it much easier for biology majors, astronomy majors and people from the liberal arts to use machine learning to solve problems in their fields,” says Mueller. “It’s my mission to create open source software to make machine learning available to everyone who has a problem to solve.”

— Robert Florida, Data Science Institute

Data Science Institute Professor Develops Software for Automatic Machine Learning

Priority Area News

Columbia Science in the News