Data Science Institute Has a Top Team of Core Developers For Scikit-learn

In Andreas Mueller, the Data Science Institute has one of the nation’s most prominent scikit-learn core developers. And now that two of his team members have been named core developers, DSI has become the go-to place for questions on how to use scikit-learn, the immensely popular open source machine-learning library.

Joining Mueller as core developers for scikit-learn are DSI postdoctoral researcher Nichola Hug, and Thomas Fan, a software developer. Both Hug and Fan were nominated by existing core scikit-learn developers and recently voted and approved to work as core developers, granting them voting rights on design decisions of the project.

“DSI now has one of the biggest and best team of scikit-learn core developers in the world,” says Mueller, an associate research scientist at DSI and author of the book, Introduction to Machine Learning with Python. “Nichola and Thomas have both contributed to enhancing the library and making it more user-friendly, so that researchers from all fields, not just technical fields, can use and benefit from scikit-learn.”

Working with Mueller to develop scikit-learn, Hug and Fan excel at maintaining various important aspects of the library. Core developers review code contributions, merge approved pull requests, and guide the development of the library by weighing in on major changes to the application program interface.

In his research, Hug focuses on integrating automatic machine learning tools to scikit-learn. He uses gradient boosting trees to make a family of algorithms run much faster, reducing their time from five minutes to about five seconds. The algorithms can now be trained much faster and offer quicker predictions. Hug also helps users who have questions or need guidance, which oftentimes means he reviews code and finds bugs. Overall, his work improves the library and helps build a larger community of library users.

Fan, the software developer, has done research to enhance the library in a few key ways, beginning before he joined Columbia. First, he improved the continuous integration system, which enables new code contributions to be tested in the cloud. This testing infrastructure allows scikit-learn to maintain a high-quality codebase. He also unified the application program interface (API) for the compose module, which gives users the flexibility to combine machine-learning models. In addition, he worked to build the documentation at scikit-learn.org, a manual for using scikit-learn, and enhanced the caching and downloading speeds for obtaining datasets from openml.org, a platform for sharing datasets and analysis.

“I’m passionate about open source programming on scikit learn,” says Fan, “because it is a great way to increase the impact of ideas. Specifically in machine learning, once the idea is codified anyone can use it and derive value from it.”

Mueller, who teaches classes at DSI on machine learning while also working on National Science Foundation-supported research, has been a core developer for scikit-learn for seven years, managing the daily upkeep of the library. His phalanx of 31,800 Twitter followers turns to him for authoritative advice on using Python and machine learning for data science. And he’s delighted to have a team of core developers at DSI working to enhance scikit-learn and make it easier to use by all.

“It’s become part of our mission here at DSI,” says Mueller, “to create open source software to make machine learning available to everyone who has a problem to solve.”

— Robert Florida, Data Science Institute

About garen