Data Science Institute Students Conduct Research Across Columbia University

Students in the master’s program at the Data Science Institute (DSI) learn the most advanced data-science techniques, may it be machine learning, statistical inference and modeling, or deep learning. And now, thanks to a program called Campus Connections, many of them are using those data techniques to assist Columbia professors with their research.

Campus Connections unites DSI students with professors who need assistance on research projects, especially with data analysis. The students who participate in the program learn what it’s like to do academic research, while the professors get help from the brightest young minds in data-science. Though leading experts in their fields, the professors may not be trained as data scientists. Nonetheless, they must contend with the big data now being generated in all areas of research; the DSI students help them process and evaluate that data.

Some of the students who sign up for Campus Connections will go on to pursue doctoral degrees and careers as research scientists. Many aim to work, though, as data scientists for major companies. And having research experience will impress prospective employers when it comes time for them to interview for jobs, says Rachel Cohen, Assistant Director of Student Services & Career Development at DSI, who launched the Campus Connections program in 2017 with Jonathan Stark, Chief Administrative and Operations Officer.

“The program supports faculty who need help with data while giving our students invaluable research experience,” says Cohen. “Our students are in great demand, but having research experience makes them even more attractive to employers, who prefer to hire graduates with demonstrated data-science skills. Campus Connections gives our students that experience.”

Cohen says she promotes the program through a tri-annual email campaign and, with the success of the program’s initial two years, through word of mouth. There is also a Campus Connections website where professors can ask for assistance with research projects. She then advertises the projects via email to the DSI student body, so those interested in doing research can connect with the professors. “What ensues,” she adds, “is a positive collaborative experience for both parties as well as interesting research results.”

The Effect of Climate Change on Marine Biology

One successful Campus Connections project is continuing at the Earth Institute, where Professor Joaquim Goes studies the effect of climate change on marine biology. Goes’s research team travels by ship to different parts of the Atlantic Ocean to collect water samples. The team is designing an automated flow-through system where seawater is drawn into their moving ship and is continuously analyzed. This automated system is an advancement over the usual method of collecting samples, where ocean researchers must stop their ships at pre-planned locations to collect samples. This system allows the ships to keep moving.

The team is also gathering data on the diversity of microscopic plant life, particularly plankton, which are critical to the marine ecosystem and for assessing the ocean’s ability to sequester carbon dioxide from the atmosphere. Plankton form the basis of many food chains and are an important indicator of an ocean’s health. When fully functional, the system will provide data required for validating satellite images of the ocean now being developed by NASA, NOAA and other agencies.

Last fall, using Campus Connections, Goes teamed up with three DSI students (Ankit Peshin, Ziyao Zhang and Paridhi Singh) to develop an automated classification system for the phytoplankton types; previously they relied on manual methods to classify these images. The classification system uses two data-science techniques–Deep Learning and Convolutional Neural Networks–to automatically classify phytoplankton types based on their shape, size and other distinct morphological features.

“Once completed, the system will be a considerable step forward in automating our flow-through system for large scale oceanographic studies,” says Goes, a Research Professor at Lamont Doherty Earth Observatory. “For us at Lamont, DSI’s Campus Connections initiative has provided an opportunity to think outside of the box. Our partnership with this team of extremely bright students will be a considerable step forward in automating our flow-through system for large-scale oceanographic studies.”

One of the students, Ankit Peshin, is using existing plankton images to train a neural network model to automatically classify the different plankton, allowing the team to better understand the biological composition of each water sample.

“With projects like these, the more data (i.e., plankton images) you have the better,” says Peshin. “A lot of experimentation is involved to see what works, since neural networks essentially act as black boxes,” without an understanding of the internal workings.

Peshin is delighted to have the chance to help develop the image-classification method. The theoretical knowledge he learned in his DSI classes has proven especially relevant to the project, he says, adding, “I’m glad to have been connected to this opportunity through Campus Connections.”

“My post-graduation plans aren’t set in stone, but having research experience will definitely help me,” he says, “whether I pursue a career in industry or academia.”

Using Machine Learning to Enhance Investigation Process

Campus Connections is also open to administrators–not only professors–and this project connected a DSI student with the Office of Internal Audit at Columbia. Student Adarsh Chavakula helped the office develop a machine-learning-based system to evaluate purchasing card transactions and possible non-compliance of purchases as defined by university policy. In particular, he designed a system to identify transactions where purchases that appeared as legitimate had a higher probability of non-compliance or misappropriation. Chavakula’s machine-learning tool assists the auditors in identifying these transactions by examining a large number of transactions at once and flagging a small number of cases that auditors would need to review.

Chad Rothbart, a Senior Auditor in the department, says that Chavakula’s project is helping the department to flag problematic purchases automatically by predicting which transactions appear most suspicious. Chavakula’s predictive machine-learning-tool, moreover, ended up being more effective and efficient than the methods currently used by auditors to conduct their audits, he says.

“I can’t tell you how impressed we all were with Adarsh and how terrific he was to work with,” adds Rothbart. “He has enabled our office to be more efficient and it was all a result of Campus Connections reaching out to the DSI masters’ students and finding Adarsh for us.”

Chavakula used an array of data-science skills to build the ML-system, which he says complemented what he was learning in his data-science classes. Working on the Internal Audit project, he adds, was also a great way to understand the challenges that a professional data scientist would encounter while trying to solve real-world problems. He believes it’s crucial for DSI students to gain experience outside their regular coursework, so that they can “fully appreciate the value their skills can create while also learning their limitations.”

Chavakula, who after he graduates this month will join Moody’s Investors Service in New York as a Data Scientist, says the research skills he learned at the Office of Internal Audit will help him succeed at his new job.

“Campus Connections provides students an opportunity to work on challenging real-world problems,” he says, “which add to their expertise as data scientists.”

Reports of Police Violence and Public Health

A third Campus Connections project addresses a subject that’s much in the news these days: police violence. Courtney Cogburn examines how reports of police violence in the media may adversely affect the health of blacks and whites differently. As an assistant professor at the Columbia School of Social Work, Cogburn studies racism and related effects on racial differences in health. For one of her projects, she wanted to see if spikes in news stories describing police violence in certain years correlated with increases in stress-related illnesses. She wanted to understand, for instance, if increasing news accounts of police violence is a type of “stress exposure” that is linked to stress-related conditions such as anxiety, heart problems or substance. To help her analyze data for the project, Cogburn brought on DSI student Akanksha Rajput, who is designing a machine-learning algorithm that can gather, evaluate and classify hundreds of thousands of news articles on the topic of police violence in black communities.

Rajput has trained a machine-learning algorithm to classify articles on the topic that appeared in four newspapers (the Wall Street Journal, the New York Times, the Washington Post and USA Today) between the years 2006 and 2018. She’s gathering the articles from Factiva, a database of archived news articles. It would have taken too much time and labor for Cogburn to wade through 12 years of news articles in four major newspapers. Rajput is thus using her algorithm to automate the news-gathering process, which has been an immense help to the project, says Cogburn.

“Akanksha [Rajput] has been amazing,” adds Cogburn. “This is a complex project that combines public health and social work with journalism and data analytics. She immediately understood and intuited what I needed–a database of articles on police violence that would be easy for my team to use, and went beyond what I asked for.”

Rajput is delighted to have the chance to use the skills she’s learning in the DSI master’s program to create her algorithmic model. Some of those skills include scraping data from Factiva and saving the relevant articles in comma-separated values format; preparing the training data to implement a supervised learning algorithm; cleaning the text data; training the classification algorithms to predict whether an article is relevant to the project; counting the relevant articles in the time frame (2006-2018) in the four newspapers; and creating interactive plots to represent the data in a graphical form. The graphs also show the years that had spikes in the number of news stories about police violence, a graphical statistic that’s extremely to helpful to Cogburn, who is trying to correlate those spikes with increased incidents of stress exposure and stress-related illnesses in the communities.

“I’m really enjoying working with Professor Cogburn’s team and I’m learning a great deal about how to apply data skills to a project that’s so important to society,” says Rajput.

Given the success of the project, Cogburn has asked Rajput to assist her on an upcoming project in which she will collect “sentiment” data from Twitter. Her idea is to collect a large sample of Twitter data relating to the “stress responses of blacks to racist events reported in the media.” For this project, Rajput will use the same skills she called upon to create her machine-learning algorithm to assess the Twitter data.

Cogburn says she couldn’t be more pleased with Rajput’s work and is happy that Campus Connections was available to her.

“Akanksha [Rajput] has given brief presentations to members of my lab about working on translating technical concepts to a lay audience and she’s also teaching a few students on my team the basics of the algorithm she’s created,” Cogburn says. “It’s my hope that a couple of these students will be in a position to continue running the algorithm if we happen to need additional articles after Akanksha leaves us.”

She has also “has very thoughtfully contributed to our conversations on racism in the United States and ways we may address these issues in the work we’re doing,” Cogburn adds.“We’re very lucky to have her as part of our team.”

— Robert Florida, Data Science Institute

About garen