Data science capstone ideas (and how to get started)

3 min readOct 5, 2018

Capstones are standalone projects meant to integrate, synthesize, and demonstrate all your data science knowledge in a multi-faceted way. Capstone projects show your readiness for using data science in real life, and are ideally something you can add to your resume, show to employers, or even use to start a career.

I find data science capstone ideas are like puppies: you want all of them, but can only keep one. Below is a list of some of my ideas and starting points.

Idea #1: Nutritional analysis from Instacart orders

In 2017 Instacart released a dataset of over 3 million grocery orders from over 200,000 users as a Kaggle competition. With a dataset this juicy, immediately a few ideas come time to mind:

Predict what products users will order again (this was the goal of the Kaggle challenge).
Build a model to stock the store so there are never any product shortages, but no wasted space or money in ordering.
Predict a user’s healthiness from order content.
Make a recommender system for healthier order alternatives.

The first and second are doable with the data you already have, which is nice.

The third was my personal choice, using the USDA food composition database to look up products and create a nutritional breakdown (by the way, they have an API). But it also introduced a lot of hurdles:

- Users don’t eat everything they order (e.g. cat food, soap, toilet paper). This would require a lot of cleaning and munging.

- Users don’t order just for themselves (e.g. companies, birthday parties, families).

- Users order on different timelines (e.g. once per week, once every two weeks, once a month).

- Items such as deli food may not have entries in the USDA database.

The fourth would also utilize the USDA database, but would not require any user-specific information or messing about with time-series.

Idea #2: Predicting solar output from satellite imaging/historical weather

One of the big issues with mainstream adoption of solar power is unlike other energy sources (hydroelectric, oil, nuclear), you can’t control how long the sun shines for. Overestimating this amount means losses for producers and investors, and downtime for users. Underestimating means a lower chance of adoption in upfront decision-making. Sounds like a job for… machine learning!

Many datasets can be found at NREL, however they are in different years and different locations with limits on how much you can download at once. They have an API, which is useful.

SolarAnywhere has an academic license, allowing you to look up any location (but only for the year 2013). They too have an API.

Also, the NREL NSRDB data viewer.

There are three immediate approaches I can think of:

- Using previous solar output to predict current solar output (time-series or RNN).

- Using weather datasets

- Using satellite imaging datasets

There are a lot of academic papers on this last subject (a quick Google Scholar search returns about 30,000 results), but not a lot of publicly available satellite time-series datasets.

Idea #3: Fake news detection

This is a hot one. Without going into full rant-mode, fake news is obviously deleterious for democracy and individual mental stability.

So how to accurately identify what’s fake and what’s true? Here are a few leads on this as a data science problem:

1. Fake News Challenge

This is the best-formatted challenge around this topic, with organizers, advisors, and volunteers from the academic, ML, and fact-checking communities. Includes GitHub repos of winning submissions. Check out the competition page on Codalab.

2. Snopes Junk News

A starting point for well-verified fake news stories vs. actual events.

3. Getting Real About Fake News — Kaggle Dataset

A collection of nearly 13,000 items from 244 websites tagged “BS” from the BS Detector chrome extension. The BS Detector is powered by Open Sources, a project that classifies biased and fake websites.