5 Datasets I Wouldn't Use for a Data Analysis Portfolio

Jul 26, 2025

Deploy and Host Anything Fast. Without the Headaches

Sevalla is the all-in-one platform for developers who want a fast, flexible, and transparent way to deploy applications without the limits and complexity of traditional PaaS tools.

Built on Google Kubernetes Engine across 25 regions and enhanced by Cloudflare’s global edge network, Sevalla delivers performance and scale with zero configuration headaches.

Why developers are switching to Sevalla:

Usage-based pricing with no seat licenses or hidden fees
Unlimited collaborators and parallel builds with no fixed plans
Human support by developers
Fully managed databases and object storage included
Git-based workflows, preview apps, and one-click deploy templates
A familiar dev experience for teams used to platforms like Heroku or Railway
Enterprise-grade security, including SOC 2 Type II, ISO 27001, and GDPR compliance

Whether you’re launching an app, spinning up a database, or hosting static content, Sevalla gives you the tools you need without slowing you down.

Learn more and start building now with a $50 credit:

Click here to try Sevalla for FREE

Introduction

The goal of your portfolio isn’t just to showcase your technical skills. It’s to showcase your decision-making process and your ability to use data to solve problems. It’s to show you understand what matters. It’s to show you can extract meaning from mess, make sense of complexity, and tell a story that drives action.

It’s not a secret that to build a portfolio, one needs a dataset. However, I’ve noticed that too many aspiring analysts build projects that are doomed from the start. Not because their SQL is bad or because they suck at Python. Not because their visualizations are ugly. But because their dataset has nothing meaningful to say. It’s clean, safe, and overused. Or worse, it's answering a question that no one has asked. Yes, completely irrelevant to the real world.

Picking the right dataset is everything. Some datasets won’t let you make the case that you can solve real problems and create value, and that’s the case you must lay down, especially in a crowded job market.

So here are five datasets I wouldn’t touch for a portfolio project and what they’re actually good for, and the steps you can take to pick proper datasets for your portfolios.

1. Titanic Dataset

Let’s be honest. You already know who died. So does the recruiter. This is one of the most famous datasets in data science. It’s been used to teach everything from logistic regression to feature engineering. But that’s the problem; everyone has done it. It's been more abused than the 30-year-old door rag at your mama's house. If your project says, "I predicted Titanic survival," it says two things:

You followed a tutorial or copied someone else's solution
You haven’t moved on yet (it’s been 110 years since the Titanic)

Apart from predicting who died and who survived, you’re not solving any business problem. It’s not enough that you’re predicting who died; you’re also about to bore the recruiter to death.

That doesn’t mean the Titanic dataset is useless. It’s great for teaching and practicing classification. It’s great for explaining model interpretability. But it’s not a story the business wants to hear. Use it for practice. Just don’t use it to show your uniqueness.

2. Iris Dataset

Nothing says, "I just started and haven’t touched anything real yet," like the famous Iris dataset. This is one of the most commonly used datasets in data. Yes, it’s a clean little dataset with only four features and three classes. Beautiful flowers. Basic dataset.

Using it as part of your portfolio is like trying to sign up for the Tour de France while you're still learning to ride a bike with training wheels. Your self-confidence is impressive, but they’re sending you right back home. Apart from its simplicity, using it in your portfolio screams "not an original solution." Since it’s been used so often, you're not turning any heads with this.

Like the Titanic dataset, it's great for practicing classification and clustering. But for your portfolio? Use something with more mess, more context, and more consequence.

3. MNIST Dataset

I will not lie to you. This is a great dataset. It is a great dataset for anyone starting out with classification algorithms, especially in computer vision. The MNIST dataset contains a total of 70,000 grayscale images of handwritten digits (0 through 9), each 28x28 pixels.

Even if you are trying to apply for computer vision roles, the dataset may not do much because everyone has heard about it. Recruiters have seen countless MNIST projects, all doing the same thing: classifying digits. Using it does not scream, "I can solve unique problems." It screams, "I followed a tutorial."

It is overused and lacks originality. It is also not messy enough. It is like cooking with pre-chopped ingredients. Fine for practice, but not good for proving you can handle raw, chaotic data. Real-world analysis involves working with incomplete records, inconsistent formats, and ambiguous variables. MNIST offers none of that, leaving you no room to show off data cleaning, feature engineering, or context-driven problem-solving.

4. Netflix Movies/IMDb Ratings Datasets

These datasets are fun. I’ll give you that. But fun doesn’t always mean valuable. Before adding one of them to your portfolio, ask yourself if the dataset will make you stand out. If we are being honest, I’ll say it will not. Go to Kaggle and you’ll notice that most of these datasets have thousands of downloads. They’re like a movie everyone has seen, and the plot is predictable. You cannot impress with such a dataset. Most importantly, it may not demonstrate much about how well you can solve business problems.

Apart from being overly used, many Netflix or IMDb datasets are preprocessed, with neat columns like title, rating, and genre, and minimal missing values. That leaves you with little room to demonstrate advanced skills like merging multiple sources, handling unstructured data, or tackling large-scale wrangling. Remember, a great portfolio tells a story that makes people care about profits, customers, or societal issues. A movie dataset may not have all that.

I will tell you that these datasets are great for practice and weak for your first impression.

5. Synthetic Datasets

You know what puts off recruiters? When they hear that the data used in a project was not real.

With AI, anyone can generate synthetic data with ease. However, it is important to know that synthetic datasets are great for testing algorithms or simulating scenarios where real data is hard to get. But if your portfolio is filled with fake data, you’re making it hard for someone to believe you can work with real-world data. Employers expect you to work with real data, so they are not going to be impressed with findings that are not based on real data.

Synthetic data is too organized. In the real world, no one hands you a perfect CSV file with a bow on top. You get a dump of data from five sources, and none of them agree. Real-world data is inconsistent, incomplete, and often frustrating. That’s where the magic happens. That’s where you prove your worth.

How to Pick the Best Datasets for Your Portfolio

Recruiters are usually impressed when a portfolio has projects that are personal, demonstrate how you solved a business problem, and show how you wrestled with real, messy data from multiple sources. Here’s how you can pick the best datasets:

Find real businesses: The best way to pick a dataset is from a real business. If you’re lucky enough to have access to a small business that’s trying to solve a real problem, then you have a great opportunity to build a standout portfolio. Let’s say you’re trying to understand why a family-run business has seen a drop in sales over the past year. The business might provide you with real data you can use in your project. Just make sure you have their permission to share it. Trust me, nothing impresses recruiters more than knowing you’ve solved real problems for a real business.
Go for complexity: Pick datasets with enough messiness to showcase your skills. Missing values, inconsistent formats, or large volumes let you demonstrate data cleaning, feature engineering, and scalability. Avoid overly clean or small datasets that hide your technical chops.
Start with a business question: One of the best ways to find an appropriate dataset is to start with a business question. The question acts as your compass and helps guide your search. Make sure the question is relevant to the industry you’re trying to break into.
Aim for storytelling potential: The best portfolios include a storytelling element. Recruiters love that. Choose datasets that let you build a narrative with a clear beginning (problem), middle (analysis), and end (actionable insight). When considering a dataset, ask yourself: Can I excite with discovery, reveal a hard truth, and suggest a next step?

The formula is simple: pick datasets that let you combine insight (deep analysis), emotion (a relatable problem), and action (clear recommendations).

Wrap-Up

Your goal should be to build a portfolio that stands out. This is especially important in an era where recruiters are flooded with applications. Avoid datasets that have been used in countless tutorials and offer no real business impact. They are great for practice, but that is where their usefulness ends.

If you are serious about making an impression, go for datasets that are messy, meaningful, and tied to real-world questions. Whether the data comes from a small business, a public source, or something you scraped yourself, what matters most is how you work with it and the story you are able to tell. Thanks for reading.

Python and Data Analysis Insights

Discussion about this post