Data on the Rocks

Gamification of apps and services and their consequences have been a disaster for the human race.

Well, OK, maybe not that bad. But they aren't helping to make the world a better place, that's for sure.

Gamification is a concept companies use to help drive engagement. Unfortunately, it only drives people to adjust their behaviors to generate rewards. Because that's how humans work. When you measure anything, especially metrics for rewards, we humans make an effort to maximize our score.

But that alone isn't making us stupider. It is our inability to perform critical thinking when we come across some random piece of code we find on the internet, promising to solve our problem.

Let's look at Kaggle as an example. As someone still in the beginner-to-intermediate phase of their data science journey, I spend a fair amount of time on Kaggle. Like many websites, Kaggle offers rewards for engagement. At first glance, these categories and levels sound great! But after some time, you come to understand how people will do anything they can to earn high rankings by taking shortcuts.

Let's look at the Housing Prices competition. This is meant to be a practice competition for new users to get familiar with Kaggle itself. But the data itself is fairly comprehensive and offers a challenge to data scientists of all levels. The goal is to take sample housing price data and build a predictive model.

As you would expect, Kaggle offers a way for users to publish and share their code. Here's one example, where a user published their notebook, claiming they scored in the top 0.3% of entries at the time (two years ago). The notebook is filled with a lot of useful regressing and exploratory data analysis techniques. It's quite good.

Except for the part where the user took a shortcut.

In section 19 of the notebook there is this comment: "Combine train and test features in order to apply the feature transformation pipeline to the entire dataset".

And this is where critical thinking comes into play. Many examples for this competition all include the combination of training and test data in order to build a predictive model. Chances are one person used this method and subsequent users, novices to data science, all follow along believing this is a standard technique.

But it's not. It's lazy, and likely leads to your model having a higher score due to data leakage.

Still, it's a good notebook and has value in many areas. Much more value than this notebook, where the user built a perfect submission by finding the answers online. There's no real explanation for this code, so perhaps they did this in an effort to test their models without hitting the limit of ten submissions a day.

Or perhaps they just wanted to get a perfect score. Because upvotes.

And this is the issue I see. Someone new to data science might believe these techniques are standard. And there isn't anyone at Kaggle offering instruction on "right" versus "wrong", this is something we users need to learn for ourselves. Oh sure, you can think this is where the Kaggle community would come to the rescue, and all pitch in to decide what is good, bad, or wrong.

I think we've seen enough of these websites over the past two decades to understand that none of us is as dumb as all of us. Thinking a community of users produces the most accurate information is a fool's folly. All this does is produce information a handful of loud voices believe to be correct.

Humans are bad at this. And gamification isn't helping to make things better.

Today's Sponsor

Look, choices can be hard. We get that. We see and hear it all the time when customers ask "should I buy DPA or SQL Sentry"?

At SolarWinds we decided the correct answer was "why not have both?"

So that's what we have done. Think of it as Thunderdome - Two products enter, one price leaves.

That's right, for one price you get access to both products. You can mix and match the products in your environment as you deem necessary.

If you haven't tried our products, there's no better time to get started.

Community Links

Windows 11 Preview Hands-On: Much Ado About Menus

You may have heard about the new Windows 11, and I thought this brief review was worth sharing. I'm interested to learn more about the CPU requirements. and why they are necessary.

Events

We are moving forward with plans to host Live! 360 this November in Orlando.

Live! 360 brings the IT, Developer, and Data communities together for six days of training, knowledge sharing, and networking. With unlimited access to Live! 360’s five co-located events, you and your team will get the training you need to keep you and your business competitive and future-ready.

Send any questions about the event to me at SQLRockstar@thomaslarock.com

Data Janitor Roundup

Wegmans Exposes Customer Data in Misconfigured Databases

Just a friendly reminder regarding basic database security. Breaches like these are not happening because an adversary is breaching your perimeter, they happen because you leave your data exposed to the public internet. Wouldn't hurt to check your current default configuration settings.

CVS Database of 1 Billion Data Points Inadvertently Exposed

As I was just saying... On the plus side, if you printed out this database it still would not be as long as a CVS receipt.

Why Scientists Need To Be Better at Visualising Data

Not just scientists, everyone needs to be better at data visualizations. Good visualizations will enhance the story your data is trying to tell. Bad visualization do the opposite, but also communicate to the end user that you might not know what you are doing, calling your analytical skills into question.

Sponsor an Issue

Reach thousands of data professionals who care about data, databases, and helping others make their data the best version possible. Get in touch right here.

Issue 12 June 29 2021

Today's Sponsor

Community Links

Windows 11 Preview Hands-On: Much Ado About Menus

Events

Data Janitor Roundup

Wegmans Exposes Customer Data in Misconfigured Databases

CVS Database of 1 Billion Data Points Inadvertently Exposed

Why Scientists Need To Be Better at Visualising Data

Thank you for reading!

Sponsor an Issue

Today's Sponsor

Community Links

Windows 11 Preview Hands-On: Much Ado About Menus

Events

Tweet of the Week

Data Janitor Roundup

Wegmans Exposes Customer Data in Misconfigured Databases

CVS Database of 1 Billion Data Points Inadvertently Exposed

Why Scientists Need To Be Better at Visualising Data

Thank you for reading!

Sponsor an Issue