In my effort to keep building my data science skills, I entered into the Kaggle March Machine Learning Mania competitions. We were given historical data from regular season and tournament results. Using that data we built a model to predict every possible matchup in this year's tournament.
This was a fun challenge for me, and a bit humbling. I entered believing my background as a basketball coach would give me an edge with regards to feature engineering. Unfortunately, all of my formulas made the models perform worse, not better. In the end, the models with fewer features performed better.
There was one curious puzzle I stumbled upon. I built one model for both the men's and women's competition. This model performed far better for the women's data than the men's. At first I assumed I had made some mistake. After some digging through the data I concluded the likely root cause is the men's tournament is far more unpredictable. There are more "upsets" in the men's tournament, and as a result my predictive model fit better against a data set that was, well, more predictable.
I suppose that would be a theory ripe for some hypothesis testing. But for now I'm just going to enjoy watching the games.
Today's Sponsor
SolarWinds
Someone is complaining about the database, again. Whatever the bottleneck may be, you know it is not your fault, but it is your responsibility.
Here's a secret for you. The majority of database issues are not edge case scenarios requiring hiring a team of experts to solve.
Common root causes include locking and blocking, poor indexing, or a badly written query. SolarWinds provides a comprehensive suite of cross-platform database performance monitoring and tuning tools for both Earthed and Cloud workloads. And with the recent purchase of Sentry One, makers of SQL Sentry, we can also help with the edge cases, too.
If you haven't tried our products, there's no better time to get started.
Community Links
Sweetviz
Exploratory Data Analysis (EDA) is a necessary step in any data science project. Sweetviz is an open-source Python library that generates visualizations to kickstart EDA with just two lines of code. Output is a fully self-contained HTML application.
Raw Data Podcast
Hugh Millen | P3
Rob and I talk with Hugh Millen, a former NFL quarterback for my beloved Patriots and currently a television and radio sports analyst. His story is another great example of a unique path that a person with a keen interest in data has taken.
Data Janitor Roundup
Build proactive database monitoring for Amazon RDS
The idea of using Amazon CloudWatch Logs, AWS Lambda, and Amazon SNS to roll-your-own monitoring of Amazon RDS seems terribly on-brand for AWS and their desire to over-engineer solutions that use a bunch of additional AWS services.
Microsoft Offers Preliminary Explanation for March 15 Azure AD Outage
If you were caught in the Azure outage this past week, the root cause details are in this article. Reading them might make you feel better about your own deployment failures.
Announcing LAMBDA: Turn Excel formulas into custom functions
For power Excel users, this is a really big deal. Lambda functions will reduce the need to write VBA or JavaScript code to perform tasks such as recursion. Yet another reason to upgrade from using Excel 2010. I'm talking to you, Gordon.
Sponsor an Issue
Reach thousands of data professionals who care about data, databases, and helping others make their data the best version possible. Get in touch right here.