Datathon Case Description
This document describes the requirements for Case description for the Datathon, organized by Data Science Society. This is a revised version as of September 2018.
What does Data Science Society provide?
The Datathon is conducted both on- and off- line at the same time, hence there are some new requirements for the case description.
Data Science Society promotes online and offline (before and during the event) each case with equal treatment using the supplied marketing (and other) information by the partner company. Data Science Society provides equal visibility of all cases among the participants.
Data Science Society may organize additional mentors’ assistance to any of the cases.
What does the partner company provide?
Each partner company provides a case and depending on its overall quality it might get selected to be solved by teams at the event. The teams have freedom of choice when deciding on which cases to work on and only the attractive and interesting cases will be selected.
The partner company provides marketing information about the mentors from the partner company (short bio, photos) provided about two months before the event.
With each provided case the following should be included:
· Written description (see 2., 3., 4. & 5. below);
· Data set (see 6. below);
· Mentorship by Industry experts (see 7. Below);
· Short video presentation (see 8. Below).
The data set will be available to the teams non-confidentially.
The goal of this section is the participants in our Datathon to obtain enough understanding about the problem under investigation from business perspective.
More precisely answers of the questions have to be provided:
· What is the business problem?
· Why business needs to solve the problem?
· What are the important problem specifics (from business sight), which have to be accounted for in the solution?
· Are there hypotheses, which have to be investigated and possibly introduced in the solution?
· What are the business requirements, which have to be satisfied in order the final result to be satisfactory?
Practical examples can put additional light on the problem description.
This section consists of problem description from data science point of view. It is meant to assist the teams to what general directions should they take in their solution.
Firstly, this part should consist of recommended data science approach, methods and/or techniques, whenever it is possible or desirable for the case.
Secondly, some hints, examples, best practices, relevant papers, etc. could be provided, if possible, to clarify the case and to further direct the teams.
Thirdly, useful insights, cross-sections, distributions, etc. may be provided for particular parts of the case.
Lastly, define acceptable accuracy measure, KPI-s, and other technical requirements.
This part describes the dataset provided for the case solution:
· Precise description of each variable, its meaning, and each variable value meaning.
· The structure of the data set(s) – indexes of independent and/or dependent variables.
· Variables types, ranges, special values, missing values, etc. in the data set.
· If participants would have to additionally collect data: Description of appropriate data sources, structures, etc.
This part of the text provides information about required/desired results from the teams working on the given case. These expected outputs may depend on the business problem, or on the research specifics, or on the data set, or some other technical requirement.
The expected outputs may include description of various items expected to be built while solving the case such as:
· Algorithms, workflows, etc.;
· Possible type of models, rulesets, etc.;
· Source code, used environments and libraries.
All outputs (including the expected by the case) will be documented in a publicly available paper (DSS template), structured along the concept of CRISP-DM.
The data size should be limited to about 20 GB (advisable 10 GB). If the data set is big it would be better to distribute it among several files. Also it is highly advisible to provide data samples for easy understanding of the structure The data set will be available to the teams non-confidentially. If the case permits, two data subsets have to be provided: control data subset and working data subset. Control data subset stays hidden from the participants throughout the competition.
In order to support the teams in their task, for each case there should be business and data science expert(s) from the company, who:
· Provide short bio, including technical background and fields of expertise, provided about two months before the event;
· Are available online in Data.Chat one week before the event for about an hour every day;
· Participate in Q & A session off- and on- line for one hour before the event;
· Are available during the competition – at least within two time slots of 3 hours;
· Present the case in a 5-min. video (provided about 1 month before the event), which will be uploaded in the YouTube channel of DSS
· Jurying all teams’ solutions, including voting by the rules of the competition;
· Have the obligation to support all teams, which have asked for their advice.
The provided video will be the only spoken presentation of the case to the participants in the event. The video should be up to 5 minutes long and it will be uploaded to YouTube DSS channel so the video should be optimized according to YouTube guidelines (see them here…).
The video should contain:
· Spoken explanations on the case, including: Business problem formulation, Research problem specification, Data description, Expected outputs and a humorous joke;
· Suitable visualizations for the important aspects of the case.
Aside from the video, the following may be provided:
· Subtitles (advisable);
· Short written description of the video and/or the presenting persons, also including necessary links.