Category Data

403. Data Distribution Shifts

▮ Data Shift After deploying your model, you’ll need to keep on maintaining them because data is constantly changing. So for this post, I’d like to share three different types of data distribution shifts that may occur which can degrade…

402. Data Leakage

▮ Leakage Data leakage refers to the phenomenon when a form of a label “leaks” into the set of features used for making predictions even though this information is not available during inference in production. Data leakage is hard to…

397. Finding Data

▮ Where to Find When starting a new project, you may need additional data to train your machine-learning model. So for this post, I’d like to share a couple of resources that might be able to help you find more…

396. Topological Data Analysis

▮ Data The growth of data volume has been exponentially fast, especially these past few years. The plot below by Statista shows that the data volume this year(2023) has nearly doubled compared to 2020. However, despite the abundance of data…

394. Dataset/Software License To Look Out For

▮ License When a new machine-learning project starts, I sometimes get caught up in reading research papers and trying out new machine-learning models, just to finally notice that I couldn’t use the data or the GitHub REPO because of the…

377. Storytelling With Data

▮ Visualizing Data All machine-learning-related engineers inevitably have to deal with data. That means there will always be a situation where these engineers have to use those data to create proposals and communicate with their clients. Being able to visualize…

363. Splitting Datasets

Training Your Model When training a model, the dataset is often divided into a Training set, a Validation Set, and a Test Set. The ratio to split the data into these 3 sets depends on how large your dataset is,…

361. How Data Augmentation “Increases” Data

Data Augmentation Data Augmentation is a technique used to “increase” the amount of data to train a model. This can be helpful in cases such as when you don’t have a sufficient amount of data or when you want to…

356. Data Preprocessing Steps

The Main 4 Steps There are mainly 4 steps for data preprocessing. Data Quality Assessment Data Cleaning Data Transformation Data Reduction 1. Data Quality Assessment Before jumping into coding, evaluating the overall data quality is essential. Here are several problems…

337. Discretizing Data

Why Here are several reasons why you may need to use Discretization. It is often easier to understand continuous data when divided and stored into meaningful categories of groups It is easier to find correlations with the target variables after…