394. Dataset/Software License To Look Out For

▮ License

When a new machine-learning project starts, I sometimes get caught up in reading research papers and trying out new machine-learning models, just to finally notice that I couldn’t use the data or the GitHub REPO because of the license agreement.

So for this post, I’d like to share several types of license agreements for both open-source software and dataset that data scientists should be careful with when the machine-learning model they are trying to develop is going to be for commercial use.

There are 2 categories we should look out for.

  1. Software License
  2. Dataset License

Let’s go through each one.

▮ Open Source Software License

There are mainly 2 types of OSS(open source software) licenses; Permissive and Copyleft.

Permissive Software Licenses

Permissive-licensed software has only a handful of conditions for the users to make the software their own through changes or additions. The most popular licenses are the following.

  • MIT license
    This license only requires distributed programs that use the licensed software to include a copyright notice and copy of the license text. This can be commercially used.
  • Apache License 2.0
    This is similar to the MIT License and can also be used for commercial use but has several additional constraints like the following.

    • Required to state major changes from the original
    • Include a copy of the NOTICE file with attribution notes
Copyleft Software Licenses

On the other hand, copyleft-licensed software generally requires that any work done through this software must be released under the same license as the original. The most popular licenses are the following.

  • GPLv2
    This is considered the “strong” copyleft license because it applies to the entire software system.
  • Mozilla Public License 2.0
    This is considered the “weak” copyleft license because it applies to a narrower set of codes.
Double Checking

When you check out the GitHub repo, you can confirm the license on the right side of the page like the image below.

However, you should always look through the whole README as well to see whether there are any additional agreements we need to oblige to.

Even when the repo itself has a license that allows commercial use, in some cases, there are additional constraints written on the README file such as “Only for research purposes”.

Fig.1 Tensorflow License

▮ Dataset License

After you have checked the software, the next step is to check for the data license agreement as well if you are planning to use that data to train your model. Here are a couple of examples.

Public Domain

This is technically not a license, because using this means the following.

  • You donate your dataset to the public
  • Anyone can use your dataset including commercial use
  • No restriction applies unless otherwise stated
  • You remove all copyrights in your dataset

License such as the following fall into this category.

  • CC0 (CC Zero)
  • ODC-PDDL
Attribution

You must give appropriate credit and provide a link to the license. The following fall into this category.

  • CC BY
Non-commercial

As the naming goes, you cannot use this for commercial use. The following fall into this category.

  • CC BY-NC
  • CC BY-NC-SA
  • CC BY-NC-ND

▮ Reference