How to Collect Training Data for Face Recognition Part 3

Date：2024.02.23

Welcome back to our adventure in preparing training data for face recognition! We're moving into the final stages, focusing on ensuring our dataset isn't just large but also perfectly prepped for developing top-notch AI models.

Last time, we put the finishing touches on a special dataset, adding yet another valuable asset to our company's collection. This week, we're gearing up to hand off this dataset to our ML engineers. But before we do, let's take a moment to double-check our work, ensuring everything is spot-on.

7. Why Check Again?

The wisdom of the idiom　"measure twice, cut once" is especially recognized in data science as we finalize a dataset for AI training. After organizing the data, we conduct another round of checks. This stage of redundant verification is a crucial safeguard against any lingering inconsistencies or errors that might have been overlooked previously.

Theoretically, our dataset should be flawless following the exhaustive automated and manual checks. However, in reality, we always discover mistakes during this double-checking process at the onset of dataset delivery.

These final checks ensure that the data adheres to our stringent quality standards, maintaining coherence and consistency throughout. We furtherly leverage both automated tools and human oversight at this juncture, such that the risk of training our AI models with flawed data is significantly reduced, leading to higher accuracies in our models’ capabilities.

8. Combining Datasets

So chances are, you'll find yourself juggling more than just one dataset, thus newer ones should combine with those already in use. That's where the art of mixing and matching datasets comes into play. Think of it not just as stacking LEGO bricks but more like piecing together a puzzle where each piece is crucial.

For instance, when training a face recognition model, we're teaching AI models to spot the differences between faces. That means every photo is tagged with a user_id to show that all those snapshots belong to the same person. Popular faces, like images of celebrities, may appear in more than one dataset, which results in having two different user IDs. Thus we will need to think of a way to combine them.

In our company, 2 levels of deduplication are done when combining datasets.

We subsample some images from each user_id. After running through various face recognition models for cross-validation, we can find if there exists user_id duplication. We will then merge the two user_id.
There may be duplicates of the same image in different datasets, especially when you reach the 10M mark of image count. Images that look too similar (likely differ only in cropping or resolution) will need to be deduplicated during this phase.

9. Creating Data Splits

Now that our dataset is all dressed up and ready to go, it's time to divide it into the "data splits" - think of them as different teams, each with a special role. We've got the training set, the validation set, and the testing set.

This setup is super important to keep our model training on the straight and narrow. It makes sure that our engineers use just the right data for training without accidentally peeking at the test set. Why? Because we want to use the test set for benchmarking, adding them to the training progress will make our results to be biased like someone cheating on a quiz.

10. Conclusion: So Long and Thanks for All the Data

Looking back at our adventure from gathering those first bits of raw data to the careful, final touches before AI training, we're proud of the journey. It's been all about sticking to our high standards for quality, embracing diversity, and paying attention to the little details. The dataset we've put together is like the cornerstone for creating face recognition models that aren't only cutting-edge but also built with ethical responsibility and inclusiveness in mind. It's the kind of solid foundation that lets us dream big and build something truly special.

This series demonstrates the collaborative efforts of our team and the valuable insights gained during each stage of data preparation. As we push the frontiers of AI in face recognition, our commitment to improving our processes and sharing our findings remains unwavering. Keep an eye out for more updates as we explore new avenues in AI innovation.

Part1 Blog

Part2 Blog

7. Why Check Again?

8. Combining Datasets

9. Creating Data Splits

10. Conclusion: So Long and Thanks for All the Data

Contact

About Japan Computer Vision (Known as JCV)