Serdar Yegulalp
Senior Writer

4 Google data sets to kickstart machine learning

news analysis
Oct 17, 20163 mins

Want to get started in machine learning? Google has you covered with high-quality data sets, both big and small

karate kick sunset martial arts fight fighter
Credit: Vee

You can always count on Google to have data — tons of it, generated by the users who interact with and upload content to its services.

Google uses that data to build intelligence for the company, but it’s offered data for others to experiment with as well. These three data sets are abundantly large, have plenty of practical applications, and are guaranteed to be well-assembled, thanks to Google’s imprimatur.

The Open Images Dataset

The Open Images Dataset, unveiled at the end of last month, is a collection of 9 million URLs to images “that have been annotated with labels spanning over 6,000 categories,” according to Google. All have a Creative Common Attributation license, so they can be reused readily, and the label assignments to the images have been verified by human eyes to ensure validity. Plus, plans are underway to “improve the quality of the annotations in Open Images the coming months.”

YouTube-8M Dataset

Named for the fact that it’s been compiled from 8 million YouTube videos, the YouTube-8M Dataset aims for diversity and quality. Each video has had at least 1,000 views, runs at least two minutes, and has been preclassified via YouTube’s built-in categories. You can explore the data set online or download it for offline use, but note that the data set is only available in the TensorFlow Record file format. You’ll need to manually massage the data if you want to experiment with it in another form.

Google Books Ngrams

Google Books Ngrams offers a clever method to explore when a word first entered wide usage. (For example, “heavy metal” has been around since the 1800s, but its most common cultural meaning hit around 1975.) Rather than simply explore the Ngram database through its web interface, you can snag your own copy via Amazon Web Services. It’s updated regularly, but be warned: you’re looking at a 2.2TB download. Make coffee.

The timeliness of the Google Trends Datastore is always limited, and it’s often quite small: 1.1MB is considered large for any given data set. But those limited sizes and topical constraints make them useful as starting points for people getting their feet wet with data analysis.

Also worth mentioning is the Google Public Data Directory, a portal to more than 100 data providers around the world, offering information on every topic from population statistics to economic indicators. The data sets are not available directly through Google, but Google performs a certain degree of curation in selecting them, so they’re guaranteed to be of high quality.

Serdar Yegulalp

Serdar Yegulalp is a senior writer at InfoWorld. A veteran technology journalist, Serdar has been writing about computers, operating systems, databases, programming, and other information technology topics for 30 years. Before joining InfoWorld in 2013, Serdar wrote for Windows Magazine, InformationWeek, Byte, and a slew of other publications. At InfoWorld, Serdar has covered software development, devops, containerization, machine learning, and artificial intelligence, winning several B2B journalism awards including a 2024 Neal Award and a 2025 Azbee Award for best instructional content and best how-to article, respectively. He currently focuses on software development tools and technologies and major programming languages including Python, Rust, Go, Zig, and Wasm. Tune into his weekly Dev with Serdar videos for programming tips and techniques and close looks at programming libraries and tools.

More from this author