Skip to main content

Big data glossary: 50+ terms defined

Our big data glossary will help you navigate the world of big data.

Geotab Team

By Geotab Team

August 21, 2024

7 minute read

graphic illustration

What exactly is big data? And what about clustering or Hadoop? Our big data glossary will help you navigate the world of big data by walking you through key terms and definitions, from the basic to the advanced.

 

See also: Telematics glossary: 100+ terms to know

A

ACID

ACID stands for atomicity, consistency, isolation, and durability. These properties are guaranteed by a transactional database.

Aggregated Data

Data which is gathered and presented in a summary form, usually for statistical analysis. Aggregation may also be one component to a process that helps ensure user anonymity.

Algorithm

An algorithm is a step-by-step set of operations which can be performed to solve a particular class of problem. The steps may involve calculation, processing, or reasoning. To be effective, an algorithm must reach a solution in finite space and time. As an example, Google uses algorithms extensively to rank page results and autocomplete user queries.

 

See Also: Read about the curve algorithm, Geotab’s patented method for GPS tracking.

Anonymized Data

Data which has been stripped of personally identifiable information, or which has had this information replaced with a randomly generated identifier. Data anonymization is just one part of a collection of methods used to help protect user privacy and identity.

Apache Kafka

An open source stream processing platform that uses a publisher and subscriber model to handle real-time data feeds. It is a very scalable distributed system which has a high throughput and low-latency.

Artificial Intelligence

The process of developing software and intelligence machines that can recognize, react to the environment, take the corresponding action when required, and later learn from those actions. It may include tasks that normally require human intelligence such as translation between languages, speech recognition, visual perception and decision-making. Learn how artificial intelligence is impacting the mobility industry.

Automatic Identification and Capture (AIDC)

System of collecting data and automatically identifying items without any human involvement. Examples of this include facial recognition systems, magnetic stripes, smart cards, voice recognition and bar codes.

B

Big Data

Large volumes of data, structured and unstructured, which are gathered and analyzed to improve customer experience and the efficiency of a business among many other things.

Big Data Graveyard

This “graveyard” consists of big repositories of unused data. The data usually gets stored in server farms or the cloud, mostly to never be seen or used again. Read how to resurrect value from data in this blog post from Mike Branch.

Big Data Scientist

An individual that performs data mining, statistical analysis, machine learning, and retrieval processes on large amounts of data to identify trends and patterns, and to forecast and make predictions.

Business Intelligence

A term that refers to the tools, applications, technologies and practices that are used for the collection, extraction, identification and analysis of business data.

C

Cloud Computing

Cloud computing is data storage and processing over the internet. Instead of locally managing computers, hard drives, and servers, a third-party service manages the physical infrastructure and the end-user utilizes the resources remotely.

The rise in cloud computing has been made possible by the increasing affordability of internet services, along with enhanced security and flexibility. No longer do employees need to be at their offices and computers to access their data, tools, and applications. Organizations no longer need to be in the data-storage business, they can outsource this aspect of their business. Read about the pros and cons of cloud computing.

Clustering

This is an essential machine learning technique for analyzing big data. Sometimes referred to as “clustering analysis”, it is the task of grouping a set of objects together in a way that differentiates them from other groups. This may be used to find certain “types” of customers and identify their commonalities and needs.

D

Dashboard

An information management tool that assists with visually tracking and displaying information, metrics, key performance indicators, and key data points to monitor processes, individuals, data quality, or important business areas.

Data Cleaning

The process of detecting and removing, tagging, or correcting inaccurate, invalid, or corrupt records from a database.

Data Lake

A system of storage that holds raw data until it needs to be used. It can include unstructured data, semi-structured data, relational databases and binary data.

Data-Driven Decision Making

Using data to support making decisions.

Data.geotab.com

A Geotab website offering free smart city and intelligence data to users. (Get everything you need to know about data.geotab.com in this blog post).

Data Governance

Set of rules and processes that ensure data quality, consistency, integrity, and security over time.

Data Warehouse

A system that is used to store data for the purpose of analyzing and reporting.

Data Mart

A subset of the data warehouse, it is used to provide data to users.

Data Mining

The analysis step in the “knowledge discovery in databases” process, it includes sorting raw data into information that can be used to solve problems and identify patterns. The aim of data mining is to obtain information from a data set and convert it into a more understandable structure for future use.

Data Schema

A structure that defines the organization of data in a database system.

Data Science

A discipline that uses algorithms, processes, scientific methods and systems to gain insight and knowledge from data in different forms. This field usually incorporates data visualization, data mining, statistics, machine learning and programming to solve complicated problems using big data.

Data Security

The practice of preventing data from destruction, corruption or unauthorized access.

Data Visualization

A visual abstraction of data that uses plots, information graphics and statistical graphics to communicate information effectively. Read more on data visualization here.

Distributed Processing

A network that enables the same application to be run on multiple computers. This term can also be used in reference to working multiple computers in parallel to run a data processing pipeline or algorithm.

Distributed File System

A system that offers access to storing data on a server. It is often used to share files and information within users in a controlled and authorized manner.

E

Extract, Transform, and Load (ETL)

Three functions that are combined and used in data warehousing to prepare data in analytics or reporting.

G

Graph Databases

A database that utilizes graphs with edges and nodes to represent and store data. This allows data to be connected and linked directly together so it could be easily retrieved with one operation.

H

Hadoop

Hadoop is an open source framework — or software platform — that allows for storing and analyzing vast quantities of data. Fun fact: The creator of “Hadoop” named the open source software after a stuffed yellow elephant toy belonging to his young son. This article describes when and when not to use Hadoop.

I

In-Memory Database

A database management system that relies on the main memory of the system for data storage. The unique and important aspect of an in-memory database is that it does not rely solely on disk, which makes it much faster and more reliable.

Internet of Things (IoT)

Gartner defines the Internet of Things (IoT) as “the network of physical objects that contain embedded technology to communicate and sense or interact with their internal states or the external environment.” No longer do computers and tablets exclusively generate data. In the near future, cars, refrigerators, wearable devices, and many other things will provide interesting insights.

 

See Also: Automotive IoT Is Disrupting the Car Rental Industry 

L

Load Balancing

The process of distributing workload across a computer network or computer cluster to optimize performance.

M

Machine Learning

The study and practice of designing systems that can adjust, learn, and improve based on the data fed. This is commonly used to allow a computer to analyze data to “learn” what action to take when an event or specific pattern occurs. Examples of machine learning include the self-driving car, Netflix recommended selections system, and the Facebook news feed. Go to the full explainer: What Is Machine Learning?

Metadata

This type of data gives information and further context about other data. Put simply, this is data about data. If you take a photo with your camera, the photo itself is data. The time, date, location, and other details of that photo represent the metadata.

N

NoSQL

Databases that are designed outside the widely used relational database management system model. For decades, users have written Structured Query Language (SQL) statements to extract, update, and create data from structured and related tables. While still enormously powerful, SQL doesn’t work nearly as well on large, messier, unstructured datasets. This is why NoSQL exists. Note that it stands for “not only SQL”.

O

Open Data

Data that can be used by anyone to “access, use or share” without any limitations or restrictions. Download our free white paper on open data and big data privacy.

P

Pattern recognition

Machine learning that is focused on the classification, recognition, or labeling of an identified pattern.

Petabyte

A unit of data (one million gigabytes or 1,024 terabytes).

Predictive modeling

The process of developing a model and using statistics to predict a trend or outcome of an event.

Q

Query Analysis

A process used in databases for the purpose of optimizing it for efficiency and speed. This is important as it helps improve the overall performance of query processing which speeds up database functions, data analysis, and reporting tools.

R

Real-Time Data

Data that is delivered and presented immediately after it is acquired.

T

Text Analytics

The application of linguistic, statistical and machine learning techniques on text-based sources to understand the insight or meaning behind it. With structured data, one can easily determine “averages” and “maximums” of sales, employee salaries, etc.

The 3 Vs

Originally coined by Gartner’s Doug Laney, data today is streaming at us with increasing velocity (speed of data processing), variety (types of data), and volume (amount of data).

Transactional Data

In a data management context, transactional data describes the information that comes as a result of transactions. Note: they always have a time dimension.

S

Supervised Learning

The machine learning task of determining a function that pairs an input and output together based on examples of other input-output pairs. By providing “training” data to the algorithm where the result is known, the algorithm infers the function which describes the input-result relationship so the result/output of new inputs can be predicted.

Scalability

The capability of a network, system or process to maintain and improve performance levels as the workload increases. A system is usually considered scalable when when they are able to increase their total output under an increased load when resources such as hardware are added. Scalability is a key characteristic of Geotab’s software development kit.

Spatial analysis

The process of analyzing spatial data which study entities using geographic, geometric and topological properties.

U

Unstructured Data

Data that has no identifiable model or structure. Unlike its structured counterpart, unstructured data is messier. Think photos, videos, emails or audio, etc. If you can analyze it in Excel, then the data is probably not unstructured.

Unsupervised Learning

The machine learning task of inferring a function that is trying to find hidden structure in data that isn’t labeled (i.e. “unlabeled data” that has not been categorized or classified).

Z

Zookeeper

A software project that provides open code name registration and centralized configuration for larger distributed systems.

 

Ready to geek out on more terminology? Go to our telematics glossary here.

 

To stay updated on more stories about big data, please subscribe to the blog!

Subscribe to get industry tips and insights


Geotab Team

Geotab Team

The Geotab Team write about company news.

View last rendered: 11/23/2024 09:47:46