by Brian Milas

How to benefit from the identity data explosion

analysis
Feb 20, 201417 mins

Everyone wants to avoid being the next Target. Brian Milas, CTO at Courion, explains how to use the rich data generated by identity and access control solutions to reduce risk

Week in and week out, big-name companies from Target to Neiman Marcus to Michaels learn the severe consequences of flawed data security. The truth is, even when IT has been armed with the latest security technology, defending against breaches isn’t easy. In fact, as attacks get more and more sophisticated, it’s getting harder.

In this week’s New Tech Forum, Brian Milas, CTO at Courion, offers an in-depth look at the data security problem from the standpoint of user identity and access management. As Milas argues, you can’t develop an effective solution without makiing sense of the very large quantity of semi-structured data generated by these gatekeeping systems. — Paul Venezia

Making sense of “big data” from identity management

Providing employees with access to applications and information is a complex operational challenge. Users require broad and varied access to be productive, but that incurs risk. IT must control access, enforcing the principle of “least privilege” in the face of compliance regulations and the threat of security breaches.

Do it right and business runs efficiently with risks understood, mitigated, and rewarded. Do it wrong and catastrophe looms.

To understand business risk effectively, you must have visibility into the access approved, access granted (which may be different than what was approved), the resources and data behind the access granted, and how access is being used. Years ago this was less complex: Employee and customer data lived in the data center, was accessed during work hours, and was less heavily regulated and audited.

Today, data resides not only in the data center but also in mobile devices and the cloud. It’s also regulated, audited, and available to many more audiences than just your employees. Here’s one way to break down the problem:

More and different types of identities. In the past, IAM (identity and access management) was primarily concerned with workers. Now contractors, suppliers, customers, partners, affiliates, and even devices have identities.

Data explosion. We’re generating and archiving more data than ever before. Recent coverage of the NSA’s data analysis efforts reveal just how much data we generate as a nation: 1.8 petabytes daily!

Flexible access. In the past, access was largely consolidated in a data center, but then came desktops, then laptops, then mobile and cloud. Today, users expect access anywhere, everywhere, all the time.

Need for speed. The United States is no longer the only “I want it now!” society. Every globally competitive company is keenly aware of the need to provide access and information immediately, whether to a shop floor employee or to a customer who needs current order status.

Increased security expectations. In the past, security was considered a specialized area, but today, government and industry regulators, auditors, board members, media, and consumers are expected to know the ropes. Increasingly, CISOs are calling for staff to flag new risks as they arise.

Logging everything

What does this all mean to a CISO who is concerned with providing only the right access to the right people at the right time? A whole lot of information about a rapidly expanding universe of electronic identities and their context. At Courion, we call this “big identity data.”

By way of example, consider a hypothetical 10,000-employee company:

  • 10,000 users with access to 10 applications results in 100,000 accounts
  • Logging in to these applications at least twice per day yields 200,000 login activity records per day
  • Keeping a data store of one month of activity creates a total of 4 million login activity records

Now let’s consider how worker interaction with files and folders enters the equation:

  • 10,000 workers accessing 50 data assets per day creates 500,000 activity records per day
  • Distributed over an eight-hour workday, this results in 62,500 activity records per hour or 1,031 per minute (or 17 per second)
  • Keeping a data store of one month of activity creates a total of 10 million unstructured data activity records

Just think: 14 million data elements, and that’s the tip of the identity and access data iceberg! One might contend that 4 million (or 10 million) records in the examples above are not really indicative of “big data” per se. That’s true. We used simple, conservative numbers to show how things grow.

In a real-world environment, data accumulates much faster. The two previous examples only talked about data for people, applications (accounts), and activity (logins and file shares). Every business has many more applications that are important, and each of those has a wealth of data to be collected and analyzed. Data must be collected regarding access, roles, inheritance, permissions, assignment, denial — and for such key systems such as financials, HR, CRM, databases, email, SharePoint, and so on. We also need to collect activity for those systems — more than just logon events. The growth in data collected is very rapid.

Boiling an ocean of details

In the face of exploding amounts of data and increased security expectations is the ever-persistent need to understand and manage identities and their access privileges successfully. It’s becoming increasingly challenging to answer even simple questions. For example: “What should Bob be able to do with that application or data?”

Supervisors are often too busy to deal with such questions and may not know the right answer. Environments, applications, and policies are constantly changing. Multiple systems must be traversed in order to arrive at an answer. To make matters worse, these are often loosely coupled systems, where additional expertise must be applied to determine the proper action. No human can boil this ocean of data and look at all of this, nor would anyone want to.

As a CISO, I want to leverage all of this data to my advantage, but how do I manage it? How do I ensure that the right identities have the right access and are doing the right things? How do I see the anomalies and outliers? What’s the risk? And how do I do all of this in a timely fashion?

In other words, how do I turn all of this data into useful information or actionable intelligence?

No human can quickly and accurately digest and make sense of all this data. This is a problem that’s seems perfectly suited to a technology that can do the heavy lifting: the data crunching and analytics, the highly repetitive tasks. While the power of automation is certainly required, that’s not the whole story. You still need humans involved in the business analysis. Where do we automate, and where do we still need human oversight or intervention?

Defining a process

What is needed is a process for collecting, analyzing, visualizing and taking action — converting identity and access data into actionable intelligence:

Collect > Analyze > Visualize > Act > Operationalize

First, we need data to answer the who, what, where, when, why, and how questions as they relate to identity and access:

Who: Things, people, or applications that have an identity

What: The action taken, resource or data accessed or involved

Where: Location, geography, or position

When: Time stamp

Why: The detail behind the user activity and intention, this is where inference and analytics are useful

How: The mechanism by which access was granted or used; activity history, inference, and aggregated data can help provide this context

Getting your arms around the data

Next, we need to prepare the data for analysis, using ETL (extract, transform, load).

Extract: Pull some or all data from a multitude of sources that have information about identities, accounts, rights, activities, and resources. Expect the data to exist in various repositories, with different storage formats and data representations, each with their own security challenges. Anticipate needing to employ different techniques and technologies to connect to and extract the needed data. Most systems have data available that help answer the who, what, where, when, why, and how. For example, an HCIS (health care information system) typically has information about the following:

  • Accounts for workers, clinicians, researchers, affiliates (Dr. Smith)
  • Rights assigned to the accounts (Dr. Smith can schedule appointments, dispense medication)
  • Resources accessible via the assigned rights (schedules for Dr. Smith’s team of clinicians and records for Dr. Smith’s patients)
  • Activity done within the HCIS (Dr. Smith logged in and viewed the records of patient X)

The extraction phase may be performed in a batch/bulk manner, or it may be conducted real time, where data is extracted as it changes.

Transform: Next the data must be converted and normalized to get it into an understandable format. A simple example is date and time: Data may show 9:00 a.m., but what time zone? Is it Daylight Savings Time or Standard Time? Does all the data conform to the same level of granularity in minutes, seconds, or microseconds? Typically you resort to transforming all data to Greenwich Mean Time (GMT). The time stamp format for logon events may vary with each system extract and needs to be converted to a consistent format for analysis.

Many other data transformations may be done to prepare the extracted data for storage and analysis. The data may need to be augmented from another repository, split into new data elements, validated against other repositories, or changed to a new value. Here’s how a ZIP code might be transformed:

  • ZIP codes may be either the five-number format or ZIP+4
  • Split ZIP+4 records into two fields
  • Data without a ZIP code or with alpha characters should be discarded
  • Verify that the numeric five-digit ZIP code is valid, verify ZIP+4 if present
  • ZIP code lookup populates and corrects city and state information

Load: The last step is to store the transformed data in a repository for analysis and determine what data is overwritten and what data is changed. For example, is the “load” data authoritative, or is the data already present in the repository authoritative? Expect to collect a large amount of data and then, depending on your data retention policy, add it to an already large data set. When activity data is collected, not only is it likely to be large, but it may also be arrive quickly (in real time).

Furthermore, the need to do forensics often drives the need to retain detailed records, resulting in larger data sets. Expect that your disk storage needs may increase based on the size of your organization and your forensics needs.

To get answers: Analyze, relate, infer, and visualize

With the data normalized and loaded, it’s ready to analyze. The analysis itself generates new data in the form of facts, relationships, indicators, trends, and inferences.

Multidimensional analysis reorganizes the data and provides new ways to pivot, view, and analyze. IAI (identity and access intelligence) analytics solutions are specifically tailored to provide analysis and visualization specific to IAM, making the connections between identities, the access assigned, permissions, and ultimately the resulting access that a person has to a given resource.

Relationships between objects such as inheritance and hierarchy add to the complexity of understanding the access environment and help us understand and answer the question about whether Bob can really approve large budget items, as well as assess the risk related to the given access rights. For example, assume that we “know” (from collecting identity and access data) that Bob can approve budget items over $100,000. This allows us to infer that Bob is a power user in the application where he can approve $100,000 items.

Another analysis can infer or suggest other attributes or relationships between objects. For instance, if 75 percent of the finance team has access to application Y, it’s likely that application Y is a high-impact application.

The risk rating for Bob’s access may be inferred to be low if it’s consistent with other peers in finance. Conversely, the risk rating may be high if Bob’s access is an outlier, inconsistent with his peers (that is, he’s not in finance).

This type of complex analysis results in multidimensional data structures that can help you begin to answer questions that start with “What is…,” “What should be…,” and “What if…” and provides an opportunity to look at information through different perspectives, such as by business unit, by date, or by business risk.

A simple question can be answered: “What access has been granted to this employee?”

Person > Account > Access

Now we can answer such questions as: “If I provide this employee with access to that resource, what is the likely effect and does it reveal any issues?” or “Have I authorized anything that was not intended?”

Person > Account > Access > Permissions > Resource

When we add activity to the process, we can uncover behavior trends, or analyze and summarize what a user or a set of users actually does with access that they have been granted to a resource:

Person > Account > Access > Permissions > Resource > Activity

This can be used to create a baseline of normal or expected behavior for similar users, or can be used to compare a user’s behavior over different time periods.

  • Activity by day of week: We expect to see low activity on the weekends
  • Activity by hour: We expect to see low activity outside normal 8-to-6 work hours
  • Activity by department: We expect to see certain sets of applications used in each department

You start to see how adding data allows for more analysis to be done, giving better visibility, generating new information, and allowing you to answer more interesting questions. (For example: Is it normal for Bob to be downloading the entire customer list to his home laptop at 3 a.m. on a Sunday before he leaves for a hastily scheduled three-week vacation?)

Adding geographic and location based data provides an idea of “where.” This can be used, for instance, to uncover and flag access by the same user in two different locations at the same time or to identify locations inconsistent with the expected geographies for the user.

Customizing IAI to reveal business risk

Already we can see that by aggregating disparate data types and looking at the context or relationship between those elements, we are revealing new information. Next let’s look at how we can highlight information and uncover knowledge rather than just showing data in a static report.

Let’s try illustrating the difference between conveying data and conveying information.

Take an example that is familiar: orphaned or inactive accounts. Suppose you have a static report showing 30 orphans. What does 30 represent? Is this “good” or “bad”?

Visualization takes the data (“30”) and couches it in the context of the bigger picture to answer those questions. For example, maybe it’s the 30 highest-risk orphans out of the 100 total orphans found. Not all orphans are equal — one that’s powerful and can approve $100,000 expenditures is a higher risk; one that’s associated with a terminated worker is higher risk; one with access to confidential data or IP is higher risk; one that’s been used is higher risk. You get the idea.

The analysis phase highlighted 30 high-risk orphan accounts out of 100 total orphans identified across 5,000 accounts. We have an overall orphan ratio of 2 percent, but only a 0.6 percent ratio for the high-risk orphan accounts.

Act wisely

Now that you have information providing insight, and not just data, and we agree that context has value, it’s really what you do with the knowledge gained that matters. You’ll want to act on certain things uncovered by analysis and visualization.

As it relates to IAM, access risk and policy violations are most commonly identified. Actions for remediation can take a variety of forms:

  • Ignore (a one-time false positive)
  • Always ignore (always considered a false positive)
  • Accept (not a false positive, but I’ll approve this exception)
  • Request review and attestation
  • Request removal (undo)
  • Request policy modification (the policy is wrong or doesn’t represent the business)
  • Reassign (it’s not mine)
  • Escalate and research (I need help)

As we have moved from collecting the data to analyzing and visualizing and now acting, the overall objective should be to identify and eliminate compliance issues as they occur and to predict and prevent the problems that lead to risk. But how can we improve productivity?

Operationalize

Remediation as a manual process doesn’t scale well, especially if a trained member of the security team needs to look at and act upon all of the information identified. Remember our statement that nobody can look at all of this, nor would they want to. Instead, you’ll want to automate and operationalize IAM tasks where possible.

Continuously monitor for the most interesting things, then automate the remediation steps. The definition of “interesting things” varies, but can include anomalies and outliers, segregation of duty violations, new assignment of privileged access, usage of an account associated with a terminated worker. Minimize noise, focus on the important things, refine, tune, and improve. Now, the phrase “continuously monitor” sounds simple, but it encompasses everything we’ve talked about thus far:

Collect > Analyze > Visualize > Act

Let’s go back to our example of Bob who has access to approve budget items over $100,000. SoD (segregation of duty) policies may dictate that individuals with budget approval rights should not have budget requesting rights. Review and sign-off is required when this combination of access is detected for an individual. By continuously monitoring this policy, violations are flagged as soon as the combination of rights is assigned to Bob. It doesn’t matter whether the assignment is done through an IAM system, native tools, or some other channel — monitoring picks up the change, checks the policy, and takes action on violations. In this case an action might be to notify stakeholders and initiate the process of having Bob’s manager review and then accept or reject the violation of SoD policy.

In most businesses, usage of an account associated with a terminated worker would raise eyebrows. The account is disabled during the off-boarding (termination) process, but what happens if it’s reactivated and used? (We’ll assume that the worker is still not associated with the business.) “Monitor and act” might entail:

  • Knowing that the worker has separated from the company
  • Knowing that the account has been activated (and by whom?)
  • Knowing that the account has been used
  • Detecting all of the above
  • Taking action, which may entail:
    • Determining who to notify
    • Requiring the account to be attested to (reviewed)
    • Disabling the account

From data to information

To paint the full picture for IAM, we collect data from a broad variety of systems, pulling together identity, access, activity, and resources. That data alone can grow quite large for an organization. Data is then only turned into information or knowledge through analysis, which helps us to visualize, confirm compliance with policy, and act as needed. With the addition of historical information, we pick up the ability to visualize trends and employ forward-looking forensics.

While market conditions, the pace of technology use, and change all contribute to a big identity data challenge, IAI is rapidly evolving, so CISOs can actually boil the ocean down to a manageable level. With the right tools they can automate routine IAM tasks, quickly identify and eliminate compliance issues, and predict, prevent, and address problems that lead to risk.

This gives us the ability, to paraphrase Albert Einstein, to not only solve problems, but to prevent them.

New Tech Forum provides a means to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all enquiries to newtechforum@infoworld.com.

This article, “How to benefit from the identity data explosion,” was originally published at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.