The era of the data explosion has opened up many new challenges for businesses and IT departments, ranging from internal security of business data, to unwanted exposure of external data, mobile data access, application and user behaviour, and everything in between. Each of these new challenges poses its own set of problems, advantages and solutions.
This article focuses on just one of these challenges: How do I store all this stuff?
The goal of this article is to outline typical data mining/data warehousing issues, and show, in one place, strategies, techniques and approaches to address these issues.
There are myriad reasons for wanting and needing to gather and hold network, server, workstation, user, administrator, application, document and perimeter device traffic. Common motives include:
- Internal and perimeter security
- Compliance and regulation
- Behaviour and performance monitoring
- Forensics
- Troubleshooting and remediation
- Capacity planning
All of these can use and access the same data sets – one of the most significant decisions is choosing a data partitioning strategy that will efficiently and effectively give access to the multitude of data types and use cases contained within your data, without compromising security or performance of your IT estate.
Choosing the way you partition your storage of these enormous data sets has a huge impact on performance, scalability, reliability, management and access.
Environments
If your IT environment is relatively small, or the scope of data gathering and storage is limited, it may be possible to store all gathered data on a single server. There comes a point, however, where a single piece of hardware will yield diminishing returns in terms of search and indexing performance, when incoming data rates becomes even mildly heavy. It is certainly possible to throw more memory, disk and CPU resources at a server; and this can help performance, but when data density starts to rise, a distributed approach is required. Needless to say, the data warehousing solution you choose needs to be able to support gathering, storing, searching and reporting to and from the distributed environment. It helps if search and reports are also distributed, particularly if data is to be stored in lots of far away locations.
Indeed, distributed storage can be a much more efficient approach even in modest traffic conditions, as it allows a physical data separation which can reap rewards for organizations with lots of sites, and where data ownership and security access issues require data to be segregated.
So when is it time to look at distributing the storage of data? Well, the answer, as always, depends on the IT landscape and business requirements. The three main points that determine the ‘horizontal scaling’ of data storage solutions are:
- The amount of data being collected (rate)
- The length of time data needs to be readily accessible (retention)
- The number of physical entities/sites that need to be supported (spread)
It is the product of these elements that will ultimately decide how many systems are required, and the resources needed for each. For example, if your data input rate is 20million events per day across 10 sites each, and your retention policy is six months, clearly you will need a number of systems to handle the load efficiently.
These are, by no means, the only considerations. Data rate, spread and retention simply determine the scale at which a distributed data mining project will reside. How the data is partitioned across these systems is equally important to ensure the data warehouse solution delivers the value that will ultimately make the organization more efficient and more profitable.
A quick word on ‘the Cloud’
In recent times, there has been much information and mis-information on cloud computing and its benefits and usage. As with most things, there are places where cloud computing fits in well, others where it is less than ideal. By all means, look seriously at considering cloud solutions for mobile device management, web conferencing, CRM sales tracking etc. It’s important to note, however that cloud solutions are not well suited for data warehousing of your internal, perimeter application and behavioural data. For security, compliance and legal reasons, your confidential internal data traffic should never leave your IT estate – in fact, this is one of the top motivations for gathering internal data traffic – to stop business-sensitive data from leaking to the outside world. Even with service agreements in place, your organization is ultimately responsible for the security and confidentiality of data – so you won’t want your data being thrown halfway ‘round the world to be stored on a server that will very likely be under a completely different legal jurisdiction.
If your organization uses a Managed Service or similar provider for some or all of your IT infrastructure, it’s worth ensuring that any data monitoring/gathering that takes place is properly managed and segregated, to preserve the security and legal obligations your organization has on its data.
So, now we’ve touched on some of the principles and issues surrounding data warehousing, let’s move swiftly on to techniques of data partitioning to maximize the efficacy of your data.
Balancing Data Gathering vs. Data Usage
There are a good number of ‘high-level’ data partitioning approaches available – each with its own merits. Picking the right one for your organization is a bit of a balancing act. When considering the best choice, it’s worth taking these points into consideration:
1. Data collection
- What devices/data/behaviour is being monitored?
- Where is data located?
- What processes will perform the task of collecting data?
- Where will these processes reside in relation to the data collection targets?
2. Data Usage
- Who needs to see the data?
- Which parts/types/locations of data are interesting to which groups of people within the organization?
- Where are the best places to process data/reports/search etc., so that data is quickly and easily accessible?
- What Access Control constraints need to be taken into account?
The main issue these points address is bridging the gap between where the data is generated, where and how it is stored, and who is ultimately going to see it.
Here are some of the most popular data partitioning strategies, and some salient criteria where they prove useful. The approaches below aren’t necessarily mutually-exclusive. It’s perfectly feasible, and often desirable, to incorporate multiple strategies. For example: employing Data Type partitioning, but stored separately across multiple Geographical sites/data centres.
Note these are in no particular order – it’s important to decide which approach or combination best fits the needs of the organization:
- Geographic
Store data according to the location where it is generated
This partitioning approach works well for organizations that have many geographically-dispersed and business-critical sites.
Geographic partitioning is also a good fit when stored data needs to be accessed locally, and is less needed in centralized locations.
Pros:
- Keeps data close to where it is generated, thus minimizing bandwidth traffic and latency in data gathering
- Data from different locations/countries is kept segregated, helping with issues like access control, data ownership and local regulations
Cons:
- Because data is not centralized, searching and reporting across many sites can be a lengthy process
- As a general rule, geographic partitioning requires more hardware to be deployed
- Organizational
Segregate and store data according to organizational unit/department
For large organizations, where there are many stakeholders and IT ownership is not centralized, this can be a good strategy.
Pros:
- Data owned by various departments remains within those departments. Data ownership is preserved
- Deployed hardware can be budgeted/charged to each relevant department
Cons:
- Searching and reporting across many departments can be a lengthy process
- Access Control for centralized searching/reporting can be more management intensive
- As a general rule, Organizational partitioning requires more hardware to be deployed
- Technological
Store data by the technology type that generates the data – e.g. Windows servers, Linux servers, mobile devices, etc.
This approach can work for organizations that use separate IT departments for different technology areas. Some of these may be outsourced or managed.
Pros:
- Data that is managed by different IT groups remains segregated and can be managed/accessed separately
Cons:
- Access Control for centralized searching/reporting can be more management intensive
- Searching and reporting across many departments can be a lengthy process
- As a general rule, Technological partitioning requires more hardware to be deployed
- Data Type
Store by the type of data being generated – e.g. event log data, file access behaviour, email, firewall logs, application data, documents etc.
This technique can work well for many types of organizations, and is particularly useful for leveraging in-house expertise in the various IT technology types. For this and other reasons, it is one of the most popular data partitioning techniques
Pros:
- Data is aggregated as it is indexed, making retrieval/search/reporting more efficient and scalable
- Access Control can be managed by type, allowing stakeholders to have access only to the data they need
- If storage is centralized/semi-centralized, can provide an efficient use of hardware resources
Cons:
- Depending on geographical location spread, data from disparate locations can be aggregated together, potentially requiring additional access control and other management procedures to be put in place
- Usage Type
Store data by the way the data is used – e.g. Security, Compliance, System operations, etc.
This specialist approach is good for organizations that have distinct and separate (possibly outsourced) requirements for gathered data within security, operations, compliance etc.
Pros:
- Data storage can be ‘fanned-out’ to discrete organizations and/or service
- Access Control is generally managed separately by each organizational unit
Cons:
- As this technique usually involves duplication of data, there are generally higher system and hardware requirements
- Depending on the management structure involved, it can be difficult to manage the pipelining of relevant data to all parties
Conclusion
Given the overwhelming amounts of data that need to be collected and managed in data warehousing projects, coupled with tight budgets and resourcing, having a clear, strategic outline for data warehousing and associated data mining can make life much easier for IT Managers and budget controllers to make informed choices on the best methodology for storing the morass of internal, perimeter and application IT data.
With the above techniques and approaches, you and your organization can be better placed to deliver successful and efficient data warehousing projects.
Until next time – Happy Mining!
