MOSS Adventures: Operations

Showing posts with label Operations. Show all posts

Saturday, February 21, 2015

A Tale of Two Governances - Part 1 Health Benchmark

When you read about governance, it is often focused on what I call the foundational governance. In the case of information technology (see my definition) we focus on foundational information governance, or the way we intend to use our information within the organization. This however is only one of two parts of the actual governance needed for the management of our information. The second portion which is often not considered as governance, encompasses the processes and procedures needed to maintain the systems that are used to manage and transmit information. I refer to this governance as operational governance and it consists of the structure, policies and procedures to ensure a stable and consistent information management solution.

Operational Governance

The governance of the sustainment processes within an organization are typically hit and miss. Most organizations will have some type of backup and recovery process, but how many have a process for the creation of sites in an ECM like SharePoint? Now don't get me wrong, some organizations are very dutiful in creating what they perceive as needed for processes to maintain and administrate their systems, the problem is that many do not, and those that do, don't necessarily get everything they need. As a consultant I come into organizations that are experiencing pain, usually in the governance of their solutions, my job is to determine the gaps and remediate them. Now one of the best ways to evaluate gaps in the operational governance of a solution (regardless of the technology) is to interview the administrators and key business users, perform a health assessment of the system and make recommendations on best practice based on the gaps; in some cases we would then move to remediate those gaps as a final step. These steps serve to quickly identify what exists and what does not exist and helps me understand the technical skills of the administrative team.

In this first part we will walk through finding our current state, then in a future post we will look at the rest of the operational governance that should be considered to ensure a properly sustained environment.

The Interviews

The first steps in the process are the interviews, in a SharePoint solution I like to site down with the farm administrators, the site collection administrators and the service desk manager. These three groups or persons can provide insight into pain and into items that take up a significant portion of their daily activity, here are a few questions I will typically ask and why I ask them. It is also important that you are clear with them about your purpose, as a consultant coming in they may perceive you as critiquing them on their job, but we are there to help them be heard and to fix their pain.

In other solutions, you may have different roles, as long as you can extract the pain and issues for the solution, your interviews can be with whoever can best provide the answers.

Farm Administrators

Farm administrators are your best source of information when it comes to issues with operational governance. They know the solution better than anyone else and have to deal with everything and anything that goes wrong. Often it is easiest to just sit down over coffee and a notebook and ask them what is wrong with the solution and what they would fix, then sit back, let them vent and take notes; but I like to have a plan so I typically compile a list of questions to ask before hand (let me know if you have some good questions and I can add them).

Do you have anything that maps out your daily routine? This is asked to first establish the existence of a "Run Book" or standard operating procedures (SOP).
Do you have any tickets assigned to you that are more than 30 days old? If yes, what are those tickets and what is preventing you from closing them? This will help identify not only gaps in knowledge, but also pain areas in architecture or process. There is often an in depth conversation into cause and what they would like to see happen to help resolve these issues.
Are there any issues that keep recurring or that never really go away? This provides insight into pain areas where they may have a work around or an area where they have decided to perform something a specific way and it is not working. This is another area we will have additional conversations about how they think it should be.
Do you have any performance issues with the current farm? if yes, do you know the cause? and have you researched a solution? performance issues identify issues with the farms architecture and/or configuration that may be hampering the solution and preventing it from performing as intended. Also it helps gauge knowledge level and root cause problem solving capabilities.
Which group or groups are the most active on your farm? This will identify who to interview from a site collection administrator perspective, concentrating on the site collections that are the most active and the most need of support.
Do you have remote offices that access the farm? how good is their connection? Do you get performance tickets from those offices? Often remote connectivity is an issue, identifying where these connections are occurring and if there are issues up front will save you time and effort. Follow the premise that it is easier to ask the question than search for it, tools are great, but the farm administrator will have insight the tools can't provide.

Notice I didn't ask them questions like how many farms, the servers on the farm, the number of content databases and their size. These can be asked, but typically you know those things before you begin the engagement and even if you don't, reports from SPRAP or any other health assessment tool will clearly give you all this information. At my office we have developed our own health assessment tool to answer all the farm questions and to touch over 100 different areas in the farm. I have included the areas in my post, What should I Check With a Health Assessment? and would love any feedback you have on the points and questions. With your help I can make it the most complete health assessment list available.

Once we completed we can move on to the Site Collection Administrator questions. Site Collection Administrators have less knowledge of the configuration, but provide a direct point of contact with your key stake holders.

Site Collection Administrators

Based on question 5 above, you should have an idea on which Site Collection Administrators are needed for this portion of the questions. In smaller organizations, the Site Collection Administrators may be the Farm Administrators, you should be able to figure that out quickly when beginning the engagement. The Site Collection are a SharePoint solutions first line of direct contact and problem solving in the business, they are the most likely to know what the users want changed and what issues are recurring the most from a User Experience perspective.

Do you have anything that maps out your daily routine? This serves a different purpose than with Farm Administrators, here you are looking for what is taking up most of their day. If they don't have it mapped out, you should sit down with them and ask what a typical day would look like. They may have trouble providing it, so another approach is to ask them to do some logging activities for a couple days, recording what they are working on. You can then review it and confirm if the tasks are typical or not.
Do you have any requests from your business users you have not been able to fulfill? If yes, what has prevented you from fulfilling them? This will often identify issues with configuration, policy or knowledge level, use it as a sounding board to ensure the architecture meets the business needs.
Are there any issues that keep recurring or that never really go away? This provides insight into pain areas where they may have a work around or an area where they have decided to perform something a specific way and it is not working. This is another area we will have additional conversations about how they think it should be.
If you could change anything about the solution what would you change? Site Collection Administrators often have good feedback on improvements specific to user experience and functionality, make note of the changes, then identify them as future state requests for remediation and road mapping.

Remember these are really meant to draw out the pain points and issues with the environment. You may hear the same answer from many different people, that should raise the importance of the issue. Some of the answers may be symptoms of a deeper problem, it will be your job to determine that before attempting to remediate it.

Service Desk Manager

The Service Desk Manager can provide you tangible numbers on where issues are occurring, open tickets and typical complaints that users have made about the system. They are the support of what has been discussed with the farm and site collection administrators and will provide additional insight and numbers behind the importance of certain issues that have been identified.

Can you provide a report of ticket opened for SharePoint in the last 6 months? This should provide ticket count, time to close and total percentage of tickets for each category.
What are the main complaints your team hears in regards to SharePoint? The Service Desk is the first line for support, so they hear most of what the users like and dislike about the solution.
What would you change about SharePoint if you could? This is an open ended question and should elicit conversation on improvements and pain that they feel from their environment.

Remember the questions above are a starting point, you want to draw out their pain experience. In some cases it might be better to talk directly to the business units, but always remember these are about insight into issues about the environment.

Health Assessment

As mentioned above the Health Assessment portion is usually done through a tool that compiles all the information about the environment. I analyzes your solution and provides feedback on all areas that need to be considered. Please refer to What should I Check With a Health Assessment? for actual check points and complete it in whatever manner you wish.

Report and Remediation
From the interviews and health assessment a report of gaps and issues with the design can be created and presented to organizational decision makers. From the report you will also be able to identify the criticality and with discussion, the priority of the issues involved. Use this information to build a remediation plan, that includes the issue, it's criticality, priority solution to the issue and the effort needed to resolve the issue, then sit down with the decision makers and work out the remediation plan to resolve the issues. The plan should provide a timeline for each resolution and the resource allocation needed to resolve it.

Next Part
In the next part of this series, we will look at other parts of your operational governance and what it takes to ensure your environment has the operational governance it needs. Feel free to read my other posts and follow me on Twitter: @DavidRMcMillan and @DevfactoPortals.

Tuesday, February 17, 2015

What should I check with a Health Assessment?

When you perform a health assessment of a SharePoint farm, you need to check everything you have and compare it to patterns and practices. In some cases you may come across limits (supported maximums) and boundaries (hard limits) for certain settings, your goal should be to ensure you are well within any limits and to have a plan in place to maintain your settings within the standards and practices as they relate to your farms.

The purpose of this blog post is to give you a guide into the physical attributes for your solution and what you need to check. I do not talk about tools in this blog, but suggest you employ a tool for your health assessment because it provides consistent, repeatable approach to your solutions health.

I will not be too verbose in this post, but rather will concentrate on the areas one of my cohorts, Kevin Cole (follow him on twitter at SPDEVGUY), a Microsoft Certified Master of SharePoint 2010 and brilliant technical mind, and I came up with. I have the areas broken down into 11 different sections and will briefly talk about what you need to know in each of the areas, so lets get to it.

The Check Points

As I mentioned you can check these things manually, but it will be time consuming, there are many tools available for you to perform these, we use PowerShell and it allows us to regularly and consistently create our reports for health. I have not gone in depth into any of these, but I will add to this/modify it if you provide feedback. This is a work in progress, but as far as I know the only check list that I have found to date that covers off the farm.

Servers

Determine the servers being used in the farm: Server identification is needed to understand the resources you are working with and to identify gaps in architecture
Determine the roles of each server in the farm: The role tells you what the server is doing and on which tier of the farm architecture the server resides.
Draw the logical diagram of the farm: A list of servers and their roles is difficult for the average user to understand, a graphical representation makes it easier for everyone to understand.
Gather the number of processors, type and if they are dedicated or shared (VM) for each server: Knowing the allocated processing power helps identify processing shortfalls that may cause performance issues.
Gather the RAM and whether it is dedicated or shared (VM) for each server: Knowing the allocated RAM helps identify when disk caching will occur and identify performance issues.
Gather the total and available storage for each server (Physical and SAN): Understanding your storage and any limitations will ensure you don't run into a situation that has you scrambling to add storage. In addition, configuration of swap drives, etc. can affect performance.
Gather the type, current capacity, allocated and maximum capacity of the SAN: Knowing the SAN capacity will help with determining current capacity and planned growth. The type of SAN will help identify any RBS provider issues or determine what is needed to implement RBS, if it has not been implemented.
Determine the hardware lifecycle for server infrastructure: Understanding how old each server is and when it is planned to be replaced allows for a proper perspective when identifying which servers are underpowered for the current environment or for future growth.
Determine the patch levels of the server OS and all dependent services: Identifying any outstanding patches will identify any risks to the stability of the OS and the services SharePoint relies upon and may identify possible security exploits.
Determine patching schedule and outage windows for the solution: Patching Schedules and Outage windows are important to the health of the servers, allowing for proper maintenance of the servers without the risk of causing a disruption. Determine if and when patching is
performed, when the outage window occurs and how long it lasts.
Determine the SQL Server version and patch level: Knowing your SQL Server version and patch level will help you identify issues with performance and may identify security holes. In addition, the SQL Server version affects some feature availability and limitations, depending on your farm.
RBS SQL Server Configuration: Storing BLOBs in the database can consume large amounts of file space and expensive server resources. RBS efficiently transfers the BLOBs to a dedicated storage solution of your choosing, and stores references to them in the database. This frees server storage for structured data, and frees server resources for database operations.
RBS BLOB Threshold: Setting the right size threshold will ensure a balance between processing needed to offload large files and your content database size.
SAN Configuration: A misconfigured SAN can cause increased latency and other issues to RBS, SharePoint and SQL Server.
Storage Provider Configuration: Using the correct storage provider (and correct version) for your SAN will improve performance.
SAN Capacity: Ensure your future storage needs do not exceed the current capacity, check for the current utilization and available storage as well as the ability to expand storage hardware if needed.
SharePoint RBS Configuration: Ensure your farm is configured correctly for RBS.
BLOB caching setup: Disk-based caching is extremely fast and eliminates the need for database round trips if it is configured properly.
RAM Utilization: Ensure your farm servers are not over utilized.
CPU Utilization: Ensure your farm servers are not over utilized.
User Profile import filters: Are service accounts and disabled accounts filtered out?
User profile synchronization schedule: Find the right balance for the sync.
Portal super reader and super user accounts setup: Verify they are set properly and that the membership is correct.
Office web apps cache: It is recommended to isolate the content database used for the Office Web Apps cache, so that cached files do not contribute to size of the "main" content database(s) for the Web application.
OWA service apps: Ensure the Apps are running on correct server roles.
Web apps: Ensure Web apps are not running in ASP.NET debug mode in production.
Farms: Record the number of Farms and purpose of each.
Web Apps: Ensure Web apps are configured correctly.
Content Databases: Ensure proper content database sizes and configuration.
Site Collections: Ensure properly sized and organized site collections.
Custom Features: Review and record the Custom Features, where they are used, their intended purpose and proper installation and activation.
Custom Apps: Review and record all custom apps installed on the farm, their intended use and where they are being used.
Custom Web Parts: Review and record where any custom web parts are being used and that they are working properly.
Environments: Record and ensure the environments are synchronized and consistent with each other and that they are being used for their intended purpose.
Environment Patching: Check environments for consistent patching (build numbers) between all environments
SQL Naming: Ensure SQL Servers are using SQL Aliases, not computer names or CNAMES
DNS: Ensure host records defined for the SQL Aliases

Platform

Page File on a separate drive from the OS, SharePoint and Logs
Does Storage meet the farms needs (current vs. projected)
Are there large files being stored in document repositories
Record number and size of files
Is there a change management process involved?

Logs

Check Application log for errors
Check System log for errors
Check ULS log for errors/ critical / warnings
Check IIS logs for 503 error pages
Check IIS logs for slow (>200ms) loading pages
Check IIS logs for Active Directory Latency (304 not modified with excessive load times)
Check IIS logs for dead links (404 errors)
Check Requests per second count from IIS logs
Check log locations (SharePoint/IIS should be on a secondary drive)
Check for unrestricted growth
Check log drive capacity/utilization

Solution Integrity

Old SSP Site removed (for in place upgrades)
Check Supported Limits for Managed path counts
Check Supported Limits for Content DB sizes
Check Supported Limits for List item counts
Check for deleted pages in navigation
Check for unused content sources in the search crawl
Check Health Analyzer rules
Check patch levels for all content databases
Check for orphaned site collections
Check for broken site collections
Check for broken my sites
Check for missing web part references (Error web part detected)
Any Sites running in UI Compatibility Mode (2007 or 2010)
Check code quality process for stress testing
Check code quality process for load testing
Check code quality process for security testing (each role)

Continuity

Is backup being performed?
Review backup process
http://technet.microsoft.com/en-us/library/cc298801.aspx#sectoin1b
Is the disaster recovery plan tested and reviewed annually?
Ensure Central Admin is redundant.
Is disaster recovery farm on another site?
Virtual machines distributed properly across physical hosts for disaster protection?
Check for role redundancy for Web front ends
Check for role redundancy for Application Servers
Check for role redundancy for Database
Check for Service redundancy

Security

Check for Extra ISA Firewall rules.
Check SSL Use // IPSEC
Are MySites hosted on a dedicated web application?
Is the farm admin able to manage the service accounts?
Ensure farm account is not be used for other services.
Farm account should not be in local administrators group unless doing install or patch.
Ensure external access uses SSL?
Kerberos Configuration (SPN's configured properly)
SP 2007 http://blogs.msdn.com/b/martinkearn/archive/2007/04/23/configuring-kerberos-for-sharepoint-2007-part-1-base-configuration-for-sharepoint.aspx

SP 2010 http://www.microsoft.com/en-us/download/details.aspx?id=23176

SP 2013 http://technet.microsoft.com/en-us/library/ee806870.aspx
Ensure the proper number of service accounts:
SP 2007: 3
SP 2010: 5
SP 2013: up to 16 service and 3 server.
Ensure My Sites are configured with secondary site collection owners.
Ensure farm admin and service accounts are not be permitted interactive logon.
Ensure the proper service accounts are used for the proper services:
http://technet.microsoft.com/en-us/library/ee662513.aspx

Database

Check content databases within limits.
Check transaction log sizes.
Check for excessive free space. // shrink db
Trim audit logs to reduce content db size.
Check for maximum degree of parallelism.
Ensure database auto growth sizes set properly.

Information Architecture

Verify: universal site taxonomy.
Check maximum site depth.
Check maximum site width
Check for a high number of role assignments on individual items.
Check for a high number of unique permissions.
Check content growth projections.
Check for a high number of sites sharing a content database.

Branding

Are there any custom master pages?
Are the custom master pages or page layouts working properly?
Are all images / styles / etc checked in and published?

Customization

What WSP Solutions are deployed?
Are any InfoPath forms deployed?
Check for Invalid / missing Feature counts.
Ensure assemblies are compiled in release mode not debug mode.
Which solutions are 3rd party?
Which solutions are in house?
Check solution utilization (Where, activation locations, actual usage)

Search

Check crawl logs for any errors or warnings.
Check crawl schedules.
Check crawl running time versus crawl interval.
Check for successful crawls and crawl failures.
Check search service account configuration.

I realize there may be some repetition above, but the purpose of this is to help you ensure a healthy environment. If you have any questions, additions or modifications, please comment and I will make updates. Please follow me on twitter @DavidRMcMillan and @DevFactoPortals. I look forward to making this a resource any admin can use.