Feb 042018

Source: https://www.analyticsvidhya.com

In a previous blog post I wrote about how Information Architecture is an enabler for Big Data Analytics. This posts will focus on how Information Architecture is used to enable Big Data analytics and essentially become AI’s Secret Ingredient!

Most of the discussion around Artificial Intelligence (AI) is focused on Machine Learning, Cognitive Computing and Big Data Analytics. However, you must prepare your organization’s data in order to properly take advantage of AI tools that are focused on Big Data Analytics (such as, IBM Watson, Amazon, Microsoft, and Google). To properly prepare your data you will need to apply Information Architecture.

Information Architecture (IA) is AI’s secret ingredient. IA provides the process, procedures and methods to perform Content (semi-structured and unstructured data) Curation. IA being applied to perform content curation focuses on the semi-structured and unstructured data that comprises over 90% of the data being analyzed by big data analytics. Semi-structured data is a form of data that does not conform with the formal structure of data in a databases or data tables, but contains tags to separate elements and enforce hierarchies within the data (i.e., spreadsheets, XML files). Unstructured data is a form of data with no tagging, metadata or inherent structured associated with it (i.e., image, text, voice, video). Content typically refers to the container that the semi-structured and unstructured data resides in (i.e., .pdf, .doc, .xml, .ppt, .csv).

Content curation enables the extraction of value from data, and it is a capability that is required for areas that are dependent on complex and/or continuous data integration and classification. The improvement of data curation tools and methods directly provides greater efficiency of the knowledge discovery process, maximizes return of investment per data item through reuse, and improves organizational transparency.

Most organizations just deal with creating a common data model depicting all structural data in the organization. Although this is a great start, it is only part of your data picture. Creating a common or enterprise content model depicting semi-structured and unstructured data will complete your data picture and lay the foundation for data centralization providing the most accurate and holistic representation of your organization’s data.

Creating a common view of your semi-structured and unstructured data is a daunting task! Due to the massive amount of data and the variety of sources, it is important to start small, splitting the data into specific domains. In aligning your common data model and content model you must have common terms and consistent structures. In particularly for your semi-structured and unstructured data, applying consistent metadata to fully describe this data is important.

Content Curation Process for Big Data Analytics

Content curation provides the methodological and technological data management support to address data quality issues, maximize the usability of the data; provide an active and on-going management of data through its lifecycle; perform data discovery and retrieval, create and maintain quality, add value, and provide for re-use over time.

Content curation process includes the following activities:

  • Content Audit
    • Gather the requirements for content, as well as measurement and evaluation criteria. Perform Content Audit from the various content sources under consideration; determine what content is ready to be consumed, evaluate the quality of content, determine the gaps in content, and identify the measurements to determine what content is used (and not used).
  • Content Analysis
    • Content analysis examines information concepts, relationships, business rules and metadata. This provides a sharable, stable and organized structure for content (information and knowledge) for the enterprise. Semantics will address the meaning of the concepts identified in the content model as well as the meanings of the relationships between the concepts (usually expressed as business rules).
  • Address Content Gaps
    • The results of performing a Content Audit will determine the gaps in content and identify the additional sources of content that are needed for effective big data analytics.
  • Content Selection & Validation
    • Content selection should be considered in terms of significance, how essential or basic is it to the discipline; validity, is the content accurate, current and relevant to the domain under consideration; relevance: what is the discipline/workplace/ societal value of this content? Utility: how useful will the content be to overall domain under consideration. Validation or content validity is concerned with making sure that the content is accessed from and/or based on the authoritative or trusted source, reviewed on a regular basis (based on the specific governance policies), modified when needed and archived when it becomes obsolete.
  • Classification
    • The classification of content will be in the form of one or more ontologies/taxonomies. Classification of information will also be realized through controlled vocabularies and thesaurus. The structure refers to the methods to aggregate the concepts and metadata into the domain ontology/taxonomy.
  • Align Content to Domain Ontology/taxonomy
    • Categorizing content and aligning the content to a common ontology/taxonomy is essential to big data analytics due to the varied number of data sources under consideration.
  • Transformation
    • Transformation provides consistent look-n-feel between similar content types; consistently identified with standard and precise metadata; aligned to an accurate and exact ontology/taxonomy
  • Preservation and Governance

Preservation and Governance has the following characteristics:

    • Content and Classification Stewardship: The focus here is on establishing accountability for the accuracy, consistency and timeliness of content, content relationships, metadata and taxonomy within areas of the enterprise and the applications that are being used.
    • IA Management and Maintenance: This refers to the specific details on how the enterprise manages and maintains changes to content, content relationships, metadata and taxonomy. This is facilitated through the use of specific process and workflows.
    • Policies and Procedures: This refers to establishing and /or conforming to information policies for generation, consumption and access of content (information and knowledge); This also addresses how information is handled – Organization has detailed information policies associated with specific information types (i.e., Underwriting Guidelines, Rate Manuals, Pricing Strategies)
    • Enforcement: Enforcement of governance pertains to the implementation and execution of the policies and procedures identified in the governance plan. Establishment of a governance board will be the organizational entity to carry out the enforcement of governance, while the applications/tools must be configured to enforce governance of content on a day-to-day basis.

There is the need to discover patterns and create models to address a specific task or a business objective. Semi-structured and unstructured data is vital to the decision-making process. Defining a structured representation associated with the data allows users to compare, aggregate, and transform the data. With more data available, the barrier of data acquisition is reduced. To extract value from the data it needs to be systematically processed, transformed, and repurposed into a new context. Curation of semi-structured and unstructured data in big data analytics is driven by the need to reduce the time-to-market, reduce the time to create new products, repurposing existing content and to improve accessibility and visibility of information artifacts. Curation is important because of the growth of the variety of sources used in Big Data. Selecting your data from a variety of well curated sources will add richness to your Big Data Analytics results!