Using the Data Restructuring Wizard for Unstructured Data
Mo Data stashed this in Big Data Technologies
Much of the dark data that corporations have (but have noy yet used) is in unstructured repositories. What is unstructured (vs. structured) data? According to Wikipedia, unstructured data is “information that either does not have a pre-defined data model or is not organized in a pre-defined manner”. It’s data that are not organized or classified in a way that can be easily grouped by subject; it’s mostly textual, but can also be images, audio, and video.
And let’s not forget social media. Facebook, Twitter, LinkedIn, Pinterest, just to name a few – these all contain unstructured and semi-structured data. Valuable data that can be very beneficial to businesses, large and small. However, it really needs to be structured before it becomes useful.
So what is structured data? Of course, in part, it’s the opposite of unstructured data. Webopedia defines structured data as “data that resides in a fixed field within a record or file.” It’s organized and relies on a model determining how the data is stored, processed, and accessed. Structured Query Language (SQL) is often used for managing structured data in database tables, just as SortCL data definition files (DDF) in IRI CoSort define the layouts of external, flat files.
Semi-structured data is a cross between both structured and unstructured data. It has structured data but doesn’t fit into the formal models of relational databases or other sequential sources. Legacy (mainframe index) files are a good example of this hybrid, because they consist of structured elements and proprietary layouts. Many XML files may fall into this category, too, although there are also tons of flat (structured) and unstructured (free-form) XML documents.
IRI software traditionally handled big data only in structured sources; i.e. all kinds of flat file formats and relational database tables that are extracted or reached via ODBC. But now it can also extract, structure, and process data in several semi- and unstructured data sources, including:
Unstructured Files (using the Data Structuring wizard in the IRI Workbench GUI, built on Eclipse™)
- free-form text (.txt)
- Microsoft Word documents (.doc and .docx)
- Adobe Portable Document Format (.pdf)
- Extensible Markup Language (.xml)
- E-mail messages (.eml)
- Microsoft Excel spreadsheets (.xls and .xlsx)
- Microsoft PowerPoint presentations (.ppt and .pptx)
- Microsoft Exchange and Outlook (.osd, and .pst)
- Rich Text Format (.rtf)
- ASN.1 call detail record (CDR) files (via a CoSort / SortCL input procedure)
- C-ISAM, IMS, QSAM, VSAM and other mainframe files (using partner ODBC drivers)
- MF-ISAM and Vision index files (using embedded Micro Focus libraries)
- MongoDB (JSON) and XML -using JDBC drivers in IRI Workbench
This article considers just those unstructured data sources in the first group, and how IRI software helps you extract and make use of the information they contain.
The general idea is that, after parsing data in these files, you can output what you’re looking for into a structured text file, with its layouts automatically defined in a data definition file (.DDF). The file and its metadata repository are easily used and re-used by IRI software to integrate, transform, migrate, mask, and report on that data, and/or feed it other applications.
Use the Data Restructuring wizard in the IRI Workbench to search documents using parse patterns (regular expressions) and keywords. The different fields on the first screen are used to gather the information needed to search a variety of unstructured documents. Use the Source directory field to specify the upper-most path. Indicate the types of documents to be searched by checking the relevant file extensions.
Next, specify the folder and file names for the structured output file and the DDF. The column headers you enter identify with the corresponding pattern/keywords and become the field names in the DDF. Choose the delimiter character to offset the fields, such as a comma or \t (for tab, as shown below).
Regular expressions are used to search for specific information. If you are not familiar with using regular expressions, a lot of assistance is available on the internet, including here at Wikipedia. IRI also provides a couple examples in their easy-to-use context help.
Once you have entered the required information in the wizard, click Next to start the search and create the new files. A preview screen shows the restructured data that is returned in the search. The preview screen displays up to 50 lines, but you can view the entire results by opening the output file.
- Data Integration and Transformation
- Data Migration and Replication
- Data Masking (Encryption, De-ID, etc.)
- DB Load and Query Optimization
- Reporting or Hand-offs to BI Tools
- Population of CRM, DB, ETL, and External Apps
See how to use the newly structured output file and its DDF in the next article, Using CoSort on Restructured Data in the IRI Workbench.