(PDF) Data Warehousing & Data Mining.pdf - PDFSLIDE.NET (2024)

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    1/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 1

    DATA WAREHOUSING AND DATA MINING

    PARTA

    UNIT1

    Data Warehousing:

    6 Hou

    Introduction, Operational Data Stores (ODS), ExtractionTransformation Loading (ETL),

    Warehouses.Design Issues,Guidelines for Data WarehouseImplementation,Data Warehouse Metad

    UNIT2 6 Hou

    Online Analytical Processing (OLAP): Introduction,Characteristics of OLAP syst

    Multidimensional view and Data cube, Data Cube Implementations,Data Cube operat

    Implementationof OLAP and overview on OLAP Softwares.

    UNIT3 6 Hou

    Data Mining: Introduction, Challenges, Data Mining Tasks, Typesof Data,Data Preproce

    Measures of Similarity and Dissimilarity, Data MiningApplications

    UNIT4 8 Hou

    Association Analysis: Basic Concepts and Algorithms:FrequentItemsetGeneration,Rule Genera

    Compact Representationof FrequentItemsets,Alternative methodsfor generatingFrequentItemsets

    Growth Algorithm,Evaluationof Association Patterns

    UNIT5

    PART - B

    6 Ho

    Classification -1 : Basics, General approach to solveclassification problem,Decision Trees, Rule B

    Classifiers,Nearest NeighborClassifiers.

    UNIT6

    Classification - 2 : BayesianClassifiers, Estimating Predictiveaccuracy of classification

    6 Ho

    met

    Improvingaccuracy of clarification methods,Evaluationcriteriafor classification methods,Multi

    Problem.

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    2/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 2

    UNIT7 8 Ho

    Clustering Techniques: Overview,Featuresof clusteranalysis,Types of Data and ComputingDista

    Types of Cluster Analysis Methods,PartitionalMethods,Hierarchical Methods,Density Based Meth

    Quality and Validity of Cluster Analysis.

    UNIT8 6 Ho

    Web Mining: Introduction, Webcontent mining,TextMining,Unstructured Text, Text cluste

    Mining Spatial and TemporalDatabases.

    Text Books:

    1. Pang-NingTan, Michael Steinbach, Vipin Kumar: Introduction toData Mining, Pearson Educa

    2005.2. G. K. Gupta:Introductionto Data Mining with CaseStudies, 3rdEdition,PHI, New Delhi, 2009.

    Reference Books:

    1. Arun K Pujari:Data Mining Techniques, 2nd Edition,UniversitiesPress, 2009.

    2. Jiawei Han and Micheline Kamber: Data Mining - Concepts andTechniques, 2nd Edition, Mo

    KaufmannPublisher,2006.

    3. Alex Berson and Stephen J. Smith:Data Warehousing,

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    3/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 3

    TABLE OFCONTENTS

    Unit-1 : Data Warehousing:

    1.1 Introduction,

    Page No.

    5

    1.2 OperationalData Stores (ODS) 6

    1.3 ExtractionTransformationLoading(ETL) 8

    1.4 Data Warehouses. 12

    1.5 Design Issues, 17

    1.6 Guidelines for Data Warehouse Implementation, 24

    1.7 Data Warehouse Metadata. 27

    UNIT2: Online Analytical Processing OLAP

    2.1 Introduction, 30

    2.2 Characteristics of OLAP systems, 34

    2.3 Multidimensional view and Data cube, 38

    2.4 Data Cube Implementations, 45

    2.5 Data Cube operations, 50

    2.6 Implementationof OLAP 56

    2.7 Overview on OLAP Softwares. 57

    UNIT 3: Data Mining

    3.1 Introduction, 60

    3.2Challenges, 61

    3.3Data Mining Tasks, 67

    3.4 Types of Data,

    73

    3.5 Data Preprocessing,69

    3.6 Measures of Similarity and Dissimilarity, Data MiningApplications 84

    UNIT 4:Association Analysis:

    4.1 Basic Concepts and Algorithms 87

    4.2 FrequentItemsetGeneration, 91

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    4/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 4

    4.3Rule Generation, 97

    4.4 Compact Representationof FrequentItemsets, 99

    4.5 Alternative methods for generatingFrequentItemsets, 103

    4.6 FP Growth Algorithm,Evaluationof Association Patterns103

    UNIT5 & UNIT6

    5.1Classification -1: Basics, 107

    5.2 General approach to solve classification problem, 107

    5.3 Decision Trees, 110

    5.4 Rule Based Classifiers, 124

    5.5 Nearest NeighborClassifiers. 129

    5.6 Classification - 2: Bayesian Classifiers, 131

    UNIT7 Clustering Techniques:

    7.1Overview, 132

    7.2 Featuresof clusteranalysis, 132

    7.3 Types of Data and ComputingDistance, 133

    Based Methods, 133

    7.5 Quality and Validity of Cluster Analysis. 134

    UNIT8 Web Mining:

    8.1Introduction, 135

    8.2 Web content mining, 135

    8.3 TextMining, 136

    8.4UnstructuredText, 136

    8.5 Textclustering, 137

    8.6 Mining Spatial and TemporalDatabases. 138

    7.4 Types of Cluster Analysis Methods,PartitionalMethods,Hierarchical Methods,Density

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    5/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 5

    UNIT 1

    DATA WAREHOUSING

    11 INTRODUCTION

    Major enterpriseshave many computers that run a variety ofenterpriseapplications.

    For anenterprise with branches inmany locations, the branchesmay havetheir own

    systems. For example, in a university with only one campus,thelibrary may run its own

    catalogand borrowingdatabase system while the studentadministrationmay have own

    systems running on another machine. Theremight be a separatefinancesystem, a

    property and facilities management system and another for humanresources

    management.A largecompany mighthave the following system.

    Human Resources

    Financials

    Billing

    Sales leads

    Web sales

    Customer support

    Such systems arecalledonlinetransaction processing (OLTP)systems. The OLTP

    systems are mostly relationaldatabase systems designedfortransactionprocessing.The

    performance of OLTP systems is usually very importantsince suchsystems are used to

    support the users (i.e. staff) that provide service to thecustomers. The systems

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    6/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 6

    therefore must be able to deal with insert and updateoperationsas well as answering

    simple queriesquickly.

    1.2 OPERATIONAL DATA STORES

    An ODS has been definedby InmonandImhoff(1996) as asubject-oriented,integrated, volatile, current valued data store,containing only corporate detailed data. A

    data warehouse is a reportingdatabase that containsrelativelyrecent as well as historical

    data and may alsocontainaggregate data.

    The ODS is subject-oriented. That is, it is organized around themajordata

    subjectsof an enterprise. In a university, the subjectsmight bestudents, lecturers and

    courses while in company the subjectsmightbecustomers,salespersonsand products.

    The ODS is integrated.That is, it is a collection ofsubject-orienteddata from a

    variety of systems to providean enterprise-wideview of thedata.

    The ODS is current valued. That is, an ODS is up-to-date andreflects the current

    status of the information. An ODS does not include historicaldata. Sincethe OLTP

    systems data is changingall the time, data fromunderlyingsources refresh the ODS as

    regularly and frequently as possible.

    The ODS is volatile. That is, the data in the ODS changesfrequently asnew

    informationrefreshes the ODS.

    The ODS is detailed. That is, the ODS is detailedenough to servethe needs of theoperationalmanagement staff in the enterprise. Thegranularity of the data in the ODS

    does not have to be exactly the same as in the source OLTPsystem.

    ODS Design and Implementation

    The extractionof informationfrom source databases needs to beefficient and the quality

    of data needs to be maintained. Sincethe data is refreshedregularl

    yandfrequently,

    suitablechecks are requiredto ensure quality of data after eachrefresh. An ODS would

    of course berequired to satisfy normal integrity constraints,for example, existential

    integrity, referentialintegrity and appropriateactionto dealwithnulls. An ODS is a read

    only database other than regular refreshingby the OLTP systems.Users shouldnot be

    allowed to update ODS information.

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    7/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 7

    Populatingan ODS involves an acquisition process of extracting,transforming

    and loading data from OLTP source systems. This process is ETL.Completing

    populating the database, checking for anomalies and testing forperformance are

    necessary before an ODS system can go online.

    SourceSystems ETL ODS

    End Users

    ExtractionTransformation

    Loading

    Managemenreports

    Oracle

    Operational

    Data Source Wefsba-fbdasedApplications

    IMS

    SAP

    Initialloading+refreshing

    OtherApplications

    CICS ETL

    Flat Files DataWarehouse

    Fig :1.1A possibleOperationalData Store structure

    Zero Latency Enterprise (ZLE)

    The Gantner Group has used a term Zero Latency Enterprise (ZLE)for near real-time

    integrationof operationaldata so that there is no significantdelay in gettinginformation

    from one part or one system of an enterpriseto another systemthat needs the information.

    The heart of a ZLE system is an operationaldata store.

    A ZLE data store is somethinglike an ODS that is integratedandup-to-date.The

    aim of a ZLE data store is to allow management a single view ofenterpriseinformation

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    8/143

    ata Warehousing & DataMinig 10IS74

    by bringingtogether relevantdata in real-timeandprovidingmanagement a 360-degree

    view of the customer.

    A ZLE usuallyhasthe followingcharacteristics. It hasaunifiedview of the

    enterprise operational data. It has a highlevel of availabilityand it involvesonline

    refreshing of information. The achieve these, a ZLE requiresinformation that is as

    current as possible.Since a ZLE needs to support a largenumberof concurrent users,for

    examplecall centre users,a fast turnaround timefortransactionsand 24/7availability is

    required.

    1.3 ETL

    An ODS or a data warehouse is based on asingleglobal schema thatintegrates and

    consolidatesenterpriseinformationfrom many sources.Building sucha system requires

    data acquisition from OLTP and legacy systems. The ETL processinvolves extracting,

    transformingand loadingdata from source systems. The process maysound very simple

    since it only involves readinginformationfrom sourcedatabases,transformingit to fitthe

    ODS database modeland loadingit in the ODS.

    As different data sources tend to have different conventions forcoding

    information anddifferent standards for the quality ofinformation, building an ODS

    requiresdata filtering, data cleaning, and integration.Thefollowing examplesshow the importanceof data cleaning:

    If an enterprisewishes to contact its customers or itssuppliers, it is essentialthat a

    complete,accurate and up-to-date list of contact addresses,emailaddresses and

    telephonenumbers be available. Correspondence sent to a wrongaddress that is

    then redirecteddoes not create a very good impressionabout theenterprise.

    If a customer or suppliercalls, the staff respondingshouldbequickly ale to find

    the person in the enterprise database but this requires that thecallers name or

    his/hercompany name is accurately listedin the database.

    If a customer appears in the databases with two or more slightlydifferentnames

    or different account numbers, it becomes difficult to update thecustomers

    information.

    ept. of ISE, SJBIT Page 8

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    9/143

    Data Warehousing & DataMinig 10IS74

    ETL requires skills in management, businessanalysisandtechnology and is often a

    significant component ofdeveloping anODS or adata warehouse.TheETL process

    tends to be differentfor every ODS and data warehouse sinceevery system is different. It

    should not be assumed that an off-the-shelf ETL system canmagically solve all ETL

    problems.

    ETL Functions

    The ETL process consistsof data extraction from source systems,data transformation

    which includesdata cleaning, and loadingdata in the ODS or thedata warehouse.

    Transforming data that has been put in a stagingarea is a rathercomplex phase of

    ETL since a variety of transformationsmay berequired.Largeamounts of data from

    differentsources are unlikely to match even if belongingto thesame person since

    people usingdifferent conventionsand different technology anddifferent systems

    would have

    created records at different times in a different environmentfor different purposes.

    Building anintegrated database from anumber of such sourcesystems mayinvolve

    solvingsome orall of the followingproblems, some of which may besingle-source

    problemswhile others may be multiple-sourceproblems:

    1. Instance identity problem: Thesame customer or client mayberepresentedslightly different in different source systems. Forexample, my name is

    represented as Gopal Gupta in some systems and as GK Gupta inothers.Given

    that the name is unusualfor data entry staff in Westerncountries,it is sometimes

    misspelledas Gopal Gopta or Gopal Gupta or some other way. Thename may

    also be represented asProfessor GK Gupta, Dr GK Gupta or Mr GKGupta.

    There is thus a possibility ofmismatching between the differentsystems that

    needs to be identifiedand corrected.

    2. Data errors: Many differenttypes of data errors other thanidentity errors are

    possible.For example:

    Data may have some missing attributevalues.

    Dept. of ISE, SJBIT Page 9

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    10/143

    Data Warehousing & DataMinig 10IS74

    Coding of some values in one database may not match with codingin

    other databases (i.e. differentcodes with the same meaningorsame code

    for differentmeanings)

    Meaningsof some code values may not be known.

    There may be duplicaterecords.

    There may be wrong aggregations.

    There may be inconsistentuse of nulls, spaces and emptyvalues.

    Some attributevalues may be inconsistent(i.e.outsidetheirdomain)

    Some data may be wrong because of inputerrors.

    There may be inappropriateuse of address lines.

    There may be non-uniqueidentifiers.

    The ETL process needs to ensure that all these types of errorsand others are

    resolvedusinga sound Technology.

    3. Record linkage problem: Recordlinkagerelates to the problemof linking

    informationfrom differentdatabases that relate to the samecustomer or client.

    The problemcan arise if a uniqueidentifier is not available inall databases that

    are being linked.Perhaps records from a database are beinglinked to records

    from a legacy system or to informationfrom a spreadsheet.Recordlinkage can

    involve a largenumber of record comparisonsto ensure linkagesthat have a high

    level of accuracy.

    4. Semanti c integration problem: This deals with theintegrationof information

    found in heterogeneous OLTP and legacy sources.Some of thesources may be

    relational, some may not be. Some may be even in text documents.Some data

    may be character stringswhile others may be integers.

    5. Data integri ty problem: This dealswith issueslikereferential integrity, null

    values, domainof values, etc.

    Dept. of ISE, SJBIT Page 10

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    11/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 11

    Overcoming all these problemsis often a very tediouswork. Manyerrors can be difficult

    to identify. In some cases one may be forced to ask thequestionhow accurate the data

    ought to be since improving the accuracy is alwaysgoing torequire more andmore

    resources and completely fixing all problemsmay beunrealistic.

    Checking for duplicatesis not always easy. The data can besorted and duplicates

    removed although for largefiles this can beexpensive. In somecasesthe duplicate

    records are not identical. In these cases checks for primary keymay be required.If more

    than one record has the same primary key then it is likely to bebecause of duplicates.

    A sound theoreticalbackground is beingdevelopedfor data cleaningtechniques.It

    has been suggested that data cleaningshouldbe based on thefollowing five steps:

    1. Parsing: Parsing identifies various components of the sourcedata files and then

    establishes relationships between those andthe fields in thetarget files. The

    classical exampleof parsingis identifying the various componentsof a persons

    name and address.

    2. Correcting: Correcting the identifiedcomponents is usuallybased on a variety

    of sophisticated techniques includingmathematicalalgorithms.Correcting may

    involve use of other relatedinformationthat may be available inthe enterprise.

    3. Standardizing: Business rules of the enterprisemay now beused to transform

    the data to standard form. For example, in some companiestheremightbe rules

    on how name and address are to be represented.

    4. Matching: Much of the data extracted from a number of sourcesystems is likely

    to be related.Such data needs to be matched.

    5. Consolidating: All corrected, standardized andmatched datacan now be

    consolidatedto build a single version of the enterprisedata.

    Selecting an ETL Tool

    Selection of an appropriateETL Tool is an important decisionthat has to be made in

    choosing components of anODS or data warehousing application.TheETL tool is

    requiredto providecoordinatedaccess to multiple data sources sothat relevantdata may

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    12/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 12

    be extracted from them.An ETL toolwould normally include toolsfor data cleansing,

    reorganization, transformation, aggregation, calculationandautomatic loading of data

    intothe target database.

    An ETL toolshouldprovidean easy user interfacethat allows datacleansingand

    data transformation rules to bespecified using a point-and-clickapproach. Whenall

    mappings and transformations have been specified, the ETL toolshould automatically

    generate

    mode.

    the data extract/transformation/load programs, which typicallyrun in batch

    1.4 DATA WAREHOUSES

    Data warehousingis a process for assemblingand managingdata fromvarious sources

    for the purpose of gaininga single detailed view of anenterprise.Although there are

    several definitions of data warehouse,a widely accepteddefinition by Inmon(1992) is an

    integrated subject-oriented and time-variant repository ofinformation in support of

    managementsdecision makingprocess. The definition of an ODS toexcept that an ODS

    is a current-valueddata store while a data warehouse is atime-variantrepository of data.

    The benefitsof implementinga data warehouse are as follows:

    To providea single version of truth about enterpriseinformation.This may appear

    rather obviousbut it is not uncommon in an enterprisefor twodatabase systems to

    have two differentversions of the truth.In many years of workingin universities,

    Ihave rarely found a university in which everyone agrees withfinancial figures of

    incomeand expenditureat each reportingtimeduringthe year.

    To speed up ad hoc reports and queries that involve aggregationsacross many

    attributes (that is, may GROUP BYs) which are resourceintensive. The

    managers require trends, sums and aggregations that allow, forexample,

    comparing this yearsperformance to last year

    s or preparation of forecasts for

    next year.

    To provide a system in whichmanagers who donot have a strongtechnical

    background are ableto run complex queries.If the managers areableto access the

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    13/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 13

    information

    managers.

    they require, it is likely to reduce the bureaucracy aroundthe

    To provide adatabase that stores relatively cleandata.By usingagood ETL

    process, the data warehouse shouldhave data of high quality.When errors are

    discoveredit may be desirableto correct them directly in thedata warehouse and

    then propagate the correctionsto the OLTP systems.

    To providea database that stores historical data that may havebeen deletedfrom

    the OLTP systems. To improve response time, historical data isusuallynot

    retained in OLTP systems other than that which is required torespond to

    customer queries.The data warehouse can then store the data thatis purged from

    the OLTP systems.

    A useful way of showingthe relationshipbetween OLTP systems, adata warehouse and

    an ODS is given in Figure 7.2. The data warehouse is more likelongterm memory of an

    enterprise. The objectives in building the two systems, ODSanddata warehouse, are

    somewhat

    schemes.

    conflicting and therefore the two databases are likely to havedifferent

    ODS

    Data warehouseOLTP system

    Figure 7.2 Relationship between OLTP, ODS and DW systems.

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    14/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 14

    In buildingand ODS, data warehousingis a process ofintegratingenterprise-widedata,

    originatingfrom a variety of sources,intoa single repository.Asshown in Figure 7.3, the

    data warehouse may be a centralenterprise-widedata warehouse foruse by all the

    decisionmakers in the enterpriseor it may consistof a number ofsmaller data warehouse

    (often called data marts or local data warehouses)

    A data mart stores information for a limited number of subjectareas.For

    example, a company mighthave a data mart about marketingthatsupports marketingand

    sales. The data mart approach is attractive since beginning witha single data mart is

    relatively inexpensive and easierto implement.

    A data mart may be used as a proof of data warehouseconcept.Data marts can

    also create difficulties by setting upsilos of informationalthough onemaybuild

    dependent data marts,which are populatedform the centraldatawarehouse.

    Data marts are often the common approach for buildinga datawarehouse since

    the cost curvefordata marts tends to bemore linear. Acentralized data warehouse

    project can be very resource intensive and requires significantinvestment at the

    beginningalthoughoverall costs over a number of years for acentralizeddata warehouse

    and for decentralizeddata marts are likely to be similar.

    A centralized warehouse can provide better quality dataandminimize data

    inconsistenciessince the data quality is controlledcentrally.The toolsand procedures for

    puttingdata in the warehouse can then be bettercontrolled.Controlling data quality with

    a decentralizedapproach is obviously more difficult. As with anycentralized function,

    though, the unitsor branches of an enterprisemay feel noownershipof the centralized

    warehouse may in some casesnot fully cooperate with theadministration of the

    warehouse. Also, maintaining a centralized warehouse wouldrequire considerable

    coordinationamong the various unitsif the enterpriseis largeandthiscoordinationmay

    incur significant costs for the enterprise.

    As an example of a data warehouse application we considerthe

    telecommunications industry which in most countries hasbecomevery competitive

    during the last few years. If acompany is able to identify amarket trend before its

    competitorsdo,then that can leadto a competitiveadvantage.Whatis therefore needed is

    to analyse customer needs and behaviour in an attempt to betterunderstand what the

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    15/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 15

    customers want and need. Such understanding mightmake it easierfor a companyto

    identify,to

    develop, and deliver some relevant new products or new pricingschemes

    retainand attract customers.Itcan also helpin improvingprofitability since it can helpthe company understand what type ofcustomers are the most profitable.

    Data Mart Data Mart Data Mart

    Central Data Warehouse

    Database Database Legacy

    Figure 7.3 Simple structure of a data warehouse system.

    ODS and DW Architecture

    A typical ODS structure was shown in Figure 7.1. Itinvolvedextractinginformation

    from source systems by usingETL processes and then storingtheinformationin the

    CICS ,FlatFiles,Oracle

    The ODS couldthen be used for producinga variety of reports formanagement.

    ODS.

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    16/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 16

    BusinessIntelligen

    Tools

    ETLprocess

    ExtractTransformand Load

    ETLprocess

    ETLprocess

    ETLprocess

    DataMart

    DataMart

    DataMart

    BI To

    BI To

    BI To

    DailyChangeProcess

    (StagingArea)

    DailyChangeProcess

    OperationalData Store

    (ODS)

    DataWarehouse

    (DW)

    Figure 7.4 Another structure for ODS and DW

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    17/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 17

    The architectureof a system that includesan ODS and a datawarehouse shown in Figure

    7.4 is more complex. Itinvolves extractinginformationfrom sourcesystems by usingan

    ETL process and then storing the information in astagingdatabase.The daily changes

    also come to the staging area.Another ETL process is used totransform information

    from the staging area to populate the ODS. The ODS is then usedfor supplying

    informationvia another ETL process to the area warehouse whichin turn feeds a number

    of data marts that generate the reports requiredbymanagement.Itshouldbe noted that

    not all ETL processes in thisarchitectureinvolve data cleaning,some may only involve

    data extractionand transformationto suitthe target systems.

    1.5 DATA WAREHOUSE DESIGN

    There are a number of ways of conceptualizing a datawarehouse.One approach is to

    view it asa three-level structure. The lowest levelconsists ofthe OLTP andlegacy

    systems, the middlelevel consistsof the globalor centraldatawarehouse while the top

    level consistsof local data warehouses or data marts.Anotherapproach is possibleif the

    enterprisehas an ODS. The three levels then mightconsistof OLTPand legacy systems

    at the bottom,the ODS in the middleand the data warehouse at thetop.

    Whatever the architecture,a data warehouse needs to have a datamodelthat can

    form the basisfor implementingit.To developa data modelwe view adata warehouse asa multidimensional structure consistingofdimensions,since that is an intuitive model

    that matches the types of OLAP queries posed bymanagement. Adimension is an

    ordinate within a multidimensional structure consisting of alist of ordered values

    (sometimescalled members)justlike the x-axis and y-axis valueson a two-dimensional

    graph.

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    18/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 18

    Scholarship

    Degree

    Number ofStudents Country

    Year

    Figure 7.5 A simple exampleof a star schema.

    A data warehouse model often consists of a central fact tableanda set of

    surroundingdimension tableson which the facts depend.Such amodelis called a star

    schema because of the shape of the model representation.A simpleexampleof such aschema is shown in Figure 7.5 for a universitywhere we assume that the number of

    students is given by the four dimensionsdegree,year, country andscholarship. These

    four dimensionswere chosen because we are interestedin findingout how many students

    come to

    scheme.

    each degree program,each year, from each country undereachscholarship

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    19/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 19

    A characteristic of a star schema is that all thedimensionsdirectly link to the fact table.

    The fact tablemay look like table7.1 and the dimensiontablesmaylook Tables 7.2 to

    7.5.

    Table 7.1 An exampleof the fact table

    _

    Year Degree name Country name Scholarship name Number

    200301 BSc Australia Govt 35

    199902 MBBS Canada None 50

    200002 LLB USA ABC 22

    199901 BCom UK Commonwealth 7

    200102 LLB Australia Equity 2

    The first dimensionis the degree dimension.An exampleofthisdimensiontableis

    Table 7.2.

    Table 7.2 An exampleof the degree dimensiontable

    _

    Name Faculty Scholarship eligibility Number of semesters

    BSc Science Yes 6

    MBBS Medicine No 10

    LLB Law Yes 8

    BCom

    LLB

    Business No 6

    Arts No 6

    We now present the second dimension,the country dimension.Anexampleof this

    dimensiontableis Table 7.3.

    Table 7.3 An exampleof the country dimensiontable

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    20/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 20

    _

    Name

    religion

    Nepal

    Indonesia

    Continent EducationLevel

    Asia Low

    Asia Low

    Major

    Hinduism

    Islam

    Norway

    Singapore

    Colombia

    Europe

    Asia

    South America

    High

    High

    Low

    Christianity

    NULL

    Christianity

    The thirddimensionis the scholarshipdimension.Thedimensiontableis given in Table7.4.

    Table 7.4 An exampleof the scholarshipdimensiontable

    _

    Name Amount (%) Scholarship eligibility Number

    Colombo 100 All 6

    Equity 100 Low income 10

    Asia 50 Top 5% 8

    Merit 75 Top 5% 5

    Bursary 25 Low income 12

    The fourth dimensionis the year dimension.The dimensiontableisgiven in Table 7.5.

    Table 7.5 An exampleof the year dimensiontable

    Name Newprograms

    2001

    2002

    Journalism

    Multimedia

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    21/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 21

    2003 Biotechnology

    We now present further examples ofthe star schema. Figure7.7shows a star

    schema for a modelwith four dimensions.

    Star schema may be refinedintosnowflake schemas if we wish toprovidesupportfor dimensionhierarchies by allowing thedimensiontablesto have subtablesto represent

    the hierarchies. For example, Figure 7.8 shows a simplesnowflake schema for a two-

    dimensionalexample.

    Degree Country

    Name

    Faculty

    Fact

    Degree Name

    Name

    Continent

    ScholarshipEligibility

    Number of

    Semesters

    CountryName

    Number ofstudents

    EducationLevel

    Major

    religion

    Figure 7.6 Star schema for a two-dimensionalexample.

    The star and snowflake schemas are intuitive, easy tounderstand,can dealwith aggregate

    data and can be easily extendedby addingnew attributesor newdimensions.They are

    the popular modeling techniques for a datawarehouse.Entry-relationship modeling isoften not discussedin thecontext of data warehousingalthoughit is quitestraightforward

    to look at the star schema as an ER model.Each dimensionmay beconsideredan entity

    and the fact may be consideredeithera relationshipbetween thedimensionentitiesor an

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    22/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 22

    entity in which the primary key is the combinationof theforeignkeys that refer to the

    dimensions.

    DegreeCountry

    Name

    FacultyDegree Name

    Name

    Continent

    Scholarship

    Eligibility

    Number ofSemesters

    CountryName

    Scholarshipname

    Year

    EducationLevel

    Majorreligion

    Scholarship

    Name

    Amount

    Eligibility

    Lastyear

    Number ofstudents

    Revenue Name

    NewProgram

    Figure 7.7 Star schema for a four-dimensionalexample.

    The star and snowflake schemas are intuitive, easy tounderstand,can dealwith aggregate

    data and can be easily extendedby addingnew attributesor newdimensions.They are

    the popular modeling techniquesfor a datawarehouse.Entity-relationship modeling is

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    23/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 23

    often not discussedin the context of data warehousingalthoughitis quitestraightforward

    to look at the star schema as an ER model.

    Name

    Number ofacademic

    staff

    Budget

    Degree Name

    ScholarshipName

    Number of

    students

    Name

    Name

    Faculty Amount

    ScholarshipEligibility

    Number ofSemesters

    Eligibility

    Figure 1.8 An exampleof a snowflake schema.

    The dimensionalstructure of the star schema is called amultidimensionalcube in

    online analytical processing (OALP). The cubes may beprecomputed to provide very

    quick response to management OLAP queries regardless of the sizeof the data

    warehouse.

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    24/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 24

    1.6 GUIDELINES FOR DATA WAREHOUSE IMPLEMENTATION

    Implementation steps

    1. Requirements analysis and capacity plann ing: In otherprojects,the first step in

    data warehousing involves defining enterprise needs, definingarchitecture,

    carrying out capacity planning andselecting the hardware andsoftware tools.

    This step will involveconsulting senior management as well asthe various

    stakeholders.

    2.Hardware integration:

    Once the hardware and software have been selected,theyneed to beput together by integrating the servers, the storage devices andthe

    client software tools.

    3. Modelling:Modelling is a majorstep that involvesdesigning thewarehouse

    schema andviews. This may involve usinga modelling tool if thedata

    warehouse is complex.

    4. Physicalmodelling: For the data warehouse to performefficiently, physical

    modelling is required. This involves designing the physicaldatawarehouse

    organization,data placement,data partitioning,decidingon accessmethods and

    indexing.

    5. Sources: The data for the data warehouse is likely to comefrom a number of

    data sources. This step involvesidentifying andconnecting thesources using

    gateways,ODBC drives or other wrappers.

    6. ETL: The data from the source systems will need to go throughan ETL process.

    The step of designing and implementing the ETL process mayinvolve

    identifying a suitableETL toolvendor and purchasingandimplementingthe tool.This may include customizingthe toolto suittheneeds of the enterprise.

    7. Populate the data warehouse: Oncethe ETL tools havebeenagreed upon,

    testingthe toolswill be required,perhaps usinga stagingarea.Onceeverythingis

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    25/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 25

    working satisfactorily, the ETL toolsmay be used in populatingthe warehouse

    giventhe schema and view definitions.

    8. User applications: For the data warehouse to be useful theremust be end-user

    applications. This step involves designing and implementingapplications

    requiredby the end users.

    9. Roll -out the warehouse and appli cations: Once the datawarehouse has been

    populated andthe end-user applications tested, the warehousesystem andthe

    applicationsmay be rolled out for the user community to use.

    Implementation Guidelines

    1. Bui ld incrementall y: Data warehouses must be builtincrementally. Generally it

    is recommended that a data mart may first be built with oneparticular projectin

    mindand once it is implementeda number of other sectionsof theenterprisemay

    also wish to implement similar systems. An enterprisedatawarehouse can then

    be implemented in an iterative manner allowingall data marts toextract

    informationfrom the data warehouse. Data warehousemodellingitself

    is an iterative methodology as users become familiar with thetechnology and are

    then ableto understand and express theirrequirementsmoreclearly.

    2. Need a champion: A data warehouse project must have achampionwho iswilling to carry out considerableresearchintoexpected costs and benefitsof the

    project. Data warehousing projects require inputs from manyunits in am

    enterprise andtherefore need to be drivenby someone who iscapable of

    interactionwith people in the enterprise and canactivelypersuade colleagues.

    Withoutthe cooperationof other units, the data modelfor thewarehouse and the

    data required to populate the warehouse may be more complicatedthan they

    need to be.Studieshave shown that havinga championcanhelpadoptionand

    success

    of data warehousingprojects.

    3. Senior management support: A data warehouse projectmust befully supported

    by the seniormanagement.Given the resource intensive nature ofsuch projects

    and the timethey can take to implement,a warehouse projectcallsfor a sustained

    commitmentfrom seniormanagement. This can sometimes be difficultsince it

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    26/143

    Data Warehousing & DataMinig 10IS74

    may behard to quantify the benefits of data warehouse technologyandthe

    managers may considerit a cost withoutany explicit return oninvestment.Data

    warehousingprojectstudiesshow that top management support isessential for

    the success of a data warehousingproject.

    4. Ensure quality: Only data that hasbeen cleaned and is of aquality that is

    understood by the organizationshouldbe loadedin the datawarehouse.The data

    quality in the source systems is not always highand often littleeffort is made to

    improve data quality in the source systems. Improved dataquality, when

    recognized by senior managers and stakeholders, is likely tolead to improved

    support for a data warehouse project.

    5. Corporate strategy: A data warehouse projectmust fitwithcorporate strategy

    and business objectives. The objectives of the project must beclearly defined

    before the start of the project. Given the importance of seniormanagement

    support for a data warehousing project, the fitness of theproject with the

    corporate strategy is essential.

    6. Business plan : The financial costs (hardware, software,peopleware),expected

    benefitsand a projectplan(including an ETL plan) for a datawarehouse project

    must be clearly outlined andunderstood by all stakeholders.Without such

    understanding, rumours about expenditure andbenefits canbecomethe only

    source of information, underminingthe project.

    7. Training: A data warehouse projectmust not overlook datawarehouse training

    requirements.For a data warehouse project to be successful, theusers must be

    trainedto use the warehouse and to understand its capabilities.Training of users

    and professionaldevelopmentof the projectteam may alsoberequiredsince data

    warehousingis a complex task and the skills of the projectteamare critical to the

    success of the project.

    8. Adaptability: The projectshouldbuild in adaptability so thatchanges may be

    made to the data warehouse if andwhenrequired. Like any system,a data

    warehouse will need to change,as needs of an enterprisechange.Furthermore,

    once the data warehouse is operational, new applications usingthe data

    Dept. of ISE, SJBIT Page 26

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    27/143

    Data Warehousing & DataMinig 10IS74

    warehouse arealmost certain to beproposed. The system shouldbeable to

    support such new applications.

    9. Joint management: Theproject must bemanaged byboth ITandbusiness

    professionals in the enterprise. To ensure good communicationwith the

    stakeholdersand that the projectis focused on assistingtheenterprisesbusiness,

    business professionals must be involvedin the project along withtechnical

    professionals.

    1.7 DATA WAREHOUSE METADATA

    Given the complexity of informationin an ODS and the datawarehouse, it is essential

    that there be a mechanismfor users to easily find out what datais there and how it can be

    used to meet their needs.Providing

    metadata about the ODS or the data warehouse

    achieves this. Metadata is data about data or documentationaboutthe data that is needed

    by the users.Another descriptionof metadata is that it isstructured data which describes

    the characteristics of a resource. Metadata is stored in thesystem itself and can be

    queried usingtoolsthat are available on the system.

    Several examplesof metadata that shouldbe familiar to thereader:

    1. A library cataloguemay be consideredmetadata.Thecataloguemetadata consists

    of a number of predefined elements representing specificattributes of a

    resource, and each elementcan have one or more values. Theseelementscould

    be the name of the author, the name of the document, thepublishers name, the

    publication date and the category to which it belongs. Theycouldeven include

    an abstract of

    the data.

    2. The tableof contents and the index in a book may beconsideredmetadata for the

    book.

    3. Suppose wesay that a data element about a person is 80.Thismust then be

    described by noting that it is the persons weight andthe unit iskilograms.

    Therefore(weight, kilogram) is the metadata about the data80.

    4. Yet another example of metadata is data about the tablesandfiguresin a

    document like thisbook.A table(which is data) has a name (e.g.table titles in

    Dept. of ISE, SJBIT Page 27

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    28/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 28

    thischapter) and there are column names of the table that maybeconsidered

    metadata.The figures alsohave titlesor names.

    There are many metadata standards. For example, the AGLS(Australian

    Government LocatorService) Metadata standard is a set of 19descriptive elementswhich

    Australian government departments and agenciescan use to improvethe visibility and

    accessibility of theirservices and informationover theInternet.

    In a database, metadata usuallyconsists of table (relation)lists, primary key

    names,attributesnames,theirdomains,schemas,record counts andperhaps a list of the

    most common queries. Additional information may be providedincluding logicaland

    physical data structures and when and what data was loaded.

    In the context of adata warehouse, metadata hasbeen defined asall of the

    informationin the data warehouse environmentthat is not theactualdata itself.

    In the data warehouse,metadata needs to be much morecomprehensive.Itmay be

    classified into two groups: back room metadata and front roommetadata. Much

    important informationis includedin the back room metadata thatis process relatedand

    guides,for example, the ETL processes.

    1.8 SOFTWARE FOR ODS, ZLE, ETL AND DATA WAREHOUSING

    ODS Software

    IQ Solutions:Dynamic ODS from Sybase offloadsdata from OLTPsystems and

    makes if available on a Sybase IQ platformfor queriesandanalysis.

    ADH Active Data Hub from Glenridge Solutions is a real-timedataintegration

    and reportingsolutionfor PeopleSoft, Oracle and SAPdatabases.ADH includes

    an ODS, an enterprisedata warehouse,a workflow initiator and ameta library.

    ZLE Software

    HP ZLE framework based on the HP NonStopplatformcombinesapplicationand data

    integration to create an enterprise-wide solution for real-timeinformation. The ZLE

    solutionis targeted at retail,telecommunications,healthcare,government and finance.

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    29/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 29

    ETL Software

    AradymeData ServicesfromAradyme

    Corporation provides data migration

    services for extraction, cleaning, transformationand loadingfromany source to

    any destination.Aradyme claims to minimize the risks inherentinmany-to-one,

    many-to-many and similar migrationprojects.

    DataFlux froma company with the same name (acquired by SAS in2000)

    providessolutionsthat helpinspect,correct, integrate,enhance,andcontroldata.

    Solutions include data

    Dataset V from Intercon Systems Inc is an integrated suite fordata cleaning,

    matching,positive identification, de-duplicationand statisticalanalysis.

    WinPure List CleanerProfromWinPureprovides asuite consistingofeight

    modules that clean, correct unwanted punctuation and spellingerrors, identify

    missing data via graphs

    variety of data sources.

    and a scoring system and removes duplicatesfrom a

    Data Warehousing Software

    mySAP Business Intelligence provides facilities of ETL, datawarehouse

    management andbusiness modelling to helpbuild data warehouse,model

    informationarchitectureand manage data from multiplesources.

    SQL Server2005fromMicrosoft provides ETL tools as well as toolsfor

    buildinga relationaldata warehouse and amultidimensionaldatabase.

    Sybase IQ is designedfor reporting,data warehousingandanalytics. Itclaims to

    deliver high queryperformance and storage efficiency forstructured and

    unstructured data.Sybase has partnered with Sun in providingdatawarehousing

    solutions.

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    30/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 30

    UNIT 2

    ONLINE ANALYTICAL PROCESSING (OLAP)

    2.1 INTRODUCTION

    A dimension is anattributeor anordinate within amultidimensional structure

    consisting of a list of values(members). For example, thedegree, the country, the

    scholarship and the year were the four dimensions used in thestudent database.

    Dimensions are used for selectingand aggregatingdata at thedesiredlevel. A dimension

    does not include orderingof values,for example there is noordering associated with

    values of each of the four dimensions,but a dimensionmay haveone or more

    hierarchies that show parent /childrelationshipbetween themembers of a dimension.

    For example, the dimension country may have a hierarchy thatdivides the world into

    continents andcontinents into regions followed by regions intocountries if such a

    hierarchy is usefulfor the applications. Multiple hierarchiesmay be defined on a

    dimension.For example, countiesmay be definedto have ageographicalhierarchy and

    may have another hierarchy definedon the basisof theirwealthorper capitaincome(e.g.

    high, medium,low).

    The non-nullvalues of facts are the numerical values stored ineach data cube cell. They

    are called measures. A measure is a non-key attributein a facttableand the value of the

    measure is dependent on thevalues of the dimensions. Each uniquecombination of

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    31/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 31

    members in a Cartesian product dimensionsof the cube identifiesprecisely one data cell

    within the cube and that cell stores the values of themeasures.

    The SQL command GROUP BY is unusualaggregationoperator in that atableis divided

    intosub-tablesbased on the attributevalues in the GROUP BYclauseso that each sub-

    tablehas the same values for the attributeand thenaggregationsover each sub-tableare

    carried out. SQL has a variety of aggregation functionsincluding max,min,average,

    count which are used by employingthe GROUP BY facility.

    A data cube computes aggregates overall subsets ofdimensionsspecified in the cube.A

    cube may befound at the union of (large)number of SQL GROUP-BYoperations.

    Generally,all or some of the aggregates are pre-computed toimprovequery response

    time. A decisionhas to be made as to what and how much shouldbepre-computed since

    pre-computed queriesrequirestorage and timeto compute them.

    A data cube is often implementedas a database in which there aredimensiontableseach

    of whichprovides details of a dimension. The database may betheenterprise data

    warehouse.

    2.2 OLAP

    OLAP systems are data warehouse front-end software tools to makeaggregate

    data availableefficiently, for advanced analysis, to managers ofanenterprise. The

    analysis often requires resource intensive aggregationsprocessing and therefore it

    becomes necessary to implement aspecialdatabase (e.g.datawarehouse) to improve

    OLAP response time. It is essential that an OLAP system providesfacilities for a

    manager to pose ad hoc complex queriesto obtaintheinformationthat he/she requires.

    Another term that is beingused increasingly is businessintelligence. Itis used to

    mean both data warehousingand OLAP. Ithas been definedas auser-centered process of

    exploring data,data relationshipsand trends,thereby helpingtoimproveoverall decision

    making. Normally thisinvolves a process of accessing data(usually stored within the

    data warehouse) and analyzing it to draw conclusions and deriveinsights with the

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    32/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 32

    purpose of effectingpositive change within anenterprise.Business intelligence is closely

    relatedto OLAP.

    A data warehouse and OLAP are based on amultidimensionalconceptualview of

    the enterprise data. Any enterprise data is multidimensionalconsisting ofdimensions

    degree,country, scholarship, and year. Data that is arranged bythe dimensionsis like a

    spreadsheet, although a spreadsheet presents onlytwo-dimensional data with each cell

    containing an aggregation. As an example, table 8.1 shows onesuch two-dimensional

    spreadsheet with dimensionsDegree and Country, where the measureis the number of

    studentsjoining a university in a particular year orsemester.

    Degree

    Table 8.1 A multidimensionalview of data for two dimensions

    CountryB.Sc LLB MBBS BCom BIT ALL

    Australia 5 20 15 50 11 101

    India 10 0 15 25 17 67

    Malaysia 5 1

    Singapore 2 2

    10 12

    10 10

    23 51

    31 55

    Sweden 5 0 5 25 7 42

    UK 5 15 20 20 13 73

    USA 0 2 20 15 19 56

    ALL 32 40 95 157 121 445

    Table8.1 bethe information for the year2001.Similar spreadsheetviewswould be

    available for other years. Three-dimensionaldata can alsobeorganizedin a spreadsheet

    usinga number of sheets or by usinga number of two-dimensionaltables in the samesheet.

    Although it is useful to think of OLAP systems as ageneralization of

    spreadsheets, spreadsheets are not really suitable for OLAP inspite of the niceuser-

    friendly interface that they provide. Spreadsheets tie datastorage too tightly to the

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    33/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 33

    presentation. It is therefore difficult to obtainotherdesirableviews of the information.

    Furthermoreit is not possibleto query spreadsheets.Also,spreadsheets become unwieldy

    whenmore than three dimensions are to be represented. It isdifficult to imagine a

    spreadsheet with millionsof rowsor with thousands of formulas.Even with small

    spreadsheets, formulas often haveerrors. An error-freespreadsheet with thousands of

    formulas wouldtherefore bevery difficult to build.Data cubesessentially generalize

    spreadsheets to any number of dimensions.

    OLAP is the dynamic enterprise analysis required to create,manipulate, animate

    and synthesize information from exegetical, contemplative andformulaic data analysis

    models.

    Essentially what this definition means is that the informationis manipulated from the

    point if view of a manager (exegetical), from the pointof viewof someone whohas

    thought about it(contemplative) and accordingto someformula(formulaic).

    Another definition of OLAP, which is software technology thatenables analysts,

    managers andexecutives to gaininsight into data through fast,consistent, interactive

    access to a wide variety of possibleviews of informationthat,hasbeen transformed from

    raw data to reflect that real dimensionalof the enterpriseasunderstood by the user.

    An even simpler definition is that OLAP is a fast analysis ofshared

    multidimensional information for advanced analysis. Thisdefinition (sometimes called

    FASMI) implies that most OLAP queries should be answered withinseconds.

    Furthermore, it

    programming.

    is expected that most OLAP queries can be answered withoutany

    In summary, a manager wouldwant eventhe most complex query to beanswered

    quickly;OLAP is usually a multi-usersystem that may be run on aseparate server using

    specialized OLAP software. The major OLAP applications aretrendanalysis over a

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    34/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 34

    number of timeperiods,slicing, dicing, drill-down and roll-uptolook at differentlevels

    of detailand pivotingor rotatingto obtaina newmultidimensionalview.

    2.3 CHARACTERISTICS OF OLAP SYSTEMS

    The following are the differencesbetween OLAP and OLTPsystems.

    1. Users: OLTP systems are designedfor office workers while theOLAP systems are

    designed fordecision makers. Therefore while anOLTP system maybeaccessed by

    hundreds or even thousands of users in a largeenterprise,an OLAPsystem is likely to be

    accessed only by a selectgroup of managers and may be used onlyby dozens of users.

    2. Functions: OLTP systems are mission-critical. They supportday-to-day operationsof

    an enterpriseand are mostly performance and availability driven.These systems carry out

    simple repetitiveoperations.OLAP systems aremanagement-criticalto support decision

    of an enterprise support functions using analyticalinvestigations. They are more

    functionality driven. These are ad hoc and often much morecomplex operations.

    3. Nature: Although SQL queries often return a set of records,OLTP systems are

    designed to process one record at a time, for example a recordrelated to the customer

    who mightbe on the phone or in the store.OLAP systems are notdesignedto dealwithindividual customer records. Insteadthey involvequeriesthat dealwith many records at a

    timeand providesummary or aggregate data to a manager.OLAPapplicationsinvolve

    data stored in a data warehouse that has been extracted frommany tablesand perhaps

    from more than one enterprisedatabase.

    4. Design: OLTP database systems are designedto beapplication-orientedwhile OLAP

    systems are designedto be subject-oriented.OLTP systems view theenterprisedata as a

    collection of tables(perhaps based on anentity-relationshipmodel). OLAP systems view

    enterpriseinformationas multidimensional).

    5. Data: OLTP systems normally deal only with the current statusof information. For

    example, informationabout an employeewho left three years agomay not be available

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    35/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 35

    on the Human Resources System. The oldinformationmay have beenachievedon some

    type of stablestorage mediaand may not be accessibleonline. Onthe other hand,OLAP

    systems require historical data overseveralyearssincetrendsareoften important in

    decisionmaking.

    6. Kind of use: OLTP systems are used for readingand writingoperationswhile OLAP

    systems normally do not update the data.

    The differencesbetween OLTP and OLAP systems are:

    Property OLTP OLAP

    Nature of users

    Functions

    Nature of queries

    Nature of usage

    Nature of design

    Number of users

    Nature of data

    Updates

    Operationsworkers

    Mission-critical

    Mostly simple

    Mostly repetitive

    Application oriented

    Thousands

    Current, detailed,relational

    All the time

    Decision makers

    Management-critical

    Mostly complex

    Mostly ad hoc

    Subject oriented

    Dozens

    Historical, summarized,

    multidimensional

    Usually not allowed

    Table 8.1 Comparisonof OLTP and OLAP system

    FASMI Characteristics

    In the FASMI characteristics of OLAP systems, the namederivedfrom the first lettersof

    the characteristicsare:

    Fast: As noted earlier,most OLAP queries should beansweredveryquickly,

    perhaps within seconds.The performance of an OLAP system has tobe like that of a

    search engine.If the response takes more than say 20 seconds,theuser is likely to move

    away to something else assuming there is aproblem with thequery.Achievingsuch

    performance is difficult. The data structures must be efficient.The hardware must be

    powerful enough for the amount of data and the number ofusers.Full pre-computationof

    aggregates helps but is often not practical due to the largenumber of aggregates.One

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    36/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 36

    approach is to pre-compute the most commonly queried aggregatesandcompute the

    remainingon-the-fly.

    Analytic: An OLAP system must provide rich analyticfunctionality and it is

    expected that most OLAP queries can beanswered withoutanyprogramming. The

    system shouldbe ableto cope with any relevantqueriesfor theapplicationand the user.

    Often the analysiswill be usingthe vendors own tools althoughOLAP software

    capabilities differ widely between products in the market.

    Shared: An OLAP system is shared resource although it isunlikely to be

    shared by hundreds of users.An OLAP system is likely to beaccessed only by a select

    group of managers and may be used merely by dozens ofusers.Being a shared system,

    an OLAP

    system shouldbe provideadequate security for confidentiality aswell as integrity.

    Multidimensional: This is the basicrequirement. Whatever OLAPsoftware is

    beingused,it must providea multidimensionalconceptualview of thedata.Itis because

    of the multidimensional view of data that we often refer to thedata as a cube. A

    dimension often hashierarchies that showparent / childrelationships between the

    members of a dimension.The multidimensionalstructure shouldallowsuch hierarchies.

    Information: OLAP systems usually obtain informationfrom a datawarehouse.

    The system shouldbe able to handle a largeamount of input data.The capacity of an

    OLAP system to handleinformationand its integrationwith the datawarehouse may be

    critical.

    Codds OLAP Characteristics

    Codd et als1993 paper listed12 characteristics (or rules) OLAPsystems. Another six in

    1995 followed these.Codd restructured the 18 rules intofourgroups.These rules provide

    another pointof view on what constitutesan OLAP system.

    All the 18 rules are availableathttp://www.olapreport.com/fasmi.htm.Here we

    discuss10 characteristics, that are most important.

    1. Mul tidimensional conceptual view: As noted above,thisiscentralcharacteristic of an

    OLAP system. By requiring a multidimensional view, it ispossible to carry out

    operationslike slice and dice.

    http://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htmhttp://www.olapreport.com/fasmi.htm
  • 8/21/2019 Data Warehousing & Data Mining.pdf

    37/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 37

    2. Accessibi li ty (OLAP as a mediator): The OLAP softwareshouldbe sittingbetween

    data sources (e.g data warehouse) and an OLAP front-end.

    3. Batch extraction vs interpretive: An OLAP systemshouldprovidemultidimensional

    data stagingplusprecalculationof aggregates inlargemultidimensionaldatabases.

    4. Mu lti -user support: Since the OLAP system is shared, theOLAP software should

    provide many normal database operations including retrieval,update, concurrency

    control,integrity and security.

    5. Stori ng OLAP resul ts: OLAP resultsdata shouldbe keptseparate from source data.

    Read-write OLAP applications should not be implemented directlyon live transaction

    data if OLAP source systems are supplyinginformationto the OLAPsystem directly.

    6. Extraction of missing values: The OLAP systemshoulddistinguish missing values

    from zero values. A largedata cube may have a largenumber ofzeros as well as some

    missing values. If a distinctionis not made between zero valuesand missing values, the

    aggregates are likely to be computed incorrectly.

    7. Treatment ofmissing values: An OLAP system should ignore allmissingvalues

    regardlessof theirsource.Correct aggregate values will becomputed once the missing

    values are ignored.

    8. Uni form reporting perf ormance: Increasing the number ofdimensions ordatabase

    size should not significantly degrade the reporting performanceof the OLAP system.

    This is a good objectivealthoughit may be difficult to achievein practice.

    9. Generic dimensionali ty: An OLAP system shouldtreat eachdimensionas equivalent

    in both is structure and operationalcapabilities. Additionaloperationalcapabilities may

    be granted to selected dimensions but such additional functionsshouldbe grantable to

    any dimension.

    10. Unlimited dimensions and aggregation levels: An OLAP systemshould allow

    unlimiteddimensions and aggregation levels. In practice, thenumber of dimensions is

    rarely more than 10 and the number of hierarchies rarely morethan six.

    MOTIVATIONS FOR USING OLAP

    1. Understanding and improving sales: For an enterprisethat hasmany products

    anduses a number of channels for selling the products, OLAP canassist in

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    38/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 38

    finding the most popularproducts and the most popularchannels.Insome cases it

    may be possibleto find the most profitablecustomers.For example,considering

    the telecommunications industry and only considering oneproduct,

    communicationminutes, there is a largeamount of data if acompany wanted to

    analyze the sales of product for every hour of the day (24hours), differentiate

    between weekdays and weekends (2 values) and divide regionstowhich calls are

    made into50 regions.

    2. Understanding and reducing costs of doing business: Improvingsales is one

    aspect of improving a business,the other aspect is to analyzecosts and to control

    them as much as possiblewithoutaffectingsales. OLAP can assistinanalyzing

    the costs associatedwith sales. In some cases,it may alsobepossibleto identify

    expenditures that produce a high return on investment (ROI). Forexample,

    recruiting a top salesperson may involve significant costs, butthe revenue

    generated by the salespersonmay justify the investment.

    2.3 MULTIDIMENSIONAL VIEW AND DATA CUBE

    SeniorExecutive

    V-C,Deans

    Department & FacultyManagement,Heads

    Daily operationsRegistrar,HR, Finance

    Figure 2.1 A typical University management hierarchy

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    39/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 39

    The multidimensionalview of data is in some ways natural view ofany enterprise of

    managers.The trianglediagramin Figure 8.1 shows that as we gohigherin the triangle

    hierarchy the managers need for detailedinformationdeclines.

    The multidimensional view of data by using anexample of asimpleOLTP

    database consistsof the three tables.Much of the literatureonOLAP uses examplesof a

    shoe store selling differentcolourshoes of differentstyles.

    It should be noted that the relation enrolment would normallynot berequired

    since the degree a student is enrolledin couldbe includedin therelationstudent but some

    students are enrolledin doubledegrees and so the relationbetweenthe student and the

    degree is multifold and hence the need for therelationenrolment.

    student(Student_id,Student_name,Country, DOB, Address)

    enrolment(Student_id,Degree_id,SSemester)

    degree(Degree_id,Degree_name,Degree_length,Fee, Department)

    An exampleof the first relation, i.e. student,is given in Table2.2

    Student_id Student_name Country DOB Address

    8656789 Peta Williams Australia 1/1/1980 Davis Hall

    8700020 John Smith Canada 2/2/1981 9 Davis Hall

    8900020

    8801234

    8654321

    Arun Krishna

    Peter Chew

    Reena Rani

    USA 3/3/1983

    UK 4/4/1983

    Australia 5/5/1984

    90 Second Hall

    88LongHall

    88LongHall

    8712374

    8612345

    Kathy Garcia

    Chris Watanabe

    Malaysia

    Singapore

    6/6/1980

    7/7/1981

    88LongHall

    11 Main street

    87442238977665

    LarsAnderssen

    Sachin SinghSweden 8/8/1982UAE 9/9/1983

    NullNull

    9234567 Rahul Kumar India 10/10/1984 Null

    9176543 Saurav Gupta UK 11/11/1985 1, Captain Drive

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    40/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 40

    Table8.3 presents an example of the relation enrolment. In thistable, the attribute

    SSemester in the semester in which the student started thecurrent degree.We code it by

    using the year followed by 01 for the first semester and 02 forthe second. We

    assume that new students are admittedin each semester.Table 8.4is an exampleof the

    relation degree. In this table, the degree length is given interms of the number of

    semester it normally takes to finish it.The fee is assumed to bein thousands of dollars per

    year.

    Table 2.3 The relationenrolment

    Student_id Degree_id SSemester

    8900020 1256 2002-01

    8700074 3271 2002-01

    8700074 3321 2002-02

    8900020 4444 2000-01

    8801234 1256 2000-01

    8801234 3321 1999-02

    8801234 3333 1999-02

    8977665 3333 2000-02

    Table 2.4 The relationdegree

    Degree_id Degree_name Degree_length Fee Department

    1256 BIT 6 18 Computer Sci.

    2345 BSc 6 20 Computer Sci

    4325 BSc 6 20 Chemistry

    3271 BSc 6 20 Physics

    3321 BCom 6 16 Business

    4444 MBBS 12 30 Medicine

    3333 LLB 8 22 Law

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    41/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 41

    It is clear that the informationgiven in Tables 8.2, 8.3 and8.4, althoughsuitablefor a

    student enrolment OLTP system, is not suitable for efficientmanagement decision

    making. The managers do not need informationabout the individualstudents,the degree

    they are enrolledin, and the semester theyjoinedthe university.What the managers need

    is the trends in student numbers in different degree programsandfrom different

    countries.

    We first consideronly two dimensions.Letus say we are primarilyinterestedin finding

    out how many students from each country came to do a particulardegree.Thereforewe

    may visualize the data as two-dimensional,i.e.,

    Country x Degree

    A table that summarizes this type of information may berepresented by a two-

    dimensionalspreadsheet given in Table 8.5 (the numbers in Table8.5 do not need relate

    to the numbers in Table 8.3). We may call that this tablegivesthe number of students

    admitted(in say, 2000-01) a two-dimensionalcube.

    Table 2.5 A two-dimensionaltableof aggregates for semester2000-01

    Country \ Degree BSc LLB MBBS BCom BIT ALL

    Australia 5 20 15 50 11 101

    India 10 0 15 25 17 67

    Malaysia 5 1 10 12 23 51

    Singapore 2 2 10 10 31 55

    Sweden 5 0 5 25 7 42

    UK 5 15 20 20 13 73

    USA 0 2 20 15 19 56

    ALL 32 40 95 157 121 445

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    42/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 42

    Using thistwo-dimensionalview we are ableto find the number ofstudentsjoining any

    degree from any country (only for semester 2000-01). Otherqueriesthat we are quickly

    ableto answer are:

    How many students started BIT in 2000-01?

    How many studentsjoinedfrom Singaporein 2000-01?

    The data given in Table 8.6 is for a particularsemester,2000-01. A similar tablewould

    be available for other semesters.Letus assume that the data for2001-01 is given in Table

    8.7.

    Table 2.6 A two-dimensionaltableof aggregates for semester2001-01

    Country \ Degree BSc LLB MBBS BCom BIT ALL

    Australia 7 10 16 53 10 96

    India 9 0 17 22 13 61

    Malaysia 5 1 19 19 20 64

    Singapore 2 2 10 12 23 49

    Sweden 8 0 5 16 7 36

    UK 4 13 20 26 11 74

    USA 4 2 10 10 12 38

    ALL 39 28 158 158 96 418

    Letus now imagine that Table 8.6 is put on top of Table 8.5. Wenow have a three-

    dimensionalcube with SSemester as the vertical dimension.We nowput on top of these

    two tablesanother tablethat gives the vertical sums, as shown inTable 8.7.

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    43/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 43

    Table 2.7 Two-dimensionaltableof aggregates for bothsemesters

    Country \ Degree BSc LLB MBBS BCom BIT ALL

    Australia 12 30 31 103 21 197

    India 19 0 32 47 30 128

    Malaysia 10 2 29 31 43 115

    Singapore 4 4 20 22 54 104

    Sweden 13 0 10 41 14 78

    UK 9 28 40 46 24 147

    USA 4 4 30 25 31 94

    ALL 71 68 192 315 217 863

    Tables8.5, 8.6 and8.7 together nowform a three-dimensionalcube.The table 8.7

    providestotalsfor the two semesters and we are ableto drill-downto find numbers in

    individualsemesters. Note that a cube does not need to have anequal number of

    members in each dimension.Puttingthe three tablestogether givesa cube of 8 x 6 x 3 ( =

    144) cells including the totalsalongevery dimension.

    A cube couldbe represented by:

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    44/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 44

    Country x Degree x Semester

    Figure 2.2 The cube formed by Tables 8.6, 8.7 and 8.8

    In the three-dimensionalcube,the following eighttypes ofa*ggregationsor queriesare

    possible:

    1. null (e.g. how many students are there? Only 1possiblequery)

    2. degrees (e.g. how many students are doingBSc? 5possiblequeriesif we assume

    5 differentdegrees)

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    45/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 45

    3. semester (e.g. how many students entered in semester 2000-01?2 possiblequeries

    if we only have data about 2 semesters)

    4. country (e.g. how many students are from the USA? 7possiblequeriesif there are

    7 countries)

    5. degrees,semester (e.g. how many students entered in 2000-01to enroll in BCom?

    With 5 degrees and 2 differentsemesters 10 queries)

    6. (ALL, b, c) semester,country (e.g. how many students from theUK entered in

    2000-01? 14 queries)

    7. (a, b, ALL) degrees,country (e.g. how many students fromSingaporeare enrolled

    in BCom? 35 queries)

    8. (a, b, c) all (e.g. how many students from Malaysia enteredin 2000-01 to enroll in

    BCom? 70 queries)

    2.4 DATA CUBE IMPLEMENTATIONS

    1. Pre-compute and store all : This means that millions ofa*ggregates will need to be

    computed and stored.Althoughthis is the best solutionas far asquery response

    timeis concerned,the solutionis impractical since resourcesrequiredto compute

    the aggregates and to store them will be prohibitively largefora largedata cube.Indexinglargeamounts of data is alsoexpensive.

    2. Pre-compute (and store) none: This means that the aggregatesare computed on-

    the-fly using the rawdata whenever a query is posed. Thisapproach does not

    requireadditionalspace for storingthe cube but the queryresponse timeis likely

    to be very poor for largedata cubes.

    3. Pre-compute and store some: This means that we pre-computeand store the

    most frequently queriedaggregates and compute others as the needarises. We

    may alsobeableto derive some of the remainingaggregates usingtheaggregates

    that have

    already

    been computed. Itmay therefore be worthwhile also to pre-

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    46/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 46

    compute some aggregates that are not most frequently queriedbuthelpin deriving

    many other aggregates. It will of course not be possible toderiveall the

    aggregates from the pre-computed aggregates and it will benecessary to access

    the database (e.g the data warehouse) to compute theremainingaggregates.The

    more aggregates we are ableto pre-compute the better the queryperformance.

    Itcan be shown that largenumbers of cells do have an ALLvalueand may therefore be

    derived from other aggregates.Letus reproduce the list ofqueries we had and define

    them as (a, b, c) where a stands for a value of the degreedimension,b for country and c

    for startingsemester:

    1. (ALL, ALL, ALL) null (e.g. how many students are there? Only1 query)

    2. (a, ALL, ALL) degrees (e.g. how many students are doingBSc? 5queries)

    3. (ALL, ALL, c) semester (e.g. how many students entered insemester 2000-01? 2

    queries)

    4. (ALL, b, ALL) country (e.g. how many students are from theUSA? 7 queries)

    5. (a, ALL, c) degrees,semester (e.g. how many students enteredin 2000-01 to

    enroll in BCom? 10 queries)

    6. (ALL, b, c) semester,country (e.g. how many students from theUK entered in

    2000-01? 14 queries)

    7. (a, b, ALL) degrees,country (e.g. how many students fromSingaporeare enrolled

    in BCom? 35 queries)

    8. (a, b, c) all (e.g. how many students from Malaysia enteredin 2000-01 to enroll in

    BCom? 70 queries)

    It is therefore possible to derive the other 74 of the144queries fromthe last 70

    queriesof type (a, b, c). Of course in a very largedata cube,itmay not be practical

    even to pre-compute all the (a, b, c) queriesand decisionwillneed to be made which

    ones should be pre-computed given that storage availability maybe limited

    may be requiredto minimize the average query cost.

    and it

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    47/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 47

    In Figure 8.3 we show how the aggregated above are relatedandhow an aggregate at the

    higherlevel may be computed from the aggregates below. Forexample, aggregates

    (ALL, ALL, c) may be derivedfrom either(a, ALL, c) bysummingover all a values

    from (ALL, b, c) by summingover all b values.

    ALL, ALL, ALL

    a, ALL, ALL ALL, b, ALL ALL, ALL, c

    a, ALL, c ALL, b, c a, b, ALL

    a, b, c

    Figure 2.3 Relationships between aggregationsof athree-dimensionalcube

    Another related issue is where the data used by OLAP willreside.We assume that the

    data is stored in a data warehouse or in one or more datamarts.

    Data cube products usedifferent techniques forpre-computingaggregates and

    storingthem.They are generally based on one of twoimplementationmodels.The first

    model, supported by vendors of traditional relational modeldatabases, is calledthe

    ROLAP modelor the Relational OLAP model.The second modeliscalled the MOLAP

    model for multidimensional OLAP. The MLOAP model provides adirect

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    48/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 48

    multidimensionalview of the data whereas the RLOAPmodelprovidesa relationalview

    of the multidimensionaldata in the form of a fact table.

    ROLAP

    ROLAP usesa relational DBMS to implement an OLAP environment. Itmay be

    considereda bottom-up approach which is typically based onusinga data warehouse that

    has been designed using a star schema. The data therefore islikely to be in a

    denormalized structure. A normalized database avoidsredundancybut is usuallynot

    appropriatefor high performance. The summary data will be heldin aggregate tables.

    The data warehouse provides the multidimensional capabilities byrepresenting data in

    fact table(s) and dimensiontables.The fact tablecontainsonecolumnfor each dimension

    and one column for each measure and every row of the table[rovides one fact. A fact

    then is represented as (BSc, India, 2001-01) with thelastcolumnas 30. An OLAP toolis

    then providedto manipulatethe data in these data warehousetables.This toolessentially

    groups the fact table to find aggregates andusessome of theaggregates already

    computed to find new aggregates.

    The advantage of using ROLAP is that it is more easily used withexisting relational

    DBMS and the data can be stored efficiently usingtablessince nozero facts need to bestored.The disadvantageof the ROLAP modelisits poor query performance.Proponents

    of the MLOAP modelhave called the ROLAP modelSLOWLAP. Someproducts in this

    category are OracleOLAP mode, OLAP Discoverer,MicroStrategyandMicrosoft

    Analysis Services.

    MOLAP

    MOLAP is based on usinga multidimensionalDBMS rather than a datawarehouse tostore andaccess data. It may beconsidered as a top-downapproach to OLAP. The

    multidimensional database systems do not have a standardapproach to storing and

    maintainingtheirdata.They often use special-purposefile systemsor indexes that store

    pre-computation of all aggregations in thecube. For example, inROLAP a cell was

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    49/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 49

    represented as (BSc, India, 2001-01) with a value 30 stored inthe last column. In

    MOLAP, the same information is stored as 30 and the storagelocation implicitly gives

    the values of the dimensions.The dimensionvalues do not need tobe stored since all the

    values of the cube couldbe stored in an array in apredefinedway. For example, the cube

    in Figure 8.2 may be represented as an array like thefollowing:

    12 30 31 10

    3

    21 19

    7

    19 0 32 47 30 12

    8

    10 2 29 31 43

    If the values of the dimensionsare known, we can infer the celllocationin the array. If

    the cell locationis known, the values of the dimensionmay beinferred. This is obviously

    a very compact notation for storinga multidimensionaldata cubealthougha coupleofproblems remain.Firstly the array is likely tobetoo large to be stored in the main

    memory. Secondly, this representation does not solve the problemof efficiently

    representingsparse cubes.To overcome the problemof the arraybeingtoo largefor main

    memory, the arraymay besplit into pieces calledchunks, eachofwhich is small

    enough to fitin the mainmemory. To overcome the problemofsparseness,the chunks

    may be compressed.

    MOLAP systems have to dealwith sparsity since a very percentageof the cells can be

    empty in some applications. The MOLAP implementation is usuallyexceptionally

    efficient. The disadvantageof usingMOLAP is that it is likely tobe more expensive than

    OLAP, the data is not always current,and it may be moredifficult to scale a MOLAP

    system for very large OLAP problems.Some MOLAP products areHyperion Essbase

    and Applix iTM1. Oracle and Microsoft are alsocompetinginthissegment of the OLAP

    market.

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    50/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 50

    The differencesbetween ROLAP and MOLAP are summarizedin Table8.8

    Table 2.8 Comparisonof MOLAP and ROLAP

    Property MOLAP ROLAPData structure Multidimensional database

    usingsparse arrays

    Relational tables(each cell is a row)

    Disk space Separate database for data

    cube; large for large data

    cubes

    May not require any space other than

    that available in the data warehouse

    Retrieval Fast(pre-computed) Slow(computeson-the-fly)Scalability Limited (cubes can be very

    large)

    Excellent

    Best suitedfor Inexperienced users, limited

    set of queries

    Experienced users, queries change

    frequentlyDBMS

    facilities

    Usually weak Usually very strong

    2.5 DATA CUBE OPERATIONS

    A number of operationsmay be appliedto data cubes. The commonones are:

    Roll-p

    Drill-down

    Slice and dice

    Pivot

    Roll-up

    Roll-up is like zoomingout on the data cube.Itis requiredwhenthe user needs furtherabstractionor less detail. This operationperforms further aggregationson the data, for

    example, from single degree programs to all programs offered bya School or department,

    from single countries to a collection of countries, and fromindividual semesters to

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    51/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 51

    academic years.Often a hierarchy defined ona dimension is usefulin the roll-up

    operationas suggested by the exampleof countriesand regions.

    We providean exampleof roll-upbased on Table =s 8.6, 8.7 and8.8. We first

    definehierarchies on two dimensions. Amongst countries,letusdefine:

    1. Asia (India, Malaysia, Singapore)

    2. Europe(Sweden, UK)

    3. Rest (Australia, USA)

    Another hierarchy is definedon the dimensiondegree:

    1. Science (BSc, BIT)

    2. Medicine (MBBS)

    3. Business and Law (BCom, LLB)

    The resultof a roll-upfor both semesters together from Table 8.8then is given in Table

    8.9.

    Table 2.9 Result of a roll-upoperationusingTable 8.7

    Country \ Degree

    Asia

    Europe

    Rest

    Science

    160

    60

    68

    Medicine

    81

    50

    61

    Business and Law

    106

    115

    162

    Drill-down

    Drill-down is like zoomingin on the data and is therefore thereverse of roll-up. Itis an

    appropriate operation whenthe userneeds further detailsorwhenthe user wants to

    partitionmore finely or wants to focus on some particular valuesof certaindimensions.

    Drill-down adds more details to the data. Hierarchy defined on adimensionmay beinvolved in drill-down. For example, a higherlevelviews of student data,for examplein

    Table8.9, givesstudent numbers for the two semesters forgroupsof countries and

    groups of degrees. If one is interested in more detailthen it ispossibleto drill-down to

    tables8.6 and 8.7 for student numbers in each of the semestersfor each country and for

    each degree.

  • 8/21/2019 Data Warehousing & Data Mining.pdf

    52/143

    Data Warehousing & DataMinig 10IS74

    Dept. of ISE, SJBIT Page 52

    Slice and dice

    Slice and dice are operationsfor browsing the data in thecube.The terms refer to the

    ability to look at informationfrom differentviewpoints.

    A slice is a subset of the cube corresponding to a single valuefor one or more

    members of the dimensions.For example, a slice operation isperformed when the user

    wants aselection on onedimension of a three-dimensional cuberesulting in a two-

    dimensionalsite. Letthe degree dimensionbe fixed as degree =BIT. The slice will not

    include any informationabout other degrees.Theinformationretrievedtherefore is more

    like a two-dimensionalcube for degree = BIT as shown in Table8.10.

    Table 2.10 Result of a slice when degree value is BIT

    Country \ Semester 2000-01 2000-02 2001-01 2001-02

    Australia 11 5 10 2

    India 17 0 13 5

    Malaysia 23 2 20 1

    Singapore 31 4 23 2

    Sweden 7 0 7 4

    UK 13 8 11 6

    USA 19 4 12 5

    Itshouldbe noted that Table 8.7 also is a slice (with SSemester= 2000-01)from the

    cube built by piling several tableslike Tables 8.7 and 8.8 aboutdifferentsemesters on top

    of each other.Itis shown in Figure 8.4.

    The diceoperationis similar to slice but dicing does not involvereducingthe number ofdimensions.A diceis obtainedby performingaselectionon two or more dimensions.For

    example, one may only be interestedin degrees BIT and BCom andcountriesAustralia,

    India, and Malaysia for semesters 2000-01 and

(PDF) Data Warehousing & Data Mining.pdf - PDFSLIDE.NET (2024)

References

Top Articles
Latest Posts
Article information

Author: Merrill Bechtelar CPA

Last Updated:

Views: 6490

Rating: 5 / 5 (70 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Merrill Bechtelar CPA

Birthday: 1996-05-19

Address: Apt. 114 873 White Lodge, Libbyfurt, CA 93006

Phone: +5983010455207

Job: Legacy Representative

Hobby: Blacksmithing, Urban exploration, Sudoku, Slacklining, Creative writing, Community, Letterboxing

Introduction: My name is Merrill Bechtelar CPA, I am a clean, agreeable, glorious, magnificent, witty, enchanting, comfortable person who loves writing and wants to share my knowledge and understanding with you.