LargeSynopticSurveyTelescop
    DataManagementDatabase
    JacekBecla,DanielWang,SergeMonkewitz,
    AndySalnikov,AndrewHanushevsky,Douglas
    MichaelKelsey,andFritzMueller
    LDM-135
    LatestRevision:2017-07-07
    ThisLSSTdocumenthasbeenapprovedasaContent-ControlledDocument
    nicalControlTeam.Ifthisdocumentischangedorsuperseded,
    theHandledesignationshownabove.Thecontrolisonthemost
    thisHandleintheLSSTdigitalarchiveandnotprintedversions.
    foundinthecorrespondingDMRFC.
    Abstract
    ThisdocumentdiscussestheLSSTdatabasesystemarchitecture.
    LARGESYNOPTICSURVEYTELESCOPE

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    ChangeRecord
    VersionDateDescriptionOwnername
    1.02009-06-15Initialversion.JacekBecla
    2.02011-07-12Mostsectionsrewritten,addedscalabilitytest
    section
    JacekBecla
    2.12011-08-12Refreshedfuture-plansandscheduleoftest-
    ingsections,addedsectionaboutfaulttoler-
    ance.
    JacekBecla,Daniel
    Wang
    3.02013-08-02Synchronizedwithlatestchangestothere-
    quirements[LSE-163].Rewrotemostofthe
    “Implementation”chapter.Documentednew
    tests,refreshedallotherchapters.
    JacekBecla,Daniel
    Wang,SergeMonke-
    witz,Kian-TatLim,
    DouglasSmith,Bill
    Chickering
    3.12013-10-10RefreshednumbersbasedonlatestLDM-141.
    Updatedsharedscans(implementation)and
    300-nodetestsections,addedsectionabout
    sharedscansdemonstration
    JacekBecla,Daniel
    Wang
    3.22013-10-10TCTapprovedRAllsman
    2016-07-18Updatewithasyncquery,sharedscan,sec-
    ondaryindex,XRootD,metadataserviceinfor-
    mation.
    JohnGates,Andy
    Salnikov,Andrew
    Hanushevsky,Michael
    Kelsey,FritzMueller
    2017-07-05Movehistoricalinvestigationstoseparate
    documents:DMTN-046,DMTN-047,DMTN-
    048,DMTR-21,DMTR-12
    T.Jenness
    4.02017-07-07Bringuptodatewithcurrentstatus,condense
    requirementssection,re-ordersectionsfor
    improvedreadability.
    FritzMueller
    Documentcurator:FritzMueller
    Documentsourcelocation:
    https://github.com/lsst/LDM-135
    The contents of this document are subject to configuration control by the
    Team.
    ii

    LARGE SYNOPTIC SURVEY TELESCOPE
    Data Management Database Design
    LDM-135
    Latest Revision 2017-07-07
    Contents
    1 Executive Summary
    1
    2 Introduction
    2
    3 Requirements
    3
    3.1 General Requirements
    ........................
    3.2 Data Production Related
    ...................
    Requirements
    3.3 Query Access Related
    ....................
    Requirements
    3.4 Discussion
    ...........................
    3.4.1 Design Considerations
    ......................
    3.4.2 Query complexity and
    .................
    access patterns
    4 Baseline Architecture
    7
    4.1 Alert Production and..................
    Up-to-date Catalog
    4.2 Data Release
    .......................
    Production
    4.3 User Query
    .........................
    Access
    4.3.1 Distributed
    .....................
    and parallel
    4.3.2 Shared-nothing
    ........................
    4.3.3 Indexing
    ..........................
    4.3.4 Shared.scanning
    ......................
    4.3.5 Clustering
    .........................
    4.3.6 Partitioning
    .........................
    4.3.7 Long-running
    ......................
    queries
    4.3.8 Technology
    .......................
    choice
    5 Implementation (Qserv)
    18
    5.1 Components
    ...........................
    5.1.1 MySQL
    ..........................
    5.1.2 XRootD
    ..........................
    5.2 Partitioning
    ...........................
    The contents of this document are subject to configuration control by the
    Team.
    iii

    LARGE SYNOPTIC SURVEY TELESCOPE
    Data Management Database Design
    LDM-135
    Latest Revision 2017-07-07
    5.3 Query Generation
    .........................
    5.3.1 Processing
    ......................
    modules
    5.3.2 Processing module
    ....................
    overview
    5.4 Dispatch
    ............................
    5.4.1 Wireprotocol
    ........................
    5.4.2 Frontend
    ..........................
    5.4.3 Worker
    ..........................
    5.5 Threading
    ..........................
    Model
    5.6 Aggregation
    ...........................
    5.7 Indexing
    ............................
    5.7.1 Secondary Index
    ....................
    Structure
    5.7.2 Secondary Index
    .....................
    Loading
    5.8 Data Distribution
    .........................
    5.8.1 Database data
    ....................
    distribution
    5.8.2 Failure and integrity
    ..................
    maintenance
    5.9 Metadata
    ............................
    5.9.1 Static
    ........................
    metadata
    5.9.2 Dynamic.......................
    metadata
    5.9.3 Architecture
    .........................
    5.9.4 Typical
    .......................
    Data Flow
    5.10 Shared
    ...........................
    Scans
    5.10.1 Background
    .........................
    5.10.2 Implementation
    ........................
    5.10.3 Memory management
    ......................
    5.10.4 XRootD scheduling
    ....................
    support
    5.10.5 Multiple tables
    .....................
    support
    5.11 Level 3: User Tables,
    ....................
    External Data
    5.12 Cluster and Task
    .....................
    Management
    5.13 Fault Tolerance
    ..........................
    5.14 Next-to-database
    ......................
    Processing
    The contents of this document are subject to configuration control by the
    Team.
    iv

    LARGE SYNOPTIC SURVEY TELESCOPE
    Data Management Database Design
    LDM-135
    Latest Revision 2017-07-07
    5.15 Administration
    ..........................
    5.15.1 Installation
    .........................
    5.15.2 Data.........................
    loading
    5.15.3 Administrative
    ......................
    scripts
    5.16 Current Status and
    .....................
    Future Plans
    5.17 Open Issues
    ...........................
    6 Risk Analysis
    49
    6.1 Potential
    .........................
    Key Risks
    6.2 Risk Mitigations
    ..........................
    7 References
    52
    The contents of this document are subject to configuration control by the
    Team.
    v

    LARGE SYNOPTIC SURVEY TELESCOPE
    Data Management Database Design
    LDM-135
    Latest Revision 2017-07-07
    Data Management Database Design
    1 Executive Summary
    Two facets of LSST database architecture and their motivating
    database architecture in support of real time Alert Production,
    support of user query access to catalog data. Following this,
    implementation of the query access architecture.
    The LSST baseline database architecture for real time Alert
    time-based partitioning. To guarantee reproducibility,
    bined with maintaining validity time for appropriate rows
    cas are maintained to isolate live production catalogs from
    chronized in real time using native database replication.
    The LSST baseline database architecture for user query access
    sively parallel processing) relational database composed
    a distributed communications layer, and a master controller,
    cluster of commodity servers with locally attached spinning
    tal scaling and recovering from hardware failures without
    catalogs are spatially partitioned
    chunks
    horizontally
    , and the remaining
    into materialized
    cat-
    alogs are replicated on each server; the chunks are distributed
    ject catalog is further
    sub-chunks
    partitioned
    with overlaps,
    intomaterialized on-the-
    needed. Chunking is handled automatically without exposure
    also partitioned vertically to maximize performance of
    uses a few critical indexes to speed up spatial searches,
    interactive queries. Shared scans are used to answer all
    architecture is primarily driven by the variety and complexity
    from single object
    lookups??(??2)full-skycorrelationsoverbillions
    to complex
    Aprototypeimplementationofthebaselinearchitecture
    above,Qserv,wasdevelopedduringtheR&DphaseofLSST,andits
    stratedinearlytesting.Productizationwassubsequently
    structionphaseofLSSTandispresentlyunderway.
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    1

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    Qservleveragestwomature,open-sourcetechnologiesas
    MySQLasaSQLexecutionengine(thoughanyalternative
    withoutundueeffortifneedbe),andXRootD
    1[11]toprovideadistributed-system
    forfault-tolerant,elastic,content-addressedmessaging.
    WecurrentlymaintainthreerunninginstancesofQservat
    nodeseachincontinuousoperationondedicatedhardware
    Thesystemhasbeendemonstratedtocorrectlydistribute
    volumequeries,includingsmall-areaconeandobjectsearches,
    fulltablescansincludinglarge-areanear-neighborsearches.
    UDFsinbothfilterandaggregationclauseshavealsobeen
    beensuccessfullyconductedontheabove-mentionedclusters
    imately70TB,andweexpecttocrossthe100TBmarkastests
    systemisontrackthroughaseriesofgraduateddata-challenge
    thestatedperformancerequirementsfortheproject.
    Ifanequivalentopen-source,communitysupported,off-the-shelf
    becomeavailableintime,itcouldpresentsignificantsupport
    readyQserv.Thelargestbarrierpreventingusfromusing
    sufficientsphericalgeometryandsphericalpartitioning
    Toincreasethechancessuchasystemwillbecomereality
    collaboratewiththeMonetDBopensourcecolumnardatabase
    strationofQservbasedonMonetDBinsteadofMySQLwasdone
    currentwiththestate-of-the-artinpetascaledatamanagement
    dialogwithallrelevantsolutionproviders,bothDBMSand
    intensiveusers,bothindustrialandscientific,through
    2conferenceserieswelead,
    andbeyond.
    2Introduction
    ThisdocumentdiscussestheLSSTdatabasesystemarchitecture
    tationofpartofthatarchitecture(Qserv)inparticular.
    1http://xrootd.org
    2https://xldb.org
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    2

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    Section3summarizesLSSTdatabase-relatedrequirementsthat
    Section4discussesthebaselinearchitectureitself.Section5discussesQservasanimple-
    mentationofthebaselinearchitectureforuserqueryaccess.6coversattendantrisk
    analysis.Forsomeadditionalbackground,DMTN-046coversin-depthanalysisof
    potentialsolutions(Map/ReduceandRDBMS)asof2013,andDMTR-21andDMTR-12describe
    largescaleQservtestsfrom2013.ThefullQservtestspecificationLDM-552.
    DMTN-048discussestheoriginaldesigntrade-offsanddecision
    teststhatwererunandsomeQservdemonstrations.
    3Requirements
    FormalDMdatabaserequirementsarecalledoutinLDM-555.Forpurposesofexposition,
    thissectionsummarizessomeofthekeyrequirementswhich
    tecture.
    3.1GeneralRequirements
    Incrementalscaling.Thesystemmustscaletotensofpetabytesand
    mustgrowasthedatagrowsandastheaccessrequirements
    becomeavailableduringthelifeofthesystemmustbeable
    quantitativestorage,diskandnetworkbandwidthandI/OLDM-141.
    Reliability.Thesystemmustnotlosedata,anditmustprovide
    faceofhardwarefailures,softwarefailures,systemmaintenance,
    Lowcost.Itisessentialtonotoverruntheallocatedbudget,
    open-sourcesolutionisstronglypreferred.
    3.2DataProductionRelatedRequirements
    TheLSSTdatabasecatalogswillbegeneratedbyasmallset
    •DataReleaseProduction–itproducesallkeycatalogs.
    DRPtakesseveralmonthstocompleteandisdominated
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    3

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    jobs.Ingestcanbedoneseparatelyfrompipelineprocessing,
    •NightlyAlertProduction–itproducesdifferenceimage
    ject,SSObject,DiaSource,DiaForcedSourcecatalogs.
    inunderaminuteafterdatahasbeentaken,datahasto
    realtime.Thenumberofrowupdates/ingestedismodest:
    occurevery~39sec[7].
    •CalibrationPipeline–itproducescalibrationinformation.
    nostringenttimingrequirements,ingestbandwidthneeds
    Inaddition,thecameraandtelescopeconfigurationiscaptured
    Database.Datavolumesareverymodest.
    Further,theLevel1livecatalogwillneedtobeupdated
    shouldnotbetakenoff-lineforextendedperiodsoftime.
    Thedatabasesystemmustallowforoccasionalschemachanges
    occasionalchangesthatdonotalterqueryresults
    3fortheLevel2dataafterthe
    beenreleased.Schemasfordifferentdatareleasesareallowed
    3.3QueryAccessRelatedRequirements
    TheScienceDataArchiveDataReleasequeryloadisdefined
    largecatalogsinthearchive:Object,Source,andForcedSource.
    forexample,thoughnumerous,areexpectedtobefast.In
    Reproducibility.QueriesexecutedonanyLevel1andLevel2data
    ducible.
    Realtime.Alargefractionofad-hocuseraccesswillinvolve
    –queriesthattouchsmallareaofsky,orrequestsmallnumber
    3Exampleofnon-alteringchangesincludingadding/removing/resorting
    derivedinformation,changingtypeofacolumnwithoutloosinginformation,FLOATtoDOUBLEwouldbealways
    allowed,DOUBLEtoFLOATwouldonlybeallowedifallvaluescanbeexpressedusingFLOATwithoutloosingany
    information)
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    4

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    requiredtobeansweredinafewseconds.Onaverage,weexpect
    runningatanygiventime.
    Fastturnaround.High-volumequeries–queriesthatinvolvefull-sky
    tobeansweredinapproximately1hour,whilemorecomplex
    correlationsareexpectedtobeansweredin~8-12hours.
    queriesareexpectedtoberunningatanygiventime.
    Cross-matchingwithexternal/userdata.Occasionally,LSSTdatabasecatalog
    becross-matchedwithexternalcatalogs:bothlarge,such
    suchassmallamateurdatasets.Usersshouldbeableto
    accessthemduringsubsequentqueries.
    Querycomplexity.Thesystemneedstohandlecomplexqueries,including
    tions,timeseriescomparisons.Spatialcorrelationsare
    –thisisanimportantobservation,asthisclassofqueries
    partitioningwithoverlaps.
    Flexibility.Sophisticatedendusersneedtobeabletoaccess
    withasfewconstraintsaspossible.Manyenduserswill
    SQL,somostofbasicSQL92willberequired.
    3.4Discussion
    3.4.1DesignConsiderations
    Theaboverequirementshaveimportantimplicationsonthe
    •Thesystemmustallowrapidselectionofsmallnumber
    tables.Toachievethis,efficientdataindexinginboth
    isessential.
    •Thesystemmustefficientlyjoinmulti-trillionwithmulti-billion
    izingthesetablestoavoidcommonjoins,suchasObject
    ForcedSource,wouldbeprohibitivelyexpensive.
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    5

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    •Thesystemmustprovidehighdatabandwidth.Inorder
    minutes,databandwidthsontheorderoftenstohundreds
    required.
    •Toachievehighbandwidths,toenableexpandability,
    systemwillneedtorunonadistributedclustercomposed
    •Themosteffectivewaytoprovidehigh-bandwidthaccess
    partitionthedata,allowingmultiplemachinestowork
    partitioningisalsoimportanttospeedupsomeoperations
    building.
    •Multiplemachinesandpartitioneddatainturnimply
    willbeexecutedinparallel,requiringthemanagement
    tasks.
    •Limitedbudgetimpliesthesystemneedstogetmostout
    incrementallyasneeded.ThesystemwillbediskI/Olimited,
    attachingmultiplequeriestoasingletablescan(shared
    3.4.2Querycomplexityandaccesspatterns
    Acompilationofrepresentativequeriesprovidedbythe
    enceCouncil,andothersurveyshavebeencaptured[5].Thesequeriescanbedivided
    severaldistinctgroups:analysisofasingleobject,analysis
    inaregionoracrossentiresky,analysisofobjectsclose
    specialgrouping,timeseriesanalysisandcrossmatchwith
    astothecomplexityrequired:thesequeriesincludedistance
    self-joins,andtimeseriesanalysis.
    Smallqueriesareexpectedtoexhibitsubstantialspatial
    similarspatialcoordinates:rightascensionanddeclination).
    expectedtoexhibitaslightlydifferentformofspatial
    havenearbyspatialcoordinates.Spatialcorrelations
    spatialcorrelationswillnotbeneededonSourceorForcedSourcetables.
    Queriesrelatedtotimeseriesanalysisareexpectedto
    vationsforagivenObject,sotheappropriateSourceor
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    6

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    joinedandaggregatefunctionsoperatingoverthelistof
    Externaldatasetsanduserdata,includingresultsfrom
    tributedalongsidedistributedproductiontabletoprovide
    Thequerycomplexityhasimportantimplicationsontheoverall
    tem.
    4BaselineArchitecture
    Thissectiondescribesthemostimportantaspectsofthe
    ThechoiceofthearchitectureisdrivenbytheprojectrequirementsLDM-555)aswellas
    cost,availability,andmaturityoftheoff-the-shelfsolutions
    ket(seeDMTN-046),anddesigntrade-offs(seeDMTN-048).Thearchitectureisperiodically
    revisited:wecontinuouslymonitorallrelevanttechnologies,
    baselinearchitecture.
    Insummary:
    •TheLSSTbaselinearchitectureforAlertProductionis
    RDBMSsystemwhichusesreplicationforfaulttolerance
    horizontal(time-based)partitioning;
    •ThebaselinearchitectureforuseraccesstoDataReleases
    lelprocessing)relationaldatabaserunningonashared-nothing
    serverswithlocallyattachedspinningdiskdrives;capable
    (b)recoveringfromhardwarefailureswithoutdisrupting
    alogsarespatiallypartitionedintomaterializedchunks,andtheremainingcatalogs
    replicatedoneachserver;thechunksaredistributed
    alogisfurtherpartitionedintosub-chunkswithoverlaps,4materializedon-the-fly
    needed.Sharedscansareusedtoanswerallbutlow-volume
    4Achunk’soverlapisimplicitlycontainedwithintheoverlapsofits
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    7

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    4.1AlertProductionandUp-to-dateCatalog
    AlertProductioninvolvesdetectionandmeasurementof
    (DiaSources).NewDiaSourcesarespatiallymatchedagainst
    istingDiaObjects,whichcontainsummarypropertiesfor
    falsepositives).UnmatchedDiaSourcesareusedtocreate
    hasanassociatedDiaSourcethatisnomorethanamonthold,
    (DiaForcedSource)istakenatthepositionofthatobject,
    wasdetectedintheexposureornot.
    TheoutputofAlertProductionconsistsmainlyofthreelarge
    andDiaForcedSource-aswellasseveralsmallertables
    exposures,visitsandprovenance.Thesecatalogswillbe
    NotethatexistingDiaObjectsareneveroverwritten.Instead,
    andDRP-producedDiaObjectsareinserted,allowingusers
    ertiesofDiaObjectsasknowntothepipelinewhenalerts
    historicalqueries,eachDiaObjectrowistaggedwitha
    timeofanewDiaObjectversionissettotheobservation
    Sourcethatledtoitscreation,andtheendtimeissetto
    itsvalidityendtimeisupdated(inplace)toequalthestart
    themostrecentversionsofDiaObjectscanalwaysberetrieved
    SELECT*FROMDiaObjectWHEREvalidityEnd=infinity
    Versionsasofsometimetareretrievablevia:
    SELECT*FROMDiaObjectWHEREvalidityStart<=tANDt<validityEnd
    NotethataDiaSourcecanalsobere-associatedtoasolar-system
    processing.ThiswillresultinanewDiaObjectversionunless
    associatedDiaSources.Inthatcase,thevalidityendtime
    timeatwhichthere-associationoccurred.
    OnceaDiaSourceisassociatedwithasolarsystemobject,
    DiaObject.Therefore,ratherthanalsoversioningDiaSources,
    associatedDiaObjectandsolarsystemobject,aswellas
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    8

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    Re-associationwillsetthesolarsystemobjectIDandre-association
    forDiaObject123attimetcanbeobtainedusing:
    SELECT*
    FROMDiaSource
    WHEREdiaObjectId=123
    ANDmidPointTai<=t
    AND(ssObjectIdisNULLOsRsObjectReassocTime>t)
    DiaForcedSourcesareneverre-associatedorupdatedin
    Fromthedatabasepointofviewthen,thealertproduction
    databaseoperations189times(onceperLSSTCCD)pervisit
    1.Issueapoint-in-regionqueryagainsttheDiaObject
    versionsoftheobjectsfallinginsidetheCCD.
    2.UsetheIDsofthesediaObjectstoretrieveallassociated
    Sources.
    3.InsertnewdiaSourcesanddiaForcedSources.
    4.UpdatevalidityendtimesofdiaObjectsthatwillbesuperseded.
    5.InsertnewversionsofdiaObjects.
    Allspatialjoinswillbeperformedonin-memorydataby
    database.WhileAlertProductiondoesalsoinvolveaspatial
    produced)Objectcatalog,thisdoesnotrequireanydatabase
    arenevermodified,sotheObjectcolumnsrequiredforspatial
    compactbinaryfilesonceperDataRelease.Thesefileswill
    veryfastregionqueries,allowingthedatabasetobebypassed
    TheDiaSourceandDiaForcedSourcetableswillbesplitinto
    andonecontainingrecordsinsertedduringthecurrent
    besmallandrelyonatransactionalenginelikeInnoDB,
    failures.Thehistorical-datatableswillusethefaster
    ageengine,andwillalsotakeadvantageofpartitioning.
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    9

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    seedthelivecatalogswillbestoredinasingleinitial
    erarchicalTriangularMeshtrixelIDsfortheirpositions).
    diaForcedSourcesforthediaObjectsinaCCDwillbelocated
    ingseeks.Everymonthofnewdatawillbestoredinafresh
    Suchpartitionswillgrowtocontainjustafewbillionrows
    forthelargestcatalog.Attheendofeachnight,thecontents
    sortedandappendedtothepartitionforthecurrent-month,
    entirecurrent-monthpartitionissortedspatially(during
    monthiscreated.
    ForDiaObject,thesameapproachisused.However,DiaObject
    occurinanypartition,andarenotconfinedtothecurrent-night
    touseatransactionalstorageenginelikeInnoDBforall
    tablesusingtheprimarykey,wewilllikelydeclareitto
    followedbydisambiguatingcolumns(diaObjectId,validityStart).
    willnotbepartofanyindex.
    Nouserquerieswillbeallowedontheliveproductioncatalogs.
    separatereplicajustforuserqueries,synchronizedin
    nativedatabasereplication.Thecatalogsforuserqueries
    livecatalogs,andviewswillbeusedtohidethesplits(usingUNIONALL”).
    Foradditionalsafety,wemightchoosetoreplicatethesmall
    partitions,andtheremaining(small)changingtablesto
    ofdisastrousmasterfailurethatcannotbefixedrapidly,
    beusedasatemporaryreplacement,anduserquerieswill
    resolved.
    Basedonthesciencerequirements,onlyshort-running,
    neededontheLevel1catalogs.Themostcomplexqueries,
    borqueries,willnotbeneeded.Instead,userquerieswill
    searches,lightcurvelookups,andhistoricalversions
    sortedspatially,weexpecttobeabletoquicklyanswer
    IDcolumnsandtheSciSQLUDFs,anapproachthathasworked
    date.Furthermore,notethatthepositionsofdiaSources/diaForcedSources
    thesamediaObjectwillbeveryclosetogether,sothatsorting
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    10

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    alsoendsupplacingsourcesbelongingtothesamelightcurve
    thedataorganizationusedtoprovidefastpipelinequery
    userqueries.
    4.2DataReleaseProduction
    DataReleaseProductionwillinvolvethegenerationofsignificantly
    Production.However,theseareproducedoverthecourse
    notwritedirectlytothedatabase,andtherearenopipeline
    executiontimerequirementstobesatisfied.Whilewedoexpect
    tablescansoverthecourseofaDataReleaseProduction,
    queriesinvolvingsuchscansonadailybasis.Userquery
    ofourscalabledatabasearchitecture,whichisdescribed
    thedataloadingprocess,pleaseseeSection5.15.2
    4.3UserQueryAccess
    Theuserqueryaccessistheprimarydriverofthescalable
    tectureisdescribedbelow.
    4.3.1Distributedandparallel
    Thedatabasearchitectureforuserqueryaccessrelieson
    amongautonomousworkernodes.Autonomousworkershave
    otherandcancompletetheirassignedworkwithoutdataor
    Thisimpliesthatdatamustbepartitioned,andthesystem
    singleuserqueryintosub-queries,andexecutingthese
    high-volumequerywithoutparallelizingitwouldtakeunacceptably
    veryfastCPU.Theparallelismanddatadistributionshould
    systemandhiddenfromusers.
    4.3.2Shared-nothing
    Sucharchitectureprovidesgoodfoundationforincremental
    causenodeshavenodirectknowledgeofeachotherandcan
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    11

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    withoutdataormanagementfromtheirpeers,itispossible
    fromsuchsystemwithno(orwithminimal)disruption.However,
    andproviderecovermechanisms,appropriatesmartshave
    agementsoftware.
    Distributor
    Combiner
    MySQL
    Node
    MySQL
    Node
    MySQL
    Node
    MySQL
    Node
    Partitioned
    Data
    Partitioned
    Data
    Partitioned
    Data
    Partitioned
    Data
    Figure1:Shared-nothingdatabasearchitecture.
    4.3.3Indexing
    DiskI/Obandwidthisexpectedtobethegreatestbottleneck.
    throughindex,whichtypicallytranslatestoarandomaccess,
    sequentialread(unlessmultiplecompetingscansareinvolved).
    Indexesdramaticallyspeeduplocatingindividualrows,
    Theyareessentialtoanswerlowvolumequeriesquickly,
    spatialindexesareessential.However,unlikeintraditional,
    tagesofindexesbecomequestionablewhenalargernumber
    atable.IncaseofLSST,selectingevena0.01%ofatable
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    12

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    rows.Sinceeachfetchthroughanindexmightturninto
    readsequentiallyfromdiskthantoseekforparticular
    indexitselfisout-of-memory.Forthatreasonthearchitecture
    dexing,onlyasmallnumberofcarefullyselectedindexes
    queries,enablingtablejoins,andspeedingupspatial
    analyticalquerysystem,itmakessensetomakeasfewassumptions
    willbeimportanttoourusers,andtotryandprovidereasonable
    aqueryloadaspossible,i.e.focusonscanthroughputrather
    therbenefittothisapproachisthatmanydifferentqueries
    I/O,boostingsystemthroughput,whereascachingindex
    feweropportunitiesforsharingasthequerycountscales
    afford).
    4.3.4Sharedscanning
    Nowwithtable-scanningbeingthenormratherthantheexception
    significantamountoftime,multiplefull-scanqueries
    eachemployedtheirownfull-scanningreadfromdisk.Sharedconvoy
    scheduling)sharestheI/Ofromeachscanwithmultiplequeries.
    andallconcerningqueriesoperateonthatpiecewhileit
    frommanyfull-scanqueriescanbereturnedinlittlemore
    query.Sharedscanningalsolowersthecostofdatacompression
    amongthesharingqueries,tiltingthetrade-offofincreased
    heavilyinfavorofcompression.
    Sharedscanningwillbeusedforallhigh-volumeandsuper-high
    scanningishelpfulforunpredictable,ad-hocanalysis,
    increasingthediskI/Ocost–onlymoreCPUisneeded.On
    runthefollowingscans:
    •onefulltablescanofObjecttableforthelatestdata
    •onesynchronizedfulltablescanofObject,Sourceand
    hoursforthelatestdatareleaseonly,
    •onesynchronizedfulltablescanofObjectandObject_Extra
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    13

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    andpreviousdatareleases.
    AppropriateLevel3usertableswillbescannedaspart
    answeranyin-flightuserqueries.
    Sharedscanswilltakeadvantageoftablechunkingexplained
    singlenodeascanwillinvolvefetchingsequentiallya
    onthischunkallqueriesinthequeue.Thelevelofparallelism
    availablecores.
    Runningmultiplesharedscansallowsrelativelyfastresponse
    andsupportingcomplex,multi-tablejoins:synchronized
    betweendifferenttables.Forself-joins,asingleshared
    nodemusthavesufficientmemorytohold2chunksatanygiven
    andnextchunk).Refertothesizingmodel[LDM-141]forfurtherdetailsonthecost
    scans.
    Low-volumequerieswillbeexecutedad-hoc,interleaved
    numberofspinningdisksismuchlargerthanthenumberof
    anygiventime,thiswillhaveverylimitedimpactonthe
    showninLDM-141.
    4.3.5Clustering
    ThedataintheObjectCatalogwillbephysicallyclustered
    objectscollocatedinspacewillbealsocollocatedondisk.
    ForcedSource,DiaSource,DiaForcedSource)willbeclustered
    objectId–thisapproachenforcesspatialclusteringand
    sameobject,allowingsequentialreadforqueriesthatinvolve
    SSObjectcatalogwillbeunpartitioned,becausethere
    couldchoosetouseforpartitioning.TheassociateddiaSources
    withdiaSourcesassociatedwithstaticdiaSources)will
    sition.ForthatreasontheSSObject-to-DiaSourcejoin
    allchunks,unlikeDiaObject-to-DiaSourcequeries.Since
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    14

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    shouldnotbeanissue.
    4.3.6Partitioning
    Datamustbepartitionedamongnodesinashared-nothingsharding
    approachespartitiondatabasedonahashoftheprimary
    LSSTdatasinceiteliminatesoptimizationsbasedoncelestial
    4.3.6.1ShardeddataandshardedqueriesAllcatalogsthatrequirespatial
    (Object,Source,ForcedSource,DiaSource,DiaForcedSource)
    associatedwiththem,suchasObject_Extra,willbedivided
    thesameareabypartitioningthenintodeclinationzones,andchunkingeachzoneRA
    stripes.Further,tobeabletoperformtablejoinswithout
    partitioningboundariesforeachpartitionedtablemust
    tablescorrespondingtothesameareaofskymustbeco-located
    surechunksareappropriatelysized,thetwolargestcatalogs,
    expectedtobepartitionedintofiner-grainchunks.Since
    constantdensitythroughoutthecelestialsphere,anequal-area
    thatisuniformlydistributedoverthesky.
    Smallercatalogsthatcanbepartitionedspatially,such
    bepartitionedspatially.Allremainingcatalogs,such
    catedoneachnode.Thesizeofthesecatalogsisexpected
    Withdatainseparatephysicalpartitions,userqueries
    ratephysicalqueriestobeexecutedonpartitions.Each
    binedintoasinglefinalresult.
    4.3.6.2Two-levelpartitionsDeterminingthesizeandnumberofdata
    beobvious.Queriesarefragmentedaccordingtopartitions
    titionsincreasesthenumberofphysicalqueriestobedispatched,
    Thusagreaternumberofpartitionsincreasesthepotential
    theoverhead.Foradata-intensiveandbandwidth-limited
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    15

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    tothenumberofdiskspindlesshouldminimizeseeksand
    mance.
    Fromamanagementperspective,morepartitionsfacilitate
    whennodesareaddedorremoved.Ifthenumberofpartitions
    nodes,thentheadditionofanewnodewouldrequirethedata
    otherhand,ifthereweremanymorepartitionsthannodes,
    assignedtothenewnodewithoutre-computingpartition
    Smallerandmorenumerouspartitionsbenefitspatialjoins.
    areinterestedinobjectsnearotherobjects,andthusafull??(??2)joinisnotrequired–alocalized
    spatialjoinismoreappropriate.Withspatialdatasplit
    computingthejoinneednotevenconsider(andreject)all
    allthepairswithinaregion.Thusataskthatis??(??2)naivelybecomes??(????)where??isthe
    numberofobjectsinapartition.
    Inconsiderationofthesetrade-offs,two-levelpartitioning
    plewaytoblendtheadvantagesofbothextremes.Queries
    coarsepartitions(“chunks”),andspatialnear-neighbor
    partitions(“sub-chunks”)withineachpartition.Toavoid
    non-joinqueries,thesystemcanstorechunksandgenerate
    tialjoinqueries.On-the-flygenerationforjoinsiscost-effective
    ofpairs,whichistrueaslongastherearemanysub-chunks
    4.3.6.3OverlapAstrictpartitioningeliminatesnearbypairs
    partitionsarepaired.Toproducecorrectresultsunder
    toobjectsfromoutsidepartitions,whichmeansthatdata
    eachpartitioncanbestoredwithaprecomputedamountof
    pingdatadoesnotstrictlybelongtothepartitionbutis
    thepartition’sborders.Usingthisdata,spatialjoins
    presetdistancewithoutneedingdatafromotherpartitions
    OverlapisneededonlyfortheObjectCatalog,asallspatial
    catalogonly.Guidedbytheexperiencefromotherprojects
    presettheoverlapto~1arcmin,whichresultsinduplicating
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    16

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    Catalog.
    4.3.6.4SphericalgeometrySupportforsphericalgeometryisnotcommon
    andsphericalgeometry-basedpartitioningwasnon-existent
    cidedtodevelopQserv.Sincesphericalgeometryisthenorm
    tialobjects(right-ascensionanddeclination),anyspatial
    objectsmustaccountforitscomplexities.
    4.3.6.5DataimmutabilityItisimportanttonotethatuserqueryaccess
    onlydata.Nothavingtodealwithupdatessimplifiesthe
    extraoptimizationsnotpossibleotherwise.TheLevel1
    andwillnotrequirethescalablearchitecture–weplanto
    of-theboxMySQLasdescribedinSection4.1
    4.3.7Long-runningqueries
    Manyofthetypicaluserqueriesmayneedsignificanttime
    Toavoidre-submissionofthoselong-runningqueriesin
    orhardwareissues)thesystemwillsupportasynchronous
    modeuserswillsubmitqueriesusingspecialoptionsorsyntax
    queryandimmediatelyreturntousersomeidentifierofthe
    usersession.Thisqueryidentifierwillbeusedbyuserto
    queryresultafterquerycompletes,orapartialqueryresult
    Thesystemshouldbeabletoestimatethetimewhichuser
    refusetorunlongqueriesinaregularblockingmode.
    4.3.8Technologychoice
    AsexplainedinDMTN-046,nooff-the-shelfsolutionmeetstheaboverequirements
    anRDBMSseemsamuchbetterfitthanaMap/Reduce-basedsystem
    suchasindexes,schema,andspeed.Forthatreason,our
    customsoftwarebuiltontwoproductioncomponents:anopen
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    17

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    non-parallelDBMS(MySQL)andXRootD.Toeasepotential
    municationwiththeunderlyingDBMSreliesonbasicDBMSfunctionalityonly,and
    vendor-specificfeaturesandadditions.
    Figure2:ComponentconnectionsinQserv.
    5Implementation(Qserv)
    Aprototypeimplementationofthebaselinearchitecture
    above,Qserv,wasdevelopedduringtheR&DphaseofLSST,andits
    stratedinearlytesting(DMTR-21,DMTR-12).Productizationwassubsequently
    resourcedfortheconstructionphaseofLSSTandispresently
    rentlyimplementedisdescribedhere.
    5.1Components
    5.1.1MySQL
    Tocontrolthescopeofeffort,QservusesanexistingSQL
    muchqueryprocessingaspossible.MySQLisagoodchoice
    mentcommunity,matureimplementation,wideclientsoftware
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    18

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    lightweightexecution,andlowdataoverhead.MySQL’s
    munitymeansthatexpertiseisrelativelycommon,which
    developmentorlong-termmaintenanceintheyearsahead.
    isalsolightweightandwell-understood,givingpredictable
    vancedstoragelayoutthatmaydemandmorecapacity,bandwidth,
    constrainedhardwarebudget.
    Itisworthnoting,however,thatQserv’sdesignandimplementation
    ofMySQLbeyondgluecodefacilitatingresultstransmission.
    ordertoallowthesystemtoleverageamoreadvancedormore
    thefuture.
    5.1.2XRootD
    TheXRootDdistributedfilesystemisusedtoprovideadistributed,
    cated,fault-tolerantcommunicationfacilityforQserv.
    havebeennon-trivial,sowewantedtoleverageanexisting
    scalability,fault-tolerance,performance,andefficiency
    physicscommunity.ItsrelativelyflexibleAPIenabledits
    generalcommunicationroutingsystem.Sinceitwasdesigned
    wereconfidentthatitcouldmediatenotonlyquerydispatch
    transferofresults.
    AXRootDclusterisimplementedasasetofdataserversand
    toaredirector,whichactsasacachingnamespacelookup
    appropriatedataservers.InQserv,XRootDdataservers
    mentingplug-inswithintheXRootDframeworkwhichadvertise
    addressableresourceswithintheXRootDcluster.TheQserv
    andreceivesresultsasanXRootDclientbydispatchingmessages
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    19

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    Figure3:XRootD.
    5.2Partitioning
    InQserv,largespatialtablesarefragmentedintospatial
    scheme.Thepartitioningspaceisasphericalspacedefined??(rightascension/??)
    and??(declination/??).Forexample,theObjecttableisfragmented
    pairspecifiedintwocolumns:right-ascensionanddeclination.
    mentsarerepresentedastablesnamedObject_CCandObject_CC_SSwhereCCisthe“chunkid”
    (first-levelfragment)andSSisthe“sub-chunkid”(second-levelfragment
    ment.Sub-chunktablesarebuilton-the-flytooptimizeperformance
    Largetablesarepartitionedonthesamespatialboundaries
    betweenthem.
    5.3QueryGeneration
    Qservisunusual(thoughnotunique)inprocessingauser
    mentsthataresubsequentlydistributedtoandexecuted
    software.Thisisdoneinthehopesofprovidingadistributed
    avoidingafullre-implementationofcommondatabasefeatures.
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    20

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    thatitisnecessarytoimplementaqueryprocessingframework
    morestandarddatabase,withtheexceptionthattheresulting
    mentsastheintermediatelanguage.
    Asignificantamountofqueryanalysisnotunlikeadatabase
    ordertogenerateadistributedexecutionplanthataccurately
    queries.Incominguserqueriesarefirstparsedintoanintermediate
    modifiedSQL92-compliantgrammar(LubosVnuk’sanltr-based
    representationisequivalenttotheoriginaluserquery,
    terpretation,butmaynotcompletelyreflecttheoriginal
    tationistoprovideasemanticrepresentationthatmaybe
    andtransformationmoduleswithoutthecomplexityofaparse
    theoriginalEBNFgrammar.
    Oncetheintermediaterepresentationhasbeencreated,
    modules.Thefirstsequenceoperatesonthequeryasasingle
    stepoccurstosplitthesinglerepresentationintoa“plan”
    tion,onetobeexecutedper-data-chunk,andonetobeexecuted
    resultsintofinaluserresults.Asecondsequenceisthen
    necessarytransformationsforanaccurateresult.
    Wehavefoundthatregularexpressionsandparseelement
    lyzeandmanipulatequeriesforanythingbeyondthemost
    5.3.1Processingmodules
    Theprocessingmodulesperformmostoftheworkintransforming
    mentsthatcanproduceafaithfulresultfromaQservcluster.
    •Identifyspatialindexingopportunities.Thisallows
    queriesononlyasubsetoftheavailablechunksconstituting
    giveninQserv-specificsyntaxarerewrittenasboolean
    •Identifysecondaryindexopportunities.Qservdatabases
    areunderconsideration)asakeycolumnwhosevaluesare
    onespatiallocation.IdentificationallowsQservtoconvert
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    21

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    intospatialrestrictions.
    •Identifytablejoinsandgeneratesyntaxtoperformdistributed
    marilysupports“near-neighbor”spatialjoinsforlimited
    tioningcoordinatespace.Arbitraryjoinsbetweendistributed
    usingthekeycolumn.Queriesareclassifiedaccording
    ning.Byidentifyingtablesscannedinaquery,Qserv
    tionusingsharedscanning,whichgreatlyincreasesefficiency.
    5.3.2Processingmoduleoverview
    Figure4:Processingmodules.
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    22

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    Thisfigureillustratesthequerypreparationpipelinethat
    inputquerystring.Userquerystringsareparsed(1)into
    thatispassedthroughasequenceofprocessingmodules(2)
    tationin-place.Then,itisbrokenup(3)intopiecesthat
    executionontablepartitionsandpiecesintendedtomerge
    Anotherprocessingsequence(4)operatesonthisnewrepresentation,
    cretequerystringsaregenerated(5)forexecution.
    Thetwosequencesofprocessingmodulesprovideanextensible
    analysisandmanipulation.Earlierprototypesperformed
    parsing,butthisledtoapracticallyunmaintainablecode
    portedtheprocessingmodulemodel.Processingissplit
    flexibilitytomodulesthatmanipulatethephysicalstructures
    queryrepresentationtomodulesthatdonotrequirethe
    betweenparsing,whoseonlygoalistoprovideaintelligible
    tation,andtheQserv-specificanalysisandmanipulation
    maintainability,andextensibilityofthesystemandshould
    andfutureLSSTneeds.
    5.4Dispatch
    QservusesXRootDasadistributed,highly-availablecommunications
    frontendstocommunicatewithdataworkers.Upuntil2015,
    APIwithnamedfilesascommunicationchannels.Thecurrent
    eraltwo-waynamed-channelingsystemwhicheliminates
    generalizedprotocolmessagesthatcanbeflexiblystreamed.
    ServiceInterface(SSI)andisbuiltontopofXRootD.
    5.4.1Wireprotocol
    QservencodesquerydispatchesinGoogleProtobufmessages,
    tobeexecutedbytheworkerandannotationsthatdescribe
    teristics.TransmittingquerycharacteristicsallowsQserv
    underchangingCPUanddiskloadsaswellasmemoryconsiderations.
    re-analyzethequerytodiscoverthesecharacteristics
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    23

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    determinedbyqueryinspection.
    QueryresultsarealsoreturnedviaProtobufmessages.Initial
    tabledumpstoavoidlogictoencodeanddecodedatavalues,
    typeMonetDBworkerbackendprovedthatdataencodingand
    problemswhosesolutioncouldsignificantlyimproveoverall
    tatingmetadataoperationsonworkerandfrontendDBMS
    encodesresultsinprotobufmessagescontainingschema
    ues.StreamingresultsdirectlyfromworkerDBMSinstances
    isatechniqueunderconsideration,asisacustomaggregation
    likelyeasetheimplementationofprovidingpartialquery
    5.4.2Frontend
    In2012,anewXRootDclientAPIwasdevelopedtoaddress
    sion’sscalability(uncoveredduringa150node,30TBscalabilityDMTR-21]).Thenew
    clientAPIbeganproductionuseforthebroaderXRootDcommunity
    quently,workbeganunderourguidancetowardsanXRootD
    onrequest-responseinteractionovernamedchannels,instead
    ingfiles.AproductionversionofthisAPI,theScalableService
    inearly2015andQservhassincebeenportedtousethis
    significantbodyofcodethatmappeddispatchingandresult-retrieval
    SSIAPInowresidesintheXRootDcodebase,whereitmaybe
    TheSSIAPIprovidesQservwithafullyasynchronousinterface
    ingthreadsusedbytheQservfrontendtocommunicatewith
    oneclassofproblemsweencounteredduringlarge-scale
    terfacesthatintegratesmoothlywiththeProtobufs-encoded
    novelfeatureswerespecificallyaddedtoimproveQserv
    sponseinterfaceenablesreducedbufferingintransmitting
    toathefrontend,whichlowersend-to-endquerylatency
    onworkers.Theout-of-bandmeta-dataresponsewhicharrives
    beusedtomapouttheProtobufsencodingandsignificantly
    orybuffers.
    ThefullyasynchronousAPIiscrucialonthemasterbecause
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    24

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    rentchunkqueriesinflightexpectedinnormaloperation.
    10kpieces,having10full-scanningqueriesrunningconcurrently
    chunkqueries–toolargeanumberofthreadstoallowona
    chronousAPItoXRootDiscrucial.Threadsareusedtoparallelize
    Whileitdoesnotseemtobeimportanttoparse/analyze/manipulate
    parallel(andsuchataskwouldbearesearchtopic),the
    couldbedoneinparallelifsomeportionoftheaggregation/merging
    coderatherthanloadedintothefrontend’sMySQLinstance
    Thusresultsprocessingshouldbeparallelizedamongresults
    queryparsing/analysis/manipulationcanbeparallelized
    5.4.3Worker
    TheQservworkerusesboththreadsandasynchronouscalls
    allelism.ToserviceincomingrequestsfromtheXRootDAPI,
    receiverequestsandenqueuethemforaction.Specifically,
    isusedonQservworkersaswell.Theinterfaceprovides
    onthefront-endmakingthelogicrelativelyeasytofollow
    prone.
    Threadsaremaintainedinathreadpooltoperformincoming
    theDBMS’sAPI(currently,theapparentlysynchronousMySQL
    runinobservanceoftheamountofparallelresourcesavailable.
    I/Odependencyofeachincomingchunkqueryintermsofthe
    resourcesinvolved,andattemptstoensurethatdiskaccess
    Thusiftherearemanyqueriesthataccessthesametable
    ofthemtorunasthereareCPUcoresinthesystem,butif
    differentchunktables,itallowsfewersimultaneouschunk
    onetablescanperdiskspindleoccurs.Furtherdiscussion
    isdescribedbelow.
    5.5ThreadingModel
    NearlyeveryportionofQserviswrittenusingacombination
    execution.
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    25

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    Qservheavilyreliesonmulti-threadingtotakeadvantage
    executingqueries,asanexample,tocompleteonefulltable
    chunks,1,000queries(processes)willbeexecuted.To
    ofprocessesthatareexecutedoneachworker,weended
    andswitchingfromathread-per-requestmodeltoathread
    completelyasynchronous,withrealcall-backs.
    mysqlproxySingle-threadedLuacodedrivingnon-blocking
    clientAPI
    Frontend-C++Processingthreadperuser-queryfor
    Results-mergingthread-per-user-query
    Frontend-XRootDCallbackthreadsperformquerytransmission
    sultsretrieval
    Frontend-XRootDinternalThreadsformaintainingworkerconnections
    host)
    XRootD,cmsdSmallthreadpoolsformanaginglivenetwork
    tionsandperforminglookups
    Worker-XRootDSmallplug-inthreadpoolO(#cores)tomakeblocking
    APIcallsintolocalmysqld;callbackthreads
    performadmission/schedulingoftasks
    andtransmissionofresults
    5.6Aggregation
    QservsupportsseveralSQLaggregationfunctions:AVG(),
    andSQL92levelGROUPBY.
    5.7Indexing
    Qserveschewsheavyindexingingeneral,duetotheprohibitive
    curasaresultofthescaleofthehosteddata.Nevertheless,
    primarykeyareanticipatedtobeaverycommonusecaseand
    ciently.Tothatend,thecurrentimplementationadmits
    ternodes,whichcanbeusedtomapqueriesrestrictedby
    andsubchunkswhichcontainthoseObjects.Queryfragments
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    26

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    tojustthesetofinvolvedworkers,wherein-memorysubchunk
    tobeefficientlyexecuted.Thisspecial-caseglobalindex
    subchunkisreferredtoasthe“secondaryindex”.
    5
    Thesecondaryindexutilizesoneormoretablesusingthe
    masternodetoperformlookups.Performancetests(Figure5)onasingle,dual-core
    with1TBharddiskstorage(notSSD)haveshownthatthis
    loadof40billionrowsinabout400,000seconds(110hours).
    withmultiplecoresandSSDstorageisexpectedtomeetthe
    lessthan48hours.
    Figure5:PerformancetestsofMySQL-basedsecondaryindex.
    ToimprovetheperformanceoftheInnoDBstorageenginefor
    maybesplitacrossasmallnumber(dozens)oftables,each
    keys.Thissplitting,ifdone,willbeindependentofthe
    contiguityofkeyrangeswillallowthesecondaryindexservice
    tablearithmeticallyviaanin-memorylookup.
    5Itisacknowledgedthatthename“secondaryindex”waspoorlychosen,
    literature.Thisnamewillprobablybechangedinthenearfuture.
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    27

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    5.7.1SecondaryIndexStructure
    Thesecondaryindexconsistsofthreecolumns:theObject
    wherealldatawiththatkeyarelocated(chunkId),andthe
    datawiththekeyarelocated(subChunkId).TheobjectId
    asa64-bitintegervalue;thechunkIdandsubChunkIdare
    spatialregionsonthesky.
    5.7.2SecondaryIndexLoading
    TheInnoDBstorageengineloadstablesmostefficientlyif
    beenpresortedaccordingtothetable’sprimarykey.When
    iscollectedforloading(fromeachworkernodehandling
    byobjectId,andmaybedividedintoroughlyequal“splits”.
    atableenmasse.
    Tofullyoptimizetheloadingandtablesplitting,theentire
    allworkersandpresortedinmemoryonthemaster.This
    entries(requiringaminimumof480GBmemory,plusoverhead).
    fromasingleworkercanbeassumedtobea“representative
    objectIds,sotablesplittingcanbedoneusingthefirst
    workerswillbesplitandloadedaccordingtothosedefined
    5.8DataDistribution
    LSSTwillmaintainthereleaseddatastorebothontapemedia
    tapearchiveisusedforlong-termarchival.Threecopies
    bekept.Thedatabaseclusterwillmaintain3onlinecopies
    clustersofreasonablesizefailureregularly,thecluster
    providecontinuousdataaccess.Areplicationfactorof
    dataintegritybymajorityrulewhenonereplicaiscorrupt.
    Ifperiodicunplanneddowntimeisacceptable,anon-tape
    three.However,theuseoftapedramaticallyincreasesthe
    Thismaybeacceptableforsometables,particularlythose
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    28

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    althoughallowingservicedisruptionmaymakeitdifficult
    analysisonthoselargetables.
    5.8.1Databasedatadistribution
    Thebaselinedatabasesystemwillprovideaccessforat
    andprevious.Dataforeachreleasewillbespreadoutamong
    Datareleasesarepartitionedspatially,andspatialpieces
    robinfashionacrossallnodes.Thismeansthatareaqueries
    almostguaranteedtoinvolveresourcesonmultiplenodes.
    Eachnodeshouldmaintainatleast20%freespaceofits
    mainingfreespaceisthenavailabletobe“borrowed”when
    temporaryuseofstoragecapacityuntilmoreserverresources
    80%storageuseisreturned.
    5.8.2Failureandintegritymaintenance
    Therewillbefailuresinanylargeclusterofnode,inthe
    volumes,innetworksaccessandsoon.Thesefailureswill
    residentonthosenodes,butthislossofdataaccessshould
    toanalyzethedatasetasawhole.Weneedtosetadataavailability
    confidenceofthecommunityinthestabilityofthesystem.
    andtoallowacceptablelevelsofnodefailuresinacluster,
    atablelevelthroughoutthecluster.
    Thereplicationlevelwillbethateachtableinthedatabase
    nodes.Amonitoringlayertothesystemwillcheckonthe
    hours,althoughthistimewillbetunedinpractice.When
    thanthreereplicasavailable,thiswillinitiateareplication
    currentlyhostingthattable.Thetimesforthechecking,
    tothestabilityofthecluster,suchthatabout5%ofall
    1or2replicas.Threereplicaswillensurethattableswill
    failures,orwhennodesneedtobemigratedtonewhardware
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    29

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    Shouldanentirenodefail,replicatingthatdatatoanother
    siveintermsoftime(ontheorderofhours).Weplanonhaving
    fillinglocalstorageto80%.Thefreespacewillbeused
    failures,wherereplicascantakeplaceinparallelbetween
    newnodeswithfreestorageareaddedtothecluster,then
    freespaceintothedrive,potentiallytakingseveralhours,
    dataduringthistime.Oncethisiscomplete,thisdatawill
    oftimeuntilthesetablescanberemovedfromthetemporary
    to80%usage.
    5.9Metadata
    Qservneedstotrackvariousmetadatainformation,static
    infrequently),anddynamic(run-time)inordertomaintain
    optimizetheclusterusage.
    5.9.1Staticmetadata
    Qservtypicallyworkswithdatabasesandtablesdistributed
    itbreaksindividuallargetablesintosmallerchunksand
    Allchunksthatbelongtothesamelogicaltablemusthave
    parameters.Differenttablesoftenneedtobepartitioned
    mightbepartitionedwithoverlap(suchastheObjecttable),
    nooverlap(forexampletheSourcetable),andsomemight
    atinyFiltertable).Allthisinformationaboutschemaand
    databasesandtablesneedstobetrackedandkeptconsistent
    ImplementationofthestaticmetadatainQservisbased
    whichusesaregularMySQLdatabaseasastoragebackend.
    multiplemastersanditmustbeservedbyafault-tolerant
    amaster-masterreplicationsolutionlikeMariaDBGalera
    criticalformetadataanditshouldbeimplementedusing
    enginesinMySQL.
    Staticmetadatamaycontainfollowinginformation:
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    30

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    •Per-databaseandper-tablepartitioningandscanscheduling
    •Tableschemaforeachtable,usedtocreatedatabasetables
    instances;theschemainthemasterMySQLinstancecan
    informationwhenatableisalreadycreated.
    •Databaseandtablestateinformation,usedprimarily
    tablecreationordeletion.
    •Definitionsforthesetofworkerandmasternodesinacluster
    status.
    Themainclientsofthestaticmetadataare:
    •Administrationtools(command-lineutilitiesandmodules)
    modifymetadatastructures.
    •Qservmaster(s),mostlyqueryingpartitioningparameters
    table/databasestatuswhendeleting/creatingnewtables
    notdependonnodedefinitionsinmetadata;theXRootDfacility
    withworkers.
    •Aspecial“watcher”servicewhichimplementsdistributed
    tablemanagement.
    •Aninitialimplementationofthedataloadingapplication
    tionsandwillcreate/updatedatabaseandtabledefinitions.
    willeventuallybereplacedbyadistributedloadingmechanism
    separatemechanisms.
    5.9.2Dynamicmetadata
    Inadditiontostaticmetadata,aQservclusteralsoneeds
    variousstatisticsaboutqueryexecution.Thissortofdata
    perqueryexecution,andiscalleddynamicmetadata.
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    31

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    Theimplementationofthedynamicmetadataisbasedonthe
    metadataitneedstobesharedbetweenallmasterinstances
    tolerantMySQLinstancewhichissharedwithstaticmetadata
    Dynamicmetadatacontainsthefollowinginformation:
    •DefinitionofeverymasterinstanceinaQservcluster.
    •RecordofeverySELECT-typequeryprocessedbycluster.
    processingstateandsomestatistical/timinginformation.
    •Per-querylistoftablenamesusedbytheasynchronous
    todelaytabledeletionwhileasyncqueriesareinprogress.
    •Per-queryworkerinformation,whichincludeschunkid
    theworkerprocessingthatchunkid.Thisinformation
    restartthemasterormigratequeryprocessingtoadifferent
    failure.
    Themostsignificantuseofthedynamicmetadataistotrack
    queries.Whenanasyncqueryissubmitteditisregistered
    IDisreturnedtotheuserimmediately.Lateruserscanrequest
    queryIDwhichisobtainedfromdynamicmetadata.Whenquery
    canrequestresultsandthemastercanobtainthelocation
    metadata.
    Additionally,dynamicmetadatacanbeusedtocollectstatistical
    thatwereexecutedinthepastwhichmaybeanimportanttool
    ingsystemperformance.
    5.9.3Architecture
    TheQservmetadatasystemisimplementedbasedonmaster/server
    dataiscentrallymanagedbyaQservMetadataServer(qms).Theinformationkept
    workeriskepttoabareminimum:eachworkeronlyknowswhich
    tohandle,andallremaininginformationcanbefetchedfrom
    ourphilosophyofkeepingtheworkersassimpleaspossible.
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    32

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    Thereal-timemetadataismanagedinsideqmsinin-memory
    withdisk-basedtable.Suchconfigurationallowsreducing
    delayingqueryexecutiontime.Shouldaqmsfailureoccur,
    theinformationwaslostwillberestarted.Sincethesynchronization
    occurrelativelyfrequently(e.g.atleastonceperminute),
    avoidoverloadingtheqms,onlythehigh-levelinformation
    storedinqms;allworker-basedinformationiscachedin
    inasimple,rawform(e.g,key-value,ASCIIfile),andcan
    5.9.4TypicalDataFlow
    Staticmetadata:
    1.Partsofthestaticmetadataknownbeforedataispartitioned/loaded
    administrationscriptsresponsibleforloadingdata
    startdatapartitioner.
    2.Thedatapartitionerreadsstaticmetadataloadedby
    remaininginformation.
    3.WhenQservstarts,itfetchesallstaticmetadataand
    in-memoryoptimizedC++structure.
    4.Thecontentsofthein-memorymetadatacacheinsideQserv
    mandifthestaticmetadatachanges(forexample,when
    added).
    Dynamic-metadata:
    1.Masterloadstheinformationforeachquery(whenitstarts,
    2.Detailedstatisticsaredumpedbyeachworkerintoascratch
    formationcanberequestedfromeachworkerondemand.
    chunk-queriesexceptonecompleted,qmswouldfetch
    chunk-querytoestimatewhenthequerymightfinish,whether
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    33

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    5.10SharedScans
    Arbitraryfull-tablescanningqueriesmustbesupported
    dertoprovidethissupportcost-effectivelyandefficiently,
    SharedscanseffectivelyreducetheI/Ocostofexecuting
    rently,reducingtherequiresystemhardwareandpurchasing
    SharedscansreduceoverallI/Ocostsbyforcingincoming
    queriesscanthesametable,theoretically,theycancompletely
    I/Ocostofasinglequeryratherthanthesumoftheirindividual
    forqueriestoshareI/Obecausetheirarrivaltimesarerandom
    beginsscanningatdifferenttimes,andbecauseLSST’scatalog
    systemcachingisineffective.InQserv,scanningqueries
    sharedscanningforceseachquerytooperateonthesame
    byreorderingandsequencingthequeryfragments.
    5.10.1Background
    Historically,sharedscanninghasbeenaresearchtopic
    mentations.Weknowofonlyoneimplementationinuse(Teradata).
    mentationsassumeOSordatabasecachingissufficient,encouraging
    toreducetheneedoftablescans.However,ourexperiments
    arelargeenough(byrowcount)andcolumnaccesssufficiently
    columnswhentherearehundredstochoosefrom),indexes
    indexesnolongerfitinmemory,andevenwhentheydofitin
    trieveeachrowisdominantwhentheindexselectsapercentage
    finitenumber(thousandsorless).
    5.10.2Implementation
    TheimplementationofsharedscansinQservisintwoparts.
    tionofincomingqueriesasscanningqueriesornon-scanning
    toscanatableifitdependsonnon-indexedcolumnvalueskchunks
    (wherekisatunableconstant).Notethatinvolvingmultiple
    lectsfromatleastonepartitionedtable.Thisclassification
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    34

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    onthefront-endandleveragestablemetadata.Themetadata
    issetbyhand.Higherscanratingsindicatelargertables
    Theidentified“scantables”andtheirratingsaremarked
    whichusetheinformationinschedulingthefragmentsof
    Thesecondpartofthesharedscansimplementationisascheduling
    queryfragmentexecutiontooptimizecacheeffectiveness.
    off-the-shelfDBMSinstancesonworkernodes,itisnotallowed
    implementsharedscans.Instead,itissuesqueryfragments
    accessindataandtime,andtriestolockthefilesassociated
    muchaspossible.Usingtheidentifiedscantablesandtheir
    theappropriatescheduler.Therewillbeatleastthreeschedulers.
    tocompleteinunderanhour,whichareexpectedtoberelated
    queriesexpectedtotakelessthaneighthours,expected
    oneforscansexpectedtotakeeighttotwelvehoursforForcedSource
    Thereasoningbeingthatasingleslowquerycanimpedethe
    alltheotheruserqueriesonthatscan.Theremaybeaneed
    queriestakingmorethan12hours.
    Eachschedulerplacesincomingchunkqueriesintooneoftwo
    idthenscanratingoftheindividualtables.Ifthequery
    ningchunkid,itisplacedontheactivepriorityqueue,
    priorityqueue.Afterchunkid,thepriorityqueueissorted
    toensurethatthelargesttablesinthechunkaregrouped
    Oncethequeryisontheappropriatescheduler,thealgorithm
    dispatchslotisavailable,itchecksthehighestpriority
    fragment(hereaftercalledatask),anditisnotatitsquota
    task,otherwisetheworkerchecksthenextscheduler.It
    beenstartedoralltheschedulershavebeenchecked.
    Eachschedulerisonlyallowedtostartataskundercertain
    enoughthreadsavailablefromthepoolsothatnoneofthe
    threadsaswellasenoughmemoryavailabletolockallthe
    theschedulerhasnotasksrunning,itmaystartonetask
    tablesinthattask.Thisshouldpreventanyschedulerfrom
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    35

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    withoutrequiringcomplicatedlogic,butitcouldincur
    Schedulerscheckfortasksbyfirstcheckingthetopofthe
    priorityqueueisempty,andthependingpriorityqueue
    queuesareswappedwiththetaskbeingtakenfromthetop
    SincethequeriesarebeingrunbyaseparateDBMSinstance
    overhowitgoesaboutrunningqueries,theworkercanonly
    theDBMSandalsolockfilesinmemory.Filesinmemoryare
    bepagedoutwhenmemoryresourcesarelow,whichwouldincrease
    inmemorypreventsthisfromhappening.However,caremust
    memorycanbeusedforlockingfiles.Usetoomuchandthere
    onDBMSperformance.Setasidetoolittle,andschedulers
    theresourcesavailableandmaybeforcedtoruntaskswithout
    memory.
    Thememorymanagercontrolswhichfilesarelockedinmemory.
    runatask,thetaskasksthememorymanagertolockallthe
    Thememorymanagerdetermineswhichfilesareassociated
    alreadylockedinmemoryandthereisenoughmemoryavailable
    notalreadylocked,thetaskisgivenahandleandallowed
    ithandsthehandlebacktothememorymanager.Ifitwasthe
    table,thememoryforthefilesusedbythattableisfreed.
    Whenthememorymanagerlocksafile,itdoesnotreadthe
    forthefiletooccupywhenitisreadbytheDBMS.Inthespecial
    eventhoughthereisnotenoughmemoryavailable,thosetables
    listofreservedtablesandtheirsizeissubtractedfrom
    freed.Whenmemoryisfreed,thememorymanagerwilltry
    BecauseQservprocessesinteractive,shortqueriesconcurrently
    queryschedulershouldbeabletoallowforthosequeries
    aqueryscan.Toachievethis,Qservworkernodeschoose
    scribedaboveandasimplergroupingscheduler.Incomingquerieswithidentified
    areadmittedtothescanscheduler,andallotherqueries
    uler.Thegroupingschedulerisasimpleschedulerthat
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    36

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    (first-in-first-out)scheduler.LikeaFIFOscheduler,it
    cute,andoperatesidenticallytoaFIFOschedulerwith
    bychunkid.Eachincomingqueryisinsertedintothequeue
    samechunk,andatthebackifnoqueuedquerymatches.The
    thatthequeuewillnevergetverylong,becauseitisintended
    querieslastingfractionsofseconds,butgroupsitsqueue
    provideaminimalamountofaccesslocalitytoimprovethroughput
    Somelongerquerieswillbeadmittedtothegroupingscheduler
    ningqueries,providedthattheyhavebeendeterminedto
    thesenon-sharedscanquerywilldisruptperformanceof
    diskonaworker,theimpactisthoughttobesmallbecause
    alargefractionof)theworkforasingleuserquery,and
    disksonallworkers.
    Fordiscussionabouttheperformanceofthecurrentimplementation,DMTR-16.
    5.10.3Memorymanagement
    Tominimizesystempagingwhenmultiplethreadsarescanning
    mentedamemorymanagercalledmemman.Whenasharedscan
    thesharedscanschedulerinformsmemmanaboutthetables
    howimportantitistokeepthosetablesinmemoryduring
    directedtokeepthetablesinmemory,memmanopenseachdata
    memory,andthenlocksthepagestopreventthekernelfrom
    uses.Thus,onceafilepageisfaultedin,itstaysinmemory
    thecontentsofthepagewithoutincurringadditionalpage
    thetablecompletes,memmanistoldthatthetablesnolonger
    memmanfreesupthepagesbyunlockingthemanddeleting
    Thistypeofmanagementisnecessarytosatisfysystempaging
    primepagingpoolisthesetofunlockedfilesystempages.
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    37

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    5.10.4XRootDschedulingsupport
    Whenthefront-enddispatchesaquery,XRootDnormally
    attempttospreadtheloadacrossallofthenodesholding
    wellforinteractivequeries,itishardlyidealforshared
    memoryandI/Ousage,queriesforthesametableinashared
    tothesamenode.Anewschedulingmodewasaddedtothe
    scheduling.Thefront-endcantellXRootDwhetherornot
    otherqueriesusingthesametable.Queriesthathaveaffinity
    noderelativetothetabletheywillbeusing.Thisallows
    pagingbyrunningthemaximumnumberofqueriesagainstthe
    thatnodefail,XRootDassignsanotherworkingnodethat
    queriesthathaveaffinity.
    5.10.5Multipletablessupport
    Handlingmultipletablesinsharedscansrequiresanadditional
    schedulerwillaimtosatisfyathroughputyieldingaverage
    •Objectqueries:1hour
    •Object,Sourcequeries(join):12hours
    •Object,ForcedSourcequeries(join):12hours
    •Object_Extras6queries(join):8hours.
    Thereareseparateschedulersforqueriesthatareexpected
    twelvehours.Theschedulersgroupthetasksbychunkidand
    thealltablesinthetask.Thescanratingsaremeanttobe
    thesizeofthetable,sothissortingplacesscansusing
    nexttoeachotherinthequeue.Usingscanratingallowsflexibility
    schemasdifferentthanthatofLSST.
    6Thisincludesall
    Object-relatedtables,e.g.,Object_Extra,Object_Periodic,Object_NonPeriodic,Object_APMean
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    38

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    Sincescansarenotlimitedtospecifictables,complicated
    thatcouldtakemorethantwelvehourstoprocess.Theworker
    ifparticularqueriesappeartobetooslowfortheircurrent
    placedonadedicated”snailscan”sotherestofthequeri
    overlydelayed.
    5.11Level3:UserTables,ExternalData
    Level3tablesincludingtablesgeneratedbyusers,and
    dependingontheirtypeandsize,willbeeitherpartitioned
    ductiondatabaseservers,orkeptunpartitionedinonecentral
    anddistributedLevel3datawillsharethenodeswithLevel
    disks,independentfromthedisksservingLevel2data.
    recoverabilityfromfailures.
    Level3tableswillbetrackedandmanagedthroughtheQserv
    scribedinSection5.9Thisincludesboththestatic,aswellasthe
    5.12ClusterandTaskManagement
    QservdelegatesmanagementofclusternodestoXRootD.The
    termembership,noderegistration/deregistration,address
    cation.ItsScalableServiceInterface(SSI)APIprovides
    nelstotherestofQserv,hidingdetailslikenodecount,
    existenceofreplicas,andnodefailure.TheQservmaster
    endpointsandQservworkersfocusonreceivingandexecuting
    ClustermanagementperformedoutsideofXRootDdoesnot
    butincludescoordinatingdatadistribution,dataloading,
    discussedinSection5.15TheSSIAPIincludesmethodsthatallowdynamic
    dataviewofanXRootDclustersothatwhennewtablesappear
    systemcanincorporatethatinformationforfuturescheduling
    dynamicallychangewithouttheneedtorestarttheXRootD
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    39

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    5.13FaultTolerance
    Qservapproachesfaulttoleranceinseveralways.Thedesign
    underlyingdatabyreplicatinganddistributingdatachunks
    eventofanodefailure,theproblemcanbeisolatedand
    tonodesmaintainingduplicatedata.Moreover,thisarchitecture
    incrementalscalabilityandparallelperformance.Within
    modularizedwithminimalinterdependenceamongitscomponents,
    narrowinterfaces.Finally,individualcomponentscontain
    handling,andrecoveringfromerrors.
    ThecomponentsthatcompriseQservincludefeaturesthat
    preventionandfailure-recoverycapabilities.TheMySQL
    amongseveralunderlyingMySQLserversandprovideautomatic
    fails.TheXRootDsystemprovidesmultiplemanagersandhighly
    highbandwidth,contendwithhighrequestrates,andcope
    Qservmasteritselfcontainslogicthatworksinconjunction
    fromworker-levelfailures.
    Aworker-levelfailuredenotesanyfailuremodethatcan
    nodes.Inprinciple,allsuchfailuresarerecoverablegiven
    andalternativenodescontainingduplicatedataareavailable.
    cludeadiskfailure,aworkerprocessormachinecrashing,
    aworkerunreachable.
    Considertheeventofadiskfailure.Qserv’sworkerlogic
    failureonlocalizedregionsofdiskandwouldbehaveasif
    workerprocesswouldthereforecrashandallchunkqueries
    belost.Thein-flightqueriesonitslocalmysqldwouldbe
    TheQservmaster’srequeststoretrievethesechunkqueries
    errorcode.Themasterrespondsbyre-initializingthechunk
    XRootD.Ideally,duplicatedataassociatedwiththechunk
    thiscase,XRootDsilentlyre-routestherequest(s)tothe
    queriesarecompletedasusual.Intheeventthatduplicate
    chunkqueries,XRootDwouldagainreturnanerrorcode.The
    submitachunkqueryafixednumberoftimes(determined
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    40

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    beforegivingup,logginginformationaboutthefailure,
    theuserinresponsetotheassociatedquery.
    Errorhandlingintheeventthatanarbitraryhardwareor
    Qservworkeritself)causesaworkerprocessormachineto
    nerdescribedabove.Thesameistrueintheeventthatnetwork
    ness/overloadhasthelimitedeffectofpreventingXRootD
    moreworkernodes.Aslongassuchfailuresarelimitedto
    notextendtotheQservmasternode,XRootDisdesignedto
    errorcode.Moreover,ifduplicatedataexistsonother
    XRootD,whichwillsuccessfullyrouteanysubsequentchunk
    Intheeventofanunrecoverableerror,theQservmasteris
    sagingmechanismdesignedtobothlogdetailedinformation
    human-readableerrormessagetotheuser.Thismechanism
    logicthatencapsulatesallofthemaster’sinteractions
    tionoccurs,themastergracefullyterminatesthequery,
    event,andnotifiestheuser.Qserv’sinternalstatus/error
    astatusmessageandtimestampeachtimeanindividualchunk
    Suchmilestonesinclude:chunkquerydispatch,written
    resultsmerged,andqueryfinalized.Thisreal-timestatus
    intheeventofanunrecoverableerror.
    Buildingupontheexistingfault-toleranceanderrorhandling
    workincludesintroducingaheartbeatmechanismonworker
    workerprocessandwillrestartitintheeventitbecomes
    monitoringprocesscouldperiodicallypingworkernodes
    essary.Wearealsoconsideringmanagingfailureataper-disk
    researchsinceapplication-leveltreatmentofdiskfailure
    possibletodevelopaninterfaceforcheckingthereal-time
    processedbyQservbyleveragingitsinternallyusedstatus/error
    5.14Next-to-databaseProcessing
    Weexpectsomedataanalyseswillbeverydifficult,oreven
    SQLlanguage.Thismightbeparticularlyusefulfortime-series
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    41

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    yses,wewillallowuserstoexecutetheiranalysisalgorithms
    asPython.Todothat,wewillallowuserstoruntheirown
    sourcesco-locatedwithproductiondatabaseservers.Users
    tiondatabasewhichstreamrowsdirectlyfromdatabasecluster
    cluster,wherearbitrarycodemayrunwithoutendangering
    allowstheirincurreddatabaseI/Oneedstobesatisfied
    scanninginfrastructurewhileprovidingthefullflexibility
    5.15Administration
    5.15.1Installation
    Qservasaservicerequiresanumberofcomponentsthatall
    figuredtogether.Onthemasternodewerequiremysqld,mysql-proxy,
    metadataservice,andtheQservmasterprocess.Oneachof
    bethemysqld,cmsd,andXRootDservice.Thesemajorcomponents
    XRootD,andQservdistributions.Buttogetthesetowork
    softwarepackage,suchasprotobuf,Lua,expat,libevent,
    soon.Andmanyoftheserequiremorerecentversionsthan
    distributions.
    Tomanageboththecomplexityofdeploymentandthediversity
    vironments,wehaveadoptedtheuseofLinuxcontainers
    theiruser-spacedependenciesarebundledwithincontainers
    uniformlytoanyhostrunningDockerwitharecentenough
    atingsystemdistribution,packageprofile,patchlevels,
    Wehaveexperimentedwithseveralcontainerorchestration
    ment.Ofthese,Kubernetesseemstobeemergingasaclear
    chestrationsolutionofchoice.
    5.15.2Dataloading
    Aspreviouslymentioned,DataReleaseProductionwill
    Instead,theDRPpipelineswillproducebinaryFITStables
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    42

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    archivedastheyareproduced.DatawillbeloadedintoQserv
    tablesareeithernotavailable,orcompleteandimmutable
    spective.
    Forreplicatedtables,theseFITSfilesareconvertedto
    headerkeywordvaluepairs,orbytranslatingbinarytables
    filesareloadeddirectlyintoMySQLandindexed.Forpartitioned
    FITStablesarefedtotheQservpartitioner,whichassigns
    andconvertstoCSV.
    Inparticular,thepartitionerdividesthecelestialsphere
    heightH.Foreachstripe,awidthWiscomputedsuchthat
    longitudesseparatedbyatleastWhaveangularseparation
    brokenintoanintegralnumberofchunksofwidthatleast
    avaryingnumberofchunks(e.g.polarstripeswillcontain
    variesbyafactorofaboutpioverthesphere.Thesameprocedure
    chunks:eachstripeisbrokenintoaconfigurablenumber
    eachsubstripeisbrokenintoequal-widthsubchunks.This
    erarchicalTriangularMeshforitsspeed(notrigonometry
    ofapointgiveninsphericalcoordinates),simplicityof
    controlitoffersovertheareaofchunksandsub-chunks.
    Theboundariesofsubchunksconstructedasdescribedare
    theoverlapregionforasubchunkisdefinedasthespherical
    thesubchunkbutwithintheoverlapradiusofitsboundary.
    ThetaskofthepartitioneristofindtheIDsofthechunk
    titioningpositionofeachrow,andtostoreeachrowinthe
    itschunk.Ifthepartitioningparametersincludeoverlap,
    mightadditionallyfallinsidetheoverlapregionsofone
    copyoftherowisstoredforeachsuchsubchunk(inoverlap
    TablesthatarepartitionedinQservmustbepartitioned
    Thismeansthatchunktablesinadatabaseshareidentical
    mappingsofchunkidtospatialpartition.Inordertofacilitate
    columnsarechosentodefinethepartitioningspaceandall
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    43

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    latedsetoftables)areeitherpartitionedaccordingthat
    atall.OurcurrentplanchoosestheObjecttable’sra_PSanddecl_PScolumns,meaningthat
    rowsintheSourceandForcedSourcetableswillbepartitioned
    reference.
    Thereisoneexception:weallowforprecomputedspatial
    atablemightprovideamany-to-manyrelationshipbetween
    referencecatalogfromanothersurvey,listingallpairs
    separatedbylessthansomefixedangle.Thereferencecatalog
    associatedObject,asmorethanoneObjectmightbematched
    thereferencecatalogmustbepartitionedbyreferenceobject
    inthematchtablemightrefertoanObjectandreference
    storedondifferentQservworkernodes.
    Weavoidthiscomplicationbyagainexploitingoverlap.
    time)thatnomatchpairisseparatedbymorethantheoverlap
    matchtables,westoreacopyofeachmatchinthechunkof
    match.WhenjoiningObjectstoreferenceobjectsviathe
    tofindallmatchestoObjectsinchunkCbyjoiningwithall
    objectsinCorintheoverlapregionofC.
    AllQservworkernodeswillpartitionsubsetsofthepipeline
    pectpartitioningtoachievesimilaraggregateI/Orates
    queryaccess,sothatpartitioningshouldcompleteina
    time.Onceitdoes,eachQservworkerwillgatheralloutput
    themintoMySQL.Thestructureoftheresultingchunktables
    performanceofuserqueryaccess(chunktableswilllikely
    compressed),andappropriateindexesarebuilt.Sincechunks
    ofthesestepscanbeperformedusinganin-memoryfile-system.
    whenreadingtheCSVfilesduringtheloadandwhencopying
    files)tolocaldisk.
    Thelastphaseofdataloadingistoreplicateeachchunkto
    willrelyontablechecksumverificationratherthanamajority
    replicaiscorruptornot.
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    44

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    Thepartitionerhasbeenprototypedasamulti-threaded
    map-reduceimplementationinternallytoscaleacrosscores,
    moreinputCSVfilesinparallel.Itdoesnotcurrentlyunderstand
    arealsoparallelized-eachoutputchunkisprocessedby
    writtentoinparallelwithnoapplicationlevellocking.
    wasabletosustainseveralhundredMB/sofbothreadandwrite
    aCSVdumpofthePT1.2Sourcetable.
    Weareinvestigatingapairofdataloadingoptimizations.
    eitherintegratethepartitioningcodeorfeeddatadirectly
    municatingviapersistentstorage.Theotheristowrite
    format(e.g.as.MYDfiles,ideallyusingtheMySQL/MariaDB
    theCSVdatabaseloadingsteptobebypassed.
    5.15.3Administrativescripts
    TheQservdesignoriginallyhadasomewhatcomplicatedset
    tocoordinateandsequenceservicelaunchandshutdownon
    onservicestatus,toautomaticallyrelaunchfailedservices,
    upgradesoutontoacluster.Wehaverecentlyfoundthatall
    containerorchestrationframeworkssuchasKubernetes,
    ofremovingthisnow-redundanttoolinginordertosimplify
    Dataingeststilltendstobeaslightlydetailedprocess,
    tonow,datapreparationproducestextfilesinCSVformat
    MySQLlayerasaMyISAMtable.Theschemaforthesetables
    placeswithcolumnsforchunkandsubchunknumber.Loading
    suspended,andindicesandtablestatisticsforthequery
    atedoneachworkeraftertheload.Additionally,alistof
    hasbeenloadedmustbegeneratedtoaidinefficientquery
    arecurrentlycoordinatedbyaPythonqserv-loaderscript.
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    45

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    5.16CurrentStatusandFuturePlans
    Asofnow(July2017)wehaveimplementedabasicversionof
    threerunninginstancesatascaleofapproximately30physical
    operationondedicatedhardwareclustersatNCSAandCC-IN2P3.
    Thesystemhasbeendemonstratedtocorrectlydistribute
    volumequeries,includingsmall-areaconeandobjectsearches,
    fulltablescansincludinglarge-areanear-neighborsearches.
    UDFsinbothfilterandaggregationclauseshavealsobeen
    Scaletestinghasbeensuccessfullyconducted
    7ontheabove-mentionedclusters
    ofuptoapproximately70TB,andweexpecttocrossthe100
    2017.Todatethesystemisontrackthroughaseriesofgraduated
    [LDM-552]tomeetorexceedthestatedperformancerequirementsLDM-555].
    Theshared-scanimplementationissubstantiallycomplete,
    highlevelsofconcurrencywithoutdegradedfull-scanquery
    Themetadatasystem,asdescribedabove,forbothstatic
    andbasicquerymanagementfacilities(querycancellation
    mented.
    Thesystemincludesacomprehensivesetofunitandregression
    thebuildandCIsystems.
    Inadditiontobackgroundongoingefforttoimprovequery
    coverage,implementationworkaheadincludes:
    •automaticdatadistributionandreplication;
    •improvedquerymanagementandmonitoringtools;
    •demonstratingcross-matchwithexternalcatalogs;
    •implementingsupportforLevel3data;
    7SeeQservtestreports
    DMTR-21,DMTR-12,DMTR-13,andDMTR-16(mostrecent).
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    46

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    •next-to-dataprocessingframework;
    •improvementstoadministrationproceduresandscripts;
    •authenticationandauthorization;
    •resourcemanagement;
    •security;
    •earlyengagementwithastronomyusers.
    Automaticdatadistributionandreplication.Wehaveexperimentedsuccessfully
    staticreplicationofdataatingesttime,butforproduction
    placedataanddynamicallyresponddynamicallytonodearrivals
    quired.Thisiscurrentlyunderactivedevelopment.
    Improvedquerymanagementandmonitoringtools.Thefacilitytoenableclients
    porarilydisconnectfromlong-runningqueriesandlater
    and/orcollectresultsiscurrentlyunderactivedevelopment.
    andestimatecompletiontimesanddatasizesforlong-running
    mented.
    Demonstratingcross-matchwithexternalcatalogs.Oneoftheusecasesinvolves
    matchingwithexternalcatalogs.Incaseswherethecatalog
    willbereplicated.Forcross-matchingwithlargercatalogs,
    withwillneedtobepartitionedanddistributedtotheworker
    ImplementingsupportforLevel3data.Usersshouldbeabletomaintaintheir
    tostoretheirowndataorresultsfrompreviousqueries.
    andreplacetheirowntableswithinthesystem.
    Nexttodataprocessingframework.Afacilitytostreamqueryresults
    whereuser-submittedcodecanrunagainstthosestreams
    Improvementstoadministrationproceduresandscripts.Tofurtherautomatecommon
    tasksrelatedtodatabasemanagement,tablemanagement,
    tribution,andothersweneedtoimplementvariousimprovements
    ceduresandscripts.
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    47

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    Authenticationandauthorization.Aproductiondatabasesystemshould
    facilityforuserorrole-basedaccesssothatusagecan
    shared.ThisisinparticularneededforLevel-3dataproducts.
    Resourcemanagement.Aproductionsystemshouldhavesomeway
    resourceusageandprovidequality-of-servicecontrols.
    thatcantrackeachnode’sloadandper-user-queryresource
    Security.Thesystemneedstobesecureandresilientagainst
    Earlyengagementwithastronomyusers.Itisimportantthatweengage
    membersofourtargetuser-community,sowecanhavetime
    wearebuilding.Doesthesystemhavethecapabilitiesthey
    syntaxusableandpracticalforthem?Wehavebegunsome
    tiesinthePDAC(PrototypeDataAccessCenter)clusterat
    andplantoexpandthataudienceinupcomingmonths.
    5.17OpenIssues
    Whatfollowsisa(non-exhaustive)listofissues,technical
    discussedandwherechangesarepossible.
    •Rowupdates.Currently,allrowsonceingestedintoaQserv
    immutable,androwupdatesarenotsupported.Thissimplification
    becausethemotivatingusecaseforQserv(annuallyreleased,
    logs)isprimarilyread-only.Theapparentneedforupdate
    QservinstancecanbeworkedaroundbytreatingLevel3
    gested;shouldchangestoaLevel3productberequired,
    andre-ingested/regeneratedratherthanupdatedin
    sarythatQservsupportrowupdatesand/ortransactions
    changestotheQservarchitecturewouldlikelyberequired.
    •Verylargeobjects.Someobjects(e.g.largegalaxies)aremuch
    region;insomecasestheirfootprintwillspanmultiple
    withtheobjectcenter,neglectingtheactualfootprint.
    casesthatwouldbenefitfromasystemthattracksobjects
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    48

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    iscurrentlynotarequirement.Apotentialsolutionwould
    similartother-tree-basedindexessuchastheTOUCH15].
    •Verylargeresults.Currently,thefront-endthatdispatches
    bleforassemblingitseventualresults.Ingeneral,
    theresourcesrequiredtoprocessestheresultsmaybe
    greaterthanthoseneededtodispatchthequery.Onepotential
    thefront-endtotheextentnecessarytohandlequery
    SSIinterfacecouldbeaugmentedtoitselfallowrunning
    onceaparticularfront-enddispatchesaqueryitcan
    connectfromit.Adifferentservercould,usingthathandle,
    processtheresults.Thiswouldbeamoreflexiblemodel
    dentscalingofquerydispatchandresultprocessing.
    notrequiringcancellationofin-progressqueriesdispatched
    shouldthatfront-enddie.
    •Sub-queries.QservdoesnotcurrentlysupportSQLsub-queries.
    thatsuchacapabilitymightbeusefultousers,weshould
    signsandunderstandhoweasy/difficulttheymightbeto
    approachesheremightbetosplitsub-queriesintomultiple
    variables.Anaïveimplementationthatinvolvesdumping
    andthenrereadingthem,similartoamulti-stagemap/reduce,
    6RiskAnalysis
    6.1PotentialKeyRisks
    Insufficientdatabaseperformanceandscalabilityisoneofthemajorstatedmajor
    DMDocument-7025[].
    Qservasanimplementationofthebaselinearchitecture
    encedaboveisalreadywellonitsway,andiscurrentlyresourced
    tionontimeandwithinbudget.
    Aviablealternativemightbetouseanoff-the-shelfsystem.
    couldpresentsignificantsupportcostadvantagesovera
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    49

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    itisasystemwellsupportedbyalargeuseranddeveloper
    source,scalablesolutionwillbeavailableonthetimescale
    ofproductionscalabilityapproachingfewhundredterabytes
    systemslargerthanthelargestsingleLSSTdatasethave
    productiontoday.Thereisagrowingdemandforlargescale
    titioninthatarea(Hadoop,Greenplum,InfiniDB,MonetDB,
    whichwasaclosed-sourceprojectattheoutsetofLSST,
    Finally,athirdalternativewouldbetouseaclosed-source,
    orInfiniDB(Teradataistooexpensive).Someofthesesystems
    believethelargestbarrierpreventingusfromusinganoff-the-shelf
    poorlydevelopedsphericalgeometryandsphericalpartitioning
    Potentialproblemswithoff-the-shelfdatabasesoftwareused,suchasMySQLisanother
    potentialrisk.MySQLhasrecentlybeenpurchasedbyOracle,
    theMySQLprojectwillbesufficientlysupportedinthelong-term.
    independentforksofMySQLsoftwarehaveemerged,including
    oftheMySQLfounders)andPercona.ShouldMySQLdisappear,
    compatiblesystemsareasolidalternative.Shouldweneed
    wehavetakenmultiplemeasurestominimizetheimpact:
    •ourschemadoesnotcontainanyMySQL-specificelements
    demonstratedusingitinothersystemssuchasMonetDB;
    •wedonotrelyonanyMySQLspecificextensions,withthe
    whichcanbemadetoworkwithnon-MySQLsystemsifneeded;
    •weminimizetheuseofstoredfunctionsandstoredprocedures
    specific,andinsteaduseuserdefinedfunctions,which
    facebindingpartneedstobemigrated).
    Complexdata.analysisThemostcomplexanalysisweidentifiedso
    temporalcorrelationswhichexhibit??(??2)performancecharacteristics,searching
    liesandrareevents,aswellassearchingforunknownare
    dustrialusersdealwithmuchsimpler,welldefinedaccess
    bead-hoc,andaccesspatternsmightbedifferentthanthese
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    50

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    large-scaleindustrialusershavestartedtoexpressstrong
    understandingandcorrelatinguserbehavior(time-series
    nies,searchingforabnormaluserbehaviortodetectfraud
    companies,analyzinggenomesequencingdatarunbybiotech
    ketanalysisrunbyfinancialcompaniesarejustafewexamples.
    ad-hocandinvolvesearchingforunknowns,similartoscientific
    (byrich,industrialusers)forthistypeofcomplexanalyses
    rapidlystartingtoaddneededfeaturesintotheirsystems.
    Thecompletelistofalldatabase-relatedrisksmaintained
    •DM-014:Databaseperformanceinsufficientforplanned
    •DM-015:Unexpecteddatabaseaccesspatternsfromscience
    •DM-016:UnexpecteddatabaseaccesspatternsfromDMproductions
    •DM-032:LSSTDMhardwarearchitecturebecomesantiquated
    •DM-060:Dependenciesonexternalsoftwarepackages
    •DM-061:Provenancecaptureinadequate
    •DM-065:LSSTDMsoftwarearchitectureincompatible
    dards
    •DM-070:Archivesizinginadequate
    •DM-074:LSSTDMsoftwarearchitecturebecomesantiquated
    •DM-075:NewSRDrequirementsrequirenewDMfunctionality
    6.2RiskMitigations
    Tomitigatetheinsufficientperformance/scalabilityrisk,
    stratedscalabilityandperformance.Inaddition,toincrease
    source,communitysupported,off-the-shelfdatabasesystem
    collaboratedwiththeMonetDBopensourcecolumnardatabase
    lessons-learned,theyaretryingtoaddmissingfeatures
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    51

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    capableofsupportingLSSTneeds.Further,tostaycurrent
    caledatamanagementandanalysis,wecontinueadialogwith
    bothDBMSandMap/Reduce,aswellaswithdata-intensive
    tific,throughtheXLDBconferenceandworkshopserieswe
    Tounderstandquerycomplexityandexpectedaccesspatterns,
    ScienceCollaborationsandtheLSSTScienceCounciltounderstand
    andquerycomplexity.Wehavecompiledasetofcommonqueries5]anddistilledthisset
    asmallersetofrepresentativequeriesweuseforvarious
    eachmajorquerytype,rangingfromtriviallowvolume,to2].Wehave
    alsotalkedtoscientistsanddatabasedevelopersfromother
    SDSS,2MASS,Gaia,DES,LOFARandPan-STARRS.
    Todealwithunpredictabilityofanalysis,wewilluseshared
    willhaveaccesstoallthedata,allthecolumns,eventhese
    dictablecost–withsharedscansincreasingcomplexity
    I/Oneeds,itonlyincreasestheCPUneeds.
    Tokeepqueryloadundercontrol,wewillemploythrottling
    7References
    [1][DMTR-12],Becla,J.,2013,Qserv300nodetest,DMTR-12,URLhttps://ls.st/DMTR-
    [2]Becla,J.,2013,QueriesUsedforScalability&Performancehttps://dev.
    lsstcorp.org/trac/wiki/db/queries/ForPerfTest
    [3][DMTR-13],Becla,J.,2015,QservSummer15LargeScaleTests,DMTR-13,URLhttps:
    //ls.st/DMTR-13
    [4][LDM-555],Becla,J.,2017,DataManagementDatabaseRequirements,LDM-555,URL
    https://ls.st/LDM-555
    [5]Becla,J.,Lim,K.T.,2013,CommonQueries,URLhttps://dev.lsstcorp.org/trac/wiki/
    db/queries
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    52

    LARGESYNOPTICSURVEYTELESCOPE
    DataManagementDatabaseDesignLDM-135LatestRevision2017-07-07
    [6][LDM-141],Becla,J.,Lim,K.T.,2013,DataManagementStorageSizingand,LDM-
    141,URLhttps://ls.st/LDM-141
    [7]Becla,J.,Lim,K.T.,Monkewitz,S.,Nieto-Santisteban,
    Bunclark,P.S.,Lewis,J.R.(eds.)AstronomicalData
    vol.394ofAstronomicalSocietyofthePacificConferenceADSLink
    [8][DMTN-048],Becla,J.,Lim,K.T.,Wang,D.,2011,Qservdesignprototypingexperiments,
    DMTN-048,URLhttps://dmtn-048.lsst.io
    LSSTDataManagementTechnicalNote
    [9][DMTN-046],Becla,J.,Lim,K.T.,Wang,D.,2013,Aninvestigationofdatabase,
    DMTN-046,URLhttps://dmtn-046.lsst.io
    LSSTDataManagementTechnicalNote
    [10][DMTR-21],Becla,J.,Lim,K.T.,Wang,D.,2013,Early(pre-2013)Large-Scale,
    DMTR-21,URLhttps://ls.st/DMTR-21
    [11]Dorigo,A.,Elmer,P.,Furano,F.,Hanushevsky,A.,2005,
    ers,4,348,URLhttp://xrootd.org/presentations/xpaper3_cut_journal.pdf
    [12][LSE-163],Jurić,M.,etal.,2017,LSSTDataProductsDefinitionDocument,LSE-163,URL
    https://ls.st/LSE-163
    [13][Document-7025],Kantor,J.,Krabbendam,V.,2011,DMRiskRegister,Document-7025,
    URLhttps://ls.st/Document-7025
    [14][LDM-552],Mueller,F.,2017,QservSoftwareTestSpecification,LDM-552,URLhttps:
    //ls.st/LDM-552
    [15]Nobari,S.,Tauheed,F.,Heinis,T.,etal.,2013,In:
    MODInternationalConferenceonManagementofData,
    NewYork,NY,USA,doi:10.1145/2463676.2463700
    [16][DMTR-16],Thukral,V.,2017,QservFall16LargeScaleTests/KPMs,DMTR-16,URLhttps:
    //ls.st/DMTR-16
    [17][DMTN-047],Tommaney,J.,Becla,J.,Lim,K.T.,Wang,D.,2011,TestswithInfiniDB,DMTN-
    047,URLhttps://dmtn-047.lsst.io
    LSSTDataManagementTechnicalNote
    Thecontentsofthisdocumentaresubjecttoconfigurationcontrolbythe
    Team.
    53

    Back to top