Big Data Business Intelligence for Govt. Agencies Certificate for...
Certificate ID:
788807
Authentication Code:
e5a4c
Certified Person Name:
Masilonyane Letebele
Trainer Name:
Callan Abrahams
Duration Days:
5
Duration Hours:
35
Course Name:
Big Data Business Intelligence for Govt. Agencies
Course Date:
2024-11-18 09:00 to 2024-11-22 16:30
Course Outline:
Each session is 2 hours
Day-1: Session -1: Business Overview of Why Big Data Business Intelligence in Govt.
- Case Studies from NIH, DoE
- Big Data adaptation rate in Govt. Agencies & and how they are aligning their future operation around Big Data Predictive Analytics
- Broad Scale Application Area in DoD, NSA, IRS, USDA etc.
- Interfacing Big Data with Legacy data
- Basic understanding of enabling technologies in predictive analytics
- Data Integration & Dashboard visualization
- Fraud management
- Business Rule/ Fraud detection generation
- Threat detection and profiling
- Cost benefit analysis for Big Data implementation
Day-1: Session-2 : Introduction of Big Data-1
- Main characteristics of Big Data-volume, variety, velocity and veracity. MPP architecture for volume.
- Data Warehouses – static schema, slowly evolving dataset
- MPP Databases like Greenplum, Exadata, Teradata, Netezza, Vertica etc.
- Hadoop Based Solutions – no conditions on structure of dataset.
- Typical pattern : HDFS, MapReduce (crunch), retrieve from HDFS
- Batch- suited for analytical/non-interactive
- Volume : CEP streaming data
- Typical choices – CEP products (e.g. Infostreams, Apama, MarkLogic etc)
- Less production ready – Storm/S4
- NoSQL Databases – (columnar and key-value): Best suited as analytical adjunct to data warehouse/database
Day-1 : Session -3 : Introduction to Big Data-2
NoSQL solutions
- KV Store - Keyspace, Flare, SchemaFree, RAMCloud, Oracle NoSQL Database (OnDB)
- KV Store - Dynamo, Voldemort, Dynomite, SubRecord, Mo8onDb, DovetailDB
- KV Store (Hierarchical) - GT.m, Cache
- KV Store (Ordered) - TokyoTyrant, Lightcloud, NMDB, Luxio, MemcacheDB, Actord
- KV Cache - Memcached, Repcached, Coherence, Infinispan, EXtremeScale, JBossCache, Velocity, Terracoqua
- Tuple Store - Gigaspaces, Coord, Apache River
- Object Database - ZopeDB, DB40, Shoal
- Document Store - CouchDB, Cloudant, Couchbase, MongoDB, Jackrabbit, XML-Databases, ThruDB, CloudKit, Prsevere, Riak-Basho, Scalaris
- Wide Columnar Store - BigTable, HBase, Apache Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI
Varieties of Data: Introduction to Data Cleaning issue in Big Data
- RDBMS – static structure/schema, doesn’t promote agile, exploratory environment.
- NoSQL – semi structured, enough structure to store data without exact schema before storing data
- Data cleaning issues
Day-1 : Session-4 : Big Data Introduction-3 : Hadoop
- When to select Hadoop?
- STRUCTURED - Enterprise data warehouses/databases can store massive data (at a cost) but impose structure (not good for active exploration)
- SEMI STRUCTURED data – tough to do with traditional solutions (DW/DB)
- Warehousing data = HUGE effort and static even after implementation
- For variety & volume of data, crunched on commodity hardware – HADOOP
- Commodity H/W needed to create a Hadoop Cluster
Introduction to Map Reduce /HDFS
- MapReduce – distribute computing over multiple servers
- HDFS – make data available locally for the computing process (with redundancy)
- Data – can be unstructured/schema-less (unlike RDBMS)
- Developer responsibility to make sense of data
- Programming MapReduce = working with Java (pros/cons), manually loading data into HDFS
Day-2: Session-1: Big Data Ecosystem-Building Big Data ETL: universe of Big Data Tools-which one to use and when?
- Hadoop vs. Other NoSQL solutions
- For interactive, random access to data
- Hbase (column oriented database) on top of Hadoop
- Random access to data but restrictions imposed (max 1 PB)
- Not good for ad-hoc analytics, good for logging, counting, time-series
- Sqoop - Import from databases to Hive or HDFS (JDBC/ODBC access)
- Flume – Stream data (e.g. log data) into HDFS
Day-2: Session-2: Big Data Management System
- Moving parts, compute nodes start/fail :ZooKeeper - For configuration/coordination/naming services
- Complex pipeline/workflow: Oozie – manage workflow, dependencies, daisy chain
- Deploy, configure, cluster management, upgrade etc (sys admin) :Ambari
- In Cloud : Whirr
Day-2: Session-3: Predictive analytics in Business Intelligence -1: Fundamental Techniques & Machine learning based BI :
- Introduction to Machine learning
- Learning classification techniques
- Bayesian Prediction-preparing training file
- Support Vector Machine
- KNN p-Tree Algebra & vertical mining
- Neural Network
- Big Data large variable problem -Random forest (RF)
- Big Data Automation problem – Multi-model ensemble RF
- Automation through Soft10-M
- Text analytic tool-Treeminer
- Agile learning
- Agent based learning
- Distributed learning
- Introduction to Open source Tools for predictive analytics : R, Rapidminer, Mahut
Day-2: Session-4 Predictive analytics eco-system-2: Common predictive analytic problems in Govt.
- Insight analytic
- Visualization analytic
- Structured predictive analytic
- Unstructured predictive analytic
- Threat/fraudstar/vendor profiling
- Recommendation Engine
- Pattern detection
- Rule/Scenario discovery –failure, fraud, optimization
- Root cause discovery
- Sentiment analysis
- CRM analytic
- Network analytic
- Text Analytics
- Technology assisted review
- Fraud analytic
- Real Time Analytic
Day-3 : Sesion-1 : Real Time and Scalable Analytic Over Hadoop
- Why common analytic algorithms fail in Hadoop/HDFS
- Apache Hama- for Bulk Synchronous distributed computing
- Apache SPARK- for cluster computing for real time analytic
- CMU Graphics Lab2- Graph based asynchronous approach to distributed computing
- KNN p-Algebra based approach from Treeminer for reduced hardware cost of operation
Day-3: Session-2: Tools for eDiscovery and Forensics
- eDiscovery over Big Data vs. Legacy data – a comparison of cost and performance
- Predictive coding and technology assisted review (TAR)
- Live demo of a Tar product ( vMiner) to understand how TAR works for faster discovery
- Faster indexing through HDFS –velocity of data
- NLP or Natural Language processing –various techniques and open source products
- eDiscovery in foreign languages-technology for foreign language processing
Day-3 : Session 3: Big Data BI for Cyber Security –Understanding whole 360 degree views of speedy data collection to threat identification
- Understanding basics of security analytics-attack surface, security misconfiguration, host defenses
- Network infrastructure/ Large datapipe / Response ETL for real time analytic
- Prescriptive vs predictive – Fixed rule based vs auto-discovery of threat rules from Meta data
Day-3: Session 4: Big Data in USDA : Application in Agriculture
- Introduction to IoT ( Internet of Things) for agriculture-sensor based Big Data and control
- Introduction to Satellite imaging and its application in agriculture
- Integrating sensor and image data for fertility of soil, cultivation recommendation and forecasting
- Agriculture insurance and Big Data
- Crop Loss forecasting
Day-4 : Session-1: Fraud prevention BI from Big Data in Govt-Fraud analytic:
- Basic classification of Fraud analytics- rule based vs predictive analytics
- Supervised vs unsupervised Machine learning for Fraud pattern detection
- Vendor fraud/over charging for projects
- Medicare and Medicaid fraud- fraud detection techniques for claim processing
- Travel reimbursement frauds
- IRS refund frauds
- Case studies and live demo will be given wherever data is available.
Day-4 : Session-2: Social Media Analytic- Intelligence gathering and analysis
- Big Data ETL API for extracting social media data
- Text, image, meta data and video
- Sentiment analysis from social media feed
- Contextual and non-contextual filtering of social media feed
- Social Media Dashboard to integrate diverse social media
- Automated profiling of social media profile
- Live demo of each analytic will be given through Treeminer Tool.
Day-4 : Session-3: Big Data Analytic in image processing and video feeds
- Image Storage techniques in Big Data- Storage solution for data exceeding petabytes
- LTFS and LTO
- GPFS-LTFS ( Layered storage solution for Big image data)
- Fundamental of image analytics
- Object recognition
- Image segmentation
- Motion tracking
- 3-D image reconstruction
Day-4: Session-4: Big Data applications in NIH:
- Emerging areas of Bio-informatics
- Meta-genomics and Big Data mining issues
- Big Data Predictive analytic for Pharmacogenomics, Metabolomics and Proteomics
- Big Data in downstream Genomics process
- Application of Big data predictive analytics in Public health
Big Data Dashboard for quick accessibility of diverse data and display :
- Integration of existing application platform with Big Data Dashboard
- Big Data management
- Case Study of Big Data Dashboard: Tableau and Pentaho
- Use Big Data app to push location based services in Govt.
- Tracking system and management
Day-5 : Session-1: How to justify Big Data BI implementation within an organization:
- Defining ROI for Big Data implementation
- Case studies for saving Analyst Time for collection and preparation of Data –increase in productivity gain
- Case studies of revenue gain from saving the licensed database cost
- Revenue gain from location based services
- Saving from fraud prevention
- An integrated spreadsheet approach to calculate approx. expense vs. Revenue gain/savings from Big Data implementation.
Day-5 : Session-2: Step by Step procedure to replace legacy data system to Big Data System:
- Understanding practical Big Data Migration Roadmap
- What are the important information needed before architecting a Big Data implementation
- What are the different ways of calculating volume, velocity, variety and veracity of data
- How to estimate data growth
- Case studies
Day-5: Session 4: Review of Big Data Vendors and review of their products. Q/A session:
- Accenture
- APTEAN (Formerly CDC Software)
- Cisco Systems
- Cloudera
- Dell
- EMC
- GoodData Corporation
- Guavus
- Hitachi Data Systems
- Hortonworks
- HP
- IBM
- Informatica
- Intel
- Jaspersoft
- Microsoft
- MongoDB (Formerly 10Gen)
- MU Sigma
- Netapp
- Opera Solutions
- Oracle
- Pentaho
- Platfora
- Qliktech
- Quantum
- Rackspace
- Revolution Analytics
- Salesforce
- SAP
- SAS Institute
- Sisense
- Software AG/Terracotta
- Soft10 Automation
- Splunk
- Sqrrl
- Supermicro
- Tableau Software
- Teradata
- Think Big Analytics
- Tidemark Systems
- Treeminer
- VMware (Part of EMC)